From patchwork Fri Feb 10 03:53:12 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Pinski X-Patchwork-Id: 55209 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp736021wrn; Thu, 9 Feb 2023 19:54:11 -0800 (PST) X-Google-Smtp-Source: AK7set8OBX4KCKr/HkFQ9wLmJCRddUiK9qfT1Cu17hs7CKd04p+gHr7qBe5b7cDW1iGKp4LUayrZ X-Received: by 2002:a50:954b:0:b0:4ab:eac:30f2 with SMTP id v11-20020a50954b000000b004ab0eac30f2mr8656488eda.18.1676001251473; Thu, 09 Feb 2023 19:54:11 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676001251; cv=none; d=google.com; s=arc-20160816; b=j157WC+ybfI4rUc9mlq20bqmJqWNuBZyLIZlK95mQ+qFSI3Rb7rLgydvFZScGLgOUD 9HGnAbaa+pU/zWZPbbL2yF0ffAIAhyiwIPKMy/SBYQOyNmxgOxSYGMfbZ/m6EXMsmqo3 Bdd17Zp3K9XXSi7724zi90MQzEFbjtTmXDJFXGw/Qh2Y2gKXTzdxYWSm69KnNCpO2NOq pZtZHJY/OXpdRlUhU9vxaAd4q09Tun3A6VSbKHOj5v/MDxZXo9oybDjFjJkD7iW/5XCV b3lwnKjyGfaI9ob7gh6QNN13KcvXPPd+y5OWGx8bBK0OeJSnrZ3N9m2BK+KPwRgtyH0W COrQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:reply-to:from:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence :content-transfer-encoding:mime-version:message-id:date:subject:cc :to:dmarc-filter:delivered-to:dkim-signature:dkim-filter; bh=DbpTk4jK7U8xm9tw17xm5BR3vHtK5hi2th/3vlMe024=; b=z+2gPBL1RcsV+RoS6i7z+rML5c0Fu8DLn+Nn8cYzFJk78dDTZAC3+4FqK2xR11A6MS U1AiEkD09YnQxGD2qclZf1dyrUIxgPLo0UbyM+5ija6H7D0XRblFofusO3msQioW4/3S 6/KcBuz3NAcoR1m0m+B/XX7w+1oipf84ALdFFbIjETw1NsiF9YPrdve2G9v0i3FLxsgz NznjHa/OBD6gvbb3y5sqT2wjjHgm4Me/gmY5kOFN7dSnCwopil0MeGj2RJboL8oHu6ka I6SVXTvImmqwWIpn1ROTCKhqETy2JQBA01cQszCQjxcGhYlKt0vLzfFFBI+z0U7QNl9D 4NEQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b="Yrwr/KPY"; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from sourceware.org (server2.sourceware.org. [2620:52:3:1:0:246e:9693:128c]) by mx.google.com with ESMTPS id n20-20020aa7db54000000b004aaa94fbe4bsi4033160edt.385.2023.02.09.19.54.11 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 09 Feb 2023 19:54:11 -0800 (PST) Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) client-ip=2620:52:3:1:0:246e:9693:128c; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b="Yrwr/KPY"; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 04FB8385783F for ; Fri, 10 Feb 2023 03:54:10 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 04FB8385783F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1676001250; bh=DbpTk4jK7U8xm9tw17xm5BR3vHtK5hi2th/3vlMe024=; h=To:CC:Subject:Date:List-Id:List-Unsubscribe:List-Archive: List-Post:List-Help:List-Subscribe:From:Reply-To:From; b=Yrwr/KPYPNJjQbcepdHne84oZ9AcQRoRP5zmYp94LD8xv/Q4SVlmNsqiW6/X0gaZ8 tV5P2Uo4LP0WOwwwLoY8jPIripSWeNQFWHQOhQoRnuO4W6n49pc/KxuUqjyTzm6/eu K1TQbIrfyWeShnQ+ay21f+Q3O7S4i4eNzofyfuk4= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mx0b-0016f401.pphosted.com (mx0b-0016f401.pphosted.com [67.231.156.173]) by sourceware.org (Postfix) with ESMTPS id 817273858C50 for ; Fri, 10 Feb 2023 03:53:25 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 817273858C50 Received: from pps.filterd (m0045851.ppops.net [127.0.0.1]) by mx0b-0016f401.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 31A2sKR7010492 for ; Thu, 9 Feb 2023 19:53:25 -0800 Received: from dc5-exch01.marvell.com ([199.233.59.181]) by mx0b-0016f401.pphosted.com (PPS) with ESMTPS id 3nn71cjakt-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT) for ; Thu, 09 Feb 2023 19:53:24 -0800 Received: from DC5-EXCH01.marvell.com (10.69.176.38) by DC5-EXCH01.marvell.com (10.69.176.38) with Microsoft SMTP Server (TLS) id 15.0.1497.42; Thu, 9 Feb 2023 19:53:22 -0800 Received: from maili.marvell.com (10.69.176.80) by DC5-EXCH01.marvell.com (10.69.176.38) with Microsoft SMTP Server id 15.0.1497.42 via Frontend Transport; Thu, 9 Feb 2023 19:53:22 -0800 Received: from vpnclient.wrightpinski.org.com (unknown [10.76.242.80]) by maili.marvell.com (Postfix) with ESMTP id AA7AC5B6939; Thu, 9 Feb 2023 19:53:21 -0800 (PST) To: CC: Andrew Pinski Subject: [PATCHv4] [AARCH64] Fix PR target/103100 -mstrict-align and memset on not aligned buffers Date: Thu, 9 Feb 2023 19:53:12 -0800 Message-ID: <20230210035312.1630020-1-apinski@marvell.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 X-Proofpoint-ORIG-GUID: wB_AzXfJ-LZlkwveKnmOLPFNM0z9PJec X-Proofpoint-GUID: wB_AzXfJ-LZlkwveKnmOLPFNM0z9PJec X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.219,Aquarius:18.0.930,Hydra:6.0.562,FMLib:17.11.170.22 definitions=2023-02-09_17,2023-02-09_03,2023-02-09_01 X-Spam-Status: No, score=-14.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Andrew Pinski via Gcc-patches From: Andrew Pinski Reply-To: Andrew Pinski Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org Sender: "Gcc-patches" X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1757414688442100014?= X-GMAIL-MSGID: =?utf-8?q?1757414688442100014?= The problem here is that aarch64_expand_setmem does not change the alignment for strict alignment case. This is version 4 of the fix, major changes from the last version is fixing the way store pairs are handled which allows handling of storing 2 SI mode at a time. This also adds a testcase to show a case with -mstrict-align we can do the store word pair stores. OK? Bootstrapped and tested on aarch64-linux-gnu with no regressions. PR target/103100 gcc/ChangeLog: * config/aarch64/aarch64.cc (aarch64_gen_store_pair): Add support for SImode. (aarch64_set_one_block_and_progress_pointer): Add use_pair argument and rewrite and simplifying the code. (aarch64_can_use_pair_load_stores): New function. (aarch64_expand_setmem): Rewrite mode selection to better handle strict alignment and non ld/stp pair case. gcc/testsuite/ChangeLog: * gcc.target/aarch64/memset-strict-align-1.c: Update test. Reduce the size down to 207 and make s1 global and aligned to 16 bytes. * gcc.target/aarch64/memset-strict-align-2.c: New test. * gcc.target/aarch64/memset-strict-align-3.c: New test. --- gcc/config/aarch64/aarch64.cc | 136 ++++++++++-------- .../aarch64/memset-strict-align-1.c | 19 ++- .../aarch64/memset-strict-align-2.c | 14 ++ .../aarch64/memset-strict-align-3.c | 15 ++ 4 files changed, 113 insertions(+), 71 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c create mode 100644 gcc/testsuite/gcc.target/aarch64/memset-strict-align-3.c diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 5c40b6ed22a..3eaf9bd608a 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -8850,6 +8850,9 @@ aarch64_gen_store_pair (machine_mode mode, rtx mem1, rtx reg1, rtx mem2, { switch (mode) { + case E_SImode: + return gen_store_pair_sw_sisi (mem1, reg1, mem2, reg2); + case E_DImode: return gen_store_pair_dw_didi (mem1, reg1, mem2, reg2); @@ -24896,42 +24899,49 @@ aarch64_expand_cpymem (rtx *operands) SRC is a register we have created with the duplicated value to be set. */ static void aarch64_set_one_block_and_progress_pointer (rtx src, rtx *dst, - machine_mode mode) + machine_mode mode, bool use_pairs) { + rtx reg = src; /* If we are copying 128bits or 256bits, we can do that straight from the SIMD register we prepared. */ - if (known_eq (GET_MODE_BITSIZE (mode), 256)) - { - mode = GET_MODE (src); - /* "Cast" the *dst to the correct mode. */ - *dst = adjust_address (*dst, mode, 0); - /* Emit the memset. */ - emit_insn (aarch64_gen_store_pair (mode, *dst, src, - aarch64_progress_pointer (*dst), src)); - - /* Move the pointers forward. */ - *dst = aarch64_move_pointer (*dst, 32); - return; - } if (known_eq (GET_MODE_BITSIZE (mode), 128)) - { - /* "Cast" the *dst to the correct mode. */ - *dst = adjust_address (*dst, GET_MODE (src), 0); - /* Emit the memset. */ - emit_move_insn (*dst, src); - /* Move the pointers forward. */ - *dst = aarch64_move_pointer (*dst, 16); - return; - } - /* For copying less, we have to extract the right amount from src. */ - rtx reg = lowpart_subreg (mode, src, GET_MODE (src)); + mode = GET_MODE(src); + else + /* For copying less, we have to extract the right amount from src. */ + reg = lowpart_subreg (mode, src, GET_MODE (src)); /* "Cast" the *dst to the correct mode. */ *dst = adjust_address (*dst, mode, 0); /* Emit the memset. */ - emit_move_insn (*dst, reg); + if (use_pairs) + emit_insn (aarch64_gen_store_pair (mode, *dst, reg, + aarch64_progress_pointer (*dst), + reg)); + else + emit_move_insn (*dst, reg); + /* Move the pointer forward. */ *dst = aarch64_progress_pointer (*dst); + if (use_pairs) + *dst = aarch64_progress_pointer (*dst); +} + +/* Returns true if size can be used as a store/load pair. + This is a helper function for aarch64_expand_setmem and others. */ +static bool +aarch64_can_use_pair_load_stores (unsigned HOST_WIDE_INT size) +{ + /* For DI and SI modes, we can use store pairs. */ + if (size == GET_MODE_BITSIZE (DImode) + || size == GET_MODE_BITSIZE (SImode)) + return true; + /* For TI mode, we will use store pairs only if + the target wants to. */ + else if (size == GET_MODE_BITSIZE (TImode) + && !(aarch64_tune_params.extra_tuning_flags + & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)) + return true; + return false; } /* Expand a setmem using the MOPS instructions. OPERANDS are the same @@ -24974,9 +24984,21 @@ aarch64_expand_setmem (rtx *operands) bool size_p = optimize_function_for_size_p (cfun); - /* Default the maximum to 256-bytes when considering only libcall vs - SIMD broadcast sequence. */ - unsigned max_set_size = 256; + /* Maximum amount to copy in one go (without taking into account store pairs). */ + int copy_limit = GET_MODE_BITSIZE (TImode); + + /* For strict alignment case, restrict the copy limit to the compiler time + known alignment of the memory. */ + if (STRICT_ALIGNMENT) + copy_limit = MIN (copy_limit, (int)MEM_ALIGN (dst)); + + bool use_pairs = aarch64_can_use_pair_load_stores (copy_limit); + + /* The max set size is 8 instructions of the copy_limit sized stores + (also taking into account using pair stores or not); + for the non strict alignment, this is 256 bytes. */ + unsigned max_set_size; + max_set_size = (copy_limit * (use_pairs ? 2 : 1) * 8) / BITS_PER_UNIT; len = INTVAL (operands[1]); if (len > max_set_size && !TARGET_MOPS) @@ -24990,8 +25012,10 @@ aarch64_expand_setmem (rtx *operands) (zero constants can use XZR directly). */ unsigned mops_cost = 3 + 1 + cst_val; /* A libcall to memset in the worst case takes 3 instructions to prepare - the arguments + 1 for the call. */ - unsigned libcall_cost = 4; + the arguments + 1 for the call. + In the case of not optimizing for size the cost of doing a libcall + is 17 instructions . */ + unsigned libcall_cost = size_p ? 4 : 17; /* Upper bound check. For large constant-sized setmem use the MOPS sequence when available. */ @@ -25001,12 +25025,12 @@ aarch64_expand_setmem (rtx *operands) /* Attempt a sequence with a vector broadcast followed by stores. Count the number of operations involved to see if it's worth it - against the alternatives. A simple counter simd_ops on the + against the alternatives. A simple counter inlined_ops on the algorithmically-relevant operations is used rather than an rtx_insn count as all the pointer adjusmtents and mode reinterprets will be optimized away later. */ start_sequence (); - unsigned simd_ops = 0; + unsigned inlined_ops = 0; base = copy_to_mode_reg (Pmode, XEXP (dst, 0)); dst = adjust_automodify_address (dst, VOIDmode, base, 0); @@ -25014,16 +25038,10 @@ aarch64_expand_setmem (rtx *operands) /* Prepare the val using a DUP/MOVI v0.16B, val. */ src = expand_vector_broadcast (V16QImode, val); src = force_reg (V16QImode, src); - simd_ops++; + inlined_ops++; /* Convert len to bits to make the rest of the code simpler. */ n = len * BITS_PER_UNIT; - /* Maximum amount to copy in one go. We allow 256-bit chunks based on the - AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter. */ - const int copy_limit = (aarch64_tune_params.extra_tuning_flags - & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) - ? GET_MODE_BITSIZE (TImode) : 256; - while (n > 0) { /* Find the largest mode in which to do the copy without @@ -25036,15 +25054,18 @@ aarch64_expand_setmem (rtx *operands) gcc_assert (cur_mode != BLKmode); mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant (); - aarch64_set_one_block_and_progress_pointer (src, &dst, cur_mode); - simd_ops++; - n -= mode_bits; + bool pairs = aarch64_can_use_pair_load_stores (mode_bits); + if (n < (mode_bits * 2)) + pairs = false; + aarch64_set_one_block_and_progress_pointer (src, &dst, cur_mode, pairs); + inlined_ops++; + n -= mode_bits * (pairs ? 2 : 1); /* Do certain trailing copies as overlapping if it's going to be cheaper. i.e. less instructions to do so. For instance doing a 15 byte copy it's more efficient to do two overlapping 8 byte copies than 8 + 4 + 2 + 1. Only do this when -mstrict-align is not supplied. */ - if (n > 0 && n < copy_limit / 2 && !STRICT_ALIGNMENT) + if (n > 0 && n < copy_limit && !STRICT_ALIGNMENT) { next_mode = smallest_mode_for_size (n, MODE_INT); int n_bits = GET_MODE_BITSIZE (next_mode).to_constant (); @@ -25056,24 +25077,17 @@ aarch64_expand_setmem (rtx *operands) rtx_insn *seq = get_insns (); end_sequence (); - if (size_p) - { - /* When optimizing for size we have 3 options: the SIMD broadcast sequence, - call to memset or the MOPS expansion. */ - if (TARGET_MOPS - && mops_cost <= libcall_cost - && mops_cost <= simd_ops) - return aarch64_expand_setmem_mops (operands); - /* If MOPS is not available or not shorter pick a libcall if the SIMD - sequence is too long. */ - else if (libcall_cost < simd_ops) - return false; - emit_insn (seq); - return true; - } + /* When optimizing for size we have 3 options: the inlined sequence, + call to memset or the MOPS expansion. */ + if (size_p + && TARGET_MOPS + && mops_cost <= libcall_cost + && mops_cost <= inlined_ops) + return aarch64_expand_setmem_mops (operands); + /* Pick a libcall if the inlined sequence is too long. */ + else if (libcall_cost < inlined_ops) + return false; - /* At this point the SIMD broadcast sequence is the best choice when - optimizing for speed. */ emit_insn (seq); return true; } diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c index 664d43aee13..63c864b25b0 100644 --- a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c +++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c @@ -1,28 +1,27 @@ /* { dg-do compile } */ /* { dg-options "-O2 -mstrict-align" } */ -struct s { char x[255]; }; +struct s { char x[207]; }; +struct s s1 __attribute__((aligned(16))); void foo (struct s *); -void bar (void) { struct s s1 = {}; foo (&s1); } +void bar (void) { s1 = (struct s){}; foo (&s1); } -/* memset (s1 = {}, sizeof = 255) should be expanded out +/* memset (s1 = {}, sizeof = 207) should be expanded out such that there are no overlap stores when -mstrict-align is in use. - so 7 pairs of 16 bytes stores (224 bytes). - 1 16 byte stores + so 6 pairs of 16 bytes stores (96 bytes). 1 8 byte store 1 4 byte store 1 2 byte store 1 1 byte store */ -/* { dg-final { scan-assembler-times "stp\tq" 7 } } */ -/* { dg-final { scan-assembler-times "str\tq" 1 } } */ +/* { dg-final { scan-assembler-times "stp\tq" 6 } } */ +/* { dg-final { scan-assembler-times "str\tq" 0 } } */ /* { dg-final { scan-assembler-times "str\txzr" 1 } } */ /* { dg-final { scan-assembler-times "str\twzr" 1 } } */ /* { dg-final { scan-assembler-times "strh\twzr" 1 } } */ /* { dg-final { scan-assembler-times "strb\twzr" 1 } } */ -/* Also one store pair for the frame-pointer and the LR. */ -/* { dg-final { scan-assembler-times "stp\tx" 1 } } */ - +/* No store pair with 8 byte words is needed as foo is called with a sibcall. */ +/* { dg-final { scan-assembler-times "stp\tx" 0 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c new file mode 100644 index 00000000000..650be86604b --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mstrict-align" } */ + +struct s { char x[7]; }; +void foo (struct s *); +void bar (struct s *s1) { *s1 = (struct s){}; foo (s1); } + +/* memset (s1 = {}, sizeof = 7) should be expanded out + such that there are no overlap stores when -mstrict-align + is in use. As the alignment of s1 is unknown, byte stores are needed. + so 15 byte stores + */ + +/* { dg-final { scan-assembler-times "strb\twzr" 7 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-3.c b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-3.c new file mode 100644 index 00000000000..09cb9a654dc --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-3.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-options "-Os -mstrict-align" } */ + +int a[2*3]; +int f(int t) +{ + __builtin_memset(a, t, 2*3*sizeof(int)); +} + +/* memset (a, (val), sizeof = 2*3*4) should be expanded out + such that there are no overlap stores when -mstrict-align + is in use. As the alignment of a is 4 byte aligned (due to -Os), + store word pairs are needed. so 3 stp are in use. */ + +/* { dg-final { scan-assembler-times "stp\ts" 3 } } */