From patchwork Fri Feb 10 03:53:12 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Andrew Pinski <apinski@marvell.com>
X-Patchwork-Id: 55209
Return-Path: <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp736021wrn;
        Thu, 9 Feb 2023 19:54:11 -0800 (PST)
X-Google-Smtp-Source: 
 AK7set8OBX4KCKr/HkFQ9wLmJCRddUiK9qfT1Cu17hs7CKd04p+gHr7qBe5b7cDW1iGKp4LUayrZ
X-Received: by 2002:a50:954b:0:b0:4ab:eac:30f2 with SMTP id
 v11-20020a50954b000000b004ab0eac30f2mr8656488eda.18.1676001251473;
        Thu, 09 Feb 2023 19:54:11 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1676001251; cv=none;
        d=google.com; s=arc-20160816;
        b=j157WC+ybfI4rUc9mlq20bqmJqWNuBZyLIZlK95mQ+qFSI3Rb7rLgydvFZScGLgOUD
         9HGnAbaa+pU/zWZPbbL2yF0ffAIAhyiwIPKMy/SBYQOyNmxgOxSYGMfbZ/m6EXMsmqo3
         Bdd17Zp3K9XXSi7724zi90MQzEFbjtTmXDJFXGw/Qh2Y2gKXTzdxYWSm69KnNCpO2NOq
         pZtZHJY/OXpdRlUhU9vxaAd4q09Tun3A6VSbKHOj5v/MDxZXo9oybDjFjJkD7iW/5XCV
         b3lwnKjyGfaI9ob7gh6QNN13KcvXPPd+y5OWGx8bBK0OeJSnrZ3N9m2BK+KPwRgtyH0W
         COrQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:from:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence
         :content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:dmarc-filter:delivered-to:dkim-signature:dkim-filter;
        bh=DbpTk4jK7U8xm9tw17xm5BR3vHtK5hi2th/3vlMe024=;
        b=z+2gPBL1RcsV+RoS6i7z+rML5c0Fu8DLn+Nn8cYzFJk78dDTZAC3+4FqK2xR11A6MS
         U1AiEkD09YnQxGD2qclZf1dyrUIxgPLo0UbyM+5ija6H7D0XRblFofusO3msQioW4/3S
         6/KcBuz3NAcoR1m0m+B/XX7w+1oipf84ALdFFbIjETw1NsiF9YPrdve2G9v0i3FLxsgz
         NznjHa/OBD6gvbb3y5sqT2wjjHgm4Me/gmY5kOFN7dSnCwopil0MeGj2RJboL8oHu6ka
         I6SVXTvImmqwWIpn1ROTCKhqETy2JQBA01cQszCQjxcGhYlKt0vLzfFFBI+z0U7QNl9D
         4NEQ==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@gcc.gnu.org header.s=default header.b="Yrwr/KPY";
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates
 2620:52:3:1:0:246e:9693:128c as permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org
Received: from sourceware.org (server2.sourceware.org.
 [2620:52:3:1:0:246e:9693:128c])
        by mx.google.com with ESMTPS id
 n20-20020aa7db54000000b004aaa94fbe4bsi4033160edt.385.2023.02.09.19.54.11
        for <ouuuleilei@gmail.com>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 09 Feb 2023 19:54:11 -0800 (PST)
Received-SPF: pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates
 2620:52:3:1:0:246e:9693:128c as permitted sender)
 client-ip=2620:52:3:1:0:246e:9693:128c;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@gcc.gnu.org header.s=default header.b="Yrwr/KPY";
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates
 2620:52:3:1:0:246e:9693:128c as permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 04FB8385783F
	for <ouuuleilei@gmail.com>; Fri, 10 Feb 2023 03:54:10 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 04FB8385783F
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1676001250;
	bh=DbpTk4jK7U8xm9tw17xm5BR3vHtK5hi2th/3vlMe024=;
	h=To:CC:Subject:Date:List-Id:List-Unsubscribe:List-Archive:
	 List-Post:List-Help:List-Subscribe:From:Reply-To:From;
	b=Yrwr/KPYPNJjQbcepdHne84oZ9AcQRoRP5zmYp94LD8xv/Q4SVlmNsqiW6/X0gaZ8
	 tV5P2Uo4LP0WOwwwLoY8jPIripSWeNQFWHQOhQoRnuO4W6n49pc/KxuUqjyTzm6/eu
	 K1TQbIrfyWeShnQ+ay21f+Q3O7S4i4eNzofyfuk4=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mx0b-0016f401.pphosted.com (mx0b-0016f401.pphosted.com
 [67.231.156.173])
 by sourceware.org (Postfix) with ESMTPS id 817273858C50
 for <gcc-patches@gcc.gnu.org>; Fri, 10 Feb 2023 03:53:25 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 817273858C50
Received: from pps.filterd (m0045851.ppops.net [127.0.0.1])
 by mx0b-0016f401.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id
 31A2sKR7010492
 for <gcc-patches@gcc.gnu.org>; Thu, 9 Feb 2023 19:53:25 -0800
Received: from dc5-exch01.marvell.com ([199.233.59.181])
 by mx0b-0016f401.pphosted.com (PPS) with ESMTPS id 3nn71cjakt-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT)
 for <gcc-patches@gcc.gnu.org>; Thu, 09 Feb 2023 19:53:24 -0800
Received: from DC5-EXCH01.marvell.com (10.69.176.38) by DC5-EXCH01.marvell.com
 (10.69.176.38) with Microsoft SMTP Server (TLS) id 15.0.1497.42;
 Thu, 9 Feb 2023 19:53:22 -0800
Received: from maili.marvell.com (10.69.176.80) by DC5-EXCH01.marvell.com
 (10.69.176.38) with Microsoft SMTP Server id 15.0.1497.42 via Frontend
 Transport; Thu, 9 Feb 2023 19:53:22 -0800
Received: from vpnclient.wrightpinski.org.com (unknown [10.76.242.80])
 by maili.marvell.com (Postfix) with ESMTP id AA7AC5B6939;
 Thu,  9 Feb 2023 19:53:21 -0800 (PST)
To: <gcc-patches@gcc.gnu.org>
CC: Andrew Pinski <apinski@marvell.com>
Subject: [PATCHv4] [AARCH64] Fix PR target/103100 -mstrict-align and memset on
 not aligned buffers
Date: Thu, 9 Feb 2023 19:53:12 -0800
Message-ID: <20230210035312.1630020-1-apinski@marvell.com>
X-Mailer: git-send-email 2.31.1
MIME-Version: 1.0
X-Proofpoint-ORIG-GUID: wB_AzXfJ-LZlkwveKnmOLPFNM0z9PJec
X-Proofpoint-GUID: wB_AzXfJ-LZlkwveKnmOLPFNM0z9PJec
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.219,Aquarius:18.0.930,Hydra:6.0.562,FMLib:17.11.170.22
 definitions=2023-02-09_17,2023-02-09_03,2023-02-09_01
X-Spam-Status: No, score=-14.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT,
 RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_NONE,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: Andrew Pinski via Gcc-patches
 <gcc-patches@gcc.gnu.org>
From: Andrew Pinski <apinski@marvell.com>
Reply-To: Andrew Pinski <apinski@marvell.com>
Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org
Sender: "Gcc-patches" <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1757414688442100014?=
X-GMAIL-MSGID: =?utf-8?q?1757414688442100014?=

The problem here is that aarch64_expand_setmem does not change the alignment
for strict alignment case.
This is version 4 of the fix, major changes from the last version is fixing
the way store pairs are handled which allows handling of storing 2 SI mode
at a time.
This also adds a testcase to show a case with -mstrict-align we can do
the store word pair stores.

OK? Bootstrapped and tested on aarch64-linux-gnu with no regressions.

	PR target/103100

gcc/ChangeLog:

	* config/aarch64/aarch64.cc (aarch64_gen_store_pair):
	Add support for SImode.
	(aarch64_set_one_block_and_progress_pointer):
	Add use_pair argument and rewrite and simplifying the
	code.
	(aarch64_can_use_pair_load_stores): New function.
	(aarch64_expand_setmem): Rewrite mode selection to
	better handle strict alignment and non ld/stp pair case.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/memset-strict-align-1.c: Update test.
	Reduce the size down to 207 and make s1 global and aligned
	to 16 bytes.
	* gcc.target/aarch64/memset-strict-align-2.c: New test.
	* gcc.target/aarch64/memset-strict-align-3.c: New test.
---
 gcc/config/aarch64/aarch64.cc                 | 136 ++++++++++--------
 .../aarch64/memset-strict-align-1.c           |  19 ++-
 .../aarch64/memset-strict-align-2.c           |  14 ++
 .../aarch64/memset-strict-align-3.c           |  15 ++
 4 files changed, 113 insertions(+), 71 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/memset-strict-align-3.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 5c40b6ed22a..3eaf9bd608a 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8850,6 +8850,9 @@ aarch64_gen_store_pair (machine_mode mode, rtx mem1, rtx reg1, rtx mem2,
 {
   switch (mode)
     {
+    case E_SImode:
+      return gen_store_pair_sw_sisi (mem1, reg1, mem2, reg2);
+
     case E_DImode:
       return gen_store_pair_dw_didi (mem1, reg1, mem2, reg2);
 
@@ -24896,42 +24899,49 @@ aarch64_expand_cpymem (rtx *operands)
    SRC is a register we have created with the duplicated value to be set.  */
 static void
 aarch64_set_one_block_and_progress_pointer (rtx src, rtx *dst,
-					    machine_mode mode)
+					    machine_mode mode, bool use_pairs)
 {
+  rtx reg = src;
   /* If we are copying 128bits or 256bits, we can do that straight from
      the SIMD register we prepared.  */
-  if (known_eq (GET_MODE_BITSIZE (mode), 256))
-    {
-      mode = GET_MODE (src);
-      /* "Cast" the *dst to the correct mode.  */
-      *dst = adjust_address (*dst, mode, 0);
-      /* Emit the memset.  */
-      emit_insn (aarch64_gen_store_pair (mode, *dst, src,
-					 aarch64_progress_pointer (*dst), src));
-
-      /* Move the pointers forward.  */
-      *dst = aarch64_move_pointer (*dst, 32);
-      return;
-    }
   if (known_eq (GET_MODE_BITSIZE (mode), 128))
-    {
-      /* "Cast" the *dst to the correct mode.  */
-      *dst = adjust_address (*dst, GET_MODE (src), 0);
-      /* Emit the memset.  */
-      emit_move_insn (*dst, src);
-      /* Move the pointers forward.  */
-      *dst = aarch64_move_pointer (*dst, 16);
-      return;
-    }
-  /* For copying less, we have to extract the right amount from src.  */
-  rtx reg = lowpart_subreg (mode, src, GET_MODE (src));
+    mode = GET_MODE(src);
+  else
+    /* For copying less, we have to extract the right amount from src.  */
+    reg = lowpart_subreg (mode, src, GET_MODE (src));
 
   /* "Cast" the *dst to the correct mode.  */
   *dst = adjust_address (*dst, mode, 0);
   /* Emit the memset.  */
-  emit_move_insn (*dst, reg);
+  if (use_pairs)
+    emit_insn (aarch64_gen_store_pair (mode, *dst, reg,
+				       aarch64_progress_pointer (*dst),
+				       reg));
+  else
+    emit_move_insn (*dst, reg);
+
   /* Move the pointer forward.  */
   *dst = aarch64_progress_pointer (*dst);
+  if (use_pairs)
+    *dst = aarch64_progress_pointer (*dst);
+}
+
+/* Returns true if size can be used as a store/load pair.
+   This is a helper function for aarch64_expand_setmem and others. */
+static bool
+aarch64_can_use_pair_load_stores (unsigned HOST_WIDE_INT size)
+{
+  /* For DI and SI modes, we can use store pairs.  */
+  if (size == GET_MODE_BITSIZE (DImode)
+      || size == GET_MODE_BITSIZE (SImode))
+    return true;
+  /* For TI mode, we will use store pairs only if
+     the target wants to. */
+  else if (size == GET_MODE_BITSIZE (TImode)
+      && !(aarch64_tune_params.extra_tuning_flags
+	   & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS))
+    return true;
+  return false;
 }
 
 /* Expand a setmem using the MOPS instructions.  OPERANDS are the same
@@ -24974,9 +24984,21 @@ aarch64_expand_setmem (rtx *operands)
 
   bool size_p = optimize_function_for_size_p (cfun);
 
-  /* Default the maximum to 256-bytes when considering only libcall vs
-     SIMD broadcast sequence.  */
-  unsigned max_set_size = 256;
+  /* Maximum amount to copy in one go (without taking into account store pairs).  */
+  int copy_limit = GET_MODE_BITSIZE (TImode);
+
+  /* For strict alignment case, restrict the copy limit to the compiler time
+     known alignment of the memory.  */
+  if (STRICT_ALIGNMENT)
+    copy_limit = MIN (copy_limit, (int)MEM_ALIGN (dst));
+
+  bool use_pairs = aarch64_can_use_pair_load_stores (copy_limit);
+
+  /* The max set size is 8 instructions of the copy_limit sized stores
+     (also taking into account using pair stores or not);
+     for the non strict alignment, this is 256 bytes.  */
+  unsigned max_set_size;
+  max_set_size = (copy_limit * (use_pairs ? 2 : 1) * 8) / BITS_PER_UNIT;
 
   len = INTVAL (operands[1]);
   if (len > max_set_size && !TARGET_MOPS)
@@ -24990,8 +25012,10 @@ aarch64_expand_setmem (rtx *operands)
     (zero constants can use XZR directly).  */
   unsigned mops_cost = 3 + 1 + cst_val;
   /* A libcall to memset in the worst case takes 3 instructions to prepare
-     the arguments + 1 for the call.  */
-  unsigned libcall_cost = 4;
+     the arguments + 1 for the call.
+     In the case of not optimizing for size the cost of doing a libcall
+     is 17 instructions . */
+  unsigned libcall_cost = size_p ? 4 : 17;
 
   /* Upper bound check.  For large constant-sized setmem use the MOPS sequence
      when available.  */
@@ -25001,12 +25025,12 @@ aarch64_expand_setmem (rtx *operands)
 
   /* Attempt a sequence with a vector broadcast followed by stores.
      Count the number of operations involved to see if it's worth it
-     against the alternatives.  A simple counter simd_ops on the
+     against the alternatives.  A simple counter inlined_ops on the
      algorithmically-relevant operations is used rather than an rtx_insn count
      as all the pointer adjusmtents and mode reinterprets will be optimized
      away later.  */
   start_sequence ();
-  unsigned simd_ops = 0;
+  unsigned inlined_ops = 0;
 
   base = copy_to_mode_reg (Pmode, XEXP (dst, 0));
   dst = adjust_automodify_address (dst, VOIDmode, base, 0);
@@ -25014,16 +25038,10 @@ aarch64_expand_setmem (rtx *operands)
   /* Prepare the val using a DUP/MOVI v0.16B, val.  */
   src = expand_vector_broadcast (V16QImode, val);
   src = force_reg (V16QImode, src);
-  simd_ops++;
+  inlined_ops++;
   /* Convert len to bits to make the rest of the code simpler.  */
   n = len * BITS_PER_UNIT;
 
-  /* Maximum amount to copy in one go.  We allow 256-bit chunks based on the
-     AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter.  */
-  const int copy_limit = (aarch64_tune_params.extra_tuning_flags
-			  & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)
-			  ? GET_MODE_BITSIZE (TImode) : 256;
-
   while (n > 0)
     {
       /* Find the largest mode in which to do the copy without
@@ -25036,15 +25054,18 @@ aarch64_expand_setmem (rtx *operands)
       gcc_assert (cur_mode != BLKmode);
 
       mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant ();
-      aarch64_set_one_block_and_progress_pointer (src, &dst, cur_mode);
-      simd_ops++;
-      n -= mode_bits;
+      bool pairs = aarch64_can_use_pair_load_stores (mode_bits);
+      if (n < (mode_bits * 2))
+	pairs = false;
+      aarch64_set_one_block_and_progress_pointer (src, &dst, cur_mode, pairs);
+      inlined_ops++;
+      n -= mode_bits * (pairs ? 2 : 1);
 
       /* Do certain trailing copies as overlapping if it's going to be
 	 cheaper.  i.e. less instructions to do so.  For instance doing a 15
 	 byte copy it's more efficient to do two overlapping 8 byte copies than
 	 8 + 4 + 2 + 1.  Only do this when -mstrict-align is not supplied.  */
-      if (n > 0 && n < copy_limit / 2 && !STRICT_ALIGNMENT)
+      if (n > 0 && n < copy_limit && !STRICT_ALIGNMENT)
 	{
 	  next_mode = smallest_mode_for_size (n, MODE_INT);
 	  int n_bits = GET_MODE_BITSIZE (next_mode).to_constant ();
@@ -25056,24 +25077,17 @@ aarch64_expand_setmem (rtx *operands)
   rtx_insn *seq = get_insns ();
   end_sequence ();
 
-  if (size_p)
-    {
-      /* When optimizing for size we have 3 options: the SIMD broadcast sequence,
-	 call to memset or the MOPS expansion.  */
-      if (TARGET_MOPS
-	  && mops_cost <= libcall_cost
-	  && mops_cost <= simd_ops)
-	return aarch64_expand_setmem_mops (operands);
-      /* If MOPS is not available or not shorter pick a libcall if the SIMD
-	 sequence is too long.  */
-      else if (libcall_cost < simd_ops)
-	return false;
-      emit_insn (seq);
-      return true;
-    }
+  /* When optimizing for size we have 3 options: the inlined sequence,
+     call to memset or the MOPS expansion.  */
+  if (size_p
+      && TARGET_MOPS
+      && mops_cost <= libcall_cost
+      && mops_cost <= inlined_ops)
+    return aarch64_expand_setmem_mops (operands);
+  /* Pick a libcall if the inlined sequence is too long.  */
+  else if (libcall_cost < inlined_ops)
+    return false;
 
-  /* At this point the SIMD broadcast sequence is the best choice when
-     optimizing for speed.  */
   emit_insn (seq);
   return true;
 }
diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
index 664d43aee13..63c864b25b0 100644
--- a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
+++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
@@ -1,28 +1,27 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -mstrict-align" } */
 
-struct s { char x[255]; };
+struct s { char x[207]; };
+struct s s1 __attribute__((aligned(16)));
 void foo (struct s *);
-void bar (void) { struct s s1 = {}; foo (&s1); }
+void bar (void) { s1 = (struct s){}; foo (&s1); }
 
-/* memset (s1 = {}, sizeof = 255) should be expanded out
+/* memset (s1 = {}, sizeof = 207) should be expanded out
    such that there are no overlap stores when -mstrict-align
    is in use.
-   so 7 pairs of 16 bytes stores (224 bytes).
-   1 16 byte stores
+   so 6 pairs of 16 bytes stores (96 bytes).
    1 8 byte store
    1 4 byte store
    1 2 byte store
    1 1 byte store
    */
 
-/* { dg-final { scan-assembler-times "stp\tq" 7 } } */
-/* { dg-final { scan-assembler-times "str\tq" 1 } } */
+/* { dg-final { scan-assembler-times "stp\tq" 6 } } */
+/* { dg-final { scan-assembler-times "str\tq" 0 } } */
 /* { dg-final { scan-assembler-times "str\txzr" 1 } } */
 /* { dg-final { scan-assembler-times "str\twzr" 1 } } */
 /* { dg-final { scan-assembler-times "strh\twzr" 1 } } */
 /* { dg-final { scan-assembler-times "strb\twzr" 1 } } */
 
-/* Also one store pair for the frame-pointer and the LR. */
-/* { dg-final { scan-assembler-times "stp\tx" 1 } } */
-
+/* No store pair with 8 byte words is needed as foo is called with a sibcall. */
+/* { dg-final { scan-assembler-times "stp\tx" 0 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c
new file mode 100644
index 00000000000..650be86604b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mstrict-align" } */
+
+struct s { char x[7]; };
+void foo (struct s *);
+void bar (struct s *s1) { *s1 = (struct s){}; foo (s1); }
+
+/* memset (s1 = {}, sizeof = 7) should be expanded out
+   such that there are no overlap stores when -mstrict-align
+   is in use. As the alignment of s1 is unknown, byte stores are needed.
+   so 15 byte stores
+   */
+
+/* { dg-final { scan-assembler-times "strb\twzr" 7 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-3.c b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-3.c
new file mode 100644
index 00000000000..09cb9a654dc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-3.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-Os -mstrict-align" } */
+
+int a[2*3];
+int f(int t)
+{
+  __builtin_memset(a, t, 2*3*sizeof(int));
+}
+
+/* memset (a, (val), sizeof = 2*3*4) should be expanded out
+   such that there are no overlap stores when -mstrict-align
+   is in use. As the alignment of a is 4 byte aligned (due to -Os),
+   store word pairs are needed. so 3 stp are in use.  */
+
+/* { dg-final { scan-assembler-times "stp\ts" 3 } } */