From patchwork Wed Nov  2 03:37:28 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hongyu Wang <hongyu.wang@intel.com>
X-Patchwork-Id: 14013
Return-Path: <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp3372504wru;
        Tue, 1 Nov 2022 20:38:28 -0700 (PDT)
X-Google-Smtp-Source: 
 AMsMyM5TxdhiEgsd5h2/uJ2h4O0Pt4kOeqiFtQ32lZSIkasdw2+SPaMSEmMqNDbVuZrlCZ/ozE9K
X-Received: by 2002:a17:906:558e:b0:7ad:ca65:32b8 with SMTP id
 y14-20020a170906558e00b007adca6532b8mr15220451ejp.456.1667360308114;
        Tue, 01 Nov 2022 20:38:28 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1667360308; cv=none;
        d=google.com; s=arc-20160816;
        b=KyomkuQuHNRYBjfPpJQMy3haiNDvllnYgYzVSQUJJVyWQ0w+35yWQGR5cWyJCzcuDn
         lzrFVGl0E2juYoyn0/UU6kbntczUNE3nmMt3kXPboRHDHa/7cw/UNGl8afICOyupXWVo
         U8lqgLKY4D0eN63/wKv8JJUSkxHROPsH71Jkv2vF3T0wB6UThLjxxowD7HwcSpc5S1+F
         cK8konfV9fDop+fs4CuBGcahw4LSrOCfnlJoDeck093SY+RBYgg0zvstMfMKLZStcDL8
         3Pmuyo9PFqQVvn+JbB+H67y+jLByjZmDmbAMs9t4Z2cbFK2wzSjOQpvA8mwPBI2tid6+
         CvJw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:cc:reply-to:from:list-subscribe:list-help
         :list-post:list-archive:list-unsubscribe:list-id:precedence
         :message-id:date:subject:to:dmarc-filter:delivered-to:dkim-signature
         :dkim-filter;
        bh=iSehfFkb0Bs7qu2cQ7wyC4ttovnQkD41iZ8cFazOnVU=;
        b=oD1IUB19DotB2hS/ImCyO/pYwmqT6iCCiIOhH/uB2W6+0v4Ce7nVXoyF2GxlLd32ms
         ol9+0llYFY22zlTYk/g2l8kcIZD9Vjj6ym06VFm8TQ03efyNpbDvdyaRRXputjYuZOhd
         CnHou7PsnEmnwci3fEOj3aLHkpf53yvl3Y8zPFMhwe9iAANr8cWDzpdeRQ0KFc0FE1zn
         sxQQyeSWy+wyPhUNgjjqakjP/VueBhIlvAcX868dONWOdHoYUED9fF7DKaoDGzPfaeNY
         wrXfeUaM5Uq7GqdIAHnBRyeqhZLbp6A8REQucTtRulrMXMgVJupHOg4MIRRrRE0X/H6v
         0KQA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@gcc.gnu.org header.s=default header.b=PAJeh7At;
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates
 2620:52:3:1:0:246e:9693:128c as permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org
Received: from sourceware.org (server2.sourceware.org.
 [2620:52:3:1:0:246e:9693:128c])
        by mx.google.com with ESMTPS id
 he16-20020a1709073d9000b007881b45441asi15339964ejc.721.2022.11.01.20.38.27
        for <ouuuleilei@gmail.com>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 01 Nov 2022 20:38:28 -0700 (PDT)
Received-SPF: pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates
 2620:52:3:1:0:246e:9693:128c as permitted sender)
 client-ip=2620:52:3:1:0:246e:9693:128c;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@gcc.gnu.org header.s=default header.b=PAJeh7At;
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates
 2620:52:3:1:0:246e:9693:128c as permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id E904F3858400
	for <ouuuleilei@gmail.com>; Wed,  2 Nov 2022 03:38:26 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E904F3858400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1667360306;
	bh=iSehfFkb0Bs7qu2cQ7wyC4ttovnQkD41iZ8cFazOnVU=;
	h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post:
	 List-Help:List-Subscribe:From:Reply-To:Cc:From;
	b=PAJeh7AtzBBdYPORQfZDLwHcPxo9wTL5xtjEjILGfUDdO7TS2l2hewihL05+xGlLt
	 uZ3yAb+vrncDzkVhtcmegGPuUwzafNbgeUGZwHVHmcDCDNNZIQH0tNVnS6XtHgyI+y
	 pfw1JV+H5VuL4zYOKowgMnwf0u28HtqT3x5BwTbQ=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mga11.intel.com (mga11.intel.com [192.55.52.93])
 by sourceware.org (Postfix) with ESMTPS id C79B33858CDA
 for <gcc-patches@gcc.gnu.org>; Wed,  2 Nov 2022 03:37:32 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C79B33858CDA
X-IronPort-AV: E=McAfee;i="6500,9779,10518"; a="306919966"
X-IronPort-AV: E=Sophos;i="5.95,232,1661842800"; d="scan'208";a="306919966"
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 01 Nov 2022 20:37:31 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6500,9779,10518"; a="963366086"
X-IronPort-AV: E=Sophos;i="5.95,232,1661842800"; d="scan'208";a="963366086"
Received: from shvmail03.sh.intel.com ([10.239.245.20])
 by fmsmga005.fm.intel.com with ESMTP; 01 Nov 2022 20:37:29 -0700
Received: from shliclel320.sh.intel.com (shliclel320.sh.intel.com
 [10.239.240.127])
 by shvmail03.sh.intel.com (Postfix) with ESMTP id EC9EE10056B3;
 Wed,  2 Nov 2022 11:37:28 +0800 (CST)
To: gcc-patches@gcc.gnu.org
Subject: [PATCH V2] Enable small loop unrolling for O2
Date: Wed,  2 Nov 2022 11:37:28 +0800
Message-Id: <20221102033728.99379-1-hongyu.wang@intel.com>
X-Mailer: git-send-email 2.18.1
X-Spam-Status: No, score=-10.8 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, FREEMAIL_ENVFROM_END_DIGIT,
 FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM, GIT_PATCH_0,
 HEADER_FROM_DIFFERENT_DOMAINS, KAM_NUMSUBJECT, KAM_SHORT, SPF_HELO_NONE,
 SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: Hongyu Wang via Gcc-patches
 <gcc-patches@gcc.gnu.org>
From: Hongyu Wang <hongyu.wang@intel.com>
Reply-To: Hongyu Wang <hongyu.wang@intel.com>
Cc: hongtao.liu@intel.com
Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org
Sender: "Gcc-patches" <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1748354002328028185?=
X-GMAIL-MSGID: =?utf-8?q?1748354002328028185?=

Hi, this is the updated patch of
https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604345.html,
which uses targetm.loop_unroll_adjust as gate to enable small loop unroll.

This patch does not change rs6000/s390 since I don't have machine to 
test them, but I suppose the default behavior is the same since they
enable flag_unroll_loops at O2.

Bootstrapped & regrtested on x86_64-pc-linux-gnu.

Ok for trunk?

---------- Patch content --------

Modern processors has multiple way instruction decoders
For x86, icelake/zen3 has 5 uops, so for small loop with <= 4
instructions (usually has 3 uops with a cmp/jmp pair that can be
macro-fused), the decoder would have 2 uops bubble for each iteration
and the pipeline could not be fully utilized.

Therefore, this patch enables loop unrolling for small size loop at O2
to fullfill the decoder as much as possible. It turns on rtl loop
unrolling when targetm.loop_unroll_adjust exists and O2 plus speed only.
In x86 backend the default behavior is to unroll small loops with less
than 4 insns by 1 time.

This improves 548.exchange2 by 9% on icelake and 7.4% on zen3 with
0.9% codesize increment. For other benchmarks the variants are minor
and overall codesize increased by 0.2%.

The kernel image size increased by 0.06%, and no impact on eembc.

gcc/ChangeLog:

	* common/config/i386/i386-common.cc (ix86_optimization_table):
	Enable small loop unroll at O2 by default.
	* config/i386/i386.cc (ix86_loop_unroll_adjust): Adjust unroll
	factor if -munroll-only-small-loops enabled and -funroll-loops/
	-funroll-all-loops are disabled.
	* config/i386/i386.opt: Add -munroll-only-small-loops,
	-param=x86-small-unroll-ninsns= for loop insn limit,
	-param=x86-small-unroll-factor= for unroll factor.
	* doc/invoke.texi: Document -munroll-only-small-loops,
	x86-small-unroll-ninsns and x86-small-unroll-factor.
	* loop-init.cc (pass_rtl_unroll_loops::gate): Enable rtl
	loop unrolling for -O2-speed and above if target hook
	loop_unroll_adjust exists.

gcc/testsuite/ChangeLog:

	* gcc.dg/guality/loop-1.c: Add additional option
	  -mno-unroll-only-small-loops.
	* gcc.target/i386/pr86270.c: Add -mno-unroll-only-small-loops.
	* gcc.target/i386/pr93002.c: Likewise.
---
 gcc/common/config/i386/i386-common.cc   |  1 +
 gcc/config/i386/i386.cc                 | 18 ++++++++++++++++++
 gcc/config/i386/i386.opt                | 13 +++++++++++++
 gcc/doc/invoke.texi                     | 16 ++++++++++++++++
 gcc/loop-init.cc                        | 10 +++++++---
 gcc/testsuite/gcc.dg/guality/loop-1.c   |  2 ++
 gcc/testsuite/gcc.target/i386/pr86270.c |  2 +-
 gcc/testsuite/gcc.target/i386/pr93002.c |  2 +-
 8 files changed, 59 insertions(+), 5 deletions(-)

diff --git a/gcc/common/config/i386/i386-common.cc b/gcc/common/config/i386/i386-common.cc
index f66bdd5a2af..c6891486078 100644
--- a/gcc/common/config/i386/i386-common.cc
+++ b/gcc/common/config/i386/i386-common.cc
@@ -1724,6 +1724,7 @@ static const struct default_options ix86_option_optimization_table[] =
     /* The STC algorithm produces the smallest code at -Os, for x86.  */
     { OPT_LEVELS_2_PLUS, OPT_freorder_blocks_algorithm_, NULL,
       REORDER_BLOCKS_ALGORITHM_STC },
+    { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_munroll_only_small_loops, NULL, 1 },
     /* Turn off -fschedule-insns by default.  It tends to make the
        problem with not enough registers even worse.  */
     { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index c0f37149ed0..0f94a3b609e 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23827,6 +23827,24 @@ ix86_loop_unroll_adjust (unsigned nunroll, class loop *loop)
   unsigned i;
   unsigned mem_count = 0;
 
+  /* Unroll small size loop when unroll factor is not explicitly
+     specified.  */
+  if (!(flag_unroll_loops
+	|| flag_unroll_all_loops
+	|| loop->unroll))
+    {
+      nunroll = 1;
+
+      /* Any explicit -f{no-}unroll-{all-}loops turns off
+	 -munroll-only-small-loops.  */
+      if (ix86_unroll_only_small_loops
+	  && !OPTION_SET_P (flag_unroll_loops))
+	if (loop->ninsns <= (unsigned) ix86_small_unroll_ninsns)
+	  nunroll = (unsigned) ix86_small_unroll_factor;
+
+      return nunroll;
+    }
+
   if (!TARGET_ADJUST_UNROLL)
      return nunroll;
 
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 53d534f6392..6da9c8d670d 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -1224,3 +1224,16 @@ mavxvnniint8
 Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save
 Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and
 AVXVNNIINT8 built-in functions and code generation.
+
+munroll-only-small-loops
+Target Var(ix86_unroll_only_small_loops) Init(0) Save
+Enable conservative small loop unrolling.
+
+-param=x86-small-unroll-ninsns=
+Target Joined UInteger Var(ix86_small_unroll_ninsns) Init(4) Param
+Insturctions number limit for loop to be unrolled under
+-munroll-only-small-loops.
+
+-param=x86-small-unroll-factor=
+Target Joined UInteger Var(ix86_small_unroll_factor) Init(2) Param
+Unroll factor for -munroll-only-small-loops.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 550aec87809..487218bd0ce 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -15821,6 +15821,14 @@ The following choices of @var{name} are available on i386 and x86_64 targets:
 @item x86-stlf-window-ninsns
 Instructions number above which STFL stall penalty can be compensated.
 
+@item x86-small-unroll-ninsns
+If -munroll-only-small-loops is enabled, only unroll loops with instruction
+count less than this parameter. The default value is 4.
+
+@item x86-small-unroll-factor
+If -munroll-only-small-loops is enabled, reset the unroll factor with this
+value. The default value is 2 which means the loop will be unrolled once.
+
 @end table
 
 @end table
@@ -25232,6 +25240,14 @@ environments where no dynamic link is performed, like firmwares, OS
 kernels, executables linked with @option{-static} or @option{-static-pie}.
 @option{-mdirect-extern-access} is not compatible with @option{-fPIC} or
 @option{-fpic}.
+
+@item -munroll-only-small-loops
+@itemx -mno-unroll-only-small-loops
+@opindex munroll-only-small-loops
+Controls conservative small loop unrolling. It is default enbaled by
+O2, and unrolls loop with less than 4 insns by 1 time. Explicit
+-f[no-]unroll-[all-]loops would disable this flag to avoid any
+unintended unrolling behavior that user does not want.
 @end table
 
 @node M32C Options
diff --git a/gcc/loop-init.cc b/gcc/loop-init.cc
index b9e07973dd6..9789efa1e11 100644
--- a/gcc/loop-init.cc
+++ b/gcc/loop-init.cc
@@ -565,9 +565,12 @@ public:
   {}
 
   /* opt_pass methods: */
-  bool gate (function *) final override
+  bool gate (function *fun) final override
     {
-      return (flag_unroll_loops || flag_unroll_all_loops || cfun->has_unroll);
+      return (flag_unroll_loops || flag_unroll_all_loops || cfun->has_unroll
+	      || (targetm.loop_unroll_adjust
+		  && optimize >= 2
+		  && optimize_function_for_speed_p (fun)));
     }
 
   unsigned int execute (function *) final override;
@@ -583,7 +586,8 @@ pass_rtl_unroll_loops::execute (function *fun)
       if (dump_file)
 	df_dump (dump_file);
 
-      if (flag_unroll_loops)
+      if (flag_unroll_loops
+	  || targetm.loop_unroll_adjust)
 	flags |= UAP_UNROLL;
       if (flag_unroll_all_loops)
 	flags |= UAP_UNROLL_ALL;
diff --git a/gcc/testsuite/gcc.dg/guality/loop-1.c b/gcc/testsuite/gcc.dg/guality/loop-1.c
index 1b1f6d32271..a32ea445a3f 100644
--- a/gcc/testsuite/gcc.dg/guality/loop-1.c
+++ b/gcc/testsuite/gcc.dg/guality/loop-1.c
@@ -1,5 +1,7 @@
 /* { dg-do run } */
 /* { dg-options "-fno-tree-scev-cprop -fno-tree-vectorize -g" } */
+/* { dg-additional-options "-mno-unroll-only-small-loops" { target ia32 } } */
+
 
 #include "../nop.h"
 
diff --git a/gcc/testsuite/gcc.target/i386/pr86270.c b/gcc/testsuite/gcc.target/i386/pr86270.c
index 81841ef5bd7..cbc9fbb0450 100644
--- a/gcc/testsuite/gcc.target/i386/pr86270.c
+++ b/gcc/testsuite/gcc.target/i386/pr86270.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2" } */
+/* { dg-options "-O2 -mno-unroll-only-small-loops" } */
 
 int *a;
 long len;
diff --git a/gcc/testsuite/gcc.target/i386/pr93002.c b/gcc/testsuite/gcc.target/i386/pr93002.c
index 0248fcc00a5..f75a847f75d 100644
--- a/gcc/testsuite/gcc.target/i386/pr93002.c
+++ b/gcc/testsuite/gcc.target/i386/pr93002.c
@@ -1,6 +1,6 @@
 /* PR target/93002 */
 /* { dg-do compile } */
-/* { dg-options "-O2" } */
+/* { dg-options "-O2 -mno-unroll-only-small-loops" } */
 /* { dg-final { scan-assembler-not "cmp\[^\n\r]*-1" } } */
 
 volatile int sink;