From patchwork Thu May 11 10:12:00 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Li, Pan2 via Gcc-patches" <gcc-patches@gcc.gnu.org>
X-Patchwork-Id: 92472
Return-Path: <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp4252216vqo;
        Thu, 11 May 2023 03:13:13 -0700 (PDT)
X-Google-Smtp-Source: 
 ACHHUZ6fWcqQhlsYXwSPuQqsT43ZXgNHKoI/QDMcZnA0m1udXDE8RmXNKhfU6sFFS3+K1cmNogIM
X-Received: by 2002:a17:907:a0c:b0:966:2123:e0ca with SMTP id
 bb12-20020a1709070a0c00b009662123e0camr14601446ejc.34.1683799993187;
        Thu, 11 May 2023 03:13:13 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1683799993; cv=none;
        d=google.com; s=arc-20160816;
        b=NCb39QrDTxHG+0sHc/gNRa3KoXMpQXtFmEaXGE4mIZc/MA245pi+nspSsgYjKXsTQa
         FiS73l8gGyZJymKE+DS5VEv1EC4H74fEZknpLRcllkem1VP0LBn5G0DEf86yqcbBkSS+
         rWKR051F3ungOOjr1ElrvOhIFRcz6grOQnMkt6mDiLBXHjQYJmdXx075AKy0yI4WXBbx
         xI8o6fbwIzRlga8iXCc8urSMBTRUYZI9IZ6GYi+G9PYtMRt0zrUzLKHd2N2Dm5LGG0tr
         VP8IWfGvkx7A7SMh0t3q9RJwB6LpLweQKoyjFLjY84YaF2SqXQh7Ghy3nlE2oCXzDs3T
         VznQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:from:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence
         :content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:dmarc-filter:delivered-to:dkim-signature:dkim-filter;
        bh=a6eFYHNt0PIeNXB2CIxWSPbRk/eMiDIj+6X5Xvg3ahQ=;
        b=pyWg3JeJLWIu2d8AJNjyFWK7CGS6iUTf9sSl4642L9oec7j7Kvt9ugGNRrxlioSO26
         2Ibx1SzZXAt2bk8k0qXgAT0Z3nbI6pO4kgLLmaRNXA5pZS8vh1A4R7xYWVBBTZ/Cl68a
         9NzaHCggvaE5VCDYAfIyHryPAyYvz56p999Y+9PSyn5IxJHu2l0TszCKDg54icnblusu
         IzlgCeG1GIcLHH6Xi0GVQdaFBBTG0KLMyR5ob1o2nNK8DeDZx0qunvKuEP9DdBBjBz3w
         h0QUdoaPcwCJsdL3M38USNu5xElhlytleEw3yit9iaRaycDmeyboiNg/glCu5f0Wcmmk
         ktQw==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@gcc.gnu.org header.s=default header.b=NFSDZF8Q;
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as
 permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org
Received: from sourceware.org (server2.sourceware.org. [8.43.85.97])
        by mx.google.com with ESMTPS id
 g23-20020a50ee17000000b0050bc3b40192si4927114eds.638.2023.05.11.03.13.12
        for <ouuuleilei@gmail.com>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 11 May 2023 03:13:13 -0700 (PDT)
Received-SPF: pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as
 permitted sender) client-ip=8.43.85.97;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@gcc.gnu.org header.s=default header.b=NFSDZF8Q;
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as
 permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 0B1043856968
	for <ouuuleilei@gmail.com>; Thu, 11 May 2023 10:12:53 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0B1043856968
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1683799973;
	bh=a6eFYHNt0PIeNXB2CIxWSPbRk/eMiDIj+6X5Xvg3ahQ=;
	h=To:Cc:Subject:Date:List-Id:List-Unsubscribe:List-Archive:
	 List-Post:List-Help:List-Subscribe:From:Reply-To:From;
	b=NFSDZF8QaHhjUcHH1fIPDdKIESWkdn2x5H9gmZahwXzd1lU548x9+dfZ7VFuuvLLy
	 T9SenuQDHepSYFZJld3uayQIlEBsBsEHBJFzTTxU/O6gO9UiDhlR0TeEJSA0lqHrs7
	 K8oSdAWcS3C5BN4DMlQzTA4NUUOy6GGkmw38lp1o=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mga17.intel.com (mga17.intel.com [192.55.52.151])
 by sourceware.org (Postfix) with ESMTPS id F33243858C83
 for <gcc-patches@gcc.gnu.org>; Thu, 11 May 2023 10:12:04 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org F33243858C83
X-IronPort-AV: E=McAfee;i="6600,9927,10706"; a="330815122"
X-IronPort-AV: E=Sophos;i="5.99,266,1677571200"; d="scan'208";a="330815122"
Received: from orsmga007.jf.intel.com ([10.7.209.58])
 by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 11 May 2023 03:12:03 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10706"; a="693740205"
X-IronPort-AV: E=Sophos;i="5.99,266,1677571200"; d="scan'208";a="693740205"
Received: from scymds03.sc.intel.com ([10.148.94.166])
 by orsmga007.jf.intel.com with ESMTP; 11 May 2023 03:12:03 -0700
Received: from shgcc101.sh.intel.com (shgcc101.sh.intel.com [10.239.85.97])
 by scymds03.sc.intel.com (Postfix) with ESMTP id C5BB878;
 Thu, 11 May 2023 03:12:02 -0700 (PDT)
To: gcc-patches@gcc.gnu.org
Cc: Lili Cui <lili.cui@intel.com>
Subject: [PATCH 1/2] PR gcc/98350:Add a param to control the length of the
 chain with FMA in reassoc pass
Date: Thu, 11 May 2023 10:12:00 +0000
Message-Id: <20230511101201.2052667-1-lili.cui@intel.com>
X-Mailer: git-send-email 2.25.1
MIME-Version: 1.0
X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 SPF_HELO_NONE, SPF_NONE, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: "Cui,
 Lili via Gcc-patches" <gcc-patches@gcc.gnu.org>
From: "Li, Pan2 via Gcc-patches" <gcc-patches@gcc.gnu.org>
Reply-To: "Cui, Lili" <lili.cui@intel.com>
Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org
Sender: "Gcc-patches" <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1765592261564359060?=
X-GMAIL-MSGID: =?utf-8?q?1765592261564359060?=

From: Lili Cui <lili.cui@intel.com>

Hi,

Those two patches each add a param to control the length of the chain with
FMA in reassoc pass and a tuning option in the backend.

Bootstrapped and regtested. Ok for trunk?

Regards
Lili.

Add a param for the chain with FMA in reassoc pass to make it more friendly to
the fma pass later. First to detect if this chain has ability to
generate more than 2 FMAs,if yes and param_reassoc_max_chain_length_with_fma
is enabled, We will rearrange the ops so that they can be combined into more
FMAs. When the chain length exceeds param_reassoc_max_chain_length_with_fma,
build parallel chains according to given association width and try to keep FMA
opportunity as much as possible.

TEST1:

float
foo (float a, float b, float c, float d, float *e)
{
   return  *e  + a * b + c * d ;
}

For -Ofast -march=icelake-server  GCC generates:
        vmulss  %xmm3, %xmm2, %xmm2
        vfmadd132ss     %xmm1, %xmm2, %xmm0
        vaddss  (%rdi), %xmm0, %xmm0
        ret

with "--param=reassoc-max-chain-length-with-fma=3" GCC generates:
        vfmadd213ss   (%rdi), %xmm1, %xmm0
        vfmadd231ss   %xmm2, %xmm3, %xmm0
        ret

gcc/ChangeLog:

	PR gcc/98350
	* params.opt (reassoc-max-fma-chain-length): New param.
	* tree-ssa-reassoc.cc
	(rewrite_expr_tree_parallel_for_fma): New.
	(rank_ops_for_fma): Ditto.
	(reassociate_bb): Handle new function.

gcc/testsuite/ChangeLog:

	PR gcc/98350
	* gcc.dg/pr98350-1.c: New test.
	* gcc.dg/pr98350-2.c: Ditto.
---
 gcc/params.opt                   |   4 +
 gcc/testsuite/gcc.dg/pr98350-1.c |  31 +++++
 gcc/testsuite/gcc.dg/pr98350-2.c |  17 +++
 gcc/tree-ssa-reassoc.cc          | 228 ++++++++++++++++++++++++++++---
 4 files changed, 264 insertions(+), 16 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr98350-1.c
 create mode 100644 gcc/testsuite/gcc.dg/pr98350-2.c

diff --git a/gcc/params.opt b/gcc/params.opt
index 823cdb2ff85..f7c719afe64 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1182,4 +1182,8 @@ The maximum factor which the loop vectorizer applies to the cost of statements i
 Common Joined UInteger Var(param_vect_induction_float) Init(1) IntegerRange(0, 1) Param Optimization
 Enable loop vectorization of floating point inductions.
 
+-param=reassoc-max-chain-length-with-fma=
+Common Joined UInteger Var(param_reassoc_max_chain_length_with_fma) Init(1) IntegerRange(1, 65536) Param Optimization
+The maximum chain length with fma considered in reassociation pass.
+
 ; This comment is to ensure we retain the blank line above.
diff --git a/gcc/testsuite/gcc.dg/pr98350-1.c b/gcc/testsuite/gcc.dg/pr98350-1.c
new file mode 100644
index 00000000000..32ecce13a2d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr98350-1.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mfpmath=sse -mfma --param=reassoc-max-chain-length-with-fma=7 -Wno-attributes " } */
+
+/* Test that the compiler properly optimizes multiply and add 
+   to generate more FMA instructions.  */
+#define N 1024
+double a[N];
+double b[N];
+double c[N];
+double d[N];
+double e[N];
+double f[N];
+double g[N];
+double h[N];
+double j[N];
+double k[N];
+double l[N];
+double m[N];
+double o[N];
+double p[N];
+
+
+void
+foo (void)
+{
+  for (int i = 0; i < N; i++)
+  {
+    a[i] += b[i] * c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] * l[i] + m[i]* o[i] + p[i];
+  }
+}
+/* { dg-final { scan-assembler-times "vfm" 6  } } */
diff --git a/gcc/testsuite/gcc.dg/pr98350-2.c b/gcc/testsuite/gcc.dg/pr98350-2.c
new file mode 100644
index 00000000000..246025d43b8
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr98350-2.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mfpmath=sse -mfma --param=reassoc-max-chain-length-with-fma=6 -Wno-attributes " } */
+
+/* Test that the compiler properly build parallel chains according to given
+   association width and try to keep FMA opportunity as much as possible.  */
+#define N 33
+double a[N];
+
+void
+foo (void)
+{
+  a[32] = a[0] *a[1] + a[2] * a[3] + a[4] * a[5] + a[6] * a[7] + a[8] * a[9]
+    + a[10] * a[11] + a[12] * a[13] + a[14] * a[15] + a[16] * a[17]
+    + a[18] * a[19] + a[20] * a[21] + a[22] * a[23] + a[24] + a[25]
+    + a[26] + a[27] + a[28] + a[29] + a[30] + a[31];
+}
+/* { dg-final { scan-assembler-times "vfm" 12  } } */
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index 067a3f07f7e..6d2e158c4f5 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -54,6 +54,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-reassoc.h"
 #include "tree-ssa-math-opts.h"
 #include "gimple-range.h"
+#include "internal-fn.h"
 
 /*  This is a simple global reassociation pass.  It is, in part, based
     on the LLVM pass of the same name (They do some things more/less
@@ -5468,6 +5469,114 @@ get_reassociation_width (int ops_num, enum tree_code opc,
   return width;
 }
 
+/* Rewrite statements with dependency chain with regard to the chance to
+   generate FMA. When the dependency chain length exceeds
+   param_max_reassoc_chain_length_with_fma, build parallel chains according to
+   given association width and try to keep fma opportunity as much as possible.
+   E.g.
+   e + f + g + a * b + c * d;
+
+   ssa1 = e + f;
+   ssa2 = g + a * b;
+   ssa3 = ssa1 + c * d;
+   ssa4 = ssa2 + ssa3;
+
+   This reassociation approach preserves the chance of fma generation as much
+   as possible.  */
+static void
+rewrite_expr_tree_parallel_for_fma (gassign *stmt, int width,
+					 const vec<operand_entry *> &ops)
+{
+  enum tree_code opcode = gimple_assign_rhs_code (stmt);
+  int op_num = ops.length ();
+  gcc_assert (op_num > 0);
+  int stmt_num = op_num - 1;
+  gimple **stmts = XALLOCAVEC (gimple *, stmt_num);
+  int op_index = op_num - 1;
+  int width_count = width;
+  int i = 0, j = 0;
+  tree tmp_op[2], op1;
+  operand_entry *oe;
+  gimple *stmt1 = NULL;
+  tree last_rhs1 = gimple_assign_rhs1 (stmt);
+
+  /* We start expression rewriting from the top statements.
+     So, in this loop we create a full list of statements
+     we will work with.  */
+  stmts[stmt_num - 1] = stmt;
+  for (i = stmt_num - 2; i >= 0; i--)
+    stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
+
+  /* Build parallel FMA dependency chain according to width.  */
+  for (i = 0; i < width; i++)
+    {
+      for (j = 0; j < 2; j++)
+	{
+	  oe = ops[op_index--];
+	  tmp_op[j] = oe->op;
+	  stmt1 = oe->stmt_to_insert;
+	  if (stmt1)
+	    insert_stmt_before_use (stmts[i], stmt1);
+	  stmt1 = NULL;
+	}
+      stmts[i] = build_and_add_sum (TREE_TYPE (last_rhs1), tmp_op[1], tmp_op[0], opcode);
+      gimple_set_visited (stmts[i], true);
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, " into ");
+	  print_gimple_stmt (dump_file, stmts[i], 0);
+	}
+    }
+
+  for (i = width; i < stmt_num; i++)
+    {
+      /* We keep original statement only for the last one.  All others are
+	 recreated.  */
+      if ( op_index < 0)
+	{
+	  if (width_count == 2)
+	    {
+
+	      gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs (stmts[i-1]));
+	      gimple_assign_set_rhs2 (stmts[i], gimple_assign_lhs (stmts[i-2]));
+	    }
+	  else
+	    {
+
+	      stmts[i] =
+		build_and_add_sum (TREE_TYPE (last_rhs1),
+				   gimple_assign_lhs (stmts[i-width_count]),
+				   gimple_assign_lhs (stmts[i-width_count+1]),
+				   opcode);
+	      width_count--;
+	    }
+	  update_stmt (stmts[i]);
+	}
+      else
+	{
+	  oe = ops[op_index--];
+	  op1 = oe->op;
+	  stmt1 = oe->stmt_to_insert;
+	  if (stmt1)
+	    insert_stmt_before_use (stmts[i], stmt1);
+	  stmt1 = NULL;
+	  stmts[i] = build_and_add_sum (TREE_TYPE (last_rhs1),
+				  gimple_assign_lhs (stmts[i-width]),
+				  op1,
+				  opcode);
+	  gimple_set_visited (stmts[i], true);
+  }
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, " into ");
+	  print_gimple_stmt (dump_file, stmts[i], 0);
+	}
+    }
+  remove_visited_stmt_chain (last_rhs1);
+}
+
 /* Recursively rewrite our linearized statements so that the operators
    match those in OPS[OPINDEX], putting the computation in rank
    order and trying to allow operations to be executed in
@@ -6649,6 +6758,64 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
     }
 }
 
+/* Rearrange ops to generate more FMA when the chain may has more than 2 fmas.
+   Putting ops that not def from mult in front can generate more fma.
+   E.g.
+   a * b + c * d + e generates:
+
+   _4  = c_9(D) * d_10(D);
+   _12 = .FMA (a_7(D), b_8(D), _4);
+   _11 = e_6(D) + _12;
+
+   Rtearrange ops to -> e + a * b + c * d generates:
+
+   _4  = .FMA (c_7(D), d_8(D), _3);
+   _11 = .FMA (a_5(D), b_6(D), _4);
+ */
+static bool
+rank_ops_for_fma (vec<operand_entry *> *ops)
+{
+  operand_entry *oe;
+  unsigned int i;
+  auto_vec<operand_entry *> ops_mult;
+  auto_vec<operand_entry *> ops_others;
+
+  FOR_EACH_VEC_ELT (*ops, i, oe)
+    {
+      if (TREE_CODE (oe->op) == SSA_NAME)
+	{
+	  gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
+	  if (is_gimple_assign (def_stmt)
+	      && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
+	    ops_mult.safe_push (oe);
+	  else
+	    ops_others.safe_push (oe);
+	}
+      else
+	ops_others.safe_push (oe);
+    }
+  /* When ops_mult.length == 2, like the following case,
+
+     a * b + c * d + e.
+
+     we need to rearrange the ops.
+
+     Putting ops that not def from mult in front can generate more fmas.  */
+  if (ops_mult.length () >= 2)
+    {
+      /* If all ops are defined with mult, we don't need to rearrange them.  */
+      if (ops_mult.length () != ops->length ())
+	{
+	  ops->truncate (0);
+	  FOR_EACH_VEC_ELT (ops_mult, i, oe)
+	    ops->safe_push (oe);
+	  FOR_EACH_VEC_ELT (ops_others, i, oe)
+	    ops->safe_push (oe);
+	}
+      return true;
+    }
+  return false;
+}
 /* Reassociate expressions in basic block BB and its post-dominator as
    children.
 
@@ -6813,6 +6980,7 @@ reassociate_bb (basic_block bb)
 		  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
 		  int ops_num = ops.length ();
 		  int width;
+		  bool keep_fma_chain = false;
 
 		  /* For binary bit operations, if there are at least 3
 		     operands and the last operand in OPS is a constant,
@@ -6826,36 +6994,64 @@ reassociate_bb (basic_block bb)
 		      && TREE_CODE (ops.last ()->op) == INTEGER_CST)
 		    std::swap (*ops[0], *ops[ops_num - 1]);
 
+		  optimization_type opt_type = bb_optimization_type (bb);
+
+		  /* When enabling param_reassoc_max_chain_length_with_fma to
+		     keep the chain with fma, rank_ops_for_fma will detect if
+		     the chain has fmas and if so it will rearrange the ops.  */
+		  if (param_reassoc_max_chain_length_with_fma > 1
+		      && direct_internal_fn_supported_p (IFN_FMA,
+							 TREE_TYPE (lhs),
+							 opt_type)
+		      && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
+		    {
+		      keep_fma_chain = rank_ops_for_fma(&ops);
+		    }
+
+		  int len = ops.length ();
 		  /* Only rewrite the expression tree to parallel in the
 		     last reassoc pass to avoid useless work back-and-forth
 		     with initial linearization.  */
 		  if (!reassoc_insert_powi_p
-		      && ops.length () > 3
+		      && len > 3
+		      && (!keep_fma_chain
+			  || (keep_fma_chain
+			      && len > param_reassoc_max_chain_length_with_fma))
 		      && (width = get_reassociation_width (ops_num, rhs_code,
-							   mode)) > 1)
+							    mode)) > 1)
 		    {
-		      if (dump_file && (dump_flags & TDF_DETAILS))
-			fprintf (dump_file,
-				 "Width = %d was chosen for reassociation\n",
-				 width);
-		      rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
-						  width, ops);
+		      if (keep_fma_chain)
+			{
+			  if (dump_file && (dump_flags & TDF_DETAILS))
+			    fprintf (dump_file,
+				     "Break chain len = %d into width for FMA\n",
+				     len);
+			  rewrite_expr_tree_parallel_for_fma
+			    (as_a <gassign *> (stmt), width, ops);
+			}
+		      else
+			{
+			  if (dump_file && (dump_flags & TDF_DETAILS))
+			    fprintf (dump_file,
+				     "Width = %d was chosen for reassociation\n",
+				     width);
+			  rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
+						      width, ops);
+			}
 		    }
 		  else
-                    {
-                      /* When there are three operands left, we want
-                         to make sure the ones that get the double
-                         binary op are chosen wisely.  */
-                      int len = ops.length ();
-                      if (len >= 3)
+		    {
+		      /* When there are three operands left, we want
+			 to make sure the ones that get the double
+			 binary op are chosen wisely.  */
+		      if (len >= 3 && !keep_fma_chain)
 			swap_ops_for_binary_stmt (ops, len - 3);
 
 		      new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
 						   powi_result != NULL
 						   || negate_result,
 						   len != orig_len);
-                    }
-
+		    }
 		  /* If we combined some repeated factors into a 
 		     __builtin_powi call, multiply that result by the
 		     reassociated operands.  */

From patchwork Thu May 11 10:12:01 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Li, Pan2 via Gcc-patches" <gcc-patches@gcc.gnu.org>
X-Patchwork-Id: 92471
Return-Path: <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp4252215vqo;
        Thu, 11 May 2023 03:13:13 -0700 (PDT)
X-Google-Smtp-Source: 
 ACHHUZ55QaNXYWOSeYLDHT0pTBIOcME4lqOt3zoiBsDDtURDcuVgEsXcbO+JpE5HoAClvTP0Gktg
X-Received: by 2002:a17:907:7251:b0:96a:1ab:b4a2 with SMTP id
 ds17-20020a170907725100b0096a01abb4a2mr8580305ejc.25.1683799993127;
        Thu, 11 May 2023 03:13:13 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1683799993; cv=none;
        d=google.com; s=arc-20160816;
        b=TD4U3xUePNOXUnr+s49G9h1sIRjZFxgDgPxaGBsVqvyOxiQioLQhZbB/tOtj7w6L40
         6bWa92b43XWkwvbqXrAFOWJO3en+iv0upJEdIMz7OmG7WLh66K7TVhLQHstaFICI2G/q
         kWCLhMjm0dye+FNc03rzZsCLjNga5wwigX2cS6hA78mBI7jNE+el53MBeCUBaR8Mg9nN
         VdQjepXWdW2K8qDUob6hvfhJYH/gOvcUmpOVm7JRm6Gv/Xs0XpqlvtnOwVUoCS+H1DID
         w+94jXo9M95Vv2LvAZB4apYcwsWgFuYVL0vmoJPURhAlDLdBqhtBvgWQca6ZVAr/56LQ
         ftkQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:from:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence
         :content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:dmarc-filter:delivered-to
         :dkim-signature:dkim-filter;
        bh=qwJjoAi6rHP8udm6WodojqkvZR5oxp4cvUepQjCpHfs=;
        b=xdBSVktaozJe/g43WITJp0+rZUdbBXe8rv+Eeqo9SEZUOLzOTG89VPLUaM/ieFeI2Y
         bfWKP3UuZoTX4HQIyE+J7+Gk9I/zaAtDO6Yzu8ZvjW/jbRHZ11HFYDsc5STxfi7yTOmI
         4sNHFAflx58Wc+zbH/at4vrFPiuvKXBbTS1lz6o9HNNbbhfiU4lmy2pcUGcokZ2eTvGv
         eTwG41QGVwefaE3WnVfxHeQsNJ5ZYRg903KslAk3lNs5p6xYerelj/b2gG45RvDCJeTu
         P7tY8TTZzFDpNKnN2UkuAbZX7LOfefGqTBnWitCohCtgbhKkOabuT2DiA2KbonS8uWrw
         BMWQ==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@gcc.gnu.org header.s=default header.b=h7ELw5Kx;
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as
 permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org
Received: from sourceware.org (server2.sourceware.org. [8.43.85.97])
        by mx.google.com with ESMTPS id
 qf37-20020a1709077f2500b009531dd6761esi6859968ejc.45.2023.05.11.03.13.12
        for <ouuuleilei@gmail.com>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 11 May 2023 03:13:13 -0700 (PDT)
Received-SPF: pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as
 permitted sender) client-ip=8.43.85.97;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@gcc.gnu.org header.s=default header.b=h7ELw5Kx;
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as
 permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id D3A8A385702F
	for <ouuuleilei@gmail.com>; Thu, 11 May 2023 10:12:52 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D3A8A385702F
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1683799972;
	bh=qwJjoAi6rHP8udm6WodojqkvZR5oxp4cvUepQjCpHfs=;
	h=To:Cc:Subject:Date:In-Reply-To:References:List-Id:
	 List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe:
	 From:Reply-To:From;
	b=h7ELw5KxqzeEYo4x9uC7zT/1i2AQuLhJ947q+sm6/24uYrj//kRhIPrCmdImtlGPS
	 HN44ndJwEEg/f8s2wFB4SEEIY+KlF6dT1MUB2IuTiMOQTzB6ytdUUWshlgIF9/E2L4
	 3Pgxy40bBT3tbIYpxXu0CgYUyoLYABxh1aWrtPLY=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mga17.intel.com (mga17.intel.com [192.55.52.151])
 by sourceware.org (Postfix) with ESMTPS id 5AD443858C54
 for <gcc-patches@gcc.gnu.org>; Thu, 11 May 2023 10:12:06 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5AD443858C54
X-IronPort-AV: E=McAfee;i="6600,9927,10706"; a="330815126"
X-IronPort-AV: E=Sophos;i="5.99,266,1677571200"; d="scan'208";a="330815126"
Received: from orsmga007.jf.intel.com ([10.7.209.58])
 by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 11 May 2023 03:12:05 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10706"; a="693740216"
X-IronPort-AV: E=Sophos;i="5.99,266,1677571200"; d="scan'208";a="693740216"
Received: from scymds03.sc.intel.com ([10.148.94.166])
 by orsmga007.jf.intel.com with ESMTP; 11 May 2023 03:12:05 -0700
Received: from shgcc101.sh.intel.com (shgcc101.sh.intel.com [10.239.85.97])
 by scymds03.sc.intel.com (Postfix) with ESMTP id D998D78;
 Thu, 11 May 2023 03:12:03 -0700 (PDT)
To: gcc-patches@gcc.gnu.org
Cc: Lili Cui <lili.cui@intel.com>
Subject: [PATCH 2/2] Add a tune option to control the length of the chain with
 FMA
Date: Thu, 11 May 2023 10:12:01 +0000
Message-Id: <20230511101201.2052667-2-lili.cui@intel.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20230511101201.2052667-1-lili.cui@intel.com>
References: <20230511101201.2052667-1-lili.cui@intel.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-11.1 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 KAM_SHORT,
 SPF_HELO_NONE, SPF_NONE, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: "Cui,
 Lili via Gcc-patches" <gcc-patches@gcc.gnu.org>
From: "Li, Pan2 via Gcc-patches" <gcc-patches@gcc.gnu.org>
Reply-To: "Cui, Lili" <lili.cui@intel.com>
Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org
Sender: "Gcc-patches" <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1765592261319665044?=
X-GMAIL-MSGID: =?utf-8?q?1765592261319665044?=

From: Lili Cui <lili.cui@intel.com>

Set the length of the chain with FMA to 5 for icelake_cost.

With this patch applied,
SPR multi-copy: 508.namd_r increased by 3%
ICX multi-copy: 508.namd_r increased by 3.5%,
                507.cactuBSSN_r increased by 3.7%

Using FMA instead of mult + add reduces register pressure and insruction
retired.

gcc/ChangeLog:

        * config/i386/i386-options.cc (ix86_option_override_internal):
        Set param_max_reassoc_fma_chain_length.
        * config/i386/i386.h (struct processor_costs): Add new tune parameters.
        * config/i386/x86-tune-costs.h (struct processor_costs): Set
	reassoc_max_chain_length_with_fma to 5 for icelake.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/fma-chain.c: New test.
---
 gcc/config/i386/i386-options.cc           |  2 ++
 gcc/config/i386/i386.h                    |  3 ++
 gcc/config/i386/x86-tune-costs.h          | 35 +++++++++++++++++++++++
 gcc/testsuite/gcc.target/i386/fma-chain.c | 11 +++++++
 4 files changed, 51 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/fma-chain.c

diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
index 2cb0bddcd35..67d35d89d91 100644
--- a/gcc/config/i386/i386-options.cc
+++ b/gcc/config/i386/i386-options.cc
@@ -2684,6 +2684,8 @@ ix86_option_override_internal (bool main_args_p,
 		       ix86_tune_cost->l1_cache_size);
   SET_OPTION_IF_UNSET (opts, opts_set, param_l2_cache_size,
 		       ix86_tune_cost->l2_cache_size);
+  SET_OPTION_IF_UNSET (opts, opts_set, param_reassoc_max_chain_length_with_fma,
+		       ix86_tune_cost->reassoc_max_chain_length_with_fma);
 
   /* 64B is the accepted value for these for all x86.  */
   SET_OPTION_IF_UNSET (&global_options, &global_options_set,
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index c7439f89bdf..c7fa7312a67 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -206,6 +206,9 @@ struct processor_costs {
 				   to number of instructions executed in
 				   parallel.  See also
 				   ix86_reassociation_width.  */
+  const int reassoc_max_chain_length_with_fma;
+				/* Specify max reassociation chain length with
+				   FMA.  */
   struct stringop_algs *memcpy, *memset;
   const int cond_taken_branch_cost;    /* Cost of taken branch for vectorizer
 					  cost model.  */
diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index 4f7a67ca5c5..1f57a5ee2a7 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -127,6 +127,7 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   COSTS_N_BYTES (2),			/* cost of SQRTSS instruction.  */
   COSTS_N_BYTES (2),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   ix86_size_memcpy,
   ix86_size_memset,
   COSTS_N_BYTES (1),			/* cond_taken_branch_cost.  */
@@ -238,6 +239,7 @@ struct processor_costs i386_cost = {	/* 386 specific costs */
   COSTS_N_INSNS (122),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (122),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   i386_memcpy,
   i386_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -350,6 +352,7 @@ struct processor_costs i486_cost = {	/* 486 specific costs */
   COSTS_N_INSNS (83),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (83),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   i486_memcpy,
   i486_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -460,6 +463,7 @@ struct processor_costs pentium_cost = {
   COSTS_N_INSNS (70),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (70),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   pentium_memcpy,
   pentium_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -563,6 +567,7 @@ struct processor_costs lakemont_cost = {
   COSTS_N_INSNS (31),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (63),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   pentium_memcpy,
   pentium_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -681,6 +686,7 @@ struct processor_costs pentiumpro_cost = {
   COSTS_N_INSNS (31),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (31),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   pentiumpro_memcpy,
   pentiumpro_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -790,6 +796,7 @@ struct processor_costs geode_cost = {
   COSTS_N_INSNS (54),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (54),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   geode_memcpy,
   geode_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -902,6 +909,7 @@ struct processor_costs k6_cost = {
   COSTS_N_INSNS (56),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (56),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   k6_memcpy,
   k6_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -1015,6 +1023,7 @@ struct processor_costs athlon_cost = {
   COSTS_N_INSNS (19),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (19),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   athlon_memcpy,
   athlon_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -1137,6 +1146,7 @@ struct processor_costs k8_cost = {
   COSTS_N_INSNS (19),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (27),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   k8_memcpy,
   k8_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -1267,6 +1277,7 @@ struct processor_costs amdfam10_cost = {
   COSTS_N_INSNS (19),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (27),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   amdfam10_memcpy,
   amdfam10_memset,
   COSTS_N_INSNS (2),			/* cond_taken_branch_cost.  */
@@ -1390,6 +1401,7 @@ const struct processor_costs bdver_cost = {
   COSTS_N_INSNS (15),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (26),			/* cost of SQRTSD instruction.  */
   1, 2, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   bdver_memcpy,
   bdver_memset,
   COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
@@ -1545,6 +1557,7 @@ struct processor_costs znver1_cost = {
      plus/minus operations per cycle but only one multiply.  This is adjusted
      in ix86_reassociation_width.  */
   4, 4, 3, 6,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   znver1_memcpy,
   znver1_memset,
   COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
@@ -1704,6 +1717,7 @@ struct processor_costs znver2_cost = {
      plus/minus operations per cycle but only one multiply.  This is adjusted
      in ix86_reassociation_width.  */
   4, 4, 3, 6,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   znver2_memcpy,
   znver2_memset,
   COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
@@ -1838,6 +1852,7 @@ struct processor_costs znver3_cost = {
      plus/minus operations per cycle but only one multiply.  This is adjusted
      in ix86_reassociation_width.  */
   4, 4, 3, 6,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   znver2_memcpy,
   znver2_memset,
   COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
@@ -1974,6 +1989,7 @@ struct processor_costs znver4_cost = {
      plus/minus operations per cycle but only one multiply.  This is adjusted
      in ix86_reassociation_width.  */
   4, 4, 3, 6,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   znver2_memcpy,
   znver2_memset,
   COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
@@ -2100,6 +2116,7 @@ struct processor_costs skylake_cost = {
   COSTS_N_INSNS (12),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (18),			/* cost of SQRTSD instruction.  */
   1, 4, 2, 2,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   skylake_memcpy,
   skylake_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -2228,6 +2245,12 @@ struct processor_costs icelake_cost = {
   COSTS_N_INSNS (12),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (18),			/* cost of SQRTSD instruction.  */
   1, 4, 2, 2,				/* reassoc int, fp, vec_int, vec_fp.  */
+  /* Icelake-server prefers fma chains instead of breaking dependencies into
+     mult + add, which can reduce instruction retired. 1 means not to keep
+     the fma chain. When the value big than 1, we will generate fma chain.
+     When the actual fma chain length is greater than this value, the fma
+     chain will be split with width.  */
+  5,					/* Reassoc max FMA chain length.  */
   icelake_memcpy,
   icelake_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -2350,6 +2373,7 @@ struct processor_costs alderlake_cost = {
   COSTS_N_INSNS (14),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (18),			/* cost of SQRTSD instruction.  */
   1, 4, 3, 3,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   alderlake_memcpy,
   alderlake_memset,
   COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
@@ -2465,6 +2489,7 @@ const struct processor_costs btver1_cost = {
   COSTS_N_INSNS (14),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (48),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   btver1_memcpy,
   btver1_memset,
   COSTS_N_INSNS (2),			/* cond_taken_branch_cost.  */
@@ -2577,6 +2602,7 @@ const struct processor_costs btver2_cost = {
   COSTS_N_INSNS (16),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (21),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   btver2_memcpy,
   btver2_memset,
   COSTS_N_INSNS (2),			/* cond_taken_branch_cost.  */
@@ -2688,6 +2714,7 @@ struct processor_costs pentium4_cost = {
   COSTS_N_INSNS (23),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (38),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   pentium4_memcpy,
   pentium4_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -2802,6 +2829,7 @@ struct processor_costs nocona_cost = {
   COSTS_N_INSNS (32),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (41),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   nocona_memcpy,
   nocona_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -2914,6 +2942,7 @@ struct processor_costs atom_cost = {
   COSTS_N_INSNS (31),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (63),			/* cost of SQRTSD instruction.  */
   2, 2, 2, 2,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   atom_memcpy,
   atom_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -3026,6 +3055,7 @@ struct processor_costs slm_cost = {
   COSTS_N_INSNS (20),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (35),			/* cost of SQRTSD instruction.  */
   1, 2, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   slm_memcpy,
   slm_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -3152,6 +3182,7 @@ struct processor_costs tremont_cost = {
   COSTS_N_INSNS (14),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (18),			/* cost of SQRTSD instruction.  */
   1, 4, 3, 3,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   tremont_memcpy,
   tremont_memset,
   COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
@@ -3264,6 +3295,7 @@ struct processor_costs intel_cost = {
   COSTS_N_INSNS (40),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (40),			/* cost of SQRTSD instruction.  */
   1, 4, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   intel_memcpy,
   intel_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -3381,6 +3413,7 @@ struct processor_costs lujiazui_cost = {
   COSTS_N_INSNS (32),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (60),			/* cost of SQRTSD instruction.  */
   1, 4, 3, 3,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   lujiazui_memcpy,
   lujiazui_memset,
   COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
@@ -3502,6 +3535,7 @@ struct processor_costs generic_cost = {
   COSTS_N_INSNS (14),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (18),			/* cost of SQRTSD instruction.  */
   1, 4, 3, 3,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   generic_memcpy,
   generic_memset,
   COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
@@ -3630,6 +3664,7 @@ struct processor_costs core_cost = {
   COSTS_N_INSNS (30),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (58),			/* cost of SQRTSD instruction.  */
   1, 4, 2, 2,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   core_memcpy,
   core_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
diff --git a/gcc/testsuite/gcc.target/i386/fma-chain.c b/gcc/testsuite/gcc.target/i386/fma-chain.c
new file mode 100644
index 00000000000..9de61f1b6ff
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/fma-chain.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -march=icelake-server -Wno-attributes " } */
+
+/* Test that the compiler properly optimizes multiply and add
+   to generate more FMA instructions.  */
+float
+foo (float a, float b, float c, float d, float e, float f, float g, float h, float j)
+{
+   return a * b + c * d + e * f + g * h + j;
+}
+/* { dg-final { scan-assembler-times "vfm" 4 } } */