From patchwork Thu May 11 15:23:56 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Li, Pan2 via Gcc-patches" <gcc-patches@gcc.gnu.org>
X-Patchwork-Id: 92697
Return-Path: <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp4461650vqo;
        Thu, 11 May 2023 08:24:45 -0700 (PDT)
X-Google-Smtp-Source: 
 ACHHUZ5F3zYGZyx5AHGeoCi1FIToS5Fy+M9j6Zk6dt2SbOg16D6okPz7Hu/2YJ5XTDqJ7hoqEvhB
X-Received: by 2002:aa7:dbc9:0:b0:50b:c896:12eb with SMTP id
 v9-20020aa7dbc9000000b0050bc89612ebmr17157151edt.31.1683818685569;
        Thu, 11 May 2023 08:24:45 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1683818685; cv=none;
        d=google.com; s=arc-20160816;
        b=iAKhebppIYa+AXSzQJ45x/UoISbMTzEYZvuSa5NKZKhrAAsuVCSxX1nsBIzRGv/hUk
         381/P3CekcUgjWYrS3j4yODkuMp5JigW+Q1q/p+/b6ZbAvCVpceby1wpRThk3YDYS4cn
         /u+gmWtqMpKX5qbPhbZE8mtNkXojVFby1AcuYednfC1i6AULnM+IGBHrCFxJpF8NDUx3
         1opISbDICx9P+iqR7iPIoaPw78UzG1g8T3uOl7Khw42m8u2Xwnl1RhZfeRpeyVfVVgAR
         w9yFF1fxcbEkGJ51iVIzKnbLuMAESaJahi41YmMHI27oGtyhD6XmOPOjlwLBTbk6EYIY
         m+Wg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:from:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence
         :content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:dmarc-filter:delivered-to:dkim-signature:dkim-filter;
        bh=UxxkzMGYMjHFRbgAh7r5cXj+fr9izMkUqmPyOnhmR64=;
        b=dLggWi3WuvwLQBcPCRQNxz7JYZJht3MsL6FOLzIxXHXMnPJU+9Su3QtN7lutF4jGt5
         JIEseWRtuPXI74m5jtTGRkNCOpcEIN3/X7k6BQTCWmmlhUxW4dYjNKZ7AK0vQcFGY1LX
         B9x/xFMWCyr4Iw8Qe6D34bok4UcyvM8T+SOK8thIzV61MF5OUdI4FKhLELzvVsI1eqV1
         MO4gHMgdP/P9y+3M3m1L3HyzYTjIiFfrfW+UXbzzt7AV8dEOtJMoejtS5BOijxmqB9/e
         c0JF/Wr67BqvkTxUADcIpX6ZmvVImIXh+MKnEoTGPvBiigHeSwf61XWaSOJBIlNDfuKJ
         /jGg==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@gcc.gnu.org header.s=default header.b=JbDVYq0m;
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as
 permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org
Received: from sourceware.org (server2.sourceware.org. [8.43.85.97])
        by mx.google.com with ESMTPS id
 w12-20020a056402128c00b00506a0ee2bb3si4774772edv.99.2023.05.11.08.24.45
        for <ouuuleilei@gmail.com>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 11 May 2023 08:24:45 -0700 (PDT)
Received-SPF: pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as
 permitted sender) client-ip=8.43.85.97;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@gcc.gnu.org header.s=default header.b=JbDVYq0m;
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as
 permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 5811B3856973
	for <ouuuleilei@gmail.com>; Thu, 11 May 2023 15:24:44 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 5811B3856973
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1683818684;
	bh=UxxkzMGYMjHFRbgAh7r5cXj+fr9izMkUqmPyOnhmR64=;
	h=To:Cc:Subject:Date:List-Id:List-Unsubscribe:List-Archive:
	 List-Post:List-Help:List-Subscribe:From:Reply-To:From;
	b=JbDVYq0mdxDO+GbX9eR0HdEUCiIdXKyawWAxZT+ZbQkNAiCLYyDBB1oSMFL6D9nW5
	 k9KEEmqvMvSam9rPYM3/JrN6IJGWksX+VQfsSaHs2cusmi1wXNtaRQ9zmvxcRyM9SB
	 iIFXEinL5nGXToPO26+TH5MnWN3S0HfmZzcbigZk=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mga05.intel.com (mga05.intel.com [192.55.52.43])
 by sourceware.org (Postfix) with ESMTPS id 88A3C3858C54
 for <gcc-patches@gcc.gnu.org>; Thu, 11 May 2023 15:23:59 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 88A3C3858C54
X-IronPort-AV: E=McAfee;i="6600,9927,10707"; a="436865345"
X-IronPort-AV: E=Sophos;i="5.99,266,1677571200"; d="scan'208";a="436865345"
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
 by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 11 May 2023 08:23:58 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10707"; a="811654894"
X-IronPort-AV: E=Sophos;i="5.99,266,1677571200"; d="scan'208";a="811654894"
Received: from scymds03.sc.intel.com ([10.148.94.166])
 by fmsmga002.fm.intel.com with ESMTP; 11 May 2023 08:23:58 -0700
Received: from shgcc101.sh.intel.com (shgcc101.sh.intel.com [10.239.85.97])
 by scymds03.sc.intel.com (Postfix) with ESMTP id 93A1878;
 Thu, 11 May 2023 08:23:57 -0700 (PDT)
To: gcc-patches@gcc.gnu.org
Cc: Lili Cui <lili.cui@intel.com>
Subject: [PATCH1/2] PR gcc/98350:Add a param to control the length of the
 chain with FMA in reassoc pass
Date: Thu, 11 May 2023 15:23:56 +0000
Message-Id: <20230511152356.2116735-1-lili.cui@intel.com>
X-Mailer: git-send-email 2.25.1
MIME-Version: 1.0
X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 SPF_HELO_NONE, SPF_NONE, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: "Cui,
 Lili via Gcc-patches" <gcc-patches@gcc.gnu.org>
From: "Li, Pan2 via Gcc-patches" <gcc-patches@gcc.gnu.org>
Reply-To: "Cui, Lili" <lili.cui@intel.com>
Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org
Sender: "Gcc-patches" <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1765592261564359060?=
X-GMAIL-MSGID: =?utf-8?q?1765611862099066944?=

From: Lili Cui <lili.cui@intel.com>

Add a param for the chain with FMA in reassoc pass to make it more friendly to
the fma pass later. First to detect if this chain has ability to
generate more than 2 FMAs,if yes and param_reassoc_max_chain_length_with_fma
is enabled, We will rearrange the ops so that they can be combined into more
FMAs. When the chain length exceeds param_reassoc_max_chain_length_with_fma,
build parallel chains according to given association width and try to keep FMA
opportunity as much as possible.

TEST1:

float
foo (float a, float b, float c, float d, float *e)
{
   return  *e  + a * b + c * d ;
}

For -Ofast -march=icelake-server  GCC generates:
        vmulss  %xmm3, %xmm2, %xmm2
        vfmadd132ss     %xmm1, %xmm2, %xmm0
        vaddss  (%rdi), %xmm0, %xmm0
        ret

with "--param=reassoc-max-chain-length-with-fma=3" GCC generates:
        vfmadd213ss   (%rdi), %xmm1, %xmm0
        vfmadd231ss   %xmm2, %xmm3, %xmm0
        ret

gcc/ChangeLog:

	PR gcc/98350
	* params.opt (reassoc-max-fma-chain-length): New param.
	* tree-ssa-reassoc.cc
	(rewrite_expr_tree_parallel_for_fma): New.
	(rank_ops_for_fma): Ditto.
	(reassociate_bb): Handle new function.

gcc/testsuite/ChangeLog:

	PR gcc/98350
	* gcc.dg/pr98350-1.c: New test.
	* gcc.dg/pr98350-2.c: Ditto.
---
 gcc/params.opt                   |   4 +
 gcc/testsuite/gcc.dg/pr98350-1.c |  31 +++++
 gcc/testsuite/gcc.dg/pr98350-2.c |  17 +++
 gcc/tree-ssa-reassoc.cc          | 226 ++++++++++++++++++++++++++++---
 4 files changed, 262 insertions(+), 16 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr98350-1.c
 create mode 100644 gcc/testsuite/gcc.dg/pr98350-2.c

diff --git a/gcc/params.opt b/gcc/params.opt
index 823cdb2ff85..f7c719afe64 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1182,4 +1182,8 @@ The maximum factor which the loop vectorizer applies to the cost of statements i
 Common Joined UInteger Var(param_vect_induction_float) Init(1) IntegerRange(0, 1) Param Optimization
 Enable loop vectorization of floating point inductions.
 
+-param=reassoc-max-chain-length-with-fma=
+Common Joined UInteger Var(param_reassoc_max_chain_length_with_fma) Init(1) IntegerRange(1, 65536) Param Optimization
+The maximum chain length with fma considered in reassociation pass.
+
 ; This comment is to ensure we retain the blank line above.
diff --git a/gcc/testsuite/gcc.dg/pr98350-1.c b/gcc/testsuite/gcc.dg/pr98350-1.c
new file mode 100644
index 00000000000..265e0e57a49
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr98350-1.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mfpmath=sse -mfma --param=reassoc-max-chain-length-with-fma=8 -Wno-attributes " } */
+
+/* Test that the compiler properly optimizes multiply and add 
+   to generate more FMA instructions.  */
+#define N 1024
+double a[N];
+double b[N];
+double c[N];
+double d[N];
+double e[N];
+double f[N];
+double g[N];
+double h[N];
+double j[N];
+double k[N];
+double l[N];
+double m[N];
+double o[N];
+double p[N];
+
+
+void
+foo (void)
+{
+  for (int i = 0; i < N; i++)
+  {
+    a[i] += b[i] * c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] * l[i] + m[i]* o[i] + p[i];
+  }
+}
+/* { dg-final { scan-assembler-times "vfm" 6  } } */
diff --git a/gcc/testsuite/gcc.dg/pr98350-2.c b/gcc/testsuite/gcc.dg/pr98350-2.c
new file mode 100644
index 00000000000..246025d43b8
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr98350-2.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mfpmath=sse -mfma --param=reassoc-max-chain-length-with-fma=6 -Wno-attributes " } */
+
+/* Test that the compiler properly build parallel chains according to given
+   association width and try to keep FMA opportunity as much as possible.  */
+#define N 33
+double a[N];
+
+void
+foo (void)
+{
+  a[32] = a[0] *a[1] + a[2] * a[3] + a[4] * a[5] + a[6] * a[7] + a[8] * a[9]
+    + a[10] * a[11] + a[12] * a[13] + a[14] * a[15] + a[16] * a[17]
+    + a[18] * a[19] + a[20] * a[21] + a[22] * a[23] + a[24] + a[25]
+    + a[26] + a[27] + a[28] + a[29] + a[30] + a[31];
+}
+/* { dg-final { scan-assembler-times "vfm" 12  } } */
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index 067a3f07f7e..f8c70ccadab 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -54,6 +54,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-reassoc.h"
 #include "tree-ssa-math-opts.h"
 #include "gimple-range.h"
+#include "internal-fn.h"
 
 /*  This is a simple global reassociation pass.  It is, in part, based
     on the LLVM pass of the same name (They do some things more/less
@@ -5468,6 +5469,114 @@ get_reassociation_width (int ops_num, enum tree_code opc,
   return width;
 }
 
+/* Rewrite statements with dependency chain with regard to the chance to
+   generate FMA. When the dependency chain length exceeds
+   param_max_reassoc_chain_length_with_fma, build parallel chains according to
+   given association width and try to keep fma opportunity as much as possible.
+   E.g.
+   e + f + g + a * b + c * d;
+
+   ssa1 = e + f;
+   ssa2 = g + a * b;
+   ssa3 = ssa1 + c * d;
+   ssa4 = ssa2 + ssa3;
+
+   This reassociation approach preserves the chance of fma generation as much
+   as possible.  */
+static void
+rewrite_expr_tree_parallel_for_fma (gassign *stmt, int width,
+					 const vec<operand_entry *> &ops)
+{
+  enum tree_code opcode = gimple_assign_rhs_code (stmt);
+  int op_num = ops.length ();
+  gcc_assert (op_num > 0);
+  int stmt_num = op_num - 1;
+  gimple **stmts = XALLOCAVEC (gimple *, stmt_num);
+  int op_index = op_num - 1;
+  int width_count = width;
+  int i = 0, j = 0;
+  tree tmp_op[2], op1;
+  operand_entry *oe;
+  gimple *stmt1 = NULL;
+  tree last_rhs1 = gimple_assign_rhs1 (stmt);
+
+  /* We start expression rewriting from the top statements.
+     So, in this loop we create a full list of statements
+     we will work with.  */
+  stmts[stmt_num - 1] = stmt;
+  for (i = stmt_num - 2; i >= 0; i--)
+    stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
+
+  /* Build parallel FMA dependency chain according to width.  */
+  for (i = 0; i < width; i++)
+    {
+      for (j = 0; j < 2; j++)
+	{
+	  oe = ops[op_index--];
+	  tmp_op[j] = oe->op;
+	  stmt1 = oe->stmt_to_insert;
+	  if (stmt1)
+	    insert_stmt_before_use (stmts[i], stmt1);
+	  stmt1 = NULL;
+	}
+      stmts[i] = build_and_add_sum (TREE_TYPE (last_rhs1), tmp_op[1], tmp_op[0], opcode);
+      gimple_set_visited (stmts[i], true);
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, " into ");
+	  print_gimple_stmt (dump_file, stmts[i], 0);
+	}
+    }
+
+  for (i = width; i < stmt_num; i++)
+    {
+      /* We keep original statement only for the last one.  All others are
+	 recreated.  */
+      if ( op_index < 0)
+	{
+	  if (width_count == 2)
+	    {
+
+	      gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs (stmts[i-1]));
+	      gimple_assign_set_rhs2 (stmts[i], gimple_assign_lhs (stmts[i-2]));
+	    }
+	  else
+	    {
+
+	      stmts[i] =
+		build_and_add_sum (TREE_TYPE (last_rhs1),
+				   gimple_assign_lhs (stmts[i-width_count]),
+				   gimple_assign_lhs (stmts[i-width_count+1]),
+				   opcode);
+	      width_count--;
+	    }
+	  update_stmt (stmts[i]);
+	}
+      else
+	{
+	  oe = ops[op_index--];
+	  op1 = oe->op;
+	  stmt1 = oe->stmt_to_insert;
+	  if (stmt1)
+	    insert_stmt_before_use (stmts[i], stmt1);
+	  stmt1 = NULL;
+	  stmts[i] = build_and_add_sum (TREE_TYPE (last_rhs1),
+				  gimple_assign_lhs (stmts[i-width]),
+				  op1,
+				  opcode);
+	  gimple_set_visited (stmts[i], true);
+  }
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, " into ");
+	  print_gimple_stmt (dump_file, stmts[i], 0);
+	}
+    }
+  remove_visited_stmt_chain (last_rhs1);
+}
+
 /* Recursively rewrite our linearized statements so that the operators
    match those in OPS[OPINDEX], putting the computation in rank
    order and trying to allow operations to be executed in
@@ -6649,6 +6758,62 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
     }
 }
 
+/* Rearrange ops to generate more FMA when the chain may has more than 2 fmas.
+   Putting ops that not def from mult in front can generate more fma.
+   E.g.
+   a * b + c * d + e generates:
+
+   _4  = c_9(D) * d_10(D);
+   _12 = .FMA (a_7(D), b_8(D), _4);
+   _11 = e_6(D) + _12;
+
+   Rtearrange ops to -> e + a * b + c * d generates:
+
+   _4  = .FMA (c_7(D), d_8(D), _3);
+   _11 = .FMA (a_5(D), b_6(D), _4);
+ */
+static bool
+rank_ops_for_fma (vec<operand_entry *> *ops)
+{
+  operand_entry *oe;
+  unsigned int i;
+  auto_vec<operand_entry *> ops_mult;
+  auto_vec<operand_entry *> ops_others;
+
+  FOR_EACH_VEC_ELT (*ops, i, oe)
+    {
+      if (TREE_CODE (oe->op) == SSA_NAME)
+	{
+	  gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
+	  if (is_gimple_assign (def_stmt)
+	      && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
+	    ops_mult.safe_push (oe);
+	  else
+	    ops_others.safe_push (oe);
+	}
+      else
+	ops_others.safe_push (oe);
+    }
+  /* When ops_mult.length == 2, like the following case,
+
+     a * b + c * d + e.
+
+     we need to rearrange the ops.
+
+     Putting ops that not def from mult in front can generate more fmas.  */
+  if (ops_mult.length () >= 2)
+    {
+      /* If all ops are defined with mult, we don't need to rearrange them.  */
+      if (ops_mult.length () != ops->length ())
+	{
+	  ops->truncate (0);
+	  ops->splice (ops_mult);
+	  ops->splice (ops_others);
+	}
+      return true;
+    }
+  return false;
+}
 /* Reassociate expressions in basic block BB and its post-dominator as
    children.
 
@@ -6813,6 +6978,7 @@ reassociate_bb (basic_block bb)
 		  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
 		  int ops_num = ops.length ();
 		  int width;
+		  bool keep_fma_chain = false;
 
 		  /* For binary bit operations, if there are at least 3
 		     operands and the last operand in OPS is a constant,
@@ -6826,36 +6992,64 @@ reassociate_bb (basic_block bb)
 		      && TREE_CODE (ops.last ()->op) == INTEGER_CST)
 		    std::swap (*ops[0], *ops[ops_num - 1]);
 
+		  optimization_type opt_type = bb_optimization_type (bb);
+
+		  /* When enabling param_reassoc_max_chain_length_with_fma to
+		     keep the chain with fma, rank_ops_for_fma will detect if
+		     the chain has fmas and if so it will rearrange the ops.  */
+		  if (param_reassoc_max_chain_length_with_fma > 1
+		      && direct_internal_fn_supported_p (IFN_FMA,
+							 TREE_TYPE (lhs),
+							 opt_type)
+		      && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
+		    {
+		      keep_fma_chain = rank_ops_for_fma(&ops);
+		    }
+
+		  int len = ops.length ();
 		  /* Only rewrite the expression tree to parallel in the
 		     last reassoc pass to avoid useless work back-and-forth
 		     with initial linearization.  */
 		  if (!reassoc_insert_powi_p
-		      && ops.length () > 3
+		      && len > 3
+		      && (!keep_fma_chain
+			  || (keep_fma_chain
+			      && len > param_reassoc_max_chain_length_with_fma))
 		      && (width = get_reassociation_width (ops_num, rhs_code,
-							   mode)) > 1)
+							    mode)) > 1)
 		    {
-		      if (dump_file && (dump_flags & TDF_DETAILS))
-			fprintf (dump_file,
-				 "Width = %d was chosen for reassociation\n",
-				 width);
-		      rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
-						  width, ops);
+		      if (keep_fma_chain)
+			{
+			  if (dump_file && (dump_flags & TDF_DETAILS))
+			    fprintf (dump_file,
+				     "Break chain len = %d into width for FMA\n",
+				     len);
+			  rewrite_expr_tree_parallel_for_fma
+			    (as_a <gassign *> (stmt), width, ops);
+			}
+		      else
+			{
+			  if (dump_file && (dump_flags & TDF_DETAILS))
+			    fprintf (dump_file,
+				     "Width = %d was chosen for reassociation\n",
+				     width);
+			  rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
+						      width, ops);
+			}
 		    }
 		  else
-                    {
-                      /* When there are three operands left, we want
-                         to make sure the ones that get the double
-                         binary op are chosen wisely.  */
-                      int len = ops.length ();
-                      if (len >= 3)
+		    {
+		      /* When there are three operands left, we want
+			 to make sure the ones that get the double
+			 binary op are chosen wisely.  */
+		      if (len >= 3 && !keep_fma_chain)
 			swap_ops_for_binary_stmt (ops, len - 3);
 
 		      new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
 						   powi_result != NULL
 						   || negate_result,
 						   len != orig_len);
-                    }
-
+		    }
 		  /* If we combined some repeated factors into a 
 		     __builtin_powi call, multiply that result by the
 		     reassociated operands.  */