[1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass
Checks
Commit Message
From: Lili Cui <lili.cui@intel.com>
Hi,
Those two patches each add a param to control the length of the chain with
FMA in reassoc pass and a tuning option in the backend.
Bootstrapped and regtested. Ok for trunk?
Regards
Lili.
Add a param for the chain with FMA in reassoc pass to make it more friendly to
the fma pass later. First to detect if this chain has ability to
generate more than 2 FMAs,if yes and param_reassoc_max_chain_length_with_fma
is enabled, We will rearrange the ops so that they can be combined into more
FMAs. When the chain length exceeds param_reassoc_max_chain_length_with_fma,
build parallel chains according to given association width and try to keep FMA
opportunity as much as possible.
TEST1:
float
foo (float a, float b, float c, float d, float *e)
{
return *e + a * b + c * d ;
}
For -Ofast -march=icelake-server GCC generates:
vmulss %xmm3, %xmm2, %xmm2
vfmadd132ss %xmm1, %xmm2, %xmm0
vaddss (%rdi), %xmm0, %xmm0
ret
with "--param=reassoc-max-chain-length-with-fma=3" GCC generates:
vfmadd213ss (%rdi), %xmm1, %xmm0
vfmadd231ss %xmm2, %xmm3, %xmm0
ret
gcc/ChangeLog:
PR gcc/98350
* params.opt (reassoc-max-fma-chain-length): New param.
* tree-ssa-reassoc.cc
(rewrite_expr_tree_parallel_for_fma): New.
(rank_ops_for_fma): Ditto.
(reassociate_bb): Handle new function.
gcc/testsuite/ChangeLog:
PR gcc/98350
* gcc.dg/pr98350-1.c: New test.
* gcc.dg/pr98350-2.c: Ditto.
---
gcc/params.opt | 4 +
gcc/testsuite/gcc.dg/pr98350-1.c | 31 +++++
gcc/testsuite/gcc.dg/pr98350-2.c | 17 +++
gcc/tree-ssa-reassoc.cc | 228 ++++++++++++++++++++++++++++---
4 files changed, 264 insertions(+), 16 deletions(-)
create mode 100644 gcc/testsuite/gcc.dg/pr98350-1.c
create mode 100644 gcc/testsuite/gcc.dg/pr98350-2.c
Comments
On Thu, May 11, 2023 at 12:13 PM Cui, Lili via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> From: Lili Cui <lili.cui@intel.com>
>
> Hi,
>
> Those two patches each add a param to control the length of the chain with
> FMA in reassoc pass and a tuning option in the backend.
>
> Bootstrapped and regtested. Ok for trunk?
>
> Regards
> Lili.
>
> Add a param for the chain with FMA in reassoc pass to make it more friendly to
> the fma pass later. First to detect if this chain has ability to
> generate more than 2 FMAs,if yes and param_reassoc_max_chain_length_with_fma
> is enabled, We will rearrange the ops so that they can be combined into more
> FMAs. When the chain length exceeds param_reassoc_max_chain_length_with_fma,
> build parallel chains according to given association width and try to keep FMA
> opportunity as much as possible.
>
> TEST1:
>
> float
> foo (float a, float b, float c, float d, float *e)
> {
> return *e + a * b + c * d ;
> }
>
> For -Ofast -march=icelake-server GCC generates:
> vmulss %xmm3, %xmm2, %xmm2
> vfmadd132ss %xmm1, %xmm2, %xmm0
> vaddss (%rdi), %xmm0, %xmm0
> ret
>
> with "--param=reassoc-max-chain-length-with-fma=3" GCC generates:
> vfmadd213ss (%rdi), %xmm1, %xmm0
> vfmadd231ss %xmm2, %xmm3, %xmm0
> ret
>
> gcc/ChangeLog:
>
> PR gcc/98350
> * params.opt (reassoc-max-fma-chain-length): New param.
> * tree-ssa-reassoc.cc
> (rewrite_expr_tree_parallel_for_fma): New.
> (rank_ops_for_fma): Ditto.
> (reassociate_bb): Handle new function.
>
> gcc/testsuite/ChangeLog:
>
> PR gcc/98350
> * gcc.dg/pr98350-1.c: New test.
> * gcc.dg/pr98350-2.c: Ditto.
> ---
> gcc/params.opt | 4 +
> gcc/testsuite/gcc.dg/pr98350-1.c | 31 +++++
> gcc/testsuite/gcc.dg/pr98350-2.c | 17 +++
> gcc/tree-ssa-reassoc.cc | 228 ++++++++++++++++++++++++++++---
> 4 files changed, 264 insertions(+), 16 deletions(-)
> create mode 100644 gcc/testsuite/gcc.dg/pr98350-1.c
> create mode 100644 gcc/testsuite/gcc.dg/pr98350-2.c
>
> diff --git a/gcc/params.opt b/gcc/params.opt
> index 823cdb2ff85..f7c719afe64 100644
> --- a/gcc/params.opt
> +++ b/gcc/params.opt
> @@ -1182,4 +1182,8 @@ The maximum factor which the loop vectorizer applies to the cost of statements i
> Common Joined UInteger Var(param_vect_induction_float) Init(1) IntegerRange(0, 1) Param Optimization
> Enable loop vectorization of floating point inductions.
>
> +-param=reassoc-max-chain-length-with-fma=
> +Common Joined UInteger Var(param_reassoc_max_chain_length_with_fma) Init(1) IntegerRange(1, 65536) Param Optimization
> +The maximum chain length with fma considered in reassociation pass.
> +
> ; This comment is to ensure we retain the blank line above.
> diff --git a/gcc/testsuite/gcc.dg/pr98350-1.c b/gcc/testsuite/gcc.dg/pr98350-1.c
> new file mode 100644
> index 00000000000..32ecce13a2d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr98350-1.c
> @@ -0,0 +1,31 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -mfpmath=sse -mfma --param=reassoc-max-chain-length-with-fma=7 -Wno-attributes " } */
> +
> +/* Test that the compiler properly optimizes multiply and add
> + to generate more FMA instructions. */
> +#define N 1024
> +double a[N];
> +double b[N];
> +double c[N];
> +double d[N];
> +double e[N];
> +double f[N];
> +double g[N];
> +double h[N];
> +double j[N];
> +double k[N];
> +double l[N];
> +double m[N];
> +double o[N];
> +double p[N];
> +
> +
> +void
> +foo (void)
> +{
> + for (int i = 0; i < N; i++)
> + {
> + a[i] += b[i] * c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] * l[i] + m[i]* o[i] + p[i];
> + }
> +}
> +/* { dg-final { scan-assembler-times "vfm" 6 } } */
> diff --git a/gcc/testsuite/gcc.dg/pr98350-2.c b/gcc/testsuite/gcc.dg/pr98350-2.c
> new file mode 100644
> index 00000000000..246025d43b8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr98350-2.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -mfpmath=sse -mfma --param=reassoc-max-chain-length-with-fma=6 -Wno-attributes " } */
> +
> +/* Test that the compiler properly build parallel chains according to given
> + association width and try to keep FMA opportunity as much as possible. */
> +#define N 33
> +double a[N];
> +
> +void
> +foo (void)
> +{
> + a[32] = a[0] *a[1] + a[2] * a[3] + a[4] * a[5] + a[6] * a[7] + a[8] * a[9]
> + + a[10] * a[11] + a[12] * a[13] + a[14] * a[15] + a[16] * a[17]
> + + a[18] * a[19] + a[20] * a[21] + a[22] * a[23] + a[24] + a[25]
> + + a[26] + a[27] + a[28] + a[29] + a[30] + a[31];
> +}
> +/* { dg-final { scan-assembler-times "vfm" 12 } } */
> diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
> index 067a3f07f7e..6d2e158c4f5 100644
> --- a/gcc/tree-ssa-reassoc.cc
> +++ b/gcc/tree-ssa-reassoc.cc
> @@ -54,6 +54,7 @@ along with GCC; see the file COPYING3. If not see
> #include "tree-ssa-reassoc.h"
> #include "tree-ssa-math-opts.h"
> #include "gimple-range.h"
> +#include "internal-fn.h"
>
> /* This is a simple global reassociation pass. It is, in part, based
> on the LLVM pass of the same name (They do some things more/less
> @@ -5468,6 +5469,114 @@ get_reassociation_width (int ops_num, enum tree_code opc,
> return width;
> }
>
> +/* Rewrite statements with dependency chain with regard to the chance to
> + generate FMA. When the dependency chain length exceeds
> + param_max_reassoc_chain_length_with_fma, build parallel chains according to
> + given association width and try to keep fma opportunity as much as possible.
> + E.g.
> + e + f + g + a * b + c * d;
> +
> + ssa1 = e + f;
> + ssa2 = g + a * b;
> + ssa3 = ssa1 + c * d;
> + ssa4 = ssa2 + ssa3;
> +
> + This reassociation approach preserves the chance of fma generation as much
> + as possible. */
> +static void
> +rewrite_expr_tree_parallel_for_fma (gassign *stmt, int width,
> + const vec<operand_entry *> &ops)
> +{
> + enum tree_code opcode = gimple_assign_rhs_code (stmt);
> + int op_num = ops.length ();
> + gcc_assert (op_num > 0);
> + int stmt_num = op_num - 1;
> + gimple **stmts = XALLOCAVEC (gimple *, stmt_num);
> + int op_index = op_num - 1;
> + int width_count = width;
> + int i = 0, j = 0;
> + tree tmp_op[2], op1;
> + operand_entry *oe;
> + gimple *stmt1 = NULL;
> + tree last_rhs1 = gimple_assign_rhs1 (stmt);
> +
> + /* We start expression rewriting from the top statements.
> + So, in this loop we create a full list of statements
> + we will work with. */
> + stmts[stmt_num - 1] = stmt;
> + for (i = stmt_num - 2; i >= 0; i--)
> + stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
> +
> + /* Build parallel FMA dependency chain according to width. */
> + for (i = 0; i < width; i++)
> + {
> + for (j = 0; j < 2; j++)
> + {
> + oe = ops[op_index--];
> + tmp_op[j] = oe->op;
> + stmt1 = oe->stmt_to_insert;
> + if (stmt1)
> + insert_stmt_before_use (stmts[i], stmt1);
> + stmt1 = NULL;
> + }
> + stmts[i] = build_and_add_sum (TREE_TYPE (last_rhs1), tmp_op[1], tmp_op[0], opcode);
> + gimple_set_visited (stmts[i], true);
> +
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + {
> + fprintf (dump_file, " into ");
> + print_gimple_stmt (dump_file, stmts[i], 0);
> + }
> + }
> +
> + for (i = width; i < stmt_num; i++)
> + {
> + /* We keep original statement only for the last one. All others are
> + recreated. */
> + if ( op_index < 0)
> + {
> + if (width_count == 2)
> + {
> +
> + gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs (stmts[i-1]));
> + gimple_assign_set_rhs2 (stmts[i], gimple_assign_lhs (stmts[i-2]));
> + }
> + else
> + {
> +
> + stmts[i] =
> + build_and_add_sum (TREE_TYPE (last_rhs1),
> + gimple_assign_lhs (stmts[i-width_count]),
> + gimple_assign_lhs (stmts[i-width_count+1]),
> + opcode);
> + width_count--;
> + }
> + update_stmt (stmts[i]);
> + }
> + else
> + {
> + oe = ops[op_index--];
> + op1 = oe->op;
> + stmt1 = oe->stmt_to_insert;
> + if (stmt1)
> + insert_stmt_before_use (stmts[i], stmt1);
> + stmt1 = NULL;
> + stmts[i] = build_and_add_sum (TREE_TYPE (last_rhs1),
> + gimple_assign_lhs (stmts[i-width]),
> + op1,
> + opcode);
> + gimple_set_visited (stmts[i], true);
> + }
> +
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + {
> + fprintf (dump_file, " into ");
> + print_gimple_stmt (dump_file, stmts[i], 0);
> + }
> + }
> + remove_visited_stmt_chain (last_rhs1);
> +}
> +
> /* Recursively rewrite our linearized statements so that the operators
> match those in OPS[OPINDEX], putting the computation in rank
> order and trying to allow operations to be executed in
> @@ -6649,6 +6758,64 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
> }
> }
>
> +/* Rearrange ops to generate more FMA when the chain may has more than 2 fmas.
> + Putting ops that not def from mult in front can generate more fma.
> + E.g.
> + a * b + c * d + e generates:
> +
> + _4 = c_9(D) * d_10(D);
> + _12 = .FMA (a_7(D), b_8(D), _4);
> + _11 = e_6(D) + _12;
> +
> + Rtearrange ops to -> e + a * b + c * d generates:
> +
> + _4 = .FMA (c_7(D), d_8(D), _3);
> + _11 = .FMA (a_5(D), b_6(D), _4);
> + */
> +static bool
> +rank_ops_for_fma (vec<operand_entry *> *ops)
> +{
> + operand_entry *oe;
> + unsigned int i;
> + auto_vec<operand_entry *> ops_mult;
> + auto_vec<operand_entry *> ops_others;
> +
> + FOR_EACH_VEC_ELT (*ops, i, oe)
> + {
> + if (TREE_CODE (oe->op) == SSA_NAME)
> + {
> + gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
> + if (is_gimple_assign (def_stmt)
> + && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
> + ops_mult.safe_push (oe);
> + else
> + ops_others.safe_push (oe);
> + }
> + else
> + ops_others.safe_push (oe);
> + }
> + /* When ops_mult.length == 2, like the following case,
> +
> + a * b + c * d + e.
> +
> + we need to rearrange the ops.
> +
> + Putting ops that not def from mult in front can generate more fmas. */
> + if (ops_mult.length () >= 2)
> + {
> + /* If all ops are defined with mult, we don't need to rearrange them. */
> + if (ops_mult.length () != ops->length ())
> + {
> + ops->truncate (0);
> + FOR_EACH_VEC_ELT (ops_mult, i, oe)
> + ops->safe_push (oe);
As you are not changing the number of ops you should be able to use
quick_push here
and below. You should be able to do
ops->splice (ops_mult);
ops->splice (ops_others);
as well.
> + FOR_EACH_VEC_ELT (ops_others, i, oe)
> + ops->safe_push (oe);
> + }
> + return true;
> + }
> + return false;
> +}
> /* Reassociate expressions in basic block BB and its post-dominator as
> children.
>
> @@ -6813,6 +6980,7 @@ reassociate_bb (basic_block bb)
> machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
> int ops_num = ops.length ();
> int width;
> + bool keep_fma_chain = false;
>
> /* For binary bit operations, if there are at least 3
> operands and the last operand in OPS is a constant,
> @@ -6826,36 +6994,64 @@ reassociate_bb (basic_block bb)
> && TREE_CODE (ops.last ()->op) == INTEGER_CST)
> std::swap (*ops[0], *ops[ops_num - 1]);
>
> + optimization_type opt_type = bb_optimization_type (bb);
> +
> + /* When enabling param_reassoc_max_chain_length_with_fma to
> + keep the chain with fma, rank_ops_for_fma will detect if
> + the chain has fmas and if so it will rearrange the ops. */
> + if (param_reassoc_max_chain_length_with_fma > 1
> + && direct_internal_fn_supported_p (IFN_FMA,
> + TREE_TYPE (lhs),
> + opt_type)
> + && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
> + {
> + keep_fma_chain = rank_ops_for_fma(&ops);
> + }
> +
> + int len = ops.length ();
> /* Only rewrite the expression tree to parallel in the
> last reassoc pass to avoid useless work back-and-forth
> with initial linearization. */
we are doing the parallel rewrite only in the last reassoc pass, i think it
makes sense to do the same for reassoc-for-fma.
Why do the existing expr rewrites not work after re-sorting the ops?
> if (!reassoc_insert_powi_p
> - && ops.length () > 3
> + && len > 3
> + && (!keep_fma_chain
> + || (keep_fma_chain
> + && len > param_reassoc_max_chain_length_with_fma))
in the case len < param_reassoc_max_chain_length_with_fma we have the
chain re-sorted but fall through to non-parallel rewrite. I wonder if we do not
want to instead adjust the reassociation width? I'd say it depends on the
number of mult cases in the chain (sth the re-sorting could have computed).
Why do we have two completely independent --params here? Can you give
an example --param
value combination that makes "sense" and show how it is beneficial?
> && (width = get_reassociation_width (ops_num, rhs_code,
> - mode)) > 1)
> + mode)) > 1)
> {
> - if (dump_file && (dump_flags & TDF_DETAILS))
> - fprintf (dump_file,
> - "Width = %d was chosen for reassociation\n",
> - width);
> - rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
> - width, ops);
> + if (keep_fma_chain)
> + {
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file,
> + "Break chain len = %d into width for FMA\n",
> + len);
> + rewrite_expr_tree_parallel_for_fma
> + (as_a <gassign *> (stmt), width, ops);
> + }
> + else
> + {
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file,
> + "Width = %d was chosen for reassociation\n",
> + width);
> + rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
> + width, ops);
> + }
> }
> else
> - {
> - /* When there are three operands left, we want
> - to make sure the ones that get the double
> - binary op are chosen wisely. */
> - int len = ops.length ();
> - if (len >= 3)
> + {
> + /* When there are three operands left, we want
> + to make sure the ones that get the double
> + binary op are chosen wisely. */
> + if (len >= 3 && !keep_fma_chain)
> swap_ops_for_binary_stmt (ops, len - 3);
>
> new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
> powi_result != NULL
> || negate_result,
> len != orig_len);
> - }
> -
> + }
> /* If we combined some repeated factors into a
> __builtin_powi call, multiply that result by the
> reassociated operands. */
> --
> 2.25.1
>
> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Thursday, May 11, 2023 6:53 PM
> To: Cui, Lili <lili.cui@intel.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length of
> the chain with FMA in reassoc pass
Hi Richard,
Thanks for helping to review the patch.
>
> As you are not changing the number of ops you should be able to use
> quick_push here and below. You should be able to do
>
> ops->splice (ops_mult);
> ops->splice (ops_others);
>
> as well.
>
Done.
> > + /* When enabling param_reassoc_max_chain_length_with_fma
> to
> > + keep the chain with fma, rank_ops_for_fma will detect if
> > + the chain has fmas and if so it will rearrange the ops. */
> > + if (param_reassoc_max_chain_length_with_fma > 1
> > + && direct_internal_fn_supported_p (IFN_FMA,
> > + TREE_TYPE (lhs),
> > + opt_type)
> > + && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
> > + {
> > + keep_fma_chain = rank_ops_for_fma(&ops);
> > + }
> > +
> > + int len = ops.length ();
> > /* Only rewrite the expression tree to parallel in the
> > last reassoc pass to avoid useless work back-and-forth
> > with initial linearization. */
>
> we are doing the parallel rewrite only in the last reassoc pass, i think it makes
> sense to do the same for reassoc-for-fma.
I rearranged the order of ops in reassoc1 without break the chain, it generated more vectorize during vector pass( seen in benchmark 503). So I rewrite the ssa tree and keep the chain with function "rewrite_expr_tree" in reassoc1, break the chain with "rewrite_expr_tree_parallel_for_fma" in reassoc2.
>
> Why do the existing expr rewrites not work after re-sorting the ops?
For case https://godbolt.org/z/3x9PWE9Kb: we put "j" at first.
j + l * m + a * b + c * d + e * f + g * h;
GCC trunk: width = 2, ops_num = 6, old function " rewrite_expr_tree_parallel " generates 3 FMAs.
---------------------------------------------------------------------------------------
_1 = l_10(D) * m_11(D);
_3 = a_13(D) * b_14(D);
_4 = j_12(D) + _3; --------> Here is one FMA.
_5 = c_15(D) * d_16(D);
_8 = _1 + _5; --------> Here is one FMA and lost one.
_7 = e_17(D) * f_18(D);
_9 = g_19(D) * h_20(D);
_2 = _7 + _9; --------> Here is one FMA and lost one.
_6 = _2 + _4;
_21 = _6 + _8;
# VUSE <.MEM_22(D)>
return _21;
--------------------------------------------------------------------------------------
width = 2, ops_num = 6, new function " rewrite_expr_tree_parallel_for_fma " generates 4 FMAs.
------------------------------------------------------------------------------
_1 = a_10(D) * b_11(D);
_3 = c_13(D) * d_14(D);
_5 = e_15(D) * f_16(D);
_7 = g_17(D) * h_18(D);
_4 = _5 + _7; --------> Here is one FMA and lost one.
_8 = _4 + _1; --------> Here is one FMA.
_9 = l_19(D) * m_20(D);
_2 = _9 + j_12(D); --------> Here is one FMA.
_6 = _2 + _3; --------> Here is one FMA.
_21 = _8 + _6;
return _21;
------------------------------------------------------------------------------------
>
> > if (!reassoc_insert_powi_p
> > - && ops.length () > 3
> > + && len > 3
> > + && (!keep_fma_chain
> > + || (keep_fma_chain
> > + && len >
> > + param_reassoc_max_chain_length_with_fma))
>
> in the case len < param_reassoc_max_chain_length_with_fma we have the
> chain re-sorted but fall through to non-parallel rewrite. I wonder if we do
> not want to instead adjust the reassociation width? I'd say it depends on the
> number of mult cases in the chain (sth the re-sorting could have computed).
> Why do we have two completely independent --params here? Can you give
> an example --param value combination that makes "sense" and show how it
> is beneficial?
For this small case https://godbolt.org/z/Pxczrre8P
a * b + c * d + e * f + j
GCC trunk: ops_num = 4, targetm.sched.reassociation_width is 4 (scalar fp cost is 4). Calculated: Width = 2. we can get 2 FMAs.
----------------------------------
_1 = a_6(D) * b_7(D);
_2 = c_8(D) * d_9(D);
_5 = _1 + _2;
_4 = e_10(D) * f_11(D);
_3 = _4 + j_12(D);
_13 = _3 + _5;
--------------------------------------------------------
_2 = c_8(D) * d_9(D);
_5 = .FMA (a_6(D), b_7(D), _2);
_3 = .FMA (e_10(D), f_11(D), j_12(D));
_13 = _3 + _5;
--------------------------------------------------------
New patch: If just rearrange ops and fall through to parallel rewrite to break the chain with width = 2.
---------------------------------------------------------
_1 = a_6(D) * b_7(D);
_2 = j + _1; -----> put j at the first.
_3 = c_8(D) * d_9(D);
_4 = e_10(D) * f_11(D);
_5 = _3 + _4; -----> break chain with width = 2. we lost a FMA here.
_13 = _2 + 5;
-------------------------------------------------------
_3 = c_8(D) * d_9(D);
_2 = .FMA (a_6(D), b_7(D), j);
_5 = .FMA (e_10(D), f_11(D), _3);
_13 = _2 + _5;
--------------------------------------------------------
Sometimes break chain will lose FMA( break chain needs put two mult-ops together, which will lose one FMA ), we can only get 2 FMAs here, if we want to get 3 FMAs, we need to keep the chain and not break it. So I added a param to control chain length "param_reassoc_max_chain_length_with_fma = 4" (For the small case in Bugzilla 98350, we need to keep the chain to generate 6 FMAs.)
-------------------------------------------------------
_1 = a_6(D) * b_7(D);
_2 = c_8(D) * d_9(D);
_4 = e_10(D) * f_11(D);
_15 = _4 + j_12(D);
_16 = _15 + _2;
_13 = _16 + _1;
-------------------------------------------------------
_15 = .FMA (e_10(D), f_11(D), j_12(D));
_16 = .FMA (c_8(D), d_9(D), _15);
_13 = .FMA (a_6(D), b_7(D), _16);
-------------------------------------------------------
In some case we want to break the chain with width, we can set "param_reassoc_max_chain_length_with_fma = 2", it will rearrange ops and break the chain with width.
>
> > && (width = get_reassociation_width (ops_num, rhs_code,
> > - mode)) > 1)
> > + mode)) >
> > + 1)
> > {
> > - if (dump_file && (dump_flags & TDF_DETAILS))
> > - fprintf (dump_file,
> > - "Width = %d was chosen for reassociation\n",
> > - width);
> > - rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
> > - width, ops);
> > + if (keep_fma_chain)
> > + {
> > + if (dump_file && (dump_flags & TDF_DETAILS))
> > + fprintf (dump_file,
> > + "Break chain len = %d into width for FMA\n",
> > + len);
> > + rewrite_expr_tree_parallel_for_fma
> > + (as_a <gassign *> (stmt), width, ops);
> > + }
> > + else
> > + {
> > + if (dump_file && (dump_flags & TDF_DETAILS))
> > + fprintf (dump_file,
> > + "Width = %d was chosen for reassociation\n",
> > + width);
> > + rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
> > + width, ops);
> > + }
> > }
> > else
> > - {
> > - /* When there are three operands left, we want
> > - to make sure the ones that get the double
> > - binary op are chosen wisely. */
> > - int len = ops.length ();
> > - if (len >= 3)
> > + {
> > + /* When there are three operands left, we want
> > + to make sure the ones that get the double
> > + binary op are chosen wisely. */
> > + if (len >= 3 && !keep_fma_chain)
> > swap_ops_for_binary_stmt (ops, len - 3);
> >
> > new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
> > powi_result != NULL
> > || negate_result,
> > len != orig_len);
> > - }
> > -
> > + }
> > /* If we combined some repeated factors into a
> > __builtin_powi call, multiply that result by the
> > reassociated operands. */
> > --
> > 2.25.1
> >
On Thu, May 11, 2023 at 5:20 PM Cui, Lili <lili.cui@intel.com> wrote:
>
> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: Thursday, May 11, 2023 6:53 PM
> > To: Cui, Lili <lili.cui@intel.com>
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length of
> > the chain with FMA in reassoc pass
>
> Hi Richard,
> Thanks for helping to review the patch.
>
> >
> > As you are not changing the number of ops you should be able to use
> > quick_push here and below. You should be able to do
> >
> > ops->splice (ops_mult);
> > ops->splice (ops_others);
> >
> > as well.
> >
> Done.
>
> > > + /* When enabling param_reassoc_max_chain_length_with_fma
> > to
> > > + keep the chain with fma, rank_ops_for_fma will detect if
> > > + the chain has fmas and if so it will rearrange the ops. */
> > > + if (param_reassoc_max_chain_length_with_fma > 1
> > > + && direct_internal_fn_supported_p (IFN_FMA,
> > > + TREE_TYPE (lhs),
> > > + opt_type)
> > > + && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
> > > + {
> > > + keep_fma_chain = rank_ops_for_fma(&ops);
> > > + }
> > > +
> > > + int len = ops.length ();
> > > /* Only rewrite the expression tree to parallel in the
> > > last reassoc pass to avoid useless work back-and-forth
> > > with initial linearization. */
> >
> > we are doing the parallel rewrite only in the last reassoc pass, i think it makes
> > sense to do the same for reassoc-for-fma.
>
> I rearranged the order of ops in reassoc1 without break the chain, it generated more vectorize during vector pass( seen in benchmark 503). So I rewrite the ssa tree and keep the chain with function "rewrite_expr_tree" in reassoc1, break the chain with "rewrite_expr_tree_parallel_for_fma" in reassoc2.
>
> >
> > Why do the existing expr rewrites not work after re-sorting the ops?
>
> For case https://godbolt.org/z/3x9PWE9Kb: we put "j" at first.
>
> j + l * m + a * b + c * d + e * f + g * h;
>
> GCC trunk: width = 2, ops_num = 6, old function " rewrite_expr_tree_parallel " generates 3 FMAs.
> ---------------------------------------------------------------------------------------
> _1 = l_10(D) * m_11(D);
> _3 = a_13(D) * b_14(D);
> _4 = j_12(D) + _3; --------> Here is one FMA.
> _5 = c_15(D) * d_16(D);
> _8 = _1 + _5; --------> Here is one FMA and lost one.
> _7 = e_17(D) * f_18(D);
> _9 = g_19(D) * h_20(D);
> _2 = _7 + _9; --------> Here is one FMA and lost one.
> _6 = _2 + _4;
> _21 = _6 + _8;
> # VUSE <.MEM_22(D)>
> return _21;
> --------------------------------------------------------------------------------------
> width = 2, ops_num = 6, new function " rewrite_expr_tree_parallel_for_fma " generates 4 FMAs.
> ------------------------------------------------------------------------------
> _1 = a_10(D) * b_11(D);
> _3 = c_13(D) * d_14(D);
> _5 = e_15(D) * f_16(D);
> _7 = g_17(D) * h_18(D);
> _4 = _5 + _7; --------> Here is one FMA and lost one.
> _8 = _4 + _1; --------> Here is one FMA.
> _9 = l_19(D) * m_20(D);
> _2 = _9 + j_12(D); --------> Here is one FMA.
> _6 = _2 + _3; --------> Here is one FMA.
> _21 = _8 + _6;
> return _21;
> ------------------------------------------------------------------------------------
ISTR there were no sufficient comments in the code explaining why
rewrite_expr_tree_parallel_for_fma is better by design. In fact ...
>
> >
> > > if (!reassoc_insert_powi_p
> > > - && ops.length () > 3
> > > + && len > 3
> > > + && (!keep_fma_chain
> > > + || (keep_fma_chain
> > > + && len >
> > > + param_reassoc_max_chain_length_with_fma))
> >
> > in the case len < param_reassoc_max_chain_length_with_fma we have the
> > chain re-sorted but fall through to non-parallel rewrite. I wonder if we do
> > not want to instead adjust the reassociation width? I'd say it depends on the
> > number of mult cases in the chain (sth the re-sorting could have computed).
> > Why do we have two completely independent --params here? Can you give
> > an example --param value combination that makes "sense" and show how it
> > is beneficial?
>
> For this small case https://godbolt.org/z/Pxczrre8P
> a * b + c * d + e * f + j
>
> GCC trunk: ops_num = 4, targetm.sched.reassociation_width is 4 (scalar fp cost is 4). Calculated: Width = 2. we can get 2 FMAs.
> ----------------------------------
> _1 = a_6(D) * b_7(D);
> _2 = c_8(D) * d_9(D);
> _5 = _1 + _2;
> _4 = e_10(D) * f_11(D);
> _3 = _4 + j_12(D);
> _13 = _3 + _5;
> --------------------------------------------------------
> _2 = c_8(D) * d_9(D);
> _5 = .FMA (a_6(D), b_7(D), _2);
> _3 = .FMA (e_10(D), f_11(D), j_12(D));
> _13 = _3 + _5;
> --------------------------------------------------------
> New patch: If just rearrange ops and fall through to parallel rewrite to break the chain with width = 2.
>
> ---------------------------------------------------------
> _1 = a_6(D) * b_7(D);
> _2 = j + _1; -----> put j at the first.
> _3 = c_8(D) * d_9(D);
> _4 = e_10(D) * f_11(D);
> _5 = _3 + _4; -----> break chain with width = 2. we lost a FMA here.
> _13 = _2 + 5;
>
> -------------------------------------------------------
> _3 = c_8(D) * d_9(D);
> _2 = .FMA (a_6(D), b_7(D), j);
> _5 = .FMA (e_10(D), f_11(D), _3);
> _13 = _2 + _5;
> --------------------------------------------------------
> Sometimes break chain will lose FMA( break chain needs put two mult-ops together, which will lose one FMA ), we can only get 2 FMAs here, if we want to get 3 FMAs, we need to keep the chain and not break it. So I added a param to control chain length "param_reassoc_max_chain_length_with_fma = 4" (For the small case in Bugzilla 98350, we need to keep the chain to generate 6 FMAs.)
> -------------------------------------------------------
> _1 = a_6(D) * b_7(D);
> _2 = c_8(D) * d_9(D);
> _4 = e_10(D) * f_11(D);
> _15 = _4 + j_12(D);
> _16 = _15 + _2;
> _13 = _16 + _1;
> -------------------------------------------------------
> _15 = .FMA (e_10(D), f_11(D), j_12(D));
> _16 = .FMA (c_8(D), d_9(D), _15);
> _13 = .FMA (a_6(D), b_7(D), _16);
> -------------------------------------------------------
> In some case we want to break the chain with width, we can set "param_reassoc_max_chain_length_with_fma = 2", it will rearrange ops and break the chain with width.
... it sounds like the problem could be fully addressed by sorting the
chain with reassoc-width in mind?
Wouldn't it be preferable if rewrite_expr_tree_parallel would get a
vector of mul and a vector of
non-mul ops so it can pick from the optimal candidate?
That said, I think rewrite_expr_tree_parallel_for_fma at least needs
more comments.
> >
> > > && (width = get_reassociation_width (ops_num, rhs_code,
> > > - mode)) > 1)
> > > + mode)) >
> > > + 1)
> > > {
> > > - if (dump_file && (dump_flags & TDF_DETAILS))
> > > - fprintf (dump_file,
> > > - "Width = %d was chosen for reassociation\n",
> > > - width);
> > > - rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
> > > - width, ops);
> > > + if (keep_fma_chain)
> > > + {
> > > + if (dump_file && (dump_flags & TDF_DETAILS))
> > > + fprintf (dump_file,
> > > + "Break chain len = %d into width for FMA\n",
> > > + len);
> > > + rewrite_expr_tree_parallel_for_fma
> > > + (as_a <gassign *> (stmt), width, ops);
> > > + }
> > > + else
> > > + {
> > > + if (dump_file && (dump_flags & TDF_DETAILS))
> > > + fprintf (dump_file,
> > > + "Width = %d was chosen for reassociation\n",
> > > + width);
> > > + rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
> > > + width, ops);
> > > + }
> > > }
> > > else
> > > - {
> > > - /* When there are three operands left, we want
> > > - to make sure the ones that get the double
> > > - binary op are chosen wisely. */
> > > - int len = ops.length ();
> > > - if (len >= 3)
> > > + {
> > > + /* When there are three operands left, we want
> > > + to make sure the ones that get the double
> > > + binary op are chosen wisely. */
> > > + if (len >= 3 && !keep_fma_chain)
> > > swap_ops_for_binary_stmt (ops, len - 3);
> > >
> > > new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
> > > powi_result != NULL
> > > || negate_result,
> > > len != orig_len);
> > > - }
> > > -
> > > + }
> > > /* If we combined some repeated factors into a
> > > __builtin_powi call, multiply that result by the
> > > reassociated operands. */
> > > --
> > > 2.25.1
> > >
> ISTR there were no sufficient comments in the code explaining why
> rewrite_expr_tree_parallel_for_fma is better by design. In fact ...
>
> >
> > >
> > > > if (!reassoc_insert_powi_p
> > > > - && ops.length () > 3
> > > > + && len > 3
> > > > + && (!keep_fma_chain
> > > > + || (keep_fma_chain
> > > > + && len >
> > > > + param_reassoc_max_chain_length_with_fma))
> > >
> > > in the case len < param_reassoc_max_chain_length_with_fma we have
> > > the chain re-sorted but fall through to non-parallel rewrite. I
> > > wonder if we do not want to instead adjust the reassociation width?
> > > I'd say it depends on the number of mult cases in the chain (sth the re-
> sorting could have computed).
> > > Why do we have two completely independent --params here? Can you
> > > give an example --param value combination that makes "sense" and
> > > show how it is beneficial?
> >
> > For this small case https://godbolt.org/z/Pxczrre8P a * b + c * d + e
> > * f + j
> >
> > GCC trunk: ops_num = 4, targetm.sched.reassociation_width is 4 (scalar fp
> cost is 4). Calculated: Width = 2. we can get 2 FMAs.
> > ----------------------------------
> > _1 = a_6(D) * b_7(D);
> > _2 = c_8(D) * d_9(D);
> > _5 = _1 + _2;
> > _4 = e_10(D) * f_11(D);
> > _3 = _4 + j_12(D);
> > _13 = _3 + _5;
> > --------------------------------------------------------
> > _2 = c_8(D) * d_9(D);
> > _5 = .FMA (a_6(D), b_7(D), _2);
> > _3 = .FMA (e_10(D), f_11(D), j_12(D));
> > _13 = _3 + _5;
> > --------------------------------------------------------
> > New patch: If just rearrange ops and fall through to parallel rewrite to
> break the chain with width = 2.
> >
> > ---------------------------------------------------------
> > _1 = a_6(D) * b_7(D);
> > _2 = j + _1; -----> put j at the first.
> > _3 = c_8(D) * d_9(D);
> > _4 = e_10(D) * f_11(D);
> > _5 = _3 + _4; -----> break chain with width = 2. we lost a FMA here.
> > _13 = _2 + 5;
> >
> > -------------------------------------------------------
> > _3 = c_8(D) * d_9(D);
> > _2 = .FMA (a_6(D), b_7(D), j);
> > _5 = .FMA (e_10(D), f_11(D), _3);
> > _13 = _2 + _5;
> > --------------------------------------------------------
> > Sometimes break chain will lose FMA( break chain needs put two
> > mult-ops together, which will lose one FMA ), we can only get 2 FMAs
> > here, if we want to get 3 FMAs, we need to keep the chain and not
> > break it. So I added a param to control chain length
> > "param_reassoc_max_chain_length_with_fma = 4" (For the small case in
> > Bugzilla 98350, we need to keep the chain to generate 6 FMAs.)
> > -------------------------------------------------------
> > _1 = a_6(D) * b_7(D);
> > _2 = c_8(D) * d_9(D);
> > _4 = e_10(D) * f_11(D);
> > _15 = _4 + j_12(D);
> > _16 = _15 + _2;
> > _13 = _16 + _1;
> > -------------------------------------------------------
> > _15 = .FMA (e_10(D), f_11(D), j_12(D));
> > _16 = .FMA (c_8(D), d_9(D), _15);
> > _13 = .FMA (a_6(D), b_7(D), _16);
> > -------------------------------------------------------
> > In some case we want to break the chain with width, we can set
> "param_reassoc_max_chain_length_with_fma = 2", it will rearrange ops and
> break the chain with width.
>
> ... it sounds like the problem could be fully addressed by sorting the chain
> with reassoc-width in mind?
> Wouldn't it be preferable if rewrite_expr_tree_parallel would get a vector of
> mul and a vector of non-mul ops so it can pick from the optimal candidate?
>
> That said, I think rewrite_expr_tree_parallel_for_fma at least needs more
> comments.
>
Sorry for not writing note clearly enough, I'll add more.
I have two places that need to be clarified.
1. For some case we need to keep chain to generate more FMAs, because break chain will lose FMA.
for example g + a * b + c * d + e * f,
Keep chain can get 3 FMAs, break chain can get 2 FMAs. It's hard to say which one is better, so we provide a param for users to customize.
2. when the chain has FMAs and need to break the chain with width,
for example l + a * b + c * d + e * f + g * h + j * k;(we already put non-mul first)
rewrite_expr_tree_parallel :
when width = 2, it will break the chain like this. actually it break the chain in to 3. It ignores the width and adds all ops two by two. it will lose FMA.
ssa1 = l + a * b;
ssa2 = c * d + e * f;
ssa3 = g * h + j * k;
ssa4 = ssa1 + ssa2;
ssa5 = ssa4 + ssa3;
rewrite_expr_tree_parallel_for_fma
when width = 2, we break the chain into two like this.
ssa1 = l + a * b;
ssa2 = c * d + e * f;
ssa3 = ssa1 + g * h;
ssa4 = ssa2 + j * k;
ssa5 = ssa3 +ssa4;
I think it's okay to remove or keep rewrite_expr_tree_parallel_for_fma. More FMAs are generated only for some special cases.
I'm not sure whether the new method is better than the old one. I created a small case the execution time of the two sequences is almost the same on SPR and ICX.
Thanks,
Lili.
On Fri, May 12, 2023 at 11:05 AM Cui, Lili <lili.cui@intel.com> wrote:
>
> > ISTR there were no sufficient comments in the code explaining why
> > rewrite_expr_tree_parallel_for_fma is better by design. In fact ...
> >
> > >
> > > >
> > > > > if (!reassoc_insert_powi_p
> > > > > - && ops.length () > 3
> > > > > + && len > 3
> > > > > + && (!keep_fma_chain
> > > > > + || (keep_fma_chain
> > > > > + && len >
> > > > > + param_reassoc_max_chain_length_with_fma))
> > > >
> > > > in the case len < param_reassoc_max_chain_length_with_fma we have
> > > > the chain re-sorted but fall through to non-parallel rewrite. I
> > > > wonder if we do not want to instead adjust the reassociation width?
> > > > I'd say it depends on the number of mult cases in the chain (sth the re-
> > sorting could have computed).
> > > > Why do we have two completely independent --params here? Can you
> > > > give an example --param value combination that makes "sense" and
> > > > show how it is beneficial?
> > >
> > > For this small case https://godbolt.org/z/Pxczrre8P a * b + c * d + e
> > > * f + j
> > >
> > > GCC trunk: ops_num = 4, targetm.sched.reassociation_width is 4 (scalar fp
> > cost is 4). Calculated: Width = 2. we can get 2 FMAs.
> > > ----------------------------------
> > > _1 = a_6(D) * b_7(D);
> > > _2 = c_8(D) * d_9(D);
> > > _5 = _1 + _2;
> > > _4 = e_10(D) * f_11(D);
> > > _3 = _4 + j_12(D);
> > > _13 = _3 + _5;
> > > --------------------------------------------------------
> > > _2 = c_8(D) * d_9(D);
> > > _5 = .FMA (a_6(D), b_7(D), _2);
> > > _3 = .FMA (e_10(D), f_11(D), j_12(D));
> > > _13 = _3 + _5;
> > > --------------------------------------------------------
> > > New patch: If just rearrange ops and fall through to parallel rewrite to
> > break the chain with width = 2.
> > >
> > > ---------------------------------------------------------
> > > _1 = a_6(D) * b_7(D);
> > > _2 = j + _1; -----> put j at the first.
> > > _3 = c_8(D) * d_9(D);
> > > _4 = e_10(D) * f_11(D);
> > > _5 = _3 + _4; -----> break chain with width = 2. we lost a FMA here.
> > > _13 = _2 + 5;
> > >
> > > -------------------------------------------------------
> > > _3 = c_8(D) * d_9(D);
> > > _2 = .FMA (a_6(D), b_7(D), j);
> > > _5 = .FMA (e_10(D), f_11(D), _3);
> > > _13 = _2 + _5;
> > > --------------------------------------------------------
> > > Sometimes break chain will lose FMA( break chain needs put two
> > > mult-ops together, which will lose one FMA ), we can only get 2 FMAs
> > > here, if we want to get 3 FMAs, we need to keep the chain and not
> > > break it. So I added a param to control chain length
> > > "param_reassoc_max_chain_length_with_fma = 4" (For the small case in
> > > Bugzilla 98350, we need to keep the chain to generate 6 FMAs.)
> > > -------------------------------------------------------
> > > _1 = a_6(D) * b_7(D);
> > > _2 = c_8(D) * d_9(D);
> > > _4 = e_10(D) * f_11(D);
> > > _15 = _4 + j_12(D);
> > > _16 = _15 + _2;
> > > _13 = _16 + _1;
> > > -------------------------------------------------------
> > > _15 = .FMA (e_10(D), f_11(D), j_12(D));
> > > _16 = .FMA (c_8(D), d_9(D), _15);
> > > _13 = .FMA (a_6(D), b_7(D), _16);
> > > -------------------------------------------------------
> > > In some case we want to break the chain with width, we can set
> > "param_reassoc_max_chain_length_with_fma = 2", it will rearrange ops and
> > break the chain with width.
> >
> > ... it sounds like the problem could be fully addressed by sorting the chain
> > with reassoc-width in mind?
> > Wouldn't it be preferable if rewrite_expr_tree_parallel would get a vector of
> > mul and a vector of non-mul ops so it can pick from the optimal candidate?
> >
> > That said, I think rewrite_expr_tree_parallel_for_fma at least needs more
> > comments.
> >
> Sorry for not writing note clearly enough, I'll add more.
> I have two places that need to be clarified.
>
> 1. For some case we need to keep chain to generate more FMAs, because break chain will lose FMA.
> for example g + a * b + c * d + e * f,
> Keep chain can get 3 FMAs, break chain can get 2 FMAs. It's hard to say which one is better, so we provide a param for users to customize.
>
> 2. when the chain has FMAs and need to break the chain with width,
> for example l + a * b + c * d + e * f + g * h + j * k;(we already put non-mul first)
> rewrite_expr_tree_parallel :
> when width = 2, it will break the chain like this. actually it break the chain in to 3. It ignores the width and adds all ops two by two. it will lose FMA.
>
> ssa1 = l + a * b;
> ssa2 = c * d + e * f;
> ssa3 = g * h + j * k;
> ssa4 = ssa1 + ssa2;
> ssa5 = ssa4 + ssa3;
>
> rewrite_expr_tree_parallel_for_fma
> when width = 2, we break the chain into two like this.
>
> ssa1 = l + a * b;
> ssa2 = c * d + e * f;
> ssa3 = ssa1 + g * h;
> ssa4 = ssa2 + j * k;
> ssa5 = ssa3 +ssa4;
>
> I think it's okay to remove or keep rewrite_expr_tree_parallel_for_fma. More FMAs are generated only for some special cases.
> I'm not sure whether the new method is better than the old one. I created a small case the execution time of the two sequences is almost the same on SPR and ICX.
I think to make a difference you need to hit the number of parallel
fadd/fmul the pipeline can perform. I don't
think issue width is ever a problem for chains w/o fma and throughput
of fma vs fadd + fmul should be similar.
That said, I think iff then we should try to improve
rewrite_expr_tree_parallel rather than adding a new
function. For example for the case with equal rank operands we can
try to sort adds first. I can't convince
myself that rewrite_expr_tree_parallel honors ranks properly quickly.
Richard.
> Thanks,
> Lili.
>
> I think to make a difference you need to hit the number of parallel fadd/fmul
> the pipeline can perform. I don't think issue width is ever a problem for
> chains w/o fma and throughput of fma vs fadd + fmul should be similar.
>
Yes, for x86 backend, fadd , fmul and fma have the same TP meaning they should have the same width.
The current implementation is reasonable /* reassoc int, fp, vec_int, vec_fp. */.
> That said, I think iff then we should try to improve
> rewrite_expr_tree_parallel rather than adding a new function. For example
> for the case with equal rank operands we can try to sort adds first. I can't
> convince myself that rewrite_expr_tree_parallel honors ranks properly
> quickly.
>
I rewrite this patch, there are mainly two changes:
1. I made some changes to rewrite_expr_tree_parallel_for_fma and used it instead of rewrite_expr_tree_parallel. The following example shows that the sequence generated by the this patch is better.
2. Put no-mult ops and mult ops alternately at the end of the queue, which is conducive to generating more fma and reducing the loss of FMA when breaking the chain.
With these two changes, GCC can break the chain with width = 2 and generates 6 FMAs for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350 without any params.
------------------------------------------------------------------------------------------------------------------
Source code: g + h + j + s + m + n+a+b +e (https://godbolt.org/z/G8sb86n84)
Compile options: -Ofast -mfpmath=sse -mfma
Width = 3 was chosen for reassociation
-----------------------------------------------------------------------------------------------------------------
Old rewrite_expr_tree_parallel generates:
_6 = g_8(D) + h_9(D); ------> parallel 0
_3 = s_11(D) + m_12(D); ------> parallel 1
_5 = _3 + j_10(D);
_2 = n_13(D) + a_14(D); ------> parallel 2
_1 = b_15(D) + e_16(D); -----> Parallel 3, This is not necessary, and it is not friendly to FMA.
_4 = _1 + _2;
_7 = _4 + _5;
_17 = _6 + _7;
return _17;
When the width = 3, we need 5 cycles here.
---------------------------------------------first end---------------------------------------------------------
Rewrite the old rewrite_expr_tree_parallel (3 sets in parallel) generates:
_3 = s_11(D) + m_12(D); ------> parallel 0
_5 = _3 + j_10(D);
_2 = n_13(D) + a_14(D); ------> parallel 1
_1 = b_15(D) + e_16(D); ------> parallel 2
_4 = _1 + _2;
_6 = _4 + _5;
_7 = _6 + h_9(D);
_17 = _7 + g_8(D);
return _17;
When the width = 3, we need 5 cycles here.
---------------------------------------------second end-------------------------------------------------------
Use rewrite_expr_tree_parallel_for_fma instead of rewrite_expr_tree_parallel generates:
_3 = s_11(D) + m_12(D);
_6 = _3 + g_8(D);
_2 = n_13(D) + a_14(D);
_5 = _2 + h_9(D);
_1 = b_15(D) + e_16(D);
_4 = _1 + j_10(D);
_7 = _4 + _5;
_17 = _7 + _6;
return _17;
When the width = 3, we need 4 cycles here.
--------------------------------------------third end-----------------------------------------------------------
Thanks,
Lili.
On Wed, May 17, 2023 at 3:05 PM Cui, Lili <lili.cui@intel.com> wrote:
>
> > I think to make a difference you need to hit the number of parallel fadd/fmul
> > the pipeline can perform. I don't think issue width is ever a problem for
> > chains w/o fma and throughput of fma vs fadd + fmul should be similar.
> >
>
> Yes, for x86 backend, fadd , fmul and fma have the same TP meaning they should have the same width.
> The current implementation is reasonable /* reassoc int, fp, vec_int, vec_fp. */.
>
> > That said, I think iff then we should try to improve
> > rewrite_expr_tree_parallel rather than adding a new function. For example
> > for the case with equal rank operands we can try to sort adds first. I can't
> > convince myself that rewrite_expr_tree_parallel honors ranks properly
> > quickly.
> >
>
> I rewrite this patch, there are mainly two changes:
> 1. I made some changes to rewrite_expr_tree_parallel_for_fma and used it instead of rewrite_expr_tree_parallel. The following example shows that the sequence generated by the this patch is better.
> 2. Put no-mult ops and mult ops alternately at the end of the queue, which is conducive to generating more fma and reducing the loss of FMA when breaking the chain.
>
> With these two changes, GCC can break the chain with width = 2 and generates 6 FMAs for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350 without any params.
>
> ------------------------------------------------------------------------------------------------------------------
> Source code: g + h + j + s + m + n+a+b +e (https://godbolt.org/z/G8sb86n84)
> Compile options: -Ofast -mfpmath=sse -mfma
> Width = 3 was chosen for reassociation
> -----------------------------------------------------------------------------------------------------------------
> Old rewrite_expr_tree_parallel generates:
> _6 = g_8(D) + h_9(D); ------> parallel 0
> _3 = s_11(D) + m_12(D); ------> parallel 1
> _5 = _3 + j_10(D);
> _2 = n_13(D) + a_14(D); ------> parallel 2
> _1 = b_15(D) + e_16(D); -----> Parallel 3, This is not necessary, and it is not friendly to FMA.
> _4 = _1 + _2;
> _7 = _4 + _5;
> _17 = _6 + _7;
> return _17;
>
> When the width = 3, we need 5 cycles here.
> ---------------------------------------------first end---------------------------------------------------------
> Rewrite the old rewrite_expr_tree_parallel (3 sets in parallel) generates:
>
> _3 = s_11(D) + m_12(D); ------> parallel 0
> _5 = _3 + j_10(D);
> _2 = n_13(D) + a_14(D); ------> parallel 1
> _1 = b_15(D) + e_16(D); ------> parallel 2
> _4 = _1 + _2;
> _6 = _4 + _5;
> _7 = _6 + h_9(D);
> _17 = _7 + g_8(D);
> return _17;
>
> When the width = 3, we need 5 cycles here.
> ---------------------------------------------second end-------------------------------------------------------
> Use rewrite_expr_tree_parallel_for_fma instead of rewrite_expr_tree_parallel generates:
>
> _3 = s_11(D) + m_12(D);
> _6 = _3 + g_8(D);
> _2 = n_13(D) + a_14(D);
> _5 = _2 + h_9(D);
> _1 = b_15(D) + e_16(D);
> _4 = _1 + j_10(D);
> _7 = _4 + _5;
> _17 = _7 + _6;
> return _17;
>
> When the width = 3, we need 4 cycles here.
> --------------------------------------------third end-----------------------------------------------------------
Yes, so what I was saying is that I doubt rewrite_expr_tree_parallel
is optimal - you show
that for the specific example rewrite_expr_tree_parallel_for_fma is
better. I was arguing
we want a single function, whether we single out leaves with
multiplications or not.
And we want documentation that shows the strategy will result in optimal latency
(I think we should not sacrifice latency just for the sake of forming
more FMAs).
Richard.
>
> Thanks,
> Lili.
>
@@ -1182,4 +1182,8 @@ The maximum factor which the loop vectorizer applies to the cost of statements i
Common Joined UInteger Var(param_vect_induction_float) Init(1) IntegerRange(0, 1) Param Optimization
Enable loop vectorization of floating point inductions.
+-param=reassoc-max-chain-length-with-fma=
+Common Joined UInteger Var(param_reassoc_max_chain_length_with_fma) Init(1) IntegerRange(1, 65536) Param Optimization
+The maximum chain length with fma considered in reassociation pass.
+
; This comment is to ensure we retain the blank line above.
new file mode 100644
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mfpmath=sse -mfma --param=reassoc-max-chain-length-with-fma=7 -Wno-attributes " } */
+
+/* Test that the compiler properly optimizes multiply and add
+ to generate more FMA instructions. */
+#define N 1024
+double a[N];
+double b[N];
+double c[N];
+double d[N];
+double e[N];
+double f[N];
+double g[N];
+double h[N];
+double j[N];
+double k[N];
+double l[N];
+double m[N];
+double o[N];
+double p[N];
+
+
+void
+foo (void)
+{
+ for (int i = 0; i < N; i++)
+ {
+ a[i] += b[i] * c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] * l[i] + m[i]* o[i] + p[i];
+ }
+}
+/* { dg-final { scan-assembler-times "vfm" 6 } } */
new file mode 100644
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mfpmath=sse -mfma --param=reassoc-max-chain-length-with-fma=6 -Wno-attributes " } */
+
+/* Test that the compiler properly build parallel chains according to given
+ association width and try to keep FMA opportunity as much as possible. */
+#define N 33
+double a[N];
+
+void
+foo (void)
+{
+ a[32] = a[0] *a[1] + a[2] * a[3] + a[4] * a[5] + a[6] * a[7] + a[8] * a[9]
+ + a[10] * a[11] + a[12] * a[13] + a[14] * a[15] + a[16] * a[17]
+ + a[18] * a[19] + a[20] * a[21] + a[22] * a[23] + a[24] + a[25]
+ + a[26] + a[27] + a[28] + a[29] + a[30] + a[31];
+}
+/* { dg-final { scan-assembler-times "vfm" 12 } } */
@@ -54,6 +54,7 @@ along with GCC; see the file COPYING3. If not see
#include "tree-ssa-reassoc.h"
#include "tree-ssa-math-opts.h"
#include "gimple-range.h"
+#include "internal-fn.h"
/* This is a simple global reassociation pass. It is, in part, based
on the LLVM pass of the same name (They do some things more/less
@@ -5468,6 +5469,114 @@ get_reassociation_width (int ops_num, enum tree_code opc,
return width;
}
+/* Rewrite statements with dependency chain with regard to the chance to
+ generate FMA. When the dependency chain length exceeds
+ param_max_reassoc_chain_length_with_fma, build parallel chains according to
+ given association width and try to keep fma opportunity as much as possible.
+ E.g.
+ e + f + g + a * b + c * d;
+
+ ssa1 = e + f;
+ ssa2 = g + a * b;
+ ssa3 = ssa1 + c * d;
+ ssa4 = ssa2 + ssa3;
+
+ This reassociation approach preserves the chance of fma generation as much
+ as possible. */
+static void
+rewrite_expr_tree_parallel_for_fma (gassign *stmt, int width,
+ const vec<operand_entry *> &ops)
+{
+ enum tree_code opcode = gimple_assign_rhs_code (stmt);
+ int op_num = ops.length ();
+ gcc_assert (op_num > 0);
+ int stmt_num = op_num - 1;
+ gimple **stmts = XALLOCAVEC (gimple *, stmt_num);
+ int op_index = op_num - 1;
+ int width_count = width;
+ int i = 0, j = 0;
+ tree tmp_op[2], op1;
+ operand_entry *oe;
+ gimple *stmt1 = NULL;
+ tree last_rhs1 = gimple_assign_rhs1 (stmt);
+
+ /* We start expression rewriting from the top statements.
+ So, in this loop we create a full list of statements
+ we will work with. */
+ stmts[stmt_num - 1] = stmt;
+ for (i = stmt_num - 2; i >= 0; i--)
+ stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
+
+ /* Build parallel FMA dependency chain according to width. */
+ for (i = 0; i < width; i++)
+ {
+ for (j = 0; j < 2; j++)
+ {
+ oe = ops[op_index--];
+ tmp_op[j] = oe->op;
+ stmt1 = oe->stmt_to_insert;
+ if (stmt1)
+ insert_stmt_before_use (stmts[i], stmt1);
+ stmt1 = NULL;
+ }
+ stmts[i] = build_and_add_sum (TREE_TYPE (last_rhs1), tmp_op[1], tmp_op[0], opcode);
+ gimple_set_visited (stmts[i], true);
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ {
+ fprintf (dump_file, " into ");
+ print_gimple_stmt (dump_file, stmts[i], 0);
+ }
+ }
+
+ for (i = width; i < stmt_num; i++)
+ {
+ /* We keep original statement only for the last one. All others are
+ recreated. */
+ if ( op_index < 0)
+ {
+ if (width_count == 2)
+ {
+
+ gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs (stmts[i-1]));
+ gimple_assign_set_rhs2 (stmts[i], gimple_assign_lhs (stmts[i-2]));
+ }
+ else
+ {
+
+ stmts[i] =
+ build_and_add_sum (TREE_TYPE (last_rhs1),
+ gimple_assign_lhs (stmts[i-width_count]),
+ gimple_assign_lhs (stmts[i-width_count+1]),
+ opcode);
+ width_count--;
+ }
+ update_stmt (stmts[i]);
+ }
+ else
+ {
+ oe = ops[op_index--];
+ op1 = oe->op;
+ stmt1 = oe->stmt_to_insert;
+ if (stmt1)
+ insert_stmt_before_use (stmts[i], stmt1);
+ stmt1 = NULL;
+ stmts[i] = build_and_add_sum (TREE_TYPE (last_rhs1),
+ gimple_assign_lhs (stmts[i-width]),
+ op1,
+ opcode);
+ gimple_set_visited (stmts[i], true);
+ }
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ {
+ fprintf (dump_file, " into ");
+ print_gimple_stmt (dump_file, stmts[i], 0);
+ }
+ }
+ remove_visited_stmt_chain (last_rhs1);
+}
+
/* Recursively rewrite our linearized statements so that the operators
match those in OPS[OPINDEX], putting the computation in rank
order and trying to allow operations to be executed in
@@ -6649,6 +6758,64 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
}
}
+/* Rearrange ops to generate more FMA when the chain may has more than 2 fmas.
+ Putting ops that not def from mult in front can generate more fma.
+ E.g.
+ a * b + c * d + e generates:
+
+ _4 = c_9(D) * d_10(D);
+ _12 = .FMA (a_7(D), b_8(D), _4);
+ _11 = e_6(D) + _12;
+
+ Rtearrange ops to -> e + a * b + c * d generates:
+
+ _4 = .FMA (c_7(D), d_8(D), _3);
+ _11 = .FMA (a_5(D), b_6(D), _4);
+ */
+static bool
+rank_ops_for_fma (vec<operand_entry *> *ops)
+{
+ operand_entry *oe;
+ unsigned int i;
+ auto_vec<operand_entry *> ops_mult;
+ auto_vec<operand_entry *> ops_others;
+
+ FOR_EACH_VEC_ELT (*ops, i, oe)
+ {
+ if (TREE_CODE (oe->op) == SSA_NAME)
+ {
+ gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
+ if (is_gimple_assign (def_stmt)
+ && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
+ ops_mult.safe_push (oe);
+ else
+ ops_others.safe_push (oe);
+ }
+ else
+ ops_others.safe_push (oe);
+ }
+ /* When ops_mult.length == 2, like the following case,
+
+ a * b + c * d + e.
+
+ we need to rearrange the ops.
+
+ Putting ops that not def from mult in front can generate more fmas. */
+ if (ops_mult.length () >= 2)
+ {
+ /* If all ops are defined with mult, we don't need to rearrange them. */
+ if (ops_mult.length () != ops->length ())
+ {
+ ops->truncate (0);
+ FOR_EACH_VEC_ELT (ops_mult, i, oe)
+ ops->safe_push (oe);
+ FOR_EACH_VEC_ELT (ops_others, i, oe)
+ ops->safe_push (oe);
+ }
+ return true;
+ }
+ return false;
+}
/* Reassociate expressions in basic block BB and its post-dominator as
children.
@@ -6813,6 +6980,7 @@ reassociate_bb (basic_block bb)
machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
int ops_num = ops.length ();
int width;
+ bool keep_fma_chain = false;
/* For binary bit operations, if there are at least 3
operands and the last operand in OPS is a constant,
@@ -6826,36 +6994,64 @@ reassociate_bb (basic_block bb)
&& TREE_CODE (ops.last ()->op) == INTEGER_CST)
std::swap (*ops[0], *ops[ops_num - 1]);
+ optimization_type opt_type = bb_optimization_type (bb);
+
+ /* When enabling param_reassoc_max_chain_length_with_fma to
+ keep the chain with fma, rank_ops_for_fma will detect if
+ the chain has fmas and if so it will rearrange the ops. */
+ if (param_reassoc_max_chain_length_with_fma > 1
+ && direct_internal_fn_supported_p (IFN_FMA,
+ TREE_TYPE (lhs),
+ opt_type)
+ && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
+ {
+ keep_fma_chain = rank_ops_for_fma(&ops);
+ }
+
+ int len = ops.length ();
/* Only rewrite the expression tree to parallel in the
last reassoc pass to avoid useless work back-and-forth
with initial linearization. */
if (!reassoc_insert_powi_p
- && ops.length () > 3
+ && len > 3
+ && (!keep_fma_chain
+ || (keep_fma_chain
+ && len > param_reassoc_max_chain_length_with_fma))
&& (width = get_reassociation_width (ops_num, rhs_code,
- mode)) > 1)
+ mode)) > 1)
{
- if (dump_file && (dump_flags & TDF_DETAILS))
- fprintf (dump_file,
- "Width = %d was chosen for reassociation\n",
- width);
- rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
- width, ops);
+ if (keep_fma_chain)
+ {
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file,
+ "Break chain len = %d into width for FMA\n",
+ len);
+ rewrite_expr_tree_parallel_for_fma
+ (as_a <gassign *> (stmt), width, ops);
+ }
+ else
+ {
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file,
+ "Width = %d was chosen for reassociation\n",
+ width);
+ rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
+ width, ops);
+ }
}
else
- {
- /* When there are three operands left, we want
- to make sure the ones that get the double
- binary op are chosen wisely. */
- int len = ops.length ();
- if (len >= 3)
+ {
+ /* When there are three operands left, we want
+ to make sure the ones that get the double
+ binary op are chosen wisely. */
+ if (len >= 3 && !keep_fma_chain)
swap_ops_for_binary_stmt (ops, len - 3);
new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
powi_result != NULL
|| negate_result,
len != orig_len);
- }
-
+ }
/* If we combined some repeated factors into a
__builtin_powi call, multiply that result by the
reassociated operands. */