From patchwork Thu Jun 29 01:49:49 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Li, Pan2 via Gcc-patches" X-Patchwork-Id: 114121 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp9331068vqr; Wed, 28 Jun 2023 18:50:30 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6IctwvXGwDTAjC3Gma6gHH2fx5TG+K4GD35FHI9Hp/IuFeY1GVI/UwUUZY0S2jbN7XVI7f X-Received: by 2002:a17:906:6a0f:b0:991:d414:d889 with SMTP id qw15-20020a1709066a0f00b00991d414d889mr3173692ejc.15.1688003429867; Wed, 28 Jun 2023 18:50:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1688003429; cv=none; d=google.com; s=arc-20160816; b=0EjgHj888B1FaB63JcMpXwn7uU01bH5A73S88rgFHLIQEXXL4IRirrgLWjVpMVMla1 7xxgYqsL70QXifMFlrv3SuNk+H4TdyILseqvC15TVwvcdL+6e/lOATb3hZeowDbFcpOq H1rybxef5OlQ8g/5afemLp3NXvHmbQf0pZPhBzLuxgNxAGNVKorbhYjjavCsIecmCuJ3 vVnAoBhcw1r8kCGlry3KyGqh3JZMSBPENxopLU3+LMOpSZv7DFeKaItjcZzJb19zGYhy uZkDw27RIyjEXLg1IJb/9YTnxGITNnff3AddA6WisRJfj6UJd9wh+9gkXa4I7XOG5Hvg Vq8A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:reply-to:from:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence :content-transfer-encoding:mime-version:message-id:date:subject:cc :to:dmarc-filter:delivered-to:dkim-signature:dkim-filter; bh=uqmzl7VM0VMTEQVyDi9Ku4zA0nkdR8/VJ1hU/10Z1EY=; fh=bD/JbD/9Q/UtlIR/hPngnrD0uzR9I5BE5JT7Bo6l+WQ=; b=QzYmDkm6FZz27syfbOoqSPn8aiJo/6lI+DJZaV2nRZhFX8qQmEeI1Y4SNEz+/xi4n4 TGuye1HVHNle63/AaZArVacMpFqEG0VANXw3L4exRJPuRR+rFd6mHedIsazYRMHfx73z 38uAjzUVm7DxI0FuANGpVHAqOaILiZUaWwaHmkzx+VKorqClzsgwmr9zvnsgbUsNFlOH UVMJ87nrfaEFIJ0u2wbkxlbD/h5HAeQztQiR8RtTW6srHNEWysGZcBclSSS/AzVZy5cU 8HT2yBykhg2VEsCjDcIptdn68tCTeGfawQlCYzsufo/usU5uxChYnCS0C4Aal0xxbSvD llzw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=reJcZrU5; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from sourceware.org (ip-8-43-85-97.sourceware.org. [8.43.85.97]) by mx.google.com with ESMTPS id t19-20020a170906065300b00987603d77c0si6722964ejb.508.2023.06.28.18.50.29 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 28 Jun 2023 18:50:29 -0700 (PDT) Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) client-ip=8.43.85.97; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=reJcZrU5; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 8AD863858C54 for ; Thu, 29 Jun 2023 01:50:28 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8AD863858C54 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1688003428; bh=uqmzl7VM0VMTEQVyDi9Ku4zA0nkdR8/VJ1hU/10Z1EY=; h=To:Cc:Subject:Date:List-Id:List-Unsubscribe:List-Archive: List-Post:List-Help:List-Subscribe:From:Reply-To:From; b=reJcZrU5K2v/aaeytyPxrAITOiX5MQ+wz5izoMhgvoo2gajtQ5rjsx5QzsiPR4kyn AX8dDhMWDckO+ktWZVMQ2Pzs6fupBd7UaD5rq+OxpzIu0OmZg1L31QsVqWVIi9obhy v25/IyvkbY3RPaPM+af9RSKPx/ChPu6B0i6T9mKc= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by sourceware.org (Postfix) with ESMTPS id 5890A3858D35 for ; Thu, 29 Jun 2023 01:49:40 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5890A3858D35 X-IronPort-AV: E=McAfee;i="6600,9927,10755"; a="360852645" X-IronPort-AV: E=Sophos;i="6.01,167,1684825200"; d="scan'208";a="360852645" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jun 2023 18:49:38 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10755"; a="694465950" X-IronPort-AV: E=Sophos;i="6.01,167,1684825200"; d="scan'208";a="694465950" Received: from scymds04.sc.intel.com ([10.82.73.238]) by orsmga006.jf.intel.com with ESMTP; 28 Jun 2023 18:49:38 -0700 Received: from shgcc101.sh.intel.com (shgcc101.sh.intel.com [10.239.85.97]) by scymds04.sc.intel.com (Postfix) with ESMTP id 609C4182F700; Wed, 28 Jun 2023 18:49:36 -0700 (PDT) To: gcc-patches@gcc.gnu.org Cc: richard.guenther@gmail.com, Lili Cui Subject: [PATCH] PR gcc/110148:Avoid adding loop-carried ops to long chains Date: Thu, 29 Jun 2023 01:49:49 +0000 Message-Id: <20230629014949.1995621-1-lili.cui@intel.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 X-Spam-Status: No, score=-11.1 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: "Cui, Lili via Gcc-patches" From: "Li, Pan2 via Gcc-patches" Reply-To: "Cui, Lili" Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org Sender: "Gcc-patches" X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1769999884283743095?= X-GMAIL-MSGID: =?utf-8?q?1769999884283743095?= From: Lili Cui Hi Maintainer This patch is to fix TSVC242 regression related to loop-carried ops. Bootstrapped and regtested. Ok for trunk? Regards Lili. Avoid adding loop-carried ops to long chains, otherwise the whole chain will have dependencies across the loop iteration. Just keep loop-carried ops in a separate chain. E.g. x_1 = phi(x_0, x_2) y_1 = phi(y_0, y_2) a + b + c + d + e + x1 + y1 SSA1 = a + b; SSA2 = c + d; SSA3 = SSA1 + e; SSA4 = SSA3 + SSA2; SSA5 = x1 + y1; SSA6 = SSA4 + SSA5; With the patch applied, these test cases improved by 32%~100%. S242: for (int i = 1; i < LEN_1D; ++i) { a[i] = a[i - 1] + s1 + s2 + b[i] + c[i] + d[i];} Case 1: for (int i = 1; i < LEN_1D; ++i) { a[i] = a[i - 1] + s1 + s2 + b[i] + c[i] + d[i] + e[i];} Case 2: for (int i = 1; i < LEN_1D; ++i) { a[i] = a[i - 1] + b[i - 1] + s1 + s2 + b[i] + c[i] + d[i] + e[i];} The value is the execution time A: original version B: with FMA patch g:e5405f065bace0685cb3b8878d1dfc7a6e7ef409(base on A) C: with current patch(base on B) A B C B/A C/A s242 2.859 5.152 2.859 1.802028681 1 case 1 5.489 5.488 3.511 0.999818 0.64 case 2 7.216 7.499 4.885 1.039218 0.68 gcc/ChangeLog: PR tree-optimization/110148 * tree-ssa-reassoc.cc (rewrite_expr_tree_parallel): Handle loop-carried ops in this function. --- gcc/tree-ssa-reassoc.cc | 236 ++++++++++++++++++++++++++++------------ 1 file changed, 167 insertions(+), 69 deletions(-) diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc index 96c88ec003e..f5da385e0b2 100644 --- a/gcc/tree-ssa-reassoc.cc +++ b/gcc/tree-ssa-reassoc.cc @@ -5471,37 +5471,62 @@ get_reassociation_width (int ops_num, enum tree_code opc, return width; } +#define SPECIAL_BIASED_END_STMT 0 /* It is the end stmt of all ops. */ +#define BIASED_END_STMT 1 /* It is the end stmt of normal or biased ops. */ +#define NORMAL_END_STMT 2 /* It is the end stmt of normal ops. */ + /* Rewrite statements with dependency chain with regard the chance to generate FMA. For the chain with FMA: Try to keep fma opportunity as much as possible. For the chain without FMA: Putting the computation in rank order and trying to allow operations to be executed in parallel. E.g. - e + f + g + a * b + c * d; + e + f + a * b + c * d; - ssa1 = e + f; - ssa2 = g + a * b; - ssa3 = ssa1 + c * d; - ssa4 = ssa2 + ssa3; + ssa1 = e + a * b; + ssa2 = f + c * d; + ssa3 = ssa1 + ssa2; This reassociation approach preserves the chance of fma generation as much - as possible. */ + as possible. + + Another thing is to avoid adding loop-carried ops to long chains, otherwise + the whole chain will have dependencies across the loop iteration. Just keep + loop-carried ops in a separate chain. + E.g. + x_1 = phi(x_0, x_2) + y_1 = phi(y_0, y_2) + + a + b + c + d + e + x1 + y1 + + SSA1 = a + b; + SSA2 = c + d; + SSA3 = SSA1 + e; + SSA4 = SSA3 + SSA2; + SSA5 = x1 + y1; + SSA6 = SSA4 + SSA5; + */ static void rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma, - const vec &ops) + const vec &ops) { enum tree_code opcode = gimple_assign_rhs_code (stmt); int op_num = ops.length (); + int op_normal_num = op_num; gcc_assert (op_num > 0); int stmt_num = op_num - 1; gimple **stmts = XALLOCAVEC (gimple *, stmt_num); - int op_index = op_num - 1; - int width_count = width; int i = 0, j = 0; tree tmp_op[2], op1; operand_entry *oe; gimple *stmt1 = NULL; tree last_rhs1 = gimple_assign_rhs1 (stmt); + int last_rhs1_stmt_index = 0, last_rhs2_stmt_index = 0; + int width_active = 0, width_count = 0; + bool has_biased = false, ops_changed = false; + auto_vec ops_normal; + auto_vec ops_biased; + vec *ops1; /* We start expression rewriting from the top statements. So, in this loop we create a full list of statements @@ -5510,83 +5535,155 @@ rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma, for (i = stmt_num - 2; i >= 0; i--) stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1])); - /* Width should not be larger than op_num / 2, since we can not create + /* Avoid adding loop-carried ops to long chains, first filter out the + loop-carried. But we need to make sure that the length of the remainder + is not less than 4, which is the smallest ops length we can break the + dependency. */ + FOR_EACH_VEC_ELT (ops, i, oe) + { + if (TREE_CODE (oe->op) == SSA_NAME + && bitmap_bit_p (biased_names, SSA_NAME_VERSION (oe->op)) + && op_normal_num > 4) + { + ops_biased.safe_push (oe); + has_biased = true; + op_normal_num --; + } + else + ops_normal.safe_push (oe); + } + + /* Width should not be larger than ops length / 2, since we can not create more parallel dependency chains that exceeds such value. */ - width = width <= op_num / 2 ? width : op_num / 2; + int width_normal = op_normal_num / 2; + int width_biased = (op_num - op_normal_num) / 2; + width_normal = width <= width_normal ? width : width_normal; + width_biased = width <= width_biased ? width : width_biased; + + ops1 = &ops_normal; + width_count = width_active = width_normal; /* Build parallel dependency chain according to width. */ - for (i = 0; i < width; i++) + for (i = 0; i < stmt_num; i++) { - /* If the chain has FAM, we do not swap two operands. */ - if (op_index > 1 && !has_fma) - swap_ops_for_binary_stmt (ops, op_index - 2); - - for (j = 0; j < 2; j++) + if (dump_file && (dump_flags & TDF_DETAILS)) { - gcc_assert (op_index >= 0); - oe = ops[op_index--]; - tmp_op[j] = oe->op; - /* If the stmt that defines operand has to be inserted, insert it - before the use. */ - stmt1 = oe->stmt_to_insert; - if (stmt1) - insert_stmt_before_use (stmts[i], stmt1); - stmt1 = NULL; + fprintf (dump_file, "Transforming "); + print_gimple_stmt (dump_file, stmts[i], 0); } - stmts[i] = build_and_add_sum (TREE_TYPE (last_rhs1), - tmp_op[1], - tmp_op[0], - opcode); - gimple_set_visited (stmts[i], true); - if (dump_file && (dump_flags & TDF_DETAILS)) + /* When the work of normal ops is over, but the loop is not over, + continue to do biased ops. */ + if (width_count == 0 && ops1 == &ops_normal) { - fprintf (dump_file, " into "); - print_gimple_stmt (dump_file, stmts[i], 0); + ops1 = &ops_biased; + width_count = width_active = width_biased; + ops_changed = true; } - } - for (i = width; i < stmt_num; i++) - { - /* We keep original statement only for the last one. All others are - recreated. */ - if ( op_index < 0) + /* Swap the operands if no FMA in the chain. */ + if (ops1->length () > 2 && !has_fma) + swap_ops_for_binary_stmt (*ops1, ops1->length () - 3); + + if (i < width_active + || (ops_changed && i <= (last_rhs1_stmt_index + width_active))) { - if (width_count == 2) + for (j = 0; j < 2; j++) { - - /* We keep original statement only for the last one. All - others are recreated. */ - gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs (stmts[i-1])); - gimple_assign_set_rhs2 (stmts[i], gimple_assign_lhs (stmts[i-2])); - update_stmt (stmts[i]); + oe = ops1->pop (); + tmp_op[j] = oe->op; + /* If the stmt that defines operand has to be inserted, insert it + before the use. */ + stmt1 = oe->stmt_to_insert; + if (stmt1) + insert_stmt_before_use (stmts[i], stmt1); + stmt1 = NULL; } - else - { + stmts[i] = build_and_add_sum (TREE_TYPE (last_rhs1), + tmp_op[1], + tmp_op[0], + opcode); + gimple_set_visited (stmts[i], true); - stmts[i] = - build_and_add_sum (TREE_TYPE (last_rhs1), - gimple_assign_lhs (stmts[i-width_count]), - gimple_assign_lhs (stmts[i-width_count+1]), - opcode); - gimple_set_visited (stmts[i], true); - width_count--; - } } else { - /* Attach the rest of the ops to the parallel dependency chain. */ - oe = ops[op_index--]; - op1 = oe->op; - stmt1 = oe->stmt_to_insert; - if (stmt1) - insert_stmt_before_use (stmts[i], stmt1); - stmt1 = NULL; - stmts[i] = build_and_add_sum (TREE_TYPE (last_rhs1), - gimple_assign_lhs (stmts[i-width]), - op1, - opcode); - gimple_set_visited (stmts[i], true); + /* We keep original statement only for the last one. All others are + recreated. */ + if (!ops1->length ()) + { + /* For biased length equal to 2. */ + if (width_count == BIASED_END_STMT && !last_rhs2_stmt_index) + last_rhs2_stmt_index = i - 1; + + /* When width_count == 2 and there is no biased, just finish. */ + if (width_count == NORMAL_END_STMT && !has_biased) + { + last_rhs1_stmt_index = i - 1; + last_rhs2_stmt_index = i - 2; + } + if (last_rhs1_stmt_index && (last_rhs2_stmt_index || !has_biased)) + { + /* We keep original statement only for the last one. All + others are recreated. */ + gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs (stmts[last_rhs1_stmt_index])); + gimple_assign_set_rhs2 (stmts[i], gimple_assign_lhs (stmts[last_rhs2_stmt_index])); + update_stmt (stmts[i]); + } + else + { + stmts[i] = + build_and_add_sum (TREE_TYPE (last_rhs1), + gimple_assign_lhs (stmts[i-width_count]), + gimple_assign_lhs (stmts[i-width_count+1]), + opcode); + gimple_set_visited (stmts[i], true); + width_count--; + + /* It is the end of normal or biased ops. + last_rhs1_stmt_index used to record the last stmt index + for normal ops. last_rhs2_stmt_index used to record the + last stmt index for biased ops. */ + if (width_count == BIASED_END_STMT) + { + gcc_assert (has_biased); + if (ops_biased.length ()) + last_rhs1_stmt_index = i; + else + last_rhs2_stmt_index = i; + width_count--; + } + } + } + else + { + /* Attach the rest of the ops to the parallel dependency chain. */ + oe = ops1->pop (); + op1 = oe->op; + stmt1 = oe->stmt_to_insert; + if (stmt1) + insert_stmt_before_use (stmts[i], stmt1); + stmt1 = NULL; + + /* For only one biased ops. */ + if (width_count == SPECIAL_BIASED_END_STMT) + { + /* We keep original statement only for the last one. All + others are recreated. */ + gcc_assert (has_biased); + gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs (stmts[last_rhs1_stmt_index])); + gimple_assign_set_rhs2 (stmts[i], op1); + update_stmt (stmts[i]); + } + else + { + stmts[i] = build_and_add_sum (TREE_TYPE (last_rhs1), + gimple_assign_lhs (stmts[i-width_active]), + op1, + opcode); + gimple_set_visited (stmts[i], true); + } + } } if (dump_file && (dump_flags & TDF_DETAILS)) @@ -5595,6 +5692,7 @@ rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma, print_gimple_stmt (dump_file, stmts[i], 0); } } + remove_visited_stmt_chain (last_rhs1); }