From patchwork Thu Aug 3 20:47:25 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Hubicka X-Patchwork-Id: 130849 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:9f41:0:b0:3e4:2afc:c1 with SMTP id v1csp1397332vqx; Thu, 3 Aug 2023 13:48:11 -0700 (PDT) X-Google-Smtp-Source: APBJJlEgA5xyOyJtPwlOipLc29gMkxZMPAJBNOt6XkcApY9nl6gi8dMHfv4nIQtFcKQAE6xFMENQ X-Received: by 2002:a05:6512:3d22:b0:4fb:7642:88dd with SMTP id d34-20020a0565123d2200b004fb764288ddmr9891436lfv.67.1691095691173; Thu, 03 Aug 2023 13:48:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691095691; cv=none; d=google.com; s=arc-20160816; b=oqiF82Vm8moTp8MyQIJmho5PvToDa/ahkeYBEjh8iStVTfFtqZ4wXFsfCgIrUNNT3R J0s1RhXkpObmcwp8Fn5fDCcsfhqCq/cWd4yLBYN5Q8e06NrG0hX6oPHkGh6hUW06VTQP vow59822mHPnLESdq1UelxEhUIJQKEfjx06Tb1EBVkF9RQkfhUohnA/Ae8RiYeAYlPzD 61hITrws8iJN3Wbp1UYKxlPZ65Yy7zqnYBEDYBjulg+xpbMVctrbLxfh3BYdW4HF6LFT hOhD5zQPJ6Z+AajdkxMnAQRKVZm45B6HxOV3T5qbpbtG7Fxw6xRJEw0Q8u/WvLfopbKU bZMw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:reply-to:from:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence :content-disposition:mime-version:message-id:subject:to:date :dmarc-filter:delivered-to:dkim-signature:dkim-filter; bh=+b5S0afg5XyhJ4xPyXHjC5MLQbuDOJ+SPlq19nKRJ48=; fh=KxPrluukuxmHUzYWntxWAHNXePa4vWvGreVeug6Wy5k=; b=zb98jVmyFc8EaYoU0hC6PBd5338zMEpZYk3h1k/weEAJkpV6P1IkYFIfCMaGzPF9eQ CayP6Z4Rlx1saZzDjXaRJ4ClNV4wLkfOmipjgWMh2f368018AbGAk5UyuEH9c+/L7PLw TRtigQeKrgxxqBsIXTMlJ1QfsEvNrmVthrlqzlTDvglEBn4EfLbaTjxuVk16FubBRIWI ZI1TCEAIF88DjQdF6UIs/++3Z6ZbWMeBl8YE0SxwXVgl62z/0uyc2eVPeLdeB9+yp4rX LvBg6p4J5oGxlbJGVMN8sa85r2nopO8wm7lSawyzN9v1LVkaPwt1PCHpZVG0+ECTMB/N kxRw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=jw50a26G; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from server2.sourceware.org (ip-8-43-85-97.sourceware.org. [8.43.85.97]) by mx.google.com with ESMTPS id s12-20020a056402164c00b005230d2e3431si373365edx.50.2023.08.03.13.48.10 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 03 Aug 2023 13:48:11 -0700 (PDT) Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) client-ip=8.43.85.97; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=jw50a26G; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 133213857C55 for ; Thu, 3 Aug 2023 20:48:10 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 133213857C55 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1691095690; bh=+b5S0afg5XyhJ4xPyXHjC5MLQbuDOJ+SPlq19nKRJ48=; h=Date:To:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=jw50a26G2vbBBCNpmdTAZZp2b2HV9ylXR9G4zy01AQ0T4GXpI40//KWMtpzvuPKV9 uVAzgp596tJ9GdJ3g+vTXULwk3TojmmMTAknxnDaJk1CosRxKvZCjI1wmKLDBnFbPi HBVGkEv2z2hWZuCuuMVoEmDgE+TQHyHwxVtM3R0U= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from nikam.ms.mff.cuni.cz (nikam.ms.mff.cuni.cz [195.113.20.16]) by sourceware.org (Postfix) with ESMTPS id DD8A03857C40 for ; Thu, 3 Aug 2023 20:47:26 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org DD8A03857C40 Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202) id 747302828D3; Thu, 3 Aug 2023 22:47:25 +0200 (CEST) Date: Thu, 3 Aug 2023 22:47:25 +0200 To: gcc-patches@gcc.gnu.org Subject: Update estimated iteraitons counts after splitting Message-ID: MIME-Version: 1.0 Content-Disposition: inline X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, GIT_PATCH_0, HEADER_FROM_DIFFERENT_DOMAINS, JMQ_SPF_NEUTRAL, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Jan Hubicka via Gcc-patches From: Jan Hubicka Reply-To: Jan Hubicka Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org Sender: "Gcc-patches" X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1773242355128727686 X-GMAIL-MSGID: 1773242355128727686 Hi, Hmmer's internal function has 4 loops. The following is the profile at start: loop 1: estimate 472 iterations by profile: 473.497707 (reliable) count in:84821 (precise, freq 0.9979) loop 2: estimate 99 iterations by profile: 100.000000 (reliable) count in:39848881 (precise, freq 468.8104) loop 3: estimate 99 iterations by profile: 100.000000 (reliable) count in:39848881 (precise, freq 468.8104) loop 4: estimate 100 iterations by profile: 100.999596 (reliable) execution count:84167 (precise, freq 0.9902) So the first loops is outer loop and second/third loops are nesed. Fourth loop is not critical. Precise iteraiton counts are unknown (473 and 100 comes from profile) Nested loop has following form: for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; } We do quite some belly dancing here. 1) loop-ch slightly misupdates profile, so the estimates of 99 does not match profile setimate of 100. 2) loops-split splits on if (k < M) and produces two loops. It fails to notice that the second loop never iterates. It used to misupdate profile a lot which later caused internal loop to become cold. This is fixed now. 3) loop-dist introduces runtime aliasing checks for both loops 4) tree vectorizer vectorizes some of the copies of the loop produces and drops expected iteration counts 5) loop peeling peels the loops with expected low iteration counts 6) complete loop unrolling kills some loops in prologues/epilogues. We end up with quite many loops and run out of registers: iterations by profile: 5.312499 (unreliable, maybe flat) this is vectorized internal loops after loop peeling iterations by profile: 0.009495 (unreliable, maybe flat) iterations by profile: 0.009495 (unreliable, maybe flat) iterations by profile: 0.009495 (unreliable, maybe flat) iterations by profile: 0.009495 (unreliable, maybe flat) Those are all versioned/peeled and vectorized variants of the loop never looping iterations by profile: 100.000008 (unreliable) iterations by profile: 100.000000 (unreliable) Those are variants with failed aliasing checks iterations by profile: 9.662853 (unreliable, maybe flat) iterations by profile: 4.646072 (unreliable) iterations by profile: 100.000007 (unreliable) iterations by profile: 5.312500 (unreliable) iterations by profile: 473.497707 (reliable) This is loop 1 iterations by profile: 100.999596 (reliable) This is the loop 4. This patch fixes loop iteration estimate update after loop split so we get: iterations by profile: 5.312499 (unreliable, maybe flat) entry count:12742188 (guessed, freq 149.9081) This is remainder of the peeled vectorized loop 2. It misses estimate that is correct since after peeling it 6 times it is essentially impossible to tell what the remaining loop profile is (without histograms) iterations by profile: 0.009496 (unreliable, maybe flat) entry count:374801 (guessed, freq 4.4094) Peeled split part of loop 2 (one that never loops). We ought to work this out but at least w estimate 99 iterations by profile: 100.000008 (unreliable) entry count:3945039 (guessed, freq 46.4122) estimate 99 iterations by profile: 100.000000 (unreliable) entry count:35505353 (guessed, freq 417.7100) estimate 99 iterations by profile: 9.662853 (unreliable, maybe flat) entry count:35505353 (guessed, freq 417.7100) Profile here mismatches estimate - I will need to work out why. estimate 5 iterations by profile: 4.646072 (unreliable) entry count:31954818 (guessed, freq 375.9390) This is vectorized but not peeled loop 3 estimate 99 iterations by profile: 100.000007 (unreliable) entry count:7101070 (guessed, freq 83.5420) Unvectorized variant of loop 3 estimate 5 iterations by profile: 5.312500 (unreliable) entry count:25563855 (guessed, freq 300.7512) Another vectorized variant of loop 3 estimate 472 iterations by profile: 473.497707 (reliable) entry count:84821 (precise, freq 0.9979) Outer loop estimate 100 iterations by profile: 100.999596 (reliable) entry count:84167 (precise, freq 0.9902) loop 4, not vectorized/peeled So there is still work to do on this testcase, but with the patch we prevent 3 useless loops. Bootstrapped/regtested x86_64-linux, committed. gcc/ChangeLog: * tree-ssa-loop-split.cc (split_loop): Update estimated iteration counts. diff --git a/gcc/tree-ssa-loop-split.cc b/gcc/tree-ssa-loop-split.cc index 923e49e716d..2f7918c6e65 100644 --- a/gcc/tree-ssa-loop-split.cc +++ b/gcc/tree-ssa-loop-split.cc @@ -43,6 +43,7 @@ along with GCC; see the file COPYING3. If not see #include "gimplify-me.h" #include "print-tree.h" #include "value-query.h" +#include "sreal.h" /* This file implements two kinds of loop splitting. @@ -691,6 +692,37 @@ split_loop (class loop *loop1) = loop1_prob.invert (); fix_loop_bb_probability (loop1, loop2, true_edge, false_edge); + /* If conditional we split on has reliable profilea nd both + preconditionals of loop1 and loop2 are constant true, we can + only redistribute the iteration counts to the split loops. + + If the conditionals we insert before loop1 or loop2 are non-trivial + they increase expected loop count, so account this accordingly. + If we do not know the probability of split conditional, avoid + reudcing loop estimates, since we do not really know how they are + split between of the two new loops. Keep orignal estimate since + it is likely better then completely dropping it. + + TODO: If we know that onle of the new loops has constant + number of iterations, we can do better. We could also update + upper bounds. */ + if (loop1->any_estimate + && wi::fits_shwi_p (loop1->nb_iterations_estimate)) + { + sreal scale = true_edge->probability.reliable_p () + ? true_edge->probability.to_sreal () : (sreal)1; + sreal scale2 = false_edge->probability.reliable_p () + ? false_edge->probability.to_sreal () : (sreal)1; + /* +1 to get header interations rather than latch iterations and then + -1 to convert back. */ + loop1->nb_iterations_estimate + = MAX ((((sreal)loop1->nb_iterations_estimate.to_shwi () + 1) * scale + / loop1_prob.to_sreal ()).to_nearest_int () - 1, 0); + loop2->nb_iterations_estimate + = MAX ((((sreal)loop2->nb_iterations_estimate.to_shwi () + 1) * scale2 + / profile_probability::very_likely ().to_sreal ()) + .to_nearest_int () - 1, 0); + } update_loop_exit_probability_scale_dom_bbs (loop1); update_loop_exit_probability_scale_dom_bbs (loop2); @@ -711,8 +743,6 @@ split_loop (class loop *loop1) tree guard_next = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop1)); patch_loop_exit (loop1, guard_stmt, guard_next, newend, initial_true); - /* TODO: Update any_esitmate and upper bounds. */ - /* Finally patch out the two copies of the condition to be always true/false (or opposite). */ gcond *force_true = as_a (*gsi_last_bb (bbs[i]));