Message ID | 20220815052519.194582-1-guojiufu@linux.ibm.com |
---|---|
State | New, archived |
Headers |
Return-Path: <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:6a10:38f:b0:2d5:3c95:9e21 with SMTP id 15csp1165395pxh; Sun, 14 Aug 2022 22:26:12 -0700 (PDT) X-Google-Smtp-Source: AA6agR5HCmhJ2YU6wGvJ5dMG+XOEaUPZi4OfMHrzvokEIdnG0k71RBEmC9/J6ogNozuTUupPmrTu X-Received: by 2002:a17:906:7e43:b0:738:6395:8d94 with SMTP id z3-20020a1709067e4300b0073863958d94mr488691ejr.54.1660541172384; Sun, 14 Aug 2022 22:26:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1660541172; cv=none; d=google.com; s=arc-20160816; b=QqC1oErx46x5UiN/S7tsBNiQqw8H51Y+WMPojufRSIbizzzF4vUOEuJ0hWpJakXLyz 0OIDcTA2tn6gxt9ka5AY3fTs9YgADXkTBaQdTQMD00wRBg5gLYV8u3AMe0E4HNPavliz nSXwbo73/SR93FDtaWZYCBQcYNqZDqTlqF92XSkzagYJPylOGcmcAm+apP6Fp51MF3zH RdzOUwiVDBJ7DvZg1bdTdoaVyAH5lLCnhAmEZpxrZm6Yf7BKhy2I4bqKuCTkQGsi7o/7 8AeNVxkb+DqyumUvuwC46mdtnQzlkHKBX7JLuNPZBD1DQGhQHDbOfAkpTyAOxlPi05o6 lddA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:cc:reply-to:from:list-subscribe:list-help :list-post:list-archive:list-unsubscribe:list-id:precedence :message-id:date:subject:to:dmarc-filter:delivered-to:dkim-signature :dkim-filter; bh=uZMw8wrTftsfu6PM91fLQag9KBNdM7O9dQmyeZVGQ8U=; b=IHJaaDqhhI/n0aXwo14Loxhjios6heq9E7nKBFxFXIBdOlI5tKmORzFyP3O4G+g0sj AvO0DXJmcU8gkJ0sGUoFYwe+5SmPtYKNlmcqfERAqiwvNxPob8jaSEYppUZEurHqAq/1 hLynwn7HI0/NwHAnZ+HO6doA6JMXKilDjcx8DUHSKoQtZGB1zMVQyavEw8Lc6OQRVodc awKK+An7uIHV76ZDRyw8q0fhfPxjIuLhT/b1NHXz0uuW83iUhC9fxgXzFXGloIOvFpve hcKNezRkszVemDCEKbXw8QsBYiEIwUgoU8H7KFksHn/uwpZe/LqTFagnKGkrZG9KmmVK hjyg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=fld1Nrb6; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from sourceware.org (server2.sourceware.org. [2620:52:3:1:0:246e:9693:128c]) by mx.google.com with ESMTPS id t9-20020a50ab49000000b0043c891e0f0fsi7434110edc.356.2022.08.14.22.26.12 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 14 Aug 2022 22:26:12 -0700 (PDT) Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) client-ip=2620:52:3:1:0:246e:9693:128c; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=fld1Nrb6; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 045783858438 for <ouuuleilei@gmail.com>; Mon, 15 Aug 2022 05:26:11 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 045783858438 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1660541171; bh=uZMw8wrTftsfu6PM91fLQag9KBNdM7O9dQmyeZVGQ8U=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=fld1Nrb6yayyuZBNuWVHx9ZRyHSFnsmV0q6hW+Zrj//PCxg/CosboO845Fi3HBjrL H7EtjV369xKnv3QK3XN2GRVip3cIHVYuXjgVswa8OahFqFLawcUJyxw2w6udvSDjZx ObUo7hakjFIph0Wh1G2E5fPQ5eSnNr2hVMJI+hSU= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 8906A3858D32; Mon, 15 Aug 2022 05:25:27 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8906A3858D32 Received: from pps.filterd (m0098399.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27F4mdTn028255; Mon, 15 Aug 2022 05:25:26 GMT Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hyfer0tby-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 15 Aug 2022 05:25:25 +0000 Received: from m0098399.ppops.net (m0098399.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27F4sGNS013369; Mon, 15 Aug 2022 05:25:25 GMT Received: from ppma05fra.de.ibm.com (6c.4a.5195.ip4.static.sl-reverse.com [149.81.74.108]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hyfer0tay-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 15 Aug 2022 05:25:25 +0000 Received: from pps.filterd (ppma05fra.de.ibm.com [127.0.0.1]) by ppma05fra.de.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27F5M1tf004202; Mon, 15 Aug 2022 05:25:23 GMT Received: from b06cxnps3074.portsmouth.uk.ibm.com (d06relay09.portsmouth.uk.ibm.com [9.149.109.194]) by ppma05fra.de.ibm.com with ESMTP id 3hx3k8s78x-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 15 Aug 2022 05:25:22 +0000 Received: from d06av26.portsmouth.uk.ibm.com (d06av26.portsmouth.uk.ibm.com [9.149.105.62]) by b06cxnps3074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27F5PKeG29032940 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 15 Aug 2022 05:25:20 GMT Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 786B8AE045; Mon, 15 Aug 2022 05:25:20 +0000 (GMT) Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id AEF78AE051; Mon, 15 Aug 2022 05:25:19 +0000 (GMT) Received: from pike.rch.stglabs.ibm.com (unknown [9.5.12.127]) by d06av26.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 15 Aug 2022 05:25:19 +0000 (GMT) To: gcc-patches@gcc.gnu.org Subject: [RFC]rs6000: split complicated constant to memory Date: Mon, 15 Aug 2022 13:25:19 +0800 Message-Id: <20220815052519.194582-1-guojiufu@linux.ibm.com> X-Mailer: git-send-email 2.17.1 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: YuTbz8JwI9t2NnDIj8upt_op42ZqKCRn X-Proofpoint-ORIG-GUID: Gfyjr6ySCQ42P1HTbK4kyL_UpzNFLcud X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-15_03,2022-08-11_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxlogscore=999 impostorscore=0 bulkscore=0 priorityscore=1501 adultscore=0 spamscore=0 suspectscore=0 mlxscore=0 phishscore=0 malwarescore=0 lowpriorityscore=0 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208150020 X-Spam-Status: No, score=-11.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, KAM_STOCKGEN, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> From: Jiufu Guo via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: Jiufu Guo <guojiufu@linux.ibm.com> Cc: dje.gcc@gmail.com, segher@kernel.crashing.org, linkw@gcc.gnu.org Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org> X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1741203620328129514?= X-GMAIL-MSGID: =?utf-8?q?1741203620328129514?= |
Series |
[RFC] rs6000: split complicated constant to memory
|
|
Commit Message
Jiufu Guo
Aug. 15, 2022, 5:25 a.m. UTC
Hi, This patch tries to put the constant into constant pool if building the constant requires 3 or more instructions. But there is a concern: I'm wondering if this patch is really profitable. Because, as I tested, 1. for simple case, if instructions are not been run in parallel, loading constant from memory maybe faster; but 2. if there are some instructions could run in parallel, loading constant from memory are not win comparing with building constant. As below examples. For f1.c and f3.c, 'loading' constant would be acceptable in runtime aspect; for f2.c and f4.c, 'loading' constant are visibly slower. For real-world cases, both kinds of code sequences exist. So, I'm not sure if we need to push this patch. Run a lot of times (1000000000) below functions to check runtime. f1.c: long foo (long *arg, long*, long *) { *arg = 0x1234567800000000; } asm building constant: lis 10,0x1234 ori 10,10,0x5678 sldi 10,10,32 vs. asm loading addis 10,2,.LC0@toc@ha ld 10,.LC0@toc@l(10) The runtime between 'building' and 'loading' are similar: some times the 'building' is faster; sometimes 'loading' is faster. And the difference is slight. f2.c long foo (long *arg, long *arg2, long *arg3) { *arg = 0x1234567800000000; *arg2 = 0x7965234700000000; *arg3 = 0x4689123700000000; } asm building constant: lis 7,0x1234 lis 10,0x7965 lis 9,0x4689 ori 7,7,0x5678 ori 10,10,0x2347 ori 9,9,0x1237 sldi 7,7,32 sldi 10,10,32 sldi 9,9,32 vs. loading addis 7,2,.LC0@toc@ha addis 10,2,.LC1@toc@ha addis 9,2,.LC2@toc@ha ld 7,.LC0@toc@l(7) ld 10,.LC1@toc@l(10) ld 9,.LC2@toc@l(9) For this case, 'loading' is always slower than 'building' (>15%). f3.c long foo (long *arg, long *, long *) { *arg = 384307168202282325; } lis 10,0x555 ori 10,10,0x5555 sldi 10,10,32 oris 10,10,0x5555 ori 10,10,0x5555 For this case, 'building' (through 5 instructions) are slower, and 'loading' is faster ~5%; f4.c long foo (long *arg, long *arg2, long *arg3) { *arg = 384307168202282325; *arg2 = -6148914691236517205; *arg3 = 768614336404564651; } lis 7,0x555 lis 10,0xaaaa lis 9,0xaaa ori 7,7,0x5555 ori 10,10,0xaaaa ori 9,9,0xaaaa sldi 7,7,32 sldi 10,10,32 sldi 9,9,32 oris 7,7,0x5555 oris 10,10,0xaaaa oris 9,9,0xaaaa ori 7,7,0x5555 ori 10,10,0xaaab ori 9,9,0xaaab For this cases, since 'building' constant are parallel, 'loading' is slower: ~8%. On p10, 'loading'(through 'pld') is also slower >4%. BR, Jeff(Jiufu) --- gcc/config/rs6000/rs6000.cc | 14 ++++++++++++++ gcc/testsuite/gcc.target/powerpc/pr63281.c | 11 +++++++++++ 2 files changed, 25 insertions(+) create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c
Comments
On Mon, Aug 15, 2022 at 7:26 AM Jiufu Guo via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > Hi, > > This patch tries to put the constant into constant pool if building the > constant requires 3 or more instructions. > > But there is a concern: I'm wondering if this patch is really profitable. > > Because, as I tested, 1. for simple case, if instructions are not been run > in parallel, loading constant from memory maybe faster; but 2. if there > are some instructions could run in parallel, loading constant from memory > are not win comparing with building constant. As below examples. > > For f1.c and f3.c, 'loading' constant would be acceptable in runtime aspect; > for f2.c and f4.c, 'loading' constant are visibly slower. > > For real-world cases, both kinds of code sequences exist. > > So, I'm not sure if we need to push this patch. > > Run a lot of times (1000000000) below functions to check runtime. > f1.c: > long foo (long *arg, long*, long *) > { > *arg = 0x1234567800000000; > } > asm building constant: > lis 10,0x1234 > ori 10,10,0x5678 > sldi 10,10,32 > vs. asm loading > addis 10,2,.LC0@toc@ha > ld 10,.LC0@toc@l(10) > The runtime between 'building' and 'loading' are similar: some times the > 'building' is faster; sometimes 'loading' is faster. And the difference is > slight. I wonder if it is possible to decide this during scheduling - chose the variant that, when the result is needed, is cheaper? Post-RA might be a bit difficult (I see the load from memory needs the TOC, but then when the TOC is not available we could just always emit the build form), and pre-reload precision might be not good enough to make this worth the experiment? Of course the scheduler might lack on the technical side as well. > > f2.c > long foo (long *arg, long *arg2, long *arg3) > { > *arg = 0x1234567800000000; > *arg2 = 0x7965234700000000; > *arg3 = 0x4689123700000000; > } > asm building constant: > lis 7,0x1234 > lis 10,0x7965 > lis 9,0x4689 > ori 7,7,0x5678 > ori 10,10,0x2347 > ori 9,9,0x1237 > sldi 7,7,32 > sldi 10,10,32 > sldi 9,9,32 > vs. loading > addis 7,2,.LC0@toc@ha > addis 10,2,.LC1@toc@ha > addis 9,2,.LC2@toc@ha > ld 7,.LC0@toc@l(7) > ld 10,.LC1@toc@l(10) > ld 9,.LC2@toc@l(9) > For this case, 'loading' is always slower than 'building' (>15%). > > f3.c > long foo (long *arg, long *, long *) > { > *arg = 384307168202282325; > } > lis 10,0x555 > ori 10,10,0x5555 > sldi 10,10,32 > oris 10,10,0x5555 > ori 10,10,0x5555 > For this case, 'building' (through 5 instructions) are slower, and 'loading' > is faster ~5%; > > f4.c > long foo (long *arg, long *arg2, long *arg3) > { > *arg = 384307168202282325; > *arg2 = -6148914691236517205; > *arg3 = 768614336404564651; > } > lis 7,0x555 > lis 10,0xaaaa > lis 9,0xaaa > ori 7,7,0x5555 > ori 10,10,0xaaaa > ori 9,9,0xaaaa > sldi 7,7,32 > sldi 10,10,32 > sldi 9,9,32 > oris 7,7,0x5555 > oris 10,10,0xaaaa > oris 9,9,0xaaaa > ori 7,7,0x5555 > ori 10,10,0xaaab > ori 9,9,0xaaab > For this cases, since 'building' constant are parallel, 'loading' is slower: > ~8%. On p10, 'loading'(through 'pld') is also slower >4%. > > > BR, > Jeff(Jiufu) > > --- > gcc/config/rs6000/rs6000.cc | 14 ++++++++++++++ > gcc/testsuite/gcc.target/powerpc/pr63281.c | 11 +++++++++++ > 2 files changed, 25 insertions(+) > create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c > > diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc > index 4b727d2a500..3798e11bdbc 100644 > --- a/gcc/config/rs6000/rs6000.cc > +++ b/gcc/config/rs6000/rs6000.cc > @@ -10098,6 +10098,20 @@ rs6000_emit_set_const (rtx dest, rtx source) > c = ((c & 0xffffffff) ^ 0x80000000) - 0x80000000; > emit_move_insn (lo, GEN_INT (c)); > } > + else if (base_reg_operand (dest, mode) > + && num_insns_constant (source, mode) > 2) > + { > + rtx sym = force_const_mem (mode, source); > + if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0)) > + && use_toc_relative_ref (XEXP (sym, 0), mode)) > + { > + rtx toc = create_TOC_reference (XEXP (sym, 0), copy_rtx (dest)); > + sym = gen_const_mem (mode, toc); > + set_mem_alias_set (sym, get_TOC_alias_set ()); > + } > + > + emit_insn (gen_rtx_SET (dest, sym)); > + } > else > rs6000_emit_set_long_const (dest, c); > break; > diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c b/gcc/testsuite/gcc.target/powerpc/pr63281.c > new file mode 100644 > index 00000000000..469a8f64400 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c > @@ -0,0 +1,11 @@ > +/* PR target/63281 */ > +/* { dg-do compile { target lp64 } } */ > +/* { dg-options "-O2 -std=c99" } */ > + > +void > +foo (unsigned long long *a) > +{ > + *a = 0x020805006106003; > +} > + > +/* { dg-final { scan-assembler-times {\mp?ld\M} 1 } } */ > -- > 2.17.1 >
Hi! On Mon, Aug 15, 2022 at 01:25:19PM +0800, Jiufu Guo wrote: > This patch tries to put the constant into constant pool if building the > constant requires 3 or more instructions. > > But there is a concern: I'm wondering if this patch is really profitable. > > Because, as I tested, 1. for simple case, if instructions are not been run > in parallel, loading constant from memory maybe faster; but 2. if there > are some instructions could run in parallel, loading constant from memory > are not win comparing with building constant. As below examples. > > For f1.c and f3.c, 'loading' constant would be acceptable in runtime aspect; > for f2.c and f4.c, 'loading' constant are visibly slower. > > For real-world cases, both kinds of code sequences exist. > > So, I'm not sure if we need to push this patch. > > Run a lot of times (1000000000) below functions to check runtime. > f1.c: > long foo (long *arg, long*, long *) > { > *arg = 0x1234567800000000; > } > asm building constant: > lis 10,0x1234 > ori 10,10,0x5678 > sldi 10,10,32 > vs. asm loading > addis 10,2,.LC0@toc@ha > ld 10,.LC0@toc@l(10) This is just a load insn, unless this is the only thing needing the TOC. You can use crtl->uses_const_pool as an approximation here, to figure out if we have that case? > The runtime between 'building' and 'loading' are similar: some times the > 'building' is faster; sometimes 'loading' is faster. And the difference is > slight. When there is only one constant, sure. But that isn't the expensive case we need to avoid :-) > addis 9,2,.LC2@toc@ha > ld 7,.LC0@toc@l(7) > ld 10,.LC1@toc@l(10) > ld 9,.LC2@toc@l(9) > For this case, 'loading' is always slower than 'building' (>15%). Only if there is nothing else to do, and only in cases where code size does not matter (i.e. microbenchmarks). > --- /dev/null > +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c > @@ -0,0 +1,11 @@ > +/* PR target/63281 */ > +/* { dg-do compile { target lp64 } } */ > +/* { dg-options "-O2 -std=c99" } */ Why std=c99 btw? The default is c17. Is there something we need to disable here? Segher
Hi, Richard Biener <richard.guenther@gmail.com> writes: > On Mon, Aug 15, 2022 at 7:26 AM Jiufu Guo via Gcc-patches > <gcc-patches@gcc.gnu.org> wrote: >> >> Hi, >> >> This patch tries to put the constant into constant pool if building the >> constant requires 3 or more instructions. >> >> But there is a concern: I'm wondering if this patch is really profitable. >> >> Because, as I tested, 1. for simple case, if instructions are not been run >> in parallel, loading constant from memory maybe faster; but 2. if there >> are some instructions could run in parallel, loading constant from memory >> are not win comparing with building constant. As below examples. >> >> For f1.c and f3.c, 'loading' constant would be acceptable in runtime aspect; >> for f2.c and f4.c, 'loading' constant are visibly slower. >> >> For real-world cases, both kinds of code sequences exist. >> >> So, I'm not sure if we need to push this patch. >> >> Run a lot of times (1000000000) below functions to check runtime. >> f1.c: >> long foo (long *arg, long*, long *) >> { >> *arg = 0x1234567800000000; >> } >> asm building constant: >> lis 10,0x1234 >> ori 10,10,0x5678 >> sldi 10,10,32 >> vs. asm loading >> addis 10,2,.LC0@toc@ha >> ld 10,.LC0@toc@l(10) >> The runtime between 'building' and 'loading' are similar: some times the >> 'building' is faster; sometimes 'loading' is faster. And the difference is >> slight. > > I wonder if it is possible to decide this during scheduling - chose the > variant that, when the result is needed, is cheaper? Post-RA might > be a bit difficult (I see the load from memory needs the TOC, but then > when the TOC is not available we could just always emit the build form), > and pre-reload precision might be not good enough to make this worth > the experiment? Thanks a lot for your comments! Yes, Post-RA may not handle all cases. If there is no TOC avaiable, we are not able to load the const through TOC. As Segher point out: crtl->uses_const_pool maybe an approximation way. Sched2 pass could optimize some cases(e.g. for f2.c and f4.c), but for some cases, it may not distrubuted those 'building' instructions. So, maybe we add a peephole after sched2. If the five-instructions to building constant are still successive, then using 'load' to replace (need to check TOC available). While I'm not sure if it is worthy. > > Of course the scheduler might lack on the technical side as well. BR, Jeff(Jiufu) > >> >> f2.c >> long foo (long *arg, long *arg2, long *arg3) >> { >> *arg = 0x1234567800000000; >> *arg2 = 0x7965234700000000; >> *arg3 = 0x4689123700000000; >> } >> asm building constant: >> lis 7,0x1234 >> lis 10,0x7965 >> lis 9,0x4689 >> ori 7,7,0x5678 >> ori 10,10,0x2347 >> ori 9,9,0x1237 >> sldi 7,7,32 >> sldi 10,10,32 >> sldi 9,9,32 >> vs. loading >> addis 7,2,.LC0@toc@ha >> addis 10,2,.LC1@toc@ha >> addis 9,2,.LC2@toc@ha >> ld 7,.LC0@toc@l(7) >> ld 10,.LC1@toc@l(10) >> ld 9,.LC2@toc@l(9) >> For this case, 'loading' is always slower than 'building' (>15%). >> >> f3.c >> long foo (long *arg, long *, long *) >> { >> *arg = 384307168202282325; >> } >> lis 10,0x555 >> ori 10,10,0x5555 >> sldi 10,10,32 >> oris 10,10,0x5555 >> ori 10,10,0x5555 >> For this case, 'building' (through 5 instructions) are slower, and 'loading' >> is faster ~5%; >> >> f4.c >> long foo (long *arg, long *arg2, long *arg3) >> { >> *arg = 384307168202282325; >> *arg2 = -6148914691236517205; >> *arg3 = 768614336404564651; >> } >> lis 7,0x555 >> lis 10,0xaaaa >> lis 9,0xaaa >> ori 7,7,0x5555 >> ori 10,10,0xaaaa >> ori 9,9,0xaaaa >> sldi 7,7,32 >> sldi 10,10,32 >> sldi 9,9,32 >> oris 7,7,0x5555 >> oris 10,10,0xaaaa >> oris 9,9,0xaaaa >> ori 7,7,0x5555 >> ori 10,10,0xaaab >> ori 9,9,0xaaab >> For this cases, since 'building' constant are parallel, 'loading' is slower: >> ~8%. On p10, 'loading'(through 'pld') is also slower >4%. >> >> >> BR, >> Jeff(Jiufu) >> >> --- >> gcc/config/rs6000/rs6000.cc | 14 ++++++++++++++ >> gcc/testsuite/gcc.target/powerpc/pr63281.c | 11 +++++++++++ >> 2 files changed, 25 insertions(+) >> create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c >> >> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc >> index 4b727d2a500..3798e11bdbc 100644 >> --- a/gcc/config/rs6000/rs6000.cc >> +++ b/gcc/config/rs6000/rs6000.cc >> @@ -10098,6 +10098,20 @@ rs6000_emit_set_const (rtx dest, rtx source) >> c = ((c & 0xffffffff) ^ 0x80000000) - 0x80000000; >> emit_move_insn (lo, GEN_INT (c)); >> } >> + else if (base_reg_operand (dest, mode) >> + && num_insns_constant (source, mode) > 2) >> + { >> + rtx sym = force_const_mem (mode, source); >> + if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0)) >> + && use_toc_relative_ref (XEXP (sym, 0), mode)) >> + { >> + rtx toc = create_TOC_reference (XEXP (sym, 0), copy_rtx (dest)); >> + sym = gen_const_mem (mode, toc); >> + set_mem_alias_set (sym, get_TOC_alias_set ()); >> + } >> + >> + emit_insn (gen_rtx_SET (dest, sym)); >> + } >> else >> rs6000_emit_set_long_const (dest, c); >> break; >> diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c b/gcc/testsuite/gcc.target/powerpc/pr63281.c >> new file mode 100644 >> index 00000000000..469a8f64400 >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c >> @@ -0,0 +1,11 @@ >> +/* PR target/63281 */ >> +/* { dg-do compile { target lp64 } } */ >> +/* { dg-options "-O2 -std=c99" } */ >> + >> +void >> +foo (unsigned long long *a) >> +{ >> + *a = 0x020805006106003; >> +} >> + >> +/* { dg-final { scan-assembler-times {\mp?ld\M} 1 } } */ >> -- >> 2.17.1 >>
Jiufu Guo <guojiufu@linux.ibm.com> writes: > Hi, > > Richard Biener <richard.guenther@gmail.com> writes: > >> On Mon, Aug 15, 2022 at 7:26 AM Jiufu Guo via Gcc-patches >> <gcc-patches@gcc.gnu.org> wrote: >>> >>> Hi, >>> >>> This patch tries to put the constant into constant pool if building the >>> constant requires 3 or more instructions. >>> >>> But there is a concern: I'm wondering if this patch is really profitable. >>> >>> Because, as I tested, 1. for simple case, if instructions are not been run >>> in parallel, loading constant from memory maybe faster; but 2. if there >>> are some instructions could run in parallel, loading constant from memory >>> are not win comparing with building constant. As below examples. >>> >>> For f1.c and f3.c, 'loading' constant would be acceptable in runtime aspect; >>> for f2.c and f4.c, 'loading' constant are visibly slower. >>> >>> For real-world cases, both kinds of code sequences exist. >>> >>> So, I'm not sure if we need to push this patch. >>> >>> Run a lot of times (1000000000) below functions to check runtime. >>> f1.c: >>> long foo (long *arg, long*, long *) >>> { >>> *arg = 0x1234567800000000; >>> } >>> asm building constant: >>> lis 10,0x1234 >>> ori 10,10,0x5678 >>> sldi 10,10,32 >>> vs. asm loading >>> addis 10,2,.LC0@toc@ha >>> ld 10,.LC0@toc@l(10) >>> The runtime between 'building' and 'loading' are similar: some times the >>> 'building' is faster; sometimes 'loading' is faster. And the difference is >>> slight. >> >> I wonder if it is possible to decide this during scheduling - chose the >> variant that, when the result is needed, is cheaper? Post-RA might >> be a bit difficult (I see the load from memory needs the TOC, but then >> when the TOC is not available we could just always emit the build form), >> and pre-reload precision might be not good enough to make this worth >> the experiment? > Thanks a lot for your comments! > > Yes, Post-RA may not handle all cases. > If there is no TOC avaiable, we are not able to load the const through > TOC. As Segher point out: crtl->uses_const_pool maybe an approximation > way. > Sched2 pass could optimize some cases(e.g. for f2.c and f4.c), but for > some cases, it may not distrubuted those 'building' instructions. > > So, maybe we add a peephole after sched2. If the five-instructions > to building constant are still successive, then using 'load' to replace > (need to check TOC available). > While I'm not sure if it is worthy. Oh, as checking the object files (from GCC bootstrap and spec), it is rare that the five-instructions are successive. It is often 1(or 2) insns are distributed, and other 4(or 3) instructions are successive. So, using peephole may not very helpful. BR, Jeff(Jiufu) > >> >> Of course the scheduler might lack on the technical side as well. > > > BR, > Jeff(Jiufu) > >> >>> >>> f2.c >>> long foo (long *arg, long *arg2, long *arg3) >>> { >>> *arg = 0x1234567800000000; >>> *arg2 = 0x7965234700000000; >>> *arg3 = 0x4689123700000000; >>> } >>> asm building constant: >>> lis 7,0x1234 >>> lis 10,0x7965 >>> lis 9,0x4689 >>> ori 7,7,0x5678 >>> ori 10,10,0x2347 >>> ori 9,9,0x1237 >>> sldi 7,7,32 >>> sldi 10,10,32 >>> sldi 9,9,32 >>> vs. loading >>> addis 7,2,.LC0@toc@ha >>> addis 10,2,.LC1@toc@ha >>> addis 9,2,.LC2@toc@ha >>> ld 7,.LC0@toc@l(7) >>> ld 10,.LC1@toc@l(10) >>> ld 9,.LC2@toc@l(9) >>> For this case, 'loading' is always slower than 'building' (>15%). >>> >>> f3.c >>> long foo (long *arg, long *, long *) >>> { >>> *arg = 384307168202282325; >>> } >>> lis 10,0x555 >>> ori 10,10,0x5555 >>> sldi 10,10,32 >>> oris 10,10,0x5555 >>> ori 10,10,0x5555 >>> For this case, 'building' (through 5 instructions) are slower, and 'loading' >>> is faster ~5%; >>> >>> f4.c >>> long foo (long *arg, long *arg2, long *arg3) >>> { >>> *arg = 384307168202282325; >>> *arg2 = -6148914691236517205; >>> *arg3 = 768614336404564651; >>> } >>> lis 7,0x555 >>> lis 10,0xaaaa >>> lis 9,0xaaa >>> ori 7,7,0x5555 >>> ori 10,10,0xaaaa >>> ori 9,9,0xaaaa >>> sldi 7,7,32 >>> sldi 10,10,32 >>> sldi 9,9,32 >>> oris 7,7,0x5555 >>> oris 10,10,0xaaaa >>> oris 9,9,0xaaaa >>> ori 7,7,0x5555 >>> ori 10,10,0xaaab >>> ori 9,9,0xaaab >>> For this cases, since 'building' constant are parallel, 'loading' is slower: >>> ~8%. On p10, 'loading'(through 'pld') is also slower >4%. >>> >>> >>> BR, >>> Jeff(Jiufu) >>> >>> --- >>> gcc/config/rs6000/rs6000.cc | 14 ++++++++++++++ >>> gcc/testsuite/gcc.target/powerpc/pr63281.c | 11 +++++++++++ >>> 2 files changed, 25 insertions(+) >>> create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c >>> >>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc >>> index 4b727d2a500..3798e11bdbc 100644 >>> --- a/gcc/config/rs6000/rs6000.cc >>> +++ b/gcc/config/rs6000/rs6000.cc >>> @@ -10098,6 +10098,20 @@ rs6000_emit_set_const (rtx dest, rtx source) >>> c = ((c & 0xffffffff) ^ 0x80000000) - 0x80000000; >>> emit_move_insn (lo, GEN_INT (c)); >>> } >>> + else if (base_reg_operand (dest, mode) >>> + && num_insns_constant (source, mode) > 2) >>> + { >>> + rtx sym = force_const_mem (mode, source); >>> + if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0)) >>> + && use_toc_relative_ref (XEXP (sym, 0), mode)) >>> + { >>> + rtx toc = create_TOC_reference (XEXP (sym, 0), copy_rtx (dest)); >>> + sym = gen_const_mem (mode, toc); >>> + set_mem_alias_set (sym, get_TOC_alias_set ()); >>> + } >>> + >>> + emit_insn (gen_rtx_SET (dest, sym)); >>> + } >>> else >>> rs6000_emit_set_long_const (dest, c); >>> break; >>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c b/gcc/testsuite/gcc.target/powerpc/pr63281.c >>> new file mode 100644 >>> index 00000000000..469a8f64400 >>> --- /dev/null >>> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c >>> @@ -0,0 +1,11 @@ >>> +/* PR target/63281 */ >>> +/* { dg-do compile { target lp64 } } */ >>> +/* { dg-options "-O2 -std=c99" } */ >>> + >>> +void >>> +foo (unsigned long long *a) >>> +{ >>> + *a = 0x020805006106003; >>> +} >>> + >>> +/* { dg-final { scan-assembler-times {\mp?ld\M} 1 } } */ >>> -- >>> 2.17.1 >>>
Hi, Segher Boessenkool <segher@kernel.crashing.org> writes: > Hi! > > On Mon, Aug 15, 2022 at 01:25:19PM +0800, Jiufu Guo wrote: >> This patch tries to put the constant into constant pool if building the >> constant requires 3 or more instructions. >> >> But there is a concern: I'm wondering if this patch is really profitable. >> >> Because, as I tested, 1. for simple case, if instructions are not been run >> in parallel, loading constant from memory maybe faster; but 2. if there >> are some instructions could run in parallel, loading constant from memory >> are not win comparing with building constant. As below examples. >> >> For f1.c and f3.c, 'loading' constant would be acceptable in runtime aspect; >> for f2.c and f4.c, 'loading' constant are visibly slower. >> >> For real-world cases, both kinds of code sequences exist. >> >> So, I'm not sure if we need to push this patch. >> >> Run a lot of times (1000000000) below functions to check runtime. >> f1.c: >> long foo (long *arg, long*, long *) >> { >> *arg = 0x1234567800000000; >> } >> asm building constant: >> lis 10,0x1234 >> ori 10,10,0x5678 >> sldi 10,10,32 >> vs. asm loading >> addis 10,2,.LC0@toc@ha >> ld 10,.LC0@toc@l(10) > > This is just a load insn, unless this is the only thing needing the TOC. > You can use crtl->uses_const_pool as an approximation here, to figure > out if we have that case? Thanks for point out this! crtl->uses_const_pool is set to 1 in force_const_mem. create_TOC_reference would be called after force_const_mem. One concern: there maybe the case that crtl->uses_const_pool was not clear to zero after related symbols are optimized out. > >> The runtime between 'building' and 'loading' are similar: some times the >> 'building' is faster; sometimes 'loading' is faster. And the difference is >> slight. > > When there is only one constant, sure. But that isn't the expensive > case we need to avoid :-) Yes. If there are other instructions around, scheduler could optimized the 'building' instructions to be in parallel with other instructions. If we emit 'building' instruction in split1 pass (before sched1), these 'building constant' instructions may be more possible to be scheduled better. Then 'building form' maybe not bad. > >> addis 9,2,.LC2@toc@ha >> ld 7,.LC0@toc@l(7) >> ld 10,.LC1@toc@l(10) >> ld 9,.LC2@toc@l(9) >> For this case, 'loading' is always slower than 'building' (>15%). > > Only if there is nothing else to do, and only in cases where code size > does not matter (i.e. microbenchmarks). Yes, 'loading' may save code size slightly. > >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c >> @@ -0,0 +1,11 @@ >> +/* PR target/63281 */ >> +/* { dg-do compile { target lp64 } } */ >> +/* { dg-options "-O2 -std=c99" } */ > > Why std=c99 btw? The default is c17. Is there something we need to > disable here? Oh, this option is not required. Thanks! BR, Jeff(Jiufu) > > > Segher
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc index 4b727d2a500..3798e11bdbc 100644 --- a/gcc/config/rs6000/rs6000.cc +++ b/gcc/config/rs6000/rs6000.cc @@ -10098,6 +10098,20 @@ rs6000_emit_set_const (rtx dest, rtx source) c = ((c & 0xffffffff) ^ 0x80000000) - 0x80000000; emit_move_insn (lo, GEN_INT (c)); } + else if (base_reg_operand (dest, mode) + && num_insns_constant (source, mode) > 2) + { + rtx sym = force_const_mem (mode, source); + if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0)) + && use_toc_relative_ref (XEXP (sym, 0), mode)) + { + rtx toc = create_TOC_reference (XEXP (sym, 0), copy_rtx (dest)); + sym = gen_const_mem (mode, toc); + set_mem_alias_set (sym, get_TOC_alias_set ()); + } + + emit_insn (gen_rtx_SET (dest, sym)); + } else rs6000_emit_set_long_const (dest, c); break; diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c b/gcc/testsuite/gcc.target/powerpc/pr63281.c new file mode 100644 index 00000000000..469a8f64400 --- /dev/null +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c @@ -0,0 +1,11 @@ +/* PR target/63281 */ +/* { dg-do compile { target lp64 } } */ +/* { dg-options "-O2 -std=c99" } */ + +void +foo (unsigned long long *a) +{ + *a = 0x020805006106003; +} + +/* { dg-final { scan-assembler-times {\mp?ld\M} 1 } } */