Message ID | ZVreIppK5dO9j3oU@cowardly-lion.the-meissners.org |
---|---|
Headers |
Return-Path: <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:9910:0:b0:403:3b70:6f57 with SMTP id i16csp1973198vqn; Sun, 19 Nov 2023 20:19:15 -0800 (PST) X-Google-Smtp-Source: AGHT+IEqchxkmL3odKNaA6u94aZTmOrBu43im8qJ9Q+m/jSDHi3SY0z0LfAXrl0osRfg68UMDpMP X-Received: by 2002:a05:6214:5188:b0:66d:2680:5a98 with SMTP id kl8-20020a056214518800b0066d26805a98mr8126789qvb.41.1700453955083; Sun, 19 Nov 2023 20:19:15 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1700453955; cv=pass; d=google.com; s=arc-20160816; b=HnM/Yf1O0pjh0kuraru7eGXnF++saCYlZ4/udpfj9bUWOKv5DSEq0MElRRY7RrNoUS qIIgUAN05Etbm2pdEJnwItTaHufjHvypYlxmJpn0xfeACZQIIaveDSqowLCTRZmepVpW PlxIfzZEH0Uglb7w8RMS6lUBZu55VGzCvcl9BEDf7IOG6qx4YxNbvQ0zyLjqMER7bzVo YZNEYL7TCLFNFV2vqPrplTvT1BgkZoNL/OF2XvJ4cpqhHDysH0v8itcq2D2Qhurr+VTI SrHGTLOon2hojCWbzKLQ+AS3GWPoQlOMhW/+4blXEW6hFAq9doK3QJktaF+bRol9J2NW oadA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=errors-to:list-subscribe:list-help:list-post:list-archive :list-unsubscribe:list-id:precedence:mime-version :content-disposition:mail-followup-to:message-id:subject:to:from :date:dkim-signature:arc-filter:dmarc-filter:delivered-to; bh=SmbIlG4zDAv85418WxgJQuucNSi7NOgkb0nOcMkv/WI=; fh=0bQqz0x7CUT8I+BDaNEJSevw0DVHT666RDr0Mh+m1Jk=; b=rDeO+vGrHcR/tR/RVNcivu4iKOsf49tRFbuPIwvxBBtBcZVnX5kVrSHte1OSrgsyau 4i7uMQtOqDlEF+XNSAFCogNvPl0ahipeeIj5Zx6AGSpTO9J3y3StwzxhnfM4J0Ftm1lZ aGDsU6nhvT1AQmJRsv0eR6k2jCTZc9ltMceFDzx3n8cSO88h6BKU34HmE3D+x8FSojsw zByjbzzgjkOw9Fa2ASzyukpufoy7qbWbYyekgVGZajQT/VJ0JqE/SpZduCN+VOmizj27 0TTndwd4EkBYNmlld6g9rX6dp7YQq5DZ3csXQi70HtwmoHKGpoNXqyHm75/DD+tAVR1/ xS9Q== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=I6pAJ3ZU; arc=pass (i=1); spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: from server2.sourceware.org (server2.sourceware.org. [8.43.85.97]) by mx.google.com with ESMTPS id n9-20020a0cfbc9000000b0066dac72db9bsi6731995qvp.346.2023.11.19.20.19.15 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 19 Nov 2023 20:19:15 -0800 (PST) Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) client-ip=8.43.85.97; Authentication-Results: mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=I6pAJ3ZU; arc=pass (i=1); spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id D681B3858284 for <ouuuleilei@gmail.com>; Mon, 20 Nov 2023 04:19:14 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id A7E9A3858C50 for <gcc-patches@gcc.gnu.org>; Mon, 20 Nov 2023 04:18:49 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A7E9A3858C50 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org A7E9A3858C50 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=148.163.156.1 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1700453931; cv=none; b=EUSHev5T2auYzaT7gPQo6YOzmTfU/TN/1LgWadCWSc+JEtyIIwnt5yk9tGIk3Q39oMIp5k9GX52pDIwFbpV6CrVl1AmqHj1KJKlc8o+bRQo9ZlbSZ4dI1oZkWtaEY5j9gkGZxeZIJjoY7+y7DmFUfYvwg/7jHM5CxFox+Wjt+LE= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1700453931; c=relaxed/simple; bh=A0kzJVWTrlm1Y1Yc6vawB3WhqjCcXXavz1s2IXfHKSA=; h=DKIM-Signature:Date:From:To:Subject:Message-ID:MIME-Version; b=vKVPGSAGxRXWJ8x1W2bnlg1/C5eOvIepzx86Kqp09FsBX/CS2bwUbKjCIoKkydWL5UrvyWJSUVtXE/+0RcddOpnKjEK8dmy/x4rrOxuWbVHGIBYwHSTxZwkOqO6qwxfRW1sHGTcx4JjEwtV4dPYpkW85e6eVKdAPkyF6tEMmlZY= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from pps.filterd (m0353729.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 3AK2AbxX013285; Mon, 20 Nov 2023 04:18:47 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=date : from : to : subject : message-id : content-type : mime-version; s=pp1; bh=SmbIlG4zDAv85418WxgJQuucNSi7NOgkb0nOcMkv/WI=; b=I6pAJ3ZUpb4YOgh7RwGsqiWYh3iWrSFoDO8V8X20P87QJktSqyzA/XyEStAoC8gCTYKl /yO1zIkQOOxNhbid5x3fLpA/QzHBjrEe3tLeJTR0Ea9wLrhEwfzIdX3G8my9Mxscam7I 1XzGlUPz2QwujJnp3StnI6IkPZuUfO4iwK6lE/QhbTnBY/Wo3d6zp7AvjDzeZOK5n53W PoA76eG5SJnWdIWIsZo3O/H49vgJ6Z6pql0+Tjj4WxgTnRxo5H+BSb00GWaAozvqYHPe X3sutbL0H2kYIsmtMjs9THnMThVWholL0RNBTJIbQC70yHFRsFRXqnnu4InXQkwyr1b0 Jw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3ueywy2yj7-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 20 Nov 2023 04:18:47 +0000 Received: from m0353729.ppops.net (m0353729.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 3AK4HtlZ006092; Mon, 20 Nov 2023 04:18:46 GMT Received: from ppma11.dal12v.mail.ibm.com (db.9e.1632.ip4.static.sl-reverse.com [50.22.158.219]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3ueywy2yhy-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 20 Nov 2023 04:18:46 +0000 Received: from pps.filterd (ppma11.dal12v.mail.ibm.com [127.0.0.1]) by ppma11.dal12v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 3AK1Yfmr000446; Mon, 20 Nov 2023 04:18:45 GMT Received: from smtprelay02.dal12v.mail.ibm.com ([172.16.1.4]) by ppma11.dal12v.mail.ibm.com (PPS) with ESMTPS id 3ufaa1pa2u-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 20 Nov 2023 04:18:45 +0000 Received: from smtpav02.wdc07v.mail.ibm.com (smtpav02.wdc07v.mail.ibm.com [10.39.53.229]) by smtprelay02.dal12v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 3AK4IiRS33096392 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 20 Nov 2023 04:18:45 GMT Received: from smtpav02.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id BA33C5805B; Mon, 20 Nov 2023 04:18:44 +0000 (GMT) Received: from smtpav02.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 142FD5805C; Mon, 20 Nov 2023 04:18:44 +0000 (GMT) Received: from cowardly-lion.the-meissners.org (unknown [9.61.1.46]) by smtpav02.wdc07v.mail.ibm.com (Postfix) with ESMTPS; Mon, 20 Nov 2023 04:18:43 +0000 (GMT) Date: Sun, 19 Nov 2023 23:18:42 -0500 From: Michael Meissner <meissner@linux.ibm.com> To: gcc-patches@gcc.gnu.org, Michael Meissner <meissner@linux.ibm.com>, Segher Boessenkool <segher@kernel.crashing.org>, "Kewen.Lin" <linkw@linux.ibm.com>, David Edelsohn <dje.gcc@gmail.com>, Peter Bergner <bergner@linux.ibm.com> Subject: [PATCH 0/4] Add vector pair support to PowerPC attribute((vector_size(32))) Message-ID: <ZVreIppK5dO9j3oU@cowardly-lion.the-meissners.org> Mail-Followup-To: Michael Meissner <meissner@linux.ibm.com>, gcc-patches@gcc.gnu.org, Segher Boessenkool <segher@kernel.crashing.org>, "Kewen.Lin" <linkw@linux.ibm.com>, David Edelsohn <dje.gcc@gmail.com>, Peter Bergner <bergner@linux.ibm.com> Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-TM-AS-GCONF: 00 X-Proofpoint-GUID: E4jbD196pSV1iAFaTnS1q00k9NEEwQTt X-Proofpoint-ORIG-GUID: PRWjoeVT5NHHTCdpUoLZPCkCWbkqzBwT X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.987,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-11-20_01,2023-11-17_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 mlxlogscore=310 malwarescore=0 clxscore=1015 impostorscore=0 bulkscore=0 adultscore=0 priorityscore=1501 suspectscore=0 spamscore=0 lowpriorityscore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2311060000 definitions=main-2311200028 X-Spam-Status: No, score=-3.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, KAM_SHORT, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783055206141910359 X-GMAIL-MSGID: 1783055206141910359 |
Series |
Add vector pair support to PowerPC attribute((vector_size(32)))
|
|
Message
Michael Meissner
Nov. 20, 2023, 4:18 a.m. UTC
This is simiilar to the patches on November 10th. * https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636077.html * https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636078.html * https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636083.html * https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636080.html * https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636081.html to add a set of built-in functions that use the PowePC __vector_pair type and that provide a set of functions to do basic operations on vector pair. After I posted these patches, it was decided that it would be better to have a new type that is used rather than a bunch of new built-in functions. Within the GCC context, the best way to add this support is to extend the vector modes so that V4DFmode, V8SFmode, V4DImode, V8SImode, V16HImode, and V32QImode are used. These patches are to provide this new implementation. While in theory you could add a whole new type that isn't a larger size vector, my experience with IEEE 128-bit floating point is that GCC really doesn't like 2 modes that are the same size but have different implementations (such as we see with IEEE 128-bit floating point and IBM double-double 128-bit floating point). So I did not consider adding a new mode for using with vector pairs. My original intention was to just implement V4DFmode and V8SFmode, since the primary users asking for vector pair support are people implementing the high end math libraries like Eigen and Blas. However in implementing this code, I discovered that we will need integer vector pair support as well as floating point vector pair. The integer modes and types are needed to properly implement byte shuffling and vector comparisons which need integer vector pairs. With the current patches, vector pair support is not enabled by default. The main reason is I have not implemented the support for byte shuffling which various tests depend on. I would also like to implement overloads for the vector built-in functions like vec_add, vec_sum, etc. that if you give it a vector pair, it would handle it just like if you give a vector type. In addition, once the various bugs are addressed, I would then implement the support so that automatic vectorization would consider using vector pairs instead of vectors. In terms of benchmarks, I wrote two benchmarks: 1) One benchmark is a saxpy type loop: value[i] += (a[i] * b[i]). That is a loop with 3 loads and a store per loop. 2) Another benchmark produces a scalar sun of an entire vector. This is a loop that just has a single load and no store. For the saxpy type loop, I get the following general numbers for both float and double: 1) The benchmarks that use attribute((vector_size(32))) are roughly 9-10% faster than using normal vector processing (both auto vectorize and using vector types). 2) The benchmarks that use attribute((vector_size(32))) are roughly 19-20% faster than if I write the loop using the vector pair loads using the exist built-ins, and then manually split the values and do the arithmetic and single vector stores, Unfortunately, for floating point, doing the sum of the whole vector is slower using the new vector pair built-in functions using a simple loop (compared to using the existing built-ins for disassembling vector pairs. If I write more complex loops that manually unroll the loop, then the floating point vector pair built-in functions become like the integer vector pair integer built-in functions. So there is some amount of tuning that will need to be done. There are 4 patches in this set: The first patch adds support for the types, and does moves, and provides some optimizations for extracting an element and setting an element. The second patch implements the floating point arithmetic operations. The third patch implements the integer operations. The fourth patch provides new tests to test these features.
Comments
Add basic support for vector_size(32). We have had several users ask us to implement ways of using the Power10 load vector pair and store vector pair instructions to give their code a speed up due to reduced memory bandwidth. I had originally posted the following patches: * https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636077.html * https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636078.html * https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636083.html * https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636080.html * https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636081.html to add a set of built-in functions that use the PowePC __vector_pair type and that provide a set of functions to do basic operations on vector pair. After I posted these patches, it was decided that it would be better to have a new type that is used rather than a bunch of new built-in functions. Within the GCC context, the best way to add this support is to extend the vector modes so that V4DFmode, V8SFmode, V4DImode, V8SImode, V16HImode, and V32QImode are used. While in theory you could add a whole new type that isn't a larger size vector, my experience with IEEE 128-bit floating point is that GCC really doesn't like 2 modes that are the same size but have different implementations (such as we see with IEEE 128-bit floating point and IBM double-double 128-bit floating point). So I did not consider adding a new mode for using with vector pairs. My original intention was to just implement V4DFmode and V8SFmode, since the primary users asking for vector pair support are people implementing the high end math libraries like Eigen and Blas. However in implementing this code, I discovered that we will need integer vector pair support as well as floating point vector pair. The integer modes and types are needed to properly implement byte shuffling and vector comparisons which need integer vector pairs. With the current patches, vector pair support is not enabled by default. The main reason is I have not implemented the support for byte shuffling which various tests depend on. I would also like to implement overloads for the vector built-in functions like vec_add, vec_sum, etc. that if you give it a vector pair, it would handle it just like if you give a vector type. In addition, once the various bugs are addressed, I would then implement the support so that automatic vectorization would consider using vector pairs instead of vectors. This is the first patch in the series. It implements the basic modes, and it allows for initialization of the modes. I've added some optimizations for extracting and setting fields within the vector pair. The second patch will implement the floating point vector pair support. The third patch will implement the integer vector pair support. The fourth patch will provide new tests to the test suite. When I test a saxpy type loop (a[i] += (b[i] * c[i])), I generally see a 10% improvement over either auto-factorization, or just using the vector types. I have tested these patches on a little endian power10 system. With -vector-size-32 disabled by default, there are no regressions in the test suite. I have also built and run the tests on both little endian power9 and big endian power9 systems, and there are no regressions. Can I check these patches into the master branch? 2023-11-19 Michael Meissner <meissner@linux.ibm.com> gcc/ * config/rs6000/constraint.md (eV): New constraint. * config/rs6000/predicates.md (cons_0_to_31_operand): New predicate. (easy_vector_constant): Add support for vector pair constants. (easy_vector_pair_constant): New predicate. (mam_assemble_input_operand): Allow other 16-byte vector modes than Immodest. * config/rs6000/rs6000-c.cc (rs6000_cpu_cpp_builtins): Define __VECTOR_SIZE_32__ if -mvector-size-32. * config/rs6000/rs6000-protos.h (vector_pair_to_vector_mode): New declaration. (split_vector_pair_constant): Likewise. (rs6000_expand_vector_pair_init): Likewise. * config/rs6000/rs6000.cc (rs6000_hard_regno_mode_ok_uncached): Use VECTOR_PAIR_MODE instead of comparing mode to OOmode. (rs6000_modes_tieable_p): Allow various vector pair modes to pair with each other. Allow 16-byte vectors to pair with vector pair modes. (rs6000_setup_reg_addr_masks): Use VECTOR_PAIR_MODE instead of comparing mode to OOmode. (rs6000_init_hard_regno_mode_ok): Setup vector pair mode basic type information and reload handlers. (rs6000_option_override_internal): Warn if -mvector-pair-32 is used without -mcpu=power10 or -mmma. (vector_pair_to_vector_mode): New function. (split_vector_pair_constant): Likewise. (rs6000_expand_vector_pair_init): Likewise. (reg_offset_addressing_ok_p): Add support for vector pair modes. (rs6000_emit_move): Likewise. (rs6000_preferred_reload_class): Likewise. (altivec_expand_vec_perm_le): Likewise. (rs6000_opt_vars): Add -mvector-size-32 switch. (rs6000_split_multireg_move): Add support for vector pair modes. * config/rs6000/rs6000.h (VECTOR_PAIR_MODE): New macro. * config/rs6000/rs6000.md (wd mode attribute): Add vector pair modes. (RELOAD mode iterator): Likewise. (toplevel): Include vector-pair.md. * config/rs6000/rs6000.opt (-mvector-size-32): New option. * config/rs6000/vector-pair.md: New file. * doc/md.texi (PowerPC constraints): Document the eV constraint.
On Mon, Nov 20, 2023 at 5:19 AM Michael Meissner <meissner@linux.ibm.com> wrote: > > This is simiilar to the patches on November 10th. > > * https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636077.html > * https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636078.html > * https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636083.html > * https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636080.html > * https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636081.html > > to add a set of built-in functions that use the PowePC __vector_pair type and > that provide a set of functions to do basic operations on vector pair. > > After I posted these patches, it was decided that it would be better to have a > new type that is used rather than a bunch of new built-in functions. Within > the GCC context, the best way to add this support is to extend the vector modes > so that V4DFmode, V8SFmode, V4DImode, V8SImode, V16HImode, and V32QImode are > used. > > These patches are to provide this new implementation. > > While in theory you could add a whole new type that isn't a larger size vector, > my experience with IEEE 128-bit floating point is that GCC really doesn't like > 2 modes that are the same size but have different implementations (such as we > see with IEEE 128-bit floating point and IBM double-double 128-bit floating > point). So I did not consider adding a new mode for using with vector pairs. > > My original intention was to just implement V4DFmode and V8SFmode, since the > primary users asking for vector pair support are people implementing the high > end math libraries like Eigen and Blas. > > However in implementing this code, I discovered that we will need integer > vector pair support as well as floating point vector pair. The integer modes > and types are needed to properly implement byte shuffling and vector > comparisons which need integer vector pairs. > > With the current patches, vector pair support is not enabled by default. The > main reason is I have not implemented the support for byte shuffling which > various tests depend on. > > I would also like to implement overloads for the vector built-in functions like > vec_add, vec_sum, etc. that if you give it a vector pair, it would handle it > just like if you give a vector type. > > In addition, once the various bugs are addressed, I would then implement the > support so that automatic vectorization would consider using vector pairs > instead of vectors. > > In terms of benchmarks, I wrote two benchmarks: > > 1) One benchmark is a saxpy type loop: value[i] += (a[i] * b[i]). That is > a loop with 3 loads and a store per loop. > > 2) Another benchmark produces a scalar sun of an entire vector. This is a > loop that just has a single load and no store. > > For the saxpy type loop, I get the following general numbers for both float and > double: > > 1) The benchmarks that use attribute((vector_size(32))) are roughly 9-10% > faster than using normal vector processing (both auto vectorize and > using vector types). > > 2) The benchmarks that use attribute((vector_size(32))) are roughly 19-20% > faster than if I write the loop using the vector pair loads using the > exist built-ins, and then manually split the values and do the > arithmetic and single vector stores, > > Unfortunately, for floating point, doing the sum of the whole vector is slower > using the new vector pair built-in functions using a simple loop (compared to > using the existing built-ins for disassembling vector pairs. If I write more > complex loops that manually unroll the loop, then the floating point vector > pair built-in functions become like the integer vector pair integer built-in > functions. So there is some amount of tuning that will need to be done. > > There are 4 patches in this set: > > The first patch adds support for the types, and does moves, and provides some > optimizations for extracting an element and setting an element. > > The second patch implements the floating point arithmetic operations. > > The third patch implements the integer operations. > > The fourth patch provides new tests to test these features. I wouldn't expose the "fake" larger modes to the vectorizer but rather adjust m_suggested_unroll_factor (which you already do to some extent). > -- > Michael Meissner, IBM > PO Box 98, Ayer, Massachusetts, USA, 01432 > email: meissner@linux.ibm.com
On Mon, Nov 20, 2023 at 08:24:35AM +0100, Richard Biener wrote: > I wouldn't expose the "fake" larger modes to the vectorizer but rather > adjust m_suggested_unroll_factor (which you already do to some extent). Thanks. I figure I first need to fix the shuffle byes issue first and get a clean test run (with the flag enabled by default), before delving into the vectorization issues. But testing has shown that at least in the loop I was looking at, that using vector pair instructions (either through the built-ins I had previously posted or with these patches), that even if I turn off unrolling completely for the vector pair case, it still is faster than unrolling the loop 4 times for using vector types (or auto vectorization). Note, of course the margin is much smaller in this case. vector double: (a * b) + c, unroll 4 loop time: 0.55483 vector double: (a * b) + c, unroll default loop time: 0.55638 vector double: (a * b) + c, unroll 0 loop time: 0.55686 vector double: (a * b) + c, unroll 2 loop time: 0.55772 vector32, w/vector pair: (a * b) + c, unroll 4 loop time: 0.48257 vector32, w/vector pair: (a * b) + c, unroll 2 loop time: 0.50782 vector32, w/vector pair: (a * b) + c, unroll default loop time: 0.50864 vector32, w/vector pair: (a * b) + c, unroll 0 loop time: 0.52224 Of course being micro-benchmarks, it doesn't mean that this translates to the behavior on actual code.
On Mon, Nov 20, 2023 at 9:56 AM Michael Meissner <meissner@linux.ibm.com> wrote: > > On Mon, Nov 20, 2023 at 08:24:35AM +0100, Richard Biener wrote: > > I wouldn't expose the "fake" larger modes to the vectorizer but rather > > adjust m_suggested_unroll_factor (which you already do to some extent). > > Thanks. I figure I first need to fix the shuffle byes issue first and get a > clean test run (with the flag enabled by default), before delving into the > vectorization issues. > > But testing has shown that at least in the loop I was looking at, that using > vector pair instructions (either through the built-ins I had previously posted > or with these patches), that even if I turn off unrolling completely for the > vector pair case, it still is faster than unrolling the loop 4 times for using > vector types (or auto vectorization). Note, of course the margin is much > smaller in this case. But unrolling 2 times or doubling the vector mode size results in exactly the same - using a lager vectorization factor. > > vector double: (a * b) + c, unroll 4 loop time: 0.55483 > vector double: (a * b) + c, unroll default loop time: 0.55638 > vector double: (a * b) + c, unroll 0 loop time: 0.55686 > vector double: (a * b) + c, unroll 2 loop time: 0.55772 > > vector32, w/vector pair: (a * b) + c, unroll 4 loop time: 0.48257 > vector32, w/vector pair: (a * b) + c, unroll 2 loop time: 0.50782 > vector32, w/vector pair: (a * b) + c, unroll default loop time: 0.50864 > vector32, w/vector pair: (a * b) + c, unroll 0 loop time: 0.52224 > > Of course being micro-benchmarks, it doesn't mean that this translates to the > behavior on actual code. I guess the difference is from how RTL handles the larger modes vs. more instructions with the smaller mode (if you don't immediately expose the smaller modes during RTL expansion). I'd compare assembly of vector double with unroll 2 and vector32 with unroll 0. Richard. > > > -- > Michael Meissner, IBM > PO Box 98, Ayer, Massachusetts, USA, 01432 > email: meissner@linux.ibm.com
on 2023/11/20 16:56, Michael Meissner wrote: > On Mon, Nov 20, 2023 at 08:24:35AM +0100, Richard Biener wrote: >> I wouldn't expose the "fake" larger modes to the vectorizer but rather >> adjust m_suggested_unroll_factor (which you already do to some extent). > > Thanks. I figure I first need to fix the shuffle byes issue first and get a > clean test run (with the flag enabled by default), before delving into the > vectorization issues. > > But testing has shown that at least in the loop I was looking at, that using > vector pair instructions (either through the built-ins I had previously posted > or with these patches), that even if I turn off unrolling completely for the > vector pair case, it still is faster than unrolling the loop 4 times for using > vector types (or auto vectorization). Note, of course the margin is much > smaller in this case. > > vector double: (a * b) + c, unroll 4 loop time: 0.55483 > vector double: (a * b) + c, unroll default loop time: 0.55638 > vector double: (a * b) + c, unroll 0 loop time: 0.55686 > vector double: (a * b) + c, unroll 2 loop time: 0.55772 > > vector32, w/vector pair: (a * b) + c, unroll 4 loop time: 0.48257 > vector32, w/vector pair: (a * b) + c, unroll 2 loop time: 0.50782 > vector32, w/vector pair: (a * b) + c, unroll default loop time: 0.50864 > vector32, w/vector pair: (a * b) + c, unroll 0 loop time: 0.52224 > > Of course being micro-benchmarks, it doesn't mean that this translates to the > behavior on actual code. > > I noticed that Ajit posted a patch for adding one new pass to replace contiguous addresses vector load lxv with lxvp: https://inbox.sourceware.org/gcc-patches/ef0c54a5-c35c-3519-f062-9ac78ee66b81@linux.ibm.com/ How about making this kind of rs6000 specific pass to pair both vector load and store? Users can make more unrolling with parameters and those memory accesses from unrolling should be neat, I'd expect the pass can easily detect and pair the candidates. BR, Kewen
On Fri, Nov 24, 2023 at 05:41:02PM +0800, Kewen.Lin wrote: > on 2023/11/20 16:56, Michael Meissner wrote: > > On Mon, Nov 20, 2023 at 08:24:35AM +0100, Richard Biener wrote: > >> I wouldn't expose the "fake" larger modes to the vectorizer but rather > >> adjust m_suggested_unroll_factor (which you already do to some extent). > > > > Thanks. I figure I first need to fix the shuffle byes issue first and get a > > clean test run (with the flag enabled by default), before delving into the > > vectorization issues. > > > > But testing has shown that at least in the loop I was looking at, that using > > vector pair instructions (either through the built-ins I had previously posted > > or with these patches), that even if I turn off unrolling completely for the > > vector pair case, it still is faster than unrolling the loop 4 times for using > > vector types (or auto vectorization). Note, of course the margin is much > > smaller in this case. > > > > vector double: (a * b) + c, unroll 4 loop time: 0.55483 > > vector double: (a * b) + c, unroll default loop time: 0.55638 > > vector double: (a * b) + c, unroll 0 loop time: 0.55686 > > vector double: (a * b) + c, unroll 2 loop time: 0.55772 > > > > vector32, w/vector pair: (a * b) + c, unroll 4 loop time: 0.48257 > > vector32, w/vector pair: (a * b) + c, unroll 2 loop time: 0.50782 > > vector32, w/vector pair: (a * b) + c, unroll default loop time: 0.50864 > > vector32, w/vector pair: (a * b) + c, unroll 0 loop time: 0.52224 > > > > Of course being micro-benchmarks, it doesn't mean that this translates to the > > behavior on actual code. > > > > > > I noticed that Ajit posted a patch for adding one new pass to replace contiguous > addresses vector load lxv with lxvp: > > https://inbox.sourceware.org/gcc-patches/ef0c54a5-c35c-3519-f062-9ac78ee66b81@linux.ibm.com/ > > How about making this kind of rs6000 specific pass to pair both vector load and > store? Users can make more unrolling with parameters and those memory accesses > from unrolling should be neat, I'd expect the pass can easily detect and pair the > candidates. Yes, I tend to think a combination of things will be needed. In my tests with a saxpy type loop, I could not get the current built-ins to load/store vector pairs to be fast enough. Peter's code that he posted help, but ultimately it was still slower than adding vector_size(32). I will try out the patch and compare it to my patches.