From patchwork Fri Mar 24 13:03:18 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Biener X-Patchwork-Id: 74530 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp615167vqo; Fri, 24 Mar 2023 06:14:05 -0700 (PDT) X-Google-Smtp-Source: AKy350ZKqaxMxfb6bc6GzT5W2tGCb2Tc7YmfDwm42JJQaaddnh4FoOwgoQtdu26QPBoxhCiLYX1F X-Received: by 2002:a17:906:524a:b0:931:7709:4c80 with SMTP id y10-20020a170906524a00b0093177094c80mr2420570ejm.71.1679663645395; Fri, 24 Mar 2023 06:14:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1679663645; cv=none; d=google.com; s=arc-20160816; b=VWxCDmgdUPwZgW24UYC7Mvzca4Q21Y78whbmIAC9cSvNnq6zf69fUgSRgxXcf6nGOq otnx8w3rG3aJfesoowlRybq7Qgb+bUj5lYbqGU522PhMjdRHyOEnlkdgUm1KUwM665kO JJv8iQzufWqFWBFzAeRD0ZwRX6kEZJ8OIeWYUhsZbd5Z3XkkLhCKUbOesu9MUG/T3x4L hCJvLYH9m6qsVfPG0WHsHI9KULEC/FXwCm00zN+uHdhkb+OBmdNuN7g/7ZXwZ2pJe6go Ch4P8Xyuxk8/83p5S3WS5hbe6WqQgkocHToRPgkJv38vWJ8qjfMKxXnKNJ3lpGziRcKH ECFw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:reply-to:from:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence:message-id :mime-version:subject:cc:to:date:dmarc-filter:delivered-to :dkim-signature:dkim-filter; bh=XIn/5NOwwyBys2wfWgxMEQ5GaTYZuSTSCnjsBce7IS8=; b=Zrnai3DopHLou+Y4ZvUBFeChrfngNW1uLtibG8wQ5s0v0QOEnMwWfj3RRGFDxG30r8 9zNnbyyZd1UjVnfHHNyAYolo2AHRRtj5ZzJX/il4/N3LY9L29jPifVCP4TLogrxB8gKY gEuA3PxH3dNFZZKm2QwYOiRBCcZ/feqOhRk7U0Wsmd70089+4aFisypLlzVd459voZr6 YtD+/NyIQ+0nB5zj8HOHoV2Fh9xd96bF2xyDEtsuyXQtrlw22CRGAZxfGLVXMs9Rpu8A 7H+EVVH3tJsucvFb3r+TzaG3To0wtiYp0dbxh2rtafVADXA6l0wISA90vwzFQFv3ciTB qfTg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=xM6OdDea; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from sourceware.org (server2.sourceware.org. [2620:52:3:1:0:246e:9693:128c]) by mx.google.com with ESMTPS id k17-20020aa7c391000000b004fc7b78f055si1022759edq.481.2023.03.24.06.14.04 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Mar 2023 06:14:05 -0700 (PDT) Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) client-ip=2620:52:3:1:0:246e:9693:128c; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=xM6OdDea; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 35D0538768B3 for ; Fri, 24 Mar 2023 13:13:54 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 35D0538768B3 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1679663634; bh=XIn/5NOwwyBys2wfWgxMEQ5GaTYZuSTSCnjsBce7IS8=; h=Date:To:cc:Subject:List-Id:List-Unsubscribe:List-Archive: List-Post:List-Help:List-Subscribe:From:Reply-To:From; b=xM6OdDeaj0kpbW0NfT4Os0gl7TuxFIPbh/9UH980SAC1zOopToyd6XPwgKlE+/2Jg 3EPa0AA1TYIjOVrdvEhRgKf5FDMaYfsp5qR6KbHVOdTXJCzYmBFMBzDwMWqHxvxOgg FslgoFOTBglDlTP3ozSqVgaBzQydiujg9aMX3wD0= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by sourceware.org (Postfix) with ESMTPS id 8DF9238708DF for ; Fri, 24 Mar 2023 13:13:05 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8DF9238708DF Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 76C202278F; Fri, 24 Mar 2023 13:03:19 +0000 (UTC) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 5E23D138ED; Fri, 24 Mar 2023 13:03:19 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id +e3OFZefHWTcGwAAMHmgww (envelope-from ); Fri, 24 Mar 2023 13:03:19 +0000 Date: Fri, 24 Mar 2023 14:03:18 +0100 (CET) To: gcc-patches@gcc.gnu.org cc: richard.sandiford@arm.com Subject: [PATCH 1/2] Add emulated scatter capability to the vectorizer MIME-Version: 1.0 Message-Id: <20230324130319.5E23D138ED@imap2.suse-dmz.suse.de> X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Richard Biener via Gcc-patches From: Richard Biener Reply-To: Richard Biener Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org Sender: "Gcc-patches" X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1761254986469131962?= X-GMAIL-MSGID: =?utf-8?q?1761254986469131962?= This adds a scatter vectorization capability to the vectorizer without target support by decomposing the offset and data vectors and then performing scalar stores in the order of vector lanes. This is aimed at cases where vectorizing the rest of the loop offsets the cost of vectorizing the scatter. The offset load is still vectorized and costed as such, but like with emulated gather those will be turned back to scalar loads by forwrpop. Ontop of the cost model patch when testing on SPEC CPU 2017 this vectorizes an additional three loops and there's no runtime change. When one enables AVX512 scatters with -mtune-ctrl=use_scatter_2parts,use_scatter_4parts,use_scatter that on the unpatched tree enables 166 new loops + epilogues to be vectorized instead, so the actual Zen4 costs for AVX512 scatters are cheaper than what we cost the emulated cases. Note the handled cases are not 100% overlapping, there's one additional loop vectorized with the emulation that's not vectorized with the AVX512 scatter ISA vectorization. Bootstrapped and tested on x86_64-unknown-linux-gnu, queued for stage1. Richard. * tree-vect-data-refs.cc (vect_analyze_data_refs): Always consider scatters. * tree-vect-stmts.cc (vect_model_store_cost): Pass in the gather-scatter info and cost emulated scatters accordingly. (get_load_store_type): Support emulated scatters. (vectorizable_store): Likewise. Emulate them by extracting scalar offsets and data, doing scalar stores. * gcc.dg/vect/pr25413a.c: Un-XFAIL everywhere. * gcc.dg/vect/vect-71.c: Likewise. * gcc.dg/vect/tsvc/vect-tsvc-s4113.c: Likewise. * gcc.dg/vect/tsvc/vect-tsvc-s491.c: Likewise. * gcc.dg/vect/tsvc/vect-tsvc-vas.c: Likewise. --- gcc/testsuite/gcc.dg/vect/pr25413a.c | 3 +- .../gcc.dg/vect/tsvc/vect-tsvc-s4113.c | 2 +- .../gcc.dg/vect/tsvc/vect-tsvc-s491.c | 2 +- .../gcc.dg/vect/tsvc/vect-tsvc-vas.c | 2 +- gcc/testsuite/gcc.dg/vect/vect-71.c | 2 +- gcc/tree-vect-data-refs.cc | 4 +- gcc/tree-vect-stmts.cc | 102 +++++++++++++++--- 7 files changed, 91 insertions(+), 26 deletions(-) diff --git a/gcc/testsuite/gcc.dg/vect/pr25413a.c b/gcc/testsuite/gcc.dg/vect/pr25413a.c index e444b2c3e8e..ffb517c9ce0 100644 --- a/gcc/testsuite/gcc.dg/vect/pr25413a.c +++ b/gcc/testsuite/gcc.dg/vect/pr25413a.c @@ -123,7 +123,6 @@ int main (void) return 0; } -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { ! vect_scatter_store } } } } */ -/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target vect_scatter_store } } } */ +/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" } } */ /* { dg-final { scan-tree-dump-times "vector alignment may not be reachable" 1 "vect" { target { ! vector_alignment_reachable } } } } */ /* { dg-final { scan-tree-dump-times "Alignment of access forced using versioning" 1 "vect" { target { ! vector_alignment_reachable } } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-s4113.c b/gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-s4113.c index b64682a65df..ddb7e9dc0e8 100644 --- a/gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-s4113.c +++ b/gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-s4113.c @@ -39,4 +39,4 @@ int main (int argc, char **argv) return 0; } -/* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" { xfail { ! aarch64_sve } } } } */ +/* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-s491.c b/gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-s491.c index 8465e137070..29e90ff0aff 100644 --- a/gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-s491.c +++ b/gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-s491.c @@ -39,4 +39,4 @@ int main (int argc, char **argv) return 0; } -/* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" { xfail { ! aarch64_sve } } } } */ +/* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-vas.c b/gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-vas.c index 5ff38851f43..b72ee21a9a3 100644 --- a/gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-vas.c +++ b/gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-vas.c @@ -39,4 +39,4 @@ int main (int argc, char **argv) return 0; } -/* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" { xfail { ! aarch64_sve } } } } */ +/* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/vect-71.c b/gcc/testsuite/gcc.dg/vect/vect-71.c index f15521176df..581473fa4a1 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-71.c +++ b/gcc/testsuite/gcc.dg/vect/vect-71.c @@ -36,4 +36,4 @@ int main (void) return main1 (); } -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_scatter_store } } } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc index 8daf7bd7dd3..17c235f5bf8 100644 --- a/gcc/tree-vect-data-refs.cc +++ b/gcc/tree-vect-data-refs.cc @@ -4464,9 +4464,7 @@ vect_analyze_data_refs (vec_info *vinfo, poly_uint64 *min_vf, bool *fatal) && !TREE_THIS_VOLATILE (DR_REF (dr)); bool maybe_scatter = DR_IS_WRITE (dr) - && !TREE_THIS_VOLATILE (DR_REF (dr)) - && (targetm.vectorize.builtin_scatter != NULL - || supports_vec_scatter_store_p ()); + && !TREE_THIS_VOLATILE (DR_REF (dr)); /* If target supports vector gather loads or scatter stores, see if they can't be used. */ diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index efa2d0daa52..423253da23d 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -942,6 +942,7 @@ cfun_returns (tree decl) static void vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies, vect_memory_access_type memory_access_type, + gather_scatter_info *gs_info, dr_alignment_support alignment_support_scheme, int misalignment, vec_load_store_type vls_type, slp_tree slp_node, @@ -997,8 +998,16 @@ vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies, if (memory_access_type == VMAT_ELEMENTWISE || memory_access_type == VMAT_GATHER_SCATTER) { - /* N scalar stores plus extracting the elements. */ unsigned int assumed_nunits = vect_nunits_for_cost (vectype); + if (memory_access_type == VMAT_GATHER_SCATTER + && gs_info->ifn == IFN_LAST && !gs_info->decl) + /* For emulated scatter N offset vector element extracts + (we assume the scalar scaling and ptr + offset add is consumed by + the load). */ + inside_cost += record_stmt_cost (cost_vec, ncopies * assumed_nunits, + vec_to_scalar, stmt_info, 0, + vect_body); + /* N scalar stores plus extracting the elements. */ inside_cost += record_stmt_cost (cost_vec, ncopies * assumed_nunits, scalar_store, stmt_info, 0, vect_body); @@ -1008,7 +1017,9 @@ vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies, misalignment, &inside_cost, cost_vec); if (memory_access_type == VMAT_ELEMENTWISE - || memory_access_type == VMAT_STRIDED_SLP) + || memory_access_type == VMAT_STRIDED_SLP + || (memory_access_type == VMAT_GATHER_SCATTER + && gs_info->ifn == IFN_LAST && !gs_info->decl)) { /* N scalar stores plus extracting the elements. */ unsigned int assumed_nunits = vect_nunits_for_cost (vectype); @@ -2503,19 +2514,11 @@ get_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, } else if (gs_info->ifn == IFN_LAST && !gs_info->decl) { - if (vls_type != VLS_LOAD) - { - if (dump_enabled_p ()) - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, - "unsupported emulated scatter.\n"); - return false; - } - else if (!TYPE_VECTOR_SUBPARTS (vectype).is_constant () - || !TYPE_VECTOR_SUBPARTS - (gs_info->offset_vectype).is_constant () - || !constant_multiple_p (TYPE_VECTOR_SUBPARTS - (gs_info->offset_vectype), - TYPE_VECTOR_SUBPARTS (vectype))) + if (!TYPE_VECTOR_SUBPARTS (vectype).is_constant () + || !TYPE_VECTOR_SUBPARTS (gs_info->offset_vectype).is_constant () + || !constant_multiple_p (TYPE_VECTOR_SUBPARTS + (gs_info->offset_vectype), + TYPE_VECTOR_SUBPARTS (vectype))) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, @@ -7751,6 +7754,15 @@ vectorizable_store (vec_info *vinfo, "unsupported access type for masked store.\n"); return false; } + else if (memory_access_type == VMAT_GATHER_SCATTER + && gs_info.ifn == IFN_LAST + && !gs_info.decl) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "unsupported masked emulated scatter.\n"); + return false; + } } else { @@ -7814,7 +7826,8 @@ vectorizable_store (vec_info *vinfo, STMT_VINFO_TYPE (stmt_info) = store_vec_info_type; vect_model_store_cost (vinfo, stmt_info, ncopies, - memory_access_type, alignment_support_scheme, + memory_access_type, &gs_info, + alignment_support_scheme, misalignment, vls_type, slp_node, cost_vec); return true; } @@ -8575,7 +8588,8 @@ vectorizable_store (vec_info *vinfo, final_mask = prepare_vec_mask (loop_vinfo, mask_vectype, final_mask, vec_mask, gsi); - if (memory_access_type == VMAT_GATHER_SCATTER) + if (memory_access_type == VMAT_GATHER_SCATTER + && gs_info.ifn != IFN_LAST) { tree scale = size_int (gs_info.scale); gcall *call; @@ -8592,6 +8606,60 @@ vectorizable_store (vec_info *vinfo, new_stmt = call; break; } + else if (memory_access_type == VMAT_GATHER_SCATTER) + { + /* Emulated scatter. */ + gcc_assert (!final_mask); + unsigned HOST_WIDE_INT const_nunits = nunits.to_constant (); + unsigned HOST_WIDE_INT const_offset_nunits + = TYPE_VECTOR_SUBPARTS (gs_info.offset_vectype) + .to_constant (); + vec *ctor_elts; + vec_alloc (ctor_elts, const_nunits); + gimple_seq stmts = NULL; + tree elt_type = TREE_TYPE (vectype); + unsigned HOST_WIDE_INT elt_size + = tree_to_uhwi (TYPE_SIZE (elt_type)); + /* We support offset vectors with more elements + than the data vector for now. */ + unsigned HOST_WIDE_INT factor + = const_offset_nunits / const_nunits; + vec_offset = vec_offsets[j / factor]; + unsigned elt_offset = (j % factor) * const_nunits; + tree idx_type = TREE_TYPE (TREE_TYPE (vec_offset)); + tree scale = size_int (gs_info.scale); + align = get_object_alignment (DR_REF (first_dr_info->dr)); + tree ltype = build_aligned_type (TREE_TYPE (vectype), align); + for (unsigned k = 0; k < const_nunits; ++k) + { + /* Compute the offsetted pointer. */ + tree boff = size_binop (MULT_EXPR, TYPE_SIZE (idx_type), + bitsize_int (k + elt_offset)); + tree idx = gimple_build (&stmts, BIT_FIELD_REF, + idx_type, vec_offset, + TYPE_SIZE (idx_type), boff); + idx = gimple_convert (&stmts, sizetype, idx); + idx = gimple_build (&stmts, MULT_EXPR, + sizetype, idx, scale); + tree ptr = gimple_build (&stmts, PLUS_EXPR, + TREE_TYPE (dataref_ptr), + dataref_ptr, idx); + ptr = gimple_convert (&stmts, ptr_type_node, ptr); + /* Extract the element to be stored. */ + tree elt = gimple_build (&stmts, BIT_FIELD_REF, + TREE_TYPE (vectype), vec_oprnd, + TYPE_SIZE (elt_type), + bitsize_int (k * elt_size)); + gsi_insert_seq_before (gsi, stmts, GSI_SAME_STMT); + stmts = NULL; + tree ref = build2 (MEM_REF, ltype, ptr, + build_int_cst (ref_type, 0)); + new_stmt = gimple_build_assign (ref, elt); + vect_finish_stmt_generation (vinfo, stmt_info, + new_stmt, gsi); + } + break; + } if (i > 0) /* Bump the vector pointer. */ From patchwork Fri Mar 24 13:04:03 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Biener X-Patchwork-Id: 74531 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp615420vqo; Fri, 24 Mar 2023 06:14:22 -0700 (PDT) X-Google-Smtp-Source: AKy350b6yWo6Za0aFFv1OWh4tbi7QqFCqLiEKUVqf20dR1cKmmLwDw2KCOJl+mJ9wFE4yAqXNlCX X-Received: by 2002:a17:906:e8a:b0:930:bcee:eed with SMTP id p10-20020a1709060e8a00b00930bcee0eedmr2496833ejf.9.1679663662554; Fri, 24 Mar 2023 06:14:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1679663662; cv=none; d=google.com; s=arc-20160816; b=JEPKQG+BDVK0EfPHLHXGZF4jCXSDVMLQiq0wYc5YCOvoTagONtRMXItsToJW6OQE8/ +rl3gO7yE5gQhjGZIJYUnyfvHXqH2Wes2A83iIUqTv6DdAnJORj2Ld5r2gE7Mb5dGzbZ cUphiyYMI31WUv+kEZ3xmggXl2o9fYLVmOX+Gce0FLagS8CVGtKO/4xhEjPSBRAT5Kxt J1jQ6Rfbw3nCwj0vRYrK9eA4VjigqKUZgcZj+3mZq1IOEIgqFYEEMIQkO5nd5z0JczTo GKjcDmZDJFEQLtuguhGrDixF3Q5uQoxjQdOOB4Cpg0gKJvG06/Kj5fzerZ75mod4H2yT qZ4Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:reply-to:from:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence:message-id :mime-version:subject:cc:to:date:dmarc-filter:delivered-to :dkim-signature:dkim-filter; bh=bbQkLbUou32ssUfV6wHdsMaBPQouTJZi55kY/CsFnPY=; b=WfPYKZLO8uno+J6CKTIHrzyQW3miNJCb30ypi9Nd7Z4ea8Fb/3w0qq0DBKNcRZkr+D tabsdvWSBUrUjIr2PG3EQhY0mVGgnuCr0aXeTVHKQeo8btx5VakpIRVUxFTNFfIogg33 TIRM1sE9299OMm51nqeD01ryyHkHL92yLCdFv6w8HCe1M1afxvRvktS+sNas5QhrCYQf xguTFI7zwJ5asYMy6GGwIbsQ7GBMH5EvSnluFaYb2prgiZKuE8Bc2BDOWIW+NtK6asXm iqi3OrpFmV905vYCEv+jJTAq90e5jGvv4yOTpOrkM3tHqhmohqeoWztx6pIhCC89km5K KqgA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=MnDX6qNw; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from sourceware.org (server2.sourceware.org. [2620:52:3:1:0:246e:9693:128c]) by mx.google.com with ESMTPS id bv12-20020a170906b1cc00b00922f66a2949si644273ejb.484.2023.03.24.06.14.22 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Mar 2023 06:14:22 -0700 (PDT) Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) client-ip=2620:52:3:1:0:246e:9693:128c; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=MnDX6qNw; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id A33F138708B7 for ; Fri, 24 Mar 2023 13:14:08 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A33F138708B7 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1679663648; bh=bbQkLbUou32ssUfV6wHdsMaBPQouTJZi55kY/CsFnPY=; h=Date:To:cc:Subject:List-Id:List-Unsubscribe:List-Archive: List-Post:List-Help:List-Subscribe:From:Reply-To:From; b=MnDX6qNwYLtEdP5SeYa5wB/Pm5JgJ8iqgK+Tn+Ae5PxrXX/MQlPHKtOe/Ti6zI6NJ Vuxy6+M2RT5QCBs0JKRXIPEmJD8w0seM8eqtQB3Ek2oNnPmUsqmjJoVwxFN7dz0AEs v/xFmxxZhyqhoQURk9jyXfEPjP3mGyA7Il2E4WKo= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by sourceware.org (Postfix) with ESMTPS id C855A38708C2 for ; Fri, 24 Mar 2023 13:13:05 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C855A38708C2 Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 4A29821A26; Fri, 24 Mar 2023 13:04:04 +0000 (UTC) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 2C4ED138ED; Fri, 24 Mar 2023 13:04:04 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id qtCkCcSfHWQ9HAAAMHmgww (envelope-from ); Fri, 24 Mar 2023 13:04:04 +0000 Date: Fri, 24 Mar 2023 14:04:03 +0100 (CET) To: gcc-patches@gcc.gnu.org cc: hongtao.liu@intel.com, Jan Hubicka Subject: [PATCH 2/2] [i386] Adjust costing of emulated vectorized gather/scatter MIME-Version: 1.0 Message-Id: <20230324130404.2C4ED138ED@imap2.suse-dmz.suse.de> X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Richard Biener via Gcc-patches From: Richard Biener Reply-To: Richard Biener Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org Sender: "Gcc-patches" X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1761254929163910583?= X-GMAIL-MSGID: =?utf-8?q?1761255004624188338?= Emulated gather/scatter behave similar to strided elementwise accesses in that they need to decompose the offset vector and construct or decompose the data vector so handle them the same way, pessimizing the cases with may elements. For pr88531-2c.c instead of .L4: leaq (%r15,%rcx), %rdx incl %edi movl 16(%rdx), %r13d movl 24(%rdx), %r14d movl (%rdx), %r10d movl 4(%rdx), %r9d movl 8(%rdx), %ebx movl 12(%rdx), %r11d movl 20(%rdx), %r12d vmovss (%rax,%r14,4), %xmm2 movl 28(%rdx), %edx vmovss (%rax,%r13,4), %xmm1 vmovss (%rax,%r10,4), %xmm0 vinsertps $0x10, (%rax,%rdx,4), %xmm2, %xmm2 vinsertps $0x10, (%rax,%r12,4), %xmm1, %xmm1 vinsertps $0x10, (%rax,%r9,4), %xmm0, %xmm0 vmovlhps %xmm2, %xmm1, %xmm1 vmovss (%rax,%rbx,4), %xmm2 vinsertps $0x10, (%rax,%r11,4), %xmm2, %xmm2 vmovlhps %xmm2, %xmm0, %xmm0 vinsertf128 $0x1, %xmm1, %ymm0, %ymm0 vmulps %ymm3, %ymm0, %ymm0 vmovups %ymm0, (%r8,%rcx) addq $32, %rcx cmpl %esi, %edi jb .L4 we now prefer .L4: leaq 0(%rbp,%rdx,8), %rcx movl (%rcx), %r10d movl 4(%rcx), %ecx vmovss (%rsi,%r10,4), %xmm0 vinsertps $0x10, (%rsi,%rcx,4), %xmm0, %xmm0 vmulps %xmm1, %xmm0, %xmm0 vmovlps %xmm0, (%rbx,%rdx,8) incq %rdx cmpl %edi, %edx jb .L4 which vectorizes with SSE instead of AVX2 which looks like an improvement. When testing this on SPEC CPU 2017 with -Ofast -flto -march=znver4 there are quite some cases where we now prefer SSE vectorization over AVX512 + AVX2 epilogue and some cases where we now reject vectorization. Runtime the changes are noise with the off-noise candidates better after the patch. Bootstrapped and tested on x86_64-unknown-linux-gnu. OK for stage1? Thanks, Richard. * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Tame down element extracts and scalar loads for gather/scatter similar to elementwise strided accesses. * gcc.target/i386/pr89618-2.c: New testcase. * gcc.target/i386/pr88531-2b.c: Adjust. * gcc.target/i386/pr88531-2c.c: Likewise. --- gcc/config/i386/i386.cc | 6 ++++-- gcc/testsuite/gcc.target/i386/pr88531-2b.c | 2 +- gcc/testsuite/gcc.target/i386/pr88531-2c.c | 2 +- gcc/testsuite/gcc.target/i386/pr89618-2.c | 23 ++++++++++++++++++++++ 4 files changed, 29 insertions(+), 4 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr89618-2.c diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index 6a8734c2346..7a0b48c62c5 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -23555,8 +23555,10 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, && stmt_info && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type) - && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE - && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != INTEGER_CST) + && ((STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE + && (TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) + != INTEGER_CST)) + || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER)) { stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1); diff --git a/gcc/testsuite/gcc.target/i386/pr88531-2b.c b/gcc/testsuite/gcc.target/i386/pr88531-2b.c index 011607c3d54..cdefff2ce8e 100644 --- a/gcc/testsuite/gcc.target/i386/pr88531-2b.c +++ b/gcc/testsuite/gcc.target/i386/pr88531-2b.c @@ -3,4 +3,4 @@ #include "pr88531-2a.c" -/* { dg-final { scan-assembler-times "vmulps" 2 } } */ +/* { dg-final { scan-assembler-times "vmulps" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr88531-2c.c b/gcc/testsuite/gcc.target/i386/pr88531-2c.c index 0f7ec3832f8..17b24c0dacc 100644 --- a/gcc/testsuite/gcc.target/i386/pr88531-2c.c +++ b/gcc/testsuite/gcc.target/i386/pr88531-2c.c @@ -3,4 +3,4 @@ #include "pr88531-2a.c" -/* { dg-final { scan-assembler-times "vmulps" 2 } } */ +/* { dg-final { scan-assembler-times "vmulps" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr89618-2.c b/gcc/testsuite/gcc.target/i386/pr89618-2.c new file mode 100644 index 00000000000..0b7dcfd8806 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr89618-2.c @@ -0,0 +1,23 @@ +/* { dg-do compile } */ +/* { dg-options "-O3 -mavx2 -fdump-tree-vect-details" } */ + +void foo (int n, int *off, double *a) +{ + const int m = 32; + + for (int j = 0; j < n/m; ++j) + { + int const start = j*m; + int const end = (j+1)*m; + +#pragma GCC ivdep + for (int i = start; i < end; ++i) + { + a[off[i]] = a[i] < 0 ? a[i] : 0; + } + } +} + +/* Make sure the cost model selects SSE vectors rather than AVX to avoid + too many scalar ops for the address computes in the loop. */ +/* { dg-final { scan-tree-dump "loop vectorized using 16 byte vectors" "vect" } } */