From patchwork Fri Mar 24 13:04:03 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Biener X-Patchwork-Id: 74531 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp615420vqo; Fri, 24 Mar 2023 06:14:22 -0700 (PDT) X-Google-Smtp-Source: AKy350b6yWo6Za0aFFv1OWh4tbi7QqFCqLiEKUVqf20dR1cKmmLwDw2KCOJl+mJ9wFE4yAqXNlCX X-Received: by 2002:a17:906:e8a:b0:930:bcee:eed with SMTP id p10-20020a1709060e8a00b00930bcee0eedmr2496833ejf.9.1679663662554; Fri, 24 Mar 2023 06:14:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1679663662; cv=none; d=google.com; s=arc-20160816; b=JEPKQG+BDVK0EfPHLHXGZF4jCXSDVMLQiq0wYc5YCOvoTagONtRMXItsToJW6OQE8/ +rl3gO7yE5gQhjGZIJYUnyfvHXqH2Wes2A83iIUqTv6DdAnJORj2Ld5r2gE7Mb5dGzbZ cUphiyYMI31WUv+kEZ3xmggXl2o9fYLVmOX+Gce0FLagS8CVGtKO/4xhEjPSBRAT5Kxt J1jQ6Rfbw3nCwj0vRYrK9eA4VjigqKUZgcZj+3mZq1IOEIgqFYEEMIQkO5nd5z0JczTo GKjcDmZDJFEQLtuguhGrDixF3Q5uQoxjQdOOB4Cpg0gKJvG06/Kj5fzerZ75mod4H2yT qZ4Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:reply-to:from:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence:message-id :mime-version:subject:cc:to:date:dmarc-filter:delivered-to :dkim-signature:dkim-filter; bh=bbQkLbUou32ssUfV6wHdsMaBPQouTJZi55kY/CsFnPY=; b=WfPYKZLO8uno+J6CKTIHrzyQW3miNJCb30ypi9Nd7Z4ea8Fb/3w0qq0DBKNcRZkr+D tabsdvWSBUrUjIr2PG3EQhY0mVGgnuCr0aXeTVHKQeo8btx5VakpIRVUxFTNFfIogg33 TIRM1sE9299OMm51nqeD01ryyHkHL92yLCdFv6w8HCe1M1afxvRvktS+sNas5QhrCYQf xguTFI7zwJ5asYMy6GGwIbsQ7GBMH5EvSnluFaYb2prgiZKuE8Bc2BDOWIW+NtK6asXm iqi3OrpFmV905vYCEv+jJTAq90e5jGvv4yOTpOrkM3tHqhmohqeoWztx6pIhCC89km5K KqgA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=MnDX6qNw; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from sourceware.org (server2.sourceware.org. [2620:52:3:1:0:246e:9693:128c]) by mx.google.com with ESMTPS id bv12-20020a170906b1cc00b00922f66a2949si644273ejb.484.2023.03.24.06.14.22 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Mar 2023 06:14:22 -0700 (PDT) Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) client-ip=2620:52:3:1:0:246e:9693:128c; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=MnDX6qNw; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id A33F138708B7 for ; Fri, 24 Mar 2023 13:14:08 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A33F138708B7 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1679663648; bh=bbQkLbUou32ssUfV6wHdsMaBPQouTJZi55kY/CsFnPY=; h=Date:To:cc:Subject:List-Id:List-Unsubscribe:List-Archive: List-Post:List-Help:List-Subscribe:From:Reply-To:From; b=MnDX6qNwYLtEdP5SeYa5wB/Pm5JgJ8iqgK+Tn+Ae5PxrXX/MQlPHKtOe/Ti6zI6NJ Vuxy6+M2RT5QCBs0JKRXIPEmJD8w0seM8eqtQB3Ek2oNnPmUsqmjJoVwxFN7dz0AEs v/xFmxxZhyqhoQURk9jyXfEPjP3mGyA7Il2E4WKo= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by sourceware.org (Postfix) with ESMTPS id C855A38708C2 for ; Fri, 24 Mar 2023 13:13:05 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C855A38708C2 Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 4A29821A26; Fri, 24 Mar 2023 13:04:04 +0000 (UTC) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 2C4ED138ED; Fri, 24 Mar 2023 13:04:04 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id qtCkCcSfHWQ9HAAAMHmgww (envelope-from ); Fri, 24 Mar 2023 13:04:04 +0000 Date: Fri, 24 Mar 2023 14:04:03 +0100 (CET) To: gcc-patches@gcc.gnu.org cc: hongtao.liu@intel.com, Jan Hubicka Subject: [PATCH 2/2] [i386] Adjust costing of emulated vectorized gather/scatter MIME-Version: 1.0 Message-Id: <20230324130404.2C4ED138ED@imap2.suse-dmz.suse.de> X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Richard Biener via Gcc-patches From: Richard Biener Reply-To: Richard Biener Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org Sender: "Gcc-patches" X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1761254929163910583?= X-GMAIL-MSGID: =?utf-8?q?1761255004624188338?= Emulated gather/scatter behave similar to strided elementwise accesses in that they need to decompose the offset vector and construct or decompose the data vector so handle them the same way, pessimizing the cases with may elements. For pr88531-2c.c instead of .L4: leaq (%r15,%rcx), %rdx incl %edi movl 16(%rdx), %r13d movl 24(%rdx), %r14d movl (%rdx), %r10d movl 4(%rdx), %r9d movl 8(%rdx), %ebx movl 12(%rdx), %r11d movl 20(%rdx), %r12d vmovss (%rax,%r14,4), %xmm2 movl 28(%rdx), %edx vmovss (%rax,%r13,4), %xmm1 vmovss (%rax,%r10,4), %xmm0 vinsertps $0x10, (%rax,%rdx,4), %xmm2, %xmm2 vinsertps $0x10, (%rax,%r12,4), %xmm1, %xmm1 vinsertps $0x10, (%rax,%r9,4), %xmm0, %xmm0 vmovlhps %xmm2, %xmm1, %xmm1 vmovss (%rax,%rbx,4), %xmm2 vinsertps $0x10, (%rax,%r11,4), %xmm2, %xmm2 vmovlhps %xmm2, %xmm0, %xmm0 vinsertf128 $0x1, %xmm1, %ymm0, %ymm0 vmulps %ymm3, %ymm0, %ymm0 vmovups %ymm0, (%r8,%rcx) addq $32, %rcx cmpl %esi, %edi jb .L4 we now prefer .L4: leaq 0(%rbp,%rdx,8), %rcx movl (%rcx), %r10d movl 4(%rcx), %ecx vmovss (%rsi,%r10,4), %xmm0 vinsertps $0x10, (%rsi,%rcx,4), %xmm0, %xmm0 vmulps %xmm1, %xmm0, %xmm0 vmovlps %xmm0, (%rbx,%rdx,8) incq %rdx cmpl %edi, %edx jb .L4 which vectorizes with SSE instead of AVX2 which looks like an improvement. When testing this on SPEC CPU 2017 with -Ofast -flto -march=znver4 there are quite some cases where we now prefer SSE vectorization over AVX512 + AVX2 epilogue and some cases where we now reject vectorization. Runtime the changes are noise with the off-noise candidates better after the patch. Bootstrapped and tested on x86_64-unknown-linux-gnu. OK for stage1? Thanks, Richard. * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Tame down element extracts and scalar loads for gather/scatter similar to elementwise strided accesses. * gcc.target/i386/pr89618-2.c: New testcase. * gcc.target/i386/pr88531-2b.c: Adjust. * gcc.target/i386/pr88531-2c.c: Likewise. --- gcc/config/i386/i386.cc | 6 ++++-- gcc/testsuite/gcc.target/i386/pr88531-2b.c | 2 +- gcc/testsuite/gcc.target/i386/pr88531-2c.c | 2 +- gcc/testsuite/gcc.target/i386/pr89618-2.c | 23 ++++++++++++++++++++++ 4 files changed, 29 insertions(+), 4 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr89618-2.c diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index 6a8734c2346..7a0b48c62c5 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -23555,8 +23555,10 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, && stmt_info && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type) - && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE - && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != INTEGER_CST) + && ((STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE + && (TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) + != INTEGER_CST)) + || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER)) { stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1); diff --git a/gcc/testsuite/gcc.target/i386/pr88531-2b.c b/gcc/testsuite/gcc.target/i386/pr88531-2b.c index 011607c3d54..cdefff2ce8e 100644 --- a/gcc/testsuite/gcc.target/i386/pr88531-2b.c +++ b/gcc/testsuite/gcc.target/i386/pr88531-2b.c @@ -3,4 +3,4 @@ #include "pr88531-2a.c" -/* { dg-final { scan-assembler-times "vmulps" 2 } } */ +/* { dg-final { scan-assembler-times "vmulps" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr88531-2c.c b/gcc/testsuite/gcc.target/i386/pr88531-2c.c index 0f7ec3832f8..17b24c0dacc 100644 --- a/gcc/testsuite/gcc.target/i386/pr88531-2c.c +++ b/gcc/testsuite/gcc.target/i386/pr88531-2c.c @@ -3,4 +3,4 @@ #include "pr88531-2a.c" -/* { dg-final { scan-assembler-times "vmulps" 2 } } */ +/* { dg-final { scan-assembler-times "vmulps" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr89618-2.c b/gcc/testsuite/gcc.target/i386/pr89618-2.c new file mode 100644 index 00000000000..0b7dcfd8806 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr89618-2.c @@ -0,0 +1,23 @@ +/* { dg-do compile } */ +/* { dg-options "-O3 -mavx2 -fdump-tree-vect-details" } */ + +void foo (int n, int *off, double *a) +{ + const int m = 32; + + for (int j = 0; j < n/m; ++j) + { + int const start = j*m; + int const end = (j+1)*m; + +#pragma GCC ivdep + for (int i = start; i < end; ++i) + { + a[off[i]] = a[i] < 0 ? a[i] : 0; + } + } +} + +/* Make sure the cost model selects SSE vectors rather than AVX to avoid + too many scalar ops for the address computes in the loop. */ +/* { dg-final { scan-tree-dump "loop vectorized using 16 byte vectors" "vect" } } */