From patchwork Fri Sep  2 09:57:56 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Richard Sandiford <richard.sandiford@arm.com>
X-Patchwork-Id: 925
Return-Path: <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:adf:ecc5:0:0:0:0:0 with SMTP id s5csp648519wro;
        Fri, 2 Sep 2022 02:58:45 -0700 (PDT)
X-Google-Smtp-Source: 
 AA6agR5RUQe06SNlexILgpeBTj1bCtoFueXhe1SuBlpGWpAn5V/mWoIZikXV2ciDEAtqDn7ZHf3a
X-Received: by 2002:a17:907:162a:b0:742:7a6:a812 with SMTP id
 hb42-20020a170907162a00b0074207a6a812mr12919722ejc.403.1662112724923;
        Fri, 02 Sep 2022 02:58:44 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1662112724; cv=none;
        d=google.com; s=arc-20160816;
        b=lE+TjPonPOZoXiMbcgnx8n5fY33RAp5po/hmdDwa88PJXId8bmG6cQ4W1mJMeDqgGR
         1h4bBHQ+YFNCY4N4g/4YRd4N4SoYTJ5gczTbxm0QIEi1Ej1jiI+ALccY2EihGd+5rWeb
         d7NUQrtdRoxN8RCF6uqvqRp8Ng+5uGmIZxaaFrTmamcktj9T8aYU2Jr1wfCzziSHJAPI
         zPZoo0OlozkOhmIQSD9R6g+UUOUCwZW0H+GfazQzEgTgcLS+EAxEmX3A9CmLZzwdtKCr
         35Ss038E94z6XqoBPfDpHtEVUlBe1jt3rdrPho/zR9OQYmNGtBUf81pYMmbAUGAU9ioH
         TLJg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:cc:reply-to:from:list-subscribe:list-help
         :list-post:list-archive:list-unsubscribe:list-id:precedence
         :mime-version:user-agent:message-id:date:subject:mail-followup-to:to
         :dmarc-filter:delivered-to:dkim-signature:dkim-filter;
        bh=i6FR7UjRvR2gcBaDIYYr7Hd3NhV82dGDI/9h+C04arU=;
        b=Dtb9CXw2/n5Px/7ua2xpH40iMnzDLfbd+/OC3e/iPzwDS7s9RIyh9VkMM9cNPr47iT
         NSBh6hj42E+Pv8RJSg5rG1zpnH/EmcPFXsIKowbNUbYrs1HespdZ/a+zeVnqsiar2nGY
         tKaInZwtIGhfNObbcHZfFUVVjqBmBmkvgJfcuzNs27wBEcrk3zPzdUOHAYI2cHk8+hfE
         bzT/mVwgAQ7zdpEBnoXXJyFTAguLTduB0xUu+YeBGpUCTtQXSV+gRipqXBRvBt4oIvJh
         C3DkXlrb03vfxoxx2B64jDQ6Hgo7GaqnFX9N8xf/3MK8tAVe0/afOZZH8jZhgbAkDr4o
         Oy7A==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@gcc.gnu.org header.s=default header.b=CphBAhlv;
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as
 permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org
Received: from sourceware.org (server2.sourceware.org. [8.43.85.97])
        by mx.google.com with ESMTPS id
 h12-20020a056402280c00b004473654d9b6si63801ede.337.2022.09.02.02.58.44
        for <ouuuleilei@gmail.com>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 02 Sep 2022 02:58:44 -0700 (PDT)
Received-SPF: pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as
 permitted sender) client-ip=8.43.85.97;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@gcc.gnu.org header.s=default header.b=CphBAhlv;
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as
 permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id BA1803858286
	for <ouuuleilei@gmail.com>; Fri,  2 Sep 2022 09:58:43 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org BA1803858286
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1662112723;
	bh=i6FR7UjRvR2gcBaDIYYr7Hd3NhV82dGDI/9h+C04arU=;
	h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post:
	 List-Help:List-Subscribe:From:Reply-To:Cc:From;
	b=CphBAhlvWPry0b+1e3LM+4i1cxWcbIfLdKQCMl5ixOqmt/c7/y2JkpPgqHya+h+Sc
	 HFToq1NXBczAJkabIg+J0GK8VZP4qa9Ea1ZIWSupknTaedtnXCU/azrQVDVTT6Y3WB
	 CIX4XOOjnThhNsrMHnydEBKLyFiPmwJ5Ql7NIGaI=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id 41C593858C54
 for <gcc-patches@gcc.gnu.org>; Fri,  2 Sep 2022 09:57:59 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 41C593858C54
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 90C69D6E;
 Fri,  2 Sep 2022 02:58:04 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.62])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id E28BB3F71A;
 Fri,  2 Sep 2022 02:57:57 -0700 (PDT)
To: gcc-patches@gcc.gnu.org
Mail-Followup-To: gcc-patches@gcc.gnu.org, rguenther@suse.de,
 richard.sandiford@arm.com
Subject: [PATCH] vect: Ensure SLP nodes don't end up in multiple BB partitions
 [PR106787]
Date: Fri, 02 Sep 2022 10:57:56 +0100
Message-ID: <mptfshacd3v.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
X-Spam-Status: No, score=-49.6 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, SPF_HELO_NONE,
 SPF_NONE, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: Richard Sandiford via Gcc-patches
 <gcc-patches@gcc.gnu.org>
From: Richard Sandiford <richard.sandiford@arm.com>
Reply-To: Richard Sandiford <richard.sandiford@arm.com>
Cc: rguenther@suse.de
Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org
Sender: "Gcc-patches" <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1742851512600781742?=
X-GMAIL-MSGID: =?utf-8?q?1742851512600781742?=

In the PR we have two REDUC_PLUS SLP instances that share a common
load of stride 4.  Each instance also has a unique contiguous load.

Initially all three loads are out of order, so have a nontrivial
load permutation.  The layout pass puts them in order instead,
For the two contiguous loads it is possible to do this by adjusting the
SLP_LOAD_PERMUTATION to be { 0, 1, 2, 3 }.  But a SLP_LOAD_PERMUTATION
of { 0, 4, 8, 12 } is rejected as unsupported, so the pass creates a
separate VEC_PERM_EXPR instead.

Later the 4-stride load's initial SLP_LOAD_PERMUTATION is rejected too,
so that the load gets replaced by an external node built from scalars.
We then have an external node feeding a VEC_PERM_EXPR.

VEC_PERM_EXPRs created in this way do not have any associated
SLP_TREE_SCALAR_STMTS.  This means that they do not affect the
decision about which nodes should be in which subgraph for costing
purposes.  If the VEC_PERM_EXPR is fed by a vect_external_def,
then the VEC_PERM_EXPR's input doesn't affect that decision either.

The net effect is that a shared VEC_PERM_EXPR fed by an external def
can appear in more than one subgraph.  This triggered an ICE in
vect_schedule_node, which (rightly) expects to be called no more
than once for the same internal def.

There seemed to be many possible fixes, including:

(1) Replace unsupported loads with external defs *before* doing
    the layout optimisation.  This would avoid the need for the
    VEC_PERM_EXPR altogether.

(2) If the target doesn't support a load in its original layout,
    stop the layout optimisation from checking whether the target
    supports loads in any new candidate layout.  In other words,
    treat all layouts as if they were supported whenever the
    original layout is not in fact supported.

    I'd rather not do this.  In principle, the layout optimisation
    could convert an unsupported layout to a supported one.
    Selectively ignoring target support would work against that.

    We could try to look specifically for loads that will need
    to be decomposed, but that just seems like admitting that
    things are happening in the wrong order.

(3) Add SLP_TREE_SCALAR_STMTS to VEC_PERM_EXPRs.

    That would be OK for this case, but wouldn't be possible
    for external defs that represent existing vectors.

(4) Make vect_schedule_slp share SCC info between subgraphs.

    It feels like that's working around the partitioning problem
    rather than a real fix though.

(5) Directly ensure that internal def nodes belong to a single
    subgraph.

(1) is probably the best long-term fix, but (5) is much simpler.
The subgraph partitioning code already has a hash set to record
which nodes have been visited; we just need to convert that to a
map from nodes to instances instead.

Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?

Richard


gcc/
	PR tree-optimization/106787
	* tree-vect-slp.cc (vect_map_to_instance): New function, split out
	from...
	(vect_bb_partition_graph_r): ...here.  Replace the visited set
	with a map from nodes to instances.  Ensure that a node only
	appears in one partition.
	(vect_bb_partition_graph): Update accordingly.

gcc/testsuite/
	* gcc.dg/vect/bb-slp-layout-19.c: New test.
---
 gcc/testsuite/gcc.dg/vect/bb-slp-layout-19.c | 34 ++++++++++
 gcc/tree-vect-slp.cc                         | 69 ++++++++++++--------
 2 files changed, 77 insertions(+), 26 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-layout-19.c

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-19.c b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-19.c
new file mode 100644
index 00000000000..f075a83a25b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-19.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-fno-tree-loop-vectorize" } */
+
+extern int a[][4], b[][4], c[][4], d[4], e[4];
+void f()
+{
+  int t0 = a[0][3];
+  int t1 = a[1][3];
+  int t2 = a[2][3];
+  int t3 = a[3][3];
+  int a0 = 0, a1 = 0, a2 = 0, a3 = 0, b0 = 0, b1 = 0, b2 = 0, b3 = 0;
+  for (int j = 0; j < 100; ++j)
+    for (int i = 0; i < 400; i += 4)
+      {
+	a0 += b[i][3] * t0;
+	a1 += b[i][2] * t1;
+	a2 += b[i][1] * t2;
+	a3 += b[i][0] * t3;
+	b0 += c[i][3] * t0;
+	b1 += c[i][2] * t1;
+	b2 += c[i][1] * t2;
+	b3 += c[i][0] * t3;
+      }
+  d[0] = a0;
+  d[1] = a1;
+  d[2] = a2;
+  d[3] = a3;
+  e[0] = b0;
+  e[1] = b1;
+  e[2] = b2;
+  e[3] = b3;
+}
+
+/* { dg-final { scan-tree-dump-times "add new stmt: \[^\\n\\r\]* = VEC_PERM_EXPR" 3 "slp1" { target { vect_int_mult && vect_perm } } } } */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 1cf79eee4a6..59ec66a6f96 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -6435,47 +6435,64 @@ get_ultimate_leader (slp_instance instance,
   return instance;
 }
 
+namespace {
+/* Subroutine of vect_bb_partition_graph_r.  Map KEY to INSTANCE in
+   KEY_TO_INSTANCE, making INSTANCE the leader of any previous mapping
+   for KEY.  Return true if KEY was already in KEY_TO_INSTANCE.
+
+   INSTANCE_LEADER is as for get_ultimate_leader.  */
+
+template<typename T>
+bool
+vect_map_to_instance (slp_instance instance, T key,
+		      hash_map<T, slp_instance> &key_to_instance,
+		      hash_map<slp_instance, slp_instance> &instance_leader)
+{
+  bool existed_p;
+  slp_instance &key_instance = key_to_instance.get_or_insert (key, &existed_p);
+  if (!existed_p)
+    ;
+  else if (key_instance != instance)
+    {
+      /* If we're running into a previously marked key make us the
+	 leader of the current ultimate leader.  This keeps the
+	 leader chain acyclic and works even when the current instance
+	 connects two previously independent graph parts.  */
+      slp_instance key_leader
+	= get_ultimate_leader (key_instance, instance_leader);
+      if (key_leader != instance)
+	instance_leader.put (key_leader, instance);
+    }
+  key_instance = instance;
+  return existed_p;
+}
+}
+
 /* Worker of vect_bb_partition_graph, recurse on NODE.  */
 
 static void
 vect_bb_partition_graph_r (bb_vec_info bb_vinfo,
 			   slp_instance instance, slp_tree node,
 			   hash_map<stmt_vec_info, slp_instance> &stmt_to_instance,
-			   hash_map<slp_instance, slp_instance> &instance_leader,
-			   hash_set<slp_tree> &visited)
+			   hash_map<slp_tree, slp_instance> &node_to_instance,
+			   hash_map<slp_instance, slp_instance> &instance_leader)
 {
   stmt_vec_info stmt_info;
   unsigned i;
 
   FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), i, stmt_info)
-    {
-      bool existed_p;
-      slp_instance &stmt_instance
-	= stmt_to_instance.get_or_insert (stmt_info, &existed_p);
-      if (!existed_p)
-	;
-      else if (stmt_instance != instance)
-	{
-	  /* If we're running into a previously marked stmt make us the
-	     leader of the current ultimate leader.  This keeps the
-	     leader chain acyclic and works even when the current instance
-	     connects two previously independent graph parts.  */
-	  slp_instance stmt_leader
-	    = get_ultimate_leader (stmt_instance, instance_leader);
-	  if (stmt_leader != instance)
-	    instance_leader.put (stmt_leader, instance);
-	}
-      stmt_instance = instance;
-    }
+    vect_map_to_instance (instance, stmt_info, stmt_to_instance,
+			  instance_leader);
 
-  if (!SLP_TREE_SCALAR_STMTS (node).is_empty () && visited.add (node))
+  if (vect_map_to_instance (instance, node, node_to_instance,
+			    instance_leader))
     return;
 
   slp_tree child;
   FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
     if (child && SLP_TREE_DEF_TYPE (child) == vect_internal_def)
       vect_bb_partition_graph_r (bb_vinfo, instance, child, stmt_to_instance,
-				 instance_leader, visited);
+				 node_to_instance, instance_leader);
 }
 
 /* Partition the SLP graph into pieces that can be costed independently.  */
@@ -6489,16 +6506,16 @@ vect_bb_partition_graph (bb_vec_info bb_vinfo)
      corresponding SLP graph entry and upon visiting a previously
      marked stmt, make the stmts leader the current SLP graph entry.  */
   hash_map<stmt_vec_info, slp_instance> stmt_to_instance;
+  hash_map<slp_tree, slp_instance> node_to_instance;
   hash_map<slp_instance, slp_instance> instance_leader;
-  hash_set<slp_tree> visited;
   slp_instance instance;
   for (unsigned i = 0; bb_vinfo->slp_instances.iterate (i, &instance); ++i)
     {
       instance_leader.put (instance, instance);
       vect_bb_partition_graph_r (bb_vinfo,
 				 instance, SLP_INSTANCE_TREE (instance),
-				 stmt_to_instance, instance_leader,
-				 visited);
+				 stmt_to_instance, node_to_instance,
+				 instance_leader);
     }
 
   /* Then collect entries to each independent subgraph.  */