From patchwork Thu Sep  1 15:39:42 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Chung-Lin Tang <cltang@codesourcery.com>
X-Patchwork-Id: 901
Return-Path: <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:adf:ecc5:0:0:0:0:0 with SMTP id s5csp311232wro;
        Thu, 1 Sep 2022 08:40:42 -0700 (PDT)
X-Google-Smtp-Source: 
 AA6agR6cEi1SkIqOCVVlri1Q406ud5F7gH6N3iCnnccsvJFwwN4hrTMMigXe1PQ63Bazq6WDSvSr
X-Received: by 2002:a17:907:2cd2:b0:741:71cc:b807 with SMTP id
 hg18-20020a1709072cd200b0074171ccb807mr15542826ejc.586.1662046842411;
        Thu, 01 Sep 2022 08:40:42 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1662046842; cv=none;
        d=google.com; s=arc-20160816;
        b=SNhjRsAL1kxFq+KvPCqPDIZzmf7uNOJlgcsuzYxJWUA8Kjb5DTqkS3yZyLdurwG16L
         rTeG100/wcneLOExUSV4BQO+oH+XIGhRuX6OQVl0ZViG8+QtNeaA1XPl7s4BzMxyazHY
         6e0CY5aCUZtQ8csGc3cwaCuR03m5eOsa6pNgcr9DdF6vAzeltN5P8w0dC+p4kuohNUZY
         CAkd7MDX1VbcL9Pqq0mw/uX6Oq1y0rUdREKDNoFKgKOK/65mu9if8wnrIg/U1C9bQ+xp
         aRJAkvVq87SDOoXjKYPB5QWwhwhAkDqfHoh88HdEQp1fLvCMvlyVben/r4XHnuzEXz7S
         LTPA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:list-subscribe:list-help:list-post:list-archive
         :list-unsubscribe:list-id:precedence:to:subject:from
         :content-language:user-agent:mime-version:date:message-id
         :ironport-sdr:dmarc-filter:delivered-to;
        bh=IChZG8mVybXD46/z55uEMqIxFoVxddzQXBtsNsBv+A4=;
        b=djjwaMSJqxiTq/JTGhoMJamqLW5gBruIKNyipKyCBQE96Iqbm4ELnX7yTug5Md70pn
         M9yRFODZAmq6fFqZr6h3Ej3MPNgiYUWHmnwYtNEvLflZfDImXxnLW8AkEClE+vmNjHKM
         z6ELt8TpNNGqjIZc/FGSCD/+pNXS/QyWGGUVxQtf7VX+KSdRoys6hhbUg+aiv8I2OAMT
         0mVsMq7D3JCq4JTd5SUc92SiQSXYl/tWuzKYm4zuCJkM2fDslR6MwnyGO91zxjzrEzQU
         dUGC5SDyCY3kN/NXgD9YVZLDOBKtSZIt5scW6WawSns6p/Jt/8KGvaBjca3/8XTmmLnM
         rPBw==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates
 2620:52:3:1:0:246e:9693:128c as permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"
Received: from sourceware.org (server2.sourceware.org.
 [2620:52:3:1:0:246e:9693:128c])
        by mx.google.com with ESMTPS id
 ca4-20020a170906a3c400b0073d6ed6ff15si11934640ejb.501.2022.09.01.08.40.42
        for <ouuuleilei@gmail.com>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 01 Sep 2022 08:40:42 -0700 (PDT)
Received-SPF: pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates
 2620:52:3:1:0:246e:9693:128c as permitted sender)
 client-ip=2620:52:3:1:0:246e:9693:128c;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates
 2620:52:3:1:0:246e:9693:128c as permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 956B13858010
	for <ouuuleilei@gmail.com>; Thu,  1 Sep 2022 15:40:24 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from esa1.mentor.iphmx.com (esa1.mentor.iphmx.com [68.232.129.153])
 by sourceware.org (Postfix) with ESMTPS id 9F4673858D1E
 for <gcc-patches@gcc.gnu.org>; Thu,  1 Sep 2022 15:39:57 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 9F4673858D1E
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=codesourcery.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com
X-IronPort-AV: E=Sophos;i="5.93,281,1654588800"; d="scan'208,223";a="85071277"
Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165])
 by esa1.mentor.iphmx.com with ESMTP; 01 Sep 2022 07:39:53 -0800
IronPort-SDR: 
 kc605igalpKyh0/dFBrtBMLiqbKIps9HsTjyM7HOpQZ0JJFqhT/1WFTL3iTyFvxNXbFkfTfsKh
 Jh0OQQodpgKPoHX08XXMwyLSIe0ZhUUgV+pgBdxuVmAf+T4OGp2XtE9ehD7p367rm6YR9ian5t
 lk9hS8JLwdQeXC50hhZsRcSgedX9ymknNko7ifQLIlzu4k9Gp1CziYvWz2GKOA4qrIqmeGFZDz
 ZkZJXtS1hpWbsnPoUpmbtRR7Yxan/5bmMI1o3g9wwVmp8Ztz21yv/SYC+PxlzpU6Aii0EOtde0
 wxk=
Message-ID: <73084783-7ab7-0d96-27bf-6a3292a74179@codesourcery.com>
Date: Thu, 1 Sep 2022 23:39:42 +0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
 Gecko/20100101 Thunderbird/102.2.0
Content-Language: en-US
From: Chung-Lin Tang <cltang@codesourcery.com>
Subject: [OpenMP, nvptx] Use bar.sync/arrive for barriers when tasking is not
 used
To: gcc-patches <gcc-patches@gcc.gnu.org>, Tom de Vries <tdevries@suse.de>,
 Jakub Jelinek <jakub@redhat.com>, Catherine Moore <clm@codesourcery.com>
X-ClientProxiedBy: svr-orw-mbx-10.mgc.mentorg.com (147.34.90.210) To
 svr-orw-mbx-10.mgc.mentorg.com (147.34.90.210)
X-Spam-Status: No, score=-10.5 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, KAM_SHORT, SPF_HELO_PASS,
 SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org
Sender: "Gcc-patches" <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1742782429828510478?=
X-GMAIL-MSGID: =?utf-8?q?1742782429828510478?=

Hi, 
our work on SPEChpc2021 benchmarks show that, after the fix for PR99555 was committed:
[libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end
https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=5ed77fb3ed1ee0289a0ec9499ef52b99b39421f1

while that patch fixed the hang, there were quite severe performance regressions caused
by this new barrier code. Under OpenMP target offload mode, Minisweep regressed by about 350%,
while HPGMG-FV was about 2x slower.

So the problem was presumably the new barriers, which replaced erroneous but fast bar.sync
instructions, with correct but really heavy-weight futex_wait/wake operations on the GPU.
This is probably required for preserving correct task vs. barrier behavior.

However, the observation is that: when tasks-related functionality are not used at all by
the team inside an OpenMP target region, and a barrier is just a place to wait for all
threads to rejoin (no problem of invoking waiting tasks to re-start) a barrier can in that
case be implemented by simple bar.sync and bar.arrive PTX instructions. That should be
able to recover most performance the cases that usually matter, e.g. 'omp parallel for' inside
'omp target'.

So the plan is to mark cases where 'tasks are never used'. This patch adds a 'task_never_used'
flag inside struct gomp_team, initialized to true, and set to false when tasks are added to
the team. The nvptx specific gomp_team_barrier_wait_end routines can then use simple barrier
when team->task_never_used remains true on the barrier.

Some other cases, like the master/masked construct, and single construct, also needs to have
task_never_used set false; because these constructs inherently creates asymmetric loads where
only a subset of threads run through the region (which may or may not use tasking), there may
be the case where different threads wait at the end assuming different task_never_used cases.
For correctness, these constructs must have team->task_never_used conservatively marked false
at the start of the construct.

This patch has been divided into two: the first is the inlining of contents of config/linux/bar.c
into config/nvptx/bar.c (instead of an include). This is needed now because some parts of
gomp_team_barrier_wait_[cancel_]end now needs nvptx specific adjustments. The second contains
the above described changes.

Tested on powerpc64le-linux and x86_64-linux with nvptx offloading, seeking approval for trunk.

Thanks,
Chung-Lin
From c2fdc31880d2d040822e8abece015c29a6d7b472 Mon Sep 17 00:00:00 2001
From: Chung-Lin Tang <cltang@codesourcery.com>
Date: Thu, 1 Sep 2022 05:53:49 -0700
Subject: [PATCH 1/2] libgomp: inline config/linux/bar.c into
 config/nvptx/bar.c

Preparing to add nvptx specific modifications to gomp_team_barrier_wait_end,
et al., so change from using an #include of config/linux/bar.c
in config/nvptx/bar.c, to a full copy of the implementation.

2022-09-01  Chung-Lin Tang  <cltang@codesourcery.com>

libgomp/ChangeLog:

	* config/nvptx/bar.c: Adjust include of "../linux/bar.c" into an
	inlining of contents of config/linux/bar.c,
---
 libgomp/config/nvptx/bar.c | 183 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 180 insertions(+), 3 deletions(-)

diff --git a/libgomp/config/nvptx/bar.c b/libgomp/config/nvptx/bar.c
index eee2107..a850c22 100644
--- a/libgomp/config/nvptx/bar.c
+++ b/libgomp/config/nvptx/bar.c
@@ -161,6 +161,183 @@ static inline void do_wait (int *addr, int val)
     futex_wait (addr, val);
 }
 
-/* Reuse the linux implementation.  */
-#define GOMP_WAIT_H 1
-#include "../linux/bar.c"
+/* Below is based on the linux implementation.  */
+
+void
+gomp_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
+{
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
+    {
+      /* Next time we'll be awaiting TOTAL threads again.  */
+      bar->awaited = bar->total;
+      __atomic_store_n (&bar->generation, bar->generation + BAR_INCR,
+			MEMMODEL_RELEASE);
+      futex_wake ((int *) &bar->generation, INT_MAX);
+    }
+  else
+    {
+      do
+	do_wait ((int *) &bar->generation, state);
+      while (__atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE) == state);
+    }
+}
+
+void
+gomp_barrier_wait (gomp_barrier_t *bar)
+{
+  gomp_barrier_wait_end (bar, gomp_barrier_wait_start (bar));
+}
+
+/* Like gomp_barrier_wait, except that if the encountering thread
+   is not the last one to hit the barrier, it returns immediately.
+   The intended usage is that a thread which intends to gomp_barrier_destroy
+   this barrier calls gomp_barrier_wait, while all other threads
+   call gomp_barrier_wait_last.  When gomp_barrier_wait returns,
+   the barrier can be safely destroyed.  */
+
+void
+gomp_barrier_wait_last (gomp_barrier_t *bar)
+{
+  gomp_barrier_state_t state = gomp_barrier_wait_start (bar);
+  if (state & BAR_WAS_LAST)
+    gomp_barrier_wait_end (bar, state);
+}
+
+void
+gomp_team_barrier_wake (gomp_barrier_t *bar, int count)
+{
+  futex_wake ((int *) &bar->generation, count == 0 ? INT_MAX : count);
+}
+
+void
+gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
+{
+  unsigned int generation, gen;
+
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
+    {
+      /* Next time we'll be awaiting TOTAL threads again.  */
+      struct gomp_thread *thr = gomp_thread ();
+      struct gomp_team *team = thr->ts.team;
+
+      bar->awaited = bar->total;
+      team->work_share_cancelled = 0;
+      if (__builtin_expect (team->task_count, 0))
+	{
+	  gomp_barrier_handle_tasks (state);
+	  state &= ~BAR_WAS_LAST;
+	}
+      else
+	{
+	  state &= ~BAR_CANCELLED;
+	  state += BAR_INCR - BAR_WAS_LAST;
+	  __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE);
+	  futex_wake ((int *) &bar->generation, INT_MAX);
+	  return;
+	}
+    }
+
+  generation = state;
+  state &= ~BAR_CANCELLED;
+  do
+    {
+      do_wait ((int *) &bar->generation, generation);
+      gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
+      if (__builtin_expect (gen & BAR_TASK_PENDING, 0))
+	{
+	  gomp_barrier_handle_tasks (state);
+	  gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
+	}
+      generation |= gen & BAR_WAITING_FOR_TASK;
+    }
+  while (gen != state + BAR_INCR);
+}
+
+void
+gomp_team_barrier_wait (gomp_barrier_t *bar)
+{
+  gomp_team_barrier_wait_end (bar, gomp_barrier_wait_start (bar));
+}
+
+void
+gomp_team_barrier_wait_final (gomp_barrier_t *bar)
+{
+  gomp_barrier_state_t state = gomp_barrier_wait_final_start (bar);
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
+    bar->awaited_final = bar->total;
+  gomp_team_barrier_wait_end (bar, state);
+}
+
+bool
+gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
+				   gomp_barrier_state_t state)
+{
+  unsigned int generation, gen;
+
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
+    {
+      /* Next time we'll be awaiting TOTAL threads again.  */
+      /* BAR_CANCELLED should never be set in state here, because
+	 cancellation means that at least one of the threads has been
+	 cancelled, thus on a cancellable barrier we should never see
+	 all threads to arrive.  */
+      struct gomp_thread *thr = gomp_thread ();
+      struct gomp_team *team = thr->ts.team;
+
+      bar->awaited = bar->total;
+      team->work_share_cancelled = 0;
+      if (__builtin_expect (team->task_count, 0))
+	{
+	  gomp_barrier_handle_tasks (state);
+	  state &= ~BAR_WAS_LAST;
+	}
+      else
+	{
+	  state += BAR_INCR - BAR_WAS_LAST;
+	  __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE);
+	  futex_wake ((int *) &bar->generation, INT_MAX);
+	  return false;
+	}
+    }
+
+  if (__builtin_expect (state & BAR_CANCELLED, 0))
+    return true;
+
+  generation = state;
+  do
+    {
+      do_wait ((int *) &bar->generation, generation);
+      gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
+      if (__builtin_expect (gen & BAR_CANCELLED, 0))
+	return true;
+      if (__builtin_expect (gen & BAR_TASK_PENDING, 0))
+	{
+	  gomp_barrier_handle_tasks (state);
+	  gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
+	}
+      generation |= gen & BAR_WAITING_FOR_TASK;
+    }
+  while (gen != state + BAR_INCR);
+
+  return false;
+}
+
+bool
+gomp_team_barrier_wait_cancel (gomp_barrier_t *bar)
+{
+  return gomp_team_barrier_wait_cancel_end (bar, gomp_barrier_wait_start (bar));
+}
+
+void
+gomp_team_barrier_cancel (struct gomp_team *team)
+{
+  gomp_mutex_lock (&team->task_lock);
+  if (team->barrier.generation & BAR_CANCELLED)
+    {
+      gomp_mutex_unlock (&team->task_lock);
+      return;
+    }
+  team->barrier.generation |= BAR_CANCELLED;
+  gomp_mutex_unlock (&team->task_lock);
+  futex_wake ((int *) &team->barrier.generation, INT_MAX);
+}