From patchwork Tue Jan 23 09:55:16 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Tobias Burnus <tburnus@baylibre.com>
X-Patchwork-Id: 190822
Return-Path: <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:7300:2553:b0:103:945f:af90 with SMTP id p19csp225976dyi;
        Tue, 23 Jan 2024 01:56:19 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IGl2kIHUdRudkZ06tYx1fP6k4B6y97QhLGiz3KG4d3e3QDIHrWwQrSI5i5yow8fbKtu0PJH
X-Received: by 2002:a05:620a:12c2:b0:783:7508:3fa0 with SMTP id
 e2-20020a05620a12c200b0078375083fa0mr6864263qkl.45.1706003778892;
        Tue, 23 Jan 2024 01:56:18 -0800 (PST)
Received: from server2.sourceware.org (server2.sourceware.org.
 [2620:52:3:1:0:246e:9693:128c])
        by mx.google.com with ESMTPS id
 dc30-20020a05620a521e00b007839efb5510si3954042qkb.278.2024.01.23.01.56.18
        for <ouuuleilei@gmail.com>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 23 Jan 2024 01:56:18 -0800 (PST)
Received-SPF: pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates
 2620:52:3:1:0:246e:9693:128c as permitted sender)
 client-ip=2620:52:3:1:0:246e:9693:128c;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify)
 header.i=@baylibre-com.20230601.gappssmtp.com header.s=20230601
 header.b=yjt5yzix;
       arc=fail (body hash mismatch);
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates
 2620:52:3:1:0:246e:9693:128c as permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 910733858C5E
	for <ouuuleilei@gmail.com>; Tue, 23 Jan 2024 09:56:18 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mail-wm1-x334.google.com (mail-wm1-x334.google.com
 [IPv6:2a00:1450:4864:20::334])
 by sourceware.org (Postfix) with ESMTPS id 0F0173858C50
 for <gcc-patches@gcc.gnu.org>; Tue, 23 Jan 2024 09:55:20 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 0F0173858C50
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=baylibre.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=baylibre.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 0F0173858C50
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=2a00:1450:4864:20::334
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1706003723; cv=none;
 b=PETwhOYLcn0afJZocluW2yVuMTrRm5PhrbhC1wguZu0mgez89hO8ZhOjZlmIU/VGwSnTnQLDz7ccuU5OjiEQQi7Y1bnPzjB/5Id7/eKIs8J6IasM5WAIo1JmwQCCTtlviUy9kjT7k3evf6f19fiTf/+lm/1cegDAFNYm2p05GsY=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1706003723; c=relaxed/simple;
 bh=8muCmBNpMJNvNPUDCJnVX/MF2SSA75tGALG5HJq9ByU=;
 h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:From:To;
 b=cREhhUCVma0PLI/1s59JNuWjZs93gDGP0vDSNQFD575jTlFaN3zESJpRIXDOUhGiwXiQOQLq7+iPtKnFWXv2noqPW9LG6HARgFc+qH9pVnMcn01ssq0jReg8yRTbsxuIQiYefpX+hcBOf7Hg/AOIk9epe+wcn/ARETxrpSBkxMc=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-wm1-x334.google.com with SMTP id
 5b1f17b1804b1-40e60e135a7so40410255e9.0
 for <gcc-patches@gcc.gnu.org>; Tue, 23 Jan 2024 01:55:20 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=baylibre-com.20230601.gappssmtp.com; s=20230601; t=1706003718;
 x=1706608518;
 darn=gcc.gnu.org;
 h=in-reply-to:references:to:from:content-language:subject:user-agent
 :mime-version:date:message-id:from:to:cc:subject:date:message-id
 :reply-to; bh=fUQ+1tLAyj4rTU0XHnya0ScNLD4S+9FwQJZWdPpwiYI=;
 b=yjt5yzixbYGgYCA47g0qZS1pWaLoVAjyTJFXWYHcYTWkUvp44eyndaQCJoqN3qFXvt
 TNpmStrojQevmG0uEGfijnPhNfKOkZfXhhBCQmmzV6tItEz/pdM+mwGUpYfjHgbJ4V9T
 thQ6LwybYzKKa5opA5r2ikVpm2PbDnnD1tXNgqP07/rlXfqjxpmS6VyDiYOPV27IILvs
 pa4FNHm5hX9bqV7oHw+b7agx8uLj38/h6T2w8M3p72W/Mdlr9/KM9lNHQ+m7VNtyJelX
 8pSvjR7k/tTZ3wOBADhwKunTfdE5UQFdyc0Nkwcsz/6btndoUve3JPxKtbFWsDe/Izw7
 jnjw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1706003718; x=1706608518;
 h=in-reply-to:references:to:from:content-language:subject:user-agent
 :mime-version:date:message-id:x-gm-message-state:from:to:cc:subject
 :date:message-id:reply-to;
 bh=fUQ+1tLAyj4rTU0XHnya0ScNLD4S+9FwQJZWdPpwiYI=;
 b=vV+kczz2hfHqYnCpzWS6bD97eoDi0EgBCYGv5NryrNHZcztoq9UYZyneEjCsf8zIC0
 e7K5/gUnM1m5jgot8dCSSuVdpy2iNWLz+HD4dq/eGkJ9+wheVGf595xEUr0374uGx4ie
 7lR3pdRiNHCn2N7cSyoQmeKrWBhxGwDtDMBOd+9TTsMPCaSY1Bl951ySfZnhC1vsgFbm
 UXg5lOT7nRO13Xx2JMgOa0QyU0NVhiFsO35FikvY1Cezqun05KTUF4XYMWgT5w016OUN
 +xGkE5JmiJ6vmY+S59IExra7B+SH0jYIkfS5iTbP9Ievf/5N/eYnfjo0RQlEkBVmaP2A
 6O3g==
X-Gm-Message-State: AOJu0YyHi2kGYe4eJXVGb4l9sSue+JVApf0RcABs69CO2dPl2KPq1hw0
 axbl3ugy3zQ/5LFhQqFsrroaNOAjLtMDdreV4Waka7GqX7rnutYDYXt/0UdTVIpdfwg4gltId5L
 5jBY=
X-Received: by 2002:a05:600c:2ed3:b0:40e:42ca:94f6 with SMTP id
 q19-20020a05600c2ed300b0040e42ca94f6mr365711wmn.101.1706003718534;
 Tue, 23 Jan 2024 01:55:18 -0800 (PST)
Received: from ?IPV6:2001:16b8:3f0c:aa00:be03:58ff:fe31:f74?
 ([2001:16b8:3f0c:aa00:be03:58ff:fe31:f74])
 by smtp.gmail.com with ESMTPSA id
 m35-20020a05600c3b2300b0040e541ddcb1sm42033783wms.33.2024.01.23.01.55.17
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Tue, 23 Jan 2024 01:55:18 -0800 (PST)
Message-ID: <53a3c4e3-452c-4445-8d4a-be66dccc9e45@baylibre.com>
Date: Tue, 23 Jan 2024 10:55:16 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: [v2][patch] plugin/plugin-nvptx.c: Fix fini_device call when already
 shutdown [PR113513]
Content-Language: en-US
From: Tobias Burnus <tburnus@baylibre.com>
To: gcc-patches <gcc-patches@gcc.gnu.org>,
 Thomas Schwinge <tschwinge@baylibre.com>, Jakub Jelinek <jakub@redhat.com>
References: <30b08783-4f6d-4ae1-9459-9391fc8e6262@baylibre.com>
In-Reply-To: <30b08783-4f6d-4ae1-9459-9391fc8e6262@baylibre.com>
X-Spam-Status: No, score=-13.7 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, GIT_PATCH_0, HTML_MESSAGE, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,
 SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1788821134209539057
X-GMAIL-MSGID: 1788874618243745174

Slightly changed patch:

nvptx_attach_host_thread_to_device now fails again with an error for 
CUDA_ERROR_DEINITIALIZED, except for GOMP_OFFLOAD_fini_device.

I think it makes more sense that way.

Tobias Burnus wrote:
> Testing showed that the libgomp.c/target-52.c failed with:
>
> libgomp: cuCtxGetDevice error: unknown cuda error
>
> libgomp: device finalization failed
>
> This testcase uses OMP_DISPLAY_ENV=true and 
> OMP_TARGET_OFFLOAD=mandatory, and those env vars matter, i.e. it only 
> fails if dg-set-target-env-var is honored.
>
> If both env vars are set, the device initialization occurs earlier as 
> OMP_DEFAULT_DEVICE is shown due to the display-env env var and its 
> value (when target-offload-var is 'mandatory') might be either 
> 'omp_invalid_device' or '0'.
>
> It turned out that this had an effect on device finalization, which 
> caused CUDA to stop earlier than expected. This patch now handles this 
> case gracefully. For details, see the commit log message in the 
> attached patch and/or the PR.
>
> Comments, remarks, suggestions?
>
> Does this look sensible? (I would like to see some acknowledgement by 
> someone who feels more comfortable with CUDA than me.)

Tobias

plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513]

The following issue was found when running libgomp.c/target-52.c with
nvptx offloading when the dg-set-target-env-var was honored. The issue
occurred for both -foffload=disable and with offloading configured when
an nvidia device is available.

At the end of the program, the offloading parts are shutdown via two means:
The callback registered via 'atexit (gomp_target_fini)' and - via code
generated in mkoffload, the '__attribute__((destructor)) fini' function
that calls GOMP_offload_unregister_ver.

In normal processing, first gomp_target_fini is called - which then sets
GOMP_DEVICE_FINALIZED for the device - and later GOMP_offload_unregister_ver,
but that's then because the state is GOMP_DEVICE_FINALIZED.
If both OMP_DISPLAY_ENV=true and OMP_TARGET_OFFLOAD="mandatory" are set,
the call omp_display_env already invokes gomp_init_targets_once, i.e. it
occurs earlier than usual and is invoked via __attribute__((constructor))
initialize_env.

For some unknown reasons, while this does not have an effect on the
order of the called plugin functions for initialization, it changes the
order of function calls for shutting down. Namely, when the two environment
variables are set, GOMP_offload_unregister_ver is called now before
gomp_target_fini. - And it seems as if CUDA regards a call to cuModuleUnload
(or unloading the last module?) as indication that the device context should
be destroyed - or, at least, afterwards calling cuCtxGetDevice will return
CUDA_ERROR_DEINITIALIZED.

As the previous code in nvptx_attach_host_thread_to_device wasn't expecting
that result, it called
  GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r));
causing a fatal error of the program.

This commit handles now CUDA_ERROR_DEINITIALIZED in a special way such
that GOMP_OFFLOAD_fini_device just works.

When reading the code, the following was observed in addition:
When gomp_fini_device is called, it invokes goacc_fini_asyncqueues
to ensure that the queue is emptied.  It seems to make sense to do
likewise for GOMP_offload_unregister_ver, which this commit does in
addition.

libgomp/ChangeLog:

	PR libgomp/113513
	* target.c (GOMP_offload_unregister_ver): Call goacc_fini_asyncqueues
	before invoking GOMP_offload_unregister_ver.
	* plugin/plugin-nvptx.c (nvptx_attach_host_thread_to_device): Change
	return type to int and new bool arg, it true, return -1 for
	CUDA_ERROR_DEINITIALIZED.
	(GOMP_OFFLOAD_fini_device): Handle the deinitialized gracefully.
	(nvptx_init, GOMP_OFFLOAD_load_image, GOMP_OFFLOAD_alloc,
	GOMP_OFFLOAD_host2dev, GOMP_OFFLOAD_dev2host, GOMP_OFFLOAD_memcpy2d,
	GOMP_OFFLOAD_memcpy3d, GOMP_OFFLOAD_openacc_async_host2dev,
	GOMP_OFFLOAD_openacc_async_dev2host): Update calls

Signed-off-by: Tobias Burnus <tburnus@baylibre.com>

 libgomp/plugin/plugin-nvptx.c | 46 ++++++++++++++++++++++++++-----------------
 libgomp/target.c              |  7 +++++--
 2 files changed, 33 insertions(+), 20 deletions(-)

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index c04c3acd679..318d3d2aca6 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -382,10 +382,13 @@ nvptx_init (void)
 }
 
 /* Select the N'th PTX device for the current host thread.  The device must
-   have been previously opened before calling this function.  */
+   have been previously opened before calling this function.
+   Returns 1 if successful, 0 if an error occurred and a message has been
+   issued; if fini_okay, -1 is returned for CUDA_ERROR_DEINITIALIZED and
+   no error message is printed in that case.  */
 
-static bool
-nvptx_attach_host_thread_to_device (int n)
+static int
+nvptx_attach_host_thread_to_device (int n, bool fini_okay)
 {
   CUdevice dev;
   CUresult r;
@@ -393,15 +396,17 @@ nvptx_attach_host_thread_to_device (int n)
   CUcontext thd_ctx;
 
   r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &dev);
+  if (fini_okay && r == CUDA_ERROR_DEINITIALIZED)
+    return -1;
   if (r == CUDA_ERROR_NOT_PERMITTED)
     {
       /* Assume we're in a CUDA callback, just return true.  */
-      return true;
+      return 1;
     }
   if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
     {
       GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r));
-      return false;
+      return 0;
     }
 
   if (r != CUDA_ERROR_INVALID_CONTEXT && dev == n)
@@ -414,7 +419,7 @@ nvptx_attach_host_thread_to_device (int n)
       if (!ptx_dev)
 	{
 	  GOMP_PLUGIN_error ("device %d not found", n);
-	  return false;
+	  return 0;
 	}
 
       CUDA_CALL (cuCtxGetCurrent, &thd_ctx);
@@ -426,7 +431,7 @@ nvptx_attach_host_thread_to_device (int n)
 
       CUDA_CALL (cuCtxPushCurrent, ptx_dev->ctx);
     }
-  return true;
+  return 1;
 }
 
 static struct ptx_device *
@@ -1252,8 +1257,11 @@ GOMP_OFFLOAD_fini_device (int n)
 
   if (ptx_devices[n] != NULL)
     {
-      if (!nvptx_attach_host_thread_to_device (n)
-	  || !nvptx_close_device (ptx_devices[n]))
+      /* Returns 1 if successful, 0 if an error occurred, and -1 for
+	 CUDA_ERROR_DEINITIALIZED.  */
+      int r = nvptx_attach_host_thread_to_device (n, true);
+      if (r == 0
+	  || (r == 1 && !nvptx_close_device (ptx_devices[n])))
 	{
 	  pthread_mutex_unlock (&ptx_dev_lock);
 	  return false;
@@ -1329,7 +1337,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
       return -1;
     }
 
-  if (!nvptx_attach_host_thread_to_device (ord)
+  if (!nvptx_attach_host_thread_to_device (ord, false)
       || !link_ptx (&module, img_header->ptx_objs, img_header->ptx_num))
     return -1;
 
@@ -1568,7 +1576,7 @@ GOMP_OFFLOAD_unload_image (int ord, unsigned version, const void *target_data)
 void *
 GOMP_OFFLOAD_alloc (int ord, size_t size)
 {
-  if (!nvptx_attach_host_thread_to_device (ord))
+  if (!nvptx_attach_host_thread_to_device (ord, false))
     return NULL;
 
   struct ptx_device *ptx_dev = ptx_devices[ord];
@@ -1604,7 +1612,7 @@ GOMP_OFFLOAD_alloc (int ord, size_t size)
 bool
 GOMP_OFFLOAD_free (int ord, void *ptr)
 {
-  return (nvptx_attach_host_thread_to_device (ord)
+  return (nvptx_attach_host_thread_to_device (ord, false)
 	  && nvptx_free (ptr, ptx_devices[ord]));
 }
 
@@ -1837,7 +1845,7 @@ cuda_memcpy_sanity_check (const void *h, const void *d, size_t s)
 bool
 GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n)
 {
-  if (!nvptx_attach_host_thread_to_device (ord)
+  if (!nvptx_attach_host_thread_to_device (ord, false)
       || !cuda_memcpy_sanity_check (src, dst, n))
     return false;
   CUDA_CALL (cuMemcpyHtoD, (CUdeviceptr) dst, src, n);
@@ -1847,7 +1855,7 @@ GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n)
 bool
 GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n)
 {
-  if (!nvptx_attach_host_thread_to_device (ord)
+  if (!nvptx_attach_host_thread_to_device (ord, false)
       || !cuda_memcpy_sanity_check (dst, src, n))
     return false;
   CUDA_CALL (cuMemcpyDtoH, dst, (CUdeviceptr) src, n);
@@ -1868,7 +1876,8 @@ GOMP_OFFLOAD_memcpy2d (int dst_ord, int src_ord, size_t dim1_size,
 		       const void *src, size_t src_offset1_size,
 		       size_t src_offset0_len, size_t src_dim1_size)
 {
-  if (!nvptx_attach_host_thread_to_device (src_ord != -1 ? src_ord : dst_ord))
+  if (!nvptx_attach_host_thread_to_device (src_ord != -1 ? src_ord : dst_ord,
+					   false))
     return false;
 
   /* TODO: Consider using CU_MEMORYTYPE_UNIFIED if supported.  */
@@ -1960,7 +1969,8 @@ GOMP_OFFLOAD_memcpy3d (int dst_ord, int src_ord, size_t dim2_size,
 		       size_t src_offset0_len, size_t src_dim2_size,
 		       size_t src_dim1_len)
 {
-  if (!nvptx_attach_host_thread_to_device (src_ord != -1 ? src_ord : dst_ord))
+  if (!nvptx_attach_host_thread_to_device (src_ord != -1 ? src_ord : dst_ord,
+					   false))
     return false;
 
   /* TODO: Consider using CU_MEMORYTYPE_UNIFIED if supported.  */
@@ -2050,7 +2060,7 @@ bool
 GOMP_OFFLOAD_openacc_async_host2dev (int ord, void *dst, const void *src,
 				     size_t n, struct goacc_asyncqueue *aq)
 {
-  if (!nvptx_attach_host_thread_to_device (ord)
+  if (!nvptx_attach_host_thread_to_device (ord, false)
       || !cuda_memcpy_sanity_check (src, dst, n))
     return false;
   CUDA_CALL (cuMemcpyHtoDAsync, (CUdeviceptr) dst, src, n, aq->cuda_stream);
@@ -2061,7 +2071,7 @@ bool
 GOMP_OFFLOAD_openacc_async_dev2host (int ord, void *dst, const void *src,
 				     size_t n, struct goacc_asyncqueue *aq)
 {
-  if (!nvptx_attach_host_thread_to_device (ord)
+  if (!nvptx_attach_host_thread_to_device (ord, false)
       || !cuda_memcpy_sanity_check (dst, src, n))
     return false;
   CUDA_CALL (cuMemcpyDtoHAsync, dst, (CUdeviceptr) src, n, aq->cuda_stream);
diff --git a/libgomp/target.c b/libgomp/target.c
index 1367e9cce6c..8d05877deb7 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2706,8 +2706,11 @@ GOMP_offload_unregister_ver (unsigned version, const void *host_table,
       gomp_mutex_lock (&devicep->lock);
       if (devicep->type == target_type
 	  && devicep->state == GOMP_DEVICE_INITIALIZED)
-	gomp_unload_image_from_device (devicep, version,
-				       host_table, target_data);
+	{
+	  goacc_fini_asyncqueues (devicep);
+	  gomp_unload_image_from_device (devicep, version,
+					 host_table, target_data);
+	}
       gomp_mutex_unlock (&devicep->lock);
     }