nvptx: Avoid deadlock in 'cuStreamAddCallback' callback, error case (was: [PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes)
Checks
Commit Message
Hi Chung-Lin, Tom!
It's been a while:
On 2018-09-25T21:11:58+0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> [...] NVPTX/CUDA-specific implementation
> of the new-style goacc_asyncqueues.
In an OpenACC 'async' setting, where the device kernel (expectedly)
crashes because of "an illegal memory access was encountered", I'm
running into a deadlock here:
> --- a/libgomp/plugin/plugin-nvptx.c
> +++ b/libgomp/plugin/plugin-nvptx.c
> +static void
> +cuda_callback_wrapper (CUstream stream, CUresult res, void *ptr)
> +{
> + if (res != CUDA_SUCCESS)
> + GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res));
> + struct nvptx_callback *cb = (struct nvptx_callback *) ptr;
> + cb->fn (cb->ptr);
> + free (ptr);
> +}
> +
> +void
> +GOMP_OFFLOAD_openacc_async_queue_callback (struct goacc_asyncqueue *aq,
> + void (*callback_fn)(void *),
> + void *userptr)
> +{
> + struct nvptx_callback *b = GOMP_PLUGIN_malloc (sizeof (*b));
> + b->fn = callback_fn;
> + b->ptr = userptr;
> + b->aq = aq;
> + CUDA_CALL_ASSERT (cuStreamAddCallback, aq->cuda_stream,
> + cuda_callback_wrapper, (void *) b, 0);
> +}
In my case, 'cuda_callback_wrapper' (expectedly) gets invoked with
'res != CUDA_SUCCESS' ("an illegal memory access was encountered").
When we invoke 'GOMP_PLUGIN_fatal', this attempts to shut down the device
(..., which deadlocks); that's generally problematic: per
<https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483>
"'cuStreamAddCallback' [...] Callbacks must not make any CUDA API calls".
Given that eventually we must reach a host/device synchronization point
(latest when the device is shut down at program termination), and the
non-'CUDA_SUCCESS' will be upheld until then, it does seem safe to
replace this 'GOMP_PLUGIN_fatal' with 'GOMP_PLUGIN_error' as per the
"nvptx: Avoid deadlock in 'cuStreamAddCallback' callback, error case"
attached. OK to push?
(Might we even skip 'GOMP_PLUGIN_error' here, understanding that the
error will be caught and reported at the next host/device synchronization
point? But I've not verified that.)
Grüße
Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
Comments
Hi Thomas,
On 2023/1/12 9:51 PM, Thomas Schwinge wrote:
> In my case, 'cuda_callback_wrapper' (expectedly) gets invoked with
> 'res != CUDA_SUCCESS' ("an illegal memory access was encountered").
> When we invoke 'GOMP_PLUGIN_fatal', this attempts to shut down the device
> (..., which deadlocks); that's generally problematic: per
> https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483
> "'cuStreamAddCallback' [...] Callbacks must not make any CUDA API calls".
I remember running into this myself when first creating this async support
(IIRC in my case it was cuFree()-ing something) yet you've found another mistake here! :)
> Given that eventually we must reach a host/device synchronization point
> (latest when the device is shut down at program termination), and the
> non-'CUDA_SUCCESS' will be upheld until then, it does seem safe to
> replace this 'GOMP_PLUGIN_fatal' with 'GOMP_PLUGIN_error' as per the
> "nvptx: Avoid deadlock in 'cuStreamAddCallback' callback, error case"
> attached. OK to push?
I think this patch is fine. Actual approval powers are your's or Tom's :)
>
> (Might we even skip 'GOMP_PLUGIN_error' here, understanding that the
> error will be caught and reported at the next host/device synchronization
> point? But I've not verified that.)
Actually, the CUDA driver API docs are a bit vague on what exactly this
CUresult arg to the callback actually means. The 'res != CUDA_SUCCESS' handling
here was basically just generic handling. I am not really sure what is the
true right thing to do here (is the error still retained by CUDA after the callback
completes?)
Chung-Lin
Hi!
On 2023-01-13T21:17:43+0800, Chung-Lin Tang <chunglin.tang@siemens.com> wrote:
> On 2023/1/12 9:51 PM, Thomas Schwinge wrote:
>> In my case, 'cuda_callback_wrapper' (expectedly) gets invoked with
>> 'res != CUDA_SUCCESS' ("an illegal memory access was encountered").
>> When we invoke 'GOMP_PLUGIN_fatal', this attempts to shut down the device
>> (..., which deadlocks); that's generally problematic: per
>> https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483
>> "'cuStreamAddCallback' [...] Callbacks must not make any CUDA API calls".
>
> I remember running into this myself when first creating this async support
> (IIRC in my case it was cuFree()-ing something) yet you've found another mistake here! :)
;-)
>> Given that eventually we must reach a host/device synchronization point
>> (latest when the device is shut down at program termination), and the
>> non-'CUDA_SUCCESS' will be upheld until then, it does seem safe to
>> replace this 'GOMP_PLUGIN_fatal' with 'GOMP_PLUGIN_error' as per the
>> "nvptx: Avoid deadlock in 'cuStreamAddCallback' callback, error case"
>> attached. OK to push?
>
> I think this patch is fine. Actual approval powers are your's or Tom's :)
ACK. I'll let it sit for some more time before 'git push'.
>> (Might we even skip 'GOMP_PLUGIN_error' here, understanding that the
>> error will be caught and reported at the next host/device synchronization
>> point? But I've not verified that.)
>
> Actually, the CUDA driver API docs are a bit vague on what exactly this
> CUresult arg to the callback actually means. The 'res != CUDA_SUCCESS' handling
> here was basically just generic handling.
I suppose this really is just for its own use: for example, skip certain
things in presence of pre-existing error?
> I am not really sure what is the
> true right thing to do here (is the error still retained by CUDA after the callback
> completes?)
Indeed the latter is what I do observe:
GOMP_OFFLOAD_openacc_async_exec: prepare mappings
nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=1, vectors=32
nvptx_exec: kernel main$_omp_fn$0: finished
libgomp: cuMemcpyDtoHAsync_v2 error: an illegal memory access was encountered
libgomp:
libgomp: Copying of dev object [0x7f9a45000000..0x7f9a45000028) to host object [0x1d89350..0x1d89378) failed
cuda_callback_wrapper error: an illegal memory access was encountered
libgomp: cuStreamDestroy error: an illegal memory access was encountered
libgomp: cuMemFree_v2 error: an illegal memory access was encountered
libgomp: device finalization failed
Here, after the 'async' OpenACC 'parallel' a 'copyout' gets enqueued,
thus 'cuMemcpyDtoHAsync_v2', which is where we first get the device-side
fault reported (all as expected). Then -- CUDA-internally
multi-threaded, I suppose (thus the mangled printing) -- we print the
'Copying [...] failed' error plus get 'cuda_callback_wrapper' invoked.
This receives the previous 'CUresult' as seen, and then the error is
still visible at device shut-down, as shown by the following reports.
(This makes sense, as the 'CUcontext' does not magically recover.)
Also, per
<https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483>,
"In the event of a device error, all subsequently executed callbacks will
receive an appropriate 'CUresult'".
But again: I'm perfectly fine with the repeated error reporting.
Grüße
Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
From b7ddcc0807967750e3c884326ed4c53c05cde81f Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Thu, 12 Jan 2023 14:39:46 +0100
Subject: [PATCH] nvptx: Avoid deadlock in 'cuStreamAddCallback' callback,
error case
When we invoke 'GOMP_PLUGIN_fatal', this attempts to shut down the device
(..., which may deadlock); that's generally problematic: per
<https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483>
"'cuStreamAddCallback' [...] Callbacks must not make any CUDA API calls".
Given that eventually we must reach a host/device synchronization point
(latest when the device is shut down at program termination), and the
non-'CUDA_SUCCESS' will be upheld until then, it does seem safe to
replace this 'GOMP_PLUGIN_fatal' with 'GOMP_PLUGIN_error'.
libgomp/
* plugin/plugin-nvptx.c (cuda_callback_wrapper): Invoke
'GOMP_PLUGIN_error' instead of 'GOMP_PLUGIN_fatal'.
---
libgomp/plugin/plugin-nvptx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
@@ -1927,7 +1927,7 @@ static void
cuda_callback_wrapper (CUstream stream, CUresult res, void *ptr)
{
if (res != CUDA_SUCCESS)
- GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res));
+ GOMP_PLUGIN_error ("%s error: %s", __FUNCTION__, cuda_error (res));
struct nvptx_callback *cb = (struct nvptx_callback *) ptr;
cb->fn (cb->ptr);
free (ptr);
--
2.39.0