[nvptx,1/2] Reimplement libgomp barriers for nvptx
Checks
Commit Message
Hi Tom,
I had a patch submitted earlier, where I reported that the current way of implementing
barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021
benchmarks:
https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html
That previous patch wasn't accepted well (admittedly, it was kind of a hack).
So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for,
barriers are implemented simplistically with bar.* synchronization instructions.
Tasks are processed after threads have joined, and only if team->task_count != 0
(arguably, there might be a little bit of performance forfeited where earlier arriving threads
could've been used to process tasks ahead of other threads. But that again falls into requiring
implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target
offloading is usually used for)
Implementation highlight notes:
1. gomp_team_barrier_wake() is now an empty function (threads never "wake" in the usual manner)
2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
The main synchronization is done using a 'bar.red' instruction. This reduces across all threads
the condition (team->task_count != 0), to enable the task processing down below if any thread
created a task. (this bar.red usage required the need of the second GCC patch in this series)
This patch has been tested on x86_64/powerpc64le with nvptx offloading, using libgomp, ovo, omptests,
and sollve_vv testsuites, all without regressions. Also verified that the SPEChpc 2021 521.miniswp_t
and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle has been restored to
devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
(also suggest backporting to GCC12 branch, if performance regression can be considered a defect)
Thanks,
Chung-Lin
libgomp/ChangeLog:
2022-09-21 Chung-Lin Tang <cltang@codesourcery.com>
* config/nvptx/bar.c (generation_to_barrier): Remove.
(futex_wait,futex_wake,do_spin,do_wait): Remove.
(GOMP_WAIT_H): Remove.
(#include "../linux/bar.c"): Remove.
(gomp_barrier_wait_end): New function.
(gomp_barrier_wait): Likewise.
(gomp_barrier_wait_last): Likewise.
(gomp_team_barrier_wait_end): Likewise.
(gomp_team_barrier_wait): Likewise.
(gomp_team_barrier_wait_final): Likewise.
(gomp_team_barrier_wait_cancel_end): Likewise.
(gomp_team_barrier_wait_cancel): Likewise.
(gomp_team_barrier_cancel): Likewise.
* config/nvptx/bar.h (gomp_team_barrier_wake): Remove
prototype, add new static inline function.
Comments
On Wed, Sep 21, 2022 at 03:45:36PM +0800, Chung-Lin Tang via Gcc-patches wrote:
> Hi Tom,
> I had a patch submitted earlier, where I reported that the current way of implementing
> barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021
> benchmarks:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html
>
> That previous patch wasn't accepted well (admittedly, it was kind of a hack).
> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
>
> Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for,
> barriers are implemented simplistically with bar.* synchronization instructions.
> Tasks are processed after threads have joined, and only if team->task_count != 0
>
> (arguably, there might be a little bit of performance forfeited where earlier arriving threads
> could've been used to process tasks ahead of other threads. But that again falls into requiring
> implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target
> offloading is usually used for)
I admit I don't have a good picture if people in real-world actually use
tasking in offloading regions and how much and in what way, but the above
definitely would be a show-stopper for typical tasking workloads, where
one thread (usually from master/masked/single construct's body) creates lots
of tasks and can spend considerable amount of time in those preparations,
while other threads are expected to handle those tasks.
Do we have an idea how are other implementations handling this?
I think it should be easily observable with atomics, have
master/masked/single that creates lots of tasks and then spends a long time
doing something, have very small task bodies that just increment some atomic
counter and at the end of the master/masked/single see how many tasks were
already encountered.
Note, I don't have any smart ideas how to handle this instead and what
you posted might be ok for what people usually do on offloading targets
in OpenMP if they use tasking at all, just wanted to mention that there
could be workloads where the above is a serious problem. If there are
say hundreds of threads doing nothing until a single thread reaches a
barrier and there are hundreds of pending tasks...
E.g. note we have that 64 pending task limit after which we start to
create undeferred tasks, so if we never start handling tasks until
one thread is done with them, that would mean the single thread
would create 64 deferred tasks and then handle all the others itself
making it even longer until the other tasks can deal with it.
Jakub
On 2022/9/21 5:01 PM, Jakub Jelinek wrote:
> On Wed, Sep 21, 2022 at 03:45:36PM +0800, Chung-Lin Tang via Gcc-patches wrote:
>> Hi Tom,
>> I had a patch submitted earlier, where I reported that the current way of implementing
>> barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021
>> benchmarks:
>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html
>>
>> That previous patch wasn't accepted well (admittedly, it was kind of a hack).
>> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
>>
>> Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for,
>> barriers are implemented simplistically with bar.* synchronization instructions.
>> Tasks are processed after threads have joined, and only if team->task_count != 0
>>
>> (arguably, there might be a little bit of performance forfeited where earlier arriving threads
>> could've been used to process tasks ahead of other threads. But that again falls into requiring
>> implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target
>> offloading is usually used for)
>
> I admit I don't have a good picture if people in real-world actually use
> tasking in offloading regions and how much and in what way, but the above
> definitely would be a show-stopper for typical tasking workloads, where
> one thread (usually from master/masked/single construct's body) creates lots
> of tasks and can spend considerable amount of time in those preparations,
> while other threads are expected to handle those tasks.
I think the most common use case for target offloading is "parallel for".
Really, not simply removing tasking altogether from target regions in the specification is just looking for trouble.
If asynchronous offloaded tasks are to be supported, something at the whole GPU offload region level
is much more reasonable, like the async clause functionality in OpenACC.
> Do we have an idea how are other implementations handling this?
> I think it should be easily observable with atomics, have
> master/masked/single that creates lots of tasks and then spends a long time
> doing something, have very small task bodies that just increment some atomic
> counter and at the end of the master/masked/single see how many tasks were
> already encountered.
This could be an interesting test...
> Note, I don't have any smart ideas how to handle this instead and what
> you posted might be ok for what people usually do on offloading targets
> in OpenMP if they use tasking at all, just wanted to mention that there
> could be workloads where the above is a serious problem. If there are
> say hundreds of threads doing nothing until a single thread reaches a
> barrier and there are hundreds of pending tasks...
I think it might still be doable, just not in the very fine "wake one thread" style
that the Linux-based implementation was doing.
> E.g. note we have that 64 pending task limit after which we start to
> create undeferred tasks, so if we never start handling tasks until
> one thread is done with them, that would mean the single thread
> would create 64 deferred tasks and then handle all the others itself
> making it even longer until the other tasks can deal with it.
Okay, thanks for reminding that.
Chung-Lin
Ping.
On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote:
> Hi Tom,
> I had a patch submitted earlier, where I reported that the current way of implementing
> barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021
> benchmarks:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html
>
> That previous patch wasn't accepted well (admittedly, it was kind of a hack).
> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
>
> Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for,
> barriers are implemented simplistically with bar.* synchronization instructions.
> Tasks are processed after threads have joined, and only if team->task_count != 0
>
> (arguably, there might be a little bit of performance forfeited where earlier arriving threads
> could've been used to process tasks ahead of other threads. But that again falls into requiring
> implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target
> offloading is usually used for)
>
> Implementation highlight notes:
> 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" in the usual manner)
> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
>
> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
> The main synchronization is done using a 'bar.red' instruction. This reduces across all threads
> the condition (team->task_count != 0), to enable the task processing down below if any thread
> created a task. (this bar.red usage required the need of the second GCC patch in this series)
>
> This patch has been tested on x86_64/powerpc64le with nvptx offloading, using libgomp, ovo, omptests,
> and sollve_vv testsuites, all without regressions. Also verified that the SPEChpc 2021 521.miniswp_t
> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle has been restored to
> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
>
> (also suggest backporting to GCC12 branch, if performance regression can be considered a defect)
>
> Thanks,
> Chung-Lin
>
> libgomp/ChangeLog:
>
> 2022-09-21 Chung-Lin Tang <cltang@codesourcery.com>
>
> * config/nvptx/bar.c (generation_to_barrier): Remove.
> (futex_wait,futex_wake,do_spin,do_wait): Remove.
> (GOMP_WAIT_H): Remove.
> (#include "../linux/bar.c"): Remove.
> (gomp_barrier_wait_end): New function.
> (gomp_barrier_wait): Likewise.
> (gomp_barrier_wait_last): Likewise.
> (gomp_team_barrier_wait_end): Likewise.
> (gomp_team_barrier_wait): Likewise.
> (gomp_team_barrier_wait_final): Likewise.
> (gomp_team_barrier_wait_cancel_end): Likewise.
> (gomp_team_barrier_wait_cancel): Likewise.
> (gomp_team_barrier_cancel): Likewise.
> * config/nvptx/bar.h (gomp_team_barrier_wake): Remove
> prototype, add new static inline function.
Ping x2.
On 2022/10/17 10:29 PM, Chung-Lin Tang wrote:
> Ping.
>
> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote:
>> Hi Tom,
>> I had a patch submitted earlier, where I reported that the current way of implementing
>> barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021
>> benchmarks:
>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html
>>
>> That previous patch wasn't accepted well (admittedly, it was kind of a hack).
>> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
>>
>> Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for,
>> barriers are implemented simplistically with bar.* synchronization instructions.
>> Tasks are processed after threads have joined, and only if team->task_count != 0
>>
>> (arguably, there might be a little bit of performance forfeited where earlier arriving threads
>> could've been used to process tasks ahead of other threads. But that again falls into requiring
>> implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target
>> offloading is usually used for)
>>
>> Implementation highlight notes:
>> 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" in the usual manner)
>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
>>
>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
>> The main synchronization is done using a 'bar.red' instruction. This reduces across all threads
>> the condition (team->task_count != 0), to enable the task processing down below if any thread
>> created a task. (this bar.red usage required the need of the second GCC patch in this series)
>>
>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, using libgomp, ovo, omptests,
>> and sollve_vv testsuites, all without regressions. Also verified that the SPEChpc 2021 521.miniswp_t
>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle has been restored to
>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
>>
>> (also suggest backporting to GCC12 branch, if performance regression can be considered a defect)
>>
>> Thanks,
>> Chung-Lin
>>
>> libgomp/ChangeLog:
>>
>> 2022-09-21 Chung-Lin Tang <cltang@codesourcery.com>
>>
>> * config/nvptx/bar.c (generation_to_barrier): Remove.
>> (futex_wait,futex_wake,do_spin,do_wait): Remove.
>> (GOMP_WAIT_H): Remove.
>> (#include "../linux/bar.c"): Remove.
>> (gomp_barrier_wait_end): New function.
>> (gomp_barrier_wait): Likewise.
>> (gomp_barrier_wait_last): Likewise.
>> (gomp_team_barrier_wait_end): Likewise.
>> (gomp_team_barrier_wait): Likewise.
>> (gomp_team_barrier_wait_final): Likewise.
>> (gomp_team_barrier_wait_cancel_end): Likewise.
>> (gomp_team_barrier_wait_cancel): Likewise.
>> (gomp_team_barrier_cancel): Likewise.
>> * config/nvptx/bar.h (gomp_team_barrier_wake): Remove
>> prototype, add new static inline function.
Ping x3.
On 2022/10/31 10:18 PM, Chung-Lin Tang wrote:
> Ping x2.
>
> On 2022/10/17 10:29 PM, Chung-Lin Tang wrote:
>> Ping.
>>
>> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote:
>>> Hi Tom,
>>> I had a patch submitted earlier, where I reported that the current way of implementing
>>> barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021
>>> benchmarks:
>>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html
>>>
>>> That previous patch wasn't accepted well (admittedly, it was kind of a hack).
>>> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
>>>
>>> Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for,
>>> barriers are implemented simplistically with bar.* synchronization instructions.
>>> Tasks are processed after threads have joined, and only if team->task_count != 0
>>>
>>> (arguably, there might be a little bit of performance forfeited where earlier arriving threads
>>> could've been used to process tasks ahead of other threads. But that again falls into requiring
>>> implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target
>>> offloading is usually used for)
>>>
>>> Implementation highlight notes:
>>> 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" in the usual manner)
>>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
>>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
>>>
>>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
>>> The main synchronization is done using a 'bar.red' instruction. This reduces across all threads
>>> the condition (team->task_count != 0), to enable the task processing down below if any thread
>>> created a task. (this bar.red usage required the need of the second GCC patch in this series)
>>>
>>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, using libgomp, ovo, omptests,
>>> and sollve_vv testsuites, all without regressions. Also verified that the SPEChpc 2021 521.miniswp_t
>>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle has been restored to
>>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
>>>
>>> (also suggest backporting to GCC12 branch, if performance regression can be considered a defect)
>>>
>>> Thanks,
>>> Chung-Lin
>>>
>>> libgomp/ChangeLog:
>>>
>>> 2022-09-21 Chung-Lin Tang <cltang@codesourcery.com>
>>>
>>> * config/nvptx/bar.c (generation_to_barrier): Remove.
>>> (futex_wait,futex_wake,do_spin,do_wait): Remove.
>>> (GOMP_WAIT_H): Remove.
>>> (#include "../linux/bar.c"): Remove.
>>> (gomp_barrier_wait_end): New function.
>>> (gomp_barrier_wait): Likewise.
>>> (gomp_barrier_wait_last): Likewise.
>>> (gomp_team_barrier_wait_end): Likewise.
>>> (gomp_team_barrier_wait): Likewise.
>>> (gomp_team_barrier_wait_final): Likewise.
>>> (gomp_team_barrier_wait_cancel_end): Likewise.
>>> (gomp_team_barrier_wait_cancel): Likewise.
>>> (gomp_team_barrier_cancel): Likewise.
>>> * config/nvptx/bar.h (gomp_team_barrier_wake): Remove
>>> prototype, add new static inline function.
Ping x4
On 2022/11/8 12:34 AM, Chung-Lin Tang wrote:
> Ping x3.
>
> On 2022/10/31 10:18 PM, Chung-Lin Tang wrote:
>> Ping x2.
>>
>> On 2022/10/17 10:29 PM, Chung-Lin Tang wrote:
>>> Ping.
>>>
>>> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote:
>>>> Hi Tom,
>>>> I had a patch submitted earlier, where I reported that the current way of implementing
>>>> barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021
>>>> benchmarks:
>>>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html>>>>>
>>>> That previous patch wasn't accepted well (admittedly, it was kind of a hack).
>>>> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
>>>>
>>>> Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for,
>>>> barriers are implemented simplistically with bar.* synchronization instructions.
>>>> Tasks are processed after threads have joined, and only if team->task_count != 0
>>>>
>>>> (arguably, there might be a little bit of performance forfeited where earlier arriving threads
>>>> could've been used to process tasks ahead of other threads. But that again falls into requiring
>>>> implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target
>>>> offloading is usually used for)
>>>>
>>>> Implementation highlight notes:
>>>> 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" in the usual manner)
>>>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
>>>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
>>>>
>>>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
>>>> The main synchronization is done using a 'bar.red' instruction. This reduces across all threads
>>>> the condition (team->task_count != 0), to enable the task processing down below if any thread
>>>> created a task. (this bar.red usage required the need of the second GCC patch in this series)
>>>>
>>>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, using libgomp, ovo, omptests,
>>>> and sollve_vv testsuites, all without regressions. Also verified that the SPEChpc 2021 521.miniswp_t
>>>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle has been restored to
>>>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
>>>>
>>>> (also suggest backporting to GCC12 branch, if performance regression can be considered a defect)
>>>>
>>>> Thanks,
>>>> Chung-Lin
>>>>
>>>> libgomp/ChangeLog:
>>>>
>>>> 2022-09-21 Chung-Lin Tang <cltang@codesourcery.com>
>>>>
>>>> * config/nvptx/bar.c (generation_to_barrier): Remove.
>>>> (futex_wait,futex_wake,do_spin,do_wait): Remove.
>>>> (GOMP_WAIT_H): Remove.
>>>> (#include "../linux/bar.c"): Remove.
>>>> (gomp_barrier_wait_end): New function.
>>>> (gomp_barrier_wait): Likewise.
>>>> (gomp_barrier_wait_last): Likewise.
>>>> (gomp_team_barrier_wait_end): Likewise.
>>>> (gomp_team_barrier_wait): Likewise.
>>>> (gomp_team_barrier_wait_final): Likewise.
>>>> (gomp_team_barrier_wait_cancel_end): Likewise.
>>>> (gomp_team_barrier_wait_cancel): Likewise.
>>>> (gomp_team_barrier_cancel): Likewise.
>>>> * config/nvptx/bar.h (gomp_team_barrier_wake): Remove
>>>> prototype, add new static inline function.
>
Ping x5
On 2022/11/22 12:24 上午, Chung-Lin Tang wrote:
> Ping x4
>
> On 2022/11/8 12:34 AM, Chung-Lin Tang wrote:
>> Ping x3.
>>
>> On 2022/10/31 10:18 PM, Chung-Lin Tang wrote:
>>> Ping x2.
>>>
>>> On 2022/10/17 10:29 PM, Chung-Lin Tang wrote:
>>>> Ping.
>>>>
>>>> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote:
>>>>> Hi Tom,
>>>>> I had a patch submitted earlier, where I reported that the current way of implementing
>>>>> barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021
>>>>> benchmarks:
>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html>>>>>> That previous patch wasn't accepted well (admittedly, it was kind of a hack).
>>>>> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
>>>>>
>>>>> Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for,
>>>>> barriers are implemented simplistically with bar.* synchronization instructions.
>>>>> Tasks are processed after threads have joined, and only if team->task_count != 0
>>>>>
>>>>> (arguably, there might be a little bit of performance forfeited where earlier arriving threads
>>>>> could've been used to process tasks ahead of other threads. But that again falls into requiring
>>>>> implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target
>>>>> offloading is usually used for)
>>>>>
>>>>> Implementation highlight notes:
>>>>> 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" in the usual manner)
>>>>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
>>>>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
>>>>>
>>>>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
>>>>> The main synchronization is done using a 'bar.red' instruction. This reduces across all threads
>>>>> the condition (team->task_count != 0), to enable the task processing down below if any thread
>>>>> created a task. (this bar.red usage required the need of the second GCC patch in this series)
>>>>>
>>>>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, using libgomp, ovo, omptests,
>>>>> and sollve_vv testsuites, all without regressions. Also verified that the SPEChpc 2021 521.miniswp_t
>>>>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle has been restored to
>>>>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
>>>>>
>>>>> (also suggest backporting to GCC12 branch, if performance regression can be considered a defect)
>>>>>
>>>>> Thanks,
>>>>> Chung-Lin
>>>>>
>>>>> libgomp/ChangeLog:
>>>>>
>>>>> 2022-09-21 Chung-Lin Tang <cltang@codesourcery.com>
>>>>>
>>>>> * config/nvptx/bar.c (generation_to_barrier): Remove.
>>>>> (futex_wait,futex_wake,do_spin,do_wait): Remove.
>>>>> (GOMP_WAIT_H): Remove.
>>>>> (#include "../linux/bar.c"): Remove.
>>>>> (gomp_barrier_wait_end): New function.
>>>>> (gomp_barrier_wait): Likewise.
>>>>> (gomp_barrier_wait_last): Likewise.
>>>>> (gomp_team_barrier_wait_end): Likewise.
>>>>> (gomp_team_barrier_wait): Likewise.
>>>>> (gomp_team_barrier_wait_final): Likewise.
>>>>> (gomp_team_barrier_wait_cancel_end): Likewise.
>>>>> (gomp_team_barrier_wait_cancel): Likewise.
>>>>> (gomp_team_barrier_cancel): Likewise.
>>>>> * config/nvptx/bar.h (gomp_team_barrier_wake): Remove
>>>>> prototype, add new static inline function.
>>
>
Ping x6
On 2022/12/6 12:21 AM, Chung-Lin Tang wrote:
> Ping x5
>
> On 2022/11/22 12:24 上午, Chung-Lin Tang wrote:
>> Ping x4
>>
>> On 2022/11/8 12:34 AM, Chung-Lin Tang wrote:
>>> Ping x3.
>>>
>>> On 2022/10/31 10:18 PM, Chung-Lin Tang wrote:
>>>> Ping x2.
>>>>
>>>> On 2022/10/17 10:29 PM, Chung-Lin Tang wrote:
>>>>> Ping.
>>>>>
>>>>> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote:
>>>>>> Hi Tom,
>>>>>> I had a patch submitted earlier, where I reported that the current way of implementing
>>>>>> barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021
>>>>>> benchmarks:
>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html That previous patch wasn't accepted well (admittedly, it was kind of a hack).
>>>>>> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
>>>>>>
>>>>>> Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for,
>>>>>> barriers are implemented simplistically with bar.* synchronization instructions.
>>>>>> Tasks are processed after threads have joined, and only if team->task_count != 0
>>>>>>
>>>>>> (arguably, there might be a little bit of performance forfeited where earlier arriving threads
>>>>>> could've been used to process tasks ahead of other threads. But that again falls into requiring
>>>>>> implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target
>>>>>> offloading is usually used for)
>>>>>>
>>>>>> Implementation highlight notes:
>>>>>> 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" in the usual manner)
>>>>>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
>>>>>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
>>>>>>
>>>>>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
>>>>>> The main synchronization is done using a 'bar.red' instruction. This reduces across all threads
>>>>>> the condition (team->task_count != 0), to enable the task processing down below if any thread
>>>>>> created a task. (this bar.red usage required the need of the second GCC patch in this series)
>>>>>>
>>>>>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, using libgomp, ovo, omptests,
>>>>>> and sollve_vv testsuites, all without regressions. Also verified that the SPEChpc 2021 521.miniswp_t
>>>>>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle has been restored to
>>>>>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
>>>>>>
>>>>>> (also suggest backporting to GCC12 branch, if performance regression can be considered a defect)
>>>>>>
>>>>>> Thanks,
>>>>>> Chung-Lin
>>>>>>
>>>>>> libgomp/ChangeLog:
>>>>>>
>>>>>> 2022-09-21 Chung-Lin Tang <cltang@codesourcery.com>
>>>>>>
>>>>>> * config/nvptx/bar.c (generation_to_barrier): Remove.
>>>>>> (futex_wait,futex_wake,do_spin,do_wait): Remove.
>>>>>> (GOMP_WAIT_H): Remove.
>>>>>> (#include "../linux/bar.c"): Remove.
>>>>>> (gomp_barrier_wait_end): New function.
>>>>>> (gomp_barrier_wait): Likewise.
>>>>>> (gomp_barrier_wait_last): Likewise.
>>>>>> (gomp_team_barrier_wait_end): Likewise.
>>>>>> (gomp_team_barrier_wait): Likewise.
>>>>>> (gomp_team_barrier_wait_final): Likewise.
>>>>>> (gomp_team_barrier_wait_cancel_end): Likewise.
>>>>>> (gomp_team_barrier_wait_cancel): Likewise.
>>>>>> (gomp_team_barrier_cancel): Likewise.
>>>>>> * config/nvptx/bar.h (gomp_team_barrier_wake): Remove
>>>>>> prototype, add new static inline function.
>>>
>>
>
On 9/21/22 09:45, Chung-Lin Tang wrote:
> Hi Tom,
> I had a patch submitted earlier, where I reported that the current way
> of implementing
> barriers in libgomp on nvptx created a quite significant performance
> drop on some SPEChpc2021
> benchmarks:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html
>
> That previous patch wasn't accepted well (admittedly, it was kind of a
> hack).
> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
>
Ack.
> Basically, instead of trying to have the GPU do CPU-with-OS-like things
> that it isn't suited for,
> barriers are implemented simplistically with bar.* synchronization
> instructions.
> Tasks are processed after threads have joined, and only if
> team->task_count != 0
>
> (arguably, there might be a little bit of performance forfeited where
> earlier arriving threads
> could've been used to process tasks ahead of other threads. But that
> again falls into requiring
> implementing complex futex-wait/wake like behavior. Really, that kind of
> tasking is not what target
> offloading is usually used for)
>
Please try to add this insight somewhere as a comment in the code, f.i.
in the header comment of bar.h.
> Implementation highlight notes:
> 1. gomp_team_barrier_wake() is now an empty function (threads never
> "wake" in the usual manner)
> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
>
> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
> The main synchronization is done using a 'bar.red' instruction. This
> reduces across all threads
> the condition (team->task_count != 0), to enable the task processing
> down below if any thread
> created a task. (this bar.red usage required the need of the second
> GCC patch in this series)
>
> This patch has been tested on x86_64/powerpc64le with nvptx offloading,
> using libgomp, ovo, omptests,
> and sollve_vv testsuites, all without regressions. Also verified that
> the SPEChpc 2021 521.miniswp_t
> and 534.hpgmgfv_t performance regressions that occurred in the GCC12
> cycle has been restored to
> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
>
AFAIU the waiters and lock fields are longer used, so they can be removed.
Yes, LGTM, please apply (after the other one).
Thanks for addressing this.
FWIW, tested on NVIDIA RTX A2000 with driver 525.60.11.
> (also suggest backporting to GCC12 branch, if performance regression can
> be considered a defect)
>
That's ok, but wait a while after applying on trunk before doing that,
say a month.
Thanks,
- Tom
> Thanks,
> Chung-Lin
>
> libgomp/ChangeLog:
>
> 2022-09-21 Chung-Lin Tang <cltang@codesourcery.com>
>
> * config/nvptx/bar.c (generation_to_barrier): Remove.
> (futex_wait,futex_wake,do_spin,do_wait): Remove.
> (GOMP_WAIT_H): Remove.
> (#include "../linux/bar.c"): Remove.
> (gomp_barrier_wait_end): New function.
> (gomp_barrier_wait): Likewise.
> (gomp_barrier_wait_last): Likewise.
> (gomp_team_barrier_wait_end): Likewise.
> (gomp_team_barrier_wait): Likewise.
> (gomp_team_barrier_wait_final): Likewise.
> (gomp_team_barrier_wait_cancel_end): Likewise.
> (gomp_team_barrier_wait_cancel): Likewise.
> (gomp_team_barrier_cancel): Likewise.
> * config/nvptx/bar.h (gomp_team_barrier_wake): Remove
> prototype, add new static inline function.
Hi!
On 2022-12-16T15:51:35+0100, Tom de Vries via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> On 9/21/22 09:45, Chung-Lin Tang wrote:
>> I had a patch submitted earlier, where I reported that the current way
>> of implementing
>> barriers in libgomp on nvptx created a quite significant performance
>> drop on some SPEChpc2021
>> benchmarks:
>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html
Also: my 2022-03-17 report in <https://gcc.gnu.org/PR99555>
"[OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs":
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99555#c13>.
>> This patch has been tested on x86_64/powerpc64le with nvptx offloading,
>> using libgomp, ovo, omptests,
>> and sollve_vv testsuites, all without regressions. Also verified that
>> the SPEChpc 2021 521.miniswp_t
>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12
>> cycle has been restored to
>> devel/omp/gcc-11 (OG11) branch levels.
I'm happy to confirm that this also does resolve the PR99555 issue
mentioned above, so please do reference PR99555 in the commit log.
>> Is this okay for trunk?
> Yes, LGTM, please apply (after the other one).
>
> Thanks for addressing this.
Thanks, Chung-Lin and Tom!
>> (also suggest backporting to GCC12 branch, if performance regression can
>> be considered a defect)
>
> That's ok, but wait a while after applying on trunk before doing that,
> say a month.
Grüße
Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
@@ -30,137 +30,143 @@
#include <limits.h>
#include "libgomp.h"
-/* For cpu_relax. */
-#include "doacross.h"
-
-/* Assuming ADDR is &bar->generation, return bar. Copied from
- rtems/bar.c. */
+void
+gomp_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
+{
+ if (__builtin_expect (state & BAR_WAS_LAST, 0))
+ {
+ /* Next time we'll be awaiting TOTAL threads again. */
+ bar->awaited = bar->total;
+ __atomic_store_n (&bar->generation, bar->generation + BAR_INCR,
+ MEMMODEL_RELEASE);
+ }
+ if (bar->total > 1)
+ asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
+}
-static gomp_barrier_t *
-generation_to_barrier (int *addr)
+void
+gomp_barrier_wait (gomp_barrier_t *bar)
{
- char *bar
- = (char *) addr - __builtin_offsetof (gomp_barrier_t, generation);
- return (gomp_barrier_t *)bar;
+ gomp_barrier_wait_end (bar, gomp_barrier_wait_start (bar));
}
-/* Implement futex_wait-like behaviour to plug into the linux/bar.c
- implementation. Assumes ADDR is &bar->generation. */
+/* Like gomp_barrier_wait, except that if the encountering thread
+ is not the last one to hit the barrier, it returns immediately.
+ The intended usage is that a thread which intends to gomp_barrier_destroy
+ this barrier calls gomp_barrier_wait, while all other threads
+ call gomp_barrier_wait_last. When gomp_barrier_wait returns,
+ the barrier can be safely destroyed. */
-static inline void
-futex_wait (int *addr, int val)
+void
+gomp_barrier_wait_last (gomp_barrier_t *bar)
{
- gomp_barrier_t *bar = generation_to_barrier (addr);
+ /* The above described behavior matches 'bar.arrive' perfectly. */
+ if (bar->total > 1)
+ asm ("bar.arrive 1, %0;" : : "r" (32 * bar->total));
+}
- if (bar->total < 2)
- /* A barrier with less than two threads, nop. */
- return;
+void
+gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
+{
+ struct gomp_thread *thr = gomp_thread ();
+ struct gomp_team *team = thr->ts.team;
- gomp_mutex_lock (&bar->lock);
+ bool run_tasks = (team->task_count != 0);
+ if (bar->total > 1)
+ run_tasks = __builtin_nvptx_bar_red_or (1, 32 * bar->total, true,
+ (team->task_count != 0));
- /* Futex semantics: only go to sleep if *addr == val. */
- if (__builtin_expect (__atomic_load_n (addr, MEMMODEL_ACQUIRE) != val, 0))
+ if (__builtin_expect (state & BAR_WAS_LAST, 0))
{
- gomp_mutex_unlock (&bar->lock);
- return;
+ /* Next time we'll be awaiting TOTAL threads again. */
+ bar->awaited = bar->total;
+ team->work_share_cancelled = 0;
}
- /* Register as waiter. */
- unsigned int waiters
- = __atomic_add_fetch (&bar->waiters, 1, MEMMODEL_ACQ_REL);
- if (waiters == 0)
- __builtin_abort ();
- unsigned int waiter_id = waiters;
-
- if (waiters > 1)
+ if (__builtin_expect (run_tasks == true, 0))
{
- /* Wake other threads in bar.sync. */
- asm volatile ("bar.sync 1, %0;" : : "r" (32 * waiters));
+ while (__atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE)
+ & BAR_TASK_PENDING)
+ gomp_barrier_handle_tasks (state);
- /* Ensure that they have updated waiters. */
- asm volatile ("bar.sync 1, %0;" : : "r" (32 * waiters));
+ if (bar->total > 1)
+ asm volatile ("bar.sync 1, %0;" : : "r" (32 * bar->total));
}
+}
- gomp_mutex_unlock (&bar->lock);
-
- while (1)
- {
- /* Wait for next thread in barrier. */
- asm volatile ("bar.sync 1, %0;" : : "r" (32 * (waiters + 1)));
-
- /* Get updated waiters. */
- unsigned int updated_waiters
- = __atomic_load_n (&bar->waiters, MEMMODEL_ACQUIRE);
-
- /* Notify that we have updated waiters. */
- asm volatile ("bar.sync 1, %0;" : : "r" (32 * (waiters + 1)));
-
- waiters = updated_waiters;
-
- if (waiter_id > waiters)
- /* A wake happened, and we're in the group of woken threads. */
- break;
-
- /* Continue waiting. */
- }
+void
+gomp_team_barrier_wait (gomp_barrier_t *bar)
+{
+ gomp_team_barrier_wait_end (bar, gomp_barrier_wait_start (bar));
}
-/* Implement futex_wake-like behaviour to plug into the linux/bar.c
- implementation. Assumes ADDR is &bar->generation. */
+void
+gomp_team_barrier_wait_final (gomp_barrier_t *bar)
+{
+ gomp_barrier_state_t state = gomp_barrier_wait_final_start (bar);
+ if (__builtin_expect (state & BAR_WAS_LAST, 0))
+ bar->awaited_final = bar->total;
+ gomp_team_barrier_wait_end (bar, state);
+}
-static inline void
-futex_wake (int *addr, int count)
+bool
+gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
+ gomp_barrier_state_t state)
{
- gomp_barrier_t *bar = generation_to_barrier (addr);
+ struct gomp_thread *thr = gomp_thread ();
+ struct gomp_team *team = thr->ts.team;
- if (bar->total < 2)
- /* A barrier with less than two threads, nop. */
- return;
+ bool run_tasks = (team->task_count != 0);
+ if (bar->total > 1)
+ run_tasks = __builtin_nvptx_bar_red_or (1, 32 * bar->total, true,
+ (team->task_count != 0));
+ if (state & BAR_CANCELLED)
+ return true;
- gomp_mutex_lock (&bar->lock);
- unsigned int waiters = __atomic_load_n (&bar->waiters, MEMMODEL_ACQUIRE);
- if (waiters == 0)
+ if (__builtin_expect (state & BAR_WAS_LAST, 0))
{
- /* No threads to wake. */
- gomp_mutex_unlock (&bar->lock);
- return;
+ /* Note: BAR_CANCELLED should never be set in state here, because
+ cancellation means that at least one of the threads has been
+ cancelled, thus on a cancellable barrier we should never see
+ all threads to arrive. */
+
+ /* Next time we'll be awaiting TOTAL threads again. */
+ bar->awaited = bar->total;
+ team->work_share_cancelled = 0;
}
- if (count == INT_MAX)
- /* Release all threads. */
- __atomic_store_n (&bar->waiters, 0, MEMMODEL_RELEASE);
- else if (count < bar->total)
- /* Release count threads. */
- __atomic_add_fetch (&bar->waiters, -count, MEMMODEL_ACQ_REL);
- else
- /* Count has an illegal value. */
- __builtin_abort ();
-
- /* Wake other threads in bar.sync. */
- asm volatile ("bar.sync 1, %0;" : : "r" (32 * (waiters + 1)));
+ if (__builtin_expect (run_tasks == true, 0))
+ {
+ while (__atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE)
+ & BAR_TASK_PENDING)
+ gomp_barrier_handle_tasks (state);
- /* Let them get the updated waiters. */
- asm volatile ("bar.sync 1, %0;" : : "r" (32 * (waiters + 1)));
+ if (bar->total > 1)
+ asm volatile ("bar.sync 1, %0;" : : "r" (32 * bar->total));
+ }
- gomp_mutex_unlock (&bar->lock);
+ return false;
}
-/* Copied from linux/wait.h. */
-
-static inline int do_spin (int *addr, int val)
+bool
+gomp_team_barrier_wait_cancel (gomp_barrier_t *bar)
{
- /* The current implementation doesn't spin. */
- return 1;
+ return gomp_team_barrier_wait_cancel_end (bar, gomp_barrier_wait_start (bar));
}
-/* Copied from linux/wait.h. */
-
-static inline void do_wait (int *addr, int val)
+void
+gomp_team_barrier_cancel (struct gomp_team *team)
{
- if (do_spin (addr, val))
- futex_wait (addr, val);
-}
+ gomp_mutex_lock (&team->task_lock);
+ if (team->barrier.generation & BAR_CANCELLED)
+ {
+ gomp_mutex_unlock (&team->task_lock);
+ return;
+ }
+ team->barrier.generation |= BAR_CANCELLED;
+ gomp_mutex_unlock (&team->task_lock);
-/* Reuse the linux implementation. */
-#define GOMP_WAIT_H 1
-#include "../linux/bar.c"
+ /* The 'exit' instruction cancels this thread and also fullfills any other
+ CTA threads waiting on barriers. */
+ asm volatile ("exit;");
+}
@@ -83,10 +83,16 @@ extern void gomp_team_barrier_wait_end (gomp_barrier_t *,
extern bool gomp_team_barrier_wait_cancel (gomp_barrier_t *);
extern bool gomp_team_barrier_wait_cancel_end (gomp_barrier_t *,
gomp_barrier_state_t);
-extern void gomp_team_barrier_wake (gomp_barrier_t *, int);
struct gomp_team;
extern void gomp_team_barrier_cancel (struct gomp_team *);
+static inline void
+gomp_team_barrier_wake (gomp_barrier_t *bar, int count)
+{
+ /* We never "wake up" threads on nvptx. Threads wait at barrier
+ instructions till barrier fullfilled. Do nothing here. */
+}
+
static inline gomp_barrier_state_t
gomp_barrier_wait_start (gomp_barrier_t *bar)
{