libgomp.texi: Document libmemkind + nvptx/gcn specifics
Commit Message
I had this patch lying around since about half a year. I did tweak and agumented it
a bit today, but finally want to get rid of it (locally - by getting it committed) ...
This patch changes -misa to -march for nvptx (the latter is now an alias
for the former), it adds a new section about libmemkind and some information
about interns of our nvptx/gcn implementation. (The latter should be mostly
correct, but I might have missed some fine print or a more recent update.)
OK for mainline?
Tobias
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
Comments
On Mon, Aug 29, 2022 at 12:54:33PM +0200, Tobias Burnus wrote:
> I had this patch lying around since about half a year. I did tweak and agumented it
> a bit today, but finally want to get rid of it (locally - by getting it committed) ...
>
> This patch changes -misa to -march for nvptx (the latter is now an alias
> for the former), it adds a new section about libmemkind and some information
> about interns of our nvptx/gcn implementation. (The latter should be mostly
> correct, but I might have missed some fine print or a more recent update.)
>
> OK for mainline?
>
> Tobias
>
>
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
> libgomp.texi: Document libmemkind + nvptx/gcn specifics
>
> libgomp/ChangeLog:
>
> * libgomp.texi (OpenMP-Implementation Specifics): New; add libmemkind
> section; move OpenMP Context Selectors from ...
> (Offload-Target Specifics): ... here; add 'AMD Radeo (GCN)' and
> 'nvptx' sections.
> +All OpenMP and OpenACC levels are used, i.e.
> +@itemize
> +@item OpenMP's simd and OpenACC's vector map to work items (thread)
> +@item OpenMP's threads (``parallel'') and OpenACC's workers map
> + to wavefronts
> +@item OpenMP's teams and OpenACC's gang use use a threadpool with the
s/use use/use/
> +All OpenMP and OpenACC levels are used, i.e.
> +@itemize
> +@item OpenMP's simd and OpenACC's vector map to threads
> +@item OpenMP's threads (``parallel'') and OpenACC's workers map to warps
> +@item OpenMP's teams and OpenACC's gang use use a threadpool with the
Again.
Otherwise LGTM.
Jakub
libgomp.texi: Document libmemkind + nvptx/gcn specifics
libgomp/ChangeLog:
* libgomp.texi (OpenMP-Implementation Specifics): New; add libmemkind
section; move OpenMP Context Selectors from ...
(Offload-Target Specifics): ... here; add 'AMD Radeo (GCN)' and
'nvptx' sections.
libgomp/libgomp.texi | 132 ++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 126 insertions(+), 6 deletions(-)
@@ -113,6 +113,8 @@ changed to GNU Offloading and Multi Processing Runtime Library.
* OpenACC Library Interoperability:: OpenACC library interoperability with the
NVIDIA CUBLAS library.
* OpenACC Profiling Interface::
+* OpenMP-Implementation Specifics:: Notes specifics of this OpenMP
+ implementation
* Offload-Target Specifics:: Notes on offload-target specific internals
* The libgomp ABI:: Notes on the external ABI presented by libgomp.
* Reporting Bugs:: How to report bugs in the GNU Offloading and
@@ -4280,16 +4282,15 @@ offloading devices (it's not clear if they should be):
@end itemize
@c ---------------------------------------------------------------------
-@c Offload-Target Specifics
+@c OpenMP-Implementation Specifics
@c ---------------------------------------------------------------------
-@node Offload-Target Specifics
-@chapter Offload-Target Specifics
-
-The following sections present notes on the offload-target specifics.
+@node OpenMP-Implementation Specifics:
+@chapter OpenMP-Implementation Specifics:
@menu
* OpenMP Context Selectors::
+* Memory allocation with libmemkind::
@end menu
@node OpenMP Context Selectors
@@ -4308,9 +4309,128 @@ The following sections present notes on the offload-target specifics.
@tab See @code{-march=} in ``AMD GCN Options''
@item @code{nvptx}
@tab @code{gpu}
- @tab See @code{-misa=} in ``Nvidia PTX Options''
+ @tab See @code{-march=} in ``Nvidia PTX Options''
@end multitable
+@node Memory allocation with libmemkind
+@section Memory allocation with libmemkind
+
+On Linux systems, where the @uref{https://github.com/memkind/memkind, memkind
+library} (@code{libmemkind.so.0}) is available at runtime, it is used when
+creating memory allocators requesting
+
+@itemize
+@item the memory space @code{omp_high_bw_mem_space}
+@item the memory space @code{omp_large_cap_mem_space}
+@item the partition trait @code{omp_atv_interleaved}
+@end itemize
+
+
+@c ---------------------------------------------------------------------
+@c Offload-Target Specifics
+@c ---------------------------------------------------------------------
+
+@node Offload-Target Specifics
+@chapter Offload-Target Specifics
+
+The following sections present notes on the offload-target specifics
+
+@menu
+* AMD Radeon::
+* nvptx::
+@end menu
+
+@node AMD Radeon
+@section AMD Radeon (GCN)
+
+On the hardware side, there is the hierarchy (fine to coarse):
+@itemize
+@item work item (thread)
+@item wavefront
+@item work group
+@item compute unite (CU)
+@end itemize
+
+All OpenMP and OpenACC levels are used, i.e.
+@itemize
+@item OpenMP's simd and OpenACC's vector map to work items (thread)
+@item OpenMP's threads (``parallel'') and OpenACC's workers map
+ to wavefronts
+@item OpenMP's teams and OpenACC's gang use use a threadpool with the
+ size of the number of teams or gangs, respectively.
+@end itemize
+
+The used sizes are
+@itemize
+@item Number of teams is the specified @code{num_teams} (OpenMP) or
+ @code{num_gangs} (OpenACC) or otherwise the number of CU
+@item Number of wavefronts is 4 for gfx900 and 16 otherwise;
+ @code{num_threads} (OpenMP) and @code{num_workers} (OpenACC)
+ overrides this if smaller.
+@item The wavefront has 102 scalars and 64 vectors
+@item Number of workitems is always 64
+@item The hardware permits maximally 40 workgroups/CU and
+ 16 wavefronts/workgroup up to a limit of 40 wavefronts in total per CU.
+@item 80 scalars registers and 24 vector registers in non-kernel functions
+ (the chosen procedure-calling API).
+@item For the kernel itself: as many as register pressure demands (number of
+ teams and number of threads, scaled down if registers are exhausted)
+@end itemize
+
+The implementation remark:
+@itemize
+@item I/O within OpenMP target regions and OpenACC parallel/kernels is supported
+ using the C library @code{printf} functions and the Fortran
+ @code{print}/@code{write} statements.
+@end itemize
+
+
+
+@node nvptx
+@section nvptx
+
+On the hardware side, there is the hierarchy (fine to coarse):
+@itemize
+@item thread
+@item warp
+@item thread block
+@item streaming multiprocessor
+@end itemize
+
+All OpenMP and OpenACC levels are used, i.e.
+@itemize
+@item OpenMP's simd and OpenACC's vector map to threads
+@item OpenMP's threads (``parallel'') and OpenACC's workers map to warps
+@item OpenMP's teams and OpenACC's gang use use a threadpool with the
+ size of the number of teams or gangs, respectively.
+@end itemize
+
+The used sizes are
+@itemize
+@item The @code{warp_size} is always 32
+@item CUDA kernel launched: @code{dim=@{#teams,1,1@}, blocks=@{#threads,warp_size,1@}}.
+@end itemize
+
+Additional information can be obtained by setting the environment variable to
+@code{GOMP_DEBUG=1} (very verbose; grep for @code{kernel.*launch} for launch
+parameters).
+
+GCC generates generic PTX ISA code, which is just-in-time compiled by CUDA,
+which caches the JIT in the user's directory (see CUDA documentation; can be
+tuned by the environment variables @code{CUDA_CACHE_@{DISABLE,MAXSIZE,PATH@}}.
+
+Note: While PTX ISA is generic, the @code{-mptx=} and @code{-march=} commandline
+options still affect the used PTX ISA code and, thus, the requirments on
+CUDA version and hardware.
+
+The implementation remark:
+@itemize
+@item I/O within OpenMP target regions and OpenACC parallel/kernels is supported
+ using the C library @code{printf} functions. Note that the Fortran
+ @code{print}/@code{write} statements
+ are not supported, yet.
+@end itemize
+
@c ---------------------------------------------------------------------
@c The libgomp ABI