From patchwork Thu Feb 16 18:06:41 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Andrew Stubbs <ams@codesourcery.com>
X-Patchwork-Id: 58181
Return-Path: <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp449441wrn;
        Thu, 16 Feb 2023 10:07:32 -0800 (PST)
X-Google-Smtp-Source: 
 AK7set/3kg+aaBG+9PL6vT+zkB/LQdjTFuMSjxsuXWFoSRKHkP5SZfesPuj190Nucy6qG2lqOSE+
X-Received: by 2002:a17:906:82d1:b0:87b:dbcb:c6b1 with SMTP id
 a17-20020a17090682d100b0087bdbcbc6b1mr7364974ejy.23.1676570852303;
        Thu, 16 Feb 2023 10:07:32 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1676570852; cv=none;
        d=google.com; s=arc-20160816;
        b=PPdEgmEWQiYfjOX9JBuDZ5EtE+mRl/s+zUkhmCScR5vZ8epdj6wPRNeEkweH40vomr
         prAMJ9Vf5KL+cYBdANYCI0OvwwHqacGU4mosEDn9RPRSG7bYYBBwZEzl/GLJo7/iwzEK
         NwULtjcogqUzQYMJpESUyiYW9SGkYV/jbR9wBlBJCrDk8SE1g+U8K46+6crKcCOOQvbd
         KF3mPMDPiaoGBpzNn+bWd1TWsBl+G70+ipWmOU5hEMomrzQTtYVEKNwGX6t8ucQITcPh
         bIdMAcIyzfmjIbf8ndF77/xxPH+NVMv5+Prhz0WQfHWcnE30uGxgJCjYbJkc/XJLR2Ds
         q42g==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:list-subscribe:list-help:list-post:list-archive
         :list-unsubscribe:list-id:precedence:subject:from:to
         :content-language:user-agent:mime-version:date:message-id
         :ironport-sdr:dmarc-filter:delivered-to;
        bh=CQlBgvj3Ziku2bQiksSNgDUmIFZrVJANlW1ZdSmqwaM=;
        b=KNq2NYTlvcvtV0ud6RJl/XocpbiFujsT3OIsra0N0573U3YcqGOoREZhLB5oBWVycJ
         whjsymUbUPqr65kKMyiVI1pi7paicZsFv+4CELE9EpIOOtREitO98x+bxUutdk8YoGhh
         MgWDv9zCn9pkFifVEECpbVJ4EFxoyZs1rLIKxBXEGfVJ1EygzcLi+8vB+wZQAyL929//
         pbfjZHAkAJ2AmwbSvkcRUNqEbyV/Rbq8FV6AwxF5Ejc3Pt5OgTV0WzxbC6UaRyOvAAYw
         AkvosSc0I84uu/uqZtde/YVYbGPnEJa/rWM2seJ4k0JUBqGwPKYyqz1r9Sz2mFiWaTpW
         cZjQ==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as
 permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"
Received: from sourceware.org (ip-8-43-85-97.sourceware.org. [8.43.85.97])
        by mx.google.com with ESMTPS id
 f30-20020a170906739e00b008b16c22d072si2214697ejl.156.2023.02.16.10.07.31
        for <ouuuleilei@gmail.com>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 16 Feb 2023 10:07:32 -0800 (PST)
Received-SPF: pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as
 permitted sender) client-ip=8.43.85.97;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as
 permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id BD4723858291
	for <ouuuleilei@gmail.com>; Thu, 16 Feb 2023 18:07:20 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from esa4.mentor.iphmx.com (esa4.mentor.iphmx.com [68.232.137.252])
 by sourceware.org (Postfix) with ESMTPS id CBE983858D33
 for <gcc-patches@gcc.gnu.org>; Thu, 16 Feb 2023 18:06:49 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org CBE983858D33
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=codesourcery.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com
X-IronPort-AV: E=Sophos;i="5.97,302,1669104000"; d="scan'208";a="97261490"
Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165])
 by esa4.mentor.iphmx.com with ESMTP; 16 Feb 2023 10:06:46 -0800
IronPort-SDR: 
 mqd2HuWTTXoJNR3OHRs5mamj9wyt07ahXAQvPlgYnB7x6Pnph3W8X6JW9gM0qxrZeXifckvVKS
 8xwi9pc51yjJv0u3eZ6BLq3jeWWXR+Ie6jHaOMpc24f0DziBt5/y3WP8y8hKjl3UCt0xrB0icM
 UzrCKSRzVujpkD2G6sh6pVDoURfJj08IApUSCVgmlALr8DtlS96MVHJ2+OOc9KZU298mVYFGHB
 cZMy7dv9kY3Z6ntk4GjEj3AubmW3CnnPDupb271sych6f09ixZozHiaypOojOOIQTNMw33JalS
 uso=
Message-ID: <5eaeddf5-317a-4574-868b-87999bb6af33@codesourcery.com>
Date: Thu, 16 Feb 2023 18:06:41 +0000
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.7.2
Content-Language: en-GB
To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
From: Andrew Stubbs <ams@codesourcery.com>
Subject: [OG12][committed] amdgcn: OpenMP low-latency allocator
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: svr-ies-mbx-10.mgc.mentorg.com (139.181.222.10) To
 svr-ies-mbx-11.mgc.mentorg.com (139.181.222.11)
X-Spam-Status: No, score=-12.4 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, KAM_SHORT, SPF_HELO_PASS,
 SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org
Sender: "Gcc-patches" <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1758011957852578849?=
X-GMAIL-MSGID: =?utf-8?q?1758011957852578849?=

These patches implement an LDS memory allocator for OpenMP on AMD.

1. 230216-basic-allocator.patch

Separate the allocator from NVPTX so the code can be shared.

2. 230216-amd-low-lat.patch

Allocate the memory, adjust the default address space, and hook up the 
allocator.

They will need to be integrated with the rest of the memory management 
patch-stack when I repost that for mainline.

Andrew
nvptx, libgomp: Move the low-latency allocator code

There shouldn't be a functionality change; this is just so AMD can share
the code.

The new basic-allocator.c is designed to be included so it can be used as a
template multiple times and inlined.

libgomp/ChangeLog:

	* config/nvptx/allocator.c (BASIC_ALLOC_PREFIX): New define, and
	include basic-allocator.c.
	(__nvptx_lowlat_heap_root): Remove.
	(heapdesc): Remove.
	(nvptx_memspace_alloc): Move implementation to basic-allocator.c.
	(nvptx_memspace_calloc): Likewise.
	(nvptx_memspace_free): Likewise.
	(nvptx_memspace_realloc): Likewise.
	* config/nvptx/team.c (__nvptx_lowlat_heap_root): Remove.
	(gomp_nvptx_main): Call __nvptx_lowlat_init.
	* basic-allocator.c: New file.
amdgcn, libgomp: low-latency allocator

This implements the OpenMP low-latency memory allocator for AMD GCN using the
small per-team LDS memory (Local Data Store).

Since addresses can now refer to LDS space, the "Global" address space is
no-longer compatible.  This patch therefore switches the backend to use
entirely "Flat" addressing (which supports both memories).  A future patch
will re-enable "global" instructions for cases where it is known to be safe
to do so.

gcc/ChangeLog:

	* config/gcn/gcn-builtins.def (DISPATCH_PTR): New built-in.
	* config/gcn/gcn.cc (gcn_init_machine_status): Disable global
	addressing.
	(gcn_expand_builtin_1): Implement GCN_BUILTIN_DISPATCH_PTR.

libgomp/ChangeLog:

	* config/gcn/libgomp-gcn.h (TEAM_ARENA_START): Move to here.
	(TEAM_ARENA_FREE): Likewise.
	(TEAM_ARENA_END): Likewise.
	(GCN_LOWLAT_HEAP): New.
	* config/gcn/team.c (LITTLEENDIAN_CPU): New, and import hsa.h.
	(__gcn_lowlat_init): New prototype.
	(gomp_gcn_enter_kernel): Initialize the low-latency heap.
	* libgomp.h (TEAM_ARENA_START): Move to libgomp.h.
	(TEAM_ARENA_FREE): Likewise.
	(TEAM_ARENA_END): Likewise.
	* plugin/plugin-gcn.c (lowlat_size): New variable.
	(print_kernel_dispatch): Label the group_segment_size purpose.
	(init_environment_variables): Read GOMP_GCN_LOWLAT_POOL.
	(create_kernel_dispatch): Pass low-latency head allocation to kernel.
	(run_kernel): Use shadow; don't assume values.
	* testsuite/libgomp.c/allocators-7.c: Enable for amdgcn.
	* config/gcn/allocator.c: New file.

diff --git a/gcc/config/gcn/gcn-builtins.def b/gcc/config/gcn/gcn-builtins.def
index f1cf30bbc94..3619cab4402 100644
--- a/gcc/config/gcn/gcn-builtins.def
+++ b/gcc/config/gcn/gcn-builtins.def
@@ -164,6 +164,8 @@ DEF_BUILTIN (FIRST_CALL_THIS_THREAD_P, -1, "first_call_this_thread_p", B_INSN,
 	     _A1 (GCN_BTI_BOOL), gcn_expand_builtin_1)
 DEF_BUILTIN (KERNARG_PTR, -1, "kernarg_ptr", B_INSN, _A1 (GCN_BTI_VOIDPTR),
 	     gcn_expand_builtin_1)
+DEF_BUILTIN (DISPATCH_PTR, -1, "dispatch_ptr", B_INSN, _A1 (GCN_BTI_VOIDPTR),
+	     gcn_expand_builtin_1)
 DEF_BUILTIN (GET_STACK_LIMIT, -1, "get_stack_limit", B_INSN,
 	     _A1 (GCN_BTI_VOIDPTR), gcn_expand_builtin_1)
 
diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index 0b21dbd256e..8e487b94e95 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -114,7 +114,8 @@ gcn_init_machine_status (void)
 
   f = ggc_cleared_alloc<machine_function> ();
 
-  if (TARGET_GCN3)
+  // FIXME: re-enable global addressing with safety for LDS-flat addresses
+  //if (TARGET_GCN3)
     f->use_flat_addressing = true;
 
   return f;
@@ -4626,6 +4627,19 @@ gcn_expand_builtin_1 (tree exp, rtx target, rtx /*subtarget */ ,
 	  }
 	return ptr;
       }
+    case GCN_BUILTIN_DISPATCH_PTR:
+      {
+	rtx ptr;
+	if (cfun->machine->args.reg[DISPATCH_PTR_ARG] >= 0)
+	   ptr = gen_rtx_REG (DImode,
+			      cfun->machine->args.reg[DISPATCH_PTR_ARG]);
+	else
+	  {
+	    ptr = gen_reg_rtx (DImode);
+	    emit_move_insn (ptr, const0_rtx);
+	  }
+	return ptr;
+      }
     case GCN_BUILTIN_FIRST_CALL_THIS_THREAD_P:
       {
 	/* Stash a marker in the unused upper 16 bits of s[0:1] to indicate
diff --git a/libgomp/config/gcn/allocator.c b/libgomp/config/gcn/allocator.c
new file mode 100644
index 00000000000..001de89ffe0
--- /dev/null
+++ b/libgomp/config/gcn/allocator.c
@@ -0,0 +1,129 @@
+/* Copyright (C) 2023 Free Software Foundation, Inc.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* The low-latency allocators use space reserved in LDS memory when the
+   kernel is launched.  The heap is initialized in gomp_gcn_enter_kernel and
+   all allocations are forgotten when the kernel exits.  Allocations to other
+   memory spaces all use the system malloc syscall.
+
+   The pointers returned are 64-bit "Flat" addresses indistinguishable from
+   regular pointers, but only compatible with the "flat_load/store"
+   instructions.  The compiler has been coded to assign default address
+   spaces accordingly.
+
+   LDS memory is not visible to other teams, and therefore may only be used
+   when the memspace access trait is set accordingly.  */
+
+#include "libgomp.h"
+#include <stdlib.h>
+
+#define BASIC_ALLOC_PREFIX __gcn_lowlat
+#define BASIC_ALLOC_YIELD asm("s_sleep 1" ::: "memory")
+#include "../../basic-allocator.c"
+
+/* The low-latency heap is located in LDS memory, but we need the __flat
+   address space for compatibility reasons.  */
+#define FLAT_HEAP_PTR \
+  ((void*)(uintptr_t)(void __flat*)(void __lds *)GCN_LOWLAT_HEAP)
+
+static void *
+gcn_memspace_alloc (omp_memspace_handle_t memspace, size_t size)
+{
+  if (memspace == omp_low_lat_mem_space)
+    {
+      char *shared_pool = FLAT_HEAP_PTR;
+
+      return __gcn_lowlat_alloc (shared_pool, size);
+    }
+  else if (memspace == ompx_host_mem_space)
+    return NULL;
+  else
+    return malloc (size);
+}
+
+static void *
+gcn_memspace_calloc (omp_memspace_handle_t memspace, size_t size)
+{
+  if (memspace == omp_low_lat_mem_space)
+    {
+      char *shared_pool = FLAT_HEAP_PTR;
+
+      return __gcn_lowlat_calloc (shared_pool, size);
+    }
+  else if (memspace == ompx_host_mem_space)
+    return NULL;
+  else
+    return calloc (1, size);
+}
+
+static void
+gcn_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size)
+{
+  if (memspace == omp_low_lat_mem_space)
+    {
+      char *shared_pool = FLAT_HEAP_PTR;
+
+      __gcn_lowlat_free (shared_pool, addr, size);
+    }
+  else
+    free (addr);
+}
+
+static void *
+gcn_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
+			size_t oldsize, size_t size)
+{
+  if (memspace == omp_low_lat_mem_space)
+    {
+      char *shared_pool = FLAT_HEAP_PTR;
+
+      return __gcn_lowlat_realloc (shared_pool, addr, oldsize, size);
+    }
+  else if (memspace == ompx_host_mem_space)
+    return NULL;
+  else
+    return realloc (addr, size);
+}
+
+static inline int
+gcn_memspace_validate (omp_memspace_handle_t memspace, unsigned access)
+{
+  /* Disallow use of low-latency memory when it must be accessible by
+     all threads.  */
+  return (memspace != omp_low_lat_mem_space
+	  || access != omp_atv_all);
+}
+
+#define MEMSPACE_ALLOC(MEMSPACE, SIZE, PIN) \
+  gcn_memspace_alloc (MEMSPACE, SIZE)
+#define MEMSPACE_CALLOC(MEMSPACE, SIZE, PIN) \
+  gcn_memspace_calloc (MEMSPACE, SIZE)
+#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE, OLDPIN, PIN) \
+  gcn_memspace_realloc (MEMSPACE, ADDR, OLDSIZE, SIZE)
+#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE, PIN) \
+  gcn_memspace_free (MEMSPACE, ADDR, SIZE)
+#define MEMSPACE_VALIDATE(MEMSPACE, ACCESS) \
+  gcn_memspace_validate (MEMSPACE, ACCESS)
+
+#include "../../allocator.c"
diff --git a/libgomp/config/gcn/libgomp-gcn.h b/libgomp/config/gcn/libgomp-gcn.h
index 1521166baa3..3e8d7451453 100644
--- a/libgomp/config/gcn/libgomp-gcn.h
+++ b/libgomp/config/gcn/libgomp-gcn.h
@@ -33,6 +33,12 @@
 #define DEFAULT_GCN_STACK_SIZE (32*1024)
 #define DEFAULT_TEAM_ARENA_SIZE (64*1024)
 
+/* These define the LDS location of data needed by OpenMP.  */
+#define TEAM_ARENA_START 16  /* LDS offset of free pointer.  */
+#define TEAM_ARENA_FREE  24  /* LDS offset of free pointer.  */
+#define TEAM_ARENA_END   32  /* LDS offset of end pointer.  */
+#define GCN_LOWLAT_HEAP  40  /* LDS offset of the OpenMP low-latency heap.  */
+
 struct heap
 {
   int64_t size;
diff --git a/libgomp/config/gcn/team.c b/libgomp/config/gcn/team.c
index ffdc09b7f35..13641a4702c 100644
--- a/libgomp/config/gcn/team.c
+++ b/libgomp/config/gcn/team.c
@@ -29,6 +29,12 @@
 #include <stdlib.h>
 #include <string.h>
 
+#define LITTLEENDIAN_CPU
+#include "hsa.h"
+
+/* Defined in basic-allocator.c via config/amdgcn/allocator.c.  */
+void __gcn_lowlat_init (void *heap, size_t size);
+
 static void gomp_thread_start (struct gomp_thread_pool *);
 
 /* This externally visible function handles target region entry.  It
@@ -71,6 +77,12 @@ gomp_gcn_enter_kernel (void)
       *arena_free = team_arena;
       *arena_end = team_arena + kernargs->arena_size_per_team;
 
+      /* Initialize the low-latency heap.  The header is the size.  */
+      void __lds *lowlat = (void __lds *)GCN_LOWLAT_HEAP;
+      hsa_kernel_dispatch_packet_t *queue_ptr = __builtin_gcn_dispatch_ptr ();
+      __gcn_lowlat_init ((void*)(uintptr_t)(void __flat*)lowlat,
+			 queue_ptr->group_segment_size - GCN_LOWLAT_HEAP);
+
       /* Allocate and initialize the team-local-storage data.  */
       struct gomp_thread *thrs = team_malloc_cleared (sizeof (*thrs)
 						      * numthreads);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index a0af66e396b..d1e45cc584e 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -114,9 +114,6 @@ extern void gomp_aligned_free (void *);
 #ifdef __AMDGCN__
 #include "libgomp-gcn.h"
 /* The arena is initialized in config/gcn/team.c.  */
-#define TEAM_ARENA_START 16  /* LDS offset of free pointer.  */
-#define TEAM_ARENA_FREE  24  /* LDS offset of free pointer.  */
-#define TEAM_ARENA_END   32  /* LDS offset of end pointer.  */
 
 static inline void * __attribute__((malloc))
 team_malloc (size_t size)
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 70a555a24a2..ca89ba658fd 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -563,6 +563,7 @@ static size_t gcn_kernel_heap_size = DEFAULT_GCN_HEAP_SIZE;
 
 static int team_arena_size = DEFAULT_TEAM_ARENA_SIZE;
 static int stack_size = DEFAULT_GCN_STACK_SIZE;
+static int lowlat_size = -1;
 
 /* Flag to decide whether print to stderr information about what is going on.
    Set in init_debug depending on environment variables.  */
@@ -1047,8 +1048,8 @@ print_kernel_dispatch (struct kernel_dispatch *dispatch, unsigned indent)
   fprintf (stderr, "%*sobject: %lu\n", indent, "", dispatch->object);
   fprintf (stderr, "%*sprivate_segment_size: %u\n", indent, "",
 	   dispatch->private_segment_size);
-  fprintf (stderr, "%*sgroup_segment_size: %u\n", indent, "",
-	   dispatch->group_segment_size);
+  fprintf (stderr, "%*sgroup_segment_size: %u (low-latency pool)\n", indent,
+	   "", dispatch->group_segment_size);
   fprintf (stderr, "\n");
 }
 
@@ -1119,6 +1120,10 @@ init_environment_variables (void)
       if (tmp)
 	stack_size = tmp;;
     }
+
+  const char *lowlat = secure_getenv ("GOMP_GCN_LOWLAT_POOL");
+  if (lowlat)
+    lowlat_size = atoi (lowlat);
 }
 
 /* Return malloc'd string with name of SYMBOL.  */
@@ -1946,7 +1951,25 @@ create_kernel_dispatch (struct kernel_info *kernel, int num_teams,
 
   shadow->signal = sync_signal.handle;
   shadow->private_segment_size = kernel->private_segment_size;
-  shadow->group_segment_size = kernel->group_segment_size;
+
+  if (lowlat_size < 0)
+    {
+      /* Divide the LDS between the number of running teams.
+	 Allocate not less than is defined in the kernel metadata.  */
+      int teams_per_cu = num_teams / get_cu_count (agent);
+      int LDS_per_team = (teams_per_cu ? 65536 / teams_per_cu : 65536);
+      shadow->group_segment_size
+	= (kernel->group_segment_size > LDS_per_team
+	   ? kernel->group_segment_size
+	   : LDS_per_team);;
+    }
+  else if (lowlat_size < GCN_LOWLAT_HEAP+8)
+    /* Ensure that there's space for the OpenMP libgomp data.  */
+    shadow->group_segment_size = GCN_LOWLAT_HEAP+8;
+  else
+    shadow->group_segment_size = (lowlat_size > 65536
+				  ? 65536
+				  : lowlat_size);
 
   /* We expect kernels to request a single pointer, explicitly, and the
      rest of struct kernargs, implicitly.  If they request anything else
@@ -2305,9 +2328,9 @@ run_kernel (struct kernel_info *kernel, void *vars,
       print_kernel_dispatch (shadow, 2);
     }
 
-  packet->private_segment_size = kernel->private_segment_size;
-  packet->group_segment_size = kernel->group_segment_size;
-  packet->kernel_object = kernel->object;
+  packet->private_segment_size = shadow->private_segment_size;
+  packet->group_segment_size = shadow->group_segment_size;
+  packet->kernel_object = shadow->object;
   packet->kernarg_address = shadow->kernarg_address;
   hsa_signal_t s;
   s.handle = shadow->signal;
diff --git a/libgomp/testsuite/libgomp.c/allocators-7.c b/libgomp/testsuite/libgomp.c/allocators-7.c
index a0a738b1d1d..5ef0c5cb3e3 100644
--- a/libgomp/testsuite/libgomp.c/allocators-7.c
+++ b/libgomp/testsuite/libgomp.c/allocators-7.c
@@ -1,7 +1,7 @@
 /* { dg-do run } */
 
 /* { dg-require-effective-target offload_device } */
-/* { dg-xfail-if "not implemented" { ! offload_target_nvptx } } */
+/* { dg-xfail-if "not implemented" { ! { offload_target_nvptx || offload_target_amdgcn } } } */
 
 /* Test that GPU low-latency allocation is limited to team access.  */

diff --git a/libgomp/basic-allocator.c b/libgomp/basic-allocator.c
new file mode 100644
index 00000000000..94b99a89e0b
--- /dev/null
+++ b/libgomp/basic-allocator.c
@@ -0,0 +1,380 @@
+/* Copyright (C) 2023 Free Software Foundation, Inc.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This is a basic "malloc" implementation intended for use with small,
+   low-latency memories.
+
+   To use this template, define BASIC_ALLOC_PREFIX, and then #include the
+   source file.  The other configuration macros are optional.
+
+   The root heap descriptor is stored in the first bytes of the heap, and each
+   free chunk contains a similar descriptor for the next free chunk in the
+   chain.
+
+   The descriptor is two values: offset and size, which describe the
+   location of a chunk of memory available for allocation. The offset is
+   relative to the base of the heap.  The special offset value 0xffffffff
+   indicates that the heap (free chain) is locked.  The offset and size are
+   32-bit values so the base alignment can be 8-bytes.
+
+   Memory is allocated to the first free chunk that fits.  The free chain
+   is always stored in order of the offset to assist coalescing adjacent
+   chunks.  */
+
+#include "libgomp.h"
+
+#ifndef BASIC_ALLOC_PREFIX
+#error "BASIC_ALLOC_PREFIX not defined."
+#endif
+
+#ifndef BASIC_ALLOC_YIELD
+#deine BASIC_ALLOC_YIELD
+#endif
+
+#define ALIGN(VAR) (((VAR) + 7) & ~7)    /* 8-byte granularity.  */
+
+#define fn1(prefix, name) prefix ## _ ## name
+#define fn(prefix, name) fn1 (prefix, name)
+#define basic_alloc_init fn(BASIC_ALLOC_PREFIX,init)
+#define basic_alloc_alloc fn(BASIC_ALLOC_PREFIX,alloc)
+#define basic_alloc_calloc fn(BASIC_ALLOC_PREFIX,calloc)
+#define basic_alloc_free fn(BASIC_ALLOC_PREFIX,free)
+#define basic_alloc_realloc fn(BASIC_ALLOC_PREFIX,realloc)
+
+typedef struct {
+  uint32_t offset;
+  uint32_t size;
+} heapdesc;
+
+void
+basic_alloc_init (char *heap, size_t limit)
+{
+  if (heap == NULL)
+    return;
+
+  /* Initialize the head of the free chain.  */
+  heapdesc *root = (heapdesc*)heap;
+  root->offset = ALIGN(1);
+  root->size = limit - root->offset;
+
+  /* And terminate the chain.  */
+  heapdesc *next = (heapdesc*)(heap + root->offset);
+  next->offset = 0;
+  next->size = 0;
+}
+
+static void *
+basic_alloc_alloc (char *heap, size_t size)
+{
+  if (heap == NULL)
+    return NULL;
+
+  /* Memory is allocated in N-byte granularity.  */
+  size = ALIGN (size);
+
+  /* Acquire a lock on the low-latency heap.  */
+  heapdesc root, *root_ptr = (heapdesc*)heap;
+  do
+    {
+      root.offset = __atomic_exchange_n (&root_ptr->offset, 0xffffffff, 
+					 MEMMODEL_ACQUIRE);
+      if (root.offset != 0xffffffff)
+	{
+	  root.size = root_ptr->size;
+	  break;
+	}
+      /* Spin.  */
+      BASIC_ALLOC_YIELD;
+    }
+  while (1);
+
+  /* Walk the free chain.  */
+  heapdesc chunk = root;
+  heapdesc *prev_chunkptr = NULL;
+  heapdesc *chunkptr = (heapdesc*)(heap + chunk.offset);
+  heapdesc onward_chain = *chunkptr;
+  while (chunk.size != 0 && (uint32_t)size > chunk.size)
+    {
+      chunk = onward_chain;
+      prev_chunkptr = chunkptr;
+      chunkptr = (heapdesc*)(heap + chunk.offset);
+      onward_chain = *chunkptr;
+    }
+
+  void *result = NULL;
+  if (chunk.size != 0)
+    {
+      /* Allocation successful.  */
+      result = chunkptr;
+
+      /* Update the free chain.  */
+      heapdesc stillfree = chunk;
+      stillfree.offset += size;
+      stillfree.size -= size;
+      heapdesc *stillfreeptr = (heapdesc*)(heap + stillfree.offset);
+
+      if (stillfree.size == 0)
+	/* The whole chunk was used.  */
+	stillfree = onward_chain;
+      else
+	/* The chunk was split, so restore the onward chain.  */
+	*stillfreeptr = onward_chain;
+
+      /* The previous free slot or root now points to stillfree.  */
+      if (prev_chunkptr)
+	*prev_chunkptr = stillfree;
+      else
+	root = stillfree;
+    }
+
+  /* Update the free chain root and release the lock.  */
+  root_ptr->size = root.size;
+  __atomic_store_n (&root_ptr->offset, root.offset, MEMMODEL_RELEASE);
+
+  return result;
+}
+
+static void *
+basic_alloc_calloc (char *heap, size_t size)
+{
+  /* Memory is allocated in N-byte granularity.  */
+  size = ALIGN (size);
+
+  uint64_t *result = basic_alloc_alloc (heap, size);
+  if (result)
+    /* Inline memset in which we know size is a multiple of 8.  */
+    for (unsigned i = 0; i < (unsigned)size/8; i++)
+    result[i] = 0;
+
+  return result;
+}
+
+static void
+basic_alloc_free (char *heap, void *addr, size_t size)
+{
+  /* Memory is allocated in N-byte granularity.  */
+  size = ALIGN (size);
+
+  /* Acquire a lock on the low-latency heap.  */
+  heapdesc root, *root_ptr = (heapdesc*)heap;
+  do
+    {
+      root.offset = __atomic_exchange_n (&root_ptr->offset, 0xffffffff, 
+					 MEMMODEL_ACQUIRE);
+      if (root.offset != 0xffffffff)
+	{
+	  root.size = root_ptr->size;
+	  break;
+	}
+      /* Spin.  */
+    }
+  while (1);
+
+  /* Walk the free chain to find where to insert a new entry.  */
+  heapdesc chunk = root, prev_chunk;
+  heapdesc *prev_chunkptr = NULL, *prevprev_chunkptr = NULL;
+  heapdesc *chunkptr = (heapdesc*)(heap + chunk.offset);
+  heapdesc onward_chain = *chunkptr;
+  while (chunk.size != 0 && addr > (void*)chunkptr)
+    {
+      prev_chunk = chunk;
+      chunk = onward_chain;
+      prevprev_chunkptr = prev_chunkptr;
+      prev_chunkptr = chunkptr;
+      chunkptr = (heapdesc*)(heap + chunk.offset);
+      onward_chain = *chunkptr;
+    }
+
+  /* Create the new chunk descriptor.  */
+  heapdesc newfreechunk;
+  newfreechunk.offset = (uint32_t)((uintptr_t)addr - (uintptr_t)heap);
+  newfreechunk.size = (uint32_t)size;
+
+  /* Coalesce adjacent free chunks.  */
+  if (newfreechunk.offset + size == chunk.offset)
+    {
+      /* Free chunk follows.  */
+      newfreechunk.size += chunk.size;
+      chunk = onward_chain;
+    }
+  if (prev_chunkptr)
+    {
+      if (prev_chunk.offset + prev_chunk.size
+	  == newfreechunk.offset)
+	{
+	  /* Free chunk precedes.  */
+	  newfreechunk.offset = prev_chunk.offset;
+	  newfreechunk.size += prev_chunk.size;
+	  addr = heap + prev_chunk.offset;
+	  prev_chunkptr = prevprev_chunkptr;
+	}
+    }
+
+  /* Update the free chain in the new and previous chunks.  */
+  *(heapdesc*)addr = chunk;
+  if (prev_chunkptr)
+    *prev_chunkptr = newfreechunk;
+  else
+    root = newfreechunk;
+
+  /* Update the free chain root and release the lock.  */
+  root_ptr->size = root.size;
+  __atomic_store_n (&root_ptr->offset, root.offset, MEMMODEL_RELEASE);
+
+}
+
+static void *
+basic_alloc_realloc (char *heap, void *addr, size_t oldsize,
+				      size_t size)
+{
+  /* Memory is allocated in N-byte granularity.  */
+  oldsize = ALIGN (oldsize);
+  size = ALIGN (size);
+
+  if (oldsize == size)
+    return addr;
+
+  /* Acquire a lock on the low-latency heap.  */
+  heapdesc root, *root_ptr = (heapdesc*)heap;
+  do
+    {
+      root.offset = __atomic_exchange_n (&root_ptr->offset, 0xffffffff, 
+					 MEMMODEL_ACQUIRE);
+      if (root.offset != 0xffffffff)
+	{
+	  root.size = root_ptr->size;
+	  break;
+	}
+      /* Spin.  */
+    }
+  while (1);
+
+  /* Walk the free chain.  */
+  heapdesc chunk = root;
+  heapdesc *prev_chunkptr = NULL;
+  heapdesc *chunkptr = (heapdesc*)(heap + chunk.offset);
+  heapdesc onward_chain = *chunkptr;
+  while (chunk.size != 0 && (void*)chunkptr < addr)
+    {
+      chunk = onward_chain;
+      prev_chunkptr = chunkptr;
+      chunkptr = (heapdesc*)(heap + chunk.offset);
+      onward_chain = *chunkptr;
+    }
+
+  void *result = NULL;
+  if (size < oldsize)
+    {
+      /* The new allocation is smaller than the old; we can always
+	 shrink an allocation in place.  */
+      result = addr;
+
+      heapdesc *nowfreeptr = (heapdesc*)(addr + size);
+
+      /* Update the free chain.  */
+      heapdesc nowfree;
+      nowfree.offset = (char*)nowfreeptr - heap;
+      nowfree.size = oldsize - size;
+
+      if (nowfree.offset + size == chunk.offset)
+	{
+	  /* Coalesce following free chunk.  */
+	  nowfree.size += chunk.size;
+	  *nowfreeptr = onward_chain;
+	}
+      else
+	*nowfreeptr = chunk;
+
+      /* The previous free slot or root now points to nowfree.  */
+      if (prev_chunkptr)
+	*prev_chunkptr = nowfree;
+      else
+	root = nowfree;
+    }
+  else if (chunk.size != 0
+	   && (char *)addr + oldsize == (char *)chunkptr
+	   && chunk.size >= size-oldsize)
+    {
+      /* The new allocation is larger than the old, and we found a
+	 large enough free block right after the existing block,
+	 so we extend into that space.  */
+      result = addr;
+
+      uint32_t delta = size-oldsize;
+
+      /* Update the free chain.  */
+      heapdesc stillfree = chunk;
+      stillfree.offset += delta;
+      stillfree.size -= delta;
+      heapdesc *stillfreeptr = (heapdesc*)(heap + stillfree.offset);
+
+      if (stillfree.size == 0)
+	/* The whole chunk was used.  */
+	stillfree = onward_chain;
+      else
+	/* The chunk was split, so restore the onward chain.  */
+	*stillfreeptr = onward_chain;
+
+      /* The previous free slot or root now points to stillfree.  */
+      if (prev_chunkptr)
+	*prev_chunkptr = stillfree;
+      else
+	root = stillfree;
+    }
+  /* Else realloc in-place has failed and result remains NULL.  */
+
+  /* Update the free chain root and release the lock.  */
+  root_ptr->size = root.size;
+  __atomic_store_n (&root_ptr->offset, root.offset, MEMMODEL_RELEASE);
+
+  if (result == NULL)
+    {
+      /* The allocation could not be extended in place, so we simply
+	 allocate fresh memory and move the data.  If we can't allocate
+	 from low-latency memory then we leave the original alloaction
+	 intact and return NULL.
+	 We could do a fall-back to main memory, but we don't know what
+	 the fall-back trait said to do.  */
+      result = basic_alloc_alloc (heap, size);
+      if (result != NULL)
+	{
+	  /* Inline memcpy in which we know oldsize is a multiple of 8.  */
+	  uint64_t *from = addr, *to = result;
+	  for (unsigned i = 0; i < (unsigned)oldsize/8; i++)
+	    to[i] = from[i];
+
+	  basic_alloc_free (heap, addr, oldsize);
+	}
+    }
+
+  return result;
+}
+
+#undef ALIGN
+#undef fn1
+#undef fn
+#undef basic_alloc_init
+#undef basic_alloc_alloc
+#undef basic_alloc_free
+#undef basic_alloc_realloc
diff --git a/libgomp/config/nvptx/allocator.c b/libgomp/config/nvptx/allocator.c
index c1a73511623..7c2a7463bf7 100644
--- a/libgomp/config/nvptx/allocator.c
+++ b/libgomp/config/nvptx/allocator.c
@@ -44,20 +44,13 @@
 #include "libgomp.h"
 #include <stdlib.h>
 
+#define BASIC_ALLOC_PREFIX __nvptx_lowlat
+#include "../../basic-allocator.c"
+
 /* There should be some .shared space reserved for us.  There's no way to
    express this magic extern sizeless array in C so use asm.  */
 asm (".extern .shared .u8 __nvptx_lowlat_pool[];\n");
 
-extern uint32_t __nvptx_lowlat_heap_root __attribute__((shared,nocommon));
-
-typedef union {
-  uint32_t raw;
-  struct {
-    uint16_t offset;
-    uint16_t size;
-  } desc;
-} heapdesc;
-
 static void *
 nvptx_memspace_alloc (omp_memspace_handle_t memspace, size_t size)
 {
@@ -66,64 +59,7 @@ nvptx_memspace_alloc (omp_memspace_handle_t memspace, size_t size)
       char *shared_pool;
       asm ("cvta.shared.u64\t%0, __nvptx_lowlat_pool;" : "=r"(shared_pool));
 
-      /* Memory is allocated in 8-byte granularity.  */
-      size = (size + 7) & ~7;
-
-      /* Acquire a lock on the low-latency heap.  */
-      heapdesc root;
-      do
-	{
-	  root.raw = __atomic_exchange_n (&__nvptx_lowlat_heap_root,
-					  0xffffffff, MEMMODEL_ACQUIRE);
-	  if (root.raw != 0xffffffff)
-	    break;
-	  /* Spin.  */
-	}
-      while (1);
-
-      /* Walk the free chain.  */
-      heapdesc chunk = {root.raw};
-      uint32_t *prev_chunkptr = NULL;
-      uint32_t *chunkptr = (uint32_t*)(shared_pool + chunk.desc.offset);
-      heapdesc onward_chain = {chunkptr[0]};
-      while (chunk.desc.size != 0 && (uint32_t)size > chunk.desc.size)
-	{
-	  chunk.raw = onward_chain.raw;
-	  prev_chunkptr = chunkptr;
-	  chunkptr = (uint32_t*)(shared_pool + chunk.desc.offset);
-	  onward_chain.raw = chunkptr[0];
-	}
-
-      void *result = NULL;
-      if (chunk.desc.size != 0)
-	{
-	  /* Allocation successful.  */
-	  result = chunkptr;
-
-	  /* Update the free chain.  */
-	  heapdesc stillfree = {chunk.raw};
-	  stillfree.desc.offset += size;
-	  stillfree.desc.size -= size;
-	  uint32_t *stillfreeptr = (uint32_t*)(shared_pool
-					       + stillfree.desc.offset);
-
-	  if (stillfree.desc.size == 0)
-	    /* The whole chunk was used.  */
-	    stillfree.raw = onward_chain.raw;
-	  else
-	    /* The chunk was split, so restore the onward chain.  */
-	    stillfreeptr[0] = onward_chain.raw;
-
-	  /* The previous free slot or root now points to stillfree.  */
-	  if (prev_chunkptr)
-	    prev_chunkptr[0] = stillfree.raw;
-	  else
-	    root.raw = stillfree.raw;
-	}
-
-      /* Update the free chain root and release the lock.  */
-      __atomic_store_n (&__nvptx_lowlat_heap_root, root.raw, MEMMODEL_RELEASE);
-      return result;
+      return __nvptx_lowlat_alloc (shared_pool, size);
     }
   else if (memspace == ompx_host_mem_space)
     return NULL;
@@ -136,16 +72,10 @@ nvptx_memspace_calloc (omp_memspace_handle_t memspace, size_t size)
 {
   if (memspace == omp_low_lat_mem_space)
     {
-      /* Memory is allocated in 8-byte granularity.  */
-      size = (size + 7) & ~7;
-
-      uint64_t *result = nvptx_memspace_alloc (memspace, size);
-      if (result)
-	/* Inline memset in which we know size is a multiple of 8.  */
-	for (unsigned i = 0; i < (unsigned)size/8; i++)
-	  result[i] = 0;
+      char *shared_pool;
+      asm ("cvta.shared.u64\t%0, __nvptx_lowlat_pool;" : "=r"(shared_pool));
 
-      return result;
+      return __nvptx_lowlat_calloc (shared_pool, size);
     }
   else if (memspace == ompx_host_mem_space)
     return NULL;
@@ -161,71 +91,7 @@ nvptx_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size)
       char *shared_pool;
       asm ("cvta.shared.u64\t%0, __nvptx_lowlat_pool;" : "=r"(shared_pool));
 
-      /* Memory is allocated in 8-byte granularity.  */
-      size = (size + 7) & ~7;
-
-      /* Acquire a lock on the low-latency heap.  */
-      heapdesc root;
-      do
-	{
-	  root.raw = __atomic_exchange_n (&__nvptx_lowlat_heap_root,
-					  0xffffffff, MEMMODEL_ACQUIRE);
-	  if (root.raw != 0xffffffff)
-	    break;
-	  /* Spin.  */
-	}
-      while (1);
-
-      /* Walk the free chain to find where to insert a new entry.  */
-      heapdesc chunk = {root.raw}, prev_chunk;
-      uint32_t *prev_chunkptr = NULL, *prevprev_chunkptr = NULL;
-      uint32_t *chunkptr = (uint32_t*)(shared_pool + chunk.desc.offset);
-      heapdesc onward_chain = {chunkptr[0]};
-      while (chunk.desc.size != 0 && addr > (void*)chunkptr)
-	{
-	  prev_chunk.raw = chunk.raw;
-	  chunk.raw = onward_chain.raw;
-	  prevprev_chunkptr = prev_chunkptr;
-	  prev_chunkptr = chunkptr;
-	  chunkptr = (uint32_t*)(shared_pool + chunk.desc.offset);
-	  onward_chain.raw = chunkptr[0];
-	}
-
-      /* Create the new chunk descriptor.  */
-      heapdesc newfreechunk;
-      newfreechunk.desc.offset = (uint16_t)((uintptr_t)addr
-					    - (uintptr_t)shared_pool);
-      newfreechunk.desc.size = (uint16_t)size;
-
-      /* Coalesce adjacent free chunks.  */
-      if (newfreechunk.desc.offset + size == chunk.desc.offset)
-	{
-	  /* Free chunk follows.  */
-	  newfreechunk.desc.size += chunk.desc.size;
-	  chunk.raw = onward_chain.raw;
-	}
-      if (prev_chunkptr)
-	{
-	  if (prev_chunk.desc.offset + prev_chunk.desc.size
-	      == newfreechunk.desc.offset)
-	    {
-	      /* Free chunk precedes.  */
-	      newfreechunk.desc.offset = prev_chunk.desc.offset;
-	      newfreechunk.desc.size += prev_chunk.desc.size;
-	      addr = shared_pool + prev_chunk.desc.offset;
-	      prev_chunkptr = prevprev_chunkptr;
-	    }
-	}
-
-      /* Update the free chain in the new and previous chunks.  */
-      ((uint32_t*)addr)[0] = chunk.raw;
-      if (prev_chunkptr)
-	prev_chunkptr[0] = newfreechunk.raw;
-      else
-	root.raw = newfreechunk.raw;
-
-      /* Update the free chain root and release the lock.  */
-      __atomic_store_n (&__nvptx_lowlat_heap_root, root.raw, MEMMODEL_RELEASE);
+      __nvptx_lowlat_free (shared_pool, addr, size);
     }
   else
     free (addr);
@@ -240,123 +106,7 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
       char *shared_pool;
       asm ("cvta.shared.u64\t%0, __nvptx_lowlat_pool;" : "=r"(shared_pool));
 
-      /* Memory is allocated in 8-byte granularity.  */
-      oldsize = (oldsize + 7) & ~7;
-      size = (size + 7) & ~7;
-
-      if (oldsize == size)
-	return addr;
-
-      /* Acquire a lock on the low-latency heap.  */
-      heapdesc root;
-      do
-	{
-	  root.raw = __atomic_exchange_n (&__nvptx_lowlat_heap_root,
-					  0xffffffff, MEMMODEL_ACQUIRE);
-	  if (root.raw != 0xffffffff)
-	    break;
-	  /* Spin.  */
-	}
-      while (1);
-
-      /* Walk the free chain.  */
-      heapdesc chunk = {root.raw};
-      uint32_t *prev_chunkptr = NULL;
-      uint32_t *chunkptr = (uint32_t*)(shared_pool + chunk.desc.offset);
-      heapdesc onward_chain = {chunkptr[0]};
-      while (chunk.desc.size != 0 && (void*)chunkptr < addr)
-	{
-	  chunk.raw = onward_chain.raw;
-	  prev_chunkptr = chunkptr;
-	  chunkptr = (uint32_t*)(shared_pool + chunk.desc.offset);
-	  onward_chain.raw = chunkptr[0];
-	}
-
-      void *result = NULL;
-      if (size < oldsize)
-	{
-	  /* The new allocation is smaller than the old; we can always
-	     shrink an allocation in place.  */
-	  result = addr;
-
-	  uint32_t *nowfreeptr = (uint32_t*)(addr + size);
-
-	  /* Update the free chain.  */
-	  heapdesc nowfree;
-	  nowfree.desc.offset = (char*)nowfreeptr - shared_pool;
-	  nowfree.desc.size = oldsize - size;
-
-	  if (nowfree.desc.offset + size == chunk.desc.offset)
-	    {
-	      /* Coalesce following free chunk.  */
-	      nowfree.desc.size += chunk.desc.size;
-	      nowfreeptr[0] = onward_chain.raw;
-	    }
-	  else
-	    nowfreeptr[0] = chunk.raw;
-
-	  /* The previous free slot or root now points to nowfree.  */
-	  if (prev_chunkptr)
-	    prev_chunkptr[0] = nowfree.raw;
-	  else
-	    root.raw = nowfree.raw;
-	}
-      else if (chunk.desc.size != 0
-	       && (char *)addr + oldsize == (char *)chunkptr
-	       && chunk.desc.size >= size-oldsize)
-	{
-	  /* The new allocation is larger than the old, and we found a
-	     large enough free block right after the existing block,
-	     so we extend into that space.  */
-	  result = addr;
-
-	  uint16_t delta = size-oldsize;
-
-	  /* Update the free chain.  */
-	  heapdesc stillfree = {chunk.raw};
-	  stillfree.desc.offset += delta;
-	  stillfree.desc.size -= delta;
-	  uint32_t *stillfreeptr = (uint32_t*)(shared_pool
-					       + stillfree.desc.offset);
-
-	  if (stillfree.desc.size == 0)
-	    /* The whole chunk was used.  */
-	    stillfree.raw = onward_chain.raw;
-	  else
-	    /* The chunk was split, so restore the onward chain.  */
-	    stillfreeptr[0] = onward_chain.raw;
-
-	  /* The previous free slot or root now points to stillfree.  */
-	  if (prev_chunkptr)
-	    prev_chunkptr[0] = stillfree.raw;
-	  else
-	    root.raw = stillfree.raw;
-	}
-      /* Else realloc in-place has failed and result remains NULL.  */
-
-      /* Update the free chain root and release the lock.  */
-      __atomic_store_n (&__nvptx_lowlat_heap_root, root.raw, MEMMODEL_RELEASE);
-
-      if (result == NULL)
-	{
-	  /* The allocation could not be extended in place, so we simply
-	     allocate fresh memory and move the data.  If we can't allocate
-	     from low-latency memory then we leave the original alloaction
-	     intact and return NULL.
-	     We could do a fall-back to main memory, but we don't know what
-	     the fall-back trait said to do.  */
-	  result = nvptx_memspace_alloc (memspace, size);
-	  if (result != NULL)
-	    {
-	      /* Inline memcpy in which we know oldsize is a multiple of 8.  */
-	      uint64_t *from = addr, *to = result;
-	      for (unsigned i = 0; i < (unsigned)oldsize/8; i++)
-		to[i] = from[i];
-
-	      nvptx_memspace_free (memspace, addr, oldsize);
-	    }
-	}
-      return result;
+      return __nvptx_lowlat_realloc (shared_pool, addr, oldsize, size);
     }
   else if (memspace == ompx_host_mem_space)
     return NULL;
diff --git a/libgomp/config/nvptx/team.c b/libgomp/config/nvptx/team.c
index 685610e00be..b30b8df178d 100644
--- a/libgomp/config/nvptx/team.c
+++ b/libgomp/config/nvptx/team.c
@@ -33,7 +33,6 @@
 
 struct gomp_thread *nvptx_thrs __attribute__((shared,nocommon));
 int __gomp_team_num __attribute__((shared,nocommon));
-uint32_t __nvptx_lowlat_heap_root __attribute__((shared,nocommon));
 
 static void gomp_thread_start (struct gomp_thread_pool *);
 
@@ -41,6 +40,9 @@ static void gomp_thread_start (struct gomp_thread_pool *);
    express this magic extern sizeless array in C so use asm.  */
 asm (".extern .shared .u8 __nvptx_lowlat_pool[];\n");
 
+/* Defined in basic-allocator.c via config/nvptx/allocator.c.  */
+void __nvptx_lowlat_init (void *heap, size_t size);
+
 /* This externally visible function handles target region entry.  It
    sets up a per-team thread pool and transfers control by calling FN (FN_DATA)
    in the master thread or gomp_thread_start in other threads.
@@ -76,19 +78,7 @@ gomp_nvptx_main (void (*fn) (void *), void *fn_data)
       asm ("mov.u32\t%0, %%dynamic_smem_size;\n"
 	   : "=r"(shared_pool_size));
 #endif
-
-      /* ... and initialize it with an empty free-chain.  */
-      union {
-	uint32_t raw;
-	struct {
-	  uint16_t offset;
-	  uint16_t size;
-	} desc;
-      } root;
-      root.desc.offset = 0;		 /* The first byte is free.  */
-      root.desc.size = shared_pool_size; /* The whole space is free.  */
-      __nvptx_lowlat_heap_root = root.raw;
-      shared_pool[0] = 0;		 /* Terminate free chain.  */
+      __nvptx_lowlat_init (shared_pool, shared_pool_size);
 
       /* Initialize the thread pool.  */
       struct gomp_thread_pool *pool = alloca (sizeof (*pool));