From patchwork Thu Feb 16 18:06:41 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Stubbs X-Patchwork-Id: 58181 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp449441wrn; Thu, 16 Feb 2023 10:07:32 -0800 (PST) X-Google-Smtp-Source: AK7set/3kg+aaBG+9PL6vT+zkB/LQdjTFuMSjxsuXWFoSRKHkP5SZfesPuj190Nucy6qG2lqOSE+ X-Received: by 2002:a17:906:82d1:b0:87b:dbcb:c6b1 with SMTP id a17-20020a17090682d100b0087bdbcbc6b1mr7364974ejy.23.1676570852303; Thu, 16 Feb 2023 10:07:32 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676570852; cv=none; d=google.com; s=arc-20160816; b=PPdEgmEWQiYfjOX9JBuDZ5EtE+mRl/s+zUkhmCScR5vZ8epdj6wPRNeEkweH40vomr prAMJ9Vf5KL+cYBdANYCI0OvwwHqacGU4mosEDn9RPRSG7bYYBBwZEzl/GLJo7/iwzEK NwULtjcogqUzQYMJpESUyiYW9SGkYV/jbR9wBlBJCrDk8SE1g+U8K46+6crKcCOOQvbd KF3mPMDPiaoGBpzNn+bWd1TWsBl+G70+ipWmOU5hEMomrzQTtYVEKNwGX6t8ucQITcPh bIdMAcIyzfmjIbf8ndF77/xxPH+NVMv5+Prhz0WQfHWcnE30uGxgJCjYbJkc/XJLR2Ds q42g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:list-subscribe:list-help:list-post:list-archive :list-unsubscribe:list-id:precedence:subject:from:to :content-language:user-agent:mime-version:date:message-id :ironport-sdr:dmarc-filter:delivered-to; bh=CQlBgvj3Ziku2bQiksSNgDUmIFZrVJANlW1ZdSmqwaM=; b=KNq2NYTlvcvtV0ud6RJl/XocpbiFujsT3OIsra0N0573U3YcqGOoREZhLB5oBWVycJ whjsymUbUPqr65kKMyiVI1pi7paicZsFv+4CELE9EpIOOtREitO98x+bxUutdk8YoGhh MgWDv9zCn9pkFifVEECpbVJ4EFxoyZs1rLIKxBXEGfVJ1EygzcLi+8vB+wZQAyL929// pbfjZHAkAJ2AmwbSvkcRUNqEbyV/Rbq8FV6AwxF5Ejc3Pt5OgTV0WzxbC6UaRyOvAAYw AkvosSc0I84uu/uqZtde/YVYbGPnEJa/rWM2seJ4k0JUBqGwPKYyqz1r9Sz2mFiWaTpW cZjQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org" Received: from sourceware.org (ip-8-43-85-97.sourceware.org. [8.43.85.97]) by mx.google.com with ESMTPS id f30-20020a170906739e00b008b16c22d072si2214697ejl.156.2023.02.16.10.07.31 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 16 Feb 2023 10:07:32 -0800 (PST) Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) client-ip=8.43.85.97; Authentication-Results: mx.google.com; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org" Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id BD4723858291 for ; Thu, 16 Feb 2023 18:07:20 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from esa4.mentor.iphmx.com (esa4.mentor.iphmx.com [68.232.137.252]) by sourceware.org (Postfix) with ESMTPS id CBE983858D33 for ; Thu, 16 Feb 2023 18:06:49 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org CBE983858D33 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com X-IronPort-AV: E=Sophos;i="5.97,302,1669104000"; d="scan'208";a="97261490" Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165]) by esa4.mentor.iphmx.com with ESMTP; 16 Feb 2023 10:06:46 -0800 IronPort-SDR: mqd2HuWTTXoJNR3OHRs5mamj9wyt07ahXAQvPlgYnB7x6Pnph3W8X6JW9gM0qxrZeXifckvVKS 8xwi9pc51yjJv0u3eZ6BLq3jeWWXR+Ie6jHaOMpc24f0DziBt5/y3WP8y8hKjl3UCt0xrB0icM UzrCKSRzVujpkD2G6sh6pVDoURfJj08IApUSCVgmlALr8DtlS96MVHJ2+OOc9KZU298mVYFGHB cZMy7dv9kY3Z6ntk4GjEj3AubmW3CnnPDupb271sych6f09ixZozHiaypOojOOIQTNMw33JalS uso= Message-ID: <5eaeddf5-317a-4574-868b-87999bb6af33@codesourcery.com> Date: Thu, 16 Feb 2023 18:06:41 +0000 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.7.2 Content-Language: en-GB To: "gcc-patches@gcc.gnu.org" From: Andrew Stubbs Subject: [OG12][committed] amdgcn: OpenMP low-latency allocator X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: svr-ies-mbx-10.mgc.mentorg.com (139.181.222.10) To svr-ies-mbx-11.mgc.mentorg.com (139.181.222.11) X-Spam-Status: No, score=-12.4 required=5.0 tests=BAYES_00, GIT_PATCH_0, HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, KAM_SHORT, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org Sender: "Gcc-patches" X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758011957852578849?= X-GMAIL-MSGID: =?utf-8?q?1758011957852578849?= These patches implement an LDS memory allocator for OpenMP on AMD. 1. 230216-basic-allocator.patch Separate the allocator from NVPTX so the code can be shared. 2. 230216-amd-low-lat.patch Allocate the memory, adjust the default address space, and hook up the allocator. They will need to be integrated with the rest of the memory management patch-stack when I repost that for mainline. Andrew nvptx, libgomp: Move the low-latency allocator code There shouldn't be a functionality change; this is just so AMD can share the code. The new basic-allocator.c is designed to be included so it can be used as a template multiple times and inlined. libgomp/ChangeLog: * config/nvptx/allocator.c (BASIC_ALLOC_PREFIX): New define, and include basic-allocator.c. (__nvptx_lowlat_heap_root): Remove. (heapdesc): Remove. (nvptx_memspace_alloc): Move implementation to basic-allocator.c. (nvptx_memspace_calloc): Likewise. (nvptx_memspace_free): Likewise. (nvptx_memspace_realloc): Likewise. * config/nvptx/team.c (__nvptx_lowlat_heap_root): Remove. (gomp_nvptx_main): Call __nvptx_lowlat_init. * basic-allocator.c: New file. amdgcn, libgomp: low-latency allocator This implements the OpenMP low-latency memory allocator for AMD GCN using the small per-team LDS memory (Local Data Store). Since addresses can now refer to LDS space, the "Global" address space is no-longer compatible. This patch therefore switches the backend to use entirely "Flat" addressing (which supports both memories). A future patch will re-enable "global" instructions for cases where it is known to be safe to do so. gcc/ChangeLog: * config/gcn/gcn-builtins.def (DISPATCH_PTR): New built-in. * config/gcn/gcn.cc (gcn_init_machine_status): Disable global addressing. (gcn_expand_builtin_1): Implement GCN_BUILTIN_DISPATCH_PTR. libgomp/ChangeLog: * config/gcn/libgomp-gcn.h (TEAM_ARENA_START): Move to here. (TEAM_ARENA_FREE): Likewise. (TEAM_ARENA_END): Likewise. (GCN_LOWLAT_HEAP): New. * config/gcn/team.c (LITTLEENDIAN_CPU): New, and import hsa.h. (__gcn_lowlat_init): New prototype. (gomp_gcn_enter_kernel): Initialize the low-latency heap. * libgomp.h (TEAM_ARENA_START): Move to libgomp.h. (TEAM_ARENA_FREE): Likewise. (TEAM_ARENA_END): Likewise. * plugin/plugin-gcn.c (lowlat_size): New variable. (print_kernel_dispatch): Label the group_segment_size purpose. (init_environment_variables): Read GOMP_GCN_LOWLAT_POOL. (create_kernel_dispatch): Pass low-latency head allocation to kernel. (run_kernel): Use shadow; don't assume values. * testsuite/libgomp.c/allocators-7.c: Enable for amdgcn. * config/gcn/allocator.c: New file. diff --git a/gcc/config/gcn/gcn-builtins.def b/gcc/config/gcn/gcn-builtins.def index f1cf30bbc94..3619cab4402 100644 --- a/gcc/config/gcn/gcn-builtins.def +++ b/gcc/config/gcn/gcn-builtins.def @@ -164,6 +164,8 @@ DEF_BUILTIN (FIRST_CALL_THIS_THREAD_P, -1, "first_call_this_thread_p", B_INSN, _A1 (GCN_BTI_BOOL), gcn_expand_builtin_1) DEF_BUILTIN (KERNARG_PTR, -1, "kernarg_ptr", B_INSN, _A1 (GCN_BTI_VOIDPTR), gcn_expand_builtin_1) +DEF_BUILTIN (DISPATCH_PTR, -1, "dispatch_ptr", B_INSN, _A1 (GCN_BTI_VOIDPTR), + gcn_expand_builtin_1) DEF_BUILTIN (GET_STACK_LIMIT, -1, "get_stack_limit", B_INSN, _A1 (GCN_BTI_VOIDPTR), gcn_expand_builtin_1) diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc index 0b21dbd256e..8e487b94e95 100644 --- a/gcc/config/gcn/gcn.cc +++ b/gcc/config/gcn/gcn.cc @@ -114,7 +114,8 @@ gcn_init_machine_status (void) f = ggc_cleared_alloc (); - if (TARGET_GCN3) + // FIXME: re-enable global addressing with safety for LDS-flat addresses + //if (TARGET_GCN3) f->use_flat_addressing = true; return f; @@ -4626,6 +4627,19 @@ gcn_expand_builtin_1 (tree exp, rtx target, rtx /*subtarget */ , } return ptr; } + case GCN_BUILTIN_DISPATCH_PTR: + { + rtx ptr; + if (cfun->machine->args.reg[DISPATCH_PTR_ARG] >= 0) + ptr = gen_rtx_REG (DImode, + cfun->machine->args.reg[DISPATCH_PTR_ARG]); + else + { + ptr = gen_reg_rtx (DImode); + emit_move_insn (ptr, const0_rtx); + } + return ptr; + } case GCN_BUILTIN_FIRST_CALL_THIS_THREAD_P: { /* Stash a marker in the unused upper 16 bits of s[0:1] to indicate diff --git a/libgomp/config/gcn/allocator.c b/libgomp/config/gcn/allocator.c new file mode 100644 index 00000000000..001de89ffe0 --- /dev/null +++ b/libgomp/config/gcn/allocator.c @@ -0,0 +1,129 @@ +/* Copyright (C) 2023 Free Software Foundation, Inc. + + This file is part of the GNU Offloading and Multi Processing Library + (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* The low-latency allocators use space reserved in LDS memory when the + kernel is launched. The heap is initialized in gomp_gcn_enter_kernel and + all allocations are forgotten when the kernel exits. Allocations to other + memory spaces all use the system malloc syscall. + + The pointers returned are 64-bit "Flat" addresses indistinguishable from + regular pointers, but only compatible with the "flat_load/store" + instructions. The compiler has been coded to assign default address + spaces accordingly. + + LDS memory is not visible to other teams, and therefore may only be used + when the memspace access trait is set accordingly. */ + +#include "libgomp.h" +#include + +#define BASIC_ALLOC_PREFIX __gcn_lowlat +#define BASIC_ALLOC_YIELD asm("s_sleep 1" ::: "memory") +#include "../../basic-allocator.c" + +/* The low-latency heap is located in LDS memory, but we need the __flat + address space for compatibility reasons. */ +#define FLAT_HEAP_PTR \ + ((void*)(uintptr_t)(void __flat*)(void __lds *)GCN_LOWLAT_HEAP) + +static void * +gcn_memspace_alloc (omp_memspace_handle_t memspace, size_t size) +{ + if (memspace == omp_low_lat_mem_space) + { + char *shared_pool = FLAT_HEAP_PTR; + + return __gcn_lowlat_alloc (shared_pool, size); + } + else if (memspace == ompx_host_mem_space) + return NULL; + else + return malloc (size); +} + +static void * +gcn_memspace_calloc (omp_memspace_handle_t memspace, size_t size) +{ + if (memspace == omp_low_lat_mem_space) + { + char *shared_pool = FLAT_HEAP_PTR; + + return __gcn_lowlat_calloc (shared_pool, size); + } + else if (memspace == ompx_host_mem_space) + return NULL; + else + return calloc (1, size); +} + +static void +gcn_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size) +{ + if (memspace == omp_low_lat_mem_space) + { + char *shared_pool = FLAT_HEAP_PTR; + + __gcn_lowlat_free (shared_pool, addr, size); + } + else + free (addr); +} + +static void * +gcn_memspace_realloc (omp_memspace_handle_t memspace, void *addr, + size_t oldsize, size_t size) +{ + if (memspace == omp_low_lat_mem_space) + { + char *shared_pool = FLAT_HEAP_PTR; + + return __gcn_lowlat_realloc (shared_pool, addr, oldsize, size); + } + else if (memspace == ompx_host_mem_space) + return NULL; + else + return realloc (addr, size); +} + +static inline int +gcn_memspace_validate (omp_memspace_handle_t memspace, unsigned access) +{ + /* Disallow use of low-latency memory when it must be accessible by + all threads. */ + return (memspace != omp_low_lat_mem_space + || access != omp_atv_all); +} + +#define MEMSPACE_ALLOC(MEMSPACE, SIZE, PIN) \ + gcn_memspace_alloc (MEMSPACE, SIZE) +#define MEMSPACE_CALLOC(MEMSPACE, SIZE, PIN) \ + gcn_memspace_calloc (MEMSPACE, SIZE) +#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE, OLDPIN, PIN) \ + gcn_memspace_realloc (MEMSPACE, ADDR, OLDSIZE, SIZE) +#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE, PIN) \ + gcn_memspace_free (MEMSPACE, ADDR, SIZE) +#define MEMSPACE_VALIDATE(MEMSPACE, ACCESS) \ + gcn_memspace_validate (MEMSPACE, ACCESS) + +#include "../../allocator.c" diff --git a/libgomp/config/gcn/libgomp-gcn.h b/libgomp/config/gcn/libgomp-gcn.h index 1521166baa3..3e8d7451453 100644 --- a/libgomp/config/gcn/libgomp-gcn.h +++ b/libgomp/config/gcn/libgomp-gcn.h @@ -33,6 +33,12 @@ #define DEFAULT_GCN_STACK_SIZE (32*1024) #define DEFAULT_TEAM_ARENA_SIZE (64*1024) +/* These define the LDS location of data needed by OpenMP. */ +#define TEAM_ARENA_START 16 /* LDS offset of free pointer. */ +#define TEAM_ARENA_FREE 24 /* LDS offset of free pointer. */ +#define TEAM_ARENA_END 32 /* LDS offset of end pointer. */ +#define GCN_LOWLAT_HEAP 40 /* LDS offset of the OpenMP low-latency heap. */ + struct heap { int64_t size; diff --git a/libgomp/config/gcn/team.c b/libgomp/config/gcn/team.c index ffdc09b7f35..13641a4702c 100644 --- a/libgomp/config/gcn/team.c +++ b/libgomp/config/gcn/team.c @@ -29,6 +29,12 @@ #include #include +#define LITTLEENDIAN_CPU +#include "hsa.h" + +/* Defined in basic-allocator.c via config/amdgcn/allocator.c. */ +void __gcn_lowlat_init (void *heap, size_t size); + static void gomp_thread_start (struct gomp_thread_pool *); /* This externally visible function handles target region entry. It @@ -71,6 +77,12 @@ gomp_gcn_enter_kernel (void) *arena_free = team_arena; *arena_end = team_arena + kernargs->arena_size_per_team; + /* Initialize the low-latency heap. The header is the size. */ + void __lds *lowlat = (void __lds *)GCN_LOWLAT_HEAP; + hsa_kernel_dispatch_packet_t *queue_ptr = __builtin_gcn_dispatch_ptr (); + __gcn_lowlat_init ((void*)(uintptr_t)(void __flat*)lowlat, + queue_ptr->group_segment_size - GCN_LOWLAT_HEAP); + /* Allocate and initialize the team-local-storage data. */ struct gomp_thread *thrs = team_malloc_cleared (sizeof (*thrs) * numthreads); diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index a0af66e396b..d1e45cc584e 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -114,9 +114,6 @@ extern void gomp_aligned_free (void *); #ifdef __AMDGCN__ #include "libgomp-gcn.h" /* The arena is initialized in config/gcn/team.c. */ -#define TEAM_ARENA_START 16 /* LDS offset of free pointer. */ -#define TEAM_ARENA_FREE 24 /* LDS offset of free pointer. */ -#define TEAM_ARENA_END 32 /* LDS offset of end pointer. */ static inline void * __attribute__((malloc)) team_malloc (size_t size) diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c index 70a555a24a2..ca89ba658fd 100644 --- a/libgomp/plugin/plugin-gcn.c +++ b/libgomp/plugin/plugin-gcn.c @@ -563,6 +563,7 @@ static size_t gcn_kernel_heap_size = DEFAULT_GCN_HEAP_SIZE; static int team_arena_size = DEFAULT_TEAM_ARENA_SIZE; static int stack_size = DEFAULT_GCN_STACK_SIZE; +static int lowlat_size = -1; /* Flag to decide whether print to stderr information about what is going on. Set in init_debug depending on environment variables. */ @@ -1047,8 +1048,8 @@ print_kernel_dispatch (struct kernel_dispatch *dispatch, unsigned indent) fprintf (stderr, "%*sobject: %lu\n", indent, "", dispatch->object); fprintf (stderr, "%*sprivate_segment_size: %u\n", indent, "", dispatch->private_segment_size); - fprintf (stderr, "%*sgroup_segment_size: %u\n", indent, "", - dispatch->group_segment_size); + fprintf (stderr, "%*sgroup_segment_size: %u (low-latency pool)\n", indent, + "", dispatch->group_segment_size); fprintf (stderr, "\n"); } @@ -1119,6 +1120,10 @@ init_environment_variables (void) if (tmp) stack_size = tmp;; } + + const char *lowlat = secure_getenv ("GOMP_GCN_LOWLAT_POOL"); + if (lowlat) + lowlat_size = atoi (lowlat); } /* Return malloc'd string with name of SYMBOL. */ @@ -1946,7 +1951,25 @@ create_kernel_dispatch (struct kernel_info *kernel, int num_teams, shadow->signal = sync_signal.handle; shadow->private_segment_size = kernel->private_segment_size; - shadow->group_segment_size = kernel->group_segment_size; + + if (lowlat_size < 0) + { + /* Divide the LDS between the number of running teams. + Allocate not less than is defined in the kernel metadata. */ + int teams_per_cu = num_teams / get_cu_count (agent); + int LDS_per_team = (teams_per_cu ? 65536 / teams_per_cu : 65536); + shadow->group_segment_size + = (kernel->group_segment_size > LDS_per_team + ? kernel->group_segment_size + : LDS_per_team);; + } + else if (lowlat_size < GCN_LOWLAT_HEAP+8) + /* Ensure that there's space for the OpenMP libgomp data. */ + shadow->group_segment_size = GCN_LOWLAT_HEAP+8; + else + shadow->group_segment_size = (lowlat_size > 65536 + ? 65536 + : lowlat_size); /* We expect kernels to request a single pointer, explicitly, and the rest of struct kernargs, implicitly. If they request anything else @@ -2305,9 +2328,9 @@ run_kernel (struct kernel_info *kernel, void *vars, print_kernel_dispatch (shadow, 2); } - packet->private_segment_size = kernel->private_segment_size; - packet->group_segment_size = kernel->group_segment_size; - packet->kernel_object = kernel->object; + packet->private_segment_size = shadow->private_segment_size; + packet->group_segment_size = shadow->group_segment_size; + packet->kernel_object = shadow->object; packet->kernarg_address = shadow->kernarg_address; hsa_signal_t s; s.handle = shadow->signal; diff --git a/libgomp/testsuite/libgomp.c/allocators-7.c b/libgomp/testsuite/libgomp.c/allocators-7.c index a0a738b1d1d..5ef0c5cb3e3 100644 --- a/libgomp/testsuite/libgomp.c/allocators-7.c +++ b/libgomp/testsuite/libgomp.c/allocators-7.c @@ -1,7 +1,7 @@ /* { dg-do run } */ /* { dg-require-effective-target offload_device } */ -/* { dg-xfail-if "not implemented" { ! offload_target_nvptx } } */ +/* { dg-xfail-if "not implemented" { ! { offload_target_nvptx || offload_target_amdgcn } } } */ /* Test that GPU low-latency allocation is limited to team access. */ diff --git a/libgomp/basic-allocator.c b/libgomp/basic-allocator.c new file mode 100644 index 00000000000..94b99a89e0b --- /dev/null +++ b/libgomp/basic-allocator.c @@ -0,0 +1,380 @@ +/* Copyright (C) 2023 Free Software Foundation, Inc. + + This file is part of the GNU Offloading and Multi Processing Library + (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* This is a basic "malloc" implementation intended for use with small, + low-latency memories. + + To use this template, define BASIC_ALLOC_PREFIX, and then #include the + source file. The other configuration macros are optional. + + The root heap descriptor is stored in the first bytes of the heap, and each + free chunk contains a similar descriptor for the next free chunk in the + chain. + + The descriptor is two values: offset and size, which describe the + location of a chunk of memory available for allocation. The offset is + relative to the base of the heap. The special offset value 0xffffffff + indicates that the heap (free chain) is locked. The offset and size are + 32-bit values so the base alignment can be 8-bytes. + + Memory is allocated to the first free chunk that fits. The free chain + is always stored in order of the offset to assist coalescing adjacent + chunks. */ + +#include "libgomp.h" + +#ifndef BASIC_ALLOC_PREFIX +#error "BASIC_ALLOC_PREFIX not defined." +#endif + +#ifndef BASIC_ALLOC_YIELD +#deine BASIC_ALLOC_YIELD +#endif + +#define ALIGN(VAR) (((VAR) + 7) & ~7) /* 8-byte granularity. */ + +#define fn1(prefix, name) prefix ## _ ## name +#define fn(prefix, name) fn1 (prefix, name) +#define basic_alloc_init fn(BASIC_ALLOC_PREFIX,init) +#define basic_alloc_alloc fn(BASIC_ALLOC_PREFIX,alloc) +#define basic_alloc_calloc fn(BASIC_ALLOC_PREFIX,calloc) +#define basic_alloc_free fn(BASIC_ALLOC_PREFIX,free) +#define basic_alloc_realloc fn(BASIC_ALLOC_PREFIX,realloc) + +typedef struct { + uint32_t offset; + uint32_t size; +} heapdesc; + +void +basic_alloc_init (char *heap, size_t limit) +{ + if (heap == NULL) + return; + + /* Initialize the head of the free chain. */ + heapdesc *root = (heapdesc*)heap; + root->offset = ALIGN(1); + root->size = limit - root->offset; + + /* And terminate the chain. */ + heapdesc *next = (heapdesc*)(heap + root->offset); + next->offset = 0; + next->size = 0; +} + +static void * +basic_alloc_alloc (char *heap, size_t size) +{ + if (heap == NULL) + return NULL; + + /* Memory is allocated in N-byte granularity. */ + size = ALIGN (size); + + /* Acquire a lock on the low-latency heap. */ + heapdesc root, *root_ptr = (heapdesc*)heap; + do + { + root.offset = __atomic_exchange_n (&root_ptr->offset, 0xffffffff, + MEMMODEL_ACQUIRE); + if (root.offset != 0xffffffff) + { + root.size = root_ptr->size; + break; + } + /* Spin. */ + BASIC_ALLOC_YIELD; + } + while (1); + + /* Walk the free chain. */ + heapdesc chunk = root; + heapdesc *prev_chunkptr = NULL; + heapdesc *chunkptr = (heapdesc*)(heap + chunk.offset); + heapdesc onward_chain = *chunkptr; + while (chunk.size != 0 && (uint32_t)size > chunk.size) + { + chunk = onward_chain; + prev_chunkptr = chunkptr; + chunkptr = (heapdesc*)(heap + chunk.offset); + onward_chain = *chunkptr; + } + + void *result = NULL; + if (chunk.size != 0) + { + /* Allocation successful. */ + result = chunkptr; + + /* Update the free chain. */ + heapdesc stillfree = chunk; + stillfree.offset += size; + stillfree.size -= size; + heapdesc *stillfreeptr = (heapdesc*)(heap + stillfree.offset); + + if (stillfree.size == 0) + /* The whole chunk was used. */ + stillfree = onward_chain; + else + /* The chunk was split, so restore the onward chain. */ + *stillfreeptr = onward_chain; + + /* The previous free slot or root now points to stillfree. */ + if (prev_chunkptr) + *prev_chunkptr = stillfree; + else + root = stillfree; + } + + /* Update the free chain root and release the lock. */ + root_ptr->size = root.size; + __atomic_store_n (&root_ptr->offset, root.offset, MEMMODEL_RELEASE); + + return result; +} + +static void * +basic_alloc_calloc (char *heap, size_t size) +{ + /* Memory is allocated in N-byte granularity. */ + size = ALIGN (size); + + uint64_t *result = basic_alloc_alloc (heap, size); + if (result) + /* Inline memset in which we know size is a multiple of 8. */ + for (unsigned i = 0; i < (unsigned)size/8; i++) + result[i] = 0; + + return result; +} + +static void +basic_alloc_free (char *heap, void *addr, size_t size) +{ + /* Memory is allocated in N-byte granularity. */ + size = ALIGN (size); + + /* Acquire a lock on the low-latency heap. */ + heapdesc root, *root_ptr = (heapdesc*)heap; + do + { + root.offset = __atomic_exchange_n (&root_ptr->offset, 0xffffffff, + MEMMODEL_ACQUIRE); + if (root.offset != 0xffffffff) + { + root.size = root_ptr->size; + break; + } + /* Spin. */ + } + while (1); + + /* Walk the free chain to find where to insert a new entry. */ + heapdesc chunk = root, prev_chunk; + heapdesc *prev_chunkptr = NULL, *prevprev_chunkptr = NULL; + heapdesc *chunkptr = (heapdesc*)(heap + chunk.offset); + heapdesc onward_chain = *chunkptr; + while (chunk.size != 0 && addr > (void*)chunkptr) + { + prev_chunk = chunk; + chunk = onward_chain; + prevprev_chunkptr = prev_chunkptr; + prev_chunkptr = chunkptr; + chunkptr = (heapdesc*)(heap + chunk.offset); + onward_chain = *chunkptr; + } + + /* Create the new chunk descriptor. */ + heapdesc newfreechunk; + newfreechunk.offset = (uint32_t)((uintptr_t)addr - (uintptr_t)heap); + newfreechunk.size = (uint32_t)size; + + /* Coalesce adjacent free chunks. */ + if (newfreechunk.offset + size == chunk.offset) + { + /* Free chunk follows. */ + newfreechunk.size += chunk.size; + chunk = onward_chain; + } + if (prev_chunkptr) + { + if (prev_chunk.offset + prev_chunk.size + == newfreechunk.offset) + { + /* Free chunk precedes. */ + newfreechunk.offset = prev_chunk.offset; + newfreechunk.size += prev_chunk.size; + addr = heap + prev_chunk.offset; + prev_chunkptr = prevprev_chunkptr; + } + } + + /* Update the free chain in the new and previous chunks. */ + *(heapdesc*)addr = chunk; + if (prev_chunkptr) + *prev_chunkptr = newfreechunk; + else + root = newfreechunk; + + /* Update the free chain root and release the lock. */ + root_ptr->size = root.size; + __atomic_store_n (&root_ptr->offset, root.offset, MEMMODEL_RELEASE); + +} + +static void * +basic_alloc_realloc (char *heap, void *addr, size_t oldsize, + size_t size) +{ + /* Memory is allocated in N-byte granularity. */ + oldsize = ALIGN (oldsize); + size = ALIGN (size); + + if (oldsize == size) + return addr; + + /* Acquire a lock on the low-latency heap. */ + heapdesc root, *root_ptr = (heapdesc*)heap; + do + { + root.offset = __atomic_exchange_n (&root_ptr->offset, 0xffffffff, + MEMMODEL_ACQUIRE); + if (root.offset != 0xffffffff) + { + root.size = root_ptr->size; + break; + } + /* Spin. */ + } + while (1); + + /* Walk the free chain. */ + heapdesc chunk = root; + heapdesc *prev_chunkptr = NULL; + heapdesc *chunkptr = (heapdesc*)(heap + chunk.offset); + heapdesc onward_chain = *chunkptr; + while (chunk.size != 0 && (void*)chunkptr < addr) + { + chunk = onward_chain; + prev_chunkptr = chunkptr; + chunkptr = (heapdesc*)(heap + chunk.offset); + onward_chain = *chunkptr; + } + + void *result = NULL; + if (size < oldsize) + { + /* The new allocation is smaller than the old; we can always + shrink an allocation in place. */ + result = addr; + + heapdesc *nowfreeptr = (heapdesc*)(addr + size); + + /* Update the free chain. */ + heapdesc nowfree; + nowfree.offset = (char*)nowfreeptr - heap; + nowfree.size = oldsize - size; + + if (nowfree.offset + size == chunk.offset) + { + /* Coalesce following free chunk. */ + nowfree.size += chunk.size; + *nowfreeptr = onward_chain; + } + else + *nowfreeptr = chunk; + + /* The previous free slot or root now points to nowfree. */ + if (prev_chunkptr) + *prev_chunkptr = nowfree; + else + root = nowfree; + } + else if (chunk.size != 0 + && (char *)addr + oldsize == (char *)chunkptr + && chunk.size >= size-oldsize) + { + /* The new allocation is larger than the old, and we found a + large enough free block right after the existing block, + so we extend into that space. */ + result = addr; + + uint32_t delta = size-oldsize; + + /* Update the free chain. */ + heapdesc stillfree = chunk; + stillfree.offset += delta; + stillfree.size -= delta; + heapdesc *stillfreeptr = (heapdesc*)(heap + stillfree.offset); + + if (stillfree.size == 0) + /* The whole chunk was used. */ + stillfree = onward_chain; + else + /* The chunk was split, so restore the onward chain. */ + *stillfreeptr = onward_chain; + + /* The previous free slot or root now points to stillfree. */ + if (prev_chunkptr) + *prev_chunkptr = stillfree; + else + root = stillfree; + } + /* Else realloc in-place has failed and result remains NULL. */ + + /* Update the free chain root and release the lock. */ + root_ptr->size = root.size; + __atomic_store_n (&root_ptr->offset, root.offset, MEMMODEL_RELEASE); + + if (result == NULL) + { + /* The allocation could not be extended in place, so we simply + allocate fresh memory and move the data. If we can't allocate + from low-latency memory then we leave the original alloaction + intact and return NULL. + We could do a fall-back to main memory, but we don't know what + the fall-back trait said to do. */ + result = basic_alloc_alloc (heap, size); + if (result != NULL) + { + /* Inline memcpy in which we know oldsize is a multiple of 8. */ + uint64_t *from = addr, *to = result; + for (unsigned i = 0; i < (unsigned)oldsize/8; i++) + to[i] = from[i]; + + basic_alloc_free (heap, addr, oldsize); + } + } + + return result; +} + +#undef ALIGN +#undef fn1 +#undef fn +#undef basic_alloc_init +#undef basic_alloc_alloc +#undef basic_alloc_free +#undef basic_alloc_realloc diff --git a/libgomp/config/nvptx/allocator.c b/libgomp/config/nvptx/allocator.c index c1a73511623..7c2a7463bf7 100644 --- a/libgomp/config/nvptx/allocator.c +++ b/libgomp/config/nvptx/allocator.c @@ -44,20 +44,13 @@ #include "libgomp.h" #include +#define BASIC_ALLOC_PREFIX __nvptx_lowlat +#include "../../basic-allocator.c" + /* There should be some .shared space reserved for us. There's no way to express this magic extern sizeless array in C so use asm. */ asm (".extern .shared .u8 __nvptx_lowlat_pool[];\n"); -extern uint32_t __nvptx_lowlat_heap_root __attribute__((shared,nocommon)); - -typedef union { - uint32_t raw; - struct { - uint16_t offset; - uint16_t size; - } desc; -} heapdesc; - static void * nvptx_memspace_alloc (omp_memspace_handle_t memspace, size_t size) { @@ -66,64 +59,7 @@ nvptx_memspace_alloc (omp_memspace_handle_t memspace, size_t size) char *shared_pool; asm ("cvta.shared.u64\t%0, __nvptx_lowlat_pool;" : "=r"(shared_pool)); - /* Memory is allocated in 8-byte granularity. */ - size = (size + 7) & ~7; - - /* Acquire a lock on the low-latency heap. */ - heapdesc root; - do - { - root.raw = __atomic_exchange_n (&__nvptx_lowlat_heap_root, - 0xffffffff, MEMMODEL_ACQUIRE); - if (root.raw != 0xffffffff) - break; - /* Spin. */ - } - while (1); - - /* Walk the free chain. */ - heapdesc chunk = {root.raw}; - uint32_t *prev_chunkptr = NULL; - uint32_t *chunkptr = (uint32_t*)(shared_pool + chunk.desc.offset); - heapdesc onward_chain = {chunkptr[0]}; - while (chunk.desc.size != 0 && (uint32_t)size > chunk.desc.size) - { - chunk.raw = onward_chain.raw; - prev_chunkptr = chunkptr; - chunkptr = (uint32_t*)(shared_pool + chunk.desc.offset); - onward_chain.raw = chunkptr[0]; - } - - void *result = NULL; - if (chunk.desc.size != 0) - { - /* Allocation successful. */ - result = chunkptr; - - /* Update the free chain. */ - heapdesc stillfree = {chunk.raw}; - stillfree.desc.offset += size; - stillfree.desc.size -= size; - uint32_t *stillfreeptr = (uint32_t*)(shared_pool - + stillfree.desc.offset); - - if (stillfree.desc.size == 0) - /* The whole chunk was used. */ - stillfree.raw = onward_chain.raw; - else - /* The chunk was split, so restore the onward chain. */ - stillfreeptr[0] = onward_chain.raw; - - /* The previous free slot or root now points to stillfree. */ - if (prev_chunkptr) - prev_chunkptr[0] = stillfree.raw; - else - root.raw = stillfree.raw; - } - - /* Update the free chain root and release the lock. */ - __atomic_store_n (&__nvptx_lowlat_heap_root, root.raw, MEMMODEL_RELEASE); - return result; + return __nvptx_lowlat_alloc (shared_pool, size); } else if (memspace == ompx_host_mem_space) return NULL; @@ -136,16 +72,10 @@ nvptx_memspace_calloc (omp_memspace_handle_t memspace, size_t size) { if (memspace == omp_low_lat_mem_space) { - /* Memory is allocated in 8-byte granularity. */ - size = (size + 7) & ~7; - - uint64_t *result = nvptx_memspace_alloc (memspace, size); - if (result) - /* Inline memset in which we know size is a multiple of 8. */ - for (unsigned i = 0; i < (unsigned)size/8; i++) - result[i] = 0; + char *shared_pool; + asm ("cvta.shared.u64\t%0, __nvptx_lowlat_pool;" : "=r"(shared_pool)); - return result; + return __nvptx_lowlat_calloc (shared_pool, size); } else if (memspace == ompx_host_mem_space) return NULL; @@ -161,71 +91,7 @@ nvptx_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size) char *shared_pool; asm ("cvta.shared.u64\t%0, __nvptx_lowlat_pool;" : "=r"(shared_pool)); - /* Memory is allocated in 8-byte granularity. */ - size = (size + 7) & ~7; - - /* Acquire a lock on the low-latency heap. */ - heapdesc root; - do - { - root.raw = __atomic_exchange_n (&__nvptx_lowlat_heap_root, - 0xffffffff, MEMMODEL_ACQUIRE); - if (root.raw != 0xffffffff) - break; - /* Spin. */ - } - while (1); - - /* Walk the free chain to find where to insert a new entry. */ - heapdesc chunk = {root.raw}, prev_chunk; - uint32_t *prev_chunkptr = NULL, *prevprev_chunkptr = NULL; - uint32_t *chunkptr = (uint32_t*)(shared_pool + chunk.desc.offset); - heapdesc onward_chain = {chunkptr[0]}; - while (chunk.desc.size != 0 && addr > (void*)chunkptr) - { - prev_chunk.raw = chunk.raw; - chunk.raw = onward_chain.raw; - prevprev_chunkptr = prev_chunkptr; - prev_chunkptr = chunkptr; - chunkptr = (uint32_t*)(shared_pool + chunk.desc.offset); - onward_chain.raw = chunkptr[0]; - } - - /* Create the new chunk descriptor. */ - heapdesc newfreechunk; - newfreechunk.desc.offset = (uint16_t)((uintptr_t)addr - - (uintptr_t)shared_pool); - newfreechunk.desc.size = (uint16_t)size; - - /* Coalesce adjacent free chunks. */ - if (newfreechunk.desc.offset + size == chunk.desc.offset) - { - /* Free chunk follows. */ - newfreechunk.desc.size += chunk.desc.size; - chunk.raw = onward_chain.raw; - } - if (prev_chunkptr) - { - if (prev_chunk.desc.offset + prev_chunk.desc.size - == newfreechunk.desc.offset) - { - /* Free chunk precedes. */ - newfreechunk.desc.offset = prev_chunk.desc.offset; - newfreechunk.desc.size += prev_chunk.desc.size; - addr = shared_pool + prev_chunk.desc.offset; - prev_chunkptr = prevprev_chunkptr; - } - } - - /* Update the free chain in the new and previous chunks. */ - ((uint32_t*)addr)[0] = chunk.raw; - if (prev_chunkptr) - prev_chunkptr[0] = newfreechunk.raw; - else - root.raw = newfreechunk.raw; - - /* Update the free chain root and release the lock. */ - __atomic_store_n (&__nvptx_lowlat_heap_root, root.raw, MEMMODEL_RELEASE); + __nvptx_lowlat_free (shared_pool, addr, size); } else free (addr); @@ -240,123 +106,7 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr, char *shared_pool; asm ("cvta.shared.u64\t%0, __nvptx_lowlat_pool;" : "=r"(shared_pool)); - /* Memory is allocated in 8-byte granularity. */ - oldsize = (oldsize + 7) & ~7; - size = (size + 7) & ~7; - - if (oldsize == size) - return addr; - - /* Acquire a lock on the low-latency heap. */ - heapdesc root; - do - { - root.raw = __atomic_exchange_n (&__nvptx_lowlat_heap_root, - 0xffffffff, MEMMODEL_ACQUIRE); - if (root.raw != 0xffffffff) - break; - /* Spin. */ - } - while (1); - - /* Walk the free chain. */ - heapdesc chunk = {root.raw}; - uint32_t *prev_chunkptr = NULL; - uint32_t *chunkptr = (uint32_t*)(shared_pool + chunk.desc.offset); - heapdesc onward_chain = {chunkptr[0]}; - while (chunk.desc.size != 0 && (void*)chunkptr < addr) - { - chunk.raw = onward_chain.raw; - prev_chunkptr = chunkptr; - chunkptr = (uint32_t*)(shared_pool + chunk.desc.offset); - onward_chain.raw = chunkptr[0]; - } - - void *result = NULL; - if (size < oldsize) - { - /* The new allocation is smaller than the old; we can always - shrink an allocation in place. */ - result = addr; - - uint32_t *nowfreeptr = (uint32_t*)(addr + size); - - /* Update the free chain. */ - heapdesc nowfree; - nowfree.desc.offset = (char*)nowfreeptr - shared_pool; - nowfree.desc.size = oldsize - size; - - if (nowfree.desc.offset + size == chunk.desc.offset) - { - /* Coalesce following free chunk. */ - nowfree.desc.size += chunk.desc.size; - nowfreeptr[0] = onward_chain.raw; - } - else - nowfreeptr[0] = chunk.raw; - - /* The previous free slot or root now points to nowfree. */ - if (prev_chunkptr) - prev_chunkptr[0] = nowfree.raw; - else - root.raw = nowfree.raw; - } - else if (chunk.desc.size != 0 - && (char *)addr + oldsize == (char *)chunkptr - && chunk.desc.size >= size-oldsize) - { - /* The new allocation is larger than the old, and we found a - large enough free block right after the existing block, - so we extend into that space. */ - result = addr; - - uint16_t delta = size-oldsize; - - /* Update the free chain. */ - heapdesc stillfree = {chunk.raw}; - stillfree.desc.offset += delta; - stillfree.desc.size -= delta; - uint32_t *stillfreeptr = (uint32_t*)(shared_pool - + stillfree.desc.offset); - - if (stillfree.desc.size == 0) - /* The whole chunk was used. */ - stillfree.raw = onward_chain.raw; - else - /* The chunk was split, so restore the onward chain. */ - stillfreeptr[0] = onward_chain.raw; - - /* The previous free slot or root now points to stillfree. */ - if (prev_chunkptr) - prev_chunkptr[0] = stillfree.raw; - else - root.raw = stillfree.raw; - } - /* Else realloc in-place has failed and result remains NULL. */ - - /* Update the free chain root and release the lock. */ - __atomic_store_n (&__nvptx_lowlat_heap_root, root.raw, MEMMODEL_RELEASE); - - if (result == NULL) - { - /* The allocation could not be extended in place, so we simply - allocate fresh memory and move the data. If we can't allocate - from low-latency memory then we leave the original alloaction - intact and return NULL. - We could do a fall-back to main memory, but we don't know what - the fall-back trait said to do. */ - result = nvptx_memspace_alloc (memspace, size); - if (result != NULL) - { - /* Inline memcpy in which we know oldsize is a multiple of 8. */ - uint64_t *from = addr, *to = result; - for (unsigned i = 0; i < (unsigned)oldsize/8; i++) - to[i] = from[i]; - - nvptx_memspace_free (memspace, addr, oldsize); - } - } - return result; + return __nvptx_lowlat_realloc (shared_pool, addr, oldsize, size); } else if (memspace == ompx_host_mem_space) return NULL; diff --git a/libgomp/config/nvptx/team.c b/libgomp/config/nvptx/team.c index 685610e00be..b30b8df178d 100644 --- a/libgomp/config/nvptx/team.c +++ b/libgomp/config/nvptx/team.c @@ -33,7 +33,6 @@ struct gomp_thread *nvptx_thrs __attribute__((shared,nocommon)); int __gomp_team_num __attribute__((shared,nocommon)); -uint32_t __nvptx_lowlat_heap_root __attribute__((shared,nocommon)); static void gomp_thread_start (struct gomp_thread_pool *); @@ -41,6 +40,9 @@ static void gomp_thread_start (struct gomp_thread_pool *); express this magic extern sizeless array in C so use asm. */ asm (".extern .shared .u8 __nvptx_lowlat_pool[];\n"); +/* Defined in basic-allocator.c via config/nvptx/allocator.c. */ +void __nvptx_lowlat_init (void *heap, size_t size); + /* This externally visible function handles target region entry. It sets up a per-team thread pool and transfers control by calling FN (FN_DATA) in the master thread or gomp_thread_start in other threads. @@ -76,19 +78,7 @@ gomp_nvptx_main (void (*fn) (void *), void *fn_data) asm ("mov.u32\t%0, %%dynamic_smem_size;\n" : "=r"(shared_pool_size)); #endif - - /* ... and initialize it with an empty free-chain. */ - union { - uint32_t raw; - struct { - uint16_t offset; - uint16_t size; - } desc; - } root; - root.desc.offset = 0; /* The first byte is free. */ - root.desc.size = shared_pool_size; /* The whole space is free. */ - __nvptx_lowlat_heap_root = root.raw; - shared_pool[0] = 0; /* Terminate free chain. */ + __nvptx_lowlat_init (shared_pool, shared_pool_size); /* Initialize the thread pool. */ struct gomp_thread_pool *pool = alloca (sizeof (*pool));