Message ID | 20230427162301.1151333-1-patrick@rivosinc.com |
---|---|
Headers |
Return-Path: <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp389161vqo; Thu, 27 Apr 2023 09:26:51 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4goOGOaSMejlKXuNMKk0pEfIf+9+fISrEQnajJ4O1GxdBRyD1GgVHr5ajZY8HRjUB34CuN X-Received: by 2002:a17:907:7245:b0:94f:36a0:da45 with SMTP id ds5-20020a170907724500b0094f36a0da45mr2090942ejc.29.1682612811303; Thu, 27 Apr 2023 09:26:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682612811; cv=none; d=google.com; s=arc-20160816; b=f8AewZZCaiMmNhv0uy6urgKlokYKgDT+fEUu+wlHV0vAaLxDYDubW24QFaq4K4n5za EqVwRsjpvK8pqf5YLa55fyCv5CCx1C5VwC/m7ThZUeQvO668s0wOAT2m7V/lYzFLlcTA A69v4+p5bSLPcsSxffCYUNBHkwZtbBualt2THhP2L6VzgzyPjDHjCkXc46JVFNgJXoXN 74SKHtBBH9s5SllfjUy42dI6n39MvYa4iPmvBtuPA/up3EbcW9tA1UXajOa2fs2Hbpjn wMheCwsrrH3ga17Nw15BiBTFH7Qf8mPDYr0zcEXf7wdTPp15RsNSbRpEW+Kd+TbJSVEX XP9w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:list-subscribe:list-help:list-post:list-archive :list-unsubscribe:list-id:precedence:content-transfer-encoding :mime-version:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature:dmarc-filter:delivered-to; bh=Qo1zh7K46HZcPmGmRUqcEecKbQhYmnqStYqIHPykW5s=; b=FJud8jGItjrMJ5FI4nN9gM/CU6jcHFxOvfE2WozMD1V6fm8Mz+D5VFX7vgmuCzpKWy h7q+QKCdJjsB2gPkn2ecy4oILhFWuYHdjuXNBzSoi3B9oIkXamlAYk57I0k5gU8Hw5d4 jDS7aXfLFYR/LKCKR9teiazRhdC+OH/GrBxV1r/SaqeHyLHnsHPb8D/hNHj/wbSYSCOj iOp/iZ5Atuc2m/NYqvxsy/WS3759rlIwBMvObz1EozSwL9KGTokvbPqVP0Zq1LLjPwxA kAwFV4XIx4e0tJQ16lk7bll/BuLpRkiPHAezxmhQBcZCNHMBzfBmirQ5oqAVGg4wWuiV 8YCA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@rivosinc-com.20221208.gappssmtp.com header.s=20221208 header.b=N525rPf3; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org" Received: from sourceware.org (server2.sourceware.org. [8.43.85.97]) by mx.google.com with ESMTPS id w19-20020aa7da53000000b00506b8918a77si13843830eds.576.2023.04.27.09.26.50 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Apr 2023 09:26:51 -0700 (PDT) Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) client-ip=8.43.85.97; Authentication-Results: mx.google.com; dkim=pass header.i=@rivosinc-com.20221208.gappssmtp.com header.s=20221208 header.b=N525rPf3; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org" Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id E87B2388204D for <ouuuleilei@gmail.com>; Thu, 27 Apr 2023 16:24:50 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mail-pl1-x641.google.com (mail-pl1-x641.google.com [IPv6:2607:f8b0:4864:20::641]) by sourceware.org (Postfix) with ESMTPS id C90873858284 for <gcc-patches@gcc.gnu.org>; Thu, 27 Apr 2023 16:24:09 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C90873858284 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=rivosinc.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rivosinc.com Received: by mail-pl1-x641.google.com with SMTP id d9443c01a7336-1a52667955dso88980875ad.1 for <gcc-patches@gcc.gnu.org>; Thu, 27 Apr 2023 09:24:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rivosinc-com.20221208.gappssmtp.com; s=20221208; t=1682612648; x=1685204648; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Qo1zh7K46HZcPmGmRUqcEecKbQhYmnqStYqIHPykW5s=; b=N525rPf3rxp58VmpF90d/BK8wP46yeMiqZsYkgMt7bYRTwa8AAqYXma4jhYT/R273U V0BXXyP+I63eWGQX/nYVrOwAqstR5bhl6XrTJx5d0iZZoiQjtBYldqSkq8hpGRufUMZK nSmUGA8ShfNWqrGfcEtJcJjTYWVdb0T/DdQiV/pVJ5lpCOolE3qGKX/KheYnA9pi2+ki V96aOGeuYRlvU/VCSHcslBn7Mtb7d0lGj6ium4pJ2h5xJwJoq4oJCZWNzXZiWIEHhhvv ta+ZAD1Wy4zaH95mI/3IAjoDtu/4KL1pSkh+oYeq8+sqDfgbVpxJId3a3bmbOBRP8bMJ VXeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682612648; x=1685204648; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Qo1zh7K46HZcPmGmRUqcEecKbQhYmnqStYqIHPykW5s=; b=N5kkXzQ8GIymMtTlMLyc59azRhojwbUxo+OndPJNfag9zhn4C69J11TOBKCFw4wsSu Ln5HmjDanUyZC7Q8xjkTpTXC9vwAZcKy7Zo7k8itPc94KfH9v/vbZ/rqvS1JvkYtZBih cHi+3Ca4/Jh5yclzC8Z4olrk+6zE3r0+ecNTncebBAdTrAnwwObiPFie/qicpUW1W6bZ JWSQqhungbO6lZu0aA3hsW5T+9MMkClejXxLK5+cg40KJMGOHadZaV4sMCNcy3FfxUYs TgFjlnI5bJtB1UGZTYvHVlSpTmQufU+a57ifzFO1Ou2kgjbyqJtBWMnsn/W86P/uqMRI nvHg== X-Gm-Message-State: AC+VfDyVofBwWbngkBcw4E3p5RzhCs4ip2c1AG8HKajvuUGCGrkLt1UC U3RU7UgwCSrdlbAVBgEyNyz3OKQA/hDz1uQhyW+z4Ipomn0= X-Received: by 2002:a17:902:e750:b0:1a9:433e:41d5 with SMTP id p16-20020a170902e75000b001a9433e41d5mr2366264plf.56.1682612648297; Thu, 27 Apr 2023 09:24:08 -0700 (PDT) Received: from patrick-ThinkPad-X1-Carbon-Gen-8.hq.rivosinc.com ([50.221.140.188]) by smtp.gmail.com with ESMTPSA id n19-20020a170902969300b001a6db2bef16sm11815906plp.303.2023.04.27.09.24.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Apr 2023 09:24:07 -0700 (PDT) From: Patrick O'Neill <patrick@rivosinc.com> To: gcc-patches@gcc.gnu.org Cc: palmer@rivosinc.com, gnu-toolchain@rivosinc.com, vineetg@rivosinc.com, andrew@sifive.com, kito.cheng@sifive.com, dlustig@nvidia.com, cmuellner@gcc.gnu.org, andrea@rivosinc.com, hboehm@google.com, jeffreyalaw@gmail.com, Patrick O'Neill <patrick@rivosinc.com> Subject: [PATCH v5 00/11] RISC-V: Implement ISA Manual Table A.6 Mappings Date: Thu, 27 Apr 2023 09:22:50 -0700 Message-Id: <20230427162301.1151333-1-patrick@rivosinc.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230414170942.1695672-1-patrick@rivosinc.com> References: <20230414170942.1695672-1-patrick@rivosinc.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-3.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org> X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764347411196977640?= X-GMAIL-MSGID: =?utf-8?q?1764347411196977640?= |
Series |
RISC-V: Implement ISA Manual Table A.6 Mappings
|
|
Message
Patrick O'Neill
April 27, 2023, 4:22 p.m. UTC
This patchset aims to make the RISCV atomics implementation stronger than the recommended mapping present in table A.6 of the ISA manual. https://github.com/riscv/riscv-isa-manual/blob/c7cf84547b3aefacab5463add1734c1602b67a49/src/memory.tex#L1083-L1157 Context --------- GCC defined RISC-V mappings [1] before the Memory Model task group finalized their work and provided the ISA Manual Table A.6/A.7 mappings[2]. For at least a year now, we've known that the mappings were different, but it wasn't clear if these unique mappings had correctness issues. Andrea Parri found an issue with the GCC mappings, showing that atomic_compare_exchange_weak_explicit(-,-,-,release,relaxed) mappings do not enforce release ordering guarantees. (Meaning the GCC mappings have a correctness issue). https://inbox.sourceware.org/gcc-patches/Y1GbJuhcBFpPGJQ0@andrea/ Why not A.6? --------- We can update our mappings now, so the obvious choice would be to implement Table A.6 (what LLVM implements/ISA manual recommends). The reason why that isn't the best path forward for GCC is due to a proposal by Hans Boehm to add L{d|w|b|h}.aq/rl and S{d|w|b|h}.aq/rl. For context, there is discussion about fast-tracking the addition of these instructions. The RISCV architectural review committee supports adopting a "new and common atomics ABI for gcc and LLVM toochains ... that assumes the addition of the preceding instructions”. That common ABI is likely to be A.7. https://lists.riscv.org/g/tech-privileged/message/1284 Transitioning from A.6 to A.7 will cause an ABI break. We can hedge against that risk by emitting a conservative fence after SEQ_CST stores to make the mapping compatible with both A.6 and A.7. What does a mapping compatible with both A.6 & A.7 look like? --------- It is exactly the same as Table A.6, but SEQ_CST stores have a trailing fence rw,rw. It's strictly stronger than Table A.6. Microbenchmark --------- Hans Boehm helpfully wrote a microbenchmark [3] that uses ARM to give a rough estimate for the performance benefits/penalties of the different mappings. The microbenchmark is single threaded and almost-write-only. This case seems unlikely but is useful for getting a rough idea of the workload that would be impacted the most. Testcases ------- Control: A simple volatile store. This is most similar to a relaxed store. Release Store: This is most similar to Sw.rl (one of the instructions in Hans' proposal). Store with release fence: This is most similar to the mapping present in Table A.6. Store with two fences: This is most similar to the compatibility mapping present in this patchset. Machines ------- Intel(R) Core(TM) i7-8650U (sanity check only): x86 TSO Cortex A53 (Raspberry pi): ARM In order core Cortex A55 (Pixel 6 Pro): ARM In order core Cortex A76 (Pixel 6 Pro): ARM Out of order core Cortex X1 (Pixel 6 Pro): ARM Out of order core Microbenchmark Results [4] -------- Units are nsecs per iteration. Sanity check Machine CONTROL REL_STORE STORE_REL_FENCE STORE_TWO_FENCE ------- ------- --------- --------------- --------------- Intel i7-8650U 1.34812 1.30038 1.2933 18.0474 Machine CONTROL REL_STORE STORE_REL_FENCE STORE_TWO_FENCE ------- ------- --------- --------------- --------------- Cortex A53 7.15224 10.7282 7.15221 10.013 Cortex A55 2.77965 8.89654 4.44787 7.78331 Cortex A76 1.78021 1.86095 5.33088 8.88462 Cortex X1 2.14252 2.14258 4.32982 7.05234 Reordered tests (using -r flag on microbenchmark) Machine CONTROL REL_STORE STORE_REL_FENCE STORE_TWO_FENCE ------- ------- --------- --------------- --------------- Cortex A53 7.15227 10.7282 7.16113 10.034 Cortex A55 2.78024 8.89574 4.44844 7.78428 Cortex A76 1.77686 1.81081 5.3301 8.88346 Cortex X1 2.14254 2.14251 4.3273 7.05239 Benchmark Interpretation -------- As expected, out of order machines are significantly faster with the REL_STORE mappings. Unexpectedly, the in-order machines are significantly slower with REL_STORE rather than REL_STORE_FENCE. Most machines in the wild are expected to use Table A.7 once the instructions are introduced. Incurring this added cost now will make it easier for compiled RISC-V binaries to transition to the A.7 memory model mapping. The performance benefits of moving to A.7 can be more clearly seen using an almost-all-load microbenchmark (included on page 3 of Hans’ proposal). The code for that microbenchmark is attached below [5]. https://lists.riscv.org/g/tech-unprivileged/attachment/382/0/load-acquire110422.pdf https://lists.riscv.org/g/tech-unprivileged/topic/92916241 Caveats -------- This is a very synthetic microbenchmark that represents what is expected to be a very unlikely workload. Nevertheless, it's helpful to see the worst-case price we are paying for compatibility. “All times include an entire loop iteration, indirect dispatch and all. The benchmark alternates tests, but does not lock CPU frequency. Since a single core was in use, I expect this was running at basically full speed. Any throttling affected everything more or less uniformly.” - Hans Boehm Patchset overview -------- Patch 1 simplifies the memmodel to ignore MEMMODEL_SYNC_* cases (legacy cases that aren't handled differently for RISC-V). Patches 2-6 make the mappings strictly stronger. Patches 7-9 weaken the mappings to be in line with table A.6 of the ISA manual. Patch 11 adds some basic conformance tests to ensure the implemented mapping matches table A.6 with stronger SEQ_CST stores. Conformance test cases notes -------- The conformance tests in this patch are a good sanity check but do not guarantee exactly following Table A.6. It checks that the right instructions are emitted (ex. fence rw,r) but not the order of those instructions. LLVM mapping notes -------- LLVM emits corresponding fences for atomic_signal_fence instructions. This seems to be an oversight since AFAIK atomic_signal_fence acts as a compiler directive. GCC does not emit any fences for atomic_signal_fence instructions. Future work -------- There still remains some work to be done in this space after this patchset fixes the correctness of the GCC mappings. * Look into explicitly handling subword loads/stores. * Look into using AMOSWAP.rl for store words/doubles. * L{b|h|w|d}.aq/rl & S{b|h|w|d}.aq/rl support once ratified. * zTSO mappings. Prior Patchsets -------- Patchset v1: https://gcc.gnu.org/pipermail/gcc-patches/2022-April/592950.html Patchset v2: https://gcc.gnu.org/pipermail/gcc-patches/2023-April/615264.html Patchset v3: https://gcc.gnu.org/pipermail/gcc-patches/2023-April/615431.html Patchset v4: https://gcc.gnu.org/pipermail/gcc-patches/2023-April/615748.html Changelogs -------- Changes for v2: * Use memmodel_base rather than a custom simplify_memmodel function (Inspired by Christoph Muellner's patch 1/9) * Move instruction styling change from [v1 5/7] to [v2 3/8] to reduce [v2 6/8]'s complexity * Eliminated %K flag for atomic store introduced in v1 in favor of if/else * Rebase/test Changes for v3: * Use a trailing fence for atomic stores to be compatible with table A.7 * Emit an optimized fence r,rw following a SEQ_CST load * Consolidate tests in [PATCH v3 10/10] * Add tests for basic A.6 conformance Changes for v4: * Update cover letter to cover more of the reasoning behind moving to a compatibility mapping * Improve conformance testcases patch assertions and add new compare-exchange testcases Changes for v5: * Update cover letter to cover more context and reasoning behind moving to a compatibility mapping * Rebase to include the subword-atomic cases introduced here: https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616080.html * Add basic amo-add subword atomic testcases * Reformat changelogs * Fix misc. whitespace issues [1] GCC port with mappings merged 06 Feb 2017 https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=09cae7507d9e88f2b05cf3a9404bf181e65ccbac [2] A.6 mappings added to ISA manual 12 Dec 2017 https://github.com/riscv/riscv-isa-manual/commit/9da1a115bcc4fe327f35acceb851d4850d12e9fa [3] Hans Boehm almost-all-store Microbenchmark: // Copyright 2023 Google LLC. // SPDX-License-Identifier: Apache-2.0 #include <atomic> #include <iostream> #include <time.h> static constexpr int INNER_ITERS = 10'000'000; static constexpr int OUTER_ITERS = 20; static constexpr int N_TESTS = 4; volatile int the_volatile(17); std::atomic<int> the_atomic(17); void test1(int i) { the_volatile = i; } void test2(int i) { the_atomic.store(i, std::memory_order_release); } void test3(int i) { atomic_thread_fence(std::memory_order_release); the_atomic.store(i, std::memory_order_relaxed); } void test4(int i) { atomic_thread_fence(std::memory_order_release); the_atomic.store(i, std::memory_order_relaxed); atomic_thread_fence(std::memory_order_seq_cst); } typedef void (*int_func)(int); uint64_t getnanos() { struct timespec result; if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &result) != 0) { std::cerr << "clock_gettime() failed\n"; exit(1); } return (uint64_t)result.tv_nsec + 1'000'000'000 * (uint64_t)result.tv_sec; } int_func tests[N_TESTS] = { test1, test2, test3, test4 }; const char *test_names[N_TESTS] = { "control", "release store", "store with release fence", "store with two fences" }; uint64_t total_time[N_TESTS] = { 0 }; int main(int argc, char **argv) { struct timespec res; if (clock_getres(CLOCK_PROCESS_CPUTIME_ID, &res) != 0) { std::cerr << "clock_getres() failed\n"; exit(1); } else { std::cout << "nsec resolution = " << res.tv_nsec << std::endl; } if (argc == 2 && argv[1][0] == 'r') { // Run tests in reverse order. for (int i = 0; i < N_TESTS / 2; ++i) { std::swap(tests[i], tests[N_TESTS - 1 - i]); std::swap(test_names[i], test_names[N_TESTS - 1 - i]); } } for (int i = 0; i < OUTER_ITERS; ++i) { // Alternate tests to minimize bias due to thermal throttling. for (int j = 0; j < N_TESTS; ++j) { uint64_t start_time = getnanos(); for (int k = 1; k <= INNER_ITERS; ++k) { tests[j](k); // Provides memory accesses between tests. } // Ignore first iteration for all tests. The first iteration of the first test is // empirically slightly slower. if (i != 0) { total_time[j] += getnanos() - start_time; } if ((tests[j] == test1 ? the_volatile : the_atomic.load()) != INNER_ITERS) { std::cerr << "result check failed, test = " << j << ", " << the_volatile << std::endl; exit(1); } } } for (int i = 0; i < N_TESTS; ++i) { double nsecs_per_iter = (double) total_time[i] / INNER_ITERS / (OUTER_ITERS - 1); std::cout << test_names[i] << " took " << nsecs_per_iter << " nseconds per iteration\n"; } exit(0); } [4] Hans Boehm Raw Microbenchmark Results Intel(R) Core(TM) i7-8650U (sanity check only): hboehm@hboehm-glaptop0:~/tests$ ./a.out nsec resolution = 1 control took 1.34812 nseconds per iteration release store took 1.30038 nseconds per iteration store with release fence took 1.2933 nseconds per iteration store with two fences took 18.0474 nseconds per iteration Cortex A53 (Raspberry pi) hboehm@rpi3-20210823:~/tests$ ./a.out nsec resolution = 1 control took 7.15224 nseconds per iteration release store took 10.7282 nseconds per iteration store with release fence took 7.15221 nseconds per iteration store with two fences took 10.013 nseconds per iteration hboehm@rpi3-20210823:~/tests$ ./a.out -r nsec resolution = 1 control took 7.15227 nseconds per iteration release store took 10.7282 nseconds per iteration store with release fence took 7.16133 nseconds per iteration store with two fences took 10.034 nseconds per iteration Cortex A55 (Pixel 6 Pro) raven:/data/tmp # taskset 0f ./release-timer nsec resolution = 1 control took 2.77965 nseconds per iteration release store took 8.89654 nseconds per iteration store with release fence took 4.44787 nseconds per iteration store with two fences took 7.78331 nseconds per iteration raven:/data/tmp # taskset 0f ./release-timer -r nsec resolution = 1 control took 2.78024 nseconds per iteration release store took 8.89574 nseconds per iteration store with release fence took 4.44844 nseconds per iteration store with two fences took 7.78428 nseconds per iteration Cortex A76 (Pixel 6 Pro) raven:/data/tmp # taskset 30 ./release-timer -r nsec resolution = 1 control took 1.77686 nseconds per iteration release store took 1.81081 nseconds per iteration store with release fence took 5.3301 nseconds per iteration store with two fences took 8.88346 nseconds per iteration raven:/data/tmp # taskset 30 ./release-timer nsec resolution = 1 control took 1.78021 nseconds per iteration release store took 1.86095 nseconds per iteration store with release fence took 5.33088 nseconds per iteration store with two fences took 8.88462 nseconds per iteration Cortex X1 (Pixel 6 Pro) raven:/data/tmp # taskset c0 ./release-timer nsec resolution = 1 control took 2.14252 nseconds per iteration release store took 2.14258 nseconds per iteration store with release fence took 4.32982 nseconds per iteration store with two fences took 7.05234 nseconds per iteration raven:/data/tmp # taskset c0 ./release-timer -r nsec resolution = 1 control took 2.14254 nseconds per iteration release store took 2.14251 nseconds per iteration store with release fence took 4.3273 nseconds per iteration store with two fences took 7.05239 nseconds per iteration [5] Hans Boehm almost-all-load Microbenchmark: // Copyright 2023 Google LLC. // SPDX-License-Identifier: Apache-2.0 #include <atomic> #include <iostream> #include <time.h> static constexpr int INNER_ITERS = 10'000'000; static constexpr int OUTER_ITERS = 20; static constexpr int N_TESTS = 4; volatile int the_volatile(17); std::atomic<int> the_atomic(17); int test1() { return the_volatile; } int test2() { return the_atomic.load(std::memory_order_acquire); } int test3() { int result = the_atomic.load(std::memory_order_relaxed); atomic_thread_fence(std::memory_order_acquire); return result; } int test4() { atomic_thread_fence(std::memory_order_seq_cst); int result = the_atomic.load(std::memory_order_relaxed); atomic_thread_fence(std::memory_order_acquire); return result; } typedef int (*int_func)(); uint64_t getnanos() { struct timespec result; if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &result) != 0) { std::cerr << "clock_gettime() failed\n"; exit(1); } return (uint64_t)result.tv_nsec + 1'000'000'000 * (uint64_t)result.tv_sec; } int_func tests[N_TESTS] = { test1, test2, test3, test4 }; const char *test_names[N_TESTS] = { "control", "acquire load", "load with acquire fence", "load with two fences" }; uint64_t total_time[N_TESTS] = { 0 }; uint sum, last_sum = 0; int main(int argc, char **argv) { struct timespec res; if (clock_getres(CLOCK_PROCESS_CPUTIME_ID, &res) != 0) { std::cerr << "clock_getres() failed\n"; exit(1); } else { std::cout << "nsec resolution = " << res.tv_nsec << std::endl; } if (argc == 2 && argv[1][0] == 'r') { // Run tests in reverse order. for (int i = 0; i < N_TESTS / 2; ++i) { std::swap(tests[i], tests[N_TESTS - 1 - i]); std::swap(test_names[i], test_names[N_TESTS - 1 - i]); } } for (int i = 0; i < OUTER_ITERS; ++i) { // Alternate tests to minimize bias due to thermal throttling. for (int j = 0; j < N_TESTS; ++j) { sum = 0; uint64_t start_time = getnanos(); for (int k = 0; k < INNER_ITERS; ++k) { sum += tests[j](); // Provides memory accesses between tests. } // Ignore first iteration for all tests. The first iteration of the first test is // empirically slightly slower. if (i != 0) { total_time[j] += getnanos() - start_time; } if (sum == 0 || last_sum != 0 && sum != last_sum) { std::cerr << "result check failed"; exit(1); } last_sum = sum; } } for (int i = 0; i < N_TESTS; ++i) { double nsecs_per_iter = (double) total_time[i] / INNER_ITERS / (OUTER_ITERS - 1); std::cout << test_names[i] << " took " << nsecs_per_iter << " nseconds per iteration\n"; } exit(0); } Patrick O'Neill (11): RISC-V: Eliminate SYNC memory models RISC-V: Enforce Libatomic LR/SC SEQ_CST RISC-V: Enforce subword atomic LR/SC SEQ_CST RISC-V: Enforce atomic compare_exchange SEQ_CST RISC-V: Add AMO release bits RISC-V: Strengthen atomic stores RISC-V: Eliminate AMO op fences RISC-V: Weaken LR/SC pairs RISC-V: Weaken mem_thread_fence RISC-V: Weaken atomic loads RISC-V: Table A.6 conformance tests gcc/config/riscv/riscv-protos.h | 3 + gcc/config/riscv/riscv.cc | 66 ++++-- gcc/config/riscv/sync.md | 194 ++++++++++++------ .../riscv/amo-table-a-6-amo-add-1.c | 8 + .../riscv/amo-table-a-6-amo-add-2.c | 8 + .../riscv/amo-table-a-6-amo-add-3.c | 8 + .../riscv/amo-table-a-6-amo-add-4.c | 8 + .../riscv/amo-table-a-6-amo-add-5.c | 8 + .../riscv/amo-table-a-6-compare-exchange-1.c | 10 + .../riscv/amo-table-a-6-compare-exchange-2.c | 10 + .../riscv/amo-table-a-6-compare-exchange-3.c | 10 + .../riscv/amo-table-a-6-compare-exchange-4.c | 10 + .../riscv/amo-table-a-6-compare-exchange-5.c | 10 + .../riscv/amo-table-a-6-compare-exchange-6.c | 11 + .../riscv/amo-table-a-6-compare-exchange-7.c | 10 + .../gcc.target/riscv/amo-table-a-6-fence-1.c | 8 + .../gcc.target/riscv/amo-table-a-6-fence-2.c | 10 + .../gcc.target/riscv/amo-table-a-6-fence-3.c | 10 + .../gcc.target/riscv/amo-table-a-6-fence-4.c | 10 + .../gcc.target/riscv/amo-table-a-6-fence-5.c | 10 + .../gcc.target/riscv/amo-table-a-6-load-1.c | 9 + .../gcc.target/riscv/amo-table-a-6-load-2.c | 11 + .../gcc.target/riscv/amo-table-a-6-load-3.c | 11 + .../gcc.target/riscv/amo-table-a-6-store-1.c | 9 + .../gcc.target/riscv/amo-table-a-6-store-2.c | 11 + .../riscv/amo-table-a-6-store-compat-3.c | 11 + .../riscv/amo-table-a-6-subword-amo-add-1.c | 9 + .../riscv/amo-table-a-6-subword-amo-add-2.c | 9 + .../riscv/amo-table-a-6-subword-amo-add-3.c | 9 + .../riscv/amo-table-a-6-subword-amo-add-4.c | 9 + .../riscv/amo-table-a-6-subword-amo-add-5.c | 9 + gcc/testsuite/gcc.target/riscv/pr89835.c | 9 + libgcc/config/riscv/atomic.c | 4 +- 33 files changed, 467 insertions(+), 75 deletions(-) create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-amo-add-1.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-amo-add-2.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-amo-add-3.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-amo-add-4.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-amo-add-5.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-1.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-2.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-3.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-4.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-5.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-6.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-7.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-fence-1.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-fence-2.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-fence-3.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-fence-4.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-fence-5.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-load-1.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-load-2.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-load-3.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-store-1.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-store-2.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-store-compat-3.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-subword-amo-add-1.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-subword-amo-add-2.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-subword-amo-add-3.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-subword-amo-add-4.c create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-subword-amo-add-5.c create mode 100644 gcc/testsuite/gcc.target/riscv/pr89835.c
Comments
On Thu, Apr 27, 2023 at 09:22:50AM -0700, Patrick O'Neill wrote: > This patchset aims to make the RISCV atomics implementation stronger > than the recommended mapping present in table A.6 of the ISA manual. > https://github.com/riscv/riscv-isa-manual/blob/c7cf84547b3aefacab5463add1734c1602b67a49/src/memory.tex#L1083-L1157 > > Context > --------- > GCC defined RISC-V mappings [1] before the Memory Model task group > finalized their work and provided the ISA Manual Table A.6/A.7 mappings[2]. > > For at least a year now, we've known that the mappings were different, > but it wasn't clear if these unique mappings had correctness issues. > > Andrea Parri found an issue with the GCC mappings, showing that > atomic_compare_exchange_weak_explicit(-,-,-,release,relaxed) mappings do > not enforce release ordering guarantees. (Meaning the GCC mappings have > a correctness issue). > https://inbox.sourceware.org/gcc-patches/Y1GbJuhcBFpPGJQ0@andrea/ > > Why not A.6? > --------- > We can update our mappings now, so the obvious choice would be to > implement Table A.6 (what LLVM implements/ISA manual recommends). > > The reason why that isn't the best path forward for GCC is due to a > proposal by Hans Boehm to add L{d|w|b|h}.aq/rl and S{d|w|b|h}.aq/rl. > > For context, there is discussion about fast-tracking the addition of > these instructions. The RISCV architectural review committee supports > adopting a "new and common atomics ABI for gcc and LLVM toochains ... > that assumes the addition of the preceding instructions”. That common > ABI is likely to be A.7. > https://lists.riscv.org/g/tech-privileged/message/1284 > > Transitioning from A.6 to A.7 will cause an ABI break. We can hedge > against that risk by emitting a conservative fence after SEQ_CST stores > to make the mapping compatible with both A.6 and A.7. > > What does a mapping compatible with both A.6 & A.7 look like? > --------- > It is exactly the same as Table A.6, but SEQ_CST stores have a trailing > fence rw,rw. It's strictly stronger than Table A.6. > > Microbenchmark > --------- > Hans Boehm helpfully wrote a microbenchmark [3] that uses ARM to give a > rough estimate for the performance benefits/penalties of the different > mappings. The microbenchmark is single threaded and almost-write-only. > This case seems unlikely but is useful for getting a rough idea of the > workload that would be impacted the most. > > Testcases > ------- > Control: A simple volatile store. This is most similar to a relaxed > store. > Release Store: This is most similar to Sw.rl (one of the instructions in > Hans' proposal). > Store with release fence: This is most similar to the mapping present in > Table A.6. > Store with two fences: This is most similar to the compatibility mapping > present in this patchset. > > Machines > ------- > Intel(R) Core(TM) i7-8650U (sanity check only): x86 TSO > Cortex A53 (Raspberry pi): ARM In order core > Cortex A55 (Pixel 6 Pro): ARM In order core > Cortex A76 (Pixel 6 Pro): ARM Out of order core > Cortex X1 (Pixel 6 Pro): ARM Out of order core > > Microbenchmark Results [4] > -------- > Units are nsecs per iteration. > > Sanity check > Machine CONTROL REL_STORE STORE_REL_FENCE STORE_TWO_FENCE > ------- ------- --------- --------------- --------------- > Intel i7-8650U 1.34812 1.30038 1.2933 18.0474 > > > Machine CONTROL REL_STORE STORE_REL_FENCE STORE_TWO_FENCE > ------- ------- --------- --------------- --------------- > Cortex A53 7.15224 10.7282 7.15221 10.013 > Cortex A55 2.77965 8.89654 4.44787 7.78331 > Cortex A76 1.78021 1.86095 5.33088 8.88462 > Cortex X1 2.14252 2.14258 4.32982 7.05234 > > Reordered tests (using -r flag on microbenchmark) > Machine CONTROL REL_STORE STORE_REL_FENCE STORE_TWO_FENCE > ------- ------- --------- --------------- --------------- > Cortex A53 7.15227 10.7282 7.16113 10.034 > Cortex A55 2.78024 8.89574 4.44844 7.78428 > Cortex A76 1.77686 1.81081 5.3301 8.88346 > Cortex X1 2.14254 2.14251 4.3273 7.05239 > > Benchmark Interpretation > -------- > As expected, out of order machines are significantly faster with the > REL_STORE mappings. Unexpectedly, the in-order machines are > significantly slower with REL_STORE rather than REL_STORE_FENCE. > > Most machines in the wild are expected to use Table A.7 once the > instructions are introduced. > Incurring this added cost now will make it easier for compiled RISC-V > binaries to transition to the A.7 memory model mapping. > > The performance benefits of moving to A.7 can be more clearly seen using > an almost-all-load microbenchmark (included on page 3 of Hans’ > proposal). The code for that microbenchmark is attached below [5]. > https://lists.riscv.org/g/tech-unprivileged/attachment/382/0/load-acquire110422.pdf > https://lists.riscv.org/g/tech-unprivileged/topic/92916241 > > Caveats > -------- > This is a very synthetic microbenchmark that represents what is expected > to be a very unlikely workload. Nevertheless, it's helpful to see the > worst-case price we are paying for compatibility. > > “All times include an entire loop iteration, indirect dispatch and all. > The benchmark alternates tests, but does not lock CPU frequency. Since a > single core was in use, I expect this was running at basically full > speed. Any throttling affected everything more or less uniformly.” > - Hans Boehm > > Patchset overview > -------- > Patch 1 simplifies the memmodel to ignore MEMMODEL_SYNC_* cases (legacy > cases that aren't handled differently for RISC-V). > Patches 2-6 make the mappings strictly stronger. > Patches 7-9 weaken the mappings to be in line with table A.6 of the ISA > manual. > Patch 11 adds some basic conformance tests to ensure the implemented > mapping matches table A.6 with stronger SEQ_CST stores. > > Conformance test cases notes > -------- > The conformance tests in this patch are a good sanity check but do not > guarantee exactly following Table A.6. It checks that the right > instructions are emitted (ex. fence rw,r) but not the order of those > instructions. > > LLVM mapping notes > -------- > LLVM emits corresponding fences for atomic_signal_fence instructions. > This seems to be an oversight since AFAIK atomic_signal_fence acts as a > compiler directive. GCC does not emit any fences for atomic_signal_fence > instructions. > > Future work > -------- > There still remains some work to be done in this space after this > patchset fixes the correctness of the GCC mappings. > * Look into explicitly handling subword loads/stores. > * Look into using AMOSWAP.rl for store words/doubles. > * L{b|h|w|d}.aq/rl & S{b|h|w|d}.aq/rl support once ratified. > * zTSO mappings. > > Prior Patchsets > -------- > Patchset v1: > https://gcc.gnu.org/pipermail/gcc-patches/2022-April/592950.html > > Patchset v2: > https://gcc.gnu.org/pipermail/gcc-patches/2023-April/615264.html > > Patchset v3: > https://gcc.gnu.org/pipermail/gcc-patches/2023-April/615431.html > > Patchset v4: > https://gcc.gnu.org/pipermail/gcc-patches/2023-April/615748.html > > Changelogs > -------- > Changes for v2: > * Use memmodel_base rather than a custom simplify_memmodel function > (Inspired by Christoph Muellner's patch 1/9) > * Move instruction styling change from [v1 5/7] to [v2 3/8] to reduce > [v2 6/8]'s complexity > * Eliminated %K flag for atomic store introduced in v1 in favor of > if/else > * Rebase/test > > Changes for v3: > * Use a trailing fence for atomic stores to be compatible with table A.7 > * Emit an optimized fence r,rw following a SEQ_CST load > * Consolidate tests in [PATCH v3 10/10] > * Add tests for basic A.6 conformance > > Changes for v4: > * Update cover letter to cover more of the reasoning behind moving to a > compatibility mapping > * Improve conformance testcases patch assertions and add new > compare-exchange testcases > > Changes for v5: > * Update cover letter to cover more context and reasoning behind moving > to a compatibility mapping > * Rebase to include the subword-atomic cases introduced here: > https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616080.html > * Add basic amo-add subword atomic testcases > * Reformat changelogs > * Fix misc. whitespace issues > > [1] GCC port with mappings merged 06 Feb 2017 > https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=09cae7507d9e88f2b05cf3a9404bf181e65ccbac > > [2] A.6 mappings added to ISA manual 12 Dec 2017 > https://github.com/riscv/riscv-isa-manual/commit/9da1a115bcc4fe327f35acceb851d4850d12e9fa > > [3] Hans Boehm almost-all-store Microbenchmark: > // Copyright 2023 Google LLC. > // SPDX-License-Identifier: Apache-2.0 > > #include <atomic> > #include <iostream> > #include <time.h> > > static constexpr int INNER_ITERS = 10'000'000; > static constexpr int OUTER_ITERS = 20; > static constexpr int N_TESTS = 4; > > volatile int the_volatile(17); > std::atomic<int> the_atomic(17); > > void test1(int i) { > the_volatile = i; > } > > void test2(int i) { > the_atomic.store(i, std::memory_order_release); > } > > void test3(int i) { > atomic_thread_fence(std::memory_order_release); > the_atomic.store(i, std::memory_order_relaxed); > } > > void test4(int i) { > atomic_thread_fence(std::memory_order_release); > the_atomic.store(i, std::memory_order_relaxed); > atomic_thread_fence(std::memory_order_seq_cst); > } > > typedef void (*int_func)(int); > > uint64_t getnanos() { > struct timespec result; > if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &result) != 0) { > std::cerr << "clock_gettime() failed\n"; > exit(1); > } > return (uint64_t)result.tv_nsec + 1'000'000'000 * (uint64_t)result.tv_sec; > } > > int_func tests[N_TESTS] = { test1, test2, test3, test4 }; > const char *test_names[N_TESTS] = > { "control", "release store", "store with release fence", "store with two fences" }; > uint64_t total_time[N_TESTS] = { 0 }; > > int main(int argc, char **argv) { > struct timespec res; > if (clock_getres(CLOCK_PROCESS_CPUTIME_ID, &res) != 0) { > std::cerr << "clock_getres() failed\n"; > exit(1); > } else { > std::cout << "nsec resolution = " << res.tv_nsec << std::endl; > } > if (argc == 2 && argv[1][0] == 'r') { > // Run tests in reverse order. > for (int i = 0; i < N_TESTS / 2; ++i) { > std::swap(tests[i], tests[N_TESTS - 1 - i]); > std::swap(test_names[i], test_names[N_TESTS - 1 - i]); > } > } > for (int i = 0; i < OUTER_ITERS; ++i) { > // Alternate tests to minimize bias due to thermal throttling. > for (int j = 0; j < N_TESTS; ++j) { > uint64_t start_time = getnanos(); > for (int k = 1; k <= INNER_ITERS; ++k) { > tests[j](k); // Provides memory accesses between tests. > } > // Ignore first iteration for all tests. The first iteration of the first test is > // empirically slightly slower. > if (i != 0) { > total_time[j] += getnanos() - start_time; > } > if ((tests[j] == test1 ? the_volatile : the_atomic.load()) != INNER_ITERS) { > std::cerr << "result check failed, test = " << j << ", " << the_volatile << std::endl; > exit(1); > } > } > } > for (int i = 0; i < N_TESTS; ++i) { > double nsecs_per_iter = (double) total_time[i] / INNER_ITERS / (OUTER_ITERS - 1); > std::cout << test_names[i] << " took " << nsecs_per_iter << " nseconds per iteration\n"; > } > exit(0); > } > > [4] Hans Boehm Raw Microbenchmark Results > Intel(R) Core(TM) i7-8650U (sanity check only): > > hboehm@hboehm-glaptop0:~/tests$ ./a.out > nsec resolution = 1 > control took 1.34812 nseconds per iteration > release store took 1.30038 nseconds per iteration > store with release fence took 1.2933 nseconds per iteration > store with two fences took 18.0474 nseconds per iteration > > Cortex A53 (Raspberry pi) > hboehm@rpi3-20210823:~/tests$ ./a.out > nsec resolution = 1 > control took 7.15224 nseconds per iteration > release store took 10.7282 nseconds per iteration > store with release fence took 7.15221 nseconds per iteration > store with two fences took 10.013 nseconds per iteration > hboehm@rpi3-20210823:~/tests$ ./a.out -r > nsec resolution = 1 > control took 7.15227 nseconds per iteration > release store took 10.7282 nseconds per iteration > store with release fence took 7.16133 nseconds per iteration > store with two fences took 10.034 nseconds per iteration > > Cortex A55 (Pixel 6 Pro) > > raven:/data/tmp # taskset 0f ./release-timer > nsec resolution = 1 > control took 2.77965 nseconds per iteration > release store took 8.89654 nseconds per iteration > store with release fence took 4.44787 nseconds per iteration > store with two fences took 7.78331 nseconds per iteration > raven:/data/tmp # taskset 0f ./release-timer -r > nsec resolution = 1 > control took 2.78024 nseconds per iteration > release store took 8.89574 nseconds per iteration > store with release fence took 4.44844 nseconds per iteration > store with two fences took 7.78428 nseconds per iteration > > Cortex A76 (Pixel 6 Pro) > raven:/data/tmp # taskset 30 ./release-timer -r > nsec resolution = 1 > control took 1.77686 nseconds per iteration > release store took 1.81081 nseconds per iteration > store with release fence took 5.3301 nseconds per iteration > store with two fences took 8.88346 nseconds per iteration > raven:/data/tmp # taskset 30 ./release-timer > nsec resolution = 1 > control took 1.78021 nseconds per iteration > release store took 1.86095 nseconds per iteration > store with release fence took 5.33088 nseconds per iteration > store with two fences took 8.88462 nseconds per iteration > > Cortex X1 (Pixel 6 Pro) > raven:/data/tmp # taskset c0 ./release-timer > nsec resolution = 1 > control took 2.14252 nseconds per iteration > release store took 2.14258 nseconds per iteration > store with release fence took 4.32982 nseconds per iteration > store with two fences took 7.05234 nseconds per iteration > raven:/data/tmp # taskset c0 ./release-timer -r > nsec resolution = 1 > control took 2.14254 nseconds per iteration > release store took 2.14251 nseconds per iteration > store with release fence took 4.3273 nseconds per iteration > store with two fences took 7.05239 nseconds per iteration > > [5] Hans Boehm almost-all-load Microbenchmark: > // Copyright 2023 Google LLC. > // SPDX-License-Identifier: Apache-2.0 > > #include <atomic> > #include <iostream> > #include <time.h> > > static constexpr int INNER_ITERS = 10'000'000; > static constexpr int OUTER_ITERS = 20; > static constexpr int N_TESTS = 4; > > volatile int the_volatile(17); > std::atomic<int> the_atomic(17); > > int test1() { > return the_volatile; > } > > int test2() { > return the_atomic.load(std::memory_order_acquire); > } > > int test3() { > int result = the_atomic.load(std::memory_order_relaxed); > atomic_thread_fence(std::memory_order_acquire); > return result; > } > > int test4() { > atomic_thread_fence(std::memory_order_seq_cst); > int result = the_atomic.load(std::memory_order_relaxed); > atomic_thread_fence(std::memory_order_acquire); > return result; > } > > typedef int (*int_func)(); > > uint64_t getnanos() { > struct timespec result; > if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &result) != 0) { > std::cerr << "clock_gettime() failed\n"; > exit(1); > } > return (uint64_t)result.tv_nsec + 1'000'000'000 * (uint64_t)result.tv_sec; > } > > int_func tests[N_TESTS] = { test1, test2, test3, test4 }; > const char *test_names[N_TESTS] = > { "control", "acquire load", "load with acquire fence", "load with two fences" }; > uint64_t total_time[N_TESTS] = { 0 }; > > uint sum, last_sum = 0; > > int main(int argc, char **argv) { > struct timespec res; > if (clock_getres(CLOCK_PROCESS_CPUTIME_ID, &res) != 0) { > std::cerr << "clock_getres() failed\n"; > exit(1); > } else { > std::cout << "nsec resolution = " << res.tv_nsec << std::endl; > } > if (argc == 2 && argv[1][0] == 'r') { > // Run tests in reverse order. > for (int i = 0; i < N_TESTS / 2; ++i) { > std::swap(tests[i], tests[N_TESTS - 1 - i]); > std::swap(test_names[i], test_names[N_TESTS - 1 - i]); > } > } > for (int i = 0; i < OUTER_ITERS; ++i) { > // Alternate tests to minimize bias due to thermal throttling. > for (int j = 0; j < N_TESTS; ++j) { > sum = 0; > uint64_t start_time = getnanos(); > for (int k = 0; k < INNER_ITERS; ++k) { > sum += tests[j](); // Provides memory accesses between tests. > } > // Ignore first iteration for all tests. The first iteration of the first test is > // empirically slightly slower. > if (i != 0) { > total_time[j] += getnanos() - start_time; > } > if (sum == 0 || last_sum != 0 && sum != last_sum) { > std::cerr << "result check failed"; > exit(1); > } > last_sum = sum; > } > } > for (int i = 0; i < N_TESTS; ++i) { > double nsecs_per_iter = (double) total_time[i] / INNER_ITERS / (OUTER_ITERS - 1); > std::cout << test_names[i] << " took " << nsecs_per_iter << " nseconds per iteration\n"; > } > exit(0); > } > > Patrick O'Neill (11): > RISC-V: Eliminate SYNC memory models > RISC-V: Enforce Libatomic LR/SC SEQ_CST > RISC-V: Enforce subword atomic LR/SC SEQ_CST > RISC-V: Enforce atomic compare_exchange SEQ_CST > RISC-V: Add AMO release bits > RISC-V: Strengthen atomic stores > RISC-V: Eliminate AMO op fences > RISC-V: Weaken LR/SC pairs > RISC-V: Weaken mem_thread_fence > RISC-V: Weaken atomic loads > RISC-V: Table A.6 conformance tests > > gcc/config/riscv/riscv-protos.h | 3 + > gcc/config/riscv/riscv.cc | 66 ++++-- > gcc/config/riscv/sync.md | 194 ++++++++++++------ > .../riscv/amo-table-a-6-amo-add-1.c | 8 + > .../riscv/amo-table-a-6-amo-add-2.c | 8 + > .../riscv/amo-table-a-6-amo-add-3.c | 8 + > .../riscv/amo-table-a-6-amo-add-4.c | 8 + > .../riscv/amo-table-a-6-amo-add-5.c | 8 + > .../riscv/amo-table-a-6-compare-exchange-1.c | 10 + > .../riscv/amo-table-a-6-compare-exchange-2.c | 10 + > .../riscv/amo-table-a-6-compare-exchange-3.c | 10 + > .../riscv/amo-table-a-6-compare-exchange-4.c | 10 + > .../riscv/amo-table-a-6-compare-exchange-5.c | 10 + > .../riscv/amo-table-a-6-compare-exchange-6.c | 11 + > .../riscv/amo-table-a-6-compare-exchange-7.c | 10 + > .../gcc.target/riscv/amo-table-a-6-fence-1.c | 8 + > .../gcc.target/riscv/amo-table-a-6-fence-2.c | 10 + > .../gcc.target/riscv/amo-table-a-6-fence-3.c | 10 + > .../gcc.target/riscv/amo-table-a-6-fence-4.c | 10 + > .../gcc.target/riscv/amo-table-a-6-fence-5.c | 10 + > .../gcc.target/riscv/amo-table-a-6-load-1.c | 9 + > .../gcc.target/riscv/amo-table-a-6-load-2.c | 11 + > .../gcc.target/riscv/amo-table-a-6-load-3.c | 11 + > .../gcc.target/riscv/amo-table-a-6-store-1.c | 9 + > .../gcc.target/riscv/amo-table-a-6-store-2.c | 11 + > .../riscv/amo-table-a-6-store-compat-3.c | 11 + > .../riscv/amo-table-a-6-subword-amo-add-1.c | 9 + > .../riscv/amo-table-a-6-subword-amo-add-2.c | 9 + > .../riscv/amo-table-a-6-subword-amo-add-3.c | 9 + > .../riscv/amo-table-a-6-subword-amo-add-4.c | 9 + > .../riscv/amo-table-a-6-subword-amo-add-5.c | 9 + > gcc/testsuite/gcc.target/riscv/pr89835.c | 9 + > libgcc/config/riscv/atomic.c | 4 +- > 33 files changed, 467 insertions(+), 75 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-amo-add-1.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-amo-add-2.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-amo-add-3.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-amo-add-4.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-amo-add-5.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-1.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-2.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-3.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-4.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-5.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-6.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-7.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-fence-1.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-fence-2.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-fence-3.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-fence-4.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-fence-5.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-load-1.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-load-2.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-load-3.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-store-1.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-store-2.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-store-compat-3.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-subword-amo-add-1.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-subword-amo-add-2.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-subword-amo-add-3.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-subword-amo-add-4.c > create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-subword-amo-add-5.c > create mode 100644 gcc/testsuite/gcc.target/riscv/pr89835.c These changes address and fix all the issues I reported/I'm aware of, thank you! Tested-by: Andrea Parri <andrea@rivosinc.com> Andrea
On 4/27/23 10:22, Patrick O'Neill wrote: > This patchset aims to make the RISCV atomics implementation stronger > than the recommended mapping present in table A.6 of the ISA manual. > https://github.com/riscv/riscv-isa-manual/blob/c7cf84547b3aefacab5463add1734c1602b67a49/src/memory.tex#L1083-L1157 > > Context > --------- > GCC defined RISC-V mappings [1] before the Memory Model task group > finalized their work and provided the ISA Manual Table A.6/A.7 mappings[2]. > > For at least a year now, we've known that the mappings were different, > but it wasn't clear if these unique mappings had correctness issues. > > Andrea Parri found an issue with the GCC mappings, showing that > atomic_compare_exchange_weak_explicit(-,-,-,release,relaxed) mappings do > not enforce release ordering guarantees. (Meaning the GCC mappings have > a correctness issue). > https://inbox.sourceware.org/gcc-patches/Y1GbJuhcBFpPGJQ0@andrea/ Right. I recall this discussion, but thanks for the back reference. > > Why not A.6? > --------- > We can update our mappings now, so the obvious choice would be to > implement Table A.6 (what LLVM implements/ISA manual recommends). > > The reason why that isn't the best path forward for GCC is due to a > proposal by Hans Boehm to add L{d|w|b|h}.aq/rl and S{d|w|b|h}.aq/rl. > > For context, there is discussion about fast-tracking the addition of > these instructions. The RISCV architectural review committee supports > adopting a "new and common atomics ABI for gcc and LLVM toochains ... > that assumes the addition of the preceding instructions”. That common > ABI is likely to be A.7. > https://lists.riscv.org/g/tech-privileged/message/1284 > > Transitioning from A.6 to A.7 will cause an ABI break. We can hedge > against that risk by emitting a conservative fence after SEQ_CST stores > to make the mapping compatible with both A.6 and A.7. So I like that we can have compatible sequences across A.6 and A.7. Of course the concern is performance ;-) > > What does a mapping compatible with both A.6 & A.7 look like? > --------- > It is exactly the same as Table A.6, but SEQ_CST stores have a trailing > fence rw,rw. It's strictly stronger than Table A.6. Right. So my worry here is silicon that is either already available or coming online shortly. Those implementations simply aren't going to be able to use the A.7 mapping, so they pay a penalty. Does it make sense to have the compatibility fences conditional? > > Benchmark Interpretation > -------- > As expected, out of order machines are significantly faster with the > REL_STORE mappings. Unexpectedly, the in-order machines are > significantly slower with REL_STORE rather than REL_STORE_FENCE. Yea, that's a bit of a surprise. > > Most machines in the wild are expected to use Table A.7 once the > instructions are introduced. > Incurring this added cost now will make it easier for compiled RISC-V > binaries to transition to the A.7 memory model mapping. > > The performance benefits of moving to A.7 can be more clearly seen using > an almost-all-load microbenchmark (included on page 3 of Hans’ > proposal). The code for that microbenchmark is attached below [5]. > https://lists.riscv.org/g/tech-unprivileged/attachment/382/0/load-acquire110422.pdf > https://lists.riscv.org/g/tech-unprivileged/topic/92916241 Yea. I'm not questioning the value of the new instructions that are on the horizon, just the value of trying to make everything A.7 compatible. > > Conformance test cases notes > -------- > The conformance tests in this patch are a good sanity check but do not > guarantee exactly following Table A.6. It checks that the right > instructions are emitted (ex. fence rw,r) but not the order of those > instructions. Note there is a way to check ordering as well. You might look at the check-function-bodies approach. I think there are some recent examples in the gcc risc-v specific tests. > > LLVM mapping notes > -------- > LLVM emits corresponding fences for atomic_signal_fence instructions. > This seems to be an oversight since AFAIK atomic_signal_fence acts as a > compiler directive. GCC does not emit any fences for atomic_signal_fence > instructions. This starts to touch on a larger concern. Specifically I'd really like the two compilers to be compatible in terms of the code they generate for the various atomics. What I worry about is code being written (by design or accident) that is dependent on the particular behavior of one compiler and then if that code gets built with the other compiler, and we end up different behavior. Worse yet, if/when this happens, it's likely to be tough to expose, reproduce & debug. Do you have any sense of where Clang/LLVM is going to go WRT providing an A.6 mapping that is compatible with A.7 by using the additional fences? Jeff
On Fri, 28 Apr 2023 09:14:00 PDT (-0700), jeffreyalaw@gmail.com wrote: > > > On 4/27/23 10:22, Patrick O'Neill wrote: >> This patchset aims to make the RISCV atomics implementation stronger >> than the recommended mapping present in table A.6 of the ISA manual. >> https://github.com/riscv/riscv-isa-manual/blob/c7cf84547b3aefacab5463add1734c1602b67a49/src/memory.tex#L1083-L1157 >> >> Context >> --------- >> GCC defined RISC-V mappings [1] before the Memory Model task group >> finalized their work and provided the ISA Manual Table A.6/A.7 mappings[2]. >> >> For at least a year now, we've known that the mappings were different, >> but it wasn't clear if these unique mappings had correctness issues. >> >> Andrea Parri found an issue with the GCC mappings, showing that >> atomic_compare_exchange_weak_explicit(-,-,-,release,relaxed) mappings do >> not enforce release ordering guarantees. (Meaning the GCC mappings have >> a correctness issue). >> https://inbox.sourceware.org/gcc-patches/Y1GbJuhcBFpPGJQ0@andrea/ > Right. I recall this discussion, but thanks for the back reference. Yep, and it's an important one: that's why we're calling the change a bug fix and dropping the current GCC mappings. If we didn't have the bug we'd be talking about an ABI break, and since the GCC mappings predate the ISA mappings we'd likely need an additional compatibility mode. So I guess we're lucky that we have a concurrency bug. I think it's the first time I've said that ;) >> Why not A.6? >> --------- >> We can update our mappings now, so the obvious choice would be to >> implement Table A.6 (what LLVM implements/ISA manual recommends). >> >> The reason why that isn't the best path forward for GCC is due to a >> proposal by Hans Boehm to add L{d|w|b|h}.aq/rl and S{d|w|b|h}.aq/rl. >> >> For context, there is discussion about fast-tracking the addition of >> these instructions. The RISCV architectural review committee supports >> adopting a "new and common atomics ABI for gcc and LLVM toochains ... >> that assumes the addition of the preceding instructions”. That common >> ABI is likely to be A.7. >> https://lists.riscv.org/g/tech-privileged/message/1284 >> >> Transitioning from A.6 to A.7 will cause an ABI break. We can hedge >> against that risk by emitting a conservative fence after SEQ_CST stores >> to make the mapping compatible with both A.6 and A.7. > So I like that we can have compatible sequences across A.6 and A.7. Of > course the concern is performance ;-) > > >> >> What does a mapping compatible with both A.6 & A.7 look like? >> --------- >> It is exactly the same as Table A.6, but SEQ_CST stores have a trailing >> fence rw,rw. It's strictly stronger than Table A.6. > Right. So my worry here is silicon that is either already available or > coming online shortly. Those implementations simply aren't going to be > able to use the A.7 mapping, so they pay a penalty. Does it make sense > to have the compatibility fences conditional? IIRC this was discussed somewhere in some thread, but I think there's really three ABIs that could be implemented here (ignoring the current GCC mappings as they're broken): * ABI compatible with the current mappings in the ISA manual (A.6). This will presumably perform best on extant hardware, given that it's what the words in the PDF say to do. * ABI compatible with the proposed mappings for the ISA manual (A.7). This may perform better on new hardware. * ABI compatible with both A.6 and A.7. This is likely slow on both new and old hardware, but allows cross-linking. If there's no performance issues this would be the only mode we need, but that seems unlikely. IMO those should be encoded somewhere in the ELF. I'd just do it as two bits in the header, but last time I proposed header bits the psABI folks wanted to do something different. I don't think where we encode this matters all that much, but if we're doing to treat these as real long-term ABIs we should have some way to encode that. There's also the orthogonal axis of whether we use the new instructions. Those aren't in specs yet so I think we can hold off on them for a bit, but they're the whole point of doing the ABI break so we should at least think them over. I think we're OK because we've just split out the ABI from the ISA here, but I'm not sure if I'm missing something. Now that I wrote that, though, I remember talking to Patrick about it and we drew a bunch of stuff on the whiteboard and then got confused. So sorry if I'm just out of the loop here... > > > >> >> Benchmark Interpretation >> -------- >> As expected, out of order machines are significantly faster with the >> REL_STORE mappings. Unexpectedly, the in-order machines are >> significantly slower with REL_STORE rather than REL_STORE_FENCE. > Yea, that's a bit of a surprise. > >> >> Most machines in the wild are expected to use Table A.7 once the >> instructions are introduced. >> Incurring this added cost now will make it easier for compiled RISC-V >> binaries to transition to the A.7 memory model mapping. >> >> The performance benefits of moving to A.7 can be more clearly seen using >> an almost-all-load microbenchmark (included on page 3 of Hans’ >> proposal). The code for that microbenchmark is attached below [5]. >> https://lists.riscv.org/g/tech-unprivileged/attachment/382/0/load-acquire110422.pdf >> https://lists.riscv.org/g/tech-unprivileged/topic/92916241 > Yea. I'm not questioning the value of the new instructions that are on > the horizon, just the value of trying to make everything A.7 compatible. > > >> >> Conformance test cases notes >> -------- >> The conformance tests in this patch are a good sanity check but do not >> guarantee exactly following Table A.6. It checks that the right >> instructions are emitted (ex. fence rw,r) but not the order of those >> instructions. > Note there is a way to check ordering as well. You might look at the > check-function-bodies approach. I think there are some recent examples > in the gcc risc-v specific tests. > > >> >> LLVM mapping notes >> -------- >> LLVM emits corresponding fences for atomic_signal_fence instructions. >> This seems to be an oversight since AFAIK atomic_signal_fence acts as a >> compiler directive. GCC does not emit any fences for atomic_signal_fence >> instructions. > This starts to touch on a larger concern. Specifically I'd really like > the two compilers to be compatible in terms of the code they generate > for the various atomics. > > What I worry about is code being written (by design or accident) that is > dependent on the particular behavior of one compiler and then if that > code gets built with the other compiler, and we end up different > behavior. Worse yet, if/when this happens, it's likely to be tough to > expose, reproduce & debug. > > Do you have any sense of where Clang/LLVM is going to go WRT providing > an A.6 mapping that is compatible with A.7 by using the additional fences? > > > Jeff
On 4/28/23 09:29, Palmer Dabbelt wrote: > On Fri, 28 Apr 2023 09:14:00 PDT (-0700), jeffreyalaw@gmail.com wrote: >> >> >> On 4/27/23 10:22, Patrick O'Neill wrote: >>> This patchset aims to make the RISCV atomics implementation stronger >>> than the recommended mapping present in table A.6 of the ISA manual. >>> https://github.com/riscv/riscv-isa-manual/blob/c7cf84547b3aefacab5463add1734c1602b67a49/src/memory.tex#L1083-L1157 >>> >>> >>> Context >>> --------- >>> GCC defined RISC-V mappings [1] before the Memory Model task group >>> finalized their work and provided the ISA Manual Table A.6/A.7 >>> mappings[2]. >>> >>> For at least a year now, we've known that the mappings were different, >>> but it wasn't clear if these unique mappings had correctness issues. >>> >>> Andrea Parri found an issue with the GCC mappings, showing that >>> atomic_compare_exchange_weak_explicit(-,-,-,release,relaxed) >>> mappings do >>> not enforce release ordering guarantees. (Meaning the GCC mappings have >>> a correctness issue). >>> https://inbox.sourceware.org/gcc-patches/Y1GbJuhcBFpPGJQ0@andrea/ >> Right. I recall this discussion, but thanks for the back reference. > > Yep, and it's an important one: that's why we're calling the change a > bug fix and dropping the current GCC mappings. If we didn't have the > bug we'd be talking about an ABI break, and since the GCC mappings > predate the ISA mappings we'd likely need an additional compatibility > mode. > > So I guess we're lucky that we have a concurrency bug. I think it's > the first time I've said that ;) > >>> Why not A.6? >>> --------- >>> We can update our mappings now, so the obvious choice would be to >>> implement Table A.6 (what LLVM implements/ISA manual recommends). >>> >>> The reason why that isn't the best path forward for GCC is due to a >>> proposal by Hans Boehm to add L{d|w|b|h}.aq/rl and S{d|w|b|h}.aq/rl. >>> >>> For context, there is discussion about fast-tracking the addition of >>> these instructions. The RISCV architectural review committee supports >>> adopting a "new and common atomics ABI for gcc and LLVM toochains ... >>> that assumes the addition of the preceding instructions”. That common >>> ABI is likely to be A.7. >>> https://lists.riscv.org/g/tech-privileged/message/1284 >>> >>> Transitioning from A.6 to A.7 will cause an ABI break. We can hedge >>> against that risk by emitting a conservative fence after SEQ_CST stores >>> to make the mapping compatible with both A.6 and A.7. >> So I like that we can have compatible sequences across A.6 and A.7. Of >> course the concern is performance ;-) >> >> >>> >>> What does a mapping compatible with both A.6 & A.7 look like? >>> --------- >>> It is exactly the same as Table A.6, but SEQ_CST stores have a trailing >>> fence rw,rw. It's strictly stronger than Table A.6. >> Right. So my worry here is silicon that is either already available or >> coming online shortly. Those implementations simply aren't going to be >> able to use the A.7 mapping, so they pay a penalty. Does it make sense >> to have the compatibility fences conditional? > > IIRC this was discussed somewhere in some thread, but I think there's > really three ABIs that could be implemented here (ignoring the current > GCC mappings as they're broken): > > * ABI compatible with the current mappings in the ISA manual (A.6). > This will presumably perform best on extant hardware, given that it's > what the words in the PDF say to do. > * ABI compatible with the proposed mappings for the ISA manual (A.7). > This may perform better on new hardware. > * ABI compatible with both A.6 and A.7. This is likely slow on both > new and old hardware, but allows cross-linking. If there's no > performance issues this would be the only mode we need, but that > seems unlikely. > > IMO those should be encoded somewhere in the ELF. I'd just do it as > two bits in the header, but last time I proposed header bits the psABI > folks wanted to do something different. I don't think where we encode > this matters all that much, but if we're doing to treat these as real > long-term ABIs we should have some way to encode that. > > There's also the orthogonal axis of whether we use the new > instructions. Those aren't in specs yet so I think we can hold off on > them for a bit, but they're the whole point of doing the ABI break so > we should at least think them over. I think we're OK because we've > just split out the ABI from the ISA here, but I'm not sure if I'm > missing something. > > Now that I wrote that, though, I remember talking to Patrick about it > and we drew a bunch of stuff on the whiteboard and then got confused. > So sorry if I'm just out of the loop here... This looks up-to-date with how I understand it. >> >>> >>> Benchmark Interpretation >>> -------- >>> As expected, out of order machines are significantly faster with the >>> REL_STORE mappings. Unexpectedly, the in-order machines are >>> significantly slower with REL_STORE rather than REL_STORE_FENCE. >> Yea, that's a bit of a surprise. >> >>> >>> Most machines in the wild are expected to use Table A.7 once the >>> instructions are introduced. >>> Incurring this added cost now will make it easier for compiled RISC-V >>> binaries to transition to the A.7 memory model mapping. >>> >>> The performance benefits of moving to A.7 can be more clearly seen >>> using >>> an almost-all-load microbenchmark (included on page 3 of Hans’ >>> proposal). The code for that microbenchmark is attached below [5]. >>> https://lists.riscv.org/g/tech-unprivileged/attachment/382/0/load-acquire110422.pdf >>> https://lists.riscv.org/g/tech-unprivileged/topic/92916241 >> Yea. I'm not questioning the value of the new instructions that are on >> the horizon, just the value of trying to make everything A.7 compatible. >> >> >>> >>> Conformance test cases notes >>> -------- >>> The conformance tests in this patch are a good sanity check but do not >>> guarantee exactly following Table A.6. It checks that the right >>> instructions are emitted (ex. fence rw,r) but not the order of those >>> instructions. >> Note there is a way to check ordering as well. You might look at the >> check-function-bodies approach. I think there are some recent examples >> in the gcc risc-v specific tests. I'll check that out - thank you! >>> >>> LLVM mapping notes >>> -------- >>> LLVM emits corresponding fences for atomic_signal_fence instructions. >>> This seems to be an oversight since AFAIK atomic_signal_fence acts as a >>> compiler directive. GCC does not emit any fences for >>> atomic_signal_fence >>> instructions. >> This starts to touch on a larger concern. Specifically I'd really like >> the two compilers to be compatible in terms of the code they generate >> for the various atomics. >> >> What I worry about is code being written (by design or accident) that is >> dependent on the particular behavior of one compiler and then if that >> code gets built with the other compiler, and we end up different >> behavior. Worse yet, if/when this happens, it's likely to be tough to >> expose, reproduce & debug. Agreed. I'll open an issue with LLVM and see what they have to say about this particular behavior. Ideally we'd have perfectly compatible compilers (for atomic ops) by the end of this :) AFAICT GCC hasn't ever been emitting fences for these instructions. (& This behavior isn't touched by the patchset). >> >> Do you have any sense of where Clang/LLVM is going to go WRT providing >> an A.6 mapping that is compatible with A.7 by using the additional >> fences? I don't - Based on how LLVM initially waited for A.6, I'd speculate that they would wait for an official compatibility mapping to be added the PSABI doc or ISA manual before implementing it. I imagine there will be demand for an ABI-compatible mapping so the ABI break has a transition plan, but the timeline for LLVM isn't clear to me. Patrick >> >> Jeff
On 4/28/23 10:44, Patrick O'Neill wrote: > On 4/28/23 09:29, Palmer Dabbelt wrote: >> On Fri, 28 Apr 2023 09:14:00 PDT (-0700), jeffreyalaw@gmail.com wrote: >>> On 4/27/23 10:22, Patrick O'Neill wrote: ... >>>> LLVM mapping notes >>>> -------- >>>> LLVM emits corresponding fences for atomic_signal_fence instructions. >>>> This seems to be an oversight since AFAIK atomic_signal_fence acts >>>> as a >>>> compiler directive. GCC does not emit any fences for >>>> atomic_signal_fence >>>> instructions. >>> This starts to touch on a larger concern. Specifically I'd really like >>> the two compilers to be compatible in terms of the code they generate >>> for the various atomics. >>> >>> What I worry about is code being written (by design or accident) >>> that is >>> dependent on the particular behavior of one compiler and then if that >>> code gets built with the other compiler, and we end up different >>> behavior. Worse yet, if/when this happens, it's likely to be tough to >>> expose, reproduce & debug. > Agreed. > > I'll open an issue with LLVM and see what they have to say about this > particular behavior. Ideally we'd have perfectly compatible compilers > (for atomic ops) by the end of this :) > > AFAICT GCC hasn't ever been emitting fences for these instructions. > (& This behavior isn't touched by the patchset). I re-ran the set of tests I was using and couldn't replicate LLVM's behavior that was noted here. I think I mixed had up atomic_thread_fence with atomic_signal_fence at some point. That was the only difference I could find during my testing of this patchset (other than the strengthened SEQ_CST store), so I think GCC/LLVM atomics will be fully compatible once this patchset is applied.
We're certainly pushing for the same ABI (A.6 + trailing fence on store) in LLVM as well. I'm about to upload a pull request for the psABI document that describes this version of the ABI, and a bit of the rationale for it. I'll attach the current draft here. I agree that compatibility is critical here, not just across llvm and gcc, but also with other language implementations. That's part of the reason to get this correct asap. I believe that standardizing on A.6 + trailing fence on store, though initially suboptimal, is by far the best bet to get us to an efficient ABI in the long term. I expect the A.7 ABI to perform well. A.6, even without the trailing store fence, has annoyingly expensive seq_cst loads, which I would really like to get away from. Hans On Fri, Apr 28, 2023 at 10:44 AM Patrick O'Neill <patrick@rivosinc.com> wrote: > On 4/28/23 09:29, Palmer Dabbelt wrote: > > On Fri, 28 Apr 2023 09:14:00 PDT (-0700), jeffreyalaw@gmail.com wrote: > >> > >> > >> On 4/27/23 10:22, Patrick O'Neill wrote: > >>> This patchset aims to make the RISCV atomics implementation stronger > >>> than the recommended mapping present in table A.6 of the ISA manual. > >>> > https://github.com/riscv/riscv-isa-manual/blob/c7cf84547b3aefacab5463add1734c1602b67a49/src/memory.tex#L1083-L1157 > >>> > >>> > >>> Context > >>> --------- > >>> GCC defined RISC-V mappings [1] before the Memory Model task group > >>> finalized their work and provided the ISA Manual Table A.6/A.7 > >>> mappings[2]. > >>> > >>> For at least a year now, we've known that the mappings were different, > >>> but it wasn't clear if these unique mappings had correctness issues. > >>> > >>> Andrea Parri found an issue with the GCC mappings, showing that > >>> atomic_compare_exchange_weak_explicit(-,-,-,release,relaxed) > >>> mappings do > >>> not enforce release ordering guarantees. (Meaning the GCC mappings have > >>> a correctness issue). > >>> https://inbox.sourceware.org/gcc-patches/Y1GbJuhcBFpPGJQ0@andrea/ > >> Right. I recall this discussion, but thanks for the back reference. > > > > Yep, and it's an important one: that's why we're calling the change a > > bug fix and dropping the current GCC mappings. If we didn't have the > > bug we'd be talking about an ABI break, and since the GCC mappings > > predate the ISA mappings we'd likely need an additional compatibility > > mode. > > > > So I guess we're lucky that we have a concurrency bug. I think it's > > the first time I've said that ;) > > > >>> Why not A.6? > >>> --------- > >>> We can update our mappings now, so the obvious choice would be to > >>> implement Table A.6 (what LLVM implements/ISA manual recommends). > >>> > >>> The reason why that isn't the best path forward for GCC is due to a > >>> proposal by Hans Boehm to add L{d|w|b|h}.aq/rl and S{d|w|b|h}.aq/rl. > >>> > >>> For context, there is discussion about fast-tracking the addition of > >>> these instructions. The RISCV architectural review committee supports > >>> adopting a "new and common atomics ABI for gcc and LLVM toochains ... > >>> that assumes the addition of the preceding instructions”. That common > >>> ABI is likely to be A.7. > >>> https://lists.riscv.org/g/tech-privileged/message/1284 > >>> > >>> Transitioning from A.6 to A.7 will cause an ABI break. We can hedge > >>> against that risk by emitting a conservative fence after SEQ_CST stores > >>> to make the mapping compatible with both A.6 and A.7. > >> So I like that we can have compatible sequences across A.6 and A.7. Of > >> course the concern is performance ;-) > >> > >> > >>> > >>> What does a mapping compatible with both A.6 & A.7 look like? > >>> --------- > >>> It is exactly the same as Table A.6, but SEQ_CST stores have a trailing > >>> fence rw,rw. It's strictly stronger than Table A.6. > >> Right. So my worry here is silicon that is either already available or > >> coming online shortly. Those implementations simply aren't going to be > >> able to use the A.7 mapping, so they pay a penalty. Does it make sense > >> to have the compatibility fences conditional? > > > > IIRC this was discussed somewhere in some thread, but I think there's > > really three ABIs that could be implemented here (ignoring the current > > GCC mappings as they're broken): > > > > * ABI compatible with the current mappings in the ISA manual (A.6). > > This will presumably perform best on extant hardware, given that it's > > what the words in the PDF say to do. > > * ABI compatible with the proposed mappings for the ISA manual (A.7). > > This may perform better on new hardware. > > * ABI compatible with both A.6 and A.7. This is likely slow on both > > new and old hardware, but allows cross-linking. If there's no > > performance issues this would be the only mode we need, but that > > seems unlikely. > > > > IMO those should be encoded somewhere in the ELF. I'd just do it as > > two bits in the header, but last time I proposed header bits the psABI > > folks wanted to do something different. I don't think where we encode > > this matters all that much, but if we're doing to treat these as real > > long-term ABIs we should have some way to encode that. > > > > There's also the orthogonal axis of whether we use the new > > instructions. Those aren't in specs yet so I think we can hold off on > > them for a bit, but they're the whole point of doing the ABI break so > > we should at least think them over. I think we're OK because we've > > just split out the ABI from the ISA here, but I'm not sure if I'm > > missing something. > > > > Now that I wrote that, though, I remember talking to Patrick about it > > and we drew a bunch of stuff on the whiteboard and then got confused. > > So sorry if I'm just out of the loop here... > This looks up-to-date with how I understand it. > >> > >>> > >>> Benchmark Interpretation > >>> -------- > >>> As expected, out of order machines are significantly faster with the > >>> REL_STORE mappings. Unexpectedly, the in-order machines are > >>> significantly slower with REL_STORE rather than REL_STORE_FENCE. > >> Yea, that's a bit of a surprise. > >> > >>> > >>> Most machines in the wild are expected to use Table A.7 once the > >>> instructions are introduced. > >>> Incurring this added cost now will make it easier for compiled RISC-V > >>> binaries to transition to the A.7 memory model mapping. > >>> > >>> The performance benefits of moving to A.7 can be more clearly seen > >>> using > >>> an almost-all-load microbenchmark (included on page 3 of Hans’ > >>> proposal). The code for that microbenchmark is attached below [5]. > >>> > https://lists.riscv.org/g/tech-unprivileged/attachment/382/0/load-acquire110422.pdf > >>> https://lists.riscv.org/g/tech-unprivileged/topic/92916241 > >> Yea. I'm not questioning the value of the new instructions that are on > >> the horizon, just the value of trying to make everything A.7 compatible. > >> > >> > >>> > >>> Conformance test cases notes > >>> -------- > >>> The conformance tests in this patch are a good sanity check but do not > >>> guarantee exactly following Table A.6. It checks that the right > >>> instructions are emitted (ex. fence rw,r) but not the order of those > >>> instructions. > >> Note there is a way to check ordering as well. You might look at the > >> check-function-bodies approach. I think there are some recent examples > >> in the gcc risc-v specific tests. > I'll check that out - thank you! > >>> > >>> LLVM mapping notes > >>> -------- > >>> LLVM emits corresponding fences for atomic_signal_fence instructions. > >>> This seems to be an oversight since AFAIK atomic_signal_fence acts as a > >>> compiler directive. GCC does not emit any fences for > >>> atomic_signal_fence > >>> instructions. > >> This starts to touch on a larger concern. Specifically I'd really like > >> the two compilers to be compatible in terms of the code they generate > >> for the various atomics. > >> > >> What I worry about is code being written (by design or accident) that is > >> dependent on the particular behavior of one compiler and then if that > >> code gets built with the other compiler, and we end up different > >> behavior. Worse yet, if/when this happens, it's likely to be tough to > >> expose, reproduce & debug. > Agreed. > > I'll open an issue with LLVM and see what they have to say about this > particular behavior. Ideally we'd have perfectly compatible compilers > (for atomic ops) by the end of this :) > > AFAICT GCC hasn't ever been emitting fences for these instructions. > (& This behavior isn't touched by the patchset). > >> > >> Do you have any sense of where Clang/LLVM is going to go WRT providing > >> an A.6 mapping that is compatible with A.7 by using the additional > >> fences? > I don't - Based on how LLVM initially waited for A.6, I'd speculate that > they would wait for an official compatibility mapping to be added the > PSABI doc or ISA manual before implementing it. > > I imagine there will be demand for an ABI-compatible mapping so the ABI > break has a transition plan, but the timeline for LLVM isn't clear to me. > > Patrick > >> > >> Jeff >
On 4/28/23 12:45, Hans Boehm wrote: > We're certainly pushing for the same ABI (A.6 + trailing fence on store) > in LLVM as well. I'm about to upload a pull request for the psABI > document that describes this version of the ABI, and a bit of the > rationale for it. I'll attach the current draft here. > > I agree that compatibility is critical here, not just across llvm and > gcc, but also with other language implementations. That's part of the > reason to get this correct asap. > > I believe that standardizing on A.6 + trailing fence on store, though > initially suboptimal, is by far the best bet to get us to an efficient > ABI in the long term. I expect the A.7 ABI to perform well. A.6, even > without the trailing store fence, has annoyingly expensive seq_cst > loads, which I would really like to get away from. Thanks for the additional info. This stuff is well outside my area of expertise, so having someone with your background to give key insights is definitely appreciated. jeff