Message ID | 20230118150703.4024-1-ubizjak@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp2389942wrn; Wed, 18 Jan 2023 07:10:40 -0800 (PST) X-Google-Smtp-Source: AMrXdXtHdCptJNGs5VIkZhZMvZNtTsCQjt/21vp8888esKcPUPuvNFRdLoPxEvoZUNvAw+CAuiO0 X-Received: by 2002:a17:907:94ce:b0:870:ab89:3dd3 with SMTP id dn14-20020a17090794ce00b00870ab893dd3mr4218219ejc.70.1674054640178; Wed, 18 Jan 2023 07:10:40 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674054640; cv=none; d=google.com; s=arc-20160816; b=SzTRNiSzUIInKNpQr29Fl7KfNO/qGbyTssYngQOWu7eI62lHKBZ65ZZXv1lswO939J p57iGfazJgCDctMvB6yL2IDrkLC5oC/kcItWRIv4Amx7aai9y0zYAYrudvx3jsdNNRsS 22NUEEcNkLyS4bk2tR42U8juEIVJwTlg2CwsAAdhJk6hA14CaENTjdzZ6OBB/Et0t4JI TENUrAKqCOqqYwcHC1nPlLvZLtfwKqoNUqLEiHzLAt9m1zbS1ATDrzqdhfKJBFjmdVVj HlVQhvVlyPKqV8FMhoq5Rr4+UQDkLXNlNSFgELUx1FlBoK+JDEbJTIey+WmLS9JVp/sY kPXQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=de5VSyCSp4LwIkgkLbu8PgcqKtOJrMJR3VqjQzZBt04=; b=ng1gpWTd6xVPkn3nHH1W/Zw7sXU9eEmeU7yIIZoX8mMVIfV5NPQEdvIpLpx68eI+hB NYs4CaEPAVaAK8rawXExX9Hj0p2+eD9wEvTphwgl4SMBluf0mq+VHs5pel50Pywr/Fnu V63StzkXzWyvjsqwUUHxEzqF4FlTka40nURnB5y069aZN8B44Y+ikAUe2ykhJBEd4z3w TFOsjLZAS/a80W+Q9vLNMD6U1ID30rE2Os8O7BKQnjH1pApHnzRgAJoOzC63BGLe/s3k hE9cpUm7c5O+wL5y48z4Nsh5y1b5/dIZXtsupgTCej95YY/ZkCWIXKvt5Xb9uAPIpd4H DPYQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=Tn84+dn0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id sh6-20020a1709076e8600b00870b0d90078si10139179ejc.644.2023.01.18.07.10.15; Wed, 18 Jan 2023 07:10:40 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=Tn84+dn0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230236AbjARPHX (ORCPT <rfc822;pfffrao@gmail.com> + 99 others); Wed, 18 Jan 2023 10:07:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39038 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229910AbjARPHT (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 18 Jan 2023 10:07:19 -0500 Received: from mail-ed1-x52f.google.com (mail-ed1-x52f.google.com [IPv6:2a00:1450:4864:20::52f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E97BB196B2 for <linux-kernel@vger.kernel.org>; Wed, 18 Jan 2023 07:07:13 -0800 (PST) Received: by mail-ed1-x52f.google.com with SMTP id v30so50008877edb.9 for <linux-kernel@vger.kernel.org>; Wed, 18 Jan 2023 07:07:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=de5VSyCSp4LwIkgkLbu8PgcqKtOJrMJR3VqjQzZBt04=; b=Tn84+dn0XpHVthmT5dogJZCrr7OsRwjBMD8hbBUwdImvjTCgvCNeA+K4edPl83CnGu 2ZTUVsS9miwka1IP8rdQnBaK2FLqudjX2Ht/h205Xg2sLCWXlkRTJZ2Fz05i7SPOc7zu UiT2s99aNP7pm72NtR9ZlPS0oOCzBaEnu+5a27TOOi3H+1rCaY1+3G80NGB2+hMyH0jf KobfxgPJvrH2cvKvJOEqRFAWGtwJOo8YhRqmY76mYm1Er2eRVYPebDkhQZqAu7h3ElgP JROiD+bp4Y9u66ioDV4aKzfLZdKVgWRJhWSLxvC7+Bf5qYhSc7LS1Uak1H9soeJrygE6 R8NQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=de5VSyCSp4LwIkgkLbu8PgcqKtOJrMJR3VqjQzZBt04=; b=TLy5a7/QcpiZ4gM0ypFwT4e7pI0UPd11/r3acriVzx6DF9tFN/oznuhRIJFLpL7Tsy baIFDI6k6cViHTa3VW5SZ7kVXCPl2iRO8Bw+cpw2UGc53MY9DYVWyYNA2kWt3EPbqcYK 5dF8c/VXTvDu1GTOSFTFn+aK6mRF9GaPoniYSsmPMMWMkqwMmvdouDpL2pPVPYc8aJ+2 FcfR4JyL4tn67SP9UWbGnhUdI/2VWbkmAbDsavPmBMnkQueEz6NhzbQI8oO2qyY8YcrI lwVttQKpaxves9ALvfFAmxzAE8VgU76rlENdUHRT0VaFPtNa4NYbSUhg1/eiTSWnXhW0 eZdw== X-Gm-Message-State: AFqh2kouywBluueXVmQVFeHkpZBbt4039EQu4Rj3R3ljW7F3i1R4ko9w gO97qAucqOPt9c0L9H2JREjKlutl8RR6RA== X-Received: by 2002:aa7:c3cf:0:b0:499:5405:9319 with SMTP id l15-20020aa7c3cf000000b0049954059319mr6980786edr.32.1674054432104; Wed, 18 Jan 2023 07:07:12 -0800 (PST) Received: from localhost.localdomain ([46.248.82.114]) by smtp.gmail.com with ESMTPSA id m20-20020aa7c2d4000000b00495f4535a33sm14425184edp.74.2023.01.18.07.07.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Jan 2023 07:07:11 -0800 (PST) From: Uros Bizjak <ubizjak@gmail.com> To: linux-kernel@vger.kernel.org Cc: Uros Bizjak <ubizjak@gmail.com>, Andrew Morton <akpm@linux-foundation.org> Subject: [PATCH] lib/genalloc: use try_cmpxchg in {set,clear}_bits_ll Date: Wed, 18 Jan 2023 16:07:03 +0100 Message-Id: <20230118150703.4024-1-ubizjak@gmail.com> X-Mailer: git-send-email 2.39.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1755373518849416807?= X-GMAIL-MSGID: =?utf-8?q?1755373518849416807?= |
Series |
lib/genalloc: use try_cmpxchg in {set,clear}_bits_ll
|
|
Commit Message
Uros Bizjak
Jan. 18, 2023, 3:07 p.m. UTC
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
{set,clear}_bits_ll. x86 CMPXCHG instruction returns success in ZF
flag, so this change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).
Also, try_cmpxchg implicitly assigns old *ptr value to "old"
when cmpxchg fails.
Note that the value from *ptr should be read using READ_ONCE to prevent
the compiler from merging, refetching or reordering the read.
The patch also declares these two functions inline, to ensure inlining.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
lib/genalloc.c | 18 ++++++++----------
1 file changed, 8 insertions(+), 10 deletions(-)
Comments
On Wed, 18 Jan 2023 16:07:03 +0100 Uros Bizjak <ubizjak@gmail.com> wrote: > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in > {set,clear}_bits_ll. x86 CMPXCHG instruction returns success in ZF > flag, so this change saves a compare after cmpxchg (and related move > instruction in front of cmpxchg). > > Also, try_cmpxchg implicitly assigns old *ptr value to "old" > when cmpxchg fails. > > Note that the value from *ptr should be read using READ_ONCE to prevent > the compiler from merging, refetching or reordering the read. > > The patch also declares these two functions inline, to ensure inlining. But why is that better? This adds a few hundred bytes more text, which has a cost.
On Wed, Jan 18, 2023 at 10:18 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Wed, 18 Jan 2023 16:07:03 +0100 Uros Bizjak <ubizjak@gmail.com> wrote: > > > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in > > {set,clear}_bits_ll. x86 CMPXCHG instruction returns success in ZF > > flag, so this change saves a compare after cmpxchg (and related move > > instruction in front of cmpxchg). > > > > Also, try_cmpxchg implicitly assigns old *ptr value to "old" > > when cmpxchg fails. > > > > Note that the value from *ptr should be read using READ_ONCE to prevent > > the compiler from merging, refetching or reordering the read. > > > > The patch also declares these two functions inline, to ensure inlining. > > But why is that better? This adds a few hundred bytes more text, which > has a cost. Originally, both functions are inlined and the size of an object file is (gcc version 12.2.1, x86_64): text data bss dec hex filename 4661 480 0 5141 1415 genalloc-orig.o When try_cmpxchg is used, gcc chooses to not inline set_bits_ll (its estimate of code size is not very precise when multi-line assembly is involved), resulting in: text data bss dec hex filename 4705 488 0 5193 1449 genalloc-noinline.o And with an inline added to avoid gcc's quirks: text data bss dec hex filename 4629 480 0 5109 13f5 genalloc.o Considering that these two changed functions are used only in genalloc.o, adding inline qualifier is a win, also when comparing to the original size. Uros.
On Wed, Jan 18, 2023 at 10:47 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Wed, Jan 18, 2023 at 10:18 PM Andrew Morton > <akpm@linux-foundation.org> wrote: > > > > On Wed, 18 Jan 2023 16:07:03 +0100 Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in > > > {set,clear}_bits_ll. x86 CMPXCHG instruction returns success in ZF > > > flag, so this change saves a compare after cmpxchg (and related move > > > instruction in front of cmpxchg). > > > > > > Also, try_cmpxchg implicitly assigns old *ptr value to "old" > > > when cmpxchg fails. > > > > > > Note that the value from *ptr should be read using READ_ONCE to prevent > > > the compiler from merging, refetching or reordering the read. > > > > > > The patch also declares these two functions inline, to ensure inlining. > > > > But why is that better? This adds a few hundred bytes more text, which > > has a cost. > > Originally, both functions are inlined and the size of an object file > is (gcc version 12.2.1, x86_64): > > text data bss dec hex filename > 4661 480 0 5141 1415 genalloc-orig.o > > When try_cmpxchg is used, gcc chooses to not inline set_bits_ll (its > estimate of code size is not very precise when multi-line assembly is > involved), resulting in: > > text data bss dec hex filename > 4705 488 0 5193 1449 genalloc-noinline.o > > And with an inline added to avoid gcc's quirks: > > text data bss dec hex filename > 4629 480 0 5109 13f5 genalloc.o > > Considering that these two changed functions are used only in > genalloc.o, adding inline qualifier is a win, also when comparing to > the original size. BTW: Recently, it was determined [1] that the usage of cpu_relax() inside the cmpxchg loop can be harmful for performance. We actually have the same situation here, so perhaps cpu_relax() should be removed in the same way it was removed from the lockref. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f5fe24ef17b5fbe6db49534163e77499fb10ae8c Uros.
On Wed, Jan 18, 2023 at 10:55 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Wed, Jan 18, 2023 at 10:47 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > On Wed, Jan 18, 2023 at 10:18 PM Andrew Morton > > <akpm@linux-foundation.org> wrote: > > > > > > On Wed, 18 Jan 2023 16:07:03 +0100 Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in > > > > {set,clear}_bits_ll. x86 CMPXCHG instruction returns success in ZF > > > > flag, so this change saves a compare after cmpxchg (and related move > > > > instruction in front of cmpxchg). > > > > > > > > Also, try_cmpxchg implicitly assigns old *ptr value to "old" > > > > when cmpxchg fails. > > > > > > > > Note that the value from *ptr should be read using READ_ONCE to prevent > > > > the compiler from merging, refetching or reordering the read. > > > > > > > > The patch also declares these two functions inline, to ensure inlining. > > > > > > But why is that better? This adds a few hundred bytes more text, which > > > has a cost. > > > > Originally, both functions are inlined and the size of an object file > > is (gcc version 12.2.1, x86_64): > > > > text data bss dec hex filename > > 4661 480 0 5141 1415 genalloc-orig.o > > > > When try_cmpxchg is used, gcc chooses to not inline set_bits_ll (its > > estimate of code size is not very precise when multi-line assembly is > > involved), resulting in: > > > > text data bss dec hex filename > > 4705 488 0 5193 1449 genalloc-noinline.o > > > > And with an inline added to avoid gcc's quirks: > > > > text data bss dec hex filename > > 4629 480 0 5109 13f5 genalloc.o > > > > Considering that these two changed functions are used only in > > genalloc.o, adding inline qualifier is a win, also when comparing to > > the original size. > > BTW: Recently, it was determined [1] that the usage of cpu_relax() > inside the cmpxchg loop can be harmful for performance. We actually > have the same situation here, so perhaps cpu_relax() should be removed > in the same way it was removed from the lockref. > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f5fe24ef17b5fbe6db49534163e77499fb10ae8c I forgot to add some CCs that may be interested in the above. Uros.
> BTW: Recently, it was determined [1] that the usage of cpu_relax() > inside the cmpxchg loop can be harmful for performance. We actually > have the same situation here, so perhaps cpu_relax() should be removed > in the same way it was removed from the lockref. I'm not sure you can ever want a cpu_relax() in a loop that is implementing an atomic operation. Even the ia64 (die...) issue was with a loop that was waiting for another cpu to change the location (eg a spinlock). For an atomic operation an immediate retry is likely to succeed. Any kind of deferral to an another cpu can only make it worse. Clearly if you have 100s of cpu looping doing atomic operation on the same cache line it is likely that some get starved. But to fix that you need to increase the time between successful operations, not delay on failure. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Thu, Jan 19, 2023 at 1:47 PM David Laight <David.Laight@aculab.com> wrote: > > > BTW: Recently, it was determined [1] that the usage of cpu_relax() > > inside the cmpxchg loop can be harmful for performance. We actually > > have the same situation here, so perhaps cpu_relax() should be removed > > in the same way it was removed from the lockref. > > I'm not sure you can ever want a cpu_relax() in a loop that > is implementing an atomic operation. > Even the ia64 (die...) issue was with a loop that was waiting > for another cpu to change the location (eg a spinlock). > > For an atomic operation an immediate retry is likely to succeed. > Any kind of deferral to an another cpu can only make it worse. > > Clearly if you have 100s of cpu looping doing atomic operation > on the same cache line it is likely that some get starved. > But to fix that you need to increase the time between successful > operations, not delay on failure. I would like to point out that the wikipedia article on compare-and-swap claims [1] that: Instead of immediately retrying after a CAS operation fails, researchers have found that total system performance can be improved in multiprocessor systems—where many threads constantly update some particular shared variable—if threads that see their CAS fail use exponential backoff—in other words, wait a little before retrying the CAS [2]. [1] https://en.wikipedia.org/wiki/Compare-and-swap#Overview [2] https://arxiv.org/pdf/1305.5800.pdf Uros.
From: Uros Bizjak > Sent: 23 January 2023 15:05 > > On Thu, Jan 19, 2023 at 1:47 PM David Laight <David.Laight@aculab.com> wrote: > > > > > BTW: Recently, it was determined [1] that the usage of cpu_relax() > > > inside the cmpxchg loop can be harmful for performance. We actually > > > have the same situation here, so perhaps cpu_relax() should be removed > > > in the same way it was removed from the lockref. > > > > I'm not sure you can ever want a cpu_relax() in a loop that > > is implementing an atomic operation. > > Even the ia64 (die...) issue was with a loop that was waiting > > for another cpu to change the location (eg a spinlock). > > > > For an atomic operation an immediate retry is likely to succeed. > > Any kind of deferral to an another cpu can only make it worse. > > > > Clearly if you have 100s of cpu looping doing atomic operation > > on the same cache line it is likely that some get starved. > > But to fix that you need to increase the time between successful > > operations, not delay on failure. > > I would like to point out that the wikipedia article on > compare-and-swap claims [1] that: > > Instead of immediately retrying after a CAS operation fails, > researchers have found that total system performance can be improved > in multiprocessor systems—where many threads constantly update some > particular shared variable—if threads that see their CAS fail use > exponential backoff—in other words, wait a little before retrying the > CAS [2]. Probably, but the real solution is 'don't do that'. In any case I suspect the cpu_relax() explicitly lets the other hyperthreading cpu run - which isn't useful at all. What you actually want if for the cache logic to avoid losing 'exclusive' access to the cache line for enough clocks after a failed compare+exchange to allow the cpu to re-issue the memory cycle with an updated value. You can't do anything about one cpu being starved, but a short delay there is almost certainly beneficial. (Some hardware cache engineer will probably say otherwise.) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 1/23/23, Uros Bizjak <ubizjak@gmail.com> wrote: > On Thu, Jan 19, 2023 at 1:47 PM David Laight <David.Laight@aculab.com> > wrote: >> >> > BTW: Recently, it was determined [1] that the usage of cpu_relax() >> > inside the cmpxchg loop can be harmful for performance. We actually >> > have the same situation here, so perhaps cpu_relax() should be removed >> > in the same way it was removed from the lockref. >> >> I'm not sure you can ever want a cpu_relax() in a loop that >> is implementing an atomic operation. >> Even the ia64 (die...) issue was with a loop that was waiting >> for another cpu to change the location (eg a spinlock). >> >> For an atomic operation an immediate retry is likely to succeed. >> Any kind of deferral to an another cpu can only make it worse. >> >> Clearly if you have 100s of cpu looping doing atomic operation >> on the same cache line it is likely that some get starved. >> But to fix that you need to increase the time between successful >> operations, not delay on failure. > > I would like to point out that the wikipedia article on > compare-and-swap claims [1] that: > > Instead of immediately retrying after a CAS operation fails, > researchers have found that total system performance can be improved > in multiprocessor systems—where many threads constantly update some > particular shared variable—if threads that see their CAS fail use > exponential backoff—in other words, wait a little before retrying the > CAS [2]. > > [1] https://en.wikipedia.org/wiki/Compare-and-swap#Overview > [2] https://arxiv.org/pdf/1305.5800.pdf > I skimmed the paper and I think I got the gist of it, which boils down to a very common idea concerning damage control of multiple CPUs modifying the same thing. The idea boils down to making some of the contending CPUs bugger off for a time. How to specifically do it is another matter. Key problem with it is that CPUs which already failed to make any progress now artificially delay the entire thing even more. Fundamental observation is that 2 or more CPUs rolling with an atomic op on the same cacheline are going to interfere with each other and that effect is only going to get worse as you keep adding more of them. If you can make some of them temporarily bugger off, that is a win in the sense that the ping-pong is reduced. If this is finely tuned, you may get an overall win with by achieving better throughput while maintaining acceptable worst case latency. Otherwise you are starving other participants which is not acceptable. tl;dr you are walking a very right rope here and I'm not sure it is worth it. For a toy example one can imagine 2 CPUs running code to do an op n times. If they both do it in a tight loop there is plenty of ping-pong, but if one is simply made to wait until the other one is finished with all the iterations, you register much shorter total real time. But then again everything happening on the other CPU had to wait so much longer. I don't know why but many people love to throw away the value returned by CAS and instead do another plain access. While I don't see actual code used, the description in the paper matches this problematic behavior and I guarantee it artificially makes a tight CAS loop appear worse than it is. My own numbers from the cpu_relax removal commit show that throughput flattens very early, and all the extra computing power thrown at it simply goes to waste. All that said. Is a plain cpu_relax in the loop good? Definitely not on x86-64. Is something more elaborate going to be good? Maybe, highly depends on CPUs at hand and the workload, and if you mess it up some of them get starved. This in turn can degrade performance elsewhere if the starved thread is holding a lock waited on by someone else, and so on. I would definitely not play any games of the sort in the code at hand. In that case what about lockref? It definitely is not the fastest it can be, for one some provisions could be made for NUMA-aware handling. Basic idea would be to have competing CPUs aggregate the total change on the lock and then have one of them make it happen across the interconnect -- so instead of add/sub 1 you would add/sub n, where n can very well be quite big. Personally, for lockref, I think the way to go forward is to use it less to begin with and to look for ways to convert it into a lock xadd-able primitive instead. The "doing it less thing" could be done by implementing a RCU-only mode for select syscalls, which defaults to opportunistically avoid refing squat. If the caller manages to do what it needed to do, that is a massive win; otherwise refs get taken. This could work well in the common case for syscalls like statx and access, but would not for open. Frankly I'm surprised something like this is not already implemented (maybe I missed a major showstopper?).
On Mon, Jan 23, 2023 at 7:59 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > In that case what about lockref? It definitely is not the fastest it > can be, for one some provisions could be made for NUMA-aware handling. Note that the contention case for lockref really is fairly made-up. Yes, yes, it's easy enough to trigger with microbenchmarks. In real life less so. > Basic idea would be to have competing CPUs aggregate the total change > on the lock and then have one of them make it happen across the > interconnect -- so instead of add/sub 1 you would add/sub n, where n > can very well be quite big. No. The use is literally serialized, and done one at a time, and needs to be synchronous. We're talking about looking up a <i>single</i> path component, while at the same time making sure that that path component isn't being removed by another CPU - and there is by definition no other locking going on, because the lockref *is* the locking. And it really is one single path component, because this is all very much the equivalent of doing a list traversal (where the "list" is the full path, and the list entries are the individual components), and there is absolutely zero parallel cases except for entirely independent threads doing this in parallel on *different* paths (that may have common components, of course, with the root and home directory in particular being shared - but they are "shared" only in the sense that lots of threads use them entirely independently). Now, there are ways to do lockref more efficiently, but I suspect (a) most of them are about hardware optimizations (b) it's very seldom an issue in practice, because the code is actually very good at avoiding them already (ie the RCU pathname walk already avoids it) Thanks to RCU pathname lookup, the lockref thing *should* come into play only when you actually fall out of RCU mode, ie for the last component. That's a huge win, because that's what avoids the whole "everybody hammers on the root/pwd dentry counts". Your benchmark that looked up the *same* last component in parallell and all the time is not really a real load. As to the actual hardware optimizations, it would be interesting to see if a LL/SC architecture might do a "native" lockref implementation better than the "emulated" one that just uses LL/SC for the cmpxchg. And Intel *has* talked about instruction set extensions for remote atomics, and using CMPxxXADD *may* be a win. At least it would avoid the *looping* on cmpxchg, and it *migth* avoid bouncing the value too if the op is actually done using L3 accesses or something. (But I do also worry that CMPxxXADD migth actually make things *worse* because of any remote access, because caches worka and contention is rare, and in many common cases you might be better off with a local cache than with a remote one). So I'm not claiming lockref is perfect, but I do suspect you're barking up the wrong tree here. The optimizations you are talking about are either not realistic in the first place (because serialization and locking) or have already mostly been done (ie avoiding the op entirely except for the last component). HOWEVER. There is one special exception that might be interesting and that has never been done: 'fstat()' and friends could possibly avoid the "try_to_unlazy()" even for the last component. IOW, we might be able to do fstatat() without ever even finalizing the RCU state of the pathname, and actually looking up the stat information while still in RCU mode, and instead of doing the actual final lockref_get_not_dead() to turn an RCU path into a real ref-counted path, we could just do the "get the stat info, then do read_seqcount_retry() to verify that the RCU walk _and_ the stat info is still valid". But I'm not convinced that final complexity would be worth it. It sounds like a potentially fun and interesting exercise (Al Viro added to particupants so that he can say "No! That's not 'fun and exciting', that's just crazy!") if somebody really wants to, but see above: the last path component is very seldom something that sees contention. You look up the same root/pwd over and over again, but nobody sane looks up the same full pathname over and over again. And extending the LOOKUP_RCU window all the way over the stat info gathering would require a lot of care, and force us to expose the kinds of things we do for LOOKUP_RCU in namei.c to fs/stat.c too. So it might be fun and exciting, but it might also be quite nasty, and it's not entirely clear that ti would be worth the pain. But since you clearly were looking at fstatat() performance, I can only point you at this case and say "There's _potentially_ upside there". Linus
On 1/23/23, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, Jan 23, 2023 at 7:59 AM Mateusz Guzik <mjguzik@gmail.com> wrote: >> Basic idea would be to have competing CPUs aggregate the total change >> on the lock and then have one of them make it happen across the >> interconnect -- so instead of add/sub 1 you would add/sub n, where n >> can very well be quite big. > > No. The use is literally serialized, and done one at a time, and needs > to be synchronous. > > We're talking about looking up a <i>single</i> path component, while > at the same time making sure that that path component isn't being > removed by another CPU - and there is by definition no other locking > going on, because the lockref *is* the locking. > [snip] > > Thanks to RCU pathname lookup, the lockref thing *should* come into > play only when you actually fall out of RCU mode, ie for the last > component. That's a huge win, because that's what avoids the whole > "everybody hammers on the root/pwd dentry counts". > > Your benchmark that looked up the *same* last component in parallell > and all the time is not really a real load. > [snip2] > > So I'm not claiming lockref is perfect, but I do suspect you're > barking up the wrong tree here. The optimizations you are talking > about are either not realistic in the first place (because > serialization and locking) or have already mostly been done (ie > avoiding the op entirely except for the last component). > That was a minor remark in the spirit of attempting to reduce pingpong, only made because of the paper. It was not a serious proposal, but perhaps I failed to make it clear enough. A more serious remark was in the last paragraph. > HOWEVER. There is one special exception that might be interesting and > that has never been done: 'fstat()' and friends could possibly avoid > the "try_to_unlazy()" even for the last component. > > IOW, we might be able to do fstatat() without ever even finalizing the > RCU state of the pathname, and actually looking up the stat > information while still in RCU mode, and instead of doing the actual > final lockref_get_not_dead() to turn an RCU path into a real > ref-counted path, we could just do the "get the stat info, then do > read_seqcount_retry() to verify that the RCU walk _and_ the stat info > is still valid". > Ignoring mentioning specific routines by name is precisely what I proposed in that e-mail. To quote myself: > Personally, for lockref, I think the way to go forward is to use it less > to begin with and to look for ways to convert it into a lock xadd-able > primitive instead. The "doing it less thing" could be done by > implementing a RCU-only mode for select syscalls, which defaults to > opportunistically avoid refing squat. If the caller manages to do what > it needed to do, that is a massive win; otherwise refs get taken. This > could work well in the common case for syscalls like statx and access, > but would not for open. Frankly I'm surprised something like this is > not already implemented (maybe I missed a major showstopper?). Now back to your response: > But I'm not convinced that final complexity would be worth it. It > sounds like a potentially fun and interesting exercise (Al Viro added > to particupants so that he can say "No! That's not 'fun and exciting', > that's just crazy!") if somebody really wants to, but see above: the > last path component is very seldom something that sees contention. You > look up the same root/pwd over and over again, but nobody sane looks > up the same full pathname over and over again. > [snip] > But since you clearly were looking at fstatat() performance, I can > only point you at this case and say "There's _potentially_ upside > there". > So if you strace something like gcc compiling stuff you will find: - some access calls on shared dirs, for example: 78533 access("/usr/lib/gcc/x86_64-linux-gnu/11/", X_OK) = 0 78533 access("/usr/lib/gcc/x86_64-linux-gnu/11/", X_OK) = 0 78533 access("/usr/lib/gcc/x86_64-linux-gnu/11/", X_OK) = 0 - same with newfstatat: 87428 newfstatat(AT_FDCWD, "./arch/x86/include", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0 87428 newfstatat(AT_FDCWD, "./arch/x86/include/generated", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0 87428 newfstatat(AT_FDCWD, "./include", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0 87428 newfstatat(AT_FDCWD, "./arch/x86/include/uapi", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0 - there is also quite a bit of readlink: 87502 readlink("/tmp", 0x7ffe28847ac0, 1023) = -1 EINVAL (Invalid argument) 87502 readlink("/tmp/ccTh37oI.s", 0x7ffe28847ac0, 1023) = -1 EINVAL (Invalid argument) that last bit is glibc doing realpath(). A case can be made for making realpath into a syscall instead, but I'm not going to flame over for the time being. :) Anyhow, my point is that 2 gcc compiling different files do invoke path lookups with the same terminal path components and lockref stuff gets bounced around when it happens. On another point how to end up dealing with lockref less, I have to note glibc switched fstat(fd, buf) to use newfstatat(fd, "", buf, AT_EMPTY_PATH) internally. This adds a lot of work single-threaded as it forces a path lookup with associated stac/clac trip, but more importantly the kernel no longer relies on liveness of the dentry provided by the file object, but uses lockref instead. Something which allows to get data produced by newfstatat without forcing a lookup sounds like a welcome addition. Same goes for statx which seems to be the recommended syscall. I'll be writing about this to -fsdevel some time later.
On Mon, Jan 23, 2023 at 4:11 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > On another point how to end up dealing with lockref less, I have to > note glibc switched fstat(fd, buf) to use newfstatat(fd, "", buf, > AT_EMPTY_PATH) internally. Yeah, that's just plain stupid. I don't know what the thinking (or rather, lack-thereof) was inside glibc for that one. Linus
On 1/24/23, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, Jan 23, 2023 at 4:11 PM Mateusz Guzik <mjguzik@gmail.com> wrote: >> >> On another point how to end up dealing with lockref less, I have to >> note glibc switched fstat(fd, buf) to use newfstatat(fd, "", buf, >> AT_EMPTY_PATH) internally. > > Yeah, that's just plain stupid. > > I don't know what the thinking (or rather, lack-thereof) was inside > glibc for that one. > To my reading the fstat system call is now legacy so they switched to something else. There is a lot to say about the stat situation (going past the fd vs lookup issue), which I'm going to do on fsdevel (perhaps next week). All flaming aside, I think Uros is still waiting for an answer what should be done in cmpxchg loops and someone else will have to provide it. I stated my case for the best course of action being to do nothing (for x86-64 anyway), but I'm not an authority figure.
From: Mateusz Guzik > Sent: 24 January 2023 00:11 ... > So if you strace something like gcc compiling stuff you will find: > - some access calls on shared dirs, for example: > 78533 access("/usr/lib/gcc/x86_64-linux-gnu/11/", X_OK) = 0 > 78533 access("/usr/lib/gcc/x86_64-linux-gnu/11/", X_OK) = 0 > 78533 access("/usr/lib/gcc/x86_64-linux-gnu/11/", X_OK) = 0 Are they back to back? Which is just stupid. Once per invocation of gcc would be noise. > - same with newfstatat: > 87428 newfstatat(AT_FDCWD, "./arch/x86/include", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0 > 87428 newfstatat(AT_FDCWD, "./arch/x86/include/generated", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0 > 87428 newfstatat(AT_FDCWD, "./include", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0 > 87428 newfstatat(AT_FDCWD, "./arch/x86/include/uapi", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0 > - there is also quite a bit of readlink: > 87502 readlink("/tmp", 0x7ffe28847ac0, 1023) = -1 EINVAL (Invalid argument) > 87502 readlink("/tmp/ccTh37oI.s", 0x7ffe28847ac0, 1023) = -1 EINVAL (Invalid argument) > > that last bit is glibc doing realpath(). A case can be made for making > realpath into a syscall instead, but I'm not going to flame over for > the time being. :) I remember looking at syscall counts during a (NetBSD) build and deciding that the dominant system call was actually failed opens from the compiler searching long -I paths looking for headers. You can speed things up by copying all the .h files from the fixed -I path list into a single directory. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, Jan 24, 2023 at 12:54 AM David Laight <David.Laight@aculab.com> wrote: > > I remember looking at syscall counts during a (NetBSD) build > and deciding that the dominant system call was actually failed > opens from the compiler searching long -I paths looking for > headers. For some loads, yes. > You can speed things up by copying all the .h files from the > fixed -I path list into a single directory. Or, better yet, do a proper pathname cache that has negative entries in it too. Like Linux does. Because it's a common pattern, not just for include paths. You find it for the regular PATH handling, and for various "look up my config file in these directories" kind of thing. So caching not just successful pathname lookups, but the failed ones too, is actually important. (Obviously it's still faster to not have to even search at all, and put everything in one directory and not have a search path at all, but it does have its own serious downsides, notably the whole "now we have to keep that union directory in sync with all the source directories") Linus
> (Obviously it's still faster to not have to even search at all, and > put everything in one directory and not have a search path at all, but > it does have its own serious downsides, notably the whole "now we have > to keep that union directory in sync with all the source directories") I've used make to do the copies for a 'day job' project. Just a matter of identifying the .h files that need copying to obj/include. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Mon, Jan 23, 2023 at 11:36:26AM -0800, Linus Torvalds wrote: > HOWEVER. There is one special exception that might be interesting and > that has never been done: 'fstat()' and friends could possibly avoid > the "try_to_unlazy()" even for the last component. > > IOW, we might be able to do fstatat() without ever even finalizing the > RCU state of the pathname, and actually looking up the stat > information while still in RCU mode, and instead of doing the actual > final lockref_get_not_dead() to turn an RCU path into a real > ref-counted path, we could just do the "get the stat info, then do > read_seqcount_retry() to verify that the RCU walk _and_ the stat info > is still valid". > > But I'm not convinced that final complexity would be worth it. It > sounds like a potentially fun and interesting exercise (Al Viro added > to particupants so that he can say "No! That's not 'fun and exciting', > that's just crazy!") if somebody really wants to, but see above: the > last path component is very seldom something that sees contention. You > look up the same root/pwd over and over again, but nobody sane looks > up the same full pathname over and over again. > > And extending the LOOKUP_RCU window all the way over the stat info > gathering would require a lot of care, and force us to expose the > kinds of things we do for LOOKUP_RCU in namei.c to fs/stat.c too. Interesting... So basically a variant of filename_lookup() that fills struct kstat instead of doing that to struct path? Not impossible, but will take some care; in particular, I'd consider non-default ->getattr() as "fall back to non-RCU *now*, we are not going there" - otherwise it becomes too nasty. STATX_DIOALIGN on block devices is even worse... Need to check what struct user_namespace and associated stuff looks like wrt RCU; OTOH, I wouldn't expect too much headache there, given that things like may_follow_link() already cover most (if not everything) of what we'd need in generic_fillattr(). Looks like the main obstacle is how to deal with duplication between that thing and vfs_getattr{,_nosec}(); one possibility would be to teach vfs_getattr about the possibility of being called from RCU mode, and have it used by that new thing. However, I really want to avoid passing struct nameidata to that sucker. Hmmmm... OK, so what are the reasons for falling out of RCU mode there and how early can we detect those? 1) non-NULL ->getattr() (OK, corresponding new bit in ->i_opflags). Can be seen immediately. 2) STATX_DIOALIGN combined with S_ISBLK(). Also can be seen immediately. 3) security_inode_getattr(). That can get unpleasant - it exposes apparmor and tomoyo to operating in RCU mode. Until now they had been spared that. OTOH, they can just fail with -ECHILD if they see whatever indicator of "you are called in RCU mode" we end up choosing... Hell knows - that just might be doable. vfs_getattr() grows a flag (not sure where - passing inode separately looks like a plausible approach, with non-NULL meaning "we are in RCU mode"). If it sees the indication of being called in RCU mode it should return -ECHILD ASAP if it can't handle things. Old callers just pass it NULL for inode; filename_statx() (in RCU mode) does something like err = vfs_getattr(nd->path, nd->inode, ....); if (err == -ECHILD) if (!try_to_unlazy(nd)) sod off and restart in non-RCU mode err = vfs_getattr(nd->path, NULL, ....); if (!err && (nd->flags & LOOKUP_RCU)) { do a trimmed-down variant of try_to_unlazy, just checking the seq counts and not grabbing anything. } Oh, wait - there's the rest of the stuff done by complete_walk()... Ugh... ->d_weak_revalidate() part obviously means "no RCU for you", but it needs to be handled for non-RCU modes and there's LOOKUP_IS_SCOPED part... I'll poke around at some point, but I've a bunch of other pending crap at the moment. It _might_ be doable without too much ugliness, but then this is just from a cursory look - there might be dragons.
On Thu, Jan 26, 2023 at 7:54 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > And extending the LOOKUP_RCU window all the way over the stat info > > gathering would require a lot of care, and force us to expose the > > kinds of things we do for LOOKUP_RCU in namei.c to fs/stat.c too. > > Interesting... So basically a variant of filename_lookup() that > fills struct kstat instead of doing that to struct path? Well, kinda. You still want the fallback to struct path in case you can't fill in the kstat without it (because the filesystem needs to do something under RCU). So I think what you'd really want is something like a special version of filename_lookup() that is then given an optimistic "call this function under RCU before you finalize the path". And then, *if* the filesystem can do the kstat lookup etc under RCU, it can just fill it in there, and instead of finalizing the path, we can just do terminate the walk without ever doing the try_to_unlazy() that legitimizes the path. I suspect it's fairly close to how we do d_revalidate() for the path component, except we'd do this not per-component, but as a "for the final result, let's do one last thing under RCU, and if it succeeded there, we don't actually need the path at all after all, because we've already done everything we needed". I think the only really relevant situation this is the case is basically the stat() family of functions, since those don't actually care about the path after the operation. But there are other cases that have that error = filename_lookup(dfd, filename, lookup_flags, &path, NULL); ... do something simple once ... path_put(&path); pattern where we just do the 'put' on the path and don't have any other use for it. The io_uring xattr code matches that same pattern, for example - but may simply not be worth worrying about. So either some generic "callback under RCU before we finalize it", or we could make it very specific for just "fill in kstat and don't bother finalizing the path when this flag is set" > Looks like the main obstacle is how to deal with duplication between > that thing and vfs_getattr{,_nosec}(); I think we'd need something like a new ->rcu_getattr() function, and filesystems that can do getattr under RCU would just set that function pointer. I dunno. I didn't look at it all, but it *feels* like you could have something that just says "if I have this function, and calling it returns success, then we can just do "terminate_walk()" without ever doing "try_to_unlazy()". Linus
diff --git a/lib/genalloc.c b/lib/genalloc.c index 00fc50d0a640..0c883d6fbd44 100644 --- a/lib/genalloc.c +++ b/lib/genalloc.c @@ -40,32 +40,30 @@ static inline size_t chunk_size(const struct gen_pool_chunk *chunk) return chunk->end_addr - chunk->start_addr + 1; } -static int set_bits_ll(unsigned long *addr, unsigned long mask_to_set) +static inline int +set_bits_ll(unsigned long *addr, unsigned long mask_to_set) { - unsigned long val, nval; + unsigned long val = READ_ONCE(*addr); - nval = *addr; do { - val = nval; if (val & mask_to_set) return -EBUSY; cpu_relax(); - } while ((nval = cmpxchg(addr, val, val | mask_to_set)) != val); + } while (!try_cmpxchg(addr, &val, val | mask_to_set)); return 0; } -static int clear_bits_ll(unsigned long *addr, unsigned long mask_to_clear) +static inline int +clear_bits_ll(unsigned long *addr, unsigned long mask_to_clear) { - unsigned long val, nval; + unsigned long val = READ_ONCE(*addr); - nval = *addr; do { - val = nval; if ((val & mask_to_clear) != mask_to_clear) return -EBUSY; cpu_relax(); - } while ((nval = cmpxchg(addr, val, val & ~mask_to_clear)) != val); + } while (!try_cmpxchg(addr, &val, val & ~mask_to_clear)); return 0; }