Message ID | 20231011204150.51166-1-ubizjak@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2908:b0:403:3b70:6f57 with SMTP id ib8csp802410vqb; Wed, 11 Oct 2023 13:42:33 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHlm33Cus2Nc1w/Q+XTbOtV34VqaVMCJHMfUw7iiCBlWRwDbt5pEkdwXcUq9qciOqI2e9js X-Received: by 2002:a17:90a:d10:b0:268:38a7:842e with SMTP id t16-20020a17090a0d1000b0026838a7842emr19496239pja.2.1697056953193; Wed, 11 Oct 2023 13:42:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697056953; cv=none; d=google.com; s=arc-20160816; b=RFg+v09X8007s27WIPjD5bzTVuTxlv0nsxmuj660sqDbldkc+x7xko0sfvsTbhznoA rbbs96hCo+4ArILrdf7uWv8ZdfaT3HTAKI7riA2vLilmjsZk5JIUYt686ct/hnjgWPDX 2bD/Ql+66UKOSeIh3WJ2IlOo497z2cV9haIyDd0Sup08nVz026dvnnQ60CNr04KO/MrL 8TPcyqzBhQB/vmLZus6ylv5cexf+neTBD3Ntfj1MCtbbGe1ot/UKaS2yXNDuRLNZaFF5 ceMmct+geXr7+wccooEAF1ZKWbPcEix83pFnVELIoMRWREdg+vWGhNgt0j4WG+uFsPvn RRtw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=PwSU+FrzFRbKZUOpzLtD4x7QEtM9uXRX1yERat/SRH8=; fh=8zQurl2ctl30uMrH+/6VONvGM5NCORHd7M3WXmciTOE=; b=rCYOldqUBp7eCe2Tc8yOhaLh2SeIXf2dNsoAHwtW6mxJVc42uSetjohTsf5aH4BWcO RYdKC/5vn+1ypTB4T1Yucqs+XO/vJvgPS0UIm8Ls+dvLPqup4QNokCwlkgv3XgtJmcgu jI/O5KEaL0KLlvOe3oNiB0n4g1a6cTL9DYGqi2b7RNBEOahsf/DJzv5H8e+4loUl+U8G rbtV/5OdHAIeP9tmjLD/KhMHiCMwr/saQ3c/V9mOt/Ms+LJ+jSYGsbCw3TmBqxh1FSKB RFEerAadnIZslbIXsTvGUP1CbQg4EN2w5IT1ZofhLYJLCd0W8iG469jPV7UgdNBQv9qE 4Nqw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=GWxy9Vqn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id e20-20020a17090ab39400b00278ff770796si607381pjr.88.2023.10.11.13.42.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 11 Oct 2023 13:42:33 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=GWxy9Vqn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id B7DFA819D9E4; Wed, 11 Oct 2023 13:42:30 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233424AbjJKUmN (ORCPT <rfc822;kartikey406@gmail.com> + 18 others); Wed, 11 Oct 2023 16:42:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33496 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233360AbjJKUmM (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 11 Oct 2023 16:42:12 -0400 Received: from mail-ed1-x535.google.com (mail-ed1-x535.google.com [IPv6:2a00:1450:4864:20::535]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 050EC91 for <linux-kernel@vger.kernel.org>; Wed, 11 Oct 2023 13:42:11 -0700 (PDT) Received: by mail-ed1-x535.google.com with SMTP id 4fb4d7f45d1cf-53df747cfe5so554146a12.2 for <linux-kernel@vger.kernel.org>; Wed, 11 Oct 2023 13:42:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1697056929; x=1697661729; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=PwSU+FrzFRbKZUOpzLtD4x7QEtM9uXRX1yERat/SRH8=; b=GWxy9VqnBi3ueftuQSZCtNgMBiOQqB3u89ecqkFYFL6buZnBKgLDsji/UpJXXOWCOK q7HqXH1fUa4iROibCjeGfUrUQk85MEzusSlb4yCHttf6hoBiWXEYYWCf06FWe+mZpvxz mH5qmkLYWu0dMY2/z2HTkbgMWJQNDUBGgRp3zZ7az4ppQX/lDvUayUu1SUBcA3u2kHy4 u9aD8lZP/46arfo13cQaIbEI4ICarIukGuRcZSgCwjDwtL3SVtzNsx6VCUINPZ3igTfP 1CCBr6r/f/AJeqMzCtPNOjJnMlq7ZzNgyNI4cZt603x0l7O06rQzhXvAe8Mn69VRwFQT KZ3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697056929; x=1697661729; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=PwSU+FrzFRbKZUOpzLtD4x7QEtM9uXRX1yERat/SRH8=; b=kTgeOUtZQiFtfvbL/a/GR3YqnTn/S58Vmo3dWKH7oV9alBfhokWbPnnRzaStNPgfIz 7oI9zWfm7c74bVhevir2UmRoTpjs8+xx/XVouGvvBou2MTtKo0X/7XXxWgyLIDpx5JJm UDAnvkS/WZmePswwK3snt09ZYsmisCYEiAVUg6lK1JjoErCWbYA3C/doGS3kHO/2TlXE ctiIbZ8kne6pfiAMK3rJ8XhSC/KLpPlys6bQiyPvDUkiENUXXVzojpe9gSebtFtY39qN tPA/0Fr5ask0JsgeaNtixqqM2RBzU7duCz7oQpSqHC4EMBhVcLPQ903VKcUjsuycAbRa yLVw== X-Gm-Message-State: AOJu0YwQB2HVZatlaDG/NLG2VyZqbTOkPNl7Joy7a7DNdGSnSVKjkSQQ FxesHhQ0p4EicGl9AiraGBM= X-Received: by 2002:a05:6402:1219:b0:531:1455:7528 with SMTP id c25-20020a056402121900b0053114557528mr16824569edw.40.1697056929225; Wed, 11 Oct 2023 13:42:09 -0700 (PDT) Received: from localhost.localdomain ([46.248.82.114]) by smtp.gmail.com with ESMTPSA id k12-20020a056402048c00b0052ffc2e82f1sm9334504edv.4.2023.10.11.13.42.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 11 Oct 2023 13:42:08 -0700 (PDT) From: Uros Bizjak <ubizjak@gmail.com> To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Uros Bizjak <ubizjak@gmail.com>, Linus Torvalds <torvalds@linux-foundation.org>, Nadav Amit <namit@vmware.com>, Ingo Molnar <mingo@kernel.org>, Andy Lutomirski <luto@kernel.org>, Brian Gerst <brgerst@gmail.com>, Denys Vlasenko <dvlasenk@redhat.com>, "H . Peter Anvin" <hpa@zytor.com>, Peter Zijlstra <peterz@infradead.org>, Thomas Gleixner <tglx@linutronix.de>, Josh Poimboeuf <jpoimboe@redhat.com> Subject: [PATCH tip] x86/percpu: Rewrite arch_raw_cpu_ptr() Date: Wed, 11 Oct 2023 22:40:36 +0200 Message-ID: <20231011204150.51166-1-ubizjak@gmail.com> X-Mailer: git-send-email 2.41.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=3.0 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_SBL_CSS, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Wed, 11 Oct 2023 13:42:30 -0700 (PDT) X-Spam-Level: ** X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1779493191844243738 X-GMAIL-MSGID: 1779493191844243738 |
Series |
[tip] x86/percpu: Rewrite arch_raw_cpu_ptr()
|
|
Commit Message
Uros Bizjak
Oct. 11, 2023, 8:40 p.m. UTC
Implement arch_raw_cpu_ptr() as a load from this_cpu_off and then
add the ptr value to the base. This way, the compiler can propagate
addend to the following instruction and simplify address calculation.
E.g.: address calcuation in amd_pmu_enable_virt() improves from:
48 c7 c0 00 00 00 00 mov $0x0,%rax
87b7: R_X86_64_32S cpu_hw_events
65 48 03 05 00 00 00 add %gs:0x0(%rip),%rax
00
87bf: R_X86_64_PC32 this_cpu_off-0x4
48 c7 80 28 13 00 00 movq $0x0,0x1328(%rax)
00 00 00 00
to:
65 48 8b 05 00 00 00 mov %gs:0x0(%rip),%rax
00
8798: R_X86_64_PC32 this_cpu_off-0x4
48 c7 80 00 00 00 00 movq $0x0,0x0(%rax)
00 00 00 00
87a6: R_X86_64_32S cpu_hw_events+0x1328
The compiler can also eliminate redundant loads from this_cpu_off,
reducing the number of percpu offset reads (either from this_cpu_off
or with rdgsbase) from 1663 to 1571.
Additionaly, the patch introduces 'rdgsbase' alternative for CPUs with
X86_FEATURE_FSGSBASE. The rdgsbase instruction *probably* will end up
only decoding in the first decoder etc. But we're talking single-cycle
kind of effects, and the rdgsbase case should be much better from
a cache perspective and might use fewer memory pipeline resources to
offset the fact that it uses an unusual front end decoder resource...
The only drawback of the patch is larger binary size:
text data bss dec hex filename
25546594 4387686 808452 30742732 1d518cc vmlinux-new.o
25515256 4387814 808452 30711522 1d49ee2 vmlinux-old.o
that increases by 31k (0.123%), due to 1578 rdgsbase altinstructions
that are placed in the text section. The increase in text-size is not
"real" - the 'rdgsbase' instruction should be smaller than a 'mov %gs';
binary size increases because we obviously have two instructions, but
the actual *executable* part likely stays the same, and it's just that
we grow the altinstruction metadata.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
---
arch/x86/include/asm/percpu.h | 26 ++++++++++++++++++++------
1 file changed, 20 insertions(+), 6 deletions(-)
Comments
On Wed, Oct 11, 2023, Uros Bizjak wrote: > Additionaly, the patch introduces 'rdgsbase' alternative for CPUs with > X86_FEATURE_FSGSBASE. The rdgsbase instruction *probably* will end up > only decoding in the first decoder etc. But we're talking single-cycle > kind of effects, and the rdgsbase case should be much better from > a cache perspective and might use fewer memory pipeline resources to > offset the fact that it uses an unusual front end decoder resource... The switch to RDGSBASE should be a separate patch, and should come with actual performance numbers. A significant percentage of data accesses in Intel's TDX-Module[*] use this pattern, e.g. even global data is relative to GS.base in the module due its rather odd and restricted environment. Back in the early days of TDX, the module used RD{FS,GS}BASE instead of prefixes to get pointers to per-CPU and global data structures in the TDX-Module. It's been a few years so I forget the exact numbers, but at the time a single transition between guest and host would have something like ~100 reads of FS.base or GS.base. Switching from RD{FS,GS}BASE to prefixed accesses reduced the latency for a guest<->host transition through the TDX-Module by several thousand cycles, as every RD{FS,GS}BASE had a latency of ~18 cycles (again, going off 3+ year old memories). The TDX-Module code is pretty much a pathological worth case scenario, but I suspect its usage is very similar to most usage of raw_cpu_ptr(), e.g. get a pointer to some data structure and then do multiple reads/writes from/to that data structure. The other wrinkle with RD{FS,FS}GSBASE is that they are trivially easy to emulate. If a hypervisor/VMM is advertising FSGSBASE even when it's not supported by hardware, e.g. to migrate VMs to older hardware, then every RDGSBASE will end up taking a few thousand cycles (#UD -> VM-Exit -> emulate). I would be surprised if any hypervisor actually does this as it would be easier/smarter to simply not advertise FSGSBASE if migrating to older hardware might be necessary, e.g. KVM doesn't support emulating RD{FS,GS}BASE. But at the same time, the whole reason I stumbled on the TDX-Module's sub-optimal RD{FS,GS}BASE usage was because I had hacked KVM to emulate RD{FS,GS}BASE so that I could do KVM TDX development on older hardware. I.e. it's not impossible that this code could run on hardware where RDGSBASE is emulated in software. [*] https://www.intel.com/content/www/us/en/download/738875/intel-trust-domain-extension-intel-tdx-module.html
On Fri, Oct 13, 2023 at 6:04 PM Sean Christopherson <seanjc@google.com> wrote: > > On Wed, Oct 11, 2023, Uros Bizjak wrote: > > Additionaly, the patch introduces 'rdgsbase' alternative for CPUs with > > X86_FEATURE_FSGSBASE. The rdgsbase instruction *probably* will end up > > only decoding in the first decoder etc. But we're talking single-cycle > > kind of effects, and the rdgsbase case should be much better from > > a cache perspective and might use fewer memory pipeline resources to > > offset the fact that it uses an unusual front end decoder resource... > > The switch to RDGSBASE should be a separate patch, and should come with actual > performance numbers. This *is* the patch to switch to RDGSBASE. The propagation of arguments is a nice side-effect of the patch. due to the explicit addition of the offset addend to the %gs base. This patch is alternative implementation of [1] [1] x86/percpu: Use C for arch_raw_cpu_ptr(), https://lore.kernel.org/lkml/20231010164234.140750-1-ubizjak@gmail.com/ Unfortunately, I have no idea on how to measure the impact of such a low-level feature, so I'll at least need some guidance. The "gut feeling" says that special instruction, intended to support the feature, is always better than emulating said feature with a memory access. > A significant percentage of data accesses in Intel's TDX-Module[*] use this > pattern, e.g. even global data is relative to GS.base in the module due its rather > odd and restricted environment. Back in the early days of TDX, the module used > RD{FS,GS}BASE instead of prefixes to get pointers to per-CPU and global data > structures in the TDX-Module. It's been a few years so I forget the exact numbers, > but at the time a single transition between guest and host would have something > like ~100 reads of FS.base or GS.base. Switching from RD{FS,GS}BASE to prefixed > accesses reduced the latency for a guest<->host transition through the TDX-Module > by several thousand cycles, as every RD{FS,GS}BASE had a latency of ~18 cycles > (again, going off 3+ year old memories). > > The TDX-Module code is pretty much a pathological worth case scenario, but I > suspect its usage is very similar to most usage of raw_cpu_ptr(), e.g. get a > pointer to some data structure and then do multiple reads/writes from/to that > data structure. > > The other wrinkle with RD{FS,FS}GSBASE is that they are trivially easy to emulate. > If a hypervisor/VMM is advertising FSGSBASE even when it's not supported by > hardware, e.g. to migrate VMs to older hardware, then every RDGSBASE will end up > taking a few thousand cycles (#UD -> VM-Exit -> emulate). I would be surprised > if any hypervisor actually does this as it would be easier/smarter to simply not > advertise FSGSBASE if migrating to older hardware might be necessary, e.g. KVM > doesn't support emulating RD{FS,GS}BASE. But at the same time, the whole reason > I stumbled on the TDX-Module's sub-optimal RD{FS,GS}BASE usage was because I had > hacked KVM to emulate RD{FS,GS}BASE so that I could do KVM TDX development on older > hardware. I.e. it's not impossible that this code could run on hardware where > RDGSBASE is emulated in software. There are some other issues when memory access to the percpu area is implemented with an asm. An ongoing analysis shows that compilers can't CSE asm over basic-block boundaries, the CSE of asm is a very simple pattern matching through the BB. So, taking into account all presented facts and noticeable increase in binary size due to the use of alternatives, I'm leaning towards using C for arch_raw_cpu_ptr(). Compilers are extremely good at optimizing memory access, so I guess the original patch [1] outweighs the trouble with RDGSBASE. Unless someone proves that RDGSBASE is noticeably *better* than current (or future optimized) implementation. > [*] https://www.intel.com/content/www/us/en/download/738875/intel-trust-domain-extension-intel-tdx-module.html Thanks, Uros.
On Fri, 13 Oct 2023 at 12:30, Uros Bizjak <ubizjak@gmail.com> wrote: > > There are some other issues when memory access to the percpu area is > implemented with an asm. An ongoing analysis shows that compilers > can't CSE asm over basic-block boundaries, the CSE of asm is a very > simple pattern matching through the BB. Ahh. That explains the odd "partial CSE". I was assuming that CSE was done on the SSA level with some dominance analysis. Which presumably all the load simplification ends up doing. Linus
On Fri, Oct 13, 2023, Uros Bizjak wrote: > On Fri, Oct 13, 2023 at 6:04 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Wed, Oct 11, 2023, Uros Bizjak wrote: > > > Additionaly, the patch introduces 'rdgsbase' alternative for CPUs with > > > X86_FEATURE_FSGSBASE. The rdgsbase instruction *probably* will end up > > > only decoding in the first decoder etc. But we're talking single-cycle > > > kind of effects, and the rdgsbase case should be much better from > > > a cache perspective and might use fewer memory pipeline resources to > > > offset the fact that it uses an unusual front end decoder resource... > > > > The switch to RDGSBASE should be a separate patch, and should come with actual > > performance numbers. > > This *is* the patch to switch to RDGSBASE. The propagation of > arguments is a nice side-effect of the patch. due to the explicit > addition of the offset addend to the %gs base. This patch is > alternative implementation of [1] > > [1] x86/percpu: Use C for arch_raw_cpu_ptr(), > https://lore.kernel.org/lkml/20231010164234.140750-1-ubizjak@gmail.com/ Me confused, can't you first switch to MOV with tcp_ptr__ += (unsigned long)(ptr), and then introduce the RDGSBASE alternative? > Unfortunately, I have no idea on how to measure the impact of such a > low-level feature, so I'll at least need some guidance. The "gut > feeling" says that special instruction, intended to support the > feature, is always better than emulating said feature with a memory > access. AIUI, {RD,WR}{FS,GS}BASE were added as faster alternatives to {RD,WR}MSR, not to accelerate actual accesses to per-CPU data, TLS, etc. E.g. loading a 64-bit base via a MOV to FS/GS is impossible. And presumably saving a userspace controlled by actually accessing FS/GS is dangerous for one reason or another. The instructions are guarded by a CR4 bit, the ucode cost just to check CR4.FSGSBASE is probably non-trivial.
* Uros Bizjak <ubizjak@gmail.com> wrote: > Implement arch_raw_cpu_ptr() as a load from this_cpu_off and then > add the ptr value to the base. This way, the compiler can propagate > addend to the following instruction and simplify address calculation. > > E.g.: address calcuation in amd_pmu_enable_virt() improves from: > > 48 c7 c0 00 00 00 00 mov $0x0,%rax > 87b7: R_X86_64_32S cpu_hw_events > > 65 48 03 05 00 00 00 add %gs:0x0(%rip),%rax > 00 > 87bf: R_X86_64_PC32 this_cpu_off-0x4 > > 48 c7 80 28 13 00 00 movq $0x0,0x1328(%rax) > 00 00 00 00 > > to: > > 65 48 8b 05 00 00 00 mov %gs:0x0(%rip),%rax > 00 > 8798: R_X86_64_PC32 this_cpu_off-0x4 > 48 c7 80 00 00 00 00 movq $0x0,0x0(%rax) > 00 00 00 00 > 87a6: R_X86_64_32S cpu_hw_events+0x1328 > > The compiler can also eliminate redundant loads from this_cpu_off, > reducing the number of percpu offset reads (either from this_cpu_off > or with rdgsbase) from 1663 to 1571. > > Additionaly, the patch introduces 'rdgsbase' alternative for CPUs with > X86_FEATURE_FSGSBASE. The rdgsbase instruction *probably* will end up > only decoding in the first decoder etc. But we're talking single-cycle > kind of effects, and the rdgsbase case should be much better from > a cache perspective and might use fewer memory pipeline resources to > offset the fact that it uses an unusual front end decoder resource... So the 'additionally' wording in the changelog should have been a big hint already that the introduction of RDGSBASE usage needs to be a separate patch. ;-) Thanks, Ingo
On Sat, Oct 14, 2023 at 12:04 PM Ingo Molnar <mingo@kernel.org> wrote: > > > * Uros Bizjak <ubizjak@gmail.com> wrote: > > > Implement arch_raw_cpu_ptr() as a load from this_cpu_off and then > > add the ptr value to the base. This way, the compiler can propagate > > addend to the following instruction and simplify address calculation. > > > > E.g.: address calcuation in amd_pmu_enable_virt() improves from: > > > > 48 c7 c0 00 00 00 00 mov $0x0,%rax > > 87b7: R_X86_64_32S cpu_hw_events > > > > 65 48 03 05 00 00 00 add %gs:0x0(%rip),%rax > > 00 > > 87bf: R_X86_64_PC32 this_cpu_off-0x4 > > > > 48 c7 80 28 13 00 00 movq $0x0,0x1328(%rax) > > 00 00 00 00 > > > > to: > > > > 65 48 8b 05 00 00 00 mov %gs:0x0(%rip),%rax > > 00 > > 8798: R_X86_64_PC32 this_cpu_off-0x4 > > 48 c7 80 00 00 00 00 movq $0x0,0x0(%rax) > > 00 00 00 00 > > 87a6: R_X86_64_32S cpu_hw_events+0x1328 > > > > The compiler can also eliminate redundant loads from this_cpu_off, > > reducing the number of percpu offset reads (either from this_cpu_off > > or with rdgsbase) from 1663 to 1571. > > > > Additionaly, the patch introduces 'rdgsbase' alternative for CPUs with > > X86_FEATURE_FSGSBASE. The rdgsbase instruction *probably* will end up > > only decoding in the first decoder etc. But we're talking single-cycle > > kind of effects, and the rdgsbase case should be much better from > > a cache perspective and might use fewer memory pipeline resources to > > offset the fact that it uses an unusual front end decoder resource... > > So the 'additionally' wording in the changelog should have been a big hint > already that the introduction of RDGSBASE usage needs to be a separate > patch. ;-) Indeed. I think that the first part should be universally beneficial, as it converts mov symbol, %rax add %gs:this_cpu_off, %rax to: mov %gs:this_cpu_off, %rax add symbol, %rax and allows the compiler to propagate addition into address calculation (the latter is also similar to the code, generated by _seg_gs approach). At this point, the "experimental" part could either a) introduce RDGSBASE: As discussed with Sean, this could be problematic, at least with KVM, and has some other drawbacks (e.g. larger binary size, limited CSE of asm). b) move to __seg_gs approach via _raw_cpu_read[1]: This approach solves the "limited CSE with assembly" compiler issue, since it exposes load to the compiler, and has greater optimization potential. [1] https://lore.kernel.org/lkml/20231010164234.140750-1-ubizjak@gmail.com/ Unfortunately, these two are mutually exclusive, since RDGSBASE is implemented as asm. To move things forward, I propose to proceed conservatively with the original patch [1], but this one should be split into two parts. The first will introduce the switch to MOV with tcp_ptr__ += (unsigned long)(ptr), and the second will add __seg_gs part. At this point, we can experiment with RDGSBASE, and compare it with both approaches, with and without __seg_gs, by just changing the asm template to: + asm (ALTERNATIVE("mov " __percpu_arg(1) ", %0", \ + "rdgsbase %0", \ + X86_FEATURE_FSGSBASE) \ Uros.
diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 60ea7755c0fe..e047a0bc5554 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -49,18 +49,32 @@ #define __force_percpu_prefix "%%"__stringify(__percpu_seg)":" #define __my_cpu_offset this_cpu_read(this_cpu_off) -/* - * Compared to the generic __my_cpu_offset version, the following - * saves one instruction and avoids clobbering a temp register. - */ +#ifdef CONFIG_X86_64 +#define arch_raw_cpu_ptr(ptr) \ +({ \ + unsigned long tcp_ptr__; \ + asm (ALTERNATIVE("movq " __percpu_arg(1) ", %0", \ + "rdgsbase %0", \ + X86_FEATURE_FSGSBASE) \ + : "=r" (tcp_ptr__) \ + : "m" (__my_cpu_var(this_cpu_off))); \ + \ + tcp_ptr__ += (unsigned long)(ptr); \ + (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \ +}) +#else /* CONFIG_X86_64 */ #define arch_raw_cpu_ptr(ptr) \ ({ \ unsigned long tcp_ptr__; \ - asm ("add " __percpu_arg(1) ", %0" \ + asm ("movl " __percpu_arg(1) ", %0" \ : "=r" (tcp_ptr__) \ - : "m" (__my_cpu_var(this_cpu_off)), "0" (ptr)); \ + : "m" (__my_cpu_var(this_cpu_off))); \ + \ + tcp_ptr__ += (unsigned long)(ptr); \ (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \ }) +#endif /* CONFIG_X86_64 */ + #else /* CONFIG_SMP */ #define __percpu_seg_override #define __percpu_prefix ""