Message ID | 20230721161018.50214-1-brgerst@gmail.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:9010:0:b0:3e4:2afc:c1 with SMTP id l16csp338012vqg; Fri, 21 Jul 2023 09:55:57 -0700 (PDT) X-Google-Smtp-Source: APBJJlFHIjqqPUmaXurktIoeBxAUu7mcCilhpZP93yJN2oeQ8mdhHWYw0yc9CJ8JMF2LhuziUC1j X-Received: by 2002:a05:6a20:450:b0:137:5957:6978 with SMTP id b16-20020a056a20045000b0013759576978mr2095224pzb.46.1689958556859; Fri, 21 Jul 2023 09:55:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689958556; cv=none; d=google.com; s=arc-20160816; b=MQ/50evlF8df787kXucBpQBGFqHh8cCcYwxkjhrpBpptC7zO4xSiHutECgIKAzod+v XMvzwpZqG83aoCIUGGGoRkFvgRQ0KelaZYoGyIq1+4EAGFiBrSMqA1cx11B/fmxm1mqI u+L4P7DDkXYFQSh/ajDdaz+BoBDFVfZdHsRlZQKfEL4Azq0R52WqA9j8ESmwyR7C0yQD ELphCrse8vrG8kz0vRPmswV11WtAD5Jc2b5DlLDUprwVEM2mMee0MCAEDeAYFqf1AFh6 EvYmIoY16ysxvAcE9o/JVk+a1/DTEqEJBm32s+mRBkGWZdEWMqFJ+6lz6mn/YdmwkajC vp0A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=timtvy/ZgJHBSalkwoR7FMzJpYWzvYaEK+8pGoIzPrc=; fh=7RNrRNWUj9CYv4UOvP3YqAuBAxHO0RPePKHqIO87VRQ=; b=cMGXjjn2Yc4nwf+drUrMWAVpZEtSh77ZMeNu86pROJygbDCdd3M0AAqAf6Uwxs6qmh dqK4ffL2DqRA6i8jxsN+bj+zIQ0bTQBfiY+/JjOZGZI64fS+Ql6hN6+oAFfdy13L5SJj XmVYDZ2P3j0FiWHG2cL/goFqvodB7R04BeuS/dVVAePCogYs3oKR7lNiG58F6GjWuOhL jWNEV+LjosQCFpZpZCWqVoR3SKllbU7py31htBPeQYx8DbZDDd3oY3QX3TYlyGBFDHAK VIPoasmvuyahP24vo3zuFC3yk3U7FlrswPkyY+G6yUz7dIKp8Z74s+luHHH040A/H4dN /DNA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b="OT/31xBD"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id bv14-20020a056a00414e00b0068250f2627asi3350161pfb.61.2023.07.21.09.55.44; Fri, 21 Jul 2023 09:55:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b="OT/31xBD"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231596AbjGUQKr (ORCPT <rfc822;chrisben.tianve@gmail.com> + 99 others); Fri, 21 Jul 2023 12:10:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38954 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231357AbjGUQKp (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 21 Jul 2023 12:10:45 -0400 Received: from mail-qt1-x830.google.com (mail-qt1-x830.google.com [IPv6:2607:f8b0:4864:20::830]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B772A3592 for <linux-kernel@vger.kernel.org>; Fri, 21 Jul 2023 09:10:40 -0700 (PDT) Received: by mail-qt1-x830.google.com with SMTP id d75a77b69052e-4039f7e1d3aso17432201cf.0 for <linux-kernel@vger.kernel.org>; Fri, 21 Jul 2023 09:10:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1689955839; x=1690560639; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=timtvy/ZgJHBSalkwoR7FMzJpYWzvYaEK+8pGoIzPrc=; b=OT/31xBDBJJm0IeZ+M3IDRbHd6zlPalkDJFVQTVxIyWSFPbFrRwPJRQZjHNpOMJcEl 3KlBhybdSYkLgzvwduacu+Js3tBex4ABGpfhM1HCRJ6YtWwYls4Wc0DdTIt49g7QbVGE mqYp8FBO5/lHgdATYM4aUJjc0QrFf0IK70qdmTJc/JLztD4RkTJ3ZENZL/K88wu+Catk AIKf/uQ58vHfCfEh8EuOQxX2WCAemoe3c1HzlgqR3LA3hg0SbxRSNQC9mHF6Sm6ZoYbh knTpTAdylkbQNihmnGUwHakICuHnYNBhFJCW62SevQk8S73N2+qd6YuQa6lH8JAnZdr6 Pozw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689955839; x=1690560639; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=timtvy/ZgJHBSalkwoR7FMzJpYWzvYaEK+8pGoIzPrc=; b=E4DKLrkAHcWf5XDgsmPAaEsfJQy9Qn5/GyaivHxfbDlbxKTIxRtnTu817Dx+5kQxOs H4bpTLC8WjFwEEDCWO97KjkIx1eCCbb6p6991IwBjlNdHChphzey13u9XckupuLajtgr DPwk/adeYAD08ySQUfcaHiXcxWOF190mfDLNgI5Dwx8aD11aKdd91FXVVJzH1fkaaxAY lePf8tT7+iiyeNwrJ0EacncQmOCeY/dfGGt9//vOwOh/ntXqZ43y3Q+iGnQUfnOCY9yl aNjPWjFUTo8HcHR6/bAK2MhkRT6cNGfO/jOVa6fctkO6sSaR/ZiPvo1fIH4W+ubMtvHI XeEQ== X-Gm-Message-State: ABy/qLbopXnuNJCdDD1maj/nHEff9Ai6uBHhA8YEj2/1Hz1G81Wm4jWP oNDDCndOsNMxcU8lNXhd/eegkncDyQ== X-Received: by 2002:ac8:7fce:0:b0:405:438a:baef with SMTP id b14-20020ac87fce000000b00405438abaefmr500344qtk.51.1689955838950; Fri, 21 Jul 2023 09:10:38 -0700 (PDT) Received: from citadel.. (047-026-243-217.res.spectrum.com. [47.26.243.217]) by smtp.gmail.com with ESMTPSA id c16-20020ac87d90000000b004054fbf9273sm311286qtd.80.2023.07.21.09.10.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 21 Jul 2023 09:10:37 -0700 (PDT) From: Brian Gerst <brgerst@gmail.com> To: linux-kernel@vger.kernel.org, x86@kernel.org Cc: Thomas Gleixner <tglx@linutronix.de>, Borislav Petkov <bp@alien8.de>, "H . Peter Anvin" <hpa@zytor.com>, Andy Lutomirski <luto@kernel.org>, =?utf-8?q?Mika_Penttil=C3=A4?= <mpenttil@redhat.com>, Brian Gerst <brgerst@gmail.com> Subject: [PATCH v2 0/6] x86: Clean up fast syscall return validation Date: Fri, 21 Jul 2023 12:10:11 -0400 Message-ID: <20230721161018.50214-1-brgerst@gmail.com> X-Mailer: git-send-email 2.41.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1772049984021274209 X-GMAIL-MSGID: 1772049984021274209 |
Series |
x86: Clean up fast syscall return validation
|
|
Message
Brian Gerst
July 21, 2023, 4:10 p.m. UTC
This patch set cleans up the tests done to determine if a fast syscall return instruction can be used to return to userspace. It converts the code to C, and refactors existing code to be more readable. v2: - Fix shift value for canonical RIP test and use __is_canonical_address() Brian Gerst (6): x86/entry/64: Remove obsolete comment on tracing vs. SYSRET x86/entry/64: Convert SYSRET validation tests to C x86/entry/compat: Combine return value test from syscall handler x86/entry/32: Convert do_fast_syscall_32() to bool return type x86/entry/32: Remove SEP test for SYSEXIT x86/entry/32: Clean up syscall fast exit tests arch/x86/entry/common.c | 99 +++++++++++++++++++++----------- arch/x86/entry/entry_32.S | 2 +- arch/x86/entry/entry_64.S | 68 +--------------------- arch/x86/entry/entry_64_compat.S | 12 ++-- arch/x86/include/asm/syscall.h | 6 +- 5 files changed, 77 insertions(+), 110 deletions(-)
Comments
* Brian Gerst <brgerst@gmail.com> wrote: > This patch set cleans up the tests done to determine if a fast syscall > return instruction can be used to return to userspace. It converts the > code to C, and refactors existing code to be more readable. > > v2: > - Fix shift value for canonical RIP test and use > __is_canonical_address() > > Brian Gerst (6): > x86/entry/64: Remove obsolete comment on tracing vs. SYSRET > x86/entry/64: Convert SYSRET validation tests to C > x86/entry/compat: Combine return value test from syscall handler > x86/entry/32: Convert do_fast_syscall_32() to bool return type > x86/entry/32: Remove SEP test for SYSEXIT > x86/entry/32: Clean up syscall fast exit tests > > arch/x86/entry/common.c | 99 +++++++++++++++++++++----------- > arch/x86/entry/entry_32.S | 2 +- > arch/x86/entry/entry_64.S | 68 +--------------------- > arch/x86/entry/entry_64_compat.S | 12 ++-- > arch/x86/include/asm/syscall.h | 6 +- > 5 files changed, 77 insertions(+), 110 deletions(-) Ok, so I've applied patches #1, #3, #4 and #5 to tip:x86/entry, (ie. skipped #2 & #6 for now), as they look correct and are good improvements. None of these four patches depend on the skipped patches in some way I missed, correct? As for #2, I looked at the before/after disassembly, and the new C code in do_syscall_64() looked suboptimal on x86-64 defconfig, if I was reading it right. Mind re-evaluating that, and if you still think the C conversion is a good idea, mind putting a before/after analysis of the generated instructions into the changelog? This is our primary system call return path after all. Thanks, Ingo
On Thu, Oct 5, 2023 at 4:22 AM Ingo Molnar <mingo@kernel.org> wrote: > > > * Brian Gerst <brgerst@gmail.com> wrote: > > > This patch set cleans up the tests done to determine if a fast syscall > > return instruction can be used to return to userspace. It converts the > > code to C, and refactors existing code to be more readable. > > > > v2: > > - Fix shift value for canonical RIP test and use > > __is_canonical_address() > > > > Brian Gerst (6): > > x86/entry/64: Remove obsolete comment on tracing vs. SYSRET > > x86/entry/64: Convert SYSRET validation tests to C > > x86/entry/compat: Combine return value test from syscall handler > > x86/entry/32: Convert do_fast_syscall_32() to bool return type > > x86/entry/32: Remove SEP test for SYSEXIT > > x86/entry/32: Clean up syscall fast exit tests > > > > arch/x86/entry/common.c | 99 +++++++++++++++++++++----------- > > arch/x86/entry/entry_32.S | 2 +- > > arch/x86/entry/entry_64.S | 68 +--------------------- > > arch/x86/entry/entry_64_compat.S | 12 ++-- > > arch/x86/include/asm/syscall.h | 6 +- > > 5 files changed, 77 insertions(+), 110 deletions(-) > > Ok, so I've applied patches #1, #3, #4 and #5 to tip:x86/entry, > (ie. skipped #2 & #6 for now), as they look correct and are good > improvements. None of these four patches depend on the skipped > patches in some way I missed, correct? > > As for #2, I looked at the before/after disassembly, and the new > C code in do_syscall_64() looked suboptimal on x86-64 defconfig, > if I was reading it right. > > Mind re-evaluating that, and if you still think the C conversion > is a good idea, mind putting a before/after analysis of the > generated instructions into the changelog? This is our primary > system call return path after all. Looking at the compiled output, the only suboptimal code appears to be the canonical address test, where the C code uses the CL register for the shifts instead of immediates. 180: e9 00 00 00 00 jmp 185 <do_syscall_64+0x85> 181: R_X86_64_PC32 .altinstr_aux-0x4 185: b9 07 00 00 00 mov $0x7,%ecx 18a: eb 05 jmp 191 <do_syscall_64+0x91> 18c: b9 10 00 00 00 mov $0x10,%ecx 191: 48 89 c2 mov %rax,%rdx 194: 48 d3 e2 shl %cl,%rdx 197: 48 d3 fa sar %cl,%rdx 19a: 48 39 d0 cmp %rdx,%rax 19d: 75 39 jne 1d8 <do_syscall_64+0xd8> Was there anything else specifically that you can point out? Brian Gerst
* Brian Gerst <brgerst@gmail.com> wrote: > Looking at the compiled output, the only suboptimal code appears to be > the canonical address test, where the C code uses the CL register for > the shifts instead of immediates. > > 180: e9 00 00 00 00 jmp 185 <do_syscall_64+0x85> > 181: R_X86_64_PC32 .altinstr_aux-0x4 > 185: b9 07 00 00 00 mov $0x7,%ecx > 18a: eb 05 jmp 191 <do_syscall_64+0x91> > 18c: b9 10 00 00 00 mov $0x10,%ecx > 191: 48 89 c2 mov %rax,%rdx > 194: 48 d3 e2 shl %cl,%rdx > 197: 48 d3 fa sar %cl,%rdx > 19a: 48 39 d0 cmp %rdx,%rax > 19d: 75 39 jne 1d8 <do_syscall_64+0xd8> Yeah, it didn't look equivalent - so I guess we want a C equivalent for: - ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \ - "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57 instead of the pgtable_l5_enabled() runtime test that __is_canonical_address() uses? Thanks, Ingo
On 10/5/23 13:20, Ingo Molnar wrote: > > * Brian Gerst <brgerst@gmail.com> wrote: > >> Looking at the compiled output, the only suboptimal code appears to be >> the canonical address test, where the C code uses the CL register for >> the shifts instead of immediates. >> >> 180: e9 00 00 00 00 jmp 185 <do_syscall_64+0x85> >> 181: R_X86_64_PC32 .altinstr_aux-0x4 >> 185: b9 07 00 00 00 mov $0x7,%ecx >> 18a: eb 05 jmp 191 <do_syscall_64+0x91> >> 18c: b9 10 00 00 00 mov $0x10,%ecx >> 191: 48 89 c2 mov %rax,%rdx >> 194: 48 d3 e2 shl %cl,%rdx >> 197: 48 d3 fa sar %cl,%rdx >> 19a: 48 39 d0 cmp %rdx,%rax >> 19d: 75 39 jne 1d8 <do_syscall_64+0xd8> > > Yeah, it didn't look equivalent - so I guess we want a C equivalent for: > > - ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \ > - "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57 > > instead of the pgtable_l5_enabled() runtime test that > __is_canonical_address() uses? > I don't think such a thing (without simply duplicate the above as an alternative asm, which is obviously easy enough, and still allows the compiler to pick the register used) would be possible without immediate patching support[*]. Incidentally, this is a question for Uros: is there a reason this is a mov to %ecx and not just %cl, which would save 3 bytes? Incidentally, it is possible to save one instruction and use only *one* alternative immediate: leaq (%rax,%rax),%rdx xorq %rax,%rdx shrq $(63 - LA),%rdx # Yes, 63, not 64 # ZF=1 if canonical This works because if bit [x] is set in the output, then bit [x] and [x-1] in the input are different (bit [-1] considered to be zero); and by definition a bit is canonical if and only if all the bits [63:LA] are identical, thus bits [63:LA+1] in the output must all be zero. The first two instructions are pure arithmetic and can thus be done in C: bar = foo ^ (foo << 1); ... leaving only one instruction needing to be patched at runtime. -hpa [*] which is a whole bit of infrastructure that I know first-hand is extremely complex[**] -- I had an almost-complete patchset done at one time, but it has severely bitrotted. The good part is that it is probably a lot easier to do today, with the alternatives-patching callback routines available. [**] the absolute best would of course be if such infrastructure could be compiler-supported (probably as part as some really fancy support for alternatives/static branch), but that would probably be a nightmare to do in the compiler or a plugin.
On Fri, Oct 6, 2023 at 2:59 PM H. Peter Anvin <hpa@zytor.com> wrote: > > On 10/5/23 13:20, Ingo Molnar wrote: > > > > * Brian Gerst <brgerst@gmail.com> wrote: > > > >> Looking at the compiled output, the only suboptimal code appears to be > >> the canonical address test, where the C code uses the CL register for > >> the shifts instead of immediates. > >> > >> 180: e9 00 00 00 00 jmp 185 <do_syscall_64+0x85> > >> 181: R_X86_64_PC32 .altinstr_aux-0x4 > >> 185: b9 07 00 00 00 mov $0x7,%ecx > >> 18a: eb 05 jmp 191 <do_syscall_64+0x91> > >> 18c: b9 10 00 00 00 mov $0x10,%ecx > >> 191: 48 89 c2 mov %rax,%rdx > >> 194: 48 d3 e2 shl %cl,%rdx > >> 197: 48 d3 fa sar %cl,%rdx > >> 19a: 48 39 d0 cmp %rdx,%rax > >> 19d: 75 39 jne 1d8 <do_syscall_64+0xd8> > > > > Yeah, it didn't look equivalent - so I guess we want a C equivalent for: > > > > - ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \ > > - "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57 > > > > instead of the pgtable_l5_enabled() runtime test that > > __is_canonical_address() uses? > > > > I don't think such a thing (without simply duplicate the above as an > alternative asm, which is obviously easy enough, and still allows the > compiler to pick the register used) would be possible without immediate > patching support[*]. > > Incidentally, this is a question for Uros: is there a reason this is a > mov to %ecx and not just %cl, which would save 3 bytes? > > Incidentally, it is possible to save one instruction and use only *one* > alternative immediate: > > leaq (%rax,%rax),%rdx > xorq %rax,%rdx > shrq $(63 - LA),%rdx # Yes, 63, not 64 > # ZF=1 if canonical > > This works because if bit [x] is set in the output, then bit [x] and > [x-1] in the input are different (bit [-1] considered to be zero); and > by definition a bit is canonical if and only if all the bits [63:LA] are > identical, thus bits [63:LA+1] in the output must all be zero. > > The first two instructions are pure arithmetic and can thus be done in C: > > bar = foo ^ (foo << 1); > > ... leaving only one instruction needing to be patched at runtime. > > -hpa One other alternative I have been considering is comparing against TASK_SIZE_MAX. The only user-executable address above that is the long deprecated vsyscall page. IMHO it's not worth optimizing for that case, so just let it fall back to using IRET. if (unlikely(regs->ip >= TASK_SIZE_MAX)) return false; compiles to: 180: 48 b9 00 f0 ff ff ff movabs $0x7ffffffff000,%rcx 187: 7f 00 00 18a: 48 39 c8 cmp %rcx,%rax 18d: 73 39 jae 1c8 <do_syscall_64+0xc8> 0000000000000000 <.altinstr_replacement>: 0: 48 b9 00 f0 ff ff ff movabs $0xfffffffffff000,%rcx 7: ff ff 00 Brian Gerst
On 10/6/23 11:59, H. Peter Anvin wrote: > > Incidentally, it is possible to save one instruction and use only *one* > alternative immediate: > > leaq (%rax,%rax),%rdx > xorq %rax,%rdx > shrq $(63 - LA),%rdx # Yes, 63, not 64 > # ZF=1 if canonical > > This works because if bit [x] is set in the output, then bit [x] and > [x-1] in the input are different (bit [-1] considered to be zero); and > by definition a bit is canonical if and only if all the bits [63:LA] are > identical, thus bits [63:LA+1] in the output must all be zero. > Yes, I'm a doofus. Bits [63:LA-1] must be identical, so 64 is correct :$) -hpa
* Brian Gerst <brgerst@gmail.com> wrote: > On Fri, Oct 6, 2023 at 2:59 PM H. Peter Anvin <hpa@zytor.com> wrote: > > > > On 10/5/23 13:20, Ingo Molnar wrote: > > > > > > * Brian Gerst <brgerst@gmail.com> wrote: > > > > > >> Looking at the compiled output, the only suboptimal code appears to be > > >> the canonical address test, where the C code uses the CL register for > > >> the shifts instead of immediates. > > >> > > >> 180: e9 00 00 00 00 jmp 185 <do_syscall_64+0x85> > > >> 181: R_X86_64_PC32 .altinstr_aux-0x4 > > >> 185: b9 07 00 00 00 mov $0x7,%ecx > > >> 18a: eb 05 jmp 191 <do_syscall_64+0x91> > > >> 18c: b9 10 00 00 00 mov $0x10,%ecx > > >> 191: 48 89 c2 mov %rax,%rdx > > >> 194: 48 d3 e2 shl %cl,%rdx > > >> 197: 48 d3 fa sar %cl,%rdx > > >> 19a: 48 39 d0 cmp %rdx,%rax > > >> 19d: 75 39 jne 1d8 <do_syscall_64+0xd8> > > > > > > Yeah, it didn't look equivalent - so I guess we want a C equivalent for: > > > > > > - ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \ > > > - "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57 > > > > > > instead of the pgtable_l5_enabled() runtime test that > > > __is_canonical_address() uses? > > > > > > > I don't think such a thing (without simply duplicate the above as an > > alternative asm, which is obviously easy enough, and still allows the > > compiler to pick the register used) would be possible without immediate > > patching support[*]. > > > > Incidentally, this is a question for Uros: is there a reason this is a > > mov to %ecx and not just %cl, which would save 3 bytes? > > > > Incidentally, it is possible to save one instruction and use only *one* > > alternative immediate: > > > > leaq (%rax,%rax),%rdx > > xorq %rax,%rdx > > shrq $(63 - LA),%rdx # Yes, 63, not 64 > > # ZF=1 if canonical > > > > This works because if bit [x] is set in the output, then bit [x] and > > [x-1] in the input are different (bit [-1] considered to be zero); and > > by definition a bit is canonical if and only if all the bits [63:LA] are > > identical, thus bits [63:LA+1] in the output must all be zero. > > > > The first two instructions are pure arithmetic and can thus be done in C: > > > > bar = foo ^ (foo << 1); > > > > ... leaving only one instruction needing to be patched at runtime. > > > > -hpa > > One other alternative I have been considering is comparing against > TASK_SIZE_MAX. The only user-executable address above that is the > long deprecated vsyscall page. IMHO it's not worth optimizing for > that case, so just let it fall back to using IRET. > > if (unlikely(regs->ip >= TASK_SIZE_MAX)) return false; > > compiles to: > > 180: 48 b9 00 f0 ff ff ff movabs $0x7ffffffff000,%rcx > 187: 7f 00 00 > 18a: 48 39 c8 cmp %rcx,%rax > 18d: 73 39 jae 1c8 <do_syscall_64+0xc8> > > 0000000000000000 <.altinstr_replacement>: > 0: 48 b9 00 f0 ff ff ff movabs $0xfffffffffff000,%rcx > 7: ff ff 00 That sounds good - and we could do this as a separate patch on top of your existing patches, to keep it bisectable in case there's any problems. Thanks, Ingo
On Fri, Oct 6, 2023 at 8:59 PM H. Peter Anvin <hpa@zytor.com> wrote: > > On 10/5/23 13:20, Ingo Molnar wrote: > > > > * Brian Gerst <brgerst@gmail.com> wrote: > > > >> Looking at the compiled output, the only suboptimal code appears to be > >> the canonical address test, where the C code uses the CL register for > >> the shifts instead of immediates. > >> > >> 180: e9 00 00 00 00 jmp 185 <do_syscall_64+0x85> > >> 181: R_X86_64_PC32 .altinstr_aux-0x4 > >> 185: b9 07 00 00 00 mov $0x7,%ecx > >> 18a: eb 05 jmp 191 <do_syscall_64+0x91> > >> 18c: b9 10 00 00 00 mov $0x10,%ecx > >> 191: 48 89 c2 mov %rax,%rdx > >> 194: 48 d3 e2 shl %cl,%rdx > >> 197: 48 d3 fa sar %cl,%rdx > >> 19a: 48 39 d0 cmp %rdx,%rax > >> 19d: 75 39 jne 1d8 <do_syscall_64+0xd8> > > > > Yeah, it didn't look equivalent - so I guess we want a C equivalent for: > > > > - ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \ > > - "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57 > > > > instead of the pgtable_l5_enabled() runtime test that > > __is_canonical_address() uses? > > > > I don't think such a thing (without simply duplicate the above as an > alternative asm, which is obviously easy enough, and still allows the > compiler to pick the register used) would be possible without immediate > patching support[*]. > > Incidentally, this is a question for Uros: is there a reason this is a > mov to %ecx and not just %cl, which would save 3 bytes? The compiler uses 32-bit mode to move values between registers, even when they are less than 32-bit wide. To avoid partial register stalls, it uses 32-bit mode as much as possible (e.g. zero-extends from memory to load 8-bit value, load of 32-bit constant, etc). Since the kernel is compiled with -O2, the compiler does not care that much for the size of instructions, and it uses full 32-bit width to initialize register with a constant. Please note that 8-bit movb instruction in fact represents insert into word-mode register. The compiler does not know how this word-mode register will be used, so to avoid partial register stalls, it takes a cautious approach and (with -O2) moves constant to a register with a word-width instruction. Also, the compiler is quite eager to CSE constants. When there are two or more uses of the same constant, it will move a constant into the register first. Uros.
On Sat, 7 Oct 2023 at 02:57, Uros Bizjak <ubizjak@gmail.com> wrote: > > Please note that 8-bit movb instruction in fact represents insert into > word-mode register. The compiler does not know how this word-mode > register will be used, so to avoid partial register stalls, it takes a > cautious approach and (with -O2) moves constant to a register with a > word-width instruction. Yes. In this case I really think it's our kernel code that is disgusting. People seem to think that because cpu_feature_enabled() uses static jumps, it's somehow "free". Not so. The static jumps are often horrendous for code quality, and they effectively make the compiler blind to otherwise obvious optimizations. We need to stop using those so blindly. I see code like this return pgtable_l5_enabled() ? PT64_ROOT_5LEVEL : PT64_ROOT_4LEVEL; and it just makes me sad. And yes, that virtual mask is disgusting. This: #define __VIRTUAL_MASK_SHIFT (pgtable_l5_enabled() ? 56 : 47) is *NOT* cheap. The end result is *not* a constant, it's a two-way branch, and while the branch is static and the compiler may then optimize each side of the branch with the constant in question, most of the time it just results in *disgusting* code like this where gcc just has to load a constant into a register, and treat it as a variable. With the code shuffling, it's possibly *more* expensive than just loading from memory, in that you just change a D$ load into an I$ one instead. I'm sure that looks great on benchmarks that stay entirely in the I$, but ... Something like using alternatives is *much* cheaper. Better yet is the suggestion to simplify the conditionals entirely to not depend on the virtual shift at all (ie our recent access_ok() simplifications that were due to _another_ CPU feature) > Also, the compiler is quite eager to CSE constants. When there are two > or more uses of the same constant, it will move a constant into the > register first. Now, honestly, that may be a good thing for other architectures that have issues with constants, but on x86 - and integer constants - it's likely a bad thing unless the constant is particularly complicated. Most x86 instructions will take the usual 8/32-bit constants (but if we have arbitrary 64-bit ones, you tend to need them in registers). Linus