Message ID | 20231024-delay-verw-v2-1-f1881340c807@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce89:0:b0:403:3b70:6f57 with SMTP id p9csp1788015vqx; Tue, 24 Oct 2023 01:08:40 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFWOa3VVzMP/Q5k4vNeWqPu43nSXFNxnBAuutYbXxnc7avRcMHm6zRFsV/gkgXklvySejDh X-Received: by 2002:a05:6359:2f83:b0:168:cfea:2d41 with SMTP id rs3-20020a0563592f8300b00168cfea2d41mr7296364rwb.14.1698134920683; Tue, 24 Oct 2023 01:08:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698134920; cv=none; d=google.com; s=arc-20160816; b=hDOdvzE+gUDeUIyZbYhPTR9ywjMnUNt25L2sUVsv4n/CVzuvCwikZU5Pd7zQjbNE70 i7fo9E17rDzvObEZl75uu7A7MFFiFwqX2tGdnS9VUJAd21Yiyfl0p28V/evWNOt+s+2h QGyt22uoBig0EDQK8VmmygG1nyQCclaZbKiHvrlKZ7XkpHSQ4l5cP6fOG4MWsk1pGdcb 9bW1mxNN0vnSXG9+XKO81M4Ea+5DFlWN+q+wPHFXYsOSgQS2s5WbcR0Vw4GcJJgfzX13 uca7KpBve7nofTAVhEhHSxcLRkmElIAN3hO/3soS5Ol13dEUHckfq5tSmbRER6ndj0F5 zARw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=Dj+pP1wZYINTGXhnZCaSveMp7Q5nsCGQucS78/Uc73w=; fh=2nScFibjZ9o9nwcLK7AVDq2RD1oZc3mBH5lwaxEPwe4=; b=Q8cImmiA+dioDLJxiUPAAvqtiM3VskN4R5dK6j/780q3uv5IMEcE0wqK++xV/cpQVO 2e6Lv2yPVAoXxrX17MX9nd8LwrkhSp4jGWDL9ZOnwuWZOdRXIdMz3ChmFMNsqf63E5v+ rc9+bK7on2NqL3kjI1n8HS3lVmjO3Dfh04HBo5iM3wi3XUrkXGgCWixz+VhphQN7BgSl VmmKVM1TijGGe2hWPktZgW5Lrtr3A1nPeJfTdfeG7rPf+k8YQ2fo7jvYiBjv4rvHyOxW 651q7rp5TOZOUL432UtPkAnjKALvC93+5k+iRklg9DATg+BaE/EnOsKwN6WDGirAY6Py Bpvg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=fCI5Dppn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id ck22-20020a056a02091600b00577e00c5ffasi8140940pgb.867.2023.10.24.01.08.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 24 Oct 2023 01:08:40 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=fCI5Dppn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id D65548031B1D; Tue, 24 Oct 2023 01:08:39 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233806AbjJXIIb (ORCPT <rfc822;a1648639935@gmail.com> + 26 others); Tue, 24 Oct 2023 04:08:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45558 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233770AbjJXII2 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 24 Oct 2023 04:08:28 -0400 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.88]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A6C8ED7E; Tue, 24 Oct 2023 01:08:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1698134903; x=1729670903; h=date:from:to:cc:subject:message-id:references: mime-version:in-reply-to; bh=eA6CRDmNCGN8nFcWCJ9NGvgTusdz/Jy5pUn40Hz7ZCQ=; b=fCI5DppnRmZUaIpSjqHmrjDPrrBr723EeN+1LWDJ+nidCk0DVPOJCmLk w6KY2tz6YROtqplXz8nQ53FSLo2e3gJ2IGVT6XXMaFCZ9LPACe8J8JBS1 EjCr+SxCGGj9qZ5dz5dtV7MfIaS8ND0GU6S5TGEPyCYakKceXWouDHf/9 hUbsoaqJ0Uo5WutB7wBD/MjSRqhycYRtq6MS/q2a154bH4oifrlTvjfJt 4GWW+i+ZQMuv7zlseK85DqfaST0nxE/KwuqMLOMdmr5fH68Jow78dc41y z56R+NPxjwaxNCzK/4Zn1oJ9bMAy3NjbdkkLHLIebBFmY83BvosYHj7xv w==; X-IronPort-AV: E=McAfee;i="6600,9927,10872"; a="418134603" X-IronPort-AV: E=Sophos;i="6.03,247,1694761200"; d="scan'208";a="418134603" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Oct 2023 01:08:22 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10872"; a="762020201" X-IronPort-AV: E=Sophos;i="6.03,247,1694761200"; d="scan'208";a="762020201" Received: from zijianw1-mobl.amr.corp.intel.com (HELO desk) ([10.209.109.187]) by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Oct 2023 01:08:22 -0700 Date: Tue, 24 Oct 2023 01:08:21 -0700 From: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> To: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>, Peter Zijlstra <peterz@infradead.org>, Josh Poimboeuf <jpoimboe@kernel.org>, Andy Lutomirski <luto@kernel.org>, Jonathan Corbet <corbet@lwn.net>, Sean Christopherson <seanjc@google.com>, Paolo Bonzini <pbonzini@redhat.com>, tony.luck@intel.com, ak@linux.intel.com, tim.c.chen@linux.intel.com Cc: linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kvm@vger.kernel.org, Alyssa Milburn <alyssa.milburn@linux.intel.com>, Daniel Sneddon <daniel.sneddon@linux.intel.com>, antonio.gomez.iglesias@linux.intel.com, Pawan Gupta <pawan.kumar.gupta@linux.intel.com>, Alyssa Milburn <alyssa.milburn@intel.com> Subject: [PATCH v2 1/6] x86/bugs: Add asm helpers for executing VERW Message-ID: <20231024-delay-verw-v2-1-f1881340c807@linux.intel.com> X-Mailer: b4 0.12.3 References: <20231024-delay-verw-v2-0-f1881340c807@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20231024-delay-verw-v2-0-f1881340c807@linux.intel.com> X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Tue, 24 Oct 2023 01:08:39 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1780623522251464580 X-GMAIL-MSGID: 1780623522251464580 |
Series |
Delay VERW
|
|
Commit Message
Pawan Gupta
Oct. 24, 2023, 8:08 a.m. UTC
MDS mitigation requires clearing the CPU buffers before returning to
user. This needs to be done late in the exit-to-user path. Current
location of VERW leaves a possibility of kernel data ending up in CPU
buffers for memory accesses done after VERW such as:
1. Kernel data accessed by an NMI between VERW and return-to-user can
remain in CPU buffers ( since NMI returning to kernel does not
execute VERW to clear CPU buffers.
2. Alyssa reported that after VERW is executed,
CONFIG_GCC_PLUGIN_STACKLEAK=y scrubs the stack used by a system
call. Memory accesses during stack scrubbing can move kernel stack
contents into CPU buffers.
3. When caller saved registers are restored after a return from
function executing VERW, the kernel stack accesses can remain in
CPU buffers(since they occur after VERW).
To fix this VERW needs to be moved very late in exit-to-user path.
In preparation for moving VERW to entry/exit asm code, create macros
that can be used in asm. Also make them depend on a new feature flag
X86_FEATURE_CLEAR_CPU_BUF.
Reported-by: Alyssa Milburn <alyssa.milburn@intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
arch/x86/include/asm/cpufeatures.h | 2 +-
arch/x86/include/asm/nospec-branch.h | 19 +++++++++++++++++++
2 files changed, 20 insertions(+), 1 deletion(-)
Comments
On Tue, Oct 24, 2023 at 01:08:21AM -0700, Pawan Gupta wrote: > +.macro CLEAR_CPU_BUFFERS > + ALTERNATIVE "jmp .Lskip_verw_\@;", "jmp .Ldo_verw_\@", X86_FEATURE_CLEAR_CPU_BUF > + /* nopl __KERNEL_DS(%rax) */ > + .byte 0x0f, 0x1f, 0x80, 0x00, 0x00; > +.Lverw_arg_\@: .word __KERNEL_DS; > +.Ldo_verw_\@: verw _ASM_RIP(.Lverw_arg_\@); > +.Lskip_verw_\@: > +.endm Why can't this be: ALTERNATIVE "". "verw _ASM_RIP(mds_verw_sel)", X86_FEATURE_CLEAR_CPU_BUF And have that mds_verw_sel thing be out-of-line ? That gives much better code for the case where we don't need this.
On Tue, Oct 24, 2023 at 12:36:01PM +0200, Peter Zijlstra wrote: > On Tue, Oct 24, 2023 at 01:08:21AM -0700, Pawan Gupta wrote: > > > +.macro CLEAR_CPU_BUFFERS > > + ALTERNATIVE "jmp .Lskip_verw_\@;", "jmp .Ldo_verw_\@", X86_FEATURE_CLEAR_CPU_BUF > > + /* nopl __KERNEL_DS(%rax) */ > > + .byte 0x0f, 0x1f, 0x80, 0x00, 0x00; > > +.Lverw_arg_\@: .word __KERNEL_DS; > > +.Ldo_verw_\@: verw _ASM_RIP(.Lverw_arg_\@); > > +.Lskip_verw_\@: > > +.endm > > Why can't this be: > > ALTERNATIVE "". "verw _ASM_RIP(mds_verw_sel)", X86_FEATURE_CLEAR_CPU_BUF > > And have that mds_verw_sel thing be out-of-line ? I haven't done this way because its a tad bit fragile as it depends on modules being within 4GB of kernel. > That gives much better code for the case where we don't need this. If this is the preferred way let me test this and roll a new revision.
On Tue, Oct 24, 2023 at 09:35:15AM -0700, Pawan Gupta wrote: > On Tue, Oct 24, 2023 at 12:36:01PM +0200, Peter Zijlstra wrote: > > On Tue, Oct 24, 2023 at 01:08:21AM -0700, Pawan Gupta wrote: > > > > > +.macro CLEAR_CPU_BUFFERS > > > + ALTERNATIVE "jmp .Lskip_verw_\@;", "jmp .Ldo_verw_\@", X86_FEATURE_CLEAR_CPU_BUF > > > + /* nopl __KERNEL_DS(%rax) */ > > > + .byte 0x0f, 0x1f, 0x80, 0x00, 0x00; > > > +.Lverw_arg_\@: .word __KERNEL_DS; > > > +.Ldo_verw_\@: verw _ASM_RIP(.Lverw_arg_\@); > > > +.Lskip_verw_\@: > > > +.endm > > > > Why can't this be: > > > > ALTERNATIVE "". "verw _ASM_RIP(mds_verw_sel)", X86_FEATURE_CLEAR_CPU_BUF > > > > And have that mds_verw_sel thing be out-of-line ? > > I haven't done this way because its a tad bit fragile as it depends on > modules being within 4GB of kernel. We 100% rely on that *everywhere*, nothing fragile about it.
On Tue, Oct 24, 2023 at 06:36:21PM +0200, Peter Zijlstra wrote: > On Tue, Oct 24, 2023 at 09:35:15AM -0700, Pawan Gupta wrote: > > On Tue, Oct 24, 2023 at 12:36:01PM +0200, Peter Zijlstra wrote: > > > On Tue, Oct 24, 2023 at 01:08:21AM -0700, Pawan Gupta wrote: > > > > > > > +.macro CLEAR_CPU_BUFFERS > > > > + ALTERNATIVE "jmp .Lskip_verw_\@;", "jmp .Ldo_verw_\@", X86_FEATURE_CLEAR_CPU_BUF > > > > + /* nopl __KERNEL_DS(%rax) */ > > > > + .byte 0x0f, 0x1f, 0x80, 0x00, 0x00; > > > > +.Lverw_arg_\@: .word __KERNEL_DS; > > > > +.Ldo_verw_\@: verw _ASM_RIP(.Lverw_arg_\@); > > > > +.Lskip_verw_\@: > > > > +.endm > > > > > > Why can't this be: > > > > > > ALTERNATIVE "". "verw _ASM_RIP(mds_verw_sel)", X86_FEATURE_CLEAR_CPU_BUF > > > > > > And have that mds_verw_sel thing be out-of-line ? > > > > I haven't done this way because its a tad bit fragile as it depends on > > modules being within 4GB of kernel. > > We 100% rely on that *everywhere*, nothing fragile about it. I didn't know that, doing it this way then. Thanks.
On Tue, Oct 24, 2023 at 09:45:20AM -0700, Pawan Gupta wrote:
> > > modules being within 4GB of kernel.
FWIW, it's 2G, it's a s32 displacement, the highest most address can
jump 2g down, while the lowest most address can jump 2g up. Leaving a 2G
directly addressable range.
And yeah, we ensure kernel text and modules are inside that 2G range.
On Tue, Oct 24, 2023 at 07:02:48PM +0200, Peter Zijlstra wrote: > On Tue, Oct 24, 2023 at 09:45:20AM -0700, Pawan Gupta wrote: > > > > > modules being within 4GB of kernel. > > FWIW, it's 2G, it's a s32 displacement, the highest most address can > jump 2g down, while the lowest most address can jump 2g up. Leaving a 2G > directly addressable range. > > And yeah, we ensure kernel text and modules are inside that 2G range. Ah, okay. Thanks for the info.
On October 24, 2023 10:02:48 AM PDT, Peter Zijlstra <peterz@infradead.org> wrote: >On Tue, Oct 24, 2023 at 09:45:20AM -0700, Pawan Gupta wrote: > >> > > modules being within 4GB of kernel. > >FWIW, it's 2G, it's a s32 displacement, the highest most address can >jump 2g down, while the lowest most address can jump 2g up. Leaving a 2G >directly addressable range. > >And yeah, we ensure kernel text and modules are inside that 2G range. To be specific, we don't require that it is located at any particular *physical* addresses, but all modules including the root module are remapped into the [-2GiB,0) range. If we didn't do that, modules would have to be compiled with the pic memory model rather than the kernel memory model which is what they currently are. This would add substantial overhead due to the need for a GOT (the PLT is optional if all symbols are resolved at load time.) The kernel is different from user space objects since it is always fully loaded into physical memory and is never paged or shared. Therefore, inline relocations, which break sharing and create dirty pages in user space, have zero execution cost in the kernel; the only overhead to modules other than load time (including the runtime linking) is that modules can't realistically be mapped using large page entries.
> the only overhead to modules other than load time (including the runtime linking) is that modules can't realistically be mapped using large page entries.
If there were some significant win for using large pages, couldn't the
kernel pre-allocate some 2MB pages in the [-2GiB,0) range? Boot parameter
for how many (perhaps two for separate code/data pages). First few loaded
modules get to use that space until it is all gone.
It would all be quite messy if those modules were later unloaded/reloaded
... so there would have to be some compelling benchmarks to justify
the complexity.
That's probably why Peter said "can't realistically".
-Tony
On October 24, 2023 11:49:07 AM PDT, "Luck, Tony" <tony.luck@intel.com> wrote: >> the only overhead to modules other than load time (including the runtime linking) is that modules can't realistically be mapped using large page entries. > >If there were some significant win for using large pages, couldn't the >kernel pre-allocate some 2MB pages in the [-2GiB,0) range? Boot parameter >for how many (perhaps two for separate code/data pages). First few loaded >modules get to use that space until it is all gone. > >It would all be quite messy if those modules were later unloaded/reloaded >... so there would have to be some compelling benchmarks to justify >the complexity. > >That's probably why Peter said "can't realistically". > >-Tony > Sure it could, but it would mean the kernel is sitting on an average of 6 MB of unusable memory. It would also mean that unloaded modules would create holes in that memory which would have to be managed.
> Sure it could, but it would mean the kernel is sitting on an average of 6 MB of unusable memory. It would also mean that unloaded modules would create holes in that memory which would have to be managed.
On my Fedora38 desktop:
$ lsmod | awk '{ bytes += $2 } END {print bytes/(1024*1024)}'
21.0859
Lots more than 6MB memory already essentially pinned by loaded modules.
$ head -3 /proc/meminfo
MemTotal: 65507344 kB
MemFree: 56762336 kB
MemAvailable: 63358552 kB
Pinning 20 or so Mbytes isn't going to make a dent in that free memory.
Managing the holes for unloading/reloading modules adds some complexity ... but shouldn't be awful.
If this code managed at finer granularity than "page", it would save some memory.
$ lsmod | wc -l
123
All those modules rounding text/data up to 4K boundaries is wasting a bunch of it.
-Tony
On October 24, 2023 12:40:02 PM PDT, "Luck, Tony" <tony.luck@intel.com> wrote: >> Sure it could, but it would mean the kernel is sitting on an average of 6 MB of unusable memory. It would also mean that unloaded modules would create holes in that memory which would have to be managed. > >On my Fedora38 desktop: > >$ lsmod | awk '{ bytes += $2 } END {print bytes/(1024*1024)}' >21.0859 > >Lots more than 6MB memory already essentially pinned by loaded modules. > >$ head -3 /proc/meminfo >MemTotal: 65507344 kB >MemFree: 56762336 kB >MemAvailable: 63358552 kB > >Pinning 20 or so Mbytes isn't going to make a dent in that free memory. > >Managing the holes for unloading/reloading modules adds some complexity ... but shouldn't be awful. > >If this code managed at finer granularity than "page", it would save some memory. > >$ lsmod | wc -l >123 > >All those modules rounding text/data up to 4K boundaries is wasting a bunch of it. > >-Tony > > > Sure, but is it worth the effort?
On Tue, Oct 24, 2023 at 11:27:36AM -0700, H. Peter Anvin wrote: > the only overhead to modules other than load time (including the > runtime linking) is that modules can't realistically be mapped using > large page entries. Song is working on that. Currently the module allocator is split between all (5 IIRC) different page permissions required, next step is using large page pools for those.
On Tue, Oct 24, 2023 at 06:49:07PM +0000, Luck, Tony wrote: > > the only overhead to modules other than load time (including the runtime linking) is that modules can't realistically be mapped using large page entries. > > If there were some significant win for using large pages, couldn't the There is. The 4k TLBs really hurt. Thomas and me ran into that when doing the retbleed call depth crud. Similarly Song ran into it with BPF, they really want eBPF JIT to write to large pages.
On Tue, Oct 24, 2023 at 12:36:01PM +0200, Peter Zijlstra wrote: > On Tue, Oct 24, 2023 at 01:08:21AM -0700, Pawan Gupta wrote: > > > +.macro CLEAR_CPU_BUFFERS > > + ALTERNATIVE "jmp .Lskip_verw_\@;", "jmp .Ldo_verw_\@", X86_FEATURE_CLEAR_CPU_BUF > > + /* nopl __KERNEL_DS(%rax) */ > > + .byte 0x0f, 0x1f, 0x80, 0x00, 0x00; > > +.Lverw_arg_\@: .word __KERNEL_DS; > > +.Ldo_verw_\@: verw _ASM_RIP(.Lverw_arg_\@); > > +.Lskip_verw_\@: > > +.endm > > Why can't this be: > > ALTERNATIVE "". "verw _ASM_RIP(mds_verw_sel)", X86_FEATURE_CLEAR_CPU_BUF > > And have that mds_verw_sel thing be out-of-line ? That gives much better > code for the case where we don't need this. Overall the code generated with this approach is much better. But, in my testing I am seeing an issue with runtime patching in 32-bit mode, when mitigations are off. Instead of NOPs I am seeing random instruction. I don't see any issue with 64-bit mode. config1: mitigations=on, 32-bit mode, post-boot entry_SYSENTER_32: ... 0xc1a3748e <+222>: pop %eax 0xc1a3748f <+223>: verw 0xc1a38240 0xc1a37496 <+230>: sti 0xc1a37497 <+231>: sysexit --------------------------------------------- config2: mitigations=off, 32-bit mode, post-boot entry_SYSENTER_32: ... 0xc1a3748e <+222>: pop %eax 0xc1a3748f <+223>: lea 0x0(%esi,%eiz,1),%esi <---- Doesn't look right 0xc1a37496 <+230>: sti 0xc1a37497 <+231>: sysexit --------------------------------------------- config3: 32-bit mode, pre-boot objdump entry_SYSENTER_32: ... c8e: 58 pop %eax c8f: 90 nop c90: 90 nop c91: 90 nop c92: 90 nop c93: 90 nop c94: 90 nop c95: 90 nop c96: fb sti c97: 0f 35 sysexit These tests were done with below patch: -----8<----- From: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Date: Mon, 23 Oct 2023 15:04:56 -0700 Subject: [PATCH] x86/bugs: Add asm helpers for executing VERW MDS mitigation requires clearing the CPU buffers before returning to user. This needs to be done late in the exit-to-user path. Current location of VERW leaves a possibility of kernel data ending up in CPU buffers for memory accesses done after VERW such as: 1. Kernel data accessed by an NMI between VERW and return-to-user can remain in CPU buffers ( since NMI returning to kernel does not execute VERW to clear CPU buffers. 2. Alyssa reported that after VERW is executed, CONFIG_GCC_PLUGIN_STACKLEAK=y scrubs the stack used by a system call. Memory accesses during stack scrubbing can move kernel stack contents into CPU buffers. 3. When caller saved registers are restored after a return from function executing VERW, the kernel stack accesses can remain in CPU buffers(since they occur after VERW). To fix this VERW needs to be moved very late in exit-to-user path. In preparation for moving VERW to entry/exit asm code, create macros that can be used in asm. Also make them depend on a new feature flag X86_FEATURE_CLEAR_CPU_BUF. Reported-by: Alyssa Milburn <alyssa.milburn@intel.com> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> --- arch/x86/include/asm/cpufeatures.h | 2 +- arch/x86/include/asm/nospec-branch.h | 24 ++++++++++++++++++++++++ 2 files changed, 25 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 58cb9495e40f..f21fc0f12737 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -308,10 +308,10 @@ #define X86_FEATURE_SMBA (11*32+21) /* "" Slow Memory Bandwidth Allocation */ #define X86_FEATURE_BMEC (11*32+22) /* "" Bandwidth Monitoring Event Configuration */ #define X86_FEATURE_USER_SHSTK (11*32+23) /* Shadow stack support for user mode applications */ - #define X86_FEATURE_SRSO (11*32+24) /* "" AMD BTB untrain RETs */ #define X86_FEATURE_SRSO_ALIAS (11*32+25) /* "" AMD BTB untrain RETs through aliasing */ #define X86_FEATURE_IBPB_ON_VMEXIT (11*32+26) /* "" Issue an IBPB only on VMEXIT */ +#define X86_FEATURE_CLEAR_CPU_BUF (11*32+27) /* "" Clear CPU buffers */ /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */ #define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */ diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index c55cc243592e..ed8218e2d9a7 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -13,6 +13,7 @@ #include <asm/unwind_hints.h> #include <asm/percpu.h> #include <asm/current.h> +#include <asm/segment.h> /* * Call depth tracking for Intel SKL CPUs to address the RSB underflow @@ -329,6 +330,29 @@ #endif .endm +/* + * Macros to execute VERW instruction that mitigate transient data sampling + * attacks such as MDS. On affected systems a microcode update overloaded VERW + * instruction to also clear the CPU buffers. VERW clobbers CFLAGS.ZF. + * + * Note: Only the memory operand variant of VERW clears the CPU buffers. + */ +.pushsection .rodata +.align 64 +mds_verw_sel: + .word __KERNEL_DS + .byte 0xcc +.align 64 +.popsection + +.macro EXEC_VERW + verw _ASM_RIP(mds_verw_sel) +.endm + +.macro CLEAR_CPU_BUFFERS + ALTERNATIVE "", __stringify(EXEC_VERW), X86_FEATURE_CLEAR_CPU_BUF +.endm + #else /* __ASSEMBLY__ */ #define ANNOTATE_RETPOLINE_SAFE \
On Tue, Oct 24, 2023 at 09:00:29PM -0700, Pawan Gupta wrote: > config1: mitigations=on, 32-bit mode, post-boot > > entry_SYSENTER_32: > ... > 0xc1a3748e <+222>: pop %eax > 0xc1a3748f <+223>: verw 0xc1a38240 > 0xc1a37496 <+230>: sti > 0xc1a37497 <+231>: sysexit > > --------------------------------------------- > > config2: mitigations=off, 32-bit mode, post-boot > > entry_SYSENTER_32: > ... > 0xc1a3748e <+222>: pop %eax > 0xc1a3748f <+223>: lea 0x0(%esi,%eiz,1),%esi <---- Doesn't look right > 0xc1a37496 <+230>: sti > 0xc1a37497 <+231>: sysexit > > --------------------------------------------- > > config3: 32-bit mode, pre-boot objdump > > entry_SYSENTER_32: > ... > c8e: 58 pop %eax > c8f: 90 nop > c90: 90 nop > c91: 90 nop > c92: 90 nop > c93: 90 nop > c94: 90 nop > c95: 90 nop > c96: fb sti > c97: 0f 35 sysexit > If you look at arch/x86/include/asm/nops.h, you'll find (for 32bit): * 7: leal 0x0(%esi,%eiz,1),%esi Which reads as: load-effective-address of %esi[0] into %esi which is, of course, just %esi. You can also get this from GAS using: .nops 7 which results in: 0: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi It is basically abusing bizarro x86 instruction encoding rules to create a 7 byte nop without using NOPL. If you really want to know, volume 2 of the SDM has a ton of stuff on instruction encoding, also the interweb :-)
On Tue, Oct 24, 2023 at 09:00:29PM -0700, Pawan Gupta wrote: > diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h > index c55cc243592e..ed8218e2d9a7 100644 > --- a/arch/x86/include/asm/nospec-branch.h > +++ b/arch/x86/include/asm/nospec-branch.h > @@ -13,6 +13,7 @@ > #include <asm/unwind_hints.h> > #include <asm/percpu.h> > #include <asm/current.h> > +#include <asm/segment.h> > > /* > * Call depth tracking for Intel SKL CPUs to address the RSB underflow > @@ -329,6 +330,29 @@ > #endif > .endm > > +/* > + * Macros to execute VERW instruction that mitigate transient data sampling > + * attacks such as MDS. On affected systems a microcode update overloaded VERW > + * instruction to also clear the CPU buffers. VERW clobbers CFLAGS.ZF. > + * > + * Note: Only the memory operand variant of VERW clears the CPU buffers. > + */ > +.pushsection .rodata > +.align 64 > +mds_verw_sel: > + .word __KERNEL_DS > + .byte 0xcc > +.align 64 > +.popsection This should not be in a header file, you'll get an instance of this per translation unit, not what you want. > + > +.macro EXEC_VERW > + verw _ASM_RIP(mds_verw_sel) > +.endm > + > +.macro CLEAR_CPU_BUFFERS > + ALTERNATIVE "", __stringify(EXEC_VERW), X86_FEATURE_CLEAR_CPU_BUF > +.endm > + > #else /* __ASSEMBLY__ */
On Wed, Oct 25, 2023 at 08:56:10AM +0200, Peter Zijlstra wrote: > > config3: 32-bit mode, pre-boot objdump > > > > entry_SYSENTER_32: > > ... > > c8e: 58 pop %eax > > c8f: 90 nop > > c90: 90 nop > > c91: 90 nop > > c92: 90 nop > > c93: 90 nop > > c94: 90 nop > > c95: 90 nop > > c96: fb sti > > c97: 0f 35 sysexit > > > > If you look at arch/x86/include/asm/nops.h, you'll find (for 32bit): > > * 7: leal 0x0(%esi,%eiz,1),%esi > > Which reads as: > > load-effective-address of %esi[0] into %esi Wow, never imagined that this would be one of the magician's trick. I will go read on why is it better than NOPL.
On Wed, Oct 25, 2023 at 08:58:18AM +0200, Peter Zijlstra wrote: > > +.pushsection .rodata > > +.align 64 > > +mds_verw_sel: > > + .word __KERNEL_DS > > + .byte 0xcc > > +.align 64 > > +.popsection > > This should not be in a header file, you'll get an instance of this per > translation unit, not what you want. Agh, sorry I missed it.
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 58cb9495e40f..f21fc0f12737 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -308,10 +308,10 @@ #define X86_FEATURE_SMBA (11*32+21) /* "" Slow Memory Bandwidth Allocation */ #define X86_FEATURE_BMEC (11*32+22) /* "" Bandwidth Monitoring Event Configuration */ #define X86_FEATURE_USER_SHSTK (11*32+23) /* Shadow stack support for user mode applications */ - #define X86_FEATURE_SRSO (11*32+24) /* "" AMD BTB untrain RETs */ #define X86_FEATURE_SRSO_ALIAS (11*32+25) /* "" AMD BTB untrain RETs through aliasing */ #define X86_FEATURE_IBPB_ON_VMEXIT (11*32+26) /* "" Issue an IBPB only on VMEXIT */ +#define X86_FEATURE_CLEAR_CPU_BUF (11*32+27) /* "" Clear CPU buffers */ /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */ #define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */ diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index c55cc243592e..c269ee74682c 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -329,6 +329,25 @@ #endif .endm +/* + * Macro to execute VERW instruction to mitigate transient data sampling + * attacks such as MDS. On affected systems a microcode update overloaded VERW + * instruction to also clear the CPU buffers. VERW clobbers CFLAGS.ZF. + * + * Note: Only the memory operand variant of VERW clears the CPU buffers. To + * handle the case when VERW is executed after user registers are restored, use + * RIP to point the memory operand to a part NOPL instruction that contains + * __KERNEL_DS. + */ +.macro CLEAR_CPU_BUFFERS + ALTERNATIVE "jmp .Lskip_verw_\@;", "jmp .Ldo_verw_\@", X86_FEATURE_CLEAR_CPU_BUF + /* nopl __KERNEL_DS(%rax) */ + .byte 0x0f, 0x1f, 0x80, 0x00, 0x00; +.Lverw_arg_\@: .word __KERNEL_DS; +.Ldo_verw_\@: verw _ASM_RIP(.Lverw_arg_\@); +.Lskip_verw_\@: +.endm + #else /* __ASSEMBLY__ */ #define ANNOTATE_RETPOLINE_SAFE \