[v3,01/13] x86/fpu/xstate: Avoid getting xstate address of init_fpstate if fpstate contains the component
Message ID | 20230221163655.920289-2-mizhang@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp106290wrd; Tue, 21 Feb 2023 08:38:00 -0800 (PST) X-Google-Smtp-Source: AK7set/21P8sP472LZecMM+olBKa4enXXvGcnJ8cTbOOhjcSwGVy3ZgQiRcBX+sDWCvwqaDJYL93 X-Received: by 2002:aa7:d957:0:b0:4ad:316:b4d9 with SMTP id l23-20020aa7d957000000b004ad0316b4d9mr5544926eds.15.1676997479942; Tue, 21 Feb 2023 08:37:59 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676997479; cv=none; d=google.com; s=arc-20160816; b=PnQQBtF8kbviSmP6YaAURQFETOw1Sa0Fu2644bGKPuI1xHOXSzBvqDqgbIkmQlH46K 8hYmAqjYSkPV2aKuby2iB97NYU8356BpdNiTolKQwy9Wm3xe99zsKXj3GLsJ+fa0cARJ ltWqJ3PLDQnmTesenbSVL1mKAAn4pIF3N0jw3cZXfHU9xNnK11VmSuNWc1yBTz6j+NXf 8U36DDsveEqKweBwVvdNQD4QY4aLE7pPIvrM3lO1kwCQ4JaECTzCESlP1tq6UQhalRkJ PPnlwcRBaOMNyUBw7XLIL8k2Qb/jLqQEZ5uiUZo1d46A2e+w1qbaO6Veb6vqEt+dDVGH 050g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:reply-to:dkim-signature; bh=2iAI7qeN019fW7cnXiCIU1jbXC6JQmM9w058O8z3YH8=; b=UuYXgRyrS6DFQGE18xMJcN9yHTCg4Qh8ctPZBDFiO+ETWkK2Djt/1hY1wLvVjnjlkd un3nTLcHj9M3SQ7Pbylahe6pKFfVkeVaIlpd8xQFx+kKL2NkhmhImsgeGmGXcJVdtQzj Lq1l5Vg1ua8PwM3M12Oa1aqdffqgU5oxzWs0nTDubydF6jEfHaoD6KpoiPLA8QesojFd hG8xVRWW16NO/qb6xGX7vGJVtkLqBLShTUv8O4IezdQ+WxTEcxhE4sDhtd4igGNz74iw sLJTz8hO0pChBvM7x9KN6fV6AlpkNx+ugujIQkIuU8r6Ysbhs8v54G7WqIkOIpAsRabU T2vg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=XdkwDudC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x20-20020aa7cd94000000b004ab4db42572si6714033edv.580.2023.02.21.08.37.37; Tue, 21 Feb 2023 08:37:59 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=XdkwDudC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234708AbjBUQhX (ORCPT <rfc822;hanasaki@gmail.com> + 99 others); Tue, 21 Feb 2023 11:37:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34770 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234231AbjBUQhU (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 21 Feb 2023 11:37:20 -0500 Received: from mail-pf1-x449.google.com (mail-pf1-x449.google.com [IPv6:2607:f8b0:4864:20::449]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5B4CA279AA for <linux-kernel@vger.kernel.org>; Tue, 21 Feb 2023 08:37:19 -0800 (PST) Received: by mail-pf1-x449.google.com with SMTP id j19-20020aa783d3000000b005b9ac633454so2912122pfn.0 for <linux-kernel@vger.kernel.org>; Tue, 21 Feb 2023 08:37:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:from:to:cc:subject:date:message-id:reply-to; bh=2iAI7qeN019fW7cnXiCIU1jbXC6JQmM9w058O8z3YH8=; b=XdkwDudCAzkUeqDZelIh5hAd5V3dHZzXBuFYZkqJd4pFfa+UTybJ8/AW+lAhYwqmS+ tbVt7VgxCnQmFM8/M2zF08LXl+P5rYttqkZsSMpRyNm2aUA2qkAZtGyIcbLnnVGKH7MB vhtd85nvW4gXenpyrZSEHHvz0+5qG0Ei5unm5oCB1/Qmmt3XyFzr8cJWPQfnh05mvVAV jpcjB8DFA/xaMuE68egf7ZIkGP57N7Fc7cCyBUTy31SaY7FwcTnTA1B86Fv9eMZ5LzA+ OY1P17jE1KaBI0GR9zW0iasO6w+Zn6hq5XF74zuLr9/ITXX3ToAx4IVTHsLJQoRAX22W ccKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=2iAI7qeN019fW7cnXiCIU1jbXC6JQmM9w058O8z3YH8=; b=DFwqAA9akhWMv//6SZV9ImLAruHYXMftW5Q0ijWkdVxwonlq/RVKd/9qfVG6oQ2iBN 8m6sObIvPLsGGHQLFQqu416Xcgn9HDh6JsRYc38akS9me0CIgPxmSQVc1lgcBIDmHh4p h6jhws5Qb8L4lYBwg3elGv3Ui+7OeltUHJkNR52BUp9JTAi+FGdrwbefYux83iV+CW1A vs9dA1cgMLIBTw8HexcP9aw+np0S/IewxjjxImnApn+c7hNL7NeLds/QeRbUlrEBnJBU hNPsGyOmPSKa85FF2nUzoq27itthW/SJOhz3hDOdhAhpS/9Zifp+8W+AEMHYyohsoyMH ho0w== X-Gm-Message-State: AO0yUKV+2Xq7dCUaruYGlRDmttVYPvlS0OzuDeJXKAqHNEnNtj0FTr/Y ZLJBW9DBGno82/JWCq2Mmgq41RsKHT9j X-Received: from mizhang-super.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:1071]) (user=mizhang job=sendgmr) by 2002:a63:2c4f:0:b0:502:2111:ba7c with SMTP id s76-20020a632c4f000000b005022111ba7cmr669064pgs.2.1676997438814; Tue, 21 Feb 2023 08:37:18 -0800 (PST) Reply-To: Mingwei Zhang <mizhang@google.com> Date: Tue, 21 Feb 2023 16:36:43 +0000 In-Reply-To: <20230221163655.920289-1-mizhang@google.com> Mime-Version: 1.0 References: <20230221163655.920289-1-mizhang@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230221163655.920289-2-mizhang@google.com> Subject: [PATCH v3 01/13] x86/fpu/xstate: Avoid getting xstate address of init_fpstate if fpstate contains the component From: Mingwei Zhang <mizhang@google.com> To: Sean Christopherson <seanjc@google.com>, Paolo Bonzini <pbonzini@redhat.com>, Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com>, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org, Mingwei Zhang <mizhang@google.com>, Jim Mattson <jmattson@google.com>, Venkatesh Srinivas <venkateshs@google.com>, Aaron Lewis <aaronlewis@google.com>, "Chang S. Bae" <chang.seok.bae@intel.com>, Chao Gao <chao.gao@intel.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758459309269783619?= X-GMAIL-MSGID: =?utf-8?q?1758459309269783619?= |
Series |
Overhauling amx_test
|
|
Commit Message
Mingwei Zhang
Feb. 21, 2023, 4:36 p.m. UTC
Avoid getting xstate address of init_fpstate if fpstate contains the xstate
component. Since XTILEDATA (bit 18) was turned off in xinit, when KVM calls
__raw_xsave_addr(xinit, 18), it triggers a warning as follows.
__raw_xsave_addr() is an internal function that assume caller does the
checking, ie., all function arguments should be checked before calling.
So, instead of removing the WARNING, add checks in
__copy_xstate_to_uabi_buf().
[ 168.814082] ------------[ cut here ]------------
[ 168.814083] WARNING: CPU: 35 PID: 15304 at arch/x86/kernel/fpu/xstate.c:934 __raw_xsave_addr+0xc8/0xe0
[ 168.814088] Modules linked in: kvm_intel dummy bridge stp llc cdc_ncm cdc_eem cdc_ether usbnet mii ehci_pci ehci_hcd vfat fat cdc_acm xhci_pci xhci_hcd idpf(O)
[ 168.814100] CPU: 35 PID: 15304 Comm: amx_test Tainted: G S O 6.2.0-smp-DEV #6
[ 168.814103] RIP: 0010:__raw_xsave_addr+0xc8/0xe0
[ 168.814105] Code: 83 f9 40 72 b0 eb 10 48 63 ca 44 8b 04 8d 60 13 1e 82 eb 03 41 89 f8 44 89 c1 48 01 c8 48 83 c4 08 5d c3 cc 0f 0b 31 c0 eb f3 <0f> 0b 48 c7 c7 c7 28 11 82 e8 da 30 b0 00 31 c0 eb e1 66 0f 1f 44
[ 168.814106] RSP: 0018:ff110020ef79bc90 EFLAGS: 00010246
[ 168.814108] RAX: ffffffff821e0340 RBX: 0000000000000012 RCX: 0000000000000012
[ 168.814109] RDX: 0000000000000012 RSI: 80000000000206e7 RDI: 0000000000040000
[ 168.814110] RBP: ff110020ef79bc98 R08: 0000000000000a00 R09: 0000000000000012
[ 168.814112] R10: 0000000000000012 R11: 0000000000000004 R12: ffa00000089f2a40
[ 168.814113] R13: 0000001200000000 R14: 0000000000000012 R15: ff110020ef288b00
[ 168.814114] FS: 00007f1812761300(0000) GS:ff11003fff4c0000(0000) knlGS:0000000000000000
[ 168.814116] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 168.814117] CR2: 00007f1812555008 CR3: 0000002093a80002 CR4: 0000000000373ee0
[ 168.814118] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 168.814119] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 168.814120] Call Trace:
[ 168.814121] <TASK>
[ 168.814122] __copy_xstate_to_uabi_buf+0x3cb/0x520
[ 168.814125] fpu_copy_guest_fpstate_to_uabi+0x29/0x50
[ 168.814127] kvm_arch_vcpu_ioctl+0x9f7/0xee0
[ 168.814130] ? __kmem_cache_free+0x16b/0x220
[ 168.814133] kvm_vcpu_ioctl+0x47c/0x5a0
[ 168.814136] __se_sys_ioctl+0x77/0xc0
[ 168.814138] __x64_sys_ioctl+0x1d/0x20
[ 168.814139] do_syscall_64+0x3d/0x80
[ 168.814142] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 168.814146] RIP: 0033:0x7f1812892c87
[ 168.814148] Code: 5d c3 cc 48 8b 05 39 1d 07 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 cc cc cc cc cc cc cc cc cc cc b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 1d 07 00 f7 d8 64 89 01 48
[ 168.814149] RSP: 002b:00007ffc4cebf538 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 168.814151] RAX: ffffffffffffffda RBX: 00007f1812761280 RCX: 00007f1812892c87
[ 168.814152] RDX: 00000000004dcda0 RSI: 000000009000aecf RDI: 0000000000000007
[ 168.814153] RBP: 0000000000002b00 R08: 00000000004d5010 R09: 0000000000002710
[ 168.814154] R10: 00007f1812906980 R11: 0000000000000246 R12: 00000000004d8110
[ 168.814155] R13: 0000000000000004 R14: 00000000004d78b0 R15: 0000000000000004
[ 168.814156] </TASK>
[ 168.814157] ---[ end trace 0000000000000000 ]---
Fixes: e84ba47e313d ("x86/fpu: Hook up PKRU into ptrace()")
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
arch/x86/kernel/fpu/xstate.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
Comments
Mingwei! On Tue, Feb 21 2023 at 16:36, Mingwei Zhang wrote: > Avoid getting xstate address of init_fpstate if fpstate contains the xstate > component. Since XTILEDATA (bit 18) was turned off in xinit, when KVM calls > __raw_xsave_addr(xinit, 18), it triggers a warning as follows. This does not parse. Aside of that it gets the ordering of the changelog wrong. It explain first what the patch is doing by repeating the way too long subject line and then tries to give some explanation about the problem. KVM does not call __raw_xsave_addr() and the problem is completely independent of KVM. > __raw_xsave_addr() is an internal function that assume caller does the > checking, ie., all function arguments should be checked before calling. > So, instead of removing the WARNING, add checks in > __copy_xstate_to_uabi_buf(). I don't see checks added. The patch open codes copy_feature() and makes the __raw_xsave_addr() invocations conditional. So something like this: Subject: x86/fpu/xstate: Prevent false-positive warning in __copy_xstate_to_uabi_buf() __copy_xstate_to_uabi_buf() copies either from the tasks XSAVE buffer or from init_fpstate into the ptrace buffer. Dynamic features, like XTILEDATA, have an all zeroes init state and are not saved in init_fpstate, which means the corresponding bit is not set in the xfeatures bitmap of the init_fpstate header. But __copy_xstate_to_uabi_buf() retrieves addresses for both the tasks xstate and init_fpstate unconditionally via __raw_xsave_addr(). So if the tasks XSAVE buffer has a dynamic feature set, then the address retrieval for init_fpstate triggers the warning in __raw_xsave_addr() which checks the feature bit in the init_fpstate header. Prevent this by open coding copy_feature() and making the address retrieval conditional on the tasks XSAVE header bit. So the order here is (in paragraphs): 1) Explain the context 2) Explain whats wrong 3) Explain the consequence 4) Explain the solution briefly It does not even need a backtrace, because the backtrace does not give any relevant information which is not in the text above already. The actual callchain is completely irrelevant for desribing this issue. If you really want to add a backtrace, then please trim it by removing the irrelevant information. See https://www.kernel.org/doc/html/latest/process/submitting-patches.html#backtraces for further information. So for this case this would be: WARNING: CPU: 35 PID: 15304 at arch/x86/kernel/fpu/xstate.c:934 __raw_xsave_addr+0xc8/0xe0 Call Trace: <TASK> __copy_xstate_to_uabi_buf+0x3cb/0x520 fpu_copy_guest_fpstate_to_uabi+0x29/0x50 But even fpu_copy_guest_fpstate_to_uabi() is already useless because __copy_xstate_to_uabi_buf() has multiple callers which all have the very same problem and they are very simple to find. Backtraces are useful to illustrate a hard to understand callchain, but having ~40 lines of backtrace which only contains two relevant lines is just a distraction. See? > Fixes: e84ba47e313d ("x86/fpu: Hook up PKRU into ptrace()") The problem did not exist at this point in time because dynamic xfeatures did not exist, neither in hardware nor in kernel support. Even if dynamic features would have existed then the commit would not be the one which introduced the problem, All the commit does is to move back then valid code into a conditional code path. It fixes: 471f0aa7fa64 ("x86/fpu: Fix copy_xstate_to_uabi() to copy init states correctly") which attempted to fix exactly the problem you are seeing, but only addressed half of it. The underlying problem was introduced with 2308ee57d93d ("x86/fpu/amx: Enable the AMX feature in 64-bit mode") Random fixes tags are not really helpful. > @@ -1151,10 +1152,11 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, > pkru.pkru = pkru_val; > membuf_write(&to, &pkru, sizeof(pkru)); > } else { > - copy_feature(header.xfeatures & BIT_ULL(i), &to, > - __raw_xsave_addr(xsave, i), > - __raw_xsave_addr(xinit, i), > - xstate_sizes[i]); Please add a comment here to explain why this is open coded and does not use copy_feature(). Something like: /* * Open code copy_feature() to prevent retrieval * of a dynamic features address from xinit, which * is invalid because xinit does not contain the * all zeros init states of dynamic features and * emits a warning. */ > + xsave_addr = (header.xfeatures & BIT_ULL(i)) ? > + __raw_xsave_addr(xsave, i) : > + __raw_xsave_addr(xinit, i); > + > + membuf_write(&to, xsave_addr, xstate_sizes[i]); Other than that. Nice detective work! Thanks, tglx
On 2/21/2023 8:36 AM, Mingwei Zhang wrote: > Avoid getting xstate address of init_fpstate if fpstate contains the xstate > component. Since XTILEDATA (bit 18) was turned off in xinit, when KVM calls > __raw_xsave_addr(xinit, 18), it triggers a warning as follows. > > __raw_xsave_addr() is an internal function that assume caller does the > checking, ie., all function arguments should be checked before calling. > So, instead of removing the WARNING, add checks in > __copy_xstate_to_uabi_buf(). > <snip> > @@ -1151,10 +1152,11 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, > pkru.pkru = pkru_val; > membuf_write(&to, &pkru, sizeof(pkru)); > } else { > - copy_feature(header.xfeatures & BIT_ULL(i), &to, > - __raw_xsave_addr(xsave, i), > - __raw_xsave_addr(xinit, i), > - xstate_sizes[i]); > + xsave_addr = (header.xfeatures & BIT_ULL(i)) ? > + __raw_xsave_addr(xsave, i) : > + __raw_xsave_addr(xinit, i); > + > + membuf_write(&to, xsave_addr, xstate_sizes[i]); > } > /* > * Keep track of the last copied state in the non-compacted So this hunk is under for_each_extended_xfeature(i, mask) -- it skips the copy routine if mask[i] == 0; instead, it fills zeros. We have this [1]: if (fpu_state_size_dynamic()) mask &= (header.xfeatures | xinit->header.xcomp_bv); If header.xfeatures[18] = 0 then mask[18] = 0 because xinit->header.xcomp_bv[18] = 0. Then, it won't hit that code. So, I'm confused about the problem that you described here. Can you elaborate on your test case a bit? Let me try to reproduce the issue on my end. Thanks, Chang [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n1134
On Tue, Feb 21 2023 at 19:05, Chang S. Bae wrote: > On 2/21/2023 8:36 AM, Mingwei Zhang wrote: >> @@ -1151,10 +1152,11 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, >> pkru.pkru = pkru_val; >> membuf_write(&to, &pkru, sizeof(pkru)); >> } else { >> - copy_feature(header.xfeatures & BIT_ULL(i), &to, >> - __raw_xsave_addr(xsave, i), >> - __raw_xsave_addr(xinit, i), >> - xstate_sizes[i]); >> + xsave_addr = (header.xfeatures & BIT_ULL(i)) ? >> + __raw_xsave_addr(xsave, i) : >> + __raw_xsave_addr(xinit, i); >> + >> + membuf_write(&to, xsave_addr, xstate_sizes[i]); >> } >> /* >> * Keep track of the last copied state in the non-compacted > > So this hunk is under for_each_extended_xfeature(i, mask) -- it skips > the copy routine if mask[i] == 0; instead, it fills zeros. > > We have this [1]: > > if (fpu_state_size_dynamic()) > mask &= (header.xfeatures | xinit->header.xcomp_bv); > > If header.xfeatures[18] = 0 then mask[18] = 0 because > xinit->header.xcomp_bv[18] = 0. Then, it won't hit that code. So, I'm > confused about the problem that you described here. Read the suggested changelog I wrote in my reply to Mingwei. TLDR: xsave.header.xfeatures[18] = 1 xinit.header.xfeatures[18] = 0 -> mask[18] = 1 -> __raw_xsave_addr(xsave, 18) <- Success -> __raw_xsave_addr(xinit, 18) <- WARN Thanks, tglx
> > We have this [1]: > > > > if (fpu_state_size_dynamic()) > > mask &= (header.xfeatures | xinit->header.xcomp_bv); > > > > If header.xfeatures[18] = 0 then mask[18] = 0 because > > xinit->header.xcomp_bv[18] = 0. Then, it won't hit that code. So, I'm > > confused about the problem that you described here. > > Read the suggested changelog I wrote in my reply to Mingwei. > > TLDR: > > xsave.header.xfeatures[18] = 1 > xinit.header.xfeatures[18] = 0 > -> mask[18] = 1 > -> __raw_xsave_addr(xsave, 18) <- Success > -> __raw_xsave_addr(xinit, 18) <- WARN > > Thanks, > > tglx Hi Thomas, Thanks for the review and I will provide the next version separately from the series, since this one is independent from the rest. Chang: to reproduce this issue, you can simply run the amx_test in the kvm selftest directory. Thanks. -Mingwei
On 2/22/2023 10:40 AM, Mingwei Zhang wrote: >>> We have this [1]: >>> >>> if (fpu_state_size_dynamic()) >>> mask &= (header.xfeatures | xinit->header.xcomp_bv); >>> >>> If header.xfeatures[18] = 0 then mask[18] = 0 because >>> xinit->header.xcomp_bv[18] = 0. Then, it won't hit that code. So, I'm >>> confused about the problem that you described here. >> >> Read the suggested changelog I wrote in my reply to Mingwei. >> >> TLDR: >> >> xsave.header.xfeatures[18] = 1 >> xinit.header.xfeatures[18] = 0 >> -> mask[18] = 1 >> -> __raw_xsave_addr(xsave, 18) <- Success >> -> __raw_xsave_addr(xinit, 18) <- WARN Oh, sigh.. This should be caught last time. Hmm, then since we store init state for legacy ones [1], unless it is too aggressive, perhaps the loop can be simplified like this: diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index 714166cc25f2..2dac6f5f3ade 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -1118,21 +1118,13 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, zerofrom = offsetof(struct xregs_state, extended_state_area); /* - * The ptrace buffer is in non-compacted XSAVE format. In - * non-compacted format disabled features still occupy state space, - * but there is no state to copy from in the compacted - * init_fpstate. The gap tracking will zero these states. + * Indicate which states to copy from fpstate. When not present in + * fpstate, those extended states are either initialized or + * disabled. They are also known to have an all zeros init state. + * Thus, remove them from 'mask' to zero those features in the user + * buffer instead of retrieving them from init_fpstate. */ - mask = fpstate->user_xfeatures; - - /* - * Dynamic features are not present in init_fpstate. When they are - * in an all zeros init state, remove those from 'mask' to zero - * those features in the user buffer instead of retrieving them - * from init_fpstate. - */ - if (fpu_state_size_dynamic()) - mask &= (header.xfeatures | xinit->header.xcomp_bv); + mask = header.xfeatures; for_each_extended_xfeature(i, mask) { /* @@ -1151,9 +1143,8 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, pkru.pkru = pkru_val; membuf_write(&to, &pkru, sizeof(pkru)); } else { - copy_feature(header.xfeatures & BIT_ULL(i), &to, + membuf_write(&to, __raw_xsave_addr(xsave, i), - __raw_xsave_addr(xinit, i), xstate_sizes[i]); } /* > Chang: to reproduce this issue, you can simply run the amx_test in the > kvm selftest directory. Yeah, I was able to reproduce it with this ptrace test: diff --git a/tools/testing/selftests/x86/amx.c b/tools/testing/selftests/x86/amx.c index 625e42901237..ae02bc81846d 100644 --- a/tools/testing/selftests/x86/amx.c +++ b/tools/testing/selftests/x86/amx.c @@ -14,8 +14,10 @@ #include <sys/auxv.h> #include <sys/mman.h> #include <sys/shm.h> +#include <sys/ptrace.h> #include <sys/syscall.h> #include <sys/wait.h> +#include <sys/uio.h> #include "../kselftest.h" /* For __cpuid_count() */ @@ -826,6 +828,76 @@ static void test_context_switch(void) free(finfo); } +/* Ptrace test */ + +static bool inject_tiledata(pid_t target) +{ + struct xsave_buffer *xbuf; + struct iovec iov; + + xbuf = alloc_xbuf(); + if (!xbuf) + fatal_error("unable to allocate XSAVE buffer"); + + load_rand_tiledata(xbuf); + + memcpy(&stashed_xsave->bytes[xtiledata.xbuf_offset], + &xbuf->bytes[xtiledata.xbuf_offset], + xtiledata.size); + + iov.iov_base = xbuf; + iov.iov_len = xbuf_size; + + if (ptrace(PTRACE_SETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov)) + fatal_error("PTRACE_SETREGSET"); + + if (ptrace(PTRACE_GETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov)) + err(1, "PTRACE_GETREGSET"); + + if (!memcmp(&stashed_xsave->bytes[xtiledata.xbuf_offset], + &xbuf->bytes[xtiledata.xbuf_offset], + xtiledata.size)) + return true; + else + return false; +} + +static void test_ptrace(void) +{ + pid_t child; + int status; + + child = fork(); + if (child < 0) { + err(1, "fork"); + } else if (!child) { + if (ptrace(PTRACE_TRACEME, 0, NULL, NULL)) + err(1, "PTRACE_TRACEME"); + + /* Use the state to expand the kernel buffer */ + load_rand_tiledata(stashed_xsave); + + raise(SIGTRAP); + _exit(0); + } + + do { + wait(&status); + } while (WSTOPSIG(status) != SIGTRAP); + + printf("\tInject tile data via ptrace()\n"); + + if (inject_tiledata(child)) + printf("[OK]\tTile data was written on ptracee.\n"); + else + printf("[FAIL]\tTile data was not written on ptracee.\n"); + + ptrace(PTRACE_DETACH, child, NULL, NULL); + wait(&status); + if (!WIFEXITED(status) || WEXITSTATUS(status)) + err(1, "ptrace test"); +} + int main(void) { /* Check hardware availability at first */ @@ -846,6 +918,8 @@ int main(void) ctxtswtest_config.num_threads = 5; test_context_switch(); + test_ptrace(); + clearhandler(SIGILL); free_stashed_xsave(); Thanks, Chang [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n386
On Wed, Feb 22, 2023, Chang S. Bae wrote: > On 2/22/2023 10:40 AM, Mingwei Zhang wrote: > > > > We have this [1]: > > > > > > > > if (fpu_state_size_dynamic()) > > > > mask &= (header.xfeatures | xinit->header.xcomp_bv); > > > > > > > > If header.xfeatures[18] = 0 then mask[18] = 0 because > > > > xinit->header.xcomp_bv[18] = 0. Then, it won't hit that code. So, I'm > > > > confused about the problem that you described here. > > > > > > Read the suggested changelog I wrote in my reply to Mingwei. > > > > > > TLDR: > > > > > > xsave.header.xfeatures[18] = 1 > > > xinit.header.xfeatures[18] = 0 > > > -> mask[18] = 1 > > > -> __raw_xsave_addr(xsave, 18) <- Success > > > -> __raw_xsave_addr(xinit, 18) <- WARN > > Oh, sigh.. This should be caught last time. > > Hmm, then since we store init state for legacy ones [1], unless it is too > aggressive, perhaps the loop can be simplified like this: > > diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c > index 714166cc25f2..2dac6f5f3ade 100644 > --- a/arch/x86/kernel/fpu/xstate.c > +++ b/arch/x86/kernel/fpu/xstate.c > @@ -1118,21 +1118,13 @@ void __copy_xstate_to_uabi_buf(struct membuf to, > struct fpstate *fpstate, > zerofrom = offsetof(struct xregs_state, extended_state_area); > > /* > - * The ptrace buffer is in non-compacted XSAVE format. In > - * non-compacted format disabled features still occupy state space, > - * but there is no state to copy from in the compacted > - * init_fpstate. The gap tracking will zero these states. > + * Indicate which states to copy from fpstate. When not present in > + * fpstate, those extended states are either initialized or > + * disabled. They are also known to have an all zeros init state. > + * Thus, remove them from 'mask' to zero those features in the user > + * buffer instead of retrieving them from init_fpstate. > */ > - mask = fpstate->user_xfeatures; Do we need to change this line and the comments? I don't see any of these was relevant to this issue. The original code semantic is to traverse all user_xfeatures, if it is available in fpstate, copy it from there; otherwise, copy it from init_fpstate. We do not assume the component in init_fpstate (but not in fpstate) are all zeros, do we? If it is safe to assume that, then it might be ok. But at least in this patch, I want to keep the original semantics as is without the assumption. > - > - /* > - * Dynamic features are not present in init_fpstate. When they are > - * in an all zeros init state, remove those from 'mask' to zero > - * those features in the user buffer instead of retrieving them > - * from init_fpstate. > - */ > - if (fpu_state_size_dynamic()) > - mask &= (header.xfeatures | xinit->header.xcomp_bv); > + mask = header.xfeatures; Same here. Let's not adding this optimization in this patch. > > for_each_extended_xfeature(i, mask) { > /* > @@ -1151,9 +1143,8 @@ void __copy_xstate_to_uabi_buf(struct membuf to, > struct fpstate *fpstate, > pkru.pkru = pkru_val; > membuf_write(&to, &pkru, sizeof(pkru)); > } else { > - copy_feature(header.xfeatures & BIT_ULL(i), &to, > + membuf_write(&to, > __raw_xsave_addr(xsave, i), > - __raw_xsave_addr(xinit, i), > xstate_sizes[i]); > } > /* > > > Chang: to reproduce this issue, you can simply run the amx_test in the > > kvm selftest directory. > > Yeah, I was able to reproduce it with this ptrace test: > > diff --git a/tools/testing/selftests/x86/amx.c > b/tools/testing/selftests/x86/amx.c > index 625e42901237..ae02bc81846d 100644 > --- a/tools/testing/selftests/x86/amx.c > +++ b/tools/testing/selftests/x86/amx.c > @@ -14,8 +14,10 @@ > #include <sys/auxv.h> > #include <sys/mman.h> > #include <sys/shm.h> > +#include <sys/ptrace.h> > #include <sys/syscall.h> > #include <sys/wait.h> > +#include <sys/uio.h> > > #include "../kselftest.h" /* For __cpuid_count() */ > > @@ -826,6 +828,76 @@ static void test_context_switch(void) > free(finfo); > } > > +/* Ptrace test */ > + > +static bool inject_tiledata(pid_t target) > +{ > + struct xsave_buffer *xbuf; > + struct iovec iov; > + > + xbuf = alloc_xbuf(); > + if (!xbuf) > + fatal_error("unable to allocate XSAVE buffer"); > + > + load_rand_tiledata(xbuf); > + > + memcpy(&stashed_xsave->bytes[xtiledata.xbuf_offset], > + &xbuf->bytes[xtiledata.xbuf_offset], > + xtiledata.size); > + > + iov.iov_base = xbuf; > + iov.iov_len = xbuf_size; > + > + if (ptrace(PTRACE_SETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov)) > + fatal_error("PTRACE_SETREGSET"); > + > + if (ptrace(PTRACE_GETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov)) > + err(1, "PTRACE_GETREGSET"); > + > + if (!memcmp(&stashed_xsave->bytes[xtiledata.xbuf_offset], > + &xbuf->bytes[xtiledata.xbuf_offset], > + xtiledata.size)) > + return true; > + else > + return false; > +} > + > +static void test_ptrace(void) > +{ > + pid_t child; > + int status; > + > + child = fork(); > + if (child < 0) { > + err(1, "fork"); > + } else if (!child) { > + if (ptrace(PTRACE_TRACEME, 0, NULL, NULL)) > + err(1, "PTRACE_TRACEME"); > + > + /* Use the state to expand the kernel buffer */ > + load_rand_tiledata(stashed_xsave); > + > + raise(SIGTRAP); > + _exit(0); > + } > + > + do { > + wait(&status); > + } while (WSTOPSIG(status) != SIGTRAP); > + > + printf("\tInject tile data via ptrace()\n"); > + > + if (inject_tiledata(child)) > + printf("[OK]\tTile data was written on ptracee.\n"); > + else > + printf("[FAIL]\tTile data was not written on ptracee.\n"); > + > + ptrace(PTRACE_DETACH, child, NULL, NULL); > + wait(&status); > + if (!WIFEXITED(status) || WEXITSTATUS(status)) > + err(1, "ptrace test"); > +} > + > int main(void) > { > /* Check hardware availability at first */ > @@ -846,6 +918,8 @@ int main(void) > ctxtswtest_config.num_threads = 5; > test_context_switch(); > > + test_ptrace(); > + > clearhandler(SIGILL); > free_stashed_xsave(); > > Thanks, > Chang > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n386 > Nice one. Yeah both ptrace and KVM are calling this function so the above code would also be enough to trigger the bug. Thanks. -Mingwei
On 2/24/2023 3:56 PM, Mingwei Zhang wrote: > On Wed, Feb 22, 2023, Chang S. Bae wrote: >> >> /* >> - * The ptrace buffer is in non-compacted XSAVE format. In >> - * non-compacted format disabled features still occupy state space, >> - * but there is no state to copy from in the compacted >> - * init_fpstate. The gap tracking will zero these states. >> + * Indicate which states to copy from fpstate. When not present in >> + * fpstate, those extended states are either initialized or >> + * disabled. They are also known to have an all zeros init state. >> + * Thus, remove them from 'mask' to zero those features in the user >> + * buffer instead of retrieving them from init_fpstate. >> */ >> - mask = fpstate->user_xfeatures; > > Do we need to change this line and the comments? I don't see any of > these was relevant to this issue. The original code semantic is to > traverse all user_xfeatures, if it is available in fpstate, copy it from > there; otherwise, copy it from init_fpstate. We do not assume the > component in init_fpstate (but not in fpstate) are all zeros, do we? If > it is safe to assume that, then it might be ok. But at least in this > patch, I want to keep the original semantics as is without the > assumption. Here it has [1]: * * XSAVE could be used, but that would require to reshuffle the * data when XSAVEC/S is available because XSAVEC/S uses xstate * compaction. But doing so is a pointless exercise because most * components have an all zeros init state except for the legacy * ones (FP and SSE). Those can be saved with FXSAVE into the * legacy area. Adding new features requires to ensure that init * state is all zeroes or if not to add the necessary handling * here. */ fxsave(&init_fpstate.regs.fxsave); Thus, init_fpstate has zeros for those extended states. Then, copying from init_fpstate is the same as membuf_zero() by the gap tracking. But, we have two ways to do the same thing here. So I think it works that simply copying the state from fpstate only for those present there, then letting the gap tracking zero out for the rest of the userspace buffer for features that are either disabled or initialized. Then, we can remove accessing init_fpstate in the copy loop and which is the source of the problem. So I think this line change is relevant and also makes the code simple. I guess I'm fine if you don't want to do this. Then, let me follow up with something like this at first. Something like yours could be a fallback option for other good reasons, otherwise. Thanks, Chang [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n386
On Fri, Feb 24, 2023, Chang S. Bae wrote: > On 2/24/2023 3:56 PM, Mingwei Zhang wrote: > > On Wed, Feb 22, 2023, Chang S. Bae wrote: > > > > > > /* > > > - * The ptrace buffer is in non-compacted XSAVE format. In > > > - * non-compacted format disabled features still occupy state space, > > > - * but there is no state to copy from in the compacted > > > - * init_fpstate. The gap tracking will zero these states. > > > + * Indicate which states to copy from fpstate. When not present in > > > + * fpstate, those extended states are either initialized or > > > + * disabled. They are also known to have an all zeros init state. > > > + * Thus, remove them from 'mask' to zero those features in the user > > > + * buffer instead of retrieving them from init_fpstate. > > > */ > > > - mask = fpstate->user_xfeatures; > > > > Do we need to change this line and the comments? I don't see any of > > these was relevant to this issue. The original code semantic is to > > traverse all user_xfeatures, if it is available in fpstate, copy it from > > there; otherwise, copy it from init_fpstate. We do not assume the > > component in init_fpstate (but not in fpstate) are all zeros, do we? If > > it is safe to assume that, then it might be ok. But at least in this > > patch, I want to keep the original semantics as is without the > > assumption. > > Here it has [1]: > > * > * XSAVE could be used, but that would require to reshuffle the > * data when XSAVEC/S is available because XSAVEC/S uses xstate > * compaction. But doing so is a pointless exercise because most > * components have an all zeros init state except for the legacy > * ones (FP and SSE). Those can be saved with FXSAVE into the > * legacy area. Adding new features requires to ensure that init > * state is all zeroes or if not to add the necessary handling > * here. > */ > fxsave(&init_fpstate.regs.fxsave); ah, I see. > > Thus, init_fpstate has zeros for those extended states. Then, copying from > init_fpstate is the same as membuf_zero() by the gap tracking. But, we have > two ways to do the same thing here. > > So I think it works that simply copying the state from fpstate only for > those present there, then letting the gap tracking zero out for the rest of > the userspace buffer for features that are either disabled or initialized. > > Then, we can remove accessing init_fpstate in the copy loop and which is the > source of the problem. So I think this line change is relevant and also > makes the code simple. > > I guess I'm fine if you don't want to do this. Then, let me follow up with > something like this at first. Something like yours could be a fallback > option for other good reasons, otherwise. hmm. I see. But this is still because of the software implementation. What if there is a new hardware component that requires a non-zero init state. For instance, in the past, we had PKRU component, whose init value is 0x555...54. Of course, that is a bad example because now we kick it out of the XSAVE/XRSTOR and special handling that, but there is no guarantee that in the future we will never need a non-zero init state. So, I will send out my fix and let you, Thomas and potentially other folks to decide what is the best option. Overall, I get your point. Thanks -Mingwei > > Thanks, > Chang > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n386 > >
On 2/24/2023 5:09 PM, Mingwei Zhang wrote: > On Fri, Feb 24, 2023, Chang S. Bae wrote: >> On 2/24/2023 3:56 PM, Mingwei Zhang wrote: >>> On Wed, Feb 22, 2023, Chang S. Bae wrote: >>>> >>>> /* >>>> - * The ptrace buffer is in non-compacted XSAVE format. In >>>> - * non-compacted format disabled features still occupy state space, >>>> - * but there is no state to copy from in the compacted >>>> - * init_fpstate. The gap tracking will zero these states. >>>> + * Indicate which states to copy from fpstate. When not present in >>>> + * fpstate, those extended states are either initialized or >>>> + * disabled. They are also known to have an all zeros init state. >>>> + * Thus, remove them from 'mask' to zero those features in the user >>>> + * buffer instead of retrieving them from init_fpstate. >>>> */ >>>> - mask = fpstate->user_xfeatures; >>> >>> Do we need to change this line and the comments? I don't see any of >>> these was relevant to this issue. The original code semantic is to >>> traverse all user_xfeatures, if it is available in fpstate, copy it from >>> there; otherwise, copy it from init_fpstate. We do not assume the >>> component in init_fpstate (but not in fpstate) are all zeros, do we? If >>> it is safe to assume that, then it might be ok. But at least in this >>> patch, I want to keep the original semantics as is without the >>> assumption. >> >> Here it has [1]: >> >> * >> * XSAVE could be used, but that would require to reshuffle the >> * data when XSAVEC/S is available because XSAVEC/S uses xstate >> * compaction. But doing so is a pointless exercise because most >> * components have an all zeros init state except for the legacy >> * ones (FP and SSE). Those can be saved with FXSAVE into the >> * legacy area. Adding new features requires to ensure that init >> * state is all zeroes or if not to add the necessary handling >> * here. >> */ >> fxsave(&init_fpstate.regs.fxsave); > > ah, I see. >> >> Thus, init_fpstate has zeros for those extended states. Then, copying from >> init_fpstate is the same as membuf_zero() by the gap tracking. But, we have >> two ways to do the same thing here. >> >> So I think it works that simply copying the state from fpstate only for >> those present there, then letting the gap tracking zero out for the rest of >> the userspace buffer for features that are either disabled or initialized. >> >> Then, we can remove accessing init_fpstate in the copy loop and which is the >> source of the problem. So I think this line change is relevant and also >> makes the code simple. >> >> I guess I'm fine if you don't want to do this. Then, let me follow up with >> something like this at first. Something like yours could be a fallback >> option for other good reasons, otherwise. > > hmm. I see. But this is still because of the software implementation. > What if there is a new hardware component that requires a non-zero init > state. > > For instance, in the past, we had PKRU component, whose init value is > 0x555...54. Of course, that is a bad example because now we kick it out > of the XSAVE/XRSTOR and special handling that, but there is no guarantee > that in the future we will never need a non-zero init state. Yeah, then that feature enabling has to update [1] to record the special init value there. So one who writes that code has to responsibly check what has to be adjusted like for other feature enabling in general. Now we have established the dynamic states which are also assumed to have an all-zeros init state. Anything new state located after that should be a dynamic state because of the sigaltstack. Of course, with that though, I don't know whether we will see anything with a non-zero init state in the future or not. > So, I will send out my fix and let you, Thomas and potentially other > folks to decide what is the best option. Overall, I get your point. Let's coordinate this. We shouldn't throw multiple versions at the same time. (Let me follow up with you.) Thanks, Chang
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index 714166cc25f2..5cc1426c3800 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -1063,6 +1063,7 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, struct xregs_state *xsave = &fpstate->regs.xsave; struct xstate_header header; unsigned int zerofrom; + void *xsave_addr; u64 mask; int i; @@ -1151,10 +1152,11 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, pkru.pkru = pkru_val; membuf_write(&to, &pkru, sizeof(pkru)); } else { - copy_feature(header.xfeatures & BIT_ULL(i), &to, - __raw_xsave_addr(xsave, i), - __raw_xsave_addr(xinit, i), - xstate_sizes[i]); + xsave_addr = (header.xfeatures & BIT_ULL(i)) ? + __raw_xsave_addr(xsave, i) : + __raw_xsave_addr(xinit, i); + + membuf_write(&to, xsave_addr, xstate_sizes[i]); } /* * Keep track of the last copied state in the non-compacted