[v3,01/13] x86/fpu/xstate: Avoid getting xstate address of init_fpstate if fpstate contains the component

Message ID 20230221163655.920289-2-mizhang@google.com
State New
Headers
Series Overhauling amx_test |

Commit Message

Mingwei Zhang Feb. 21, 2023, 4:36 p.m. UTC
  Avoid getting xstate address of init_fpstate if fpstate contains the xstate
component. Since XTILEDATA (bit 18) was turned off in xinit, when KVM calls
__raw_xsave_addr(xinit, 18), it triggers a warning as follows.

__raw_xsave_addr() is an internal function that assume caller does the
checking, ie., all function arguments should be checked before calling.
So, instead of removing the WARNING, add checks in
__copy_xstate_to_uabi_buf().

[  168.814082] ------------[ cut here ]------------
[  168.814083] WARNING: CPU: 35 PID: 15304 at arch/x86/kernel/fpu/xstate.c:934 __raw_xsave_addr+0xc8/0xe0
[  168.814088] Modules linked in: kvm_intel dummy bridge stp llc cdc_ncm cdc_eem cdc_ether usbnet mii ehci_pci ehci_hcd vfat fat cdc_acm xhci_pci xhci_hcd idpf(O)
[  168.814100] CPU: 35 PID: 15304 Comm: amx_test Tainted: G S         O       6.2.0-smp-DEV #6
[  168.814103] RIP: 0010:__raw_xsave_addr+0xc8/0xe0
[  168.814105] Code: 83 f9 40 72 b0 eb 10 48 63 ca 44 8b 04 8d 60 13 1e 82 eb 03 41 89 f8 44 89 c1 48 01 c8 48 83 c4 08 5d c3 cc 0f 0b 31 c0 eb f3 <0f> 0b 48 c7 c7 c7 28 11 82 e8 da 30 b0 00 31 c0 eb e1 66 0f 1f 44
[  168.814106] RSP: 0018:ff110020ef79bc90 EFLAGS: 00010246
[  168.814108] RAX: ffffffff821e0340 RBX: 0000000000000012 RCX: 0000000000000012
[  168.814109] RDX: 0000000000000012 RSI: 80000000000206e7 RDI: 0000000000040000
[  168.814110] RBP: ff110020ef79bc98 R08: 0000000000000a00 R09: 0000000000000012
[  168.814112] R10: 0000000000000012 R11: 0000000000000004 R12: ffa00000089f2a40
[  168.814113] R13: 0000001200000000 R14: 0000000000000012 R15: ff110020ef288b00
[  168.814114] FS:  00007f1812761300(0000) GS:ff11003fff4c0000(0000) knlGS:0000000000000000
[  168.814116] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  168.814117] CR2: 00007f1812555008 CR3: 0000002093a80002 CR4: 0000000000373ee0
[  168.814118] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  168.814119] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[  168.814120] Call Trace:
[  168.814121]  <TASK>
[  168.814122]  __copy_xstate_to_uabi_buf+0x3cb/0x520
[  168.814125]  fpu_copy_guest_fpstate_to_uabi+0x29/0x50
[  168.814127]  kvm_arch_vcpu_ioctl+0x9f7/0xee0
[  168.814130]  ? __kmem_cache_free+0x16b/0x220
[  168.814133]  kvm_vcpu_ioctl+0x47c/0x5a0
[  168.814136]  __se_sys_ioctl+0x77/0xc0
[  168.814138]  __x64_sys_ioctl+0x1d/0x20
[  168.814139]  do_syscall_64+0x3d/0x80
[  168.814142]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  168.814146] RIP: 0033:0x7f1812892c87
[  168.814148] Code: 5d c3 cc 48 8b 05 39 1d 07 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 cc cc cc cc cc cc cc cc cc cc b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 1d 07 00 f7 d8 64 89 01 48
[  168.814149] RSP: 002b:00007ffc4cebf538 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  168.814151] RAX: ffffffffffffffda RBX: 00007f1812761280 RCX: 00007f1812892c87
[  168.814152] RDX: 00000000004dcda0 RSI: 000000009000aecf RDI: 0000000000000007
[  168.814153] RBP: 0000000000002b00 R08: 00000000004d5010 R09: 0000000000002710
[  168.814154] R10: 00007f1812906980 R11: 0000000000000246 R12: 00000000004d8110
[  168.814155] R13: 0000000000000004 R14: 00000000004d78b0 R15: 0000000000000004
[  168.814156]  </TASK>
[  168.814157] ---[ end trace 0000000000000000 ]---

Fixes: e84ba47e313d ("x86/fpu: Hook up PKRU into ptrace()")
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kernel/fpu/xstate.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)
  

Comments

Thomas Gleixner Feb. 21, 2023, 8:54 p.m. UTC | #1
Mingwei!

On Tue, Feb 21 2023 at 16:36, Mingwei Zhang wrote:
> Avoid getting xstate address of init_fpstate if fpstate contains the xstate
> component. Since XTILEDATA (bit 18) was turned off in xinit, when KVM calls
> __raw_xsave_addr(xinit, 18), it triggers a warning as follows.

This does not parse. Aside of that it gets the ordering of the changelog
wrong. It explain first what the patch is doing by repeating the way too
long subject line and then tries to give some explanation about the
problem.

KVM does not call __raw_xsave_addr() and the problem is completely
independent of KVM.

> __raw_xsave_addr() is an internal function that assume caller does the
> checking, ie., all function arguments should be checked before calling.
> So, instead of removing the WARNING, add checks in
> __copy_xstate_to_uabi_buf().

I don't see checks added. The patch open codes copy_feature() and makes
the __raw_xsave_addr() invocations conditional.

So something like this:

   Subject: x86/fpu/xstate: Prevent false-positive warning in __copy_xstate_to_uabi_buf()

   __copy_xstate_to_uabi_buf() copies either from the tasks XSAVE buffer
   or from init_fpstate into the ptrace buffer. Dynamic features, like
   XTILEDATA, have an all zeroes init state and are not saved in
   init_fpstate, which means the corresponding bit is not set in the
   xfeatures bitmap of the init_fpstate header.

   But __copy_xstate_to_uabi_buf() retrieves addresses for both the
   tasks xstate and init_fpstate unconditionally via __raw_xsave_addr().

   So if the tasks XSAVE buffer has a dynamic feature set, then the
   address retrieval for init_fpstate triggers the warning in
   __raw_xsave_addr() which checks the feature bit in the init_fpstate
   header.

   Prevent this by open coding copy_feature() and making the address
   retrieval conditional on the tasks XSAVE header bit.

So the order here is (in paragraphs):

   1) Explain the context
   2) Explain whats wrong
   3) Explain the consequence
   4) Explain the solution briefly

It does not even need a backtrace, because the backtrace does not give
any relevant information which is not in the text above already.

The actual callchain is completely irrelevant for desribing this
issue. If you really want to add a backtrace, then please trim it by
removing the irrelevant information. See
https://www.kernel.org/doc/html/latest/process/submitting-patches.html#backtraces
for further information.

So for this case this would be:

WARNING: CPU: 35 PID: 15304 at arch/x86/kernel/fpu/xstate.c:934 __raw_xsave_addr+0xc8/0xe0
Call Trace:
 <TASK>
 __copy_xstate_to_uabi_buf+0x3cb/0x520
 fpu_copy_guest_fpstate_to_uabi+0x29/0x50

But even fpu_copy_guest_fpstate_to_uabi() is already useless because
__copy_xstate_to_uabi_buf() has multiple callers which all have the very
same problem and they are very simple to find.

Backtraces are useful to illustrate a hard to understand callchain, but
having ~40 lines of backtrace which only contains two relevant lines is
just a distraction.

See?

> Fixes: e84ba47e313d ("x86/fpu: Hook up PKRU into ptrace()")

The problem did not exist at this point in time because dynamic
xfeatures did not exist, neither in hardware nor in kernel support.

Even if dynamic features would have existed then the commit would not be
the one which introduced the problem, All the commit does is to move
back then valid code into a conditional code path.

It fixes:

  471f0aa7fa64 ("x86/fpu: Fix copy_xstate_to_uabi() to copy init states correctly")

which attempted to fix exactly the problem you are seeing, but only
addressed half of it. The underlying problem was introduced with
2308ee57d93d ("x86/fpu/amx: Enable the AMX feature in 64-bit mode")

Random fixes tags are not really helpful.

> @@ -1151,10 +1152,11 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
>  			pkru.pkru = pkru_val;
>  			membuf_write(&to, &pkru, sizeof(pkru));
>  		} else {
> -			copy_feature(header.xfeatures & BIT_ULL(i), &to,
> -				     __raw_xsave_addr(xsave, i),
> -				     __raw_xsave_addr(xinit, i),
> -				     xstate_sizes[i]);

Please add a comment here to explain why this is open coded and does not
use copy_feature(). Something like:

    			/*
                         * Open code copy_feature() to prevent retrieval
                         * of a dynamic features address from xinit, which
                         * is invalid because xinit does not contain the
                         * all zeros init states of dynamic features and
                         * emits a warning.
                         */

> +			xsave_addr = (header.xfeatures & BIT_ULL(i)) ?
> +				__raw_xsave_addr(xsave, i) :
> +				__raw_xsave_addr(xinit, i);
> +
> +			membuf_write(&to, xsave_addr, xstate_sizes[i]);

Other than that. Nice detective work!

Thanks,

        tglx
  
Chang S. Bae Feb. 22, 2023, 3:05 a.m. UTC | #2
On 2/21/2023 8:36 AM, Mingwei Zhang wrote:
> Avoid getting xstate address of init_fpstate if fpstate contains the xstate
> component. Since XTILEDATA (bit 18) was turned off in xinit, when KVM calls
> __raw_xsave_addr(xinit, 18), it triggers a warning as follows.
> 
> __raw_xsave_addr() is an internal function that assume caller does the
> checking, ie., all function arguments should be checked before calling.
> So, instead of removing the WARNING, add checks in
> __copy_xstate_to_uabi_buf().
> 

<snip>

> @@ -1151,10 +1152,11 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
>   			pkru.pkru = pkru_val;
>   			membuf_write(&to, &pkru, sizeof(pkru));
>   		} else {
> -			copy_feature(header.xfeatures & BIT_ULL(i), &to,
> -				     __raw_xsave_addr(xsave, i),
> -				     __raw_xsave_addr(xinit, i),
> -				     xstate_sizes[i]);
> +			xsave_addr = (header.xfeatures & BIT_ULL(i)) ?
> +				__raw_xsave_addr(xsave, i) :
> +				__raw_xsave_addr(xinit, i);
> +
> +			membuf_write(&to, xsave_addr, xstate_sizes[i]);
>   		}
>   		/*
>   		 * Keep track of the last copied state in the non-compacted

So this hunk is under for_each_extended_xfeature(i, mask) -- it skips 
the copy routine if mask[i] == 0; instead, it fills zeros.

We have this [1]:

	if (fpu_state_size_dynamic())
		mask &= (header.xfeatures | xinit->header.xcomp_bv);

If header.xfeatures[18] = 0 then mask[18] = 0 because 
xinit->header.xcomp_bv[18] = 0. Then, it won't hit that code. So, I'm 
confused about the problem that you described here.

Can you elaborate on your test case a bit? Let me try to reproduce the 
issue on my end.

Thanks,
Chang

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n1134
  
Thomas Gleixner Feb. 22, 2023, 8:38 a.m. UTC | #3
On Tue, Feb 21 2023 at 19:05, Chang S. Bae wrote:
> On 2/21/2023 8:36 AM, Mingwei Zhang wrote:
>> @@ -1151,10 +1152,11 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
>>   			pkru.pkru = pkru_val;
>>   			membuf_write(&to, &pkru, sizeof(pkru));
>>   		} else {
>> -			copy_feature(header.xfeatures & BIT_ULL(i), &to,
>> -				     __raw_xsave_addr(xsave, i),
>> -				     __raw_xsave_addr(xinit, i),
>> -				     xstate_sizes[i]);
>> +			xsave_addr = (header.xfeatures & BIT_ULL(i)) ?
>> +				__raw_xsave_addr(xsave, i) :
>> +				__raw_xsave_addr(xinit, i);
>> +
>> +			membuf_write(&to, xsave_addr, xstate_sizes[i]);
>>   		}
>>   		/*
>>   		 * Keep track of the last copied state in the non-compacted
>
> So this hunk is under for_each_extended_xfeature(i, mask) -- it skips 
> the copy routine if mask[i] == 0; instead, it fills zeros.
>
> We have this [1]:
>
> 	if (fpu_state_size_dynamic())
> 		mask &= (header.xfeatures | xinit->header.xcomp_bv);
>
> If header.xfeatures[18] = 0 then mask[18] = 0 because 
> xinit->header.xcomp_bv[18] = 0. Then, it won't hit that code. So, I'm 
> confused about the problem that you described here.

Read the suggested changelog I wrote in my reply to Mingwei.

TLDR:

        xsave.header.xfeatures[18] = 1
        xinit.header.xfeatures[18] = 0
    ->  mask[18] = 1
    ->  __raw_xsave_addr(xsave, 18)     <- Success
    ->  __raw_xsave_addr(xinit, 18)     <- WARN

Thanks,

        tglx
  
Mingwei Zhang Feb. 22, 2023, 6:40 p.m. UTC | #4
> > We have this [1]:
> >
> >       if (fpu_state_size_dynamic())
> >               mask &= (header.xfeatures | xinit->header.xcomp_bv);
> >
> > If header.xfeatures[18] = 0 then mask[18] = 0 because
> > xinit->header.xcomp_bv[18] = 0. Then, it won't hit that code. So, I'm
> > confused about the problem that you described here.
>
> Read the suggested changelog I wrote in my reply to Mingwei.
>
> TLDR:
>
>         xsave.header.xfeatures[18] = 1
>         xinit.header.xfeatures[18] = 0
>     ->  mask[18] = 1
>     ->  __raw_xsave_addr(xsave, 18)     <- Success
>     ->  __raw_xsave_addr(xinit, 18)     <- WARN
>
> Thanks,
>
>         tglx

Hi Thomas,

Thanks for the review and I will provide the next version separately
from the series, since this one is independent from the rest.

Chang: to reproduce this issue, you can simply run the amx_test in the
kvm selftest directory.

Thanks.
-Mingwei
  
Chang S. Bae Feb. 22, 2023, 10:13 p.m. UTC | #5
On 2/22/2023 10:40 AM, Mingwei Zhang wrote:
>>> We have this [1]:
>>>
>>>        if (fpu_state_size_dynamic())
>>>                mask &= (header.xfeatures | xinit->header.xcomp_bv);
>>>
>>> If header.xfeatures[18] = 0 then mask[18] = 0 because
>>> xinit->header.xcomp_bv[18] = 0. Then, it won't hit that code. So, I'm
>>> confused about the problem that you described here.
>>
>> Read the suggested changelog I wrote in my reply to Mingwei.
>>
>> TLDR:
>>
>>          xsave.header.xfeatures[18] = 1
>>          xinit.header.xfeatures[18] = 0
>>      ->  mask[18] = 1
>>      ->  __raw_xsave_addr(xsave, 18)     <- Success
>>      ->  __raw_xsave_addr(xinit, 18)     <- WARN

Oh, sigh.. This should be caught last time.

Hmm, then since we store init state for legacy ones [1], unless it is 
too aggressive, perhaps the loop can be simplified like this:

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 714166cc25f2..2dac6f5f3ade 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1118,21 +1118,13 @@ void __copy_xstate_to_uabi_buf(struct membuf to, 
struct fpstate *fpstate,
         zerofrom = offsetof(struct xregs_state, extended_state_area);

         /*
-        * The ptrace buffer is in non-compacted XSAVE format.  In
-        * non-compacted format disabled features still occupy state space,
-        * but there is no state to copy from in the compacted
-        * init_fpstate. The gap tracking will zero these states.
+        * Indicate which states to copy from fpstate. When not present in
+        * fpstate, those extended states are either initialized or
+        * disabled. They are also known to have an all zeros init state.
+        * Thus, remove them from 'mask' to zero those features in the user
+        * buffer instead of retrieving them from init_fpstate.
          */
-       mask = fpstate->user_xfeatures;
-
-       /*
-        * Dynamic features are not present in init_fpstate. When they are
-        * in an all zeros init state, remove those from 'mask' to zero
-        * those features in the user buffer instead of retrieving them
-        * from init_fpstate.
-        */
-       if (fpu_state_size_dynamic())
-               mask &= (header.xfeatures | xinit->header.xcomp_bv);
+       mask = header.xfeatures;

         for_each_extended_xfeature(i, mask) {
                 /*
@@ -1151,9 +1143,8 @@ void __copy_xstate_to_uabi_buf(struct membuf to, 
struct fpstate *fpstate,
                         pkru.pkru = pkru_val;
                         membuf_write(&to, &pkru, sizeof(pkru));
                 } else {
-                       copy_feature(header.xfeatures & BIT_ULL(i), &to,
+                       membuf_write(&to,
                                      __raw_xsave_addr(xsave, i),
-                                    __raw_xsave_addr(xinit, i),
                                      xstate_sizes[i]);
                 }
                 /*

> Chang: to reproduce this issue, you can simply run the amx_test in the
> kvm selftest directory.

Yeah, I was able to reproduce it with this ptrace test:

diff --git a/tools/testing/selftests/x86/amx.c 
b/tools/testing/selftests/x86/amx.c
index 625e42901237..ae02bc81846d 100644
--- a/tools/testing/selftests/x86/amx.c
+++ b/tools/testing/selftests/x86/amx.c
@@ -14,8 +14,10 @@
  #include <sys/auxv.h>
  #include <sys/mman.h>
  #include <sys/shm.h>
+#include <sys/ptrace.h>
  #include <sys/syscall.h>
  #include <sys/wait.h>
+#include <sys/uio.h>

  #include "../kselftest.h" /* For __cpuid_count() */

@@ -826,6 +828,76 @@ static void test_context_switch(void)
         free(finfo);
  }

+/* Ptrace test */
+
+static bool inject_tiledata(pid_t target)
+{
+       struct xsave_buffer *xbuf;
+       struct iovec iov;
+
+       xbuf = alloc_xbuf();
+       if (!xbuf)
+               fatal_error("unable to allocate XSAVE buffer");
+
+       load_rand_tiledata(xbuf);
+
+       memcpy(&stashed_xsave->bytes[xtiledata.xbuf_offset],
+              &xbuf->bytes[xtiledata.xbuf_offset],
+              xtiledata.size);
+
+       iov.iov_base = xbuf;
+       iov.iov_len = xbuf_size;
+
+       if (ptrace(PTRACE_SETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov))
+               fatal_error("PTRACE_SETREGSET");
+
+       if (ptrace(PTRACE_GETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov))
+               err(1, "PTRACE_GETREGSET");
+
+       if (!memcmp(&stashed_xsave->bytes[xtiledata.xbuf_offset],
+                   &xbuf->bytes[xtiledata.xbuf_offset],
+                   xtiledata.size))
+               return true;
+       else
+               return false;
+}
+
+static void test_ptrace(void)
+{
+       pid_t child;
+       int status;
+
+       child = fork();
+       if (child < 0) {
+               err(1, "fork");
+       } else if (!child) {
+               if (ptrace(PTRACE_TRACEME, 0, NULL, NULL))
+                       err(1, "PTRACE_TRACEME");
+
+               /* Use the state to expand the kernel buffer */
+               load_rand_tiledata(stashed_xsave);
+
+               raise(SIGTRAP);
+               _exit(0);
+       }
+
+       do {
+               wait(&status);
+       } while (WSTOPSIG(status) != SIGTRAP);
+
+       printf("\tInject tile data via ptrace()\n");
+
+       if (inject_tiledata(child))
+               printf("[OK]\tTile data was written on ptracee.\n");
+       else
+               printf("[FAIL]\tTile data was not written on ptracee.\n");
+
+       ptrace(PTRACE_DETACH, child, NULL, NULL);
+       wait(&status);
+       if (!WIFEXITED(status) || WEXITSTATUS(status))
+               err(1, "ptrace test");
+}
+
  int main(void)
  {
         /* Check hardware availability at first */
@@ -846,6 +918,8 @@ int main(void)
         ctxtswtest_config.num_threads = 5;
         test_context_switch();

+       test_ptrace();
+
         clearhandler(SIGILL);
         free_stashed_xsave();

Thanks,
Chang

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n386
  
Mingwei Zhang Feb. 24, 2023, 11:56 p.m. UTC | #6
On Wed, Feb 22, 2023, Chang S. Bae wrote:
> On 2/22/2023 10:40 AM, Mingwei Zhang wrote:
> > > > We have this [1]:
> > > > 
> > > >        if (fpu_state_size_dynamic())
> > > >                mask &= (header.xfeatures | xinit->header.xcomp_bv);
> > > > 
> > > > If header.xfeatures[18] = 0 then mask[18] = 0 because
> > > > xinit->header.xcomp_bv[18] = 0. Then, it won't hit that code. So, I'm
> > > > confused about the problem that you described here.
> > > 
> > > Read the suggested changelog I wrote in my reply to Mingwei.
> > > 
> > > TLDR:
> > > 
> > >          xsave.header.xfeatures[18] = 1
> > >          xinit.header.xfeatures[18] = 0
> > >      ->  mask[18] = 1
> > >      ->  __raw_xsave_addr(xsave, 18)     <- Success
> > >      ->  __raw_xsave_addr(xinit, 18)     <- WARN
> 
> Oh, sigh.. This should be caught last time.
> 
> Hmm, then since we store init state for legacy ones [1], unless it is too
> aggressive, perhaps the loop can be simplified like this:
> 
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 714166cc25f2..2dac6f5f3ade 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -1118,21 +1118,13 @@ void __copy_xstate_to_uabi_buf(struct membuf to,
> struct fpstate *fpstate,
>         zerofrom = offsetof(struct xregs_state, extended_state_area);
> 
>         /*
> -        * The ptrace buffer is in non-compacted XSAVE format.  In
> -        * non-compacted format disabled features still occupy state space,
> -        * but there is no state to copy from in the compacted
> -        * init_fpstate. The gap tracking will zero these states.
> +        * Indicate which states to copy from fpstate. When not present in
> +        * fpstate, those extended states are either initialized or
> +        * disabled. They are also known to have an all zeros init state.
> +        * Thus, remove them from 'mask' to zero those features in the user
> +        * buffer instead of retrieving them from init_fpstate.
>          */
> -       mask = fpstate->user_xfeatures;

Do we need to change this line and the comments? I don't see any of
these was relevant to this issue. The original code semantic is to
traverse all user_xfeatures, if it is available in fpstate, copy it from
there; otherwise, copy it from init_fpstate. We do not assume the
component in init_fpstate (but not in fpstate) are all zeros, do we? If
it is safe to assume that, then it might be ok. But at least in this
patch, I want to keep the original semantics as is without the
assumption.
> -
> -       /*
> -        * Dynamic features are not present in init_fpstate. When they are
> -        * in an all zeros init state, remove those from 'mask' to zero
> -        * those features in the user buffer instead of retrieving them
> -        * from init_fpstate.
> -        */
> -       if (fpu_state_size_dynamic())
> -               mask &= (header.xfeatures | xinit->header.xcomp_bv);
> +       mask = header.xfeatures;

Same here. Let's not adding this optimization in this patch.

>
>         for_each_extended_xfeature(i, mask) {
>                 /*
> @@ -1151,9 +1143,8 @@ void __copy_xstate_to_uabi_buf(struct membuf to,
> struct fpstate *fpstate,
>                         pkru.pkru = pkru_val;
>                         membuf_write(&to, &pkru, sizeof(pkru));
>                 } else {
> -                       copy_feature(header.xfeatures & BIT_ULL(i), &to,
> +                       membuf_write(&to,
>                                      __raw_xsave_addr(xsave, i),
> -                                    __raw_xsave_addr(xinit, i),
>                                      xstate_sizes[i]);
>                 }
>                 /*
> 
> > Chang: to reproduce this issue, you can simply run the amx_test in the
> > kvm selftest directory.
> 
> Yeah, I was able to reproduce it with this ptrace test:
> 
> diff --git a/tools/testing/selftests/x86/amx.c
> b/tools/testing/selftests/x86/amx.c
> index 625e42901237..ae02bc81846d 100644
> --- a/tools/testing/selftests/x86/amx.c
> +++ b/tools/testing/selftests/x86/amx.c
> @@ -14,8 +14,10 @@
>  #include <sys/auxv.h>
>  #include <sys/mman.h>
>  #include <sys/shm.h>
> +#include <sys/ptrace.h>
>  #include <sys/syscall.h>
>  #include <sys/wait.h>
> +#include <sys/uio.h>
> 
>  #include "../kselftest.h" /* For __cpuid_count() */
> 
> @@ -826,6 +828,76 @@ static void test_context_switch(void)
>         free(finfo);
>  }
> 
> +/* Ptrace test */
> +
> +static bool inject_tiledata(pid_t target)
> +{
> +       struct xsave_buffer *xbuf;
> +       struct iovec iov;
> +
> +       xbuf = alloc_xbuf();
> +       if (!xbuf)
> +               fatal_error("unable to allocate XSAVE buffer");
> +
> +       load_rand_tiledata(xbuf);
> +
> +       memcpy(&stashed_xsave->bytes[xtiledata.xbuf_offset],
> +              &xbuf->bytes[xtiledata.xbuf_offset],
> +              xtiledata.size);
> +
> +       iov.iov_base = xbuf;
> +       iov.iov_len = xbuf_size;
> +
> +       if (ptrace(PTRACE_SETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov))
> +               fatal_error("PTRACE_SETREGSET");
> +
> +       if (ptrace(PTRACE_GETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov))
> +               err(1, "PTRACE_GETREGSET");
> +
> +       if (!memcmp(&stashed_xsave->bytes[xtiledata.xbuf_offset],
> +                   &xbuf->bytes[xtiledata.xbuf_offset],
> +                   xtiledata.size))
> +               return true;
> +       else
> +               return false;
> +}
> +
> +static void test_ptrace(void)
> +{
> +       pid_t child;
> +       int status;
> +
> +       child = fork();
> +       if (child < 0) {
> +               err(1, "fork");
> +       } else if (!child) {
> +               if (ptrace(PTRACE_TRACEME, 0, NULL, NULL))
> +                       err(1, "PTRACE_TRACEME");
> +
> +               /* Use the state to expand the kernel buffer */
> +               load_rand_tiledata(stashed_xsave);
> +
> +               raise(SIGTRAP);
> +               _exit(0);
> +       }
> +
> +       do {
> +               wait(&status);
> +       } while (WSTOPSIG(status) != SIGTRAP);
> +
> +       printf("\tInject tile data via ptrace()\n");
> +
> +       if (inject_tiledata(child))
> +               printf("[OK]\tTile data was written on ptracee.\n");
> +       else
> +               printf("[FAIL]\tTile data was not written on ptracee.\n");
> +
> +       ptrace(PTRACE_DETACH, child, NULL, NULL);
> +       wait(&status);
> +       if (!WIFEXITED(status) || WEXITSTATUS(status))
> +               err(1, "ptrace test");
> +}
> +
>  int main(void)
>  {
>         /* Check hardware availability at first */
> @@ -846,6 +918,8 @@ int main(void)
>         ctxtswtest_config.num_threads = 5;
>         test_context_switch();
> 
> +       test_ptrace();
> +
>         clearhandler(SIGILL);
>         free_stashed_xsave();
> 
> Thanks,
> Chang
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n386
> 

Nice one. Yeah both ptrace and KVM are calling this function so the above
code would also be enough to trigger the bug.


Thanks.
-Mingwei
  
Chang S. Bae Feb. 25, 2023, 12:47 a.m. UTC | #7
On 2/24/2023 3:56 PM, Mingwei Zhang wrote:
> On Wed, Feb 22, 2023, Chang S. Bae wrote:
>>
>>          /*
>> -        * The ptrace buffer is in non-compacted XSAVE format.  In
>> -        * non-compacted format disabled features still occupy state space,
>> -        * but there is no state to copy from in the compacted
>> -        * init_fpstate. The gap tracking will zero these states.
>> +        * Indicate which states to copy from fpstate. When not present in
>> +        * fpstate, those extended states are either initialized or
>> +        * disabled. They are also known to have an all zeros init state.
>> +        * Thus, remove them from 'mask' to zero those features in the user
>> +        * buffer instead of retrieving them from init_fpstate.
>>           */
>> -       mask = fpstate->user_xfeatures;
> 
> Do we need to change this line and the comments? I don't see any of
> these was relevant to this issue. The original code semantic is to
> traverse all user_xfeatures, if it is available in fpstate, copy it from
> there; otherwise, copy it from init_fpstate. We do not assume the
> component in init_fpstate (but not in fpstate) are all zeros, do we? If
> it is safe to assume that, then it might be ok. But at least in this
> patch, I want to keep the original semantics as is without the
> assumption.

Here it has [1]:

	 *
	 * XSAVE could be used, but that would require to reshuffle the
	 * data when XSAVEC/S is available because XSAVEC/S uses xstate
	 * compaction. But doing so is a pointless exercise because most
	 * components have an all zeros init state except for the legacy
	 * ones (FP and SSE). Those can be saved with FXSAVE into the
	 * legacy area. Adding new features requires to ensure that init
	 * state is all zeroes or if not to add the necessary handling
	 * here.
	 */
	fxsave(&init_fpstate.regs.fxsave);

Thus, init_fpstate has zeros for those extended states. Then, copying 
from init_fpstate is the same as membuf_zero() by the gap tracking. But, 
we have two ways to do the same thing here.

So I think it works that simply copying the state from fpstate only for 
those present there, then letting the gap tracking zero out for the rest 
of the userspace buffer for features that are either disabled or 
initialized.

Then, we can remove accessing init_fpstate in the copy loop and which is 
the source of the problem. So I think this line change is relevant and 
also makes the code simple.

I guess I'm fine if you don't want to do this. Then, let me follow up 
with something like this at first. Something like yours could be a 
fallback option for other good reasons, otherwise.

Thanks,
Chang

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n386
  
Mingwei Zhang Feb. 25, 2023, 1:09 a.m. UTC | #8
On Fri, Feb 24, 2023, Chang S. Bae wrote:
> On 2/24/2023 3:56 PM, Mingwei Zhang wrote:
> > On Wed, Feb 22, 2023, Chang S. Bae wrote:
> > > 
> > >          /*
> > > -        * The ptrace buffer is in non-compacted XSAVE format.  In
> > > -        * non-compacted format disabled features still occupy state space,
> > > -        * but there is no state to copy from in the compacted
> > > -        * init_fpstate. The gap tracking will zero these states.
> > > +        * Indicate which states to copy from fpstate. When not present in
> > > +        * fpstate, those extended states are either initialized or
> > > +        * disabled. They are also known to have an all zeros init state.
> > > +        * Thus, remove them from 'mask' to zero those features in the user
> > > +        * buffer instead of retrieving them from init_fpstate.
> > >           */
> > > -       mask = fpstate->user_xfeatures;
> > 
> > Do we need to change this line and the comments? I don't see any of
> > these was relevant to this issue. The original code semantic is to
> > traverse all user_xfeatures, if it is available in fpstate, copy it from
> > there; otherwise, copy it from init_fpstate. We do not assume the
> > component in init_fpstate (but not in fpstate) are all zeros, do we? If
> > it is safe to assume that, then it might be ok. But at least in this
> > patch, I want to keep the original semantics as is without the
> > assumption.
> 
> Here it has [1]:
> 
> 	 *
> 	 * XSAVE could be used, but that would require to reshuffle the
> 	 * data when XSAVEC/S is available because XSAVEC/S uses xstate
> 	 * compaction. But doing so is a pointless exercise because most
> 	 * components have an all zeros init state except for the legacy
> 	 * ones (FP and SSE). Those can be saved with FXSAVE into the
> 	 * legacy area. Adding new features requires to ensure that init
> 	 * state is all zeroes or if not to add the necessary handling
> 	 * here.
> 	 */
> 	fxsave(&init_fpstate.regs.fxsave);

ah, I see.
> 
> Thus, init_fpstate has zeros for those extended states. Then, copying from
> init_fpstate is the same as membuf_zero() by the gap tracking. But, we have
> two ways to do the same thing here.
> 
> So I think it works that simply copying the state from fpstate only for
> those present there, then letting the gap tracking zero out for the rest of
> the userspace buffer for features that are either disabled or initialized.
> 
> Then, we can remove accessing init_fpstate in the copy loop and which is the
> source of the problem. So I think this line change is relevant and also
> makes the code simple.
> 
> I guess I'm fine if you don't want to do this. Then, let me follow up with
> something like this at first. Something like yours could be a fallback
> option for other good reasons, otherwise.

hmm. I see. But this is still because of the software implementation.
What if there is a new hardware component that requires a non-zero init
state.

For instance, in the past, we had PKRU component, whose init value is
0x555...54. Of course, that is a bad example because now we kick it out
of the XSAVE/XRSTOR and special handling that, but there is no guarantee
that in the future we will never need a non-zero init state.

So, I will send out my fix and let you, Thomas and potentially other
folks to decide what is the best option. Overall, I get your point.

Thanks
-Mingwei
>
> Thanks,
> Chang
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n386
> 
>
  
Chang S. Bae Feb. 25, 2023, 1:39 a.m. UTC | #9
On 2/24/2023 5:09 PM, Mingwei Zhang wrote:
> On Fri, Feb 24, 2023, Chang S. Bae wrote:
>> On 2/24/2023 3:56 PM, Mingwei Zhang wrote:
>>> On Wed, Feb 22, 2023, Chang S. Bae wrote:
>>>>
>>>>           /*
>>>> -        * The ptrace buffer is in non-compacted XSAVE format.  In
>>>> -        * non-compacted format disabled features still occupy state space,
>>>> -        * but there is no state to copy from in the compacted
>>>> -        * init_fpstate. The gap tracking will zero these states.
>>>> +        * Indicate which states to copy from fpstate. When not present in
>>>> +        * fpstate, those extended states are either initialized or
>>>> +        * disabled. They are also known to have an all zeros init state.
>>>> +        * Thus, remove them from 'mask' to zero those features in the user
>>>> +        * buffer instead of retrieving them from init_fpstate.
>>>>            */
>>>> -       mask = fpstate->user_xfeatures;
>>>
>>> Do we need to change this line and the comments? I don't see any of
>>> these was relevant to this issue. The original code semantic is to
>>> traverse all user_xfeatures, if it is available in fpstate, copy it from
>>> there; otherwise, copy it from init_fpstate. We do not assume the
>>> component in init_fpstate (but not in fpstate) are all zeros, do we? If
>>> it is safe to assume that, then it might be ok. But at least in this
>>> patch, I want to keep the original semantics as is without the
>>> assumption.
>>
>> Here it has [1]:
>>
>> 	 *
>> 	 * XSAVE could be used, but that would require to reshuffle the
>> 	 * data when XSAVEC/S is available because XSAVEC/S uses xstate
>> 	 * compaction. But doing so is a pointless exercise because most
>> 	 * components have an all zeros init state except for the legacy
>> 	 * ones (FP and SSE). Those can be saved with FXSAVE into the
>> 	 * legacy area. Adding new features requires to ensure that init
>> 	 * state is all zeroes or if not to add the necessary handling
>> 	 * here.
>> 	 */
>> 	fxsave(&init_fpstate.regs.fxsave);
> 
> ah, I see.
>>
>> Thus, init_fpstate has zeros for those extended states. Then, copying from
>> init_fpstate is the same as membuf_zero() by the gap tracking. But, we have
>> two ways to do the same thing here.
>>
>> So I think it works that simply copying the state from fpstate only for
>> those present there, then letting the gap tracking zero out for the rest of
>> the userspace buffer for features that are either disabled or initialized.
>>
>> Then, we can remove accessing init_fpstate in the copy loop and which is the
>> source of the problem. So I think this line change is relevant and also
>> makes the code simple.
>>
>> I guess I'm fine if you don't want to do this. Then, let me follow up with
>> something like this at first. Something like yours could be a fallback
>> option for other good reasons, otherwise.
> 
> hmm. I see. But this is still because of the software implementation.
> What if there is a new hardware component that requires a non-zero init
> state.
> 
> For instance, in the past, we had PKRU component, whose init value is
> 0x555...54. Of course, that is a bad example because now we kick it out
> of the XSAVE/XRSTOR and special handling that, but there is no guarantee
> that in the future we will never need a non-zero init state.

Yeah, then that feature enabling has to update [1] to record the special 
init value there. So one who writes that code has to responsibly check 
what has to be adjusted like for other feature enabling in general.

Now we have established the dynamic states which are also assumed to 
have an all-zeros init state. Anything new state located after that 
should be a dynamic state because of the sigaltstack.

Of course, with that though, I don't know whether we will see anything 
with a non-zero init state in the future or not.

> So, I will send out my fix and let you, Thomas and potentially other
> folks to decide what is the best option. Overall, I get your point.

Let's coordinate this. We shouldn't throw multiple versions at the same 
time. (Let me follow up with you.)

Thanks,
Chang
  

Patch

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 714166cc25f2..5cc1426c3800 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1063,6 +1063,7 @@  void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
 	struct xregs_state *xsave = &fpstate->regs.xsave;
 	struct xstate_header header;
 	unsigned int zerofrom;
+	void *xsave_addr;
 	u64 mask;
 	int i;
 
@@ -1151,10 +1152,11 @@  void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
 			pkru.pkru = pkru_val;
 			membuf_write(&to, &pkru, sizeof(pkru));
 		} else {
-			copy_feature(header.xfeatures & BIT_ULL(i), &to,
-				     __raw_xsave_addr(xsave, i),
-				     __raw_xsave_addr(xinit, i),
-				     xstate_sizes[i]);
+			xsave_addr = (header.xfeatures & BIT_ULL(i)) ?
+				__raw_xsave_addr(xsave, i) :
+				__raw_xsave_addr(xinit, i);
+
+			membuf_write(&to, xsave_addr, xstate_sizes[i]);
 		}
 		/*
 		 * Keep track of the last copied state in the non-compacted