[v2,0/6] x86: Clean up fast syscall return validation

Message ID 20230721161018.50214-1-brgerst@gmail.com
Headers
Series x86: Clean up fast syscall return validation |

Message

Brian Gerst July 21, 2023, 4:10 p.m. UTC
  This patch set cleans up the tests done to determine if a fast syscall
return instruction can be used to return to userspace.  It converts the
code to C, and refactors existing code to be more readable.

v2:
 - Fix shift value for canonical RIP test and use
   __is_canonical_address()

Brian Gerst (6):
  x86/entry/64: Remove obsolete comment on tracing vs. SYSRET
  x86/entry/64: Convert SYSRET validation tests to C
  x86/entry/compat: Combine return value test from syscall handler
  x86/entry/32: Convert do_fast_syscall_32() to bool return type
  x86/entry/32: Remove SEP test for SYSEXIT
  x86/entry/32: Clean up syscall fast exit tests

 arch/x86/entry/common.c          | 99 +++++++++++++++++++++-----------
 arch/x86/entry/entry_32.S        |  2 +-
 arch/x86/entry/entry_64.S        | 68 +---------------------
 arch/x86/entry/entry_64_compat.S | 12 ++--
 arch/x86/include/asm/syscall.h   |  6 +-
 5 files changed, 77 insertions(+), 110 deletions(-)
  

Comments

Ingo Molnar Oct. 5, 2023, 8:22 a.m. UTC | #1
* Brian Gerst <brgerst@gmail.com> wrote:

> This patch set cleans up the tests done to determine if a fast syscall
> return instruction can be used to return to userspace.  It converts the
> code to C, and refactors existing code to be more readable.
> 
> v2:
>  - Fix shift value for canonical RIP test and use
>    __is_canonical_address()
> 
> Brian Gerst (6):
>   x86/entry/64: Remove obsolete comment on tracing vs. SYSRET
>   x86/entry/64: Convert SYSRET validation tests to C
>   x86/entry/compat: Combine return value test from syscall handler
>   x86/entry/32: Convert do_fast_syscall_32() to bool return type
>   x86/entry/32: Remove SEP test for SYSEXIT
>   x86/entry/32: Clean up syscall fast exit tests
> 
>  arch/x86/entry/common.c          | 99 +++++++++++++++++++++-----------
>  arch/x86/entry/entry_32.S        |  2 +-
>  arch/x86/entry/entry_64.S        | 68 +---------------------
>  arch/x86/entry/entry_64_compat.S | 12 ++--
>  arch/x86/include/asm/syscall.h   |  6 +-
>  5 files changed, 77 insertions(+), 110 deletions(-)

Ok, so I've applied patches #1, #3, #4 and #5 to tip:x86/entry,
(ie. skipped #2 & #6 for now), as they look correct and are good
improvements. None of these four patches depend on the skipped
patches in some way I missed, correct?

As for #2, I looked at the before/after disassembly, and the new
C code in do_syscall_64() looked suboptimal on x86-64 defconfig,
if I was reading it right.

Mind re-evaluating that, and if you still think the C conversion
is a good idea, mind putting a before/after analysis of the
generated instructions into the changelog? This is our primary
system call return path after all.

Thanks,

	Ingo
  
Brian Gerst Oct. 5, 2023, 3:13 p.m. UTC | #2
On Thu, Oct 5, 2023 at 4:22 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Brian Gerst <brgerst@gmail.com> wrote:
>
> > This patch set cleans up the tests done to determine if a fast syscall
> > return instruction can be used to return to userspace.  It converts the
> > code to C, and refactors existing code to be more readable.
> >
> > v2:
> >  - Fix shift value for canonical RIP test and use
> >    __is_canonical_address()
> >
> > Brian Gerst (6):
> >   x86/entry/64: Remove obsolete comment on tracing vs. SYSRET
> >   x86/entry/64: Convert SYSRET validation tests to C
> >   x86/entry/compat: Combine return value test from syscall handler
> >   x86/entry/32: Convert do_fast_syscall_32() to bool return type
> >   x86/entry/32: Remove SEP test for SYSEXIT
> >   x86/entry/32: Clean up syscall fast exit tests
> >
> >  arch/x86/entry/common.c          | 99 +++++++++++++++++++++-----------
> >  arch/x86/entry/entry_32.S        |  2 +-
> >  arch/x86/entry/entry_64.S        | 68 +---------------------
> >  arch/x86/entry/entry_64_compat.S | 12 ++--
> >  arch/x86/include/asm/syscall.h   |  6 +-
> >  5 files changed, 77 insertions(+), 110 deletions(-)
>
> Ok, so I've applied patches #1, #3, #4 and #5 to tip:x86/entry,
> (ie. skipped #2 & #6 for now), as they look correct and are good
> improvements. None of these four patches depend on the skipped
> patches in some way I missed, correct?
>
> As for #2, I looked at the before/after disassembly, and the new
> C code in do_syscall_64() looked suboptimal on x86-64 defconfig,
> if I was reading it right.
>
> Mind re-evaluating that, and if you still think the C conversion
> is a good idea, mind putting a before/after analysis of the
> generated instructions into the changelog? This is our primary
> system call return path after all.

Looking at the compiled output, the only suboptimal code appears to be
the canonical address test, where the C code uses the CL register for
the shifts instead of immediates.

 180:   e9 00 00 00 00          jmp    185 <do_syscall_64+0x85>
                        181: R_X86_64_PC32      .altinstr_aux-0x4
 185:   b9 07 00 00 00          mov    $0x7,%ecx
 18a:   eb 05                   jmp    191 <do_syscall_64+0x91>
 18c:   b9 10 00 00 00          mov    $0x10,%ecx
 191:   48 89 c2                mov    %rax,%rdx
 194:   48 d3 e2                shl    %cl,%rdx
 197:   48 d3 fa                sar    %cl,%rdx
 19a:   48 39 d0                cmp    %rdx,%rax
 19d:   75 39                   jne    1d8 <do_syscall_64+0xd8>

Was there anything else specifically that you can point out?

Brian Gerst
  
Ingo Molnar Oct. 5, 2023, 8:20 p.m. UTC | #3
* Brian Gerst <brgerst@gmail.com> wrote:

> Looking at the compiled output, the only suboptimal code appears to be
> the canonical address test, where the C code uses the CL register for
> the shifts instead of immediates.
> 
>  180:   e9 00 00 00 00          jmp    185 <do_syscall_64+0x85>
>                         181: R_X86_64_PC32      .altinstr_aux-0x4
>  185:   b9 07 00 00 00          mov    $0x7,%ecx
>  18a:   eb 05                   jmp    191 <do_syscall_64+0x91>
>  18c:   b9 10 00 00 00          mov    $0x10,%ecx
>  191:   48 89 c2                mov    %rax,%rdx
>  194:   48 d3 e2                shl    %cl,%rdx
>  197:   48 d3 fa                sar    %cl,%rdx
>  19a:   48 39 d0                cmp    %rdx,%rax
>  19d:   75 39                   jne    1d8 <do_syscall_64+0xd8>

Yeah, it didn't look equivalent - so I guess we want a C equivalent for:

-       ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
-               "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57

instead of the pgtable_l5_enabled() runtime test that 
__is_canonical_address() uses?

Thanks,

	Ingo
  
H. Peter Anvin Oct. 6, 2023, 6:59 p.m. UTC | #4
On 10/5/23 13:20, Ingo Molnar wrote:
> 
> * Brian Gerst <brgerst@gmail.com> wrote:
> 
>> Looking at the compiled output, the only suboptimal code appears to be
>> the canonical address test, where the C code uses the CL register for
>> the shifts instead of immediates.
>>
>>   180:   e9 00 00 00 00          jmp    185 <do_syscall_64+0x85>
>>                          181: R_X86_64_PC32      .altinstr_aux-0x4
>>   185:   b9 07 00 00 00          mov    $0x7,%ecx
>>   18a:   eb 05                   jmp    191 <do_syscall_64+0x91>
>>   18c:   b9 10 00 00 00          mov    $0x10,%ecx
>>   191:   48 89 c2                mov    %rax,%rdx
>>   194:   48 d3 e2                shl    %cl,%rdx
>>   197:   48 d3 fa                sar    %cl,%rdx
>>   19a:   48 39 d0                cmp    %rdx,%rax
>>   19d:   75 39                   jne    1d8 <do_syscall_64+0xd8>
> 
> Yeah, it didn't look equivalent - so I guess we want a C equivalent for:
> 
> -       ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
> -               "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57
> 
> instead of the pgtable_l5_enabled() runtime test that
> __is_canonical_address() uses?
> 

I don't think such a thing (without simply duplicate the above as an 
alternative asm, which is obviously easy enough, and still allows the 
compiler to pick the register used) would be possible without immediate 
patching support[*].

Incidentally, this is a question for Uros: is there a reason this is a 
mov to %ecx and not just %cl, which would save 3 bytes?

Incidentally, it is possible to save one instruction and use only *one* 
alternative immediate:

	leaq (%rax,%rax),%rdx
	xorq %rax,%rdx
	shrq $(63 - LA),%rdx		# Yes, 63, not 64
	# ZF=1 if canonical

This works because if bit [x] is set in the output, then bit [x] and 
[x-1] in the input are different (bit [-1] considered to be zero); and 
by definition a bit is canonical if and only if all the bits [63:LA] are 
identical, thus bits [63:LA+1] in the output must all be zero.

The first two instructions are pure arithmetic and can thus be done in C:

	bar = foo ^ (foo << 1);

... leaving only one instruction needing to be patched at runtime.

	-hpa



[*] which is a whole bit of infrastructure that I know first-hand is 
extremely complex[**] -- I had an almost-complete patchset done at one 
time, but it has severely bitrotted. The good part is that it is 
probably a lot easier to do today, with the alternatives-patching 
callback routines available.

[**] the absolute best would of course be if such infrastructure could 
be compiler-supported (probably as part as some really fancy support for 
alternatives/static branch), but that would probably be a nightmare to 
do in the compiler or a plugin.
  
Brian Gerst Oct. 6, 2023, 9:32 p.m. UTC | #5
On Fri, Oct 6, 2023 at 2:59 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 10/5/23 13:20, Ingo Molnar wrote:
> >
> > * Brian Gerst <brgerst@gmail.com> wrote:
> >
> >> Looking at the compiled output, the only suboptimal code appears to be
> >> the canonical address test, where the C code uses the CL register for
> >> the shifts instead of immediates.
> >>
> >>   180:   e9 00 00 00 00          jmp    185 <do_syscall_64+0x85>
> >>                          181: R_X86_64_PC32      .altinstr_aux-0x4
> >>   185:   b9 07 00 00 00          mov    $0x7,%ecx
> >>   18a:   eb 05                   jmp    191 <do_syscall_64+0x91>
> >>   18c:   b9 10 00 00 00          mov    $0x10,%ecx
> >>   191:   48 89 c2                mov    %rax,%rdx
> >>   194:   48 d3 e2                shl    %cl,%rdx
> >>   197:   48 d3 fa                sar    %cl,%rdx
> >>   19a:   48 39 d0                cmp    %rdx,%rax
> >>   19d:   75 39                   jne    1d8 <do_syscall_64+0xd8>
> >
> > Yeah, it didn't look equivalent - so I guess we want a C equivalent for:
> >
> > -       ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
> > -               "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57
> >
> > instead of the pgtable_l5_enabled() runtime test that
> > __is_canonical_address() uses?
> >
>
> I don't think such a thing (without simply duplicate the above as an
> alternative asm, which is obviously easy enough, and still allows the
> compiler to pick the register used) would be possible without immediate
> patching support[*].
>
> Incidentally, this is a question for Uros: is there a reason this is a
> mov to %ecx and not just %cl, which would save 3 bytes?
>
> Incidentally, it is possible to save one instruction and use only *one*
> alternative immediate:
>
>         leaq (%rax,%rax),%rdx
>         xorq %rax,%rdx
>         shrq $(63 - LA),%rdx            # Yes, 63, not 64
>         # ZF=1 if canonical
>
> This works because if bit [x] is set in the output, then bit [x] and
> [x-1] in the input are different (bit [-1] considered to be zero); and
> by definition a bit is canonical if and only if all the bits [63:LA] are
> identical, thus bits [63:LA+1] in the output must all be zero.
>
> The first two instructions are pure arithmetic and can thus be done in C:
>
>         bar = foo ^ (foo << 1);
>
> ... leaving only one instruction needing to be patched at runtime.
>
>         -hpa

One other alternative I have been considering is comparing against
TASK_SIZE_MAX.  The only user-executable address above that is the
long deprecated vsyscall page.  IMHO it's not worth optimizing for
that case, so just let it fall back to using IRET.

    if (unlikely(regs->ip >= TASK_SIZE_MAX)) return false;

compiles to:

 180:   48 b9 00 f0 ff ff ff    movabs $0x7ffffffff000,%rcx
 187:   7f 00 00
 18a:   48 39 c8                cmp    %rcx,%rax
 18d:   73 39                   jae    1c8 <do_syscall_64+0xc8>

0000000000000000 <.altinstr_replacement>:
   0:   48 b9 00 f0 ff ff ff    movabs $0xfffffffffff000,%rcx
   7:   ff ff 00

Brian Gerst
  
H. Peter Anvin Oct. 6, 2023, 11:58 p.m. UTC | #6
On 10/6/23 11:59, H. Peter Anvin wrote:
> 
> Incidentally, it is possible to save one instruction and use only *one* 
> alternative immediate:
> 
>      leaq (%rax,%rax),%rdx
>      xorq %rax,%rdx
>      shrq $(63 - LA),%rdx        # Yes, 63, not 64
>      # ZF=1 if canonical
> 
> This works because if bit [x] is set in the output, then bit [x] and 
> [x-1] in the input are different (bit [-1] considered to be zero); and 
> by definition a bit is canonical if and only if all the bits [63:LA] are 
> identical, thus bits [63:LA+1] in the output must all be zero.
> 

Yes, I'm a doofus. Bits [63:LA-1] must be identical, so 64 is correct :$)

	-hpa
  
Ingo Molnar Oct. 7, 2023, 9:42 a.m. UTC | #7
* Brian Gerst <brgerst@gmail.com> wrote:

> On Fri, Oct 6, 2023 at 2:59 PM H. Peter Anvin <hpa@zytor.com> wrote:
> >
> > On 10/5/23 13:20, Ingo Molnar wrote:
> > >
> > > * Brian Gerst <brgerst@gmail.com> wrote:
> > >
> > >> Looking at the compiled output, the only suboptimal code appears to be
> > >> the canonical address test, where the C code uses the CL register for
> > >> the shifts instead of immediates.
> > >>
> > >>   180:   e9 00 00 00 00          jmp    185 <do_syscall_64+0x85>
> > >>                          181: R_X86_64_PC32      .altinstr_aux-0x4
> > >>   185:   b9 07 00 00 00          mov    $0x7,%ecx
> > >>   18a:   eb 05                   jmp    191 <do_syscall_64+0x91>
> > >>   18c:   b9 10 00 00 00          mov    $0x10,%ecx
> > >>   191:   48 89 c2                mov    %rax,%rdx
> > >>   194:   48 d3 e2                shl    %cl,%rdx
> > >>   197:   48 d3 fa                sar    %cl,%rdx
> > >>   19a:   48 39 d0                cmp    %rdx,%rax
> > >>   19d:   75 39                   jne    1d8 <do_syscall_64+0xd8>
> > >
> > > Yeah, it didn't look equivalent - so I guess we want a C equivalent for:
> > >
> > > -       ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
> > > -               "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57
> > >
> > > instead of the pgtable_l5_enabled() runtime test that
> > > __is_canonical_address() uses?
> > >
> >
> > I don't think such a thing (without simply duplicate the above as an
> > alternative asm, which is obviously easy enough, and still allows the
> > compiler to pick the register used) would be possible without immediate
> > patching support[*].
> >
> > Incidentally, this is a question for Uros: is there a reason this is a
> > mov to %ecx and not just %cl, which would save 3 bytes?
> >
> > Incidentally, it is possible to save one instruction and use only *one*
> > alternative immediate:
> >
> >         leaq (%rax,%rax),%rdx
> >         xorq %rax,%rdx
> >         shrq $(63 - LA),%rdx            # Yes, 63, not 64
> >         # ZF=1 if canonical
> >
> > This works because if bit [x] is set in the output, then bit [x] and
> > [x-1] in the input are different (bit [-1] considered to be zero); and
> > by definition a bit is canonical if and only if all the bits [63:LA] are
> > identical, thus bits [63:LA+1] in the output must all be zero.
> >
> > The first two instructions are pure arithmetic and can thus be done in C:
> >
> >         bar = foo ^ (foo << 1);
> >
> > ... leaving only one instruction needing to be patched at runtime.
> >
> >         -hpa
> 
> One other alternative I have been considering is comparing against
> TASK_SIZE_MAX.  The only user-executable address above that is the
> long deprecated vsyscall page.  IMHO it's not worth optimizing for
> that case, so just let it fall back to using IRET.
> 
>     if (unlikely(regs->ip >= TASK_SIZE_MAX)) return false;
> 
> compiles to:
> 
>  180:   48 b9 00 f0 ff ff ff    movabs $0x7ffffffff000,%rcx
>  187:   7f 00 00
>  18a:   48 39 c8                cmp    %rcx,%rax
>  18d:   73 39                   jae    1c8 <do_syscall_64+0xc8>
> 
> 0000000000000000 <.altinstr_replacement>:
>    0:   48 b9 00 f0 ff ff ff    movabs $0xfffffffffff000,%rcx
>    7:   ff ff 00

That sounds good - and we could do this as a separate patch on top
of your existing  patches, to keep it bisectable in case there's
any problems.

Thanks,

	Ingo
  
Uros Bizjak Oct. 7, 2023, 9:56 a.m. UTC | #8
On Fri, Oct 6, 2023 at 8:59 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 10/5/23 13:20, Ingo Molnar wrote:
> >
> > * Brian Gerst <brgerst@gmail.com> wrote:
> >
> >> Looking at the compiled output, the only suboptimal code appears to be
> >> the canonical address test, where the C code uses the CL register for
> >> the shifts instead of immediates.
> >>
> >>   180:   e9 00 00 00 00          jmp    185 <do_syscall_64+0x85>
> >>                          181: R_X86_64_PC32      .altinstr_aux-0x4
> >>   185:   b9 07 00 00 00          mov    $0x7,%ecx
> >>   18a:   eb 05                   jmp    191 <do_syscall_64+0x91>
> >>   18c:   b9 10 00 00 00          mov    $0x10,%ecx
> >>   191:   48 89 c2                mov    %rax,%rdx
> >>   194:   48 d3 e2                shl    %cl,%rdx
> >>   197:   48 d3 fa                sar    %cl,%rdx
> >>   19a:   48 39 d0                cmp    %rdx,%rax
> >>   19d:   75 39                   jne    1d8 <do_syscall_64+0xd8>
> >
> > Yeah, it didn't look equivalent - so I guess we want a C equivalent for:
> >
> > -       ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
> > -               "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57
> >
> > instead of the pgtable_l5_enabled() runtime test that
> > __is_canonical_address() uses?
> >
>
> I don't think such a thing (without simply duplicate the above as an
> alternative asm, which is obviously easy enough, and still allows the
> compiler to pick the register used) would be possible without immediate
> patching support[*].
>
> Incidentally, this is a question for Uros: is there a reason this is a
> mov to %ecx and not just %cl, which would save 3 bytes?

The compiler uses 32-bit mode to move values between registers, even
when they are less than 32-bit wide. To avoid partial register stalls,
it uses 32-bit mode as much as possible (e.g. zero-extends from memory
to load 8-bit value, load of 32-bit constant, etc). Since the kernel
is compiled with -O2, the compiler does not care that much for the
size of instructions, and it uses full 32-bit width to initialize
register with a constant.

Please note that 8-bit movb instruction in fact represents insert into
word-mode register. The compiler does not know how this word-mode
register will be used, so to avoid partial register stalls, it takes a
cautious approach and (with -O2) moves constant to a register with a
word-width instruction.

Also, the compiler is quite eager to CSE constants. When there are two
or more uses of the same constant, it will move a constant into the
register first.

Uros.
  
Linus Torvalds Oct. 7, 2023, 6:07 p.m. UTC | #9
On Sat, 7 Oct 2023 at 02:57, Uros Bizjak <ubizjak@gmail.com> wrote:
>
> Please note that 8-bit movb instruction in fact represents insert into
> word-mode register. The compiler does not know how this word-mode
> register will be used, so to avoid partial register stalls, it takes a
> cautious approach and (with -O2) moves constant to a register with a
> word-width instruction.

Yes. In this case I really think it's our kernel code that is disgusting.

People seem to think that because cpu_feature_enabled() uses static
jumps, it's somehow "free". Not so. The static jumps are often
horrendous for code quality, and they effectively make the compiler
blind to otherwise obvious optimizations.

We need to stop using those so blindly. I see code like this

       return pgtable_l5_enabled() ? PT64_ROOT_5LEVEL : PT64_ROOT_4LEVEL;

and it just makes me sad. And yes, that virtual mask is disgusting. This:

    #define __VIRTUAL_MASK_SHIFT       (pgtable_l5_enabled() ? 56 : 47)

is *NOT* cheap. The end result is *not* a constant, it's a two-way
branch, and while the branch is static and the compiler may then
optimize each side of the branch with the constant in question, most
of the time it just results in *disgusting* code like this where gcc
just has to load a constant into a register, and treat it as a
variable.

With the code shuffling, it's possibly *more* expensive than just
loading from memory, in that you just change a D$ load into an I$ one
instead. I'm sure that looks great on benchmarks that stay entirely in
the I$, but ...

Something like using alternatives is *much* cheaper. Better yet is the
suggestion to simplify the conditionals entirely to not depend on the
virtual shift at all (ie our recent access_ok() simplifications that
were due to _another_ CPU feature)

> Also, the compiler is quite eager to CSE constants. When there are two
> or more uses of the same constant, it will move a constant into the
> register first.

Now, honestly, that may be a good thing for other architectures that
have issues with constants, but on x86 - and integer constants - it's
likely a bad thing unless the constant is particularly complicated.

Most x86 instructions will take the usual 8/32-bit constants (but if
we have arbitrary 64-bit ones, you tend to need them in registers).

             Linus