[v7,30/41] x86/shstk: Handle thread shadow stack

Message ID 20230227222957.24501-31-rick.p.edgecombe@intel.com
State New
Headers
Series Shadow stacks for userspace |

Commit Message

Edgecombe, Rick P Feb. 27, 2023, 10:29 p.m. UTC
  From: Yu-cheng Yu <yu-cheng.yu@intel.com>

When a process is duplicated, but the child shares the address space with
the parent, there is potential for the threads sharing a single stack to
cause conflicts for each other. In the normal non-cet case this is handled
in two ways.

With regular CLONE_VM a new stack is provided by userspace such that the
parent and child have different stacks.

For vfork, the parent is suspended until the child exits. So as long as
the child doesn't return from the vfork()/CLONE_VFORK calling function and
sticks to a limited set of operations, the parent and child can share the
same stack.

For shadow stack, these scenarios present similar sharing problems. For the
CLONE_VM case, the child and the parent must have separate shadow stacks.
Instead of changing clone to take a shadow stack, have the kernel just
allocate one and switch to it.

Use stack_size passed from clone3() syscall for thread shadow stack size. A
compat-mode thread shadow stack size is further reduced to 1/4. This
allows more threads to run in a 32-bit address space. The clone() does not
pass stack_size, which was added to clone3(). In that case, use
RLIMIT_STACK size and cap to 4 GB.

For shadow stack enabled vfork(), the parent and child can share the same
shadow stack, like they can share a normal stack. Since the parent is
suspended until the child terminates, the child will not interfere with
the parent while executing as long as it doesn't return from the vfork()
and overwrite up the shadow stack. The child can safely overwrite down
the shadow stack, as the parent can just overwrite this later. So CET does
not add any additional limitations for vfork().

Userspace implementing posix vfork() can actually prevent the child from
returning from the vfork() calling function, using CET. Glibc does this
by adjusting the shadow stack pointer in the child, so that the child
receives a #CP if it tries to return from vfork() calling function.

Free the shadow stack on thread exit by doing it in mm_release(). Skip
this when exiting a vfork() child since the stack is shared in the
parent.

During this operation, the shadow stack pointer of the new thread needs
to be updated to point to the newly allocated shadow stack. Since the
ability to do this is confined to the FPU subsystem, change
fpu_clone() to take the new shadow stack pointer, and update it
internally inside the FPU subsystem. This part was suggested by Thomas
Gleixner.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v3:
 - Fix update_fpu_shstk() stub (Mike Rapoport)
 - Fix chunks around alloc_shstk() in wrong patch (Kees)
 - Fix stack_size/flags swap (Kees)
 - Use centralized stack size logic (Kees)

v2:
 - Have fpu_clone() take new shadow stack pointer and update SSP in
   xsave buffer for new task. (tglx)

v1:
 - Expand commit log.
 - Add more comments.
 - Switch to xsave helpers.

Yu-cheng v30:
 - Update comments about clone()/clone3(). (Borislav Petkov)
---
 arch/x86/include/asm/fpu/sched.h   |  3 ++-
 arch/x86/include/asm/mmu_context.h |  2 ++
 arch/x86/include/asm/shstk.h       |  7 +++++
 arch/x86/kernel/fpu/core.c         | 41 +++++++++++++++++++++++++++-
 arch/x86/kernel/process.c          | 18 ++++++++++++-
 arch/x86/kernel/shstk.c            | 43 ++++++++++++++++++++++++++++--
 6 files changed, 109 insertions(+), 5 deletions(-)
  

Comments

Szabolcs Nagy March 2, 2023, 5:34 p.m. UTC | #1
The 02/27/2023 14:29, Rick Edgecombe wrote:
> For shadow stack enabled vfork(), the parent and child can share the same
> shadow stack, like they can share a normal stack. Since the parent is
> suspended until the child terminates, the child will not interfere with
> the parent while executing as long as it doesn't return from the vfork()
> and overwrite up the shadow stack. The child can safely overwrite down
> the shadow stack, as the parent can just overwrite this later. So CET does
> not add any additional limitations for vfork().
> 
> Userspace implementing posix vfork() can actually prevent the child from
> returning from the vfork() calling function, using CET. Glibc does this
> by adjusting the shadow stack pointer in the child, so that the child
> receives a #CP if it tries to return from vfork() calling function.

this commit message implies there is protection against
the vfork child clobbering the parent's shadow stack,
but actually the child can INCSSP (or longjmp) and then
clobber it.

so the glibc code just tries to catch bugs and accidents
not a strong security mechanism. i'd skip this paragraph.
  
Edgecombe, Rick P March 2, 2023, 9:48 p.m. UTC | #2
On Thu, 2023-03-02 at 17:34 +0000, Szabolcs Nagy wrote:
> The 02/27/2023 14:29, Rick Edgecombe wrote:
> > For shadow stack enabled vfork(), the parent and child can share
> > the same
> > shadow stack, like they can share a normal stack. Since the parent
> > is
> > suspended until the child terminates, the child will not interfere
> > with
> > the parent while executing as long as it doesn't return from the
> > vfork()
> > and overwrite up the shadow stack. The child can safely overwrite
> > down
> > the shadow stack, as the parent can just overwrite this later. So
> > CET does
> > not add any additional limitations for vfork().
> > 
> > Userspace implementing posix vfork() can actually prevent the child
> > from
> > returning from the vfork() calling function, using CET. Glibc does
> > this
> > by adjusting the shadow stack pointer in the child, so that the
> > child
> > receives a #CP if it tries to return from vfork() calling function.
> 
> this commit message implies there is protection against
> the vfork child clobbering the parent's shadow stack,
> but actually the child can INCSSP (or longjmp) and then
> clobber it.

It's true the vfork child could use INCSSP and clobber to create
problems, so it is not a strong guarantee of shadow stack integrity.
But that's not claimed either. It does "prevent the child from
returning from the vfork() calling function" as much as shadow stack
protections apply, which I think would be reasonably understood. The
vfork child could also use wrss to write the return address to the
shadow stack and actually return, or disable shadow stack and return,
as other ways to create problems.

> 
> so the glibc code just tries to catch bugs and accidents
> not a strong security mechanism. i'd skip this paragraph.

Yep. I think it's very much a "nice to have" thing and not intended for
security. The paragraph is an aside anyway, because it is specifics
about another project. I don't have any objection to dropping it if the
opportunity comes up. IIRC it was added because someone thought vfork
couldn't work with shadow stack, so people might like to have the
details of how can be done.

I wouldn't even be too bothered if the discussed glibc behavior was
dropped either. vfork() can go wrong many ways regardless of shadow
stack. Is it worth the extra special behavior? Maybe just barely...
  
Borislav Petkov March 8, 2023, 3:26 p.m. UTC | #3
On Mon, Feb 27, 2023 at 02:29:46PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> When a process is duplicated, but the child shares the address space with
> the parent, there is potential for the threads sharing a single stack to
> cause conflicts for each other. In the normal non-cet case this is handled

"non-CET"

> in two ways.
> 
> With regular CLONE_VM a new stack is provided by userspace such that the
> parent and child have different stacks.
> 
> For vfork, the parent is suspended until the child exits. So as long as
> the child doesn't return from the vfork()/CLONE_VFORK calling function and
> sticks to a limited set of operations, the parent and child can share the
> same stack.
> 
> For shadow stack, these scenarios present similar sharing problems. For the
> CLONE_VM case, the child and the parent must have separate shadow stacks.
> Instead of changing clone to take a shadow stack, have the kernel just
> allocate one and switch to it.
> 
> Use stack_size passed from clone3() syscall for thread shadow stack size. A
> compat-mode thread shadow stack size is further reduced to 1/4. This
> allows more threads to run in a 32-bit address space. The clone() does not
> pass stack_size, which was added to clone3(). In that case, use
> RLIMIT_STACK size and cap to 4 GB.
> 
> For shadow stack enabled vfork(), the parent and child can share the same
> shadow stack, like they can share a normal stack. Since the parent is
> suspended until the child terminates, the child will not interfere with
> the parent while executing as long as it doesn't return from the vfork()
> and overwrite up the shadow stack. The child can safely overwrite down
> the shadow stack, as the parent can just overwrite this later. So CET does
> not add any additional limitations for vfork().
> 
> Userspace implementing posix vfork() can actually prevent the child from

"POSIX"

...

> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index f851558b673f..bc3de4aeb661 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -552,8 +552,41 @@ static inline void fpu_inherit_perms(struct fpu *dst_fpu)
>  	}
>  }
>  
> +#ifdef CONFIG_X86_USER_SHADOW_STACK
> +static int update_fpu_shstk(struct task_struct *dst, unsigned long ssp)
> +{
> +	struct cet_user_state *xstate;
> +
> +	/* If ssp update is not needed. */
> +	if (!ssp)
> +		return 0;
> +
> +	xstate = get_xsave_addr(&dst->thread.fpu.fpstate->regs.xsave,
> +				XFEATURE_CET_USER);
> +
> +	/*
> +	 * If there is a non-zero ssp, then 'dst' must be configured with a shadow
> +	 * stack and the fpu state should be up to date since it was just copied
> +	 * from the parent in fpu_clone(). So there must be a valid non-init CET
> +	 * state location in the buffer.
> +	 */
> +	if (WARN_ON_ONCE(!xstate))
> +		return 1;
> +
> +	xstate->user_ssp = (u64)ssp;
> +
> +	return 0;
> +}
> +#else
> +static int update_fpu_shstk(struct task_struct *dst, unsigned long shstk_addr)
								      ^^^^^^^^^^^
ssp, like above.

Better yet:

static int update_fpu_shstk(struct task_struct *dst, unsigned long ssp)
{
#ifdef CONFIG_X86_USER_SHADOW_STACK
	...
#endif
	return 0;
}

and less ifdeffery.



> +{
> +	return 0;
> +}
> +#endif
> +
>  /* Clone current's FPU state on fork */
> -int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
> +int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal,
> +	      unsigned long ssp)
>  {
>  	struct fpu *src_fpu = &current->thread.fpu;
>  	struct fpu *dst_fpu = &dst->thread.fpu;
> @@ -613,6 +646,12 @@ int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
>  	if (use_xsave())
>  		dst_fpu->fpstate->regs.xsave.header.xfeatures &= ~XFEATURE_MASK_PASID;
>  
> +	/*
> +	 * Update shadow stack pointer, in case it changed during clone.
> +	 */
> +	if (update_fpu_shstk(dst, ssp))
> +		return 1;
> +
>  	trace_x86_fpu_copy_src(src_fpu);
>  	trace_x86_fpu_copy_dst(dst_fpu);
>  
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index b650cde3f64d..bf703f53fa49 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -48,6 +48,7 @@
>  #include <asm/frame.h>
>  #include <asm/unwind.h>
>  #include <asm/tdx.h>
> +#include <asm/shstk.h>
>  
>  #include "process.h"
>  
> @@ -119,6 +120,7 @@ void exit_thread(struct task_struct *tsk)
>  
>  	free_vm86(t);
>  
> +	shstk_free(tsk);
>  	fpu__drop(fpu);
>  }
>  
> @@ -140,6 +142,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
>  	struct inactive_task_frame *frame;
>  	struct fork_frame *fork_frame;
>  	struct pt_regs *childregs;
> +	unsigned long shstk_addr = 0;
>  	int ret = 0;
>  
>  	childregs = task_pt_regs(p);
> @@ -174,7 +177,13 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
>  	frame->flags = X86_EFLAGS_FIXED;
>  #endif
>  
> -	fpu_clone(p, clone_flags, args->fn);
> +	/* Allocate a new shadow stack for pthread if needed */
> +	ret = shstk_alloc_thread_stack(p, clone_flags, args->stack_size,
> +				       &shstk_addr);

That function will return 0 even if shstk_addr hasn't been written in it
and you will continue merrily and call

	fpu_clone(..., shstk_addr=0);

why don't you return the shadow stack address or negative on error
instead of adding an I/O parameter which is pretty much always nasty to
deal with.



> +	if (ret)
> +		return ret;
> +
> +	fpu_clone(p, clone_flags, args->fn, shstk_addr);
>  
>  	/* Kernel thread ? */
>  	if (unlikely(p->flags & PF_KTHREAD)) {

...
  
Edgecombe, Rick P March 8, 2023, 8:03 p.m. UTC | #4
On Wed, 2023-03-08 at 16:26 +0100, Borislav Petkov wrote:
> On Mon, Feb 27, 2023 at 02:29:46PM -0800, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > 
> > When a process is duplicated, but the child shares the address
> > space with
> > the parent, there is potential for the threads sharing a single
> > stack to
> > cause conflicts for each other. In the normal non-cet case this is
> > handled
> 
> "non-CET"

Sure.

> 
> > in two ways.
> > 
> > With regular CLONE_VM a new stack is provided by userspace such
> > that the
> > parent and child have different stacks.
> > 
> > For vfork, the parent is suspended until the child exits. So as
> > long as
> > the child doesn't return from the vfork()/CLONE_VFORK calling
> > function and
> > sticks to a limited set of operations, the parent and child can
> > share the
> > same stack.
> > 
> > For shadow stack, these scenarios present similar sharing problems.
> > For the
> > CLONE_VM case, the child and the parent must have separate shadow
> > stacks.
> > Instead of changing clone to take a shadow stack, have the kernel
> > just
> > allocate one and switch to it.
> > 
> > Use stack_size passed from clone3() syscall for thread shadow stack
> > size. A
> > compat-mode thread shadow stack size is further reduced to 1/4.
> > This
> > allows more threads to run in a 32-bit address space. The clone()
> > does not
> > pass stack_size, which was added to clone3(). In that case, use
> > RLIMIT_STACK size and cap to 4 GB.
> > 
> > For shadow stack enabled vfork(), the parent and child can share
> > the same
> > shadow stack, like they can share a normal stack. Since the parent
> > is
> > suspended until the child terminates, the child will not interfere
> > with
> > the parent while executing as long as it doesn't return from the
> > vfork()
> > and overwrite up the shadow stack. The child can safely overwrite
> > down
> > the shadow stack, as the parent can just overwrite this later. So
> > CET does
> > not add any additional limitations for vfork().
> > 
> > Userspace implementing posix vfork() can actually prevent the child
> > from
> 
> "POSIX"

Ok.

> 
> ...
> 
> > diff --git a/arch/x86/kernel/fpu/core.c
> > b/arch/x86/kernel/fpu/core.c
> > index f851558b673f..bc3de4aeb661 100644
> > --- a/arch/x86/kernel/fpu/core.c
> > +++ b/arch/x86/kernel/fpu/core.c
> > @@ -552,8 +552,41 @@ static inline void fpu_inherit_perms(struct
> > fpu *dst_fpu)
> >  	}
> >  }
> >  
> > +#ifdef CONFIG_X86_USER_SHADOW_STACK
> > +static int update_fpu_shstk(struct task_struct *dst, unsigned long
> > ssp)
> > +{
> > +	struct cet_user_state *xstate;
> > +
> > +	/* If ssp update is not needed. */
> > +	if (!ssp)
> > +		return 0;
> > +
> > +	xstate = get_xsave_addr(&dst->thread.fpu.fpstate->regs.xsave,
> > +				XFEATURE_CET_USER);
> > +
> > +	/*
> > +	 * If there is a non-zero ssp, then 'dst' must be configured
> > with a shadow
> > +	 * stack and the fpu state should be up to date since it was
> > just copied
> > +	 * from the parent in fpu_clone(). So there must be a valid
> > non-init CET
> > +	 * state location in the buffer.
> > +	 */
> > +	if (WARN_ON_ONCE(!xstate))
> > +		return 1;
> > +
> > +	xstate->user_ssp = (u64)ssp;
> > +
> > +	return 0;
> > +}
> > +#else
> > +static int update_fpu_shstk(struct task_struct *dst, unsigned long
> > shstk_addr)
> 
> 								      ^
> ^^^^^^^^^^
> ssp, like above.
> 
> Better yet:
> 
> static int update_fpu_shstk(struct task_struct *dst, unsigned long
> ssp)
> {
> #ifdef CONFIG_X86_USER_SHADOW_STACK
> 	...
> #endif
> 	return 0;
> }
> 
> and less ifdeffery.

Sure. Sometimes people tell me to only ifdef out whole functions to
make it easier to read. I suppose in this case it's not hard to see.


> 
> 
> 
> > +{
> > +	return 0;
> > +}
> > +#endif
> > +
> >  /* Clone current's FPU state on fork */
> > -int fpu_clone(struct task_struct *dst, unsigned long clone_flags,
> > bool minimal)
> > +int fpu_clone(struct task_struct *dst, unsigned long clone_flags,
> > bool minimal,
> > +	      unsigned long ssp)
> >  {
> >  	struct fpu *src_fpu = &current->thread.fpu;
> >  	struct fpu *dst_fpu = &dst->thread.fpu;
> > @@ -613,6 +646,12 @@ int fpu_clone(struct task_struct *dst,
> > unsigned long clone_flags, bool minimal)
> >  	if (use_xsave())
> >  		dst_fpu->fpstate->regs.xsave.header.xfeatures &=
> > ~XFEATURE_MASK_PASID;
> >  
> > +	/*
> > +	 * Update shadow stack pointer, in case it changed during
> > clone.
> > +	 */
> > +	if (update_fpu_shstk(dst, ssp))
> > +		return 1;
> > +
> >  	trace_x86_fpu_copy_src(src_fpu);
> >  	trace_x86_fpu_copy_dst(dst_fpu);
> >  
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index b650cde3f64d..bf703f53fa49 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -48,6 +48,7 @@
> >  #include <asm/frame.h>
> >  #include <asm/unwind.h>
> >  #include <asm/tdx.h>
> > +#include <asm/shstk.h>
> >  
> >  #include "process.h"
> >  
> > @@ -119,6 +120,7 @@ void exit_thread(struct task_struct *tsk)
> >  
> >  	free_vm86(t);
> >  
> > +	shstk_free(tsk);
> >  	fpu__drop(fpu);
> >  }
> >  
> > @@ -140,6 +142,7 @@ int copy_thread(struct task_struct *p, const
> > struct kernel_clone_args *args)
> >  	struct inactive_task_frame *frame;
> >  	struct fork_frame *fork_frame;
> >  	struct pt_regs *childregs;
> > +	unsigned long shstk_addr = 0;
> >  	int ret = 0;
> >  
> >  	childregs = task_pt_regs(p);
> > @@ -174,7 +177,13 @@ int copy_thread(struct task_struct *p, const
> > struct kernel_clone_args *args)
> >  	frame->flags = X86_EFLAGS_FIXED;
> >  #endif
> >  
> > -	fpu_clone(p, clone_flags, args->fn);
> > +	/* Allocate a new shadow stack for pthread if needed */
> > +	ret = shstk_alloc_thread_stack(p, clone_flags, args-
> > >stack_size,
> > +				       &shstk_addr);
> 
> That function will return 0 even if shstk_addr hasn't been written in
> it
> and you will continue merrily and call
> 
> 	fpu_clone(..., shstk_addr=0);
> 
> why don't you return the shadow stack address or negative on error
> instead of adding an I/O parameter which is pretty much always nasty
> to
> deal with.

On a shadow stack allocation error, we fail the copy_thread(). When
shadow stack is enabled, the app might be able to handle a clone
failure, but would not be able to handle starting a new thread without
getting a new shadow stack.

So in your suggestion I guess we would have two types of failure one
that signifies shadow stack is enabled and the allocation failed, and
another that signifies that shadow stack is not enabled, so zero needs
to be passed into fpu_clone()?

We need the output param in shstk_alloc_thread_stack() because we need
to update the SSP to the new shadow stack. If we want to make the non-
shadow stack case handled differently, I think the extra conditionals
are worse, like:
/* Allocate a new shadow stack for pthread if needed */
ret = shstk_alloc_thread_stack(p, clone_flags, args->stack_size,
				&shstk_addr);
if (ret == -EOPNOTSUPP)
	fpu_clone(p, clone_flags, args->fn, 0);
else if (ret < 0)
	return ret;
else
	fpu_clone(p, clone_flags, args->fn, shstk_addr);

Do you think?

It used to be that shstk_alloc_thread_stack() reached into FPU
internals to do the SSP update itself. Then the ability to do this was
removed. So I came up with an interface for allowing features to modify
XSAVE buffers from outside the FPU code. On further discussion, letting
code outside the FPU have flexible access to the XSAVE buffer could
constrain the FPU code from adding optimizations. So Thomas suggested
to pass the SSP along into FPU code so that the FPU modification could
be all monolithic and flexible.

If the default SSP value logic is too hidden, what about some clearer
code and comments, like this?

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index bf703f53fa49..bd123527fcca 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -142,7 +142,7 @@ int copy_thread(struct task_struct *p, const struct
kernel_clone_args *args)
        struct inactive_task_frame *frame;
        struct fork_frame *fork_frame;
        struct pt_regs *childregs;
-       unsigned long shstk_addr = 0;
+       unsigned long new_ssp;
        int ret = 0;
 
        childregs = task_pt_regs(p);
@@ -177,13 +177,18 @@ int copy_thread(struct task_struct *p, const
struct kernel_clone_args *args)
        frame->flags = X86_EFLAGS_FIXED;
 #endif
 
-       /* Allocate a new shadow stack for pthread if needed */
+       /*
+        * Allocate a new shadow stack for thread if needed. If shadow
stack,
+        * is disabled, new_ssp will remain 0, and fpu_clone() will
know not to
+        * update it.
+        */
+       new_ssp = 0;
        ret = shstk_alloc_thread_stack(p, clone_flags, args-
>stack_size,
-                                      &shstk_addr);
+                                      &new_ssp);
        if (ret)
                return ret;
 
-       fpu_clone(p, clone_flags, args->fn, shstk_addr);
+       fpu_clone(p, clone_flags, args->fn, new_ssp);
 
        /* Kernel thread ? */
        if (unlikely(p->flags & PF_KTHREAD)) {
  
Borislav Petkov March 9, 2023, 2:12 p.m. UTC | #5
On Wed, Mar 08, 2023 at 08:03:17PM +0000, Edgecombe, Rick P wrote:

Btw,

pls try to trim your replies as I need ot scroll through pages of quoted
text to find the response.

> Sure. Sometimes people tell me to only ifdef out whole functions to
> make it easier to read. I suppose in this case it's not hard to see.

Yeah, the less ifdeffery we have, the better.

> If the default SSP value logic is too hidden, what about some clearer
> code and comments, like this?

The problem with this function is that it needs to return three things:

* success:
 ** 0
 or
 ** shadow stack address
* failure: due to allocation.

How about this below instead? (totally untested ofc):

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index bf703f53fa49..6e323d4e32fc 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -142,7 +142,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	struct inactive_task_frame *frame;
 	struct fork_frame *fork_frame;
 	struct pt_regs *childregs;
-	unsigned long shstk_addr = 0;
+	unsigned long shstk_addr;
 	int ret = 0;
 
 	childregs = task_pt_regs(p);
@@ -178,10 +178,9 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 #endif
 
 	/* Allocate a new shadow stack for pthread if needed */
-	ret = shstk_alloc_thread_stack(p, clone_flags, args->stack_size,
-				       &shstk_addr);
-	if (ret)
-		return ret;
+	shstk_addr = shstk_alloc_thread_stack(p, clone_flags, args->stack_size);
+	if (IS_ERR_VALUE(shstk_addr))
+		return PTR_ERR((void *)shstk_addr);
 
 	fpu_clone(p, clone_flags, args->fn, shstk_addr);
 
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 13c02747386f..b1668b499e9a 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -157,8 +157,8 @@ void reset_thread_features(void)
 	current->thread.features_locked = 0;
 }
 
-int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
-			     unsigned long stack_size, unsigned long *shstk_addr)
+unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
+				       unsigned long stack_size)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
 	unsigned long addr, size;
@@ -180,14 +180,12 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
 	size = adjust_shstk_size(stack_size);
 	addr = alloc_shstk(size);
 	if (IS_ERR_VALUE(addr))
-		return PTR_ERR((void *)addr);
+		return addr;
 
 	shstk->base = addr;
 	shstk->size = size;
 
-	*shstk_addr = addr + size;
-
-	return 0;
+	return addr + size;
 }
 
 static unsigned long get_user_shstk_addr(void)
  
Edgecombe, Rick P March 9, 2023, 4:59 p.m. UTC | #6
On Thu, 2023-03-09 at 15:12 +0100, Borislav Petkov wrote:
> On Wed, Mar 08, 2023 at 08:03:17PM +0000, Edgecombe, Rick P wrote:
> 
> Btw,
> 
> pls try to trim your replies as I need ot scroll through pages of
> quoted
> text to find the response.

Sure sorry.


[...]
> 
> > If the default SSP value logic is too hidden, what about some
> > clearer
> > code and comments, like this?
> 
> The problem with this function is that it needs to return three
> things:
> 
> * success:
>  ** 0
>  or
>  ** shadow stack address
> * failure: due to allocation.
> 
> How about this below instead? (totally untested ofc):

Ah, I see what you were saying now. It looks like it will work to me if
you think it is better stylistically.
  
Borislav Petkov March 9, 2023, 5:04 p.m. UTC | #7
On Thu, Mar 09, 2023 at 04:59:52PM +0000, Edgecombe, Rick P wrote:
> Ah, I see what you were saying now. It looks like it will work to me if
> you think it is better stylistically.

Yeah, having a function return an error *and* an I/O parameter at the
same time is more complicated and error prone instead of when you have
a single retval and only input parameters.

I'd say.
  
Edgecombe, Rick P March 9, 2023, 8:29 p.m. UTC | #8
On Thu, 2023-03-09 at 18:04 +0100, Borislav Petkov wrote:
> On Thu, Mar 09, 2023 at 04:59:52PM +0000, Edgecombe, Rick P wrote:
> > Ah, I see what you were saying now. It looks like it will work to
> > me if
> > you think it is better stylistically.
> 
> Yeah, having a function return an error *and* an I/O parameter at the
> same time is more complicated and error prone instead of when you
> have
> a single retval and only input parameters.
> 
> I'd say.

Yea, I agree it's better this way, and at first I just missed your
point. By "if you think it's better", I just meant that if someone told
me to do it the other way I wouldn't die on the hill.
  

Patch

diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h
index c2d6cd78ed0c..3c2903bbb456 100644
--- a/arch/x86/include/asm/fpu/sched.h
+++ b/arch/x86/include/asm/fpu/sched.h
@@ -11,7 +11,8 @@ 
 
 extern void save_fpregs_to_fpstate(struct fpu *fpu);
 extern void fpu__drop(struct fpu *fpu);
-extern int  fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal);
+extern int  fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal,
+		      unsigned long shstk_addr);
 extern void fpu_flush_thread(void);
 
 /*
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index e01aa74a6de7..9714f08d941b 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -147,6 +147,8 @@  do {						\
 #else
 #define deactivate_mm(tsk, mm)			\
 do {						\
+	if (!tsk->vfork_done)			\
+		shstk_free(tsk);		\
 	load_gs_index(0);			\
 	loadsegment(fs, 0);			\
 } while (0)
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index 2b1f7c9b9995..1399f4df098b 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -15,11 +15,18 @@  struct thread_shstk {
 
 long shstk_prctl(struct task_struct *task, int option, unsigned long features);
 void reset_thread_features(void);
+int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
+			     unsigned long stack_size,
+			     unsigned long *shstk_addr);
 void shstk_free(struct task_struct *p);
 #else
 static inline long shstk_prctl(struct task_struct *task, int option,
 			       unsigned long arg2) { return -EINVAL; }
 static inline void reset_thread_features(void) {}
+static inline int shstk_alloc_thread_stack(struct task_struct *p,
+					   unsigned long clone_flags,
+					   unsigned long stack_size,
+					   unsigned long *shstk_addr) { return 0; }
 static inline void shstk_free(struct task_struct *p) {}
 #endif /* CONFIG_X86_USER_SHADOW_STACK */
 
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index f851558b673f..bc3de4aeb661 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -552,8 +552,41 @@  static inline void fpu_inherit_perms(struct fpu *dst_fpu)
 	}
 }
 
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+static int update_fpu_shstk(struct task_struct *dst, unsigned long ssp)
+{
+	struct cet_user_state *xstate;
+
+	/* If ssp update is not needed. */
+	if (!ssp)
+		return 0;
+
+	xstate = get_xsave_addr(&dst->thread.fpu.fpstate->regs.xsave,
+				XFEATURE_CET_USER);
+
+	/*
+	 * If there is a non-zero ssp, then 'dst' must be configured with a shadow
+	 * stack and the fpu state should be up to date since it was just copied
+	 * from the parent in fpu_clone(). So there must be a valid non-init CET
+	 * state location in the buffer.
+	 */
+	if (WARN_ON_ONCE(!xstate))
+		return 1;
+
+	xstate->user_ssp = (u64)ssp;
+
+	return 0;
+}
+#else
+static int update_fpu_shstk(struct task_struct *dst, unsigned long shstk_addr)
+{
+	return 0;
+}
+#endif
+
 /* Clone current's FPU state on fork */
-int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
+int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal,
+	      unsigned long ssp)
 {
 	struct fpu *src_fpu = &current->thread.fpu;
 	struct fpu *dst_fpu = &dst->thread.fpu;
@@ -613,6 +646,12 @@  int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
 	if (use_xsave())
 		dst_fpu->fpstate->regs.xsave.header.xfeatures &= ~XFEATURE_MASK_PASID;
 
+	/*
+	 * Update shadow stack pointer, in case it changed during clone.
+	 */
+	if (update_fpu_shstk(dst, ssp))
+		return 1;
+
 	trace_x86_fpu_copy_src(src_fpu);
 	trace_x86_fpu_copy_dst(dst_fpu);
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b650cde3f64d..bf703f53fa49 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -48,6 +48,7 @@ 
 #include <asm/frame.h>
 #include <asm/unwind.h>
 #include <asm/tdx.h>
+#include <asm/shstk.h>
 
 #include "process.h"
 
@@ -119,6 +120,7 @@  void exit_thread(struct task_struct *tsk)
 
 	free_vm86(t);
 
+	shstk_free(tsk);
 	fpu__drop(fpu);
 }
 
@@ -140,6 +142,7 @@  int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	struct inactive_task_frame *frame;
 	struct fork_frame *fork_frame;
 	struct pt_regs *childregs;
+	unsigned long shstk_addr = 0;
 	int ret = 0;
 
 	childregs = task_pt_regs(p);
@@ -174,7 +177,13 @@  int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	frame->flags = X86_EFLAGS_FIXED;
 #endif
 
-	fpu_clone(p, clone_flags, args->fn);
+	/* Allocate a new shadow stack for pthread if needed */
+	ret = shstk_alloc_thread_stack(p, clone_flags, args->stack_size,
+				       &shstk_addr);
+	if (ret)
+		return ret;
+
+	fpu_clone(p, clone_flags, args->fn, shstk_addr);
 
 	/* Kernel thread ? */
 	if (unlikely(p->flags & PF_KTHREAD)) {
@@ -220,6 +229,13 @@  int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	if (!ret && unlikely(test_tsk_thread_flag(current, TIF_IO_BITMAP)))
 		io_bitmap_share(p);
 
+	/*
+	 * If copy_thread() if failing, don't leak the shadow stack possibly
+	 * allocated in shstk_alloc_thread_stack() above.
+	 */
+	if (ret)
+		shstk_free(p);
+
 	return ret;
 }
 
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 3cb85224d856..1d30295e0066 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -47,7 +47,7 @@  static unsigned long alloc_shstk(unsigned long size)
 	unsigned long addr, unused;
 
 	mmap_write_lock(mm);
-	addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+	addr = do_mmap(NULL, 0, size, PROT_READ, flags,
 		       VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
 
 	mmap_write_unlock(mm);
@@ -126,6 +126,39 @@  void reset_thread_features(void)
 	current->thread.features_locked = 0;
 }
 
+int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
+			     unsigned long stack_size, unsigned long *shstk_addr)
+{
+	struct thread_shstk *shstk = &tsk->thread.shstk;
+	unsigned long addr, size;
+
+	/*
+	 * If shadow stack is not enabled on the new thread, skip any
+	 * switch to a new shadow stack.
+	 */
+	if (!features_enabled(ARCH_SHSTK_SHSTK))
+		return 0;
+
+	/*
+	 * For CLONE_VM, except vfork, the child needs a separate shadow
+	 * stack.
+	 */
+	if ((clone_flags & (CLONE_VFORK | CLONE_VM)) != CLONE_VM)
+		return 0;
+
+	size = adjust_shstk_size(stack_size);
+	addr = alloc_shstk(size);
+	if (IS_ERR_VALUE(addr))
+		return PTR_ERR((void *)addr);
+
+	shstk->base = addr;
+	shstk->size = size;
+
+	*shstk_addr = addr + size;
+
+	return 0;
+}
+
 void shstk_free(struct task_struct *tsk)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
@@ -134,7 +167,13 @@  void shstk_free(struct task_struct *tsk)
 	    !features_enabled(ARCH_SHSTK_SHSTK))
 		return;
 
-	if (!tsk->mm)
+	/*
+	 * When fork() with CLONE_VM fails, the child (tsk) already has a
+	 * shadow stack allocated, and exit_thread() calls this function to
+	 * free it.  In this case the parent (current) and the child share
+	 * the same mm struct.
+	 */
+	if (!tsk->mm || tsk->mm != current->mm)
 		return;
 
 	unmap_shadow_stack(shstk->base, shstk->size);