[GIT,PULL] x86/mm for 6.4

Message ID 20230427225647.1101172-1-dave.hansen@linux.intel.com
State New
Headers
Series [GIT,PULL] x86/mm for 6.4 |

Pull-request

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_mm_for_6.4

Commit Message

Dave Hansen April 27, 2023, 10:56 p.m. UTC
  Hi Linus,

Please pull some x86/mm changes for 6.4.  The only content here is
solely a new revision of Kirill's Linear Address Masking implementation.

You had some concerns[1] with the last version and the (lack of) locking
while the feature was being enabled in multithreaded programs.  It's
been fixed up since then to simply only allow LAM enabling when the
process is single threaded.  It's also accumulated a few random fixes
and cleanups since then.

This conflicts with the shadow stack work (x86/shstk) that I sent
earlier this week.  This is no surprise since LAM and shadow stacks both
add prctl()'s, selftests and touch the same headers.

Despite there being a few sites of conflict, the merge is logically
straightforward.  I've included a suggested resolution.  Both Kirill
(LAM) and Rick (shadow stacks) tested the result and confirmed that
nothing broke.

1. https://lore.kernel.org/all/CAHk-=wi=TY3Kte5Z1_nvfcsEh+rcz86pYnzeASw=pbG9QtpJEQ@mail.gmail.com/

--

The following changes since commit eeac8ede17557680855031c6f305ece2378af326:

  Linux 6.3-rc2 (2023-03-12 16:36:44 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_mm_for_6.4

for you to fetch changes up to 97740266de26e5dfe6e4fbecacb6995b66c2e378:

  x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside (2023-04-06 13:45:06 -0700)

----------------------------------------------------------------
Add support for new Linear Address Masking CPU feature.  This is similar
to ARM's Top Byte Ignore and allows userspace to store metadata in some
bits of pointers without masking it out before use.

----------------------------------------------------------------
Kirill A. Shutemov (14):
      x86/mm: Rework address range check in get_user() and put_user()
      x86: Allow atomic MM_CONTEXT flags setting
      x86: CPUID and CR3/CR4 flags for Linear Address Masking
      x86/mm: Handle LAM on context switch
      mm: Introduce untagged_addr_remote()
      x86/uaccess: Provide untagged_addr() and remove tags before address check
      x86/mm: Reduce untagged_addr() overhead for systems without LAM
      x86/mm: Provide arch_prctl() interface for LAM
      mm: Expose untagging mask in /proc/$PID/status
      iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()
      x86/mm/iommu/sva: Make LAM and SVA mutually exclusive
      selftests/x86/lam: Add test cases for LAM vs thread creation
      x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA
      x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside

Weihong Zhang (5):
      selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking
      selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking
      selftests/x86/lam: Add io_uring test cases for linear-address masking
      selftests/x86/lam: Add inherit test cases for linear-address masking
      selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking

 arch/arm64/include/asm/mmu_context.h        |    6 +
 arch/sparc/include/asm/mmu_context_64.h     |    6 +
 arch/sparc/include/asm/uaccess_64.h         |    2 +
 arch/x86/Kconfig                            |   11 +
 arch/x86/entry/vsyscall/vsyscall_64.c       |    2 +-
 arch/x86/include/asm/cpufeatures.h          |    1 +
 arch/x86/include/asm/disabled-features.h    |    8 +-
 arch/x86/include/asm/mmu.h                  |   18 +-
 arch/x86/include/asm/mmu_context.h          |   49 +-
 arch/x86/include/asm/processor-flags.h      |    2 +
 arch/x86/include/asm/tlbflush.h             |   48 +-
 arch/x86/include/asm/uaccess.h              |   58 +-
 arch/x86/include/uapi/asm/prctl.h           |    5 +
 arch/x86/include/uapi/asm/processor-flags.h |    6 +
 arch/x86/kernel/process.c                   |    6 +
 arch/x86/kernel/process_64.c                |   68 +-
 arch/x86/kernel/traps.c                     |    6 +-
 arch/x86/lib/getuser.S                      |   83 +-
 arch/x86/lib/putuser.S                      |   54 +-
 arch/x86/mm/init.c                          |    5 +
 arch/x86/mm/tlb.c                           |   53 +-
 drivers/iommu/iommu-sva.c                   |    8 +-
 drivers/vfio/vfio_iommu_type1.c             |    2 +-
 fs/proc/array.c                             |    7 +
 fs/proc/task_mmu.c                          |    9 +-
 include/linux/ioasid.h                      |    9 -
 include/linux/mm.h                          |   11 -
 include/linux/mmu_context.h                 |   14 +
 include/linux/sched/mm.h                    |    8 +-
 include/linux/uaccess.h                     |   22 +
 mm/gup.c                                    |    4 +-
 mm/madvise.c                                |    5 +-
 mm/migrate.c                                |   11 +-
 tools/testing/selftests/x86/Makefile        |    2 +-
 tools/testing/selftests/x86/lam.c           | 1241 +++++++++++++++++++++++++++
 35 files changed, 1701 insertions(+), 149 deletions(-)
 create mode 100644 tools/testing/selftests/x86/lam.c
+++ b/arch/x86/include/asm/disabled-features.h
@@@ -120,8 -126,8 +132,8 @@@
  #define DISABLED_MASK9	(DISABLE_SGX)
  #define DISABLED_MASK10	0
  #define DISABLED_MASK11	(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
- 			 DISABLE_CALL_DEPTH_TRACKING)
+ 			 DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK)
 -#define DISABLED_MASK12	0
 +#define DISABLED_MASK12	(DISABLE_LAM)
  #define DISABLED_MASK13	0
  #define DISABLED_MASK14	0
  #define DISABLED_MASK15	0
+++ b/arch/x86/include/uapi/asm/prctl.h
@@@ -20,9 -20,16 +20,21 @@@
  #define ARCH_MAP_VDSO_32		0x2002
  #define ARCH_MAP_VDSO_64		0x2003
  
+ /* Don't use 0x3001-0x3004 because of old glibcs */
+ 
 +#define ARCH_GET_UNTAG_MASK		0x4001
 +#define ARCH_ENABLE_TAGGED_ADDR		0x4002
 +#define ARCH_GET_MAX_TAG_BITS		0x4003
 +#define ARCH_FORCE_TAGGED_SVA		0x4004
 +
+ #define ARCH_SHSTK_ENABLE		0x5001
+ #define ARCH_SHSTK_DISABLE		0x5002
+ #define ARCH_SHSTK_LOCK			0x5003
+ #define ARCH_SHSTK_UNLOCK		0x5004
+ #define ARCH_SHSTK_STATUS		0x5005
+ 
+ /* ARCH_SHSTK_ features bits */
+ #define ARCH_SHSTK_SHSTK		(1ULL <<  0)
+ #define ARCH_SHSTK_WRSS			(1ULL <<  1)
+ 
  #endif /* _ASM_X86_PRCTL_H */
+++ b/arch/x86/kernel/process.c
@@@ -48,7 -48,7 +48,8 @@@
  #include <asm/frame.h>
  #include <asm/unwind.h>
  #include <asm/tdx.h>
 +#include <asm/mmu_context.h>
+ #include <asm/shstk.h>
  
  #include "process.h"
  
+++ b/arch/x86/kernel/process_64.c
@@@ -875,22 -831,13 +877,28 @@@ long do_arch_prctl_64(struct task_struc
  # endif
  	case ARCH_MAP_VDSO_64:
  		return prctl_map_vdso(&vdso_image_64, arg2);
 +#endif
 +#ifdef CONFIG_ADDRESS_MASKING
 +	case ARCH_GET_UNTAG_MASK:
 +		return put_user(task->mm->context.untag_mask,
 +				(unsigned long __user *)arg2);
 +	case ARCH_ENABLE_TAGGED_ADDR:
 +		return prctl_enable_tagged_addr(task->mm, arg2);
 +	case ARCH_FORCE_TAGGED_SVA:
 +		set_bit(MM_CONTEXT_FORCE_TAGGED_SVA, &task->mm->context.flags);
 +		return 0;
 +	case ARCH_GET_MAX_TAG_BITS:
 +		if (!cpu_feature_enabled(X86_FEATURE_LAM))
 +			return put_user(0, (unsigned long __user *)arg2);
 +		else
 +			return put_user(LAM_U57_BITS, (unsigned long __user *)arg2);
  #endif
+ 	case ARCH_SHSTK_ENABLE:
+ 	case ARCH_SHSTK_DISABLE:
+ 	case ARCH_SHSTK_LOCK:
+ 	case ARCH_SHSTK_UNLOCK:
+ 	case ARCH_SHSTK_STATUS:
+ 		return shstk_prctl(task, option, arg2);
  	default:
  		ret = -EINVAL;
  		break;
+++ b/fs/proc/array.c
@@@ -424,11 -423,11 +424,16 @@@ static inline void task_thp_status(stru
  	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
  }
  
 +static inline void task_untag_mask(struct seq_file *m, struct mm_struct *mm)
 +{
 +	seq_printf(m, "untag_mask:\t%#lx\n", mm_untag_mask(mm));
 +}
 +
+ __weak void arch_proc_pid_thread_features(struct seq_file *m,
+ 					  struct task_struct *task)
+ {
+ }
+ 
  int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
  			struct pid *pid, struct task_struct *task)
  {
+++ b/tools/testing/selftests/x86/Makefile
@@@ -18,7 -18,7 +18,7 @@@ TARGETS_C_32BIT_ONLY := entry_from_vm8
  			test_FCMOV test_FCOMI test_FISTTP \
  			vdso_restorer
  TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering \
- 			corrupt_xstate_header amx lam
 -			corrupt_xstate_header amx test_shadow_stack
++			corrupt_xstate_header amx test_shadow_stack lam
  # Some selftests require 32bit support enabled also on 64bit systems
  TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall
  

Comments

pr-tracker-bot@kernel.org April 28, 2023, 5:23 p.m. UTC | #1
The pull request you sent on Thu, 27 Apr 2023 15:56:47 -0700:

> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_mm_for_6.4

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/22b8cc3e78f5448b4c5df00303817a9137cd663f

Thank you!
  
Linus Torvalds April 28, 2023, 8:07 p.m. UTC | #2
On Thu, Apr 27, 2023 at 3:57 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
> Please pull some x86/mm changes for 6.4.  The only content here is
> solely a new revision of Kirill's Linear Address Masking implementation.

So I was waiting for this for my final piece of the x86 user copy
changes, so here goes...

I think we should now make 'access_ok()' do the same thing that
get/put_user() do with the LAM code: only worry about the sign bit.

So here's my suggested change on top of the current tree. Comments?

PeterZ also added to the cc, because he's the source of that
WARN_ON_IN_IRQ() in the x86 'access_ok()' macro. That's the only
reason x86 has its own copy of that.

I wonder if that WARN_ON_IN_IRQ() should just be removed, or perhaps
moved into the generic code in <asm-generic/access_ok.h>?

                  Linus
  
Linus Torvalds April 28, 2023, 8:15 p.m. UTC | #3
On Fri, Apr 28, 2023 at 1:07 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So here's my suggested change on top of the current tree. Comments?

Oh, and I wanted to particularly mention that

    We could probably just do that "check only starting address" for any
    arbitrary range size: realistically all kernel accesses to user space
    will be done starting at the low address.  But let's leave that kind of
    optimization for later.  As it is, this already allows us to generate
    simpler code and not worry about any tag bits in the address.

part of the commit log.

Right now that patch only simplifies the range check for when the
compiler statically knows that the range is small (which does happen,
but not very often, because 'copy_to/from_user()' isn't inlined on
x86-64, so the compiler doesn't actually see the constant size case
that is very common there).

However, that "check the end of the range" is sometimes actually
fairly complicated code, and it would be nice to drop that entirely.

See for example the fs/readdir.c case, where the length of the access
is kind of annoying:

        if (!user_write_access_begin(dirent,
                        (unsigned long)(dirent->d_name + namlen + 1) -
                                (unsigned long)dirent))
                goto efault;

and there really isn't any actual reason to check the end of the
access on x86: if the beginning address has the low bit clear, it
doesn't really matter what the end is, because we'll either have a
good area, or we'll fault in the non-canonical area even if the sign
changes.

So being careful about the range is kind of annoying, when we don't
really need it.

                  Linus
  
Kirill A. Shutemov April 29, 2023, 12:38 a.m. UTC | #4
On Fri, Apr 28, 2023 at 01:15:33PM -0700, Linus Torvalds wrote:
> On Fri, Apr 28, 2023 at 1:07 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So here's my suggested change on top of the current tree. Comments?
> 
> Oh, and I wanted to particularly mention that
> 
>     We could probably just do that "check only starting address" for any
>     arbitrary range size: realistically all kernel accesses to user space
>     will be done starting at the low address.  But let's leave that kind of
>     optimization for later.  As it is, this already allows us to generate
>     simpler code and not worry about any tag bits in the address.
> 
> part of the commit log.
> 
> Right now that patch only simplifies the range check for when the
> compiler statically knows that the range is small (which does happen,
> but not very often, because 'copy_to/from_user()' isn't inlined on
> x86-64, so the compiler doesn't actually see the constant size case
> that is very common there).

BTW, I think the static check can be relaxed. Checking size against
PAGE_SIZE is rather conservative: there's 8 TB (or 4 PB for 5-level
paging) guard hole at the begging of kernel address space.

> However, that "check the end of the range" is sometimes actually
> fairly complicated code, and it would be nice to drop that entirely.
> 
> See for example the fs/readdir.c case, where the length of the access
> is kind of annoying:
> 
>         if (!user_write_access_begin(dirent,
>                         (unsigned long)(dirent->d_name + namlen + 1) -
>                                 (unsigned long)dirent))
>                 goto efault;
> 
> and there really isn't any actual reason to check the end of the
> access on x86: if the beginning address has the low bit clear, it
> doesn't really matter what the end is, because we'll either have a
> good area, or we'll fault in the non-canonical area even if the sign
> changes.
> 
> So being careful about the range is kind of annoying, when we don't
> really need it.

Hm. Is there anybody who access high to low after the check (glibc
memcpy() bug flashbacks)? Or not in any particular order?
  
Linus Torvalds April 29, 2023, 1:04 a.m. UTC | #5
On Fri, Apr 28, 2023 at 5:38 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> BTW, I think the static check can be relaxed. Checking size against
> PAGE_SIZE is rather conservative: there's 8 TB (or 4 PB for 5-level
> paging) guard hole at the begging of kernel address space.

So I don't worry about the size per se - we just don't have any
constant sized accesses that are bigger than a page.

The constant-sized case is for things like structures being copied to
user space.

And having a bug gap is nice for the suzkaller case, although I don't
think that GP fault has triggered lately (or ever, I don't remember).
Having random system call arguments that trigger "oh, this is in the
non-canonical region" is a good thing.

> > So being careful about the range is kind of annoying, when we don't
> > really need it.
>
> Hm. Is there anybody who access high to low after the check (glibc
> memcpy() bug flashbacks)? Or not in any particular order?

Yeah, I can't think of a single case, which is why it seems so silly
to even bother.

Almost all real life cases end up being limited by things like the
page/folio size.

We do have exceptions, like module loading etc that might copy a
bigger area from user space, but no, we don't have any backwards
copies.

So you'd almost have to have some "access_ok()" followed by random
access with a user-controlled offset, and that seems nonsensical and
fundamentally impossible anyway.

But just because I can't think of it, and go "that would be insane"
doesn't mean that some driver ioctl interface might not try it.

Which is why I think having others look at it would be a good idea.

               Linus
  
Dave Hansen May 2, 2023, 3:42 p.m. UTC | #6
On 4/28/23 17:38, Kirill A. Shutemov wrote:
> BTW, I think the static check can be relaxed. Checking size against
> PAGE_SIZE is rather conservative: there's 8 TB (or 4 PB for 5-level
> paging) guard hole at the begging of kernel address space.

Whatever we relax it to, let's make sure we get a note in
Documentation/x86/x86_64/mm.rst.  But that's totally minor and we can
fix it up later.

Have anyone seen any actual code generation difference between:

	return (long)ptr >= 0;

and

        return !((unsigned long)ptr & (1UL<<(BITS_PER_LONG-1)));

?  I'm seeing gcc generate the same code for both the <=PAGE_SIZE side
and the 'sum' side.

It's longer, but I'd rather read the explicit "check bit 63" than the
positive/negative address space thing.  I certainly grok both, but have
to think through the "(long)ptr >= 0" check every time.

I guess it also wouldn't matter as much either if we hid it in a helper
like the attached patch and I didn't have to read it twice. ;)
  
Linus Torvalds May 2, 2023, 4 p.m. UTC | #7
On Tue, May 2, 2023 at 8:42 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> Have anyone seen any actual code generation difference between:
>
>         return (long)ptr >= 0;
>
> and
>
>         return !((unsigned long)ptr & (1UL<<(BITS_PER_LONG-1)));

No, as far as I know, they both generate the same code.

> It's longer, but I'd rather read the explicit "check bit 63" than the
> positive/negative address space thing.  I certainly grok both, but have
> to think through the "(long)ptr >= 0" check every time.

I'm very much the other way. I think it's much clearer to say "check
the sign bit".

Doing the explicit bit check means that I have to look at what the bit
number is, and that is a much more complex expression.

In fact, I'd find it easier to read

       return !((unsigned long)ptr & (1UL<< 63));

just because then you go "Oh, checking bit 63" without having to parse
the expression.

But even then I find the '!' is easy to miss, so you really have to parse it.

But:

> I guess it also wouldn't matter as much either if we hid it in a helper
> like the attached patch and I didn't have to read it twice. ;)

Yeah, I think that's a good solution.

                 Linus
  
Linus Torvalds May 2, 2023, 8:14 p.m. UTC | #8
On Tue, May 2, 2023 at 9:00 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> > I guess it also wouldn't matter as much either if we hid it in a helper
> > like the attached patch and I didn't have to read it twice. ;)
>
> Yeah, I think that's a good solution.

Hmm. And as I was rebasing the patch to fix up my patch, I realized
that the current -git top-of-tree state is actually broken.

That

  #define access_ok(addr, size)                                           \
  ({                                                                      \
          WARN_ON_IN_IRQ();                                               \
          likely(__access_ok(untagged_addr(addr), size));                 \
  })

is actually *wrong* in two ways.

Now, in my original patch, I added a comment about how that
"WARN_ON_IN_IRQ()" is bogus and this shouldn't be x86-specific at all.

I ended up going back in time to see why it was added, and I think it
was added because we used to access 'current' in access_ok(), due to
it using that user_addr_max() thing:

        likely(!__range_not_ok(addr, size, user_addr_max()));

but that was all removed by the set_fs() removal by Christoph Hellwig.

So there is actually nothing task-related in "access_ok()" any more,
and any warning about using it in the wrong context is just bogus.
That warning simply shouldn't exist any more (or maybe it should be in
a different place, like the actual user copy functions)

But that's actually not the problem with the above.

No, the problem is that probably *because* "access_ok()" has that
warning, not all users use "access_ok()" at all. We have places that
use "__access_ok()" instead. Like copy_from_nmi().

So now copy_from_nmi() doesn't do the untagging, so if you were to use
tagged pointers for the stack, you'd not get stack traces.

End result: I think that

 (a) that WARN_ON_IN_IRQ() is actively detrimental and causes problems

 (b) the current "use untagged_addr() in access_ok()" model is also broken

and my patch - which was meant to just improve code generation -
actually fixes this.

            Linus
  
Dave Hansen May 3, 2023, 12:53 a.m. UTC | #9
On 5/2/23 13:14, Linus Torvalds wrote:
> No, the problem is that probably *because* "access_ok()" has that
> warning, not all users use "access_ok()" at all. We have places that
> use "__access_ok()" instead. Like copy_from_nmi().
> 
> So now copy_from_nmi() doesn't do the untagging, so if you were to use
> tagged pointers for the stack, you'd not get stack traces.
> 
> End result: I think that
> 
>  (a) that WARN_ON_IN_IRQ() is actively detrimental and causes problems
> 
>  (b) the current "use untagged_addr() in access_ok()" model is also broken

Ugh, yes.

The fallout seems limited to (probably) perf and tracing poking at user
stack frames.  But, yes, it definitely looks broken there.

While I bet we could shoehorn the existing tlbstate checks into the
__access_ok() sites, I also vastly prefer the high bit checks in
access_ok() instead.  The less state we have to consult, the better.

Once the WARN_ON_IN_IRQ() is gone, it seems like it's just a matter of
collapsing __access_ok() into access_ok() and converting the (~3) callers.
  
Linus Torvalds May 3, 2023, 1:17 a.m. UTC | #10
On Tue, May 2, 2023 at 5:53 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> The fallout seems limited to (probably) perf and tracing poking at user
> stack frames.  But, yes, it definitely looks broken there.

So I ended up doing more cleanups - moving the 64-bit specific code to
'uaccess_64.h' etc.

And in the process I found another broken
thing:__untagged_addr_remote() is very very buggy.

The reason?

        long sign = addr >> 63;

that does *not* do at all what '__untagged_addr()' does, because while
'sign' is a signed long, 'addr' is an *unsigned* long.

So the actual shift ends up being done as an unsigned shift, and then
just the result is assigned to a signed variable.

End result? 'sign' ends up being 0 for user space (intentional) or 1
for kernel space (not intentional)..

It's not 0 or ~0 as the __untagged_addr() is, wh9ch uses a signed
shift in the inline asm ('sar')

That said, while this is all horribly buggy, I'm not actually 100%
sure that we care too much about untagging kernel addresses. So it's
an error case that doesn't get marked as an error.

So I guess in *practice* the horribly buggy code is actually just a
small harmless bug, and causes anybody who passes in an invalid
(kernel) address look like a possibly valid user address after all.

HOWEVER.

Why does it do that "shift-by-63" game there, instead of making
tlbstate_untag_mask just have bit #63 always set?

Untagging a kernel address is not a sensible operation, so the only
thing you want is to *keep* a kernel address as a bad address. You
don't have to actually keep it a *valid* kernel address, it just
should not become a (possibly valid) user address after untagging.

Hmmm? Am I missing something?

                  Linus
  
Peter Zijlstra May 3, 2023, 8:01 a.m. UTC | #11
On Tue, May 02, 2023 at 01:14:33PM -0700, Linus Torvalds wrote:
> On Tue, May 2, 2023 at 9:00 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > > I guess it also wouldn't matter as much either if we hid it in a helper
> > > like the attached patch and I didn't have to read it twice. ;)
> >
> > Yeah, I think that's a good solution.
> 
> Hmm. And as I was rebasing the patch to fix up my patch, I realized
> that the current -git top-of-tree state is actually broken.
> 
> That
> 
>   #define access_ok(addr, size)                                           \
>   ({                                                                      \
>           WARN_ON_IN_IRQ();                                               \
>           likely(__access_ok(untagged_addr(addr), size));                 \
>   })
> 
> is actually *wrong* in two ways.
> 
> Now, in my original patch, I added a comment about how that
> "WARN_ON_IN_IRQ()" is bogus and this shouldn't be x86-specific at all.
> 
> I ended up going back in time to see why it was added, and I think it
> was added because we used to access 'current' in access_ok(), due to
> it using that user_addr_max() thing:
> 
>         likely(!__range_not_ok(addr, size, user_addr_max()));
> 
> but that was all removed by the set_fs() removal by Christoph Hellwig.

So I had a poke around, trying to figure out where it came from, and
yes. Commit ae31fe51a3cc ("perf/x86: Restore TASK_SIZE check on frame
pointer") is the reason that WARN_ON_IN_IRQ() thing got added.
  
Linus Torvalds May 3, 2023, 4:38 p.m. UTC | #12
On Tue, May 2, 2023 at 6:17 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And in the process I found another broken
> thing:__untagged_addr_remote() is very very buggy.
>
> The reason?
>
>         long sign = addr >> 63;
>
> that does *not* do at all what '__untagged_addr()' does, because while
> 'sign' is a signed long, 'addr' is an *unsigned* long.
>
> So the actual shift ends up being done as an unsigned shift, and then
> just the result is assigned to a signed variable.
>
> End result? 'sign' ends up being 0 for user space (intentional) or 1
> for kernel space (not intentional)..

Looking around, this same bug used to exists for the normal
(non-remote) case too, until it was accidentally fixed when changing
that to use inline asm and the alternatives code.

At that point the non-remote case got an explicit 'sar' instruction,
and the result really was ~0 for kernel mode addresses.

> Why does it do that "shift-by-63" game there, instead of making
> tlbstate_untag_mask just have bit #63 always set?

And it turns out that bit #63 really _is_ always set, so I think the
solution to this all is to remove the sign games in untag_addr()
entirely.

Untagging a kernel address will "corrupt" it, but it will stay a
kernel address (well, it will stay a "high bit set" address), which is
all we care about anyway.

If somebody actually tries to untag a kernel address, that would be a
bug anyway, as far as I can tell.

So I'm going to just remove the 'sign' games entirely. They are
completely broken in 'untagged_addr_remote()', they _used_ to be
completely broken in 'untagged_addr()', and it looks like it's all
unnecessary.

           Linus
  
Linus Torvalds May 3, 2023, 4:38 p.m. UTC | #13
On Wed, May 3, 2023 at 1:01 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> So I had a poke around, trying to figure out where it came from, and
> yes. Commit ae31fe51a3cc ("perf/x86: Restore TASK_SIZE check on frame
> pointer") is the reason that WARN_ON_IN_IRQ() thing got added.

I added a pointer to that commit in the commit message, and took the
above to mean 'Ack' from you.

               Linus
  
Dave Hansen May 3, 2023, 4:44 p.m. UTC | #14
On 5/3/23 09:38, Linus Torvalds wrote:
>> Why does it do that "shift-by-63" game there, instead of making
>> tlbstate_untag_mask just have bit #63 always set?
> And it turns out that bit #63 really _is_ always set, so I think the
> solution to this all is to remove the sign games in untag_addr()
> entirely.

Yes, there are only two possible values right now, both of which have
bit 63 set:

LAM off: mm->context.untag_mask = -1UL;
LAM on:  mm->context.untag_mask = ~GENMASK(62, 57);

> Untagging a kernel address will "corrupt" it, but it will stay a
> kernel address (well, it will stay a "high bit set" address), which is
> all we care about anyway.
> 
> If somebody actually tries to untag a kernel address, that would be a
> bug anyway, as far as I can tell.

Is it a bug?  The do_madvise() path, for instance, is passing a value in
there that came right from userspace.

> So I'm going to just remove the 'sign' games entirely. They are
> completely broken in 'untagged_addr_remote()', they _used_ to be
> completely broken in 'untagged_addr()', and it looks like it's all
> unnecessary.

Yes, it looks completely superfluous.
  
Linus Torvalds May 3, 2023, 4:51 p.m. UTC | #15
On Wed, May 3, 2023 at 9:45 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/3/23 09:38, Linus Torvalds wrote:
> > If somebody actually tries to untag a kernel address, that would be a
> > bug anyway, as far as I can tell.
>
> Is it a bug?  The do_madvise() path, for instance, is passing a value in
> there that came right from userspace.

That's still a "user address" - just not a *valid* one.

So we do not want to mask the high bit off - because that is what will
catch people later doing things like vma address range comparisons on
it and notice "that's not a valid address", but it's also not a
"kernel address" that we need to preserve as such.

So yeah, it's a bit confusing in that it's _also_ true that "kernel
addresses have the high bit set" and "user addresses have the high bit
clear", and I'm basically using two different semantics for "kernel
address".

IOW: what I mean by "it's not valid to do 'untagged_addr()' on a
kernel address" is that you can't take a (valid) kernel address, do
'untagged_addr()' on it, and expect it to still work as a kernel
address.

But at the same time you *are* supposed to be able to use
'untagged_addr()' on a - untrusted and possibly invalid - user
pointer, and it's supposed to end up having the tag bits clear and
still be usable as a user pointer. And yet still also be caught by any
validity checks (ie a high bit set would never be a valid user
pointer, not even after doing 'untagged_addr()' on it).

                Linus
  
Linus Torvalds May 3, 2023, 5:54 p.m. UTC | #16
On Wed, May 3, 2023 at 9:38 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I'm going to just remove the 'sign' games entirely. They are
> completely broken in 'untagged_addr_remote()', they _used_ to be
> completely broken in 'untagged_addr()', and it looks like it's all
> unnecessary.

Ok, I've pushed out my changes to the 'x86-uaccess-cleanup' branch.

I think it's all good, but it would be really nice to hear it's been
tested on a setup that actually has LAM (simulator? or maybe there is
actual hw inside Intel)

And any other commentary is appreciated,

              Linus
  
Peter Zijlstra May 3, 2023, 7:01 p.m. UTC | #17
On Wed, May 03, 2023 at 10:54:38AM -0700, Linus Torvalds wrote:
> On Wed, May 3, 2023 at 9:38 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So I'm going to just remove the 'sign' games entirely. They are
> > completely broken in 'untagged_addr_remote()', they _used_ to be
> > completely broken in 'untagged_addr()', and it looks like it's all
> > unnecessary.
> 
> Ok, I've pushed out my changes to the 'x86-uaccess-cleanup' branch.
> 
> I think it's all good, but it would be really nice to hear it's been
> tested on a setup that actually has LAM (simulator? or maybe there is
> actual hw inside Intel)
> 
> And any other commentary is appreciated,

Looks sane from a first reading. But I'll try and have another look
tomorrow.

Also per how 2s complement is constructed 0 has to be in the positive
space for it to be 'half'. Also, typically 0 is included in N and the
traditional 'positive integers' set is then written as N \ {0}, but
that's somewhat modern and a lot of variation exists -- although
typically books tend to specify where they stand on that issue.

I suppose that's a very long winded way of saying, that yes, ofcourse 0
is a positive number :-)
  
Linus Torvalds May 3, 2023, 7:19 p.m. UTC | #18
On Wed, May 3, 2023 at 12:02 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> I suppose that's a very long winded way of saying, that yes, ofcourse 0
> is a positive number :-)

Well, I do consider it as such, but I guess I took all my math classes
when people still considered negative, zero and positive to be three
disjoint sets, and 'non-negative' was required for rigor.

Some googling around says that a lot of people still think that, and
that it might even be language-specific.

I think the commit commentary about "ok, strictly non-negative" might
still be relevant. At least to some people, and at least for
sticklers.

Also, I do consider 0 to be part of ℕ, although I wouldn't consider
that to be an argument about "positive" at all.

The argument for 0 in ℕ would be that without it, you don't have an
identity element for addition, which would arguably make natural
numbers kind of broken as a set.

            Linus
  
Kirill A. Shutemov May 4, 2023, 6:28 a.m. UTC | #19
On Wed, May 03, 2023 at 09:38:03AM -0700, Linus Torvalds wrote:
> > Why does it do that "shift-by-63" game there, instead of making
> > tlbstate_untag_mask just have bit #63 always set?
> 
> And it turns out that bit #63 really _is_ always set, so I think the
> solution to this all is to remove the sign games in untag_addr()
> entirely.

Untagging kernel pointer with LAM enabled will land it in the canonical
hole which is safe as it leads to #GP on dereference.

> Untagging a kernel address will "corrupt" it, but it will stay a
> kernel address (well, it will stay a "high bit set" address), which is
> all we care about anyway.

The interesting case to consider is untagging kernel pointer when LAM_U48
is enabled (not part of current LAM enabling). LAM_U48 would make the
untagging mask wider -- ~GENMASK(62, 48). With 5-level paging and LAM_SUP
enabled (also not supported yet) untagging kernel may transform it to
other valid kernel pointer.

So we cannot rely on #GP as backstop here. The kernel has to exclude
kernel pointer by other means. It can be fun to debug.

Looks like you are not worried about this kind of corruption. Hm.

Maybe it worth to indicate in the helper name that it is intended only for
userspace addresses? Rename it to untagged_uaddr() or something?
  
Dave Hansen May 4, 2023, 3:25 p.m. UTC | #20
On 5/3/23 23:28, Kirill A. Shutemov wrote:
>> Untagging a kernel address will "corrupt" it, but it will stay a
>> kernel address (well, it will stay a "high bit set" address), which is
>> all we care about anyway.
> The interesting case to consider is untagging kernel pointer when LAM_U48
> is enabled (not part of current LAM enabling). LAM_U48 would make the
> untagging mask wider -- ~GENMASK(62, 48). With 5-level paging and LAM_SUP
> enabled (also not supported yet) untagging kernel may transform it to
> other valid kernel pointer.
> 
> So we cannot rely on #GP as backstop here. The kernel has to exclude
> kernel pointer by other means. It can be fun to debug.

Yeah, I have the feeling that we're really going to need a pair of
untagging functions once we get to doing kernel LAM for a _bunch_ of
reasons.

Just as a practical matter, I think we should OR bits into the mask on
the kernel side, effectively:

unsigned long untag_kernel_addr(unsigned long addr)
{
	return addr | kernel_untag_mask;
}

and kernel_untag_mask should have bit 63 *clear*.

That way the pointers that have gone through untagging won't look silly.
 If you untag VMALLOC_END or something, it'll still look like the
addresses we have in mm.rst.

Also, it'll be impossible to have the mask turn a userspace address into
a kernel one.

Last, we can add some debugging in there, probably conditional on some
mm debugging options like:

	if (WARN_ON_ONCE(!valid_user_address(addr)))
		return 0;

It's kinda like "void __user *" versus "void *".  The __user ones can
*absolutely* point anywhere, user or kernel.  That's why we can't WARN()
in the untagged_addr() function that takes user pointers.

But "void *" can only point to the kernel.  It has totally different rules.

We should probably also do something like the attached patch sooner
rather than later.  'untag_mask' really is a misleading name for a mask
that gets applied only to user addresses.
  
Linus Torvalds May 4, 2023, 5:10 p.m. UTC | #21
On Wed, May 3, 2023 at 11:28 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Wed, May 03, 2023 at 09:38:03AM -0700, Linus Torvalds wrote:
> > > Why does it do that "shift-by-63" game there, instead of making
> > > tlbstate_untag_mask just have bit #63 always set?
> >
> > And it turns out that bit #63 really _is_ always set, so I think the
> > solution to this all is to remove the sign games in untag_addr()
> > entirely.
>
> Untagging kernel pointer with LAM enabled will land it in the canonical
> hole which is safe as it leads to #GP on dereference.

You are entirely missing the point.

The GP fault does *NOT MATTER*.

Think of 'untagged_addr()' as a companion to - but absolutely *NOT* a
replacement for - 'access_ok()'.

You have three distinct cases:

 (a) user-supplied valid user address

 (b) user-supplied invalid user address (it high bit set)

 (c) actual kernel address

and 'untagged_addr()' and 'access_ok()' work on the same basic input
domain: cases (a) and (b).

And the important thing for 'untagged_addr()' is that in case (a) it
needs to remove the tab bits, and in case (b) needs to *keep* the
address as an invalid user address.

Note that it does not need to keep the value the *same*. Nobody cares.

And also note that the resulting pointer may or may not GP-fault. Nobody cares.

Just to hit this home with a very explicit example, think of the case
where the kernel config for address masking isn't even enabled, and
'untagged_addr()' is a 1:1 map with no changes.

Doing 'untagged_addr()' on a valid user address results in a valid
user address. And doing it on an invalid user address results in an
invalid user address. GOOD.

Note that doing 'untagged_addr()' on an invalid user access does NOT
IN ANY WAY make that invalid address somehow "safe" and cause a GP
fault. Sure, it _might_ GP-fault. Or it might not. It might point to
random kernel data.

Verification of the address is simply not the job of
'untagged_addr()'. Never has been, never will be, and fundamentally
*cannot* be, since for the forseeable future 'untagged_addr()' is a
no-op for every single user outside of Intel.

Verification is separate. The verification is never "let's randomly
just access this pointer and see if it gets a GP-fault". No. We have a
user pointer, it needs *checking*. It needs to have something like
"use lookup_vma() to look up the vma that is associated with that
address". But it could also be something like "just do the range check
in GUP".

And that's why "keep an invalid user address as an invalid address",
because that *separate* stage of verifying the address needs to still
show that it's invalid.

Now, sometimes the "verification" might actually be "access_ok()"
itself, but honestly, if the address is used for an actual access,
then it shouldn't have gone through the 'untagged_addr()' thing at
all. It should just have been used as-is for the access. So normally
'untagged_addr()' does not get used *together* with 'access_ok()',
although that should obviously also work.

End result: all that really matters on x86-64 is that
'untagged_addr()' must keep the high bit as-is. That's the "this is
not a valid user address" bit. That's very much explicit in my current
series, of course, but even without my current series it was
implicitly the truth with those LAM patches (particularly considering
that 'untagged_addr()' didn't actually do the "keep kernel addresses
as-is" that it *thought* it did due to the signed type confusion).

So what about that (c) case? It doesn't matter. It's simply
fundamentally wrong for the kernel to pass an actual kernel address to
'untagged_addr()' and expect something useful back. It's a nonsensical
thing to do, and it's a huge bug.

So for the (c) case, the fact that the result would be useless and
*usually* GP-fault on access is a good thing. But it's not a
requirement, it's more of a "oh, cool, it's good that the nonsensical
operation causes quick failures".

So in that case, the GP fault is a "feature", but not a requirement.
Again, the normal 'untagged_addr()' case (of not changing the pointer
at all), obviously does *not* cause the kernel pointer corruption, but
maybe we could even have a debug mode. We could literally make
'untagged_addr()' do something like

   #ifdef CONFIG_DEBUG_NONLAM
   // Make sure nobody tries to use untagged_addr() on non-user addresses
   #define untagged_addr(x) ((x) | (long)(x)>>63)
   #endif

except obviously with the "get the types right and use 'x' only once"
thing (so the above #define is buggy, and puresly for conceptual
documentation purposes).

See?

Side note: one day, maybe we want to use address tagging *inside* the
kernel. However, that will not use 'untagged_addr()'. That would use
some *other* model for cleaning up kernel pointers when necessary.

Even on x86-64, the tagging rules for kernel and user space is
entirely different, in that user-space LAM rules are "U48" or "U57",
while kernel LAM rules depend on the paging mode, and the two are
*not* tied together.

But more importantly from a kernel perspective: regardless of hardware
implementations like that, the notion of masking bits off a untrusted
user pointer is simply completely *different* from the notion of
masking off bits of our own kernel pointers. You can see this in the
kernel everywhere: user pointers should be statically always user
pointers, and should look something like

   struct iocb __user *user_iocb

which is very very different from a kernel pointer that doesn't have
that "__user" thing. Again, on some architectures, the *EXACT*SAME*
bit pattern pointer value may point to different things, because user
accesses are simply in a different address space entirely.

So if/when KASAN starts using LAM inside the kernel, it will do its
own things. It had better not use "untagged_addr()".

           Linus
  
Kirill A. Shutemov May 4, 2023, 5:10 p.m. UTC | #22
On Thu, May 04, 2023 at 08:25:58AM -0700, Dave Hansen wrote:
> On 5/3/23 23:28, Kirill A. Shutemov wrote:
> >> Untagging a kernel address will "corrupt" it, but it will stay a
> >> kernel address (well, it will stay a "high bit set" address), which is
> >> all we care about anyway.
> > The interesting case to consider is untagging kernel pointer when LAM_U48
> > is enabled (not part of current LAM enabling). LAM_U48 would make the
> > untagging mask wider -- ~GENMASK(62, 48). With 5-level paging and LAM_SUP
> > enabled (also not supported yet) untagging kernel may transform it to
> > other valid kernel pointer.
> > 
> > So we cannot rely on #GP as backstop here. The kernel has to exclude
> > kernel pointer by other means. It can be fun to debug.
> 
> Yeah, I have the feeling that we're really going to need a pair of
> untagging functions once we get to doing kernel LAM for a _bunch_ of
> reasons.

There's already arch_kasan_reset_tag() used on ARM64 (and two more helpers
to set/get tag). Don't see a reason to add new.

> Just as a practical matter, I think we should OR bits into the mask on
> the kernel side, effectively:
> 
> unsigned long untag_kernel_addr(unsigned long addr)
> {
> 	return addr | kernel_untag_mask;
> }
> 
> and kernel_untag_mask should have bit 63 *clear*.
> 
> That way the pointers that have gone through untagging won't look silly.
>  If you untag VMALLOC_END or something, it'll still look like the
> addresses we have in mm.rst.
> 
> Also, it'll be impossible to have the mask turn a userspace address into
> a kernel one.
> 
> Last, we can add some debugging in there, probably conditional on some
> mm debugging options like:
> 
> 	if (WARN_ON_ONCE(!valid_user_address(addr)))
> 		return 0;
> 
> It's kinda like "void __user *" versus "void *".  The __user ones can
> *absolutely* point anywhere, user or kernel.  That's why we can't WARN()
> in the untagged_addr() function that takes user pointers.
> 
> But "void *" can only point to the kernel.  It has totally different rules.
> 
> We should probably also do something like the attached patch sooner
> rather than later.  'untag_mask' really is a misleading name for a mask
> that gets applied only to user addresses.

A bit too verbose to my liking, but okay.

Maybe _uaddr() instead of _user_addr()?
  
Linus Torvalds May 5, 2023, 7:56 p.m. UTC | #23
On Wed, May 3, 2023 at 12:02 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Looks sane from a first reading. But I'll try and have another look
> tomorrow.

Well, I didn't hear back any more, and I was merging architecture
code, so I merged my branch too.

It does fix at least two bugs, even if those bugs are not necessarily
all that noticeable.

I guess we'll see if it introduces new ones...

                    Linus
  
Kirill A. Shutemov May 5, 2023, 8:59 p.m. UTC | #24
On Fri, May 05, 2023 at 12:56:31PM -0700, Linus Torvalds wrote:
> On Wed, May 3, 2023 at 12:02 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Looks sane from a first reading. But I'll try and have another look
> > tomorrow.
> 
> Well, I didn't hear back any more, and I was merging architecture
> code, so I merged my branch too.
> 
> It does fix at least two bugs, even if those bugs are not necessarily
> all that noticeable.
> 
> I guess we'll see if it introduces new ones...

The branch passed LAM selftests fine.
  
Alexander Potapenko June 16, 2023, 8:47 a.m. UTC | #25
Hi Linus,

>  static bool ex_handler_uaccess(const struct exception_table_entry *fixup,
> -			       struct pt_regs *regs, int trapnr)
> +			       struct pt_regs *regs, int trapnr,
> +			       unsigned long fault_address)
>  {
> -	WARN_ONCE(trapnr == X86_TRAP_GP, "General protection fault in user access. Non-canonical address?");
> +	WARN_ONCE(trapnr == X86_TRAP_GP && !gp_fault_address_ok(fault_address),
> +		"General protection fault in user access. Non-canonical address?");
>  	return ex_handler_default(fixup, regs);
>  }

Shouldn't ex_handler_copy() be fixed in the same way?
Looks like it's still possible for a tagged userspace address to be passed to it and trigger a warning.

Alex
  
Linus Torvalds June 16, 2023, 4:22 p.m. UTC | #26
On Fri, 16 Jun 2023 at 01:47, Alexander Potapenko <glider@google.com> wrote:
>
> Shouldn't ex_handler_copy() be fixed in the same way?

I don't think ex_handler_copy() is actually reachable any more.

The only thing ex_handler_copy() did was to set %ax to the fault type.

It was used by the strange copy_user_generic_unrolled() that had
special machine check case for "don't do the tail if we get
X86_TRAP_MC".

But that was always bogus. The MC case in question was for the
__copy_user_nocache function, and the machine check case was for the
*destination*, which wasn't in user space at all.

So instead of looking at "what was the trap number", the code should
have just noticed "trapped on the destination", and stopped for *that*
reason.

See commit 034ff37d3407 ("x86: rewrite '__copy_user_nocache'
function") and in particular, see the comment there about writes on
the destination:

  * An exception on a write means that we're
  * done, but we need to update the count
  * depending on where in the unrolled loop
  * we were.

but yeah, I never removed the actual now unused _ASM_EXTABLE_CPY case.

Honestly, I had no way of even testing the code. I doubt anybody does.
There are a couple of users:

 - rdma mis-uses it for regular kernel-to-kernel copies that don't
fault (because rdma wants the "nocache" case). So it can't fault at
all.

 - a couple of GPU drivers mis-use it similarly to rdma, but at least
with a user source in the form of __copy_from_user_inatomic_nocache()

 - _copy_from_iter_flushcache uses it for pmem and dax, because it
wants that machine check handling for non-volatile memory

So two of three users don't actually *have* the MC case at all on the
destination.

And the third case - the actual "copy using uncached accesses in order
to get errors on the destination" case - depends on hardware that I'm
not convinced really exists any more.

Oh, I think there's some odd ntb mis-use too.

I'd love for somebody to actually test the machine check case, but the
old code was completely insane with odd "we handle _one_ case of
4-byte alignment, but not actually the general case", so I wonder if
the MC case ever really worked in reality. And as mentioned, I am not
convinced the hardware is available.

                 Linus
  
Alexander Potapenko June 19, 2023, 12:03 p.m. UTC | #27
On Fri, Jun 16, 2023 at 6:22 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, 16 Jun 2023 at 01:47, Alexander Potapenko <glider@google.com> wrote:
> >
> > Shouldn't ex_handler_copy() be fixed in the same way?
>
> I don't think ex_handler_copy() is actually reachable any more.
>
> The only thing ex_handler_copy() did was to set %ax to the fault type.
>
> It was used by the strange copy_user_generic_unrolled() that had
> special machine check case for "don't do the tail if we get
> X86_TRAP_MC".

Oh, I see, thanks!
I am using a downstream kernel (way before 034ff37d3407) that still
has _ASM_EXTABLE_CPY.
Looks like I just need to adjust ex_handler_copy() in my code.
  

Patch

diff --cc arch/x86/include/asm/disabled-features.h
index fafe9be7a6f4,652e366b68a0..702d93fdd10e
--- a/arch/x86/include/asm/disabled-features.h
diff --cc arch/x86/include/uapi/asm/prctl.h
index eb290d89cb32,1b85bc876c2d..abe3fe6db6d2
--- a/arch/x86/include/uapi/asm/prctl.h
diff --cc arch/x86/kernel/process.c
index 50d950771371,8bf13cff0141..7157b09d1cbf
--- a/arch/x86/kernel/process.c
diff --cc arch/x86/kernel/process_64.c
index b46924c9e46d,31241930b60c..74c7e84a94d8
--- a/arch/x86/kernel/process_64.c
diff --cc fs/proc/array.c
index 6daea628bc76,3e1a33dcd0d0..a880c4e44752
--- a/fs/proc/array.c
diff --cc tools/testing/selftests/x86/Makefile
index 598135d3162b,cfc8a26ad151..fa2216a8c0d5
--- a/tools/testing/selftests/x86/Makefile