[GIT,PULL] x86/mm for 6.2

Message ID 20221213174234.688534-1-dave.hansen@linux.intel.com
State New
Headers
Series [GIT,PULL] x86/mm for 6.2 |

Pull-request

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_mm_for_6.2

Commit Message

Dave Hansen Dec. 13, 2022, 5:42 p.m. UTC
  Hi Linus,

Please pull some x86/mm changes for 6.2.

This includes some new randomization of the per-cpu entry areas from
Peter Z.  Without it, these areas are a tasty target for attackers.
The entry code and mappings are especially tricky code and this has
caused some issues along the way, but they have settled down.

This also contains a new hardware feature: Linear Address Masking
(LAM).  It is similar conceptually to the ARM Top-Byte-Ignore (TBI)
feature and should allow userspace memory sanitizers to be used
with less overhead on x86.

LAM adds some pointer tag/untag functions which are used relatively
widely in the tree, but were just stubs on x86 until now.  The
new functions also unearthed a SCSI build issue.  There's a fix in
the SCSI tree for it:

	4e80eef45ad7 ("scsi: sg: Fix get_user() in call sg_scsi_ioctl()")

but this should not be pulled without that patch being in place
first.

Last, there is a merge conflict between the set_memory_rox() work
in this branch and x86/core.  I've included a suggested resolution
below from Ingo.

---

Merge branch 'x86/core' into x86/mm, to resolve conflicts

Resolve conflicts between this commit in tip:x86/core:

      4c4eb3ecc91f ("x86/modules: Set VM_FLUSH_RESET_PERMS in module_alloc()")

    ... and this one in tip:x86/mm:

      1f6eae430528 mm: Introduce set_memory_rox()

The set_vm_flush_reset_perms() calls need to be removed in the
set_memory_rox()-simplified code too.

Conflicts:
	arch/x86/kernel/ftrace.c
	arch/x86/kernel/kprobes/core.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

+++ b/arch/x86/kernel/ftrace.c
@@@ -413,9 -421,9 +421,8 @@@ create_trampoline(struct ftrace_ops *op
  	/* ALLOC_TRAMP flags lets us know we created it */
  	ops->flags |= FTRACE_OPS_FL_ALLOC_TRAMP;
  
- 	set_vm_flush_reset_perms(trampoline);
- 
 -	if (likely(system_state != SYSTEM_BOOTING))
 -		set_memory_ro((unsigned long)trampoline, npages);
 -	set_memory_x((unsigned long)trampoline, npages);
 +	set_memory_rox((unsigned long)trampoline, npages);
++
  	return (unsigned long)trampoline;
  fail:
  	tramp_free(trampoline);
+++ b/arch/x86/kernel/kprobes/core.c
@@@ -414,8 -414,12 +414,6 @@@ void *alloc_insn_page(void
  	if (!page)
  		return NULL;
  
- 	set_vm_flush_reset_perms(page);
 -	/*
 -	 * First make the page read-only, and only then make it executable to
 -	 * prevent it from being W+X in between.
 -	 */
 -	set_memory_ro((unsigned long)page, 1);
--
  	/*
  	 * TODO: Once additional kernel code protection mechanisms are set, ensure
  	 * that the page was not maliciously altered and it is still zeroed.

--
The following changes since commit 30a0b95b1335e12efef89dd78518ed3e4a71a763:

  Linux 6.1-rc3 (2022-10-30 15:19:28 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_mm_for_6.2

for you to fetch changes up to ce66a02538f39f071443bac9bc6ff8f3a780ab92:

  x86/mm: Fix sparse warnings in untagged_ptr() (2022-11-28 15:12:25 -0800)

----------------------------------------------------------------
New Features:
* Introduce Hardware-based Linear Address Masking (LAM)
* Randomize the per-cpu entry areas
Cleanups:
* Move to "native" set_memory_rox() helper
* Clean up pmd_get_atomic() and i386-PAE
* Remove some unused page table size macros

----------------------------------------------------------------
Andrey Ryabinin (1):
      x86/kasan: Map shadow for percpu pages on demand

Dave Hansen (1):
      x86/mm: Ensure forced page table splitting

Kirill A. Shutemov (12):
      x86/mm: Fix CR3_ADDR_MASK
      x86: CPUID and CR3/CR4 flags for Linear Address Masking
      mm: Pass down mm_struct to untagged_addr()
      x86/mm: Handle LAM on context switch
      x86/uaccess: Provide untagged_addr() and remove tags before address check
      KVM: Serialize tagged address check against tagging enabling
      x86/mm: Provide arch_prctl() interface for LAM
      x86/mm: Reduce untagged_addr() overhead until the first LAM user
      mm: Expose untagging mask in /proc/$PID/status
      iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()
      x86/mm/iommu/sva: Make LAM and SVA mutually exclusive
      x86/mm: Fix sparse warnings in untagged_ptr()

Pasha Tatashin (1):
      x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros

Peter Zijlstra (27):
      x86/mm: Randomize per-cpu entry area
      Merge branch 'v6.1-rc3'
      mm: Move mm_cachep initialization to mm_init()
      x86/mm: Use mm_alloc() in poking_init()
      x86/mm: Initialize text poking earlier
      x86/ftrace: Remove SYSTEM_BOOTING exceptions
      x86/mm: Do verify W^X at boot up
      mm: Introduce set_memory_rox()
      x86/mm: Implement native set_memory_rox()
      mm: Update ptep_get_lockless()'s comment
      x86/mm/pae: Make pmd_t similar to pte_t
      sh/mm: Make pmd_t similar to pte_t
      mm: Fix pmd_read_atomic()
      mm: Rename GUP_GET_PTE_LOW_HIGH
      mm: Rename pmd_read_atomic()
      mm/gup: Fix the lockless PMD access
      x86/mm/pae: Don't (ab)use atomic64
      x86/mm/pae: Use WRITE_ONCE()
      x86/mm/pae: Be consistent with pXXp_get_and_clear()
      x86_64: Remove pointless set_64bit() usage
      x86/mm/pae: Get rid of set_64bit()
      mm: Remove pointless barrier() after pmdp_get_lockless()
      mm: Convert __HAVE_ARCH_P..P_GET to the new style
      x86/mm: Add a few comments
      x86/mm: Untangle __change_page_attr_set_clr(.checkalias)
      x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias()
      x86/mm: Rename __change_page_attr_set_clr(.checkalias)

Sean Christopherson (5):
      x86/mm: Recompute physical address for every page of per-CPU CEA mapping
      x86/mm: Populate KASAN shadow for entire per-CPU range of CPU entry area
      x86/kasan: Rename local CPU_ENTRY_AREA variables to shorten names
      x86/kasan: Add helpers to align shadow addresses up and down
      x86/kasan: Populate shadow for shared chunk of the CPU entry area

Weihong Zhang (5):
      selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking
      selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking
      selftests/x86/lam: Add io_uring test cases for linear-address masking
      selftests/x86/lam: Add inherit test cases for linear-address masking
      selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking

 arch/arm/mach-omap1/sram-init.c                  |    8 +-
 arch/arm/mach-omap2/sram.c                       |    8 +-
 arch/arm64/include/asm/memory.h                  |    4 +-
 arch/arm64/include/asm/mmu_context.h             |    6 +
 arch/arm64/include/asm/signal.h                  |    2 +-
 arch/arm64/include/asm/uaccess.h                 |    2 +-
 arch/arm64/kernel/hw_breakpoint.c                |    2 +-
 arch/arm64/kernel/traps.c                        |    4 +-
 arch/arm64/mm/fault.c                            |   10 +-
 arch/mips/Kconfig                                |    2 +-
 arch/powerpc/include/asm/nohash/32/pgtable.h     |    2 +-
 arch/powerpc/kernel/kprobes.c                    |    9 +-
 arch/sh/Kconfig                                  |    2 +-
 arch/sh/include/asm/pgtable-3level.h             |   10 +-
 arch/sparc/include/asm/mmu_context_64.h          |    6 +
 arch/sparc/include/asm/pgtable_64.h              |    2 +-
 arch/sparc/include/asm/uaccess_64.h              |    2 +
 arch/um/include/asm/pgtable-3level.h             |    8 -
 arch/x86/Kconfig                                 |    2 +-
 arch/x86/include/asm/cmpxchg_32.h                |   28 -
 arch/x86/include/asm/cmpxchg_64.h                |    5 -
 arch/x86/include/asm/cpu_entry_area.h            |    4 -
 arch/x86/include/asm/cpufeatures.h               |    1 +
 arch/x86/include/asm/kasan.h                     |    3 +
 arch/x86/include/asm/mmu.h                       |   12 +-
 arch/x86/include/asm/mmu_context.h               |   47 +
 arch/x86/include/asm/page_types.h                |   12 +-
 arch/x86/include/asm/pgtable-3level.h            |  171 +---
 arch/x86/include/asm/pgtable-3level_types.h      |    7 +
 arch/x86/include/asm/pgtable_64_types.h          |    1 +
 arch/x86/include/asm/pgtable_areas.h             |    8 +-
 arch/x86/include/asm/pgtable_types.h             |    4 +-
 arch/x86/include/asm/processor-flags.h           |    4 +-
 arch/x86/include/asm/set_memory.h                |    3 +
 arch/x86/include/asm/tlbflush.h                  |   34 +
 arch/x86/include/asm/uaccess.h                   |   42 +-
 arch/x86/include/uapi/asm/prctl.h                |    5 +
 arch/x86/include/uapi/asm/processor-flags.h      |    6 +
 arch/x86/kernel/alternative.c                    |   10 -
 arch/x86/kernel/amd_gart_64.c                    |    2 +-
 arch/x86/kernel/ftrace.c                         |    6 +-
 arch/x86/kernel/head64.c                         |    2 +-
 arch/x86/kernel/hw_breakpoint.c                  |    2 +-
 arch/x86/kernel/kprobes/core.c                   |    9 +-
 arch/x86/kernel/process.c                        |    3 +
 arch/x86/kernel/process_64.c                     |   87 +-
 arch/x86/kernel/traps.c                          |    6 +-
 arch/x86/mm/cpu_entry_area.c                     |   50 +-
 arch/x86/mm/init.c                               |    2 +-
 arch/x86/mm/kasan_init_64.c                      |   53 +-
 arch/x86/mm/mem_encrypt_boot.S                   |    4 +-
 arch/x86/mm/mem_encrypt_identity.c               |   18 +-
 arch/x86/mm/pat/set_memory.c                     |  105 +-
 arch/x86/mm/pti.c                                |    2 +-
 arch/x86/mm/tlb.c                                |   53 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c |    2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c          |    2 +-
 drivers/gpu/drm/radeon/radeon_gem.c              |    2 +-
 drivers/infiniband/hw/mlx4/mr.c                  |    2 +-
 drivers/iommu/intel/irq_remapping.c              |   13 +-
 drivers/iommu/iommu-sva-lib.c                    |   16 +-
 drivers/media/common/videobuf2/frame_vector.c    |    2 +-
 drivers/media/v4l2-core/videobuf-dma-contig.c    |    2 +-
 drivers/misc/sram-exec.c                         |    7 +-
 drivers/staging/media/atomisp/pci/hmm/hmm_bo.c   |    2 +-
 drivers/tee/tee_shm.c                            |    2 +-
 drivers/vfio/vfio_iommu_type1.c                  |    2 +-
 fs/proc/array.c                                  |    6 +
 fs/proc/task_mmu.c                               |    2 +-
 include/linux/filter.h                           |    3 +-
 include/linux/ioasid.h                           |    9 -
 include/linux/mm.h                               |   11 -
 include/linux/mmu_context.h                      |   14 +
 include/linux/pgtable.h                          |   73 +-
 include/linux/sched/mm.h                         |    8 +-
 include/linux/sched/task.h                       |    2 +-
 include/linux/set_memory.h                       |   10 +
 include/linux/uaccess.h                          |   15 +
 init/main.c                                      |    4 +-
 kernel/bpf/bpf_struct_ops.c                      |    3 +-
 kernel/bpf/core.c                                |    6 +-
 kernel/bpf/trampoline.c                          |    3 +-
 kernel/events/core.c                             |    2 +-
 kernel/fork.c                                    |   37 +-
 lib/strncpy_from_user.c                          |    2 +-
 lib/strnlen_user.c                               |    2 +-
 mm/Kconfig                                       |    2 +-
 mm/gup.c                                         |    8 +-
 mm/hmm.c                                         |    3 +-
 mm/khugepaged.c                                  |    2 +-
 mm/madvise.c                                     |    2 +-
 mm/mapping_dirty_helpers.c                       |    2 +-
 mm/mempolicy.c                                   |    6 +-
 mm/migrate.c                                     |    2 +-
 mm/mincore.c                                     |    2 +-
 mm/mlock.c                                       |    4 +-
 mm/mmap.c                                        |    2 +-
 mm/mprotect.c                                    |    4 +-
 mm/mremap.c                                      |    2 +-
 mm/msync.c                                       |    2 +-
 mm/userfaultfd.c                                 |    2 +-
 mm/vmscan.c                                      |    5 +-
 net/bpf/bpf_dummy_struct_ops.c                   |    3 +-
 tools/testing/selftests/x86/Makefile             |    2 +-
 tools/testing/selftests/x86/lam.c                | 1149 ++++++++++++++++++++++
 virt/kvm/kvm_main.c                              |   14 +-
 106 files changed, 1901 insertions(+), 485 deletions(-)
 create mode 100644 tools/testing/selftests/x86/lam.c
  

Comments

Linus Torvalds Dec. 14, 2022, 10:36 p.m. UTC | #1
On Tue, Dec 13, 2022 at 9:43 AM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
> This also contains a new hardware feature: Linear Address Masking
> (LAM).  It is similar conceptually to the ARM Top-Byte-Ignore (TBI)
> feature and should allow userspace memory sanitizers to be used
> with less overhead on x86.

Christ.

Is it too late to ask Intel to call this "Top-Bits-Ignore", and
instead of adding another crazy TLA, we'd just all agree to call this
"TBI"?

I  know, I know, NIH and all that, but at least as long as we are
limiting ourselves to regular US-ASCII, we really only have 17576
TLA's to go around, and at some point it gets not only confusing, but
really quite wasteful, to have everybody make up their own
architecture-specific TLA.

And while I'm on the subject: I really think that the changes to
"untagged_addr()" are fundamentally broken.

Why? That whole LAM (or BTI) is not necessarily per-mm. It can easily
be per-*thread*.

Imagine, if you will, a setup where you have some threads that use
tagged pointers, and some threads that don't.

For example, maybe the upper bits of the address contains a tag that
is used only used within a virtual machine? You could even have the
"native" mode use the full address space, and put itself and its
private data in the upper bits virtually.

IOW, imagine using the virtual address masking as not just memory
sanitizers, but as an actual honest-to-goodness separation feature (eg
JITed code might fundamentally have access only to the lower bits,
while the JITter itself sees the whole address space).

Maybe that's not how LAM works on x86, but your changes to
untagged_addr() are *not* x86-specific.

So I really think this is completely wrong, quite aside from the
naming. It just makes assumptions that aren't valid.

The fact that you made this mm-specific actually ends up being an
active bug in the code, even on x86-64. You use the mmap lock to
serialize this all in prctl_enable_tagged_addr(), but then the read
side (ie just untagged_addr()) isn't actually serialized at all - and
*shouldn't* be serialized.

So I really think this is a fundamental design mistake, and while I
pulled it and sorted out the trivial conflicts, I've unpulled it again
as being actively mis-designed.

                Linus
  
Kirill A. Shutemov Dec. 15, 2022, 12:30 p.m. UTC | #2
On Wed, Dec 14, 2022 at 02:36:04PM -0800, Linus Torvalds wrote:
> On Tue, Dec 13, 2022 at 9:43 AM Dave Hansen <dave.hansen@linux.intel.com> wrote:
> >
> > This also contains a new hardware feature: Linear Address Masking
> > (LAM).  It is similar conceptually to the ARM Top-Byte-Ignore (TBI)
> > feature and should allow userspace memory sanitizers to be used
> > with less overhead on x86.
> 
> Christ.
> 
> Is it too late to ask Intel to call this "Top-Bits-Ignore", and
> instead of adding another crazy TLA, we'd just all agree to call this
> "TBI"?

The very top bit 63 is not ignored in LAM case. And in ARM case, TBI is Top
Byte Ignore, not Bits implies number of ignored bits and their placement.
The same TLA that means different thing does not help situation IMO.

I'm not sure if we can change how Intel calls the feature at this point,
but we can change kernel nomenclature if needed.

BTW, we already have somewhat related feature on Sparc that called ADI
which also hooks up into untagged_addr() machinery.

Maybe it is okay to have architecture-specific terminology as long as it
stays in arch code? Per-arch TLA namespace :P

> I  know, I know, NIH and all that, but at least as long as we are
> limiting ourselves to regular US-ASCII, we really only have 17576
> TLA's to go around, and at some point it gets not only confusing, but
> really quite wasteful, to have everybody make up their own
> architecture-specific TLA.
> 
> And while I'm on the subject: I really think that the changes to
> "untagged_addr()" are fundamentally broken.
> 
> Why? That whole LAM (or BTI) is not necessarily per-mm. It can easily
> be per-*thread*.
> 
> Imagine, if you will, a setup where you have some threads that use
> tagged pointers, and some threads that don't.
> 
> For example, maybe the upper bits of the address contains a tag that
> is used only used within a virtual machine? You could even have the
> "native" mode use the full address space, and put itself and its
> private data in the upper bits virtually.
> 
> IOW, imagine using the virtual address masking as not just memory
> sanitizers, but as an actual honest-to-goodness separation feature (eg
> JITed code might fundamentally have access only to the lower bits,
> while the JITter itself sees the whole address space).
> 
> Maybe that's not how LAM works on x86, but your changes to
> untagged_addr() are *not* x86-specific.

My initial enabling allowed userspace to enable the feature per-thread.
I pointed out the same potential use -- JIT -- case as you did[1].
But it made harder to handle corner cases. Like, it is not obvious what
LAM setup kernel thread has to use when acts on behalf of the process
(io_uring for instance). I ended up with using the most permissive LAM
mode (that allowed the most tag bits), but it is clearly a hack.

Thomas pointed[2] that out and following versions used per-process
approach. It simplified the patchset substantially.

Speaking about, untagged_addr(), how do you see the interface should look
like? It has to work correctly when called from a kernel thread. I can
only think of a hack that makes it branch on task->flags & PF_KTHREAD.

Hm?

[1] https://lore.kernel.org/all/20220513225936.qo4cy6sijqpzmvpt@black.fi.intel.com
[2] https://lore.kernel.org/all/878rr6x4iu.ffs@tglx

> So I really think this is completely wrong, quite aside from the
> naming. It just makes assumptions that aren't valid.
> 
> The fact that you made this mm-specific actually ends up being an
> active bug in the code, even on x86-64. You use the mmap lock to
> serialize this all in prctl_enable_tagged_addr(), but then the read
> side (ie just untagged_addr()) isn't actually serialized at all - and
> *shouldn't* be serialized.

mmap lock in prctl_enable_tagged_addr() serializes LAM enabling against
PASID allocation for the process as SVA and LAM is not currently
compatible.

I re-used the locking for KVM serialization where it wants to make sure
the userspace_addr has no tag bits set. But I just realized that replacing
access_ok() with __access_ok() should work too, no need for unagged_addr()
dance. I will rewrite the patch.

Normal untagged_addr() usage doesn't require serialization: IPI in
prctl_enable_tagged_addr() makes mm->context.untag_mask visible to all
CPUs that runs the mm before it returns. And we only allow one-shot LAM
enabling. It cannot be disabled later on.

Could you elaborate, what active bug do you see?

> So I really think this is a fundamental design mistake, and while I
> pulled it and sorted out the trivial conflicts, I've unpulled it again
> as being actively mis-designed.
> 
>                 Linus
  
Linus Torvalds Dec. 15, 2022, 5:17 p.m. UTC | #3
On Thu, Dec 15, 2022 at 4:30 AM <kirill.shutemov@linux.intel.com> wrote:
>
> My initial enabling allowed userspace to enable the feature per-thread.
> I pointed out the same potential use -- JIT -- case as you did[1].
> But it made harder to handle corner cases. Like, it is not obvious what
> LAM setup kernel thread has to use when acts on behalf of the process
> (io_uring for instance). I ended up with using the most permissive LAM
> mode (that allowed the most tag bits), but it is clearly a hack.

... and doing it per-mm causes more corner cases.

Look at the - actively buggy - uses of the mask outside the mm lock.

And no, "take the mm lock for reading every time you want to do
untagged_addr()" isn't the answer either.

I didn't look at every single use of 'untagged_addr()', but I did look
at a couple, and every single one of the ones I looked at was racy and
wrong.

Does the race _matter_? Almost certainly not. Nobody will ever care.
It's something we would have happily ignored for years, and then
sighed, and added a READ_ONCE() for to make KCSAN happy.

But when I get a new pull request, and the first thing I look at is
fundamentally racy, just take a wild guess at how happy it makes me
about that pull request?

> Thomas pointed[2] that out and following versions used per-process
> approach. It simplified the patchset substantially.

No, it sure as hell didn't.

It "simplified" things by just introducing another bug instead.

> Speaking about, untagged_addr(), how do you see the interface should look
> like? It has to work correctly when called from a kernel thread. I can
> only think of a hack that makes it branch on task->flags & PF_KTHREAD.

So I think that the current branch contents I see are simply not an
option. I do not want to pull code that when I look at it, I see a
data race on context.untag_mask .

I also find the "pass in mm" to be fundamentally objectionable,
because even if that then ends up being the right approach for x86,
this interface really isn't x86-centric.

Yes, arm64 ends up making it all unconditional, which simplifies
things enormously and just means that arm64 doesn't care at all. But
what if somebody else comes along, and really does want to make it
per-thread?

So I'd be ok with an interface that passes in 'struct task_struct',
and then x86 could use "tsk->mm" to get the mm if that really is the
interface x86 people want. Because then at least the *interface* isn't
hardcoded to be broken.

So those are my two fundamental objections against the current LAM
code. An outright bug in x86 code, and a bad interface in non-x86
code.

Here's *one* suggested solution:

 - pass in the 'task' for the target for untagged_addr(), so that
we're not locked into one model

 - if x86 really wants to make thing per-mm, let it, but have a "lock"
operation that says "now my setting is set in stone". You already do
that in prctl_enable_tagged_addr() with that

        /* Already enabled? */
        if (mm->context.lam_cr3_mask) {
                ret = -EBUSY;
                goto out;
        }

   just expose that as a thing.

 - make thread creation lock it (and unlock it on creating a new mm at
execve, but not fork).

And yes, I would actually suggest that _any_ thread creation locks it,
so that you never *EVER* have any issues with "oh, now I need to
synchronize with other threads". A process can set its LAM state at
startup, not in the middle of running!

Note that io_uring will automatically do that locking, because it does
that thread creation (create_io_thread()). So io_uring threads won't
even be a special case.

You could go even further, and just make it a ELF binary flag, and
basically have that LAM thing be a "when creating the mm, this is what
it gets created as". That is probably what most people would actually
want, but it sounds painful from a bootstrapping standpoint.

Now, the above is just a suggestion off the top of my head. It
simplifies things a lot, in that now even if the state is "per mm",
you can actually *cache* it per-thread, and there are no data races.

It also means that prctl_enable_tagged_addr() doesn't have to worry
about serialization and that whole

        on_each_cpu_mask(mm_cpumask(mm), enable_lam_func, mm, true);

thing.

> Normal untagged_addr() usage doesn't require serialization: IPI in
> prctl_enable_tagged_addr() makes mm->context.untag_mask visible to all
> CPUs that runs the mm before it returns.

BS.

That operation is not serialized IN ANY WAY with the other threads
reading the value. For all you know, untagged_addr() may be reading
that value multiple times due to compiler reloads, and getting
different values between two reads, and doing random things because
you use one value for one thing, and another for the later
"find_vma()" when the untagged address is actually used.

The IPI does *nothing*.  It serializes the cores, but since there is
no actual locking (or even a READ_ONCE() or anything), it doesn't
actually do anything on a software level.

The code is buggy. Really. It's likely a "this works in practice
because reloads are unlikely and you wouldn't hit them anyway" kind of
bug, but no, that IPI does not help.

(Of course, the IPI helps in that now at least the cr3 value is
synchronized, so the IPI is needed for the *hardware* accesses, but
doesn't help software use of mm->context.untag_mask).

                Linus
  
Linus Torvalds Dec. 15, 2022, 5:26 p.m. UTC | #4
On Thu, Dec 15, 2022 at 9:17 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Here's *one* suggested solution:

Note again: this is not a "you need to do it this way" suggestion.
This is just a "at least this way doesn't have the issues I object
to". There are bound to be other ways to do it.

But if you feel like all threads have to share the same LAM state, it
does seem a lot simpler if you just say "you need to set that state
before you start any threads". No?

> And yes, I would actually suggest that _any_ thread creation locks it,
> so that you never *EVER* have any issues with "oh, now I need to
> synchronize with other threads". A process can set its LAM state at
> startup, not in the middle of running!

Note that this "no serialization needed" is just about the SW side.

The *hardware* side may still need the IPI just to make sure that it
forces a TLB flush - even if we are single-threaded, that single
thread may have run on other CPU's before.

But I think at that point it's just a regular TLB flush, and doesn't
need that LAM-specific IPI.

But maybe there's some bigger HW serialization that is needed for the
LAM bit, I have not looked at that enough to know.

              Linus
  
Dave Hansen Dec. 15, 2022, 9:53 p.m. UTC | #5
On 12/15/22 09:26, Linus Torvalds wrote:
> But if you feel like all threads have to share the same LAM state, it
> does seem a lot simpler if you just say "you need to set that state
> before you start any threads". No?

That would be a lot simpler.  I have one bit of hesitation there, though.

Back in the MPX days, we had some users pop up and tell us that MPX
wasn't working for them on certain threads.  Those threads ended up
having been clone()'d even _before_ MPX got enabled which was via some
pre-main() startup code that the compiler inserted.  Since these early
threads' FPU state was copied before MPX setup was done, they never got
MPX-enabled.

Right or wrong, since then I've basically assumed that someone might be
creating threads behind my back.

Maybe we call those "early thread" folks too crazy to get LAM.  Maybe I
need to forget it ever happened, but they were actual users that got
bitten and cared enough to report it.  Or, heck, maybe I'm just
delusional because I can't find any trace of this conversation in the
list archives.

LAM _is_ slightly different than what MPX hit, though, since instead of
totally silent breakage the app can whine about the LAM prctl() having
failed.

Anyway, message heard loud and clear about the untagged_addr() races and
the interfaces.  We'll find some way to fix those up for the next merge
window.
  
Linus Torvalds Dec. 15, 2022, 10:46 p.m. UTC | #6
On Thu, Dec 15, 2022 at 1:53 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> Back in the MPX days, we had some users pop up and tell us that MPX
> wasn't working for them on certain threads.  Those threads ended up
> having been clone()'d even _before_ MPX got enabled which was via some
> pre-main() startup code that the compiler inserted.  Since these early
> threads' FPU state was copied before MPX setup was done, they never got
> MPX-enabled.

Yeah, I can see that happening, but I do think:

> Maybe we call those "early thread" folks too crazy to get LAM.

I think we should at least *start* that way.

Yes, I can imagine some early LD linkage thing deciding "this early
dynamic linking is very expensive, I will thread it".

I personally think that's a bit crazy - if your early linking is that
expensive, you have some serious design issues - but even if it does
happen, I'd rather start with very strict rules, see if that works
out, and then if we hit problems we have other alternatives.

Those other alternatives could involve relaxing the rules later and
saying "ok, we'll allow you to enable LAM even in this case, because
you only did Xyz".

But another alternative could also be to just have that LAM enabled
even earlier by adding an ELF flag, so that the address masking is
essentially set up at execve() time.

And independently of that, there's another issue: I suspect you want
separate the concept of "the kernel will mask virtual addresses in
certain system calls" from "the CPU will ignore those masked bits"

And I say that because that's basically what arm64 does.

On arm64, the 'untagged_addr()' masking is always enabled, but whether
the actual CPU hardware ignores the flags when virtually addressed is
a separate thing that you need to do the prctl() for.

Put another way: you can always pass tagged addresses to things like
mmap() and friends, *even if* those tagged addresses would cause a
fault when accessed normally, and wouldn't work for "read()" and
'write()" and friends (because read/write uses regular CPU accesses).

Now, the Intel LAM model is mode complicated, because the mask is
dynamic, and because there are people who want to use the full linear
address space with no masking.

But the whole "want to use the full linear address space" issue is
actually in some ways closer to the the "use five-level page tables"
decision - in that by *default* we don't use VA space ab9ove 47, and
you have to basically actively *ask* for that 5-level address space.

And I think the LAM model might want to take that same approach: maybe
by *default*, untagged_addr() always does the masking, and x86 takes
the arm64 approach.

And then the truly *special* apps that want unmasked addresses are the
ones who mark themselves special, kind of the same way they already
have to if they want 5-level paging.

And again: untagged_addr() doing masking does *not* necessarily mean
that the CPU does. I think arm64 does actually set the TBI bit by
default, but there's very much a prctl() to enable/disable it. But
that "BTI is disabled" doesn't actually mean that the mmap() address
masking is disabled.

I dunno. I'm really just saying "look, arm64 does this very
differently, and it seems to work fine there".

                    Linus
  
Kirill A. Shutemov Dec. 16, 2022, 2:16 a.m. UTC | #7
On Thu, Dec 15, 2022 at 09:17:11AM -0800, Linus Torvalds wrote:
>  - make thread creation lock it (and unlock it on creating a new mm at
> execve, but not fork).
> 
> And yes, I would actually suggest that _any_ thread creation locks it,
> so that you never *EVER* have any issues with "oh, now I need to
> synchronize with other threads". A process can set its LAM state at
> startup, not in the middle of running!

By thread creation, you mean CLONE_VM, right?

Do we care to avoid locking for CLONE_VFORK case? Seems excessive.

> Note that io_uring will automatically do that locking, because it does
> that thread creation (create_io_thread()). So io_uring threads won't
> even be a special case.

I've missed that io_uring does not use kthread_use_mm() anymore. But what
about other kthread_use_mm() users?

I did not find concrete cases where missing LAM setup would breaks things,
but I cannot prove opposite too.

kthread_use_mm() usage is suspicious in amdkfd, but I donno.

io_uring was a clear before it moved away from kthread_use_mm().
  
Kirill A. Shutemov Dec. 16, 2022, 3:05 p.m. UTC | #8
On Fri, Dec 16, 2022 at 05:16:45AM +0300, kirill.shutemov@linux.intel.com wrote:
> On Thu, Dec 15, 2022 at 09:17:11AM -0800, Linus Torvalds wrote:
> >  - make thread creation lock it (and unlock it on creating a new mm at
> > execve, but not fork).
> > 
> > And yes, I would actually suggest that _any_ thread creation locks it,
> > so that you never *EVER* have any issues with "oh, now I need to
> > synchronize with other threads". A process can set its LAM state at
> > startup, not in the middle of running!
> 
> By thread creation, you mean CLONE_VM, right?
> 
> Do we care to avoid locking for CLONE_VFORK case? Seems excessive.
> 
> > Note that io_uring will automatically do that locking, because it does
> > that thread creation (create_io_thread()). So io_uring threads won't
> > even be a special case.
> 
> I've missed that io_uring does not use kthread_use_mm() anymore. But what
> about other kthread_use_mm() users?
> 
> I did not find concrete cases where missing LAM setup would breaks things,
> but I cannot prove opposite too.
> 
> kthread_use_mm() usage is suspicious in amdkfd, but I donno.
> 
> io_uring was a clear before it moved away from kthread_use_mm().

Below is preliminary fixup that suppose to address the issue. It does not
include change to untagged_addr() interface to avoid the clutter.

LAM mode kept per-mm, but cannot be changed after the first thread has
spawn (or LAM enabled). Bit in mm->context.flags is used to indicate LAM
mode lock.

I've added few test cases for new limitations around threads.

kthread_use_mm() should be safe as long as no arch actually implements
per-thread tagging enabling.

Any comments?

If looks okay, I will integrate the fixup into the LAM patchset properly.
Or maybe you want an incremental patchset on top?

diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
index 4af81df133ee..ed320e145cf9 100644
--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -317,7 +317,7 @@ static struct vm_area_struct gate_vma __ro_after_init = {
 struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
 {
 #ifdef CONFIG_COMPAT
-	if (!mm || !(mm->context.flags & MM_CONTEXT_HAS_VSYSCALL))
+	if (!mm || test_bit(MM_CONTEXT_HAS_VSYSCALL, &mm->context.flags))
 		return NULL;
 #endif
 	if (vsyscall_mode == NONE)
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 1215b0a714c9..4e7e8d9893de 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -9,11 +9,13 @@
 #include <linux/bits.h>
 
 /* Uprobes on this MM assume 32-bit code */
-#define MM_CONTEXT_UPROBE_IA32		BIT(0)
+#define MM_CONTEXT_UPROBE_IA32		0
 /* vsyscall page is accessible on this MM */
-#define MM_CONTEXT_HAS_VSYSCALL		BIT(1)
+#define MM_CONTEXT_HAS_VSYSCALL		1
 /* Allow LAM and SVA coexisting */
-#define MM_CONTEXT_FORCE_TAGGED_SVA	BIT(2)
+#define MM_CONTEXT_FORCE_TAGGED_SVA	2
+/* Do not allow changing LAM mode */
+#define MM_CONTEXT_LOCK_LAM		3
 
 /*
  * x86 has arch-specific MMU state beyond what lives in mm_struct.
@@ -41,7 +43,7 @@ typedef struct {
 #endif
 
 #ifdef CONFIG_X86_64
-	unsigned short flags;
+	unsigned long flags;
 
 	/* Active LAM mode:  X86_CR3_LAM_U48 or X86_CR3_LAM_U57 or 0 (disabled) */
 	unsigned long lam_cr3_mask;
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 32483ab3f763..4bc95c35cbd3 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -118,7 +118,7 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
 static inline bool arch_pgtable_dma_compat(struct mm_struct *mm)
 {
 	return !mm_lam_cr3_mask(mm) ||
-		(mm->context.flags & MM_CONTEXT_FORCE_TAGGED_SVA);
+		test_bit(MM_CONTEXT_FORCE_TAGGED_SVA, &mm->context.flags);
 }
 #else
 
@@ -228,7 +228,7 @@ static inline void arch_exit_mmap(struct mm_struct *mm)
 static inline bool is_64bit_mm(struct mm_struct *mm)
 {
 	return	!IS_ENABLED(CONFIG_IA32_EMULATION) ||
-		!(mm->context.flags & MM_CONTEXT_UPROBE_IA32);
+		!test_bit(MM_CONTEXT_UPROBE_IA32, &mm->context.flags);
 }
 #else
 static inline bool is_64bit_mm(struct mm_struct *mm)
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index bd92e1ee1c1a..6286885dc1e2 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -35,7 +35,7 @@ DECLARE_STATIC_KEY_FALSE(tagged_addr_key);
 	u64 __addr = (__force u64)(addr);				\
 	if (static_branch_likely(&tagged_addr_key)) {			\
 		s64 sign = (s64)__addr >> 63;				\
-		__addr &= (mm)->context.untag_mask | sign;		\
+		__addr &= READ_ONCE((mm)->context.untag_mask) | sign;	\
 	}								\
 	(__force __typeof__(addr))__addr;				\
 })
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index d1e83ba21130..9bcec8195714 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -162,6 +162,9 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 
 	savesegment(es, p->thread.es);
 	savesegment(ds, p->thread.ds);
+
+	if (p->mm && (clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM)
+		set_bit(MM_CONTEXT_LOCK_LAM, &p->mm->context.flags);
 #else
 	p->thread.sp0 = (unsigned long) (childregs + 1);
 	savesegment(gs, p->thread.gs);
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b06de18215ce..7b8df4769486 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -671,7 +671,7 @@ void set_personality_64bit(void)
 	task_pt_regs(current)->orig_ax = __NR_execve;
 	current_thread_info()->status &= ~TS_COMPAT;
 	if (current->mm)
-		current->mm->context.flags = MM_CONTEXT_HAS_VSYSCALL;
+		__set_bit(MM_CONTEXT_HAS_VSYSCALL, &current->mm->context.flags);
 
 	/* TBD: overwrites user setup. Should have two bits.
 	   But 64bit processes have always behaved this way,
@@ -708,7 +708,7 @@ static void __set_personality_ia32(void)
 		 * uprobes applied to this MM need to know this and
 		 * cannot use user_64bit_mode() at that time.
 		 */
-		current->mm->context.flags = MM_CONTEXT_UPROBE_IA32;
+		__set_bit(MM_CONTEXT_UPROBE_IA32, &current->mm->context.flags);
 	}
 
 	current->personality |= force_personality32;
@@ -746,31 +746,6 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
 DEFINE_STATIC_KEY_FALSE(tagged_addr_key);
 EXPORT_SYMBOL_GPL(tagged_addr_key);
 
-static void enable_lam_func(void *mm)
-{
-	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
-	unsigned long lam_mask;
-	unsigned long cr3;
-
-	if (loaded_mm != mm)
-		return;
-
-	lam_mask = READ_ONCE(loaded_mm->context.lam_cr3_mask);
-
-	/*
-	 * Update CR3 to get LAM active on the CPU.
-	 *
-	 * This might not actually need to update CR3 if a context switch
-	 * happened between updating 'lam_cr3_mask' and running this IPI
-	 * handler.  Update it unconditionally for simplicity.
-	 */
-	cr3 = __read_cr3();
-	cr3 &= ~(X86_CR3_LAM_U48 | X86_CR3_LAM_U57);
-	cr3 |= lam_mask;
-	write_cr3(cr3);
-	set_tlbstate_cr3_lam_mask(lam_mask);
-}
-
 #define LAM_U57_BITS 6
 
 static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
@@ -780,36 +755,27 @@ static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
 	if (!cpu_feature_enabled(X86_FEATURE_LAM))
 		return -ENODEV;
 
-	if (mmap_write_lock_killable(mm))
-		return -EINTR;
-
-	/* Already enabled? */
-	if (mm->context.lam_cr3_mask) {
-		ret = -EBUSY;
-		goto out;
-	}
+	if (test_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags))
+		return -EBUSY;
 
 	if (mm_valid_pasid(mm) &&
-	    !(mm->context.flags & MM_CONTEXT_FORCE_TAGGED_SVA)) {
-		ret = -EBUSY;
-		goto out;
-	}
+	    !test_bit(MM_CONTEXT_FORCE_TAGGED_SVA, &mm->context.flags))
+		return -EBUSY;
 
 	if (!nr_bits) {
-		ret = -EINVAL;
-		goto out;
+		return -EINVAL;
 	} else if (nr_bits <= LAM_U57_BITS) {
 		mm->context.lam_cr3_mask = X86_CR3_LAM_U57;
 		mm->context.untag_mask =  ~GENMASK(62, 57);
 	} else {
-		ret = -EINVAL;
-		goto out;
+		return -EINVAL;
 	}
 
-	on_each_cpu_mask(mm_cpumask(mm), enable_lam_func, mm, true);
+	write_cr3(__read_cr3() | mm->context.lam_cr3_mask);
+	set_tlbstate_cr3_lam_mask(mm->context.lam_cr3_mask);
+	set_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags);
+
 	static_branch_enable(&tagged_addr_key);
-out:
-	mmap_write_unlock(mm);
 	return ret;
 }
 
@@ -906,10 +872,7 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
 	case ARCH_ENABLE_TAGGED_ADDR:
 		return prctl_enable_tagged_addr(task->mm, arg2);
 	case ARCH_FORCE_TAGGED_SVA:
-		if (mmap_write_lock_killable(task->mm))
-			return -EINTR;
-		task->mm->context.flags |= MM_CONTEXT_FORCE_TAGGED_SVA;
-		mmap_write_unlock(task->mm);
+		set_bit(MM_CONTEXT_FORCE_TAGGED_SVA, &task->mm->context.flags);
 		return 0;
 	case ARCH_GET_MAX_TAG_BITS:
 		if (!cpu_feature_enabled(X86_FEATURE_LAM))
diff --git a/drivers/iommu/iommu-sva-lib.c b/drivers/iommu/iommu-sva-lib.c
index 8934c32e923c..bfe70d45dd1e 100644
--- a/drivers/iommu/iommu-sva-lib.c
+++ b/drivers/iommu/iommu-sva-lib.c
@@ -33,14 +33,8 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max)
 	    min == 0 || max < min)
 		return -EINVAL;
 
-	/* Serialize against address tagging enabling */
-	if (mmap_write_lock_killable(mm))
-		return -EINTR;
-
-	if (!arch_pgtable_dma_compat(mm)) {
-		mmap_write_unlock(mm);
+	if (!arch_pgtable_dma_compat(mm))
 		return -EBUSY;
-	}
 
 	mutex_lock(&iommu_sva_lock);
 	/* Is a PASID already associated with this mm? */
@@ -57,7 +51,6 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max)
 		mm_pasid_set(mm, pasid);
 out:
 	mutex_unlock(&iommu_sva_lock);
-	mmap_write_unlock(mm);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_sva_alloc_pasid);
diff --git a/tools/testing/selftests/x86/lam.c b/tools/testing/selftests/x86/lam.c
index 52a876a7ccb8..93e6089164b6 100644
--- a/tools/testing/selftests/x86/lam.c
+++ b/tools/testing/selftests/x86/lam.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
@@ -12,6 +13,7 @@
 #include <sys/stat.h>
 #include <fcntl.h>
 #include <inttypes.h>
+#include <sched.h>
 
 #include <sys/uio.h>
 #include <linux/io_uring.h>
@@ -50,6 +52,8 @@
 
 #define PAGE_SIZE               (4 << 10)
 
+#define STACK_SIZE		65536
+
 #define barrier() ({						\
 		   __asm__ __volatile__("" : : : "memory");	\
 })
@@ -731,6 +735,75 @@ static int handle_inheritance(struct testcases *test)
 	return 0;
 }
 
+static int thread_fn_get_lam(void *arg)
+{
+	return get_lam();
+}
+
+static int thread_fn_set_lam(void *arg)
+{
+	struct testcases *test = arg;
+
+	return set_lam(test->lam);
+}
+
+static int handle_thread(struct testcases *test)
+{
+	char stack[STACK_SIZE];
+	int ret, child_ret;
+	int lam = 0;
+	pid_t pid;
+
+	/* Set LAM mode in parent process */
+	if (!test->later) {
+		lam = test->lam;
+		if (set_lam(lam) != 0)
+			return 1;
+	}
+
+	pid = clone(thread_fn_get_lam, stack + STACK_SIZE,
+		    SIGCHLD | CLONE_FILES | CLONE_FS | CLONE_VM, NULL);
+	if (pid < 0) {
+		perror("Clone failed.");
+		return 1;
+	}
+
+	waitpid(pid, &child_ret, 0);
+	ret = WEXITSTATUS(child_ret);
+
+	if (lam != ret)
+		return 1;
+
+	if (test->later) {
+		if (set_lam(test->lam) != 0)
+			return 1;
+	}
+
+	return 0;
+}
+
+static int handle_thread_enable(struct testcases *test)
+{
+	char stack[STACK_SIZE];
+	int ret, child_ret;
+	int lam = test->lam;
+	pid_t pid;
+
+	pid = clone(thread_fn_set_lam, stack + STACK_SIZE,
+		    SIGCHLD | CLONE_FILES | CLONE_FS | CLONE_VM, test);
+	if (pid < 0) {
+		perror("Clone failed.");
+		return 1;
+	}
+
+	waitpid(pid, &child_ret, 0);
+	ret = WEXITSTATUS(child_ret);
+
+	if (lam != ret)
+		return 1;
+
+	return 0;
+}
 static void run_test(struct testcases *test, int count)
 {
 	int i, ret = 0;
@@ -846,6 +919,25 @@ static struct testcases inheritance_cases[] = {
 		.test_func = handle_inheritance,
 		.msg = "FORK: LAM_U57, child process should get LAM mode same as parent\n",
 	},
+	{
+		.expected = 0,
+		.lam = LAM_U57_BITS,
+		.test_func = handle_thread,
+		.msg = "THREAD: LAM_U57, child thread should get LAM mode same as parent\n",
+	},
+	{
+		.expected = 1,
+		.lam = LAM_U57_BITS,
+		.test_func = handle_thread_enable,
+		.msg = "THREAD: [NEGATIVE] Enable LAM in child.\n",
+	},
+	{
+		.expected = 1,
+		.later = 1,
+		.lam = LAM_U57_BITS,
+		.test_func = handle_thread,
+		.msg = "THREAD: [NEGATIVE] Enable LAM in parent after thread created.\n",
+	},
 	{
 		.expected = 0,
 		.lam = LAM_U57_BITS,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a587e34fc750..14d7e7d72f14 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1943,22 +1943,12 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		return -EINVAL;
 	if (mem->guest_phys_addr & (PAGE_SIZE - 1))
 		return -EINVAL;
-
-	/* Serialize against tagging enabling */
-	if (mmap_read_lock_killable(kvm->mm))
-		return -EINTR;
-
 	/* We can read the guest memory with __xxx_user() later on. */
 	if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
 	    (mem->userspace_addr != untagged_addr(kvm->mm, mem->userspace_addr)) ||
 	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
-			mem->memory_size)) {
-		mmap_read_unlock(kvm->mm);
+			mem->memory_size))
 		return -EINVAL;
-	}
-
-	mmap_read_unlock(kvm->mm);
-
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
 		return -EINVAL;
 	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
  
Linus Torvalds Dec. 16, 2022, 3:43 p.m. UTC | #9
On Fri, Dec 16, 2022 at 9:05 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> Below is preliminary fixup that suppose to address the issue. It does not
> include change to untagged_addr() interface to avoid the clutter.

Looks like the right direction.

And once you change untagged_addr() to take 'tsk', you should then be
able to cache all the information in the thread struct, and avoid the
'tsk->mm' dereference entirely.

> kthread_use_mm() should be safe as long as no arch actually implements
> per-thread tagging enabling.

I think in a perfect world the (few) users of kthread_use_mm() would
also just make sure they did the locking of thing, so that they can't
have that race with somebody that then would enable LAM later.

             Linus
  
Jason A. Donenfeld Dec. 16, 2022, 8:09 p.m. UTC | #10
Hi Dave,

On Tue, Dec 13, 2022 at 09:42:34AM -0800, Dave Hansen wrote:
> This includes some new randomization of the per-cpu entry areas from
> Peter Z.  Without it, these areas are a tasty target for attackers.
> The entry code and mappings are especially tricky code and this has
> caused some issues along the way, but they have settled down.
>       x86/mm: Randomize per-cpu entry area

I've got a trivial patch teed up for late in the window that I wanted to
send in after Linus merges this pull. I was wondering whether you were
going to send in another x86/mm pull during the merge window (perhaps
without LAM in it?), or if this whole thing is going to wait for 6.3.
Either way is fine with me, but it'd save me a lot of F5 refreshing if I
knew what to expect.

Jason
  
Kirill A. Shutemov Dec. 17, 2022, 4:43 p.m. UTC | #11
On Fri, Dec 16, 2022 at 09:43:11AM -0600, Linus Torvalds wrote:
> On Fri, Dec 16, 2022 at 9:05 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > Below is preliminary fixup that suppose to address the issue. It does not
> > include change to untagged_addr() interface to avoid the clutter.
> 
> Looks like the right direction.
> 
> And once you change untagged_addr() to take 'tsk', you should then be
> able to cache all the information in the thread struct, and avoid the
> 'tsk->mm' dereference entirely.

Making untagged_addr() take task as an argument does not work well. We
don't have relevant task on hands in some places. The most notably GUP
only has 'mm'.

It is also not clear what untagging rules has to be applied: whoever runs
untagged_addr() or the target task/mm.

Up until now I choose the target mm rules. It makes things easier for
GDB, but it is not strong enough justification. Maybe always use rules of
'current' is the right way?


In other thread you suggested to make untagging in untagged_addr()
unconditional and I see how it solves the problem, but I don't think it is
good idea.

The current LAM patchset only enables LAM_U57 mode that doesn't compete
for real virtual address bits, but hardware also support LAM_U48 which
provides more tagging bits, but limits available address space for the
process. Unconditionally untag according to LAM_U48 rules is obviously
broken.

Although, x86 maintainers rejected LAM_U48 upstreaming as of now, I want
to keep door open in case a strong use case for it comes. I have code that
enables the mode stashed, just in case.
  

Patch

diff --cc arch/x86/kernel/ftrace.c
index 03579460d0ec,cf15ef5aecff..3d883eb989c7
--- a/arch/x86/kernel/ftrace.c
diff --cc arch/x86/kernel/kprobes/core.c
index e7b7ca64acdf,01b8d956aa76..66299682b6b7
--- a/arch/x86/kernel/kprobes/core.c