Message ID | 20221213174234.688534-1-dave.hansen@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:e747:0:0:0:0:0 with SMTP id c7csp266993wrn; Tue, 13 Dec 2022 09:45:55 -0800 (PST) X-Google-Smtp-Source: AA0mqf4kPeDFtzCxRYtlya/Fx4NCMxHzznAaZdUqrkBXqR3KaX+GkTqcbyvJ6S15QC4fJez64Ceb X-Received: by 2002:a05:6402:524b:b0:46b:8d4f:1699 with SMTP id t11-20020a056402524b00b0046b8d4f1699mr30519229edd.14.1670953554878; Tue, 13 Dec 2022 09:45:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1670953554; cv=none; d=google.com; s=arc-20160816; b=H98i8nmVJbssHRCf1OoXjAX8+89nnzCB3az8zyQtpsqE2GJn5W+8ETolo9lqheFWo/ fYE2YkfM+8IH5J7T/S9j2g2oXnXFBdoRButvWtPiCNkiAoind4DM2nvrIE3xiKudaCpT sKyg6G4G5/+yHkkGisBLhFbZjEs9BnQWUkJxe+6Ek5E5qL4OuKMokIfTmDcFPaT9DqzP mGWNiXWZXGfeaP9Zp5OXiPMwx2VHWEhbgXWSXhl78WPhVqqjmcIQ2L6CZQo4ienso0aP EG0ZPk/mqGNvegY5VOOaDCnMxtbASQwBw2YmmYs4qtmHVS73/WglAN50VOZPeyDrCVF/ oYKQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=9AFbjKFXq8prSXXARlOUSj74I47uNQcUKbSqbwsKhF8=; b=YSuTFZcWmpV7tu1StEjVzGOvfrH9UUGs/qPgM8DOM9zplX636TJqatnpyBJLx/04UO bKvRxRZ8caHoAThofXzO1Pv+Ig7mMRCH3iI8QKx/A+5DQ6qYHcgIX5NcC79OynAwBfQf Q67q+fmOXdvAgsQXeb11R51CKNCa28mF1WB1X1OBpqSAS+w+450/X/2p3jYRIWvhK6K1 Vib0SGmIyOi/YTEP9fMsTYVgxm3P8+G/kW6YqMWy95pgJQu19q/1sZvVMLraHjBLOQV7 RPP/rkRNpt1E3ZWMg8wA5DS0aecJ3NWuD7yPGqBp4jQsnKF6ajS0QBdPSajAv6Xg74BR 0liw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=AHEP32LK; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id u23-20020a056402065700b0044f2fb68fe6si8626085edx.495.2022.12.13.09.45.31; Tue, 13 Dec 2022 09:45:54 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=AHEP32LK; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236175AbiLMRnF (ORCPT <rfc822;jeantsuru.cumc.mandola@gmail.com> + 99 others); Tue, 13 Dec 2022 12:43:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57052 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236317AbiLMRm7 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 13 Dec 2022 12:42:59 -0500 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D58FA23BD6 for <linux-kernel@vger.kernel.org>; Tue, 13 Dec 2022 09:42:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670953376; x=1702489376; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=JbygnR+34aVP50LuHDGng6ylvUT5LCMJ5h3J6y1OskA=; b=AHEP32LK7KmqvZvJ2SY+P8Gf2PMirkAPWrgp2ROFcpwMOCoII/Ynhp7P B6HwvEGluzcvDFobiHPBw1Q5YbOmcQmk1pZs2J7kBQuJgYKNffBUqPyIj hPOKrU/9S39jXHW7y0h4Nh7QlUlwxU29MC8T72ffLrBNbAUiNMT55ekoL sb/6Oqb+lma3313Y8QQjzsAxagDgqCuGBzU2p2MRhK4gBElnX3SzVgtn4 nXRfrX9NDT79EmZewqOhSY2IogC/Nf5O5VdKPROqUmwXQfdZVScjrVw4V 2VzzAv7N5n5wGZs000JXn6CKfIJzwk4llg4hIER0q1adG0yKHH88o408d A==; X-IronPort-AV: E=McAfee;i="6500,9779,10560"; a="320052685" X-IronPort-AV: E=Sophos;i="5.96,242,1665471600"; d="scan'208";a="320052685" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Dec 2022 09:42:56 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10560"; a="737460079" X-IronPort-AV: E=Sophos;i="5.96,242,1665471600"; d="scan'208";a="737460079" Received: from viggo.jf.intel.com (HELO ray2.sr71.net) ([10.54.77.144]) by FMSMGA003.fm.intel.com with ESMTP; 13 Dec 2022 09:42:55 -0800 From: Dave Hansen <dave.hansen@linux.intel.com> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org, x86@kernel.org, kirill.shutemov@linux.intel.com, jejb@linux.ibm.com, martin.petersen@oracle.com Subject: [GIT PULL] x86/mm for 6.2 Date: Tue, 13 Dec 2022 09:42:34 -0800 Message-Id: <20221213174234.688534-1-dave.hansen@linux.intel.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752121794557709963?= X-GMAIL-MSGID: =?utf-8?q?1752121794557709963?= |
Series |
[GIT,PULL] x86/mm for 6.2
|
|
Pull-request
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_mm_for_6.2Commit Message
Dave Hansen
Dec. 13, 2022, 5:42 p.m. UTC
Hi Linus, Please pull some x86/mm changes for 6.2. This includes some new randomization of the per-cpu entry areas from Peter Z. Without it, these areas are a tasty target for attackers. The entry code and mappings are especially tricky code and this has caused some issues along the way, but they have settled down. This also contains a new hardware feature: Linear Address Masking (LAM). It is similar conceptually to the ARM Top-Byte-Ignore (TBI) feature and should allow userspace memory sanitizers to be used with less overhead on x86. LAM adds some pointer tag/untag functions which are used relatively widely in the tree, but were just stubs on x86 until now. The new functions also unearthed a SCSI build issue. There's a fix in the SCSI tree for it: 4e80eef45ad7 ("scsi: sg: Fix get_user() in call sg_scsi_ioctl()") but this should not be pulled without that patch being in place first. Last, there is a merge conflict between the set_memory_rox() work in this branch and x86/core. I've included a suggested resolution below from Ingo. --- Merge branch 'x86/core' into x86/mm, to resolve conflicts Resolve conflicts between this commit in tip:x86/core: 4c4eb3ecc91f ("x86/modules: Set VM_FLUSH_RESET_PERMS in module_alloc()") ... and this one in tip:x86/mm: 1f6eae430528 mm: Introduce set_memory_rox() The set_vm_flush_reset_perms() calls need to be removed in the set_memory_rox()-simplified code too. Conflicts: arch/x86/kernel/ftrace.c arch/x86/kernel/kprobes/core.c Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> +++ b/arch/x86/kernel/ftrace.c @@@ -413,9 -421,9 +421,8 @@@ create_trampoline(struct ftrace_ops *op /* ALLOC_TRAMP flags lets us know we created it */ ops->flags |= FTRACE_OPS_FL_ALLOC_TRAMP; - set_vm_flush_reset_perms(trampoline); - - if (likely(system_state != SYSTEM_BOOTING)) - set_memory_ro((unsigned long)trampoline, npages); - set_memory_x((unsigned long)trampoline, npages); + set_memory_rox((unsigned long)trampoline, npages); ++ return (unsigned long)trampoline; fail: tramp_free(trampoline); +++ b/arch/x86/kernel/kprobes/core.c @@@ -414,8 -414,12 +414,6 @@@ void *alloc_insn_page(void if (!page) return NULL; - set_vm_flush_reset_perms(page); - /* - * First make the page read-only, and only then make it executable to - * prevent it from being W+X in between. - */ - set_memory_ro((unsigned long)page, 1); -- /* * TODO: Once additional kernel code protection mechanisms are set, ensure * that the page was not maliciously altered and it is still zeroed. -- The following changes since commit 30a0b95b1335e12efef89dd78518ed3e4a71a763: Linux 6.1-rc3 (2022-10-30 15:19:28 -0700) are available in the Git repository at: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_mm_for_6.2 for you to fetch changes up to ce66a02538f39f071443bac9bc6ff8f3a780ab92: x86/mm: Fix sparse warnings in untagged_ptr() (2022-11-28 15:12:25 -0800) ---------------------------------------------------------------- New Features: * Introduce Hardware-based Linear Address Masking (LAM) * Randomize the per-cpu entry areas Cleanups: * Move to "native" set_memory_rox() helper * Clean up pmd_get_atomic() and i386-PAE * Remove some unused page table size macros ---------------------------------------------------------------- Andrey Ryabinin (1): x86/kasan: Map shadow for percpu pages on demand Dave Hansen (1): x86/mm: Ensure forced page table splitting Kirill A. Shutemov (12): x86/mm: Fix CR3_ADDR_MASK x86: CPUID and CR3/CR4 flags for Linear Address Masking mm: Pass down mm_struct to untagged_addr() x86/mm: Handle LAM on context switch x86/uaccess: Provide untagged_addr() and remove tags before address check KVM: Serialize tagged address check against tagging enabling x86/mm: Provide arch_prctl() interface for LAM x86/mm: Reduce untagged_addr() overhead until the first LAM user mm: Expose untagging mask in /proc/$PID/status iommu/sva: Replace pasid_valid() helper with mm_valid_pasid() x86/mm/iommu/sva: Make LAM and SVA mutually exclusive x86/mm: Fix sparse warnings in untagged_ptr() Pasha Tatashin (1): x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros Peter Zijlstra (27): x86/mm: Randomize per-cpu entry area Merge branch 'v6.1-rc3' mm: Move mm_cachep initialization to mm_init() x86/mm: Use mm_alloc() in poking_init() x86/mm: Initialize text poking earlier x86/ftrace: Remove SYSTEM_BOOTING exceptions x86/mm: Do verify W^X at boot up mm: Introduce set_memory_rox() x86/mm: Implement native set_memory_rox() mm: Update ptep_get_lockless()'s comment x86/mm/pae: Make pmd_t similar to pte_t sh/mm: Make pmd_t similar to pte_t mm: Fix pmd_read_atomic() mm: Rename GUP_GET_PTE_LOW_HIGH mm: Rename pmd_read_atomic() mm/gup: Fix the lockless PMD access x86/mm/pae: Don't (ab)use atomic64 x86/mm/pae: Use WRITE_ONCE() x86/mm/pae: Be consistent with pXXp_get_and_clear() x86_64: Remove pointless set_64bit() usage x86/mm/pae: Get rid of set_64bit() mm: Remove pointless barrier() after pmdp_get_lockless() mm: Convert __HAVE_ARCH_P..P_GET to the new style x86/mm: Add a few comments x86/mm: Untangle __change_page_attr_set_clr(.checkalias) x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias() x86/mm: Rename __change_page_attr_set_clr(.checkalias) Sean Christopherson (5): x86/mm: Recompute physical address for every page of per-CPU CEA mapping x86/mm: Populate KASAN shadow for entire per-CPU range of CPU entry area x86/kasan: Rename local CPU_ENTRY_AREA variables to shorten names x86/kasan: Add helpers to align shadow addresses up and down x86/kasan: Populate shadow for shared chunk of the CPU entry area Weihong Zhang (5): selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking selftests/x86/lam: Add io_uring test cases for linear-address masking selftests/x86/lam: Add inherit test cases for linear-address masking selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking arch/arm/mach-omap1/sram-init.c | 8 +- arch/arm/mach-omap2/sram.c | 8 +- arch/arm64/include/asm/memory.h | 4 +- arch/arm64/include/asm/mmu_context.h | 6 + arch/arm64/include/asm/signal.h | 2 +- arch/arm64/include/asm/uaccess.h | 2 +- arch/arm64/kernel/hw_breakpoint.c | 2 +- arch/arm64/kernel/traps.c | 4 +- arch/arm64/mm/fault.c | 10 +- arch/mips/Kconfig | 2 +- arch/powerpc/include/asm/nohash/32/pgtable.h | 2 +- arch/powerpc/kernel/kprobes.c | 9 +- arch/sh/Kconfig | 2 +- arch/sh/include/asm/pgtable-3level.h | 10 +- arch/sparc/include/asm/mmu_context_64.h | 6 + arch/sparc/include/asm/pgtable_64.h | 2 +- arch/sparc/include/asm/uaccess_64.h | 2 + arch/um/include/asm/pgtable-3level.h | 8 - arch/x86/Kconfig | 2 +- arch/x86/include/asm/cmpxchg_32.h | 28 - arch/x86/include/asm/cmpxchg_64.h | 5 - arch/x86/include/asm/cpu_entry_area.h | 4 - arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/kasan.h | 3 + arch/x86/include/asm/mmu.h | 12 +- arch/x86/include/asm/mmu_context.h | 47 + arch/x86/include/asm/page_types.h | 12 +- arch/x86/include/asm/pgtable-3level.h | 171 +--- arch/x86/include/asm/pgtable-3level_types.h | 7 + arch/x86/include/asm/pgtable_64_types.h | 1 + arch/x86/include/asm/pgtable_areas.h | 8 +- arch/x86/include/asm/pgtable_types.h | 4 +- arch/x86/include/asm/processor-flags.h | 4 +- arch/x86/include/asm/set_memory.h | 3 + arch/x86/include/asm/tlbflush.h | 34 + arch/x86/include/asm/uaccess.h | 42 +- arch/x86/include/uapi/asm/prctl.h | 5 + arch/x86/include/uapi/asm/processor-flags.h | 6 + arch/x86/kernel/alternative.c | 10 - arch/x86/kernel/amd_gart_64.c | 2 +- arch/x86/kernel/ftrace.c | 6 +- arch/x86/kernel/head64.c | 2 +- arch/x86/kernel/hw_breakpoint.c | 2 +- arch/x86/kernel/kprobes/core.c | 9 +- arch/x86/kernel/process.c | 3 + arch/x86/kernel/process_64.c | 87 +- arch/x86/kernel/traps.c | 6 +- arch/x86/mm/cpu_entry_area.c | 50 +- arch/x86/mm/init.c | 2 +- arch/x86/mm/kasan_init_64.c | 53 +- arch/x86/mm/mem_encrypt_boot.S | 4 +- arch/x86/mm/mem_encrypt_identity.c | 18 +- arch/x86/mm/pat/set_memory.c | 105 +- arch/x86/mm/pti.c | 2 +- arch/x86/mm/tlb.c | 53 +- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 +- drivers/gpu/drm/radeon/radeon_gem.c | 2 +- drivers/infiniband/hw/mlx4/mr.c | 2 +- drivers/iommu/intel/irq_remapping.c | 13 +- drivers/iommu/iommu-sva-lib.c | 16 +- drivers/media/common/videobuf2/frame_vector.c | 2 +- drivers/media/v4l2-core/videobuf-dma-contig.c | 2 +- drivers/misc/sram-exec.c | 7 +- drivers/staging/media/atomisp/pci/hmm/hmm_bo.c | 2 +- drivers/tee/tee_shm.c | 2 +- drivers/vfio/vfio_iommu_type1.c | 2 +- fs/proc/array.c | 6 + fs/proc/task_mmu.c | 2 +- include/linux/filter.h | 3 +- include/linux/ioasid.h | 9 - include/linux/mm.h | 11 - include/linux/mmu_context.h | 14 + include/linux/pgtable.h | 73 +- include/linux/sched/mm.h | 8 +- include/linux/sched/task.h | 2 +- include/linux/set_memory.h | 10 + include/linux/uaccess.h | 15 + init/main.c | 4 +- kernel/bpf/bpf_struct_ops.c | 3 +- kernel/bpf/core.c | 6 +- kernel/bpf/trampoline.c | 3 +- kernel/events/core.c | 2 +- kernel/fork.c | 37 +- lib/strncpy_from_user.c | 2 +- lib/strnlen_user.c | 2 +- mm/Kconfig | 2 +- mm/gup.c | 8 +- mm/hmm.c | 3 +- mm/khugepaged.c | 2 +- mm/madvise.c | 2 +- mm/mapping_dirty_helpers.c | 2 +- mm/mempolicy.c | 6 +- mm/migrate.c | 2 +- mm/mincore.c | 2 +- mm/mlock.c | 4 +- mm/mmap.c | 2 +- mm/mprotect.c | 4 +- mm/mremap.c | 2 +- mm/msync.c | 2 +- mm/userfaultfd.c | 2 +- mm/vmscan.c | 5 +- net/bpf/bpf_dummy_struct_ops.c | 3 +- tools/testing/selftests/x86/Makefile | 2 +- tools/testing/selftests/x86/lam.c | 1149 ++++++++++++++++++++++ virt/kvm/kvm_main.c | 14 +- 106 files changed, 1901 insertions(+), 485 deletions(-) create mode 100644 tools/testing/selftests/x86/lam.c
Comments
On Tue, Dec 13, 2022 at 9:43 AM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > This also contains a new hardware feature: Linear Address Masking > (LAM). It is similar conceptually to the ARM Top-Byte-Ignore (TBI) > feature and should allow userspace memory sanitizers to be used > with less overhead on x86. Christ. Is it too late to ask Intel to call this "Top-Bits-Ignore", and instead of adding another crazy TLA, we'd just all agree to call this "TBI"? I know, I know, NIH and all that, but at least as long as we are limiting ourselves to regular US-ASCII, we really only have 17576 TLA's to go around, and at some point it gets not only confusing, but really quite wasteful, to have everybody make up their own architecture-specific TLA. And while I'm on the subject: I really think that the changes to "untagged_addr()" are fundamentally broken. Why? That whole LAM (or BTI) is not necessarily per-mm. It can easily be per-*thread*. Imagine, if you will, a setup where you have some threads that use tagged pointers, and some threads that don't. For example, maybe the upper bits of the address contains a tag that is used only used within a virtual machine? You could even have the "native" mode use the full address space, and put itself and its private data in the upper bits virtually. IOW, imagine using the virtual address masking as not just memory sanitizers, but as an actual honest-to-goodness separation feature (eg JITed code might fundamentally have access only to the lower bits, while the JITter itself sees the whole address space). Maybe that's not how LAM works on x86, but your changes to untagged_addr() are *not* x86-specific. So I really think this is completely wrong, quite aside from the naming. It just makes assumptions that aren't valid. The fact that you made this mm-specific actually ends up being an active bug in the code, even on x86-64. You use the mmap lock to serialize this all in prctl_enable_tagged_addr(), but then the read side (ie just untagged_addr()) isn't actually serialized at all - and *shouldn't* be serialized. So I really think this is a fundamental design mistake, and while I pulled it and sorted out the trivial conflicts, I've unpulled it again as being actively mis-designed. Linus
On Wed, Dec 14, 2022 at 02:36:04PM -0800, Linus Torvalds wrote: > On Tue, Dec 13, 2022 at 9:43 AM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > > > This also contains a new hardware feature: Linear Address Masking > > (LAM). It is similar conceptually to the ARM Top-Byte-Ignore (TBI) > > feature and should allow userspace memory sanitizers to be used > > with less overhead on x86. > > Christ. > > Is it too late to ask Intel to call this "Top-Bits-Ignore", and > instead of adding another crazy TLA, we'd just all agree to call this > "TBI"? The very top bit 63 is not ignored in LAM case. And in ARM case, TBI is Top Byte Ignore, not Bits implies number of ignored bits and their placement. The same TLA that means different thing does not help situation IMO. I'm not sure if we can change how Intel calls the feature at this point, but we can change kernel nomenclature if needed. BTW, we already have somewhat related feature on Sparc that called ADI which also hooks up into untagged_addr() machinery. Maybe it is okay to have architecture-specific terminology as long as it stays in arch code? Per-arch TLA namespace :P > I know, I know, NIH and all that, but at least as long as we are > limiting ourselves to regular US-ASCII, we really only have 17576 > TLA's to go around, and at some point it gets not only confusing, but > really quite wasteful, to have everybody make up their own > architecture-specific TLA. > > And while I'm on the subject: I really think that the changes to > "untagged_addr()" are fundamentally broken. > > Why? That whole LAM (or BTI) is not necessarily per-mm. It can easily > be per-*thread*. > > Imagine, if you will, a setup where you have some threads that use > tagged pointers, and some threads that don't. > > For example, maybe the upper bits of the address contains a tag that > is used only used within a virtual machine? You could even have the > "native" mode use the full address space, and put itself and its > private data in the upper bits virtually. > > IOW, imagine using the virtual address masking as not just memory > sanitizers, but as an actual honest-to-goodness separation feature (eg > JITed code might fundamentally have access only to the lower bits, > while the JITter itself sees the whole address space). > > Maybe that's not how LAM works on x86, but your changes to > untagged_addr() are *not* x86-specific. My initial enabling allowed userspace to enable the feature per-thread. I pointed out the same potential use -- JIT -- case as you did[1]. But it made harder to handle corner cases. Like, it is not obvious what LAM setup kernel thread has to use when acts on behalf of the process (io_uring for instance). I ended up with using the most permissive LAM mode (that allowed the most tag bits), but it is clearly a hack. Thomas pointed[2] that out and following versions used per-process approach. It simplified the patchset substantially. Speaking about, untagged_addr(), how do you see the interface should look like? It has to work correctly when called from a kernel thread. I can only think of a hack that makes it branch on task->flags & PF_KTHREAD. Hm? [1] https://lore.kernel.org/all/20220513225936.qo4cy6sijqpzmvpt@black.fi.intel.com [2] https://lore.kernel.org/all/878rr6x4iu.ffs@tglx > So I really think this is completely wrong, quite aside from the > naming. It just makes assumptions that aren't valid. > > The fact that you made this mm-specific actually ends up being an > active bug in the code, even on x86-64. You use the mmap lock to > serialize this all in prctl_enable_tagged_addr(), but then the read > side (ie just untagged_addr()) isn't actually serialized at all - and > *shouldn't* be serialized. mmap lock in prctl_enable_tagged_addr() serializes LAM enabling against PASID allocation for the process as SVA and LAM is not currently compatible. I re-used the locking for KVM serialization where it wants to make sure the userspace_addr has no tag bits set. But I just realized that replacing access_ok() with __access_ok() should work too, no need for unagged_addr() dance. I will rewrite the patch. Normal untagged_addr() usage doesn't require serialization: IPI in prctl_enable_tagged_addr() makes mm->context.untag_mask visible to all CPUs that runs the mm before it returns. And we only allow one-shot LAM enabling. It cannot be disabled later on. Could you elaborate, what active bug do you see? > So I really think this is a fundamental design mistake, and while I > pulled it and sorted out the trivial conflicts, I've unpulled it again > as being actively mis-designed. > > Linus
On Thu, Dec 15, 2022 at 4:30 AM <kirill.shutemov@linux.intel.com> wrote: > > My initial enabling allowed userspace to enable the feature per-thread. > I pointed out the same potential use -- JIT -- case as you did[1]. > But it made harder to handle corner cases. Like, it is not obvious what > LAM setup kernel thread has to use when acts on behalf of the process > (io_uring for instance). I ended up with using the most permissive LAM > mode (that allowed the most tag bits), but it is clearly a hack. ... and doing it per-mm causes more corner cases. Look at the - actively buggy - uses of the mask outside the mm lock. And no, "take the mm lock for reading every time you want to do untagged_addr()" isn't the answer either. I didn't look at every single use of 'untagged_addr()', but I did look at a couple, and every single one of the ones I looked at was racy and wrong. Does the race _matter_? Almost certainly not. Nobody will ever care. It's something we would have happily ignored for years, and then sighed, and added a READ_ONCE() for to make KCSAN happy. But when I get a new pull request, and the first thing I look at is fundamentally racy, just take a wild guess at how happy it makes me about that pull request? > Thomas pointed[2] that out and following versions used per-process > approach. It simplified the patchset substantially. No, it sure as hell didn't. It "simplified" things by just introducing another bug instead. > Speaking about, untagged_addr(), how do you see the interface should look > like? It has to work correctly when called from a kernel thread. I can > only think of a hack that makes it branch on task->flags & PF_KTHREAD. So I think that the current branch contents I see are simply not an option. I do not want to pull code that when I look at it, I see a data race on context.untag_mask . I also find the "pass in mm" to be fundamentally objectionable, because even if that then ends up being the right approach for x86, this interface really isn't x86-centric. Yes, arm64 ends up making it all unconditional, which simplifies things enormously and just means that arm64 doesn't care at all. But what if somebody else comes along, and really does want to make it per-thread? So I'd be ok with an interface that passes in 'struct task_struct', and then x86 could use "tsk->mm" to get the mm if that really is the interface x86 people want. Because then at least the *interface* isn't hardcoded to be broken. So those are my two fundamental objections against the current LAM code. An outright bug in x86 code, and a bad interface in non-x86 code. Here's *one* suggested solution: - pass in the 'task' for the target for untagged_addr(), so that we're not locked into one model - if x86 really wants to make thing per-mm, let it, but have a "lock" operation that says "now my setting is set in stone". You already do that in prctl_enable_tagged_addr() with that /* Already enabled? */ if (mm->context.lam_cr3_mask) { ret = -EBUSY; goto out; } just expose that as a thing. - make thread creation lock it (and unlock it on creating a new mm at execve, but not fork). And yes, I would actually suggest that _any_ thread creation locks it, so that you never *EVER* have any issues with "oh, now I need to synchronize with other threads". A process can set its LAM state at startup, not in the middle of running! Note that io_uring will automatically do that locking, because it does that thread creation (create_io_thread()). So io_uring threads won't even be a special case. You could go even further, and just make it a ELF binary flag, and basically have that LAM thing be a "when creating the mm, this is what it gets created as". That is probably what most people would actually want, but it sounds painful from a bootstrapping standpoint. Now, the above is just a suggestion off the top of my head. It simplifies things a lot, in that now even if the state is "per mm", you can actually *cache* it per-thread, and there are no data races. It also means that prctl_enable_tagged_addr() doesn't have to worry about serialization and that whole on_each_cpu_mask(mm_cpumask(mm), enable_lam_func, mm, true); thing. > Normal untagged_addr() usage doesn't require serialization: IPI in > prctl_enable_tagged_addr() makes mm->context.untag_mask visible to all > CPUs that runs the mm before it returns. BS. That operation is not serialized IN ANY WAY with the other threads reading the value. For all you know, untagged_addr() may be reading that value multiple times due to compiler reloads, and getting different values between two reads, and doing random things because you use one value for one thing, and another for the later "find_vma()" when the untagged address is actually used. The IPI does *nothing*. It serializes the cores, but since there is no actual locking (or even a READ_ONCE() or anything), it doesn't actually do anything on a software level. The code is buggy. Really. It's likely a "this works in practice because reloads are unlikely and you wouldn't hit them anyway" kind of bug, but no, that IPI does not help. (Of course, the IPI helps in that now at least the cr3 value is synchronized, so the IPI is needed for the *hardware* accesses, but doesn't help software use of mm->context.untag_mask). Linus
On Thu, Dec 15, 2022 at 9:17 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Here's *one* suggested solution: Note again: this is not a "you need to do it this way" suggestion. This is just a "at least this way doesn't have the issues I object to". There are bound to be other ways to do it. But if you feel like all threads have to share the same LAM state, it does seem a lot simpler if you just say "you need to set that state before you start any threads". No? > And yes, I would actually suggest that _any_ thread creation locks it, > so that you never *EVER* have any issues with "oh, now I need to > synchronize with other threads". A process can set its LAM state at > startup, not in the middle of running! Note that this "no serialization needed" is just about the SW side. The *hardware* side may still need the IPI just to make sure that it forces a TLB flush - even if we are single-threaded, that single thread may have run on other CPU's before. But I think at that point it's just a regular TLB flush, and doesn't need that LAM-specific IPI. But maybe there's some bigger HW serialization that is needed for the LAM bit, I have not looked at that enough to know. Linus
On 12/15/22 09:26, Linus Torvalds wrote: > But if you feel like all threads have to share the same LAM state, it > does seem a lot simpler if you just say "you need to set that state > before you start any threads". No? That would be a lot simpler. I have one bit of hesitation there, though. Back in the MPX days, we had some users pop up and tell us that MPX wasn't working for them on certain threads. Those threads ended up having been clone()'d even _before_ MPX got enabled which was via some pre-main() startup code that the compiler inserted. Since these early threads' FPU state was copied before MPX setup was done, they never got MPX-enabled. Right or wrong, since then I've basically assumed that someone might be creating threads behind my back. Maybe we call those "early thread" folks too crazy to get LAM. Maybe I need to forget it ever happened, but they were actual users that got bitten and cared enough to report it. Or, heck, maybe I'm just delusional because I can't find any trace of this conversation in the list archives. LAM _is_ slightly different than what MPX hit, though, since instead of totally silent breakage the app can whine about the LAM prctl() having failed. Anyway, message heard loud and clear about the untagged_addr() races and the interfaces. We'll find some way to fix those up for the next merge window.
On Thu, Dec 15, 2022 at 1:53 PM Dave Hansen <dave.hansen@intel.com> wrote: > > Back in the MPX days, we had some users pop up and tell us that MPX > wasn't working for them on certain threads. Those threads ended up > having been clone()'d even _before_ MPX got enabled which was via some > pre-main() startup code that the compiler inserted. Since these early > threads' FPU state was copied before MPX setup was done, they never got > MPX-enabled. Yeah, I can see that happening, but I do think: > Maybe we call those "early thread" folks too crazy to get LAM. I think we should at least *start* that way. Yes, I can imagine some early LD linkage thing deciding "this early dynamic linking is very expensive, I will thread it". I personally think that's a bit crazy - if your early linking is that expensive, you have some serious design issues - but even if it does happen, I'd rather start with very strict rules, see if that works out, and then if we hit problems we have other alternatives. Those other alternatives could involve relaxing the rules later and saying "ok, we'll allow you to enable LAM even in this case, because you only did Xyz". But another alternative could also be to just have that LAM enabled even earlier by adding an ELF flag, so that the address masking is essentially set up at execve() time. And independently of that, there's another issue: I suspect you want separate the concept of "the kernel will mask virtual addresses in certain system calls" from "the CPU will ignore those masked bits" And I say that because that's basically what arm64 does. On arm64, the 'untagged_addr()' masking is always enabled, but whether the actual CPU hardware ignores the flags when virtually addressed is a separate thing that you need to do the prctl() for. Put another way: you can always pass tagged addresses to things like mmap() and friends, *even if* those tagged addresses would cause a fault when accessed normally, and wouldn't work for "read()" and 'write()" and friends (because read/write uses regular CPU accesses). Now, the Intel LAM model is mode complicated, because the mask is dynamic, and because there are people who want to use the full linear address space with no masking. But the whole "want to use the full linear address space" issue is actually in some ways closer to the the "use five-level page tables" decision - in that by *default* we don't use VA space ab9ove 47, and you have to basically actively *ask* for that 5-level address space. And I think the LAM model might want to take that same approach: maybe by *default*, untagged_addr() always does the masking, and x86 takes the arm64 approach. And then the truly *special* apps that want unmasked addresses are the ones who mark themselves special, kind of the same way they already have to if they want 5-level paging. And again: untagged_addr() doing masking does *not* necessarily mean that the CPU does. I think arm64 does actually set the TBI bit by default, but there's very much a prctl() to enable/disable it. But that "BTI is disabled" doesn't actually mean that the mmap() address masking is disabled. I dunno. I'm really just saying "look, arm64 does this very differently, and it seems to work fine there". Linus
On Thu, Dec 15, 2022 at 09:17:11AM -0800, Linus Torvalds wrote: > - make thread creation lock it (and unlock it on creating a new mm at > execve, but not fork). > > And yes, I would actually suggest that _any_ thread creation locks it, > so that you never *EVER* have any issues with "oh, now I need to > synchronize with other threads". A process can set its LAM state at > startup, not in the middle of running! By thread creation, you mean CLONE_VM, right? Do we care to avoid locking for CLONE_VFORK case? Seems excessive. > Note that io_uring will automatically do that locking, because it does > that thread creation (create_io_thread()). So io_uring threads won't > even be a special case. I've missed that io_uring does not use kthread_use_mm() anymore. But what about other kthread_use_mm() users? I did not find concrete cases where missing LAM setup would breaks things, but I cannot prove opposite too. kthread_use_mm() usage is suspicious in amdkfd, but I donno. io_uring was a clear before it moved away from kthread_use_mm().
On Fri, Dec 16, 2022 at 05:16:45AM +0300, kirill.shutemov@linux.intel.com wrote: > On Thu, Dec 15, 2022 at 09:17:11AM -0800, Linus Torvalds wrote: > > - make thread creation lock it (and unlock it on creating a new mm at > > execve, but not fork). > > > > And yes, I would actually suggest that _any_ thread creation locks it, > > so that you never *EVER* have any issues with "oh, now I need to > > synchronize with other threads". A process can set its LAM state at > > startup, not in the middle of running! > > By thread creation, you mean CLONE_VM, right? > > Do we care to avoid locking for CLONE_VFORK case? Seems excessive. > > > Note that io_uring will automatically do that locking, because it does > > that thread creation (create_io_thread()). So io_uring threads won't > > even be a special case. > > I've missed that io_uring does not use kthread_use_mm() anymore. But what > about other kthread_use_mm() users? > > I did not find concrete cases where missing LAM setup would breaks things, > but I cannot prove opposite too. > > kthread_use_mm() usage is suspicious in amdkfd, but I donno. > > io_uring was a clear before it moved away from kthread_use_mm(). Below is preliminary fixup that suppose to address the issue. It does not include change to untagged_addr() interface to avoid the clutter. LAM mode kept per-mm, but cannot be changed after the first thread has spawn (or LAM enabled). Bit in mm->context.flags is used to indicate LAM mode lock. I've added few test cases for new limitations around threads. kthread_use_mm() should be safe as long as no arch actually implements per-thread tagging enabling. Any comments? If looks okay, I will integrate the fixup into the LAM patchset properly. Or maybe you want an incremental patchset on top? diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c index 4af81df133ee..ed320e145cf9 100644 --- a/arch/x86/entry/vsyscall/vsyscall_64.c +++ b/arch/x86/entry/vsyscall/vsyscall_64.c @@ -317,7 +317,7 @@ static struct vm_area_struct gate_vma __ro_after_init = { struct vm_area_struct *get_gate_vma(struct mm_struct *mm) { #ifdef CONFIG_COMPAT - if (!mm || !(mm->context.flags & MM_CONTEXT_HAS_VSYSCALL)) + if (!mm || test_bit(MM_CONTEXT_HAS_VSYSCALL, &mm->context.flags)) return NULL; #endif if (vsyscall_mode == NONE) diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h index 1215b0a714c9..4e7e8d9893de 100644 --- a/arch/x86/include/asm/mmu.h +++ b/arch/x86/include/asm/mmu.h @@ -9,11 +9,13 @@ #include <linux/bits.h> /* Uprobes on this MM assume 32-bit code */ -#define MM_CONTEXT_UPROBE_IA32 BIT(0) +#define MM_CONTEXT_UPROBE_IA32 0 /* vsyscall page is accessible on this MM */ -#define MM_CONTEXT_HAS_VSYSCALL BIT(1) +#define MM_CONTEXT_HAS_VSYSCALL 1 /* Allow LAM and SVA coexisting */ -#define MM_CONTEXT_FORCE_TAGGED_SVA BIT(2) +#define MM_CONTEXT_FORCE_TAGGED_SVA 2 +/* Do not allow changing LAM mode */ +#define MM_CONTEXT_LOCK_LAM 3 /* * x86 has arch-specific MMU state beyond what lives in mm_struct. @@ -41,7 +43,7 @@ typedef struct { #endif #ifdef CONFIG_X86_64 - unsigned short flags; + unsigned long flags; /* Active LAM mode: X86_CR3_LAM_U48 or X86_CR3_LAM_U57 or 0 (disabled) */ unsigned long lam_cr3_mask; diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 32483ab3f763..4bc95c35cbd3 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -118,7 +118,7 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm) static inline bool arch_pgtable_dma_compat(struct mm_struct *mm) { return !mm_lam_cr3_mask(mm) || - (mm->context.flags & MM_CONTEXT_FORCE_TAGGED_SVA); + test_bit(MM_CONTEXT_FORCE_TAGGED_SVA, &mm->context.flags); } #else @@ -228,7 +228,7 @@ static inline void arch_exit_mmap(struct mm_struct *mm) static inline bool is_64bit_mm(struct mm_struct *mm) { return !IS_ENABLED(CONFIG_IA32_EMULATION) || - !(mm->context.flags & MM_CONTEXT_UPROBE_IA32); + !test_bit(MM_CONTEXT_UPROBE_IA32, &mm->context.flags); } #else static inline bool is_64bit_mm(struct mm_struct *mm) diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index bd92e1ee1c1a..6286885dc1e2 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -35,7 +35,7 @@ DECLARE_STATIC_KEY_FALSE(tagged_addr_key); u64 __addr = (__force u64)(addr); \ if (static_branch_likely(&tagged_addr_key)) { \ s64 sign = (s64)__addr >> 63; \ - __addr &= (mm)->context.untag_mask | sign; \ + __addr &= READ_ONCE((mm)->context.untag_mask) | sign; \ } \ (__force __typeof__(addr))__addr; \ }) diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index d1e83ba21130..9bcec8195714 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -162,6 +162,9 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args) savesegment(es, p->thread.es); savesegment(ds, p->thread.ds); + + if (p->mm && (clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM) + set_bit(MM_CONTEXT_LOCK_LAM, &p->mm->context.flags); #else p->thread.sp0 = (unsigned long) (childregs + 1); savesegment(gs, p->thread.gs); diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index b06de18215ce..7b8df4769486 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -671,7 +671,7 @@ void set_personality_64bit(void) task_pt_regs(current)->orig_ax = __NR_execve; current_thread_info()->status &= ~TS_COMPAT; if (current->mm) - current->mm->context.flags = MM_CONTEXT_HAS_VSYSCALL; + __set_bit(MM_CONTEXT_HAS_VSYSCALL, ¤t->mm->context.flags); /* TBD: overwrites user setup. Should have two bits. But 64bit processes have always behaved this way, @@ -708,7 +708,7 @@ static void __set_personality_ia32(void) * uprobes applied to this MM need to know this and * cannot use user_64bit_mode() at that time. */ - current->mm->context.flags = MM_CONTEXT_UPROBE_IA32; + __set_bit(MM_CONTEXT_UPROBE_IA32, ¤t->mm->context.flags); } current->personality |= force_personality32; @@ -746,31 +746,6 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr) DEFINE_STATIC_KEY_FALSE(tagged_addr_key); EXPORT_SYMBOL_GPL(tagged_addr_key); -static void enable_lam_func(void *mm) -{ - struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm); - unsigned long lam_mask; - unsigned long cr3; - - if (loaded_mm != mm) - return; - - lam_mask = READ_ONCE(loaded_mm->context.lam_cr3_mask); - - /* - * Update CR3 to get LAM active on the CPU. - * - * This might not actually need to update CR3 if a context switch - * happened between updating 'lam_cr3_mask' and running this IPI - * handler. Update it unconditionally for simplicity. - */ - cr3 = __read_cr3(); - cr3 &= ~(X86_CR3_LAM_U48 | X86_CR3_LAM_U57); - cr3 |= lam_mask; - write_cr3(cr3); - set_tlbstate_cr3_lam_mask(lam_mask); -} - #define LAM_U57_BITS 6 static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits) @@ -780,36 +755,27 @@ static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits) if (!cpu_feature_enabled(X86_FEATURE_LAM)) return -ENODEV; - if (mmap_write_lock_killable(mm)) - return -EINTR; - - /* Already enabled? */ - if (mm->context.lam_cr3_mask) { - ret = -EBUSY; - goto out; - } + if (test_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags)) + return -EBUSY; if (mm_valid_pasid(mm) && - !(mm->context.flags & MM_CONTEXT_FORCE_TAGGED_SVA)) { - ret = -EBUSY; - goto out; - } + !test_bit(MM_CONTEXT_FORCE_TAGGED_SVA, &mm->context.flags)) + return -EBUSY; if (!nr_bits) { - ret = -EINVAL; - goto out; + return -EINVAL; } else if (nr_bits <= LAM_U57_BITS) { mm->context.lam_cr3_mask = X86_CR3_LAM_U57; mm->context.untag_mask = ~GENMASK(62, 57); } else { - ret = -EINVAL; - goto out; + return -EINVAL; } - on_each_cpu_mask(mm_cpumask(mm), enable_lam_func, mm, true); + write_cr3(__read_cr3() | mm->context.lam_cr3_mask); + set_tlbstate_cr3_lam_mask(mm->context.lam_cr3_mask); + set_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags); + static_branch_enable(&tagged_addr_key); -out: - mmap_write_unlock(mm); return ret; } @@ -906,10 +872,7 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2) case ARCH_ENABLE_TAGGED_ADDR: return prctl_enable_tagged_addr(task->mm, arg2); case ARCH_FORCE_TAGGED_SVA: - if (mmap_write_lock_killable(task->mm)) - return -EINTR; - task->mm->context.flags |= MM_CONTEXT_FORCE_TAGGED_SVA; - mmap_write_unlock(task->mm); + set_bit(MM_CONTEXT_FORCE_TAGGED_SVA, &task->mm->context.flags); return 0; case ARCH_GET_MAX_TAG_BITS: if (!cpu_feature_enabled(X86_FEATURE_LAM)) diff --git a/drivers/iommu/iommu-sva-lib.c b/drivers/iommu/iommu-sva-lib.c index 8934c32e923c..bfe70d45dd1e 100644 --- a/drivers/iommu/iommu-sva-lib.c +++ b/drivers/iommu/iommu-sva-lib.c @@ -33,14 +33,8 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max) min == 0 || max < min) return -EINVAL; - /* Serialize against address tagging enabling */ - if (mmap_write_lock_killable(mm)) - return -EINTR; - - if (!arch_pgtable_dma_compat(mm)) { - mmap_write_unlock(mm); + if (!arch_pgtable_dma_compat(mm)) return -EBUSY; - } mutex_lock(&iommu_sva_lock); /* Is a PASID already associated with this mm? */ @@ -57,7 +51,6 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max) mm_pasid_set(mm, pasid); out: mutex_unlock(&iommu_sva_lock); - mmap_write_unlock(mm); return ret; } EXPORT_SYMBOL_GPL(iommu_sva_alloc_pasid); diff --git a/tools/testing/selftests/x86/lam.c b/tools/testing/selftests/x86/lam.c index 52a876a7ccb8..93e6089164b6 100644 --- a/tools/testing/selftests/x86/lam.c +++ b/tools/testing/selftests/x86/lam.c @@ -1,4 +1,5 @@ // SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <string.h> @@ -12,6 +13,7 @@ #include <sys/stat.h> #include <fcntl.h> #include <inttypes.h> +#include <sched.h> #include <sys/uio.h> #include <linux/io_uring.h> @@ -50,6 +52,8 @@ #define PAGE_SIZE (4 << 10) +#define STACK_SIZE 65536 + #define barrier() ({ \ __asm__ __volatile__("" : : : "memory"); \ }) @@ -731,6 +735,75 @@ static int handle_inheritance(struct testcases *test) return 0; } +static int thread_fn_get_lam(void *arg) +{ + return get_lam(); +} + +static int thread_fn_set_lam(void *arg) +{ + struct testcases *test = arg; + + return set_lam(test->lam); +} + +static int handle_thread(struct testcases *test) +{ + char stack[STACK_SIZE]; + int ret, child_ret; + int lam = 0; + pid_t pid; + + /* Set LAM mode in parent process */ + if (!test->later) { + lam = test->lam; + if (set_lam(lam) != 0) + return 1; + } + + pid = clone(thread_fn_get_lam, stack + STACK_SIZE, + SIGCHLD | CLONE_FILES | CLONE_FS | CLONE_VM, NULL); + if (pid < 0) { + perror("Clone failed."); + return 1; + } + + waitpid(pid, &child_ret, 0); + ret = WEXITSTATUS(child_ret); + + if (lam != ret) + return 1; + + if (test->later) { + if (set_lam(test->lam) != 0) + return 1; + } + + return 0; +} + +static int handle_thread_enable(struct testcases *test) +{ + char stack[STACK_SIZE]; + int ret, child_ret; + int lam = test->lam; + pid_t pid; + + pid = clone(thread_fn_set_lam, stack + STACK_SIZE, + SIGCHLD | CLONE_FILES | CLONE_FS | CLONE_VM, test); + if (pid < 0) { + perror("Clone failed."); + return 1; + } + + waitpid(pid, &child_ret, 0); + ret = WEXITSTATUS(child_ret); + + if (lam != ret) + return 1; + + return 0; +} static void run_test(struct testcases *test, int count) { int i, ret = 0; @@ -846,6 +919,25 @@ static struct testcases inheritance_cases[] = { .test_func = handle_inheritance, .msg = "FORK: LAM_U57, child process should get LAM mode same as parent\n", }, + { + .expected = 0, + .lam = LAM_U57_BITS, + .test_func = handle_thread, + .msg = "THREAD: LAM_U57, child thread should get LAM mode same as parent\n", + }, + { + .expected = 1, + .lam = LAM_U57_BITS, + .test_func = handle_thread_enable, + .msg = "THREAD: [NEGATIVE] Enable LAM in child.\n", + }, + { + .expected = 1, + .later = 1, + .lam = LAM_U57_BITS, + .test_func = handle_thread, + .msg = "THREAD: [NEGATIVE] Enable LAM in parent after thread created.\n", + }, { .expected = 0, .lam = LAM_U57_BITS, diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index a587e34fc750..14d7e7d72f14 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1943,22 +1943,12 @@ int __kvm_set_memory_region(struct kvm *kvm, return -EINVAL; if (mem->guest_phys_addr & (PAGE_SIZE - 1)) return -EINVAL; - - /* Serialize against tagging enabling */ - if (mmap_read_lock_killable(kvm->mm)) - return -EINTR; - /* We can read the guest memory with __xxx_user() later on. */ if ((mem->userspace_addr & (PAGE_SIZE - 1)) || (mem->userspace_addr != untagged_addr(kvm->mm, mem->userspace_addr)) || !access_ok((void __user *)(unsigned long)mem->userspace_addr, - mem->memory_size)) { - mmap_read_unlock(kvm->mm); + mem->memory_size)) return -EINVAL; - } - - mmap_read_unlock(kvm->mm); - if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM) return -EINVAL; if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
On Fri, Dec 16, 2022 at 9:05 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > Below is preliminary fixup that suppose to address the issue. It does not > include change to untagged_addr() interface to avoid the clutter. Looks like the right direction. And once you change untagged_addr() to take 'tsk', you should then be able to cache all the information in the thread struct, and avoid the 'tsk->mm' dereference entirely. > kthread_use_mm() should be safe as long as no arch actually implements > per-thread tagging enabling. I think in a perfect world the (few) users of kthread_use_mm() would also just make sure they did the locking of thing, so that they can't have that race with somebody that then would enable LAM later. Linus
Hi Dave, On Tue, Dec 13, 2022 at 09:42:34AM -0800, Dave Hansen wrote: > This includes some new randomization of the per-cpu entry areas from > Peter Z. Without it, these areas are a tasty target for attackers. > The entry code and mappings are especially tricky code and this has > caused some issues along the way, but they have settled down. > x86/mm: Randomize per-cpu entry area I've got a trivial patch teed up for late in the window that I wanted to send in after Linus merges this pull. I was wondering whether you were going to send in another x86/mm pull during the merge window (perhaps without LAM in it?), or if this whole thing is going to wait for 6.3. Either way is fine with me, but it'd save me a lot of F5 refreshing if I knew what to expect. Jason
On Fri, Dec 16, 2022 at 09:43:11AM -0600, Linus Torvalds wrote: > On Fri, Dec 16, 2022 at 9:05 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > Below is preliminary fixup that suppose to address the issue. It does not > > include change to untagged_addr() interface to avoid the clutter. > > Looks like the right direction. > > And once you change untagged_addr() to take 'tsk', you should then be > able to cache all the information in the thread struct, and avoid the > 'tsk->mm' dereference entirely. Making untagged_addr() take task as an argument does not work well. We don't have relevant task on hands in some places. The most notably GUP only has 'mm'. It is also not clear what untagging rules has to be applied: whoever runs untagged_addr() or the target task/mm. Up until now I choose the target mm rules. It makes things easier for GDB, but it is not strong enough justification. Maybe always use rules of 'current' is the right way? In other thread you suggested to make untagging in untagged_addr() unconditional and I see how it solves the problem, but I don't think it is good idea. The current LAM patchset only enables LAM_U57 mode that doesn't compete for real virtual address bits, but hardware also support LAM_U48 which provides more tagging bits, but limits available address space for the process. Unconditionally untag according to LAM_U48 rules is obviously broken. Although, x86 maintainers rejected LAM_U48 upstreaming as of now, I want to keep door open in case a strong use case for it comes. I have code that enables the mode stashed, just in case.
diff --cc arch/x86/kernel/ftrace.c index 03579460d0ec,cf15ef5aecff..3d883eb989c7 --- a/arch/x86/kernel/ftrace.c diff --cc arch/x86/kernel/kprobes/core.c index e7b7ca64acdf,01b8d956aa76..66299682b6b7 --- a/arch/x86/kernel/kprobes/core.c