Message ID | 20230427225647.1101172-1-dave.hansen@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp579843vqo; Thu, 27 Apr 2023 16:05:48 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4MIuipNLHarwUyqdaB3eoHtZOr8GyYXLKW6p6VyY4Jj+Z8zczu39UlR4h3HKlqaav4MNM4 X-Received: by 2002:a05:6a21:380e:b0:f0:4664:ad53 with SMTP id yi14-20020a056a21380e00b000f04664ad53mr1709071pzb.48.1682636747892; Thu, 27 Apr 2023 16:05:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682636747; cv=none; d=google.com; s=arc-20160816; b=wdytERX4VeTzmizDM6GIEswmYO77U3qkQOfGMiMwTVqVwSJtMSKRSDIWPLPeQhLHZy nl1kRtSxy2nka8h4re0bgAL7B7rNYtVAHFHu4uabqL+GpojuZ2YjaNV5mc2bIZlqjsev Uz5e9CW7Zyt2WJAYWWe0mSxePNe1GnAqe/sVhjHTIyDIXBFdNmrDZIdDjV8S8bnI+95+ 4VGgYRlab0gENh4ihIXvNDdo8lqhfWeQGRAZLskH9pHF3N1DQFDop7ktyKrbu2yGrYYG tQt9oGSsduMxvqCSWsyIWRArlbvecWzUTzgGSIbwF/aKBVstXAL/mdgb89tsAOUb9N42 RF4g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=HV1FNbk3NVRyGNGKHvnjWi0x2qczoKI/Y2VK9qKjGqo=; b=o648fYhGDdqfbPdyhfib9FIFJsk32K/v00wemZ5e2Su8LiDOZrg72k4v3kX3nLeKPq 7OvlEkPoTJEg50Bo8y7df3+Pqh5/oub5+25Z+wln+p7h/x+MQ6G/q9iOPKZG4g40J3ij HR4sdYX9eK5Ju7UCl5ev/FQjZ7wm3fnd1V+HFQhkTeeNr689THAGbDV2b8Lt6+btgvVp 8j2VQWHMlIbQDiOE4vr4dcP6gH9HsclTuvB7aPPdgxzsakwHvHLApHKNMRTq1jnBzK1v lauE7eOOpJRczSGYLiOkY6dbi3PfC7wke6C0aNYtfO1pFR9njYFYlTBdJWKuLP2cTsq5 KpJw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=GApZXUG8; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id h6-20020aa79f46000000b0063ab35b9d75si20042453pfr.162.2023.04.27.16.05.33; Thu, 27 Apr 2023 16:05:47 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=GApZXUG8; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344357AbjD0W5O (ORCPT <rfc822;zxc52fgh@gmail.com> + 99 others); Thu, 27 Apr 2023 18:57:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55500 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229912AbjD0W5N (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 27 Apr 2023 18:57:13 -0400 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A7E7EE76 for <linux-kernel@vger.kernel.org>; Thu, 27 Apr 2023 15:57:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1682636231; x=1714172231; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=Oo30p4zjO7R+SHBzRbEWPBwEjv6nOyVJP4VGo+qJFhY=; b=GApZXUG8RM75A1oEFvj72VrWp611QyB5ZtvyNiYJDn2aw0X3T/lMmMA9 xFOrUwz7JWiW1vErFI8jMH0I3cRGUmkBAPr+kxQUDmHgCNS6GVcaAuRNH uMeuABP97qOqd7DuU/BTcL96FQvj9qtxu1p1TWBdK9ZQEhrQzKCsGy8wm o1dm4QqAad7m+iyX19dMD2JNQuPlEJfC2cavoirp2Cj0YQ8AO7q7W6a/N usdCTtDQprT0Wfz+CH9vBtQuxm6f8GbZkGEdJ6TR2YnWvnQPFkPKprcHT ku4TcexsYD95w/1rZAkwiafmezWkKTD0O9Z6osW7aslyUgf1aPcWJ0ujv A==; X-IronPort-AV: E=McAfee;i="6600,9927,10693"; a="433870798" X-IronPort-AV: E=Sophos;i="5.99,232,1677571200"; d="scan'208";a="433870798" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Apr 2023 15:57:10 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10693"; a="725110244" X-IronPort-AV: E=Sophos;i="5.99,232,1677571200"; d="scan'208";a="725110244" Received: from viggo.jf.intel.com (HELO ray2.sr71.net) ([10.54.77.144]) by orsmga008.jf.intel.com with ESMTP; 27 Apr 2023 15:57:09 -0700 From: Dave Hansen <dave.hansen@linux.intel.com> To: torvalds@linux-foundation.org Cc: x86@kernel.org, linux-kernel@vger.kernel.org, kirill.shutemov@linux.intel.com, Dave Hansen <dave.hansen@linux.intel.com> Subject: [GIT PULL] x86/mm for 6.4 Date: Thu, 27 Apr 2023 15:56:47 -0700 Message-Id: <20230427225647.1101172-1-dave.hansen@linux.intel.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.5 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764372510886962745?= X-GMAIL-MSGID: =?utf-8?q?1764372510886962745?= |
Series |
[GIT,PULL] x86/mm for 6.4
|
|
Pull-request
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_mm_for_6.4Commit Message
Dave Hansen
April 27, 2023, 10:56 p.m. UTC
Hi Linus, Please pull some x86/mm changes for 6.4. The only content here is solely a new revision of Kirill's Linear Address Masking implementation. You had some concerns[1] with the last version and the (lack of) locking while the feature was being enabled in multithreaded programs. It's been fixed up since then to simply only allow LAM enabling when the process is single threaded. It's also accumulated a few random fixes and cleanups since then. This conflicts with the shadow stack work (x86/shstk) that I sent earlier this week. This is no surprise since LAM and shadow stacks both add prctl()'s, selftests and touch the same headers. Despite there being a few sites of conflict, the merge is logically straightforward. I've included a suggested resolution. Both Kirill (LAM) and Rick (shadow stacks) tested the result and confirmed that nothing broke. 1. https://lore.kernel.org/all/CAHk-=wi=TY3Kte5Z1_nvfcsEh+rcz86pYnzeASw=pbG9QtpJEQ@mail.gmail.com/ -- The following changes since commit eeac8ede17557680855031c6f305ece2378af326: Linux 6.3-rc2 (2023-03-12 16:36:44 -0700) are available in the Git repository at: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_mm_for_6.4 for you to fetch changes up to 97740266de26e5dfe6e4fbecacb6995b66c2e378: x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside (2023-04-06 13:45:06 -0700) ---------------------------------------------------------------- Add support for new Linear Address Masking CPU feature. This is similar to ARM's Top Byte Ignore and allows userspace to store metadata in some bits of pointers without masking it out before use. ---------------------------------------------------------------- Kirill A. Shutemov (14): x86/mm: Rework address range check in get_user() and put_user() x86: Allow atomic MM_CONTEXT flags setting x86: CPUID and CR3/CR4 flags for Linear Address Masking x86/mm: Handle LAM on context switch mm: Introduce untagged_addr_remote() x86/uaccess: Provide untagged_addr() and remove tags before address check x86/mm: Reduce untagged_addr() overhead for systems without LAM x86/mm: Provide arch_prctl() interface for LAM mm: Expose untagging mask in /proc/$PID/status iommu/sva: Replace pasid_valid() helper with mm_valid_pasid() x86/mm/iommu/sva: Make LAM and SVA mutually exclusive selftests/x86/lam: Add test cases for LAM vs thread creation x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside Weihong Zhang (5): selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking selftests/x86/lam: Add io_uring test cases for linear-address masking selftests/x86/lam: Add inherit test cases for linear-address masking selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking arch/arm64/include/asm/mmu_context.h | 6 + arch/sparc/include/asm/mmu_context_64.h | 6 + arch/sparc/include/asm/uaccess_64.h | 2 + arch/x86/Kconfig | 11 + arch/x86/entry/vsyscall/vsyscall_64.c | 2 +- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/disabled-features.h | 8 +- arch/x86/include/asm/mmu.h | 18 +- arch/x86/include/asm/mmu_context.h | 49 +- arch/x86/include/asm/processor-flags.h | 2 + arch/x86/include/asm/tlbflush.h | 48 +- arch/x86/include/asm/uaccess.h | 58 +- arch/x86/include/uapi/asm/prctl.h | 5 + arch/x86/include/uapi/asm/processor-flags.h | 6 + arch/x86/kernel/process.c | 6 + arch/x86/kernel/process_64.c | 68 +- arch/x86/kernel/traps.c | 6 +- arch/x86/lib/getuser.S | 83 +- arch/x86/lib/putuser.S | 54 +- arch/x86/mm/init.c | 5 + arch/x86/mm/tlb.c | 53 +- drivers/iommu/iommu-sva.c | 8 +- drivers/vfio/vfio_iommu_type1.c | 2 +- fs/proc/array.c | 7 + fs/proc/task_mmu.c | 9 +- include/linux/ioasid.h | 9 - include/linux/mm.h | 11 - include/linux/mmu_context.h | 14 + include/linux/sched/mm.h | 8 +- include/linux/uaccess.h | 22 + mm/gup.c | 4 +- mm/madvise.c | 5 +- mm/migrate.c | 11 +- tools/testing/selftests/x86/Makefile | 2 +- tools/testing/selftests/x86/lam.c | 1241 +++++++++++++++++++++++++++ 35 files changed, 1701 insertions(+), 149 deletions(-) create mode 100644 tools/testing/selftests/x86/lam.c +++ b/arch/x86/include/asm/disabled-features.h @@@ -120,8 -126,8 +132,8 @@@ #define DISABLED_MASK9 (DISABLE_SGX) #define DISABLED_MASK10 0 #define DISABLED_MASK11 (DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \ - DISABLE_CALL_DEPTH_TRACKING) + DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK) -#define DISABLED_MASK12 0 +#define DISABLED_MASK12 (DISABLE_LAM) #define DISABLED_MASK13 0 #define DISABLED_MASK14 0 #define DISABLED_MASK15 0 +++ b/arch/x86/include/uapi/asm/prctl.h @@@ -20,9 -20,16 +20,21 @@@ #define ARCH_MAP_VDSO_32 0x2002 #define ARCH_MAP_VDSO_64 0x2003 + /* Don't use 0x3001-0x3004 because of old glibcs */ + +#define ARCH_GET_UNTAG_MASK 0x4001 +#define ARCH_ENABLE_TAGGED_ADDR 0x4002 +#define ARCH_GET_MAX_TAG_BITS 0x4003 +#define ARCH_FORCE_TAGGED_SVA 0x4004 + + #define ARCH_SHSTK_ENABLE 0x5001 + #define ARCH_SHSTK_DISABLE 0x5002 + #define ARCH_SHSTK_LOCK 0x5003 + #define ARCH_SHSTK_UNLOCK 0x5004 + #define ARCH_SHSTK_STATUS 0x5005 + + /* ARCH_SHSTK_ features bits */ + #define ARCH_SHSTK_SHSTK (1ULL << 0) + #define ARCH_SHSTK_WRSS (1ULL << 1) + #endif /* _ASM_X86_PRCTL_H */ +++ b/arch/x86/kernel/process.c @@@ -48,7 -48,7 +48,8 @@@ #include <asm/frame.h> #include <asm/unwind.h> #include <asm/tdx.h> +#include <asm/mmu_context.h> + #include <asm/shstk.h> #include "process.h" +++ b/arch/x86/kernel/process_64.c @@@ -875,22 -831,13 +877,28 @@@ long do_arch_prctl_64(struct task_struc # endif case ARCH_MAP_VDSO_64: return prctl_map_vdso(&vdso_image_64, arg2); +#endif +#ifdef CONFIG_ADDRESS_MASKING + case ARCH_GET_UNTAG_MASK: + return put_user(task->mm->context.untag_mask, + (unsigned long __user *)arg2); + case ARCH_ENABLE_TAGGED_ADDR: + return prctl_enable_tagged_addr(task->mm, arg2); + case ARCH_FORCE_TAGGED_SVA: + set_bit(MM_CONTEXT_FORCE_TAGGED_SVA, &task->mm->context.flags); + return 0; + case ARCH_GET_MAX_TAG_BITS: + if (!cpu_feature_enabled(X86_FEATURE_LAM)) + return put_user(0, (unsigned long __user *)arg2); + else + return put_user(LAM_U57_BITS, (unsigned long __user *)arg2); #endif + case ARCH_SHSTK_ENABLE: + case ARCH_SHSTK_DISABLE: + case ARCH_SHSTK_LOCK: + case ARCH_SHSTK_UNLOCK: + case ARCH_SHSTK_STATUS: + return shstk_prctl(task, option, arg2); default: ret = -EINVAL; break; +++ b/fs/proc/array.c @@@ -424,11 -423,11 +424,16 @@@ static inline void task_thp_status(stru seq_printf(m, "THP_enabled:\t%d\n", thp_enabled); } +static inline void task_untag_mask(struct seq_file *m, struct mm_struct *mm) +{ + seq_printf(m, "untag_mask:\t%#lx\n", mm_untag_mask(mm)); +} + + __weak void arch_proc_pid_thread_features(struct seq_file *m, + struct task_struct *task) + { + } + int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task) { +++ b/tools/testing/selftests/x86/Makefile @@@ -18,7 -18,7 +18,7 @@@ TARGETS_C_32BIT_ONLY := entry_from_vm8 test_FCMOV test_FCOMI test_FISTTP \ vdso_restorer TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering \ - corrupt_xstate_header amx lam - corrupt_xstate_header amx test_shadow_stack ++ corrupt_xstate_header amx test_shadow_stack lam # Some selftests require 32bit support enabled also on 64bit systems TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall
Comments
The pull request you sent on Thu, 27 Apr 2023 15:56:47 -0700:
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_mm_for_6.4
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/22b8cc3e78f5448b4c5df00303817a9137cd663f
Thank you!
On Thu, Apr 27, 2023 at 3:57 PM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > Please pull some x86/mm changes for 6.4. The only content here is > solely a new revision of Kirill's Linear Address Masking implementation. So I was waiting for this for my final piece of the x86 user copy changes, so here goes... I think we should now make 'access_ok()' do the same thing that get/put_user() do with the LAM code: only worry about the sign bit. So here's my suggested change on top of the current tree. Comments? PeterZ also added to the cc, because he's the source of that WARN_ON_IN_IRQ() in the x86 'access_ok()' macro. That's the only reason x86 has its own copy of that. I wonder if that WARN_ON_IN_IRQ() should just be removed, or perhaps moved into the generic code in <asm-generic/access_ok.h>? Linus
On Fri, Apr 28, 2023 at 1:07 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So here's my suggested change on top of the current tree. Comments? Oh, and I wanted to particularly mention that We could probably just do that "check only starting address" for any arbitrary range size: realistically all kernel accesses to user space will be done starting at the low address. But let's leave that kind of optimization for later. As it is, this already allows us to generate simpler code and not worry about any tag bits in the address. part of the commit log. Right now that patch only simplifies the range check for when the compiler statically knows that the range is small (which does happen, but not very often, because 'copy_to/from_user()' isn't inlined on x86-64, so the compiler doesn't actually see the constant size case that is very common there). However, that "check the end of the range" is sometimes actually fairly complicated code, and it would be nice to drop that entirely. See for example the fs/readdir.c case, where the length of the access is kind of annoying: if (!user_write_access_begin(dirent, (unsigned long)(dirent->d_name + namlen + 1) - (unsigned long)dirent)) goto efault; and there really isn't any actual reason to check the end of the access on x86: if the beginning address has the low bit clear, it doesn't really matter what the end is, because we'll either have a good area, or we'll fault in the non-canonical area even if the sign changes. So being careful about the range is kind of annoying, when we don't really need it. Linus
On Fri, Apr 28, 2023 at 01:15:33PM -0700, Linus Torvalds wrote: > On Fri, Apr 28, 2023 at 1:07 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > So here's my suggested change on top of the current tree. Comments? > > Oh, and I wanted to particularly mention that > > We could probably just do that "check only starting address" for any > arbitrary range size: realistically all kernel accesses to user space > will be done starting at the low address. But let's leave that kind of > optimization for later. As it is, this already allows us to generate > simpler code and not worry about any tag bits in the address. > > part of the commit log. > > Right now that patch only simplifies the range check for when the > compiler statically knows that the range is small (which does happen, > but not very often, because 'copy_to/from_user()' isn't inlined on > x86-64, so the compiler doesn't actually see the constant size case > that is very common there). BTW, I think the static check can be relaxed. Checking size against PAGE_SIZE is rather conservative: there's 8 TB (or 4 PB for 5-level paging) guard hole at the begging of kernel address space. > However, that "check the end of the range" is sometimes actually > fairly complicated code, and it would be nice to drop that entirely. > > See for example the fs/readdir.c case, where the length of the access > is kind of annoying: > > if (!user_write_access_begin(dirent, > (unsigned long)(dirent->d_name + namlen + 1) - > (unsigned long)dirent)) > goto efault; > > and there really isn't any actual reason to check the end of the > access on x86: if the beginning address has the low bit clear, it > doesn't really matter what the end is, because we'll either have a > good area, or we'll fault in the non-canonical area even if the sign > changes. > > So being careful about the range is kind of annoying, when we don't > really need it. Hm. Is there anybody who access high to low after the check (glibc memcpy() bug flashbacks)? Or not in any particular order?
On Fri, Apr 28, 2023 at 5:38 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > BTW, I think the static check can be relaxed. Checking size against > PAGE_SIZE is rather conservative: there's 8 TB (or 4 PB for 5-level > paging) guard hole at the begging of kernel address space. So I don't worry about the size per se - we just don't have any constant sized accesses that are bigger than a page. The constant-sized case is for things like structures being copied to user space. And having a bug gap is nice for the suzkaller case, although I don't think that GP fault has triggered lately (or ever, I don't remember). Having random system call arguments that trigger "oh, this is in the non-canonical region" is a good thing. > > So being careful about the range is kind of annoying, when we don't > > really need it. > > Hm. Is there anybody who access high to low after the check (glibc > memcpy() bug flashbacks)? Or not in any particular order? Yeah, I can't think of a single case, which is why it seems so silly to even bother. Almost all real life cases end up being limited by things like the page/folio size. We do have exceptions, like module loading etc that might copy a bigger area from user space, but no, we don't have any backwards copies. So you'd almost have to have some "access_ok()" followed by random access with a user-controlled offset, and that seems nonsensical and fundamentally impossible anyway. But just because I can't think of it, and go "that would be insane" doesn't mean that some driver ioctl interface might not try it. Which is why I think having others look at it would be a good idea. Linus
On 4/28/23 17:38, Kirill A. Shutemov wrote: > BTW, I think the static check can be relaxed. Checking size against > PAGE_SIZE is rather conservative: there's 8 TB (or 4 PB for 5-level > paging) guard hole at the begging of kernel address space. Whatever we relax it to, let's make sure we get a note in Documentation/x86/x86_64/mm.rst. But that's totally minor and we can fix it up later. Have anyone seen any actual code generation difference between: return (long)ptr >= 0; and return !((unsigned long)ptr & (1UL<<(BITS_PER_LONG-1))); ? I'm seeing gcc generate the same code for both the <=PAGE_SIZE side and the 'sum' side. It's longer, but I'd rather read the explicit "check bit 63" than the positive/negative address space thing. I certainly grok both, but have to think through the "(long)ptr >= 0" check every time. I guess it also wouldn't matter as much either if we hid it in a helper like the attached patch and I didn't have to read it twice. ;)
On Tue, May 2, 2023 at 8:42 AM Dave Hansen <dave.hansen@intel.com> wrote: > > Have anyone seen any actual code generation difference between: > > return (long)ptr >= 0; > > and > > return !((unsigned long)ptr & (1UL<<(BITS_PER_LONG-1))); No, as far as I know, they both generate the same code. > It's longer, but I'd rather read the explicit "check bit 63" than the > positive/negative address space thing. I certainly grok both, but have > to think through the "(long)ptr >= 0" check every time. I'm very much the other way. I think it's much clearer to say "check the sign bit". Doing the explicit bit check means that I have to look at what the bit number is, and that is a much more complex expression. In fact, I'd find it easier to read return !((unsigned long)ptr & (1UL<< 63)); just because then you go "Oh, checking bit 63" without having to parse the expression. But even then I find the '!' is easy to miss, so you really have to parse it. But: > I guess it also wouldn't matter as much either if we hid it in a helper > like the attached patch and I didn't have to read it twice. ;) Yeah, I think that's a good solution. Linus
On Tue, May 2, 2023 at 9:00 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > I guess it also wouldn't matter as much either if we hid it in a helper > > like the attached patch and I didn't have to read it twice. ;) > > Yeah, I think that's a good solution. Hmm. And as I was rebasing the patch to fix up my patch, I realized that the current -git top-of-tree state is actually broken. That #define access_ok(addr, size) \ ({ \ WARN_ON_IN_IRQ(); \ likely(__access_ok(untagged_addr(addr), size)); \ }) is actually *wrong* in two ways. Now, in my original patch, I added a comment about how that "WARN_ON_IN_IRQ()" is bogus and this shouldn't be x86-specific at all. I ended up going back in time to see why it was added, and I think it was added because we used to access 'current' in access_ok(), due to it using that user_addr_max() thing: likely(!__range_not_ok(addr, size, user_addr_max())); but that was all removed by the set_fs() removal by Christoph Hellwig. So there is actually nothing task-related in "access_ok()" any more, and any warning about using it in the wrong context is just bogus. That warning simply shouldn't exist any more (or maybe it should be in a different place, like the actual user copy functions) But that's actually not the problem with the above. No, the problem is that probably *because* "access_ok()" has that warning, not all users use "access_ok()" at all. We have places that use "__access_ok()" instead. Like copy_from_nmi(). So now copy_from_nmi() doesn't do the untagging, so if you were to use tagged pointers for the stack, you'd not get stack traces. End result: I think that (a) that WARN_ON_IN_IRQ() is actively detrimental and causes problems (b) the current "use untagged_addr() in access_ok()" model is also broken and my patch - which was meant to just improve code generation - actually fixes this. Linus
On 5/2/23 13:14, Linus Torvalds wrote: > No, the problem is that probably *because* "access_ok()" has that > warning, not all users use "access_ok()" at all. We have places that > use "__access_ok()" instead. Like copy_from_nmi(). > > So now copy_from_nmi() doesn't do the untagging, so if you were to use > tagged pointers for the stack, you'd not get stack traces. > > End result: I think that > > (a) that WARN_ON_IN_IRQ() is actively detrimental and causes problems > > (b) the current "use untagged_addr() in access_ok()" model is also broken Ugh, yes. The fallout seems limited to (probably) perf and tracing poking at user stack frames. But, yes, it definitely looks broken there. While I bet we could shoehorn the existing tlbstate checks into the __access_ok() sites, I also vastly prefer the high bit checks in access_ok() instead. The less state we have to consult, the better. Once the WARN_ON_IN_IRQ() is gone, it seems like it's just a matter of collapsing __access_ok() into access_ok() and converting the (~3) callers.
On Tue, May 2, 2023 at 5:53 PM Dave Hansen <dave.hansen@intel.com> wrote: > > The fallout seems limited to (probably) perf and tracing poking at user > stack frames. But, yes, it definitely looks broken there. So I ended up doing more cleanups - moving the 64-bit specific code to 'uaccess_64.h' etc. And in the process I found another broken thing:__untagged_addr_remote() is very very buggy. The reason? long sign = addr >> 63; that does *not* do at all what '__untagged_addr()' does, because while 'sign' is a signed long, 'addr' is an *unsigned* long. So the actual shift ends up being done as an unsigned shift, and then just the result is assigned to a signed variable. End result? 'sign' ends up being 0 for user space (intentional) or 1 for kernel space (not intentional).. It's not 0 or ~0 as the __untagged_addr() is, wh9ch uses a signed shift in the inline asm ('sar') That said, while this is all horribly buggy, I'm not actually 100% sure that we care too much about untagging kernel addresses. So it's an error case that doesn't get marked as an error. So I guess in *practice* the horribly buggy code is actually just a small harmless bug, and causes anybody who passes in an invalid (kernel) address look like a possibly valid user address after all. HOWEVER. Why does it do that "shift-by-63" game there, instead of making tlbstate_untag_mask just have bit #63 always set? Untagging a kernel address is not a sensible operation, so the only thing you want is to *keep* a kernel address as a bad address. You don't have to actually keep it a *valid* kernel address, it just should not become a (possibly valid) user address after untagging. Hmmm? Am I missing something? Linus
On Tue, May 02, 2023 at 01:14:33PM -0700, Linus Torvalds wrote: > On Tue, May 2, 2023 at 9:00 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > > I guess it also wouldn't matter as much either if we hid it in a helper > > > like the attached patch and I didn't have to read it twice. ;) > > > > Yeah, I think that's a good solution. > > Hmm. And as I was rebasing the patch to fix up my patch, I realized > that the current -git top-of-tree state is actually broken. > > That > > #define access_ok(addr, size) \ > ({ \ > WARN_ON_IN_IRQ(); \ > likely(__access_ok(untagged_addr(addr), size)); \ > }) > > is actually *wrong* in two ways. > > Now, in my original patch, I added a comment about how that > "WARN_ON_IN_IRQ()" is bogus and this shouldn't be x86-specific at all. > > I ended up going back in time to see why it was added, and I think it > was added because we used to access 'current' in access_ok(), due to > it using that user_addr_max() thing: > > likely(!__range_not_ok(addr, size, user_addr_max())); > > but that was all removed by the set_fs() removal by Christoph Hellwig. So I had a poke around, trying to figure out where it came from, and yes. Commit ae31fe51a3cc ("perf/x86: Restore TASK_SIZE check on frame pointer") is the reason that WARN_ON_IN_IRQ() thing got added.
On Tue, May 2, 2023 at 6:17 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And in the process I found another broken > thing:__untagged_addr_remote() is very very buggy. > > The reason? > > long sign = addr >> 63; > > that does *not* do at all what '__untagged_addr()' does, because while > 'sign' is a signed long, 'addr' is an *unsigned* long. > > So the actual shift ends up being done as an unsigned shift, and then > just the result is assigned to a signed variable. > > End result? 'sign' ends up being 0 for user space (intentional) or 1 > for kernel space (not intentional).. Looking around, this same bug used to exists for the normal (non-remote) case too, until it was accidentally fixed when changing that to use inline asm and the alternatives code. At that point the non-remote case got an explicit 'sar' instruction, and the result really was ~0 for kernel mode addresses. > Why does it do that "shift-by-63" game there, instead of making > tlbstate_untag_mask just have bit #63 always set? And it turns out that bit #63 really _is_ always set, so I think the solution to this all is to remove the sign games in untag_addr() entirely. Untagging a kernel address will "corrupt" it, but it will stay a kernel address (well, it will stay a "high bit set" address), which is all we care about anyway. If somebody actually tries to untag a kernel address, that would be a bug anyway, as far as I can tell. So I'm going to just remove the 'sign' games entirely. They are completely broken in 'untagged_addr_remote()', they _used_ to be completely broken in 'untagged_addr()', and it looks like it's all unnecessary. Linus
On Wed, May 3, 2023 at 1:01 AM Peter Zijlstra <peterz@infradead.org> wrote: > > So I had a poke around, trying to figure out where it came from, and > yes. Commit ae31fe51a3cc ("perf/x86: Restore TASK_SIZE check on frame > pointer") is the reason that WARN_ON_IN_IRQ() thing got added. I added a pointer to that commit in the commit message, and took the above to mean 'Ack' from you. Linus
On 5/3/23 09:38, Linus Torvalds wrote: >> Why does it do that "shift-by-63" game there, instead of making >> tlbstate_untag_mask just have bit #63 always set? > And it turns out that bit #63 really _is_ always set, so I think the > solution to this all is to remove the sign games in untag_addr() > entirely. Yes, there are only two possible values right now, both of which have bit 63 set: LAM off: mm->context.untag_mask = -1UL; LAM on: mm->context.untag_mask = ~GENMASK(62, 57); > Untagging a kernel address will "corrupt" it, but it will stay a > kernel address (well, it will stay a "high bit set" address), which is > all we care about anyway. > > If somebody actually tries to untag a kernel address, that would be a > bug anyway, as far as I can tell. Is it a bug? The do_madvise() path, for instance, is passing a value in there that came right from userspace. > So I'm going to just remove the 'sign' games entirely. They are > completely broken in 'untagged_addr_remote()', they _used_ to be > completely broken in 'untagged_addr()', and it looks like it's all > unnecessary. Yes, it looks completely superfluous.
On Wed, May 3, 2023 at 9:45 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 5/3/23 09:38, Linus Torvalds wrote: > > If somebody actually tries to untag a kernel address, that would be a > > bug anyway, as far as I can tell. > > Is it a bug? The do_madvise() path, for instance, is passing a value in > there that came right from userspace. That's still a "user address" - just not a *valid* one. So we do not want to mask the high bit off - because that is what will catch people later doing things like vma address range comparisons on it and notice "that's not a valid address", but it's also not a "kernel address" that we need to preserve as such. So yeah, it's a bit confusing in that it's _also_ true that "kernel addresses have the high bit set" and "user addresses have the high bit clear", and I'm basically using two different semantics for "kernel address". IOW: what I mean by "it's not valid to do 'untagged_addr()' on a kernel address" is that you can't take a (valid) kernel address, do 'untagged_addr()' on it, and expect it to still work as a kernel address. But at the same time you *are* supposed to be able to use 'untagged_addr()' on a - untrusted and possibly invalid - user pointer, and it's supposed to end up having the tag bits clear and still be usable as a user pointer. And yet still also be caught by any validity checks (ie a high bit set would never be a valid user pointer, not even after doing 'untagged_addr()' on it). Linus
On Wed, May 3, 2023 at 9:38 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So I'm going to just remove the 'sign' games entirely. They are > completely broken in 'untagged_addr_remote()', they _used_ to be > completely broken in 'untagged_addr()', and it looks like it's all > unnecessary. Ok, I've pushed out my changes to the 'x86-uaccess-cleanup' branch. I think it's all good, but it would be really nice to hear it's been tested on a setup that actually has LAM (simulator? or maybe there is actual hw inside Intel) And any other commentary is appreciated, Linus
On Wed, May 03, 2023 at 10:54:38AM -0700, Linus Torvalds wrote: > On Wed, May 3, 2023 at 9:38 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > So I'm going to just remove the 'sign' games entirely. They are > > completely broken in 'untagged_addr_remote()', they _used_ to be > > completely broken in 'untagged_addr()', and it looks like it's all > > unnecessary. > > Ok, I've pushed out my changes to the 'x86-uaccess-cleanup' branch. > > I think it's all good, but it would be really nice to hear it's been > tested on a setup that actually has LAM (simulator? or maybe there is > actual hw inside Intel) > > And any other commentary is appreciated, Looks sane from a first reading. But I'll try and have another look tomorrow. Also per how 2s complement is constructed 0 has to be in the positive space for it to be 'half'. Also, typically 0 is included in N and the traditional 'positive integers' set is then written as N \ {0}, but that's somewhat modern and a lot of variation exists -- although typically books tend to specify where they stand on that issue. I suppose that's a very long winded way of saying, that yes, ofcourse 0 is a positive number :-)
On Wed, May 3, 2023 at 12:02 PM Peter Zijlstra <peterz@infradead.org> wrote: > > I suppose that's a very long winded way of saying, that yes, ofcourse 0 > is a positive number :-) Well, I do consider it as such, but I guess I took all my math classes when people still considered negative, zero and positive to be three disjoint sets, and 'non-negative' was required for rigor. Some googling around says that a lot of people still think that, and that it might even be language-specific. I think the commit commentary about "ok, strictly non-negative" might still be relevant. At least to some people, and at least for sticklers. Also, I do consider 0 to be part of ℕ, although I wouldn't consider that to be an argument about "positive" at all. The argument for 0 in ℕ would be that without it, you don't have an identity element for addition, which would arguably make natural numbers kind of broken as a set. Linus
On Wed, May 03, 2023 at 09:38:03AM -0700, Linus Torvalds wrote: > > Why does it do that "shift-by-63" game there, instead of making > > tlbstate_untag_mask just have bit #63 always set? > > And it turns out that bit #63 really _is_ always set, so I think the > solution to this all is to remove the sign games in untag_addr() > entirely. Untagging kernel pointer with LAM enabled will land it in the canonical hole which is safe as it leads to #GP on dereference. > Untagging a kernel address will "corrupt" it, but it will stay a > kernel address (well, it will stay a "high bit set" address), which is > all we care about anyway. The interesting case to consider is untagging kernel pointer when LAM_U48 is enabled (not part of current LAM enabling). LAM_U48 would make the untagging mask wider -- ~GENMASK(62, 48). With 5-level paging and LAM_SUP enabled (also not supported yet) untagging kernel may transform it to other valid kernel pointer. So we cannot rely on #GP as backstop here. The kernel has to exclude kernel pointer by other means. It can be fun to debug. Looks like you are not worried about this kind of corruption. Hm. Maybe it worth to indicate in the helper name that it is intended only for userspace addresses? Rename it to untagged_uaddr() or something?
On 5/3/23 23:28, Kirill A. Shutemov wrote: >> Untagging a kernel address will "corrupt" it, but it will stay a >> kernel address (well, it will stay a "high bit set" address), which is >> all we care about anyway. > The interesting case to consider is untagging kernel pointer when LAM_U48 > is enabled (not part of current LAM enabling). LAM_U48 would make the > untagging mask wider -- ~GENMASK(62, 48). With 5-level paging and LAM_SUP > enabled (also not supported yet) untagging kernel may transform it to > other valid kernel pointer. > > So we cannot rely on #GP as backstop here. The kernel has to exclude > kernel pointer by other means. It can be fun to debug. Yeah, I have the feeling that we're really going to need a pair of untagging functions once we get to doing kernel LAM for a _bunch_ of reasons. Just as a practical matter, I think we should OR bits into the mask on the kernel side, effectively: unsigned long untag_kernel_addr(unsigned long addr) { return addr | kernel_untag_mask; } and kernel_untag_mask should have bit 63 *clear*. That way the pointers that have gone through untagging won't look silly. If you untag VMALLOC_END or something, it'll still look like the addresses we have in mm.rst. Also, it'll be impossible to have the mask turn a userspace address into a kernel one. Last, we can add some debugging in there, probably conditional on some mm debugging options like: if (WARN_ON_ONCE(!valid_user_address(addr))) return 0; It's kinda like "void __user *" versus "void *". The __user ones can *absolutely* point anywhere, user or kernel. That's why we can't WARN() in the untagged_addr() function that takes user pointers. But "void *" can only point to the kernel. It has totally different rules. We should probably also do something like the attached patch sooner rather than later. 'untag_mask' really is a misleading name for a mask that gets applied only to user addresses.
On Wed, May 3, 2023 at 11:28 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > On Wed, May 03, 2023 at 09:38:03AM -0700, Linus Torvalds wrote: > > > Why does it do that "shift-by-63" game there, instead of making > > > tlbstate_untag_mask just have bit #63 always set? > > > > And it turns out that bit #63 really _is_ always set, so I think the > > solution to this all is to remove the sign games in untag_addr() > > entirely. > > Untagging kernel pointer with LAM enabled will land it in the canonical > hole which is safe as it leads to #GP on dereference. You are entirely missing the point. The GP fault does *NOT MATTER*. Think of 'untagged_addr()' as a companion to - but absolutely *NOT* a replacement for - 'access_ok()'. You have three distinct cases: (a) user-supplied valid user address (b) user-supplied invalid user address (it high bit set) (c) actual kernel address and 'untagged_addr()' and 'access_ok()' work on the same basic input domain: cases (a) and (b). And the important thing for 'untagged_addr()' is that in case (a) it needs to remove the tab bits, and in case (b) needs to *keep* the address as an invalid user address. Note that it does not need to keep the value the *same*. Nobody cares. And also note that the resulting pointer may or may not GP-fault. Nobody cares. Just to hit this home with a very explicit example, think of the case where the kernel config for address masking isn't even enabled, and 'untagged_addr()' is a 1:1 map with no changes. Doing 'untagged_addr()' on a valid user address results in a valid user address. And doing it on an invalid user address results in an invalid user address. GOOD. Note that doing 'untagged_addr()' on an invalid user access does NOT IN ANY WAY make that invalid address somehow "safe" and cause a GP fault. Sure, it _might_ GP-fault. Or it might not. It might point to random kernel data. Verification of the address is simply not the job of 'untagged_addr()'. Never has been, never will be, and fundamentally *cannot* be, since for the forseeable future 'untagged_addr()' is a no-op for every single user outside of Intel. Verification is separate. The verification is never "let's randomly just access this pointer and see if it gets a GP-fault". No. We have a user pointer, it needs *checking*. It needs to have something like "use lookup_vma() to look up the vma that is associated with that address". But it could also be something like "just do the range check in GUP". And that's why "keep an invalid user address as an invalid address", because that *separate* stage of verifying the address needs to still show that it's invalid. Now, sometimes the "verification" might actually be "access_ok()" itself, but honestly, if the address is used for an actual access, then it shouldn't have gone through the 'untagged_addr()' thing at all. It should just have been used as-is for the access. So normally 'untagged_addr()' does not get used *together* with 'access_ok()', although that should obviously also work. End result: all that really matters on x86-64 is that 'untagged_addr()' must keep the high bit as-is. That's the "this is not a valid user address" bit. That's very much explicit in my current series, of course, but even without my current series it was implicitly the truth with those LAM patches (particularly considering that 'untagged_addr()' didn't actually do the "keep kernel addresses as-is" that it *thought* it did due to the signed type confusion). So what about that (c) case? It doesn't matter. It's simply fundamentally wrong for the kernel to pass an actual kernel address to 'untagged_addr()' and expect something useful back. It's a nonsensical thing to do, and it's a huge bug. So for the (c) case, the fact that the result would be useless and *usually* GP-fault on access is a good thing. But it's not a requirement, it's more of a "oh, cool, it's good that the nonsensical operation causes quick failures". So in that case, the GP fault is a "feature", but not a requirement. Again, the normal 'untagged_addr()' case (of not changing the pointer at all), obviously does *not* cause the kernel pointer corruption, but maybe we could even have a debug mode. We could literally make 'untagged_addr()' do something like #ifdef CONFIG_DEBUG_NONLAM // Make sure nobody tries to use untagged_addr() on non-user addresses #define untagged_addr(x) ((x) | (long)(x)>>63) #endif except obviously with the "get the types right and use 'x' only once" thing (so the above #define is buggy, and puresly for conceptual documentation purposes). See? Side note: one day, maybe we want to use address tagging *inside* the kernel. However, that will not use 'untagged_addr()'. That would use some *other* model for cleaning up kernel pointers when necessary. Even on x86-64, the tagging rules for kernel and user space is entirely different, in that user-space LAM rules are "U48" or "U57", while kernel LAM rules depend on the paging mode, and the two are *not* tied together. But more importantly from a kernel perspective: regardless of hardware implementations like that, the notion of masking bits off a untrusted user pointer is simply completely *different* from the notion of masking off bits of our own kernel pointers. You can see this in the kernel everywhere: user pointers should be statically always user pointers, and should look something like struct iocb __user *user_iocb which is very very different from a kernel pointer that doesn't have that "__user" thing. Again, on some architectures, the *EXACT*SAME* bit pattern pointer value may point to different things, because user accesses are simply in a different address space entirely. So if/when KASAN starts using LAM inside the kernel, it will do its own things. It had better not use "untagged_addr()". Linus
On Thu, May 04, 2023 at 08:25:58AM -0700, Dave Hansen wrote: > On 5/3/23 23:28, Kirill A. Shutemov wrote: > >> Untagging a kernel address will "corrupt" it, but it will stay a > >> kernel address (well, it will stay a "high bit set" address), which is > >> all we care about anyway. > > The interesting case to consider is untagging kernel pointer when LAM_U48 > > is enabled (not part of current LAM enabling). LAM_U48 would make the > > untagging mask wider -- ~GENMASK(62, 48). With 5-level paging and LAM_SUP > > enabled (also not supported yet) untagging kernel may transform it to > > other valid kernel pointer. > > > > So we cannot rely on #GP as backstop here. The kernel has to exclude > > kernel pointer by other means. It can be fun to debug. > > Yeah, I have the feeling that we're really going to need a pair of > untagging functions once we get to doing kernel LAM for a _bunch_ of > reasons. There's already arch_kasan_reset_tag() used on ARM64 (and two more helpers to set/get tag). Don't see a reason to add new. > Just as a practical matter, I think we should OR bits into the mask on > the kernel side, effectively: > > unsigned long untag_kernel_addr(unsigned long addr) > { > return addr | kernel_untag_mask; > } > > and kernel_untag_mask should have bit 63 *clear*. > > That way the pointers that have gone through untagging won't look silly. > If you untag VMALLOC_END or something, it'll still look like the > addresses we have in mm.rst. > > Also, it'll be impossible to have the mask turn a userspace address into > a kernel one. > > Last, we can add some debugging in there, probably conditional on some > mm debugging options like: > > if (WARN_ON_ONCE(!valid_user_address(addr))) > return 0; > > It's kinda like "void __user *" versus "void *". The __user ones can > *absolutely* point anywhere, user or kernel. That's why we can't WARN() > in the untagged_addr() function that takes user pointers. > > But "void *" can only point to the kernel. It has totally different rules. > > We should probably also do something like the attached patch sooner > rather than later. 'untag_mask' really is a misleading name for a mask > that gets applied only to user addresses. A bit too verbose to my liking, but okay. Maybe _uaddr() instead of _user_addr()?
On Wed, May 3, 2023 at 12:02 PM Peter Zijlstra <peterz@infradead.org> wrote: > > Looks sane from a first reading. But I'll try and have another look > tomorrow. Well, I didn't hear back any more, and I was merging architecture code, so I merged my branch too. It does fix at least two bugs, even if those bugs are not necessarily all that noticeable. I guess we'll see if it introduces new ones... Linus
On Fri, May 05, 2023 at 12:56:31PM -0700, Linus Torvalds wrote: > On Wed, May 3, 2023 at 12:02 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > Looks sane from a first reading. But I'll try and have another look > > tomorrow. > > Well, I didn't hear back any more, and I was merging architecture > code, so I merged my branch too. > > It does fix at least two bugs, even if those bugs are not necessarily > all that noticeable. > > I guess we'll see if it introduces new ones... The branch passed LAM selftests fine.
Hi Linus, > static bool ex_handler_uaccess(const struct exception_table_entry *fixup, > - struct pt_regs *regs, int trapnr) > + struct pt_regs *regs, int trapnr, > + unsigned long fault_address) > { > - WARN_ONCE(trapnr == X86_TRAP_GP, "General protection fault in user access. Non-canonical address?"); > + WARN_ONCE(trapnr == X86_TRAP_GP && !gp_fault_address_ok(fault_address), > + "General protection fault in user access. Non-canonical address?"); > return ex_handler_default(fixup, regs); > } Shouldn't ex_handler_copy() be fixed in the same way? Looks like it's still possible for a tagged userspace address to be passed to it and trigger a warning. Alex
On Fri, 16 Jun 2023 at 01:47, Alexander Potapenko <glider@google.com> wrote: > > Shouldn't ex_handler_copy() be fixed in the same way? I don't think ex_handler_copy() is actually reachable any more. The only thing ex_handler_copy() did was to set %ax to the fault type. It was used by the strange copy_user_generic_unrolled() that had special machine check case for "don't do the tail if we get X86_TRAP_MC". But that was always bogus. The MC case in question was for the __copy_user_nocache function, and the machine check case was for the *destination*, which wasn't in user space at all. So instead of looking at "what was the trap number", the code should have just noticed "trapped on the destination", and stopped for *that* reason. See commit 034ff37d3407 ("x86: rewrite '__copy_user_nocache' function") and in particular, see the comment there about writes on the destination: * An exception on a write means that we're * done, but we need to update the count * depending on where in the unrolled loop * we were. but yeah, I never removed the actual now unused _ASM_EXTABLE_CPY case. Honestly, I had no way of even testing the code. I doubt anybody does. There are a couple of users: - rdma mis-uses it for regular kernel-to-kernel copies that don't fault (because rdma wants the "nocache" case). So it can't fault at all. - a couple of GPU drivers mis-use it similarly to rdma, but at least with a user source in the form of __copy_from_user_inatomic_nocache() - _copy_from_iter_flushcache uses it for pmem and dax, because it wants that machine check handling for non-volatile memory So two of three users don't actually *have* the MC case at all on the destination. And the third case - the actual "copy using uncached accesses in order to get errors on the destination" case - depends on hardware that I'm not convinced really exists any more. Oh, I think there's some odd ntb mis-use too. I'd love for somebody to actually test the machine check case, but the old code was completely insane with odd "we handle _one_ case of 4-byte alignment, but not actually the general case", so I wonder if the MC case ever really worked in reality. And as mentioned, I am not convinced the hardware is available. Linus
On Fri, Jun 16, 2023 at 6:22 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, 16 Jun 2023 at 01:47, Alexander Potapenko <glider@google.com> wrote: > > > > Shouldn't ex_handler_copy() be fixed in the same way? > > I don't think ex_handler_copy() is actually reachable any more. > > The only thing ex_handler_copy() did was to set %ax to the fault type. > > It was used by the strange copy_user_generic_unrolled() that had > special machine check case for "don't do the tail if we get > X86_TRAP_MC". Oh, I see, thanks! I am using a downstream kernel (way before 034ff37d3407) that still has _ASM_EXTABLE_CPY. Looks like I just need to adjust ex_handler_copy() in my code.
diff --cc arch/x86/include/asm/disabled-features.h index fafe9be7a6f4,652e366b68a0..702d93fdd10e --- a/arch/x86/include/asm/disabled-features.h diff --cc arch/x86/include/uapi/asm/prctl.h index eb290d89cb32,1b85bc876c2d..abe3fe6db6d2 --- a/arch/x86/include/uapi/asm/prctl.h diff --cc arch/x86/kernel/process.c index 50d950771371,8bf13cff0141..7157b09d1cbf --- a/arch/x86/kernel/process.c diff --cc arch/x86/kernel/process_64.c index b46924c9e46d,31241930b60c..74c7e84a94d8 --- a/arch/x86/kernel/process_64.c diff --cc fs/proc/array.c index 6daea628bc76,3e1a33dcd0d0..a880c4e44752 --- a/fs/proc/array.c diff --cc tools/testing/selftests/x86/Makefile index 598135d3162b,cfc8a26ad151..fa2216a8c0d5 --- a/tools/testing/selftests/x86/Makefile