Message ID | 20230424212130.590684-1-dave.hansen@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp3017565vqo; Mon, 24 Apr 2023 14:36:16 -0700 (PDT) X-Google-Smtp-Source: AKy350Z5RTeZr+P/l77zBcHtNY2ekT7BpU1OS8iBjmCutBTiufWguDxYvwTm7vWXUh4BrLrbjmpD X-Received: by 2002:a05:6a21:3703:b0:ef:f558:b80 with SMTP id yl3-20020a056a21370300b000eff5580b80mr12892084pzb.58.1682372176380; Mon, 24 Apr 2023 14:36:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682372176; cv=none; d=google.com; s=arc-20160816; b=JfE5I1vQGySWXeWgo8xkU54ebM0IXYB1hFCV3WgsVJBw/9uDur2VFNN82d7Bz84eQu OQc4P1rk6fpSsmJipFeKWSGvUDou6YW71spl3V7wS2ZgR2iHNCSZ50+ulymC50J4+Wrh n7bbA+O7u1W8019o5KV1AT41yTuyl4DVz7r5zizXo+4fMU82ZHC3pprnDAh2jgPFtlBq MgbRfotFal+ur59XBX+Iiy1F9XtConTphkwezF58SCRcqAJ5kSc7s99ECheT8vuRI1Am c10nXGy/x+jrFPtpGk6BzqBtPswX8kf58FFRXrC6zLFGqciV70rLvPDN55IeLnu2XqSM Vxbw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=i6y/4RqjBsb/ds9JUkDHd/RFUPDsIqCyxTfoIPUolrI=; b=fjktNGOm3Ot06UESbR6L/wlzxOSVefnv6MtkQnZlmt1YbzPSayDryuTptUC1XULW3M mfNLLDEDrDpAX2ZoUNFLveaSzHF8oa6N2QupWfH0SrfC9PnxoGDy9x4mNUbyxIrrBfwT YKGkPbh1rwr04kUFEnz0rSChE6S5KnFakxQ97nZ+zJON8Y0WKCk8ga+x2D/9OwLz2gk8 zYYXKVyprsqVqtzS79CEehjYuerSlNsSol+Cs/kbC+ypl0yVDPQiskvFxcKtppt3TBUC d7aFEI6xAWZn4dFsGN0vt7GXZmyeIC+yu7+ag/gfDXxbN9bYUYmvzfe1LOTrCbn3TNc7 n7VA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=abDw42E5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 3-20020a620503000000b0063a471c3111si11603062pff.201.2023.04.24.14.36.03; Mon, 24 Apr 2023 14:36:16 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=abDw42E5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231137AbjDXVVk (ORCPT <rfc822;fengqi706@gmail.com> + 99 others); Mon, 24 Apr 2023 17:21:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36218 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232312AbjDXVVi (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 24 Apr 2023 17:21:38 -0400 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5B8A865B8 for <linux-kernel@vger.kernel.org>; Mon, 24 Apr 2023 14:21:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1682371293; x=1713907293; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=TU26IXMsM61swYhFhuzSkBBi4g6h/uia0f+spD5o+PY=; b=abDw42E58tCTABgc2ALGfAzOQ5blrwVgfOnfqtwKe2LYTIdhkXwnRQc9 PZ4aXSgCUoDkNGyK7jY33GxDRSSlHeHyrBhjcMpDYL+Tzj4XE9IXHcWtd JRoUd8dlA3El7IoSOiKCcRWtUvXNm2a8uxOM9y/F/Az5XaAzIrq0QOeEm 6g3hviVtR5X5U9yAUu6GsZngerK3e0y26LVuakriDa+hhmB8d0yDU/9vD BJH5FvrUZYBcwyfpYEtP6KHQJCPOWH9JHIo/SgfCzOnuPYS1yLDnxo3yJ IDU5YZ9EL/HnUOcjRZatnzJZDzeyFp0yzMBZT0i6ksmCgXdqIa4nlhDmK g==; X-IronPort-AV: E=McAfee;i="6600,9927,10690"; a="345317940" X-IronPort-AV: E=Sophos;i="5.99,223,1677571200"; d="scan'208";a="345317940" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Apr 2023 14:21:33 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10690"; a="1022856264" X-IronPort-AV: E=Sophos;i="5.99,223,1677571200"; d="scan'208";a="1022856264" Received: from viggo.jf.intel.com (HELO ray2.sr71.net) ([10.54.77.144]) by fmsmga005.fm.intel.com with ESMTP; 24 Apr 2023 14:21:31 -0700 From: Dave Hansen <dave.hansen@linux.intel.com> To: torvalds@linux-foundation.org Cc: x86@kernel.org, linux-kernel@vger.kernel.org, keescook@chromium.org, akpm@linux-foundation.org, rick.p.edgecombe@intel.com, Dave Hansen <dave.hansen@linux.intel.com> Subject: [GIT PULL] x86/shstk for 6.4 Date: Mon, 24 Apr 2023 14:21:30 -0700 Message-Id: <20230424212130.590684-1-dave.hansen@linux.intel.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.5 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764095087058868547?= X-GMAIL-MSGID: =?utf-8?q?1764095087058868547?= |
Series |
[GIT,PULL] x86/shstk for 6.4
|
|
Pull-request
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_shstk_for_6.4Message
Dave Hansen
April 24, 2023, 9:21 p.m. UTC
Hi Linus, Please pull x86/shstk for 6.4. This is the long-awaited Shadow Stack support. It is the most highly desired hardware security feature in a long time. Both AMD and Intel have (compatible) implementations. It has been present on the Intel side since the 11th-gen CPUs, but it had a few stumbles on the way and is a bit tardy. The trickiest part of this whole thing (IMNHO) was that shadow stacks exist in a permission grey area. A shadow stack PTE literally says Write=0, but some instructions _can_ write to it. The PTEs also can't be read-only so they can't be COW'd. They are oddballs. The Write=0,Dirty=1 PTE permissions also mean that the dirty bit can not be used as freely as before. Those two things combine to create a fair amount of PTE management churn. A few more things you should know: 1. There is a non-trivial amount of core mm churn. It has acks from mm folks and I hope it's no surprise to Andrew. These add a VMA argument to pte_mkwrite(). There is a new user in Andrew's pile[1*] which will need to get fixed up[2*] before this gets merged with the mm tree. 2. There has been an unusual snarl of userspace compatibility issues with shadow stacks [3*]. While the move to new arch_prctl() values helped, we can still envision scenarios where this old code might bite us. The plan is try to ban any problematic apps from using shadow stack if anything comes up in practice. We should obviously be on the lookout for these. 3. This conflicts with the LAM code which is coming in x86/mm. I'll discuss the resolution when I send x86/mm. [1*] 717f95b494ac36 ("mm: don't check VMA write permissions if the PTE/PMD indicates write permissions") [2*] https://lore.kernel.org/all/20230419182136.112974-1-broonie@kernel.org/ [3*] https://lore.kernel.org/lkml/CAHk-=wgP5mk3poVeejw16Asbid0ghDt4okHnWaWKLBkRhQntRA@mail.gmail.com/ -- The following changes since commit eeac8ede17557680855031c6f305ece2378af326: Linux 6.3-rc2 (2023-03-12 16:36:44 -0700) are available in the Git repository at: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_shstk_for_6.4 for you to fetch changes up to 11c95c77eef6d56c1ef9f55d8afd83ceb6d99996: x86/shstk: Enforce only whole copies for ssp_set() (2023-03-27 17:55:51 -0700) ---------------------------------------------------------------- Add x86 shadow stack support. Includes: * New arch_prctl() ABI for enable/disable/lock/debug * Plumbing to deal with the new Write=0,Dirty=1 PTE permissions * Core mm changes that allow some PTE write functions to take a VMA so they can tell if the VMA is for a shadow stack * Normal old selftest and documentation ---------------------------------------------------------------- Mike Rapoport (1): x86/shstk: Add ARCH_SHSTK_UNLOCK Rick Edgecombe (37): Documentation/x86: Add CET shadow stack description x86/shstk: Add Kconfig option for shadow stack x86/cpufeatures: Add CPU feature flags for shadow stacks x86/cpufeatures: Enable CET CR4 bit for shadow stack x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states x86/fpu: Add helper for modifying xstate x86/traps: Move control protection handler to separate file x86/shstk: Add user control-protection fault handler x86/mm: Remove _PAGE_DIRTY from kernel RO pages x86/mm: Move pmd_write(), pud_write() up in the file mm: Introduce pte_mkwrite_kernel() s390/mm: Introduce pmd_mkwrite_kernel() mm: Make pte_mkwrite() take a VMA x86/mm: Introduce _PAGE_SAVED_DIRTY x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY x86/mm: Start actually marking _PAGE_SAVED_DIRTY x86/mm: Check shadow stack page fault errors x86/mm: Teach pte_mkwrite() about stack memory mm: Add guard pages around a shadow stack. mm/mmap: Add shadow stack pages to memory accounting mm: Don't allow write GUPs to shadow stack memory x86/mm: Introduce MAP_ABOVE4G mm: Warn on shadow stack memory in wrong vma x86/mm: Warn if create Write=0,Dirty=1 with raw prot x86: Introduce userspace API for shadow stack x86/shstk: Add user-mode shadow stack support x86/shstk: Handle thread shadow stack x86/shstk: Introduce routines modifying shstk x86/shstk: Handle signals for shadow stack x86/shstk: Introduce map_shadow_stack syscall x86/shstk: Support WRSS for userspace x86: Expose thread features in /proc/$PID/status x86/shstk: Wire in shadow stack interface selftests/x86: Add shadow stack test x86: Add PTRACE interface for shadow stack x86/shstk: Add ARCH_SHSTK_STATUS x86/shstk: Enforce only whole copies for ssp_set() Yu-cheng Yu (3): mm: Move VM_UFFD_MINOR_BIT from 37 to 38 mm: Introduce VM_SHADOW_STACK for shadow stack memory mm: Re-introduce vm_flags to do_mmap() Documentation/filesystems/proc.rst | 1 + Documentation/mm/arch_pgtable_helpers.rst | 9 +- Documentation/x86/index.rst | 1 + Documentation/x86/shstk.rst | 179 ++++++ arch/alpha/include/asm/pgtable.h | 6 +- arch/arc/include/asm/hugepage.h | 2 +- arch/arc/include/asm/pgtable-bits-arcv2.h | 7 +- arch/arm/include/asm/pgtable-3level.h | 7 +- arch/arm/include/asm/pgtable.h | 2 +- arch/arm/kernel/signal.c | 2 +- arch/arm64/include/asm/pgtable.h | 9 +- arch/arm64/kernel/signal.c | 2 +- arch/arm64/kernel/signal32.c | 2 +- arch/arm64/mm/trans_pgd.c | 4 +- arch/csky/include/asm/pgtable.h | 2 +- arch/hexagon/include/asm/pgtable.h | 2 +- arch/ia64/include/asm/pgtable.h | 2 +- arch/loongarch/include/asm/pgtable.h | 4 +- arch/m68k/include/asm/mcf_pgtable.h | 2 +- arch/m68k/include/asm/motorola_pgtable.h | 6 +- arch/m68k/include/asm/sun3_pgtable.h | 6 +- arch/microblaze/include/asm/pgtable.h | 2 +- arch/mips/include/asm/pgtable.h | 6 +- arch/nios2/include/asm/pgtable.h | 2 +- arch/openrisc/include/asm/pgtable.h | 2 +- arch/parisc/include/asm/pgtable.h | 6 +- arch/powerpc/include/asm/book3s/32/pgtable.h | 2 +- arch/powerpc/include/asm/book3s/64/pgtable.h | 4 +- arch/powerpc/include/asm/nohash/32/pgtable.h | 2 +- arch/powerpc/include/asm/nohash/32/pte-8xx.h | 2 +- arch/powerpc/include/asm/nohash/64/pgtable.h | 2 +- arch/riscv/include/asm/pgtable.h | 6 +- arch/s390/include/asm/hugetlb.h | 4 +- arch/s390/include/asm/pgtable.h | 14 +- arch/s390/mm/pageattr.c | 4 +- arch/sh/include/asm/pgtable_32.h | 10 +- arch/sparc/include/asm/pgtable_32.h | 2 +- arch/sparc/include/asm/pgtable_64.h | 6 +- arch/sparc/kernel/signal32.c | 2 +- arch/sparc/kernel/signal_64.c | 2 +- arch/um/include/asm/pgtable.h | 2 +- arch/x86/Kconfig | 24 + arch/x86/Kconfig.assembler | 5 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/x86/include/asm/cpufeatures.h | 2 + arch/x86/include/asm/disabled-features.h | 16 +- arch/x86/include/asm/fpu/api.h | 9 + arch/x86/include/asm/fpu/regset.h | 7 +- arch/x86/include/asm/fpu/sched.h | 3 +- arch/x86/include/asm/fpu/types.h | 16 +- arch/x86/include/asm/fpu/xstate.h | 6 +- arch/x86/include/asm/idtentry.h | 2 +- arch/x86/include/asm/mmu_context.h | 2 + arch/x86/include/asm/pgtable.h | 322 +++++++++-- arch/x86/include/asm/pgtable_types.h | 56 +- arch/x86/include/asm/processor.h | 8 + arch/x86/include/asm/shstk.h | 38 ++ arch/x86/include/asm/special_insns.h | 13 + arch/x86/include/asm/tlbflush.h | 3 +- arch/x86/include/asm/trap_pf.h | 2 + arch/x86/include/asm/traps.h | 12 + arch/x86/include/uapi/asm/mman.h | 4 + arch/x86/include/uapi/asm/prctl.h | 12 + arch/x86/kernel/Makefile | 4 + arch/x86/kernel/cet.c | 152 ++++++ arch/x86/kernel/cpu/common.c | 35 +- arch/x86/kernel/cpu/cpuid-deps.c | 1 + arch/x86/kernel/cpu/proc.c | 23 + arch/x86/kernel/fpu/core.c | 54 +- arch/x86/kernel/fpu/regset.c | 81 +++ arch/x86/kernel/fpu/xstate.c | 90 ++- arch/x86/kernel/idt.c | 2 +- arch/x86/kernel/process.c | 21 +- arch/x86/kernel/process_64.c | 9 +- arch/x86/kernel/ptrace.c | 12 + arch/x86/kernel/shstk.c | 499 +++++++++++++++++ arch/x86/kernel/signal.c | 1 + arch/x86/kernel/signal_32.c | 2 +- arch/x86/kernel/signal_64.c | 8 +- arch/x86/kernel/sys_x86_64.c | 6 +- arch/x86/kernel/traps.c | 87 --- arch/x86/mm/fault.c | 22 + arch/x86/mm/pat/set_memory.c | 4 +- arch/x86/mm/pgtable.c | 38 ++ arch/x86/xen/enlighten_pv.c | 2 +- arch/x86/xen/mmu_pv.c | 2 +- arch/x86/xen/xen-asm.S | 2 +- arch/xtensa/include/asm/pgtable.h | 2 +- fs/aio.c | 2 +- fs/proc/array.c | 6 + fs/proc/task_mmu.c | 3 + include/asm-generic/hugetlb.h | 4 +- include/linux/mm.h | 65 ++- include/linux/mman.h | 4 + include/linux/pgtable.h | 14 + include/linux/proc_fs.h | 2 + include/linux/syscalls.h | 1 + include/uapi/asm-generic/siginfo.h | 3 +- include/uapi/asm-generic/unistd.h | 2 +- include/uapi/linux/elf.h | 2 + ipc/shm.c | 2 +- kernel/sys_ni.c | 1 + mm/debug_vm_pgtable.c | 16 +- mm/gup.c | 2 +- mm/huge_memory.c | 7 +- mm/hugetlb.c | 4 +- mm/internal.h | 4 +- mm/memory.c | 5 +- mm/migrate_device.c | 2 +- mm/mmap.c | 10 +- mm/mprotect.c | 2 +- mm/nommu.c | 4 +- mm/userfaultfd.c | 2 +- mm/util.c | 2 +- tools/testing/selftests/x86/Makefile | 2 +- tools/testing/selftests/x86/test_shadow_stack.c | 695 ++++++++++++++++++++++++ 116 files changed, 2615 insertions(+), 314 deletions(-) create mode 100644 Documentation/x86/shstk.rst create mode 100644 arch/x86/include/asm/shstk.h create mode 100644 arch/x86/kernel/cet.c create mode 100644 arch/x86/kernel/shstk.c create mode 100644 tools/testing/selftests/x86/test_shadow_stack.c
Comments
On Mon, Apr 24, 2023 at 2:21 PM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > Please pull x86/shstk for 6.4. This is the long-awaited Shadow Stack > support. It is the most highly desired hardware security feature in a > long time. Both AMD and Intel have (compatible) implementations. It > has been present on the Intel side since the 11th-gen CPUs, but it had a > few stumbles on the way and is a bit tardy. Just letting you know that this is still in my queue, but every time I look at it, I go "I need to spend more time on understanding this". So this is one of those pull requests that will be done next week, after I have gotten everything else off my plate, and I have time to look more closely into it. If it can be chopped up into smaller pieces ("this is just the preliminary work part which is all obvious and clear") that might help me, but I'll get around to it eventually regardless. Linus
On Fri, 2023-04-28 at 11:17 -0700, Linus Torvalds wrote: > If it can be chopped up into smaller pieces ("this is just the > preliminary work part which is all obvious and clear") that might > help > me, but I'll get around to it eventually regardless. Looking at it with that in mind, I can see how the key bits have gotten lost in the noise. I've taken a stab at reordering it. I think it is better at separating out and tagging the boring parts from the thornier ones. The rebase was almost clean, and changes are pretty much in the commits where they were originally. The robots are still checking it for bisectability, so please consider this branch for review only. The tags are as follows (in chronological order): generic_mm ---------- Core MM refactoring in prep for shadow stack memory. shadow_stack_prep ----------------- Add Kconfig's and defines needed by later patches saved_dirty ----------- Implementation of the "Saved Dirty Bit". The HW has shadow stack as a weird PTE bit combination: "Write=0,Dirty=1". So the x86 mm code keeps track of the HW dirty bit in a SW bit when HW Dirty=1 memory gets write-protected, in order to not inadvertently create shadow stack memory. shadow_stack_mem_boring ----------------------- Shadow stack memory patches that are fairly mechanical. shadow_stack_mem_thorny ----------------------- The nitty gritty of shadow stack memory support. shadow_stack_core ----------------- This is the non-MM parts of the shadow stack implementation. It implements new ABI around shadow stacks (clone, signals, enabling, etc). shadow_stack_ptrace ------------------- Support for ptracers to work with/around shadow stack. You might find generic_mm, shadow_stack_prep, shadow_stack_mem_boring and shadow_stack_ptrace to be the most ordinary. And saved_dirty, shadow_stack_mem_thorny and shadow_stack_core to be more interesting. Kindly placed on a host where the tags can be viewed in the commit log by Dave: https://git.kernel.org/pub/scm/linux/kernel/git/daveh/devel.git/log/?h=shadow_stack_ptrace
On 4/28/23 17:26, Edgecombe, Rick P wrote: > Kindly placed on a host where the tags can be viewed in the commit log > by Dave: > https://git.kernel.org/pub/scm/linux/kernel/git/daveh/devel.git/log/?h=shadow_stack_ptrace I hope that reordering makes it more obvious and clear what's going on. But if that turns out to still be too big of a pile to get through for 6.4, I do think the "shadow_stack_mem_thorny" in the reordered pile is a good stopping point. I'd be happy to prepare another pull request for the pile up to there if it sounds sane to everyone else.
On Fri, Apr 28, 2023 at 5:40 PM Dave Hansen <dave.hansen@intel.com> wrote: > > I'd be happy to prepare another pull request for the pile up to there if > it sounds sane to everyone else. So I'm going through the original pull request now - I was really hoping to have been able to do that earlier, but there kept being all these small pending other issues. And I'm about a quarter in, haven't even gotten to the meat yet, and I've already found a bug. Commit 74fd30bd28e4 ("mm: Make pte_mkwrite() take a VMA") seems to be just completely broken at least on arm 3-level page tables: Spot the problem when I distill it down to just a few lines): -PMD_BIT_FUNC(mkwrite, &= ~L_PMD_SECT_RDONLY); +static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) +{ + pmd_val(pmd) |= L_PMD_SECT_RDONLY; + return pmd; +} and I think this whole area was done entirely incorrectly - and that incorrect approach is then why this bug happened. I think the first patch should have just have completely mindlessly renamed every architecture "pmd/pte_mkwrite()" function as "pmd/pte_mkwrite_kernel()", and added a single #ifndef pte_mkwrite static inline pte_t pte_mkwrite(pte_t pte) { return pte_mkwrite_kernel(pte); } #endif in <linux/pgtable.h>. IOW, it should make it *completely* obvious that absolutely no semantic changes happened, and all that happened that it added that single wrapper layer. And all those architecture changes would be trivial one-liners. The only possibly subtle thing would be that some existing #include <asm/pgtable.h> include might need to be changed to #include <linux/pgtable.h> instead, because we do have a number of users that seem to include just the bare asm version. From some quick grepping, I don't think any of them use 'pte_dirty()' or 'pmd_dirty()', but it needs *checking*. Then, the "add vma" thing would change the above wrapper to literally just have the 'vma', but to not use it, and still just call 'pte_mkwrite_kernel(pte)'. Again, absolutely obvious that there are zero semantic changes to any architectures that don't care. And finally - for the only architecture that *does* care, ie x86, do implement the new pte_mkwrite(vma,pte), and do a #define pte_mkwrite pte_mkwrite to let the generic header know that there's now an architecture-specific version for it, and it shouldn't do that wrapper that just falls back on the "kernel" version. End result: all those architectures that do *not* want the vma argument don't need to do any extra work, and they just implement the old version, and the only thing that happened was that it was renamed. Because I really don't want to pull this series as-is, when I found what looks like a "this broke an architecture that DOES NOT EVEN CARE" bug in the series. And yes, my bad for not getting to this earlier to notice this. Or alternatively - your bad for not going through this with a fine comb like I started doing. Linus
On Sat, May 6, 2023 at 12:34 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And I'm about a quarter in, haven't even gotten to the meat yet, and > I've already found a bug. Ok, so despite this I'm going a bit further, just to familiarize myself with this even if I can't pull it. And this is horrendous: pte_wrprotect() doing if (pte_dirty(pte)) pte = pte_mksaveddirty(pte); which actually has two fundamental problems: (a) it shouldn't be a conditional thing at all, it should just be bit operations (b) "pte_dirty()" isn't even the right thing to use, since it includes the SW dirty bit. so what this *should* do is likely to just do pte.val |= (pte.val >> _PAGE_BIT_DIRTY) & 1) << _PAGE_BIT_SOFT_DIRTY; pte.val &= ~_PAGE_DIRTY; which the compiler should be able to turn into a nice unconditional series. Smaller than any conditional. (You could simplify the above by just using fixed bit numbers - the above is written with two shifts just to work with "any bit pair", but obviously it could be written to be simpler and more straightforward by just shifting the bit right into place) I think the compilers may be able to do that all the simplification for you even with a if (pte.val & _PAGE_DIRTY) pte.val |= _PAGE_SOFT_DIRTY; pte.val &= ~_PAGE_DIRTY; but that is only true when there are *not* things like those cpu_feature_enabled() tests in between those operations. So I strongly suspect that all those if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) around this area are only making things worse. You're replacing a couple of simple bit operations with jump-arounds (and then doing the bit operations anyway in one side). And you are limited the compiler from doing obvious simplifications in the process. Conditionals are really bad, even when they can be turned into static jumps. As static jumps they just cause odd code flow, and lack of testing. And compiler barriers. All these bit operations are actually cheaper - and would get more test coverage - if they are just done unconditionally with a couple of shift-and-mask operations. Now, my reaction here is - the whole shadow stack notion of "dirty but not writable is a magic marker" is *DISGUSTING*. It's wrong. Whatever Intel designer that came up with that - instead of just picking another bit for the *HARDWARE* to check - should be ashamed. Now we have to pick a software bit instead, and play games for this. BAD BAD BAD. I'm assuming this is something where Microsoft went "we already don't have that, and we want all the sw bits for sw, so do this". But from a design standpoint it's just nasty. - But if we have to play those games, just *play* them. Do it all unconditionally, and make the x86-64 rules be that "dirty but not writable" is something we should never have. Having two different rules, and conditionals for them, is both more complex for maintainers, *and* for compilers. So just make that _PAGE_BIT_SOFT_DIRTY be unconditional, and let's just live with it. But make the bit operations efficient. Maybe I'm missing something, and the people who have been working on this have a really good reason for this mess. But it really looks horrible to me. So maybe you can explain in small words why I'm wrong, but right now my feeling is that not only did I find an arm bug in the series (trivially fixed with a one-liner, but worrying, and triggered by the series being done in a particularly fragile way). But I also think there are some x86 things that are just not done the way they should have been done. But again: maybe I don't understand some of the problems. Linus
On Sat, 2023-05-06 at 12:34 -0700, Linus Torvalds wrote: > > > > End result: all those architectures that do *not* want the vma > > > > argument don't need to do any extra work, and they just > > > > implement > > the > > > > old version, and the only thing that happened was that it was > > > > > > > renamed. > > > > > > > > Because I really don't want to pull this series as-is, when I > > > > found > > > > what looks like a "this broke an architecture that DOES NOT > > > > EVEN > > > CARE" > > > > bug in the series. > > > > > > > > And yes, my bad for not getting to this earlier to notice this. > > > > > > > > Or alternatively - your bad for not going through this with a > > > > fine > > > > comb like I started doing. Oof, yes that definitely looks like a bug. Yes, the ifdef solution would be a less error prone way to pull off the addition of the VMA. I think I did try something like your suggestion during development. My (maybe misguided) concern was that pte_mkwrite_kernel() might not make semantic sense for all architectures since not all architectures were using pte_mkwrite() on kernel memory. Like I know a lot of architectures don't do read-only memory in the kernel even though they do it in userspace. I also think it would still be leaving things in a slightly worse place than we started by having the actual guts of the pte_mkwrite()'s harder to grep for. I'm surprised this was missed in automated testing, since the consequence of breaking these should have been pretty immediately obvious. Between that and all the times myself and others looked at it and still failed, maybe the less error prone solution is better.
On Sat, 2023-05-06 at 13:09 -0700, Linus Torvalds wrote: > > On Sat, May 6, 2023 at 12:34 PM Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > > > And I'm about a quarter in, haven't even gotten to the meat > > > > yet, > > and > > > > I've already found a bug. > > > > Ok, so despite this I'm going a bit further, just to familiarize > > myself with this even if I can't pull it. > > > > And this is horrendous: pte_wrprotect() doing > > > > if (pte_dirty(pte)) > > pte = pte_mksaveddirty(pte); > > > > which actually has two fundamental problems: > > > > (a) it shouldn't be a conditional thing at all, it should just be > > > bit > > operations > > > > (b) "pte_dirty()" isn't even the right thing to use, since it > > includes the SW dirty bit. pte_dirty() actually doesn't check the the SW dirty bit, but this probably just re-enforces the over complexity here. > [ snip ] > > > > Now, my reaction here is > > > > - the whole shadow stack notion of "dirty but not writable is a > > > magic > > marker" is *DISGUSTING*. It's wrong. > > > > Whatever Intel designer that came up with that - instead of just > > picking another bit for the *HARDWARE* to check - should be > > ashamed. > > > > Now we have to pick a software bit instead, and play games for > > this. BAD BAD BAD. > > > > I'm assuming this is something where Microsoft went "we already > > don't have that, and we want all the sw bits for sw, so do this". > > But > > from a design standpoint it's just nasty. I can't argue against this. > > > > - But if we have to play those games, just *play* them. Do it all > > unconditionally, and make the x86-64 rules be that "dirty but not > > writable" is something we should never have. > > > > Having two different rules, and conditionals for them, is both > > > more > > complex for maintainers, *and* for compilers. > > > > So just make that _PAGE_BIT_SOFT_DIRTY be unconditional, and let's > > just live with it. But make the bit operations efficient. > > > > Maybe I'm missing something, and the people who have been working > > on > > this have a really good reason for this mess. But it really looks > > horrible to me. > > > > So maybe you can explain in small words why I'm wrong, but right > > now > > my feeling is that not only did I find an arm bug in the series > > (trivially fixed with a one-liner, but worrying, and triggered by > > the > > series being done in a particularly fragile way). > > > > But I also think there are some x86 things that are just not done > > the > > way they should have been done. > > > > But again: maybe I don't understand some of the problems. > > Reflecting on these points, I think there might have been some amount of wishful thinking around wanting to draw lines around the ugliness and keep it at bay. I'm swayed by your argument to bite the bullet for the reduced complexity and increased testing. The only caveat that I could point out would be: with the existing solution we retain the ability to compile out saved-dirty. It's a bit thorny and if anything goes wrong, the whole thing can easily be disabled. So there might be an argument that having two sets of logic is a better thing to start out with, even if the long term solution is saved-dirty all-the-time. I don't want to argue it too strongly though, because it's clear now how you can look at this and think, this can't be right. And in any case I didn't justify having the two modes in any of the logs.
On Sat, May 6, 2023 at 5:10 PM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > I think I did try something like your suggestion during development. My > (maybe misguided) concern was that pte_mkwrite_kernel() might not make > semantic sense for all architectures since not all architectures were > using pte_mkwrite() on kernel memory. Yeah, I agree, it's not a great name. How about just calling it "pte_mkwrite_novma()" instead? That makes semantic sense for all architectures: it's basically saying "this is the version of mkwrite that doesn't take a vma". And then it also makes complete sense to say #define pte_mkwrite(vma,pte) pte_mkwrite_novma(pte) (or the "inline function" version of the same just to make sure the 'vma' argument is evaluated) on the architectures that really don't care about the vma. And it also makes sense for the x86 kernel use that doesn't have a vma... So I think "novma" is more closely related to what the semantics actually are, and the "kernel" thing is just a "on x86-64, we have this special case where we don't have vma's and don't worry about losing the dirty bit". Hmm? Linus
On Sat, May 6, 2023 at 5:18 PM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Sat, 2023-05-06 at 13:09 -0700, Linus Torvalds wrote: > > > > > > (b) "pte_dirty()" isn't even the right thing to use, since it > > > includes the SW dirty bit. > > pte_dirty() actually doesn't check the the SW dirty bit It really does. See commit "x86/mm: Start actually marking _PAGE_SAVED_DIRTY", where the patch is: -static inline int pte_dirty(pte_t pte) +static inline bool pte_dirty(pte_t pte) { - return pte_flags(pte) & _PAGE_DIRTY; + return pte_flags(pte) & _PAGE_DIRTY_BITS; } and that _PAGE_DIRTY_BITS is a combination of _PAGE_DIRTY | _PAGE_SAVED_DIRTY. So that means that the if (pte_dirty(pte)) pte = pte_mksaveddirty(pte); in pte_wrprotect() is just nonsensical, and basically says "if either the real dirty or the SW dirty bit is set, set the SW dirty bit". But that's entirely redundant wrt the old state of the dirty bit. It reality should just 'or' the HW dirty bit into the SW dirty bit and be done with it. Of course, maybe I confused the issue by talking about HW dirty and SW dirty, because we *also* have that entirely *other* legacy "SOFT_DIRTY" bit that is different from the new SW dirty bit ("SAVED_DIRTY"). Which certainly doesn't make things clearer. The "soft" in the soft dirty bit isn't about software, (although it's obviously done *in* software), it's literally "soft" not as in software, but as in just "not a hard bit": you can clear it for tracking which pages people write to. So we end up with two bits that are maintained by software: SOFT and SAVED, and then it's really easy to just call the "SAVED" bit the "SW dirty" bit as I did as opposed to the "HW dirty" bit, because it really is a secondary very of the *hard* HW dirty bit. I have this fleeting suspicion that we could actually merge the SOFT and SAVED bits, in that the "SOFT" bit doesn't make much sense if the pte is read-only, and the SAVED bit in turn doesn't matter if the pte is RW. But that's a separate headache, not worth worrying about now. Linus
On Sat, 2023-05-06 at 17:38 -0700, Linus Torvalds wrote: > So that means that the > > if (pte_dirty(pte)) > pte = pte_mksaveddirty(pte); > > in pte_wrprotect() is just nonsensical, and basically says "if either > the real dirty or the SW dirty bit is set, set the SW dirty bit". But > that's entirely redundant wrt the old state of the dirty bit. > > It reality should just 'or' the HW dirty bit into the SW dirty bit > and > be done with it. > > Of course, maybe I confused the issue by talking about HW dirty and > SW > dirty, because we *also* have that entirely *other* legacy > "SOFT_DIRTY" bit that is different from the new SW dirty bit > ("SAVED_DIRTY"). Sorry, I did think you meant the old _PAGE_SOFT_DIRTY when you were talking about the SW dirty bit here. Yea, if only _PAGE_SAVED_DIRTY is set, and not _PAGE_DIRTY, then it's pointless to do pte_mksaveddirty() here. So I guess you were pointing out an example of the general wrongness you elaborated on. I thought you were saying it was a functional bug.
On Sat, 2023-05-06 at 17:19 -0700, Linus Torvalds wrote: > So I think "novma" is more closely related to what the semantics > actually are, and the "kernel" thing is just a "on x86-64, we have > this special case where we don't have vma's and don't worry about > losing the dirty bit". > > Hmm? Yea, it seems ok. To me what pte_mkwrite_novma() really does is make a PTE writable in the conventional sense. Unfortunately, now that "writable" is starting to get overloaded, there is no good name specific to the original type of writable. I don't know if I'm bikeshedding here, but what pte_mkwrite(pte, vma) kind of does now is remove a PTE's protection in a general sense. After using it, userspace can change the mapping. Sometimes with normal writes for the that type of VMA, sometimes with shadow stack accesses for another. So I wonder about leaving pte_mkwrite(pte), and creating pte_mkunprotected(pte, vma) or pte_mkmodifiable(pte, vma). This makes more sense to me, but then I guess every other architecture has to wade through this concept to deal with the simpler writable pattern they care about. It also no longer matches VM_WRITE. I'll stick with pte_mkwrite_novma(), unless you like that better. BTW, I forgot to mention that there is another architecture (maybe 2) that is expected to use this refactor for implementing their shadow stacks. So FWIW, this churn is not just for x86.
On 5/6/23 13:09, Linus Torvalds wrote: > Now, my reaction here is > > - the whole shadow stack notion of "dirty but not writable is a magic > marker" is *DISGUSTING*. It's wrong. > > Whatever Intel designer that came up with that - instead of just > picking another bit for the *HARDWARE* to check - should be ashamed. >> Now we have to pick a software bit instead, and play games for > this. BAD BAD BAD. > > I'm assuming this is something where Microsoft went "we already > don't have that, and we want all the sw bits for sw, so do this". But > from a design standpoint it's just nasty. Heh, I won't name names. But, yeah, it was something like that. > - But if we have to play those games, just *play* them. Do it all > unconditionally, and make the x86-64 rules be that "dirty but not > writable" is something we should never have. There's a wrinkle to enforcing that universally. From the SDM's "ACCESSED AND DIRTY FLAGS" section: If software on one logical processor writes to a page while software on another logical processor concurrently clears the R/W flag in the paging-structure entry that maps the page, execution on some processors may result in the entry’s dirty flag being set. This behavior is gone on shadow stack CPUs, but it does exist on older ones. We could theoretically stop being exposed to it by transitioning all PTE operations that today do: 1. RW => RO (usually more than one) 2. TLB flush to instead take a trip through Present=0 first: 1. RW => Present=0 2. TLB flush 3. Present=0 => RO Similar to what we do for doing Dirty=1->0. We could probably tolerate the cost for some of the users like ksm. But I can't think of a way to do it without making fork() suffer. fork() of course modifies the PTE (RW->RO) and flushes the TLB now. But there would need to be a Present=0 PTE in there somewhere before the TLB flush. That fundamentally means there needs to be a second look at the PTEs and some fault handling for folks that do read-only accesses to the PTEs during the Present=0 window. That said, there are some places like: pte_mksaveddirty() and pte_clear_saveddirty() that are doing _extra_ things on shadow stack systems. That stuff could be made the common case without functionally breaking any old systems. So, the rule would be something like: The *kernel* will never itself create Write=0,Dirty=1 PTEs That won't prevent the hardware from still being able to do it behind our backs on older CPUs. But it does avoid a few of the special cases.
On Mon, May 8, 2023 at 3:57 PM Dave Hansen <dave.hansen@intel.com> wrote: > > There's a wrinkle to enforcing that universally. From the SDM's > "ACCESSED AND DIRTY FLAGS" section: > > If software on one logical processor writes to a page while > software on another logical processor concurrently clears the > R/W flag in the paging-structure entry that maps the page, > execution on some processors may result in the entry’s dirty > flag being set. I was actually wondering about that. I had this memory that we've done special things in the past to make sure that the dirty bit is guaranteed stable (ie the whole "ptep_clear()" dance). But I wasn't sure. > This behavior is gone on shadow stack CPUs Ok, so Intel has actually tightened up the rules on setting dirty, and now guarantees that it will set dirty only if the pte is actually writable? > We could probably tolerate the cost for some of the users like ksm. But > I can't think of a way to do it without making fork() suffer. fork() of > course modifies the PTE (RW->RO) and flushes the TLB now. But there > would need to be a Present=0 PTE in there somewhere before the TLB flush. Yeah, we don't want to make fork() any worse than it already is. No question about that. But if we make the rule be that having the exact dirty bit vs rw bit semantics only matters for CPUs that do the shadow stack thing, and on *those* CPU's it's ok to not go through the dance, can we then come up with a sequence that works for everybody? > So, the rule would be something like: > > The *kernel* will never itself create Write=0,Dirty=1 PTEs > > That won't prevent the hardware from still being able to do it behind > our backs on older CPUs. But it does avoid a few of the special cases. Right. So looking at the fork() case as a nasty example, right now we have ptep_set_wrprotect() on the source pte of a fork(), which atomically just clears the WRITE bit (and thus guarantees that dirty bits cannot get lost, simply because it doesn't matter if some other CPU atomically sets another bit concurrently). On the destination we don't have any races with concurrent accesses, and just do entirely non-atomic pte = pte_wrprotect(pte); and then eventually (after other bit games) do set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); and basically you're saying that there is no possible common sequence for that ptep_set_wrprotect() that doesn't penalize some case. Hmm. Yeah, right now the non-shadow-stack ptep_set_wrprotect() can just be an atomic clear_bit(), which turns into just lock andb $-3, (%reg) and I guess that would inevitably become a horror of a cmpxchg loop when you need to move the dirty bit to the SW dirty on CPU's where the dirty bit can come in late. How very very horrid. Linus
On Mon, May 8, 2023 at 4:31 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Yeah, right now the non-shadow-stack ptep_set_wrprotect() can just be > an atomic clear_bit(), which turns into just > > lock andb $-3, (%reg) > > and I guess that would inevitably become a horror of a cmpxchg loop > when you need to move the dirty bit to the SW dirty on CPU's where the > dirty bit can come in late. > > How very very horrid. Oh, maybe not. So the nice thing here is that we actually *do* have the old value (although we don't pass it in to ptep_set_wrprotect(), so while the lock andb $-3,(%reg) looks simple and efficient, I do think that it wouldn't necessarily be any worse to replace it with a lock cmpxchg %new,%old,(%reg) and a well-predicted branch for the (very very unlikely) failure case. So maybe unifying these two areas wouldn't be too bad. In fact, looking at this code a bit more made me realize that we probably should always have special-cased the common "is the source MM single-threaded" case. Pretty much nobody does "fork()" from a threaded app, because it's not portable anyway. So: - the nasty racy case is already exceedingly rare, and we're wasting huge amounts of time on using a locked access here when in 99% of all cases we shouldn't do that at all! - we would probably be *much* better off with a "if (mm->count == 1)" test that goes off and does *not* do the atomic case (which also doesn't have any worries about dirty bits). I'll take a well-predicted conditional branch over an atomic op any day of the week - once you do that, the nasty racy case isn't even in the critical path any more, so it might as well do the unconditional "cmpxchg" loop that works on all CPU's. And yes, old CPU's may still set the dirty bit *later*, but we can indeed make the rule be that the *kernel* never sets that "read-only and dirty" combination, and then on new CPU's the HW guarantees it doesn't set it either. How does that sound? I think we'd potentially speed up fork() quite noticeably by not doing the locked accesses. Maybe I just think I'm clever, but I'm actually very stupid and am not thinking of some obvious problem case. For example, maybe the mm count doesn't end up being 1 commonly after all, because we're sharing it with the idle thread or something like that. Or maybe there's something even more obviously wrong with the above idea. Linus
On 5/8/23 16:31, Linus Torvalds wrote: > On Mon, May 8, 2023 at 3:57 PM Dave Hansen <dave.hansen@intel.com> wrote: ... >> This behavior is gone on shadow stack CPUs > > Ok, so Intel has actually tightened up the rules on setting dirty, and > now guarantees that it will set dirty only if the pte is actually > writable? Yep: Specifically, a processor that supports CET will never set the dirty flag in a paging-structure entry in which the R/W flag is clear. and this was _absolutely_ one of the things the hardware folks did for the benefit of software. As for the mm->users==1 optimization, seems like something sane to explore. I can't think of any ways off the top of my head that it would break, but I'll go take a closer look.
On 5/8/23 16:47, Linus Torvalds wrote: > - we would probably be *much* better off with a "if (mm->count == 1)" > test that goes off and does *not* do the atomic case (which also > doesn't have any worries about dirty bits). I'll take a well-predicted > conditional branch over an atomic op any day of the week Were you really thinking of mm->count==1, or did you mean mm->mm_users==1? I _think_ the only clue that things like ptrace() and kthread_use_mm() are poking around in the page tables is when mm->mm_users>1. They don't bump mm->count. Most mmget() (or get_task_mm()->atomic_inc(&mm->mm_users)) callers are obviously harmless for our purposes, like proc_pid_statm(). There are others like idxd_copy_cr()->copy_to_user() that are less obviously OK. They're OK if they fault during a fork() because the fault will end up stuck on mmap_read(mm) waiting for fork() to release its write. But the worry is if those kthread_use_mm() users *don't* fault: CPU-1 CPU-2 fork() // mm->mm_users==1 ptep_set_wrprotect() ^ non-atomic kthread_use_mm() // mm->mm_users==2 copy_to_user() // page walker sets Dirty=1 There's always a race there because mm->mm_users can always get bumped after the fork()er checks it. Is there something that fixes this race that I'm missing? We can probably do something more comprehensive to make sure that mm->mm_users isn't bumped during fork(), but it'll be a bit more complicated than just checking mm->mm_users in fork().
On Fri, May 12, 2023 at 12:34 PM Dave Hansen <dave.hansen@intel.com> wrote: > > Were you really thinking of mm->count==1, or did you mean > mm->mm_users==1? Yeah, I meant mm_users. And I was thinking that if it is 1, then it is stable - kind of like how we do our optimization with file descriptors. But you're right to worry about possibly other users incrementing the mm_users count somehow - or just using "mmgrab()" to not increment it, but be able to still fault on it etc. > There's always a race there because mm->mm_users can always get bumped > after the fork()er checks it. Ugh. I was assuming that if they don't already have a reference to the mm, they can't get one (becasue the '1' comes from 'current->mm', and nobody else has that mm). But maybe that's just not true. Looking around, we have things like pages->source_mm = current->mm; mmgrab(pages->source_mm); that creates a ref to the mm with a grab, and then later it gets used. So maybe the "no races can happen" is limited to *both* mm_users and mm_count being 1. What would increment it in that case, when 'current' is obviously busy forking? That "both are one" might still be the common case for fork(). Hmm? Linus
On Sun, May 07, 2023 at 04:24:24PM +0000, Edgecombe, Rick P wrote: > >BTW, I forgot to mention that there is another architecture (maybe 2) >that is expected to use this refactor for implementing their shadow >stacks. So FWIW, this churn is not just for x86. > That's right, one of them is RISC-V. RISC-V control-flow integrity: https://github.com/riscv/riscv-cfi Since RISC-V PTE have 3 separate bits for read, write and execute. Write only (R=0, W=1, X=0) encodings had been reserved and thus cpu supporting this extension will treat this reserved encoding as shadow stack. It doesn't get messy as in case of x86 (due to overloading of dirty bit), but it still will need pte helper which marks a page "shadow stack writeable" or "regular writeable" depending on vma. I plan to use this re-factor for RISC-V shadow stack as well. RISC-V CFI RFC https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/ Note: This is still a WIP. As spec gets into stable state, I'll post a v2. On my patch pte helper discusion and suggestion to converge with x86 on pte_mkwrite https://lore.kernel.org/lkml/7693247c-a55d-a375-3621-1b07115a9d99@redhat.com/ -Deepak
On 5/12/23 14:55, Linus Torvalds wrote: >> There's always a race there because mm->mm_users can always get bumped >> after the fork()er checks it. > Ugh. I was assuming that if they don't already have a reference to the > mm, they can't get one (becasue the '1' comes from 'current->mm', and > nobody else has that mm). > > But maybe that's just not true. Looking around, we have things like > > pages->source_mm = current->mm; > mmgrab(pages->source_mm); > > that creates a ref to the mm with a grab, and then later it gets used. > > So maybe the "no races can happen" is limited to *both* mm_users and > mm_count being 1. What would increment it in that case, when 'current' > is obviously busy forking? > > That "both are one" might still be the common case for fork(). Hmm? get_task_mm() is the devil here. It goes right from having a task_struct to bumping ->mm_users, no ->mm_count needed. It also bumps ->mm_users while holding task_lock(), which means we can't do something simple like take mmap_lock in there to avoid racing with fork(). I did hack something together that seems to work for fork() and get_task_mm(). Basically, we let get_task_mm()'s legacy behavior to be the fast path. But it diverts over to a slow path if get_task_mm() notices that an mm's refcounts and mmap_lock are consistent with a fork() happening elsewhere. The slow path releases the task_lock() and acquires mmap_lock so it can sleep until the (potential) fork() is finished. On the other side, the fork() code checks ->mm_users and ->mm_count. It can now depend on them being stable because it holds mmap_lock and it diverts the get_task_mm() callers over to the slow path. This works for two important cases: 1. get_task_mm() callers since they now conditionally use mmap_lock 2. mmgrab() -> mmget_not_zero() users that later take the mmap_lock I'm also fairly sure it misses some cases outside of those two. The patch is also quite ugly. The "->task_doing_fast_fork" mechanism is pure hackery, for instance. This seems to stay on the fast paths pretty well, even with 'top' or some other /proc poking going on. In the end, this is balancing the extra cost of the get_task_mm() slow path with reduced atomic cost in the fork() path. It looks promising so far. Is this worth pursuing?
On 5/15/23 14:36, Dave Hansen wrote:
> Is this worth pursuing?
... and with the actual patch this time
On Mon, May 15, 2023 at 2:36 PM Dave Hansen <dave.hansen@intel.com> wrote: > > I did hack something together that seems to work for fork() and > get_task_mm(). Basically, we let get_task_mm()'s legacy behavior to be > the fast path. So I don't like the patch very much, and I think you're being too blinded by the mm_count. It's not the mm_count itself that is the problem, after all. The problem is literally "another CPU is using this MM for its TLB fills", which is what can cause those dirty bits. My mm_count suggestion was just a simplistic - and clearly overly so - approximation for that not being the case. So it's not actually "get_task_mm()" that is the problem - it's the transition through "switch_mm_irqs_off()" that is the true barrier here (whether through task switching or using it through switch_mm(). And basically no uses of "get_task_mm()" lead to that "switch_mm()" path. The most common case by far for get_task_mm() is just for various /proc things that want the mm struct for stats or other random information (ie things like "ps" will do a *lot* of get_task_mm() to read the command lines, for example, but won't actually _switch_ into the resulting mm). So I don't really love your patch - not because I think it's necessarily wrong, but because I think it was just a mistake to concentrate too much on the counts as the "thing to protect". IOW, let's take a step back. The mm_count idea was just broken. It seemed simple, but it turned out that the only thing simple was me ;) No, I think we should look at mm switching instead, and I think the obvious main culprit here is kthread_use_mm(). Yes, there are other uses of "switch_mm()" out there, but they tend to be pretty special (ie we have EFI code that switches temporarily to the EFI page tables and then switches back, but that will never switch to some *new* context). And the good news is that we're actually walking away from that kthread_use_mm() use, mostly because a lot of the kthread users of user space state have been so hugely problematic. So io_uring switched to "real user threads in kernel space" instead, and this merge window vhost did too. So I think instead of my - clearly racy - idea to use mm_count, we could just use proper serialization with kthread_use_mm(). We could even introduce a fork-specific lock that just says "you can't do that during fork()", which honestly is not going to hurt *anything*, because already nobody does. And once you *do* have that serialization, at that point you can then use mm_count as a heuristic, knowing that you won't race with kthread_use_mm(). Because while kthread_use_mm() does remain, it's for things like the GPU drivers and some other special driver infrastructure that has some async user space access, which wouldn't make sense during fork() anyway. So I *think* fork() could do something like this: struct fork_cookie; // dummy type purely for type checking static struct fork_cookie *is_singe_threaded(void) { struct mm_struct *mm = current->mm; mutex_lock(&mm->fork_lock); if (atomic_read(&mm->mm_users) > 1 || atomic_read(&mm->mm_count) > 1) { mutex_unlock(&mm->fork_lock); return NULL; } return (struct fork_cookie *)mm; } static void release_single_threaded(struct fork_cookie *cookie) { if (cookie) { struct mm_struct *mm = (struct mm_struct *)cookie; mutex_unlock(&mm->fork_lock); } or whatever. Then you just put const struct fork_cookie *single_threaded = is_singe_threaded(); ... release_single_threaded(single_threaded); around the whole copy_mm() in fork, and use that "single_threaded" cookie to decide whether you do the optimized fork or not. And all you need to do in kthread_[un]use_mm() is to increment/decrement the mm_users under that same fork_lock. We can still race with *other* possible uses of the mm_count, but not with the ones that matter - the ones that use the mm for mapping. (The whole idle thread use-case needs some thinking about when the 'fork()' itself bounces from core to core, and might leave idle threads in its wake. But I don't think any speculative idle thread accesses could possibly mark page tables dirty???) I dunno. The above is wild handwaving. My hands are waving so hard that I'm generating a bit of lift. I may well be missing some other case - AGAIN. What do you think? A lock that basically will never show even a hint of contention under real loads seems like a fairly cheap thing to serialize on. Linus
On Mon, May 15, 2023 at 3:40 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > mutex_lock(&mm->fork_lock); > if (atomic_read(&mm->mm_users) > 1 || > atomic_read(&mm->mm_count) > 1) { Again, just to clarify: at this point the mm_users/count checks are purely a heuristic. If they are both 1, then we know we're *currently* the only user, and the fork_lock means that no new users that matter can come in. If we race with things like /proc accesses that have raised either of those counts, and we happen to say "ok, we're not alone in the universe", then that just means that we fall back to the slow thread-safe path. Unfortunate, but not a huge deal. The slow path is what we do right now, and while it might be a bit slower with a 'lock cmpxchg' instead of a 'lock andb', it shouldn't be some *insanely* much slower case. So what we really want to do here is an optimistic and cheap "can we do the fast case" for the presumably common situation where we're really just a bog-standard single-threaded fork. IOW, the mm counts aren't somehow semantically important, we're literally just looking for a cheap test for a common case. That's my argument, at least. There may be flaws in that argument, like some memory ordering with some exit() case that decrements mm_users, but those should very much happen only after the exit() has already stopped using the page tables, so I don't think those can actually matter. Keyword here being "I don't think". Linus
On Mon, May 15, 2023 at 3:40 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So I *think* fork() could do something like this: > > struct fork_cookie; // dummy type purely for type checking > static struct fork_cookie *is_singe_threaded(void) > { > struct mm_struct *mm = current->mm; > mutex_lock(&mm->fork_lock); Actually, let's not even bother with a new fork_lock. I for some reason thought that fork only took the mmap_lock for reading (because it kind of does just read the old VM data), but when I actually look at dup_mmap() I notice that what it does is actually just if (mmap_write_lock_killable(oldmm)) .. and so I think we can just use that as the lock. So then all we'd need is to use mmap_read_lock(mm) in kthread_use_mm() around the mmgrab. I don't think we even need it in kthread_unuse_mm(), because *decrementing* the mm counters isn't even something we need to worry about. How does that sound? Linus
On 5/16/23 13:38, Linus Torvalds wrote: > On Mon, May 15, 2023 at 3:40 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> So I *think* fork() could do something like this: >> >> struct fork_cookie; // dummy type purely for type checking >> static struct fork_cookie *is_singe_threaded(void) >> { >> struct mm_struct *mm = current->mm; >> mutex_lock(&mm->fork_lock); > Actually, let's not even bother with a new fork_lock. > > I for some reason thought that fork only took the mmap_lock for > reading (because it kind of does just read the old VM data), but when > I actually look at dup_mmap() I notice that what it does is actually > just > > if (mmap_write_lock_killable(oldmm)) .. > > and so I think we can just use that as the lock. > > So then all we'd need is to use mmap_read_lock(mm) in kthread_use_mm() > around the mmgrab. > > I don't think we even need it in kthread_unuse_mm(), because > *decrementing* the mm counters isn't even something we need to worry > about. > > How does that sound? Sounds promising. I was really hesitant to bring a new lock into the world. I need to digest what you write yesterday and convince myself a little more that doing it at kthread_unuse_mm() is sufficient, but that's my problem. ;)
On Mon, May 15, 2023 at 02:22:55PM -0700, Deepak Gupta wrote: > On Sun, May 07, 2023 at 04:24:24PM +0000, Edgecombe, Rick P wrote: > > BTW, I forgot to mention that there is another architecture (maybe 2) > > that is expected to use this refactor for implementing their shadow > > stacks. So FWIW, this churn is not just for x86. > That's right, one of them is RISC-V. Also arm64. > RISC-V control-flow integrity: https://github.com/riscv/riscv-cfi > Since RISC-V PTE have 3 separate bits for read, write and execute. Write > only (R=0, W=1, X=0) encodings had been reserved and thus cpu supporting > this extension will treat this reserved encoding as shadow stack. > It doesn't get messy as in case of x86 (due to overloading of dirty bit), > but it still will need pte helper which marks a page "shadow stack > writeable" or "regular writeable" depending on vma. For arm64 GCS (our shadow stack equivalent) is built on top of another extension that allows us to assign arbitrary meanings to four of the bits (they become an index into an array of actual permissions) so we might be able to avoid having to look at the VMA, though we might want to in future in order to make better use of the other features of the indirection extension.