[GIT,PULL] x86/shstk for 6.4

Message ID 20230424212130.590684-1-dave.hansen@linux.intel.com
State New
Headers
Series [GIT,PULL] x86/shstk for 6.4 |

Pull-request

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_shstk_for_6.4

Message

Dave Hansen April 24, 2023, 9:21 p.m. UTC
  Hi Linus,

Please pull x86/shstk for 6.4.  This is the long-awaited Shadow Stack
support.  It is the most highly desired hardware security feature in a
long time.  Both AMD and Intel have (compatible) implementations.  It
has been present on the Intel side since the 11th-gen CPUs, but it had a
few stumbles on the way and is a bit tardy.

The trickiest part of this whole thing (IMNHO) was that shadow stacks
exist in a permission grey area.  A shadow stack PTE literally says
Write=0, but some instructions _can_ write to it.  The PTEs also can't
be read-only so they can't be COW'd.  They are oddballs.

The Write=0,Dirty=1 PTE permissions also mean that the dirty bit can not
be used as freely as before.  Those two things combine to create a fair
amount of PTE management churn.

A few more things you should know:

 1. There is a non-trivial amount of core mm churn.  It has acks from
    mm folks and I hope it's no surprise to Andrew.  These add a VMA
    argument to pte_mkwrite().  There is a new user in Andrew's
    pile[1*] which will need to get fixed up[2*] before this gets
    merged with the mm tree.
 2. There has been an unusual snarl of userspace compatibility issues
    with shadow stacks [3*]. While the move to new arch_prctl() values
    helped, we can still envision scenarios where this old code might
    bite us. The plan is try to ban any problematic apps from using
    shadow stack if anything comes up in practice.  We should obviously
    be on the lookout for these.
 3. This conflicts with the LAM code which is coming in x86/mm.  I'll
    discuss the resolution when I send x86/mm.

 [1*] 717f95b494ac36 ("mm: don't check VMA write permissions if the PTE/PMD indicates write permissions")
 [2*] https://lore.kernel.org/all/20230419182136.112974-1-broonie@kernel.org/
 [3*] https://lore.kernel.org/lkml/CAHk-=wgP5mk3poVeejw16Asbid0ghDt4okHnWaWKLBkRhQntRA@mail.gmail.com/

--

The following changes since commit eeac8ede17557680855031c6f305ece2378af326:

  Linux 6.3-rc2 (2023-03-12 16:36:44 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_shstk_for_6.4

for you to fetch changes up to 11c95c77eef6d56c1ef9f55d8afd83ceb6d99996:

  x86/shstk: Enforce only whole copies for ssp_set() (2023-03-27 17:55:51 -0700)

----------------------------------------------------------------
Add x86 shadow stack support.  Includes:
 * New arch_prctl() ABI for enable/disable/lock/debug
 * Plumbing to deal with the new Write=0,Dirty=1 PTE permissions
 * Core mm changes that allow some PTE write functions to take
   a VMA so they can tell if the VMA is for a shadow stack
 * Normal old selftest and documentation

----------------------------------------------------------------
Mike Rapoport (1):
      x86/shstk: Add ARCH_SHSTK_UNLOCK

Rick Edgecombe (37):
      Documentation/x86: Add CET shadow stack description
      x86/shstk: Add Kconfig option for shadow stack
      x86/cpufeatures: Add CPU feature flags for shadow stacks
      x86/cpufeatures: Enable CET CR4 bit for shadow stack
      x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
      x86/fpu: Add helper for modifying xstate
      x86/traps: Move control protection handler to separate file
      x86/shstk: Add user control-protection fault handler
      x86/mm: Remove _PAGE_DIRTY from kernel RO pages
      x86/mm: Move pmd_write(), pud_write() up in the file
      mm: Introduce pte_mkwrite_kernel()
      s390/mm: Introduce pmd_mkwrite_kernel()
      mm: Make pte_mkwrite() take a VMA
      x86/mm: Introduce _PAGE_SAVED_DIRTY
      x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY
      x86/mm: Start actually marking _PAGE_SAVED_DIRTY
      x86/mm: Check shadow stack page fault errors
      x86/mm: Teach pte_mkwrite() about stack memory
      mm: Add guard pages around a shadow stack.
      mm/mmap: Add shadow stack pages to memory accounting
      mm: Don't allow write GUPs to shadow stack memory
      x86/mm: Introduce MAP_ABOVE4G
      mm: Warn on shadow stack memory in wrong vma
      x86/mm: Warn if create Write=0,Dirty=1 with raw prot
      x86: Introduce userspace API for shadow stack
      x86/shstk: Add user-mode shadow stack support
      x86/shstk: Handle thread shadow stack
      x86/shstk: Introduce routines modifying shstk
      x86/shstk: Handle signals for shadow stack
      x86/shstk: Introduce map_shadow_stack syscall
      x86/shstk: Support WRSS for userspace
      x86: Expose thread features in /proc/$PID/status
      x86/shstk: Wire in shadow stack interface
      selftests/x86: Add shadow stack test
      x86: Add PTRACE interface for shadow stack
      x86/shstk: Add ARCH_SHSTK_STATUS
      x86/shstk: Enforce only whole copies for ssp_set()

Yu-cheng Yu (3):
      mm: Move VM_UFFD_MINOR_BIT from 37 to 38
      mm: Introduce VM_SHADOW_STACK for shadow stack memory
      mm: Re-introduce vm_flags to do_mmap()

 Documentation/filesystems/proc.rst              |   1 +
 Documentation/mm/arch_pgtable_helpers.rst       |   9 +-
 Documentation/x86/index.rst                     |   1 +
 Documentation/x86/shstk.rst                     | 179 ++++++
 arch/alpha/include/asm/pgtable.h                |   6 +-
 arch/arc/include/asm/hugepage.h                 |   2 +-
 arch/arc/include/asm/pgtable-bits-arcv2.h       |   7 +-
 arch/arm/include/asm/pgtable-3level.h           |   7 +-
 arch/arm/include/asm/pgtable.h                  |   2 +-
 arch/arm/kernel/signal.c                        |   2 +-
 arch/arm64/include/asm/pgtable.h                |   9 +-
 arch/arm64/kernel/signal.c                      |   2 +-
 arch/arm64/kernel/signal32.c                    |   2 +-
 arch/arm64/mm/trans_pgd.c                       |   4 +-
 arch/csky/include/asm/pgtable.h                 |   2 +-
 arch/hexagon/include/asm/pgtable.h              |   2 +-
 arch/ia64/include/asm/pgtable.h                 |   2 +-
 arch/loongarch/include/asm/pgtable.h            |   4 +-
 arch/m68k/include/asm/mcf_pgtable.h             |   2 +-
 arch/m68k/include/asm/motorola_pgtable.h        |   6 +-
 arch/m68k/include/asm/sun3_pgtable.h            |   6 +-
 arch/microblaze/include/asm/pgtable.h           |   2 +-
 arch/mips/include/asm/pgtable.h                 |   6 +-
 arch/nios2/include/asm/pgtable.h                |   2 +-
 arch/openrisc/include/asm/pgtable.h             |   2 +-
 arch/parisc/include/asm/pgtable.h               |   6 +-
 arch/powerpc/include/asm/book3s/32/pgtable.h    |   2 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h    |   4 +-
 arch/powerpc/include/asm/nohash/32/pgtable.h    |   2 +-
 arch/powerpc/include/asm/nohash/32/pte-8xx.h    |   2 +-
 arch/powerpc/include/asm/nohash/64/pgtable.h    |   2 +-
 arch/riscv/include/asm/pgtable.h                |   6 +-
 arch/s390/include/asm/hugetlb.h                 |   4 +-
 arch/s390/include/asm/pgtable.h                 |  14 +-
 arch/s390/mm/pageattr.c                         |   4 +-
 arch/sh/include/asm/pgtable_32.h                |  10 +-
 arch/sparc/include/asm/pgtable_32.h             |   2 +-
 arch/sparc/include/asm/pgtable_64.h             |   6 +-
 arch/sparc/kernel/signal32.c                    |   2 +-
 arch/sparc/kernel/signal_64.c                   |   2 +-
 arch/um/include/asm/pgtable.h                   |   2 +-
 arch/x86/Kconfig                                |  24 +
 arch/x86/Kconfig.assembler                      |   5 +
 arch/x86/entry/syscalls/syscall_64.tbl          |   1 +
 arch/x86/include/asm/cpufeatures.h              |   2 +
 arch/x86/include/asm/disabled-features.h        |  16 +-
 arch/x86/include/asm/fpu/api.h                  |   9 +
 arch/x86/include/asm/fpu/regset.h               |   7 +-
 arch/x86/include/asm/fpu/sched.h                |   3 +-
 arch/x86/include/asm/fpu/types.h                |  16 +-
 arch/x86/include/asm/fpu/xstate.h               |   6 +-
 arch/x86/include/asm/idtentry.h                 |   2 +-
 arch/x86/include/asm/mmu_context.h              |   2 +
 arch/x86/include/asm/pgtable.h                  | 322 +++++++++--
 arch/x86/include/asm/pgtable_types.h            |  56 +-
 arch/x86/include/asm/processor.h                |   8 +
 arch/x86/include/asm/shstk.h                    |  38 ++
 arch/x86/include/asm/special_insns.h            |  13 +
 arch/x86/include/asm/tlbflush.h                 |   3 +-
 arch/x86/include/asm/trap_pf.h                  |   2 +
 arch/x86/include/asm/traps.h                    |  12 +
 arch/x86/include/uapi/asm/mman.h                |   4 +
 arch/x86/include/uapi/asm/prctl.h               |  12 +
 arch/x86/kernel/Makefile                        |   4 +
 arch/x86/kernel/cet.c                           | 152 ++++++
 arch/x86/kernel/cpu/common.c                    |  35 +-
 arch/x86/kernel/cpu/cpuid-deps.c                |   1 +
 arch/x86/kernel/cpu/proc.c                      |  23 +
 arch/x86/kernel/fpu/core.c                      |  54 +-
 arch/x86/kernel/fpu/regset.c                    |  81 +++
 arch/x86/kernel/fpu/xstate.c                    |  90 ++-
 arch/x86/kernel/idt.c                           |   2 +-
 arch/x86/kernel/process.c                       |  21 +-
 arch/x86/kernel/process_64.c                    |   9 +-
 arch/x86/kernel/ptrace.c                        |  12 +
 arch/x86/kernel/shstk.c                         | 499 +++++++++++++++++
 arch/x86/kernel/signal.c                        |   1 +
 arch/x86/kernel/signal_32.c                     |   2 +-
 arch/x86/kernel/signal_64.c                     |   8 +-
 arch/x86/kernel/sys_x86_64.c                    |   6 +-
 arch/x86/kernel/traps.c                         |  87 ---
 arch/x86/mm/fault.c                             |  22 +
 arch/x86/mm/pat/set_memory.c                    |   4 +-
 arch/x86/mm/pgtable.c                           |  38 ++
 arch/x86/xen/enlighten_pv.c                     |   2 +-
 arch/x86/xen/mmu_pv.c                           |   2 +-
 arch/x86/xen/xen-asm.S                          |   2 +-
 arch/xtensa/include/asm/pgtable.h               |   2 +-
 fs/aio.c                                        |   2 +-
 fs/proc/array.c                                 |   6 +
 fs/proc/task_mmu.c                              |   3 +
 include/asm-generic/hugetlb.h                   |   4 +-
 include/linux/mm.h                              |  65 ++-
 include/linux/mman.h                            |   4 +
 include/linux/pgtable.h                         |  14 +
 include/linux/proc_fs.h                         |   2 +
 include/linux/syscalls.h                        |   1 +
 include/uapi/asm-generic/siginfo.h              |   3 +-
 include/uapi/asm-generic/unistd.h               |   2 +-
 include/uapi/linux/elf.h                        |   2 +
 ipc/shm.c                                       |   2 +-
 kernel/sys_ni.c                                 |   1 +
 mm/debug_vm_pgtable.c                           |  16 +-
 mm/gup.c                                        |   2 +-
 mm/huge_memory.c                                |   7 +-
 mm/hugetlb.c                                    |   4 +-
 mm/internal.h                                   |   4 +-
 mm/memory.c                                     |   5 +-
 mm/migrate_device.c                             |   2 +-
 mm/mmap.c                                       |  10 +-
 mm/mprotect.c                                   |   2 +-
 mm/nommu.c                                      |   4 +-
 mm/userfaultfd.c                                |   2 +-
 mm/util.c                                       |   2 +-
 tools/testing/selftests/x86/Makefile            |   2 +-
 tools/testing/selftests/x86/test_shadow_stack.c | 695 ++++++++++++++++++++++++
 116 files changed, 2615 insertions(+), 314 deletions(-)
 create mode 100644 Documentation/x86/shstk.rst
 create mode 100644 arch/x86/include/asm/shstk.h
 create mode 100644 arch/x86/kernel/cet.c
 create mode 100644 arch/x86/kernel/shstk.c
 create mode 100644 tools/testing/selftests/x86/test_shadow_stack.c
  

Comments

Linus Torvalds April 28, 2023, 6:17 p.m. UTC | #1
On Mon, Apr 24, 2023 at 2:21 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
> Please pull x86/shstk for 6.4.  This is the long-awaited Shadow Stack
> support.  It is the most highly desired hardware security feature in a
> long time.  Both AMD and Intel have (compatible) implementations.  It
> has been present on the Intel side since the 11th-gen CPUs, but it had a
> few stumbles on the way and is a bit tardy.

Just letting you know that this is still in my queue, but every time I
look at it, I go "I need to spend more time on understanding this".

So this is one of those pull requests that will be done next week,
after I have gotten everything else off my plate, and I have time to
look more closely into it.

If it can be chopped up into smaller pieces ("this is just the
preliminary work part which is all obvious and clear") that might help
me, but I'll get around to it eventually regardless.

                 Linus
  
Edgecombe, Rick P April 29, 2023, 12:26 a.m. UTC | #2
On Fri, 2023-04-28 at 11:17 -0700, Linus Torvalds wrote:
> If it can be chopped up into smaller pieces ("this is just the
> preliminary work part which is all obvious and clear") that might
> help
> me, but I'll get around to it eventually regardless.

Looking at it with that in mind, I can see how the key bits have gotten
lost in the noise. I've taken a stab at reordering it. I think it is
better at separating out and tagging the boring parts from the thornier
ones.

The rebase was almost clean, and changes are pretty much in the commits
where they were originally. The robots are still checking it for
bisectability, so please consider this branch for review only.

The tags are as follows (in chronological order):

generic_mm
----------
Core MM refactoring in prep for shadow stack memory.

shadow_stack_prep
-----------------
Add Kconfig's and defines needed by later patches

saved_dirty
-----------
Implementation of the "Saved Dirty Bit". The HW has shadow stack as a
weird PTE bit combination: "Write=0,Dirty=1". So the x86 mm code keeps
track of the HW dirty bit in a SW bit when HW Dirty=1 memory gets
write-protected, in order to not inadvertently create shadow stack
memory.

shadow_stack_mem_boring
-----------------------
Shadow stack memory patches that are fairly mechanical.

shadow_stack_mem_thorny
-----------------------
The nitty gritty of shadow stack memory support.

shadow_stack_core
-----------------
This is the non-MM parts of the shadow stack implementation. It
implements new ABI around shadow stacks (clone, signals, enabling,
etc).

shadow_stack_ptrace
-------------------
Support for ptracers to work with/around shadow stack.


You might find generic_mm, shadow_stack_prep, shadow_stack_mem_boring
and shadow_stack_ptrace to be the most ordinary. And saved_dirty, 
shadow_stack_mem_thorny and shadow_stack_core to be more interesting.

Kindly placed on a host where the tags can be viewed in the commit log
by Dave:
https://git.kernel.org/pub/scm/linux/kernel/git/daveh/devel.git/log/?h=shadow_stack_ptrace
  
Dave Hansen April 29, 2023, 12:40 a.m. UTC | #3
On 4/28/23 17:26, Edgecombe, Rick P wrote:
> Kindly placed on a host where the tags can be viewed in the commit log
> by Dave:
> https://git.kernel.org/pub/scm/linux/kernel/git/daveh/devel.git/log/?h=shadow_stack_ptrace

I hope that reordering makes it more obvious and clear what's going on.
But if that turns out to still be too big of a pile to get through for
6.4, I do think the "shadow_stack_mem_thorny" in the reordered pile is a
good stopping point.

I'd be happy to prepare another pull request for the pile up to there if
it sounds sane to everyone else.
  
Linus Torvalds May 6, 2023, 7:34 p.m. UTC | #4
On Fri, Apr 28, 2023 at 5:40 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> I'd be happy to prepare another pull request for the pile up to there if
> it sounds sane to everyone else.

So I'm going through the original pull request now - I was really
hoping to have been able to do that earlier, but there kept being all
these small pending other issues.

And I'm about a quarter in, haven't even gotten to the meat yet, and
I've already found a bug.

Commit 74fd30bd28e4 ("mm: Make pte_mkwrite() take a VMA") seems to be
just completely broken at least on arm 3-level page tables:

Spot the problem when I distill it down to just a few lines):

    -PMD_BIT_FUNC(mkwrite,   &= ~L_PMD_SECT_RDONLY);
    +static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
    +{
    +       pmd_val(pmd) |= L_PMD_SECT_RDONLY;
    +       return pmd;
    +}

and I think this whole area was done entirely incorrectly - and that
incorrect approach is then why this bug happened.

I think the first patch should have just have completely mindlessly
renamed every architecture "pmd/pte_mkwrite()" function as
"pmd/pte_mkwrite_kernel()", and added a single

    #ifndef pte_mkwrite
    static inline pte_t pte_mkwrite(pte_t pte)
    { return pte_mkwrite_kernel(pte); }
    #endif

in <linux/pgtable.h>.

IOW, it should make it *completely* obvious that absolutely no
semantic changes happened, and all that happened that it added that
single wrapper layer. And all those architecture changes would be
trivial one-liners.

The only possibly subtle thing would be that some existing

   #include <asm/pgtable.h>

include might need to be changed to

   #include <linux/pgtable.h>

instead, because we do have a number of users that seem to include
just the bare asm version.

From some quick grepping, I don't think any of them use 'pte_dirty()'
or 'pmd_dirty()', but it needs *checking*.

Then, the "add vma" thing would change the above wrapper to literally
just have the 'vma', but to not use it, and still just call
'pte_mkwrite_kernel(pte)'.

Again, absolutely obvious that there are zero semantic changes to any
architectures that don't care.

And finally - for the only architecture that *does* care, ie x86, do
implement the new pte_mkwrite(vma,pte), and do a

  #define pte_mkwrite pte_mkwrite

to let the generic header know that there's now an
architecture-specific version for it, and it shouldn't do that wrapper
that just falls back on the "kernel" version.

End result: all those architectures that do *not* want the vma
argument don't need to do any extra work, and they just implement the
old version, and the only thing that happened was that it was renamed.

Because I really don't want to pull this series as-is, when I found
what looks like a "this broke an architecture that DOES NOT EVEN CARE"
bug in the series.

And yes, my bad for not getting to this earlier to notice this.

Or alternatively - your bad for not going through this with a fine
comb like I started doing.

             Linus
  
Linus Torvalds May 6, 2023, 8:09 p.m. UTC | #5
On Sat, May 6, 2023 at 12:34 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And I'm about a quarter in, haven't even gotten to the meat yet, and
> I've already found a bug.

Ok, so despite this I'm going a bit further, just to familiarize
myself with this even if I can't pull it.

And this is horrendous: pte_wrprotect() doing

    if (pte_dirty(pte))
        pte = pte_mksaveddirty(pte);

which actually has two fundamental problems:

 (a) it shouldn't be a conditional thing at all, it should just be bit
operations

 (b) "pte_dirty()" isn't even the right thing to use, since it
includes the SW dirty bit.

so what this *should* do is likely to just do

   pte.val |= (pte.val >> _PAGE_BIT_DIRTY) & 1) << _PAGE_BIT_SOFT_DIRTY;
   pte.val &= ~_PAGE_DIRTY;

which the compiler should be able to turn into a nice unconditional
series. Smaller than any conditional.

(You could simplify the above by just using fixed bit numbers - the
above is written with two shifts just to work with "any bit pair", but
obviously it could be written to be simpler and more straightforward
by just shifting the bit right into place)

I think the compilers may be able to do that all the simplification
for you even with a

    if (pte.val & _PAGE_DIRTY)
        pte.val |= _PAGE_SOFT_DIRTY;
    pte.val &= ~_PAGE_DIRTY;

but that is only true when there are *not* things like those
cpu_feature_enabled() tests in between those operations.

So I strongly suspect that all those

     if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))

around this area are only making things worse. You're replacing a
couple of simple bit operations with jump-arounds (and then doing the
bit operations anyway in one side). And you are limited the compiler
from doing obvious simplifications in the process.

Conditionals are really bad, even when they can be turned into static jumps.

As static jumps they just cause odd code flow, and lack of testing.
And compiler barriers.

All these bit operations are actually cheaper - and would get more
test coverage - if they are just done unconditionally with a couple of
shift-and-mask operations.

Now, my reaction here is

 - the whole shadow stack notion of "dirty but not writable is a magic
marker" is *DISGUSTING*. It's wrong.

   Whatever Intel designer that came up with that - instead of just
picking another bit for the *HARDWARE* to check - should be ashamed.

   Now we have to pick a software bit instead, and play games for
this. BAD BAD BAD.

   I'm assuming this is something where Microsoft went "we already
don't have that, and we want all the sw bits for sw, so do this". But
from a design standpoint it's just nasty.

 - But if we have to play those games, just *play* them. Do it all
unconditionally, and make the x86-64 rules be that "dirty but not
writable" is something we should never have.

   Having two different rules, and conditionals for them, is both more
complex for maintainers, *and* for compilers.

So just make that _PAGE_BIT_SOFT_DIRTY be unconditional, and let's
just live with it. But make the bit operations efficient.

Maybe I'm missing something, and the people who have been working on
this have a really good reason for this mess. But it really looks
horrible to me.

So maybe you can explain in small words why I'm wrong, but right now
my feeling is that not only did I find an arm bug in the series
(trivially fixed with a one-liner, but worrying, and triggered by the
series being done in a particularly fragile way).

But I also think there are some x86 things that are just not done the
way they should have been done.

But again: maybe I don't understand some of the problems.

                   Linus
  
Edgecombe, Rick P May 7, 2023, 12:10 a.m. UTC | #6
On Sat, 2023-05-06 at 12:34 -0700, Linus Torvalds wrote:
> > > > End result: all those architectures that do *not* want the vma
> > > > argument don't need to do any extra work, and they just
> > > > implement > > the
> > > > old version, and the only thing that happened was that it was >
> > > > > > renamed.
> > > > 
> > > > Because I really don't want to pull this series as-is, when I
> > > > found
> > > > what looks like a "this broke an architecture that DOES NOT
> > > > EVEN > > > CARE"
> > > > bug in the series.
> > > > 
> > > > And yes, my bad for not getting to this earlier to notice this.
> > > > 
> > > > Or alternatively - your bad for not going through this with a
> > > > fine
> > > > comb like I started doing.

Oof, yes that definitely looks like a bug. Yes, the ifdef solution
would be a less error prone way to pull off the addition of the VMA. 

I think I did try something like your suggestion during development. My
(maybe misguided) concern was that pte_mkwrite_kernel() might not make
semantic sense for all architectures since not all architectures were
using pte_mkwrite() on kernel memory. Like I know a lot of
architectures don't do read-only memory in the kernel even though they
do it in userspace.

I also think it would still be leaving things in a slightly worse place
than we started by having the actual guts of the pte_mkwrite()'s harder
to grep for.

I'm surprised this was missed in automated testing, since the
consequence of breaking these should have been pretty immediately
obvious. Between that and all the times myself and others looked at it
and still failed, maybe the less error prone solution is better.
  
Edgecombe, Rick P May 7, 2023, 12:18 a.m. UTC | #7
On Sat, 2023-05-06 at 13:09 -0700, Linus Torvalds wrote:
> > On Sat, May 6, 2023 at 12:34 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > > > 
> > > > And I'm about a quarter in, haven't even gotten to the meat
> > > > yet, > > and
> > > > I've already found a bug.
> > 
> > Ok, so despite this I'm going a bit further, just to familiarize
> > myself with this even if I can't pull it.
> > 
> > And this is horrendous: pte_wrprotect() doing
> > 
> >     if (pte_dirty(pte))
> >         pte = pte_mksaveddirty(pte);
> > 
> > which actually has two fundamental problems:
> > 
> >  (a) it shouldn't be a conditional thing at all, it should just be
> > > bit
> > operations
> > 
> >  (b) "pte_dirty()" isn't even the right thing to use, since it
> > includes the SW dirty bit.

pte_dirty() actually doesn't check the the SW dirty bit, but this
probably just re-enforces the over complexity here.

> 
[ snip ]
> > 
> > Now, my reaction here is
> > 
> >  - the whole shadow stack notion of "dirty but not writable is a >
> > magic
> > marker" is *DISGUSTING*. It's wrong.
> > 
> >    Whatever Intel designer that came up with that - instead of just
> > picking another bit for the *HARDWARE* to check - should be
> > ashamed.
> > 
> >    Now we have to pick a software bit instead, and play games for
> > this. BAD BAD BAD.
> > 
> >    I'm assuming this is something where Microsoft went "we already
> > don't have that, and we want all the sw bits for sw, so do this".
> > But
> > from a design standpoint it's just nasty.

I can't argue against this.

> > 
> >  - But if we have to play those games, just *play* them. Do it all
> > unconditionally, and make the x86-64 rules be that "dirty but not
> > writable" is something we should never have.
> > 
> >    Having two different rules, and conditionals for them, is both >
> > more
> > complex for maintainers, *and* for compilers.
> > 
> > So just make that _PAGE_BIT_SOFT_DIRTY be unconditional, and let's
> > just live with it. But make the bit operations efficient.
> > 
> > Maybe I'm missing something, and the people who have been working
> > on
> > this have a really good reason for this mess. But it really looks
> > horrible to me.
> > 
> > So maybe you can explain in small words why I'm wrong, but right
> > now
> > my feeling is that not only did I find an arm bug in the series
> > (trivially fixed with a one-liner, but worrying, and triggered by
> > the
> > series being done in a particularly fragile way).
> > 
> > But I also think there are some x86 things that are just not done
> > the
> > way they should have been done.
> > 
> > But again: maybe I don't understand some of the problems.
> > 

Reflecting on these points, I think there might have been some amount
of wishful thinking around wanting to draw lines around the ugliness
and keep it at bay.

I'm swayed by your argument to bite the bullet for the reduced
complexity and increased testing. The only caveat that I could point
out would be: with the existing solution we retain the ability to
compile out saved-dirty. It's a bit thorny and if anything goes wrong,
the whole thing can easily be disabled. So there might be an argument
that having two sets of logic is a better thing to start out with, even
if the long term solution is saved-dirty all-the-time.

I don't want to argue it too strongly though, because it's clear now
how you can look at this and think, this can't be right. And in any
case I didn't justify having the two modes in any of the logs.
  
Linus Torvalds May 7, 2023, 12:19 a.m. UTC | #8
On Sat, May 6, 2023 at 5:10 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> I think I did try something like your suggestion during development. My
> (maybe misguided) concern was that pte_mkwrite_kernel() might not make
> semantic sense for all architectures since not all architectures were
> using pte_mkwrite() on kernel memory.

Yeah, I agree, it's not a great name.

How about just calling it "pte_mkwrite_novma()" instead?

That makes semantic sense for all architectures: it's basically saying
"this is the version of mkwrite that doesn't take a vma".

And then it also makes complete sense to say

    #define pte_mkwrite(vma,pte) pte_mkwrite_novma(pte)

(or the "inline function" version of the same just to make sure the
'vma' argument is evaluated) on the architectures that really don't
care about the vma.

And it also makes sense for the x86 kernel use that doesn't have a vma...

So I think "novma" is more closely related to what the semantics
actually are, and the "kernel" thing is just a "on x86-64, we have
this special case where we don't have vma's and don't worry about
losing the dirty bit".

Hmm?

           Linus
  
Linus Torvalds May 7, 2023, 12:38 a.m. UTC | #9
On Sat, May 6, 2023 at 5:18 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Sat, 2023-05-06 at 13:09 -0700, Linus Torvalds wrote:
> > >
> > >  (b) "pte_dirty()" isn't even the right thing to use, since it
> > > includes the SW dirty bit.
>
> pte_dirty() actually doesn't check the the SW dirty bit

It really does.

See commit "x86/mm: Start actually marking _PAGE_SAVED_DIRTY", where
the patch is:

  -static inline int pte_dirty(pte_t pte)
  +static inline bool pte_dirty(pte_t pte)
   {
  -     return pte_flags(pte) & _PAGE_DIRTY;
  +     return pte_flags(pte) & _PAGE_DIRTY_BITS;
   }

and that  _PAGE_DIRTY_BITS is a combination of _PAGE_DIRTY | _PAGE_SAVED_DIRTY.

So that means that the

        if (pte_dirty(pte))
                pte = pte_mksaveddirty(pte);

in pte_wrprotect() is just nonsensical, and basically says "if either
the real dirty or the SW dirty bit is set, set the SW dirty bit". But
that's entirely redundant wrt the old state of the dirty bit.

It reality should just 'or' the HW dirty bit into the SW dirty bit and
be done with it.

Of course, maybe I confused the issue by talking about HW dirty and SW
dirty, because we *also* have that entirely *other* legacy
"SOFT_DIRTY" bit that is different from the new SW dirty bit
("SAVED_DIRTY").

Which certainly doesn't make things clearer.  The "soft" in the soft
dirty bit isn't about software, (although it's obviously done *in*
software), it's literally "soft" not as in software, but as in just
"not a hard bit": you can clear it for tracking which pages people
write to.

So we end up with two bits that are maintained by software: SOFT and
SAVED, and then it's really easy to just call the "SAVED" bit the "SW
dirty" bit as I did as opposed to the "HW dirty" bit, because it
really is a secondary very of the *hard* HW dirty bit.

I have this fleeting suspicion that we could actually merge the SOFT
and SAVED bits, in that the "SOFT" bit doesn't make much sense if the
pte is read-only, and the SAVED bit in turn doesn't matter if the pte
is RW.

But that's a separate headache, not worth worrying about now.

           Linus
  
Edgecombe, Rick P May 7, 2023, 3:57 p.m. UTC | #10
On Sat, 2023-05-06 at 17:38 -0700, Linus Torvalds wrote:
> So that means that the
> 
>         if (pte_dirty(pte))
>                 pte = pte_mksaveddirty(pte);
> 
> in pte_wrprotect() is just nonsensical, and basically says "if either
> the real dirty or the SW dirty bit is set, set the SW dirty bit". But
> that's entirely redundant wrt the old state of the dirty bit.
> 
> It reality should just 'or' the HW dirty bit into the SW dirty bit
> and
> be done with it.
> 
> Of course, maybe I confused the issue by talking about HW dirty and
> SW
> dirty, because we *also* have that entirely *other* legacy
> "SOFT_DIRTY" bit that is different from the new SW dirty bit
> ("SAVED_DIRTY").

Sorry, I did think you meant the old _PAGE_SOFT_DIRTY when you were
talking about the SW dirty bit here.

Yea, if only _PAGE_SAVED_DIRTY is set, and not _PAGE_DIRTY, then it's
pointless to do pte_mksaveddirty() here. So I guess you were pointing
out an example of the general wrongness you elaborated on. I thought
you were saying it was a functional bug.
  
Edgecombe, Rick P May 7, 2023, 4:24 p.m. UTC | #11
On Sat, 2023-05-06 at 17:19 -0700, Linus Torvalds wrote:
> So I think "novma" is more closely related to what the semantics
> actually are, and the "kernel" thing is just a "on x86-64, we have
> this special case where we don't have vma's and don't worry about
> losing the dirty bit".
> 
> Hmm?

Yea, it seems ok.

To me what pte_mkwrite_novma() really does is make a PTE writable in
the conventional sense. Unfortunately, now that "writable" is starting
to get overloaded, there is no good name specific to the original type
of writable.

I don't know if I'm bikeshedding here, but what pte_mkwrite(pte, vma)
kind of does now is remove a PTE's protection in a general sense. After
using it, userspace can change the mapping. Sometimes with normal
writes for the that type of VMA, sometimes with shadow stack accesses
for another. So I wonder about leaving pte_mkwrite(pte), and creating
pte_mkunprotected(pte, vma) or pte_mkmodifiable(pte, vma). This makes
more sense to me, but then I guess every other architecture has to wade
through this concept to deal with the simpler writable pattern they
care about. It also no longer matches VM_WRITE. I'll stick with
pte_mkwrite_novma(), unless you like that better.

BTW, I forgot to mention that there is another architecture (maybe 2)
that is expected to use this refactor for implementing their shadow
stacks. So FWIW, this churn is not just for x86.
  
Dave Hansen May 8, 2023, 10:57 p.m. UTC | #12
On 5/6/23 13:09, Linus Torvalds wrote:
> Now, my reaction here is
> 
>  - the whole shadow stack notion of "dirty but not writable is a magic
> marker" is *DISGUSTING*. It's wrong.
> 
>    Whatever Intel designer that came up with that - instead of just
> picking another bit for the *HARDWARE* to check - should be ashamed.
>>    Now we have to pick a software bit instead, and play games for
> this. BAD BAD BAD.
> 
>    I'm assuming this is something where Microsoft went "we already
> don't have that, and we want all the sw bits for sw, so do this". But
> from a design standpoint it's just nasty.

Heh, I won't name names.  But, yeah, it was something like that.

>  - But if we have to play those games, just *play* them. Do it all
> unconditionally, and make the x86-64 rules be that "dirty but not
> writable" is something we should never have.

There's a wrinkle to enforcing that universally.  From the SDM's
"ACCESSED AND DIRTY FLAGS" section:

	If software on one logical processor writes to a page while
	software on another logical processor concurrently clears the
	R/W flag in the paging-structure entry that maps the page,
	execution on some processors may result in the entry’s dirty
	flag being set.

This behavior is gone on shadow stack CPUs, but it does exist on older
ones.  We could theoretically stop being exposed to it by transitioning
all PTE operations that today do:

	1. RW => RO (usually more than one)
	2. TLB flush

to instead take a trip through Present=0 first:

	1. RW => Present=0
	2. TLB flush
	3. Present=0 => RO

Similar to what we do for doing Dirty=1->0.

We could probably tolerate the cost for some of the users like ksm.  But
I can't think of a way to do it without making fork() suffer.  fork() of
course modifies the PTE (RW->RO) and flushes the TLB now.  But there
would need to be a Present=0 PTE in there somewhere before the TLB flush.

That fundamentally means there needs to be a second look at the PTEs and
some fault handling for folks that do read-only accesses to the PTEs
during the Present=0 window.

That said, there are some places like:

	pte_mksaveddirty()
and
	pte_clear_saveddirty()

that are doing _extra_ things on shadow stack systems.  That stuff could
be made the common case without functionally breaking any old systems.

So, the rule would be something like:

	The *kernel* will never itself create Write=0,Dirty=1 PTEs

That won't prevent the hardware from still being able to do it behind
our backs on older CPUs.  But it does avoid a few of the special cases.
  
Linus Torvalds May 8, 2023, 11:31 p.m. UTC | #13
On Mon, May 8, 2023 at 3:57 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> There's a wrinkle to enforcing that universally.  From the SDM's
> "ACCESSED AND DIRTY FLAGS" section:
>
>         If software on one logical processor writes to a page while
>         software on another logical processor concurrently clears the
>         R/W flag in the paging-structure entry that maps the page,
>         execution on some processors may result in the entry’s dirty
>         flag being set.

I was actually wondering about that.

I had this memory that we've done special things in the past to make
sure that the dirty bit is guaranteed stable (ie the whole
"ptep_clear()" dance). But I wasn't sure.

> This behavior is gone on shadow stack CPUs

Ok, so Intel has actually tightened up the rules on setting dirty, and
now guarantees that it will set dirty only if the pte is actually
writable?

> We could probably tolerate the cost for some of the users like ksm.  But
> I can't think of a way to do it without making fork() suffer.  fork() of
> course modifies the PTE (RW->RO) and flushes the TLB now.  But there
> would need to be a Present=0 PTE in there somewhere before the TLB flush.

Yeah, we don't want to make fork() any worse than it already is.  No
question about that.

But if we make the rule be that having the exact dirty bit vs rw bit
semantics only matters for CPUs that do the shadow stack thing, and on
*those* CPU's it's ok to not go through the dance, can we then come up
with a sequence that works for everybody?

> So, the rule would be something like:
>
>         The *kernel* will never itself create Write=0,Dirty=1 PTEs
>
> That won't prevent the hardware from still being able to do it behind
> our backs on older CPUs.  But it does avoid a few of the special cases.

Right. So looking at the fork() case as a nasty example, right now we have

        ptep_set_wrprotect()

on the source pte of a fork(), which atomically just clears the WRITE
bit (and thus guarantees that dirty bits cannot get lost, simply
because it doesn't matter if some other CPU atomically sets another
bit concurrently).

On the destination we don't have any races with concurrent accesses,
and just do entirely non-atomic

                pte = pte_wrprotect(pte);

and then eventually (after other bit games) do

        set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);

and basically you're saying that there is no possible common sequence
for that ptep_set_wrprotect() that doesn't penalize some case.

Hmm.

Yeah, right now the non-shadow-stack ptep_set_wrprotect() can just be
an atomic clear_bit(), which turns into just

        lock andb $-3, (%reg)

and I guess that would inevitably become a horror of a cmpxchg loop
when you need to move the dirty bit to the SW dirty on CPU's where the
dirty bit can come in late.

How very very horrid.

                     Linus
  
Linus Torvalds May 8, 2023, 11:47 p.m. UTC | #14
On Mon, May 8, 2023 at 4:31 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Yeah, right now the non-shadow-stack ptep_set_wrprotect() can just be
> an atomic clear_bit(), which turns into just
>
>         lock andb $-3, (%reg)
>
> and I guess that would inevitably become a horror of a cmpxchg loop
> when you need to move the dirty bit to the SW dirty on CPU's where the
> dirty bit can come in late.
>
> How very very horrid.

Oh, maybe not.

So the nice thing here is that we actually *do* have the old value
(although we don't pass it in to ptep_set_wrprotect(), so while the

        lock andb $-3,(%reg)

looks simple and efficient, I do think that it wouldn't necessarily be
any worse to replace it with a

        lock cmpxchg %new,%old,(%reg)

and a well-predicted branch for the (very very unlikely) failure case.

So maybe unifying these two areas wouldn't be too bad.

In fact, looking at this code a bit more made me realize that we
probably should always have special-cased the common "is the source MM
single-threaded" case.

Pretty much nobody does "fork()" from a threaded app, because it's not
portable anyway.

So:

 - the nasty racy case is already exceedingly rare, and we're wasting
huge amounts of time on using a locked access here when in 99% of all
cases we shouldn't do that at all!

 - we would probably be *much* better off with a "if (mm->count == 1)"
test that goes off and does *not* do the atomic case (which also
doesn't have any worries about dirty bits). I'll take a well-predicted
conditional branch over an atomic op any day of the week

 - once you do that, the nasty racy case isn't even in the critical
path any more, so it might as well do the unconditional "cmpxchg" loop
that works on all CPU's.

And yes, old CPU's may still set the dirty bit *later*, but we can
indeed make the rule be that the *kernel* never sets that "read-only
and dirty" combination, and then on new CPU's the HW guarantees it
doesn't set it either.

How does that sound? I think we'd potentially speed up fork() quite
noticeably by not doing the locked accesses.

Maybe I just think I'm clever, but I'm actually very stupid and am not
thinking of some obvious problem case.

For example, maybe the mm count doesn't end up being 1 commonly after
all, because we're sharing it with the idle thread or something like
that.

Or maybe there's something even more obviously wrong with the above idea.

                  Linus
  
Dave Hansen May 9, 2023, 12:07 a.m. UTC | #15
On 5/8/23 16:31, Linus Torvalds wrote:
> On Mon, May 8, 2023 at 3:57 PM Dave Hansen <dave.hansen@intel.com> wrote:
...
>> This behavior is gone on shadow stack CPUs
> 
> Ok, so Intel has actually tightened up the rules on setting dirty, and
> now guarantees that it will set dirty only if the pte is actually
> writable?

Yep:

	Specifically, a processor that supports CET will never set the
	dirty flag in a paging-structure entry in which the R/W flag is
	clear.

and this was _absolutely_ one of the things the hardware folks did for
the benefit of software.

As for the mm->users==1 optimization, seems like something sane to
explore.  I can't think of any ways off the top of my head that it would
break, but I'll go take a closer look.
  
Dave Hansen May 12, 2023, 5:34 p.m. UTC | #16
On 5/8/23 16:47, Linus Torvalds wrote:
>  - we would probably be *much* better off with a "if (mm->count == 1)"
> test that goes off and does *not* do the atomic case (which also
> doesn't have any worries about dirty bits). I'll take a well-predicted
> conditional branch over an atomic op any day of the week

Were you really thinking of mm->count==1, or did you mean
mm->mm_users==1?  I _think_ the only clue that things like ptrace() and
kthread_use_mm() are poking around in the page tables is when
mm->mm_users>1.  They don't bump mm->count.

Most mmget() (or get_task_mm()->atomic_inc(&mm->mm_users)) callers are
obviously harmless for our purposes, like proc_pid_statm().

There are others like idxd_copy_cr()->copy_to_user() that are less
obviously OK.  They're OK if they fault during a fork() because the
fault will end up stuck on mmap_read(mm) waiting for fork() to release
its write.

But the worry is if those kthread_use_mm() users *don't* fault:

	CPU-1			CPU-2
	fork()
	// mm->mm_users==1
	ptep_set_wrprotect()
	  ^ non-atomic
				kthread_use_mm()
				// mm->mm_users==2
				copy_to_user()
				// page walker sets Dirty=1

There's always a race there because mm->mm_users can always get bumped
after the fork()er checks it.

Is there something that fixes this race that I'm missing?

We can probably do something more comprehensive to make sure that
mm->mm_users isn't bumped during fork(), but it'll be a bit more
complicated than just checking mm->mm_users in fork().
  
Linus Torvalds May 12, 2023, 9:55 p.m. UTC | #17
On Fri, May 12, 2023 at 12:34 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> Were you really thinking of mm->count==1, or did you mean
> mm->mm_users==1?

Yeah, I meant mm_users.

And I was thinking that if it is 1, then it is stable - kind of like
how we do our optimization with file descriptors.

But you're right to worry about possibly other users incrementing the
mm_users count somehow - or just using "mmgrab()" to not increment it,
but be able to still fault on it etc.

> There's always a race there because mm->mm_users can always get bumped
> after the fork()er checks it.

Ugh. I was assuming that if they don't already have a reference to the
mm, they can't get one (becasue the '1' comes from 'current->mm', and
nobody else has that mm).

But maybe that's just not true. Looking around, we have things like

        pages->source_mm = current->mm;
        mmgrab(pages->source_mm);

that creates a ref to the mm with a grab, and then later it gets used.

So maybe the "no races can happen" is limited to *both* mm_users and
mm_count being 1. What would increment it in that case, when 'current'
is obviously busy forking?

That "both are one" might still be the common case for fork(). Hmm?

                Linus
  
Deepak Gupta May 15, 2023, 9:22 p.m. UTC | #18
On Sun, May 07, 2023 at 04:24:24PM +0000, Edgecombe, Rick P wrote:
>
>BTW, I forgot to mention that there is another architecture (maybe 2)
>that is expected to use this refactor for implementing their shadow
>stacks. So FWIW, this churn is not just for x86.
>

That's right, one of them is RISC-V.

RISC-V control-flow integrity: https://github.com/riscv/riscv-cfi

Since RISC-V PTE have 3 separate bits for read, write and execute. Write
only (R=0, W=1, X=0) encodings had been reserved and thus cpu supporting
this extension will treat this reserved encoding as shadow stack.

It doesn't get messy as in case of x86 (due to overloading of dirty bit),
but it still will need pte helper which marks a page "shadow stack
writeable" or "regular writeable" depending on vma.

I plan to use this re-factor for RISC-V shadow stack as well.

RISC-V CFI RFC
https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/
Note: This is still a WIP. As spec gets into stable state, I'll post a v2.

On my patch pte helper discusion and suggestion to converge with x86 on
pte_mkwrite 
https://lore.kernel.org/lkml/7693247c-a55d-a375-3621-1b07115a9d99@redhat.com/

-Deepak
  
Dave Hansen May 15, 2023, 9:36 p.m. UTC | #19
On 5/12/23 14:55, Linus Torvalds wrote:
>> There's always a race there because mm->mm_users can always get bumped
>> after the fork()er checks it.
> Ugh. I was assuming that if they don't already have a reference to the
> mm, they can't get one (becasue the '1' comes from 'current->mm', and
> nobody else has that mm).
> 
> But maybe that's just not true. Looking around, we have things like
> 
>         pages->source_mm = current->mm;
>         mmgrab(pages->source_mm);
> 
> that creates a ref to the mm with a grab, and then later it gets used.
> 
> So maybe the "no races can happen" is limited to *both* mm_users and
> mm_count being 1. What would increment it in that case, when 'current'
> is obviously busy forking?
> 
> That "both are one" might still be the common case for fork(). Hmm?

get_task_mm() is the devil here.  It goes right from having a
task_struct to bumping ->mm_users, no ->mm_count needed.  It also bumps
->mm_users while holding task_lock(), which means we can't do something
simple like take mmap_lock in there to avoid racing with fork().

I did hack something together that seems to work for fork() and
get_task_mm().  Basically, we let get_task_mm()'s legacy behavior to be
the fast path.  But it diverts over to a slow path if get_task_mm()
notices that an mm's refcounts and mmap_lock are consistent with a
fork() happening elsewhere.

The slow path releases the task_lock() and acquires mmap_lock so it can
sleep until the (potential) fork() is finished.

On the other side, the fork() code checks ->mm_users and ->mm_count.  It
can now depend on them being stable because it holds mmap_lock and it
diverts the get_task_mm() callers over to the slow path.

This works for two important cases:

1. get_task_mm() callers since they now conditionally use mmap_lock
2. mmgrab() -> mmget_not_zero() users that later take the mmap_lock

I'm also fairly sure it misses some cases outside of those two.  The
patch is also quite ugly.  The "->task_doing_fast_fork" mechanism is
pure hackery, for instance.

This seems to stay on the fast paths pretty well, even with 'top' or
some other /proc poking going on.  In the end, this is balancing the
extra cost of the get_task_mm() slow path with reduced atomic cost in
the fork() path.  It looks promising so far.

Is this worth pursuing?
  
Dave Hansen May 15, 2023, 9:37 p.m. UTC | #20
On 5/15/23 14:36, Dave Hansen wrote:
> Is this worth pursuing?

... and with the actual patch this time
  
Linus Torvalds May 15, 2023, 10:40 p.m. UTC | #21
On Mon, May 15, 2023 at 2:36 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> I did hack something together that seems to work for fork() and
> get_task_mm().  Basically, we let get_task_mm()'s legacy behavior to be
> the fast path.

So I don't like the patch very much, and I think you're being too
blinded by the mm_count.

It's not the mm_count itself that is the problem, after all. The
problem is literally "another CPU is using this MM for its TLB fills",
which is what can cause those dirty bits.

My mm_count suggestion was just a simplistic - and clearly overly so -
approximation for that not being the case.

So it's not actually "get_task_mm()" that is the problem - it's the
transition through "switch_mm_irqs_off()" that is the true barrier
here (whether through task switching or using it through switch_mm().

And basically no uses of "get_task_mm()" lead to that "switch_mm()"
path. The most common case by far for get_task_mm() is just for
various /proc things that want the mm struct for stats or other random
information (ie things like "ps" will do a *lot* of get_task_mm() to
read the command lines, for example, but won't actually _switch_ into
the resulting mm).

So I don't really love  your patch - not because I think it's
necessarily wrong, but because I think it was just a mistake to
concentrate too much on the counts as the "thing to protect".

IOW, let's take a step back. The mm_count idea was just broken. It
seemed simple, but it turned out that the only thing simple was me ;)

No, I think we should look at mm switching instead, and I think the
obvious main culprit here is kthread_use_mm().  Yes, there are other
uses of "switch_mm()" out there, but they tend to be pretty special
(ie we have EFI code that switches temporarily to the EFI page tables
and then switches back, but that will never switch to some *new*
context).

And the good news is that we're actually walking away from that
kthread_use_mm()  use, mostly because a lot of the kthread users of
user space state have been so hugely problematic. So io_uring switched
to "real user threads in kernel space" instead, and this merge window
vhost did too.

So I think instead of my - clearly racy - idea to use mm_count, we
could just use proper serialization with kthread_use_mm(). We could
even introduce a fork-specific lock that just says "you can't do that
during fork()", which honestly is not going to hurt *anything*,
because already nobody does.

And once you *do* have that serialization, at that point you can then
use mm_count as a heuristic, knowing that you won't race with
kthread_use_mm().

Because while kthread_use_mm() does remain, it's for things like the
GPU drivers and some other special driver infrastructure that has some
async user space access, which wouldn't make sense during fork()
anyway.

So I *think* fork() could do something like this:

  struct fork_cookie; // dummy type purely for type checking
  static struct fork_cookie *is_singe_threaded(void)
  {
        struct mm_struct *mm = current->mm;
        mutex_lock(&mm->fork_lock);
        if (atomic_read(&mm->mm_users) > 1 ||
            atomic_read(&mm->mm_count) > 1) {
                mutex_unlock(&mm->fork_lock);
                return NULL;
        }
        return (struct fork_cookie *)mm;
  }

  static void release_single_threaded(struct fork_cookie *cookie)
  {
        if (cookie) {
                struct mm_struct *mm = (struct mm_struct *)cookie;
                mutex_unlock(&mm->fork_lock);
  }

or whatever.

Then you just put

        const struct fork_cookie *single_threaded = is_singe_threaded();
        ...
        release_single_threaded(single_threaded);

around the whole copy_mm() in fork, and use that "single_threaded"
cookie to decide whether you do the optimized fork or not.

And all you need to do in kthread_[un]use_mm() is to
increment/decrement the mm_users under that same fork_lock.

We can still race with *other* possible uses of the mm_count, but not
with the ones that matter - the ones that use the mm for mapping.

(The whole idle thread use-case needs some thinking about when the
'fork()' itself bounces from core to core, and might leave idle
threads in its wake. But I don't think any speculative idle thread
accesses could possibly mark page tables dirty???)

I dunno. The above is wild handwaving. My hands are waving so hard
that I'm generating a bit of lift. I may well be missing some other
case - AGAIN.

What do you think? A lock that basically will never show even a hint
of contention under real loads seems like a fairly cheap thing to
serialize on.

             Linus
  
Linus Torvalds May 15, 2023, 11:02 p.m. UTC | #22
On Mon, May 15, 2023 at 3:40 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>         mutex_lock(&mm->fork_lock);
>         if (atomic_read(&mm->mm_users) > 1 ||
>             atomic_read(&mm->mm_count) > 1) {

Again, just to clarify: at this point the mm_users/count checks are
purely a heuristic. If they are both 1, then we know we're *currently*
the only user, and the fork_lock means that no new users that matter
can come in.

If we race with things like /proc accesses that have raised either of
those counts, and we happen to say "ok, we're not alone in the
universe", then that just means that we fall back to the slow
thread-safe path. Unfortunate, but not a huge deal.

The slow path is what we do right now, and while it might be a bit
slower with a 'lock cmpxchg' instead of a 'lock andb', it shouldn't be
some *insanely* much slower case. So what we really want to do here is
an optimistic and cheap "can we do the fast case" for the presumably
common situation where we're really just a bog-standard
single-threaded fork.

IOW, the mm counts aren't somehow semantically important, we're
literally just looking for a cheap test for a common case.

That's my argument, at least. There may be flaws in that argument,
like some memory ordering with some exit() case that decrements
mm_users, but those should very much happen only after the exit() has
already stopped using the page tables, so I don't think those can
actually matter.

Keyword here being "I don't think".

               Linus
  
Linus Torvalds May 16, 2023, 8:38 p.m. UTC | #23
On Mon, May 15, 2023 at 3:40 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I *think* fork() could do something like this:
>
>   struct fork_cookie; // dummy type purely for type checking
>   static struct fork_cookie *is_singe_threaded(void)
>   {
>         struct mm_struct *mm = current->mm;
>         mutex_lock(&mm->fork_lock);

Actually, let's not even bother with a new fork_lock.

I for some reason thought that fork only took the mmap_lock for
reading (because it kind of does just read the old VM data), but when
I actually look at dup_mmap() I notice that what it does is actually
just

        if (mmap_write_lock_killable(oldmm)) ..

and so I think we can just use that as the lock.

So then all we'd need is to use mmap_read_lock(mm) in kthread_use_mm()
around the mmgrab.

I don't think we even need it in kthread_unuse_mm(), because
*decrementing* the mm counters isn't even something we need to worry
about.

How does that sound?

                Linus
  
Dave Hansen May 16, 2023, 8:42 p.m. UTC | #24
On 5/16/23 13:38, Linus Torvalds wrote:
> On Mon, May 15, 2023 at 3:40 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> So I *think* fork() could do something like this:
>>
>>   struct fork_cookie; // dummy type purely for type checking
>>   static struct fork_cookie *is_singe_threaded(void)
>>   {
>>         struct mm_struct *mm = current->mm;
>>         mutex_lock(&mm->fork_lock);
> Actually, let's not even bother with a new fork_lock.
> 
> I for some reason thought that fork only took the mmap_lock for
> reading (because it kind of does just read the old VM data), but when
> I actually look at dup_mmap() I notice that what it does is actually
> just
> 
>         if (mmap_write_lock_killable(oldmm)) ..
> 
> and so I think we can just use that as the lock.
> 
> So then all we'd need is to use mmap_read_lock(mm) in kthread_use_mm()
> around the mmgrab.
> 
> I don't think we even need it in kthread_unuse_mm(), because
> *decrementing* the mm counters isn't even something we need to worry
> about.
> 
> How does that sound?

Sounds promising.  I was really hesitant to bring a new lock into the world.

I need to digest what you write yesterday and convince myself a little
more that doing it at kthread_unuse_mm() is sufficient, but that's my
problem. ;)
  
Mark Brown May 25, 2023, 4:20 p.m. UTC | #25
On Mon, May 15, 2023 at 02:22:55PM -0700, Deepak Gupta wrote:
> On Sun, May 07, 2023 at 04:24:24PM +0000, Edgecombe, Rick P wrote:

> > BTW, I forgot to mention that there is another architecture (maybe 2)
> > that is expected to use this refactor for implementing their shadow
> > stacks. So FWIW, this churn is not just for x86.

> That's right, one of them is RISC-V.

Also arm64.

> RISC-V control-flow integrity: https://github.com/riscv/riscv-cfi

> Since RISC-V PTE have 3 separate bits for read, write and execute. Write
> only (R=0, W=1, X=0) encodings had been reserved and thus cpu supporting
> this extension will treat this reserved encoding as shadow stack.

> It doesn't get messy as in case of x86 (due to overloading of dirty bit),
> but it still will need pte helper which marks a page "shadow stack
> writeable" or "regular writeable" depending on vma.

For arm64 GCS (our shadow stack equivalent) is built on top of another
extension that allows us to assign arbitrary meanings to four of the
bits (they become an index into an array of actual permissions) so we
might be able to avoid having to look at the VMA, though we might want
to in future in order to make better use of the other features of the
indirection extension.