diff mbox series

[v5,18/39] mm: Handle faultless write upgrades for shstk

Message ID	20230119212317.8324-19-rick.p.edgecombe@intel.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Rick Edgecombe <rick.p.edgecombe@intel.com> To: x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann <arnd@arndb.de>, Andy Lutomirski <luto@kernel.org>, Balbir Singh <bsingharora@gmail.com>, Borislav Petkov <bp@alien8.de>, Cyrill Gorcunov <gorcunov@gmail.com>, Dave Hansen <dave.hansen@linux.intel.com>, Eugene Syromiatnikov <esyr@redhat.com>, Florian Weimer <fweimer@redhat.com>, "H . J . Lu" <hjl.tools@gmail.com>, Jann Horn <jannh@google.com>, Jonathan Corbet <corbet@lwn.net>, Kees Cook <keescook@chromium.org>, Mike Kravetz <mike.kravetz@oracle.com>, Nadav Amit <nadav.amit@gmail.com>, Oleg Nesterov <oleg@redhat.com>, Pavel Machek <pavel@ucw.cz>, Peter Zijlstra <peterz@infradead.org>, Randy Dunlap <rdunlap@infradead.org>, Weijiang Yang <weijiang.yang@intel.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, John Allen <john.allen@amd.com>, kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com Cc: rick.p.edgecombe@intel.com, David Hildenbrand <david@redhat.com>, Yu-cheng Yu <yu-cheng.yu@intel.com> Subject: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk Date: Thu, 19 Jan 2023 13:22:56 -0800 Message-Id: <20230119212317.8324-19-rick.p.edgecombe@intel.com> In-Reply-To: <20230119212317.8324-1-rick.p.edgecombe@intel.com> References: <20230119212317.8324-1-rick.p.edgecombe@intel.com> Precedence: bulk
Series	Shadow stacks for userspace \| [v5,00/39] Shadow stacks for userspace [v5,01/39] Documentation/x86: Add CET shadow stack description [v5,02/39] x86/shstk: Add Kconfig option for shadow stack [v5,03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks [v5,04/39] x86/cpufeatures: Enable CET CR4 bit for shadow stack [v5,05/39] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states [v5,06/39] x86/fpu: Add helper for modifying xstate [v5,07/39] x86: Add user control-protection fault handler [v5,08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages [v5,09/39] x86/mm: Move pmd_write(), pud_write() up in the file [v5,10/39] x86/mm: Introduce _PAGE_COW [v5,11/39] x86/mm: Update pte_modify for _PAGE_COW [v5,12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_D… [v5,13/39] x86/mm: Start actually marking _PAGE_COW [v5,14/39] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 [v5,15/39] mm: Introduce VM_SHADOW_STACK for shadow stack memory [v5,16/39] x86/mm: Check shadow stack page fault errors [v5,17/39] x86/mm: Update maybe_mkwrite() for shadow stack [v5,18/39] mm: Handle faultless write upgrades for shstk [v5,19/39] mm: Fixup places that call pte_mkwrite() directly [v5,20/39] mm: Add guard pages around a shadow stack. [v5,21/39] mm/mmap: Add shadow stack pages to memory accounting [v5,22/39] mm: Re-introduce vm_flags to do_mmap() [v5,23/39] mm: Don't allow write GUPs to shadow stack memory [v5,24/39] x86/mm: Introduce MAP_ABOVE4G [v5,25/39] mm: Warn on shadow stack memory in wrong vma [v5,26/39] x86: Introduce userspace API for shadow stack [v5,27/39] x86/shstk: Add user-mode shadow stack support [v5,28/39] x86/shstk: Handle thread shadow stack [v5,29/39] x86/shstk: Introduce routines modifying shstk [v5,30/39] x86/shstk: Handle signals for shadow stack [v5,31/39] x86/shstk: Introduce map_shadow_stack syscall [v5,32/39] x86/shstk: Support WRSS for userspace [v5,33/39] x86: Expose thread features in /proc/$PID/status [v5,34/39] x86/shstk: Wire in shadow stack interface [v5,35/39] selftests/x86: Add shadow stack test [v5,36/39] x86/fpu: Add helper for initing features [v5,37/39] x86: Add PTRACE interface for shadow stack [v5,38/39] x86/shstk: Add ARCH_SHSTK_UNLOCK [v5,39/39] x86/shstk: Add ARCH_SHSTK_STATUS

Commit Message

Edgecombe, Rick P Jan. 19, 2023, 9:22 p.m. UTC

  The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

Since shadow stack memory can be changed from userspace, is both
VM_SHADOW_STACK and VM_WRITE. But it should not be made conventionally
writable (i.e. pte_mkwrite()). So some code that calls pte_mkwrite() needs
to be adjusted.

One such case is when memory is made writable without an actual write
fault. This happens in some mprotect operations, and also prot_numa faults.
In both cases code checks whether it should be made (conventionally)
writable by calling vma_wants_manual_pte_write_upgrade().

One way to fix this would be have code actually check if memory is also
VM_SHADOW_STACK and in that case call pte_mkwrite_shstk(). But since
most memory won't be shadow stack, just have simpler logic and skip this
optimization by changing vma_wants_manual_pte_write_upgrade() to not
return true for VM_SHADOW_STACK_MEMORY. This will simply handle all
cases of this type.

Cc: David Hildenbrand <david@redhat.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

v5:
 - Update solution after the recent removal of pte_savedwrite()

v4:
 - Add "why" to comments in code (Peterz)

Yu-cheng v25:
 - Move is_shadow_stack_mapping() to a separate line.

Yu-cheng v24:
 - Change arch_shadow_stack_mapping() to is_shadow_stack_mapping().

 include/linux/mm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

David Hildenbrand Jan. 23, 2023, 9:50 a.m. UTC | #1

On 19.01.23 22:22, Rick Edgecombe wrote:
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
> 
> Since shadow stack memory can be changed from userspace, is both
> VM_SHADOW_STACK and VM_WRITE. But it should not be made conventionally
> writable (i.e. pte_mkwrite()). So some code that calls pte_mkwrite() needs
> to be adjusted.
> 
> One such case is when memory is made writable without an actual write
> fault. This happens in some mprotect operations, and also prot_numa faults.
> In both cases code checks whether it should be made (conventionally)
> writable by calling vma_wants_manual_pte_write_upgrade().
> 
> One way to fix this would be have code actually check if memory is also
> VM_SHADOW_STACK and in that case call pte_mkwrite_shstk(). But since
> most memory won't be shadow stack, just have simpler logic and skip this
> optimization by changing vma_wants_manual_pte_write_upgrade() to not
> return true for VM_SHADOW_STACK_MEMORY. This will simply handle all
> cases of this type.
> 
> Cc: David Hildenbrand <david@redhat.com>
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---

Instead of having these x86-shadow stack details all over the MM space, 
was the option explored to handle this more in arch specific code?

IIUC, one way to get it working would be

1) Have a SW "shadowstack" PTE flag.
2) Have an "SW-dirty" PTE flag, to store "dirty=1" when "write=0".

pte_mkwrite(), pte_write(), pte_dirty ... can then make decisions based 
on the "shadowstack" PTE flag and hide all these details from core-mm.

When mapping a shadowstack page (new page, migration, swapin, ...), 
which can be obtained by looking at the VMA flags, the first thing you'd 
do is set the "shadowstack" PTE flag.

Edgecombe, Rick P Jan. 23, 2023, 8:47 p.m. UTC | #2

On Mon, 2023-01-23 at 10:50 +0100, David Hildenbrand wrote:
> On 19.01.23 22:22, Rick Edgecombe wrote:
> > The x86 Control-flow Enforcement Technology (CET) feature includes
> > a new
> > type of memory called shadow stack. This shadow stack memory has
> > some
> > unusual properties, which requires some core mm changes to function
> > properly.
> > 
> > Since shadow stack memory can be changed from userspace, is both
> > VM_SHADOW_STACK and VM_WRITE. But it should not be made
> > conventionally
> > writable (i.e. pte_mkwrite()). So some code that calls
> > pte_mkwrite() needs
> > to be adjusted.
> > 
> > One such case is when memory is made writable without an actual
> > write
> > fault. This happens in some mprotect operations, and also prot_numa
> > faults.
> > In both cases code checks whether it should be made
> > (conventionally)
> > writable by calling vma_wants_manual_pte_write_upgrade().
> > 
> > One way to fix this would be have code actually check if memory is
> > also
> > VM_SHADOW_STACK and in that case call pte_mkwrite_shstk(). But
> > since
> > most memory won't be shadow stack, just have simpler logic and skip
> > this
> > optimization by changing vma_wants_manual_pte_write_upgrade() to
> > not
> > return true for VM_SHADOW_STACK_MEMORY. This will simply handle all
> > cases of this type.
> > 
> > Cc: David Hildenbrand <david@redhat.com>
> > Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> > Tested-by: John Allen <john.allen@amd.com>
> > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > ---
> 
> Instead of having these x86-shadow stack details all over the MM
> space, 
> was the option explored to handle this more in arch specific code?
> 
> IIUC, one way to get it working would be
> 
> 1) Have a SW "shadowstack" PTE flag.
> 2) Have an "SW-dirty" PTE flag, to store "dirty=1" when "write=0".

I don't think that idea came up. So vma->vm_page_prot would have the SW
shadow stack flag for VM_SHADOW_STACK, and pte_mkwrite() could do
Write=0,Dirty=1 part. It seems like it should work.

> 
> pte_mkwrite(), pte_write(), pte_dirty ... can then make decisions
> based 
> on the "shadowstack" PTE flag and hide all these details from core-
> mm.
> 
> When mapping a shadowstack page (new page, migration, swapin, ...), 
> which can be obtained by looking at the VMA flags, the first thing
> you'd 
> do is set the "shadowstack" PTE flag.

I guess the downside is that it uses an extra software bit. But the
other positive is that it's less error prone, so that someone writing
core-mm code won't introduce a change that makes shadow stack VMAs
Write=1 if they don't know to also check for VM_SHADOW_STACK.

David Hildenbrand Jan. 24, 2023, 4:24 p.m. UTC | #3

On 23.01.23 21:47, Edgecombe, Rick P wrote:
> On Mon, 2023-01-23 at 10:50 +0100, David Hildenbrand wrote:
>> On 19.01.23 22:22, Rick Edgecombe wrote:
>>> The x86 Control-flow Enforcement Technology (CET) feature includes
>>> a new
>>> type of memory called shadow stack. This shadow stack memory has
>>> some
>>> unusual properties, which requires some core mm changes to function
>>> properly.
>>>
>>> Since shadow stack memory can be changed from userspace, is both
>>> VM_SHADOW_STACK and VM_WRITE. But it should not be made
>>> conventionally
>>> writable (i.e. pte_mkwrite()). So some code that calls
>>> pte_mkwrite() needs
>>> to be adjusted.
>>>
>>> One such case is when memory is made writable without an actual
>>> write
>>> fault. This happens in some mprotect operations, and also prot_numa
>>> faults.
>>> In both cases code checks whether it should be made
>>> (conventionally)
>>> writable by calling vma_wants_manual_pte_write_upgrade().
>>>
>>> One way to fix this would be have code actually check if memory is
>>> also
>>> VM_SHADOW_STACK and in that case call pte_mkwrite_shstk(). But
>>> since
>>> most memory won't be shadow stack, just have simpler logic and skip
>>> this
>>> optimization by changing vma_wants_manual_pte_write_upgrade() to
>>> not
>>> return true for VM_SHADOW_STACK_MEMORY. This will simply handle all
>>> cases of this type.
>>>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
>>> Tested-by: John Allen <john.allen@amd.com>
>>> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
>>> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>>> ---
>>
>> Instead of having these x86-shadow stack details all over the MM
>> space,
>> was the option explored to handle this more in arch specific code?
>>
>> IIUC, one way to get it working would be
>>
>> 1) Have a SW "shadowstack" PTE flag.
>> 2) Have an "SW-dirty" PTE flag, to store "dirty=1" when "write=0".
> 
> I don't think that idea came up. So vma->vm_page_prot would have the SW
> shadow stack flag for VM_SHADOW_STACK, and pte_mkwrite() could do
> Write=0,Dirty=1 part. It seems like it should work.
> 

Right, if we include it in vma->vm_page_prot, we'd immediately let 
mk_pte() just handle that.

Otherwise, we'd have to refactor e.g., mk_pte() to consume a vma instead 
of the vma->vm_page_prot. Let's see if we can avoid that for now.

>>
>> pte_mkwrite(), pte_write(), pte_dirty ... can then make decisions
>> based
>> on the "shadowstack" PTE flag and hide all these details from core-
>> mm.
>>
>> When mapping a shadowstack page (new page, migration, swapin, ...),
>> which can be obtained by looking at the VMA flags, the first thing
>> you'd
>> do is set the "shadowstack" PTE flag.
> 
> I guess the downside is that it uses an extra software bit. But the
> other positive is that it's less error prone, so that someone writing
> core-mm code won't introduce a change that makes shadow stack VMAs
> Write=1 if they don't know to also check for VM_SHADOW_STACK.

Right. And I think this mimics the what I would have expected HW to 
provide: a dedicated HW bit, not somehow mangling this into semantics of 
existing bits.

Roughly speaking: if we abstract it that way and get all of the "how to 
set it writable now?" out of core-MM, it not only is cleaner and less 
error prone, it might even allow other architectures that implement 
something comparable (e.g., using a dedicated HW bit) to actually reuse 
some of that work. Otherwise most of that "shstk" is really just x86 
specific ...

I guess the only cases we have to special case would be page pinning 
code where pte_write() would indicate that the PTE is writable (well, it 
is, just not by "ordinary CPU instruction" context directly): but you do 
that already, so ... :)

Sorry for stumbling over that this late, I only started looking into 
this when you CCed me on that one patch.

Edgecombe, Rick P Jan. 24, 2023, 6:14 p.m. UTC | #4

On Tue, 2023-01-24 at 17:24 +0100, David Hildenbrand wrote:
> On 23.01.23 21:47, Edgecombe, Rick P wrote:
> > On Mon, 2023-01-23 at 10:50 +0100, David Hildenbrand wrote:
> > > On 19.01.23 22:22, Rick Edgecombe wrote:
> > > > The x86 Control-flow Enforcement Technology (CET) feature
> > > > includes
> > > > a new
> > > > type of memory called shadow stack. This shadow stack memory
> > > > has
> > > > some
> > > > unusual properties, which requires some core mm changes to
> > > > function
> > > > properly.
> > > > 
> > > > Since shadow stack memory can be changed from userspace, is
> > > > both
> > > > VM_SHADOW_STACK and VM_WRITE. But it should not be made
> > > > conventionally
> > > > writable (i.e. pte_mkwrite()). So some code that calls
> > > > pte_mkwrite() needs
> > > > to be adjusted.
> > > > 
> > > > One such case is when memory is made writable without an actual
> > > > write
> > > > fault. This happens in some mprotect operations, and also
> > > > prot_numa
> > > > faults.
> > > > In both cases code checks whether it should be made
> > > > (conventionally)
> > > > writable by calling vma_wants_manual_pte_write_upgrade().
> > > > 
> > > > One way to fix this would be have code actually check if memory
> > > > is
> > > > also
> > > > VM_SHADOW_STACK and in that case call pte_mkwrite_shstk(). But
> > > > since
> > > > most memory won't be shadow stack, just have simpler logic and
> > > > skip
> > > > this
> > > > optimization by changing vma_wants_manual_pte_write_upgrade()
> > > > to
> > > > not
> > > > return true for VM_SHADOW_STACK_MEMORY. This will simply handle
> > > > all
> > > > cases of this type.
> > > > 
> > > > Cc: David Hildenbrand <david@redhat.com>
> > > > Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> > > > Tested-by: John Allen <john.allen@amd.com>
> > > > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > > > Reviewed-by: Kirill A. Shutemov <
> > > > kirill.shutemov@linux.intel.com>
> > > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > ---
> > > 
> > > Instead of having these x86-shadow stack details all over the MM
> > > space,
> > > was the option explored to handle this more in arch specific
> > > code?
> > > 
> > > IIUC, one way to get it working would be
> > > 
> > > 1) Have a SW "shadowstack" PTE flag.
> > > 2) Have an "SW-dirty" PTE flag, to store "dirty=1" when
> > > "write=0".
> > 
> > I don't think that idea came up. So vma->vm_page_prot would have
> > the SW
> > shadow stack flag for VM_SHADOW_STACK, and pte_mkwrite() could do
> > Write=0,Dirty=1 part. It seems like it should work.
> > 
> 
> Right, if we include it in vma->vm_page_prot, we'd immediately let 
> mk_pte() just handle that.
> 
> Otherwise, we'd have to refactor e.g., mk_pte() to consume a vma
> instead 
> of the vma->vm_page_prot. Let's see if we can avoid that for now.
> 
> > > 
> > > pte_mkwrite(), pte_write(), pte_dirty ... can then make decisions
> > > based
> > > on the "shadowstack" PTE flag and hide all these details from
> > > core-
> > > mm.
> > > 
> > > When mapping a shadowstack page (new page, migration, swapin,
> > > ...),
> > > which can be obtained by looking at the VMA flags, the first
> > > thing
> > > you'd
> > > do is set the "shadowstack" PTE flag.
> > 
> > I guess the downside is that it uses an extra software bit. But the
> > other positive is that it's less error prone, so that someone
> > writing
> > core-mm code won't introduce a change that makes shadow stack VMAs
> > Write=1 if they don't know to also check for VM_SHADOW_STACK.
> 
> Right. And I think this mimics the what I would have expected HW to 
> provide: a dedicated HW bit, not somehow mangling this into semantics
> of 
> existing bits.

Yea.

> 
> Roughly speaking: if we abstract it that way and get all of the "how
> to 
> set it writable now?" out of core-MM, it not only is cleaner and
> less 
> error prone, it might even allow other architectures that implement 
> something comparable (e.g., using a dedicated HW bit) to actually
> reuse 
> some of that work. Otherwise most of that "shstk" is really just x86 
> specific ...
> 
> I guess the only cases we have to special case would be page pinning 
> code where pte_write() would indicate that the PTE is writable (well,
> it 
> is, just not by "ordinary CPU instruction" context directly): but you
> do 
> that already, so ... :)
> 
> Sorry for stumbling over that this late, I only started looking into 
> this when you CCed me on that one patch.

Sorry for not calling more attention to it earlier. Appreciate your
comments.

Previously versions of this series had changed some of these
pte_mkwrite() calls to maybe_mkwrite(), which of course takes a vma.
This way an x86 implementation could use the VM_SHADOW_STACK vma flag
to decide between pte_mkwrite() and pte_mkwrite_shstk(). The feedback
was that in some of these code paths "maybe" isn't really an option, it
*needs* to make it writable. Even though the logic was the same, the
name of the function made it look wrong.

But another option could be to change pte_mkwrite() to take a vma. This
would save using another software bit on x86, but instead requires a
small change to each arch's pte_mkwrite().

x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(), but
maybe it could additionally warn if the vma is not writable. It also
seems more aligned with your changes to stop taking hints from PTE bits
and just look at the VMA? (I'm thinking about the dropping of the dirty
check in GUP and dropping pte_saved_write())

David Hildenbrand Jan. 25, 2023, 9:27 a.m. UTC | #5

On 24.01.23 19:14, Edgecombe, Rick P wrote:
> On Tue, 2023-01-24 at 17:24 +0100, David Hildenbrand wrote:
>> On 23.01.23 21:47, Edgecombe, Rick P wrote:
>>> On Mon, 2023-01-23 at 10:50 +0100, David Hildenbrand wrote:
>>>> On 19.01.23 22:22, Rick Edgecombe wrote:
>>>>> The x86 Control-flow Enforcement Technology (CET) feature
>>>>> includes
>>>>> a new
>>>>> type of memory called shadow stack. This shadow stack memory
>>>>> has
>>>>> some
>>>>> unusual properties, which requires some core mm changes to
>>>>> function
>>>>> properly.
>>>>>
>>>>> Since shadow stack memory can be changed from userspace, is
>>>>> both
>>>>> VM_SHADOW_STACK and VM_WRITE. But it should not be made
>>>>> conventionally
>>>>> writable (i.e. pte_mkwrite()). So some code that calls
>>>>> pte_mkwrite() needs
>>>>> to be adjusted.
>>>>>
>>>>> One such case is when memory is made writable without an actual
>>>>> write
>>>>> fault. This happens in some mprotect operations, and also
>>>>> prot_numa
>>>>> faults.
>>>>> In both cases code checks whether it should be made
>>>>> (conventionally)
>>>>> writable by calling vma_wants_manual_pte_write_upgrade().
>>>>>
>>>>> One way to fix this would be have code actually check if memory
>>>>> is
>>>>> also
>>>>> VM_SHADOW_STACK and in that case call pte_mkwrite_shstk(). But
>>>>> since
>>>>> most memory won't be shadow stack, just have simpler logic and
>>>>> skip
>>>>> this
>>>>> optimization by changing vma_wants_manual_pte_write_upgrade()
>>>>> to
>>>>> not
>>>>> return true for VM_SHADOW_STACK_MEMORY. This will simply handle
>>>>> all
>>>>> cases of this type.
>>>>>
>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
>>>>> Tested-by: John Allen <john.allen@amd.com>
>>>>> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
>>>>> Reviewed-by: Kirill A. Shutemov <
>>>>> kirill.shutemov@linux.intel.com>
>>>>> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>>>>> ---
>>>>
>>>> Instead of having these x86-shadow stack details all over the MM
>>>> space,
>>>> was the option explored to handle this more in arch specific
>>>> code?
>>>>
>>>> IIUC, one way to get it working would be
>>>>
>>>> 1) Have a SW "shadowstack" PTE flag.
>>>> 2) Have an "SW-dirty" PTE flag, to store "dirty=1" when
>>>> "write=0".
>>>
>>> I don't think that idea came up. So vma->vm_page_prot would have
>>> the SW
>>> shadow stack flag for VM_SHADOW_STACK, and pte_mkwrite() could do
>>> Write=0,Dirty=1 part. It seems like it should work.
>>>
>>
>> Right, if we include it in vma->vm_page_prot, we'd immediately let
>> mk_pte() just handle that.
>>
>> Otherwise, we'd have to refactor e.g., mk_pte() to consume a vma
>> instead
>> of the vma->vm_page_prot. Let's see if we can avoid that for now.
>>
>>>>
>>>> pte_mkwrite(), pte_write(), pte_dirty ... can then make decisions
>>>> based
>>>> on the "shadowstack" PTE flag and hide all these details from
>>>> core-
>>>> mm.
>>>>
>>>> When mapping a shadowstack page (new page, migration, swapin,
>>>> ...),
>>>> which can be obtained by looking at the VMA flags, the first
>>>> thing
>>>> you'd
>>>> do is set the "shadowstack" PTE flag.
>>>
>>> I guess the downside is that it uses an extra software bit. But the
>>> other positive is that it's less error prone, so that someone
>>> writing
>>> core-mm code won't introduce a change that makes shadow stack VMAs
>>> Write=1 if they don't know to also check for VM_SHADOW_STACK.
>>
>> Right. And I think this mimics the what I would have expected HW to
>> provide: a dedicated HW bit, not somehow mangling this into semantics
>> of
>> existing bits.
> 
> Yea.
> 
>>
>> Roughly speaking: if we abstract it that way and get all of the "how
>> to
>> set it writable now?" out of core-MM, it not only is cleaner and
>> less
>> error prone, it might even allow other architectures that implement
>> something comparable (e.g., using a dedicated HW bit) to actually
>> reuse
>> some of that work. Otherwise most of that "shstk" is really just x86
>> specific ...
>>
>> I guess the only cases we have to special case would be page pinning
>> code where pte_write() would indicate that the PTE is writable (well,
>> it
>> is, just not by "ordinary CPU instruction" context directly): but you
>> do
>> that already, so ... :)
>>
>> Sorry for stumbling over that this late, I only started looking into
>> this when you CCed me on that one patch.
> 
> Sorry for not calling more attention to it earlier. Appreciate your
> comments.
> 
> Previously versions of this series had changed some of these
> pte_mkwrite() calls to maybe_mkwrite(), which of course takes a vma.
> This way an x86 implementation could use the VM_SHADOW_STACK vma flag
> to decide between pte_mkwrite() and pte_mkwrite_shstk(). The feedback
> was that in some of these code paths "maybe" isn't really an option, it
> *needs* to make it writable. Even though the logic was the same, the
> name of the function made it look wrong.
> 
> But another option could be to change pte_mkwrite() to take a vma. This
> would save using another software bit on x86, but instead requires a
> small change to each arch's pte_mkwrite().

I played with that idea shortly as well, but discarded it. I was not 
able to convince myself that it wouldn't be required to pass in the VMA 
as well for things like pte_dirty(), pte_mkdirty(), pte_write(), ... 
which would end up fairly ugly (or even impossible in thing slike GUP-fast).

For example, I wonder how we'd be handling stuff like do_numa_page() 
cleanly correctly, where we use pte_modify() + pte_mkwrite(), and either 
call might set the PTE writable and maintain dirty bit ...

Having that said, maybe it could work with only a single saved-dirty bit 
and passing in the VMA for pte_mkwrite() only.

pte_wrprotect() would detect "writable=0,dirty=1" and move the dirty bit 
to the soft-dirty bit instead, resulting in 
"writable=0,dirty=0,saved-dirty=1",

pte_dirty() would return dirty==1||saved-dirty==1.

pte_mkdirty() would set either set dirty=1 or saved-dirty=1, depending 
on the writable bit.

pte_mkclean() would clean both bits.

pte_write() would detect "writable == 1 || (writable==0 && dirty==1)"

pte_mkwrite() would act according to the VMA, and in addition, merge the 
saved-dirty bit into the dirty bit.

pte_modify() and mk_pte() .... would require more thought ...


Further, ptep_modify_prot_commit() might have to be adjusted to properly 
flush in all relevant cases IIRC.

> 
> x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(), but
> maybe it could additionally warn if the vma is not writable. It also
> seems more aligned with your changes to stop taking hints from PTE bits
> and just look at the VMA? (I'm thinking about the dropping of the dirty
> check in GUP and dropping pte_saved_write())

The soft-shstk bit wouldn't be a hint, it would be logically changing 
the "type" of the PTE such that any other PTE functions can do the right 
thing without having to consume the VMA.

Edgecombe, Rick P Jan. 25, 2023, 6:43 p.m. UTC | #6

On Wed, 2023-01-25 at 10:27 +0100, David Hildenbrand wrote:
> > > Roughly speaking: if we abstract it that way and get all of the
> > > "how
> > > to
> > > set it writable now?" out of core-MM, it not only is cleaner and
> > > less
> > > error prone, it might even allow other architectures that
> > > implement
> > > something comparable (e.g., using a dedicated HW bit) to actually
> > > reuse
> > > some of that work. Otherwise most of that "shstk" is really just
> > > x86
> > > specific ...
> > > 
> > > I guess the only cases we have to special case would be page
> > > pinning
> > > code where pte_write() would indicate that the PTE is writable
> > > (well,
> > > it
> > > is, just not by "ordinary CPU instruction" context directly): but
> > > you
> > > do
> > > that already, so ... :)
> > > 
> > > Sorry for stumbling over that this late, I only started looking
> > > into
> > > this when you CCed me on that one patch.
> > 
> > Sorry for not calling more attention to it earlier. Appreciate your
> > comments.
> > 
> > Previously versions of this series had changed some of these
> > pte_mkwrite() calls to maybe_mkwrite(), which of course takes a
> > vma.
> > This way an x86 implementation could use the VM_SHADOW_STACK vma
> > flag
> > to decide between pte_mkwrite() and pte_mkwrite_shstk(). The
> > feedback
> > was that in some of these code paths "maybe" isn't really an
> > option, it
> > *needs* to make it writable. Even though the logic was the same,
> > the
> > name of the function made it look wrong.
> > 
> > But another option could be to change pte_mkwrite() to take a vma.
> > This
> > would save using another software bit on x86, but instead requires
> > a
> > small change to each arch's pte_mkwrite().
> 
> I played with that idea shortly as well, but discarded it. I was not 
> able to convince myself that it wouldn't be required to pass in the
> VMA 
> as well for things like pte_dirty(), pte_mkdirty(), pte_write(), ... 
> which would end up fairly ugly (or even impossible in thing slike
> GUP-fast).
> 
> For example, I wonder how we'd be handling stuff like do_numa_page() 
> cleanly correctly, where we use pte_modify() + pte_mkwrite(), and
> either 
> call might set the PTE writable and maintain dirty bit ...

pte_modify() is handled like this currently:

https://lore.kernel.org/lkml/20230119212317.8324-12-rick.p.edgecombe@intel.com/

There has been a couple iterations on that. The current solution is to
do the Dirty->SavedDirty fixup if needed after the new prots are added.

Of course pte_modify() can't know whether you are are attempting to
create a shadow stack PTE with the prot you are passing in. But the
callers today explicitly call pte_mkwrite() after filling in the other
bits with pte_modify(). Today this patch causes the pte_mkwrite() to be
skipped and another fault may be required in the mprotect() and numa
cases, but if we change pte_mkwrite() to take a VMA we can just make it
shadow stack to start.

It might be worth mentioning, there was a suggestion in the past to try
to have the shadow stack bits come out of vm_get_page_prot(), but MM
code would then try to map the zero page as (shadow stack) writable
when there was a normal (non-shadow stack) read access. So I had to
abandon that approach and rely on explicit calls to pte_mkwrite/shstk()
to make it shadow stack.

> 
> Having that said, maybe it could work with only a single saved-dirty
> bit 
> and passing in the VMA for pte_mkwrite() only.
> 
> pte_wrprotect() would detect "writable=0,dirty=1" and move the dirty
> bit 
> to the soft-dirty bit instead, resulting in 
> "writable=0,dirty=0,saved-dirty=1",
> 
> pte_dirty() would return dirty==1||saved-dirty==1.
> 
> pte_mkdirty() would set either set dirty=1 or saved-dirty=1,
> depending 
> on the writable bit.
> 
> pte_mkclean() would clean both bits.
> 
> pte_write() would detect "writable == 1 || (writable==0 && dirty==1)"
> 
> pte_mkwrite() would act according to the VMA, and in addition, merge
> the 
> saved-dirty bit into the dirty bit.
> 
> pte_modify() and mk_pte() .... would require more thought ...

Not sure I'm following what the mk_pte() problem would be. You mean if
Write=0,Dirty=1 is manually added to the prot?

Shouldn't people generally use the pte_mkwrite() helpers unless they
are drawing from a prot that was already created with the helpers or
vm_get_page_prot()? I think they can't manually create prot's from bits
in core mm code, right? And x86 arch code already has to be aware of
shadow stack. It's a bit of an assumption I guess, but I think maybe
not too crazy of one?

> 
> 
> Further, ptep_modify_prot_commit() might have to be adjusted to
> properly 
> flush in all relevant cases IIRC.

Sorry, I'm not following. Can you elaborate? There is an adjustment
made in pte_flags_need_flush().

> 
> > 
> > x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(),
> > but
> > maybe it could additionally warn if the vma is not writable. It
> > also
> > seems more aligned with your changes to stop taking hints from PTE
> > bits
> > and just look at the VMA? (I'm thinking about the dropping of the
> > dirty
> > check in GUP and dropping pte_saved_write())
> 
> The soft-shstk bit wouldn't be a hint, it would be logically
> changing 
> the "type" of the PTE such that any other PTE functions can do the
> right 
> thing without having to consume the VMA.

Yea, true.

Thanks for your comments and ideas here, I'll give the:
pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
...solution a try.

Edgecombe, Rick P Jan. 26, 2023, 12:59 a.m. UTC | #7

On Wed, 2023-01-25 at 10:43 -0800, Rick Edgecombe wrote:
> Thanks for your comments and ideas here, I'll give the:
> pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
> ...solution a try.

Well, it turns out there are some pte_mkwrite() callers in other arch's
that operate on kernel memory and don't have a VMA. So it needed a new 
function that can be overridden in arch code. I ended up with x86
versions of these, like this:
pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
	if (!(vma->vm_flags & VM_WRITE))
		return pte;

	if (vma->vm_flags & VM_SHADOW_STACK)
		return pte_mkwrite_shstk(pte);

	return pte_mkwrite(pte);
}

pte_t pte_mkwrite_vma(pte_t pte, struct vm_area_struct *vma)
{
	if (vma->vm_flags & VM_SHADOW_STACK)
		return pte_mkwrite_shstk(pte);

	return pte_mkwrite(pte);
}

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
{
	if (!(vma->vm_flags & VM_WRITE))
		return pmd;

	if (vma->vm_flags & VM_SHADOW_STACK)
		return pmd_mkwrite_shstk(pmd);

	return pmd_mkwrite(pmd);
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

All the other pte_mkdirty()s, etc remain the same.

Previously, there was a suggestion to not override the
maybe_mkwrite()'s and put the logic in core MM by having a generic
version of pte_mkwrite_shstk() that does nothing. But given what we are
trying to do with pte_mkwrite_vma() it seemed better to hide all the
shadow stack PTE changes in arch code again.

After the changes, the only shadow stack specific bits in core mm are
the bit in GUP to require FOLL_FORCE, the memory accounting, and these
warnings:

https://lore.kernel.org/lkml/20230119212317.8324-26-rick.p.edgecombe@intel.com/

David Hildenbrand Jan. 26, 2023, 8:46 a.m. UTC | #8

On 26.01.23 01:59, Edgecombe, Rick P wrote:
> On Wed, 2023-01-25 at 10:43 -0800, Rick Edgecombe wrote:
>> Thanks for your comments and ideas here, I'll give the:
>> pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
>> ...solution a try.
> 
> Well, it turns out there are some pte_mkwrite() callers in other arch's
> that operate on kernel memory and don't have a VMA. So it needed a new

Why not pass in NULL as VMA then and document the semantics? The less 
similarly named but slightly different functions, the better :)

David Hildenbrand Jan. 26, 2023, 8:57 a.m. UTC | #9

On 25.01.23 19:43, Edgecombe, Rick P wrote:
> On Wed, 2023-01-25 at 10:27 +0100, David Hildenbrand wrote:
>>>> Roughly speaking: if we abstract it that way and get all of the
>>>> "how
>>>> to
>>>> set it writable now?" out of core-MM, it not only is cleaner and
>>>> less
>>>> error prone, it might even allow other architectures that
>>>> implement
>>>> something comparable (e.g., using a dedicated HW bit) to actually
>>>> reuse
>>>> some of that work. Otherwise most of that "shstk" is really just
>>>> x86
>>>> specific ...
>>>>
>>>> I guess the only cases we have to special case would be page
>>>> pinning
>>>> code where pte_write() would indicate that the PTE is writable
>>>> (well,
>>>> it
>>>> is, just not by "ordinary CPU instruction" context directly): but
>>>> you
>>>> do
>>>> that already, so ... :)
>>>>
>>>> Sorry for stumbling over that this late, I only started looking
>>>> into
>>>> this when you CCed me on that one patch.
>>>
>>> Sorry for not calling more attention to it earlier. Appreciate your
>>> comments.
>>>
>>> Previously versions of this series had changed some of these
>>> pte_mkwrite() calls to maybe_mkwrite(), which of course takes a
>>> vma.
>>> This way an x86 implementation could use the VM_SHADOW_STACK vma
>>> flag
>>> to decide between pte_mkwrite() and pte_mkwrite_shstk(). The
>>> feedback
>>> was that in some of these code paths "maybe" isn't really an
>>> option, it
>>> *needs* to make it writable. Even though the logic was the same,
>>> the
>>> name of the function made it look wrong.
>>>
>>> But another option could be to change pte_mkwrite() to take a vma.
>>> This
>>> would save using another software bit on x86, but instead requires
>>> a
>>> small change to each arch's pte_mkwrite().
>>
>> I played with that idea shortly as well, but discarded it. I was not
>> able to convince myself that it wouldn't be required to pass in the
>> VMA
>> as well for things like pte_dirty(), pte_mkdirty(), pte_write(), ...
>> which would end up fairly ugly (or even impossible in thing slike
>> GUP-fast).
>>
>> For example, I wonder how we'd be handling stuff like do_numa_page()
>> cleanly correctly, where we use pte_modify() + pte_mkwrite(), and
>> either
>> call might set the PTE writable and maintain dirty bit ...
> 
> pte_modify() is handled like this currently:
> 
> https://lore.kernel.org/lkml/20230119212317.8324-12-rick.p.edgecombe@intel.com/
> 
> There has been a couple iterations on that. The current solution is to
> do the Dirty->SavedDirty fixup if needed after the new prots are added.
> 
> Of course pte_modify() can't know whether you are are attempting to
> create a shadow stack PTE with the prot you are passing in. But the
> callers today explicitly call pte_mkwrite() after filling in the other
> bits with pte_modify().

See below on my MAP_PRIVATE vs. MAP_SHARED comment.

> Today this patch causes the pte_mkwrite() to be
> skipped and another fault may be required in the mprotect() and numa
> cases, but if we change pte_mkwrite() to take a VMA we can just make it
> shadow stack to start.
> 
> It might be worth mentioning, there was a suggestion in the past to try
> to have the shadow stack bits come out of vm_get_page_prot(), but MM
> code would then try to map the zero page as (shadow stack) writable
> when there was a normal (non-shadow stack) read access. So I had to
> abandon that approach and rely on explicit calls to pte_mkwrite/shstk()
> to make it shadow stack.

Thanks, do you have a pointer?

> 
>>
>> Having that said, maybe it could work with only a single saved-dirty
>> bit
>> and passing in the VMA for pte_mkwrite() only.
>>
>> pte_wrprotect() would detect "writable=0,dirty=1" and move the dirty
>> bit
>> to the soft-dirty bit instead, resulting in
>> "writable=0,dirty=0,saved-dirty=1",
>>
>> pte_dirty() would return dirty==1||saved-dirty==1.
>>
>> pte_mkdirty() would set either set dirty=1 or saved-dirty=1,
>> depending
>> on the writable bit.
>>
>> pte_mkclean() would clean both bits.
>>
>> pte_write() would detect "writable == 1 || (writable==0 && dirty==1)"
>>
>> pte_mkwrite() would act according to the VMA, and in addition, merge
>> the
>> saved-dirty bit into the dirty bit.
>>
>> pte_modify() and mk_pte() .... would require more thought ...
> 
> Not sure I'm following what the mk_pte() problem would be. You mean if
> Write=0,Dirty=1 is manually added to the prot?
> 
> Shouldn't people generally use the pte_mkwrite() helpers unless they
> are drawing from a prot that was already created with the helpers or
> vm_get_page_prot()?

pte_mkwrite() is mostly only used (except for writenotify ...) for 
MAP_PRIVATE memory ("COW-able"). For MAP_SHARED memory, 
vma->vm_page_prot in a VM_WRITE mapping already contains the write 
permissions. pte_mkwrite() is not necessary (again, unless writenotify 
is active).

I assume shstk VMAs don't apply to MAP_SHARED VMAs, which is why you 
didn't stumble over that issue yet? Because I don't see how it could 
work with MAP_SHARED VMAs.


The other thing I had in mind was that we have to make sure that we're 
not accidentally setting "Write=0,Dirty=1" in mk_pte() / pte_modify().

Assume we had a "Write=1,Dirty=1" PTE, and we effectively wrprotect 
using pte_modify(), we have to make sure to move the dirty bit to the 
saved_dirty bit.

> I think they can't manually create prot's from bits
> in core mm code, right? And x86 arch code already has to be aware of
> shadow stack. It's a bit of an assumption I guess, but I think maybe
> not too crazy of one?

I think that's true. Arch code is supposed to deal with that IIRC.

> 
>>
>>
>> Further, ptep_modify_prot_commit() might have to be adjusted to
>> properly
>> flush in all relevant cases IIRC.
> 
> Sorry, I'm not following. Can you elaborate? There is an adjustment
> made in pte_flags_need_flush().

Note that I did not fully review all bits of this patch set, just 
throwing out what was on my mind. If already handled, great.

> 
>>
>>>
>>> x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(),
>>> but
>>> maybe it could additionally warn if the vma is not writable. It
>>> also
>>> seems more aligned with your changes to stop taking hints from PTE
>>> bits
>>> and just look at the VMA? (I'm thinking about the dropping of the
>>> dirty
>>> check in GUP and dropping pte_saved_write())
>>
>> The soft-shstk bit wouldn't be a hint, it would be logically
>> changing
>> the "type" of the PTE such that any other PTE functions can do the
>> right
>> thing without having to consume the VMA.
> 
> Yea, true.
> 
> Thanks for your comments and ideas here, I'll give the:
> pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
> ...solution a try.

Good!

Edgecombe, Rick P Jan. 26, 2023, 8:16 p.m. UTC | #10

On Thu, 2023-01-26 at 09:57 +0100, David Hildenbrand wrote:
> On 25.01.23 19:43, Edgecombe, Rick P wrote:
> > On Wed, 2023-01-25 at 10:27 +0100, David Hildenbrand wrote:
> > > > > Roughly speaking: if we abstract it that way and get all of
> > > > > the
> > > > > "how
> > > > > to
> > > > > set it writable now?" out of core-MM, it not only is cleaner
> > > > > and
> > > > > less
> > > > > error prone, it might even allow other architectures that
> > > > > implement
> > > > > something comparable (e.g., using a dedicated HW bit) to
> > > > > actually
> > > > > reuse
> > > > > some of that work. Otherwise most of that "shstk" is really
> > > > > just
> > > > > x86
> > > > > specific ...
> > > > > 
> > > > > I guess the only cases we have to special case would be page
> > > > > pinning
> > > > > code where pte_write() would indicate that the PTE is
> > > > > writable
> > > > > (well,
> > > > > it
> > > > > is, just not by "ordinary CPU instruction" context directly):
> > > > > but
> > > > > you
> > > > > do
> > > > > that already, so ... :)
> > > > > 
> > > > > Sorry for stumbling over that this late, I only started
> > > > > looking
> > > > > into
> > > > > this when you CCed me on that one patch.
> > > > 
> > > > Sorry for not calling more attention to it earlier. Appreciate
> > > > your
> > > > comments.
> > > > 
> > > > Previously versions of this series had changed some of these
> > > > pte_mkwrite() calls to maybe_mkwrite(), which of course takes a
> > > > vma.
> > > > This way an x86 implementation could use the VM_SHADOW_STACK
> > > > vma
> > > > flag
> > > > to decide between pte_mkwrite() and pte_mkwrite_shstk(). The
> > > > feedback
> > > > was that in some of these code paths "maybe" isn't really an
> > > > option, it
> > > > *needs* to make it writable. Even though the logic was the
> > > > same,
> > > > the
> > > > name of the function made it look wrong.
> > > > 
> > > > But another option could be to change pte_mkwrite() to take a
> > > > vma.
> > > > This
> > > > would save using another software bit on x86, but instead
> > > > requires
> > > > a
> > > > small change to each arch's pte_mkwrite().
> > > 
> > > I played with that idea shortly as well, but discarded it. I was
> > > not
> > > able to convince myself that it wouldn't be required to pass in
> > > the
> > > VMA
> > > as well for things like pte_dirty(), pte_mkdirty(), pte_write(),
> > > ...
> > > which would end up fairly ugly (or even impossible in thing slike
> > > GUP-fast).
> > > 
> > > For example, I wonder how we'd be handling stuff like
> > > do_numa_page()
> > > cleanly correctly, where we use pte_modify() + pte_mkwrite(), and
> > > either
> > > call might set the PTE writable and maintain dirty bit ...
> > 
> > pte_modify() is handled like this currently:
> > 
> > 
https://lore.kernel.org/lkml/20230119212317.8324-12-rick.p.edgecombe@intel.com/
> > 
> > There has been a couple iterations on that. The current solution is
> > to
> > do the Dirty->SavedDirty fixup if needed after the new prots are
> > added.
> > 
> > Of course pte_modify() can't know whether you are are attempting to
> > create a shadow stack PTE with the prot you are passing in. But the
> > callers today explicitly call pte_mkwrite() after filling in the
> > other
> > bits with pte_modify().
> 
> See below on my MAP_PRIVATE vs. MAP_SHARED comment.

Yep, MAP_SHARED support was dropped with the reboot of the series. It
did have some problems IIRC.

Now shadow stack memory creation is tightly controlled. Either created
via special syscall or automatically with a new thread.

> 
> > Today this patch causes the pte_mkwrite() to be
> > skipped and another fault may be required in the mprotect() and
> > numa
> > cases, but if we change pte_mkwrite() to take a VMA we can just
> > make it
> > shadow stack to start.
> > 
> > It might be worth mentioning, there was a suggestion in the past to
> > try
> > to have the shadow stack bits come out of vm_get_page_prot(), but
> > MM
> > code would then try to map the zero page as (shadow stack) writable
> > when there was a normal (non-shadow stack) read access. So I had to
> > abandon that approach and rely on explicit calls to
> > pte_mkwrite/shstk()
> > to make it shadow stack.
> 
> Thanks, do you have a pointer?

I never posted it because it didn't work out. This was the comment that
prompted the exploration in that direction:

https://lore.kernel.org/lkml/8065c333-0911-04a2-f91e-7c2e0cc7ec51@intel.com/

Shadow stack memory also used to not be VM_WRITE (VM_SHADOW_STACK
only), but this was changed for other reasons. In v2 there were some
updates to how shadow stack memory was handled, and the cover letter
had a writeup of the reasons and general design:

https://lore.kernel.org/lkml/20220929222936.14584-1-rick.p.edgecombe@intel.com/

> 
> > 
> > > 
> > > Having that said, maybe it could work with only a single saved-
> > > dirty
> > > bit
> > > and passing in the VMA for pte_mkwrite() only.
> > > 
> > > pte_wrprotect() would detect "writable=0,dirty=1" and move the
> > > dirty
> > > bit
> > > to the soft-dirty bit instead, resulting in
> > > "writable=0,dirty=0,saved-dirty=1",
> > > 
> > > pte_dirty() would return dirty==1||saved-dirty==1.
> > > 
> > > pte_mkdirty() would set either set dirty=1 or saved-dirty=1,
> > > depending
> > > on the writable bit.
> > > 
> > > pte_mkclean() would clean both bits.
> > > 
> > > pte_write() would detect "writable == 1 || (writable==0 &&
> > > dirty==1)"
> > > 
> > > pte_mkwrite() would act according to the VMA, and in addition,
> > > merge
> > > the
> > > saved-dirty bit into the dirty bit.
> > > 
> > > pte_modify() and mk_pte() .... would require more thought ...
> > 
> > Not sure I'm following what the mk_pte() problem would be. You mean
> > if
> > Write=0,Dirty=1 is manually added to the prot?
> > 
> > Shouldn't people generally use the pte_mkwrite() helpers unless
> > they
> > are drawing from a prot that was already created with the helpers
> > or
> > vm_get_page_prot()?
> 
> pte_mkwrite() is mostly only used (except for writenotify ...) for 
> MAP_PRIVATE memory ("COW-able"). For MAP_SHARED memory, 
> vma->vm_page_prot in a VM_WRITE mapping already contains the write 
> permissions. pte_mkwrite() is not necessary (again, unless
> writenotify 
> is active).

Oh, interesting.

> 
> I assume shstk VMAs don't apply to MAP_SHARED VMAs, which is why you 
> didn't stumble over that issue yet? Because I don't see how it could 
> work with MAP_SHARED VMAs.

Yep, it doesn't support MAP_SHARED.

> 
> 
> The other thing I had in mind was that we have to make sure that
> we're 
> not accidentally setting "Write=0,Dirty=1" in mk_pte() /
> pte_modify().
> 
> Assume we had a "Write=1,Dirty=1" PTE, and we effectively wrprotect 
> using pte_modify(), we have to make sure to move the dirty bit to
> the 
> saved_dirty bit.

For the mk_pte() case, I don't think a Write=0,Dirty=1 prot could come
from anywhere. I guess the MAP_SHARED case is a little less bounded. We
could maybe add a warning for this case.

For the pte_modify() case, this does happen. There are two scenarios
considered:
1. A Write=0,Dirty=0 PTE is made dirty. This can't happen today as
Dirty is filtered via _PAGE_CHG_MASK. Basically pte_modify() doesn't
support it.
2. A Write=1,Dirty=1 PTE gets write protected. This does happen because
the Write=0 prot comes from protection_map, and pte_modify() would
leave the Dirty=1 bit alone. The main case I know of is mprotect(). It
is handled by changes to pte_modify() by doing the Dirty->SoftDirty
fixup if needed.

So pte_modify()s job should not be too tricky. What you can't do with
it though, is create shadow stack PTEs. But it is ok for our uses
because of the explicit mkwrite().

> 
> > I think they can't manually create prot's from bits
> > in core mm code, right? And x86 arch code already has to be aware
> > of
> > shadow stack. It's a bit of an assumption I guess, but I think
> > maybe
> > not too crazy of one?
> 
> I think that's true. Arch code is supposed to deal with that IIRC.
> 
> > 
> > > 
> > > 
> > > Further, ptep_modify_prot_commit() might have to be adjusted to
> > > properly
> > > flush in all relevant cases IIRC.
> > 
> > Sorry, I'm not following. Can you elaborate? There is an adjustment
> > made in pte_flags_need_flush().
> 
> Note that I did not fully review all bits of this patch set, just 
> throwing out what was on my mind. If already handled, great.
> 
> > 
> > > 
> > > > 
> > > > x86's pte_mkwrite() would then be pretty close to
> > > > maybe_mkwrite(),
> > > > but
> > > > maybe it could additionally warn if the vma is not writable. It
> > > > also
> > > > seems more aligned with your changes to stop taking hints from
> > > > PTE
> > > > bits
> > > > and just look at the VMA? (I'm thinking about the dropping of
> > > > the
> > > > dirty
> > > > check in GUP and dropping pte_saved_write())
> > > 
> > > The soft-shstk bit wouldn't be a hint, it would be logically
> > > changing
> > > the "type" of the PTE such that any other PTE functions can do
> > > the
> > > right
> > > thing without having to consume the VMA.
> > 
> > Yea, true.
> > 
> > Thanks for your comments and ideas here, I'll give the:
> > pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
> > ...solution a try.
> 
> Good!
>

Edgecombe, Rick P Jan. 26, 2023, 8:19 p.m. UTC | #11

On Thu, 2023-01-26 at 09:46 +0100, David Hildenbrand wrote:
> On 26.01.23 01:59, Edgecombe, Rick P wrote:
> > On Wed, 2023-01-25 at 10:43 -0800, Rick Edgecombe wrote:
> > > Thanks for your comments and ideas here, I'll give the:
> > > pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
> > > ...solution a try.
> > 
> > Well, it turns out there are some pte_mkwrite() callers in other
> > arch's
> > that operate on kernel memory and don't have a VMA. So it needed a
> > new
> 
> Why not pass in NULL as VMA then and document the semantics? The
> less 
> similarly named but slightly different functions, the better :)

Hmm. The x86 and generic versions should probably have the same
semantics, so then if you pass a NULL, it would do a regular
pte_mkwrite() I guess?

I see another benefit of requiring the vma argument, such that raw
pte_mkwrite()s are less likely to appear in core MM code. But I think
the NULL is awkward because it's not obvious, to me at least, what the
implications of that should be.

So it will be confusing to read in the NULL cases for the other archs.
We also have some warnings to catch miss cases in the PTE tear down
code, so the scenario of new code accidentally marking shadow stack
PTEs as writable is not totally unchecked.

The three functions that do slightly different things are:

pte_mkwrite():
Makes a PTE conventionally writable, only takes a PTE. Very clear that
it is a low level helper and what it does.

maybe_mkwrite():
Might make a PTE writable if the VMA allows it.

pte_mkwrite_vma():
Makes a PTE writable in a specific way depending on the VMA

I wonder if the name pte_mkwrite_vma() is maybe just not clear enough.
It takes a VMA, yes, but what does it do with it?

What if it was called pte_mkwrite_type() instead? Some arch's have
additional types of writable memory and this function creates them. Of
course they also have the normal type of writable memory, and
pte_mkwrite() creates that like usual. Doesn't it seem more readable?

David Hildenbrand Jan. 27, 2023, 4:12 p.m. UTC | #12

On 26.01.23 21:19, Edgecombe, Rick P wrote:
> On Thu, 2023-01-26 at 09:46 +0100, David Hildenbrand wrote:
>> On 26.01.23 01:59, Edgecombe, Rick P wrote:
>>> On Wed, 2023-01-25 at 10:43 -0800, Rick Edgecombe wrote:
>>>> Thanks for your comments and ideas here, I'll give the:
>>>> pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
>>>> ...solution a try.
>>>
>>> Well, it turns out there are some pte_mkwrite() callers in other
>>> arch's
>>> that operate on kernel memory and don't have a VMA. So it needed a
>>> new
>>
>> Why not pass in NULL as VMA then and document the semantics? The
>> less
>> similarly named but slightly different functions, the better :)
> 
> Hmm. The x86 and generic versions should probably have the same
> semantics, so then if you pass a NULL, it would do a regular
> pte_mkwrite() I guess?
> 
> I see another benefit of requiring the vma argument, such that raw
> pte_mkwrite()s are less likely to appear in core MM code. But I think
> the NULL is awkward because it's not obvious, to me at least, what the
> implications of that should be.
> 
> So it will be confusing to read in the NULL cases for the other archs.
> We also have some warnings to catch miss cases in the PTE tear down
> code, so the scenario of new code accidentally marking shadow stack
> PTEs as writable is not totally unchecked.
> 
> The three functions that do slightly different things are:
> 
> pte_mkwrite():
> Makes a PTE conventionally writable, only takes a PTE. Very clear that
> it is a low level helper and what it does.
> 
> maybe_mkwrite():
> Might make a PTE writable if the VMA allows it.
> 
> pte_mkwrite_vma():
> Makes a PTE writable in a specific way depending on the VMA
> 
> I wonder if the name pte_mkwrite_vma() is maybe just not clear enough.
> It takes a VMA, yes, but what does it do with it?
> 
> What if it was called pte_mkwrite_type() instead? Some arch's have
> additional types of writable memory and this function creates them. Of
> course they also have the normal type of writable memory, and
> pte_mkwrite() creates that like usual. Doesn't it seem more readable?

The issue is, the more variants we provide the easier it is to make 
mistakes and introduce new buggy code.

It's tempting to simply use pte_mkwrite() and call it a day, where 
people actually should use pte_mkwrite_vma().

Then, they at least have to investigate what to do about the second VMA 
parameter.

David Hildenbrand Jan. 27, 2023, 4:19 p.m. UTC | #13

> 
> Now shadow stack memory creation is tightly controlled. Either created
> via special syscall or automatically with a new thread.

Good, it would be valuable to document that somewhere ("Neve rapplies to 
VM_SHARED|VM_MAYSHARE VMAs").

[...]

>>
>> The other thing I had in mind was that we have to make sure that
>> we're
>> not accidentally setting "Write=0,Dirty=1" in mk_pte() /
>> pte_modify().
>>
>> Assume we had a "Write=1,Dirty=1" PTE, and we effectively wrprotect
>> using pte_modify(), we have to make sure to move the dirty bit to
>> the
>> saved_dirty bit.
> 
> For the mk_pte() case, I don't think a Write=0,Dirty=1 prot could come
> from anywhere. I guess the MAP_SHARED case is a little less bounded. We
> could maybe add a warning for this case.

Right, Write=0,Dirty=1  shouldn't apply at that point if shstk are 
always wrprotected as default.

> 
> For the pte_modify() case, this does happen. There are two scenarios
> considered:
> 1. A Write=0,Dirty=0 PTE is made dirty. This can't happen today as
> Dirty is filtered via _PAGE_CHG_MASK. Basically pte_modify() doesn't
> support it.

It should simply set the saved_dirty bit I guess. But I don't think 
pte_modify() is actually supposed to set PTEs dirty (primary goal is to 
change protection IIRC).

> 2. A Write=1,Dirty=1 PTE gets write protected. This does happen because
> the Write=0 prot comes from protection_map, and pte_modify() would
> leave the Dirty=1 bit alone. The main case I know of is mprotect(). It
> is handled by changes to pte_modify() by doing the Dirty->SoftDirty
> fixup if needed.

Right, we'd have to move the dirty bit to the saved_dirty bit. (we have 
to handle soft-dirty, too, whenever setting the PTE dirty -- either via 
the dirty bit or via the saved_dirty bit)

> 
> So pte_modify()s job should not be too tricky. What you can't do with
> it though, is create shadow stack PTEs. But it is ok for our uses
> because of the explicit mkwrite().

I think you are correct.

Edgecombe, Rick P Jan. 28, 2023, 12:51 a.m. UTC | #14

On Fri, 2023-01-27 at 17:12 +0100, David Hildenbrand wrote:
> On 26.01.23 21:19, Edgecombe, Rick P wrote:
> > On Thu, 2023-01-26 at 09:46 +0100, David Hildenbrand wrote:
> > > On 26.01.23 01:59, Edgecombe, Rick P wrote:
> > > > On Wed, 2023-01-25 at 10:43 -0800, Rick Edgecombe wrote:
> > > > > Thanks for your comments and ideas here, I'll give the:
> > > > > pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
> > > > > ...solution a try.
> > > > 
> > > > Well, it turns out there are some pte_mkwrite() callers in
> > > > other
> > > > arch's
> > > > that operate on kernel memory and don't have a VMA. So it
> > > > needed a
> > > > new
> > > 
> > > Why not pass in NULL as VMA then and document the semantics? The
> > > less
> > > similarly named but slightly different functions, the better :)
> > 
> > Hmm. The x86 and generic versions should probably have the same
> > semantics, so then if you pass a NULL, it would do a regular
> > pte_mkwrite() I guess?
> > 
> > I see another benefit of requiring the vma argument, such that raw
> > pte_mkwrite()s are less likely to appear in core MM code. But I
> > think
> > the NULL is awkward because it's not obvious, to me at least, what
> > the
> > implications of that should be.
> > 
> > So it will be confusing to read in the NULL cases for the other
> > archs.
> > We also have some warnings to catch miss cases in the PTE tear down
> > code, so the scenario of new code accidentally marking shadow stack
> > PTEs as writable is not totally unchecked.
> > 
> > The three functions that do slightly different things are:
> > 
> > pte_mkwrite():
> > Makes a PTE conventionally writable, only takes a PTE. Very clear
> > that
> > it is a low level helper and what it does.
> > 
> > maybe_mkwrite():
> > Might make a PTE writable if the VMA allows it.
> > 
> > pte_mkwrite_vma():
> > Makes a PTE writable in a specific way depending on the VMA
> > 
> > I wonder if the name pte_mkwrite_vma() is maybe just not clear
> > enough.
> > It takes a VMA, yes, but what does it do with it?
> > 
> > What if it was called pte_mkwrite_type() instead? Some arch's have
> > additional types of writable memory and this function creates them.
> > Of
> > course they also have the normal type of writable memory, and
> > pte_mkwrite() creates that like usual. Doesn't it seem more
> > readable?
> 
> The issue is, the more variants we provide the easier it is to make 
> mistakes and introduce new buggy code.
> 
> It's tempting to simply use pte_mkwrite() and call it a day, where 
> people actually should use pte_mkwrite_vma().
> 
> Then, they at least have to investigate what to do about the second
> VMA 
> parameter.

Ok, I'll give it a spin. So far it looks ok. The downside is the giant
tree-wide pte_mkwrite() signature change, but once that is over with
there are other advantages. Like getting rid of maybe_mkwrite()'s
awareness of shadow stack so the logic is more centralized. Please let
me know if you don't feel comfortable with a suggested-by credit tag.

Thanks,
Rick

David Hildenbrand Jan. 31, 2023, 8:46 a.m. UTC | #15

On 28.01.23 01:51, Edgecombe, Rick P wrote:
> On Fri, 2023-01-27 at 17:12 +0100, David Hildenbrand wrote:
>> On 26.01.23 21:19, Edgecombe, Rick P wrote:
>>> On Thu, 2023-01-26 at 09:46 +0100, David Hildenbrand wrote:
>>>> On 26.01.23 01:59, Edgecombe, Rick P wrote:
>>>>> On Wed, 2023-01-25 at 10:43 -0800, Rick Edgecombe wrote:
>>>>>> Thanks for your comments and ideas here, I'll give the:
>>>>>> pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
>>>>>> ...solution a try.
>>>>>
>>>>> Well, it turns out there are some pte_mkwrite() callers in
>>>>> other
>>>>> arch's
>>>>> that operate on kernel memory and don't have a VMA. So it
>>>>> needed a
>>>>> new
>>>>
>>>> Why not pass in NULL as VMA then and document the semantics? The
>>>> less
>>>> similarly named but slightly different functions, the better :)
>>>
>>> Hmm. The x86 and generic versions should probably have the same
>>> semantics, so then if you pass a NULL, it would do a regular
>>> pte_mkwrite() I guess?
>>>
>>> I see another benefit of requiring the vma argument, such that raw
>>> pte_mkwrite()s are less likely to appear in core MM code. But I
>>> think
>>> the NULL is awkward because it's not obvious, to me at least, what
>>> the
>>> implications of that should be.
>>>
>>> So it will be confusing to read in the NULL cases for the other
>>> archs.
>>> We also have some warnings to catch miss cases in the PTE tear down
>>> code, so the scenario of new code accidentally marking shadow stack
>>> PTEs as writable is not totally unchecked.
>>>
>>> The three functions that do slightly different things are:
>>>
>>> pte_mkwrite():
>>> Makes a PTE conventionally writable, only takes a PTE. Very clear
>>> that
>>> it is a low level helper and what it does.
>>>
>>> maybe_mkwrite():
>>> Might make a PTE writable if the VMA allows it.
>>>
>>> pte_mkwrite_vma():
>>> Makes a PTE writable in a specific way depending on the VMA
>>>
>>> I wonder if the name pte_mkwrite_vma() is maybe just not clear
>>> enough.
>>> It takes a VMA, yes, but what does it do with it?
>>>
>>> What if it was called pte_mkwrite_type() instead? Some arch's have
>>> additional types of writable memory and this function creates them.
>>> Of
>>> course they also have the normal type of writable memory, and
>>> pte_mkwrite() creates that like usual. Doesn't it seem more
>>> readable?
>>
>> The issue is, the more variants we provide the easier it is to make
>> mistakes and introduce new buggy code.
>>
>> It's tempting to simply use pte_mkwrite() and call it a day, where
>> people actually should use pte_mkwrite_vma().
>>
>> Then, they at least have to investigate what to do about the second
>> VMA
>> parameter.
> 
> Ok, I'll give it a spin. So far it looks ok. The downside is the giant
> tree-wide pte_mkwrite() signature change, but once that is over with
> there are other advantages. Like getting rid of maybe_mkwrite()'s
> awareness of shadow stack so the logic is more centralized. Please let
> me know if you don't feel comfortable with a suggested-by credit tag.

Sure ...

but I reconsidered :)

Maybe there is a cleaner way to do it and avoid the "NULL" argument.

What about having (while you're going over everything already):

pte_mkwrite(pte, vma)
pte_mkwrite_kernel(pte)

The latter would only be used in that arch code where we're working on 
kernel pgtables. We already have pte_offset_kernel() and 
pte_alloc_kernel_track(), so it's not too weird.

Edgecombe, Rick P Jan. 31, 2023, 11:33 p.m. UTC | #16

On Tue, 2023-01-31 at 09:46 +0100, David Hildenbrand wrote:
> Sure ...
> 
> but I reconsidered :)
> 
> Maybe there is a cleaner way to do it and avoid the "NULL" argument.
> 
> What about having (while you're going over everything already):
> 
> pte_mkwrite(pte, vma)
> pte_mkwrite_kernel(pte)
> 
> The latter would only be used in that arch code where we're working
> on 
> kernel pgtables. We already have pte_offset_kernel() and 
> pte_alloc_kernel_track(), so it's not too weird.

Hmm, one downside is the "mk" part might lead people to guess
pte_mkwrite_kernel() would make it writable AND a kernel page (like
U/S=0 on x86). Instead of being a mkwrite() that's useful for setting
on kernel PTEs.

The other problem is that one of NULL passers is not for kernel memory.
huge_pte_mkwrite() calls pte_mkwrite(). Shadow stack memory can't be
created with MAP_HUGETLB, so it is not needed. Using
pte_mkwrite_kernel() would look weird in this case, but making
huge_pte_mkwrite() take a VMA would be for no reason. Maybe making
huge_pte_mkwrite() take a VMA is the better of those two options. Or
keep the NULL semantics...  Any thoughts?

David Hildenbrand Feb. 1, 2023, 9:03 a.m. UTC | #17

On 01.02.23 00:33, Edgecombe, Rick P wrote:
> On Tue, 2023-01-31 at 09:46 +0100, David Hildenbrand wrote:
>> Sure ...
>>
>> but I reconsidered :)
>>
>> Maybe there is a cleaner way to do it and avoid the "NULL" argument.
>>
>> What about having (while you're going over everything already):
>>
>> pte_mkwrite(pte, vma)
>> pte_mkwrite_kernel(pte)
>>
>> The latter would only be used in that arch code where we're working
>> on
>> kernel pgtables. We already have pte_offset_kernel() and
>> pte_alloc_kernel_track(), so it's not too weird.
> 
> Hmm, one downside is the "mk" part might lead people to guess
> pte_mkwrite_kernel() would make it writable AND a kernel page (like
> U/S=0 on x86). Instead of being a mkwrite() that's useful for setting
> on kernel PTEs.

At least I wouldn't worry about that too much. We handle nowhere in 
common code user vs. supervisor access that way explicitly (e.g., 
mkkernel), and it wouldn't even apply on architectures where we cannot 
make such a decision on a per-PTE basis.

> 
> The other problem is that one of NULL passers is not for kernel memory.
> huge_pte_mkwrite() calls pte_mkwrite(). Shadow stack memory can't be
> created with MAP_HUGETLB, so it is not needed. Using
> pte_mkwrite_kernel() would look weird in this case, but making
> huge_pte_mkwrite() take a VMA would be for no reason. Maybe making
> huge_pte_mkwrite() take a VMA is the better of those two options. Or
> keep the NULL semantics...  Any thoughts?

Well, the reason would be consistency. From a core-mm point of view it 
makes sense to handle this all consistency, even if the single user 
(x86) wouldn't strictly require it right now.

I'd just pass in the VMA and call it a day :)

Edgecombe, Rick P Feb. 1, 2023, 5:32 p.m. UTC | #18

On Wed, 2023-02-01 at 10:03 +0100, David Hildenbrand wrote:
> > 
> > The other problem is that one of NULL passers is not for kernel
> > memory.
> > huge_pte_mkwrite() calls pte_mkwrite(). Shadow stack memory can't
> > be
> > created with MAP_HUGETLB, so it is not needed. Using
> > pte_mkwrite_kernel() would look weird in this case, but making
> > huge_pte_mkwrite() take a VMA would be for no reason. Maybe making
> > huge_pte_mkwrite() take a VMA is the better of those two options.
> > Or
> > keep the NULL semantics...  Any thoughts?
> 
> Well, the reason would be consistency. From a core-mm point of view
> it 
> makes sense to handle this all consistency, even if the single user 
> (x86) wouldn't strictly require it right now.
> 
> I'd just pass in the VMA and call it a day :)

Ok, I'll give it a spin.

David Hildenbrand Feb. 1, 2023, 6:03 p.m. UTC | #19

On 01.02.23 18:32, Edgecombe, Rick P wrote:
> On Wed, 2023-02-01 at 10:03 +0100, David Hildenbrand wrote:
>>>
>>> The other problem is that one of NULL passers is not for kernel
>>> memory.
>>> huge_pte_mkwrite() calls pte_mkwrite(). Shadow stack memory can't
>>> be
>>> created with MAP_HUGETLB, so it is not needed. Using
>>> pte_mkwrite_kernel() would look weird in this case, but making
>>> huge_pte_mkwrite() take a VMA would be for no reason. Maybe making
>>> huge_pte_mkwrite() take a VMA is the better of those two options.
>>> Or
>>> keep the NULL semantics...  Any thoughts?
>>
>> Well, the reason would be consistency. From a core-mm point of view
>> it
>> makes sense to handle this all consistency, even if the single user
>> (x86) wouldn't strictly require it right now.
>>
>> I'd just pass in the VMA and call it a day :)
> 
> Ok, I'll give it a spin.

It would be good to get more opinions on that, but I'm afraid we won't 
get more deep down in this thread :)

diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e15d2fc04007..139a682d243b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2181,7 +2181,7 @@  static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma
 	 */
 	if (vma->vm_flags & VM_SHARED)
 		return vma_wants_writenotify(vma, vma->vm_page_prot);
-	return !!(vma->vm_flags & VM_WRITE);
+	return (vma->vm_flags & VM_WRITE) && !(vma->vm_flags & VM_SHADOW_STACK);
 
 }
 bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,