[v17,19/35] arch/mm: Export direct {un,}map functions

Message ID 20240222-gunyah-v17-19-1e9da6763d38@quicinc.com
State New
Headers
Series Drivers for Gunyah hypervisor |

Commit Message

Elliot Berman Feb. 22, 2024, 11:16 p.m. UTC
  Firmware and hypervisor drivers can donate system heap memory to their
respective firmware/hypervisor entities. Those drivers should unmap the
pages from the kernel's logical map before doing so.

Export can_set_direct_map, set_direct_map_invalid_noflush, and
set_direct_map_default_noflush.

Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
---
 arch/arm64/mm/pageattr.c | 3 +++
 1 file changed, 3 insertions(+)
  

Comments

Christoph Hellwig Feb. 23, 2024, 7:09 a.m. UTC | #1
On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote:
> Firmware and hypervisor drivers can donate system heap memory to their
> respective firmware/hypervisor entities. Those drivers should unmap the
> pages from the kernel's logical map before doing so.
> 
> Export can_set_direct_map, set_direct_map_invalid_noflush, and
> set_direct_map_default_noflush.

Err, not they should not.  And not using such super low-level interfaces
from modular code.
  
Elliot Berman Feb. 24, 2024, 12:37 a.m. UTC | #2
On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote:
> On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote:
> > Firmware and hypervisor drivers can donate system heap memory to their
> > respective firmware/hypervisor entities. Those drivers should unmap the
> > pages from the kernel's logical map before doing so.
> > 
> > Export can_set_direct_map, set_direct_map_invalid_noflush, and
> > set_direct_map_default_noflush.
> 
> Err, not they should not.  And not using such super low-level interfaces
> from modular code.

Hi Cristoph,
 
We've observed a few times that Linux can unintentionally access a page
we've unmapped from host's stage 2 page table via an unaligned load from
an adjacent page. The stage 2 is managed by Gunyah. There are few
scenarios where even though we allocate and own a page from buddy,
someone else could try to access the page without going through the
hypervisor driver. One such instance we know about is
load_unaligned_zeropad() via pathlookup_at() [1].
 
load_unaligned_zeropad() could be called near the end of a page. If the
next page isn't mapped by the kernel in the stage one page tables, then
the access from to the unmapped page from load_unaligned_zeropad() will
land in __do_kernel_fault(), call fixup_exception(), and fill the
remainder of the load with zeroes. If the page in question is mapped in
stage 1 but was unmapped from stage 2, then the access lands back in
Linux in do_sea(), leading to a panic().
 
Our preference would be to add fixup_exception() to S2 PTW errors for
two reasons:
1. It's cheaper to do performance wise: we've already manipulated S2
   page table and prevent intentional access to the page because
   pKVM/Gunyah drivers know that access to the page has been lost.
2. Page-granular S1 mappings only happen on arm64 with rodata=full.
 
In an off-list discussion with the Android pkvm folks, their preference
was to have the pages unmapped from stage 1. I've gone with that
approach to get started but welcome discussion on the best approach.
 
The Android (downstream) implementation of arm64 pkvm is currently
implementing a hack where s2 ptw faults are given back to the host as s1
ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) --
allowing the kernel to fixup the exception.
 
arm64 pKVM will also face this issue when implementing guest_memfd or
when donating more memory to the hyp for s2 page tables, etc. As far as
I can tell, this isn't an issue for arm64 pKVM today because memory
isn't being dynamically donated to the hypervisor.
 
Thanks,
Elliot

[1]:
path_lookupat+0x340/0x3228
filename_lookup+0xbc/0x1c0
__arm64_sys_newfstatat+0xb0/0x4a0
invoke_syscall+0x58/0x118
  
Christoph Hellwig Feb. 26, 2024, 11:06 a.m. UTC | #3
The point is that we can't we just allow modules to unmap data from
the kernel mapping, no matter how noble your intentions are.
  
David Hildenbrand Feb. 26, 2024, 11:53 a.m. UTC | #4
On 26.02.24 12:06, Christoph Hellwig wrote:
> The point is that we can't we just allow modules to unmap data from
> the kernel mapping, no matter how noble your intentions are.

I absolutely agree.
  
Elliot Berman Feb. 26, 2024, 5:27 p.m. UTC | #5
On Mon, Feb 26, 2024 at 12:53:48PM +0100, David Hildenbrand wrote:
> On 26.02.24 12:06, Christoph Hellwig wrote:
> > The point is that we can't we just allow modules to unmap data from
> > the kernel mapping, no matter how noble your intentions are.
> 
> I absolutely agree.
> 

Hi David and Chirstoph,

Are your preferences that we should make Gunyah builtin only or should add
fixing up S2 PTW errors (or something else)?

Also, do you extend that preference to modifying S2 mappings? This would
require any hypervisor driver that supports confidential compute
usecases to only ever be builtin.

Is your concern about unmapping data from kernel mapping, then module
being unloaded, and then having no way to recover the mapping? Would a
permanent module be better? The primary reason we were wanting to have
it as module was to avoid having driver in memory if you're not a Gunyah
guest.

Thanks,
Elliot
  
David Hildenbrand Feb. 27, 2024, 9:49 a.m. UTC | #6
On 26.02.24 18:27, Elliot Berman wrote:
> On Mon, Feb 26, 2024 at 12:53:48PM +0100, David Hildenbrand wrote:
>> On 26.02.24 12:06, Christoph Hellwig wrote:
>>> The point is that we can't we just allow modules to unmap data from
>>> the kernel mapping, no matter how noble your intentions are.
>>
>> I absolutely agree.
>>
> 
> Hi David and Chirstoph,
> 
> Are your preferences that we should make Gunyah builtin only or should add
> fixing up S2 PTW errors (or something else)?

Having that built into the kernel certainly does sound better than 
exposing that functionality to arbitrary OOT modules. But still, this 
feels like it is using a "too-low-level" interface.

> 
> Also, do you extend that preference to modifying S2 mappings? This would
> require any hypervisor driver that supports confidential compute
> usecases to only ever be builtin.
> 
> Is your concern about unmapping data from kernel mapping, then module
> being unloaded, and then having no way to recover the mapping? Would a
> permanent module be better? The primary reason we were wanting to have
> it as module was to avoid having driver in memory if you're not a Gunyah
> guest.

What I didn't grasp from this patch description: is the area where a 
driver would unmap/remap that memory somehow known ahead of time and 
limited?

How would the driver obtain that memory it would try to unmap/remap the 
direct map of? Simply allocate some pages and then unmap the direct map?

For example, we do have mm/secretmem.c, where we unmap the directmap on 
allocation and remap when freeing a page. A nice abstraction on 
alloc/free, so one cannot really do a lot of harm.

Further, we enlightened the remainder of the system about secretmem, 
such that we can detect that the directmap is no longer there. As one 
example, see the secretmem_active() check in kernel/power/hibernate.c.

A similar abstraction would make sense (I remember a discussion about 
having secretmem functionality in guest_memfd, would that help?), but 
the question is "which" memory you want to unmap the direct map of, and 
how the driver became "owner" of that memory such that it would really 
be allowed to mess with the directmap.
  
Elliot Berman March 1, 2024, 1:35 a.m. UTC | #7
On Tue, Feb 27, 2024 at 10:49:32AM +0100, David Hildenbrand wrote:
> On 26.02.24 18:27, Elliot Berman wrote:
> > On Mon, Feb 26, 2024 at 12:53:48PM +0100, David Hildenbrand wrote:
> > > On 26.02.24 12:06, Christoph Hellwig wrote:
> > > > The point is that we can't we just allow modules to unmap data from
> > > > the kernel mapping, no matter how noble your intentions are.
> > > 
> > > I absolutely agree.
> > > 
> > 
> > Hi David and Chirstoph,
> > 
> > Are your preferences that we should make Gunyah builtin only or should add
> > fixing up S2 PTW errors (or something else)?
> 
> Having that built into the kernel certainly does sound better than exposing
> that functionality to arbitrary OOT modules. But still, this feels like it
> is using a "too-low-level" interface.
> 

What are your thoughts about fixing up the stage-2 fault instead? I
think this gives mmu-based isolation a slight speed boost because we
avoid modifying kernel mapping. The hypervisor driver (KVM or Gunyah)
knows that the page isn't mapped. Whether we get S2 or S1 fault, the
kernel is likely going to crash, except in the rare case where we want
to fix the exception. In that case, we can modify the S2 fault handler
to call fixup_exception() when appropriate.

> > 
> > Also, do you extend that preference to modifying S2 mappings? This would
> > require any hypervisor driver that supports confidential compute
> > usecases to only ever be builtin.
> > 
> > Is your concern about unmapping data from kernel mapping, then module
> > being unloaded, and then having no way to recover the mapping? Would a
> > permanent module be better? The primary reason we were wanting to have
> > it as module was to avoid having driver in memory if you're not a Gunyah
> > guest.
> 
> What I didn't grasp from this patch description: is the area where a driver
> would unmap/remap that memory somehow known ahead of time and limited?
> 
> How would the driver obtain that memory it would try to unmap/remap the
> direct map of? Simply allocate some pages and then unmap the direct map?

That's correct.

> 
> For example, we do have mm/secretmem.c, where we unmap the directmap on
> allocation and remap when freeing a page. A nice abstraction on alloc/free,
> so one cannot really do a lot of harm.
> 
> Further, we enlightened the remainder of the system about secretmem, such
> that we can detect that the directmap is no longer there. As one example,
> see the secretmem_active() check in kernel/power/hibernate.c.
> 

I'll take a look at this. guest_memfd might be able to use PM notifiers here
instead, but I'll dig in the archives to see why secretmem isn't using that.

> A similar abstraction would make sense (I remember a discussion about having
> secretmem functionality in guest_memfd, would that help?), but the question
> is "which" memory you want to unmap the direct map of, and how the driver
> became "owner" of that memory such that it would really be allowed to mess
> with the directmap.
  
Quentin Perret March 4, 2024, 1:10 p.m. UTC | #8
On Friday 23 Feb 2024 at 16:37:23 (-0800), Elliot Berman wrote:
> On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote:
> > On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote:
> > > Firmware and hypervisor drivers can donate system heap memory to their
> > > respective firmware/hypervisor entities. Those drivers should unmap the
> > > pages from the kernel's logical map before doing so.
> > > 
> > > Export can_set_direct_map, set_direct_map_invalid_noflush, and
> > > set_direct_map_default_noflush.
> > 
> > Err, not they should not.  And not using such super low-level interfaces
> > from modular code.
> 
> Hi Cristoph,
>  
> We've observed a few times that Linux can unintentionally access a page
> we've unmapped from host's stage 2 page table via an unaligned load from
> an adjacent page. The stage 2 is managed by Gunyah. There are few
> scenarios where even though we allocate and own a page from buddy,
> someone else could try to access the page without going through the
> hypervisor driver. One such instance we know about is
> load_unaligned_zeropad() via pathlookup_at() [1].
>  
> load_unaligned_zeropad() could be called near the end of a page. If the
> next page isn't mapped by the kernel in the stage one page tables, then
> the access from to the unmapped page from load_unaligned_zeropad() will
> land in __do_kernel_fault(), call fixup_exception(), and fill the
> remainder of the load with zeroes. If the page in question is mapped in
> stage 1 but was unmapped from stage 2, then the access lands back in
> Linux in do_sea(), leading to a panic().
>  
> Our preference would be to add fixup_exception() to S2 PTW errors for
> two reasons:
> 1. It's cheaper to do performance wise: we've already manipulated S2
>    page table and prevent intentional access to the page because
>    pKVM/Gunyah drivers know that access to the page has been lost.
> 2. Page-granular S1 mappings only happen on arm64 with rodata=full.
>  
> In an off-list discussion with the Android pkvm folks, their preference
> was to have the pages unmapped from stage 1. I've gone with that
> approach to get started but welcome discussion on the best approach.
>  
> The Android (downstream) implementation of arm64 pkvm is currently
> implementing a hack where s2 ptw faults are given back to the host as s1
> ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) --
> allowing the kernel to fixup the exception.
>  
> arm64 pKVM will also face this issue when implementing guest_memfd or
> when donating more memory to the hyp for s2 page tables, etc. As far as
> I can tell, this isn't an issue for arm64 pKVM today because memory
> isn't being dynamically donated to the hypervisor.

FWIW pKVM already donates memory dynamically to the hypervisor, to store
e.g. guest VM metadata and page-tables, and we've never seen that
problem as far as I can recall.

A key difference is that pKVM injects a data abort back into the kernel
in case of a stage-2 fault, so the whole EXTABLE trick/hack in
load_unaligned_zeropad() should work fine out of the box.

As discussed offline, Gunyah injecting an SEA into the kernel is
questionable, but I understand that the architecture is a bit lacking in
this department, and that's probably the next best thing.

Could the Gunyah driver allocate from a CMA region instead? That would
surely simplify unmapping from EL1 stage-1 (similar to how drivers
usually donate memory to TZ).

Thanks,
Quentin
  

Patch

diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 924843f1f661b..a9bd84588c98a 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -32,6 +32,7 @@  bool can_set_direct_map(void)
 	return rodata_full || debug_pagealloc_enabled() ||
 	       arm64_kfence_can_set_direct_map();
 }
+EXPORT_SYMBOL_GPL(can_set_direct_map);
 
 static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 {
@@ -176,6 +177,7 @@  int set_direct_map_invalid_noflush(struct page *page)
 				   (unsigned long)page_address(page),
 				   PAGE_SIZE, change_page_range, &data);
 }
+EXPORT_SYMBOL_GPL(set_direct_map_invalid_noflush);
 
 int set_direct_map_default_noflush(struct page *page)
 {
@@ -191,6 +193,7 @@  int set_direct_map_default_noflush(struct page *page)
 				   (unsigned long)page_address(page),
 				   PAGE_SIZE, change_page_range, &data);
 }
+EXPORT_SYMBOL_GPL(set_direct_map_default_noflush);
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 void __kernel_map_pages(struct page *page, int numpages, int enable)