[RFC,00/14] Prevent cross-cache attacks in the SLUB allocator

Message ID 20230915105933.495735-1-matteorizzo@google.com
Headers
Series Prevent cross-cache attacks in the SLUB allocator |

Message

Matteo Rizzo Sept. 15, 2023, 10:59 a.m. UTC
  The goal of this patch series is to deterministically prevent cross-cache
attacks in the SLUB allocator.

Use-after-free bugs are normally exploited by making the memory allocator
reuse the victim object's memory for an object with a different type. This
creates a type confusion which is a very powerful attack primitive.

There are generally two ways to create such type confusions in the kernel:
one way is to make SLUB reuse the freed object's address for another object
of a different type which lives in the same slab cache. This only works in
slab caches that can contain objects of different types (i.e. the kmalloc
caches) and the attacker is limited to objects that belong to the same size
class as the victim object.

The other way is to use a "cross-cache attack": make SLUB return the page
containing the victim object to the page allocator and then make it use the
same page for a different slab cache or other objects that contain
attacker-controlled data. This gives attackers access to all objects rather
than just the ones in the same size class as the target and lets attackers
target objects allocated from dedicated caches such as struct file.

This patch prevents cross-cache attacks by making sure that once a virtual
address is used for a slab cache it's never reused for anything except for
other slabs in that cache.


Jann Horn (13):
  mm/slub: add is_slab_addr/is_slab_page helpers
  mm/slub: move kmem_cache_order_objects to slab.h
  mm: use virt_to_slab instead of folio_slab
  mm/slub: create folio_set/clear_slab helpers
  mm/slub: pass additional args to alloc_slab_page
  mm/slub: pass slab pointer to the freeptr decode helper
  security: introduce CONFIG_SLAB_VIRTUAL
  mm/slub: add the slab freelists to kmem_cache
  x86: Create virtual memory region for SLUB
  mm/slub: allocate slabs from virtual memory
  mm/slub: introduce the deallocated_pages sysfs attribute
  mm/slub: sanity-check freepointers
  security: add documentation for SLAB_VIRTUAL

Matteo Rizzo (1):
  mm/slub: don't try to dereference invalid freepointers

 Documentation/arch/x86/x86_64/mm.rst       |   4 +-
 Documentation/security/self-protection.rst | 102 ++++
 arch/x86/include/asm/page_64.h             |  10 +
 arch/x86/include/asm/pgtable_64_types.h    |  21 +
 arch/x86/mm/init_64.c                      |  19 +-
 arch/x86/mm/kaslr.c                        |   9 +
 arch/x86/mm/mm_internal.h                  |   4 +
 arch/x86/mm/physaddr.c                     |  10 +
 include/linux/slab.h                       |   8 +
 include/linux/slub_def.h                   |  25 +-
 init/main.c                                |   1 +
 kernel/resource.c                          |   2 +-
 lib/slub_kunit.c                           |   4 +
 mm/memcontrol.c                            |   2 +-
 mm/slab.h                                  | 145 +++++
 mm/slab_common.c                           |  21 +-
 mm/slub.c                                  | 641 +++++++++++++++++++--
 mm/usercopy.c                              |  12 +-
 security/Kconfig.hardening                 |  16 +
 19 files changed, 977 insertions(+), 79 deletions(-)


base-commit: 46a9ea6681907a3be6b6b0d43776dccc62cad6cf
  

Comments

Dave Hansen Sept. 15, 2023, 3:19 p.m. UTC | #1
On 9/15/23 03:59, Matteo Rizzo wrote:
> The goal of this patch series is to deterministically prevent cross-cache
> attacks in the SLUB allocator.

What's the cost?
  
Lameter, Christopher Sept. 15, 2023, 4:30 p.m. UTC | #2
On Fri, 15 Sep 2023, Dave Hansen wrote:

> On 9/15/23 03:59, Matteo Rizzo wrote:
>> The goal of this patch series is to deterministically prevent cross-cache
>> attacks in the SLUB allocator.
>
> What's the cost?

The only thing that I see is 1-2% on kernel compilations (and "more on 
machines with lots of cores")?

Having a virtualized slab subsystem could enable other things:

- The page order calculation could be simplified since vmalloc can stitch 
arbitrary base pages together to form larger contiguous virtual segments. 
So just use f.e. order 5 or so for all slabs to reduce contention?

- Maybe we could make slab pages movable (if we can ensure that slab 
objects are not touched somehow. At least stop_machine run could be used 
to move batches of slab memory)

- Maybe we can avoid allocating page structs somehow for slab memory? 
Looks like this is taking a step into that direction. The metadata storage 
of the slab allocator could be reworked and optimized better.

Problems:

- Overhead due to more TLB lookups

- Larger amounts of TLBs are used for the OS. Currently we are trying to 
use the maximum mappable TLBs to reduce their numbers. This presumably 
means using 4K TLBs for all slab access.

- Memory may not be physically contiguous which may be required by 
some drivers doing DMA.
  
Matteo Rizzo Sept. 18, 2023, 12:08 p.m. UTC | #3
On Fri, 15 Sept 2023 at 18:30, Lameter, Christopher
<cl@os.amperecomputing.com> wrote:
>
> On Fri, 15 Sep 2023, Dave Hansen wrote:
>
> > What's the cost?
>
> The only thing that I see is 1-2% on kernel compilations (and "more on
> machines with lots of cores")?

I used kernel compilation time (wall clock time) as a benchmark while
preparing the series. Lower is better.

Intel Skylake, 112 cores:

      LABEL    | COUNT |   MIN   |   MAX   |   MEAN  |  MEDIAN | STDDEV
---------------+-------+---------+---------+---------+---------+--------
SLAB_VIRTUAL=n | 150   | 49.700s | 51.320s | 50.449s | 50.430s | 0.29959
SLAB_VIRTUAL=y | 150   | 50.020s | 51.660s | 50.880s | 50.880s | 0.30495
               |       | +0.64%  | +0.66%  | +0.85%  | +0.89%  | +1.79%

AMD Milan, 256 cores:

    LABEL      | COUNT |   MIN   |   MAX   |   MEAN  |  MEDIAN | STDDEV
---------------+-------+---------+---------+---------+---------+--------
SLAB_VIRTUAL=n | 150   | 25.480s | 26.550s | 26.065s | 26.055s | 0.23495
SLAB_VIRTUAL=y | 150   | 25.820s | 27.080s | 26.531s | 26.540s | 0.25974
               |       | +1.33%  | +2.00%  | +1.79%  | +1.86%  | +10.55%

Are there any specific benchmarks that you would be interested in seeing or
that are usually used for SLUB?

> Problems:
>
> - Overhead due to more TLB lookups
>
> - Larger amounts of TLBs are used for the OS. Currently we are trying to
> use the maximum mappable TLBs to reduce their numbers. This presumably
> means using 4K TLBs for all slab access.

Yes, we are using 4K pages for the slab mappings which is going to increase
TLB pressure. I also tried writing a version of the patch that uses 2M
pages which had slightly better performance, but that had its own problems.
For example most slabs are much smaller than 2M, so we would need to create
and map multiple slabs at once and we wouldn't be able to release the
physical memory until all slabs in the 2M page are unused which increases
fragmentation.

> - Memory may not be physically contiguous which may be required by some
> drivers doing DMA.

In the current implementation each slab is backed by physically contiguous
memory, but different slabs that are adjacent in virtual memory might not
be physically contiguous. Treating objects allocated from two different
slabs as one contiguous chunk of memory is probably wrong anyway, right?

--
Matteo
  
Ingo Molnar Sept. 18, 2023, 5:39 p.m. UTC | #4
* Matteo Rizzo <matteorizzo@google.com> wrote:

> On Fri, 15 Sept 2023 at 18:30, Lameter, Christopher
> <cl@os.amperecomputing.com> wrote:
> >
> > On Fri, 15 Sep 2023, Dave Hansen wrote:
> >
> > > What's the cost?
> >
> > The only thing that I see is 1-2% on kernel compilations (and "more on
> > machines with lots of cores")?
> 
> I used kernel compilation time (wall clock time) as a benchmark while
> preparing the series. Lower is better.
> 
> Intel Skylake, 112 cores:
> 
>       LABEL    | COUNT |   MIN   |   MAX   |   MEAN  |  MEDIAN | STDDEV
> ---------------+-------+---------+---------+---------+---------+--------
> SLAB_VIRTUAL=n | 150   | 49.700s | 51.320s | 50.449s | 50.430s | 0.29959
> SLAB_VIRTUAL=y | 150   | 50.020s | 51.660s | 50.880s | 50.880s | 0.30495
>                |       | +0.64%  | +0.66%  | +0.85%  | +0.89%  | +1.79%
> 
> AMD Milan, 256 cores:
> 
>     LABEL      | COUNT |   MIN   |   MAX   |   MEAN  |  MEDIAN | STDDEV
> ---------------+-------+---------+---------+---------+---------+--------
> SLAB_VIRTUAL=n | 150   | 25.480s | 26.550s | 26.065s | 26.055s | 0.23495
> SLAB_VIRTUAL=y | 150   | 25.820s | 27.080s | 26.531s | 26.540s | 0.25974
>                |       | +1.33%  | +2.00%  | +1.79%  | +1.86%  | +10.55%

That's sadly a rather substantial overhead for a compiler/linker workload 
that is dominantly user-space: a kernel build is about 90% user-time and 
10% system-time:

   $ perf stat --null make -j64 vmlinux
   ...

   Performance counter stats for 'make -j64 vmlinux':

        59.840704481 seconds time elapsed

      2000.774537000 seconds user
       219.138280000 seconds sys

What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between 
user-space execution and kernel-space execution?

Thanks,

	Ingo
  
Linus Torvalds Sept. 18, 2023, 6:05 p.m. UTC | #5
On Mon, 18 Sept 2023 at 10:39, Ingo Molnar <mingo@kernel.org> wrote:
>
> What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between
> user-space execution and kernel-space execution?

... and equally importantly, what about DMA?

Or what about the fixed-size slabs (aka kmalloc?) What's the point of
"never re-use the same address for a different slab", when the *same*
slab will contain different kinds of allocations anyway?

I think the whole "make it one single compile-time option" model is
completely and fundamentally broken.

                     Linus
  
Matteo Rizzo Sept. 19, 2023, 1:42 p.m. UTC | #6
On Mon, 18 Sept 2023 at 19:39, Ingo Molnar <mingo@kernel.org> wrote:
>
> What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between
> user-space execution and kernel-space execution?
>

Same benchmark as before (compiling a kernel on a system running the patched
kernel):

Intel Skylake:

      LABEL    | COUNT |   MIN    |   MAX    |   MEAN   |  MEDIAN  | STDDEV
---------------+-------+----------+----------+----------+----------+--------
wall clock     |       |          |          |          |          |
SLAB_VIRTUAL=n | 150   | 49.700   | 51.320   | 50.449   | 50.430   | 0.29959
SLAB_VIRTUAL=y | 150   | 50.020   | 51.660   | 50.880   | 50.880   | 0.30495
               |       | +0.64%   | +0.66%   | +0.85%   | +0.89%   | +1.79%
system time    |       |          |          |          |          |
SLAB_VIRTUAL=n | 150   | 358.560  | 362.900  | 360.922  | 360.985  | 0.91761
SLAB_VIRTUAL=y | 150   | 362.970  | 367.970  | 366.062  | 366.115  | 1.015
               |       | +1.23%   | +1.40%   | +1.42%   | +1.42%   | +10.60%
user time      |       |          |          |          |          |
SLAB_VIRTUAL=n | 150   | 3110.000 | 3124.520 | 3118.143 | 3118.120 | 2.466
SLAB_VIRTUAL=y | 150   | 3115.070 | 3127.070 | 3120.762 | 3120.925 | 2.654
               |       | +0.16%   | +0.08%   | +0.08%   | +0.09%   | +7.63%

AMD Milan:

      LABEL    | COUNT |   MIN    |   MAX    |   MEAN   |  MEDIAN  | STDDEV
---------------+-------+----------+----------+----------+----------+--------
wall clock     |       |          |          |          |          |
SLAB_VIRTUAL=n | 150   | 25.480   | 26.550   | 26.065   | 26.055   | 0.23495
SLAB_VIRTUAL=y | 150   | 25.820   | 27.080   | 26.531   | 26.540   | 0.25974
               |       | +1.33%   | +2.00%   | +1.79%   | +1.86%   | +10.55%
system time    |       |          |          |          |          |
SLAB_VIRTUAL=n | 150   | 478.530  | 540.420  | 520.803  | 521.485  | 9.166
SLAB_VIRTUAL=y | 150   | 530.520  | 572.460  | 552.825  | 552.985  | 7.161
               |       | +10.86%  | +5.93%   | +6.15%   | +6.04%   | -21.88%
user time      |       |          |          |          |          |
SLAB_VIRTUAL=n | 150   | 2373.540 | 2403.800 | 2386.343 | 2385.840 | 5.325
SLAB_VIRTUAL=y | 150   | 2388.690 | 2426.290 | 2408.325 | 2408.895 | 6.667
               |       | +0.64%   | +0.94%   | +0.92%   | +0.97%   | +25.20%


I'm not exactly sure why user time increases by almost 1% on Milan, it could be
TLB contention.

--
Matteo
  
Matteo Rizzo Sept. 19, 2023, 3:48 p.m. UTC | #7
On Mon, 18 Sept 2023 at 20:05, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> ... and equally importantly, what about DMA?

I'm not exactly sure what you mean by this, I don't think this should
affect the performance of DMA.

> Or what about the fixed-size slabs (aka kmalloc?) What's the point of
> "never re-use the same address for a different slab", when the *same*
> slab will contain different kinds of allocations anyway?

There are a number of patches out there (for example the random_kmalloc
series which recently got merged into v6.6) which attempt to segregate
kmalloc'd objects into different caches to make exploitation harder.
Another thing that we would like to have in the future is to segregate
objects by type (like XNU's kalloc_type
https://security.apple.com/blog/towards-the-next-generation-of-xnu-memory-safety/)
which makes exploiting use-after-free by type confusion much harder or
impossible.

All of these mitigations can be bypassed very easily if the attacker can
mount a cross-cache attack, which is what this series attempts to prevent.
This is not only theoretical, we've seen attackers use this all the time in
kCTF/kernelCTF submissions (for example
https://ruia-ruia.github.io/2022/08/05/CVE-2022-29582-io-uring/).

> I think the whole "make it one single compile-time option" model is
> completely and fundamentally broken.

Wouldn't making this toggleable at boot time or runtime make performance
even worse?

--
Matteo
  
Dave Hansen Sept. 19, 2023, 3:56 p.m. UTC | #8
On 9/19/23 06:42, Matteo Rizzo wrote:
> On Mon, 18 Sept 2023 at 19:39, Ingo Molnar <mingo@kernel.org> wrote:
>> What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between
>> user-space execution and kernel-space execution?
>>
> Same benchmark as before (compiling a kernel on a system running the patched
> kernel):

Thanks for running those.  One more situation that comes to mind is how
this will act under memory pressure.  Will some memory pressure make
contention on 'slub_kworker_lock' visible or make the global TLB flushes
less bearable?

In any case, none of this looks _catastrophic_.  It's surely a cost that
some folks will pay.

But I really do think it needs to be more dynamic.  There are a _couple_
of reasons for this.  If it's only a compile-time option, it's never
going to get turned on except for maybe ChromeOS and the datacenter
folks that are paranoid.  I suspect the distros will never turn it on.

A lot of questions get easier if you can disable/enable it at runtime.
For instance, what do you do if the virtual area fills up?  You _could_
just go back to handing out direct map addresses.  Less secure?  Yep.
But better than crashing (for some folks).

It also opens up the door to do this per-slab.  That alone would be a
handy debugging option.
  
Dave Hansen Sept. 19, 2023, 4:02 p.m. UTC | #9
On 9/19/23 08:48, Matteo Rizzo wrote:
>> I think the whole "make it one single compile-time option" model is
>> completely and fundamentally broken.
> Wouldn't making this toggleable at boot time or runtime make performance
> even worse?

Maybe.

But you can tolerate even more of a performance impact from a feature if
the people that don't care can actually disable it.

There are also plenty of ways to minimize the overhead of switching it
on and off at runtime.  Static branches are your best friend here.
  
Kees Cook Sept. 19, 2023, 5:56 p.m. UTC | #10
On September 19, 2023 9:02:07 AM PDT, Dave Hansen <dave.hansen@intel.com> wrote:
>On 9/19/23 08:48, Matteo Rizzo wrote:
>>> I think the whole "make it one single compile-time option" model is
>>> completely and fundamentally broken.
>> Wouldn't making this toggleable at boot time or runtime make performance
>> even worse?
>
>Maybe.
>
>But you can tolerate even more of a performance impact from a feature if
>the people that don't care can actually disable it.
>
>There are also plenty of ways to minimize the overhead of switching it
>on and off at runtime.  Static branches are your best friend here.

Let's start with a boot time on/off toggle (no per-slab, no switch on out-of-space, etc). That should be sufficient for initial ease of use for testing, etc. But yes, using static_branch will nicely DTRT here.
  
Linus Torvalds Sept. 19, 2023, 6:49 p.m. UTC | #11
On Tue, 19 Sept 2023 at 08:48, Matteo Rizzo <matteorizzo@google.com> wrote:
>
> On Mon, 18 Sept 2023 at 20:05, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > ... and equally importantly, what about DMA?
>
> I'm not exactly sure what you mean by this, I don't think this should
> affect the performance of DMA.

I was more worried about just basic correctness.

We've traditionally had a lot of issues with using virtual addresses
for dma, simply because we've got random drivers, and I'm not entirely
convinced that your "virt_to_phys()" update will catch it all.

IOW, even on x86-64 - which is hopefully better than most
architectures because it already has that double mapping issue - we
have things like

        unsigned long paddr = (unsigned long)vaddr - __PAGE_OFFSET;

in other places than just the __phys_addr() code.

The one place I grepped for looks to be just boot-time AMD memory
encryption, so wouldn't be any slab allocation, but ...

               Linus
  
Ingo Molnar Sept. 20, 2023, 7:44 a.m. UTC | #12
* Matteo Rizzo <matteorizzo@google.com> wrote:

> On Mon, 18 Sept 2023 at 19:39, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between
> > user-space execution and kernel-space execution?
> >
> 
> Same benchmark as before (compiling a kernel on a system running the patched
> kernel):
> 
> Intel Skylake:
> 
>       LABEL    | COUNT |   MIN    |   MAX    |   MEAN   |  MEDIAN  | STDDEV
> ---------------+-------+----------+----------+----------+----------+--------
> wall clock     |       |          |          |          |          |
> SLAB_VIRTUAL=n | 150   | 49.700   | 51.320   | 50.449   | 50.430   | 0.29959
> SLAB_VIRTUAL=y | 150   | 50.020   | 51.660   | 50.880   | 50.880   | 0.30495
>                |       | +0.64%   | +0.66%   | +0.85%   | +0.89%   | +1.79%
> system time    |       |          |          |          |          |
> SLAB_VIRTUAL=n | 150   | 358.560  | 362.900  | 360.922  | 360.985  | 0.91761
> SLAB_VIRTUAL=y | 150   | 362.970  | 367.970  | 366.062  | 366.115  | 1.015
>                |       | +1.23%   | +1.40%   | +1.42%   | +1.42%   | +10.60%
> user time      |       |          |          |          |          |
> SLAB_VIRTUAL=n | 150   | 3110.000 | 3124.520 | 3118.143 | 3118.120 | 2.466
> SLAB_VIRTUAL=y | 150   | 3115.070 | 3127.070 | 3120.762 | 3120.925 | 2.654
>                |       | +0.16%   | +0.08%   | +0.08%   | +0.09%   | +7.63%

These Skylake figures are a bit counter-intuitive: how does an increase of 
only +0.08% user-time - which dominates 89.5% of execution, combined with a 
+1.42% increase in system time that consumes only 10.5% of CPU capacity, 
result in a +0.85% increase in wall-clock time?

There might be hidden factors at work in the DMA space, as Linus suggested?

Or perhaps wall-clock time is dominated by the single-threaded final link 
time of the kernel, which phase might be disproportionately hurt by these 
changes?

(Stddev seems low enough for this not to be a measurement artifact.)

The AMD Milan figures are more intuitive:

> AMD Milan:
> 
>       LABEL    | COUNT |   MIN    |   MAX    |   MEAN   |  MEDIAN  | STDDEV
> ---------------+-------+----------+----------+----------+----------+--------
> wall clock     |       |          |          |          |          |
> SLAB_VIRTUAL=n | 150   | 25.480   | 26.550   | 26.065   | 26.055   | 0.23495
> SLAB_VIRTUAL=y | 150   | 25.820   | 27.080   | 26.531   | 26.540   | 0.25974
>                |       | +1.33%   | +2.00%   | +1.79%   | +1.86%   | +10.55%
> system time    |       |          |          |          |          |
> SLAB_VIRTUAL=n | 150   | 478.530  | 540.420  | 520.803  | 521.485  | 9.166
> SLAB_VIRTUAL=y | 150   | 530.520  | 572.460  | 552.825  | 552.985  | 7.161
>                |       | +10.86%  | +5.93%   | +6.15%   | +6.04%   | -21.88%
> user time      |       |          |          |          |          |
> SLAB_VIRTUAL=n | 150   | 2373.540 | 2403.800 | 2386.343 | 2385.840 | 5.325
> SLAB_VIRTUAL=y | 150   | 2388.690 | 2426.290 | 2408.325 | 2408.895 | 6.667
>                |       | +0.64%   | +0.94%   | +0.92%   | +0.97%   | +25.20%
>
> 
> I'm not exactly sure why user time increases by almost 1% on Milan, it 
> could be TLB contention.

The other worrying aspect is the increase of +6.15% of system time ... 
which is roughly in line with what we'd expect from a +1.79% increase in 
wall-clock time.

Thanks,

	Ingo
  
Vlastimil Babka Sept. 20, 2023, 8:49 a.m. UTC | #13
On 9/18/23 14:08, Matteo Rizzo wrote:
> On Fri, 15 Sept 2023 at 18:30, Lameter, Christopher
>> Problems:
>>
>> - Overhead due to more TLB lookups
>>
>> - Larger amounts of TLBs are used for the OS. Currently we are trying to
>> use the maximum mappable TLBs to reduce their numbers. This presumably
>> means using 4K TLBs for all slab access.
> 
> Yes, we are using 4K pages for the slab mappings which is going to increase
> TLB pressure. I also tried writing a version of the patch that uses 2M
> pages which had slightly better performance, but that had its own problems.
> For example most slabs are much smaller than 2M, so we would need to create
> and map multiple slabs at once and we wouldn't be able to release the
> physical memory until all slabs in the 2M page are unused which increases
> fragmentation.
 At last LSF/MM [1] we basically discarded direct map fragmentation
avoidance as solving something that turns out to be insignificant, with the
exception of kernel code. As kernel code is unlikely to be allocated from
kmem caches due to W^X, we can hopefully assume it's also insignificant for
the virtual slab area.

[1] https://lwn.net/Articles/931406/