Message ID | 20230915105933.495735-1-matteorizzo@google.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp1171046vqi; Fri, 15 Sep 2023 09:28:57 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFAEEIcSVYb6TLbzCdMCaUCR1lsI7/x0s3rRObK0iE2GUVNpj5NwdT1SM9NeG2WKWC/E1pV X-Received: by 2002:a17:902:7202:b0:1b9:de75:d5bb with SMTP id ba2-20020a170902720200b001b9de75d5bbmr2053453plb.7.1694795336943; Fri, 15 Sep 2023 09:28:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1694795336; cv=none; d=google.com; s=arc-20160816; b=p+LTf+A/gD4xf4CStoHS5sji1ORLni72LHU65EVautIwSZCt97li7IKYbtaQMmZt8S oBfozjglUj6/szA0LNiCLfE4CdICcTsB/TfCKIC4l95iIofl6Bagcj5IBqnrwRwfQDf6 fEOpUTKDYFR6B5Hh56BVKkeOW05qUkwLvwzuMlwXl7D0HKDjiRWgwoz9mx3qUnKK+Bba J9K03G1JuxOHCZq4BiuhK/XGTyaT3dVH9CYd9gC99huJB1JxURVnqe144DwkLObg9BDq jp/EQKjQNU4nUS7vHTHWsi3MLDsNzR6H81CgUlmPVgHFRL9Uf4UYzoM5+k89yIaqLRFN iqnQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:mime-version:date :dkim-signature; bh=Q96e0kuVpLqyiXurRtzULT7V48+oh0KsDqb/+UWKHpA=; fh=hxrcP/evkFXgdiqiA6zEdvE31LUTUVye/z/fSCqBw68=; b=zDPOW7uqI3z0ljIJ9QPnZDH+Kg/kOiwV96G/+DG6sAq8Xsb/ObWGabNXMStDPcB/x1 Sn7bOQHaFTwNAJ9mB9j297KYaNubEsREsOm9WmxW/V5uJzJj1eQbWeh/fpH+2JsMURPm 1nx9BlU90I/KDvYGtFHlwEmDfmPri4NFylcckGkPulbaU8wof1xvRjvvp+3gvX6xXzm1 rbDYXfDZVw6m6rwiWeM2JvC2YsIHW+AHsqq/2VHMSRjm5A5I/YVm3lRlX5DdbpHZyalz wJrmPK6hcjfGtmxegWQ2EVKWYmECM2h4gj6/cHNQaFiHDmJt/Hse47ljq6fDS+Z0IBGL ckdw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=mgh954mD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from groat.vger.email (groat.vger.email. [23.128.96.35]) by mx.google.com with ESMTPS id m1-20020a170902d18100b001bf193a70b3si3426968plb.298.2023.09.15.09.28.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 15 Sep 2023 09:28:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) client-ip=23.128.96.35; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=mgh954mD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 0AA6F83C547A; Fri, 15 Sep 2023 03:59:51 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233964AbjIOK7p (ORCPT <rfc822;ruipengqi7@gmail.com> + 32 others); Fri, 15 Sep 2023 06:59:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54164 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233884AbjIOK7o (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 15 Sep 2023 06:59:44 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 524AA195 for <linux-kernel@vger.kernel.org>; Fri, 15 Sep 2023 03:59:37 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-d818fb959f4so2062163276.1 for <linux-kernel@vger.kernel.org>; Fri, 15 Sep 2023 03:59:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1694775576; x=1695380376; darn=vger.kernel.org; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=Q96e0kuVpLqyiXurRtzULT7V48+oh0KsDqb/+UWKHpA=; b=mgh954mDu+rnYiMNpcqUCnwHvsCdynxhQlIIl0NkcZZTdwonts1YFXVdwrTNQCXcYn Uijsw258odlFnL9gBrMHEUxqJItL6ApUdS8LSyVkA4v1x8nikhpTFPvaT4DDnZ53nXUz mP8hOEc4O7t4SxmvX1peVzPT9BzPrNdQCNI0DRlVb963UG8rUSFm0JSm6vMT6TsdCTOA SVg3xRPlS7q+hnLJI+Ayu4A+ZHOWVzeHBMRjU3Ky54t87pUW7GRvbRsPJUSJ9h3UC+8l 0z2c8dHsq3bJvkYAwllxXO1D+gPaLsBiIAUqw8maeGfYbUr/ZXGPZ4urxZbnWCmClLCc 1ASA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694775576; x=1695380376; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=Q96e0kuVpLqyiXurRtzULT7V48+oh0KsDqb/+UWKHpA=; b=IYfGUnzI3ZRqvpBVtKU3K1zpboK0P7UAdIHdhuibqjeGVV/W6bduyV0gw+rNcPJ3w2 VyvBBhqN2o1m5E+LTBbKiN4oq/N9ocp1cwvBAp6niKUt8+UPyYuN6FDpiqlkhhgRzT9/ jlu++AkXicl8I09MaQ3SYCEKEWmwcpo9DcRSskBSblTfH2L6SXQesOvDdbvfL73/Nil9 1UJXKLO8JxFXiBc2/KIZfq9cnlZ/AB9+KwwJLtoEM5C0FpfRRPCDjrqoNzhikGT0ikdl n3ouyF7O0vtq6s1R/bm+CujrxIBKCHYmbbIkJhDK/eeK0e9P6vHZveIQiVXV7Au1VT0W ldNg== X-Gm-Message-State: AOJu0YxQEq1P+uxXMd9RHWJSp6LRd2plErUoUN8ixzRdg3pBF0y7dYkA J3iNmPY/64ML/L+PcNtXOUHJ/rDwlOwLIp+fjw== X-Received: from mr-cloudtop2.c.googlers.com ([fda3:e722:ac3:cc00:31:98fb:c0a8:2a6]) (user=matteorizzo job=sendgmr) by 2002:a25:aa6c:0:b0:d7e:a025:2672 with SMTP id s99-20020a25aa6c000000b00d7ea0252672mr23125ybi.9.1694775576481; Fri, 15 Sep 2023 03:59:36 -0700 (PDT) Date: Fri, 15 Sep 2023 10:59:19 +0000 Mime-Version: 1.0 X-Mailer: git-send-email 2.42.0.459.ge4e396fd5e-goog Message-ID: <20230915105933.495735-1-matteorizzo@google.com> Subject: [RFC PATCH 00/14] Prevent cross-cache attacks in the SLUB allocator From: Matteo Rizzo <matteorizzo@google.com> To: cl@linux.com, penberg@kernel.org, rientjes@google.com, iamjoonsoo.kim@lge.com, akpm@linux-foundation.org, vbabka@suse.cz, roman.gushchin@linux.dev, 42.hyeyoo@gmail.com, keescook@chromium.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-hardening@vger.kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, corbet@lwn.net, luto@kernel.org, peterz@infradead.org Cc: jannh@google.com, matteorizzo@google.com, evn@google.com, poprdi@google.com, jordyzomer@google.com Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-8.4 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Fri, 15 Sep 2023 03:59:51 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1777121715163495307 X-GMAIL-MSGID: 1777121715163495307 |
Series |
Prevent cross-cache attacks in the SLUB allocator
|
|
Message
Matteo Rizzo
Sept. 15, 2023, 10:59 a.m. UTC
The goal of this patch series is to deterministically prevent cross-cache attacks in the SLUB allocator. Use-after-free bugs are normally exploited by making the memory allocator reuse the victim object's memory for an object with a different type. This creates a type confusion which is a very powerful attack primitive. There are generally two ways to create such type confusions in the kernel: one way is to make SLUB reuse the freed object's address for another object of a different type which lives in the same slab cache. This only works in slab caches that can contain objects of different types (i.e. the kmalloc caches) and the attacker is limited to objects that belong to the same size class as the victim object. The other way is to use a "cross-cache attack": make SLUB return the page containing the victim object to the page allocator and then make it use the same page for a different slab cache or other objects that contain attacker-controlled data. This gives attackers access to all objects rather than just the ones in the same size class as the target and lets attackers target objects allocated from dedicated caches such as struct file. This patch prevents cross-cache attacks by making sure that once a virtual address is used for a slab cache it's never reused for anything except for other slabs in that cache. Jann Horn (13): mm/slub: add is_slab_addr/is_slab_page helpers mm/slub: move kmem_cache_order_objects to slab.h mm: use virt_to_slab instead of folio_slab mm/slub: create folio_set/clear_slab helpers mm/slub: pass additional args to alloc_slab_page mm/slub: pass slab pointer to the freeptr decode helper security: introduce CONFIG_SLAB_VIRTUAL mm/slub: add the slab freelists to kmem_cache x86: Create virtual memory region for SLUB mm/slub: allocate slabs from virtual memory mm/slub: introduce the deallocated_pages sysfs attribute mm/slub: sanity-check freepointers security: add documentation for SLAB_VIRTUAL Matteo Rizzo (1): mm/slub: don't try to dereference invalid freepointers Documentation/arch/x86/x86_64/mm.rst | 4 +- Documentation/security/self-protection.rst | 102 ++++ arch/x86/include/asm/page_64.h | 10 + arch/x86/include/asm/pgtable_64_types.h | 21 + arch/x86/mm/init_64.c | 19 +- arch/x86/mm/kaslr.c | 9 + arch/x86/mm/mm_internal.h | 4 + arch/x86/mm/physaddr.c | 10 + include/linux/slab.h | 8 + include/linux/slub_def.h | 25 +- init/main.c | 1 + kernel/resource.c | 2 +- lib/slub_kunit.c | 4 + mm/memcontrol.c | 2 +- mm/slab.h | 145 +++++ mm/slab_common.c | 21 +- mm/slub.c | 641 +++++++++++++++++++-- mm/usercopy.c | 12 +- security/Kconfig.hardening | 16 + 19 files changed, 977 insertions(+), 79 deletions(-) base-commit: 46a9ea6681907a3be6b6b0d43776dccc62cad6cf
Comments
On 9/15/23 03:59, Matteo Rizzo wrote: > The goal of this patch series is to deterministically prevent cross-cache > attacks in the SLUB allocator. What's the cost?
On Fri, 15 Sep 2023, Dave Hansen wrote: > On 9/15/23 03:59, Matteo Rizzo wrote: >> The goal of this patch series is to deterministically prevent cross-cache >> attacks in the SLUB allocator. > > What's the cost? The only thing that I see is 1-2% on kernel compilations (and "more on machines with lots of cores")? Having a virtualized slab subsystem could enable other things: - The page order calculation could be simplified since vmalloc can stitch arbitrary base pages together to form larger contiguous virtual segments. So just use f.e. order 5 or so for all slabs to reduce contention? - Maybe we could make slab pages movable (if we can ensure that slab objects are not touched somehow. At least stop_machine run could be used to move batches of slab memory) - Maybe we can avoid allocating page structs somehow for slab memory? Looks like this is taking a step into that direction. The metadata storage of the slab allocator could be reworked and optimized better. Problems: - Overhead due to more TLB lookups - Larger amounts of TLBs are used for the OS. Currently we are trying to use the maximum mappable TLBs to reduce their numbers. This presumably means using 4K TLBs for all slab access. - Memory may not be physically contiguous which may be required by some drivers doing DMA.
On Fri, 15 Sept 2023 at 18:30, Lameter, Christopher <cl@os.amperecomputing.com> wrote: > > On Fri, 15 Sep 2023, Dave Hansen wrote: > > > What's the cost? > > The only thing that I see is 1-2% on kernel compilations (and "more on > machines with lots of cores")? I used kernel compilation time (wall clock time) as a benchmark while preparing the series. Lower is better. Intel Skylake, 112 cores: LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV ---------------+-------+---------+---------+---------+---------+-------- SLAB_VIRTUAL=n | 150 | 49.700s | 51.320s | 50.449s | 50.430s | 0.29959 SLAB_VIRTUAL=y | 150 | 50.020s | 51.660s | 50.880s | 50.880s | 0.30495 | | +0.64% | +0.66% | +0.85% | +0.89% | +1.79% AMD Milan, 256 cores: LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV ---------------+-------+---------+---------+---------+---------+-------- SLAB_VIRTUAL=n | 150 | 25.480s | 26.550s | 26.065s | 26.055s | 0.23495 SLAB_VIRTUAL=y | 150 | 25.820s | 27.080s | 26.531s | 26.540s | 0.25974 | | +1.33% | +2.00% | +1.79% | +1.86% | +10.55% Are there any specific benchmarks that you would be interested in seeing or that are usually used for SLUB? > Problems: > > - Overhead due to more TLB lookups > > - Larger amounts of TLBs are used for the OS. Currently we are trying to > use the maximum mappable TLBs to reduce their numbers. This presumably > means using 4K TLBs for all slab access. Yes, we are using 4K pages for the slab mappings which is going to increase TLB pressure. I also tried writing a version of the patch that uses 2M pages which had slightly better performance, but that had its own problems. For example most slabs are much smaller than 2M, so we would need to create and map multiple slabs at once and we wouldn't be able to release the physical memory until all slabs in the 2M page are unused which increases fragmentation. > - Memory may not be physically contiguous which may be required by some > drivers doing DMA. In the current implementation each slab is backed by physically contiguous memory, but different slabs that are adjacent in virtual memory might not be physically contiguous. Treating objects allocated from two different slabs as one contiguous chunk of memory is probably wrong anyway, right? -- Matteo
* Matteo Rizzo <matteorizzo@google.com> wrote: > On Fri, 15 Sept 2023 at 18:30, Lameter, Christopher > <cl@os.amperecomputing.com> wrote: > > > > On Fri, 15 Sep 2023, Dave Hansen wrote: > > > > > What's the cost? > > > > The only thing that I see is 1-2% on kernel compilations (and "more on > > machines with lots of cores")? > > I used kernel compilation time (wall clock time) as a benchmark while > preparing the series. Lower is better. > > Intel Skylake, 112 cores: > > LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV > ---------------+-------+---------+---------+---------+---------+-------- > SLAB_VIRTUAL=n | 150 | 49.700s | 51.320s | 50.449s | 50.430s | 0.29959 > SLAB_VIRTUAL=y | 150 | 50.020s | 51.660s | 50.880s | 50.880s | 0.30495 > | | +0.64% | +0.66% | +0.85% | +0.89% | +1.79% > > AMD Milan, 256 cores: > > LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV > ---------------+-------+---------+---------+---------+---------+-------- > SLAB_VIRTUAL=n | 150 | 25.480s | 26.550s | 26.065s | 26.055s | 0.23495 > SLAB_VIRTUAL=y | 150 | 25.820s | 27.080s | 26.531s | 26.540s | 0.25974 > | | +1.33% | +2.00% | +1.79% | +1.86% | +10.55% That's sadly a rather substantial overhead for a compiler/linker workload that is dominantly user-space: a kernel build is about 90% user-time and 10% system-time: $ perf stat --null make -j64 vmlinux ... Performance counter stats for 'make -j64 vmlinux': 59.840704481 seconds time elapsed 2000.774537000 seconds user 219.138280000 seconds sys What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between user-space execution and kernel-space execution? Thanks, Ingo
On Mon, 18 Sept 2023 at 10:39, Ingo Molnar <mingo@kernel.org> wrote: > > What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between > user-space execution and kernel-space execution? ... and equally importantly, what about DMA? Or what about the fixed-size slabs (aka kmalloc?) What's the point of "never re-use the same address for a different slab", when the *same* slab will contain different kinds of allocations anyway? I think the whole "make it one single compile-time option" model is completely and fundamentally broken. Linus
On Mon, 18 Sept 2023 at 19:39, Ingo Molnar <mingo@kernel.org> wrote: > > What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between > user-space execution and kernel-space execution? > Same benchmark as before (compiling a kernel on a system running the patched kernel): Intel Skylake: LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV ---------------+-------+----------+----------+----------+----------+-------- wall clock | | | | | | SLAB_VIRTUAL=n | 150 | 49.700 | 51.320 | 50.449 | 50.430 | 0.29959 SLAB_VIRTUAL=y | 150 | 50.020 | 51.660 | 50.880 | 50.880 | 0.30495 | | +0.64% | +0.66% | +0.85% | +0.89% | +1.79% system time | | | | | | SLAB_VIRTUAL=n | 150 | 358.560 | 362.900 | 360.922 | 360.985 | 0.91761 SLAB_VIRTUAL=y | 150 | 362.970 | 367.970 | 366.062 | 366.115 | 1.015 | | +1.23% | +1.40% | +1.42% | +1.42% | +10.60% user time | | | | | | SLAB_VIRTUAL=n | 150 | 3110.000 | 3124.520 | 3118.143 | 3118.120 | 2.466 SLAB_VIRTUAL=y | 150 | 3115.070 | 3127.070 | 3120.762 | 3120.925 | 2.654 | | +0.16% | +0.08% | +0.08% | +0.09% | +7.63% AMD Milan: LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV ---------------+-------+----------+----------+----------+----------+-------- wall clock | | | | | | SLAB_VIRTUAL=n | 150 | 25.480 | 26.550 | 26.065 | 26.055 | 0.23495 SLAB_VIRTUAL=y | 150 | 25.820 | 27.080 | 26.531 | 26.540 | 0.25974 | | +1.33% | +2.00% | +1.79% | +1.86% | +10.55% system time | | | | | | SLAB_VIRTUAL=n | 150 | 478.530 | 540.420 | 520.803 | 521.485 | 9.166 SLAB_VIRTUAL=y | 150 | 530.520 | 572.460 | 552.825 | 552.985 | 7.161 | | +10.86% | +5.93% | +6.15% | +6.04% | -21.88% user time | | | | | | SLAB_VIRTUAL=n | 150 | 2373.540 | 2403.800 | 2386.343 | 2385.840 | 5.325 SLAB_VIRTUAL=y | 150 | 2388.690 | 2426.290 | 2408.325 | 2408.895 | 6.667 | | +0.64% | +0.94% | +0.92% | +0.97% | +25.20% I'm not exactly sure why user time increases by almost 1% on Milan, it could be TLB contention. -- Matteo
On Mon, 18 Sept 2023 at 20:05, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > ... and equally importantly, what about DMA? I'm not exactly sure what you mean by this, I don't think this should affect the performance of DMA. > Or what about the fixed-size slabs (aka kmalloc?) What's the point of > "never re-use the same address for a different slab", when the *same* > slab will contain different kinds of allocations anyway? There are a number of patches out there (for example the random_kmalloc series which recently got merged into v6.6) which attempt to segregate kmalloc'd objects into different caches to make exploitation harder. Another thing that we would like to have in the future is to segregate objects by type (like XNU's kalloc_type https://security.apple.com/blog/towards-the-next-generation-of-xnu-memory-safety/) which makes exploiting use-after-free by type confusion much harder or impossible. All of these mitigations can be bypassed very easily if the attacker can mount a cross-cache attack, which is what this series attempts to prevent. This is not only theoretical, we've seen attackers use this all the time in kCTF/kernelCTF submissions (for example https://ruia-ruia.github.io/2022/08/05/CVE-2022-29582-io-uring/). > I think the whole "make it one single compile-time option" model is > completely and fundamentally broken. Wouldn't making this toggleable at boot time or runtime make performance even worse? -- Matteo
On 9/19/23 06:42, Matteo Rizzo wrote: > On Mon, 18 Sept 2023 at 19:39, Ingo Molnar <mingo@kernel.org> wrote: >> What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between >> user-space execution and kernel-space execution? >> > Same benchmark as before (compiling a kernel on a system running the patched > kernel): Thanks for running those. One more situation that comes to mind is how this will act under memory pressure. Will some memory pressure make contention on 'slub_kworker_lock' visible or make the global TLB flushes less bearable? In any case, none of this looks _catastrophic_. It's surely a cost that some folks will pay. But I really do think it needs to be more dynamic. There are a _couple_ of reasons for this. If it's only a compile-time option, it's never going to get turned on except for maybe ChromeOS and the datacenter folks that are paranoid. I suspect the distros will never turn it on. A lot of questions get easier if you can disable/enable it at runtime. For instance, what do you do if the virtual area fills up? You _could_ just go back to handing out direct map addresses. Less secure? Yep. But better than crashing (for some folks). It also opens up the door to do this per-slab. That alone would be a handy debugging option.
On 9/19/23 08:48, Matteo Rizzo wrote: >> I think the whole "make it one single compile-time option" model is >> completely and fundamentally broken. > Wouldn't making this toggleable at boot time or runtime make performance > even worse? Maybe. But you can tolerate even more of a performance impact from a feature if the people that don't care can actually disable it. There are also plenty of ways to minimize the overhead of switching it on and off at runtime. Static branches are your best friend here.
On September 19, 2023 9:02:07 AM PDT, Dave Hansen <dave.hansen@intel.com> wrote: >On 9/19/23 08:48, Matteo Rizzo wrote: >>> I think the whole "make it one single compile-time option" model is >>> completely and fundamentally broken. >> Wouldn't making this toggleable at boot time or runtime make performance >> even worse? > >Maybe. > >But you can tolerate even more of a performance impact from a feature if >the people that don't care can actually disable it. > >There are also plenty of ways to minimize the overhead of switching it >on and off at runtime. Static branches are your best friend here. Let's start with a boot time on/off toggle (no per-slab, no switch on out-of-space, etc). That should be sufficient for initial ease of use for testing, etc. But yes, using static_branch will nicely DTRT here.
On Tue, 19 Sept 2023 at 08:48, Matteo Rizzo <matteorizzo@google.com> wrote: > > On Mon, 18 Sept 2023 at 20:05, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > ... and equally importantly, what about DMA? > > I'm not exactly sure what you mean by this, I don't think this should > affect the performance of DMA. I was more worried about just basic correctness. We've traditionally had a lot of issues with using virtual addresses for dma, simply because we've got random drivers, and I'm not entirely convinced that your "virt_to_phys()" update will catch it all. IOW, even on x86-64 - which is hopefully better than most architectures because it already has that double mapping issue - we have things like unsigned long paddr = (unsigned long)vaddr - __PAGE_OFFSET; in other places than just the __phys_addr() code. The one place I grepped for looks to be just boot-time AMD memory encryption, so wouldn't be any slab allocation, but ... Linus
* Matteo Rizzo <matteorizzo@google.com> wrote: > On Mon, 18 Sept 2023 at 19:39, Ingo Molnar <mingo@kernel.org> wrote: > > > > What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between > > user-space execution and kernel-space execution? > > > > Same benchmark as before (compiling a kernel on a system running the patched > kernel): > > Intel Skylake: > > LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV > ---------------+-------+----------+----------+----------+----------+-------- > wall clock | | | | | | > SLAB_VIRTUAL=n | 150 | 49.700 | 51.320 | 50.449 | 50.430 | 0.29959 > SLAB_VIRTUAL=y | 150 | 50.020 | 51.660 | 50.880 | 50.880 | 0.30495 > | | +0.64% | +0.66% | +0.85% | +0.89% | +1.79% > system time | | | | | | > SLAB_VIRTUAL=n | 150 | 358.560 | 362.900 | 360.922 | 360.985 | 0.91761 > SLAB_VIRTUAL=y | 150 | 362.970 | 367.970 | 366.062 | 366.115 | 1.015 > | | +1.23% | +1.40% | +1.42% | +1.42% | +10.60% > user time | | | | | | > SLAB_VIRTUAL=n | 150 | 3110.000 | 3124.520 | 3118.143 | 3118.120 | 2.466 > SLAB_VIRTUAL=y | 150 | 3115.070 | 3127.070 | 3120.762 | 3120.925 | 2.654 > | | +0.16% | +0.08% | +0.08% | +0.09% | +7.63% These Skylake figures are a bit counter-intuitive: how does an increase of only +0.08% user-time - which dominates 89.5% of execution, combined with a +1.42% increase in system time that consumes only 10.5% of CPU capacity, result in a +0.85% increase in wall-clock time? There might be hidden factors at work in the DMA space, as Linus suggested? Or perhaps wall-clock time is dominated by the single-threaded final link time of the kernel, which phase might be disproportionately hurt by these changes? (Stddev seems low enough for this not to be a measurement artifact.) The AMD Milan figures are more intuitive: > AMD Milan: > > LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV > ---------------+-------+----------+----------+----------+----------+-------- > wall clock | | | | | | > SLAB_VIRTUAL=n | 150 | 25.480 | 26.550 | 26.065 | 26.055 | 0.23495 > SLAB_VIRTUAL=y | 150 | 25.820 | 27.080 | 26.531 | 26.540 | 0.25974 > | | +1.33% | +2.00% | +1.79% | +1.86% | +10.55% > system time | | | | | | > SLAB_VIRTUAL=n | 150 | 478.530 | 540.420 | 520.803 | 521.485 | 9.166 > SLAB_VIRTUAL=y | 150 | 530.520 | 572.460 | 552.825 | 552.985 | 7.161 > | | +10.86% | +5.93% | +6.15% | +6.04% | -21.88% > user time | | | | | | > SLAB_VIRTUAL=n | 150 | 2373.540 | 2403.800 | 2386.343 | 2385.840 | 5.325 > SLAB_VIRTUAL=y | 150 | 2388.690 | 2426.290 | 2408.325 | 2408.895 | 6.667 > | | +0.64% | +0.94% | +0.92% | +0.97% | +25.20% > > > I'm not exactly sure why user time increases by almost 1% on Milan, it > could be TLB contention. The other worrying aspect is the increase of +6.15% of system time ... which is roughly in line with what we'd expect from a +1.79% increase in wall-clock time. Thanks, Ingo
On 9/18/23 14:08, Matteo Rizzo wrote: > On Fri, 15 Sept 2023 at 18:30, Lameter, Christopher >> Problems: >> >> - Overhead due to more TLB lookups >> >> - Larger amounts of TLBs are used for the OS. Currently we are trying to >> use the maximum mappable TLBs to reduce their numbers. This presumably >> means using 4K TLBs for all slab access. > > Yes, we are using 4K pages for the slab mappings which is going to increase > TLB pressure. I also tried writing a version of the patch that uses 2M > pages which had slightly better performance, but that had its own problems. > For example most slabs are much smaller than 2M, so we would need to create > and map multiple slabs at once and we wouldn't be able to release the > physical memory until all slabs in the 2M page are unused which increases > fragmentation. At last LSF/MM [1] we basically discarded direct map fragmentation avoidance as solving something that turns out to be insignificant, with the exception of kernel code. As kernel code is unlikely to be allocated from kmem caches due to W^X, we can hopefully assume it's also insignificant for the virtual slab area. [1] https://lwn.net/Articles/931406/