Message ID | 20221107201317.324457-1-jannh@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp2270161wru; Mon, 7 Nov 2022 12:14:49 -0800 (PST) X-Google-Smtp-Source: AMsMyM5EJTF2DP1hN/dScOLgoEhDePe0UUMiD6pvNKfNE3U7G4FYyEtPnIN4WMZNZdx3Bjcw5FM7 X-Received: by 2002:a17:906:4d0f:b0:7ad:e14f:b1af with SMTP id r15-20020a1709064d0f00b007ade14fb1afmr40011012eju.183.1667852089380; Mon, 07 Nov 2022 12:14:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1667852089; cv=none; d=google.com; s=arc-20160816; b=lpCNWo4QSt0+rbjuAjeRsNB/w1WEJtRfx5I+wopf6OpmJlkgJXJ6d2asu86r0KX/aT ym8rg5H90qI0ZEEO75oUGXKzc2rB2VSx7i5id1XRXnzt053EZ3GKRpEEI1UDu54E2gdO yCQmB1L3zW9x83ZpQ7wsSBhPgwT8NWrehCCNK5iIj8tyiXY4bid1Lga7sTkwywJfdatP yhR7fV3YGFd1gybScQeVIsNrPoyKwHeRbZX8fyKLRTzLBGoobJkbOwejN+gjkwwqBg0N VxKXmUMYqyxke8gKmrXOCT5KBdGWAOtWmvsG78VTsu3HD4UTSSEEw7PguLCKRMAxhxX3 VNzQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=yObtQ9wJPrlYkrbPvdnk4VcNP5+2kVedV7p00zrzpqk=; b=zCY/4N2vF7Pr9XAH56YQb89YiVfYefbCuhvKrLV852DbalFS9DN3lOG0HorQ3rt8+0 FVc+akLJR+MVqxqtMk1sHFyneVysPb5APaSsv9HedTsYcYpeLCITJ0cvOsbtuHxTjdTQ TdvqUehOFz+fiYmX2YchqRzSVg3btHpTZbw3OaloDpoUW5UHH9dt5zD4fTw8AwjsnAxH RzZ1vZ7/bCNRlJohjr3b02K7zuJ+QjY+tFnMjrVMalf2B7wdhcSn+rXimf87qFhsPS0T 89lYXImDIp+Fs0ksLNgAy+vUzwEOmhwjfYrZ9uCwZddEAFpAyhciqADLEDk7iQycYX6O a14A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=HOMpNU6p; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id e17-20020a056402191100b00454599abf52si12377542edz.92.2022.11.07.12.14.26; Mon, 07 Nov 2022 12:14:49 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=HOMpNU6p; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232816AbiKGUNh (ORCPT <rfc822;hjfbswb@gmail.com> + 99 others); Mon, 7 Nov 2022 15:13:37 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50430 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232461AbiKGUNd (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 7 Nov 2022 15:13:33 -0500 Received: from mail-wm1-x336.google.com (mail-wm1-x336.google.com [IPv6:2a00:1450:4864:20::336]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B5A1DB7DC for <linux-kernel@vger.kernel.org>; Mon, 7 Nov 2022 12:13:31 -0800 (PST) Received: by mail-wm1-x336.google.com with SMTP id j5-20020a05600c410500b003cfa9c0ea76so2435496wmi.3 for <linux-kernel@vger.kernel.org>; Mon, 07 Nov 2022 12:13:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=yObtQ9wJPrlYkrbPvdnk4VcNP5+2kVedV7p00zrzpqk=; b=HOMpNU6pIG9ak3/AXz03bACY3qUle3fLU/VLA/B9MwgoIWX+AhFbWitAWrf6kcUXLa jNiAW6/0Icd7ZEJI4CzlfnpwO88fkzbNMOKjdYcHMx2LGFk0LJH21udPJFdHSMlEJswV PfAGu3QM6OP71N7I4P3S4kGQT9Iw1dzvvkaoP7fJxhT6YF7KU6K7M3mKHpfsoreGHiX1 l8hS/3GOuWcRF4UXInnYFdnE4j6CvhSub3ZymWotAqGjVZriNU7HqztS6qi88nvaIG7m +gPUdqf2Tpf4UJ8kLL9EkoNunqfRoBMilJds1Zsj6W6GapSOMbLoO2MswuSeojqmHqk4 IRQw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=yObtQ9wJPrlYkrbPvdnk4VcNP5+2kVedV7p00zrzpqk=; b=rRk+omKx2Tcz6TIaNDhJfEONrs6JM6wt5PC7wz3gZWoDKjwWdEcH/vrc3lilXpp+4b 2lcKFGWyg/JU3+SvcSfZ2mkuobXM9z0I7iczfxQK0rGtk6U5a0gJVpVNV1a7ZPGnvy0H EFKd4J4+qVMKmKv9VR7L6oH8f2HU5SeIBg0keVVVq1QRa/4E+Ij6T5tfzN+dealBq3SG oHUrjVG7cdTj3Yi5TFLSSGnhV7JXbg9E8Y0FoUzxpmGjL/vm/4Zjd4lEMOQlOmWJ71aj YbINUrefzhTTvtVpq3KRTR8wcXMQh3DDJ6pbOvpBfxUNSub0A2CxM7MwEXO6/V8QG6O9 pvfA== X-Gm-Message-State: ACrzQf3v1YuwvfIGkN4RwXHTqFTeM8M3jiDb7qYtO4kTXT0R/njtofnU uGH5VvFGEkvYqQlxRMeDAfx0Kw== X-Received: by 2002:a05:600c:4691:b0:3cf:9cac:85c7 with SMTP id p17-20020a05600c469100b003cf9cac85c7mr11989812wmo.150.1667852010069; Mon, 07 Nov 2022 12:13:30 -0800 (PST) Received: from localhost ([2a00:79e0:9d:4:fe7:f71:71e4:91f8]) by smtp.gmail.com with ESMTPSA id 124-20020a1c1982000000b003c65c9a36dfsm8956891wmz.48.2022.11.07.12.13.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 07 Nov 2022 12:13:29 -0800 (PST) From: Jann Horn <jannh@google.com> To: Kees Cook <keescook@chromium.org>, linux-hardening@vger.kernel.org, kernel-hardening@lists.openwall.com Cc: Greg KH <gregkh@linuxfoundation.org>, Linus Torvalds <torvalds@linuxfoundation.org>, Seth Jenkins <sethjenkins@google.com>, "Eric W . Biederman" <ebiederm@xmission.com>, Andy Lutomirski <luto@kernel.org>, linux-kernel@vger.kernel.org Subject: [PATCH] exit: Put an upper limit on how often we can oops Date: Mon, 7 Nov 2022 21:13:17 +0100 Message-Id: <20221107201317.324457-1-jannh@google.com> X-Mailer: git-send-email 2.38.1.431.g37b22c650d-goog MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1748869672814976219?= X-GMAIL-MSGID: =?utf-8?q?1748869672814976219?= |
Series |
exit: Put an upper limit on how often we can oops
|
|
Commit Message
Jann Horn
Nov. 7, 2022, 8:13 p.m. UTC
Many Linux systems are configured to not panic on oops; but allowing an
attacker to oops the system **really** often can make even bugs that look
completely unexploitable exploitable (like NULL dereferences and such) if
each crash elevates a refcount by one or a lock is taken in read mode, and
this causes a counter to eventually overflow.
The most interesting counters for this are 32 bits wide (like open-coded
refcounts that don't use refcount_t). (The ldsem reader count on 32-bit
platforms is just 16 bits, but probably nobody cares about 32-bit platforms
that much nowadays.)
So let's panic the system if the kernel is constantly oopsing.
The speed of oopsing 2^32 times probably depends on several factors, like
how long the stack trace is and which unwinder you're using; an empirically
important one is whether your console is showing a graphical environment or
a text console that oopses will be printed to.
In a quick single-threaded benchmark, it looks like oopsing in a vfork()
child with a very short stack trace only takes ~510 microseconds per run
when a graphical console is active; but switching to a text console that
oopses are printed to slows it down around 87x, to ~45 milliseconds per
run.
(Adding more threads makes this faster, but the actual oops printing
happens under &die_lock on x86, so you can maybe speed this up by a factor
of around 2 and then any further improvement gets eaten up by lock
contention.)
It looks like it would take around 8-12 days to overflow a 32-bit counter
with repeated oopsing on a multi-core X86 system running a graphical
environment; both me (in an X86 VM) and Seth (with a distro kernel on
normal hardware in a standard configuration) got numbers in that ballpark.
12 days aren't *that* short on a desktop system, and you'd likely need much
longer on a typical server system (assuming that people don't run graphical
desktop environments on their servers), and this is a *very* noisy and
violent approach to exploiting the kernel; and it also seems to take orders
of magnitude longer on some machines, probably because stuff like EFI
pstore will slow it down a ton if that's active.
Signed-off-by: Jann Horn <jannh@google.com>
---
I picked 10000 here to also provide safety for the ldsem code on 32-bit
systems, but you could also argue that the real fix there is to make
ldsem more robust, and that the limit should be something like 2^31...
An alternative approach would be to always let make_task_dead() take the
do_task_dead() path and never exit; but that would probably be a more
disruptive change?
@Kees should this go through your tree? (After waiting a while for
the inevitable bikeshedding on whether the default limit should be closer
to 10000 or 2^31.)
Documentation/admin-guide/sysctl/kernel.rst | 7 +++++++
include/linux/panic.h | 1 +
kernel/exit.c | 22 +++++++++++++++++++++
kernel/sysctl.c | 7 +++++++
4 files changed, 37 insertions(+)
base-commit: f0c4d9fc9cc9462659728d168387191387e903cc
Comments
On Mon, Nov 7, 2022 at 12:13 PM Jann Horn <jannh@google.com> wrote: > > I picked 10000 here to also provide safety for the ldsem code on 32-bit > systems, but you could also argue that the real fix there is to make > ldsem more robust, and that the limit should be something like 2^31... > > An alternative approach would be to always let make_task_dead() take the > do_task_dead() path and never exit; but that would probably be a more > disruptive change? It might be more disruptive, but it might also be a better idea in some respects: one of the bigger issues we've had with oopses in inconvenient places is when they then cause even more problems in the exit path (because the initial oops was horrid). I'd honestly prefer something higher than 10000, but hey... I would also prefer something where that legacy 'ldsem' was based on our current legacy 'struct semaphore' rather than the half-way optimized 'rwsem'. The difference being that 'struct rwsem' tries to be clever and uses atomic operations, while we long ago decided that anybody who uses the bad old 'struct semaphore' can just use spinlocks and non-atomic logic. It's kind of silly how we try to stuff things into one 'sem->count' value, when we could just have separate readers and writers counts. And the only reason we do that is because those kinds of things *do* matter for contended locks and the rwsem code has it, but I really think the ldsem code could just always take the spinlock that it already takes in the slowpath, and just skip any other atomics. And it shouldn't have a wait_lock thing and two different wait queues - it should have one wait queue, use that wait queues spinlock *as* the lock for the semaphore operations, and put readers at the end, and writers at the beginning as exclusive waiters. So that ldesc_sem thing is just historical garbage in so many ways. It's basically a modified copy of an old version of our rwsem, and hasn't evern been simplified for its intended use, nor has it been updated to improvements by the actual rwsem code. Worst of both worlds, in other words. Oh well. I don't think anybody really cares about the ldsem code, which is why it is like it is, and probably will remain that way forever. I guess 10000 is fine - small enough to test for, big enough that if somebody really hits it, they only have themselves to blame. Linus
On Mon, Nov 07, 2022 at 09:13:17PM +0100, Jann Horn wrote: > +oops_limit > +========== > + > +Number of kernel oopses after which the kernel should panic when > +``panic_on_oops`` is not set. Rather than introduce this separate oops_limit, how about making panic_on_oops (and maybe all panic_on_*) take the limit value(s) instead of being Boolean? I think this would preserve the current behavior at panic_on_oops = 0 and panic_on_oops = 1, but would introduce your desired behavior at panic_on_oops = 10000. We can make 10000 the new default. If a distro overrides panic_on_oops, it probably sets it to 1 like RHEL does. Are there distros explicitly setting panic_on_oops to 0? If so, that could be a reason to introduce the separate oops_limit. I'm not advocating one way or the other - I just felt this should be explicitly mentioned and decided on. Alexander
On Mon, Nov 7, 2022 at 10:15 PM Solar Designer <solar@openwall.com> wrote: > On Mon, Nov 07, 2022 at 09:13:17PM +0100, Jann Horn wrote: > > +oops_limit > > +========== > > + > > +Number of kernel oopses after which the kernel should panic when > > +``panic_on_oops`` is not set. > > Rather than introduce this separate oops_limit, how about making > panic_on_oops (and maybe all panic_on_*) take the limit value(s) instead > of being Boolean? I think this would preserve the current behavior at > panic_on_oops = 0 and panic_on_oops = 1, but would introduce your > desired behavior at panic_on_oops = 10000. We can make 10000 the new > default. If a distro overrides panic_on_oops, it probably sets it to 1 > like RHEL does. > > Are there distros explicitly setting panic_on_oops to 0? If so, that > could be a reason to introduce the separate oops_limit. > > I'm not advocating one way or the other - I just felt this should be > explicitly mentioned and decided on. I think at least internally in the kernel, it probably works better to keep those two concepts separate? For example, sparc has a function die_nmi() that uses panic_on_oops to determine whether the system should panic when a watchdog detects a lockup.
From: Jann Horn > Sent: 07 November 2022 20:13 > > Many Linux systems are configured to not panic on oops; but allowing an > attacker to oops the system **really** often can make even bugs that look > completely unexploitable exploitable (like NULL dereferences and such) if > each crash elevates a refcount by one or a lock is taken in read mode, and > this causes a counter to eventually overflow. > > The most interesting counters for this are 32 bits wide (like open-coded > refcounts that don't use refcount_t). (The ldsem reader count on 32-bit > platforms is just 16 bits, but probably nobody cares about 32-bit platforms > that much nowadays.) > > So let's panic the system if the kernel is constantly oopsing. I think you are pretty much guaranteed to run out of memory (or at least KVA) before any 32bit counter wraps. That is probably even harder to diagnose than a refcount wrap! David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, Nov 8, 2022 at 10:26 AM David Laight <David.Laight@aculab.com> wrote: > > Many Linux systems are configured to not panic on oops; but allowing an > > attacker to oops the system **really** often can make even bugs that look > > completely unexploitable exploitable (like NULL dereferences and such) if > > each crash elevates a refcount by one or a lock is taken in read mode, and > > this causes a counter to eventually overflow. > > > > The most interesting counters for this are 32 bits wide (like open-coded > > refcounts that don't use refcount_t). (The ldsem reader count on 32-bit > > platforms is just 16 bits, but probably nobody cares about 32-bit platforms > > that much nowadays.) > > > > So let's panic the system if the kernel is constantly oopsing. > > I think you are pretty much guaranteed to run out of memory > (or at least KVA) before any 32bit counter wraps. Not if you repeatedly take a reference and then oops without dropping the reference, and the oops path cleans up all the resources that were allocated for the crashing tasks. In that case, each oops increments the reference count by 1 without causing memory allocation. (Also, as a sidenote: To store 2^32 densely packed pointers, you just need around 8 bytes * (2^32) = 32 GiB of RAM. So on a workstation or server with a decent amount of RAM, there can already be cases where you can overflow a 32-bit reference counter with legitimate references - see <https://bugs.chromium.org/p/project-zero/issues/detail?id=809>. Another example that needs more RAM is <https://bugs.chromium.org/p/project-zero/issues/detail?id=1752>, that needs ~140 GiB. Still probably within the realm of what a beefy server might have nowadays? Kernel virtual address space is not a meaningful limit on x86-64 - even with 4-level paging, the kernel has a 64 TiB virtual memory area (the direct mapping) that is used for slab allocations and such, see <https://www.kernel.org/doc/html/latest/x86/x86_64/mm.html>. With 5-level paging it's even more, 32 PiB.)
On Mon, Nov 07, 2022 at 09:13:17PM +0100, Jann Horn wrote: > @Kees should this go through your tree? (After waiting a while for > the inevitable bikeshedding on whether the default limit should be closer > to 10000 or 2^31.) Sure, yeah. I can take it.
On Mon, Nov 07, 2022 at 10:48:20PM +0100, Jann Horn wrote: > On Mon, Nov 7, 2022 at 10:15 PM Solar Designer <solar@openwall.com> wrote: > > On Mon, Nov 07, 2022 at 09:13:17PM +0100, Jann Horn wrote: > > > +oops_limit > > > +========== > > > + > > > +Number of kernel oopses after which the kernel should panic when > > > +``panic_on_oops`` is not set. > > > > Rather than introduce this separate oops_limit, how about making > > panic_on_oops (and maybe all panic_on_*) take the limit value(s) instead > > of being Boolean? I think this would preserve the current behavior at > > panic_on_oops = 0 and panic_on_oops = 1, but would introduce your > > desired behavior at panic_on_oops = 10000. We can make 10000 the new > > default. If a distro overrides panic_on_oops, it probably sets it to 1 > > like RHEL does. > > > > Are there distros explicitly setting panic_on_oops to 0? If so, that > > could be a reason to introduce the separate oops_limit. > > > > I'm not advocating one way or the other - I just felt this should be > > explicitly mentioned and decided on. > > I think at least internally in the kernel, it probably works better to > keep those two concepts separate? For example, sparc has a function > die_nmi() that uses panic_on_oops to determine whether the system > should panic when a watchdog detects a lockup. Internally, yes, the kernel should keep "panic_on_oops" to mean "panic _NOW_ on oops?" but I would agree with Solar -- this is a counter as far as userspace is concerned. "Panic on Oops" after 1 oops, 2, oopses, etc. I would like to see this for panic_on_warn too, actually.
On Tue, Nov 08, 2022 at 09:24:40AM -0800, Kees Cook wrote: > On Mon, Nov 07, 2022 at 10:48:20PM +0100, Jann Horn wrote: > > On Mon, Nov 7, 2022 at 10:15 PM Solar Designer <solar@openwall.com> wrote: > > > On Mon, Nov 07, 2022 at 09:13:17PM +0100, Jann Horn wrote: > > > > +oops_limit > > > > +========== > > > > + > > > > +Number of kernel oopses after which the kernel should panic when > > > > +``panic_on_oops`` is not set. > > > > > > Rather than introduce this separate oops_limit, how about making > > > panic_on_oops (and maybe all panic_on_*) take the limit value(s) instead > > > of being Boolean? I think this would preserve the current behavior at > > > panic_on_oops = 0 and panic_on_oops = 1, but would introduce your > > > desired behavior at panic_on_oops = 10000. We can make 10000 the new > > > default. If a distro overrides panic_on_oops, it probably sets it to 1 > > > like RHEL does. > > > > > > Are there distros explicitly setting panic_on_oops to 0? If so, that > > > could be a reason to introduce the separate oops_limit. > > > > > > I'm not advocating one way or the other - I just felt this should be > > > explicitly mentioned and decided on. > > > > I think at least internally in the kernel, it probably works better to > > keep those two concepts separate? For example, sparc has a function > > die_nmi() that uses panic_on_oops to determine whether the system > > should panic when a watchdog detects a lockup. > > Internally, yes, the kernel should keep "panic_on_oops" to mean "panic > _NOW_ on oops?" but I would agree with Solar -- this is a counter as far > as userspace is concerned. "Panic on Oops" after 1 oops, 2, oopses, etc. > I would like to see this for panic_on_warn too, actually. Hm, in looking at this more closely, I think it does make sense as you already have it. The count is for the panic_on_oops=0 case, so even in userspace, trying to remap that doesn't make a bunch of sense. So, yes, let's keep this as-is.
From: Jann Horn > Sent: 08 November 2022 14:53 > > On Tue, Nov 8, 2022 at 10:26 AM David Laight <David.Laight@aculab.com> wrote: > > > Many Linux systems are configured to not panic on oops; but allowing an > > > attacker to oops the system **really** often can make even bugs that look > > > completely unexploitable exploitable (like NULL dereferences and such) if > > > each crash elevates a refcount by one or a lock is taken in read mode, and > > > this causes a counter to eventually overflow. > > > > > > The most interesting counters for this are 32 bits wide (like open-coded > > > refcounts that don't use refcount_t). (The ldsem reader count on 32-bit > > > platforms is just 16 bits, but probably nobody cares about 32-bit platforms > > > that much nowadays.) > > > > > > So let's panic the system if the kernel is constantly oopsing. > > > > I think you are pretty much guaranteed to run out of memory > > (or at least KVA) before any 32bit counter wraps. > > Not if you repeatedly take a reference and then oops without dropping > the reference, and the oops path cleans up all the resources that were > allocated for the crashing tasks. In that case, each oops increments > the reference count by 1 without causing memory allocation. I'd have thought that the kernel stack and process areas couldn't be freed because they might contain 'live data'. There is also the much smaller pid_t structure. Of course I might be wrong... But I'm sure /proc/pid/stack is valid for an oopsed process. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Wed, Nov 9, 2022 at 10:04 AM David Laight <David.Laight@aculab.com> wrote: > > From: Jann Horn > > Sent: 08 November 2022 14:53 > > > > On Tue, Nov 8, 2022 at 10:26 AM David Laight <David.Laight@aculab.com> wrote: > > > > Many Linux systems are configured to not panic on oops; but allowing an > > > > attacker to oops the system **really** often can make even bugs that look > > > > completely unexploitable exploitable (like NULL dereferences and such) if > > > > each crash elevates a refcount by one or a lock is taken in read mode, and > > > > this causes a counter to eventually overflow. > > > > > > > > The most interesting counters for this are 32 bits wide (like open-coded > > > > refcounts that don't use refcount_t). (The ldsem reader count on 32-bit > > > > platforms is just 16 bits, but probably nobody cares about 32-bit platforms > > > > that much nowadays.) > > > > > > > > So let's panic the system if the kernel is constantly oopsing. > > > > > > I think you are pretty much guaranteed to run out of memory > > > (or at least KVA) before any 32bit counter wraps. > > > > Not if you repeatedly take a reference and then oops without dropping > > the reference, and the oops path cleans up all the resources that were > > allocated for the crashing tasks. In that case, each oops increments > > the reference count by 1 without causing memory allocation. > > I'd have thought that the kernel stack and process areas couldn't > be freed because they might contain 'live data'. > There is also the much smaller pid_t structure. > > Of course I might be wrong... > But I'm sure /proc/pid/stack is valid for an oopsed process. No. It might be in the edgecase where the process oopses, then the kernel tries to exit, then it oopses again, and the kernel decides that that process is a hazardous mess and can't be cleaned up. But in the general case, oopsed processes don't have /proc/$pid/stack anymore, they go through the normal exit path, and they get reaped normally by their parent.
I'll add to this by noting that it is highly oops dependent. Depending on what locks and refcounts you had taken at the moment you oops'd, it may not be possible to clean up the process e.g. if you're holding your own mmap lock at the moment you oops you're liable to deadlock in __mmput. But there are certainly empirical cases (not all too isolated ones) where the kernel really *is* able to clean up the entire process.
On Tue, Nov 08, 2022 at 11:38:22AM -0800, Kees Cook wrote: > On Tue, Nov 08, 2022 at 09:24:40AM -0800, Kees Cook wrote: > > On Mon, Nov 07, 2022 at 10:48:20PM +0100, Jann Horn wrote: > > > On Mon, Nov 7, 2022 at 10:15 PM Solar Designer <solar@openwall.com> wrote: > > > > On Mon, Nov 07, 2022 at 09:13:17PM +0100, Jann Horn wrote: > > > > > +oops_limit > > > > > +========== > > > > > + > > > > > +Number of kernel oopses after which the kernel should panic when > > > > > +``panic_on_oops`` is not set. > > > > > > > > Rather than introduce this separate oops_limit, how about making > > > > panic_on_oops (and maybe all panic_on_*) take the limit value(s) instead > > > > of being Boolean? I think this would preserve the current behavior at > > > > panic_on_oops = 0 and panic_on_oops = 1, but would introduce your > > > > desired behavior at panic_on_oops = 10000. We can make 10000 the new > > > > default. If a distro overrides panic_on_oops, it probably sets it to 1 > > > > like RHEL does. > > > > > > > > Are there distros explicitly setting panic_on_oops to 0? If so, that > > > > could be a reason to introduce the separate oops_limit. > > > > > > > > I'm not advocating one way or the other - I just felt this should be > > > > explicitly mentioned and decided on. > > > > > > I think at least internally in the kernel, it probably works better to > > > keep those two concepts separate? For example, sparc has a function > > > die_nmi() that uses panic_on_oops to determine whether the system > > > should panic when a watchdog detects a lockup. > > > > Internally, yes, the kernel should keep "panic_on_oops" to mean "panic > > _NOW_ on oops?" but I would agree with Solar -- this is a counter as far > > as userspace is concerned. "Panic on Oops" after 1 oops, 2, oopses, etc. > > I would like to see this for panic_on_warn too, actually. > > Hm, in looking at this more closely, I think it does make sense as you > already have it. The count is for the panic_on_oops=0 case, so even in > userspace, trying to remap that doesn't make a bunch of sense. So, yes, > let's keep this as-is. I don't follow your logic there - maybe you got confused? Yes, as proposed the count is for panic_on_oops=0, but that's just weird - first kind of request no panic with panic_on_oops=0, then override that with oops_limit=10000. I think it is more natural to request panic_on_oops=10000 in one step. Also, I think it is more natural to preserve panic_on_oops=0's meaning of no panic on Oops. To me, about the only reason to introduce the override is if we want to literally override a distro's explicit default of panic_on_oops=0. Alexander
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 98d1b198b2b4c..09713f25b3d62 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -667,6 +667,13 @@ This is the default behavior. an oops event is detected. +oops_limit +========== + +Number of kernel oopses after which the kernel should panic when +``panic_on_oops`` is not set. + + osrelease, ostype & version =========================== diff --git a/include/linux/panic.h b/include/linux/panic.h index c7759b3f20452..5b3e029fe1eb0 100644 --- a/include/linux/panic.h +++ b/include/linux/panic.h @@ -21,6 +21,7 @@ extern int panic_on_oops; extern int panic_on_unrecovered_nmi; extern int panic_on_io_nmi; extern int panic_on_warn; +extern int oops_limit; extern unsigned long panic_on_taint; extern bool panic_on_taint_nousertaint; diff --git a/kernel/exit.c b/kernel/exit.c index 35e0a31a0315c..827ceffbfa432 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -872,8 +872,17 @@ void __noreturn do_exit(long code) do_task_dead(); } +/* + * The default value should be high enough to not crash a system that randomly + * crashes its kernel from time to time, but low enough to at least not permit + * overflowing 32-bit refcounts or the ldsem writer count. + */ +int oops_limit = 10000; + void __noreturn make_task_dead(int signr) { + static atomic_t oops_count = ATOMIC_INIT(0); + /* * Take the task off the cpu after something catastrophic has * happened. @@ -897,6 +906,19 @@ void __noreturn make_task_dead(int signr) preempt_count_set(PREEMPT_ENABLED); } + /* + * Every time the system oopses, if the oops happens while a reference + * to an object was held, the reference leaks. + * If the oops doesn't also leak memory, repeated oopsing can cause + * reference counters to wrap around (if they're not using refcount_t). + * This means that repeated oopsing can make unexploitable-looking bugs + * exploitable through repeated oopsing. + * To make sure this can't happen, place an upper bound on how often the + * kernel may oops without panic(). + */ + if (atomic_inc_return(&oops_count) >= READ_ONCE(oops_limit)) + panic("Oopsed too often (oops_limit is %d)", oops_limit); + /* * We're taking recursive faults here in make_task_dead. Safest is to just * leave this task alone and wait for reboot. diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 188c305aeb8b7..63370aa4c078f 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1866,6 +1866,13 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, + { + .procname = "oops_limit", + .data = &oops_limit, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, { .procname = "panic_print", .data = &panic_print,