Message ID | 20230124185110.143857-14-ebiggers@kernel.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp2318944wrn; Tue, 24 Jan 2023 10:53:37 -0800 (PST) X-Google-Smtp-Source: AMrXdXsF7YVRXB3/aNBAFNONV95/t5VFRQX+Tq/ykMawUJFQ8voewkk9dm2cBALfe7pEvcAt19Bx X-Received: by 2002:a17:90b:8c1:b0:229:1f83:84d0 with SMTP id ds1-20020a17090b08c100b002291f8384d0mr30146717pjb.14.1674586416951; Tue, 24 Jan 2023 10:53:36 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674586416; cv=none; d=google.com; s=arc-20160816; b=iosMzXRGBL4Gu4Lpp1tTzDychJL7iMEPMdxa+ZAeSC70PnKoa4JcTA1wUhRHQCd5WH J8hwYXXiRaMmPHLKGoUNxWKGWC9/0ycHpAEt0Z7gfwNHSWEFdyE23menlHRpfnDXrNC9 F9iUrc8uAoSTMkYRMEWAZdSlqqlPxM0lUo7R41vYXtYrBNRgS+egaVYnRICSwGR9waFt 6YRvw+7kdP9he2Fj73txo4rB/zzxo1lSeJNyOZ2qKo10utvUlP8F5l4YNkrK20XdeeZW y6sxCDv7F5Zu44RAIzKTOvMKobeQ0P3KHt/QV5bOVkehQCsAnaLgiDEKVtVvlt7WlsWZ qSgg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=g4QjEzTS/osr0PKRCWiITf0QcXOhGd1ocEFY19NorLQ=; b=JxjeGOGzfYMOC3DV4RIHQeAulVG7J6Q/Tlkb5PznXRdYcF2hu4AfNQL4btgYlXCYRa ElwSZBzpDVsuI1I6KJhNB+qlsxot9/YQhYQE0XqaauUOYW2jWbNN2VqZii9kBSkWAoqk yH1LTJg+TXiFM/2xveC20rp1iNF426Pji4lm0MxlcesZkI1KSqsFYido5ZD1TgHoPNBD ES523FSHdBHMfmn0TmGrdu/Jmag3ySsywPav6EG19p0VO1qr81D2eoyVd6wlKWtG4G/x gzu++N3kf12Vxpzl+PbACgIN+G4dSWq0ZjoGFx8NGooeTzWP1vbDbvgqswoNlg2kZDm/ Ejjg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=PjFKUvpf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id o20-20020a17090ad25400b0022949f5ca1esi14452003pjw.86.2023.01.24.10.53.24; Tue, 24 Jan 2023 10:53:36 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=PjFKUvpf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234269AbjAXSwx (ORCPT <rfc822;rust.linux@gmail.com> + 99 others); Tue, 24 Jan 2023 13:52:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56708 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233809AbjAXSwg (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 24 Jan 2023 13:52:36 -0500 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 600644ABDB; Tue, 24 Jan 2023 10:52:11 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 0F95AB816A9; Tue, 24 Jan 2023 18:52:10 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 78575C433A0; Tue, 24 Jan 2023 18:52:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1674586328; bh=JPVb5eRLwkSeq1J7Wd4S+qskMfD70BTRh8Tm7ANTEH8=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=PjFKUvpfYB1A24NUDrMYpG6IbcuEuWTgzbwThbbBBSwGmxZyh8aURWsvyVwZVgxJN O+c+B3wJoXZIaBpKFNMGDPo6Zl4F50U2lt8okogX+GzbfdEFP3ABflIbZUC2qUGV9Z LMQW83uDYRgCXXGchdfGunCVesY4DjXQMtcwrVVDCItkTYbE8IO4z26UmClTCe3NGt IuiFkoPRdJp10flCPG4eexiOwHzMliCYJokVz4haRrxksbUQQRP0js0u5EScCoaqZY +ihlcsAwU4DE/nYghvAvZbh9p5hfX7oFEjTtSCBAJ6TRNJO6GBvq7dxBjSNTjLa+7t smrkD90nhyerw== From: Eric Biggers <ebiggers@kernel.org> To: stable@vger.kernel.org, Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kees Cook <keescook@chromium.org>, SeongJae Park <sj@kernel.org>, Seth Jenkins <sethjenkins@google.com>, Jann Horn <jannh@google.com>, "Eric W . Biederman" <ebiederm@xmission.com>, linux-hardening@vger.kernel.org, linux-kernel@vger.kernel.org, Luis Chamberlain <mcgrof@kernel.org> Subject: [PATCH 5.15 13/20] exit: Put an upper limit on how often we can oops Date: Tue, 24 Jan 2023 10:51:03 -0800 Message-Id: <20230124185110.143857-14-ebiggers@kernel.org> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230124185110.143857-1-ebiggers@kernel.org> References: <20230124185110.143857-1-ebiggers@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1755931126791963472?= X-GMAIL-MSGID: =?utf-8?q?1755931126791963472?= |
Series |
Backport oops_limit to 5.15
|
|
Commit Message
Eric Biggers
Jan. 24, 2023, 6:51 p.m. UTC
From: Jann Horn <jannh@google.com> commit d4ccd54d28d3c8598e2354acc13e28c060961dbb upstream. Many Linux systems are configured to not panic on oops; but allowing an attacker to oops the system **really** often can make even bugs that look completely unexploitable exploitable (like NULL dereferences and such) if each crash elevates a refcount by one or a lock is taken in read mode, and this causes a counter to eventually overflow. The most interesting counters for this are 32 bits wide (like open-coded refcounts that don't use refcount_t). (The ldsem reader count on 32-bit platforms is just 16 bits, but probably nobody cares about 32-bit platforms that much nowadays.) So let's panic the system if the kernel is constantly oopsing. The speed of oopsing 2^32 times probably depends on several factors, like how long the stack trace is and which unwinder you're using; an empirically important one is whether your console is showing a graphical environment or a text console that oopses will be printed to. In a quick single-threaded benchmark, it looks like oopsing in a vfork() child with a very short stack trace only takes ~510 microseconds per run when a graphical console is active; but switching to a text console that oopses are printed to slows it down around 87x, to ~45 milliseconds per run. (Adding more threads makes this faster, but the actual oops printing happens under &die_lock on x86, so you can maybe speed this up by a factor of around 2 and then any further improvement gets eaten up by lock contention.) It looks like it would take around 8-12 days to overflow a 32-bit counter with repeated oopsing on a multi-core X86 system running a graphical environment; both me (in an X86 VM) and Seth (with a distro kernel on normal hardware in a standard configuration) got numbers in that ballpark. 12 days aren't *that* short on a desktop system, and you'd likely need much longer on a typical server system (assuming that people don't run graphical desktop environments on their servers), and this is a *very* noisy and violent approach to exploiting the kernel; and it also seems to take orders of magnitude longer on some machines, probably because stuff like EFI pstore will slow it down a ton if that's active. Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20221107201317.324457-1-jannh@google.com Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-2-keescook@chromium.org Signed-off-by: Eric Biggers <ebiggers@google.com> --- Documentation/admin-guide/sysctl/kernel.rst | 8 ++++ kernel/exit.c | 43 +++++++++++++++++++++ 2 files changed, 51 insertions(+)
Comments
On 25/01/23 12:21 am, Eric Biggers wrote: > From: Jann Horn <jannh@google.com> > > commit d4ccd54d28d3c8598e2354acc13e28c060961dbb upstream. > > Many Linux systems are configured to not panic on oops; but allowing an > attacker to oops the system **really** often can make even bugs that look > completely unexploitable exploitable (like NULL dereferences and such) if > each crash elevates a refcount by one or a lock is taken in read mode, and > this causes a counter to eventually overflow. > > The most interesting counters for this are 32 bits wide (like open-coded > refcounts that don't use refcount_t). (The ldsem reader count on 32-bit > platforms is just 16 bits, but probably nobody cares about 32-bit platforms > that much nowadays.) > > So let's panic the system if the kernel is constantly oopsing. > > The speed of oopsing 2^32 times probably depends on several factors, like > how long the stack trace is and which unwinder you're using; an empirically > important one is whether your console is showing a graphical environment or > a text console that oopses will be printed to. > In a quick single-threaded benchmark, it looks like oopsing in a vfork() > child with a very short stack trace only takes ~510 microseconds per run > when a graphical console is active; but switching to a text console that > oopses are printed to slows it down around 87x, to ~45 milliseconds per > run. > (Adding more threads makes this faster, but the actual oops printing > happens under &die_lock on x86, so you can maybe speed this up by a factor > of around 2 and then any further improvement gets eaten up by lock > contention.) > > It looks like it would take around 8-12 days to overflow a 32-bit counter > with repeated oopsing on a multi-core X86 system running a graphical > environment; both me (in an X86 VM) and Seth (with a distro kernel on > normal hardware in a standard configuration) got numbers in that ballpark. > > 12 days aren't *that* short on a desktop system, and you'd likely need much > longer on a typical server system (assuming that people don't run graphical > desktop environments on their servers), and this is a *very* noisy and > violent approach to exploiting the kernel; and it also seems to take orders > of magnitude longer on some machines, probably because stuff like EFI > pstore will slow it down a ton if that's active. > > Signed-off-by: Jann Horn <jannh@google.com> > Link: https://urldefense.com/v3/__https://lore.kernel.org/r/20221107201317.324457-1-jannh@google.com__;!!ACWV5N9M2RV99hQ!N-JMN1iGq4TzLl-KgssGXKoBeTEyN5-Qqf4WKpkP9dPj5DpMQejZFXq92OuEL0fWts4dfsuyqTLPWHXVEhx3tDFCvFE$ > Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> > Signed-off-by: Kees Cook <keescook@chromium.org> > Link: https://urldefense.com/v3/__https://lore.kernel.org/r/20221117234328.594699-2-keescook@chromium.org__;!!ACWV5N9M2RV99hQ!N-JMN1iGq4TzLl-KgssGXKoBeTEyN5-Qqf4WKpkP9dPj5DpMQejZFXq92OuEL0fWts4dfsuyqTLPWHXVEhx3qFbFrr8$ > Signed-off-by: Eric Biggers <ebiggers@google.com> > --- > Documentation/admin-guide/sysctl/kernel.rst | 8 ++++ > kernel/exit.c | 43 +++++++++++++++++++++ > 2 files changed, 51 insertions(+) > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst > index 609b891754081..b6e68d6f297e5 100644 > --- a/Documentation/admin-guide/sysctl/kernel.rst > +++ b/Documentation/admin-guide/sysctl/kernel.rst > @@ -671,6 +671,14 @@ This is the default behavior. > an oops event is detected. > > > +oops_limit > +========== > + > +Number of kernel oopses after which the kernel should panic when > +``panic_on_oops`` is not set. Setting this to 0 or 1 has the same effect > +as setting ``panic_on_oops=1``. > + > + > osrelease, ostype & version > =========================== > > diff --git a/kernel/exit.c b/kernel/exit.c > index 5d1a507fd4bae..172d7f835f801 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -69,6 +69,33 @@ > #include <asm/unistd.h> > #include <asm/mmu_context.h> > > +/* > + * The default value should be high enough to not crash a system that randomly > + * crashes its kernel from time to time, but low enough to at least not permit > + * overflowing 32-bit refcounts or the ldsem writer count. > + */ > +static unsigned int oops_limit = 10000; > + > +#ifdef CONFIG_SYSCTL > +static struct ctl_table kern_exit_table[] = { > + { > + .procname = "oops_limit", > + .data = &oops_limit, > + .maxlen = sizeof(oops_limit), > + .mode = 0644, > + .proc_handler = proc_douintvec, > + }, > + { } > +}; > + > +static __init int kernel_exit_sysctls_init(void) > +{ > + register_sysctl_init("kernel", kern_exit_table); > + return 0; > +} > +late_initcall(kernel_exit_sysctls_init); > +#endif > + > static void __unhash_process(struct task_struct *p, bool group_dead) > { > nr_threads--; > @@ -879,10 +906,26 @@ EXPORT_SYMBOL_GPL(do_exit); > > void __noreturn make_task_dead(int signr) > { > + static atomic_t oops_count = ATOMIC_INIT(0); > + > /* > * Take the task off the cpu after something catastrophic has > * happened. > */ > + > + /* > + * Every time the system oopses, if the oops happens while a reference > + * to an object was held, the reference leaks. > + * If the oops doesn't also leak memory, repeated oopsing can cause > + * reference counters to wrap around (if they're not using refcount_t). > + * This means that repeated oopsing can make unexploitable-looking bugs > + * exploitable through repeated oopsing. > + * To make sure this can't happen, place an upper bound on how often the > + * kernel may oops without panic(). > + */ > + if (atomic_inc_return(&oops_count) >= READ_ONCE(oops_limit)) > + panic("Oopsed too often (kernel.oops_limit is %d)", oops_limit); > + > do_exit(signr); > } > Hi, Thanks for the backports. I have tried backporting the oops_limit patches to LTS 5.15.y and had a similar set of patches, just want to add a note here on an alternate way for backporting this patch without resolving conflicts manually: Here is the sequence: * Patch 12: [panic: Separate sysctl logic from CONFIG_SMP] --> Cherry-pick Commit: 05ea0424f0e2 ("exit: Move oops specific logic from do_exit into make_task_dead") upstream --> Cherry-pick Commit: de77c3a5b95c ("exit: Move force_uaccess back into do_exit") upstream * Patch 13 which is Commit: d4ccd54d28d3 ("exit: Put an upper limit on how often we can oops") upstream, will be a clean cherry-pick. The benefit may be making future backports simpler in make_task_dead(). This was the only difference, so your backport looks good to me. Regards, Harshit
Hi Harshit, On Wed, Jan 25, 2023 at 07:39:10PM +0530, Harshit Mogalapalli wrote: > > Thanks for the backports. > > I have tried backporting the oops_limit patches to LTS 5.15.y and had a > similar set of patches, just want to add a note here on an alternate way for > backporting this patch without resolving conflicts manually: > > Here is the sequence: > > * Patch 12: [panic: Separate sysctl logic from CONFIG_SMP] > --> Cherry-pick Commit: 05ea0424f0e2 ("exit: Move oops specific logic from > do_exit into make_task_dead") upstream > --> Cherry-pick Commit: de77c3a5b95c ("exit: Move force_uaccess back into > do_exit") upstream > * Patch 13 which is Commit: d4ccd54d28d3 ("exit: Put an upper limit on how > often we can oops") upstream, will be a clean cherry-pick. > > The benefit may be making future backports simpler in make_task_dead(). > > This was the only difference, so your backport looks good to me. > It's certainly an option. The reason why I didn't do it that way is to reduce the impact of any potential bugs where do_exit() is still called when the new make_task_dead() function should be used instead. With my series, the effect is just that oops_limit won't take effect in such cases. If we also backported commit 05ea0424f0e2 ("exit: Move oops specific logic from do_exit into make_task_dead"), then do_exit() will lose various other things, such as panicing when called from an interrupt handler. That would increase the chance of regressions, unless we made absolutely sure that everywhere that should be using make_task_dead() is indeed using it instead of do_exit(). Commit 0e25498f8cd4 ("exit: Add and use make_task_dead."), which I backported, did the vast majority of conversions to make_task_dead(). Some architectures still have uses of do_exit() that got cleaned up later, though. It seems it was mostly unreachable code, and some cases that should have been doing something else such as BUG() or sending a signal to userspace. So, generally not super important cases. Still, getting all that would bring in many more patches. We could do that, but since this is already a 20-patch series, I wanted to limit the scope a bit. These extra patches could always be backported later on top of this if desired. - Eric
Hi Eric, On 26/01/23 12:14 am, Eric Biggers wrote: > Hi Harshit, > > On Wed, Jan 25, 2023 at 07:39:10PM +0530, Harshit Mogalapalli wrote: >> >> Thanks for the backports. >> >> I have tried backporting the oops_limit patches to LTS 5.15.y and had a >> similar set of patches, just want to add a note here on an alternate way for >> backporting this patch without resolving conflicts manually: >> >> Here is the sequence: >> >> * Patch 12: [panic: Separate sysctl logic from CONFIG_SMP] >> --> Cherry-pick Commit: 05ea0424f0e2 ("exit: Move oops specific logic from >> do_exit into make_task_dead") upstream >> --> Cherry-pick Commit: de77c3a5b95c ("exit: Move force_uaccess back into >> do_exit") upstream >> * Patch 13 which is Commit: d4ccd54d28d3 ("exit: Put an upper limit on how >> often we can oops") upstream, will be a clean cherry-pick. >> >> The benefit may be making future backports simpler in make_task_dead(). >> >> This was the only difference, so your backport looks good to me. >> > > It's certainly an option. The reason why I didn't do it that way is to reduce > the impact of any potential bugs where do_exit() is still called when the new > make_task_dead() function should be used instead. With my series, the effect is > just that oops_limit won't take effect in such cases. If we also backported > commit 05ea0424f0e2 ("exit: Move oops specific logic from do_exit into > make_task_dead"), then do_exit() will lose various other things, such as > panicing when called from an interrupt handler. That would increase the chance > of regressions, unless we made absolutely sure that everywhere that should be > using make_task_dead() is indeed using it instead of do_exit(). > > Commit 0e25498f8cd4 ("exit: Add and use make_task_dead."), which I backported, > did the vast majority of conversions to make_task_dead(). > > Some architectures still have uses of do_exit() that got cleaned up later, > though. It seems it was mostly unreachable code, and some cases that should > have been doing something else such as BUG() or sending a signal to userspace. > So, generally not super important cases. > Thanks a lot for explaining! > Still, getting all that would bring in many more patches. We could do that, but > since this is already a 20-patch series, I wanted to limit the scope a bit. > These extra patches could always be backported later on top of this if desired. > Sure. Regards, Harshit > - Eric
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 609b891754081..b6e68d6f297e5 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -671,6 +671,14 @@ This is the default behavior. an oops event is detected. +oops_limit +========== + +Number of kernel oopses after which the kernel should panic when +``panic_on_oops`` is not set. Setting this to 0 or 1 has the same effect +as setting ``panic_on_oops=1``. + + osrelease, ostype & version =========================== diff --git a/kernel/exit.c b/kernel/exit.c index 5d1a507fd4bae..172d7f835f801 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -69,6 +69,33 @@ #include <asm/unistd.h> #include <asm/mmu_context.h> +/* + * The default value should be high enough to not crash a system that randomly + * crashes its kernel from time to time, but low enough to at least not permit + * overflowing 32-bit refcounts or the ldsem writer count. + */ +static unsigned int oops_limit = 10000; + +#ifdef CONFIG_SYSCTL +static struct ctl_table kern_exit_table[] = { + { + .procname = "oops_limit", + .data = &oops_limit, + .maxlen = sizeof(oops_limit), + .mode = 0644, + .proc_handler = proc_douintvec, + }, + { } +}; + +static __init int kernel_exit_sysctls_init(void) +{ + register_sysctl_init("kernel", kern_exit_table); + return 0; +} +late_initcall(kernel_exit_sysctls_init); +#endif + static void __unhash_process(struct task_struct *p, bool group_dead) { nr_threads--; @@ -879,10 +906,26 @@ EXPORT_SYMBOL_GPL(do_exit); void __noreturn make_task_dead(int signr) { + static atomic_t oops_count = ATOMIC_INIT(0); + /* * Take the task off the cpu after something catastrophic has * happened. */ + + /* + * Every time the system oopses, if the oops happens while a reference + * to an object was held, the reference leaks. + * If the oops doesn't also leak memory, repeated oopsing can cause + * reference counters to wrap around (if they're not using refcount_t). + * This means that repeated oopsing can make unexploitable-looking bugs + * exploitable through repeated oopsing. + * To make sure this can't happen, place an upper bound on how often the + * kernel may oops without panic(). + */ + if (atomic_inc_return(&oops_count) >= READ_ONCE(oops_limit)) + panic("Oopsed too often (kernel.oops_limit is %d)", oops_limit); + do_exit(signr); }