Message ID | 20230127162409.2505312-1-elver@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp923015wrn; Fri, 27 Jan 2023 08:29:55 -0800 (PST) X-Google-Smtp-Source: AK7set+eTR0JrgoAP/8+RAlShZJMM+cfz/LUQ3QplQOUOB9WbcZtuogppZnzHL+YGjGtEIWh99ra X-Received: by 2002:a05:6402:530a:b0:49f:f60d:1692 with SMTP id eo10-20020a056402530a00b0049ff60d1692mr15514439edb.24.1674836995386; Fri, 27 Jan 2023 08:29:55 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674836995; cv=none; d=google.com; s=arc-20160816; b=aztZ6sgWvUCTx0w7wsaOHA7S8Q3YaeoLQcrSIeKVL4dn2bs5F6qO2Cfq7/B5M8iq9F h5IAPJMZ4AGDBVkcjoWhmwc+hWPHt0E7ZkoO4SrQvv5WfgY4pLn39t2gPXFm429+0/zT PXfZ+b7OXbbS7blAyEnWPYnpznv1HmitQP7JgRO+/tU4VD1TBHMS9ifg69BaBV9+EBDR zMWS+a9Tzl6dAgBKYZk7wDOgk5zWgvFfdZWiKl6Knh+qLVx4AJjcP/UYkfEYn2Q0F3WY K0+kGOBu6p/4R2t8t004upevEC5nD9IcJza2GmlpV9lM+2HsN3hCi5oggHku9yDDvsNs F9Ww== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:mime-version:date :dkim-signature; bh=+ZA1LVGRnTSLVUFMxUOcf61woTjOoeoiKjydCG1Ukv0=; b=Xq0iH0VHqDRPUqcWR7TYKPYyOwM+niZlyToL1t4nyd+z6qcEUN4CAMRJ9GD6zEAzlL JzknKmVCpfBuSbyO7jcRNvvvC2LMwItvwS5VlEKHFSTsMQqpQxBELPTf+VwDb9weIRhu wY4z5QF7FRpyB9sWSWzWpx2cjrmMA/XHSpsjby2jtDe/e/mv22L0EIR3UZ2/U2Ah/Ikt HWCQVwqmbx8jAcBP8ZFvppGAZLxHZprwMvZBD5zmfvbv8kb6OYv4y2+NfOPXjd2EmtP7 bCVLe0+Gy+pA58NtkXreRYDf19eX/L/Bb4ka5AbEivY4i3RmbKHVyBVpK3VgR8Dx2Y1q VcSQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=fYvcrw82; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c7-20020a0564021f8700b0049eeaaa3f27si5845236edc.433.2023.01.27.08.29.31; Fri, 27 Jan 2023 08:29:55 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=fYvcrw82; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234851AbjA0QZb (ORCPT <rfc822;lekhanya01809@gmail.com> + 99 others); Fri, 27 Jan 2023 11:25:31 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52370 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234914AbjA0QYj (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 27 Jan 2023 11:24:39 -0500 Received: from mail-ej1-x64a.google.com (mail-ej1-x64a.google.com [IPv6:2a00:1450:4864:20::64a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EABBC84183 for <linux-kernel@vger.kernel.org>; Fri, 27 Jan 2023 08:24:32 -0800 (PST) Received: by mail-ej1-x64a.google.com with SMTP id kt9-20020a1709079d0900b00877f4df5aceso3729773ejc.21 for <linux-kernel@vger.kernel.org>; Fri, 27 Jan 2023 08:24:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=+ZA1LVGRnTSLVUFMxUOcf61woTjOoeoiKjydCG1Ukv0=; b=fYvcrw82fbpBOjdM7Tufn3QIumdGltcGOVcJOL3P91Mb+xN6AWlMaU0wegl4Bp5KLz oNHmAmw9YROlzBPiV0oWvxpRdsFJcDiqZ7aSNfalpbRwE0u7DdLVObfFpbFnn2ofMWSZ dn48cTiy+XGJ1NtCdRitiY7fn40MvVkGGeg+HdRJvaxsy1nKpxfFjKR+73oBXa+uENDb ninwsim5zsfTBvw+FSH4bUOiwB32kJknPOY3XrWX6tGbwgDQlPNsCfNi8C5fU4tsQn0A lu9VjPCECckHi2eIpzFGrrg+7W32eKTyjhD0TsX0d5tXIrnuh2a9yftLUCIgSPMKhYqV mYGQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=+ZA1LVGRnTSLVUFMxUOcf61woTjOoeoiKjydCG1Ukv0=; b=qKcy2+4BnmPwEelzH5Kd4sZQANmYkfzyuQc0IIf0N4zeQBJw6yq+3f0pcszoUwtIF4 zRBo4hLTMOuh0j2DgWikE65Ertzu26kLwvWSWfr2GUDh19NFzoDP4NiZe6iO7NXAlvIl CDGKNz4IMNPtbHvrggzQtVTBmadvjmKtgxaJHYhUz1PO1WO9Ir+qC687Ehq7B9qLiEvD mcjVs7e/FON2rjpK9+fGoqwACyQtGET7BLbRZBTppDIiIojXDk0tMlkDwEiWXzKRN8Ce lLxC30j4O4a1xjcqf+zuBTKr2XwzZJcRfQgdGDh9MwTHP/IGQLW5GBB3TiOmxFJ/bqgO ZXkA== X-Gm-Message-State: AO0yUKXgrYYYd2E7ZkUglHfujArT5PjY/Hiz8Fx8BUhcxXiJpA+YWhDt hCTNdS+BBw550zHsTKYTLOYbElET+g== X-Received: from elver.muc.corp.google.com ([2a00:79e0:9c:201:19a:e00e:4f5d:59d8]) (user=elver job=sendgmr) by 2002:a05:6402:b63:b0:4a0:e4bd:6ea4 with SMTP id cb3-20020a0564020b6300b004a0e4bd6ea4mr1477810edb.35.1674836671255; Fri, 27 Jan 2023 08:24:31 -0800 (PST) Date: Fri, 27 Jan 2023 17:24:09 +0100 Mime-Version: 1.0 X-Mailer: git-send-email 2.39.1.456.gfc5497dd1b-goog Message-ID: <20230127162409.2505312-1-elver@google.com> Subject: [PATCH v2] perf: Allow restricted kernel breakpoints on user addresses From: Marco Elver <elver@google.com> To: elver@google.com, Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com>, Arnaldo Carvalho de Melo <acme@kernel.org>, Mark Rutland <mark.rutland@arm.com>, Alexander Shishkin <alexander.shishkin@linux.intel.com>, Jiri Olsa <jolsa@kernel.org>, Namhyung Kim <namhyung@kernel.org>, linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, Dmitry Vyukov <dvyukov@google.com>, kasan-dev@googlegroups.com, Jann Horn <jannh@google.com>, Thomas Gleixner <tglx@linutronix.de>, Andrey Konovalov <andreyknvl@gmail.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1756193877284744698?= X-GMAIL-MSGID: =?utf-8?q?1756193877284744698?= |
Series |
[v2] perf: Allow restricted kernel breakpoints on user addresses
|
|
Commit Message
Marco Elver
Jan. 27, 2023, 4:24 p.m. UTC
Allow the creation of restricted breakpoint perf events that also fire in the kernel (perf_event_attr::exclude_kernel=0), if: 1. No sample information is requested; samples may contain IPs, registers, or other information that may disclose kernel addresses. 2. The breakpoint (viz. data watchpoint) is on a user address. The rules constrain the allowable perf events such that no sensitive kernel information can be disclosed. Despite no explicit kernel information disclosure, the following questions may need answers: 1. Q: Is obtaining information that the kernel accessed a particular user's known memory location revealing new information? A: Given the kernel's user space ABI, there should be no "surprise accesses" to user space memory in the first place. 2. Q: Does causing breakpoints on user memory accesses by the kernel potentially impact timing in a sensitive way? A: Since hardware breakpoints trigger regardless of the state of perf_event_attr::exclude_kernel, but are filtered in the perf subsystem, this possibility already exists independent of the proposed change. Motivation: Data breakpoints on user addresses that also fire in the kernel provide complete coverage to track and debug accesses, not just in user space but also through the kernel. For example, tracking where user space invokes syscalls with pointers to specific memory. Breakpoints can be used for more complex dynamic analysis, such as race detection, memory-safety error detection, or data-flow analysis. Larger deployment by linking such dynamic analysis into binaries in production only becomes possible when no additional capabilities are required by unprivileged users. To improve coverage, it should then also be possible to enable breakpoints on user addresses that fire in the kernel with no additional capabilities. Acked-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Marco Elver <elver@google.com> --- Changelog ~~~~~~~~~ v2: * Commit message (motivation, more explanation). * Apply ack. v1: https://lkml.kernel.org/r/20220902100057.404817-1-elver@google.com * Rebase. RFC: https://lkml.kernel.org/r/20220601093502.364142-1-elver@google.com --- include/linux/perf_event.h | 8 +------- kernel/events/core.c | 38 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 39 insertions(+), 7 deletions(-)
Comments
Hi Marco, Apologies for having not replies on v1... On Fri, Jan 27, 2023 at 05:24:09PM +0100, Marco Elver wrote: > Allow the creation of restricted breakpoint perf events that also fire > in the kernel (perf_event_attr::exclude_kernel=0), if: > > 1. No sample information is requested; samples may contain IPs, > registers, or other information that may disclose kernel addresses. > > 2. The breakpoint (viz. data watchpoint) is on a user address. I think there's a potential problem here w.r.t. what constitutes a "user address". Below, the patch assumes that any address which access_ok() is happy with is a user address, but that's not always the case, and it's not necessarily always safe to allow watchpoints on such addresses. For example, UEFI runtime services may live in low adddresses below TASK_SIZE_MAX, and there are times when we run code in an idmap (or other low-half mapping) when we cannot safely take an exception for things like idle, suspend, kexec, pagetable rewriting on arm64, etc. So I think this may introduce functional issues (e.g. a mechanism to crash the kernel) in addition to any potential information disclosure, and I would not want this to be generally available to unprivileged users. Most of those happen in kernel threads, but they can also happen in the context of user threads (e.g. if triggering suspend/idle via sysfs), so special care will be needed, as above. > The rules constrain the allowable perf events such that no sensitive > kernel information can be disclosed. > > Despite no explicit kernel information disclosure, the following > questions may need answers: > > 1. Q: Is obtaining information that the kernel accessed a particular > user's known memory location revealing new information? > > A: Given the kernel's user space ABI, there should be no "surprise > accesses" to user space memory in the first place. I think that may be true for userspace, but not true for other transient mappings in the low half of the address space. Ignoring the functional concern above, for idmap'd code this would at least provide a mechanism to probe for the phyiscal address of that code (and by extension, reveal the phyiscal location of the entire kernel). > 2. Q: Does causing breakpoints on user memory accesses by the kernel > potentially impact timing in a sensitive way? > > A: Since hardware breakpoints trigger regardless of the state of > perf_event_attr::exclude_kernel, but are filtered in the perf > subsystem, this possibility already exists independent of the > proposed change. Hmm... arm64's HW breakpoints and watchpoints have HW privilege filters, so I'm not sure the above statement is generally/necessarily true. > Motivation: Data breakpoints on user addresses that also fire in the > kernel provide complete coverage to track and debug accesses, not just > in user space but also through the kernel. For example, tracking where > user space invokes syscalls with pointers to specific memory. > > Breakpoints can be used for more complex dynamic analysis, such as race > detection, memory-safety error detection, or data-flow analysis. Larger > deployment by linking such dynamic analysis into binaries in production > only becomes possible when no additional capabilities are required by > unprivileged users. To improve coverage, it should then also be possible > to enable breakpoints on user addresses that fire in the kernel with no > additional capabilities. I can understand the argument for watchpoints (modulo my concerns above), but there's no need to support instruction breakpoints, right? i.e. there's no legitimate reason for a user to want to monitor a given user address system-wide, regardless of what's running? IIUC this only makes sense for watchpoints, and only in the context of a given task. Thanks, Mark. > Acked-by: Dmitry Vyukov <dvyukov@google.com> > Signed-off-by: Marco Elver <elver@google.com> > --- > > Changelog > ~~~~~~~~~ > > v2: > * Commit message (motivation, more explanation). > * Apply ack. > > v1: https://lkml.kernel.org/r/20220902100057.404817-1-elver@google.com > * Rebase. > > RFC: https://lkml.kernel.org/r/20220601093502.364142-1-elver@google.com > --- > include/linux/perf_event.h | 8 +------- > kernel/events/core.c | 38 ++++++++++++++++++++++++++++++++++++++ > 2 files changed, 39 insertions(+), 7 deletions(-) > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index c6a3bac76966..a95a6b889b00 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -1463,13 +1463,7 @@ static inline int perf_is_paranoid(void) > return sysctl_perf_event_paranoid > -1; > } > > -static inline int perf_allow_kernel(struct perf_event_attr *attr) > -{ > - if (sysctl_perf_event_paranoid > 1 && !perfmon_capable()) > - return -EACCES; > - > - return security_perf_event_open(attr, PERF_SECURITY_KERNEL); > -} > +extern int perf_allow_kernel(struct perf_event_attr *attr); > > static inline int perf_allow_cpu(struct perf_event_attr *attr) > { > diff --git a/kernel/events/core.c b/kernel/events/core.c > index d56328e5080e..0f1fc9aef294 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -3174,6 +3174,12 @@ static int perf_event_modify_attr(struct perf_event *event, > return -EOPNOTSUPP; > } > > + if (!event->attr.exclude_kernel) { > + err = perf_allow_kernel(attr); > + if (err) > + return err; > + } > + > WARN_ON_ONCE(event->ctx->parent_ctx); > > mutex_lock(&event->child_mutex); > @@ -12289,6 +12295,38 @@ perf_check_permission(struct perf_event_attr *attr, struct task_struct *task) > return is_capable || ptrace_may_access(task, ptrace_mode); > } > > +/* > + * Check if unprivileged users are allowed to set up breakpoints on user > + * addresses that also count when the kernel accesses them. > + */ > +static bool perf_allow_kernel_breakpoint(struct perf_event_attr *attr) > +{ > + if (attr->type != PERF_TYPE_BREAKPOINT) > + return false; > + > + /* > + * The sample may contain IPs, registers, or other information that may > + * disclose kernel addresses or timing information. Disallow any kind of > + * additional sample information. > + */ > + if (attr->sample_type) > + return false; > + > + /* > + * Only allow kernel breakpoints on user addresses. > + */ > + return access_ok((void __user *)(unsigned long)attr->bp_addr, attr->bp_len); > +} > + > +int perf_allow_kernel(struct perf_event_attr *attr) > +{ > + if (sysctl_perf_event_paranoid > 1 && !perfmon_capable() && > + !perf_allow_kernel_breakpoint(attr)) > + return -EACCES; > + > + return security_perf_event_open(attr, PERF_SECURITY_KERNEL); > +} > + > /** > * sys_perf_event_open - open a performance event, associate it to a task/cpu > * > -- > 2.39.1.456.gfc5497dd1b-goog >
On Fri, 27 Jan 2023 at 19:14, Mark Rutland <mark.rutland@arm.com> wrote: > > Hi Marco, > > Apologies for having not replies on v1... > > On Fri, Jan 27, 2023 at 05:24:09PM +0100, Marco Elver wrote: > > Allow the creation of restricted breakpoint perf events that also fire > > in the kernel (perf_event_attr::exclude_kernel=0), if: > > > > 1. No sample information is requested; samples may contain IPs, > > registers, or other information that may disclose kernel addresses. > > > > 2. The breakpoint (viz. data watchpoint) is on a user address. > > I think there's a potential problem here w.r.t. what constitutes a "user > address". Below, the patch assumes that any address which access_ok() is happy > with is a user address, but that's not always the case, and it's not > necessarily always safe to allow watchpoints on such addresses. Isn't that a deficiency with access_ok()? https://www.kernel.org/doc/html/latest/core-api/mm-api.html#c.access_ok "Checks if a pointer to a block of memory in user space is valid. [...]" > For example, UEFI runtime services may live in low adddresses below > TASK_SIZE_MAX, and there are times when we run code in an idmap (or other > low-half mapping) when we cannot safely take an exception for things like idle, > suspend, kexec, pagetable rewriting on arm64, etc. > > So I think this may introduce functional issues (e.g. a mechanism to crash the > kernel) in addition to any potential information disclosure, and I would not > want this to be generally available to unprivileged users. > > Most of those happen in kernel threads, but they can also happen in the context > of user threads (e.g. if triggering suspend/idle via sysfs), so special care > will be needed, as above. These are good points. > > The rules constrain the allowable perf events such that no sensitive > > kernel information can be disclosed. > > > > Despite no explicit kernel information disclosure, the following > > questions may need answers: > > > > 1. Q: Is obtaining information that the kernel accessed a particular > > user's known memory location revealing new information? > > > > A: Given the kernel's user space ABI, there should be no "surprise > > accesses" to user space memory in the first place. > > I think that may be true for userspace, but not true for other transient > mappings in the low half of the address space. Ignoring the functional concern > above, for idmap'd code this would at least provide a mechanism to probe for > the phyiscal address of that code (and by extension, reveal the phyiscal > location of the entire kernel). This again feels like a deficiency with access_ok(). Is there a better primitive than access_ok(), or can we have something that gives us the guarantee that whatever it says is "ok" is a userspace address? > > 2. Q: Does causing breakpoints on user memory accesses by the kernel > > potentially impact timing in a sensitive way? > > > > A: Since hardware breakpoints trigger regardless of the state of > > perf_event_attr::exclude_kernel, but are filtered in the perf > > subsystem, this possibility already exists independent of the > > proposed change. > > Hmm... arm64's HW breakpoints and watchpoints have HW privilege filters, so I'm > not sure the above statement is generally/necessarily true. Right, I can see this being a valid concern on those architectures that do support HW privilege filters. > > Motivation: Data breakpoints on user addresses that also fire in the > > kernel provide complete coverage to track and debug accesses, not just > > in user space but also through the kernel. For example, tracking where > > user space invokes syscalls with pointers to specific memory. > > > > Breakpoints can be used for more complex dynamic analysis, such as race > > detection, memory-safety error detection, or data-flow analysis. Larger > > deployment by linking such dynamic analysis into binaries in production > > only becomes possible when no additional capabilities are required by > > unprivileged users. To improve coverage, it should then also be possible > > to enable breakpoints on user addresses that fire in the kernel with no > > additional capabilities. > > I can understand the argument for watchpoints (modulo my concerns above), but > there's no need to support instruction breakpoints, right? i.e. there's no > legitimate reason for a user to want to monitor a given user address > system-wide, regardless of what's running? > > IIUC this only makes sense for watchpoints, and only in the context of a given > task. Right, there shouldn't be a need for instruction breakpoints, the kernel shouldn't be executing user code. Thanks, -- Marco
On Mon, Jan 30, 2023 at 08:00:00AM +0100, Marco Elver wrote: > On Fri, 27 Jan 2023 at 19:14, Mark Rutland <mark.rutland@arm.com> wrote: > > > > Hi Marco, > > > > Apologies for having not replies on v1... > > > > On Fri, Jan 27, 2023 at 05:24:09PM +0100, Marco Elver wrote: > > > Allow the creation of restricted breakpoint perf events that also fire > > > in the kernel (perf_event_attr::exclude_kernel=0), if: > > > > > > 1. No sample information is requested; samples may contain IPs, > > > registers, or other information that may disclose kernel addresses. > > > > > > 2. The breakpoint (viz. data watchpoint) is on a user address. > > > > I think there's a potential problem here w.r.t. what constitutes a "user > > address". Below, the patch assumes that any address which access_ok() is happy > > with is a user address, but that's not always the case, and it's not > > necessarily always safe to allow watchpoints on such addresses. > > Isn't that a deficiency with access_ok()? > > https://www.kernel.org/doc/html/latest/core-api/mm-api.html#c.access_ok > "Checks if a pointer to a block of memory in user space is valid. [...]" Arguably yes, but it's not really solvable in the current API design. One issue is that this is contextual, and access_ok() is implicitly limited to some scenarios but not others. It's not meant to work for arbitrarty pointers in arbitrary contexts (as e.g. it has no way of distinguishing an idmap from userspace). We largely don't take implicit context into account in access_ok(), other than the tag removal stuff we do on arm64 (and on x86 for LAM), and I don't think anyone was all that happy about extending it for that. > > For example, UEFI runtime services may live in low adddresses below > > TASK_SIZE_MAX, and there are times when we run code in an idmap (or other > > low-half mapping) when we cannot safely take an exception for things like idle, > > suspend, kexec, pagetable rewriting on arm64, etc. > > > > So I think this may introduce functional issues (e.g. a mechanism to crash the > > kernel) in addition to any potential information disclosure, and I would not > > want this to be generally available to unprivileged users. > > > > Most of those happen in kernel threads, but they can also happen in the context > > of user threads (e.g. if triggering suspend/idle via sysfs), so special care > > will be needed, as above. > > These are good points. > > > > The rules constrain the allowable perf events such that no sensitive > > > kernel information can be disclosed. > > > > > > Despite no explicit kernel information disclosure, the following > > > questions may need answers: > > > > > > 1. Q: Is obtaining information that the kernel accessed a particular > > > user's known memory location revealing new information? > > > > > > A: Given the kernel's user space ABI, there should be no "surprise > > > accesses" to user space memory in the first place. > > > > I think that may be true for userspace, but not true for other transient > > mappings in the low half of the address space. Ignoring the functional concern > > above, for idmap'd code this would at least provide a mechanism to probe for > > the phyiscal address of that code (and by extension, reveal the phyiscal > > location of the entire kernel). > > This again feels like a deficiency with access_ok(). Is there a better > primitive than access_ok(), or can we have something that gives us the > guarantee that whatever it says is "ok" is a userspace address? I don't think so, since this is contextual and temporal -- a helper can't give a single correct answert in all cases because it could change. In the cases we switch to another mapping, we could try to ensure that we enable/disable potentially unsafe watchpoints/breakpoints. Taking a look at arm64, our idmap code might actually be ok, since we usually mask all the DAIF bits (and the 'D' or 'Debug' bit masks HW breakpoints/watchpoints). For EFI we largely switch to another thread (but not always), so that would need some auditing. So if this only needs to work in per-task mode rather than system-wide mode, I reckon we can have some save/restore logic around those special cases where we transiently install a mapping, which would protect us. For the threads that run with special mappings in the low half, I'm not sure what to do. If we've ruled out system-wide monitoring I believe those would be protected from unprivileged users. Thanks, Mark. > > > 2. Q: Does causing breakpoints on user memory accesses by the kernel > > > potentially impact timing in a sensitive way? > > > > > > A: Since hardware breakpoints trigger regardless of the state of > > > perf_event_attr::exclude_kernel, but are filtered in the perf > > > subsystem, this possibility already exists independent of the > > > proposed change. > > > > Hmm... arm64's HW breakpoints and watchpoints have HW privilege filters, so I'm > > not sure the above statement is generally/necessarily true. > > Right, I can see this being a valid concern on those architectures > that do support HW privilege filters. > > > > Motivation: Data breakpoints on user addresses that also fire in the > > > kernel provide complete coverage to track and debug accesses, not just > > > in user space but also through the kernel. For example, tracking where > > > user space invokes syscalls with pointers to specific memory. > > > > > > Breakpoints can be used for more complex dynamic analysis, such as race > > > detection, memory-safety error detection, or data-flow analysis. Larger > > > deployment by linking such dynamic analysis into binaries in production > > > only becomes possible when no additional capabilities are required by > > > unprivileged users. To improve coverage, it should then also be possible > > > to enable breakpoints on user addresses that fire in the kernel with no > > > additional capabilities. > > > > I can understand the argument for watchpoints (modulo my concerns above), but > > there's no need to support instruction breakpoints, right? i.e. there's no > > legitimate reason for a user to want to monitor a given user address > > system-wide, regardless of what's running? > > > > IIUC this only makes sense for watchpoints, and only in the context of a given > > task. > > Right, there shouldn't be a need for instruction breakpoints, the > kernel shouldn't be executing user code. > > Thanks, > -- Marco
On Mon, 30 Jan 2023 at 11:46, Mark Rutland <mark.rutland@arm.com> wrote: [...] > > This again feels like a deficiency with access_ok(). Is there a better > > primitive than access_ok(), or can we have something that gives us the > > guarantee that whatever it says is "ok" is a userspace address? > > I don't think so, since this is contextual and temporal -- a helper can't give > a single correct answert in all cases because it could change. That's fair, but unfortunate. Just curious: would copy_from_user_nofault() reliably fail if it tries to access one of those mappings but where access_ok() said "ok"? Though that would probably restrict us to only creating watchpoints for addresses that are actually mapped in the task. > In the cases we switch to another mapping, we could try to ensure that we > enable/disable potentially unsafe watchpoints/breakpoints. That seems it'd be too hard to reason that it's 100% safe, everywhere, on every arch. I'm still convinced we can prohibit creation of such watchpoints in the first place, but need something other than access_ok(). > Taking a look at arm64, our idmap code might actually be ok, since we usually > mask all the DAIF bits (and the 'D' or 'Debug' bit masks HW > breakpoints/watchpoints). For EFI we largely switch to another thread (but not > always), so that would need some auditing. > > So if this only needs to work in per-task mode rather than system-wide mode, I > reckon we can have some save/restore logic around those special cases where we > transiently install a mapping, which would protect us. It should only work in per-task mode. > For the threads that run with special mappings in the low half, I'm not sure > what to do. If we've ruled out system-wide monitoring I believe those would be > protected from unprivileged users. Can the task actually access those special mappings, or is it only accessible by the kernel? Thanks, -- Marco
On Wed, 1 Feb 2023 at 10:34, Marco Elver <elver@google.com> wrote: > > On Mon, 30 Jan 2023 at 11:46, Mark Rutland <mark.rutland@arm.com> wrote: > [...] > > > This again feels like a deficiency with access_ok(). Is there a better > > > primitive than access_ok(), or can we have something that gives us the > > > guarantee that whatever it says is "ok" is a userspace address? > > > > I don't think so, since this is contextual and temporal -- a helper can't give > > a single correct answert in all cases because it could change. > > That's fair, but unfortunate. Just curious: would > copy_from_user_nofault() reliably fail if it tries to access one of > those mappings but where access_ok() said "ok"? I also wonder if these special mappings are ever accessible in a user task context? If yes, can a racing process_vm_readv/writev mess with these special mappings? We could use copy_from_user() to probe that the watchpoint address is legit. But I think the memory can be potentially PROT_NONE but still legit, so copy_from_user() won't work for these corner cases. > Though that would probably restrict us to only creating watchpoints > for addresses that are actually mapped in the task. > > > In the cases we switch to another mapping, we could try to ensure that we > > enable/disable potentially unsafe watchpoints/breakpoints. > > That seems it'd be too hard to reason that it's 100% safe, everywhere, > on every arch. I'm still convinced we can prohibit creation of such > watchpoints in the first place, but need something other than > access_ok(). > > > Taking a look at arm64, our idmap code might actually be ok, since we usually > > mask all the DAIF bits (and the 'D' or 'Debug' bit masks HW > > breakpoints/watchpoints). For EFI we largely switch to another thread (but not > > always), so that would need some auditing. > > > > So if this only needs to work in per-task mode rather than system-wide mode, I > > reckon we can have some save/restore logic around those special cases where we > > transiently install a mapping, which would protect us. > > It should only work in per-task mode. > > > For the threads that run with special mappings in the low half, I'm not sure > > what to do. If we've ruled out system-wide monitoring I believe those would be > > protected from unprivileged users. > > Can the task actually access those special mappings, or is it only > accessible by the kernel? > > Thanks, > -- Marco
On Wed, Feb 01, 2023 at 10:33:40AM +0100, Marco Elver wrote: > On Mon, 30 Jan 2023 at 11:46, Mark Rutland <mark.rutland@arm.com> wrote: > [...] > > > This again feels like a deficiency with access_ok(). Is there a better > > > primitive than access_ok(), or can we have something that gives us the > > > guarantee that whatever it says is "ok" is a userspace address? > > > > I don't think so, since this is contextual and temporal -- a helper can't give > > a single correct answert in all cases because it could change. One thing I just realised to note -- these mappings are installed in a distinct set of page tables that the kernel transiently switches to within the context of a task, they're not inside the same page tables as userspace associated with that task. So you can have distinct mappings at the same VA at different times. > That's fair, but unfortunate. Yup. :) > Just curious: would copy_from_user_nofault() reliably fail if it tries to > access one of those mappings but where access_ok() said "ok"? Generally, no. Most architectures don't have special instructions for accessing user memory specifically and are reliant on people not making uaccesses while such mappings are installed. That's generally enforced by mutual exclusion; userspace can't issue any new syscalls within the context of that task since it isn't executing while the special mappings are installed, and usually IRQs would be disabled, preventing IPIs and such. There *might* be a latent issue with interruptible EFI runtime services. On arm64, yes. Our uacccess routines including copy_from_user_nofault() use out `LDTR` and `STTR` instructions, which use the same permissions as accesses from userspace, and we create the special mappings without user access permissions, so any uaccess to those will fault. There are some special cases (e.g. the futex code), but those are never invoked in a context where the special mappings are in place. > Though that would probably restrict us to only creating watchpoints > for addresses that are actually mapped in the task. As above, since this is contextual and temporal, that wouldn't actually protect us. Consider a user task with something mapped at 0xCAFEF00D: * access_ok(0xCAFEF00D, 1) is true * copy_from_user_nofault(dst, 0xCAFEF00D, 1) succeeds without faulting. ... so we would be able to install a watchpoint. However, after this the task might *transiently* use a different mapping (e.g. the idmap), which could have an unrelated mapping at 0xCAFEF00D (for which copy_from_user_nofault() would fault). > > In the cases we switch to another mapping, we could try to ensure that we > > enable/disable potentially unsafe watchpoints/breakpoints. > > That seems it'd be too hard to reason that it's 100% safe, everywhere, > on every arch. I'm still convinced we can prohibit creation of such > watchpoints in the first place, but need something other than > access_ok(). As above, I don't think that can be an ahead-of-time check. If we want the watchpoints to fire on kernel-mode accesses to user memory, we need a temporal boundary around when userspace mappings are transiently switched with other mappings. While that's arch specific, there are relatively few places that do that switch. > > Taking a look at arm64, our idmap code might actually be ok, since we usually > > mask all the DAIF bits (and the 'D' or 'Debug' bit masks HW > > breakpoints/watchpoints). For EFI we largely switch to another thread (but not > > always), so that would need some auditing. > > > > So if this only needs to work in per-task mode rather than system-wide mode, I > > reckon we can have some save/restore logic around those special cases where we > > transiently install a mapping, which would protect us. > > It should only work in per-task mode. Ok, that makes the problem much simpler; with that in mind arm64 might already be safe today. That rules out a user task trying to monitor a kthread, which is the common case (e.g. most EFI RTS calls or use of the idmap for idle). There are a few rare cases where we do this within the context of a user task. In those cases we're already doing a bunch of work to transiently switch page tables and other state, so we could add some hooks to transiently disable watchpoints and call those at the same time. > > For the threads that run with special mappings in the low half, I'm not sure > > what to do. If we've ruled out system-wide monitoring I believe those would be > > protected from unprivileged users. > > Can the task actually access those special mappings, or is it only > accessible by the kernel? They're only accessible by the kernel, and are not accessible by a uaccess or actual userspace access. As above, they're in a distinct set of page tables (so not accessible from other threads within the same process), and they're mapped with kernel permissions, so the uaccess routines should fault. Thanks, Mark.
Hi Dmitry, We raced to reply here, so there's more detail in my reply to Marco. I'm providing minimal detail here, sorry for being terse! :) On Wed, Feb 01, 2023 at 10:53:44AM +0100, Dmitry Vyukov wrote: > On Wed, 1 Feb 2023 at 10:34, Marco Elver <elver@google.com> wrote: > > > > On Mon, 30 Jan 2023 at 11:46, Mark Rutland <mark.rutland@arm.com> wrote: > > [...] > > > > This again feels like a deficiency with access_ok(). Is there a better > > > > primitive than access_ok(), or can we have something that gives us the > > > > guarantee that whatever it says is "ok" is a userspace address? > > > > > > I don't think so, since this is contextual and temporal -- a helper can't give > > > a single correct answert in all cases because it could change. > > > > That's fair, but unfortunate. Just curious: would > > copy_from_user_nofault() reliably fail if it tries to access one of > > those mappings but where access_ok() said "ok"? > > I also wonder if these special mappings are ever accessible in a user > task context? No. The special mappings are actually distinct page tables from the user page tables, so whenever userspace is executing and can issue a syscall, the user page tables are installed. The special mappings are only installed for transient periods within the context of a user task. There *might* be some latent issues with work happening in IPI context (e.g. perf user backtrace) on some architectures. > If yes, can a racing process_vm_readv/writev mess with these special mappings? No; those happen in task context, and cannot be invoked within the critical section where the page tables with the special mappings are installed. > We could use copy_from_user() to probe that the watchpoint address is > legit. But I think the memory can be potentially PROT_NONE but still > legit, so copy_from_user() won't work for these corner cases. Please see my other reply; ahead-of-time checks cannot help here. An address might be a legitimate user address and *also* transiently be a special mapping (since the two aare in entirely separate page tables). Thanks, Mark.
On Wed, 1 Feb 2023 at 12:54, Mark Rutland <mark.rutland@arm.com> wrote: > > Hi Dmitry, > > We raced to reply here, so there's more detail in my reply to Marco. I'm > providing minimal detail here, sorry for being terse! :) > > On Wed, Feb 01, 2023 at 10:53:44AM +0100, Dmitry Vyukov wrote: > > On Wed, 1 Feb 2023 at 10:34, Marco Elver <elver@google.com> wrote: > > > > > > On Mon, 30 Jan 2023 at 11:46, Mark Rutland <mark.rutland@arm.com> wrote: > > > [...] > > > > > This again feels like a deficiency with access_ok(). Is there a better > > > > > primitive than access_ok(), or can we have something that gives us the > > > > > guarantee that whatever it says is "ok" is a userspace address? > > > > > > > > I don't think so, since this is contextual and temporal -- a helper can't give > > > > a single correct answert in all cases because it could change. > > > > > > That's fair, but unfortunate. Just curious: would > > > copy_from_user_nofault() reliably fail if it tries to access one of > > > those mappings but where access_ok() said "ok"? > > > > I also wonder if these special mappings are ever accessible in a user > > task context? > > No. The special mappings are actually distinct page tables from the user page > tables, so whenever userspace is executing and can issue a syscall, the user > page tables are installed. > > The special mappings are only installed for transient periods within the > context of a user task. There *might* be some latent issues with work happening > in IPI context (e.g. perf user backtrace) on some architectures. > > > If yes, can a racing process_vm_readv/writev mess with these special mappings? > > No; those happen in task context, and cannot be invoked within the critical > section where the page tables with the special mappings are installed. > > > We could use copy_from_user() to probe that the watchpoint address is > > legit. But I think the memory can be potentially PROT_NONE but still > > legit, so copy_from_user() won't work for these corner cases. > > Please see my other reply; ahead-of-time checks cannot help here. An address > might be a legitimate user address and *also* transiently be a special mapping > (since the two aare in entirely separate page tables). This brings more clarity. Thanks for the explanations. If addresses overlap, then it seems that the kernel must disable all watchpoints while the mapping is installed. This patch tries to relax checks, but CAP_ADMIN can install such watchpoints today. And they can unintentionally break kernel, or produce false watchpoint triggers. And if all watchpoints are disabled while the mapping is installed, then this patch should be OK, right?
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index c6a3bac76966..a95a6b889b00 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1463,13 +1463,7 @@ static inline int perf_is_paranoid(void) return sysctl_perf_event_paranoid > -1; } -static inline int perf_allow_kernel(struct perf_event_attr *attr) -{ - if (sysctl_perf_event_paranoid > 1 && !perfmon_capable()) - return -EACCES; - - return security_perf_event_open(attr, PERF_SECURITY_KERNEL); -} +extern int perf_allow_kernel(struct perf_event_attr *attr); static inline int perf_allow_cpu(struct perf_event_attr *attr) { diff --git a/kernel/events/core.c b/kernel/events/core.c index d56328e5080e..0f1fc9aef294 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3174,6 +3174,12 @@ static int perf_event_modify_attr(struct perf_event *event, return -EOPNOTSUPP; } + if (!event->attr.exclude_kernel) { + err = perf_allow_kernel(attr); + if (err) + return err; + } + WARN_ON_ONCE(event->ctx->parent_ctx); mutex_lock(&event->child_mutex); @@ -12289,6 +12295,38 @@ perf_check_permission(struct perf_event_attr *attr, struct task_struct *task) return is_capable || ptrace_may_access(task, ptrace_mode); } +/* + * Check if unprivileged users are allowed to set up breakpoints on user + * addresses that also count when the kernel accesses them. + */ +static bool perf_allow_kernel_breakpoint(struct perf_event_attr *attr) +{ + if (attr->type != PERF_TYPE_BREAKPOINT) + return false; + + /* + * The sample may contain IPs, registers, or other information that may + * disclose kernel addresses or timing information. Disallow any kind of + * additional sample information. + */ + if (attr->sample_type) + return false; + + /* + * Only allow kernel breakpoints on user addresses. + */ + return access_ok((void __user *)(unsigned long)attr->bp_addr, attr->bp_len); +} + +int perf_allow_kernel(struct perf_event_attr *attr) +{ + if (sysctl_perf_event_paranoid > 1 && !perfmon_capable() && + !perf_allow_kernel_breakpoint(attr)) + return -EACCES; + + return security_perf_event_open(attr, PERF_SECURITY_KERNEL); +} + /** * sys_perf_event_open - open a performance event, associate it to a task/cpu *