Message ID | 20221108220651.24492-1-revest@chromium.org |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp8025wru; Tue, 8 Nov 2022 14:17:02 -0800 (PST) X-Google-Smtp-Source: AMsMyM6V2YDoXQ5aIO3142W5Tcdk+jh4oeUrs6jpC82CZrXxI0u5F2v/+o8nL13t7IJKthsb0I45 X-Received: by 2002:a17:906:8456:b0:7ad:f5b9:9e95 with SMTP id e22-20020a170906845600b007adf5b99e95mr38406522ejy.149.1667945822344; Tue, 08 Nov 2022 14:17:02 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1667945822; cv=none; d=google.com; s=arc-20160816; b=DgoaX1qy9WScT4PB8C7uwDqBBeLc/65TxHDVkqBBvnsIe2jteTAvFIxWA6MPsgRh4O ZtV0cXWQU+1KMPQdJdRHZ3BvQzGjPmkUyzGv34a8VhzWOES4zX0h3FsYpiCIHPHQv/2a LKwe+aPzb7ZpkRrYPGCs4BuNBEtqCqbN9i2BCIk+m9t5Y3PA4RjhREdMGw3GbGOwsTuM g5ezTzGpGpqKmBLTrXRSgThIAXmkLdcYVccE1DR4hWt76RLutQzUekVSROOCt9p1kHC7 v1KjOEPWvsLGyxTeumFcrKXDG3RBzHRzJr6yWpAlsLxSeYn4MGohjiD4s6mGrVNDGLl7 SdCA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=oZa2XsBC7LSTs3tg2Aekhd4AeoshIZpxVnZl6SEC4lk=; b=Ie8DUPpkTRHraMeqoWEUN/1ObcMT3RoiFJIWKWiaAMgH20R8TBknQCBAbapGVQX8VQ iGtn3DulrXWZCB7nfir1/R+tfxgESIXYsgKc8obNTwMcwCH71dupJyyRut3afQNh+Slh XROT3aswKUFAW6tJLUnobcUQx3xMBFWj9hLI4OVj7igAPQwJc1FNaCi18AxBn+NYN6aa Tt8sBkkRaDfDUa+EweV86lWGQer38Q4T8r3sxFWUpStyEh75D7MT0E0h65H7eW6SmjMB N03pqCVX8esIGjFiIVrnMneD/1F3dxZL/ga5K42/BJ6HdQ6YLZX4luisQ+bmM7q37DHs ALBw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=KxBQ4KUB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ju27-20020a17090798bb00b007ada2ec1a28si10857406ejc.165.2022.11.08.14.16.33; Tue, 08 Nov 2022 14:17:02 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=KxBQ4KUB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230045AbiKHWH3 (ORCPT <rfc822;dexuan.linux@gmail.com> + 99 others); Tue, 8 Nov 2022 17:07:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39460 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229624AbiKHWH1 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 8 Nov 2022 17:07:27 -0500 Received: from mail-wm1-x335.google.com (mail-wm1-x335.google.com [IPv6:2a00:1450:4864:20::335]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2CC6C4B9B4 for <linux-kernel@vger.kernel.org>; Tue, 8 Nov 2022 14:07:24 -0800 (PST) Received: by mail-wm1-x335.google.com with SMTP id ja4-20020a05600c556400b003cf6e77f89cso1755982wmb.0 for <linux-kernel@vger.kernel.org>; Tue, 08 Nov 2022 14:07:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=oZa2XsBC7LSTs3tg2Aekhd4AeoshIZpxVnZl6SEC4lk=; b=KxBQ4KUBSA33jXcR3gUddQwWIISKgRvrB4BVa64HjRHCGWzmit4A9fuj7o0bS0O32j CmvZ6prPCzADMDdPDl1qyf/nXRqvi9Ayg561y+JRo08oheNEq+pzV92L7gnMQPbKT2+B gGjE0IDl/8gAxB+fL4VPyLAq+KdUhcL1tkBXc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=oZa2XsBC7LSTs3tg2Aekhd4AeoshIZpxVnZl6SEC4lk=; b=6xZUUqHx4rVSosNh8ylBJUz5PxbD5vF9Q6baLZk8SdpYxROs3+PRALO2yol4xejcX+ Nt7OZD8uL4GEUgDUMnw1olwZ6jeR46CP4eFbUhYwEnu8XVqgYWC1iZ0kBn39ojmGy6N4 qLBvaoh6fwxQiA0NL7nRwE1r9s1/LU/HGn9lA1WcMym8no8kxAlXNr8EGGljFgvPpJgh YrZT4JrYk2Q1nXc1B1290ZAM65RtLLPnk2/JueMnzGtdPLv7jeN5SUN5swBjzAAwzMD7 vFvDggOZ1xCZg1JNzxIA2iUCrFk5dbLwSJHT1WJlcr/989GTeet4/LsZbr66HG9LD2d/ 7IjQ== X-Gm-Message-State: ACrzQf1ndSzgg9tOuB2EODWdyKOqYbzlSgqx3tJlAfNIj3k56wFkqGA5 oFh76Vv6eEwKv+tbb1JyIb8iJw== X-Received: by 2002:a05:600c:3587:b0:3cf:9aab:ab4a with SMTP id p7-20020a05600c358700b003cf9aabab4amr16653350wmq.155.1667945242563; Tue, 08 Nov 2022 14:07:22 -0800 (PST) Received: from revest.zrh.corp.google.com ([2a00:79e0:9d:6:8b01:3c16:212f:91df]) by smtp.gmail.com with ESMTPSA id bw9-20020a0560001f8900b00236c1f2cecesm14001004wrb.81.2022.11.08.14.07.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 08 Nov 2022 14:07:21 -0800 (PST) From: Florent Revest <revest@chromium.org> To: bpf@vger.kernel.org Cc: ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, kpsingh@kernel.org, jackmanb@google.com, markowsky@google.com, mark.rutland@arm.com, mhiramat@kernel.org, rostedt@goodmis.org, xukuohai@huawei.com, linux-kernel@vger.kernel.org, Florent Revest <revest@chromium.org> Subject: [RFC 0/1] BPF tracing for arm64 using fprobe Date: Tue, 8 Nov 2022 23:06:50 +0100 Message-Id: <20221108220651.24492-1-revest@chromium.org> X-Mailer: git-send-email 2.38.1.431.g37b22c650d-goog MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1748967958945055851?= X-GMAIL-MSGID: =?utf-8?q?1748967958945055851?= |
Series |
BPF tracing for arm64 using fprobe
|
|
Message
Florent Revest
Nov. 8, 2022, 10:06 p.m. UTC
Hi! With this RFC, I'd like to revive the conversation between BPF, ARM and tracing folks on what BPF tracing (fentry/fexit/fmod_ret) could/should look like on arm64. Current status of BPF tracing ============================= On currently supported architectures (like x86), BPF tracing programs are called from a JITted BPF trampoline, itself called from the ftrace patch site thanks to the ftrace "direct call" API. (or from the end of the ftrace trampoline if a ftrace ops is also tracing that function, but this is transparent to BPF) Thanks to Xu's work [1], we now have BPF trampolines on arm64 (these can be used for struct ops programs already), but Xu's attempts at getting ftrace direct calls support [2][3] on arm64 have been unsucessful so far so we still do not support BPF tracing programs. This prompted me to try a different approach. I'd like to collect feedback on it here. Why not direct calls ? ====================== Mark and Steven have not been too keen on getting direct calls on arm64 because: - working around BL instruction's limited range introduces complexity [4] - it's difficult to get reliable stacktraces right with direct calls [5] - direct calls are complex to maintain on the arch/ftrace side [5] In the absence of ftrace direct calls support, BPF tracing programs would need to be called from an ftrace ops instead. Note that the BPF callback signature would have to be different, so we can't re-use trampolines (direct called callbacks receive arguments in registers whereas ftrace ops callbacks receive arguments in a struct ftrace_regs pointer) Why fprobe ? ============ Ftrace ops per-se only expose an API to hook before a function. There are two systems built on top of ftrace ops that also allow hooking the function exit: fprobe (using rethook) and the function graph tracer. There are plans from Masami and Steven to unify these two systems but, as they stand, only fprobe gives enough flexibility to implement BPF tracing. In order not to reinvent the wheel, if direct calls aren't available on the arch, BPF could leverage fprobe to hook before and after the traced function. Note that return hooking is implemented a bit differently than it is in BPF trampolines. Instead of keeping arguments on a stack frame and calling the traced function, rethook saves arguments in a memory pool and returns to the traced function with a hijacked return pointer that will have its ret jump back to the rethook trampoline. What about performances ? ========================= In its current state, a fprobe callback on arm64 is very expensive because: 1- the ftrace trampoline saves all registers (including many unnecessary ones) 2- it calls ftrace_ops_list_func which iterates over all ops and is very slow 3- the fprobe ops unconditionally hooks a rethook 4- rethook grabs memory from a freelist which is slow under high contention However, all the above points are currently being addressed: 1- by Mark's series to save argument registers only [6] 2- by Mark's series to call single ops directly [7] 3- by Masami's patch to skip rethooks if not needed [8] 4- Masami said the rethook freelist would be replaced by a per-task stack as part of its unification with the function graph tracer [9] I measured the costs of BPF on different approaches on my RPi4 here: [10] tl;dr: the BPF "bench" takes a performance hit of: - 28.6% w/ BPF tracing on direct calls (best case scenario for reference) [11] - 66.8% w/ BPF on kprobe (just for reference) - 62.6% w/ BPF tracing on fprobe without any optimizations (current state) [12] - 34.1% w/ BPF tracing on fprobe with all optimizations (near-future state) [13] On top of Mark's and Masami's existing and planned work, BPF tracing programs called from a fprobe callback become much closer to the performances of direct calls but there's still a hit. If we want to try optimizing even further, we could potentially go even one step further by JITting the fprobe callbacks. This would let us unroll the argument loop and use direct calls instead of indirect calls. However this would introduce quite some complexity and I expect the performance impact should be fairly minimal. (arm64 doesn't have repotlines so it doesn't suffer too much from indirect calls anyway) Two other performance discrepancies I can think of, that would stay anyway are: 1- the ftrace trampoline saves all argument registers while BPF trampolines can get away with only saving the number of arguments that are actually used by that function (thanks to BTF info), consequentially, we need to convert the ftrace_regs structure into a BPF "ctx" array which inevitably takes some cycles 2- When a fexit program is attached, going through the rethook trampoline means that we need to save argument registers a second time while BPF trampolines can just rely on arguments kept on the stack How to apply this RFC ? ======================= This RFC only brings up to discussion the eventual patch that would touch the BPF subsystem because the ftrace and fprobe optimizations it is built on are not as controversial and already on the way. However, this patch is meant to apply on top of Mark's and Masami's work. If you'd like to run this patch you can use my branch. [13] Links ===== 1: https://lore.kernel.org/bpf/20220711144722.2100039-1-xukuohai@huawei.com/ 2: https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/ 3: https://lore.kernel.org/bpf/20220913162732.163631-1-xukuohai@huaweicloud.com/ 4: https://lore.kernel.org/bpf/Yo4xb2w+FHhUtJNw@FVFF77S0Q05N/ 5: https://lore.kernel.org/bpf/YzR5WSLux4mmFIXg@FVFF77S0Q05N/ 6: https://lore.kernel.org/all/20221103170520.931305-1-mark.rutland@arm.com/ 7: https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops 8: https://lore.kernel.org/all/166792255429.919356.14116090269057513181.stgit@devnote3/T/#m9d43fbdc91f48b03d644be77ac18017963a08df5 9: https://lore.kernel.org/bpf/20221024220008.48780b0f58903afed2dc8d4a@kernel.org/ 10: https://paste.debian.net/1260011/ 11: https://github.com/FlorentRevest/linux/commits/bpf-direct-calls 12: https://github.com/FlorentRevest/linux/commits/bpf-fprobe-slow 13: https://github.com/FlorentRevest/linux/commits/bpf-fprobe-rfc Florent Revest (1): bpf: Invoke tracing progs using fprobe on archs without direct call include/linux/bpf.h | 5 ++ kernel/bpf/trampoline.c | 120 ++++++++++++++++++++++++++++++++++++++-- 2 files changed, 121 insertions(+), 4 deletions(-)
Comments
On Tue, Nov 8, 2022 at 2:07 PM Florent Revest <revest@chromium.org> wrote: > > Hi! > > With this RFC, I'd like to revive the conversation between BPF, ARM and tracing > folks on what BPF tracing (fentry/fexit/fmod_ret) could/should look like on > arm64. > > Current status of BPF tracing > ============================= > > On currently supported architectures (like x86), BPF tracing programs are > called from a JITted BPF trampoline, itself called from the ftrace patch site > thanks to the ftrace "direct call" API. (or from the end of the ftrace > trampoline if a ftrace ops is also tracing that function, but this is > transparent to BPF) > > Thanks to Xu's work [1], we now have BPF trampolines on arm64 (these can be > used for struct ops programs already), but Xu's attempts at getting ftrace > direct calls support [2][3] on arm64 have been unsucessful so far so we still > do not support BPF tracing programs. This prompted me to try a different > approach. I'd like to collect feedback on it here. > > Why not direct calls ? > ====================== > > Mark and Steven have not been too keen on getting direct calls on arm64 because: > - working around BL instruction's limited range introduces complexity [4] > - it's difficult to get reliable stacktraces right with direct calls [5] > - direct calls are complex to maintain on the arch/ftrace side [5] > > In the absence of ftrace direct calls support, BPF tracing programs would need > to be called from an ftrace ops instead. Note that the BPF callback signature > would have to be different, so we can't re-use trampolines (direct called > callbacks receive arguments in registers whereas ftrace ops callbacks receive > arguments in a struct ftrace_regs pointer) > > Why fprobe ? > ============ > > Ftrace ops per-se only expose an API to hook before a function. There are two > systems built on top of ftrace ops that also allow hooking the function exit: > fprobe (using rethook) and the function graph tracer. There are plans from > Masami and Steven to unify these two systems but, as they stand, only fprobe > gives enough flexibility to implement BPF tracing. > > In order not to reinvent the wheel, if direct calls aren't available on the > arch, BPF could leverage fprobe to hook before and after the traced function. > Note that return hooking is implemented a bit differently than it is in BPF > trampolines. Instead of keeping arguments on a stack frame and calling the > traced function, rethook saves arguments in a memory pool and returns to the > traced function with a hijacked return pointer that will have its ret jump back > to the rethook trampoline. > > What about performances ? > ========================= > > In its current state, a fprobe callback on arm64 is very expensive because: > 1- the ftrace trampoline saves all registers (including many unnecessary ones) > 2- it calls ftrace_ops_list_func which iterates over all ops and is very slow > 3- the fprobe ops unconditionally hooks a rethook > 4- rethook grabs memory from a freelist which is slow under high contention > > However, all the above points are currently being addressed: > 1- by Mark's series to save argument registers only [6] > 2- by Mark's series to call single ops directly [7] > 3- by Masami's patch to skip rethooks if not needed [8] > 4- Masami said the rethook freelist would be replaced by a per-task stack as > part of its unification with the function graph tracer [9] > > I measured the costs of BPF on different approaches on my RPi4 here: [10] > tl;dr: the BPF "bench" takes a performance hit of: > - 28.6% w/ BPF tracing on direct calls (best case scenario for reference) [11] > - 66.8% w/ BPF on kprobe (just for reference) > - 62.6% w/ BPF tracing on fprobe without any optimizations (current state) [12] > - 34.1% w/ BPF tracing on fprobe with all optimizations (near-future state) [13] Even with all optimization the performance overhead is not acceptable. It feels to me that folks are still thinking about bpf trampoline as a tracing facility. It's a lot more than that. It needs to run 24/7 with zero overhead. It needs to replace the kernel functions and be invoked millions times a second until the system is rebooted. In this environment every nanosecond counts. Even if the fprobe side was completely free the patch 1 has so much overhead in copy of bpf_cookie, regs, etc that it's a non-starter for these use cases. There are several other fundamental issues in this approach because of fprobe/ftrace. It has ftrace_test_recursion_trylock and disables preemption. Both are deal breakers. bpf trampoline has to allow recursion in some cases. See __bpf_prog_enter*() flavors. bpf trampoline also has to use migrate_disable instead of preemption and rcu_read_lock() in some cases and rcu_read_lock_trace() in others. bpf trampoline must never allocate memory or grab locks. All of these mandatory features exclude fprobe, ftrace, rethook from possible options. Let's figure out how to address concerns with direct calls: > - working around BL instruction's limited range introduces complexity [4] > - it's difficult to get reliable stacktraces right with direct calls [5] > - direct calls are complex to maintain on the arch/ftrace side [5]
Hi Florent, On Tue, 8 Nov 2022 23:06:50 +0100 Florent Revest <revest@chromium.org> wrote: > Hi! > > With this RFC, I'd like to revive the conversation between BPF, ARM and tracing > folks on what BPF tracing (fentry/fexit/fmod_ret) could/should look like on > arm64. > > Current status of BPF tracing > ============================= > > On currently supported architectures (like x86), BPF tracing programs are > called from a JITted BPF trampoline, itself called from the ftrace patch site > thanks to the ftrace "direct call" API. (or from the end of the ftrace > trampoline if a ftrace ops is also tracing that function, but this is > transparent to BPF) > > Thanks to Xu's work [1], we now have BPF trampolines on arm64 (these can be > used for struct ops programs already), but Xu's attempts at getting ftrace > direct calls support [2][3] on arm64 have been unsucessful so far so we still > do not support BPF tracing programs. This prompted me to try a different > approach. I'd like to collect feedback on it here. > > Why not direct calls ? > ====================== > > Mark and Steven have not been too keen on getting direct calls on arm64 because: > - working around BL instruction's limited range introduces complexity [4] > - it's difficult to get reliable stacktraces right with direct calls [5] > - direct calls are complex to maintain on the arch/ftrace side [5] > > In the absence of ftrace direct calls support, BPF tracing programs would need > to be called from an ftrace ops instead. Note that the BPF callback signature > would have to be different, so we can't re-use trampolines (direct called > callbacks receive arguments in registers whereas ftrace ops callbacks receive > arguments in a struct ftrace_regs pointer) > > Why fprobe ? > ============ > > Ftrace ops per-se only expose an API to hook before a function. There are two > systems built on top of ftrace ops that also allow hooking the function exit: > fprobe (using rethook) and the function graph tracer. There are plans from > Masami and Steven to unify these two systems but, as they stand, only fprobe > gives enough flexibility to implement BPF tracing. > > In order not to reinvent the wheel, if direct calls aren't available on the > arch, BPF could leverage fprobe to hook before and after the traced function. > Note that return hooking is implemented a bit differently than it is in BPF > trampolines. Instead of keeping arguments on a stack frame and calling the > traced function, rethook saves arguments in a memory pool and returns to the > traced function with a hijacked return pointer that will have its ret jump back > to the rethook trampoline. Yeah, that is designed to not change the task's stack, but it makes another list-based stack. But it eventually replaced by the per-task shadow stack. > > What about performances ? > ========================= > > In its current state, a fprobe callback on arm64 is very expensive because: > 1- the ftrace trampoline saves all registers (including many unnecessary ones) > 2- it calls ftrace_ops_list_func which iterates over all ops and is very slow > 3- the fprobe ops unconditionally hooks a rethook > 4- rethook grabs memory from a freelist which is slow under high contention > > However, all the above points are currently being addressed: > 1- by Mark's series to save argument registers only [6] > 2- by Mark's series to call single ops directly [7] > 3- by Masami's patch to skip rethooks if not needed [8] > 4- Masami said the rethook freelist would be replaced by a per-task stack as > part of its unification with the function graph tracer [9] > > I measured the costs of BPF on different approaches on my RPi4 here: [10] > tl;dr: the BPF "bench" takes a performance hit of: > - 28.6% w/ BPF tracing on direct calls (best case scenario for reference) [11] > - 66.8% w/ BPF on kprobe (just for reference) > - 62.6% w/ BPF tracing on fprobe without any optimizations (current state) [12] > - 34.1% w/ BPF tracing on fprobe with all optimizations (near-future state) [13] Great! thanks for measuring that. I've checked your tree[13] and basically it is good for me. - rethook: fprobe: Use ftrace_regs instead of pt_regs This patch may break other arch which is trying to implement rethook, so can you break it down to patches for fprobe, rethook and arch specific one? > > On top of Mark's and Masami's existing and planned work, BPF tracing programs > called from a fprobe callback become much closer to the performances of direct > calls but there's still a hit. If we want to try optimizing even further, we > could potentially go even one step further by JITting the fprobe callbacks. Would this mean JITting fprobe or callbacks? > This would let us unroll the argument loop and use direct calls instead of > indirect calls. However this would introduce quite some complexity and I expect > the performance impact should be fairly minimal. (arm64 doesn't have repotlines > so it doesn't suffer too much from indirect calls anyway) > > Two other performance discrepancies I can think of, that would stay anyway are: > 1- the ftrace trampoline saves all argument registers while BPF trampolines can > get away with only saving the number of arguments that are actually used by > that function (thanks to BTF info), consequentially, we need to convert the > ftrace_regs structure into a BPF "ctx" array which inevitably takes some > cycles Have you checked the root cause by perf? I'm not sure that makes difference so much. > 2- When a fexit program is attached, going through the rethook trampoline means > that we need to save argument registers a second time while BPF trampolines > can just rely on arguments kept on the stack I think we can make a new trampoline eventually just copying some required arguments on the shadow stack. Thanks! > > How to apply this RFC ? > ======================= > > This RFC only brings up to discussion the eventual patch that would > touch the BPF subsystem because the ftrace and fprobe optimizations it > is built on are not as controversial and already on the way. However, > this patch is meant to apply on top of Mark's and Masami's work. If > you'd like to run this patch you can use my branch. [13] > > Links > ===== > > 1: https://lore.kernel.org/bpf/20220711144722.2100039-1-xukuohai@huawei.com/ > 2: https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/ > 3: https://lore.kernel.org/bpf/20220913162732.163631-1-xukuohai@huaweicloud.com/ > 4: https://lore.kernel.org/bpf/Yo4xb2w+FHhUtJNw@FVFF77S0Q05N/ > 5: https://lore.kernel.org/bpf/YzR5WSLux4mmFIXg@FVFF77S0Q05N/ > 6: https://lore.kernel.org/all/20221103170520.931305-1-mark.rutland@arm.com/ > 7: https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops > 8: https://lore.kernel.org/all/166792255429.919356.14116090269057513181.stgit@devnote3/T/#m9d43fbdc91f48b03d644be77ac18017963a08df5 > 9: https://lore.kernel.org/bpf/20221024220008.48780b0f58903afed2dc8d4a@kernel.org/ > 10: https://paste.debian.net/1260011/ > 11: https://github.com/FlorentRevest/linux/commits/bpf-direct-calls > 12: https://github.com/FlorentRevest/linux/commits/bpf-fprobe-slow > 13: https://github.com/FlorentRevest/linux/commits/bpf-fprobe-rfc > > Florent Revest (1): > bpf: Invoke tracing progs using fprobe on archs without direct call > > include/linux/bpf.h | 5 ++ > kernel/bpf/trampoline.c | 120 ++++++++++++++++++++++++++++++++++++++-- > 2 files changed, 121 insertions(+), 4 deletions(-) > > -- > 2.38.1.431.g37b22c650d-goog >
On Wed, 16 Nov 2022 18:41:26 -0800 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Tue, Nov 8, 2022 at 2:07 PM Florent Revest <revest@chromium.org> wrote: > > > > Hi! > > > > With this RFC, I'd like to revive the conversation between BPF, ARM and tracing > > folks on what BPF tracing (fentry/fexit/fmod_ret) could/should look like on > > arm64. > > > > Current status of BPF tracing > > ============================= > > > > On currently supported architectures (like x86), BPF tracing programs are > > called from a JITted BPF trampoline, itself called from the ftrace patch site > > thanks to the ftrace "direct call" API. (or from the end of the ftrace > > trampoline if a ftrace ops is also tracing that function, but this is > > transparent to BPF) > > > > Thanks to Xu's work [1], we now have BPF trampolines on arm64 (these can be > > used for struct ops programs already), but Xu's attempts at getting ftrace > > direct calls support [2][3] on arm64 have been unsucessful so far so we still > > do not support BPF tracing programs. This prompted me to try a different > > approach. I'd like to collect feedback on it here. > > > > Why not direct calls ? > > ====================== > > > > Mark and Steven have not been too keen on getting direct calls on arm64 because: > > - working around BL instruction's limited range introduces complexity [4] > > - it's difficult to get reliable stacktraces right with direct calls [5] > > - direct calls are complex to maintain on the arch/ftrace side [5] > > > > In the absence of ftrace direct calls support, BPF tracing programs would need > > to be called from an ftrace ops instead. Note that the BPF callback signature > > would have to be different, so we can't re-use trampolines (direct called > > callbacks receive arguments in registers whereas ftrace ops callbacks receive > > arguments in a struct ftrace_regs pointer) > > > > Why fprobe ? > > ============ > > > > Ftrace ops per-se only expose an API to hook before a function. There are two > > systems built on top of ftrace ops that also allow hooking the function exit: > > fprobe (using rethook) and the function graph tracer. There are plans from > > Masami and Steven to unify these two systems but, as they stand, only fprobe > > gives enough flexibility to implement BPF tracing. > > > > In order not to reinvent the wheel, if direct calls aren't available on the > > arch, BPF could leverage fprobe to hook before and after the traced function. > > Note that return hooking is implemented a bit differently than it is in BPF > > trampolines. Instead of keeping arguments on a stack frame and calling the > > traced function, rethook saves arguments in a memory pool and returns to the > > traced function with a hijacked return pointer that will have its ret jump back > > to the rethook trampoline. > > > > What about performances ? > > ========================= > > > > In its current state, a fprobe callback on arm64 is very expensive because: > > 1- the ftrace trampoline saves all registers (including many unnecessary ones) > > 2- it calls ftrace_ops_list_func which iterates over all ops and is very slow > > 3- the fprobe ops unconditionally hooks a rethook > > 4- rethook grabs memory from a freelist which is slow under high contention > > > > However, all the above points are currently being addressed: > > 1- by Mark's series to save argument registers only [6] > > 2- by Mark's series to call single ops directly [7] > > 3- by Masami's patch to skip rethooks if not needed [8] > > 4- Masami said the rethook freelist would be replaced by a per-task stack as > > part of its unification with the function graph tracer [9] > > > > I measured the costs of BPF on different approaches on my RPi4 here: [10] > > tl;dr: the BPF "bench" takes a performance hit of: > > - 28.6% w/ BPF tracing on direct calls (best case scenario for reference) [11] > > - 66.8% w/ BPF on kprobe (just for reference) > > - 62.6% w/ BPF tracing on fprobe without any optimizations (current state) [12] > > - 34.1% w/ BPF tracing on fprobe with all optimizations (near-future state) [13] > > Even with all optimization the performance overhead is not acceptable. > It feels to me that folks are still thinking about bpf trampoline > as a tracing facility. > It's a lot more than that. It needs to run 24/7 with zero overhead. > It needs to replace the kernel functions and be invoked > millions times a second until the system is rebooted. > In this environment every nanosecond counts. > > Even if the fprobe side was completely free the patch 1 has so much > overhead in copy of bpf_cookie, regs, etc that it's a non-starter > for these use cases. > > There are several other fundamental issues in this approach > because of fprobe/ftrace. > It has ftrace_test_recursion_trylock and disables preemption. > Both are deal breakers. I talked with Florent about this offline. ftrace_test_recursion_trylock() is required for generic ftrace use because user callback can call a function which can be traced by ftrace. This means it can cause an infinite loop. However, if user can ensure to check it by itself, I can add a flag to avoid that trylock. (Of course, you can shoot your foot.) I thought the preemption disabling was for accessing per-cpu, but it is needed for rethook to get an object from an RCU protected list. Thus when we move on the per-task shadow stack, it can be removed too. > > bpf trampoline has to allow recursion in some cases. > See __bpf_prog_enter*() flavors. > > bpf trampoline also has to use migrate_disable instead of preemption > and rcu_read_lock() in some cases and rcu_read_lock_trace() in others. Is rcu_read_lock() better than preempt_disable()? > > bpf trampoline must never allocate memory or grab locks. Note that ftrace_test_recursion_trylock() is just a bit operation per-task, not taking a lock (nor atomic). Thank you, > > All of these mandatory features exclude fprobe, ftrace, rethook > from possible options. > > Let's figure out how to address concerns with direct calls: > > > - working around BL instruction's limited range introduces complexity [4] > > - it's difficult to get reliable stacktraces right with direct calls [5] > > - direct calls are complex to maintain on the arch/ftrace side [5]
On Thu, Nov 17, 2022 at 5:34 AM Masami Hiramatsu <mhiramat@kernel.org> wrote: > > On Wed, 16 Nov 2022 18:41:26 -0800 > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > > On Tue, Nov 8, 2022 at 2:07 PM Florent Revest <revest@chromium.org> wrote: > > > > > > Hi! > > > > > > With this RFC, I'd like to revive the conversation between BPF, ARM and tracing > > > folks on what BPF tracing (fentry/fexit/fmod_ret) could/should look like on > > > arm64. > > > > > > Current status of BPF tracing > > > ============================= > > > > > > On currently supported architectures (like x86), BPF tracing programs are > > > called from a JITted BPF trampoline, itself called from the ftrace patch site > > > thanks to the ftrace "direct call" API. (or from the end of the ftrace > > > trampoline if a ftrace ops is also tracing that function, but this is > > > transparent to BPF) > > > > > > Thanks to Xu's work [1], we now have BPF trampolines on arm64 (these can be > > > used for struct ops programs already), but Xu's attempts at getting ftrace > > > direct calls support [2][3] on arm64 have been unsucessful so far so we still > > > do not support BPF tracing programs. This prompted me to try a different > > > approach. I'd like to collect feedback on it here. > > > > > > Why not direct calls ? > > > ====================== > > > > > > Mark and Steven have not been too keen on getting direct calls on arm64 because: > > > - working around BL instruction's limited range introduces complexity [4] > > > - it's difficult to get reliable stacktraces right with direct calls [5] > > > - direct calls are complex to maintain on the arch/ftrace side [5] > > > > > > In the absence of ftrace direct calls support, BPF tracing programs would need > > > to be called from an ftrace ops instead. Note that the BPF callback signature > > > would have to be different, so we can't re-use trampolines (direct called > > > callbacks receive arguments in registers whereas ftrace ops callbacks receive > > > arguments in a struct ftrace_regs pointer) > > > > > > Why fprobe ? > > > ============ > > > > > > Ftrace ops per-se only expose an API to hook before a function. There are two > > > systems built on top of ftrace ops that also allow hooking the function exit: > > > fprobe (using rethook) and the function graph tracer. There are plans from > > > Masami and Steven to unify these two systems but, as they stand, only fprobe > > > gives enough flexibility to implement BPF tracing. > > > > > > In order not to reinvent the wheel, if direct calls aren't available on the > > > arch, BPF could leverage fprobe to hook before and after the traced function. > > > Note that return hooking is implemented a bit differently than it is in BPF > > > trampolines. Instead of keeping arguments on a stack frame and calling the > > > traced function, rethook saves arguments in a memory pool and returns to the > > > traced function with a hijacked return pointer that will have its ret jump back > > > to the rethook trampoline. > > > > > > What about performances ? > > > ========================= > > > > > > In its current state, a fprobe callback on arm64 is very expensive because: > > > 1- the ftrace trampoline saves all registers (including many unnecessary ones) > > > 2- it calls ftrace_ops_list_func which iterates over all ops and is very slow > > > 3- the fprobe ops unconditionally hooks a rethook > > > 4- rethook grabs memory from a freelist which is slow under high contention > > > > > > However, all the above points are currently being addressed: > > > 1- by Mark's series to save argument registers only [6] > > > 2- by Mark's series to call single ops directly [7] > > > 3- by Masami's patch to skip rethooks if not needed [8] > > > 4- Masami said the rethook freelist would be replaced by a per-task stack as > > > part of its unification with the function graph tracer [9] > > > > > > I measured the costs of BPF on different approaches on my RPi4 here: [10] > > > tl;dr: the BPF "bench" takes a performance hit of: > > > - 28.6% w/ BPF tracing on direct calls (best case scenario for reference) [11] > > > - 66.8% w/ BPF on kprobe (just for reference) > > > - 62.6% w/ BPF tracing on fprobe without any optimizations (current state) [12] > > > - 34.1% w/ BPF tracing on fprobe with all optimizations (near-future state) [13] > > > > Even with all optimization the performance overhead is not acceptable. > > It feels to me that folks are still thinking about bpf trampoline > > as a tracing facility. > > It's a lot more than that. It needs to run 24/7 with zero overhead. > > It needs to replace the kernel functions and be invoked > > millions times a second until the system is rebooted. > > In this environment every nanosecond counts. > > > > Even if the fprobe side was completely free the patch 1 has so much > > overhead in copy of bpf_cookie, regs, etc that it's a non-starter > > for these use cases. > > > > There are several other fundamental issues in this approach > > because of fprobe/ftrace. > > It has ftrace_test_recursion_trylock and disables preemption. > > Both are deal breakers. > > I talked with Florent about this offline. > ftrace_test_recursion_trylock() is required for generic ftrace > use because user callback can call a function which can be > traced by ftrace. This means it can cause an infinite loop. > However, if user can ensure to check it by itself, I can add a > flag to avoid that trylock. (Of course, you can shoot your foot.) > > I thought the preemption disabling was for accessing per-cpu, > but it is needed for rethook to get an object from an RCU > protected list. > Thus when we move on the per-task shadow stack, it can be > removed too. There might not be a task available where bpf trampoline is running. rcu protection might not be there either. Really we're just scratching the surface of all the issues why fprobe is not usable.
On Wed, 16 Nov 2022 18:41:26 -0800 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > Even with all optimization the performance overhead is not acceptable. > It feels to me that folks are still thinking about bpf trampoline > as a tracing facility. > It's a lot more than that. It needs to run 24/7 with zero overhead. It obviously doesn't have zero overhead. And correctness and maintainability trumps micro-optimizations. > It needs to replace the kernel functions and be invoked What do you mean by "replace the kernel functions"? You mean an existing kernel function can be replaced by a bpf program? Like live patching? This seems rather dangerous, and how does one know that their system has integrity? Is there a feature to sign bpf programs before they can be added? Also, it may be time to bring in the lawyers. If a bpf program can replace an existing kernel function, then it has definitely passed the "user space" exception to the GPL, where user space must use the system call interface. By injecting executable code into the kernel, especially something that replaces kernel functionality, it becomes arguably derived from the kernel itself. And the BPF program must be GPL. Allowing proprietary BPF programs to replace kernel functionality looks like a clear violation and circumvention of the GPL. But I could be mistaken. As I said, it's time to bring in the lawyers on this one. > millions times a second until the system is rebooted. > In this environment every nanosecond counts. > > Even if the fprobe side was completely free the patch 1 has so much > overhead in copy of bpf_cookie, regs, etc that it's a non-starter > for these use cases. > > There are several other fundamental issues in this approach > because of fprobe/ftrace. > It has ftrace_test_recursion_trylock and disables preemption. > Both are deal breakers. Please explain why? The recursion protection lock is a simply bit operation on the task struct which is used to protect against recursion at the same context. Which if you do not have, will likely happen, and the only hint of it is that the system triple faults and reboots. If you are only hooking to one function, then it is easy to figure this out. But with the multi work being done, that is no longer the case. Hooking to functions is *extremely* intrusive. And protection against errors is a must have, and not an option. > > bpf trampoline has to allow recursion in some cases. > See __bpf_prog_enter*() flavors. The recursion lock allows recursions, but not at the same context. That is, interrupt against normal context is fine. But really, you should not have it within the same context. How do you verify that you do not run out of stack? > > bpf trampoline also has to use migrate_disable instead of preemption > and rcu_read_lock() in some cases and rcu_read_lock_trace() in others. > > bpf trampoline must never allocate memory or grab locks. Neither should fprobes / ftrace. -- Steve > > All of these mandatory features exclude fprobe, ftrace, rethook > from possible options. > > Let's figure out how to address concerns with direct calls:
On 11/17/22 12:16 PM, Steven Rostedt wrote: > On Wed, 16 Nov 2022 18:41:26 -0800 > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > >> Even with all optimization the performance overhead is not acceptable. >> It feels to me that folks are still thinking about bpf trampoline >> as a tracing facility. >> It's a lot more than that. It needs to run 24/7 with zero overhead. > > It obviously doesn't have zero overhead. > > And correctness and maintainability trumps micro-optimizations. During the bpf office hours today Mark Rutland and Florent had some great ideas about how to wire things up. I'm sure Mark will need some time to write it all down but it was a fun call. > >> It needs to replace the kernel functions and be invoked > > What do you mean by "replace the kernel functions"? You mean an existing > kernel function can be replaced by a bpf program? Like live patching? > > This seems rather dangerous, and how does one know that their system has > integrity? Is there a feature to sign bpf programs before they can be added? > > Also, it may be time to bring in the lawyers. If a bpf program can replace > an existing kernel function, then it has definitely passed the "user space" > exception to the GPL, where user space must use the system call interface. > By injecting executable code into the kernel, especially something that > replaces kernel functionality, it becomes arguably derived from the kernel > itself. And the BPF program must be GPL. > > Allowing proprietary BPF programs to replace kernel functionality looks > like a clear violation and circumvention of the GPL. But I could be > mistaken. As I said, it's time to bring in the lawyers on this one. https://docs.kernel.org/bpf/bpf_licensing.html answers most of your questions. It was reviewed by lawyers and also discussed pretty extensively on the lists. The short answer to your concerns is that you can't replace kernel functions from proprietary BPF programs. The LSM and TCP congestion control features intentionally have GPL only support functions in the way. bpf_probe_read_kernel() is also GPL only and massively limits the things that can be done from proprietary code. This list of helpers is pretty current and details which ones are GPL only: https://github.com/iovisor/bcc/blob/master/docs/kernel-versions.md#helpers I know there's a long and glorious history of collaboration around these parts of bpf and ftrace. I really hope this time around we all come away feeling like the technical discussion made both projects better. Mark and Florent today certainly made me think that was the direction we were headed. Along these lines, I'm also hoping to avoid diving into old debates and alarmist conclusions about GPL compliance and signed bpf programs. Or, if some part of those old debates is no longer valid, can we split it off into a well researched separate thread and focus on technical bits here? -chris
On Thu, 17 Nov 2022 16:55:12 -0500 Chris Mason <clm@meta.com> wrote: > On 11/17/22 12:16 PM, Steven Rostedt wrote: > > On Wed, 16 Nov 2022 18:41:26 -0800 > > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > > >> Even with all optimization the performance overhead is not acceptable. > >> It feels to me that folks are still thinking about bpf trampoline > >> as a tracing facility. > >> It's a lot more than that. It needs to run 24/7 with zero overhead. > > > > It obviously doesn't have zero overhead. > > > > And correctness and maintainability trumps micro-optimizations. > > During the bpf office hours today Mark Rutland and Florent had some > great ideas about how to wire things up. I'm sure Mark will need some > time to write it all down but it was a fun call. That's good to hear. > > > > >> It needs to replace the kernel functions and be invoked > > > > What do you mean by "replace the kernel functions"? You mean an existing > > kernel function can be replaced by a bpf program? Like live patching? > > > > This seems rather dangerous, and how does one know that their system has > > integrity? Is there a feature to sign bpf programs before they can be added? > > > > Also, it may be time to bring in the lawyers. If a bpf program can replace > > an existing kernel function, then it has definitely passed the "user space" > > exception to the GPL, where user space must use the system call interface. > > By injecting executable code into the kernel, especially something that > > replaces kernel functionality, it becomes arguably derived from the kernel > > itself. And the BPF program must be GPL. > > > > Allowing proprietary BPF programs to replace kernel functionality looks > > like a clear violation and circumvention of the GPL. But I could be > > mistaken. As I said, it's time to bring in the lawyers on this one. > > https://docs.kernel.org/bpf/bpf_licensing.html answers most of your > questions. It was reviewed by lawyers and also discussed pretty > extensively on the lists. > > The short answer to your concerns is that you can't replace kernel > functions from proprietary BPF programs. The LSM and TCP congestion > control features intentionally have GPL only support functions in the > way. bpf_probe_read_kernel() is also GPL only and massively limits the > things that can be done from proprietary code. ^^^^^^^^^^^^^^^^^ That's the part I wanted to hear. But just the fact of replacing a kernel function with BPF code seems a bit concerning. > > This list of helpers is pretty current and details which ones are GPL only: > > https://github.com/iovisor/bcc/blob/master/docs/kernel-versions.md#helpers > > I know there's a long and glorious history of collaboration around these > parts of bpf and ftrace. I really hope this time around we all come > away feeling like the technical discussion made both projects better. > Mark and Florent today certainly made me think that was the direction we > were headed. > > Along these lines, I'm also hoping to avoid diving into old debates and > alarmist conclusions about GPL compliance and signed bpf programs. Or, Not alarmist, but concern as being able to modify what a kernel function can do is not something I take lightly. -- Steve > if some part of those old debates is no longer valid, can we split > it off into a well researched separate thread and focus on technical > bits here?
On Thu, Nov 17, 2022 at 04:55:12PM -0500, Chris Mason wrote: > On 11/17/22 12:16 PM, Steven Rostedt wrote: > > On Wed, 16 Nov 2022 18:41:26 -0800 > > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > > > > Even with all optimization the performance overhead is not acceptable. > > > It feels to me that folks are still thinking about bpf trampoline > > > as a tracing facility. > > > It's a lot more than that. It needs to run 24/7 with zero overhead. > > > > It obviously doesn't have zero overhead. > > > > And correctness and maintainability trumps micro-optimizations. > > During the bpf office hours today Mark Rutland and Florent had some > great ideas about how to wire things up. I'm sure Mark will need some > time to write it all down but it was a fun call. I'd hoped to write that up today, but I haven't had enough time yet, so I'll try to write up that proposal next week. The rough idea was to *somehow* rejig the per-callsite ftrace_ops code I've been working on to permit (but not require) the use of custom trampolines. As mentioned during the call I want to ensure that this doesn't adversely affect regular ftrace usage, and I'd also like to ensure that the regular ftrace code is able to gain form those changes (without the need for trampolines). AFAICT, that's likely to require some rework to the way direct functions are managed. The WIP code for per-callsite ftrace_ops is at: https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops To be clear, my comments were purely about the *mechanism* we end up implementing. I do have concerns w.r.t. overriding arbitrary parts of the kernel. Thanks, Mark.
Hi Alexei, On Thu, Nov 17, 2022 at 08:50:12AM -0800, Alexei Starovoitov wrote: > There might not be a task available where bpf trampoline is running. I'm not sure what you mean by "there might not be a task available"; do you mean that there might not be space in the per-task shadow stack, or that the BPF program can be invoked inan IRQ context? > rcu protection might not be there either. We've spent a lot of time reworking entry/idle sequences with noinstr, so any time BPF can be invoked, we should have a regular kernel environment available, with RCU watching (but not necessarily in an RCU read-side critical section). If BPF is being invoked without RCU watching, that is a bug that needs to be fixed. Do you have a specific example in mind? Thanks, Mark.
On Thu, Nov 17, 2022 at 05:40:30PM -0500, Steven Rostedt wrote: > On Thu, 17 Nov 2022 16:55:12 -0500 > Chris Mason <clm@meta.com> wrote: > > > On 11/17/22 12:16 PM, Steven Rostedt wrote: > > The short answer to your concerns is that you can't replace kernel > > functions from proprietary BPF programs. The LSM and TCP congestion > > control features intentionally have GPL only support functions in the > > way. bpf_probe_read_kernel() is also GPL only and massively limits the > > things that can be done from proprietary code. > > ^^^^^^^^^^^^^^^^^ > > That's the part I wanted to hear. But just the fact of replacing a kernel > function with BPF code seems a bit concerning. > > This list of helpers is pretty current and details which ones are GPL only: > > > > https://github.com/iovisor/bcc/blob/master/docs/kernel-versions.md#helpers > > > > I know there's a long and glorious history of collaboration around these > > parts of bpf and ftrace. I really hope this time around we all come > > away feeling like the technical discussion made both projects better. > > Mark and Florent today certainly made me think that was the direction we > > were headed. > > > > Along these lines, I'm also hoping to avoid diving into old debates and > > alarmist conclusions about GPL compliance and signed bpf programs. Or, > > Not alarmist, but concern as being able to modify what a kernel function can > do is not something I take lightly. FWIW, given that the aim here seems to be to expose all kernel internals to be overridden arbitrarily, I'm also concerned that there's a huge surface area for issues with maintainability, robustness/correctness, and security. I really don't want to be stuck in a position where someone argues that all kernel internal functions are ABI and need to stay around as-is to be hooked by eBPF, and I hope that we all agree that there are no guarantees on that front. Thanks, Mark.
On Fri, 18 Nov 2022 16:34:50 +0000 Mark Rutland <mark.rutland@arm.com> wrote: > > Not alarmist, but concern as being able to modify what a kernel function can > > do is not something I take lightly. > > FWIW, given that the aim here seems to be to expose all kernel internals to be > overridden arbitrarily, I'm also concerned that there's a huge surface area for > issues with maintainability, robustness/correctness, and security. > > I really don't want to be stuck in a position where someone argues that all > kernel internal functions are ABI and need to stay around as-is to be hooked by > eBPF, and I hope that we all agree that there are no guarantees on that front. My biggest concern is changing functionality of arbitrary functions by BPF. I would much rather limit what functions BPF could change with some annotation. int __bpf_modify foo() { ... } That way if somethings not working, you can see directly in the code that the function could be modified by a BPF program, instead of getting some random bug report because a function returned an unexpected result that the code of that function could never produce. -- Steve
On 11/18/22 11:45 AM, Steven Rostedt wrote: > On Fri, 18 Nov 2022 16:34:50 +0000 > Mark Rutland <mark.rutland@arm.com> wrote: > >>> Not alarmist, but concern as being able to modify what a kernel function can >>> do is not something I take lightly. >> >> FWIW, given that the aim here seems to be to expose all kernel internals to be >> overridden arbitrarily, I'm also concerned that there's a huge surface area for >> issues with maintainability, robustness/correctness, and security. >> >> I really don't want to be stuck in a position where someone argues that all >> kernel internal functions are ABI and need to stay around as-is to be hooked by >> eBPF, and I hope that we all agree that there are no guarantees on that front. > For ABI concerns, I don't think we're doing anything fundamentally different from x86 here. So unless our ARM users are substantially more exciting than the x86 crowd, it should all fall under the discussion from maintainer's summit: https://lwn.net/Articles/908464/ > My biggest concern is changing functionality of arbitrary functions by BPF. > I would much rather limit what functions BPF could change with some > annotation. > > int __bpf_modify foo() > { > ... > } > > > That way if somethings not working, you can see directly in the code that > the function could be modified by a BPF program, instead of getting some > random bug report because a function returned an unexpected result that the > code of that function could never produce. > The good news is that BPF generally confines the function replacement through struct ops interfaces. There are also explicit allow lists to limit functions where you can do return value overrides etc, so I think it's fair to say these concerns are already baked in. I'm sure they can be improved over the long term, but I don't think that's related to this set of functionality on ARM. -chris
On Fri, 18 Nov 2022 12:44:00 -0500 Chris Mason <clm@meta.com> wrote: > > My biggest concern is changing functionality of arbitrary functions by BPF. > > I would much rather limit what functions BPF could change with some > > annotation. > > > > int __bpf_modify foo() > > { > > ... > > } > > > > > > That way if somethings not working, you can see directly in the code that > > the function could be modified by a BPF program, instead of getting some > > random bug report because a function returned an unexpected result that the > > code of that function could never produce. > > > > The good news is that BPF generally confines the function replacement > through struct ops interfaces. What struct ops interfaces? > There are also explicit allow lists to > limit functions where you can do return value overrides etc, so I think Where are these lists. > it's fair to say these concerns are already baked in. I'm sure they can How do I know that a function return was modified by BPF? If I'm debugging something, is it obvious to the developer that is debugging an issue (perhaps unaware of what BPF programs are loaded on the users machine), that the return of a function was tweaked by BPF and that could be the source of the bug? > be improved over the long term, but I don't think that's related to this > set of functionality on ARM. I disagree. These issues may have been added to x86, but perhaps we should take a deeper look at them again before extending them to other architectures. -- Steve
On 11/18/22 1:06 PM, Steven Rostedt wrote: > On Fri, 18 Nov 2022 12:44:00 -0500 > Chris Mason <clm@meta.com> wrote: > >>> My biggest concern is changing functionality of arbitrary functions by BPF. >>> I would much rather limit what functions BPF could change with some >>> annotation. >>> >>> int __bpf_modify foo() >>> { >>> ... >>> } >>> >>> >>> That way if somethings not working, you can see directly in the code that >>> the function could be modified by a BPF program, instead of getting some >>> random bug report because a function returned an unexpected result that the >>> code of that function could never produce. >>> >> >> The good news is that BPF generally confines the function replacement >> through struct ops interfaces. > > What struct ops interfaces? https://lwn.net/Articles/811631/ > >> There are also explicit allow lists to >> limit functions where you can do return value overrides etc, so I think > > Where are these lists. Some of the original features: https://lwn.net/Articles/811631/ It has changed and expanded since then, but hopefully you get the idea. > >> it's fair to say these concerns are already baked in. I'm sure they can > > How do I know that a function return was modified by BPF? If I'm debugging > something, is it obvious to the developer that is debugging an issue > (perhaps unaware of what BPF programs are loaded on the users machine), > that the return of a function was tweaked by BPF and that could be the > source of the bug? > >> be improved over the long term, but I don't think that's related to this >> set of functionality on ARM. > > I disagree. These issues may have been added to x86, but perhaps we should > take a deeper look at them again before extending them to other > architectures. Honestly, I think a large scale architecture review of every BPF feature and decision over the last 10 years is just the wrong bar for this patch series. From my understanding, Mark and Florent have some changes planned that'll improve ftrace, livepatching, and bpf. Lets talk about those, and tackle any long term improvements you'd like to make to BPF in other patch series. -chris
On Fri, Nov 18, 2022 at 01:06:08PM -0500, Steven Rostedt wrote: > How do I know that a function return was modified by BPF? If I'm debugging > something, is it obvious to the developer that is debugging an issue > (perhaps unaware of what BPF programs are loaded on the users machine), > that the return of a function was tweaked by BPF and that could be the > source of the bug? Have it taint the kernel if something is overridden ;-) Then we can all ignore the report until it comes back without taint.
On Fri, Nov 18, 2022 at 7:52 PM Chris Mason <clm@meta.com> wrote: > > On 11/18/22 1:06 PM, Steven Rostedt wrote: > > On Fri, 18 Nov 2022 12:44:00 -0500 > > Chris Mason <clm@meta.com> wrote: > > (adding this back here) > >>>> On Fri, 18 Nov 2022 16:34:50 +0000 > >>>> Mark Rutland <mark.rutland@arm.com> wrote: > >>>> FWIW, given that the aim here seems to be to expose all kernel internals to be > >>>> overridden arbitrarily, I'm also concerned that there's a huge surface area for > >>>> issues with maintainability, robustness/correctness, and security. > >>>> This is not all kernel internals, just the functions allowed for error injection. > >>>> I really don't want to be stuck in a position where someone argues that all > >>>> kernel internal functions are ABI and need to stay around as-is to be hooked by > >>>> eBPF, and I hope that we all agree that there are no guarantees on that front. > >>>> Yes, BPF provides no guarantee that kernel functions will remain stable (similar to tracepoints and kprobes). > >>>> Thanks, > >>>> Mark. > >>>> > >>> My biggest concern is changing functionality of arbitrary functions by BPF. > >>> I would much rather limit what functions BPF could change with some > >>> annotation. > >>> > >>> int __bpf_modify foo() > >>> { > >>> ... > >>> } This annotation already exists, i.e. ALLOW_ERROR_INJECTION Users, with CONFIG_FUNCTION_ERROR_INJECTION, can already modify return values of kernel functions using kprobes and the failure injection framework [1] for functions annotated with ALLOW_ERROR_INJECTION. BPF just provides another way to do the same thing with "modify return" programs and this also respects the error injection list [2] and users can *only* attach these programs to the functions annotated with ALLOW_ERROR_INJECTION. [1] https://www.kernel.org/doc/Documentation/fault-injection/fault-injection.txt [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/verifier.c?id=f4c4ca70dedc1bce8e7b1648e652aa9be1d3fcd7#n14948 > >>> > >>> > >>> That way if somethings not working, you can see directly in the code that > >>> the function could be modified by a BPF program, instead of getting some > >>> random bug report because a function returned an unexpected result that the > >>> code of that function could never produce. > >>> > >> > >> The good news is that BPF generally confines the function replacement > >> through struct ops interfaces. > > > > What struct ops interfaces? > > https://lwn.net/Articles/811631/ > > > > >> There are also explicit allow lists to > >> limit functions where you can do return value overrides etc, so I think > > > > Where are these lists. > > Some of the original features: > > https://lwn.net/Articles/811631/ I think you meant: https://lwn.net/Articles/740146/ ? > > It has changed and expanded since then, but hopefully you get the idea. > > > > >> it's fair to say these concerns are already baked in. I'm sure they can > > > > How do I know that a function return was modified by BPF? If I'm debugging You can list the BPF programs that are loaded in the kernel with # bpftool prog list Also, the BPF programs show up in call stacks when you are debugging. > > something, is it obvious to the developer that is debugging an issue > > (perhaps unaware of what BPF programs are loaded on the users machine), > > that the return of a function was tweaked by BPF and that could be the > > source of the bug? > > > >> be improved over the long term, but I don't think that's related to this > >> set of functionality on ARM. There are workloads and applications (e.g. https://kubearmor.io/) that already use BPF Tracing and LSM programs and are currently blocked on their ARM server deployments. This may be obvious, but I want to reiterate that while the attachment points are not UAPI and users have to tolerate kernel function changes, they do expect the core loading and attachment mechanisms to exist (i.e. the ability to use LSM and tracing programs). > > > > I disagree. These issues may have been added to x86, but perhaps we should > > take a deeper look at them again before extending them to other > > architectures. > > Honestly, I think a large scale architecture review of every BPF feature > and decision over the last 10 years is just the wrong bar for this patch > series. +1 > > From my understanding, Mark and Florent have some changes planned > that'll improve ftrace, livepatching, and bpf. Lets talk about those, > and tackle any long term improvements you'd like to make to BPF in other > patch series. +1 - KP > > -chris >
On Mon, Nov 21, 2022 at 02:47:10PM +0100, KP Singh wrote: > > > How do I know that a function return was modified by BPF? If I'm debugging > > You can list the BPF programs that are loaded in the kernel with > > # bpftool prog list Only when you have access to the machine; most cases it's people sending random splats by email. > Also, the BPF programs show up in call stacks when you are debugging. Only when it splats inside the BPF part, not when it splats after because BPF changed semantics of a function.
On Mon, Nov 21, 2022 at 3:17 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Nov 21, 2022 at 02:47:10PM +0100, KP Singh wrote: > > > > > How do I know that a function return was modified by BPF? If I'm debugging > > > > You can list the BPF programs that are loaded in the kernel with > > > > # bpftool prog list > > Only when you have access to the machine; most cases it's people sending > random splats by email. Good point, What about having information about loaded BPF programs in the kernel stack traces and sharing bytecode, somehow, like in crash dumps? > > > Also, the BPF programs show up in call stacks when you are debugging. > > Only when it splats inside the BPF part, not when it splats after > because BPF changed semantics of a function. > >
On Mon, 21 Nov 2022 11:09:21 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Nov 18, 2022 at 01:06:08PM -0500, Steven Rostedt wrote: > > How do I know that a function return was modified by BPF? If I'm debugging > > something, is it obvious to the developer that is debugging an issue > > (perhaps unaware of what BPF programs are loaded on the users machine), > > that the return of a function was tweaked by BPF and that could be the > > source of the bug? > > Have it taint the kernel if something is overridden ;-) Then we can all > ignore the report until it comes back without taint. Hmm, indeed. BTW, error injection should set that too. Thanks,
On Mon, 21 Nov 2022 14:47:10 +0100 KP Singh <kpsingh@kernel.org> wrote: > This annotation already exists, i.e. ALLOW_ERROR_INJECTION > > Users, with CONFIG_FUNCTION_ERROR_INJECTION, can already modify return > values of kernel functions using kprobes and the failure injection > framework [1] for functions annotated with ALLOW_ERROR_INJECTION. > > BPF just provides another way to do the same thing with "modify > return" programs and this also respects the error injection list [2] > and users can *only* attach these programs to the functions annotated > with ALLOW_ERROR_INJECTION. WAIT! Looking at the Kconfigs, I see CONFIG_FUNCTION_ERROR_INJECTION is set when CONFIG_HAVE_FUNCTION_ERROR_INJECTION is set, and when CONFIG_KPROBES is set. And ALLOW_ERROR_INJECTION() is set when CONFIG_FUNCTION_ERROR_INJECTION is. There's no way to turn it off on x86 except by disabling kprobes! WTF! I don't want a kernel that can add error injection just because kprobes is enabled. There's two kinds of kprobes. One that is for visibility only (for tracing) and one that can be used for functional changes. I want the visibility without the ability to change the kernel. The visibility portion is very useful for security, where as the modifying one can be used to circumvent security. As kprobes are set in most production environments, so is error injection. Do we really want error injection enabled on production environments? I don't. I think we need this patch ASAP! -- Steve diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index c3c0b077ade3..9ee72d8860c3 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1874,8 +1874,14 @@ config NETDEV_NOTIFIER_ERROR_INJECT If unsure, say N. config FUNCTION_ERROR_INJECTION - def_bool y + bool "Fault-injections of functions" depends on HAVE_FUNCTION_ERROR_INJECTION && KPROBES + help + Add fault injections into various functions that are annotated with + ALLOW_ERROR_INJECTION() in the kernel. BPF may also modify the return + value of theses functions. This is useful to test error paths of code. + + If unsure, say N config FAULT_INJECTION bool "Fault-injection framework"
On Mon, Nov 21, 2022 at 4:15 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Mon, 21 Nov 2022 14:47:10 +0100 > KP Singh <kpsingh@kernel.org> wrote: > > > This annotation already exists, i.e. ALLOW_ERROR_INJECTION > > > > Users, with CONFIG_FUNCTION_ERROR_INJECTION, can already modify return > > values of kernel functions using kprobes and the failure injection > > framework [1] for functions annotated with ALLOW_ERROR_INJECTION. > > > > BPF just provides another way to do the same thing with "modify > > return" programs and this also respects the error injection list [2] > > and users can *only* attach these programs to the functions annotated > > with ALLOW_ERROR_INJECTION. > > WAIT! > > Looking at the Kconfigs, I see > > CONFIG_FUNCTION_ERROR_INJECTION is set when > CONFIG_HAVE_FUNCTION_ERROR_INJECTION is set, and when CONFIG_KPROBES is set. > > And ALLOW_ERROR_INJECTION() is set when CONFIG_FUNCTION_ERROR_INJECTION is. > > There's no way to turn it off on x86 except by disabling kprobes! > > WTF! > > I don't want a kernel that can add error injection just because kprobes is > enabled. There's two kinds of kprobes. One that is for visibility only (for > tracing) and one that can be used for functional changes. I want the > visibility without the ability to change the kernel. The visibility portion > is very useful for security, where as the modifying one can be used to > circumvent security. I am not sure how they can circumvent security since this needs root / root equivalent permissions. Fault injection is actually a very useful debugging tool. > > As kprobes are set in most production environments, so is error injection. > Do we really want error injection enabled on production environments? > I don't. > > I think we need this patch ASAP! > > -- Steve > > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug > index c3c0b077ade3..9ee72d8860c3 100644 > --- a/lib/Kconfig.debug > +++ b/lib/Kconfig.debug > @@ -1874,8 +1874,14 @@ config NETDEV_NOTIFIER_ERROR_INJECT > If unsure, say N. > > config FUNCTION_ERROR_INJECTION > - def_bool y > + bool "Fault-injections of functions" > depends on HAVE_FUNCTION_ERROR_INJECTION && KPROBES > + help > + Add fault injections into various functions that are annotated with > + ALLOW_ERROR_INJECTION() in the kernel. BPF may also modify the return > + value of theses functions. This is useful to test error paths of code. > + > + If unsure, say N > > config FAULT_INJECTION > bool "Fault-injection framework"
On Mon, 21 Nov 2022 16:29:54 +0100 KP Singh <kpsingh@kernel.org> wrote: > I am not sure how they can circumvent security since this needs root / > root equivalent permissions. Fault injection is actually a very useful > debugging tool. On ChromeOS, we even consider root untrusted and lock down pretty much all privileged activities (like loading modules and such). As you said. It's a good debugging tool. Not something to allow in production environments. Or at the very least, allow admins to disable it. -- Steve
On Mon, Nov 21, 2022 at 7:15 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Mon, 21 Nov 2022 14:47:10 +0100 > KP Singh <kpsingh@kernel.org> wrote: > > > This annotation already exists, i.e. ALLOW_ERROR_INJECTION > > > > Users, with CONFIG_FUNCTION_ERROR_INJECTION, can already modify return > > values of kernel functions using kprobes and the failure injection > > framework [1] for functions annotated with ALLOW_ERROR_INJECTION. > > > > BPF just provides another way to do the same thing with "modify > > return" programs and this also respects the error injection list [2] > > and users can *only* attach these programs to the functions annotated > > with ALLOW_ERROR_INJECTION. > > WAIT! > > Looking at the Kconfigs, I see > > CONFIG_FUNCTION_ERROR_INJECTION is set when > CONFIG_HAVE_FUNCTION_ERROR_INJECTION is set, and when CONFIG_KPROBES is set. > > And ALLOW_ERROR_INJECTION() is set when CONFIG_FUNCTION_ERROR_INJECTION is. > > There's no way to turn it off on x86 except by disabling kprobes! > > WTF! > > I don't want a kernel that can add error injection just because kprobes is > enabled. There's two kinds of kprobes. One that is for visibility only (for > tracing) and one that can be used for functional changes. I want the > visibility without the ability to change the kernel. The visibility portion > is very useful for security, where as the modifying one can be used to > circumvent security. > > As kprobes are set in most production environments, so is error injection. > Do we really want error injection enabled on production environments? We absolutely want it enabled in production. > I don't. Speak for yourself, because your employer thinks otherwise.
On Mon, 21 Nov 2022 07:40:11 -0800 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > > > As kprobes are set in most production environments, so is error injection. > > Do we really want error injection enabled on production environments? > > We absolutely want it enabled in production. > > > I don't. > > Speak for yourself, because your employer thinks otherwise. And speak for yourself! Just because you want it in production, does not mean that everyone else is forced to do so too. It's a DEBUG feature. It's in Kconfig.debug. Why is it enabled by default??? -- Steve
On Mon, Nov 21, 2022 at 10:45:48AM -0500, Steven Rostedt wrote: > And speak for yourself! Just because you want it in production, does not > mean that everyone else is forced to do so too. Yes, I don't want it enabled in my enterprise kernels either because I don't want bug reports for stuff which people should clearly not do in production. It is as simple as that.
On Mon, 21 Nov 2022, KP Singh wrote: > > Looking at the Kconfigs, I see > > > > CONFIG_FUNCTION_ERROR_INJECTION is set when > > CONFIG_HAVE_FUNCTION_ERROR_INJECTION is set, and when CONFIG_KPROBES is set. > > > > And ALLOW_ERROR_INJECTION() is set when CONFIG_FUNCTION_ERROR_INJECTION is. > > > > There's no way to turn it off on x86 except by disabling kprobes! > > > > WTF! > > > > I don't want a kernel that can add error injection just because kprobes is > > enabled. There's two kinds of kprobes. One that is for visibility only (for > > tracing) and one that can be used for functional changes. I want the > > visibility without the ability to change the kernel. The visibility portion > > is very useful for security, where as the modifying one can be used to > > circumvent security. > > I am not sure how they can circumvent security since this needs root / > root equivalent permissions. Fault injection is actually a very useful > debugging tool. There are environments where root is untrusted (e.g. secure boot), and there is a whole mechanism in kernel for dealing with that (all the CONFIG_LOCKDOWN_LSM handling). Seems like error injection should be wired up into lockdown handling at minimum.