Message ID | 20231012114550.152846-1-asavkov@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2908:b0:403:3b70:6f57 with SMTP id ib8csp1164949vqb; Thu, 12 Oct 2023 04:47:59 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGEu3F1RLFBO+EEOaI/Qd/W2RAcb5vA8PzWqkpmyyODpy3tkkXGMQMr0nLDYKz+cjiwSUtS X-Received: by 2002:a05:6871:80d:b0:1d0:e371:db33 with SMTP id q13-20020a056871080d00b001d0e371db33mr26945728oap.3.1697111279218; Thu, 12 Oct 2023 04:47:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697111279; cv=none; d=google.com; s=arc-20160816; b=qmKcbe9N9iW7dSBPjcSZGwtHXPIbcfcr/b0NMKKN7iYvEHsZVUwCQ1JpM9dzkms+N6 98TNV6DZbIf0HzeHvKy1S/P2I+bE1B4wZNoeLx5ADyKjYfmPN1x1RN41vDLL1zAyhVlZ oD3ySLfTsEMMDAv7l9fs+Ei1L9iIwaykpeMjjihUAsDMLrpz1zekBa0FZK0krtkWRDxD f34/MDanhY+fcUbSgRhjouZ1/tsOYePp1XPK5/LQJwydjfJFoHUyanCmBUCwjW/ugL4f a03BwMtn2QXS839ixMeRVMmjkMgcTShAshG8ulz21SCtZWqQy5yokCXhRh9llwvfkPLg QUDA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=WIegbc1FLrrwIh2JY6As7DXUlzCdo7e3rD7xAJ0V3OI=; fh=BLnU+INhokr5VegtWHnIHuQcLgGbp60hDLZrry+Hnzk=; b=QFFzDpIK1h4okjsSI3YnLJi60Vc6KV/Odg3+812Ry+KzUqlorw/eokVGXLTWsnAwuW gS0CVC+9Tp6sBmyqbNgrVqS+9sF+sFylVvf2LhCStRNaj+m4C8VdXE3up1VCuFGKu5I3 1la5fM4Yl5VJXHiip/qQjLk/1DCVyRSJaJiCKeE+sXKud6Eg3vwQyy6CCA78mqmCOtc7 2EG4FHpdGs+95F2TUg8PWLGRxN3QU+x13oPYJEHUCm7T83yVq5FtpbeUsm+tdeguJusP ieTayYiRgR7BSYUHt7aPlpN72izf+gzIgzzOeFL5q7AiQOiUNRfgLK4G34Zdc+X4dm8w 6B0A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=fEGvh8AW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id 202-20020a6302d3000000b005859da6172bsi2054815pgc.727.2023.10.12.04.47.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Oct 2023 04:47:59 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=fEGvh8AW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 2F746822CC5E; Thu, 12 Oct 2023 04:47:58 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1378741AbjJLLrr (ORCPT <rfc822;rua109.linux@gmail.com> + 19 others); Thu, 12 Oct 2023 07:47:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44990 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1378536AbjJLLrK (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 12 Oct 2023 07:47:10 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 47ABDFD for <linux-kernel@vger.kernel.org>; Thu, 12 Oct 2023 04:46:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1697111185; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=WIegbc1FLrrwIh2JY6As7DXUlzCdo7e3rD7xAJ0V3OI=; b=fEGvh8AWT8Xh+s5on7gGO2X9pYrZ1KlhD4CDKy4RSqFmjIay0LA7ZKyBNCQIZtMmL5QbWF obaL8Y26vL2lbsSHc61drDW/SouA8MhjXKMEOtq5W06m26QubS+d2//D0X8I1/LGVRqEl7 /hrAHlH4HIgcv1L7a3Va2T28KgLiOgY= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-648-3jWqN8JuMCGfw_MDG6rsJg-1; Thu, 12 Oct 2023 07:46:10 -0400 X-MC-Unique: 3jWqN8JuMCGfw_MDG6rsJg-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id B1A3185530F; Thu, 12 Oct 2023 11:46:09 +0000 (UTC) Received: from alecto.usersys.redhat.com (unknown [10.43.17.26]) by smtp.corp.redhat.com (Postfix) with ESMTP id D6D462157F5A; Thu, 12 Oct 2023 11:46:07 +0000 (UTC) From: Artem Savkov <asavkov@redhat.com> To: Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>, bpf@vger.kernel.org, netdev@vger.kernel.org Cc: Steven Rostedt <rostedt@goodmis.org>, Masami Hiramatsu <mhiramat@kernel.org>, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, Thomas Gleixner <tglx@linutronix.de>, linux-rt-users@vger.kernel.org, Jiri Olsa <jolsa@kernel.org>, Artem Savkov <asavkov@redhat.com> Subject: [RFC PATCH bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t Date: Thu, 12 Oct 2023 13:45:50 +0200 Message-ID: <20231012114550.152846-1-asavkov@redhat.com> In-Reply-To: <20231005123413.GA488417@alecto.usersys.redhat.com> References: <20231005123413.GA488417@alecto.usersys.redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.6 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Thu, 12 Oct 2023 04:47:58 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1779550157038336106 X-GMAIL-MSGID: 1779550157038336106 |
Series |
[RFC,bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t
|
|
Commit Message
Artem Savkov
Oct. 12, 2023, 11:45 a.m. UTC
linux-rt-devel tree contains a patch (b1773eac3f29c ("sched: Add support
for lazy preemption")) that adds an extra member to struct trace_entry.
This causes the offset of args field in struct trace_event_raw_sys_enter
be different from the one in struct syscall_trace_enter:
struct trace_event_raw_sys_enter {
struct trace_entry ent; /* 0 12 */
/* XXX last struct has 3 bytes of padding */
/* XXX 4 bytes hole, try to pack */
long int id; /* 16 8 */
long unsigned int args[6]; /* 24 48 */
/* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
char __data[]; /* 72 0 */
/* size: 72, cachelines: 2, members: 4 */
/* sum members: 68, holes: 1, sum holes: 4 */
/* paddings: 1, sum paddings: 3 */
/* last cacheline: 8 bytes */
};
struct syscall_trace_enter {
struct trace_entry ent; /* 0 12 */
/* XXX last struct has 3 bytes of padding */
int nr; /* 12 4 */
long unsigned int args[]; /* 16 0 */
/* size: 16, cachelines: 1, members: 3 */
/* paddings: 1, sum paddings: 3 */
/* last cacheline: 16 bytes */
};
This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
test_profiler testcase because max_ctx_offset is calculated based on the
former struct, while off on the latter:
10488 if (is_tracepoint || is_syscall_tp) {
10489 int off = trace_event_get_offsets(event->tp_event);
10490
10491 if (prog->aux->max_ctx_offset > off)
10492 return -EACCES;
10493 }
What bpf program is actually getting is a pointer to struct
syscall_tp_t, defined in kernel/trace/trace_syscalls.c. This patch fixes
the problem by aligning struct syscall_tp_t with with struct
syscall_trace_(enter|exit) and changing the tests to use these structs
to dereference context.
Signed-off-by: Artem Savkov <asavkov@redhat.com>
---
kernel/trace/trace_syscalls.c | 4 ++--
tools/testing/selftests/bpf/progs/profiler.inc.h | 2 +-
tools/testing/selftests/bpf/progs/test_vmlinux.c | 4 ++--
3 files changed, 5 insertions(+), 5 deletions(-)
Comments
On Thu, 12 Oct 2023 13:45:50 +0200 Artem Savkov <asavkov@redhat.com> wrote: > linux-rt-devel tree contains a patch (b1773eac3f29c ("sched: Add support > for lazy preemption")) that adds an extra member to struct trace_entry. > This causes the offset of args field in struct trace_event_raw_sys_enter > be different from the one in struct syscall_trace_enter: > > struct trace_event_raw_sys_enter { > struct trace_entry ent; /* 0 12 */ > > /* XXX last struct has 3 bytes of padding */ > /* XXX 4 bytes hole, try to pack */ > > long int id; /* 16 8 */ > long unsigned int args[6]; /* 24 48 */ > /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */ > char __data[]; /* 72 0 */ > > /* size: 72, cachelines: 2, members: 4 */ > /* sum members: 68, holes: 1, sum holes: 4 */ > /* paddings: 1, sum paddings: 3 */ > /* last cacheline: 8 bytes */ > }; > > struct syscall_trace_enter { > struct trace_entry ent; /* 0 12 */ > > /* XXX last struct has 3 bytes of padding */ > > int nr; /* 12 4 */ > long unsigned int args[]; /* 16 0 */ > > /* size: 16, cachelines: 1, members: 3 */ > /* paddings: 1, sum paddings: 3 */ > /* last cacheline: 16 bytes */ > }; > > This, in turn, causes perf_event_set_bpf_prog() fail while running bpf > test_profiler testcase because max_ctx_offset is calculated based on the > former struct, while off on the latter: > > 10488 if (is_tracepoint || is_syscall_tp) { > 10489 int off = trace_event_get_offsets(event->tp_event); > 10490 > 10491 if (prog->aux->max_ctx_offset > off) > 10492 return -EACCES; > 10493 } > > What bpf program is actually getting is a pointer to struct > syscall_tp_t, defined in kernel/trace/trace_syscalls.c. This patch fixes > the problem by aligning struct syscall_tp_t with with struct > syscall_trace_(enter|exit) and changing the tests to use these structs > to dereference context. > > Signed-off-by: Artem Savkov <asavkov@redhat.com> Thanks for doing a proper fix. Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> -- Steve
On Thu, Oct 12, 2023 at 6:43 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Thu, 12 Oct 2023 13:45:50 +0200 > Artem Savkov <asavkov@redhat.com> wrote: > > > linux-rt-devel tree contains a patch (b1773eac3f29c ("sched: Add support > > for lazy preemption")) that adds an extra member to struct trace_entry. > > This causes the offset of args field in struct trace_event_raw_sys_enter > > be different from the one in struct syscall_trace_enter: > > > > struct trace_event_raw_sys_enter { > > struct trace_entry ent; /* 0 12 */ > > > > /* XXX last struct has 3 bytes of padding */ > > /* XXX 4 bytes hole, try to pack */ > > > > long int id; /* 16 8 */ > > long unsigned int args[6]; /* 24 48 */ > > /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */ > > char __data[]; /* 72 0 */ > > > > /* size: 72, cachelines: 2, members: 4 */ > > /* sum members: 68, holes: 1, sum holes: 4 */ > > /* paddings: 1, sum paddings: 3 */ > > /* last cacheline: 8 bytes */ > > }; > > > > struct syscall_trace_enter { > > struct trace_entry ent; /* 0 12 */ > > > > /* XXX last struct has 3 bytes of padding */ > > > > int nr; /* 12 4 */ > > long unsigned int args[]; /* 16 0 */ > > > > /* size: 16, cachelines: 1, members: 3 */ > > /* paddings: 1, sum paddings: 3 */ > > /* last cacheline: 16 bytes */ > > }; > > > > This, in turn, causes perf_event_set_bpf_prog() fail while running bpf > > test_profiler testcase because max_ctx_offset is calculated based on the > > former struct, while off on the latter: > > > > 10488 if (is_tracepoint || is_syscall_tp) { > > 10489 int off = trace_event_get_offsets(event->tp_event); > > 10490 > > 10491 if (prog->aux->max_ctx_offset > off) > > 10492 return -EACCES; > > 10493 } > > > > What bpf program is actually getting is a pointer to struct > > syscall_tp_t, defined in kernel/trace/trace_syscalls.c. This patch fixes > > the problem by aligning struct syscall_tp_t with with struct > > syscall_trace_(enter|exit) and changing the tests to use these structs > > to dereference context. > > > > Signed-off-by: Artem Savkov <asavkov@redhat.com> > I think these changes make sense regardless, can you please resend the patch without RFC tag so that our CI can run tests for it? > Thanks for doing a proper fix. > > Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> But looking at [0] and briefly reading some of the discussions you, Steven, had. I'm just wondering if it would be best to avoid increasing struct trace_entry altogether? It seems like preempt_count is actually a 4-bit field in trace context, so it doesn't seem like we really need to allocate an entire byte for both preempt_count and preempt_lazy_count. Why can't we just combine them and not waste 8 extra bytes for each trace event in a ring buffer? [0] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/?id=b1773eac3f29cbdcdfd16e0339f1a164066e9f71 > > -- Steve
For the novice with the RT kernel, Could somebody tell me what kernel this bug was first introduced in and what kernel we need to install to get the fix? This could be the issue we have been experiencing in the Linuxcnc community with excessive RT network latency (mostly with realtek NIC's). I had flagged Lazy preemption as being the possible changes causing this issue. I thought Lazy Preemption was added Circa kernel 5.09 as it affected Debian Bullseye on Kernel 5.10 which coincided with when we first observed the problem. Thanks in anticipation. Rod Webster Rod Webster 1300 896 832 +61 435 765 611 VMN® www.vmn.com.au Sole Queensland Distributor On Thu, 12 Oct 2023 at 21:46, Artem Savkov <asavkov@redhat.com> wrote: > > linux-rt-devel tree contains a patch (b1773eac3f29c ("sched: Add support > for lazy preemption")) that adds an extra member to struct trace_entry. > This causes the offset of args field in struct trace_event_raw_sys_enter > be different from the one in struct syscall_trace_enter: > > struct trace_event_raw_sys_enter { > struct trace_entry ent; /* 0 12 */ > > /* XXX last struct has 3 bytes of padding */ > /* XXX 4 bytes hole, try to pack */ > > long int id; /* 16 8 */ > long unsigned int args[6]; /* 24 48 */ > /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */ > char __data[]; /* 72 0 */ > > /* size: 72, cachelines: 2, members: 4 */ > /* sum members: 68, holes: 1, sum holes: 4 */ > /* paddings: 1, sum paddings: 3 */ > /* last cacheline: 8 bytes */ > }; > > struct syscall_trace_enter { > struct trace_entry ent; /* 0 12 */ > > /* XXX last struct has 3 bytes of padding */ > > int nr; /* 12 4 */ > long unsigned int args[]; /* 16 0 */ > > /* size: 16, cachelines: 1, members: 3 */ > /* paddings: 1, sum paddings: 3 */ > /* last cacheline: 16 bytes */ > }; > > This, in turn, causes perf_event_set_bpf_prog() fail while running bpf > test_profiler testcase because max_ctx_offset is calculated based on the > former struct, while off on the latter: > > 10488 if (is_tracepoint || is_syscall_tp) { > 10489 int off = trace_event_get_offsets(event->tp_event); > 10490 > 10491 if (prog->aux->max_ctx_offset > off) > 10492 return -EACCES; > 10493 } > > What bpf program is actually getting is a pointer to struct > syscall_tp_t, defined in kernel/trace/trace_syscalls.c. This patch fixes > the problem by aligning struct syscall_tp_t with with struct > syscall_trace_(enter|exit) and changing the tests to use these structs > to dereference context. > > Signed-off-by: Artem Savkov <asavkov@redhat.com> > --- > kernel/trace/trace_syscalls.c | 4 ++-- > tools/testing/selftests/bpf/progs/profiler.inc.h | 2 +- > tools/testing/selftests/bpf/progs/test_vmlinux.c | 4 ++-- > 3 files changed, 5 insertions(+), 5 deletions(-) > > diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c > index de753403cdafb..9c581d6da843a 100644 > --- a/kernel/trace/trace_syscalls.c > +++ b/kernel/trace/trace_syscalls.c > @@ -556,7 +556,7 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re > { > struct syscall_tp_t { > struct trace_entry ent; > - unsigned long syscall_nr; > + int syscall_nr; > unsigned long args[SYSCALL_DEFINE_MAXARGS]; > } __aligned(8) param; > int i; > @@ -661,7 +661,7 @@ static int perf_call_bpf_exit(struct trace_event_call *call, struct pt_regs *reg > { > struct syscall_tp_t { > struct trace_entry ent; > - unsigned long syscall_nr; > + int syscall_nr; > unsigned long ret; > } __aligned(8) param; > > diff --git a/tools/testing/selftests/bpf/progs/profiler.inc.h b/tools/testing/selftests/bpf/progs/profiler.inc.h > index f799d87e87002..897061930cb76 100644 > --- a/tools/testing/selftests/bpf/progs/profiler.inc.h > +++ b/tools/testing/selftests/bpf/progs/profiler.inc.h > @@ -609,7 +609,7 @@ ssize_t BPF_KPROBE(kprobe__proc_sys_write, > } > > SEC("tracepoint/syscalls/sys_enter_kill") > -int tracepoint__syscalls__sys_enter_kill(struct trace_event_raw_sys_enter* ctx) > +int tracepoint__syscalls__sys_enter_kill(struct syscall_trace_enter* ctx) > { > struct bpf_func_stats_ctx stats_ctx; > > diff --git a/tools/testing/selftests/bpf/progs/test_vmlinux.c b/tools/testing/selftests/bpf/progs/test_vmlinux.c > index 4b8e37f7fd06c..78b23934d9f8f 100644 > --- a/tools/testing/selftests/bpf/progs/test_vmlinux.c > +++ b/tools/testing/selftests/bpf/progs/test_vmlinux.c > @@ -16,12 +16,12 @@ bool kprobe_called = false; > bool fentry_called = false; > > SEC("tp/syscalls/sys_enter_nanosleep") > -int handle__tp(struct trace_event_raw_sys_enter *args) > +int handle__tp(struct syscall_trace_enter *args) > { > struct __kernel_timespec *ts; > long tv_nsec; > > - if (args->id != __NR_nanosleep) > + if (args->nr != __NR_nanosleep) > return 0; > > ts = (void *)args->args[0]; > -- > 2.41.0 > >
On Thu, Oct 12, 2023 at 04:32:51PM -0700, Andrii Nakryiko wrote: > On Thu, Oct 12, 2023 at 6:43 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > On Thu, 12 Oct 2023 13:45:50 +0200 > > Artem Savkov <asavkov@redhat.com> wrote: > > > > > linux-rt-devel tree contains a patch (b1773eac3f29c ("sched: Add support > > > for lazy preemption")) that adds an extra member to struct trace_entry. > > > This causes the offset of args field in struct trace_event_raw_sys_enter > > > be different from the one in struct syscall_trace_enter: > > > > > > struct trace_event_raw_sys_enter { > > > struct trace_entry ent; /* 0 12 */ > > > > > > /* XXX last struct has 3 bytes of padding */ > > > /* XXX 4 bytes hole, try to pack */ > > > > > > long int id; /* 16 8 */ > > > long unsigned int args[6]; /* 24 48 */ > > > /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */ > > > char __data[]; /* 72 0 */ > > > > > > /* size: 72, cachelines: 2, members: 4 */ > > > /* sum members: 68, holes: 1, sum holes: 4 */ > > > /* paddings: 1, sum paddings: 3 */ > > > /* last cacheline: 8 bytes */ > > > }; > > > > > > struct syscall_trace_enter { > > > struct trace_entry ent; /* 0 12 */ > > > > > > /* XXX last struct has 3 bytes of padding */ > > > > > > int nr; /* 12 4 */ > > > long unsigned int args[]; /* 16 0 */ > > > > > > /* size: 16, cachelines: 1, members: 3 */ > > > /* paddings: 1, sum paddings: 3 */ > > > /* last cacheline: 16 bytes */ > > > }; > > > > > > This, in turn, causes perf_event_set_bpf_prog() fail while running bpf > > > test_profiler testcase because max_ctx_offset is calculated based on the > > > former struct, while off on the latter: > > > > > > 10488 if (is_tracepoint || is_syscall_tp) { > > > 10489 int off = trace_event_get_offsets(event->tp_event); > > > 10490 > > > 10491 if (prog->aux->max_ctx_offset > off) > > > 10492 return -EACCES; > > > 10493 } > > > > > > What bpf program is actually getting is a pointer to struct > > > syscall_tp_t, defined in kernel/trace/trace_syscalls.c. This patch fixes > > > the problem by aligning struct syscall_tp_t with with struct > > > syscall_trace_(enter|exit) and changing the tests to use these structs > > > to dereference context. > > > > > > Signed-off-by: Artem Savkov <asavkov@redhat.com> > > > > I think these changes make sense regardless, can you please resend the > patch without RFC tag so that our CI can run tests for it? Ok, didn't know it was set up like that. > > Thanks for doing a proper fix. > > > > Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> > > But looking at [0] and briefly reading some of the discussions you, > Steven, had. I'm just wondering if it would be best to avoid > increasing struct trace_entry altogether? It seems like preempt_count > is actually a 4-bit field in trace context, so it doesn't seem like we > really need to allocate an entire byte for both preempt_count and > preempt_lazy_count. Why can't we just combine them and not waste 8 > extra bytes for each trace event in a ring buffer? > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/?id=b1773eac3f29cbdcdfd16e0339f1a164066e9f71 I agree that avoiding increase in struct trace_entry size would be very desirable, but I have no knowledge whether rt developers had reasons to do it like this. Nevertheless I think the issue with verifier running against a wrong struct still needs to be addressed.
On Fri, 13 Oct 2023 08:01:34 +0200 Artem Savkov <asavkov@redhat.com> wrote: > > But looking at [0] and briefly reading some of the discussions you, > > Steven, had. I'm just wondering if it would be best to avoid > > increasing struct trace_entry altogether? It seems like preempt_count > > is actually a 4-bit field in trace context, so it doesn't seem like we > > really need to allocate an entire byte for both preempt_count and > > preempt_lazy_count. Why can't we just combine them and not waste 8 > > extra bytes for each trace event in a ring buffer? > > > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/?id=b1773eac3f29cbdcdfd16e0339f1a164066e9f71 > > I agree that avoiding increase in struct trace_entry size would be very > desirable, but I have no knowledge whether rt developers had reasons to > do it like this. > > Nevertheless I think the issue with verifier running against a wrong > struct still needs to be addressed. Correct. My Ack is based on the current way things are done upstream. It was just that linux-rt showed the issue, where the code was not as robust as it should have been. To me this was a correctness issue, not an issue that had to do with how things are done in linux-rt. As for the changes in linux-rt, they are not upstream yet. I'll have my comments on that code when that happens. -- Steve
On Fri, Oct 13, 2023 at 7:00 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Fri, 13 Oct 2023 08:01:34 +0200 > Artem Savkov <asavkov@redhat.com> wrote: > > > > But looking at [0] and briefly reading some of the discussions you, > > > Steven, had. I'm just wondering if it would be best to avoid > > > increasing struct trace_entry altogether? It seems like preempt_count > > > is actually a 4-bit field in trace context, so it doesn't seem like we > > > really need to allocate an entire byte for both preempt_count and > > > preempt_lazy_count. Why can't we just combine them and not waste 8 > > > extra bytes for each trace event in a ring buffer? > > > > > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/?id=b1773eac3f29cbdcdfd16e0339f1a164066e9f71 > > > > I agree that avoiding increase in struct trace_entry size would be very > > desirable, but I have no knowledge whether rt developers had reasons to > > do it like this. > > > > Nevertheless I think the issue with verifier running against a wrong > > struct still needs to be addressed. > > Correct. My Ack is based on the current way things are done upstream. > It was just that linux-rt showed the issue, where the code was not as > robust as it should have been. To me this was a correctness issue, not > an issue that had to do with how things are done in linux-rt. I think we should at least add some BUILD_BUG_ON() that validates offsets in syscall_tp_t matches the ones in syscall_trace_enter and syscall_trace_exit, to fail more loudly if there is any mismatch in the future. WDYT? > > As for the changes in linux-rt, they are not upstream yet. I'll have my > comments on that code when that happens. Ah, ok, cool. I'd appreciate you cc'ing bpf@vger.kernel.org in that discussion, thank you! > > -- Steve
On Fri, 13 Oct 2023 12:43:18 -0700 Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > Correct. My Ack is based on the current way things are done upstream. > > It was just that linux-rt showed the issue, where the code was not as > > robust as it should have been. To me this was a correctness issue, not > > an issue that had to do with how things are done in linux-rt. > > I think we should at least add some BUILD_BUG_ON() that validates > offsets in syscall_tp_t matches the ones in syscall_trace_enter and > syscall_trace_exit, to fail more loudly if there is any mismatch in > the future. WDYT? If you want to, feel free to send a patch. > > > > > As for the changes in linux-rt, they are not upstream yet. I'll have my > > comments on that code when that happens. > > Ah, ok, cool. I'd appreciate you cc'ing bpf@vger.kernel.org in that > discussion, thank you! If I remember ;-) -- Steve
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index de753403cdafb..9c581d6da843a 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -556,7 +556,7 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re { struct syscall_tp_t { struct trace_entry ent; - unsigned long syscall_nr; + int syscall_nr; unsigned long args[SYSCALL_DEFINE_MAXARGS]; } __aligned(8) param; int i; @@ -661,7 +661,7 @@ static int perf_call_bpf_exit(struct trace_event_call *call, struct pt_regs *reg { struct syscall_tp_t { struct trace_entry ent; - unsigned long syscall_nr; + int syscall_nr; unsigned long ret; } __aligned(8) param; diff --git a/tools/testing/selftests/bpf/progs/profiler.inc.h b/tools/testing/selftests/bpf/progs/profiler.inc.h index f799d87e87002..897061930cb76 100644 --- a/tools/testing/selftests/bpf/progs/profiler.inc.h +++ b/tools/testing/selftests/bpf/progs/profiler.inc.h @@ -609,7 +609,7 @@ ssize_t BPF_KPROBE(kprobe__proc_sys_write, } SEC("tracepoint/syscalls/sys_enter_kill") -int tracepoint__syscalls__sys_enter_kill(struct trace_event_raw_sys_enter* ctx) +int tracepoint__syscalls__sys_enter_kill(struct syscall_trace_enter* ctx) { struct bpf_func_stats_ctx stats_ctx; diff --git a/tools/testing/selftests/bpf/progs/test_vmlinux.c b/tools/testing/selftests/bpf/progs/test_vmlinux.c index 4b8e37f7fd06c..78b23934d9f8f 100644 --- a/tools/testing/selftests/bpf/progs/test_vmlinux.c +++ b/tools/testing/selftests/bpf/progs/test_vmlinux.c @@ -16,12 +16,12 @@ bool kprobe_called = false; bool fentry_called = false; SEC("tp/syscalls/sys_enter_nanosleep") -int handle__tp(struct trace_event_raw_sys_enter *args) +int handle__tp(struct syscall_trace_enter *args) { struct __kernel_timespec *ts; long tv_nsec; - if (args->id != __NR_nanosleep) + if (args->nr != __NR_nanosleep) return 0; ts = (void *)args->args[0];