Message ID | 20230810123905.1531061-1-zhengyejian1@huawei.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b824:0:b0:3f2:4152:657d with SMTP id z4csp451261vqi; Thu, 10 Aug 2023 07:16:07 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFBx0n/NJ19KZPZ4e1fSzhqWFN06pGWGSXVx4wvQDwWLlBUde0J7qwRRMcnDayreRBE4j3Q X-Received: by 2002:a05:6a20:1019:b0:13a:6bca:7a84 with SMTP id gs25-20020a056a20101900b0013a6bca7a84mr2633652pzc.44.1691676966808; Thu, 10 Aug 2023 07:16:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691676966; cv=none; d=google.com; s=arc-20160816; b=Ex3U88IDi5kHJJxTlBy+FSG9vfdAkJiOuwLwENhF403d/Aobcpr4MSheOjjQl7Fdr9 P/MjgbGCX3a9HeCAqVHn3dD7VdZHszrocbmEvQt0GjSeEhnDJe/tbiTLYBrslee5n62G 5UxA8UpgZoG13zSx+yf6uKEUrmvN9ZCPWrGPmbs6bUM3w13c6U0XIUBqBpCdCaTQnIk0 WTArbou+7CwPUrkjiXYX6AODnfX22W0PW4kd5pLDpS/xGZD5+E9uiHt/fQ/tdgLBu2zX OBqvbnPlK+70Dzoa/+O2hvhAAoc+ZEnhT1cM48Gzq6G693RK3Yeju6sTAb+cyK6NKm4L 8Tjw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=RToxi3+5skr3eI7hdsMTJH3wa008g0BiAQLhVxt+3V0=; fh=xCdvb0pchHpICiVB7Yft0edV6bsZADbcLyom6OXel/s=; b=wdKmIRM0h2v4x9ekFyVc/ECzFWf6EvWvsefEptOw+3djL2jv10GzrS7T0FGJTfQvZ/ fZTu7DLQrxZyirZjt7aLZUDpvKuDZeM3OCkkoUPSWqeD+aQ6w4aW7yzMIIMNkQt1ljlU ddwipGODCGkdLsjYIJtg77heJI1emOZxZpy8btGbm9IP5VTstbUDvtvR+/0jlmIkXEml okfnc3iKgLIKeagjQAuOR8myAzw5s+kmBwSNH1yLer3MT1CXts/TwmbvBp5u6Y6NH1qI hWWz8libkUxs22KqoA26uXfioham1qh8pyApIhLN6QDVPErT4m6rweDBWa9dfc24a4L6 WVXw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 132-20020a63028a000000b0055798b90375si1716865pgc.709.2023.08.10.07.15.43; Thu, 10 Aug 2023 07:16:06 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234003AbjHJMji (ORCPT <rfc822;lanlanxiyiji@gmail.com> + 99 others); Thu, 10 Aug 2023 08:39:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38814 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229631AbjHJMjh (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 10 Aug 2023 08:39:37 -0400 Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EAE85212F; Thu, 10 Aug 2023 05:39:36 -0700 (PDT) Received: from dggpeml500012.china.huawei.com (unknown [172.30.72.54]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4RM60Z2SKrz1L9wc; Thu, 10 Aug 2023 20:38:22 +0800 (CST) Received: from localhost.localdomain (10.67.175.61) by dggpeml500012.china.huawei.com (7.185.36.15) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.27; Thu, 10 Aug 2023 20:39:34 +0800 From: Zheng Yejian <zhengyejian1@huawei.com> To: <rostedt@goodmis.org>, <mhiramat@kernel.org> CC: <laijs@cn.fujitsu.com>, <linux-kernel@vger.kernel.org>, <linux-trace-kernel@vger.kernel.org>, <zhengyejian1@huawei.com> Subject: [PATCH] tracing: Fix race when concurrently splice_read trace_pipe Date: Thu, 10 Aug 2023 20:39:05 +0800 Message-ID: <20230810123905.1531061-1-zhengyejian1@huawei.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.67.175.61] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To dggpeml500012.china.huawei.com (7.185.36.15) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1773851866982128409 X-GMAIL-MSGID: 1773851866982128409 |
Series |
tracing: Fix race when concurrently splice_read trace_pipe
|
|
Commit Message
Zheng Yejian
Aug. 10, 2023, 12:39 p.m. UTC
When concurrently splice_read file trace_pipe and per_cpu/cpu*/trace_pipe,
there are more data being read out than expected.
The root cause is that in tracing_splice_read_pipe(), an entry is found
outside locks, it may be read by multiple readers or consumed by other
reader as starting printing it.
To fix it, change to find entry after holding locks.
Fixes: 7e53bd42d14c ("tracing: Consolidate protection of reader access to the ring buffer")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
---
kernel/trace/trace.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
Comments
On Thu, 10 Aug 2023 20:39:05 +0800 Zheng Yejian <zhengyejian1@huawei.com> wrote: > When concurrently splice_read file trace_pipe and per_cpu/cpu*/trace_pipe, > there are more data being read out than expected. > > The root cause is that in tracing_splice_read_pipe(), an entry is found > outside locks, it may be read by multiple readers or consumed by other > reader as starting printing it. > > To fix it, change to find entry after holding locks. > > Fixes: 7e53bd42d14c ("tracing: Consolidate protection of reader access to the ring buffer") > Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> > --- > kernel/trace/trace.c | 10 ++++++---- > 1 file changed, 6 insertions(+), 4 deletions(-) > > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c > index b8870078ef58..f169d33b948f 100644 > --- a/kernel/trace/trace.c > +++ b/kernel/trace/trace.c > @@ -7054,14 +7054,16 @@ static ssize_t tracing_splice_read_pipe(struct file *filp, > if (ret <= 0) > goto out_err; > > - if (!iter->ent && !trace_find_next_entry_inc(iter)) { > + trace_event_read_lock(); > + trace_access_lock(iter->cpu_file); > + > + if (!trace_find_next_entry_inc(iter)) { It seems you skips '!iter->ent' check. Is there any reason for this change? Thank you, > + trace_access_unlock(iter->cpu_file); > + trace_event_read_unlock(); > ret = -EFAULT; > goto out_err; > } > > - trace_event_read_lock(); > - trace_access_lock(iter->cpu_file); > - > /* Fill as many pages as possible. */ > for (i = 0, rem = len; i < spd.nr_pages_max && rem; i++) { > spd.pages[i] = alloc_page(GFP_KERNEL); > -- > 2.25.1 >
On 2023/8/11 19:42, Masami Hiramatsu (Google) wrote: > On Thu, 10 Aug 2023 20:39:05 +0800 > Zheng Yejian <zhengyejian1@huawei.com> wrote: > >> When concurrently splice_read file trace_pipe and per_cpu/cpu*/trace_pipe, >> there are more data being read out than expected. >> >> The root cause is that in tracing_splice_read_pipe(), an entry is found >> outside locks, it may be read by multiple readers or consumed by other >> reader as starting printing it. >> >> To fix it, change to find entry after holding locks. >> >> Fixes: 7e53bd42d14c ("tracing: Consolidate protection of reader access to the ring buffer") >> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> >> --- >> kernel/trace/trace.c | 10 ++++++---- >> 1 file changed, 6 insertions(+), 4 deletions(-) >> >> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c >> index b8870078ef58..f169d33b948f 100644 >> --- a/kernel/trace/trace.c >> +++ b/kernel/trace/trace.c >> @@ -7054,14 +7054,16 @@ static ssize_t tracing_splice_read_pipe(struct file *filp, >> if (ret <= 0) >> goto out_err; >> >> - if (!iter->ent && !trace_find_next_entry_inc(iter)) { >> + trace_event_read_lock(); >> + trace_access_lock(iter->cpu_file); >> + >> + if (!trace_find_next_entry_inc(iter)) { > > It seems you skips '!iter->ent' check. Is there any reason for this change? IIUC, 'iter->ent' may be the entry that was found but not consumed in last call tracing_splice_read_pipe(), and in this call, 'iter->ent' may have being consumed, so we may should find a new 'iter->ent' before printing it in tracing_fill_pipe_page(), see following reduced codes: tracing_splice_read_pipe() { if (!iter->ent && !trace_find_next_entry_inc(iter)) { // 1. find entry here ... ... } tracing_fill_pipe_page() { for (;;) { ... ... ret = print_trace_line(iter); // 2. print entry ... ... if (!trace_find_next_entry_inc()) { // 3. find next entry ... ... break; } } -- Thanks, Zheng Yejian > > Thank you, > >> + trace_access_unlock(iter->cpu_file); >> + trace_event_read_unlock(); >> ret = -EFAULT; >> goto out_err; >> } >> >> - trace_event_read_lock(); >> - trace_access_lock(iter->cpu_file); >> - >> /* Fill as many pages as possible. */ >> for (i = 0, rem = len; i < spd.nr_pages_max && rem; i++) { >> spd.pages[i] = alloc_page(GFP_KERNEL); >> -- >> 2.25.1 >> > >
On Fri, 11 Aug 2023 20:37:07 +0800 Zheng Yejian <zhengyejian1@huawei.com> wrote: > On 2023/8/11 19:42, Masami Hiramatsu (Google) wrote: > > On Thu, 10 Aug 2023 20:39:05 +0800 > > Zheng Yejian <zhengyejian1@huawei.com> wrote: > > > >> When concurrently splice_read file trace_pipe and per_cpu/cpu*/trace_pipe, > >> there are more data being read out than expected. > >> > >> The root cause is that in tracing_splice_read_pipe(), an entry is found > >> outside locks, it may be read by multiple readers or consumed by other > >> reader as starting printing it. > >> > >> To fix it, change to find entry after holding locks. > >> > >> Fixes: 7e53bd42d14c ("tracing: Consolidate protection of reader access to the ring buffer") > >> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> > >> --- > >> kernel/trace/trace.c | 10 ++++++---- > >> 1 file changed, 6 insertions(+), 4 deletions(-) > >> > >> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c > >> index b8870078ef58..f169d33b948f 100644 > >> --- a/kernel/trace/trace.c > >> +++ b/kernel/trace/trace.c > >> @@ -7054,14 +7054,16 @@ static ssize_t tracing_splice_read_pipe(struct file *filp, > >> if (ret <= 0) > >> goto out_err; > >> > >> - if (!iter->ent && !trace_find_next_entry_inc(iter)) { > >> + trace_event_read_lock(); > >> + trace_access_lock(iter->cpu_file); > >> + > >> + if (!trace_find_next_entry_inc(iter)) { > > > > It seems you skips '!iter->ent' check. Is there any reason for this change? > > IIUC, 'iter->ent' may be the entry that was found but not consumed > in last call tracing_splice_read_pipe(), and in this call, 'iter->ent' > may have being consumed, so we may should find a new 'iter->ent' before > printing it in tracing_fill_pipe_page(), see following reduced codes: And if it wasn't consumed? We just lost it? > > tracing_splice_read_pipe() { > if (!iter->ent && !trace_find_next_entry_inc(iter)) { // 1. find > entry here > ... ... > } > tracing_fill_pipe_page() { > for (;;) { > ... ... > ret = print_trace_line(iter); // 2. print entry > ... ... You missed: count = trace_seq_used(&iter->seq) - save_len; if (rem < count) { rem = 0; iter->seq.seq.len = save_len; Where the above just threw away what was printed in the above "print_trace_line()", and it never went to console. break; } -- Steve > if (!trace_find_next_entry_inc()) { // 3. find next entry > ... ... > break; > } > } > > -- > > Thanks, > Zheng Yejian > > > > > Thank you, > > > >> + trace_access_unlock(iter->cpu_file); > >> + trace_event_read_unlock(); > >> ret = -EFAULT; > >> goto out_err; > >> } > >> > >> - trace_event_read_lock(); > >> - trace_access_lock(iter->cpu_file); > >> - > >> /* Fill as many pages as possible. */ > >> for (i = 0, rem = len; i < spd.nr_pages_max && rem; i++) { > >> spd.pages[i] = alloc_page(GFP_KERNEL); > >> -- > >> 2.25.1 > >> > > > >
On Thu, 10 Aug 2023 20:39:05 +0800 Zheng Yejian <zhengyejian1@huawei.com> wrote: > When concurrently splice_read file trace_pipe and per_cpu/cpu*/trace_pipe, > there are more data being read out than expected. Honestly the real fix is to prevent that use case. We should probably have access to trace_pipe lock all the per_cpu trace_pipes too. -- Steve
On 2023/8/12 03:25, Steven Rostedt wrote: > On Thu, 10 Aug 2023 20:39:05 +0800 > Zheng Yejian <zhengyejian1@huawei.com> wrote: > >> When concurrently splice_read file trace_pipe and per_cpu/cpu*/trace_pipe, >> there are more data being read out than expected. Sorry, I didn't make clear here. It not just read more but also lost some data. My case is that, for example: 1) Inject 3 events into ring_buffer: event1, event2, event3; 2) Concurrently splice_read through trace_pipes; 3) Then actually read out: event1, event3, event3. No event2, but 2 event3. > > Honestly the real fix is to prevent that use case. We should probably have > access to trace_pipe lock all the per_cpu trace_pipes too. Yes, we could do that, but would it seem not that effective? because per_cpu trace_pipe only read its own ring_buffer and not race with ring_buffers in other cpus. > > -- Steve >
On 2023/8/12 03:24, Steven Rostedt wrote: > On Fri, 11 Aug 2023 20:37:07 +0800 > Zheng Yejian <zhengyejian1@huawei.com> wrote: > >> On 2023/8/11 19:42, Masami Hiramatsu (Google) wrote: >>> On Thu, 10 Aug 2023 20:39:05 +0800 >>> Zheng Yejian <zhengyejian1@huawei.com> wrote: >>> >>>> When concurrently splice_read file trace_pipe and per_cpu/cpu*/trace_pipe, >>>> there are more data being read out than expected. >>>> >>>> The root cause is that in tracing_splice_read_pipe(), an entry is found >>>> outside locks, it may be read by multiple readers or consumed by other >>>> reader as starting printing it. >>>> >>>> To fix it, change to find entry after holding locks. >>>> >>>> Fixes: 7e53bd42d14c ("tracing: Consolidate protection of reader access to the ring buffer") >>>> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> >>>> --- >>>> kernel/trace/trace.c | 10 ++++++---- >>>> 1 file changed, 6 insertions(+), 4 deletions(-) >>>> >>>> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c >>>> index b8870078ef58..f169d33b948f 100644 >>>> --- a/kernel/trace/trace.c >>>> +++ b/kernel/trace/trace.c >>>> @@ -7054,14 +7054,16 @@ static ssize_t tracing_splice_read_pipe(struct file *filp, >>>> if (ret <= 0) >>>> goto out_err; >>>> >>>> - if (!iter->ent && !trace_find_next_entry_inc(iter)) { >>>> + trace_event_read_lock(); >>>> + trace_access_lock(iter->cpu_file); >>>> + >>>> + if (!trace_find_next_entry_inc(iter)) { >>> >>> It seems you skips '!iter->ent' check. Is there any reason for this change? >> >> IIUC, 'iter->ent' may be the entry that was found but not consumed >> in last call tracing_splice_read_pipe(), and in this call, 'iter->ent' >> may have being consumed, so we may should find a new 'iter->ent' before >> printing it in tracing_fill_pipe_page(), see following reduced codes: > > And if it wasn't consumed? We just lost it? If 'iter->ent' was not consumed, trace_find_next_entry_inc() will find it again, will it? -- Zheng Yejian > >> >> tracing_splice_read_pipe() { >> if (!iter->ent && !trace_find_next_entry_inc(iter)) { // 1. find >> entry here >> ... ... >> } >> tracing_fill_pipe_page() { >> for (;;) { >> ... ... >> ret = print_trace_line(iter); // 2. print entry >> ... ... > > You missed: > > count = trace_seq_used(&iter->seq) - save_len; > if (rem < count) { > rem = 0; > iter->seq.seq.len = save_len; > > Where the above just threw away what was printed in the above > "print_trace_line()", and it never went to console. > > break; > } > Thanks for pointing this out! -- Zheng Yejian > -- Steve > > >> if (!trace_find_next_entry_inc()) { // 3. find next entry >> ... ... >> break; >> } >> } >> >> -- >> >> Thanks, >> Zheng Yejian >> >>> >>> Thank you, >>> >>>> + trace_access_unlock(iter->cpu_file); >>>> + trace_event_read_unlock(); >>>> ret = -EFAULT; >>>> goto out_err; >>>> } >>>> >>>> - trace_event_read_lock(); >>>> - trace_access_lock(iter->cpu_file); >>>> - >>>> /* Fill as many pages as possible. */ >>>> for (i = 0, rem = len; i < spd.nr_pages_max && rem; i++) { >>>> spd.pages[i] = alloc_page(GFP_KERNEL); >>>> -- >>>> 2.25.1 >>>> >>> >>> > >
On 2023/8/12 03:25, Steven Rostedt wrote: > On Thu, 10 Aug 2023 20:39:05 +0800 > Zheng Yejian <zhengyejian1@huawei.com> wrote: > >> When concurrently splice_read file trace_pipe and per_cpu/cpu*/trace_pipe, >> there are more data being read out than expected. > > Honestly the real fix is to prevent that use case. We should probably have > access to trace_pipe lock all the per_cpu trace_pipes too. > > -- Steve > Hi~ Reproduction testcase is show as below, it can always reproduce the issue in v5.10, and after this patch, the testcase passed. In v5.10, when run `cat trace_pipe > /tmp/myfile &`, it call sendfile() to transmit data from trace_pipe into /tmp/myfile. And in kernel, .splice_read() of trace_pipe is called then the issue is reproduced. However in the newest v6.5, this reproduction case didn't run into the .splice_read() of trace_pipe, because after commit 97ef77c52b78 ("fs: check FMODE_LSEEK to control internal pipe splicing"), non-seekable trace_pipe cannot be sendfile-ed. ``` repro.sh #!/bin/bash do_test() { local trace_dir=/sys/kernel/tracing local trace=${trace_dir}/trace local old_trace_lines local new_trace_lines local tempfiles local testlog="trace pipe concurrency issue" local pipe_pids local i local write_cnt=1000 local read_cnt=0 local nr_cpu=`nproc` # 1. At first, clear all ring buffer echo > ${trace} # 2. Count how many lines in trace file now old_trace_lines=`cat ${trace} | wc -l` # 3. Close water mark so that reader can read as event comes echo 0 > ${trace_dir}/buffer_percent # 4. Read percpu trace_pipes into local file on background. # Splice read must be used under command 'cat' so that the racy # issue can be reproduced !!! i=0 while [ ${i} -lt ${nr_cpu} ]; do tempfiles[${i}]=/tmp/percpu_trace_pipe_${i} cat ${trace_dir}/per_cpu/cpu${i}/trace_pipe > ${tempfiles[${i}]} & pipe_pids[${i}]=$! let i=i+1 done # 5. Read main trace_pipe into local file on background. # The same, splice read must be used to reproduce the issue !!! tempfiles[${i}]=/tmp/main_trace_pipe cat ${trace_dir}/trace_pipe > ${tempfiles[${i}]} & pipe_pids[${i}]=$! echo "Take a break, let readers run." sleep 3 # 6. Write events into ring buffer through trace_marker, so that # hungry readers start racing these events. i=0 while [ ${i} -lt ${write_cnt} ]; do echo "${testlog} <${i}>" > ${trace_dir}/trace_marker let i=i+1 done # 7. Wait until all events being consumed new_trace_lines=`cat ${trace} | wc -l` while [ "${new_trace_lines}" != "${old_trace_lines}" ]; do new_trace_lines=`cat ${trace} | wc -l` sleep 1 done echo "All written events have been consumed." # 8. Kill all readers and count the events readed i=0 while [ ${i} -lt ${#pipe_pids[*]} ]; do local num kill -9 ${pipe_pids[${i}]} wait ${pipe_pids[${i}]} num=`cat ${tempfiles[${i}]} | grep "${testlog}" | wc -l` let read_cnt=read_cnt+num let i=i+1 done # 9. Expect to read events as much as write if [ "${read_cnt}" != "${write_cnt}" ]; then echo "Test fail: write ${write_cnt} but read ${read_cnt} !!!" return 1 fi # 10. Clean temp files if test success i=0 while [ ${i} -lt ${#tempfiles[*]} ]; do rm ${tempfiles[${i}]} let i=i+1 done return 0 } do_test ``` -- Zheng Yejian
On Sat, 12 Aug 2023 09:45:52 +0800 Zheng Yejian <zhengyejian1@huawei.com> wrote: > On 2023/8/12 03:25, Steven Rostedt wrote: > > On Thu, 10 Aug 2023 20:39:05 +0800 > > Zheng Yejian <zhengyejian1@huawei.com> wrote: > > > >> When concurrently splice_read file trace_pipe and per_cpu/cpu*/trace_pipe, > >> there are more data being read out than expected. > > Sorry, I didn't make clear here. It not just read more but also lost > some data. My case is that, for example: > 1) Inject 3 events into ring_buffer: event1, event2, event3; > 2) Concurrently splice_read through trace_pipes; > 3) Then actually read out: event1, event3, event3. No event2, but 2 > event3. > > > > > Honestly the real fix is to prevent that use case. We should probably have > > access to trace_pipe lock all the per_cpu trace_pipes too. > > Yes, we could do that, but would it seem not that effective? > because per_cpu trace_pipe only read its own ring_buffer and not race > with ring_buffers in other cpus. I think Steve said that only one of below is usable. - Read trace_pipe or - Read per_cpu/cpu*/trace_pipe concurrently And I think this makes sence, especially if you use splice (this *moves* the page from the ring_buffer to other pipe). Thank you, > > > > > -- Steve > > >
On Sat, 12 Aug 2023 10:22:43 +0800 Zheng Yejian <zhengyejian1@huawei.com> wrote: > > And if it wasn't consumed? We just lost it? > > If 'iter->ent' was not consumed, trace_find_next_entry_inc() will find > it again, will it? No it will not, because it did the "inc". That means the next time it is called, it will find the next item to read, thinking it already "consumed" the last one. -- Steve
On Sat, 12 Aug 2023 15:38:12 +0800 Zheng Yejian <zhengyejian1@huawei.com> wrote: > Reproduction testcase is show as below, it can always reproduce the > issue in v5.10, and after this patch, the testcase passed. > > In v5.10, when run `cat trace_pipe > /tmp/myfile &`, it call > sendfile() to transmit data from trace_pipe into /tmp/myfile. And in > kernel, .splice_read() of trace_pipe is called then the issue is > reproduced. > > However in the newest v6.5, this reproduction case didn't run into the > .splice_read() of trace_pipe, because after commit 97ef77c52b78 ("fs: > check FMODE_LSEEK to control internal pipe splicing"), non-seekable > trace_pipe cannot be sendfile-ed. So the test case cannot be run because the "sendfile" on the trace_pipe now fails? I'm not sure if this is considered a user space regression or not, but I figured I'd let the interested parties know. I don't know what tools out there records trace_pipe via sendfile, but there might be some. -- Steve
On Sat, 12 Aug 2023 at 18:13, Steven Rostedt <rostedt@goodmis.org> wrote: > > So the test case cannot be run because the "sendfile" on the > trace_pipe now fails? So sendfile() has always required a seekable source, because it - intentionally - doesn't do the "splice to pipe and then splice from pipe to destination". It just does a "splice from source, splice result to destination". That sounds like "obviously the right thing to do", but it means that there is now no buffer with any extended lifetime between the two operations. And that's a big deal. Without that buffer (ie pipe) in the middle, if the splice to destination fails - or is partial - and the system call is interrupted by a signal, then the source splice data is now gone, gone, gone. So it only works if the source can then re-create the data - ie if the source is seekable. In that case, if the destination cannot accept all the data, the sendfile can just go back and read again from the last successfully written position. And if you are a data stream that can't seek, then that "from last successfully written position" doesn't work, and any partial writes will just have dropped the data. This seekability test *used* to be to test that the source was either a regular file or a block device. Now it literally checks "can I seek". It looks like the trace_pipe fiel is *claiming* to be a regular file - so sendfile() used to be ok with it - but cannot actually seek - so sendfile would silently drop data. Now sendfile says "I'm sorry, Dave, I'm afraid I can't do that" rather than silently causing data loss. Now, this is not a user-visible regression in this case, because "cat" will always fall back on just regular read/write when sendfile fails. So all that changed for 'trace_pipe' is that a buggy splice now no longer triggers the bug that it used to (which the patch in question was also trying to fix). End result: things in many ways work better this way. So it really looks like it never really worked before either. There was *both* the dropped data bug because "trace_pipe" lied about being a regular file, *and* apparently this trace_pipe race bug that Zheng Yejian tried to fix. Of course, some destinations always accept a full write, so maybe we could relax the "source must be seekable" to be "source must be seekable, _OR_ destination must be something that never returns partial writes". So sendfile to a POSIX-compliant regular file could always work (ignoring filesystem full situations, and at that point I think we can say "yeah, we're done, no recovering from that in sendfile()"). So if somebody really *really* want sendfile to work for this case, then you (a) do need to fix the race in tracing_splice_read_pipe (which you should do anyway, since you can obviously always use splice() itself, not sendfile()). AND (b) figure out when 'splice_write()' will always succeed fully and convince people that we can do that "extended version" of sendfile(). Hmm? Linus
On Sun, 13 Aug 2023 09:41:33 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > So if somebody really *really* want sendfile to work for this case, then you > > (a) do need to fix the race in tracing_splice_read_pipe (which you > should do anyway, since you can obviously always use splice() itself, > not sendfile()). > > AND > > (b) figure out when 'splice_write()' will always succeed fully and > convince people that we can do that "extended version" of sendfile(). > No big deal. I really don't care about splicing trace_pipe anyway. That was added by others, but the trace_pipe_raw is what really should be used. Zheng's race needs to be fixed regardless, but I just wanted to make sure this wasn't some kind of hidden regression, as there were patches to trace_pipe to make sendfile get fixed in the past. a29054d9478d0 ("tracing: Fix crash from reading trace_pipe with sendfile") But that too was triggered by cat. If cat no longer uses sendfile for trace_pipe, I'm fine with it. -- Steve
On Sat, 12 Aug 2023 10:22:43 +0800 Zheng Yejian <zhengyejian1@huawei.com> wrote: > > >>>> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c > >>>> index b8870078ef58..f169d33b948f 100644 > >>>> --- a/kernel/trace/trace.c > >>>> +++ b/kernel/trace/trace.c > >>>> @@ -7054,14 +7054,16 @@ static ssize_t tracing_splice_read_pipe(struct file *filp, > >>>> if (ret <= 0) > >>>> goto out_err; > >>>> > >>>> - if (!iter->ent && !trace_find_next_entry_inc(iter)) { > >>>> + trace_event_read_lock(); > >>>> + trace_access_lock(iter->cpu_file); > >>>> + > >>>> + if (!trace_find_next_entry_inc(iter)) { > >>> > >>> It seems you skips '!iter->ent' check. Is there any reason for this change? > >> > >> IIUC, 'iter->ent' may be the entry that was found but not consumed > >> in last call tracing_splice_read_pipe(), and in this call, 'iter->ent' > >> may have being consumed, so we may should find a new 'iter->ent' before > >> printing it in tracing_fill_pipe_page(), see following reduced codes: > > > > And if it wasn't consumed? We just lost it? > > If 'iter->ent' was not consumed, trace_find_next_entry_inc() will find > it again, will it? > > -- Zheng Yejian > > > > >> > >> tracing_splice_read_pipe() { > >> if (!iter->ent && !trace_find_next_entry_inc(iter)) { // 1. find > >> entry here > >> ... ... > >> } > >> tracing_fill_pipe_page() { > >> for (;;) { > >> ... ... > >> ret = print_trace_line(iter); // 2. print entry > >> ... ... > > > > You missed: > > > > count = trace_seq_used(&iter->seq) - save_len; > > if (rem < count) { > > rem = 0; > > iter->seq.seq.len = save_len; > > > > Where the above just threw away what was printed in the above > > "print_trace_line()", and it never went to console. > > > > break; > > } > > > > Thanks for pointing this out! Just to get this moving again, I believe we should add a ref count to trace_pipe and the per_cpu trace_pipes, where if they are opened, nothing else can read it. Opening trace_pipe locks all per_cpu ref counts, if any of them are open, then the trace_pipe open will fail (and releases any ref counts it had taken). Opening a per_cpu trace_pipe will up the ref count for just that CPU buffer. This will allow multiple tasks to read different per_cpu trace_pipe files, but will prevent the main trace_pipe file from being opened. Does that work for this? -- Steve
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index b8870078ef58..f169d33b948f 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -7054,14 +7054,16 @@ static ssize_t tracing_splice_read_pipe(struct file *filp, if (ret <= 0) goto out_err; - if (!iter->ent && !trace_find_next_entry_inc(iter)) { + trace_event_read_lock(); + trace_access_lock(iter->cpu_file); + + if (!trace_find_next_entry_inc(iter)) { + trace_access_unlock(iter->cpu_file); + trace_event_read_unlock(); ret = -EFAULT; goto out_err; } - trace_event_read_lock(); - trace_access_lock(iter->cpu_file); - /* Fill as many pages as possible. */ for (i = 0, rem = len; i < spd.nr_pages_max && rem; i++) { spd.pages[i] = alloc_page(GFP_KERNEL);