Message ID | 20230414082300.34798-1-adrian.hunter@intel.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp221723vqo; Fri, 14 Apr 2023 01:32:50 -0700 (PDT) X-Google-Smtp-Source: AKy350ZpIKxKK8BW3KsJQ983XTa+Oum1DyZoRN+CmJpAzYjkwDPN+CKmlu+FKRvKZ0aRV4hhTdbY X-Received: by 2002:a17:902:f9cc:b0:1a5:f36:ae09 with SMTP id kz12-20020a170902f9cc00b001a50f36ae09mr2189326plb.7.1681461170505; Fri, 14 Apr 2023 01:32:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681461170; cv=none; d=google.com; s=arc-20160816; b=Glr/iLWmqyH6bQN3X43AQOpiEMOnrpWOqR6rHaa9t39LkKxTvJQDMM3vgHWPA9kv1d wkH7pmR0PXyqryFGR+QlsSQHrjammC6bzLwLnPreJmke4s7dMYGAGccwYAcX303hgju0 7abQMaTNbz0Ybts5P9og04JSec+vWYKS4nYqsW60GoJ47pbbgrD/x1hA+UnZIZ7nZZlb y3x+N3+Cw7HZGCaJdTze39fpz7hguIKz+93N2r1dtapNvAq8eP1t3yXtzsY1Gy/P/VqQ BQYYsw87QHloWK/gcK0SbjXO+2ZU6IZp4p/Qr0NH6GnqdydOSbewUM5sXQLDL0+m+1YU eqwQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:organization :mime-version:message-id:date:subject:cc:to:from:dkim-signature; bh=JUKscNpmYbZT9TbkdIcYb4MMV5GbFHoC7j4KvHM7Wo0=; b=YPzFGMlGtIBlrNbMDidDhGKx4vCxMB7xx/kwxY0eYi6yNIkh6VXXCIMVpG9e1TnIvt v/XJ57SY53vQNJJ15AEU2SlcV6IhA2aOinoIB8b+UQkYW1l8j2JyLWysPuDoWbdkJBlw jDwQbSgi/wrt939qghVCyVpSPjjIksIoORPh9bkhLH48RrEp/VCc2bh7ekhZF7zpJeGX YtqEE8iWcTGJVEA5AHi0PT1wTFWNo6ttpXA2E3LdHfH5QzdrpetfkslIYNmGWNZarChj imypCkrHrkAXrLrll056NZE8pZ2mIxEXG85i4Y5gXXx/VtkbHBS1EkYuuZKR3RMO7Wt2 w+fg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="N4IhTVO/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id t5-20020a17090a950500b0024664356ef7si6756604pjo.152.2023.04.14.01.32.38; Fri, 14 Apr 2023 01:32:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="N4IhTVO/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230172AbjDNIX3 (ORCPT <rfc822;leviz.kernel.dev@gmail.com> + 99 others); Fri, 14 Apr 2023 04:23:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33036 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230103AbjDNIXY (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 14 Apr 2023 04:23:24 -0400 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0CAF95B9F; Fri, 14 Apr 2023 01:23:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1681460602; x=1712996602; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=IR3enTULW+0dSiranrN8ej6IJa0hU4y3yttWFJiFMng=; b=N4IhTVO/7jObHl/8HIZGX3wzi4lDwHtiMS/oRc5UGtvUX6Vrep1abd17 ryLnFdzGo7dJNyYJCeVCaJ2lvq7Fbqs5N8ukTaaCRId1gJHsG2Ms83/ig 1+ivbjD4wnF+lKyth1yasmasvuCfRNwdzqyj1ZNkqB5/Pkv/kCU7WEyte I8EDuq4Zr+1Cf/peSm+x3Pg/L9so6Ot1VhfIMIGrKrJbCANRD0HZUWuTA dUluisUxj5byJaeOuOAeID5MEHNVlqq18RiFPNuQfQgF5256eckxYYoFv QKmv+tBqLBzJUOhXK/R8qGJeqqZh3lFzdlHLluG3G7AvgO31OOr+I8SYk A==; X-IronPort-AV: E=McAfee;i="6600,9927,10679"; a="430708078" X-IronPort-AV: E=Sophos;i="5.99,195,1677571200"; d="scan'208";a="430708078" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2023 01:23:20 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10679"; a="683267437" X-IronPort-AV: E=Sophos;i="5.99,195,1677571200"; d="scan'208";a="683267437" Received: from ahunter6-mobl1.ger.corp.intel.com (HELO ahunter-VirtualBox.home\044ger.corp.intel.com) ([10.249.34.252]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2023 01:23:16 -0700 From: Adrian Hunter <adrian.hunter@intel.com> To: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com>, Arnaldo Carvalho de Melo <acme@kernel.org>, Mark Rutland <mark.rutland@arm.com>, Alexander Shishkin <alexander.shishkin@linux.intel.com>, Jiri Olsa <jolsa@kernel.org>, Namhyung Kim <namhyung@kernel.org>, Ian Rogers <irogers@google.com>, linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH RFC 0/5] perf: Add ioctl to emit sideband events Date: Fri, 14 Apr 2023 11:22:55 +0300 Message-Id: <20230414082300.34798-1-adrian.hunter@intel.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Organization: Intel Finland Oy, Registered Address: PL 281, 00181 Helsinki, Business Identity Code: 0357606 - 4, Domiciled in Helsinki Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1763139828019123338?= X-GMAIL-MSGID: =?utf-8?q?1763139828019123338?= |
Series | perf: Add ioctl to emit sideband events | |
Message
Adrian Hunter
April 14, 2023, 8:22 a.m. UTC
Hi Here is a stab at adding an ioctl for sideband events. This is to overcome races when reading the same information from /proc. To keep it simple, the ioctl is limited to emitting existing sideband events (fork, namespaces, comm, mmap) to an already active context. There are not yet any perf tools patches at this stage. Adrian Hunter (5): perf: Add ioctl to emit sideband events perf: Add fork to the sideband ioctl perf: Add namespaces to the sideband ioctl perf: Add comm to the sideband ioctl perf: Add mmap to the sideband ioctl include/uapi/linux/perf_event.h | 19 ++- kernel/events/core.c | 315 +++++++++++++++++++++++++++++++++------- 2 files changed, 280 insertions(+), 54 deletions(-) Regards Adrian
Comments
On Fri, Apr 14, 2023 at 11:22:55AM +0300, Adrian Hunter wrote: > Hi > > Here is a stab at adding an ioctl for sideband events. > > This is to overcome races when reading the same information > from /proc. What races? Are you talking about reading old state in /proc the kernel delivering a sideband event for the new state, and then you writing the old state out? Surely that's something perf tool can fix without kernel changes?
On Mon, Apr 17, 2023 at 4:02 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Apr 14, 2023 at 11:22:55AM +0300, Adrian Hunter wrote: > > Hi > > > > Here is a stab at adding an ioctl for sideband events. > > > > This is to overcome races when reading the same information > > from /proc. > > What races? Are you talking about reading old state in /proc the kernel > delivering a sideband event for the new state, and then you writing the > old state out? > > Surely that's something perf tool can fix without kernel changes? So my reading is that during event synthesis there are races between reading the different /proc files. There is still, I believe, a race in with perf record/top with uid filtering which reminds me of this. The uid filtering race is that we scan /proc to find processes (pids) for a uid, we then synthesize the maps for each of these pids but if a pid starts or exits we either error out or don't sample that pid. I believe the error out behavior is easy to hit 100% of the time making uid mode of limited use. This may be for something other than synthesis, but for synthesis a few points are: - as servers get bigger and consequently more jobs get consolidated on them, synthesis is slow (hence --num-thread-synthesize) and also the events dominate the perf.data file - perhaps >90% of the file size, and a lot of that will be for processes with no samples in them. Another issue here is that all those file descriptors don't come for free in the kernel. - BPF has buildid+offset stack traces that remove the need for synthesis by having more expensive stack generation. I believe this is unpopular as adding this as a variant for every kind of event would be hard, but perhaps we can do some low-hanging fruit like instructions and cycles. - I believe Jiri looked at doing synthesis with BPF. Perhaps we could do something similar to the off-cpu and tail-synthesize, where more things happen at the tail end of perf. Off-cpu records data in maps that it then synthesizes into samples. There is also a long standing issue around not sampling munmap (or mremap) that causes plenty of issues. Perhaps if we had less mmap in the perf.data file we could add these. Thanks, Ian
On 17/04/23 14:02, Peter Zijlstra wrote: > On Fri, Apr 14, 2023 at 11:22:55AM +0300, Adrian Hunter wrote: >> Hi >> >> Here is a stab at adding an ioctl for sideband events. >> >> This is to overcome races when reading the same information >> from /proc. > > What races? Are you talking about reading old state in /proc the kernel > delivering a sideband event for the new state, and then you writing the > old state out? > > Surely that's something perf tool can fix without kernel changes? Yes, and it was a bit of a brain fart not to realise that. There may still be corner cases, where different kinds of events are interdependent, perhaps NAMESPACES events vs MMAP events could have ordering issues. Putting that aside, the ioctl may be quicker than reading from /proc. I could get some numbers and see what people think.
On 17/04/23 19:37, Ian Rogers wrote: > On Mon, Apr 17, 2023 at 4:02 AM Peter Zijlstra <peterz@infradead.org> wrote: >> >> On Fri, Apr 14, 2023 at 11:22:55AM +0300, Adrian Hunter wrote: >>> Hi >>> >>> Here is a stab at adding an ioctl for sideband events. >>> >>> This is to overcome races when reading the same information >>> from /proc. >> >> What races? Are you talking about reading old state in /proc the kernel >> delivering a sideband event for the new state, and then you writing the >> old state out? >> >> Surely that's something perf tool can fix without kernel changes? > > So my reading is that during event synthesis there are races between > reading the different /proc files. There is still, I believe, a race > in with perf record/top with uid filtering which reminds me of this. > The uid filtering race is that we scan /proc to find processes (pids) > for a uid, we then synthesize the maps for each of these pids but if a > pid starts or exits we either error out or don't sample that pid. I > believe the error out behavior is easy to hit 100% of the time making > uid mode of limited use. > > This may be for something other than synthesis, but for synthesis a > few points are: > - as servers get bigger and consequently more jobs get consolidated > on them, synthesis is slow (hence --num-thread-synthesize) and also > the events dominate the perf.data file - perhaps >90% of the file > size, and a lot of that will be for processes with no samples in them. Note also, for hardware tracing, it isn't generally possible to know that during tracing, and figuring it out afterwards and working backwards may not be feasible. > Another issue here is that all those file descriptors don't come for > free in the kernel. > - BPF has buildid+offset stack traces that remove the need for > synthesis by having more expensive stack generation. I believe this is > unpopular as adding this as a variant for every kind of event would be > hard, but perhaps we can do some low-hanging fruit like instructions > and cycles. > - I believe Jiri looked at doing synthesis with BPF. Perhaps we could > do something similar to the off-cpu and tail-synthesize, where more > things happen at the tail end of perf. Off-cpu records data in maps > that it then synthesizes into samples. > > There is also a long standing issue around not sampling munmap (or > mremap) that causes plenty of issues. Perhaps if we had less mmap in > the perf.data file we could add these. > > Thanks, > Ian
On 18/04/23 09:18, Adrian Hunter wrote: > On 17/04/23 14:02, Peter Zijlstra wrote: >> On Fri, Apr 14, 2023 at 11:22:55AM +0300, Adrian Hunter wrote: >>> Hi >>> >>> Here is a stab at adding an ioctl for sideband events. >>> >>> This is to overcome races when reading the same information >>> from /proc. >> >> What races? Are you talking about reading old state in /proc the kernel >> delivering a sideband event for the new state, and then you writing the >> old state out? >> >> Surely that's something perf tool can fix without kernel changes? > > Yes, and it was a bit of a brain fart not to realise that. > > There may still be corner cases, where different kinds of events are > interdependent, perhaps NAMESPACES events vs MMAP events could > have ordering issues. > > Putting that aside, the ioctl may be quicker than reading from > /proc. I could get some numbers and see what people think. > Here's a result with a quick hack to use the ioctl but without handling the buffer becoming full (hence the -m4M) # ps -e | wc -l 1171 # perf.old stat -- perf.old record -o old.data --namespaces -a true [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 1.095 MB old.data (100 samples) ] Performance counter stats for 'perf.old record -o old.data --namespaces -a true': 498.15 msec task-clock # 0.987 CPUs utilized 126 context-switches # 252.935 /sec 64 cpu-migrations # 128.475 /sec 4396 page-faults # 8.825 K/sec 1927096347 cycles # 3.868 GHz 4563059399 instructions # 2.37 insn per cycle 914232559 branches # 1.835 G/sec 6618052 branch-misses # 0.72% of all branches 9633787105 slots # 19.339 G/sec 4394300990 topdown-retiring # 38.8% Retiring 3693815286 topdown-bad-spec # 32.6% Bad Speculation 1692356927 topdown-fe-bound # 14.9% Frontend Bound 1544151518 topdown-be-bound # 13.6% Backend Bound 0.504636742 seconds time elapsed 0.158237000 seconds user 0.340625000 seconds sys # perf.old stat -- perf.new record -o new.data -m4M --namespaces -a true [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 1.095 MB new.data (103 samples) ] Performance counter stats for 'perf.new record -o new.data -m4M --namespaces -a true': 386.61 msec task-clock # 0.988 CPUs utilized 100 context-switches # 258.658 /sec 65 cpu-migrations # 168.128 /sec 4935 page-faults # 12.765 K/sec 1495905137 cycles # 3.869 GHz 3647660473 instructions # 2.44 insn per cycle 735822370 branches # 1.903 G/sec 5765668 branch-misses # 0.78% of all branches 7477722620 slots # 19.342 G/sec 3415835954 topdown-retiring # 39.5% Retiring 2748625759 topdown-bad-spec # 31.8% Bad Speculation 1221594670 topdown-fe-bound # 14.1% Frontend Bound 1256150733 topdown-be-bound # 14.5% Backend Bound 0.391472763 seconds time elapsed 0.141207000 seconds user 0.246277000 seconds sys # ls -lh old.data -rw------- 1 root root 1.2M Apr 18 13:19 old.data # ls -lh new.data -rw------- 1 root root 1.2M Apr 18 13:19 new.data #
On Tue, Apr 18, 2023 at 6:36 AM Adrian Hunter <adrian.hunter@intel.com> wrote: > > On 18/04/23 09:18, Adrian Hunter wrote: > > On 17/04/23 14:02, Peter Zijlstra wrote: > >> On Fri, Apr 14, 2023 at 11:22:55AM +0300, Adrian Hunter wrote: > >>> Hi > >>> > >>> Here is a stab at adding an ioctl for sideband events. > >>> > >>> This is to overcome races when reading the same information > >>> from /proc. > >> > >> What races? Are you talking about reading old state in /proc the kernel > >> delivering a sideband event for the new state, and then you writing the > >> old state out? > >> > >> Surely that's something perf tool can fix without kernel changes? > > > > Yes, and it was a bit of a brain fart not to realise that. > > > > There may still be corner cases, where different kinds of events are > > interdependent, perhaps NAMESPACES events vs MMAP events could > > have ordering issues. > > > > Putting that aside, the ioctl may be quicker than reading from > > /proc. I could get some numbers and see what people think. > > > > Here's a result with a quick hack to use the ioctl but without > handling the buffer becoming full (hence the -m4M) > > # ps -e | wc -l > 1171 > # perf.old stat -- perf.old record -o old.data --namespaces -a true > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 1.095 MB old.data (100 samples) ] > > Performance counter stats for 'perf.old record -o old.data --namespaces -a true': > > 498.15 msec task-clock # 0.987 CPUs utilized > 126 context-switches # 252.935 /sec > 64 cpu-migrations # 128.475 /sec > 4396 page-faults # 8.825 K/sec > 1927096347 cycles # 3.868 GHz > 4563059399 instructions # 2.37 insn per cycle > 914232559 branches # 1.835 G/sec > 6618052 branch-misses # 0.72% of all branches > 9633787105 slots # 19.339 G/sec > 4394300990 topdown-retiring # 38.8% Retiring > 3693815286 topdown-bad-spec # 32.6% Bad Speculation > 1692356927 topdown-fe-bound # 14.9% Frontend Bound > 1544151518 topdown-be-bound # 13.6% Backend Bound > > 0.504636742 seconds time elapsed > > 0.158237000 seconds user > 0.340625000 seconds sys > > # perf.old stat -- perf.new record -o new.data -m4M --namespaces -a true > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 1.095 MB new.data (103 samples) ] > > Performance counter stats for 'perf.new record -o new.data -m4M --namespaces -a true': > > 386.61 msec task-clock # 0.988 CPUs utilized > 100 context-switches # 258.658 /sec > 65 cpu-migrations # 168.128 /sec > 4935 page-faults # 12.765 K/sec > 1495905137 cycles # 3.869 GHz > 3647660473 instructions # 2.44 insn per cycle > 735822370 branches # 1.903 G/sec > 5765668 branch-misses # 0.78% of all branches > 7477722620 slots # 19.342 G/sec > 3415835954 topdown-retiring # 39.5% Retiring > 2748625759 topdown-bad-spec # 31.8% Bad Speculation > 1221594670 topdown-fe-bound # 14.1% Frontend Bound > 1256150733 topdown-be-bound # 14.5% Backend Bound > > 0.391472763 seconds time elapsed > > 0.141207000 seconds user > 0.246277000 seconds sys > > # ls -lh old.data > -rw------- 1 root root 1.2M Apr 18 13:19 old.data > # ls -lh new.data > -rw------- 1 root root 1.2M Apr 18 13:19 new.data > # Cool, so the headline is a ~20% or 1billion instruction reduction in perf startup overhead? Thanks, Ian