[v3,4/7] perf record: Track sideband events for all CPUs when tracing selected CPUs

Message ID 20230722093219.174898-5-yangjihong1@huawei.com
State New
Headers
Series perf record: Track sideband events for all CPUs when tracing selected CPUs |

Commit Message

Yang Jihong July 22, 2023, 9:32 a.m. UTC
  User space tasks can migrate between CPUs, we need to track side-band
events for all CPUs.

The specific scenarios are as follows:

         CPU0                                 CPU1
  perf record -C 0 start
                              taskA starts to be created and executed
                                -> PERF_RECORD_COMM and PERF_RECORD_MMAP
                                   events only deliver to CPU1
                              ......
                                |
                          migrate to CPU0
                                |
  Running on CPU0    <----------/
  ...

  perf record -C 0 stop

Now perf samples the PC of taskA. However, perf does not record the
PERF_RECORD_COMM and PERF_RECORD_MMAP events of taskA.
Therefore, the comm and symbols of taskA cannot be parsed.

The solution is to record sideband events for all CPUs when tracing
selected CPUs. Because this modifies the default behavior, add related
comments to the perf record man page.

The sys_perf_event_open invoked is as follows:

  # perf --debug verbose=3 record -e cpu-clock -C 1 true
  <SNIP>
  Opening: cpu-clock
  ------------------------------------------------------------
  perf_event_attr:
    type                             1 (PERF_TYPE_SOFTWARE)
    size                             136
    config                           0 (PERF_COUNT_SW_CPU_CLOCK)
    { sample_period, sample_freq }   4000
    sample_type                      IP|TID|TIME|CPU|PERIOD|IDENTIFIER
    read_format                      ID|LOST
    disabled                         1
    inherit                          1
    freq                             1
    sample_id_all                    1
    exclude_guest                    1
  ------------------------------------------------------------
  sys_perf_event_open: pid -1  cpu 1  group_fd -1  flags 0x8 = 5
  Opening: dummy:u
  ------------------------------------------------------------
  perf_event_attr:
    type                             1 (PERF_TYPE_SOFTWARE)
    size                             136
    config                           0x9 (PERF_COUNT_SW_DUMMY)
    { sample_period, sample_freq }   1
    sample_type                      IP|TID|TIME|CPU|IDENTIFIER
    read_format                      ID|LOST
    inherit                          1
    exclude_kernel                   1
    exclude_hv                       1
    mmap                             1
    comm                             1
    task                             1
    sample_id_all                    1
    exclude_guest                    1
    mmap2                            1
    comm_exec                        1
    ksymbol                          1
    bpf_event                        1
  ------------------------------------------------------------
  sys_perf_event_open: pid -1  cpu 0  group_fd -1  flags 0x8 = 6
  sys_perf_event_open: pid -1  cpu 1  group_fd -1  flags 0x8 = 7
  sys_perf_event_open: pid -1  cpu 2  group_fd -1  flags 0x8 = 9
  sys_perf_event_open: pid -1  cpu 3  group_fd -1  flags 0x8 = 10
  sys_perf_event_open: pid -1  cpu 4  group_fd -1  flags 0x8 = 11
  sys_perf_event_open: pid -1  cpu 5  group_fd -1  flags 0x8 = 12
  sys_perf_event_open: pid -1  cpu 6  group_fd -1  flags 0x8 = 13
  sys_perf_event_open: pid -1  cpu 7  group_fd -1  flags 0x8 = 14
  <SNIP>

Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
---
 tools/perf/Documentation/perf-record.txt |  3 +++
 tools/perf/builtin-record.c              | 14 +++++++++++++-
 2 files changed, 16 insertions(+), 1 deletion(-)
  

Comments

Adrian Hunter July 31, 2023, 11:08 a.m. UTC | #1
On 22/07/23 12:32, Yang Jihong wrote:
> User space tasks can migrate between CPUs, we need to track side-band
> events for all CPUs.
> 
> The specific scenarios are as follows:
> 
>          CPU0                                 CPU1
>   perf record -C 0 start
>                               taskA starts to be created and executed
>                                 -> PERF_RECORD_COMM and PERF_RECORD_MMAP
>                                    events only deliver to CPU1
>                               ......
>                                 |
>                           migrate to CPU0
>                                 |
>   Running on CPU0    <----------/
>   ...
> 
>   perf record -C 0 stop
> 
> Now perf samples the PC of taskA. However, perf does not record the
> PERF_RECORD_COMM and PERF_RECORD_MMAP events of taskA.
> Therefore, the comm and symbols of taskA cannot be parsed.
> 
> The solution is to record sideband events for all CPUs when tracing
> selected CPUs. Because this modifies the default behavior, add related
> comments to the perf record man page.
> 
> The sys_perf_event_open invoked is as follows:
> 
>   # perf --debug verbose=3 record -e cpu-clock -C 1 true
>   <SNIP>
>   Opening: cpu-clock
>   ------------------------------------------------------------
>   perf_event_attr:
>     type                             1 (PERF_TYPE_SOFTWARE)
>     size                             136
>     config                           0 (PERF_COUNT_SW_CPU_CLOCK)
>     { sample_period, sample_freq }   4000
>     sample_type                      IP|TID|TIME|CPU|PERIOD|IDENTIFIER
>     read_format                      ID|LOST
>     disabled                         1
>     inherit                          1
>     freq                             1
>     sample_id_all                    1
>     exclude_guest                    1
>   ------------------------------------------------------------
>   sys_perf_event_open: pid -1  cpu 1  group_fd -1  flags 0x8 = 5
>   Opening: dummy:u
>   ------------------------------------------------------------
>   perf_event_attr:
>     type                             1 (PERF_TYPE_SOFTWARE)
>     size                             136
>     config                           0x9 (PERF_COUNT_SW_DUMMY)
>     { sample_period, sample_freq }   1
>     sample_type                      IP|TID|TIME|CPU|IDENTIFIER
>     read_format                      ID|LOST
>     inherit                          1
>     exclude_kernel                   1
>     exclude_hv                       1
>     mmap                             1
>     comm                             1
>     task                             1
>     sample_id_all                    1
>     exclude_guest                    1
>     mmap2                            1
>     comm_exec                        1
>     ksymbol                          1
>     bpf_event                        1
>   ------------------------------------------------------------
>   sys_perf_event_open: pid -1  cpu 0  group_fd -1  flags 0x8 = 6
>   sys_perf_event_open: pid -1  cpu 1  group_fd -1  flags 0x8 = 7
>   sys_perf_event_open: pid -1  cpu 2  group_fd -1  flags 0x8 = 9
>   sys_perf_event_open: pid -1  cpu 3  group_fd -1  flags 0x8 = 10
>   sys_perf_event_open: pid -1  cpu 4  group_fd -1  flags 0x8 = 11
>   sys_perf_event_open: pid -1  cpu 5  group_fd -1  flags 0x8 = 12
>   sys_perf_event_open: pid -1  cpu 6  group_fd -1  flags 0x8 = 13
>   sys_perf_event_open: pid -1  cpu 7  group_fd -1  flags 0x8 = 14
>   <SNIP>
> 
> Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
> ---
>  tools/perf/Documentation/perf-record.txt |  3 +++
>  tools/perf/builtin-record.c              | 14 +++++++++++++-
>  2 files changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> index 680396c56bd1..dac53ece51ab 100644
> --- a/tools/perf/Documentation/perf-record.txt
> +++ b/tools/perf/Documentation/perf-record.txt
> @@ -388,6 +388,9 @@ comma-separated list with no space: 0,1. Ranges of CPUs are specified with -: 0-
>  In per-thread mode with inheritance mode on (default), samples are captured only when
>  the thread executes on the designated CPUs. Default is to monitor all CPUs.
>  
> +User space tasks can migrate between CPUs, so when tracing selected CPUs,
> +a dummy event is created to track sideband for all CPUs.
> +
>  -B::
>  --no-buildid::
>  Do not save the build ids of binaries in the perf.data files. This skips
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 3ff9d972225e..4e8e97928f05 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -912,6 +912,7 @@ static int record__config_tracking_events(struct record *rec)
>  {
>  	struct record_opts *opts = &rec->opts;
>  	struct evlist *evlist = rec->evlist;
> +	bool system_wide = false;
>  	struct evsel *evsel;
>  
>  	/*
> @@ -921,7 +922,18 @@ static int record__config_tracking_events(struct record *rec)
>  	 */
>  	if (opts->target.initial_delay || target__has_cpu(&opts->target) ||
>  	    perf_pmus__num_core_pmus() > 1) {
> -		evsel = evlist__findnew_tracking_event(evlist, false);
> +
> +		/*
> +		 * User space tasks can migrate between CPUs, so when tracing
> +		 * selected CPUs, sideband for all CPUs is still needed.
> +		 *
> +		 * If all (non-dummy) evsel have exclude_user,
> +		 * system_wide is not needed.
> +		 */
> +		if (!!opts->target.cpu_list && !opts->all_kernel)

Not everyone uses all-kernel.  Can we check the evsels are either dummy
or exclude_user?

> +			system_wide = true;
> +
> +		evsel = evlist__findnew_tracking_event(evlist, system_wide);
>  		if (!evsel)
>  			return -ENOMEM;
>
  
Yang Jihong July 31, 2023, 12:38 p.m. UTC | #2
Hello,

On 2023/7/31 19:08, Adrian Hunter wrote:
> On 22/07/23 12:32, Yang Jihong wrote:
>> User space tasks can migrate between CPUs, we need to track side-band
>> events for all CPUs.
>>
>> The specific scenarios are as follows:
>>
>>           CPU0                                 CPU1
>>    perf record -C 0 start
>>                                taskA starts to be created and executed
>>                                  -> PERF_RECORD_COMM and PERF_RECORD_MMAP
>>                                     events only deliver to CPU1
>>                                ......
>>                                  |
>>                            migrate to CPU0
>>                                  |
>>    Running on CPU0    <----------/
>>    ...
>>
>>    perf record -C 0 stop
>>
>> Now perf samples the PC of taskA. However, perf does not record the
>> PERF_RECORD_COMM and PERF_RECORD_MMAP events of taskA.
>> Therefore, the comm and symbols of taskA cannot be parsed.
>>
>> The solution is to record sideband events for all CPUs when tracing
>> selected CPUs. Because this modifies the default behavior, add related
>> comments to the perf record man page.
>>
>> The sys_perf_event_open invoked is as follows:
>>
>>    # perf --debug verbose=3 record -e cpu-clock -C 1 true
>>    <SNIP>
>>    Opening: cpu-clock
>>    ------------------------------------------------------------
>>    perf_event_attr:
>>      type                             1 (PERF_TYPE_SOFTWARE)
>>      size                             136
>>      config                           0 (PERF_COUNT_SW_CPU_CLOCK)
>>      { sample_period, sample_freq }   4000
>>      sample_type                      IP|TID|TIME|CPU|PERIOD|IDENTIFIER
>>      read_format                      ID|LOST
>>      disabled                         1
>>      inherit                          1
>>      freq                             1
>>      sample_id_all                    1
>>      exclude_guest                    1
>>    ------------------------------------------------------------
>>    sys_perf_event_open: pid -1  cpu 1  group_fd -1  flags 0x8 = 5
>>    Opening: dummy:u
>>    ------------------------------------------------------------
>>    perf_event_attr:
>>      type                             1 (PERF_TYPE_SOFTWARE)
>>      size                             136
>>      config                           0x9 (PERF_COUNT_SW_DUMMY)
>>      { sample_period, sample_freq }   1
>>      sample_type                      IP|TID|TIME|CPU|IDENTIFIER
>>      read_format                      ID|LOST
>>      inherit                          1
>>      exclude_kernel                   1
>>      exclude_hv                       1
>>      mmap                             1
>>      comm                             1
>>      task                             1
>>      sample_id_all                    1
>>      exclude_guest                    1
>>      mmap2                            1
>>      comm_exec                        1
>>      ksymbol                          1
>>      bpf_event                        1
>>    ------------------------------------------------------------
>>    sys_perf_event_open: pid -1  cpu 0  group_fd -1  flags 0x8 = 6
>>    sys_perf_event_open: pid -1  cpu 1  group_fd -1  flags 0x8 = 7
>>    sys_perf_event_open: pid -1  cpu 2  group_fd -1  flags 0x8 = 9
>>    sys_perf_event_open: pid -1  cpu 3  group_fd -1  flags 0x8 = 10
>>    sys_perf_event_open: pid -1  cpu 4  group_fd -1  flags 0x8 = 11
>>    sys_perf_event_open: pid -1  cpu 5  group_fd -1  flags 0x8 = 12
>>    sys_perf_event_open: pid -1  cpu 6  group_fd -1  flags 0x8 = 13
>>    sys_perf_event_open: pid -1  cpu 7  group_fd -1  flags 0x8 = 14
>>    <SNIP>
>>
>> Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
>> ---
>>   tools/perf/Documentation/perf-record.txt |  3 +++
>>   tools/perf/builtin-record.c              | 14 +++++++++++++-
>>   2 files changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
>> index 680396c56bd1..dac53ece51ab 100644
>> --- a/tools/perf/Documentation/perf-record.txt
>> +++ b/tools/perf/Documentation/perf-record.txt
>> @@ -388,6 +388,9 @@ comma-separated list with no space: 0,1. Ranges of CPUs are specified with -: 0-
>>   In per-thread mode with inheritance mode on (default), samples are captured only when
>>   the thread executes on the designated CPUs. Default is to monitor all CPUs.
>>   
>> +User space tasks can migrate between CPUs, so when tracing selected CPUs,
>> +a dummy event is created to track sideband for all CPUs.
>> +
>>   -B::
>>   --no-buildid::
>>   Do not save the build ids of binaries in the perf.data files. This skips
>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>> index 3ff9d972225e..4e8e97928f05 100644
>> --- a/tools/perf/builtin-record.c
>> +++ b/tools/perf/builtin-record.c
>> @@ -912,6 +912,7 @@ static int record__config_tracking_events(struct record *rec)
>>   {
>>   	struct record_opts *opts = &rec->opts;
>>   	struct evlist *evlist = rec->evlist;
>> +	bool system_wide = false;
>>   	struct evsel *evsel;
>>   
>>   	/*
>> @@ -921,7 +922,18 @@ static int record__config_tracking_events(struct record *rec)
>>   	 */
>>   	if (opts->target.initial_delay || target__has_cpu(&opts->target) ||
>>   	    perf_pmus__num_core_pmus() > 1) {
>> -		evsel = evlist__findnew_tracking_event(evlist, false);
>> +
>> +		/*
>> +		 * User space tasks can migrate between CPUs, so when tracing
>> +		 * selected CPUs, sideband for all CPUs is still needed.
>> +		 *
>> +		 * If all (non-dummy) evsel have exclude_user,
>> +		 * system_wide is not needed.
>> +		 */
>> +		if (!!opts->target.cpu_list && !opts->all_kernel)
> 
> Not everyone uses all-kernel.  Can we check the evsels are either dummy
> or exclude_user?
For perf_record, exclude_user of all evsels is set in evsel__config(), 
and record__config_tracking_events() is before evsel__config().

Uh..., it seems that only opts->all_kernel can be used to check 
exclude_user of evsels.

void evsel__config()
{
   ...
   if (opts->all_kernel) {
     attr->exclude_kernel = 0;
     attr->exclude_user   = 1;
   }
   ...
}

Thanks,
Yang
  
Adrian Hunter July 31, 2023, 1:01 p.m. UTC | #3
On 31/07/23 15:38, Yang Jihong wrote:
> Hello,
> 
> On 2023/7/31 19:08, Adrian Hunter wrote:
>> On 22/07/23 12:32, Yang Jihong wrote:
>>> User space tasks can migrate between CPUs, we need to track side-band
>>> events for all CPUs.
>>>
>>> The specific scenarios are as follows:
>>>
>>>           CPU0                                 CPU1
>>>    perf record -C 0 start
>>>                                taskA starts to be created and executed
>>>                                  -> PERF_RECORD_COMM and PERF_RECORD_MMAP
>>>                                     events only deliver to CPU1
>>>                                ......
>>>                                  |
>>>                            migrate to CPU0
>>>                                  |
>>>    Running on CPU0    <----------/
>>>    ...
>>>
>>>    perf record -C 0 stop
>>>
>>> Now perf samples the PC of taskA. However, perf does not record the
>>> PERF_RECORD_COMM and PERF_RECORD_MMAP events of taskA.
>>> Therefore, the comm and symbols of taskA cannot be parsed.
>>>
>>> The solution is to record sideband events for all CPUs when tracing
>>> selected CPUs. Because this modifies the default behavior, add related
>>> comments to the perf record man page.
>>>
>>> The sys_perf_event_open invoked is as follows:
>>>
>>>    # perf --debug verbose=3 record -e cpu-clock -C 1 true
>>>    <SNIP>
>>>    Opening: cpu-clock
>>>    ------------------------------------------------------------
>>>    perf_event_attr:
>>>      type                             1 (PERF_TYPE_SOFTWARE)
>>>      size                             136
>>>      config                           0 (PERF_COUNT_SW_CPU_CLOCK)
>>>      { sample_period, sample_freq }   4000
>>>      sample_type                      IP|TID|TIME|CPU|PERIOD|IDENTIFIER
>>>      read_format                      ID|LOST
>>>      disabled                         1
>>>      inherit                          1
>>>      freq                             1
>>>      sample_id_all                    1
>>>      exclude_guest                    1
>>>    ------------------------------------------------------------
>>>    sys_perf_event_open: pid -1  cpu 1  group_fd -1  flags 0x8 = 5
>>>    Opening: dummy:u
>>>    ------------------------------------------------------------
>>>    perf_event_attr:
>>>      type                             1 (PERF_TYPE_SOFTWARE)
>>>      size                             136
>>>      config                           0x9 (PERF_COUNT_SW_DUMMY)
>>>      { sample_period, sample_freq }   1
>>>      sample_type                      IP|TID|TIME|CPU|IDENTIFIER
>>>      read_format                      ID|LOST
>>>      inherit                          1
>>>      exclude_kernel                   1
>>>      exclude_hv                       1
>>>      mmap                             1
>>>      comm                             1
>>>      task                             1
>>>      sample_id_all                    1
>>>      exclude_guest                    1
>>>      mmap2                            1
>>>      comm_exec                        1
>>>      ksymbol                          1
>>>      bpf_event                        1
>>>    ------------------------------------------------------------
>>>    sys_perf_event_open: pid -1  cpu 0  group_fd -1  flags 0x8 = 6
>>>    sys_perf_event_open: pid -1  cpu 1  group_fd -1  flags 0x8 = 7
>>>    sys_perf_event_open: pid -1  cpu 2  group_fd -1  flags 0x8 = 9
>>>    sys_perf_event_open: pid -1  cpu 3  group_fd -1  flags 0x8 = 10
>>>    sys_perf_event_open: pid -1  cpu 4  group_fd -1  flags 0x8 = 11
>>>    sys_perf_event_open: pid -1  cpu 5  group_fd -1  flags 0x8 = 12
>>>    sys_perf_event_open: pid -1  cpu 6  group_fd -1  flags 0x8 = 13
>>>    sys_perf_event_open: pid -1  cpu 7  group_fd -1  flags 0x8 = 14
>>>    <SNIP>
>>>
>>> Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
>>> ---
>>>   tools/perf/Documentation/perf-record.txt |  3 +++
>>>   tools/perf/builtin-record.c              | 14 +++++++++++++-
>>>   2 files changed, 16 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
>>> index 680396c56bd1..dac53ece51ab 100644
>>> --- a/tools/perf/Documentation/perf-record.txt
>>> +++ b/tools/perf/Documentation/perf-record.txt
>>> @@ -388,6 +388,9 @@ comma-separated list with no space: 0,1. Ranges of CPUs are specified with -: 0-
>>>   In per-thread mode with inheritance mode on (default), samples are captured only when
>>>   the thread executes on the designated CPUs. Default is to monitor all CPUs.
>>>   +User space tasks can migrate between CPUs, so when tracing selected CPUs,
>>> +a dummy event is created to track sideband for all CPUs.
>>> +
>>>   -B::
>>>   --no-buildid::
>>>   Do not save the build ids of binaries in the perf.data files. This skips
>>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>>> index 3ff9d972225e..4e8e97928f05 100644
>>> --- a/tools/perf/builtin-record.c
>>> +++ b/tools/perf/builtin-record.c
>>> @@ -912,6 +912,7 @@ static int record__config_tracking_events(struct record *rec)
>>>   {
>>>       struct record_opts *opts = &rec->opts;
>>>       struct evlist *evlist = rec->evlist;
>>> +    bool system_wide = false;
>>>       struct evsel *evsel;
>>>         /*
>>> @@ -921,7 +922,18 @@ static int record__config_tracking_events(struct record *rec)
>>>        */
>>>       if (opts->target.initial_delay || target__has_cpu(&opts->target) ||
>>>           perf_pmus__num_core_pmus() > 1) {
>>> -        evsel = evlist__findnew_tracking_event(evlist, false);
>>> +
>>> +        /*
>>> +         * User space tasks can migrate between CPUs, so when tracing
>>> +         * selected CPUs, sideband for all CPUs is still needed.
>>> +         *
>>> +         * If all (non-dummy) evsel have exclude_user,
>>> +         * system_wide is not needed.
>>> +         */
>>> +        if (!!opts->target.cpu_list && !opts->all_kernel)
>>
>> Not everyone uses all-kernel.  Can we check the evsels are either dummy
>> or exclude_user?
> For perf_record, exclude_user of all evsels is set in evsel__config(), and record__config_tracking_events() is before evsel__config().
> 
> Uh..., it seems that only opts->all_kernel can be used to check exclude_user of evsels.
> 
> void evsel__config()
> {
>   ...
>   if (opts->all_kernel) {
>     attr->exclude_kernel = 0;
>     attr->exclude_user   = 1;
>   }
>   ...
> }

The parser updates attr in accordance with ":k" etc.  I guess 
opts->all_kernel or opts->all_user override that as well.
  
Yang Jihong July 31, 2023, 2:28 p.m. UTC | #4
Hello,

On 2023/7/31 21:01, Adrian Hunter wrote:
> On 31/07/23 15:38, Yang Jihong wrote:
>> Hello,
>>
>> On 2023/7/31 19:08, Adrian Hunter wrote:
>>> On 22/07/23 12:32, Yang Jihong wrote:
>>>> User space tasks can migrate between CPUs, we need to track side-band
>>>> events for all CPUs.
>>>>
>>>> The specific scenarios are as follows:
>>>>
>>>>            CPU0                                 CPU1
>>>>     perf record -C 0 start
>>>>                                 taskA starts to be created and executed
>>>>                                   -> PERF_RECORD_COMM and PERF_RECORD_MMAP
>>>>                                      events only deliver to CPU1
>>>>                                 ......
>>>>                                   |
>>>>                             migrate to CPU0
>>>>                                   |
>>>>     Running on CPU0    <----------/
>>>>     ...
>>>>
>>>>     perf record -C 0 stop
>>>>
>>>> Now perf samples the PC of taskA. However, perf does not record the
>>>> PERF_RECORD_COMM and PERF_RECORD_MMAP events of taskA.
>>>> Therefore, the comm and symbols of taskA cannot be parsed.
>>>>
>>>> The solution is to record sideband events for all CPUs when tracing
>>>> selected CPUs. Because this modifies the default behavior, add related
>>>> comments to the perf record man page.
>>>>
>>>> The sys_perf_event_open invoked is as follows:
>>>>
>>>>     # perf --debug verbose=3 record -e cpu-clock -C 1 true
>>>>     <SNIP>
>>>>     Opening: cpu-clock
>>>>     ------------------------------------------------------------
>>>>     perf_event_attr:
>>>>       type                             1 (PERF_TYPE_SOFTWARE)
>>>>       size                             136
>>>>       config                           0 (PERF_COUNT_SW_CPU_CLOCK)
>>>>       { sample_period, sample_freq }   4000
>>>>       sample_type                      IP|TID|TIME|CPU|PERIOD|IDENTIFIER
>>>>       read_format                      ID|LOST
>>>>       disabled                         1
>>>>       inherit                          1
>>>>       freq                             1
>>>>       sample_id_all                    1
>>>>       exclude_guest                    1
>>>>     ------------------------------------------------------------
>>>>     sys_perf_event_open: pid -1  cpu 1  group_fd -1  flags 0x8 = 5
>>>>     Opening: dummy:u
>>>>     ------------------------------------------------------------
>>>>     perf_event_attr:
>>>>       type                             1 (PERF_TYPE_SOFTWARE)
>>>>       size                             136
>>>>       config                           0x9 (PERF_COUNT_SW_DUMMY)
>>>>       { sample_period, sample_freq }   1
>>>>       sample_type                      IP|TID|TIME|CPU|IDENTIFIER
>>>>       read_format                      ID|LOST
>>>>       inherit                          1
>>>>       exclude_kernel                   1
>>>>       exclude_hv                       1
>>>>       mmap                             1
>>>>       comm                             1
>>>>       task                             1
>>>>       sample_id_all                    1
>>>>       exclude_guest                    1
>>>>       mmap2                            1
>>>>       comm_exec                        1
>>>>       ksymbol                          1
>>>>       bpf_event                        1
>>>>     ------------------------------------------------------------
>>>>     sys_perf_event_open: pid -1  cpu 0  group_fd -1  flags 0x8 = 6
>>>>     sys_perf_event_open: pid -1  cpu 1  group_fd -1  flags 0x8 = 7
>>>>     sys_perf_event_open: pid -1  cpu 2  group_fd -1  flags 0x8 = 9
>>>>     sys_perf_event_open: pid -1  cpu 3  group_fd -1  flags 0x8 = 10
>>>>     sys_perf_event_open: pid -1  cpu 4  group_fd -1  flags 0x8 = 11
>>>>     sys_perf_event_open: pid -1  cpu 5  group_fd -1  flags 0x8 = 12
>>>>     sys_perf_event_open: pid -1  cpu 6  group_fd -1  flags 0x8 = 13
>>>>     sys_perf_event_open: pid -1  cpu 7  group_fd -1  flags 0x8 = 14
>>>>     <SNIP>
>>>>
>>>> Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
>>>> ---
>>>>    tools/perf/Documentation/perf-record.txt |  3 +++
>>>>    tools/perf/builtin-record.c              | 14 +++++++++++++-
>>>>    2 files changed, 16 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
>>>> index 680396c56bd1..dac53ece51ab 100644
>>>> --- a/tools/perf/Documentation/perf-record.txt
>>>> +++ b/tools/perf/Documentation/perf-record.txt
>>>> @@ -388,6 +388,9 @@ comma-separated list with no space: 0,1. Ranges of CPUs are specified with -: 0-
>>>>    In per-thread mode with inheritance mode on (default), samples are captured only when
>>>>    the thread executes on the designated CPUs. Default is to monitor all CPUs.
>>>>    +User space tasks can migrate between CPUs, so when tracing selected CPUs,
>>>> +a dummy event is created to track sideband for all CPUs.
>>>> +
>>>>    -B::
>>>>    --no-buildid::
>>>>    Do not save the build ids of binaries in the perf.data files. This skips
>>>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>>>> index 3ff9d972225e..4e8e97928f05 100644
>>>> --- a/tools/perf/builtin-record.c
>>>> +++ b/tools/perf/builtin-record.c
>>>> @@ -912,6 +912,7 @@ static int record__config_tracking_events(struct record *rec)
>>>>    {
>>>>        struct record_opts *opts = &rec->opts;
>>>>        struct evlist *evlist = rec->evlist;
>>>> +    bool system_wide = false;
>>>>        struct evsel *evsel;
>>>>          /*
>>>> @@ -921,7 +922,18 @@ static int record__config_tracking_events(struct record *rec)
>>>>         */
>>>>        if (opts->target.initial_delay || target__has_cpu(&opts->target) ||
>>>>            perf_pmus__num_core_pmus() > 1) {
>>>> -        evsel = evlist__findnew_tracking_event(evlist, false);
>>>> +
>>>> +        /*
>>>> +         * User space tasks can migrate between CPUs, so when tracing
>>>> +         * selected CPUs, sideband for all CPUs is still needed.
>>>> +         *
>>>> +         * If all (non-dummy) evsel have exclude_user,
>>>> +         * system_wide is not needed.
>>>> +         */
>>>> +        if (!!opts->target.cpu_list && !opts->all_kernel)
>>>
>>> Not everyone uses all-kernel.  Can we check the evsels are either dummy
>>> or exclude_user?
>> For perf_record, exclude_user of all evsels is set in evsel__config(), and record__config_tracking_events() is before evsel__config().
>>
>> Uh..., it seems that only opts->all_kernel can be used to check exclude_user of evsels.
>>
>> void evsel__config()
>> {
>>    ...
>>    if (opts->all_kernel) {
>>      attr->exclude_kernel = 0;
>>      attr->exclude_user   = 1;
>>    }
>>    ...
>> }
> 
> The parser updates attr in accordance with ":k" etc.  I guess
Yes, the ":k" situation also needs to be considered.

> opts->all_kernel or opts->all_user override that as well.
Yes, opts->all_kernel and opts->all_user will overwrite the original 
attr, see [1].

may need to check all_user, all_kernel and non-dummy exclude_user at the 
same time:

if ((all_user && one_non_dummy_exist) ||
     (!all_user && !all_kernel && one_non_dummy_without_exclude_user)) {
     system_wide = true;
}

[1]
# perf --debug verbose=2 record -e cpu-clock:u --all-kernel true
<SNIP>
------------------------------------------------------------
perf_event_attr:
   type                             1 (PERF_TYPE_SOFTWARE)
   size                             136
   config                           0 (PERF_COUNT_SW_CPU_CLOCK)
   { sample_period, sample_freq }   4000
   sample_type                      IP|TID|TIME|PERIOD
   read_format                      ID|LOST
   disabled                         1
   inherit                          1
   exclude_user                     1
   exclude_hv                       1
   mmap                             1
   comm                             1
   freq                             1
   enable_on_exec                   1
   task                             1
   sample_id_all                    1
   exclude_guest                    1
   mmap2                            1
   comm_exec                        1
   ksymbol                          1
   bpf_event                        1
------------------------------------------------------------
<SNIP>

Thanks,
Yang
  

Patch

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 680396c56bd1..dac53ece51ab 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -388,6 +388,9 @@  comma-separated list with no space: 0,1. Ranges of CPUs are specified with -: 0-
 In per-thread mode with inheritance mode on (default), samples are captured only when
 the thread executes on the designated CPUs. Default is to monitor all CPUs.
 
+User space tasks can migrate between CPUs, so when tracing selected CPUs,
+a dummy event is created to track sideband for all CPUs.
+
 -B::
 --no-buildid::
 Do not save the build ids of binaries in the perf.data files. This skips
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 3ff9d972225e..4e8e97928f05 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -912,6 +912,7 @@  static int record__config_tracking_events(struct record *rec)
 {
 	struct record_opts *opts = &rec->opts;
 	struct evlist *evlist = rec->evlist;
+	bool system_wide = false;
 	struct evsel *evsel;
 
 	/*
@@ -921,7 +922,18 @@  static int record__config_tracking_events(struct record *rec)
 	 */
 	if (opts->target.initial_delay || target__has_cpu(&opts->target) ||
 	    perf_pmus__num_core_pmus() > 1) {
-		evsel = evlist__findnew_tracking_event(evlist, false);
+
+		/*
+		 * User space tasks can migrate between CPUs, so when tracing
+		 * selected CPUs, sideband for all CPUs is still needed.
+		 *
+		 * If all (non-dummy) evsel have exclude_user,
+		 * system_wide is not needed.
+		 */
+		if (!!opts->target.cpu_list && !opts->all_kernel)
+			system_wide = true;
+
+		evsel = evlist__findnew_tracking_event(evlist, system_wide);
 		if (!evsel)
 			return -ENOMEM;