[v1,3/3] perf arm64: Support virtual CPU ID for kvm-stat

Message ID 20221105072311.8214-4-leo.yan@linaro.org
State New
Headers
Series KVM: arm64: Support tracing virtual CPU ID |

Commit Message

Leo Yan Nov. 5, 2022, 7:23 a.m. UTC
  Since the two trace events kvm_entry_v2/kvm_exit_v2 are added, we can
use the field "vcpu_id" in the events to get to know the virtual CPU ID.
To keep backward compatibility, we still need to rely on the trace
events kvm_entry/kvm_exit for old kernels.

This patch adds Arm64's functions setup_kvm_events_tp() and
arm64__setup_kvm_tp(), by detecting the nodes under sysfs folder, it can
dynamically register trace events kvm_entry_v2/kvm_exit_v2 when the
kernel has provided them, otherwise, it rolls back to use events
kvm_entry/kvm_exit for backward compatibility.

Let cpu_isa_init() to invoke arm64__setup_kvm_tp(), this can allow the
command "perf kvm stat report" also to dynamically setup trace events.

Before:

  # perf kvm stat report --vcpu 27

  Analyze events for all VMs, VCPU 27:

               VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time

  Total Samples:0, Total events handled time:0.00us.

After:

  # perf kvm stat report --vcpu 27

  Analyze events for all VMs, VCPU 27:

               VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time

                 SYS64        808    98.54%    91.24%      0.00us    303.76us      3.46us ( +-  13.54% )
                   WFx         10     1.22%     7.79%      0.00us     69.48us     23.91us ( +-  25.91% )
                   IRQ          2     0.24%     0.97%      0.00us     22.64us     14.82us ( +-  52.77% )

  Total Samples:820, Total events handled time:3068.28us.

Signed-off-by: Leo Yan <leo.yan@linaro.org>
---
 tools/perf/arch/arm64/util/kvm-stat.c | 54 ++++++++++++++++++++++++---
 1 file changed, 49 insertions(+), 5 deletions(-)
  

Comments

Marc Zyngier Nov. 5, 2022, 1:28 p.m. UTC | #1
On Sat, 05 Nov 2022 07:23:11 +0000,
Leo Yan <leo.yan@linaro.org> wrote:
> 
> Since the two trace events kvm_entry_v2/kvm_exit_v2 are added, we can
> use the field "vcpu_id" in the events to get to know the virtual CPU ID.
> To keep backward compatibility, we still need to rely on the trace
> events kvm_entry/kvm_exit for old kernels.
> 
> This patch adds Arm64's functions setup_kvm_events_tp() and
> arm64__setup_kvm_tp(), by detecting the nodes under sysfs folder, it can
> dynamically register trace events kvm_entry_v2/kvm_exit_v2 when the
> kernel has provided them, otherwise, it rolls back to use events
> kvm_entry/kvm_exit for backward compatibility.
> 
> Let cpu_isa_init() to invoke arm64__setup_kvm_tp(), this can allow the
> command "perf kvm stat report" also to dynamically setup trace events.
> 
> Before:
> 
>   # perf kvm stat report --vcpu 27
> 
>   Analyze events for all VMs, VCPU 27:
> 
>                VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time
> 
>   Total Samples:0, Total events handled time:0.00us.
>
> After:
> 
>   # perf kvm stat report --vcpu 27
> 
>   Analyze events for all VMs, VCPU 27:
> 
>                VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time
> 
>                  SYS64        808    98.54%    91.24%      0.00us    303.76us      3.46us ( +-  13.54% )
>                    WFx         10     1.22%     7.79%      0.00us     69.48us     23.91us ( +-  25.91% )
>                    IRQ          2     0.24%     0.97%      0.00us     22.64us     14.82us ( +-  52.77% )
> 
>   Total Samples:820, Total events handled time:3068.28us.

Please educate me: how useful is it to filter on a vcpu number across
all VMs? What sense does it even make?

Conversely, what would be the purpose of filtering on a 5th thread of
any process irrespective of what the process does? To me, this is the
same level of non-sense.

AFAICT, this is just piling more arbitrary data extraction for no
particular reason other than "just because we can", and there is
absolutely no guarantee that this is fit for anyone else's purpose.

I'd rather you have a generic tracepoint taking the vcpu as a context
and a BPF program that spits out the information people actually need,
keeping things out of the kernel. Or even a tracehook (like the
scheduler does), and let people load a module to dump whatever
information they please.

But randomly adding new tracepoints to output a semi-useless field
without any consideration for future-proofing? No, thank you.

Thanks,

	M.
  
Leo Yan Nov. 7, 2022, 2:47 p.m. UTC | #2
On Sat, Nov 05, 2022 at 01:28:40PM +0000, Marc Zyngier wrote:

[...]

> > Before:
> > 
> >   # perf kvm stat report --vcpu 27
> > 
> >   Analyze events for all VMs, VCPU 27:
> > 
> >                VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time
> > 
> >   Total Samples:0, Total events handled time:0.00us.
> >
> > After:
> > 
> >   # perf kvm stat report --vcpu 27
> > 
> >   Analyze events for all VMs, VCPU 27:
> > 
> >                VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time
> > 
> >                  SYS64        808    98.54%    91.24%      0.00us    303.76us      3.46us ( +-  13.54% )
> >                    WFx         10     1.22%     7.79%      0.00us     69.48us     23.91us ( +-  25.91% )
> >                    IRQ          2     0.24%     0.97%      0.00us     22.64us     14.82us ( +-  52.77% )
> > 
> >   Total Samples:820, Total events handled time:3068.28us.
> 
> Please educate me: how useful is it to filter on a vcpu number across
> all VMs? What sense does it even make?

Now "perf kvm" tool is not sophisticated since it doesn't capture VMID
and virtual CPU ID together.

I think a case is we can spin a program on a specific virtual CPU with
taskset in VM, in this way we can check if any bottleneck is caused by
VM entry/exit, but I have to say that it's inaccurate if we only filter
on VCPU ID, we should consider tracing VMID and VCPU ID together in
later's enhancement.

> Conversely, what would be the purpose of filtering on a 5th thread of
> any process irrespective of what the process does? To me, this is the
> same level of non-sense.

I agree.

> AFAICT, this is just piling more arbitrary data extraction for no
> particular reason other than "just because we can", and there is
> absolutely no guarantee that this is fit for anyone else's purpose.
> 
> I'd rather you have a generic tracepoint taking the vcpu as a context
> and a BPF program that spits out the information people actually need,
> keeping things out of the kernel. Or even a tracehook (like the
> scheduler does), and let people load a module to dump whatever
> information they please.

Actually I considered three options:

Option 1: Simply add new version's trace events for recording more info.
This is not flexible and we even have risk to add more version's trace
event if later we might find that more data should traced.

This approach is straightforward and the implementation would be
simple.  This is main reason why finally I choosed to add new trace
events.

Option 2: use Kprobe to dynamically insert tracepoints; but this means
the user must have the corresponding vmlinux file, otherwise, perf
tool might inject tracepoint at an incorrect address.  This is the
main reason I didn't use Kprobe to add dynamic tracepoints.

Option 3: As you suggested, I can bind KVM tracepoints with a eBPF
program and the eBPF program records perf events.

When I reviewed Arm64's kvm_entry / kvm_exit trace events, they don't
have vcpu context in the arguments, this means I need to add new trace
events for accessing "vcpu" context.

Option 1 and 3 both need to add trace events; option 1 is more
straightforward solution and this is why it was choosed in current patch
set.

I recognized that I made a mistake, actually we can modify the trace
event's definition for kvm_entry / kvm_exit, note we only modify the
trace event's arguments, this will change the trace function's
definition but it will not break ABI (the format is exactly same for
the user space).  Below changes demonstrate what's my proposing:

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 94d33e296e10..16f6b61abfec 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -917,7 +917,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
                /**************************************************************
                 * Enter the guest
                 */
-               trace_kvm_entry(*vcpu_pc(vcpu));
+               trace_kvm_entry(vcpu);
                guest_timing_enter_irqoff();
 
                ret = kvm_arm_vcpu_enter_exit(vcpu);
diff --git a/arch/arm64/kvm/trace_arm.h b/arch/arm64/kvm/trace_arm.h
index 33e4e7dd2719..9df4fd30093c 100644
--- a/arch/arm64/kvm/trace_arm.h
+++ b/arch/arm64/kvm/trace_arm.h
@@ -12,15 +12,15 @@
  * Tracepoints for entry/exit to guest
  */
 TRACE_EVENT(kvm_entry,
-       TP_PROTO(unsigned long vcpu_pc),
-       TP_ARGS(vcpu_pc),
+       TP_PROTO(struct kvm_vcpu *vcpu),
+       TP_ARGS(vcpu),
 
        TP_STRUCT__entry(
                __field(        unsigned long,  vcpu_pc         )
        ),
 
        TP_fast_assign(
-               __entry->vcpu_pc                = vcpu_pc;
+               __entry->vcpu_pc                = *vcpu_pc(vcpu);
        ),
 
        TP_printk("PC: 0x%016lx", __entry->vcpu_pc)

Please let me know your opinion, if you don't object, I can move
forward with this approach.

> But randomly adding new tracepoints to output a semi-useless field
> without any consideration for future-proofing? No, thank you.

Okay.  Thanks for review!

Leo
  
Marc Zyngier Nov. 7, 2022, 3:39 p.m. UTC | #3
On Mon, 07 Nov 2022 14:47:10 +0000,
Leo Yan <leo.yan@linaro.org> wrote:
> 
> On Sat, Nov 05, 2022 at 01:28:40PM +0000, Marc Zyngier wrote:
> 
> [...]
> 
> > > Before:
> > > 
> > >   # perf kvm stat report --vcpu 27
> > > 
> > >   Analyze events for all VMs, VCPU 27:
> > > 
> > >                VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time
> > > 
> > >   Total Samples:0, Total events handled time:0.00us.
> > >
> > > After:
> > > 
> > >   # perf kvm stat report --vcpu 27
> > > 
> > >   Analyze events for all VMs, VCPU 27:
> > > 
> > >                VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time
> > > 
> > >                  SYS64        808    98.54%    91.24%      0.00us    303.76us      3.46us ( +-  13.54% )
> > >                    WFx         10     1.22%     7.79%      0.00us     69.48us     23.91us ( +-  25.91% )
> > >                    IRQ          2     0.24%     0.97%      0.00us     22.64us     14.82us ( +-  52.77% )
> > > 
> > >   Total Samples:820, Total events handled time:3068.28us.
> > 
> > Please educate me: how useful is it to filter on a vcpu number across
> > all VMs? What sense does it even make?
> 
> Now "perf kvm" tool is not sophisticated since it doesn't capture VMID
> and virtual CPU ID together.

VMID is not a relevant indicator anyway, as it can change at any
point. But that's only to show that everybody has a different view on
what they need to collect. At which point, we need to provide an
infrastructure for data extraction, and not the data itself.

> I think a case is we can spin a program on a specific virtual CPU with
> taskset in VM, in this way we can check if any bottleneck is caused by
> VM entry/exit, but I have to say that it's inaccurate if we only filter
> on VCPU ID, we should consider tracing VMID and VCPU ID together in
> later's enhancement.
> 
> > Conversely, what would be the purpose of filtering on a 5th thread of
> > any process irrespective of what the process does? To me, this is the
> > same level of non-sense.
> 
> I agree.
> 
> > AFAICT, this is just piling more arbitrary data extraction for no
> > particular reason other than "just because we can", and there is
> > absolutely no guarantee that this is fit for anyone else's purpose.
> > 
> > I'd rather you have a generic tracepoint taking the vcpu as a context
> > and a BPF program that spits out the information people actually need,
> > keeping things out of the kernel. Or even a tracehook (like the
> > scheduler does), and let people load a module to dump whatever
> > information they please.
> 
> Actually I considered three options:
> 
> Option 1: Simply add new version's trace events for recording more info.
> This is not flexible and we even have risk to add more version's trace
> event if later we might find that more data should traced.
> 
> This approach is straightforward and the implementation would be
> simple.  This is main reason why finally I choosed to add new trace
> events.

But that doesn't scale at all.

> 
> Option 2: use Kprobe to dynamically insert tracepoints; but this means
> the user must have the corresponding vmlinux file, otherwise, perf
> tool might inject tracepoint at an incorrect address.  This is the
> main reason I didn't use Kprobe to add dynamic tracepoints.
>
> Option 3: As you suggested, I can bind KVM tracepoints with a eBPF
> program and the eBPF program records perf events.
> 
> When I reviewed Arm64's kvm_entry / kvm_exit trace events, they don't
> have vcpu context in the arguments, this means I need to add new trace
> events for accessing "vcpu" context.

I'm not opposed to adding new trace{point,hook}s if you demonstrate
that they are generic enough or will never need to evolve.

> 
> Option 1 and 3 both need to add trace events; option 1 is more
> straightforward solution and this is why it was choosed in current patch
> set.
> 
> I recognized that I made a mistake, actually we can modify the trace
> event's definition for kvm_entry / kvm_exit, note we only modify the
> trace event's arguments, this will change the trace function's
> definition but it will not break ABI (the format is exactly same for
> the user space).  Below changes demonstrate what's my proposing:
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 94d33e296e10..16f6b61abfec 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -917,7 +917,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>                 /**************************************************************
>                  * Enter the guest
>                  */
> -               trace_kvm_entry(*vcpu_pc(vcpu));
> +               trace_kvm_entry(vcpu);
>                 guest_timing_enter_irqoff();
>  
>                 ret = kvm_arm_vcpu_enter_exit(vcpu);
> diff --git a/arch/arm64/kvm/trace_arm.h b/arch/arm64/kvm/trace_arm.h
> index 33e4e7dd2719..9df4fd30093c 100644
> --- a/arch/arm64/kvm/trace_arm.h
> +++ b/arch/arm64/kvm/trace_arm.h
> @@ -12,15 +12,15 @@
>   * Tracepoints for entry/exit to guest
>   */
>  TRACE_EVENT(kvm_entry,
> -       TP_PROTO(unsigned long vcpu_pc),
> -       TP_ARGS(vcpu_pc),
> +       TP_PROTO(struct kvm_vcpu *vcpu),
> +       TP_ARGS(vcpu),
>  
>         TP_STRUCT__entry(
>                 __field(        unsigned long,  vcpu_pc         )
>         ),
>  
>         TP_fast_assign(
> -               __entry->vcpu_pc                = vcpu_pc;
> +               __entry->vcpu_pc                = *vcpu_pc(vcpu);
>         ),
>  
>         TP_printk("PC: 0x%016lx", __entry->vcpu_pc)
> 
> Please let me know your opinion, if you don't object, I can move
> forward with this approach.

I have no issue with this if this doesn't change anything else.

And if you can make use of this with a BPF program and get to the same
result as your initial patch, then please submit it for inclusion in
the kernel as an example. We can then point people to it next time
this crop up (probably before Xmas).

Thanks,

	M.
  
Leo Yan Nov. 8, 2022, 11:49 a.m. UTC | #4
On Mon, Nov 07, 2022 at 03:39:07PM +0000, Marc Zyngier wrote:

[...]

> > > Please educate me: how useful is it to filter on a vcpu number across
> > > all VMs? What sense does it even make?
> > 
> > Now "perf kvm" tool is not sophisticated since it doesn't capture VMID
> > and virtual CPU ID together.
> 
> VMID is not a relevant indicator anyway, as it can change at any
> point.

Thanks for correction.  IIUC, VMID is not fixed for virtual machine, it
can be re-allocated after overflow.

> But that's only to show that everybody has a different view on
> what they need to collect. At which point, we need to provide an
> infrastructure for data extraction, and not the data itself.

Totally agree.

[...]

> > Option 3: As you suggested, I can bind KVM tracepoints with a eBPF
> > program and the eBPF program records perf events.
> > 
> > When I reviewed Arm64's kvm_entry / kvm_exit trace events, they don't
> > have vcpu context in the arguments, this means I need to add new trace
> > events for accessing "vcpu" context.
> 
> I'm not opposed to adding new trace{point,hook}s if you demonstrate
> that they are generic enough or will never need to evolve.
> 
> > 
> > Option 1 and 3 both need to add trace events; option 1 is more
> > straightforward solution and this is why it was choosed in current patch
> > set.
> > 
> > I recognized that I made a mistake, actually we can modify the trace
> > event's definition for kvm_entry / kvm_exit, note we only modify the
> > trace event's arguments, this will change the trace function's
> > definition but it will not break ABI (the format is exactly same for
> > the user space).  Below changes demonstrate what's my proposing:
> > 
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 94d33e296e10..16f6b61abfec 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -917,7 +917,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> >                 /**************************************************************
> >                  * Enter the guest
> >                  */
> > -               trace_kvm_entry(*vcpu_pc(vcpu));
> > +               trace_kvm_entry(vcpu);
> >                 guest_timing_enter_irqoff();
> >  
> >                 ret = kvm_arm_vcpu_enter_exit(vcpu);
> > diff --git a/arch/arm64/kvm/trace_arm.h b/arch/arm64/kvm/trace_arm.h
> > index 33e4e7dd2719..9df4fd30093c 100644
> > --- a/arch/arm64/kvm/trace_arm.h
> > +++ b/arch/arm64/kvm/trace_arm.h
> > @@ -12,15 +12,15 @@
> >   * Tracepoints for entry/exit to guest
> >   */
> >  TRACE_EVENT(kvm_entry,
> > -       TP_PROTO(unsigned long vcpu_pc),
> > -       TP_ARGS(vcpu_pc),
> > +       TP_PROTO(struct kvm_vcpu *vcpu),
> > +       TP_ARGS(vcpu),
> >  
> >         TP_STRUCT__entry(
> >                 __field(        unsigned long,  vcpu_pc         )
> >         ),
> >  
> >         TP_fast_assign(
> > -               __entry->vcpu_pc                = vcpu_pc;
> > +               __entry->vcpu_pc                = *vcpu_pc(vcpu);
> >         ),
> >  
> >         TP_printk("PC: 0x%016lx", __entry->vcpu_pc)
> > 
> > Please let me know your opinion, if you don't object, I can move
> > forward with this approach.
> 
> I have no issue with this if this doesn't change anything else.

Thanks for confirmation.

> And if you can make use of this with a BPF program and get to the same
> result as your initial patch, then please submit it for inclusion in
> the kernel as an example. We can then point people to it next time
> this crop up (probably before Xmas).

Will do.

Just head up, if everything is going well, I will place the eBPF
program in the folder tools/perf/examples/bpf/, this can be easy for
integration and release with perf tool.

Thanks,
Leo
  

Patch

diff --git a/tools/perf/arch/arm64/util/kvm-stat.c b/tools/perf/arch/arm64/util/kvm-stat.c
index 73d18e0ed6f6..1ba54ce3d7d8 100644
--- a/tools/perf/arch/arm64/util/kvm-stat.c
+++ b/tools/perf/arch/arm64/util/kvm-stat.c
@@ -3,6 +3,7 @@ 
 #include <memory.h>
 #include "../../../util/evsel.h"
 #include "../../../util/kvm-stat.h"
+#include "../../../util/tracepoint.h"
 #include "arm64_exception_types.h"
 #include "debug.h"
 
@@ -10,18 +11,28 @@  define_exit_reasons_table(arm64_exit_reasons, kvm_arm_exception_type);
 define_exit_reasons_table(arm64_trap_exit_reasons, kvm_arm_exception_class);
 
 const char *kvm_trap_exit_reason = "esr_ec";
-const char *vcpu_id_str = "id";
+const char *vcpu_id_str = "vcpu_id";
 const int decode_str_len = 20;
 const char *kvm_exit_reason = "ret";
-const char *kvm_entry_trace = "kvm:kvm_entry";
-const char *kvm_exit_trace = "kvm:kvm_exit";
+const char *kvm_entry_trace;
+const char *kvm_exit_trace;
 
-const char *kvm_events_tp[] = {
+#define NR_TPS	2
+
+static const char *kvm_events_tp_v1[NR_TPS + 1] = {
 	"kvm:kvm_entry",
 	"kvm:kvm_exit",
 	NULL,
 };
 
+static const char *kvm_events_tp_v2[NR_TPS + 1] = {
+	"kvm:kvm_entry_v2",
+	"kvm:kvm_exit_v2",
+	NULL,
+};
+
+const char *kvm_events_tp[NR_TPS + 1];
+
 static void event_get_key(struct evsel *evsel,
 			  struct perf_sample *sample,
 			  struct event_key *key)
@@ -78,8 +89,41 @@  const char * const kvm_skip_events[] = {
 	NULL,
 };
 
-int cpu_isa_init(struct perf_kvm_stat *kvm, const char *cpuid __maybe_unused)
+static int arm64__setup_kvm_tp(struct perf_kvm_stat *kvm)
 {
+	const char **kvm_events, **events_ptr;
+	int i, nr_tp = 0;
+
+	if (is_valid_tracepoint("kvm:kvm_entry_v2")) {
+		kvm_events = kvm_events_tp_v2;
+		kvm_entry_trace = "kvm:kvm_entry_v2";
+		kvm_exit_trace = "kvm:kvm_exit_v2";
+	} else {
+		kvm_events = kvm_events_tp_v1;
+		kvm_entry_trace = "kvm:kvm_entry";
+		kvm_exit_trace = "kvm:kvm_exit";
+	}
+
+	for (events_ptr = kvm_events; *events_ptr; events_ptr++) {
+		if (!is_valid_tracepoint(*events_ptr))
+			return -1;
+		nr_tp++;
+	}
+
+	for (i = 0; i < nr_tp; i++)
+		kvm_events_tp[i] = kvm_events[i];
+	kvm_events_tp[i] = NULL;
+
 	kvm->exit_reasons_isa = "arm64";
 	return 0;
 }
+
+int setup_kvm_events_tp(struct perf_kvm_stat *kvm)
+{
+	return arm64__setup_kvm_tp(kvm);
+}
+
+int cpu_isa_init(struct perf_kvm_stat *kvm, const char *cpuid __maybe_unused)
+{
+	return arm64__setup_kvm_tp(kvm);
+}