Message ID | 20231129095055.88060-1-likexu@tencent.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:a5a7:0:b0:403:3b70:6f57 with SMTP id d7csp228277vqn; Wed, 29 Nov 2023 01:51:36 -0800 (PST) X-Google-Smtp-Source: AGHT+IGKeOEWbB6YENgCHv9QhFmWLOjfAnCbB/NxphvGNLCip2MPszPmyQX+Ax8A91rOy75+q3Wn X-Received: by 2002:a05:6a20:42a4:b0:18a:d5a6:ef01 with SMTP id o36-20020a056a2042a400b0018ad5a6ef01mr29703859pzj.20.1701251496513; Wed, 29 Nov 2023 01:51:36 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701251496; cv=none; d=google.com; s=arc-20160816; b=MSetjTWDcBj8zywwai52aVLfpz8m7PUnGQh3MGSIwpCpWtaVNrtj/9zX8UCGAuLWKX FO2kZwmx7r10lRLwMb7W1FVsaEFmN328GU5N0iOL5brJbYQqJR4E4QPnJ1Gtk57ou6p2 3bk0E5Im/rCqTcHWDj4b+SJM1ffz07j5sgUgc9c8lu7k/ZHCqycC9EjkVErZlwydkiN9 rb4YPIbyRMIh8mvG8N7NZcgSRe5FUXOAft/upW6H7oyt0CQ23Y5FfLS1e1MYqYBrJ1mu 5loCCzd+pR0SJKJq2jEo+EKhJhb0Kqc0z/aevW7B0EGNG7fHYCMIw7J6pffHk2/j/ZpZ JXqw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=HQ+LDwQ3Bi2N8eKNE7wHcwQgSEsXHaXC33trFiDy14A=; fh=GCDrqIUoSuf516CdEf/yUgnogT0axZCPa9WfpkaXe9w=; b=mdBAyA9oLPKSLbcIxyqmkE8LbDtcPKPSHR7W0H87vLyWhPBHSww5vNd+CTFrE5TP+4 02os839I4iyQIio7rSMkMptRr+jC2ep6wIdPsJNSxHYz863d1detlka7QAKvfIZY6nAU h59ua+rkYoy8L/CBz7lZ5lHWhq50tybJF7hfKmOIt6J9SoSvr4xrwGn7bsSIGe/idq0s RaWcI9e+K4QKMsa20YxxaEU/XThl7HPeymSqMZniBnIr5MawMP7Rt0rMrQTgxhrVixWt 85EtaCQS3sBWMPyxVNWxI6hGz8A0Qb74WiHXhT5EQ5Yl1pnyi/svg5QA+yVlFNA2AbDF aytw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=C8LMXvad; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id b18-20020a056a000cd200b006cc07022e9fsi9338914pfv.57.2023.11.29.01.51.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Nov 2023 01:51:36 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=C8LMXvad; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 06BAF80B8E64; Wed, 29 Nov 2023 01:51:35 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229864AbjK2JvN (ORCPT <rfc822;toshivichauhan@gmail.com> + 99 others); Wed, 29 Nov 2023 04:51:13 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58910 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229509AbjK2JvH (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 29 Nov 2023 04:51:07 -0500 Received: from mail-pf1-x42d.google.com (mail-pf1-x42d.google.com [IPv6:2607:f8b0:4864:20::42d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B166D19A0; Wed, 29 Nov 2023 01:51:13 -0800 (PST) Received: by mail-pf1-x42d.google.com with SMTP id d2e1a72fcca58-6cbd24d9557so555417b3a.1; Wed, 29 Nov 2023 01:51:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701251473; x=1701856273; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=HQ+LDwQ3Bi2N8eKNE7wHcwQgSEsXHaXC33trFiDy14A=; b=C8LMXvadrV0O1RBJGUBUQK+4otduMBx/iuXvPRjuMVVijERfNwUYvIWFiEEQsp8//U AB2pshEPz6c0z/sfOTyUW641dwu6IIsxK4PgvzF+QASdrZmqzInSyNScKvC8kTka3nV+ q6q2RSF2z9gPyPzqWmCnhzdSlWzu3CTOLGhT0k/RV26icKX+c1+SHIVkQRyYcZiQr6i3 /4MqXlwewDDH9URVLEGWsih88nG8DMYS1/hl19uR6IG41tzL2XQ475Gl+/3B8CneE+3Z zcYafS1Qbae2qBB898NoJ+u/W5dCAlCnknHYrk3fa77Ty9typdkz3fHUSIeS4CC5WP5l IOFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701251473; x=1701856273; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=HQ+LDwQ3Bi2N8eKNE7wHcwQgSEsXHaXC33trFiDy14A=; b=ZU3meDAX7AbQrXChEXrnwGDbVhRuFRPYfCMjiTyD+GaJ/ycGNkEwLmDSnlQVECefb5 ttseXXybHtLEiRqPKD5qNs80K0A4O/DPQNRWw6z5XuHRmP+Ubpn9Lu77cnBRUtRV7bSb HvO8Mxvorxs/GYToQhh4Bp0Lk5yVUUbayYR90ZETAhJlpFpPxXVytV7TNfv/MVCQJ4a9 YwwXmsp7zUfaHzNCYMg2AC3P8BeeDIgFYijCZozEVuUzRT55xNACjO+obIU5O9Zf/pQP L8YaUJuKdJLVCrJfY/C15r/6bEPQLA12T3YxfM/2Y0rItiVgj1qn9yUPEmA5XPswaeWX rQnA== X-Gm-Message-State: AOJu0YzQ4vd7RfNY3PS5JlTBX1UyX9BBeFOiC+hAsV9+R8xiwjVV6P3S 0+D1b7tS8FHIx3ISqeOsoTp2QtPvpPBZKg== X-Received: by 2002:a05:6a20:8e1f:b0:13f:13cb:bc50 with SMTP id y31-20020a056a208e1f00b0013f13cbbc50mr23216894pzj.25.1701251473085; Wed, 29 Nov 2023 01:51:13 -0800 (PST) Received: from localhost.localdomain ([103.7.29.32]) by smtp.gmail.com with ESMTPSA id j21-20020a62e915000000b006c34015a8f2sm10692761pfh.146.2023.11.29.01.51.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Nov 2023 01:51:12 -0800 (PST) From: Like Xu <like.xu.linux@gmail.com> X-Google-Original-From: Like Xu <likexu@tencent.com> To: Sean Christopherson <seanjc@google.com>, Peter Zijlstra <peterz@infradead.org> Cc: Kan Liang <kan.liang@linux.intel.com>, Paolo Bonzini <pbonzini@redhat.com>, linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH] KVM: x86/pmu: Prevent any host user from enabling PEBS for profiling guest Date: Wed, 29 Nov 2023 17:50:55 +0800 Message-ID: <20231129095055.88060-1-likexu@tencent.com> X-Mailer: git-send-email 2.43.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Wed, 29 Nov 2023 01:51:35 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783891489563955142 X-GMAIL-MSGID: 1783891489563955142 |
Series |
KVM: x86/pmu: Prevent any host user from enabling PEBS for profiling guest
|
|
Commit Message
Like Xu
Nov. 29, 2023, 9:50 a.m. UTC
From: Like Xu <likexu@tencent.com> Stop using PEBS counters on host to profiling guest. Limit the range of enabled PEBS counters to only those counters enabled from the guest PEBS emulation perspective. If there is a perf-record agent on host that uses perf-tools events like "cpu-cycles:GP" (G for attr.exclude_host, P for max precise event counter) to capture guest performance events, then the guest will be hanged. This is because Intel DS-based PEBS buffer is addressed using the 64-bit linear address of the current {p/v}CPU context based on MSR_IA32_DS_AREA. Any perf user using PEBS counters to profile guest on host is, in perf/core implementation details, trying to set bits on cpuc->intel_ctrl_guest_mask and arr[pebs_enable].guest, much like the guest PEBS emulation behaviour. But the subsequent PEBS memory write, regardless of whether guest PEBS is enabled, can overshoot guest entry and corrupt guest memory. Profiling guest via PEBS-DS buffer on host is not supported at this time. Fix this by filtering the real configured value of arr[pebs_enable].guest with the emulated state of guest enabled PEBS counters, under the condition of none cross-mapped PEBS counters. Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Fixes: c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS") Signed-off-by: Like Xu <likexu@tencent.com> --- arch/x86/events/intel/core.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) base-commit: 6803fb00772cc50cd59a66bd8caaee5c84b13fcf
Comments
On 2023-11-29 4:50 a.m., Like Xu wrote: > From: Like Xu <likexu@tencent.com> > > Stop using PEBS counters on host to profiling guest. Limit the range of > enabled PEBS counters to only those counters enabled from the guest PEBS > emulation perspective. > > If there is a perf-record agent on host that uses perf-tools events like > "cpu-cycles:GP" (G for attr.exclude_host, P for max precise event counter) > to capture guest performance events, then the guest will be hanged. This is > because Intel DS-based PEBS buffer is addressed using the 64-bit linear > address of the current {p/v}CPU context based on MSR_IA32_DS_AREA. > > Any perf user using PEBS counters to profile guest on host is, in perf/core > implementation details, trying to set bits on cpuc->intel_ctrl_guest_mask > and arr[pebs_enable].guest, much like the guest PEBS emulation behaviour. > But the subsequent PEBS memory write, regardless of whether guest PEBS is > enabled, can overshoot guest entry and corrupt guest memory. > > Profiling guest via PEBS-DS buffer on host is not supported at this time. > Fix this by filtering the real configured value of arr[pebs_enable].guest > with the emulated state of guest enabled PEBS counters, under the condition > of none cross-mapped PEBS counters. So the counter will be silently disabled. The user never knows why nothing is sampled. Since we don't support the case, profiling guest via PEBS-DS buffer on host. Maybe we should error out when creating the event. For example (not tested), diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 3871267d3237..24b90c70737f 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -3958,6 +3958,10 @@ static int intel_pmu_hw_config(struct perf_event *event) if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == INTEL_FIXED_VLBR_EVENT) return -EINVAL; + /* Profiling guest via PEBS-DS buffer on host is not supported. */ + if (event->attr.exclude_host) + return -EINVAL; + if (!(event->attr.freq || (event->attr.wakeup_events && !event->attr.watermark))) { event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; if (!(event->attr.sample_type & Thanks, Kan > > Cc: Peter Zijlstra (Intel) <peterz@infradead.org> > Fixes: c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS") > Signed-off-by: Like Xu <likexu@tencent.com> > --- > arch/x86/events/intel/core.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c > index a08f794a0e79..17afd504c35b 100644 > --- a/arch/x86/events/intel/core.c > +++ b/arch/x86/events/intel/core.c > @@ -4103,13 +4103,19 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) > .guest = pebs_mask & ~cpuc->intel_ctrl_host_mask, > }; > > + /* In any case, clear guest PEBS bits first. */ > + arr[global_ctrl].guest &= ~arr[pebs_enable].guest; > + > if (arr[pebs_enable].host) { > /* Disable guest PEBS if host PEBS is enabled. */ > arr[pebs_enable].guest = 0; > } else { > /* Disable guest PEBS thoroughly for cross-mapped PEBS counters. */ > arr[pebs_enable].guest &= ~kvm_pmu->host_cross_mapped_mask; > - arr[global_ctrl].guest &= ~kvm_pmu->host_cross_mapped_mask; > + > + /* Prevent any host user from enabling PEBS for profiling guest. */ > + arr[pebs_enable].guest &= (kvm_pmu->pebs_enable & kvm_pmu->global_ctrl); > + > /* Set hw GLOBAL_CTRL bits for PEBS counter when it runs for guest */ > arr[global_ctrl].guest |= arr[pebs_enable].guest; > } > > base-commit: 6803fb00772cc50cd59a66bd8caaee5c84b13fcf
On 29/11/2023 10:38 pm, Liang, Kan wrote: > > > On 2023-11-29 4:50 a.m., Like Xu wrote: >> From: Like Xu <likexu@tencent.com> >> >> Stop using PEBS counters on host to profiling guest. Limit the range of >> enabled PEBS counters to only those counters enabled from the guest PEBS >> emulation perspective. >> >> If there is a perf-record agent on host that uses perf-tools events like >> "cpu-cycles:GP" (G for attr.exclude_host, P for max precise event counter) >> to capture guest performance events, then the guest will be hanged. This is >> because Intel DS-based PEBS buffer is addressed using the 64-bit linear >> address of the current {p/v}CPU context based on MSR_IA32_DS_AREA. >> >> Any perf user using PEBS counters to profile guest on host is, in perf/core >> implementation details, trying to set bits on cpuc->intel_ctrl_guest_mask >> and arr[pebs_enable].guest, much like the guest PEBS emulation behaviour. >> But the subsequent PEBS memory write, regardless of whether guest PEBS is >> enabled, can overshoot guest entry and corrupt guest memory. >> >> Profiling guest via PEBS-DS buffer on host is not supported at this time. >> Fix this by filtering the real configured value of arr[pebs_enable].guest >> with the emulated state of guest enabled PEBS counters, under the condition >> of none cross-mapped PEBS counters. > > So the counter will be silently disabled. The user never knows why > nothing is sampled. > Since we don't support the case, profiling guest via PEBS-DS buffer on > host. Maybe we should error out when creating the event. For example > (not tested), Test failed. > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c > index 3871267d3237..24b90c70737f 100644 > --- a/arch/x86/events/intel/core.c > +++ b/arch/x86/events/intel/core.c > @@ -3958,6 +3958,10 @@ static int intel_pmu_hw_config(struct perf_event > *event) > if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == > INTEL_FIXED_VLBR_EVENT) > return -EINVAL; > > + /* Profiling guest via PEBS-DS buffer on host is not supported. */ > + if (event->attr.exclude_host) > + return -EINVAL; > + Guest PEBS emulation also sets this bit, a typical call stack looks like: intel_pmu_hw_config+0x441/0x4d0 hsw_hw_config+0x12/0xa0 x86_pmu_event_init+0x98/0x370 perf_try_init_event+0x47/0x130 perf_event_alloc+0x446/0xeb0 perf_event_create_kernel_counter+0x38/0x190 pmc_reprogram_counter.constprop.17+0xd9/0x230 [kvm] kvm_pmu_handle_event+0x1a6/0x310 [kvm] vcpu_enter_guest+0x1388/0x19b0 [kvm] vcpu_run+0x117/0x6c0 [kvm] kvm_arch_vcpu_ioctl_run+0x13d/0x4d0 [kvm] kvm_vcpu_ioctl+0x301/0x6e0 [kvm] Alternatively, this path is taken when using PEBS-via-PT to profile guests on host. The status of the guest can only be queried in the NMI handler and the func intel_guest_get_msrs() in the perf/core context, where it's easier and more centrally to review this part of changes that affects vPMU for corner cases. Maybe adding print info on the perf-tool side would help. For perf-tool users, it will get 0 number of sample for "cpu-cycles:GP" events, just like other uncounted perf-tool events. > if (!(event->attr.freq || (event->attr.wakeup_events && > !event->attr.watermark))) { > event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; > if (!(event->attr.sample_type & > > > Thanks, > Kan > >> >> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> >> Fixes: c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS") >> Signed-off-by: Like Xu <likexu@tencent.com> >> --- >> arch/x86/events/intel/core.c | 8 +++++++- >> 1 file changed, 7 insertions(+), 1 deletion(-) >> >> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c >> index a08f794a0e79..17afd504c35b 100644 >> --- a/arch/x86/events/intel/core.c >> +++ b/arch/x86/events/intel/core.c >> @@ -4103,13 +4103,19 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) >> .guest = pebs_mask & ~cpuc->intel_ctrl_host_mask, >> }; >> >> + /* In any case, clear guest PEBS bits first. */ >> + arr[global_ctrl].guest &= ~arr[pebs_enable].guest; >> + >> if (arr[pebs_enable].host) { >> /* Disable guest PEBS if host PEBS is enabled. */ >> arr[pebs_enable].guest = 0; >> } else { >> /* Disable guest PEBS thoroughly for cross-mapped PEBS counters. */ >> arr[pebs_enable].guest &= ~kvm_pmu->host_cross_mapped_mask; >> - arr[global_ctrl].guest &= ~kvm_pmu->host_cross_mapped_mask; >> + >> + /* Prevent any host user from enabling PEBS for profiling guest. */ >> + arr[pebs_enable].guest &= (kvm_pmu->pebs_enable & kvm_pmu->global_ctrl); >> + >> /* Set hw GLOBAL_CTRL bits for PEBS counter when it runs for guest */ >> arr[global_ctrl].guest |= arr[pebs_enable].guest; >> } >> >> base-commit: 6803fb00772cc50cd59a66bd8caaee5c84b13fcf
On 2023-11-30 2:29 a.m., Like Xu wrote: > On 29/11/2023 10:38 pm, Liang, Kan wrote: >> >> >> On 2023-11-29 4:50 a.m., Like Xu wrote: >>> From: Like Xu <likexu@tencent.com> >>> >>> Stop using PEBS counters on host to profiling guest. Limit the range of >>> enabled PEBS counters to only those counters enabled from the guest PEBS >>> emulation perspective. >>> >>> If there is a perf-record agent on host that uses perf-tools events like >>> "cpu-cycles:GP" (G for attr.exclude_host, P for max precise event >>> counter) >>> to capture guest performance events, then the guest will be hanged. >>> This is >>> because Intel DS-based PEBS buffer is addressed using the 64-bit linear >>> address of the current {p/v}CPU context based on MSR_IA32_DS_AREA. >>> >>> Any perf user using PEBS counters to profile guest on host is, in >>> perf/core >>> implementation details, trying to set bits on >>> cpuc->intel_ctrl_guest_mask >>> and arr[pebs_enable].guest, much like the guest PEBS emulation >>> behaviour. >>> But the subsequent PEBS memory write, regardless of whether guest >>> PEBS is >>> enabled, can overshoot guest entry and corrupt guest memory. >>> >>> Profiling guest via PEBS-DS buffer on host is not supported at this >>> time. >>> Fix this by filtering the real configured value of >>> arr[pebs_enable].guest >>> with the emulated state of guest enabled PEBS counters, under the >>> condition >>> of none cross-mapped PEBS counters. >> >> So the counter will be silently disabled. The user never knows why >> nothing is sampled. >> Since we don't support the case, profiling guest via PEBS-DS buffer on >> host. Maybe we should error out when creating the event. For example >> (not tested), > > Test failed. > >> >> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c >> index 3871267d3237..24b90c70737f 100644 >> --- a/arch/x86/events/intel/core.c >> +++ b/arch/x86/events/intel/core.c >> @@ -3958,6 +3958,10 @@ static int intel_pmu_hw_config(struct perf_event >> *event) >> if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == >> INTEL_FIXED_VLBR_EVENT) >> return -EINVAL; >> >> + /* Profiling guest via PEBS-DS buffer on host is not >> supported. */ >> + if (event->attr.exclude_host) >> + return -EINVAL; >> + > > Guest PEBS emulation also sets this bit, a typical call stack looks like: > > intel_pmu_hw_config+0x441/0x4d0 > hsw_hw_config+0x12/0xa0 > x86_pmu_event_init+0x98/0x370 > perf_try_init_event+0x47/0x130 > perf_event_alloc+0x446/0xeb0 > perf_event_create_kernel_counter+0x38/0x190 > pmc_reprogram_counter.constprop.17+0xd9/0x230 [kvm] > kvm_pmu_handle_event+0x1a6/0x310 [kvm] > vcpu_enter_guest+0x1388/0x19b0 [kvm] > vcpu_run+0x117/0x6c0 [kvm] > kvm_arch_vcpu_ioctl_run+0x13d/0x4d0 [kvm] > kvm_vcpu_ioctl+0x301/0x6e0 [kvm] > Oh right, the event from the KVM guest is also exclude_host. So we should only error out with the non-KVM exclude_host PEBS event. > Alternatively, this path is taken when using PEBS-via-PT to profile > guests on host. There is a is_pebs_pt(event), so we can skip the PEBS-via-PT. Seems we just need to distinguish a KVM event and a normal host event. I don't have a better way to do it except using event->overflow_handler_context, which is NULL for a normal host event. diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index a968708ed1fb..c93a2aaff7c3 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -3958,6 +3958,16 @@ static int intel_pmu_hw_config(struct perf_event *event) if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == INTEL_FIXED_VLBR_EVENT) return -EINVAL; + /* + * Profiling guest via PEBS-DS buffer on host is not supported. + * The event->overflow_handler_context is to distinguish a KVM + * event and a normal host event. + */ + if (event->attr.exclude_host && + !is_pebs_pt(event) && + !event->overflow_handler_context) + return -EINVAL; + if (!(event->attr.freq || (event->attr.wakeup_events && !event->attr.watermark))) { event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; if (!(event->attr.sample_type & > > The status of the guest can only be queried in the NMI handler and the func > intel_guest_get_msrs() in the perf/core context, where it's easier and more > centrally to review this part of changes that affects vPMU for corner > cases. > > Maybe adding print info on the perf-tool side would help. > > For perf-tool users, it will get 0 number of sample for "cpu-cycles:GP" > events, > just like other uncounted perf-tool events. perf-tool would never know such details, e.g., whether the platform supports PEBS-DS or other PEBS method. It's hard to tell if the 0 is because of an unsupported hardware or nothing sampled in the guest. Thanks, Kan > >> if (!(event->attr.freq || (event->attr.wakeup_events && >> !event->attr.watermark))) { >> event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; >> if (!(event->attr.sample_type & >> >> >> Thanks, >> Kan >> >>> >>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> >>> Fixes: c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR >>> emulation for extended PEBS") >>> Signed-off-by: Like Xu <likexu@tencent.com> >>> --- >>> arch/x86/events/intel/core.c | 8 +++++++- >>> 1 file changed, 7 insertions(+), 1 deletion(-) >>> >>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c >>> index a08f794a0e79..17afd504c35b 100644 >>> --- a/arch/x86/events/intel/core.c >>> +++ b/arch/x86/events/intel/core.c >>> @@ -4103,13 +4103,19 @@ static struct perf_guest_switch_msr >>> *intel_guest_get_msrs(int *nr, void *data) >>> .guest = pebs_mask & ~cpuc->intel_ctrl_host_mask, >>> }; >>> + /* In any case, clear guest PEBS bits first. */ >>> + arr[global_ctrl].guest &= ~arr[pebs_enable].guest; >>> + >>> if (arr[pebs_enable].host) { >>> /* Disable guest PEBS if host PEBS is enabled. */ >>> arr[pebs_enable].guest = 0; >>> } else { >>> /* Disable guest PEBS thoroughly for cross-mapped PEBS >>> counters. */ >>> arr[pebs_enable].guest &= ~kvm_pmu->host_cross_mapped_mask; >>> - arr[global_ctrl].guest &= ~kvm_pmu->host_cross_mapped_mask; >>> + >>> + /* Prevent any host user from enabling PEBS for profiling >>> guest. */ >>> + arr[pebs_enable].guest &= (kvm_pmu->pebs_enable & >>> kvm_pmu->global_ctrl); >>> + >>> /* Set hw GLOBAL_CTRL bits for PEBS counter when it runs >>> for guest */ >>> arr[global_ctrl].guest |= arr[pebs_enable].guest; >>> } >>> >>> base-commit: 6803fb00772cc50cd59a66bd8caaee5c84b13fcf
On 30/11/2023 11:49 pm, Liang, Kan wrote: > > > On 2023-11-30 2:29 a.m., Like Xu wrote: >> On 29/11/2023 10:38 pm, Liang, Kan wrote: >>> >>> >>> On 2023-11-29 4:50 a.m., Like Xu wrote: >>>> From: Like Xu <likexu@tencent.com> >>>> >>>> Stop using PEBS counters on host to profiling guest. Limit the range of >>>> enabled PEBS counters to only those counters enabled from the guest PEBS >>>> emulation perspective. >>>> >>>> If there is a perf-record agent on host that uses perf-tools events like >>>> "cpu-cycles:GP" (G for attr.exclude_host, P for max precise event >>>> counter) >>>> to capture guest performance events, then the guest will be hanged. >>>> This is >>>> because Intel DS-based PEBS buffer is addressed using the 64-bit linear >>>> address of the current {p/v}CPU context based on MSR_IA32_DS_AREA. >>>> >>>> Any perf user using PEBS counters to profile guest on host is, in >>>> perf/core >>>> implementation details, trying to set bits on >>>> cpuc->intel_ctrl_guest_mask >>>> and arr[pebs_enable].guest, much like the guest PEBS emulation >>>> behaviour. >>>> But the subsequent PEBS memory write, regardless of whether guest >>>> PEBS is >>>> enabled, can overshoot guest entry and corrupt guest memory. >>>> >>>> Profiling guest via PEBS-DS buffer on host is not supported at this >>>> time. >>>> Fix this by filtering the real configured value of >>>> arr[pebs_enable].guest >>>> with the emulated state of guest enabled PEBS counters, under the >>>> condition >>>> of none cross-mapped PEBS counters. >>> >>> So the counter will be silently disabled. The user never knows why >>> nothing is sampled. >>> Since we don't support the case, profiling guest via PEBS-DS buffer on >>> host. Maybe we should error out when creating the event. For example >>> (not tested), >> >> Test failed. >> >>> >>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c >>> index 3871267d3237..24b90c70737f 100644 >>> --- a/arch/x86/events/intel/core.c >>> +++ b/arch/x86/events/intel/core.c >>> @@ -3958,6 +3958,10 @@ static int intel_pmu_hw_config(struct perf_event >>> *event) >>> if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == >>> INTEL_FIXED_VLBR_EVENT) >>> return -EINVAL; >>> >>> + /* Profiling guest via PEBS-DS buffer on host is not >>> supported. */ >>> + if (event->attr.exclude_host) >>> + return -EINVAL; >>> + >> >> Guest PEBS emulation also sets this bit, a typical call stack looks like: >> >> intel_pmu_hw_config+0x441/0x4d0 >> hsw_hw_config+0x12/0xa0 >> x86_pmu_event_init+0x98/0x370 >> perf_try_init_event+0x47/0x130 >> perf_event_alloc+0x446/0xeb0 >> perf_event_create_kernel_counter+0x38/0x190 >> pmc_reprogram_counter.constprop.17+0xd9/0x230 [kvm] >> kvm_pmu_handle_event+0x1a6/0x310 [kvm] >> vcpu_enter_guest+0x1388/0x19b0 [kvm] >> vcpu_run+0x117/0x6c0 [kvm] >> kvm_arch_vcpu_ioctl_run+0x13d/0x4d0 [kvm] >> kvm_vcpu_ioctl+0x301/0x6e0 [kvm] >> > > Oh right, the event from the KVM guest is also exclude_host. > So we should only error out with the non-KVM exclude_host PEBS event. > >> Alternatively, this path is taken when using PEBS-via-PT to profile >> guests on host. > > There is a is_pebs_pt(event), so we can skip the PEBS-via-PT. > > Seems we just need to distinguish a KVM event and a normal host event. > I don't have a better way to do it except using > event->overflow_handler_context, which is NULL for a normal host event. The assumption that event->overflow_handler_context == NULL is unsafe, considering .overflow_handler_context hook has its acclaimed generality. I understand your motivation very well, but this is not the right move (based on my previous history of being sprayed by Peter). For perf/core, in-kernel perf_events should be treated equally, and the semantics of kvm_pmu should only be accounted when a perf/core API is only used for guest-only path. In this case for KVM perf_events, intel_guest_get_msrs() and x86_pmu_handle_guest_pebs() have this context. > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c > index a968708ed1fb..c93a2aaff7c3 100644 > --- a/arch/x86/events/intel/core.c > +++ b/arch/x86/events/intel/core.c > @@ -3958,6 +3958,16 @@ static int intel_pmu_hw_config(struct perf_event > *event) > if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == > INTEL_FIXED_VLBR_EVENT) > return -EINVAL; > > + /* > + * Profiling guest via PEBS-DS buffer on host is not supported. > + * The event->overflow_handler_context is to distinguish a KVM > + * event and a normal host event. > + */ > + if (event->attr.exclude_host && > + !is_pebs_pt(event) && > + !event->overflow_handler_context) > + return -EINVAL; > + > if (!(event->attr.freq || (event->attr.wakeup_events && > !event->attr.watermark))) { > event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; > if (!(event->attr.sample_type & > >> >> The status of the guest can only be queried in the NMI handler and the func >> intel_guest_get_msrs() in the perf/core context, where it's easier and more >> centrally to review this part of changes that affects vPMU for corner >> cases. >> >> Maybe adding print info on the perf-tool side would help. >> >> For perf-tool users, it will get 0 number of sample for "cpu-cycles:GP" >> events, >> just like other uncounted perf-tool events. > > perf-tool would never know such details, e.g., whether the platform > supports PEBS-DS or other PEBS method. It's hard to tell if the 0 is > because of an unsupported hardware or nothing sampled in the guest. It kind of looks like this use case to me: perf record -e cycles -e cpu/event=0xf4,umask=0x10/ ./workload # ICX # Total Lost Samples: 0 # # Samples: 0 of event 'cpu/event=0xf4,umask=0x10/' # Event count (approx.): 0 A end-user has to check if the event-umask combination is supported or not, or nothing sampled for the workload. Is there any room for improvement in perf-tool to reduce the pain of this part ? If any, the same thing could be applied to cpu-cycles:GP, isn't it ? > > Thanks, > Kan >> >>> if (!(event->attr.freq || (event->attr.wakeup_events && >>> !event->attr.watermark))) { >>> event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; >>> if (!(event->attr.sample_type & >>> >>> >>> Thanks, >>> Kan >>> >>>> >>>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> >>>> Fixes: c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR >>>> emulation for extended PEBS") >>>> Signed-off-by: Like Xu <likexu@tencent.com> >>>> --- >>>> arch/x86/events/intel/core.c | 8 +++++++- >>>> 1 file changed, 7 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c >>>> index a08f794a0e79..17afd504c35b 100644 >>>> --- a/arch/x86/events/intel/core.c >>>> +++ b/arch/x86/events/intel/core.c >>>> @@ -4103,13 +4103,19 @@ static struct perf_guest_switch_msr >>>> *intel_guest_get_msrs(int *nr, void *data) >>>> .guest = pebs_mask & ~cpuc->intel_ctrl_host_mask, >>>> }; >>>> + /* In any case, clear guest PEBS bits first. */ >>>> + arr[global_ctrl].guest &= ~arr[pebs_enable].guest; >>>> + >>>> if (arr[pebs_enable].host) { >>>> /* Disable guest PEBS if host PEBS is enabled. */ >>>> arr[pebs_enable].guest = 0; >>>> } else { >>>> /* Disable guest PEBS thoroughly for cross-mapped PEBS >>>> counters. */ >>>> arr[pebs_enable].guest &= ~kvm_pmu->host_cross_mapped_mask; >>>> - arr[global_ctrl].guest &= ~kvm_pmu->host_cross_mapped_mask; >>>> + >>>> + /* Prevent any host user from enabling PEBS for profiling >>>> guest. */ >>>> + arr[pebs_enable].guest &= (kvm_pmu->pebs_enable & >>>> kvm_pmu->global_ctrl); >>>> + >>>> /* Set hw GLOBAL_CTRL bits for PEBS counter when it runs >>>> for guest */ >>>> arr[global_ctrl].guest |= arr[pebs_enable].guest; >>>> } >>>> >>>> base-commit: 6803fb00772cc50cd59a66bd8caaee5c84b13fcf
On 2023-11-30 10:59 p.m., Like Xu wrote: > On 30/11/2023 11:49 pm, Liang, Kan wrote: >> >> >> On 2023-11-30 2:29 a.m., Like Xu wrote: >>> On 29/11/2023 10:38 pm, Liang, Kan wrote: >>>> >>>> >>>> On 2023-11-29 4:50 a.m., Like Xu wrote: >>>>> From: Like Xu <likexu@tencent.com> >>>>> >>>>> Stop using PEBS counters on host to profiling guest. Limit the >>>>> range of >>>>> enabled PEBS counters to only those counters enabled from the guest >>>>> PEBS >>>>> emulation perspective. >>>>> >>>>> If there is a perf-record agent on host that uses perf-tools events >>>>> like >>>>> "cpu-cycles:GP" (G for attr.exclude_host, P for max precise event >>>>> counter) >>>>> to capture guest performance events, then the guest will be hanged. >>>>> This is >>>>> because Intel DS-based PEBS buffer is addressed using the 64-bit >>>>> linear >>>>> address of the current {p/v}CPU context based on MSR_IA32_DS_AREA. >>>>> >>>>> Any perf user using PEBS counters to profile guest on host is, in >>>>> perf/core >>>>> implementation details, trying to set bits on >>>>> cpuc->intel_ctrl_guest_mask >>>>> and arr[pebs_enable].guest, much like the guest PEBS emulation >>>>> behaviour. >>>>> But the subsequent PEBS memory write, regardless of whether guest >>>>> PEBS is >>>>> enabled, can overshoot guest entry and corrupt guest memory. >>>>> >>>>> Profiling guest via PEBS-DS buffer on host is not supported at this >>>>> time. >>>>> Fix this by filtering the real configured value of >>>>> arr[pebs_enable].guest >>>>> with the emulated state of guest enabled PEBS counters, under the >>>>> condition >>>>> of none cross-mapped PEBS counters. >>>> >>>> So the counter will be silently disabled. The user never knows why >>>> nothing is sampled. >>>> Since we don't support the case, profiling guest via PEBS-DS buffer on >>>> host. Maybe we should error out when creating the event. For example >>>> (not tested), >>> >>> Test failed. >>> >>>> >>>> diff --git a/arch/x86/events/intel/core.c >>>> b/arch/x86/events/intel/core.c >>>> index 3871267d3237..24b90c70737f 100644 >>>> --- a/arch/x86/events/intel/core.c >>>> +++ b/arch/x86/events/intel/core.c >>>> @@ -3958,6 +3958,10 @@ static int intel_pmu_hw_config(struct perf_event >>>> *event) >>>> if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == >>>> INTEL_FIXED_VLBR_EVENT) >>>> return -EINVAL; >>>> >>>> + /* Profiling guest via PEBS-DS buffer on host is not >>>> supported. */ >>>> + if (event->attr.exclude_host) >>>> + return -EINVAL; >>>> + >>> >>> Guest PEBS emulation also sets this bit, a typical call stack looks >>> like: >>> >>> intel_pmu_hw_config+0x441/0x4d0 >>> hsw_hw_config+0x12/0xa0 >>> x86_pmu_event_init+0x98/0x370 >>> perf_try_init_event+0x47/0x130 >>> perf_event_alloc+0x446/0xeb0 >>> perf_event_create_kernel_counter+0x38/0x190 >>> pmc_reprogram_counter.constprop.17+0xd9/0x230 [kvm] >>> kvm_pmu_handle_event+0x1a6/0x310 [kvm] >>> vcpu_enter_guest+0x1388/0x19b0 [kvm] >>> vcpu_run+0x117/0x6c0 [kvm] >>> kvm_arch_vcpu_ioctl_run+0x13d/0x4d0 [kvm] >>> kvm_vcpu_ioctl+0x301/0x6e0 [kvm] >>> >> >> Oh right, the event from the KVM guest is also exclude_host. >> So we should only error out with the non-KVM exclude_host PEBS event. >> >>> Alternatively, this path is taken when using PEBS-via-PT to profile >>> guests on host. >> >> There is a is_pebs_pt(event), so we can skip the PEBS-via-PT. >> >> Seems we just need to distinguish a KVM event and a normal host event. >> I don't have a better way to do it except using >> event->overflow_handler_context, which is NULL for a normal host event. > > The assumption that event->overflow_handler_context == NULL is unsafe, > considering .overflow_handler_context hook has its acclaimed generality. > Yes, I agree it's very hacky. What we need here is a way to distinguish the KVM guest request from the others. How about the PF_VCPU flag? if (event->attr.exclude_host && !is_pebs_pt(event) && (!(event->attach_state & PERF_ATTACH_TASK) || !(current->flags & PF_VCPU)) return -EINVAL; > I understand your motivation very well, but this is not the right move > (based on my previous history of being sprayed by Peter). For perf/core, > in-kernel perf_events should be treated equally, and the semantics of > kvm_pmu should only be accounted when a perf/core API is only used for > guest-only path. In this case for KVM perf_events, intel_guest_get_msrs() > and x86_pmu_handle_guest_pebs() have this context. > >> >> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c >> index a968708ed1fb..c93a2aaff7c3 100644 >> --- a/arch/x86/events/intel/core.c >> +++ b/arch/x86/events/intel/core.c >> @@ -3958,6 +3958,16 @@ static int intel_pmu_hw_config(struct perf_event >> *event) >> if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == >> INTEL_FIXED_VLBR_EVENT) >> return -EINVAL; >> >> + /* >> + * Profiling guest via PEBS-DS buffer on host is not supported. >> + * The event->overflow_handler_context is to distinguish a KVM >> + * event and a normal host event. >> + */ >> + if (event->attr.exclude_host && >> + !is_pebs_pt(event) && >> + !event->overflow_handler_context) >> + return -EINVAL; >> + >> if (!(event->attr.freq || (event->attr.wakeup_events && >> !event->attr.watermark))) { >> event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; >> if (!(event->attr.sample_type & >> >>> >>> The status of the guest can only be queried in the NMI handler and >>> the func >>> intel_guest_get_msrs() in the perf/core context, where it's easier >>> and more >>> centrally to review this part of changes that affects vPMU for corner >>> cases. >>> >>> Maybe adding print info on the perf-tool side would help. >>> >>> For perf-tool users, it will get 0 number of sample for "cpu-cycles:GP" >>> events, >>> just like other uncounted perf-tool events. >> >> perf-tool would never know such details, e.g., whether the platform >> supports PEBS-DS or other PEBS method. It's hard to tell if the 0 is >> because of an unsupported hardware or nothing sampled in the guest. > > It kind of looks like this use case to me: > > perf record -e cycles -e cpu/event=0xf4,umask=0x10/ ./workload # ICX > > # Total Lost Samples: 0 > # > # Samples: 0 of event 'cpu/event=0xf4,umask=0x10/' > # Event count (approx.): 0 > > A end-user has to check if the event-umask combination is supported or not, > or nothing sampled for the workload. Is there any room for improvement in > perf-tool to reduce the pain of this part ? If any, the same thing could > be applied > to cpu-cycles:GP, isn't it ? I don't think we can expect the end user knows such details. Most of them may even don't know what's PEBS-via-DS. Thanks, Kan > >> >> Thanks, >> Kan >>> >>>> if (!(event->attr.freq || (event->attr.wakeup_events && >>>> !event->attr.watermark))) { >>>> event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; >>>> if (!(event->attr.sample_type & >>>> >>>> >>>> Thanks, >>>> Kan >>>> >>>>> >>>>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> >>>>> Fixes: c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR >>>>> emulation for extended PEBS") >>>>> Signed-off-by: Like Xu <likexu@tencent.com> >>>>> --- >>>>> arch/x86/events/intel/core.c | 8 +++++++- >>>>> 1 file changed, 7 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/arch/x86/events/intel/core.c >>>>> b/arch/x86/events/intel/core.c >>>>> index a08f794a0e79..17afd504c35b 100644 >>>>> --- a/arch/x86/events/intel/core.c >>>>> +++ b/arch/x86/events/intel/core.c >>>>> @@ -4103,13 +4103,19 @@ static struct perf_guest_switch_msr >>>>> *intel_guest_get_msrs(int *nr, void *data) >>>>> .guest = pebs_mask & ~cpuc->intel_ctrl_host_mask, >>>>> }; >>>>> + /* In any case, clear guest PEBS bits first. */ >>>>> + arr[global_ctrl].guest &= ~arr[pebs_enable].guest; >>>>> + >>>>> if (arr[pebs_enable].host) { >>>>> /* Disable guest PEBS if host PEBS is enabled. */ >>>>> arr[pebs_enable].guest = 0; >>>>> } else { >>>>> /* Disable guest PEBS thoroughly for cross-mapped PEBS >>>>> counters. */ >>>>> arr[pebs_enable].guest &= ~kvm_pmu->host_cross_mapped_mask; >>>>> - arr[global_ctrl].guest &= ~kvm_pmu->host_cross_mapped_mask; >>>>> + >>>>> + /* Prevent any host user from enabling PEBS for profiling >>>>> guest. */ >>>>> + arr[pebs_enable].guest &= (kvm_pmu->pebs_enable & >>>>> kvm_pmu->global_ctrl); >>>>> + >>>>> /* Set hw GLOBAL_CTRL bits for PEBS counter when it runs >>>>> for guest */ >>>>> arr[global_ctrl].guest |= arr[pebs_enable].guest; >>>>> } >>>>> >>>>> base-commit: 6803fb00772cc50cd59a66bd8caaee5c84b13fcf
On 1/12/2023 10:38 pm, Liang, Kan wrote: > > > On 2023-11-30 10:59 p.m., Like Xu wrote: >> On 30/11/2023 11:49 pm, Liang, Kan wrote: >>> >>> >>> On 2023-11-30 2:29 a.m., Like Xu wrote: >>>> On 29/11/2023 10:38 pm, Liang, Kan wrote: >>>>> >>>>> >>>>> On 2023-11-29 4:50 a.m., Like Xu wrote: >>>>>> From: Like Xu <likexu@tencent.com> >>>>>> >>>>>> Stop using PEBS counters on host to profiling guest. Limit the >>>>>> range of >>>>>> enabled PEBS counters to only those counters enabled from the guest >>>>>> PEBS >>>>>> emulation perspective. >>>>>> >>>>>> If there is a perf-record agent on host that uses perf-tools events >>>>>> like >>>>>> "cpu-cycles:GP" (G for attr.exclude_host, P for max precise event >>>>>> counter) >>>>>> to capture guest performance events, then the guest will be hanged. >>>>>> This is >>>>>> because Intel DS-based PEBS buffer is addressed using the 64-bit >>>>>> linear >>>>>> address of the current {p/v}CPU context based on MSR_IA32_DS_AREA. >>>>>> >>>>>> Any perf user using PEBS counters to profile guest on host is, in >>>>>> perf/core >>>>>> implementation details, trying to set bits on >>>>>> cpuc->intel_ctrl_guest_mask >>>>>> and arr[pebs_enable].guest, much like the guest PEBS emulation >>>>>> behaviour. >>>>>> But the subsequent PEBS memory write, regardless of whether guest >>>>>> PEBS is >>>>>> enabled, can overshoot guest entry and corrupt guest memory. >>>>>> >>>>>> Profiling guest via PEBS-DS buffer on host is not supported at this >>>>>> time. >>>>>> Fix this by filtering the real configured value of >>>>>> arr[pebs_enable].guest >>>>>> with the emulated state of guest enabled PEBS counters, under the >>>>>> condition >>>>>> of none cross-mapped PEBS counters. >>>>> >>>>> So the counter will be silently disabled. The user never knows why >>>>> nothing is sampled. >>>>> Since we don't support the case, profiling guest via PEBS-DS buffer on >>>>> host. Maybe we should error out when creating the event. For example >>>>> (not tested), >>>> >>>> Test failed. >>>> >>>>> >>>>> diff --git a/arch/x86/events/intel/core.c >>>>> b/arch/x86/events/intel/core.c >>>>> index 3871267d3237..24b90c70737f 100644 >>>>> --- a/arch/x86/events/intel/core.c >>>>> +++ b/arch/x86/events/intel/core.c >>>>> @@ -3958,6 +3958,10 @@ static int intel_pmu_hw_config(struct perf_event >>>>> *event) >>>>> if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == >>>>> INTEL_FIXED_VLBR_EVENT) >>>>> return -EINVAL; >>>>> >>>>> + /* Profiling guest via PEBS-DS buffer on host is not >>>>> supported. */ >>>>> + if (event->attr.exclude_host) >>>>> + return -EINVAL; >>>>> + >>>> >>>> Guest PEBS emulation also sets this bit, a typical call stack looks >>>> like: >>>> >>>> intel_pmu_hw_config+0x441/0x4d0 >>>> hsw_hw_config+0x12/0xa0 >>>> x86_pmu_event_init+0x98/0x370 >>>> perf_try_init_event+0x47/0x130 >>>> perf_event_alloc+0x446/0xeb0 >>>> perf_event_create_kernel_counter+0x38/0x190 >>>> pmc_reprogram_counter.constprop.17+0xd9/0x230 [kvm] >>>> kvm_pmu_handle_event+0x1a6/0x310 [kvm] >>>> vcpu_enter_guest+0x1388/0x19b0 [kvm] >>>> vcpu_run+0x117/0x6c0 [kvm] >>>> kvm_arch_vcpu_ioctl_run+0x13d/0x4d0 [kvm] >>>> kvm_vcpu_ioctl+0x301/0x6e0 [kvm] >>>> >>> >>> Oh right, the event from the KVM guest is also exclude_host. >>> So we should only error out with the non-KVM exclude_host PEBS event. >>> >>>> Alternatively, this path is taken when using PEBS-via-PT to profile >>>> guests on host. >>> >>> There is a is_pebs_pt(event), so we can skip the PEBS-via-PT. >>> >>> Seems we just need to distinguish a KVM event and a normal host event. >>> I don't have a better way to do it except using >>> event->overflow_handler_context, which is NULL for a normal host event. >> >> The assumption that event->overflow_handler_context == NULL is unsafe, >> considering .overflow_handler_context hook has its acclaimed generality. >> > > Yes, I agree it's very hacky. > > What we need here is a way to distinguish the KVM guest request from the > others. How about the PF_VCPU flag? > > if (event->attr.exclude_host && > !is_pebs_pt(event) && > (!(event->attach_state & PERF_ATTACH_TASK) || !(current->flags & > PF_VCPU)) > return -EINVAL; Unfortunately, the tests are not passed w/ the above diff. But it's good to know that there is PF_VCPU and more things to play around with. The AMD IBS also takes the precise path, but it puts the recorded values on group of MSRs instead of the linear address-based memory. The root cause of this issue is the hardware limitation, just like what we did in the commit 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry"), a similar fix should also belong in the context of directly configuring VMX-switch related hardware configuration values. I haven't find a better location than intel_guest_get_msrs(). > > >> I understand your motivation very well, but this is not the right move >> (based on my previous history of being sprayed by Peter). For perf/core, >> in-kernel perf_events should be treated equally, and the semantics of >> kvm_pmu should only be accounted when a perf/core API is only used for >> guest-only path. In this case for KVM perf_events, intel_guest_get_msrs() >> and x86_pmu_handle_guest_pebs() have this context. >> >>> >>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c >>> index a968708ed1fb..c93a2aaff7c3 100644 >>> --- a/arch/x86/events/intel/core.c >>> +++ b/arch/x86/events/intel/core.c >>> @@ -3958,6 +3958,16 @@ static int intel_pmu_hw_config(struct perf_event >>> *event) >>> if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == >>> INTEL_FIXED_VLBR_EVENT) >>> return -EINVAL; >>> >>> + /* >>> + * Profiling guest via PEBS-DS buffer on host is not supported. >>> + * The event->overflow_handler_context is to distinguish a KVM >>> + * event and a normal host event. >>> + */ >>> + if (event->attr.exclude_host && >>> + !is_pebs_pt(event) && >>> + !event->overflow_handler_context) >>> + return -EINVAL; >>> + >>> if (!(event->attr.freq || (event->attr.wakeup_events && >>> !event->attr.watermark))) { >>> event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; >>> if (!(event->attr.sample_type & >>> >>>> >>>> The status of the guest can only be queried in the NMI handler and >>>> the func >>>> intel_guest_get_msrs() in the perf/core context, where it's easier >>>> and more >>>> centrally to review this part of changes that affects vPMU for corner >>>> cases. >>>> >>>> Maybe adding print info on the perf-tool side would help. >>>> >>>> For perf-tool users, it will get 0 number of sample for "cpu-cycles:GP" >>>> events, >>>> just like other uncounted perf-tool events. >>> >>> perf-tool would never know such details, e.g., whether the platform >>> supports PEBS-DS or other PEBS method. It's hard to tell if the 0 is >>> because of an unsupported hardware or nothing sampled in the guest. >> >> It kind of looks like this use case to me: >> >> perf record -e cycles -e cpu/event=0xf4,umask=0x10/ ./workload # ICX >> >> # Total Lost Samples: 0 >> # >> # Samples: 0 of event 'cpu/event=0xf4,umask=0x10/' >> # Event count (approx.): 0 >> >> A end-user has to check if the event-umask combination is supported or not, >> or nothing sampled for the workload. Is there any room for improvement in >> perf-tool to reduce the pain of this part ? If any, the same thing could >> be applied >> to cpu-cycles:GP, isn't it ? > > I don't think we can expect the end user knows such details. Most of > them may even don't know what's PEBS-via-DS. Maybe the following generic reminder helps: # Total Lost Samples: 0 # Note: the event is not counted or unsupported. # # Samples: 0 of event 'cpu/event=0xf4,umask=0x10/' # Event count (approx.): 0 > > Thanks, > Kan > >> >>> >>> Thanks, >>> Kan >>>> >>>>> if (!(event->attr.freq || (event->attr.wakeup_events && >>>>> !event->attr.watermark))) { >>>>> event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; >>>>> if (!(event->attr.sample_type & >>>>> >>>>> >>>>> Thanks, >>>>> Kan >>>>> >>>>>> >>>>>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> >>>>>> Fixes: c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR >>>>>> emulation for extended PEBS") >>>>>> Signed-off-by: Like Xu <likexu@tencent.com> >>>>>> --- >>>>>> arch/x86/events/intel/core.c | 8 +++++++- >>>>>> 1 file changed, 7 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/arch/x86/events/intel/core.c >>>>>> b/arch/x86/events/intel/core.c >>>>>> index a08f794a0e79..17afd504c35b 100644 >>>>>> --- a/arch/x86/events/intel/core.c >>>>>> +++ b/arch/x86/events/intel/core.c >>>>>> @@ -4103,13 +4103,19 @@ static struct perf_guest_switch_msr >>>>>> *intel_guest_get_msrs(int *nr, void *data) >>>>>> .guest = pebs_mask & ~cpuc->intel_ctrl_host_mask, >>>>>> }; >>>>>> + /* In any case, clear guest PEBS bits first. */ >>>>>> + arr[global_ctrl].guest &= ~arr[pebs_enable].guest; >>>>>> + >>>>>> if (arr[pebs_enable].host) { >>>>>> /* Disable guest PEBS if host PEBS is enabled. */ >>>>>> arr[pebs_enable].guest = 0; >>>>>> } else { >>>>>> /* Disable guest PEBS thoroughly for cross-mapped PEBS >>>>>> counters. */ >>>>>> arr[pebs_enable].guest &= ~kvm_pmu->host_cross_mapped_mask; >>>>>> - arr[global_ctrl].guest &= ~kvm_pmu->host_cross_mapped_mask; >>>>>> + >>>>>> + /* Prevent any host user from enabling PEBS for profiling >>>>>> guest. */ >>>>>> + arr[pebs_enable].guest &= (kvm_pmu->pebs_enable & >>>>>> kvm_pmu->global_ctrl); >>>>>> + >>>>>> /* Set hw GLOBAL_CTRL bits for PEBS counter when it runs >>>>>> for guest */ >>>>>> arr[global_ctrl].guest |= arr[pebs_enable].guest; >>>>>> } >>>>>> >>>>>> base-commit: 6803fb00772cc50cd59a66bd8caaee5c84b13fcf
On 2023-12-04 3:32 a.m., Like Xu wrote: > > > On 1/12/2023 10:38 pm, Liang, Kan wrote: >> >> >> On 2023-11-30 10:59 p.m., Like Xu wrote: >>> On 30/11/2023 11:49 pm, Liang, Kan wrote: >>>> >>>> >>>> On 2023-11-30 2:29 a.m., Like Xu wrote: >>>>> On 29/11/2023 10:38 pm, Liang, Kan wrote: >>>>>> >>>>>> >>>>>> On 2023-11-29 4:50 a.m., Like Xu wrote: >>>>>>> From: Like Xu <likexu@tencent.com> >>>>>>> >>>>>>> Stop using PEBS counters on host to profiling guest. Limit the >>>>>>> range of >>>>>>> enabled PEBS counters to only those counters enabled from the guest >>>>>>> PEBS >>>>>>> emulation perspective. >>>>>>> >>>>>>> If there is a perf-record agent on host that uses perf-tools events >>>>>>> like >>>>>>> "cpu-cycles:GP" (G for attr.exclude_host, P for max precise event >>>>>>> counter) >>>>>>> to capture guest performance events, then the guest will be hanged. >>>>>>> This is >>>>>>> because Intel DS-based PEBS buffer is addressed using the 64-bit >>>>>>> linear >>>>>>> address of the current {p/v}CPU context based on MSR_IA32_DS_AREA. >>>>>>> >>>>>>> Any perf user using PEBS counters to profile guest on host is, in >>>>>>> perf/core >>>>>>> implementation details, trying to set bits on >>>>>>> cpuc->intel_ctrl_guest_mask >>>>>>> and arr[pebs_enable].guest, much like the guest PEBS emulation >>>>>>> behaviour. >>>>>>> But the subsequent PEBS memory write, regardless of whether guest >>>>>>> PEBS is >>>>>>> enabled, can overshoot guest entry and corrupt guest memory. >>>>>>> >>>>>>> Profiling guest via PEBS-DS buffer on host is not supported at this >>>>>>> time. >>>>>>> Fix this by filtering the real configured value of >>>>>>> arr[pebs_enable].guest >>>>>>> with the emulated state of guest enabled PEBS counters, under the >>>>>>> condition >>>>>>> of none cross-mapped PEBS counters. >>>>>> >>>>>> So the counter will be silently disabled. The user never knows why >>>>>> nothing is sampled. >>>>>> Since we don't support the case, profiling guest via PEBS-DS >>>>>> buffer on >>>>>> host. Maybe we should error out when creating the event. For example >>>>>> (not tested), >>>>> >>>>> Test failed. >>>>> >>>>>> >>>>>> diff --git a/arch/x86/events/intel/core.c >>>>>> b/arch/x86/events/intel/core.c >>>>>> index 3871267d3237..24b90c70737f 100644 >>>>>> --- a/arch/x86/events/intel/core.c >>>>>> +++ b/arch/x86/events/intel/core.c >>>>>> @@ -3958,6 +3958,10 @@ static int intel_pmu_hw_config(struct >>>>>> perf_event >>>>>> *event) >>>>>> if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == >>>>>> INTEL_FIXED_VLBR_EVENT) >>>>>> return -EINVAL; >>>>>> >>>>>> + /* Profiling guest via PEBS-DS buffer on host is not >>>>>> supported. */ >>>>>> + if (event->attr.exclude_host) >>>>>> + return -EINVAL; >>>>>> + >>>>> >>>>> Guest PEBS emulation also sets this bit, a typical call stack looks >>>>> like: >>>>> >>>>> intel_pmu_hw_config+0x441/0x4d0 >>>>> hsw_hw_config+0x12/0xa0 >>>>> x86_pmu_event_init+0x98/0x370 >>>>> perf_try_init_event+0x47/0x130 >>>>> perf_event_alloc+0x446/0xeb0 >>>>> perf_event_create_kernel_counter+0x38/0x190 >>>>> pmc_reprogram_counter.constprop.17+0xd9/0x230 [kvm] >>>>> kvm_pmu_handle_event+0x1a6/0x310 [kvm] >>>>> vcpu_enter_guest+0x1388/0x19b0 [kvm] >>>>> vcpu_run+0x117/0x6c0 [kvm] >>>>> kvm_arch_vcpu_ioctl_run+0x13d/0x4d0 [kvm] >>>>> kvm_vcpu_ioctl+0x301/0x6e0 [kvm] >>>>> >>>> >>>> Oh right, the event from the KVM guest is also exclude_host. >>>> So we should only error out with the non-KVM exclude_host PEBS event. >>>> >>>>> Alternatively, this path is taken when using PEBS-via-PT to profile >>>>> guests on host. >>>> >>>> There is a is_pebs_pt(event), so we can skip the PEBS-via-PT. >>>> >>>> Seems we just need to distinguish a KVM event and a normal host event. >>>> I don't have a better way to do it except using >>>> event->overflow_handler_context, which is NULL for a normal host event. >>> >>> The assumption that event->overflow_handler_context == NULL is unsafe, >>> considering .overflow_handler_context hook has its acclaimed generality. >>> >> >> Yes, I agree it's very hacky. >> >> What we need here is a way to distinguish the KVM guest request from the >> others. How about the PF_VCPU flag? >> >> if (event->attr.exclude_host && >> !is_pebs_pt(event) && >> (!(event->attach_state & PERF_ATTACH_TASK) || >> !(current->flags & >> PF_VCPU)) >> return -EINVAL; > > Unfortunately, the tests are not passed w/ the above diff. What's the exact failed perf command? Is it because that the KVM guest is mistakenly killed or the host command is not detected? > But it's good to know that there is PF_VCPU and more things to play > around with. > > The AMD IBS also takes the precise path, but it puts the recorded values > on group of MSRs instead of the linear address-based memory. The check is in the intel_pmu_hw_config(). So AMD isn't impacted. > > The root cause of this issue is the hardware limitation, just like what > we did > in the commit 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry"), > a similar fix should also belong in the context of directly configuring > VMX-switch related hardware configuration values. > The above fix is different than this case. IIRC, It's caused by a ucode issue. So once there is a guest, SW has to explicitly disable the PEBS. Otherwise, the guest crashes even the host doesn't intend to profile the guest. While the case in the patch is apparently a violation of the current rules. I think it's better to detect it and error out early. Thanks, Kan > I haven't find a better location than intel_guest_get_msrs(). > >> >> >>> I understand your motivation very well, but this is not the right move >>> (based on my previous history of being sprayed by Peter). For perf/core, >>> in-kernel perf_events should be treated equally, and the semantics of >>> kvm_pmu should only be accounted when a perf/core API is only used for >>> guest-only path. In this case for KVM perf_events, >>> intel_guest_get_msrs() >>> and x86_pmu_handle_guest_pebs() have this context. >>> >>>> >>>> diff --git a/arch/x86/events/intel/core.c >>>> b/arch/x86/events/intel/core.c >>>> index a968708ed1fb..c93a2aaff7c3 100644 >>>> --- a/arch/x86/events/intel/core.c >>>> +++ b/arch/x86/events/intel/core.c >>>> @@ -3958,6 +3958,16 @@ static int intel_pmu_hw_config(struct perf_event >>>> *event) >>>> if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == >>>> INTEL_FIXED_VLBR_EVENT) >>>> return -EINVAL; >>>> >>>> + /* >>>> + * Profiling guest via PEBS-DS buffer on host is not >>>> supported. >>>> + * The event->overflow_handler_context is to distinguish a KVM >>>> + * event and a normal host event. >>>> + */ >>>> + if (event->attr.exclude_host && >>>> + !is_pebs_pt(event) && >>>> + !event->overflow_handler_context) >>>> + return -EINVAL; >>>> + >>>> if (!(event->attr.freq || (event->attr.wakeup_events && >>>> !event->attr.watermark))) { >>>> event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; >>>> if (!(event->attr.sample_type & >>>> >>>>> >>>>> The status of the guest can only be queried in the NMI handler and >>>>> the func >>>>> intel_guest_get_msrs() in the perf/core context, where it's easier >>>>> and more >>>>> centrally to review this part of changes that affects vPMU for corner >>>>> cases. >>>>> >>>>> Maybe adding print info on the perf-tool side would help. >>>>> >>>>> For perf-tool users, it will get 0 number of sample for >>>>> "cpu-cycles:GP" >>>>> events, >>>>> just like other uncounted perf-tool events. >>>> >>>> perf-tool would never know such details, e.g., whether the platform >>>> supports PEBS-DS or other PEBS method. It's hard to tell if the 0 is >>>> because of an unsupported hardware or nothing sampled in the guest. >>> >>> It kind of looks like this use case to me: >>> >>> perf record -e cycles -e cpu/event=0xf4,umask=0x10/ ./workload # >>> ICX >>> >>> # Total Lost Samples: 0 >>> # >>> # Samples: 0 of event 'cpu/event=0xf4,umask=0x10/' >>> # Event count (approx.): 0 >>> >>> A end-user has to check if the event-umask combination is supported >>> or not, >>> or nothing sampled for the workload. Is there any room for >>> improvement in >>> perf-tool to reduce the pain of this part ? If any, the same thing could >>> be applied >>> to cpu-cycles:GP, isn't it ? >> >> I don't think we can expect the end user knows such details. Most of >> them may even don't know what's PEBS-via-DS. > > Maybe the following generic reminder helps: > > # Total Lost Samples: 0 > # Note: the event is not counted or unsupported. > # > # Samples: 0 of event 'cpu/event=0xf4,umask=0x10/' > # Event count (approx.): 0 > >> >> Thanks, >> Kan >> >>> >>>> >>>> Thanks, >>>> Kan >>>>> >>>>>> if (!(event->attr.freq || (event->attr.wakeup_events && >>>>>> !event->attr.watermark))) { >>>>>> event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; >>>>>> if (!(event->attr.sample_type & >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Kan >>>>>> >>>>>>> >>>>>>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> >>>>>>> Fixes: c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR >>>>>>> emulation for extended PEBS") >>>>>>> Signed-off-by: Like Xu <likexu@tencent.com> >>>>>>> --- >>>>>>> arch/x86/events/intel/core.c | 8 +++++++- >>>>>>> 1 file changed, 7 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/arch/x86/events/intel/core.c >>>>>>> b/arch/x86/events/intel/core.c >>>>>>> index a08f794a0e79..17afd504c35b 100644 >>>>>>> --- a/arch/x86/events/intel/core.c >>>>>>> +++ b/arch/x86/events/intel/core.c >>>>>>> @@ -4103,13 +4103,19 @@ static struct perf_guest_switch_msr >>>>>>> *intel_guest_get_msrs(int *nr, void *data) >>>>>>> .guest = pebs_mask & ~cpuc->intel_ctrl_host_mask, >>>>>>> }; >>>>>>> + /* In any case, clear guest PEBS bits first. */ >>>>>>> + arr[global_ctrl].guest &= ~arr[pebs_enable].guest; >>>>>>> + >>>>>>> if (arr[pebs_enable].host) { >>>>>>> /* Disable guest PEBS if host PEBS is enabled. */ >>>>>>> arr[pebs_enable].guest = 0; >>>>>>> } else { >>>>>>> /* Disable guest PEBS thoroughly for cross-mapped PEBS >>>>>>> counters. */ >>>>>>> arr[pebs_enable].guest &= >>>>>>> ~kvm_pmu->host_cross_mapped_mask; >>>>>>> - arr[global_ctrl].guest &= ~kvm_pmu->host_cross_mapped_mask; >>>>>>> + >>>>>>> + /* Prevent any host user from enabling PEBS for profiling >>>>>>> guest. */ >>>>>>> + arr[pebs_enable].guest &= (kvm_pmu->pebs_enable & >>>>>>> kvm_pmu->global_ctrl); >>>>>>> + >>>>>>> /* Set hw GLOBAL_CTRL bits for PEBS counter when it runs >>>>>>> for guest */ >>>>>>> arr[global_ctrl].guest |= arr[pebs_enable].guest; >>>>>>> } >>>>>>> >>>>>>> base-commit: 6803fb00772cc50cd59a66bd8caaee5c84b13fcf >
On 4/12/2023 11:19 pm, Liang, Kan wrote: > > > On 2023-12-04 3:32 a.m., Like Xu wrote: >> >> >> On 1/12/2023 10:38 pm, Liang, Kan wrote: >>> >>> >>> On 2023-11-30 10:59 p.m., Like Xu wrote: >>>> On 30/11/2023 11:49 pm, Liang, Kan wrote: >>>>> >>>>> >>>>> On 2023-11-30 2:29 a.m., Like Xu wrote: >>>>>> On 29/11/2023 10:38 pm, Liang, Kan wrote: >>>>>>> >>>>>>> >>>>>>> On 2023-11-29 4:50 a.m., Like Xu wrote: >>>>>>>> From: Like Xu <likexu@tencent.com> >>>>>>>> >>>>>>>> Stop using PEBS counters on host to profiling guest. Limit the >>>>>>>> range of >>>>>>>> enabled PEBS counters to only those counters enabled from the guest >>>>>>>> PEBS >>>>>>>> emulation perspective. >>>>>>>> >>>>>>>> If there is a perf-record agent on host that uses perf-tools events >>>>>>>> like >>>>>>>> "cpu-cycles:GP" (G for attr.exclude_host, P for max precise event >>>>>>>> counter) >>>>>>>> to capture guest performance events, then the guest will be hanged. >>>>>>>> This is >>>>>>>> because Intel DS-based PEBS buffer is addressed using the 64-bit >>>>>>>> linear >>>>>>>> address of the current {p/v}CPU context based on MSR_IA32_DS_AREA. >>>>>>>> >>>>>>>> Any perf user using PEBS counters to profile guest on host is, in >>>>>>>> perf/core >>>>>>>> implementation details, trying to set bits on >>>>>>>> cpuc->intel_ctrl_guest_mask >>>>>>>> and arr[pebs_enable].guest, much like the guest PEBS emulation >>>>>>>> behaviour. >>>>>>>> But the subsequent PEBS memory write, regardless of whether guest >>>>>>>> PEBS is >>>>>>>> enabled, can overshoot guest entry and corrupt guest memory. >>>>>>>> >>>>>>>> Profiling guest via PEBS-DS buffer on host is not supported at this >>>>>>>> time. >>>>>>>> Fix this by filtering the real configured value of >>>>>>>> arr[pebs_enable].guest >>>>>>>> with the emulated state of guest enabled PEBS counters, under the >>>>>>>> condition >>>>>>>> of none cross-mapped PEBS counters. >>>>>>> >>>>>>> So the counter will be silently disabled. The user never knows why >>>>>>> nothing is sampled. >>>>>>> Since we don't support the case, profiling guest via PEBS-DS >>>>>>> buffer on >>>>>>> host. Maybe we should error out when creating the event. For example >>>>>>> (not tested), >>>>>> >>>>>> Test failed. >>>>>> >>>>>>> >>>>>>> diff --git a/arch/x86/events/intel/core.c >>>>>>> b/arch/x86/events/intel/core.c >>>>>>> index 3871267d3237..24b90c70737f 100644 >>>>>>> --- a/arch/x86/events/intel/core.c >>>>>>> +++ b/arch/x86/events/intel/core.c >>>>>>> @@ -3958,6 +3958,10 @@ static int intel_pmu_hw_config(struct >>>>>>> perf_event >>>>>>> *event) >>>>>>> if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == >>>>>>> INTEL_FIXED_VLBR_EVENT) >>>>>>> return -EINVAL; >>>>>>> >>>>>>> + /* Profiling guest via PEBS-DS buffer on host is not >>>>>>> supported. */ >>>>>>> + if (event->attr.exclude_host) >>>>>>> + return -EINVAL; >>>>>>> + >>>>>> >>>>>> Guest PEBS emulation also sets this bit, a typical call stack looks >>>>>> like: >>>>>> >>>>>> intel_pmu_hw_config+0x441/0x4d0 >>>>>> hsw_hw_config+0x12/0xa0 >>>>>> x86_pmu_event_init+0x98/0x370 >>>>>> perf_try_init_event+0x47/0x130 >>>>>> perf_event_alloc+0x446/0xeb0 >>>>>> perf_event_create_kernel_counter+0x38/0x190 >>>>>> pmc_reprogram_counter.constprop.17+0xd9/0x230 [kvm] >>>>>> kvm_pmu_handle_event+0x1a6/0x310 [kvm] >>>>>> vcpu_enter_guest+0x1388/0x19b0 [kvm] >>>>>> vcpu_run+0x117/0x6c0 [kvm] >>>>>> kvm_arch_vcpu_ioctl_run+0x13d/0x4d0 [kvm] >>>>>> kvm_vcpu_ioctl+0x301/0x6e0 [kvm] >>>>>> >>>>> >>>>> Oh right, the event from the KVM guest is also exclude_host. >>>>> So we should only error out with the non-KVM exclude_host PEBS event. >>>>> >>>>>> Alternatively, this path is taken when using PEBS-via-PT to profile >>>>>> guests on host. >>>>> >>>>> There is a is_pebs_pt(event), so we can skip the PEBS-via-PT. >>>>> >>>>> Seems we just need to distinguish a KVM event and a normal host event. >>>>> I don't have a better way to do it except using >>>>> event->overflow_handler_context, which is NULL for a normal host event. >>>> >>>> The assumption that event->overflow_handler_context == NULL is unsafe, >>>> considering .overflow_handler_context hook has its acclaimed generality. >>>> >>> >>> Yes, I agree it's very hacky. >>> >>> What we need here is a way to distinguish the KVM guest request from the >>> others. How about the PF_VCPU flag? >>> >>> if (event->attr.exclude_host && >>> !is_pebs_pt(event) && >>> (!(event->attach_state & PERF_ATTACH_TASK) || >>> !(current->flags & >>> PF_VCPU)) >>> return -EINVAL; >> >> Unfortunately, the tests are not passed w/ the above diff. > > What's the exact failed perf command? [0] one vcpu thread running //usr/bin/yes [1] tools/perf/perf kvm --host --guest \ --guestkallsyms="..." \ --guestvmlinux="..." \ record --raw-samples --kcore \ -e "cpu-cycles:GP" \ -p `pidof cloud-hypervisor` > > Is it because that the KVM guest is mistakenly killed or the host > command is not detected? Theoretically it should error_out, but instead of just printing unsupported info, this command even collects samples pointing to vmx_vmexit, which is also not expected, there should be no samples taken for "cpu-cycles:GP" event. The point at which PF_VCPU is set/cleared is not synchronised or even controllable with the time of perf_event being added and enabled. Going the way of PF_VCPU will bring more complexity. > >> But it's good to know that there is PF_VCPU and more things to play >> around with. >> >> The AMD IBS also takes the precise path, but it puts the recorded values >> on group of MSRs instead of the linear address-based memory. > > The check is in the intel_pmu_hw_config(). So AMD isn't impacted. > >> >> The root cause of this issue is the hardware limitation, just like what >> we did >> in the commit 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry"), >> a similar fix should also belong in the context of directly configuring >> VMX-switch related hardware configuration values. >> > > The above fix is different than this case. IIRC, It's caused by a ucode > issue. So once there is a guest, SW has to explicitly disable the PEBS. > Otherwise, the guest crashes even the host doesn't intend to profile the > guest. > > While the case in the patch is apparently a violation of the current > rules. I think it's better to detect it and error out early. > > Thanks, > Kan "The host does not intend to profile the guest" implies that at the time 26a4f3c08de4 was introduced, we didn't have the capability to use PEBS to profiling guest, and the error_out message was not prompted to the end-user when vPEBS is not supported, and the fix to prevent guest crashes appears in intel_guest_get_msrs(); Now vPEBS is supported and not enabled, but at this time, we're moving the same logic that fixes guest crashes from intel_guest_get_msrs() to generic intel_pmu_hw_config(). This might not be a good move. For me, it fixes the same issue in the same way (I should have fixed it in c59a1f106f5c). If we do more implementation to make PEBS on the host profiling guest, with shared memory and read/write ordering, we don't need to change intel_pmu_hw_config() any more, any of the required change is not out of the scope of intel_guest_get_msrs(). To move forward, this fix won't prevent you from exploring more further fixes in the intel_pmu_hw_config() or elsewhere, could we agree on that ? > >> I haven't find a better location than intel_guest_get_msrs(). >> >>> >>> >>>> I understand your motivation very well, but this is not the right move >>>> (based on my previous history of being sprayed by Peter). For perf/core, >>>> in-kernel perf_events should be treated equally, and the semantics of >>>> kvm_pmu should only be accounted when a perf/core API is only used for >>>> guest-only path. In this case for KVM perf_events, >>>> intel_guest_get_msrs() >>>> and x86_pmu_handle_guest_pebs() have this context. >>>> >>>>> >>>>> diff --git a/arch/x86/events/intel/core.c >>>>> b/arch/x86/events/intel/core.c >>>>> index a968708ed1fb..c93a2aaff7c3 100644 >>>>> --- a/arch/x86/events/intel/core.c >>>>> +++ b/arch/x86/events/intel/core.c >>>>> @@ -3958,6 +3958,16 @@ static int intel_pmu_hw_config(struct perf_event >>>>> *event) >>>>> if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == >>>>> INTEL_FIXED_VLBR_EVENT) >>>>> return -EINVAL; >>>>> >>>>> + /* >>>>> + * Profiling guest via PEBS-DS buffer on host is not >>>>> supported. >>>>> + * The event->overflow_handler_context is to distinguish a KVM >>>>> + * event and a normal host event. >>>>> + */ >>>>> + if (event->attr.exclude_host && >>>>> + !is_pebs_pt(event) && >>>>> + !event->overflow_handler_context) >>>>> + return -EINVAL; >>>>> + >>>>> if (!(event->attr.freq || (event->attr.wakeup_events && >>>>> !event->attr.watermark))) { >>>>> event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; >>>>> if (!(event->attr.sample_type & >>>>> >>>>>> >>>>>> The status of the guest can only be queried in the NMI handler and >>>>>> the func >>>>>> intel_guest_get_msrs() in the perf/core context, where it's easier >>>>>> and more >>>>>> centrally to review this part of changes that affects vPMU for corner >>>>>> cases. >>>>>> >>>>>> Maybe adding print info on the perf-tool side would help. >>>>>> >>>>>> For perf-tool users, it will get 0 number of sample for >>>>>> "cpu-cycles:GP" >>>>>> events, >>>>>> just like other uncounted perf-tool events. >>>>> >>>>> perf-tool would never know such details, e.g., whether the platform >>>>> supports PEBS-DS or other PEBS method. It's hard to tell if the 0 is >>>>> because of an unsupported hardware or nothing sampled in the guest. >>>> >>>> It kind of looks like this use case to me: >>>> >>>> perf record -e cycles -e cpu/event=0xf4,umask=0x10/ ./workload # >>>> ICX >>>> >>>> # Total Lost Samples: 0 >>>> # >>>> # Samples: 0 of event 'cpu/event=0xf4,umask=0x10/' >>>> # Event count (approx.): 0 >>>> >>>> A end-user has to check if the event-umask combination is supported >>>> or not, >>>> or nothing sampled for the workload. Is there any room for >>>> improvement in >>>> perf-tool to reduce the pain of this part ? If any, the same thing could >>>> be applied >>>> to cpu-cycles:GP, isn't it ? >>> >>> I don't think we can expect the end user knows such details. Most of >>> them may even don't know what's PEBS-via-DS. >> >> Maybe the following generic reminder helps: >> >> # Total Lost Samples: 0 >> # Note: the event is not counted or unsupported. >> # >> # Samples: 0 of event 'cpu/event=0xf4,umask=0x10/' >> # Event count (approx.): 0 >> >>> >>> Thanks, >>> Kan >>> >>>> >>>>> >>>>> Thanks, >>>>> Kan >>>>>> >>>>>>> if (!(event->attr.freq || (event->attr.wakeup_events && >>>>>>> !event->attr.watermark))) { >>>>>>> event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; >>>>>>> if (!(event->attr.sample_type & >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Kan >>>>>>> >>>>>>>> >>>>>>>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> >>>>>>>> Fixes: c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR >>>>>>>> emulation for extended PEBS") >>>>>>>> Signed-off-by: Like Xu <likexu@tencent.com> >>>>>>>> --- >>>>>>>> arch/x86/events/intel/core.c | 8 +++++++- >>>>>>>> 1 file changed, 7 insertions(+), 1 deletion(-) >>>>>>>> >>>>>>>> diff --git a/arch/x86/events/intel/core.c >>>>>>>> b/arch/x86/events/intel/core.c >>>>>>>> index a08f794a0e79..17afd504c35b 100644 >>>>>>>> --- a/arch/x86/events/intel/core.c >>>>>>>> +++ b/arch/x86/events/intel/core.c >>>>>>>> @@ -4103,13 +4103,19 @@ static struct perf_guest_switch_msr >>>>>>>> *intel_guest_get_msrs(int *nr, void *data) >>>>>>>> .guest = pebs_mask & ~cpuc->intel_ctrl_host_mask, >>>>>>>> }; >>>>>>>> + /* In any case, clear guest PEBS bits first. */ >>>>>>>> + arr[global_ctrl].guest &= ~arr[pebs_enable].guest; >>>>>>>> + >>>>>>>> if (arr[pebs_enable].host) { >>>>>>>> /* Disable guest PEBS if host PEBS is enabled. */ >>>>>>>> arr[pebs_enable].guest = 0; >>>>>>>> } else { >>>>>>>> /* Disable guest PEBS thoroughly for cross-mapped PEBS >>>>>>>> counters. */ >>>>>>>> arr[pebs_enable].guest &= >>>>>>>> ~kvm_pmu->host_cross_mapped_mask; >>>>>>>> - arr[global_ctrl].guest &= ~kvm_pmu->host_cross_mapped_mask; >>>>>>>> + >>>>>>>> + /* Prevent any host user from enabling PEBS for profiling >>>>>>>> guest. */ >>>>>>>> + arr[pebs_enable].guest &= (kvm_pmu->pebs_enable & >>>>>>>> kvm_pmu->global_ctrl); >>>>>>>> + >>>>>>>> /* Set hw GLOBAL_CTRL bits for PEBS counter when it runs >>>>>>>> for guest */ >>>>>>>> arr[global_ctrl].guest |= arr[pebs_enable].guest; >>>>>>>> } >>>>>>>> >>>>>>>> base-commit: 6803fb00772cc50cd59a66bd8caaee5c84b13fcf >>
On 2023-12-05 2:24 a.m., Like Xu wrote: > > > On 4/12/2023 11:19 pm, Liang, Kan wrote: >> >> >> On 2023-12-04 3:32 a.m., Like Xu wrote: >>> >>> >>> On 1/12/2023 10:38 pm, Liang, Kan wrote: >>>> >>>> >>>> On 2023-11-30 10:59 p.m., Like Xu wrote: >>>>> On 30/11/2023 11:49 pm, Liang, Kan wrote: >>>>>> >>>>>> >>>>>> On 2023-11-30 2:29 a.m., Like Xu wrote: >>>>>>> On 29/11/2023 10:38 pm, Liang, Kan wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 2023-11-29 4:50 a.m., Like Xu wrote: >>>>>>>>> From: Like Xu <likexu@tencent.com> >>>>>>>>> >>>>>>>>> Stop using PEBS counters on host to profiling guest. Limit the >>>>>>>>> range of >>>>>>>>> enabled PEBS counters to only those counters enabled from the >>>>>>>>> guest >>>>>>>>> PEBS >>>>>>>>> emulation perspective. >>>>>>>>> >>>>>>>>> If there is a perf-record agent on host that uses perf-tools >>>>>>>>> events >>>>>>>>> like >>>>>>>>> "cpu-cycles:GP" (G for attr.exclude_host, P for max precise event >>>>>>>>> counter) >>>>>>>>> to capture guest performance events, then the guest will be >>>>>>>>> hanged. >>>>>>>>> This is >>>>>>>>> because Intel DS-based PEBS buffer is addressed using the 64-bit >>>>>>>>> linear >>>>>>>>> address of the current {p/v}CPU context based on MSR_IA32_DS_AREA. >>>>>>>>> >>>>>>>>> Any perf user using PEBS counters to profile guest on host is, in >>>>>>>>> perf/core >>>>>>>>> implementation details, trying to set bits on >>>>>>>>> cpuc->intel_ctrl_guest_mask >>>>>>>>> and arr[pebs_enable].guest, much like the guest PEBS emulation >>>>>>>>> behaviour. >>>>>>>>> But the subsequent PEBS memory write, regardless of whether guest >>>>>>>>> PEBS is >>>>>>>>> enabled, can overshoot guest entry and corrupt guest memory. >>>>>>>>> >>>>>>>>> Profiling guest via PEBS-DS buffer on host is not supported at >>>>>>>>> this >>>>>>>>> time. >>>>>>>>> Fix this by filtering the real configured value of >>>>>>>>> arr[pebs_enable].guest >>>>>>>>> with the emulated state of guest enabled PEBS counters, under the >>>>>>>>> condition >>>>>>>>> of none cross-mapped PEBS counters. >>>>>>>> >>>>>>>> So the counter will be silently disabled. The user never knows why >>>>>>>> nothing is sampled. >>>>>>>> Since we don't support the case, profiling guest via PEBS-DS >>>>>>>> buffer on >>>>>>>> host. Maybe we should error out when creating the event. For >>>>>>>> example >>>>>>>> (not tested), >>>>>>> >>>>>>> Test failed. >>>>>>> >>>>>>>> >>>>>>>> diff --git a/arch/x86/events/intel/core.c >>>>>>>> b/arch/x86/events/intel/core.c >>>>>>>> index 3871267d3237..24b90c70737f 100644 >>>>>>>> --- a/arch/x86/events/intel/core.c >>>>>>>> +++ b/arch/x86/events/intel/core.c >>>>>>>> @@ -3958,6 +3958,10 @@ static int intel_pmu_hw_config(struct >>>>>>>> perf_event >>>>>>>> *event) >>>>>>>> if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == >>>>>>>> INTEL_FIXED_VLBR_EVENT) >>>>>>>> return -EINVAL; >>>>>>>> >>>>>>>> + /* Profiling guest via PEBS-DS buffer on host is not >>>>>>>> supported. */ >>>>>>>> + if (event->attr.exclude_host) >>>>>>>> + return -EINVAL; >>>>>>>> + >>>>>>> >>>>>>> Guest PEBS emulation also sets this bit, a typical call stack looks >>>>>>> like: >>>>>>> >>>>>>> intel_pmu_hw_config+0x441/0x4d0 >>>>>>> hsw_hw_config+0x12/0xa0 >>>>>>> x86_pmu_event_init+0x98/0x370 >>>>>>> perf_try_init_event+0x47/0x130 >>>>>>> perf_event_alloc+0x446/0xeb0 >>>>>>> perf_event_create_kernel_counter+0x38/0x190 >>>>>>> pmc_reprogram_counter.constprop.17+0xd9/0x230 [kvm] >>>>>>> kvm_pmu_handle_event+0x1a6/0x310 [kvm] >>>>>>> vcpu_enter_guest+0x1388/0x19b0 [kvm] >>>>>>> vcpu_run+0x117/0x6c0 [kvm] >>>>>>> kvm_arch_vcpu_ioctl_run+0x13d/0x4d0 [kvm] >>>>>>> kvm_vcpu_ioctl+0x301/0x6e0 [kvm] >>>>>>> >>>>>> >>>>>> Oh right, the event from the KVM guest is also exclude_host. >>>>>> So we should only error out with the non-KVM exclude_host PEBS event. >>>>>> >>>>>>> Alternatively, this path is taken when using PEBS-via-PT to profile >>>>>>> guests on host. >>>>>> >>>>>> There is a is_pebs_pt(event), so we can skip the PEBS-via-PT. >>>>>> >>>>>> Seems we just need to distinguish a KVM event and a normal host >>>>>> event. >>>>>> I don't have a better way to do it except using >>>>>> event->overflow_handler_context, which is NULL for a normal host >>>>>> event. >>>>> >>>>> The assumption that event->overflow_handler_context == NULL is unsafe, >>>>> considering .overflow_handler_context hook has its acclaimed >>>>> generality. >>>>> >>>> >>>> Yes, I agree it's very hacky. >>>> >>>> What we need here is a way to distinguish the KVM guest request from >>>> the >>>> others. How about the PF_VCPU flag? >>>> >>>> if (event->attr.exclude_host && >>>> !is_pebs_pt(event) && >>>> (!(event->attach_state & PERF_ATTACH_TASK) || >>>> !(current->flags & >>>> PF_VCPU)) >>>> return -EINVAL; >>> >>> Unfortunately, the tests are not passed w/ the above diff. >> >> What's the exact failed perf command? > > [0] one vcpu thread running //usr/bin/yes > [1] tools/perf/perf kvm --host --guest \ > --guestkallsyms="..." \ --guestvmlinux="..." \ > record --raw-samples --kcore \ > -e "cpu-cycles:GP" \ > -p `pidof cloud-hypervisor` > >> >> Is it because that the KVM guest is mistakenly killed or the host >> command is not detected? > > Theoretically it should error_out, but instead of just printing > unsupported info, > this command even collects samples pointing to vmx_vmexit, which is also > not > expected, there should be no samples taken for "cpu-cycles:GP" event. If the event creation is rejected by the kernel, the perf tool will automatically remove the P and give another try. What's the perf tool output with -vvv? Do you see "decreasing precise_ip by one"? If the precise_ip is decreased to 0, the event is a non-pebs which can be created successfully. So you will still see the samples. > > The point at which PF_VCPU is set/cleared is not synchronised or even > controllable with the time of perf_event being added and enabled. > > Going the way of PF_VCPU will bring more complexity. Seems we need a new flag to distinguish the KVM created event. > >> >>> But it's good to know that there is PF_VCPU and more things to play >>> around with. >>> >>> The AMD IBS also takes the precise path, but it puts the recorded values >>> on group of MSRs instead of the linear address-based memory. >> >> The check is in the intel_pmu_hw_config(). So AMD isn't impacted. >> >>> >>> The root cause of this issue is the hardware limitation, just like what >>> we did >>> in the commit 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry"), >>> a similar fix should also belong in the context of directly configuring >>> VMX-switch related hardware configuration values. >>> >> >> The above fix is different than this case. IIRC, It's caused by a ucode >> issue. So once there is a guest, SW has to explicitly disable the PEBS. >> Otherwise, the guest crashes even the host doesn't intend to profile the >> guest. >> >> While the case in the patch is apparently a violation of the current >> rules. I think it's better to detect it and error out early. >> >> Thanks, >> Kan > > "The host does not intend to profile the guest" implies that at the time > 26a4f3c08de4 was introduced, we didn't have the capability to use PEBS to > profiling guest, and the error_out message was not prompted to the end-user > when vPEBS is not supported, and the fix to prevent guest crashes appears > in intel_guest_get_msrs(); > > Now vPEBS is supported and not enabled, but at this time, we're moving > the same logic that fixes guest crashes from intel_guest_get_msrs() to > generic intel_pmu_hw_config(). This might not be a good move. > For me, it fixes the same issue in the same way (I should have fixed > it in c59a1f106f5c). The problem is that the current KVM addicts to silently manipulating the MSRs. It doesn't care if the end user is informed properly about the changes. I don't think it's a proper way. Let's say if a user runs a system-wide PEBS event which tries to collect both host and guest, they should only get samples from the host, right? How do they know the reason why there is 0 samples from the guest? > > If we do more implementation to make PEBS on the host profiling guest, > with shared memory and read/write ordering, we don't need to change > intel_pmu_hw_config() any more, any of the required change is not out > of the scope of intel_guest_get_msrs(). The intel_pmu_hw_config() is to check whether the feature is supported. If we support the host profiling guest later, we should remove the check in the intel_pmu_hw_config(). > > To move forward, this fix won't prevent you from exploring more further > fixes in the intel_pmu_hw_config() or elsewhere, could we agree on that ? I think we need a real fix (from error handling to MSR manipulation), not just a workaround. I know the current error handling of perf is not perfect. Some KVM failures may not be able to be explicitly passed to the end user, e.g., if I recall correctly PEBS crossmapping. But it doesn't mean it's not important. We should try our best to notify the end user when it's possible. I think we can patch both intel_pmu_hw_config() and intel_guest_get_msrs() if necessary. Thanks, Kan > >> >>> I haven't find a better location than intel_guest_get_msrs(). >>> >>>> >>>> >>>>> I understand your motivation very well, but this is not the right move >>>>> (based on my previous history of being sprayed by Peter). For >>>>> perf/core, >>>>> in-kernel perf_events should be treated equally, and the semantics of >>>>> kvm_pmu should only be accounted when a perf/core API is only used for >>>>> guest-only path. In this case for KVM perf_events, >>>>> intel_guest_get_msrs() >>>>> and x86_pmu_handle_guest_pebs() have this context. >>>>> >>>>>> >>>>>> diff --git a/arch/x86/events/intel/core.c >>>>>> b/arch/x86/events/intel/core.c >>>>>> index a968708ed1fb..c93a2aaff7c3 100644 >>>>>> --- a/arch/x86/events/intel/core.c >>>>>> +++ b/arch/x86/events/intel/core.c >>>>>> @@ -3958,6 +3958,16 @@ static int intel_pmu_hw_config(struct >>>>>> perf_event >>>>>> *event) >>>>>> if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == >>>>>> INTEL_FIXED_VLBR_EVENT) >>>>>> return -EINVAL; >>>>>> >>>>>> + /* >>>>>> + * Profiling guest via PEBS-DS buffer on host is not >>>>>> supported. >>>>>> + * The event->overflow_handler_context is to distinguish >>>>>> a KVM >>>>>> + * event and a normal host event. >>>>>> + */ >>>>>> + if (event->attr.exclude_host && >>>>>> + !is_pebs_pt(event) && >>>>>> + !event->overflow_handler_context) >>>>>> + return -EINVAL; >>>>>> + >>>>>> if (!(event->attr.freq || (event->attr.wakeup_events && >>>>>> !event->attr.watermark))) { >>>>>> event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; >>>>>> if (!(event->attr.sample_type & >>>>>> >>>>>>> >>>>>>> The status of the guest can only be queried in the NMI handler and >>>>>>> the func >>>>>>> intel_guest_get_msrs() in the perf/core context, where it's easier >>>>>>> and more >>>>>>> centrally to review this part of changes that affects vPMU for >>>>>>> corner >>>>>>> cases. >>>>>>> >>>>>>> Maybe adding print info on the perf-tool side would help. >>>>>>> >>>>>>> For perf-tool users, it will get 0 number of sample for >>>>>>> "cpu-cycles:GP" >>>>>>> events, >>>>>>> just like other uncounted perf-tool events. >>>>>> >>>>>> perf-tool would never know such details, e.g., whether the platform >>>>>> supports PEBS-DS or other PEBS method. It's hard to tell if the 0 is >>>>>> because of an unsupported hardware or nothing sampled in the guest. >>>>> >>>>> It kind of looks like this use case to me: >>>>> >>>>> perf record -e cycles -e cpu/event=0xf4,umask=0x10/ ./workload # >>>>> ICX >>>>> >>>>> # Total Lost Samples: 0 >>>>> # >>>>> # Samples: 0 of event 'cpu/event=0xf4,umask=0x10/' >>>>> # Event count (approx.): 0 >>>>> >>>>> A end-user has to check if the event-umask combination is supported >>>>> or not, >>>>> or nothing sampled for the workload. Is there any room for >>>>> improvement in >>>>> perf-tool to reduce the pain of this part ? If any, the same thing >>>>> could >>>>> be applied >>>>> to cpu-cycles:GP, isn't it ? >>>> >>>> I don't think we can expect the end user knows such details. Most of >>>> them may even don't know what's PEBS-via-DS. >>> >>> Maybe the following generic reminder helps: >>> >>> # Total Lost Samples: 0 >>> # Note: the event is not counted or unsupported. >>> # >>> # Samples: 0 of event 'cpu/event=0xf4,umask=0x10/' >>> # Event count (approx.): 0 >>> >>>> >>>> Thanks, >>>> Kan >>>> >>>>> >>>>>> >>>>>> Thanks, >>>>>> Kan >>>>>>> >>>>>>>> if (!(event->attr.freq || >>>>>>>> (event->attr.wakeup_events && >>>>>>>> !event->attr.watermark))) { >>>>>>>> event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; >>>>>>>> if (!(event->attr.sample_type & >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Kan >>>>>>>> >>>>>>>>> >>>>>>>>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> >>>>>>>>> Fixes: c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR >>>>>>>>> emulation for extended PEBS") >>>>>>>>> Signed-off-by: Like Xu <likexu@tencent.com> >>>>>>>>> --- >>>>>>>>> arch/x86/events/intel/core.c | 8 +++++++- >>>>>>>>> 1 file changed, 7 insertions(+), 1 deletion(-) >>>>>>>>> >>>>>>>>> diff --git a/arch/x86/events/intel/core.c >>>>>>>>> b/arch/x86/events/intel/core.c >>>>>>>>> index a08f794a0e79..17afd504c35b 100644 >>>>>>>>> --- a/arch/x86/events/intel/core.c >>>>>>>>> +++ b/arch/x86/events/intel/core.c >>>>>>>>> @@ -4103,13 +4103,19 @@ static struct perf_guest_switch_msr >>>>>>>>> *intel_guest_get_msrs(int *nr, void *data) >>>>>>>>> .guest = pebs_mask & ~cpuc->intel_ctrl_host_mask, >>>>>>>>> }; >>>>>>>>> + /* In any case, clear guest PEBS bits first. */ >>>>>>>>> + arr[global_ctrl].guest &= ~arr[pebs_enable].guest; >>>>>>>>> + >>>>>>>>> if (arr[pebs_enable].host) { >>>>>>>>> /* Disable guest PEBS if host PEBS is enabled. */ >>>>>>>>> arr[pebs_enable].guest = 0; >>>>>>>>> } else { >>>>>>>>> /* Disable guest PEBS thoroughly for cross-mapped >>>>>>>>> PEBS >>>>>>>>> counters. */ >>>>>>>>> arr[pebs_enable].guest &= >>>>>>>>> ~kvm_pmu->host_cross_mapped_mask; >>>>>>>>> - arr[global_ctrl].guest &= >>>>>>>>> ~kvm_pmu->host_cross_mapped_mask; >>>>>>>>> + >>>>>>>>> + /* Prevent any host user from enabling PEBS for profiling >>>>>>>>> guest. */ >>>>>>>>> + arr[pebs_enable].guest &= (kvm_pmu->pebs_enable & >>>>>>>>> kvm_pmu->global_ctrl); >>>>>>>>> + >>>>>>>>> /* Set hw GLOBAL_CTRL bits for PEBS counter when >>>>>>>>> it runs >>>>>>>>> for guest */ >>>>>>>>> arr[global_ctrl].guest |= arr[pebs_enable].guest; >>>>>>>>> } >>>>>>>>> >>>>>>>>> base-commit: 6803fb00772cc50cd59a66bd8caaee5c84b13fcf >>>
On 5/12/2023 11:19 pm, Liang, Kan wrote: > > > On 2023-12-05 2:24 a.m., Like Xu wrote: >> >> >> On 4/12/2023 11:19 pm, Liang, Kan wrote: >>> >>> >>> On 2023-12-04 3:32 a.m., Like Xu wrote: >>>> >>>> >>>> On 1/12/2023 10:38 pm, Liang, Kan wrote: >>>>> >>>>> >>>>> On 2023-11-30 10:59 p.m., Like Xu wrote: >>>>>> On 30/11/2023 11:49 pm, Liang, Kan wrote: >>>>>>> >>>>>>> >>>>>>> On 2023-11-30 2:29 a.m., Like Xu wrote: >>>>>>>> On 29/11/2023 10:38 pm, Liang, Kan wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> On 2023-11-29 4:50 a.m., Like Xu wrote: >>>>>>>>>> From: Like Xu <likexu@tencent.com> >>>>>>>>>> >>>>>>>>>> Stop using PEBS counters on host to profiling guest. Limit the >>>>>>>>>> range of >>>>>>>>>> enabled PEBS counters to only those counters enabled from the >>>>>>>>>> guest >>>>>>>>>> PEBS >>>>>>>>>> emulation perspective. >>>>>>>>>> >>>>>>>>>> If there is a perf-record agent on host that uses perf-tools >>>>>>>>>> events >>>>>>>>>> like >>>>>>>>>> "cpu-cycles:GP" (G for attr.exclude_host, P for max precise event >>>>>>>>>> counter) >>>>>>>>>> to capture guest performance events, then the guest will be >>>>>>>>>> hanged. >>>>>>>>>> This is >>>>>>>>>> because Intel DS-based PEBS buffer is addressed using the 64-bit >>>>>>>>>> linear >>>>>>>>>> address of the current {p/v}CPU context based on MSR_IA32_DS_AREA. >>>>>>>>>> >>>>>>>>>> Any perf user using PEBS counters to profile guest on host is, in >>>>>>>>>> perf/core >>>>>>>>>> implementation details, trying to set bits on >>>>>>>>>> cpuc->intel_ctrl_guest_mask >>>>>>>>>> and arr[pebs_enable].guest, much like the guest PEBS emulation >>>>>>>>>> behaviour. >>>>>>>>>> But the subsequent PEBS memory write, regardless of whether guest >>>>>>>>>> PEBS is >>>>>>>>>> enabled, can overshoot guest entry and corrupt guest memory. >>>>>>>>>> >>>>>>>>>> Profiling guest via PEBS-DS buffer on host is not supported at >>>>>>>>>> this >>>>>>>>>> time. >>>>>>>>>> Fix this by filtering the real configured value of >>>>>>>>>> arr[pebs_enable].guest >>>>>>>>>> with the emulated state of guest enabled PEBS counters, under the >>>>>>>>>> condition >>>>>>>>>> of none cross-mapped PEBS counters. >>>>>>>>> >>>>>>>>> So the counter will be silently disabled. The user never knows why >>>>>>>>> nothing is sampled. >>>>>>>>> Since we don't support the case, profiling guest via PEBS-DS >>>>>>>>> buffer on >>>>>>>>> host. Maybe we should error out when creating the event. For >>>>>>>>> example >>>>>>>>> (not tested), >>>>>>>> >>>>>>>> Test failed. >>>>>>>> >>>>>>>>> >>>>>>>>> diff --git a/arch/x86/events/intel/core.c >>>>>>>>> b/arch/x86/events/intel/core.c >>>>>>>>> index 3871267d3237..24b90c70737f 100644 >>>>>>>>> --- a/arch/x86/events/intel/core.c >>>>>>>>> +++ b/arch/x86/events/intel/core.c >>>>>>>>> @@ -3958,6 +3958,10 @@ static int intel_pmu_hw_config(struct >>>>>>>>> perf_event >>>>>>>>> *event) >>>>>>>>> if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == >>>>>>>>> INTEL_FIXED_VLBR_EVENT) >>>>>>>>> return -EINVAL; >>>>>>>>> >>>>>>>>> + /* Profiling guest via PEBS-DS buffer on host is not >>>>>>>>> supported. */ >>>>>>>>> + if (event->attr.exclude_host) >>>>>>>>> + return -EINVAL; >>>>>>>>> + >>>>>>>> >>>>>>>> Guest PEBS emulation also sets this bit, a typical call stack looks >>>>>>>> like: >>>>>>>> >>>>>>>> intel_pmu_hw_config+0x441/0x4d0 >>>>>>>> hsw_hw_config+0x12/0xa0 >>>>>>>> x86_pmu_event_init+0x98/0x370 >>>>>>>> perf_try_init_event+0x47/0x130 >>>>>>>> perf_event_alloc+0x446/0xeb0 >>>>>>>> perf_event_create_kernel_counter+0x38/0x190 >>>>>>>> pmc_reprogram_counter.constprop.17+0xd9/0x230 [kvm] >>>>>>>> kvm_pmu_handle_event+0x1a6/0x310 [kvm] >>>>>>>> vcpu_enter_guest+0x1388/0x19b0 [kvm] >>>>>>>> vcpu_run+0x117/0x6c0 [kvm] >>>>>>>> kvm_arch_vcpu_ioctl_run+0x13d/0x4d0 [kvm] >>>>>>>> kvm_vcpu_ioctl+0x301/0x6e0 [kvm] >>>>>>>> >>>>>>> >>>>>>> Oh right, the event from the KVM guest is also exclude_host. >>>>>>> So we should only error out with the non-KVM exclude_host PEBS event. >>>>>>> >>>>>>>> Alternatively, this path is taken when using PEBS-via-PT to profile >>>>>>>> guests on host. >>>>>>> >>>>>>> There is a is_pebs_pt(event), so we can skip the PEBS-via-PT. >>>>>>> >>>>>>> Seems we just need to distinguish a KVM event and a normal host >>>>>>> event. >>>>>>> I don't have a better way to do it except using >>>>>>> event->overflow_handler_context, which is NULL for a normal host >>>>>>> event. >>>>>> >>>>>> The assumption that event->overflow_handler_context == NULL is unsafe, >>>>>> considering .overflow_handler_context hook has its acclaimed >>>>>> generality. >>>>>> >>>>> >>>>> Yes, I agree it's very hacky. >>>>> >>>>> What we need here is a way to distinguish the KVM guest request from >>>>> the >>>>> others. How about the PF_VCPU flag? >>>>> >>>>> if (event->attr.exclude_host && >>>>> !is_pebs_pt(event) && >>>>> (!(event->attach_state & PERF_ATTACH_TASK) || >>>>> !(current->flags & >>>>> PF_VCPU)) >>>>> return -EINVAL; >>>> >>>> Unfortunately, the tests are not passed w/ the above diff. >>> >>> What's the exact failed perf command? >> >> [0] one vcpu thread running //usr/bin/yes >> [1] tools/perf/perf kvm --host --guest \ >> --guestkallsyms="..." \ --guestvmlinux="..." \ >> record --raw-samples --kcore \ >> -e "cpu-cycles:GP" \ >> -p `pidof cloud-hypervisor` >> >>> >>> Is it because that the KVM guest is mistakenly killed or the host >>> command is not detected? >> >> Theoretically it should error_out, but instead of just printing >> unsupported info, >> this command even collects samples pointing to vmx_vmexit, which is also >> not >> expected, there should be no samples taken for "cpu-cycles:GP" event. > > If the event creation is rejected by the kernel, the perf tool will > automatically remove the P and give another try. > What's the perf tool output with -vvv? > Do you see "decreasing precise_ip by one"? > > If the precise_ip is decreased to 0, the event is a non-pebs which can > be created successfully. So you will still see the samples. I'm guessing this happened and the end-users need '-vvv' to notice the details. > >> >> The point at which PF_VCPU is set/cleared is not synchronised or even >> controllable with the time of perf_event being added and enabled. >> >> Going the way of PF_VCPU will bring more complexity. > > Seems we need a new flag to distinguish the KVM created event. I'd love to see you make this happen, and those perf_events callers from ebpf would appreciate the flag. > >> >>> >>>> But it's good to know that there is PF_VCPU and more things to play >>>> around with. >>>> >>>> The AMD IBS also takes the precise path, but it puts the recorded values >>>> on group of MSRs instead of the linear address-based memory. >>> >>> The check is in the intel_pmu_hw_config(). So AMD isn't impacted. >>> >>>> >>>> The root cause of this issue is the hardware limitation, just like what >>>> we did >>>> in the commit 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry"), >>>> a similar fix should also belong in the context of directly configuring >>>> VMX-switch related hardware configuration values. >>>> >>> >>> The above fix is different than this case. IIRC, It's caused by a ucode >>> issue. So once there is a guest, SW has to explicitly disable the PEBS. >>> Otherwise, the guest crashes even the host doesn't intend to profile the >>> guest. >>> >>> While the case in the patch is apparently a violation of the current >>> rules. I think it's better to detect it and error out early. >>> >>> Thanks, >>> Kan >> >> "The host does not intend to profile the guest" implies that at the time >> 26a4f3c08de4 was introduced, we didn't have the capability to use PEBS to >> profiling guest, and the error_out message was not prompted to the end-user >> when vPEBS is not supported, and the fix to prevent guest crashes appears >> in intel_guest_get_msrs(); >> >> Now vPEBS is supported and not enabled, but at this time, we're moving >> the same logic that fixes guest crashes from intel_guest_get_msrs() to >> generic intel_pmu_hw_config(). This might not be a good move. >> For me, it fixes the same issue in the same way (I should have fixed >> it in c59a1f106f5c). > > The problem is that the current KVM addicts to silently manipulating the > MSRs. It doesn't care if the end user is informed properly about the > changes. I don't think it's a proper way. I have to agree, KVM also has quirks that users are not aware of. > > Let's say if a user runs a system-wide PEBS event which tries to collect > both host and guest, they should only get samples from the host, right? > How do they know the reason why there is 0 samples from the guest? The same thing happened at pmu-unfunctional ctx like enclave and SMM. perf-report even has a parameter option that can change the denominator of all collected samples (it thinks that's the whole story for executed payload), while making the data appear distorted and conclusions absurd. > >> >> If we do more implementation to make PEBS on the host profiling guest, >> with shared memory and read/write ordering, we don't need to change >> intel_pmu_hw_config() any more, any of the required change is not out >> of the scope of intel_guest_get_msrs(). > > The intel_pmu_hw_config() is to check whether the feature is supported. > If we support the host profiling guest later, we should remove the check > in the intel_pmu_hw_config(). > >> >> To move forward, this fix won't prevent you from exploring more further >> fixes in the intel_pmu_hw_config() or elsewhere, could we agree on that ? > > I think we need a real fix (from error handling to MSR manipulation), > not just a workaround. > > I know the current error handling of perf is not perfect. Some KVM > failures may not be able to be explicitly passed to the end user, e.g., > if I recall correctly PEBS crossmapping. But it doesn't mean it's not > important. We should try our best to notify the end user when it's possible. It sounds like we need to add a little more detail to the header of the samples. I've also noticed that some users don't use perf/tool but develop directly based on syscall, but it's not that simple to abstract some generic prompt messages since '-Einval' is not enough for wider cases. > > I think we can patch both intel_pmu_hw_config() and > intel_guest_get_msrs() if necessary. Thank you and thanks for your comments. At the very least, KVM need this minor fix on intel_guest_get_msrs() to alleviate my guilt. And let's see what we will get on som thread around intel_pmu_hw_config(). > > Thanks, > Kan >> >>> >>>> I haven't find a better location than intel_guest_get_msrs(). >>>> >>>>> >>>>> >>>>>> I understand your motivation very well, but this is not the right move >>>>>> (based on my previous history of being sprayed by Peter). For >>>>>> perf/core, >>>>>> in-kernel perf_events should be treated equally, and the semantics of >>>>>> kvm_pmu should only be accounted when a perf/core API is only used for >>>>>> guest-only path. In this case for KVM perf_events, >>>>>> intel_guest_get_msrs() >>>>>> and x86_pmu_handle_guest_pebs() have this context. >>>>>> >>>>>>> >>>>>>> diff --git a/arch/x86/events/intel/core.c >>>>>>> b/arch/x86/events/intel/core.c >>>>>>> index a968708ed1fb..c93a2aaff7c3 100644 >>>>>>> --- a/arch/x86/events/intel/core.c >>>>>>> +++ b/arch/x86/events/intel/core.c >>>>>>> @@ -3958,6 +3958,16 @@ static int intel_pmu_hw_config(struct >>>>>>> perf_event >>>>>>> *event) >>>>>>> if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == >>>>>>> INTEL_FIXED_VLBR_EVENT) >>>>>>> return -EINVAL; >>>>>>> >>>>>>> + /* >>>>>>> + * Profiling guest via PEBS-DS buffer on host is not >>>>>>> supported. >>>>>>> + * The event->overflow_handler_context is to distinguish >>>>>>> a KVM >>>>>>> + * event and a normal host event. >>>>>>> + */ >>>>>>> + if (event->attr.exclude_host && >>>>>>> + !is_pebs_pt(event) && >>>>>>> + !event->overflow_handler_context) >>>>>>> + return -EINVAL; >>>>>>> + >>>>>>> if (!(event->attr.freq || (event->attr.wakeup_events && >>>>>>> !event->attr.watermark))) { >>>>>>> event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; >>>>>>> if (!(event->attr.sample_type & >>>>>>> >>>>>>>> >>>>>>>> The status of the guest can only be queried in the NMI handler and >>>>>>>> the func >>>>>>>> intel_guest_get_msrs() in the perf/core context, where it's easier >>>>>>>> and more >>>>>>>> centrally to review this part of changes that affects vPMU for >>>>>>>> corner >>>>>>>> cases. >>>>>>>> >>>>>>>> Maybe adding print info on the perf-tool side would help. >>>>>>>> >>>>>>>> For perf-tool users, it will get 0 number of sample for >>>>>>>> "cpu-cycles:GP" >>>>>>>> events, >>>>>>>> just like other uncounted perf-tool events. >>>>>>> >>>>>>> perf-tool would never know such details, e.g., whether the platform >>>>>>> supports PEBS-DS or other PEBS method. It's hard to tell if the 0 is >>>>>>> because of an unsupported hardware or nothing sampled in the guest. >>>>>> >>>>>> It kind of looks like this use case to me: >>>>>> >>>>>> perf record -e cycles -e cpu/event=0xf4,umask=0x10/ ./workload # >>>>>> ICX >>>>>> >>>>>> # Total Lost Samples: 0 >>>>>> # >>>>>> # Samples: 0 of event 'cpu/event=0xf4,umask=0x10/' >>>>>> # Event count (approx.): 0 >>>>>> >>>>>> A end-user has to check if the event-umask combination is supported >>>>>> or not, >>>>>> or nothing sampled for the workload. Is there any room for >>>>>> improvement in >>>>>> perf-tool to reduce the pain of this part ? If any, the same thing >>>>>> could >>>>>> be applied >>>>>> to cpu-cycles:GP, isn't it ? >>>>> >>>>> I don't think we can expect the end user knows such details. Most of >>>>> them may even don't know what's PEBS-via-DS. >>>> >>>> Maybe the following generic reminder helps: >>>> >>>> # Total Lost Samples: 0 >>>> # Note: the event is not counted or unsupported. >>>> # >>>> # Samples: 0 of event 'cpu/event=0xf4,umask=0x10/' >>>> # Event count (approx.): 0 >>>> >>>>> >>>>> Thanks, >>>>> Kan >>>>> >>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Kan >>>>>>>> >>>>>>>>> if (!(event->attr.freq || >>>>>>>>> (event->attr.wakeup_events && >>>>>>>>> !event->attr.watermark))) { >>>>>>>>> event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; >>>>>>>>> if (!(event->attr.sample_type & >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Kan >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> >>>>>>>>>> Fixes: c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR >>>>>>>>>> emulation for extended PEBS") >>>>>>>>>> Signed-off-by: Like Xu <likexu@tencent.com> >>>>>>>>>> --- >>>>>>>>>> arch/x86/events/intel/core.c | 8 +++++++- >>>>>>>>>> 1 file changed, 7 insertions(+), 1 deletion(-) >>>>>>>>>> >>>>>>>>>> diff --git a/arch/x86/events/intel/core.c >>>>>>>>>> b/arch/x86/events/intel/core.c >>>>>>>>>> index a08f794a0e79..17afd504c35b 100644 >>>>>>>>>> --- a/arch/x86/events/intel/core.c >>>>>>>>>> +++ b/arch/x86/events/intel/core.c >>>>>>>>>> @@ -4103,13 +4103,19 @@ static struct perf_guest_switch_msr >>>>>>>>>> *intel_guest_get_msrs(int *nr, void *data) >>>>>>>>>> .guest = pebs_mask & ~cpuc->intel_ctrl_host_mask, >>>>>>>>>> }; >>>>>>>>>> + /* In any case, clear guest PEBS bits first. */ >>>>>>>>>> + arr[global_ctrl].guest &= ~arr[pebs_enable].guest; >>>>>>>>>> + >>>>>>>>>> if (arr[pebs_enable].host) { >>>>>>>>>> /* Disable guest PEBS if host PEBS is enabled. */ >>>>>>>>>> arr[pebs_enable].guest = 0; >>>>>>>>>> } else { >>>>>>>>>> /* Disable guest PEBS thoroughly for cross-mapped >>>>>>>>>> PEBS >>>>>>>>>> counters. */ >>>>>>>>>> arr[pebs_enable].guest &= >>>>>>>>>> ~kvm_pmu->host_cross_mapped_mask; >>>>>>>>>> - arr[global_ctrl].guest &= >>>>>>>>>> ~kvm_pmu->host_cross_mapped_mask; >>>>>>>>>> + >>>>>>>>>> + /* Prevent any host user from enabling PEBS for profiling >>>>>>>>>> guest. */ >>>>>>>>>> + arr[pebs_enable].guest &= (kvm_pmu->pebs_enable & >>>>>>>>>> kvm_pmu->global_ctrl); >>>>>>>>>> + >>>>>>>>>> /* Set hw GLOBAL_CTRL bits for PEBS counter when >>>>>>>>>> it runs >>>>>>>>>> for guest */ >>>>>>>>>> arr[global_ctrl].guest |= arr[pebs_enable].guest; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> base-commit: 6803fb00772cc50cd59a66bd8caaee5c84b13fcf >>>>
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index a08f794a0e79..17afd504c35b 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -4103,13 +4103,19 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) .guest = pebs_mask & ~cpuc->intel_ctrl_host_mask, }; + /* In any case, clear guest PEBS bits first. */ + arr[global_ctrl].guest &= ~arr[pebs_enable].guest; + if (arr[pebs_enable].host) { /* Disable guest PEBS if host PEBS is enabled. */ arr[pebs_enable].guest = 0; } else { /* Disable guest PEBS thoroughly for cross-mapped PEBS counters. */ arr[pebs_enable].guest &= ~kvm_pmu->host_cross_mapped_mask; - arr[global_ctrl].guest &= ~kvm_pmu->host_cross_mapped_mask; + + /* Prevent any host user from enabling PEBS for profiling guest. */ + arr[pebs_enable].guest &= (kvm_pmu->pebs_enable & kvm_pmu->global_ctrl); + /* Set hw GLOBAL_CTRL bits for PEBS counter when it runs for guest */ arr[global_ctrl].guest |= arr[pebs_enable].guest; }