Message ID | 20231120221932.213710-1-namhyung@kernel.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2b07:b0:403:3b70:6f57 with SMTP id io7csp252046vqb; Mon, 20 Nov 2023 14:20:04 -0800 (PST) X-Google-Smtp-Source: AGHT+IEAVRubJCBBBQPs1M4iZtTN/rqc8/UjhQ/YcVNryPJVHvsFd/YSr2KQNzhDNlnC/T64DTrP X-Received: by 2002:a05:6a20:1604:b0:185:a90d:363d with SMTP id l4-20020a056a20160400b00185a90d363dmr7579918pzj.2.1700518804027; Mon, 20 Nov 2023 14:20:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700518804; cv=none; d=google.com; s=arc-20160816; b=u/5c3tgdlbnwimvVwFh2LnvrQTvBnALD+OJtg8rh86x7KJP+CbVCcJm5x+5M+ihEII RACjUWO3dy6JXXIzwyojxbIkzJ6OYSrM3ZLGgQOcub6dRVNqr11pYh4t8i6Q0/E9fIFn hxdquAQPWUc6mE4NE53VzU9fkNLjjjLu0MA+5iN5pjLbLKJEEPd6jQ8r1Q4nsekQTNuI pl16XIhxGMBkLZ4j/WMRKfu0Iqc+jOpHbeNaELt6+72UJXDdd1+PAtm+/eR2jBJEDO3M 9cJZS+vM+oq8yF/ARIJOrxs4OIL1gnNUNytPYQLlkySzuVpCoeoEhtcG9duj/iuQhyWG 1KqA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:sender:dkim-signature; bh=6XC8XlavK49xNjfUz810sKQmeHvQWvE5qIjS3kpF1wE=; fh=b3TeBIs+CttSBZm6Hnbb3kJFExQaS46Vc9Vx9fOC4sY=; b=WnH7uy1smgezywFr4prn23jQwxBrTIgIAKHgdDtm6u4Kt7ZeBr5cJ/ibrZ2tRqY4Rn KufTzpQlXCeVxIlgXQxxze4WLFLbaPGixdEnjktGVU5LI1+bjLLV6lISiqyqHr/dZCb+ NxM6mZYVn6wILEwQ7bLd0AQBN+5IBF8xvbmevnKygrpCd28hHrKWwOJyaxT4kfJuLrT8 2bu7cFAeozg7i/rAp44K00CfBJkd9M1dSmM9Yy7K8f7kYqR7HVd8KEKPeSGUx8YMA3em JUBfvw7o68dGd1hHe4pxWYZdJEU/E8a7zJp9cmc+oT2iWjq7mivNwKuFcov9y3lWb4lb 9SWQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=Mm5lJ0bv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id bm22-20020a656e96000000b005bddb7ff530si9477833pgb.124.2023.11.20.14.20.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 20 Nov 2023 14:20:04 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=Mm5lJ0bv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 006C2802F30A; Mon, 20 Nov 2023 14:20:03 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232620AbjKTWTm (ORCPT <rfc822;heyuhang3455@gmail.com> + 27 others); Mon, 20 Nov 2023 17:19:42 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41692 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232689AbjKTWTi (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 20 Nov 2023 17:19:38 -0500 Received: from mail-pl1-x633.google.com (mail-pl1-x633.google.com [IPv6:2607:f8b0:4864:20::633]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C374ACA for <linux-kernel@vger.kernel.org>; Mon, 20 Nov 2023 14:19:34 -0800 (PST) Received: by mail-pl1-x633.google.com with SMTP id d9443c01a7336-1cc4f777ab9so38675465ad.0 for <linux-kernel@vger.kernel.org>; Mon, 20 Nov 2023 14:19:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1700518774; x=1701123574; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:sender:from:to:cc:subject:date:message-id:reply-to; bh=6XC8XlavK49xNjfUz810sKQmeHvQWvE5qIjS3kpF1wE=; b=Mm5lJ0bvDxChRHIn951jLh/fx2lDrdn/3fJ1BotetQ0vSkNo2i/D8qsxVEWl0LfwED uEwoNnxwaRZFjmb3y+PrbGY3pXDobRwNUNbJNDgSQ+ZiXS/sQUQOeY2xvQEcFENDJhPy kbVi8/VJ8dz8vBQMTXqMIFpIobxtMOAruHJiKtJWWo2oJduKIpuI0iEtSg2J1IpfUz0D 0eUhnLa4Yhn33355fjyoJQV4XObXSH5gfd730YJzwA2kucmCSnpUl//WOsPZGcIfDXYh NGubz5RxoDCNSTTvg7mDaelhDnnoqaaMdNZt7GA2J1rJ6t9cN4zfWFMYneYRiaZtl/CF eGOQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1700518774; x=1701123574; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:sender:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=6XC8XlavK49xNjfUz810sKQmeHvQWvE5qIjS3kpF1wE=; b=r3B9OTVJlUwBWTWrpfT5bLn8fAErAy32nE+uE/llHOElz1ygiyuEcrSPUucvdQDxdB U6BvRUlvr/jwNIRi2YaAxKXphsgOAYnGVcA6f2jMRTVwi0Z1cJ5LjCmQMDn+UGZcqDpT WhqH6bbz2+cRRAfqRGDwNaFdTw/XjMKoQiR7VVrB0CuF8s2ltcvATvT8ijHRLXq3GAe5 Xw01uA0qACy/jbik/jzw2Kuw31GefJYnYZ8W9RXmmukPZyLyMGgOsb0vccJvvI1oARr0 zKS5Llg4gAB5qZp3iX9U1X6Zymfduj3/T+1vz1/HOOAG486qzeAZTJLco1ae9JxcH9SZ Qykg== X-Gm-Message-State: AOJu0Ywfsil389fUsKTAAHWWwP+qFd0mI/euV0fqo7CBe/e8BbgW++Dr RmTQ6KeEZuwqCXve63PCay0= X-Received: by 2002:a17:902:b486:b0:1ca:e78b:35dc with SMTP id y6-20020a170902b48600b001cae78b35dcmr6445266plr.27.1700518774127; Mon, 20 Nov 2023 14:19:34 -0800 (PST) Received: from bangji.hsd1.ca.comcast.net ([2601:647:6780:42e0:75c6:4212:ae99:93b6]) by smtp.gmail.com with ESMTPSA id d4-20020a170902c18400b001c9c47d6cb9sm3528246pld.99.2023.11.20.14.19.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 20 Nov 2023 14:19:33 -0800 (PST) Sender: Namhyung Kim <namhyung@gmail.com> From: Namhyung Kim <namhyung@kernel.org> To: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com>, Alexander Shishkin <alexander.shishkin@linux.intel.com>, Arnaldo Carvalho de Melo <acme@kernel.org>, LKML <linux-kernel@vger.kernel.org>, Ian Rogers <irogers@google.com>, Kan Liang <kan.liang@linux.intel.com>, Mingwei Zhang <mizhang@google.com> Subject: [PATCH 1/3] perf/core: Update perf_adjust_freq_unthr_context() Date: Mon, 20 Nov 2023 14:19:30 -0800 Message-ID: <20231120221932.213710-1-namhyung@kernel.org> X-Mailer: git-send-email 2.43.0.rc1.413.gea7ed67945-goog MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.5 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_EF,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Mon, 20 Nov 2023 14:20:03 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783123205647483092 X-GMAIL-MSGID: 1783123205647483092 |
Series |
[1/3] perf/core: Update perf_adjust_freq_unthr_context()
|
|
Commit Message
Namhyung Kim
Nov. 20, 2023, 10:19 p.m. UTC
It was unnecessarily disabling and enabling PMUs for each event. It
should be done at PMU level. Add pmu_ctx->nr_freq counter to check it
at each PMU. As pmu context has separate active lists for pinned group
and flexible group, factor out a new function to do the job.
Another minor optimization is that it can skip PMUs w/ CAP_NO_INTERRUPT
even if it needs to unthrottle sampling events.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
include/linux/perf_event.h | 1 +
kernel/events/core.c | 68 +++++++++++++++++++++++---------------
2 files changed, 43 insertions(+), 26 deletions(-)
Comments
On Mon, Nov 20, 2023 at 2:19 PM Namhyung Kim <namhyung@kernel.org> wrote: > > It was unnecessarily disabling and enabling PMUs for each event. It > should be done at PMU level. Add pmu_ctx->nr_freq counter to check it > at each PMU. As pmu context has separate active lists for pinned group > and flexible group, factor out a new function to do the job. > > Another minor optimization is that it can skip PMUs w/ CAP_NO_INTERRUPT > even if it needs to unthrottle sampling events. > > Signed-off-by: Namhyung Kim <namhyung@kernel.org> Series: Reviewed-by: Ian Rogers <irogers@google.com> Thanks, Ian > --- > include/linux/perf_event.h | 1 + > kernel/events/core.c | 68 +++++++++++++++++++++++--------------- > 2 files changed, 43 insertions(+), 26 deletions(-) > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index 0367d748fae0..3eb17dc89f5e 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -879,6 +879,7 @@ struct perf_event_pmu_context { > > unsigned int nr_events; > unsigned int nr_cgroups; > + unsigned int nr_freq; > > atomic_t refcount; /* event <-> epc */ > struct rcu_head rcu_head; > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 3eb26c2c6e65..53e2ad73102d 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -2275,8 +2275,10 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx) > > if (!is_software_event(event)) > cpc->active_oncpu--; > - if (event->attr.freq && event->attr.sample_freq) > + if (event->attr.freq && event->attr.sample_freq) { > ctx->nr_freq--; > + epc->nr_freq--; > + } > if (event->attr.exclusive || !cpc->active_oncpu) > cpc->exclusive = 0; > > @@ -2531,9 +2533,10 @@ event_sched_in(struct perf_event *event, struct perf_event_context *ctx) > > if (!is_software_event(event)) > cpc->active_oncpu++; > - if (event->attr.freq && event->attr.sample_freq) > + if (event->attr.freq && event->attr.sample_freq) { > ctx->nr_freq++; > - > + epc->nr_freq++; > + } > if (event->attr.exclusive) > cpc->exclusive = 1; > > @@ -4096,30 +4099,14 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo > } > } > > -/* > - * combine freq adjustment with unthrottling to avoid two passes over the > - * events. At the same time, make sure, having freq events does not change > - * the rate of unthrottling as that would introduce bias. > - */ > -static void > -perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > +static void perf_adjust_freq_unthr_events(struct list_head *event_list) > { > struct perf_event *event; > struct hw_perf_event *hwc; > u64 now, period = TICK_NSEC; > s64 delta; > > - /* > - * only need to iterate over all events iff: > - * - context have events in frequency mode (needs freq adjust) > - * - there are events to unthrottle on this cpu > - */ > - if (!(ctx->nr_freq || unthrottle)) > - return; > - > - raw_spin_lock(&ctx->lock); > - > - list_for_each_entry_rcu(event, &ctx->event_list, event_entry) { > + list_for_each_entry(event, event_list, active_list) { > if (event->state != PERF_EVENT_STATE_ACTIVE) > continue; > > @@ -4127,8 +4114,6 @@ perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > if (!event_filter_match(event)) > continue; > > - perf_pmu_disable(event->pmu); > - > hwc = &event->hw; > > if (hwc->interrupts == MAX_INTERRUPTS) { > @@ -4138,7 +4123,7 @@ perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > } > > if (!event->attr.freq || !event->attr.sample_freq) > - goto next; > + continue; > > /* > * stop the event and update event->count > @@ -4160,8 +4145,39 @@ perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > perf_adjust_period(event, period, delta, false); > > event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0); > - next: > - perf_pmu_enable(event->pmu); > + } > +} > + > +/* > + * combine freq adjustment with unthrottling to avoid two passes over the > + * events. At the same time, make sure, having freq events does not change > + * the rate of unthrottling as that would introduce bias. > + */ > +static void > +perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > +{ > + struct perf_event_pmu_context *pmu_ctx; > + > + /* > + * only need to iterate over all events iff: > + * - context have events in frequency mode (needs freq adjust) > + * - there are events to unthrottle on this cpu > + */ > + if (!(ctx->nr_freq || unthrottle)) > + return; > + > + raw_spin_lock(&ctx->lock); > + > + list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) { > + if (!(pmu_ctx->nr_freq || unthrottle)) > + continue; > + if (pmu_ctx->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) > + continue; > + > + perf_pmu_disable(pmu_ctx->pmu); > + perf_adjust_freq_unthr_events(&pmu_ctx->pinned_active); > + perf_adjust_freq_unthr_events(&pmu_ctx->flexible_active); > + perf_pmu_enable(pmu_ctx->pmu); > } > > raw_spin_unlock(&ctx->lock); > -- > 2.43.0.rc1.413.gea7ed67945-goog >
On Mon, Nov 20, 2023, Ian Rogers wrote: > On Mon, Nov 20, 2023 at 2:19 PM Namhyung Kim <namhyung@kernel.org> wrote: > > > > It was unnecessarily disabling and enabling PMUs for each event. It > > should be done at PMU level. Add pmu_ctx->nr_freq counter to check it > > at each PMU. As pmu context has separate active lists for pinned group > > and flexible group, factor out a new function to do the job. > > > > Another minor optimization is that it can skip PMUs w/ CAP_NO_INTERRUPT > > even if it needs to unthrottle sampling events. > > > > Signed-off-by: Namhyung Kim <namhyung@kernel.org> > > Series: > Reviewed-by: Ian Rogers <irogers@google.com> > > Thanks, > Ian > Can we have "Cc: stable@vger.kernel.org" for the whole series? This series should have a great performance improvement for all VMs in which perf sampling events without specifying period. The key point behind is that disabling/enabling PMU in virtualized environment is super heavyweight which can reaches up to 50% of the CPU time, ie., When multiplxing is used in the VM, a vCPU on a pCPU can only use 50% of the resource, the other half was entirely wasted in host PMU code doing the enabling/disabling PMU. Thanks. -Mingwei > > --- > > include/linux/perf_event.h | 1 + > > kernel/events/core.c | 68 +++++++++++++++++++++++--------------- > > 2 files changed, 43 insertions(+), 26 deletions(-) > > > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > > index 0367d748fae0..3eb17dc89f5e 100644 > > --- a/include/linux/perf_event.h > > +++ b/include/linux/perf_event.h > > @@ -879,6 +879,7 @@ struct perf_event_pmu_context { > > > > unsigned int nr_events; > > unsigned int nr_cgroups; > > + unsigned int nr_freq; > > > > atomic_t refcount; /* event <-> epc */ > > struct rcu_head rcu_head; > > diff --git a/kernel/events/core.c b/kernel/events/core.c > > index 3eb26c2c6e65..53e2ad73102d 100644 > > --- a/kernel/events/core.c > > +++ b/kernel/events/core.c > > @@ -2275,8 +2275,10 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx) > > > > if (!is_software_event(event)) > > cpc->active_oncpu--; > > - if (event->attr.freq && event->attr.sample_freq) > > + if (event->attr.freq && event->attr.sample_freq) { > > ctx->nr_freq--; > > + epc->nr_freq--; > > + } > > if (event->attr.exclusive || !cpc->active_oncpu) > > cpc->exclusive = 0; > > > > @@ -2531,9 +2533,10 @@ event_sched_in(struct perf_event *event, struct perf_event_context *ctx) > > > > if (!is_software_event(event)) > > cpc->active_oncpu++; > > - if (event->attr.freq && event->attr.sample_freq) > > + if (event->attr.freq && event->attr.sample_freq) { > > ctx->nr_freq++; > > - > > + epc->nr_freq++; > > + } > > if (event->attr.exclusive) > > cpc->exclusive = 1; > > > > @@ -4096,30 +4099,14 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo > > } > > } > > > > -/* > > - * combine freq adjustment with unthrottling to avoid two passes over the > > - * events. At the same time, make sure, having freq events does not change > > - * the rate of unthrottling as that would introduce bias. > > - */ > > -static void > > -perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > > +static void perf_adjust_freq_unthr_events(struct list_head *event_list) > > { > > struct perf_event *event; > > struct hw_perf_event *hwc; > > u64 now, period = TICK_NSEC; > > s64 delta; > > > > - /* > > - * only need to iterate over all events iff: > > - * - context have events in frequency mode (needs freq adjust) > > - * - there are events to unthrottle on this cpu > > - */ > > - if (!(ctx->nr_freq || unthrottle)) > > - return; > > - > > - raw_spin_lock(&ctx->lock); > > - > > - list_for_each_entry_rcu(event, &ctx->event_list, event_entry) { > > + list_for_each_entry(event, event_list, active_list) { > > if (event->state != PERF_EVENT_STATE_ACTIVE) > > continue; > > > > @@ -4127,8 +4114,6 @@ perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > > if (!event_filter_match(event)) > > continue; > > > > - perf_pmu_disable(event->pmu); > > - > > hwc = &event->hw; > > > > if (hwc->interrupts == MAX_INTERRUPTS) { > > @@ -4138,7 +4123,7 @@ perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > > } > > > > if (!event->attr.freq || !event->attr.sample_freq) > > - goto next; > > + continue; > > > > /* > > * stop the event and update event->count > > @@ -4160,8 +4145,39 @@ perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > > perf_adjust_period(event, period, delta, false); > > > > event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0); > > - next: > > - perf_pmu_enable(event->pmu); > > + } > > +} > > + > > +/* > > + * combine freq adjustment with unthrottling to avoid two passes over the > > + * events. At the same time, make sure, having freq events does not change > > + * the rate of unthrottling as that would introduce bias. > > + */ > > +static void > > +perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > > +{ > > + struct perf_event_pmu_context *pmu_ctx; > > + > > + /* > > + * only need to iterate over all events iff: > > + * - context have events in frequency mode (needs freq adjust) > > + * - there are events to unthrottle on this cpu > > + */ > > + if (!(ctx->nr_freq || unthrottle)) > > + return; > > + > > + raw_spin_lock(&ctx->lock); > > + > > + list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) { > > + if (!(pmu_ctx->nr_freq || unthrottle)) > > + continue; > > + if (pmu_ctx->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) > > + continue; > > + > > + perf_pmu_disable(pmu_ctx->pmu); > > + perf_adjust_freq_unthr_events(&pmu_ctx->pinned_active); > > + perf_adjust_freq_unthr_events(&pmu_ctx->flexible_active); > > + perf_pmu_enable(pmu_ctx->pmu); > > } > > > > raw_spin_unlock(&ctx->lock); > > -- > > 2.43.0.rc1.413.gea7ed67945-goog > >
On 2023-11-20 5:19 p.m., Namhyung Kim wrote: > It was unnecessarily disabling and enabling PMUs for each event. It > should be done at PMU level. Add pmu_ctx->nr_freq counter to check it > at each PMU. As pmu context has separate active lists for pinned group > and flexible group, factor out a new function to do the job. > > Another minor optimization is that it can skip PMUs w/ CAP_NO_INTERRUPT > even if it needs to unthrottle sampling events. > > Signed-off-by: Namhyung Kim <namhyung@kernel.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Thanks, Kan > --- > include/linux/perf_event.h | 1 + > kernel/events/core.c | 68 +++++++++++++++++++++++--------------- > 2 files changed, 43 insertions(+), 26 deletions(-) > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index 0367d748fae0..3eb17dc89f5e 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -879,6 +879,7 @@ struct perf_event_pmu_context { > > unsigned int nr_events; > unsigned int nr_cgroups; > + unsigned int nr_freq; > > atomic_t refcount; /* event <-> epc */ > struct rcu_head rcu_head; > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 3eb26c2c6e65..53e2ad73102d 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -2275,8 +2275,10 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx) > > if (!is_software_event(event)) > cpc->active_oncpu--; > - if (event->attr.freq && event->attr.sample_freq) > + if (event->attr.freq && event->attr.sample_freq) { > ctx->nr_freq--; > + epc->nr_freq--; > + } > if (event->attr.exclusive || !cpc->active_oncpu) > cpc->exclusive = 0; > > @@ -2531,9 +2533,10 @@ event_sched_in(struct perf_event *event, struct perf_event_context *ctx) > > if (!is_software_event(event)) > cpc->active_oncpu++; > - if (event->attr.freq && event->attr.sample_freq) > + if (event->attr.freq && event->attr.sample_freq) { > ctx->nr_freq++; > - > + epc->nr_freq++; > + } > if (event->attr.exclusive) > cpc->exclusive = 1; > > @@ -4096,30 +4099,14 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo > } > } > > -/* > - * combine freq adjustment with unthrottling to avoid two passes over the > - * events. At the same time, make sure, having freq events does not change > - * the rate of unthrottling as that would introduce bias. > - */ > -static void > -perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > +static void perf_adjust_freq_unthr_events(struct list_head *event_list) > { > struct perf_event *event; > struct hw_perf_event *hwc; > u64 now, period = TICK_NSEC; > s64 delta; > > - /* > - * only need to iterate over all events iff: > - * - context have events in frequency mode (needs freq adjust) > - * - there are events to unthrottle on this cpu > - */ > - if (!(ctx->nr_freq || unthrottle)) > - return; > - > - raw_spin_lock(&ctx->lock); > - > - list_for_each_entry_rcu(event, &ctx->event_list, event_entry) { > + list_for_each_entry(event, event_list, active_list) { > if (event->state != PERF_EVENT_STATE_ACTIVE) > continue; > > @@ -4127,8 +4114,6 @@ perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > if (!event_filter_match(event)) > continue; > > - perf_pmu_disable(event->pmu); > - > hwc = &event->hw; > > if (hwc->interrupts == MAX_INTERRUPTS) { > @@ -4138,7 +4123,7 @@ perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > } > > if (!event->attr.freq || !event->attr.sample_freq) > - goto next; > + continue; > > /* > * stop the event and update event->count > @@ -4160,8 +4145,39 @@ perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > perf_adjust_period(event, period, delta, false); > > event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0); > - next: > - perf_pmu_enable(event->pmu); > + } > +} > + > +/* > + * combine freq adjustment with unthrottling to avoid two passes over the > + * events. At the same time, make sure, having freq events does not change > + * the rate of unthrottling as that would introduce bias. > + */ > +static void > +perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > +{ > + struct perf_event_pmu_context *pmu_ctx; > + > + /* > + * only need to iterate over all events iff: > + * - context have events in frequency mode (needs freq adjust) > + * - there are events to unthrottle on this cpu > + */ > + if (!(ctx->nr_freq || unthrottle)) > + return; > + > + raw_spin_lock(&ctx->lock); > + > + list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) { > + if (!(pmu_ctx->nr_freq || unthrottle)) > + continue; > + if (pmu_ctx->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) > + continue; > + > + perf_pmu_disable(pmu_ctx->pmu); > + perf_adjust_freq_unthr_events(&pmu_ctx->pinned_active); > + perf_adjust_freq_unthr_events(&pmu_ctx->flexible_active); > + perf_pmu_enable(pmu_ctx->pmu); > } > > raw_spin_unlock(&ctx->lock);
Hi Mingwei, On Mon, Nov 20, 2023 at 3:24 PM Mingwei Zhang <mizhang@google.com> wrote: > > On Mon, Nov 20, 2023, Ian Rogers wrote: > > On Mon, Nov 20, 2023 at 2:19 PM Namhyung Kim <namhyung@kernel.org> wrote: > > > > > > It was unnecessarily disabling and enabling PMUs for each event. It > > > should be done at PMU level. Add pmu_ctx->nr_freq counter to check it > > > at each PMU. As pmu context has separate active lists for pinned group > > > and flexible group, factor out a new function to do the job. > > > > > > Another minor optimization is that it can skip PMUs w/ CAP_NO_INTERRUPT > > > even if it needs to unthrottle sampling events. > > > > > > Signed-off-by: Namhyung Kim <namhyung@kernel.org> > > > > Series: > > Reviewed-by: Ian Rogers <irogers@google.com> > > > > Thanks, > > Ian > > > > Can we have "Cc: stable@vger.kernel.org" for the whole series? This > series should have a great performance improvement for all VMs in which > perf sampling events without specifying period. I was not sure if it's ok to have this performance fix in the stable series. Thanks, Namhyung
On Tue, Nov 21, 2023, Namhyung Kim wrote: Hi Namhyung, > Hi Mingwei, > > On Mon, Nov 20, 2023 at 3:24 PM Mingwei Zhang <mizhang@google.com> wrote: > > > > On Mon, Nov 20, 2023, Ian Rogers wrote: > > > On Mon, Nov 20, 2023 at 2:19 PM Namhyung Kim <namhyung@kernel.org> wrote: > > > > > > > > It was unnecessarily disabling and enabling PMUs for each event. It > > > > should be done at PMU level. Add pmu_ctx->nr_freq counter to check it > > > > at each PMU. As pmu context has separate active lists for pinned group > > > > and flexible group, factor out a new function to do the job. > > > > > > > > Another minor optimization is that it can skip PMUs w/ CAP_NO_INTERRUPT > > > > even if it needs to unthrottle sampling events. > > > > > > > > Signed-off-by: Namhyung Kim <namhyung@kernel.org> > > > > > > Series: > > > Reviewed-by: Ian Rogers <irogers@google.com> > > > > > > Thanks, > > > Ian > > > > > > > Can we have "Cc: stable@vger.kernel.org" for the whole series? This > > series should have a great performance improvement for all VMs in which > > perf sampling events without specifying period. > > I was not sure if it's ok to have this performance fix in the stable series. > Critical performance bug fix is ok to be added to stable tree, as the requirements are mentioned here: https://www.kernel.org/doc/Documentation/process/stable-kernel-rules.rst In particular, this patch satisfies the 2nd sub-bullet of the forth bullet. But let me step back. Only this patch is needed with stable tag instead of the whole series. This patch impact 69 lines of code. It satisfies the rule of within 100 lines (bullet 3). I will give a try and test it today or tomorrow and make sure we satisfy bullet 2. Once it gets in, bullet 1 will be satisfied as well. Overall, the intention is to improve PMU performance in VM as early as we can since we don't control the schedule of distro kernel upgrade and we don't control when end customers upgrade their running kernel. So I presume even adding to the stable tree may take years to see the result change. But if we don't do it, it may take way longer (since it does not contain a "Fixes" tag as well). Thanks. -Mingwei > Thanks, > Namhyung
On Tue, Nov 21, 2023, Mingwei Zhang wrote: > On Tue, Nov 21, 2023, Namhyung Kim wrote: > > > Hi Namhyung, > > > Hi Mingwei, > > > > On Mon, Nov 20, 2023 at 3:24 PM Mingwei Zhang <mizhang@google.com> wrote: > > > > > > On Mon, Nov 20, 2023, Ian Rogers wrote: > > > > On Mon, Nov 20, 2023 at 2:19 PM Namhyung Kim <namhyung@kernel.org> wrote: > > > > > > > > > > It was unnecessarily disabling and enabling PMUs for each event. It > > > > > should be done at PMU level. Add pmu_ctx->nr_freq counter to check it > > > > > at each PMU. As pmu context has separate active lists for pinned group > > > > > and flexible group, factor out a new function to do the job. > > > > > > > > > > Another minor optimization is that it can skip PMUs w/ CAP_NO_INTERRUPT > > > > > even if it needs to unthrottle sampling events. > > > > > > > > > > Signed-off-by: Namhyung Kim <namhyung@kernel.org> > > > > > > > > Series: > > > > Reviewed-by: Ian Rogers <irogers@google.com> > > > > > > > > Thanks, > > > > Ian > > > > > > > > > > Can we have "Cc: stable@vger.kernel.org" for the whole series? This > > > series should have a great performance improvement for all VMs in which > > > perf sampling events without specifying period. > > > > I was not sure if it's ok to have this performance fix in the stable series. > > > > Critical performance bug fix is ok to be added to stable tree, as the > requirements are mentioned here: > > https://www.kernel.org/doc/Documentation/process/stable-kernel-rules.rst > > In particular, this patch satisfies the 2nd sub-bullet of the forth > bullet. > > But let me step back. Only this patch is needed with stable tag instead > of the whole series. This patch impact 69 lines of code. It satisfies > the rule of within 100 lines (bullet 3). > > I will give a try and test it today or tomorrow and make sure we satisfy > bullet 2. > > Once it gets in, bullet 1 will be satisfied as well. > > Overall, the intention is to improve PMU performance in VM as early as > we can since we don't control the schedule of distro kernel upgrade and > we don't control when end customers upgrade their running kernel. So I > presume even adding to the stable tree may take years to see the result > change. But if we don't do it, it may take way longer (since it does not > contain a "Fixes" tag as well). > > Thanks. > -Mingwei > I have tested the code. Yes profiling results in the VM shows that it removes perf_adjust_freq_unthr_contex() as the hot spot. However, when running perf with sufficient events in frequency mode that triggers multiplexing. The overall performance overhead still reaches 60% per CPU (this overhead is invisible to vCPU). At the host level, I have been monitoring the write MSRs and found that the repeated writes to 0x38f disappeared, indicating that this patch is indeed working. But on the other hand, I have noticed more frequent overflows and PMIs. The more frequent overflows is shown in the writes to MSR 0x390 and reads to MSR 0x38e. I infer the more frequent PMIs from the much longer execution of 'vmx_vmexit()' shown in flamegraph. Because of the above observation, it seems to me that this patch no longer satisfies the requirements of Ccing "stable@vger.kernel.org". I will double check and follow up on this one. Thanks. -Mingwei
On Mon, Nov 20, 2023, Namhyung Kim wrote: > It was unnecessarily disabling and enabling PMUs for each event. It > should be done at PMU level. Add pmu_ctx->nr_freq counter to check it > at each PMU. As pmu context has separate active lists for pinned group > and flexible group, factor out a new function to do the job. > > Another minor optimization is that it can skip PMUs w/ CAP_NO_INTERRUPT > even if it needs to unthrottle sampling events. > > Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Mingwei Zhang <mizhang@google.com> > --- > include/linux/perf_event.h | 1 + > kernel/events/core.c | 68 +++++++++++++++++++++++--------------- > 2 files changed, 43 insertions(+), 26 deletions(-) > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index 0367d748fae0..3eb17dc89f5e 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -879,6 +879,7 @@ struct perf_event_pmu_context { > > unsigned int nr_events; > unsigned int nr_cgroups; > + unsigned int nr_freq; > > atomic_t refcount; /* event <-> epc */ > struct rcu_head rcu_head; > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 3eb26c2c6e65..53e2ad73102d 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -2275,8 +2275,10 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx) > > if (!is_software_event(event)) > cpc->active_oncpu--; > - if (event->attr.freq && event->attr.sample_freq) > + if (event->attr.freq && event->attr.sample_freq) { > ctx->nr_freq--; > + epc->nr_freq--; > + } > if (event->attr.exclusive || !cpc->active_oncpu) > cpc->exclusive = 0; > > @@ -2531,9 +2533,10 @@ event_sched_in(struct perf_event *event, struct perf_event_context *ctx) > > if (!is_software_event(event)) > cpc->active_oncpu++; > - if (event->attr.freq && event->attr.sample_freq) > + if (event->attr.freq && event->attr.sample_freq) { > ctx->nr_freq++; > - > + epc->nr_freq++; > + } > if (event->attr.exclusive) > cpc->exclusive = 1; > > @@ -4096,30 +4099,14 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo > } > } > > -/* > - * combine freq adjustment with unthrottling to avoid two passes over the > - * events. At the same time, make sure, having freq events does not change > - * the rate of unthrottling as that would introduce bias. > - */ > -static void > -perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > +static void perf_adjust_freq_unthr_events(struct list_head *event_list) > { > struct perf_event *event; > struct hw_perf_event *hwc; > u64 now, period = TICK_NSEC; > s64 delta; > > - /* > - * only need to iterate over all events iff: > - * - context have events in frequency mode (needs freq adjust) > - * - there are events to unthrottle on this cpu > - */ > - if (!(ctx->nr_freq || unthrottle)) > - return; > - > - raw_spin_lock(&ctx->lock); > - > - list_for_each_entry_rcu(event, &ctx->event_list, event_entry) { > + list_for_each_entry(event, event_list, active_list) { > if (event->state != PERF_EVENT_STATE_ACTIVE) > continue; > > @@ -4127,8 +4114,6 @@ perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > if (!event_filter_match(event)) > continue; > > - perf_pmu_disable(event->pmu); > - > hwc = &event->hw; > > if (hwc->interrupts == MAX_INTERRUPTS) { > @@ -4138,7 +4123,7 @@ perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > } > > if (!event->attr.freq || !event->attr.sample_freq) > - goto next; > + continue; > > /* > * stop the event and update event->count > @@ -4160,8 +4145,39 @@ perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > perf_adjust_period(event, period, delta, false); > > event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0); > - next: > - perf_pmu_enable(event->pmu); > + } > +} > + > +/* > + * combine freq adjustment with unthrottling to avoid two passes over the > + * events. At the same time, make sure, having freq events does not change > + * the rate of unthrottling as that would introduce bias. > + */ > +static void > +perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > +{ > + struct perf_event_pmu_context *pmu_ctx; > + > + /* > + * only need to iterate over all events iff: > + * - context have events in frequency mode (needs freq adjust) > + * - there are events to unthrottle on this cpu > + */ > + if (!(ctx->nr_freq || unthrottle)) > + return; > + > + raw_spin_lock(&ctx->lock); > + > + list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) { > + if (!(pmu_ctx->nr_freq || unthrottle)) > + continue; > + if (pmu_ctx->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) > + continue; > + > + perf_pmu_disable(pmu_ctx->pmu); > + perf_adjust_freq_unthr_events(&pmu_ctx->pinned_active); > + perf_adjust_freq_unthr_events(&pmu_ctx->flexible_active); > + perf_pmu_enable(pmu_ctx->pmu); > } > > raw_spin_unlock(&ctx->lock); > -- > 2.43.0.rc1.413.gea7ed67945-goog >
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 0367d748fae0..3eb17dc89f5e 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -879,6 +879,7 @@ struct perf_event_pmu_context { unsigned int nr_events; unsigned int nr_cgroups; + unsigned int nr_freq; atomic_t refcount; /* event <-> epc */ struct rcu_head rcu_head; diff --git a/kernel/events/core.c b/kernel/events/core.c index 3eb26c2c6e65..53e2ad73102d 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -2275,8 +2275,10 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx) if (!is_software_event(event)) cpc->active_oncpu--; - if (event->attr.freq && event->attr.sample_freq) + if (event->attr.freq && event->attr.sample_freq) { ctx->nr_freq--; + epc->nr_freq--; + } if (event->attr.exclusive || !cpc->active_oncpu) cpc->exclusive = 0; @@ -2531,9 +2533,10 @@ event_sched_in(struct perf_event *event, struct perf_event_context *ctx) if (!is_software_event(event)) cpc->active_oncpu++; - if (event->attr.freq && event->attr.sample_freq) + if (event->attr.freq && event->attr.sample_freq) { ctx->nr_freq++; - + epc->nr_freq++; + } if (event->attr.exclusive) cpc->exclusive = 1; @@ -4096,30 +4099,14 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo } } -/* - * combine freq adjustment with unthrottling to avoid two passes over the - * events. At the same time, make sure, having freq events does not change - * the rate of unthrottling as that would introduce bias. - */ -static void -perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) +static void perf_adjust_freq_unthr_events(struct list_head *event_list) { struct perf_event *event; struct hw_perf_event *hwc; u64 now, period = TICK_NSEC; s64 delta; - /* - * only need to iterate over all events iff: - * - context have events in frequency mode (needs freq adjust) - * - there are events to unthrottle on this cpu - */ - if (!(ctx->nr_freq || unthrottle)) - return; - - raw_spin_lock(&ctx->lock); - - list_for_each_entry_rcu(event, &ctx->event_list, event_entry) { + list_for_each_entry(event, event_list, active_list) { if (event->state != PERF_EVENT_STATE_ACTIVE) continue; @@ -4127,8 +4114,6 @@ perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) if (!event_filter_match(event)) continue; - perf_pmu_disable(event->pmu); - hwc = &event->hw; if (hwc->interrupts == MAX_INTERRUPTS) { @@ -4138,7 +4123,7 @@ perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) } if (!event->attr.freq || !event->attr.sample_freq) - goto next; + continue; /* * stop the event and update event->count @@ -4160,8 +4145,39 @@ perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) perf_adjust_period(event, period, delta, false); event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0); - next: - perf_pmu_enable(event->pmu); + } +} + +/* + * combine freq adjustment with unthrottling to avoid two passes over the + * events. At the same time, make sure, having freq events does not change + * the rate of unthrottling as that would introduce bias. + */ +static void +perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) +{ + struct perf_event_pmu_context *pmu_ctx; + + /* + * only need to iterate over all events iff: + * - context have events in frequency mode (needs freq adjust) + * - there are events to unthrottle on this cpu + */ + if (!(ctx->nr_freq || unthrottle)) + return; + + raw_spin_lock(&ctx->lock); + + list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) { + if (!(pmu_ctx->nr_freq || unthrottle)) + continue; + if (pmu_ctx->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) + continue; + + perf_pmu_disable(pmu_ctx->pmu); + perf_adjust_freq_unthr_events(&pmu_ctx->pinned_active); + perf_adjust_freq_unthr_events(&pmu_ctx->flexible_active); + perf_pmu_enable(pmu_ctx->pmu); } raw_spin_unlock(&ctx->lock);