Message ID | 20230123182728.825519-2-kan.liang@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp1758022wrn; Mon, 23 Jan 2023 10:35:23 -0800 (PST) X-Google-Smtp-Source: AMrXdXvS/piuyXSUv3DZsZfO+UbdnsfhlvpxE43vOBkbODE/g7nuIHsAq2eI8W+xPGRUN0PwGuOZ X-Received: by 2002:a17:902:bf4b:b0:194:85df:9f74 with SMTP id u11-20020a170902bf4b00b0019485df9f74mr31607434pls.8.1674498923231; Mon, 23 Jan 2023 10:35:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674498923; cv=none; d=google.com; s=arc-20160816; b=Dz4DlWXGj7CLNovnGiuE6S0Zrg02EJyz87K0Bx4vVTl2Gc2m7X3Zhkv8bHvp656wgf xSBLHhZgXbHd5w6pZzWKraXUtYoaBa9x6YWZmGd+3ISpnc07brlxj0z+HOZXxzDqUV/g eQNPNSBx4AuDZFQ7qdCGJRgWinla7033ZvpdwOXx3WrMxa5TNUgjBnOHywJeyG/JzMf1 OpLBnC4Mo6trhTAs2I5YG0Wq+6F+WqdU50zt855pXv6rtk5Zuz7vsFRg9LPIpWpun2f2 PYbVhHSedZgxnhz83t5e9NFi1qrH2XK9uZzG43AHs82P3M8kkgmYf7/ucyhIqJABgA0a MB/Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=+V5R04K/wEtZspPLQ4IXmFybLKfDJsNvL13Ol+vBOMo=; b=ysxzqSHPqseReKbTql+zpzq4QQh9AgpN16AbANcQVS0IBwWGhh6GQfhOyRkBQ7PuN3 nlP3fU/hh+VitPt2Y5bz0wVGj+fxRmibj6pziyAvh5BXVGJG872b+aRTMHcqAejVLN83 bcbHmzoJhY/fQ7w85uHLl+8ld37nilQ75pjEceDG++kfbzipsz9/AC1Ekxs0FtBux2cK YAxeiz9GDdDWsfTpUxZNZdqinCQMYIpcHoout9c+o74SqX6ez3RYZYPixrVDad1iVqzY qw9guQRALJndZKCyJQfBl9DGptNHmrqfUeoy+DzeHXRB4LHuxq0UEiosjn//EPM2cAs+ tzDA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=a+Xwckuh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id bi12-20020a170902bf0c00b0017f791f52c3si81505plb.88.2023.01.23.10.35.11; Mon, 23 Jan 2023 10:35:23 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=a+Xwckuh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232127AbjAWS3m (ORCPT <rfc822;rust.linux@gmail.com> + 99 others); Mon, 23 Jan 2023 13:29:42 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42500 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232171AbjAWS3k (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 23 Jan 2023 13:29:40 -0500 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6AC18D504 for <linux-kernel@vger.kernel.org>; Mon, 23 Jan 2023 10:29:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674498558; x=1706034558; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=JuOqPB3b9j39khhHERV2tl4VPEWAjvZVB/vliTmGgeo=; b=a+XwckuhDqdd1U61mlLqBb9D330VUgNCl3lTy8THdJgYiQ1cb3uP7wTt OtshmA5xpB42X+AKEep/qaVf2ZDeg9ADOUnL68yeb4camjpSA3+/2NWkH wdnWbEneK2W+SSsJWBRseDC+lcgl9IszwicXr6tlFMYYyp1h2kEa9TtsJ NB4GnHLriTnTI7t6nWHjk5Iul0+NKDaJbQvDsbqxex549EKeoymiOyUNT GdlVxn29OGlUGWY0qDFexIXQukEwM5PlfkP4fcNDMOAe6OgHsNyvQ4jO8 pI3ymo3s979jM6L99NRioLv/a401yS9kaMCX2c5coHEmVxBxHgEPkR35N w==; X-IronPort-AV: E=McAfee;i="6500,9779,10599"; a="328201782" X-IronPort-AV: E=Sophos;i="5.97,240,1669104000"; d="scan'208";a="328201782" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Jan 2023 10:27:43 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10599"; a="661812073" X-IronPort-AV: E=Sophos;i="5.97,240,1669104000"; d="scan'208";a="661812073" Received: from kanliang-dev.jf.intel.com ([10.165.154.102]) by orsmga002.jf.intel.com with ESMTP; 23 Jan 2023 10:27:42 -0800 From: kan.liang@linux.intel.com To: peterz@infradead.org, mingo@redhat.com, tglx@linutronix.de, jstultz@google.com, sboyd@kernel.org, linux-kernel@vger.kernel.org Cc: eranian@google.com, namhyung@kernel.org, ak@linux.intel.com, Kan Liang <kan.liang@linux.intel.com> Subject: [PATCH 1/3] timekeeping: NMI safe converter from a given time to monotonic Date: Mon, 23 Jan 2023 10:27:26 -0800 Message-Id: <20230123182728.825519-2-kan.liang@linux.intel.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20230123182728.825519-1-kan.liang@linux.intel.com> References: <20230123182728.825519-1-kan.liang@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1755839383097514558?= X-GMAIL-MSGID: =?utf-8?q?1755839383097514558?= |
Series |
Convert TSC to monotonic clock for PEBS
|
|
Commit Message
Liang, Kan
Jan. 23, 2023, 6:27 p.m. UTC
From: Kan Liang <kan.liang@linux.intel.com> It's useful to provide a NMI safe function to convert a given time to monotonic. For example, the perf_event subsystem wants to convert a TSC of a PEBS record to a monotonic clock in a NMI handler. Considered the below two existing functions, but none of them fulfill the above requirements. - The ktime_get_mono_fast_ns() is NMI safe, but it can only return the current clock monotonic rather than a given time's monotonic. - The get_device_system_crosststamp() can calculate the system time from a given device time. But it's not fast and NMI safe. The new function is a combination of the two existing functions. Use the tk_fast_mono timekeeper to make the new function fast and NMI safe. Use the get_time_fn callback to retrieve the given timestamp and its clocksource. The history is not supported, since the perf case doesn't really need it. It can be added later once there is a use case. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> --- include/linux/timekeeping.h | 9 +++++ kernel/time/timekeeping.c | 68 +++++++++++++++++++++++++++++++++++-- 2 files changed, 74 insertions(+), 3 deletions(-)
Comments
On Mon, Jan 23, 2023 at 10:27 AM <kan.liang@linux.intel.com> wrote: > +int notrace get_mono_fast_from_given_time(int (*get_time_fn) > + (struct system_counterval_t *sys_counterval, > + void *ctx), > + void *ctx, > + u64 *mono_ns) > +{ > + struct system_counterval_t system_counterval; > + struct tk_fast *tkf = &tk_fast_mono; > + u64 cycles, now, interval_start; > + struct tk_read_base *tkr; > + unsigned int seq; > + int ret; > + > + do { > + seq = raw_read_seqcount_latch(&tkf->seq); > + tkr = tkf->base + (seq & 0x01); > + > + ret = get_time_fn(&system_counterval, ctx); > + if (ret) > + return ret; > + > + /* > + * Verify that the clocksource associated with the given > + * timestamp is the same as the currently installed > + * timekeeper clocksource > + */ > + if (tkr->clock != system_counterval.cs) > + return -EOPNOTSUPP; > + cycles = system_counterval.cycles; > + > + /* > + * Check whether the given timestamp is on the current > + * timekeeping interval. > + */ > + now = tk_clock_read(tkr); > + interval_start = tkr->cycle_last; > + if (!cycle_between(interval_start, cycles, now)) > + return -EOPNOTSUPP; So. I've not fully thought this out, but it seems like it would be quite likely that you'd run into the case where the cycle_last value is updated and your earlier TSC timestamp isn't valid for the current interval. The get_device_system_crosststamp() logic has a big chunk of complex code to try to handle this case by interpolating the cycle value back in time. How well does just failing in this case work out? thanks -john
On Mon, Jan 23 2023 at 10:27, kan liang wrote: > From: Kan Liang <kan.liang@linux.intel.com> > > It's useful to provide a NMI safe function to convert a given time to > monotonic. For example, the perf_event subsystem wants to convert a TSC > of a PEBS record to a monotonic clock in a NMI handler. Why? That's a postprocessing problem, really. Thanks, tglx
On Tue, Jan 24, 2023 at 12:52 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > On Mon, Jan 23 2023 at 10:27, kan liang wrote: > > From: Kan Liang <kan.liang@linux.intel.com> > > > > It's useful to provide a NMI safe function to convert a given time to > > monotonic. For example, the perf_event subsystem wants to convert a TSC > > of a PEBS record to a monotonic clock in a NMI handler. > > Why? That's a postprocessing problem, really. > Because you want to correlate samples captured by PEBS with samples from applications timestamped with a user available clock such as CLOCK_MONOTONIC, for instance. When I create a perf_event event and I stipulate that event_attr.clockid=MONOTONIC, I expect all the samples from that event to be timestamped using the same clock source, regardless of PEBS or IBS. > Thanks, > > tglx
On 2023-01-24 2:01 a.m., John Stultz wrote: > On Mon, Jan 23, 2023 at 10:27 AM <kan.liang@linux.intel.com> wrote: >> +int notrace get_mono_fast_from_given_time(int (*get_time_fn) >> + (struct system_counterval_t *sys_counterval, >> + void *ctx), >> + void *ctx, >> + u64 *mono_ns) >> +{ >> + struct system_counterval_t system_counterval; >> + struct tk_fast *tkf = &tk_fast_mono; >> + u64 cycles, now, interval_start; >> + struct tk_read_base *tkr; >> + unsigned int seq; >> + int ret; >> + >> + do { >> + seq = raw_read_seqcount_latch(&tkf->seq); >> + tkr = tkf->base + (seq & 0x01); >> + >> + ret = get_time_fn(&system_counterval, ctx); >> + if (ret) >> + return ret; >> + >> + /* >> + * Verify that the clocksource associated with the given >> + * timestamp is the same as the currently installed >> + * timekeeper clocksource >> + */ >> + if (tkr->clock != system_counterval.cs) >> + return -EOPNOTSUPP; >> + cycles = system_counterval.cycles; >> + >> + /* >> + * Check whether the given timestamp is on the current >> + * timekeeping interval. >> + */ >> + now = tk_clock_read(tkr); >> + interval_start = tkr->cycle_last; >> + if (!cycle_between(interval_start, cycles, now)) >> + return -EOPNOTSUPP; > > So. I've not fully thought this out, but it seems like it would be > quite likely that you'd run into the case where the cycle_last value > is updated and your earlier TSC timestamp isn't valid for the current > interval. The get_device_system_crosststamp() logic has a big chunk of > complex code to try to handle this case by interpolating the cycle > value back in time. How well does just failing in this case work out? > For the case, perf fallback to the time captured in the NMI handler, via ktime_get_mono_fast_ns(). The TSC in PEBS is captured by HW when the sample was generated. There should be a small delta compared with the time captured in the NMI handler. But I think the delta should be acceptable as a backup solution for the most analysis cases. Also, I don't think the case (the cycle_last value is updated during the monitoring) should occur very often either. So I drop the history support to simplify the function. Thanks, Kan
On 2023-01-24 4:10 a.m., Stephane Eranian wrote: > On Tue, Jan 24, 2023 at 12:52 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> >> On Mon, Jan 23 2023 at 10:27, kan liang wrote: >>> From: Kan Liang <kan.liang@linux.intel.com> >>> >>> It's useful to provide a NMI safe function to convert a given time to >>> monotonic. For example, the perf_event subsystem wants to convert a TSC >>> of a PEBS record to a monotonic clock in a NMI handler. >> >> Why? That's a postprocessing problem, really. >> Besides the reason Stephane mentioned, the current perf tool assumes that kernel always gives the time it asked. For example, without the patch, the kernel uses ktime_get_mono_fast_ns() to get the MONOTONIC time and dump it to user space. If we want to break the assumption, I think we have to modify the current interfaces and dump more extra data to indicate the clock source. That makes the existing interface more complex. Preparing and dumping the extra data also brings overhead. We also have to force the user to update their tools. Thanks, Kan > Because you want to correlate samples captured by PEBS with samples > from applications timestamped with a user available clock such as > CLOCK_MONOTONIC, for instance. > When I create a perf_event event and I stipulate that > event_attr.clockid=MONOTONIC, I expect all the samples from > that event to be timestamped using the same clock source, regardless > of PEBS or IBS. > > > >> Thanks, >> >> tglx
On Tue, Jan 24, 2023 at 7:09 AM Liang, Kan <kan.liang@linux.intel.com> wrote: > On 2023-01-24 2:01 a.m., John Stultz wrote: > > On Mon, Jan 23, 2023 at 10:27 AM <kan.liang@linux.intel.com> wrote: > >> + /* > >> + * Check whether the given timestamp is on the current > >> + * timekeeping interval. > >> + */ > >> + now = tk_clock_read(tkr); > >> + interval_start = tkr->cycle_last; > >> + if (!cycle_between(interval_start, cycles, now)) > >> + return -EOPNOTSUPP; > > > > So. I've not fully thought this out, but it seems like it would be > > quite likely that you'd run into the case where the cycle_last value > > is updated and your earlier TSC timestamp isn't valid for the current > > interval. The get_device_system_crosststamp() logic has a big chunk of > > complex code to try to handle this case by interpolating the cycle > > value back in time. How well does just failing in this case work out? > > > > For the case, perf fallback to the time captured in the NMI handler, via > ktime_get_mono_fast_ns(). This feels like *very* subtle behavior. Maybe I'm misunderstanding, but the goal seems to be to have more accurate timestamps on the hw events, and using the captured tsc timestamp avoids the measuring latency reading the time again. But if every timekeeping update interval (~tick) you transparently get a delayed value due to the fallback, it makes it hard to understand which timestamps are better or worse. The latency between two reads may be real or it may be just bad luck. This doesn't intuitively seem like a great benefit over more consistent latency of just using the ktime_get_mono_fast() timestamping. > The TSC in PEBS is captured by HW when the sample was generated. There > should be a small delta compared with the time captured in the NMI > handler. But I think the delta should be acceptable as a backup solution > for the most analysis cases. Also, I don't think the case (the > cycle_last value is updated during the monitoring) should occur very > often either. So I drop the history support to simplify the function. So the reads and this function are *always* used in NMI context? Has this been stressed with things like SMIs to see how it does if interrupted in those cases? My worry is that (as I bored everyone earlier), the ktime_get_*_fast_ns() interfaces already have some sharp edges and need a fair amount of thought as to when they should be used. This is sort of compounding that adding an interface that has further special cases where it can fail, making it difficult to fully understand and easier to accidentally misuse. My other concern is that interfaces always get stretched and used beyond anything they were initially planned for (see the ktime_get_*fast_ns() interfaces here as an example! :), and in this case the logic seems to have lots of implicit dependencies on the facts of your specific use case, so it seems a bit fragile should folks on other architectures with other constraints try to use it. So I just want to push a bit to think how you might be able to extend/generalize the existing get_system_crosststamp for your purposes, or alternatively find a way to simplify the logic's behavior so its less tied to specific constraints ("this works most of the time from NMI, but otherwise no promises"). Or at least some better documentation around the code, its uses and its constraints? ( "NMI safe" is not the same as "Only safe to use from NMI" :) And apologies if I've misunderstood things here. thanks -john
On 2023-01-24 1:43 p.m., John Stultz wrote: > On Tue, Jan 24, 2023 at 7:09 AM Liang, Kan <kan.liang@linux.intel.com> wrote: >> On 2023-01-24 2:01 a.m., John Stultz wrote: >>> On Mon, Jan 23, 2023 at 10:27 AM <kan.liang@linux.intel.com> wrote: >>>> + /* >>>> + * Check whether the given timestamp is on the current >>>> + * timekeeping interval. >>>> + */ >>>> + now = tk_clock_read(tkr); >>>> + interval_start = tkr->cycle_last; >>>> + if (!cycle_between(interval_start, cycles, now)) >>>> + return -EOPNOTSUPP; >>> >>> So. I've not fully thought this out, but it seems like it would be >>> quite likely that you'd run into the case where the cycle_last value >>> is updated and your earlier TSC timestamp isn't valid for the current >>> interval. The get_device_system_crosststamp() logic has a big chunk of >>> complex code to try to handle this case by interpolating the cycle >>> value back in time. How well does just failing in this case work out? >>> >> >> For the case, perf fallback to the time captured in the NMI handler, via >> ktime_get_mono_fast_ns(). > > This feels like *very* subtle behavior. Maybe I'm misunderstanding, > but the goal seems to be to have more accurate timestamps on the hw > events, and using the captured tsc timestamp avoids the measuring > latency reading the time again. But if every timekeeping update > interval (~tick) you transparently get a delayed value due to the > fallback, it makes it hard to understand which timestamps are better > or worse. The latency between two reads may be real or it may be just > bad luck. This doesn't intuitively seem like a great benefit over more > consistent latency of just using the ktime_get_mono_fast() > timestamping. Your understand is correct. We want a more accurate timestamp for the analysis work. As my understanding, the timekeeping update should not be very often. If I read the code correctly, it should happen only when adjusting NTP or suspending/resuming. If so, I think the drawback should not impact the normal analysis work. I will call out the drwabacks in the comments where the function is used. > >> The TSC in PEBS is captured by HW when the sample was generated. There >> should be a small delta compared with the time captured in the NMI >> handler. But I think the delta should be acceptable as a backup solution >> for the most analysis cases. Also, I don't think the case (the >> cycle_last value is updated during the monitoring) should occur very >> often either. So I drop the history support to simplify the function. > > So the reads and this function are *always* used in NMI context? Has > this been stressed with things like SMIs to see how it does if > interrupted in those cases? Yes, it's *always* and only used in NMI context. > > My worry is that (as I bored everyone earlier), the > ktime_get_*_fast_ns() interfaces already have some sharp edges and > need a fair amount of thought as to when they should be used. This is > sort of compounding that adding an interface that has further special > cases where it can fail, making it difficult to fully understand and > easier to accidentally misuse. > > My other concern is that interfaces always get stretched and used > beyond anything they were initially planned for (see the > ktime_get_*fast_ns() interfaces here as an example! :), and in this > case the logic seems to have lots of implicit dependencies on the > facts of your specific use case, so it seems a bit fragile should > folks on other architectures with other constraints try to use it. > > So I just want to push a bit to think how you might be able to > extend/generalize the existing get_system_crosststamp for your > purposes, or alternatively find a way to simplify the logic's behavior > so its less tied to specific constraints ("this works most of the time > from NMI, but otherwise no promises"). Or at least some better > documentation around the code, its uses and its constraints? ( "NMI > safe" is not the same as "Only safe to use from NMI" :) Since our usage is fixed (only in NMI), I prefer the latter. I think extending/generalizing the existing function only makes the function extremely complex and low efficient. The new function should have the same constraints as the existing ktime_get_mono_fast_ns(). Since perf can live with the ktime_get_mono_fast_ns(), there should be no problem with the new function for the constraints. I will add more comments to clarify the usage and constraints to avoid the abuse of the new function. Thanks, Kan
On Tue, Jan 24, 2023 at 12:13 PM Liang, Kan <kan.liang@linux.intel.com> wrote: > On 2023-01-24 1:43 p.m., John Stultz wrote: > > On Tue, Jan 24, 2023 at 7:09 AM Liang, Kan <kan.liang@linux.intel.com> wrote: > >> On 2023-01-24 2:01 a.m., John Stultz wrote: > >>> On Mon, Jan 23, 2023 at 10:27 AM <kan.liang@linux.intel.com> wrote: > >>>> + /* > >>>> + * Check whether the given timestamp is on the current > >>>> + * timekeeping interval. > >>>> + */ > >>>> + now = tk_clock_read(tkr); > >>>> + interval_start = tkr->cycle_last; > >>>> + if (!cycle_between(interval_start, cycles, now)) > >>>> + return -EOPNOTSUPP; > >>> > >>> So. I've not fully thought this out, but it seems like it would be > >>> quite likely that you'd run into the case where the cycle_last value > >>> is updated and your earlier TSC timestamp isn't valid for the current > >>> interval. The get_device_system_crosststamp() logic has a big chunk of > >>> complex code to try to handle this case by interpolating the cycle > >>> value back in time. How well does just failing in this case work out? > >>> > >> > >> For the case, perf fallback to the time captured in the NMI handler, via > >> ktime_get_mono_fast_ns(). > > > > This feels like *very* subtle behavior. Maybe I'm misunderstanding, > > but the goal seems to be to have more accurate timestamps on the hw > > events, and using the captured tsc timestamp avoids the measuring > > latency reading the time again. But if every timekeeping update > > interval (~tick) you transparently get a delayed value due to the > > fallback, it makes it hard to understand which timestamps are better > > or worse. The latency between two reads may be real or it may be just > > bad luck. This doesn't intuitively seem like a great benefit over more > > consistent latency of just using the ktime_get_mono_fast() > > timestamping. > > Your understand is correct. We want a more accurate timestamp for the > analysis work. > > As my understanding, the timekeeping update should not be very often. If "Often" depends on your your timescale. > I read the code correctly, it should happen only when adjusting NTP or > suspending/resuming. If so, I think the drawback should not impact the > normal analysis work. I will call out the drwabacks in the comments > where the function is used. So the adjustments are done at tick time depending on the current NTP "error" (basically what the kernel tracks as the delta from its sense of what NTP has told us). Not just at the time when ntp makes an adjustment. So the window for it to happen is every timekeeping update (which is ~HZ). > >> The TSC in PEBS is captured by HW when the sample was generated. There > >> should be a small delta compared with the time captured in the NMI > >> handler. But I think the delta should be acceptable as a backup solution > >> for the most analysis cases. Also, I don't think the case (the > >> cycle_last value is updated during the monitoring) should occur very > >> often either. So I drop the history support to simplify the function. > > > > So the reads and this function are *always* used in NMI context? Has > > this been stressed with things like SMIs to see how it does if > > interrupted in those cases? > > Yes, it's *always* and only used in NMI context. Thanks, that is helpful to clarify. > > My worry is that (as I bored everyone earlier), the > > ktime_get_*_fast_ns() interfaces already have some sharp edges and > > need a fair amount of thought as to when they should be used. This is > > sort of compounding that adding an interface that has further special > > cases where it can fail, making it difficult to fully understand and > > easier to accidentally misuse. > > > > My other concern is that interfaces always get stretched and used > > beyond anything they were initially planned for (see the > > ktime_get_*fast_ns() interfaces here as an example! :), and in this > > case the logic seems to have lots of implicit dependencies on the > > facts of your specific use case, so it seems a bit fragile should > > folks on other architectures with other constraints try to use it. > > > > So I just want to push a bit to think how you might be able to > > extend/generalize the existing get_system_crosststamp for your > > purposes, or alternatively find a way to simplify the logic's behavior > > so its less tied to specific constraints ("this works most of the time > > from NMI, but otherwise no promises"). Or at least some better > > documentation around the code, its uses and its constraints? ( "NMI > > safe" is not the same as "Only safe to use from NMI" :) > > Since our usage is fixed (only in NMI), I prefer the latter. I think > extending/generalizing the existing function only makes the function > extremely complex and low efficient. The new function should have the > same constraints as the existing ktime_get_mono_fast_ns(). Since perf > can live with the ktime_get_mono_fast_ns(), there should be no problem > with the new function for the constraints. I will add more comments to > clarify the usage and constraints to avoid the abuse of the new function. I agree the existing function is complex, so adding more complexity isn't ideal, but potentially breaking it up or reworking it might be better. Having two similar but different implementations is also a complexity. So I just want to make sure this is well considered. But clearer documentation as a first step will help. Thanks -john
On 2023-01-24 3:33 p.m., John Stultz wrote: > On Tue, Jan 24, 2023 at 12:13 PM Liang, Kan <kan.liang@linux.intel.com> wrote: >> On 2023-01-24 1:43 p.m., John Stultz wrote: >>> On Tue, Jan 24, 2023 at 7:09 AM Liang, Kan <kan.liang@linux.intel.com> wrote: >>>> On 2023-01-24 2:01 a.m., John Stultz wrote: >>>>> On Mon, Jan 23, 2023 at 10:27 AM <kan.liang@linux.intel.com> wrote: >>>>>> + /* >>>>>> + * Check whether the given timestamp is on the current >>>>>> + * timekeeping interval. >>>>>> + */ >>>>>> + now = tk_clock_read(tkr); >>>>>> + interval_start = tkr->cycle_last; >>>>>> + if (!cycle_between(interval_start, cycles, now)) >>>>>> + return -EOPNOTSUPP; >>>>> So. I've not fully thought this out, but it seems like it would be >>>>> quite likely that you'd run into the case where the cycle_last value >>>>> is updated and your earlier TSC timestamp isn't valid for the current >>>>> interval. The get_device_system_crosststamp() logic has a big chunk of >>>>> complex code to try to handle this case by interpolating the cycle >>>>> value back in time. How well does just failing in this case work out? >>>>> >>>> For the case, perf fallback to the time captured in the NMI handler, via >>>> ktime_get_mono_fast_ns(). >>> This feels like *very* subtle behavior. Maybe I'm misunderstanding, >>> but the goal seems to be to have more accurate timestamps on the hw >>> events, and using the captured tsc timestamp avoids the measuring >>> latency reading the time again. But if every timekeeping update >>> interval (~tick) you transparently get a delayed value due to the >>> fallback, it makes it hard to understand which timestamps are better >>> or worse. The latency between two reads may be real or it may be just >>> bad luck. This doesn't intuitively seem like a great benefit over more >>> consistent latency of just using the ktime_get_mono_fast() >>> timestamping. >> Your understand is correct. We want a more accurate timestamp for the >> analysis work. >> >> As my understanding, the timekeeping update should not be very often. If > "Often" depends on your your timescale. > >> I read the code correctly, it should happen only when adjusting NTP or >> suspending/resuming. If so, I think the drawback should not impact the >> normal analysis work. I will call out the drwabacks in the comments >> where the function is used. > So the adjustments are done at tick time depending on the current NTP > "error" (basically what the kernel tracks as the delta from its sense > of what NTP has told us). > > Not just at the time when ntp makes an adjustment. > > So the window for it to happen is every timekeeping update (which is ~HZ). You mean the tkf->base is updated in each tick? If so, that should be a problem. Does the NTP "error" consistently happen on all the platforms? Or just on some platforms which we can check and limit the usage? There are two configurations for PEBS, single PEBS, and large PEBS. For the single PEBS mode, HW triggers a NMI for each record. The TSC is the time which the record is generated, and we convert it in the same NMI. I don't think the NTP "error" can impact it. But for the large PEBS mode, HW triggers a NMI only when a fixed number of records are generated. It's very likely that the base has been updated between the records. That could bring troubles, since we can only fall back to the NMI handler time of the last record. I'm afraid we have to disable the large PEBS mode. Stephane, What do you think? Is it still useful for you if we only support the single PEBS? Thanks, Kan
On Tue, Jan 24, 2023 at 2:08 PM Liang, Kan <kan.liang@linux.intel.com> wrote: > On 2023-01-24 3:33 p.m., John Stultz wrote: > > On Tue, Jan 24, 2023 at 12:13 PM Liang, Kan <kan.liang@linux.intel.com> wrote: > >> On 2023-01-24 1:43 p.m., John Stultz wrote: > >> I read the code correctly, it should happen only when adjusting NTP or > >> suspending/resuming. If so, I think the drawback should not impact the > >> normal analysis work. I will call out the drwabacks in the comments > >> where the function is used. > > So the adjustments are done at tick time depending on the current NTP > > "error" (basically what the kernel tracks as the delta from its sense > > of what NTP has told us). > > > > Not just at the time when ntp makes an adjustment. > > > > So the window for it to happen is every timekeeping update (which is ~HZ). > > You mean the tkf->base is updated in each tick? Right, every call to timekeeping_update(), which is called for every major state change in time along with regular updates via timekeeping_advance() which is called from update_wall_time() which gets called from the various tick handlers. > If so, that should be a problem. > > Does the NTP "error" consistently happen on all the platforms? Or just > on some platforms which we can check and limit the usage? Basically any platform where adjtimex was ever called. Both offset and freq adjustments will trigger this. The trouble is that the adjustments requested are finer grained than what the actual clocksources mult values can be adjusted for. So the kernel tracks the error internally and will oscillate the clocksource multiplier value to keep the long-term accuracy close to what was requested. > There are two configurations for PEBS, single PEBS, and large PEBS. > For the single PEBS mode, HW triggers a NMI for each record. The TSC is > the time which the record is generated, and we convert it in the same > NMI. I don't think the NTP "error" can impact it. If the clocksource is being adjusted in between reads (imagine the freq is dropped), when using the fast accessors, time may seem to go backwards between those reads. In fact, if the same TSC value is used for two calls in a row, you could see time go backwards or forwards between them if the adjustment happened inbetween. The regular (non-fast) clocksource accessors use the seqcount to prevent reads from occuring while the adjustment is made from being used (you'll note we always re-read the clocksource within the seqcount loop) but this makes it not NMI safe. > But for the large PEBS mode, HW triggers a NMI only when a fixed number > of records are generated. It's very likely that the base has been > updated between the records. That could bring troubles, since we can > only fall back to the NMI handler time of the last record. I'm afraid we > have to disable the large PEBS mode. > Right, if I recall, this is why the interpolation is done in the existing get_device_system_crosststamp(). It's obviously never going to be perfect as we dont' keep the timekeeper state for every tick, but it might be better than the bailout you have if you didn't luck into the current interval. My apologies, as my intention isn't to dissuade you here, but just to make sure the complexities (and unfortunately there are a lot) are well worked out and well tested. thanks -john
On 2023-01-24 5:40 p.m., John Stultz wrote: > On Tue, Jan 24, 2023 at 2:08 PM Liang, Kan <kan.liang@linux.intel.com> wrote: >> On 2023-01-24 3:33 p.m., John Stultz wrote: >>> On Tue, Jan 24, 2023 at 12:13 PM Liang, Kan <kan.liang@linux.intel.com> wrote: >>>> On 2023-01-24 1:43 p.m., John Stultz wrote: >>>> I read the code correctly, it should happen only when adjusting NTP or >>>> suspending/resuming. If so, I think the drawback should not impact the >>>> normal analysis work. I will call out the drwabacks in the comments >>>> where the function is used. >>> So the adjustments are done at tick time depending on the current NTP >>> "error" (basically what the kernel tracks as the delta from its sense >>> of what NTP has told us). >>> >>> Not just at the time when ntp makes an adjustment. >>> >>> So the window for it to happen is every timekeeping update (which is ~HZ). >> >> You mean the tkf->base is updated in each tick? > > Right, every call to timekeeping_update(), which is called for every > major state change in time along with regular updates via > timekeeping_advance() which is called from update_wall_time() which > gets called from the various tick handlers. > >> If so, that should be a problem. >> >> Does the NTP "error" consistently happen on all the platforms? Or just >> on some platforms which we can check and limit the usage? > > Basically any platform where adjtimex was ever called. Both offset > and freq adjustments will trigger this. The trouble is that the > adjustments requested are finer grained than what the actual > clocksources mult values can be adjusted for. So the kernel tracks the > error internally and will oscillate the clocksource multiplier value > to keep the long-term accuracy close to what was requested. > >> There are two configurations for PEBS, single PEBS, and large PEBS. >> For the single PEBS mode, HW triggers a NMI for each record. The TSC is >> the time which the record is generated, and we convert it in the same >> NMI. I don't think the NTP "error" can impact it. > > If the clocksource is being adjusted in between reads (imagine the > freq is dropped), when using the fast accessors, time may seem to go > backwards between those reads. > > In fact, if the same TSC value is used for two calls in a row, you > could see time go backwards or forwards between them if the adjustment > happened inbetween. > > The regular (non-fast) clocksource accessors use the seqcount to > prevent reads from occuring while the adjustment is made from being > used (you'll note we always re-read the clocksource within the > seqcount loop) but this makes it not NMI safe. > > >> But for the large PEBS mode, HW triggers a NMI only when a fixed number >> of records are generated. It's very likely that the base has been >> updated between the records. That could bring troubles, since we can >> only fall back to the NMI handler time of the last record. I'm afraid we >> have to disable the large PEBS mode. >> > > Right, if I recall, this is why the interpolation is done in the > existing get_device_system_crosststamp(). It's obviously never going > to be perfect as we dont' keep the timekeeper state for every tick, > but it might be better than the bailout you have if you didn't luck > into the current interval. > > My apologies, as my intention isn't to dissuade you here, but just to > make sure the complexities (and unfortunately there are a lot) are > well worked out and well tested. > Thank you very much for your patience and all the details, I really appreciate it. I think I understand the problem and the concerns now. I will reevaluate the current design and try to find a proper solution. Thanks, Kan
On Tue, Jan 24 2023 at 01:10, Stephane Eranian wrote: > On Tue, Jan 24, 2023 at 12:52 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> > It's useful to provide a NMI safe function to convert a given time to >> > monotonic. For example, the perf_event subsystem wants to convert a TSC >> > of a PEBS record to a monotonic clock in a NMI handler. >> >> Why? That's a postprocessing problem, really. >> > Because you want to correlate samples captured by PEBS with samples > from applications timestamped with a user available clock such as > CLOCK_MONOTONIC, for instance. Sure. Postprocessing can do that if the kernel provides the relevant conversion information. Absolutely no need for doing this in the kernel. > When I create a perf_event event and I stipulate that > event_attr.clockid=MONOTONIC, I expect all the samples from that event > to be timestamped using the same clock source, regardless of PEBS or > IBS. Just because the kernel can do the conversion, there is no requirement to actually do so. Instrumentation has to be as lightweight as possible and a lot of instrumentation goes the extra mile of doing postprocessing exactly for this reason. So if we implement this conversion in the kernel then there needs to be a compelling technical reason to do so. So far the provided arguments are in the 'I want a pony' realm. Thanks, tglx
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h index fe1e467ba046..234fa87a846b 100644 --- a/include/linux/timekeeping.h +++ b/include/linux/timekeeping.h @@ -289,6 +289,15 @@ extern int get_device_system_crosststamp( struct system_time_snapshot *history, struct system_device_crosststamp *xtstamp); +/* + * Fast NMI safe way to convert a given timestamp to clock monotonic + */ +extern int get_mono_fast_from_given_time(int (*get_time_fn) + (struct system_counterval_t *sys_counterval, + void *ctx), + void *ctx, + u64 *mono_ns); + /* * Simultaneously snapshot realtime and monotonic raw clocks */ diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 5579ead449f2..5bd32b2981fd 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -431,14 +431,19 @@ static void update_fast_timekeeper(const struct tk_read_base *tkr, memcpy(base + 1, base, sizeof(*base)); } -static __always_inline u64 fast_tk_get_delta_ns(struct tk_read_base *tkr) +static __always_inline u64 fast_tk_get_delta_ns_from_cycles(struct tk_read_base *tkr, + u64 cycles) { - u64 delta, cycles = tk_clock_read(tkr); + u64 delta = clocksource_delta(cycles, tkr->cycle_last, tkr->mask); - delta = clocksource_delta(cycles, tkr->cycle_last, tkr->mask); return timekeeping_delta_to_ns(tkr, delta); } +static __always_inline u64 fast_tk_get_delta_ns(struct tk_read_base *tkr) +{ + return fast_tk_get_delta_ns_from_cycles(tkr, tk_clock_read(tkr)); +} + static __always_inline u64 __ktime_get_fast_ns(struct tk_fast *tkf) { struct tk_read_base *tkr; @@ -1303,6 +1308,63 @@ int get_device_system_crosststamp(int (*get_time_fn) } EXPORT_SYMBOL_GPL(get_device_system_crosststamp); +/** + * get_mono_fast_from_given_time - Fast NMI safe access to convert a given + * timestamp to clock monotonic + * @get_time_fn: Callback to get the given time and its clocksource + * @ctx: Context passed to get_time_fn() + * @mono_ns: The monotonic time of the given time + */ +int notrace get_mono_fast_from_given_time(int (*get_time_fn) + (struct system_counterval_t *sys_counterval, + void *ctx), + void *ctx, + u64 *mono_ns) +{ + struct system_counterval_t system_counterval; + struct tk_fast *tkf = &tk_fast_mono; + u64 cycles, now, interval_start; + struct tk_read_base *tkr; + unsigned int seq; + int ret; + + do { + seq = raw_read_seqcount_latch(&tkf->seq); + tkr = tkf->base + (seq & 0x01); + + ret = get_time_fn(&system_counterval, ctx); + if (ret) + return ret; + + /* + * Verify that the clocksource associated with the given + * timestamp is the same as the currently installed + * timekeeper clocksource + */ + if (tkr->clock != system_counterval.cs) + return -EOPNOTSUPP; + cycles = system_counterval.cycles; + + /* + * Check whether the given timestamp is on the current + * timekeeping interval. + */ + now = tk_clock_read(tkr); + interval_start = tkr->cycle_last; + if (!cycle_between(interval_start, cycles, now)) + return -EOPNOTSUPP; + + now = ktime_to_ns(tkr->base); + now += fast_tk_get_delta_ns_from_cycles(tkr, cycles); + + } while (read_seqcount_latch_retry(&tkf->seq, seq)); + + *mono_ns = now; + + return 0; +} +EXPORT_SYMBOL_GPL(get_mono_fast_from_given_time); + /** * do_settimeofday64 - Sets the time of day. * @ts: pointer to the timespec64 variable containing the new time