Message ID | 20230213190754.1836051-3-kan.liang@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp2523221wrn; Mon, 13 Feb 2023 11:12:27 -0800 (PST) X-Google-Smtp-Source: AK7set9QtjajTwIEtamXpFyjX3w1VMT+eBqmTl+zHobsXAL2qZ77zgdkHY9ykXx7X74UHSCQU2iz X-Received: by 2002:a17:906:774f:b0:8aa:c105:f0bf with SMTP id o15-20020a170906774f00b008aac105f0bfmr7202ejn.17.1676315547342; Mon, 13 Feb 2023 11:12:27 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676315547; cv=none; d=google.com; s=arc-20160816; b=ojwQY5Oo08TUNEc4yA/vGJhDpRMJPh2vmQQ5qXMOH9olMXQuCk11DUeiqXcpQiFP2P 1Xqt4EBLk0OVrW70B97DBNew16MUl6RjRykwi1dncwdJQGkQQjJwjhembfkzrw2ZTVBD wtEX6+aOODma/he34IxxceC18v6mRudkPyILxsjhh1W4ts9nEebCldFa2LPe/sYXRHi0 h8/b+o988WNqCt/LSZJbTQyqJtGoZcZBai28QQ+FYSBeV21fpl+ELGpPxQqyjEUouKP3 q0CYgLV+vcNb6jE1dBr0tiMkWoa7Hk2y5X2I8baUN2jxd9K92giIUAbP2c+PMfu9CFbp X86w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=HS4NCgKo7FQixOuCxx6mZmkkPy4TpbN1VGgOt1VEqHE=; b=tfY8I/isCY8wRSy8kdEceUSwKX0+DY7eXuJ6jbDr94O5BDlKtI10LG4Atw2FXoSVtl UCEmqDXREQ1TRlnhXhvceG/6soQRn6IbKj+ktxLvWLbNaBnMNPY+1RIzQB0f6ozw2aCt uUU/s08+HEao3vXHCvec1+jkl6k6Scw/QZ9Kp+cGcvfZdbisiPbd92NG405bCH23SemB Plxoh3XYH290EaEjxP09EZu6UkQrbLzrFE/hQl8RpJY3yB4CKQ6lyg17hX8USHrk5FLg qeNFFogGXwZ+8+PhNv9+E7GEc345KOY3q3Pf2wTHRfC2Xw+LawNouXHslJLk2QxC13aJ zM1w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=g2KxmUMV; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id kb3-20020a170907924300b008988a44b121si13924649ejb.180.2023.02.13.11.12.04; Mon, 13 Feb 2023 11:12:27 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=g2KxmUMV; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229767AbjBMTIE (ORCPT <rfc822;tebrre53rla2o@gmail.com> + 99 others); Mon, 13 Feb 2023 14:08:04 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43252 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229668AbjBMTH7 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 13 Feb 2023 14:07:59 -0500 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 14940CA1A for <linux-kernel@vger.kernel.org>; Mon, 13 Feb 2023 11:07:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1676315279; x=1707851279; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=9lgoINWxU1sPPhF4j2mJcGx7Hoa3TJdaZRev5+44tes=; b=g2KxmUMVoIAHC1Crf3KbXVJOTw+tVN+7qO9huGhP9xjmfV1EuHY2esDs pTb5e5iKbXCiYdu4t29T65YaqNAdPf1l/YqGS0jhDMWB2/kRwOwOtMxAe sCLOFLlqTaRHSvuM+u+isJPB9xKuqVMVWEI0HFATCuXFTGIY9A+TeqRTi FYQvy2GxzNhq4y92D0GJWpUK0QP2FGTwfiLmJLie0lk+nCT+2SNQAaSEM uqC71kQ2y/A2qxb1lz/2jFIOjKtXJFEqtoU8Op510wiDc/76AluopGltm eHltr98S0PkhS7ON50ZI1GDrSyZr7Lsv2KAmEXbXP/qHCoIa/jOpiecFK g==; X-IronPort-AV: E=McAfee;i="6500,9779,10620"; a="333108696" X-IronPort-AV: E=Sophos;i="5.97,294,1669104000"; d="scan'208";a="333108696" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Feb 2023 11:07:58 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10620"; a="668901920" X-IronPort-AV: E=Sophos;i="5.97,294,1669104000"; d="scan'208";a="668901920" Received: from kanliang-dev.jf.intel.com ([10.165.154.102]) by orsmga002.jf.intel.com with ESMTP; 13 Feb 2023 11:07:58 -0800 From: kan.liang@linux.intel.com To: tglx@linutronix.de, jstultz@google.com, peterz@infradead.org, mingo@redhat.com, linux-kernel@vger.kernel.org Cc: sboyd@kernel.org, eranian@google.com, namhyung@kernel.org, ak@linux.intel.com, adrian.hunter@intel.com, Kan Liang <kan.liang@linux.intel.com> Subject: [RFC PATCH V2 2/9] perf: Extend ABI to support post-processing monotonic raw conversion Date: Mon, 13 Feb 2023 11:07:47 -0800 Message-Id: <20230213190754.1836051-3-kan.liang@linux.intel.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20230213190754.1836051-1-kan.liang@linux.intel.com> References: <20230213190754.1836051-1-kan.liang@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1757744251658099453?= X-GMAIL-MSGID: =?utf-8?q?1757744251658099453?= |
Series |
Convert TSC to monotonic raw clock for PEBS
|
|
Commit Message
Liang, Kan
Feb. 13, 2023, 7:07 p.m. UTC
From: Kan Liang <kan.liang@linux.intel.com> The monotonic raw clock is not affected by NTP/PTP correction. The calculation of the monotonic raw clock can be done in the post-processing, which can reduce the kernel overhead. Add hw_time in the struct perf_event_attr to tell the kernel dump the raw HW time to user space. The perf tool will calculate the HW time in post-processing. Currently, only supports the monotonic raw conversion. Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate HW time can only be provided in a sample by HW. For other type of records, the user requested clock should be returned as usual. Nothing is changed. Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the conversion information. The cap_user_time_mono_raw also indicates whether the monotonic raw conversion information is available. If yes, the clock monotonic raw can be calculated as mono_raw = base + ((cyc - last) * mult + nsec) >> shift Signed-off-by: Kan Liang <kan.liang@linux.intel.com> --- include/uapi/linux/perf_event.h | 21 ++++++++++++++++++--- kernel/events/core.c | 7 +++++++ 2 files changed, 25 insertions(+), 3 deletions(-)
Comments
On Mon, Feb 13, 2023 at 11:08 AM <kan.liang@linux.intel.com> wrote: > > From: Kan Liang <kan.liang@linux.intel.com> > > The monotonic raw clock is not affected by NTP/PTP correction. The > calculation of the monotonic raw clock can be done in the > post-processing, which can reduce the kernel overhead. > > Add hw_time in the struct perf_event_attr to tell the kernel dump the > raw HW time to user space. The perf tool will calculate the HW time > in post-processing. > Currently, only supports the monotonic raw conversion. > Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate > HW time can only be provided in a sample by HW. For other type of > records, the user requested clock should be returned as usual. Nothing > is changed. > > Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the > conversion information. The cap_user_time_mono_raw also indicates > whether the monotonic raw conversion information is available. > If yes, the clock monotonic raw can be calculated as > mono_raw = base + ((cyc - last) * mult + nsec) >> shift Again, I appreciate you reworking and resending this series out, I know it took some effort. But oof, I'd really like to make sure we're not exporting timekeeping internals to userland. I think Thomas' suggestion of doing the timestamp conversion in post-processing was more about interpolating collected system times with the counter (tsc) values captured. I get the interpolation can be difficult as the counter value and system time can't currently atomically collected, so potentially there may be a need for a way to tie two together (see my previous email's thought of ktime_get_raw_monotonic_from_timestamp()), but we'd probably want a clear understanding of the benefit (quantitative reduction in interpolation error, and what real benefit that brings), and would also want the driver to generate and share those pairs rather than having userland have access. thanks -john
On 2023-02-13 2:37 p.m., John Stultz wrote: > On Mon, Feb 13, 2023 at 11:08 AM <kan.liang@linux.intel.com> wrote: >> >> From: Kan Liang <kan.liang@linux.intel.com> >> >> The monotonic raw clock is not affected by NTP/PTP correction. The >> calculation of the monotonic raw clock can be done in the >> post-processing, which can reduce the kernel overhead. >> >> Add hw_time in the struct perf_event_attr to tell the kernel dump the >> raw HW time to user space. The perf tool will calculate the HW time >> in post-processing. >> Currently, only supports the monotonic raw conversion. >> Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate >> HW time can only be provided in a sample by HW. For other type of >> records, the user requested clock should be returned as usual. Nothing >> is changed. >> >> Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the >> conversion information. The cap_user_time_mono_raw also indicates >> whether the monotonic raw conversion information is available. >> If yes, the clock monotonic raw can be calculated as >> mono_raw = base + ((cyc - last) * mult + nsec) >> shift > > Again, I appreciate you reworking and resending this series out, I > know it took some effort. > > But oof, I'd really like to make sure we're not exporting timekeeping > internals to userland. > > I think Thomas' suggestion of doing the timestamp conversion in > post-processing was more about interpolating collected system times > with the counter (tsc) values captured. > Thomas, could you please clarify your suggestion regarding "the relevant conversion information" provided by the kernel? https://lore.kernel.org/lkml/87ilgsgl5f.ffs@tglx/ Is it only the interpolation information or the entire conversion information (Mult, shift etc.)? If it's only the interpolation information, the user space will be lack of information to handle all the cases. If I understand John's comments correctly, it could also bring some interpolation error which can only be addressed by the mult/shift conversion. If the suggestion is to dump the entire conversion information into the user space, we have to expose the timekeeping internals. Considering the above difficulties, could we use the kernel conversion? (The current perf already uses the kernel conversion for monotonic raw. It should not bring extra overhead.) Thanks, Kan > I get the interpolation can be difficult as the counter value and > system time can't currently atomically collected, so potentially there > may be a need for a way to tie two together (see my previous email's > thought of ktime_get_raw_monotonic_from_timestamp()), but we'd > probably want a clear understanding of the benefit (quantitative > reduction in interpolation error, and what real benefit that brings), > and would also want the driver to generate and share those pairs > rather than having userland have access. > > thanks > -john
On Mon, Feb 13, 2023 at 1:40 PM Liang, Kan <kan.liang@linux.intel.com> wrote: > On 2023-02-13 2:37 p.m., John Stultz wrote: > > On Mon, Feb 13, 2023 at 11:08 AM <kan.liang@linux.intel.com> wrote: > >> > >> From: Kan Liang <kan.liang@linux.intel.com> > >> > >> The monotonic raw clock is not affected by NTP/PTP correction. The > >> calculation of the monotonic raw clock can be done in the > >> post-processing, which can reduce the kernel overhead. > >> > >> Add hw_time in the struct perf_event_attr to tell the kernel dump the > >> raw HW time to user space. The perf tool will calculate the HW time > >> in post-processing. > >> Currently, only supports the monotonic raw conversion. > >> Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate > >> HW time can only be provided in a sample by HW. For other type of > >> records, the user requested clock should be returned as usual. Nothing > >> is changed. > >> > >> Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the > >> conversion information. The cap_user_time_mono_raw also indicates > >> whether the monotonic raw conversion information is available. > >> If yes, the clock monotonic raw can be calculated as > >> mono_raw = base + ((cyc - last) * mult + nsec) >> shift > > > > Again, I appreciate you reworking and resending this series out, I > > know it took some effort. > > > > But oof, I'd really like to make sure we're not exporting timekeeping > > internals to userland. > > > > I think Thomas' suggestion of doing the timestamp conversion in > > post-processing was more about interpolating collected system times > > with the counter (tsc) values captured. > > > > Thomas, could you please clarify your suggestion regarding "the relevant > conversion information" provided by the kernel? > https://lore.kernel.org/lkml/87ilgsgl5f.ffs@tglx/ > > Is it only the interpolation information or the entire conversion > information (Mult, shift etc.)? > > If it's only the interpolation information, the user space will be lack > of information to handle all the cases. If I understand John's comments > correctly, it could also bring some interpolation error which can only > be addressed by the mult/shift conversion. "Only" is maybe too strong a word. I think having the driver use kernel timekeeping accessors to CLOCK_MONONOTONIC_RAW time with counter values will minimize the error. But again, it's not yet established that any interpolation error using existing interfaces is great enough to be problematic here. The interpoloation is pretty easy to do: do { start= readtsc(); clock_gett(CLOCK_MONOTONIC_RAW, &ts); end = readtsc(); delta = end-start; } while (delta > THRESHOLD) // make sure the reads were not preempted mid = start + (delta +(delta/2))/2; //round-closest and be able to get you a fairly close matching of TSC to CLOCK_MONOTONIC_RAW value. Once you have that mapping you can take a few samples and establish the linear function. But that will have some error, so quantifying that error helps establish why being able to get an atomic mapping of TSC -> CLOCK_MONOTONIC_RAW would help. So I really don't think we need to expose the kernel internal values to userland, but I'm willing to guess the atomic mapping (which the driver will have access to, not userland) may be helpful for the fine granularity you want in the trace. thanks -john
On Mon, Feb 13, 2023 at 02:22:39PM -0800, John Stultz wrote: > The interpoloation is pretty easy to do: > > do { > start= readtsc(); > clock_gett(CLOCK_MONOTONIC_RAW, &ts); > end = readtsc(); > delta = end-start; > } while (delta > THRESHOLD) // make sure the reads were not preempted > mid = start + (delta +(delta/2))/2; //round-closest > > and be able to get you a fairly close matching of TSC to > CLOCK_MONOTONIC_RAW value. > > Once you have that mapping you can take a few samples and establish > the linear function. Right, this is how we do the TSC calibration in the first place, and if NTP can achieve high correctness over a network, then surely we can do better locally. That is, this scheme should work for all CLOCKs, not only MONOTONIC_RAW.
On 2023-02-13 5:22 p.m., John Stultz wrote: > On Mon, Feb 13, 2023 at 1:40 PM Liang, Kan <kan.liang@linux.intel.com> wrote: >> On 2023-02-13 2:37 p.m., John Stultz wrote: >>> On Mon, Feb 13, 2023 at 11:08 AM <kan.liang@linux.intel.com> wrote: >>>> >>>> From: Kan Liang <kan.liang@linux.intel.com> >>>> >>>> The monotonic raw clock is not affected by NTP/PTP correction. The >>>> calculation of the monotonic raw clock can be done in the >>>> post-processing, which can reduce the kernel overhead. >>>> >>>> Add hw_time in the struct perf_event_attr to tell the kernel dump the >>>> raw HW time to user space. The perf tool will calculate the HW time >>>> in post-processing. >>>> Currently, only supports the monotonic raw conversion. >>>> Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate >>>> HW time can only be provided in a sample by HW. For other type of >>>> records, the user requested clock should be returned as usual. Nothing >>>> is changed. >>>> >>>> Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the >>>> conversion information. The cap_user_time_mono_raw also indicates >>>> whether the monotonic raw conversion information is available. >>>> If yes, the clock monotonic raw can be calculated as >>>> mono_raw = base + ((cyc - last) * mult + nsec) >> shift >>> >>> Again, I appreciate you reworking and resending this series out, I >>> know it took some effort. >>> >>> But oof, I'd really like to make sure we're not exporting timekeeping >>> internals to userland. >>> >>> I think Thomas' suggestion of doing the timestamp conversion in >>> post-processing was more about interpolating collected system times >>> with the counter (tsc) values captured. >>> >> >> Thomas, could you please clarify your suggestion regarding "the relevant >> conversion information" provided by the kernel? >> https://lore.kernel.org/lkml/87ilgsgl5f.ffs@tglx/ >> >> Is it only the interpolation information or the entire conversion >> information (Mult, shift etc.)? >> >> If it's only the interpolation information, the user space will be lack >> of information to handle all the cases. If I understand John's comments >> correctly, it could also bring some interpolation error which can only >> be addressed by the mult/shift conversion. > Thanks for the details John. > "Only" is maybe too strong a word. I think having the driver use > kernel timekeeping accessors to CLOCK_MONONOTONIC_RAW time with > counter values will minimize the error. > The key motivation of using the TSC in the PEBS record is to get an accurate timestamp of each record. We definitely want the conversion has minimized error. > But again, it's not yet established that any interpolation error using > existing interfaces is great enough to be problematic here. > > The interpoloation is pretty easy to do: > > do { > start= readtsc(); > clock_gett(CLOCK_MONOTONIC_RAW, &ts); > end = readtsc(); > delta = end-start; > } while (delta > THRESHOLD) // make sure the reads were not preempted > mid = start + (delta +(delta/2))/2; //round-closest > How to choose the THRESHOLD? It seems the THRESHOLD value also impacts the accuracy. > and be able to get you a fairly close matching of TSC to > CLOCK_MONOTONIC_RAW value. > > Once you have that mapping you can take a few samples and establish > the linear function. > > But that will have some error, so quantifying that error helps > establish why being able to get an atomic mapping of TSC -> > CLOCK_MONOTONIC_RAW would help. > > So I really don't think we need to expose the kernel internal values > to userland, but I'm willing to guess the atomic mapping (which the > driver will have access to, not userland) may be helpful for the fine > granularity you want in the trace. > If I understand correctly, the idea is to let the user space tool run the above interpoloation algorithm several times to 'guess' the atomic mapping. Using the mapping information to covert the TSC from the PEBS record. Is my understanding correct? If so, to be honest, I doubt we can get the accuracy we want. Thanks, Kan
On 2023-02-14 9:51 a.m., Liang, Kan wrote: > > > On 2023-02-13 5:22 p.m., John Stultz wrote: >> On Mon, Feb 13, 2023 at 1:40 PM Liang, Kan <kan.liang@linux.intel.com> wrote: >>> On 2023-02-13 2:37 p.m., John Stultz wrote: >>>> On Mon, Feb 13, 2023 at 11:08 AM <kan.liang@linux.intel.com> wrote: >>>>> >>>>> From: Kan Liang <kan.liang@linux.intel.com> >>>>> >>>>> The monotonic raw clock is not affected by NTP/PTP correction. The >>>>> calculation of the monotonic raw clock can be done in the >>>>> post-processing, which can reduce the kernel overhead. >>>>> >>>>> Add hw_time in the struct perf_event_attr to tell the kernel dump the >>>>> raw HW time to user space. The perf tool will calculate the HW time >>>>> in post-processing. >>>>> Currently, only supports the monotonic raw conversion. >>>>> Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate >>>>> HW time can only be provided in a sample by HW. For other type of >>>>> records, the user requested clock should be returned as usual. Nothing >>>>> is changed. >>>>> >>>>> Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the >>>>> conversion information. The cap_user_time_mono_raw also indicates >>>>> whether the monotonic raw conversion information is available. >>>>> If yes, the clock monotonic raw can be calculated as >>>>> mono_raw = base + ((cyc - last) * mult + nsec) >> shift >>>> >>>> Again, I appreciate you reworking and resending this series out, I >>>> know it took some effort. >>>> >>>> But oof, I'd really like to make sure we're not exporting timekeeping >>>> internals to userland. >>>> >>>> I think Thomas' suggestion of doing the timestamp conversion in >>>> post-processing was more about interpolating collected system times >>>> with the counter (tsc) values captured. >>>> >>> >>> Thomas, could you please clarify your suggestion regarding "the relevant >>> conversion information" provided by the kernel? >>> https://lore.kernel.org/lkml/87ilgsgl5f.ffs@tglx/ >>> >>> Is it only the interpolation information or the entire conversion >>> information (Mult, shift etc.)? >>> >>> If it's only the interpolation information, the user space will be lack >>> of information to handle all the cases. If I understand John's comments >>> correctly, it could also bring some interpolation error which can only >>> be addressed by the mult/shift conversion. >> > > > Thanks for the details John. > >> "Only" is maybe too strong a word. I think having the driver use >> kernel timekeeping accessors to CLOCK_MONONOTONIC_RAW time with >> counter values will minimize the error. >> > > The key motivation of using the TSC in the PEBS record is to get an > accurate timestamp of each record. We definitely want the conversion has > minimized error. > > >> But again, it's not yet established that any interpolation error using >> existing interfaces is great enough to be problematic here. >> >> The interpoloation is pretty easy to do: >> >> do { >> start= readtsc(); >> clock_gett(CLOCK_MONOTONIC_RAW, &ts); >> end = readtsc(); >> delta = end-start; >> } while (delta > THRESHOLD) // make sure the reads were not preempted >> mid = start + (delta +(delta/2))/2; //round-closest >> > > How to choose the THRESHOLD? It seems the THRESHOLD value also impacts > the accuracy. > > >> and be able to get you a fairly close matching of TSC to >> CLOCK_MONOTONIC_RAW value. >> >> Once you have that mapping you can take a few samples and establish >> the linear function. >> >> But that will have some error, so quantifying that error helps >> establish why being able to get an atomic mapping of TSC -> >> CLOCK_MONOTONIC_RAW would help. >> >> So I really don't think we need to expose the kernel internal values >> to userland, but I'm willing to guess the atomic mapping (which the >> driver will have access to, not userland) may be helpful for the fine >> granularity you want in the trace. >> > > If I understand correctly, the idea is to let the user space tool run > the above interpoloation algorithm several times to 'guess' the atomic > mapping. Using the mapping information to covert the TSC from the PEBS > record. Is my understanding correct? > > If so, to be honest, I doubt we can get the accuracy we want. > I implemented a simple test to evaluate the error. I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm at the start and end of perf cmd. MONO_RAW TSC start 89553516545645 223619715214239 end 89562251233830 223641517000376 Here is what I get via mult/shift conversion from this patch. MONO_RAW TSC PEBS 89555942691466 223625770878571 Then I use the time information from start and end to create a linear function and 'guess' the MONO_RAW of PEBS from the TSC. I get 89555942692721. There is a 1255 ns difference. I tried several different PEBS records. The error is ~1000ns. I think it should be an observable error. Thanks, Kan
On 2023-02-14 5:43 a.m., Peter Zijlstra wrote: > On Mon, Feb 13, 2023 at 02:22:39PM -0800, John Stultz wrote: >> The interpoloation is pretty easy to do: >> >> do { >> start= readtsc(); >> clock_gett(CLOCK_MONOTONIC_RAW, &ts); >> end = readtsc(); >> delta = end-start; >> } while (delta > THRESHOLD) // make sure the reads were not preempted >> mid = start + (delta +(delta/2))/2; //round-closest >> >> and be able to get you a fairly close matching of TSC to >> CLOCK_MONOTONIC_RAW value. >> >> Once you have that mapping you can take a few samples and establish >> the linear function. > > Right, this is how we do the TSC calibration in the first place, and if > NTP can achieve high correctness over a network, then surely we can do > better locally. > > That is, this scheme should work for all CLOCKs, not only MONOTONIC_RAW. If I understand correctly, the TSC calibration is done in the kernel. The kernel keeps updating the mul/shift. We dump the mul/shift into the perf mmap page for the user tools. But for the CLOCKs, the mul/shift is kernel internal values which we don't want to expose to the user space. If we only apply the scheme in the user space, it brings some observable errors based on my test mentioned in the other thread. Thanks, Kan
On Tue, Feb 14, 2023 at 7:56 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Feb 13, 2023 at 02:22:39PM -0800, John Stultz wrote: > > The interpoloation is pretty easy to do: > > > > do { > > start= readtsc(); > > clock_gett(CLOCK_MONOTONIC_RAW, &ts); > > end = readtsc(); > > delta = end-start; > > } while (delta > THRESHOLD) // make sure the reads were not preempted > > mid = start + (delta +(delta/2))/2; //round-closest > > > > and be able to get you a fairly close matching of TSC to > > CLOCK_MONOTONIC_RAW value. > > > > Once you have that mapping you can take a few samples and establish > > the linear function. > > Right, this is how we do the TSC calibration in the first place, and if > NTP can achieve high correctness over a network, then surely we can do > better locally. > > That is, this scheme should work for all CLOCKs, not only MONOTONIC_RAW. Well, CLOCK_MONOTONIC_RAW is at least a fixed function, we don't change its frequency. Whereas other clocks will likely be adjusted over their lifetime, so deriving the frequency has to be continually re-calculated, so they aren't ideal for this sort of interpolation. thanks -john
On Tue, Feb 14, 2023 at 9:46 AM Liang, Kan <kan.liang@linux.intel.com> wrote: > On 2023-02-14 5:43 a.m., Peter Zijlstra wrote: > > On Mon, Feb 13, 2023 at 02:22:39PM -0800, John Stultz wrote: > >> The interpoloation is pretty easy to do: > >> > >> do { > >> start= readtsc(); > >> clock_gett(CLOCK_MONOTONIC_RAW, &ts); > >> end = readtsc(); > >> delta = end-start; > >> } while (delta > THRESHOLD) // make sure the reads were not preempted > >> mid = start + (delta +(delta/2))/2; //round-closest > >> > >> and be able to get you a fairly close matching of TSC to > >> CLOCK_MONOTONIC_RAW value. > >> > >> Once you have that mapping you can take a few samples and establish > >> the linear function. > > > > Right, this is how we do the TSC calibration in the first place, and if > > NTP can achieve high correctness over a network, then surely we can do > > better locally. > > > > That is, this scheme should work for all CLOCKs, not only MONOTONIC_RAW. > > If I understand correctly, the TSC calibration is done in the kernel. > The kernel keeps updating the mul/shift. We dump the mul/shift into the > perf mmap page for the user tools. Where is that done in the perf mmap? I wasn't aware. thanks -john
On Tue, Feb 14, 2023 at 6:51 AM Liang, Kan <kan.liang@linux.intel.com> wrote: > On 2023-02-13 5:22 p.m., John Stultz wrote: > > On Mon, Feb 13, 2023 at 1:40 PM Liang, Kan <kan.liang@linux.intel.com> wrote: > >> On 2023-02-13 2:37 p.m., John Stultz wrote: > >>> On Mon, Feb 13, 2023 at 11:08 AM <kan.liang@linux.intel.com> wrote: > >>>> > >>>> From: Kan Liang <kan.liang@linux.intel.com> > >>>> > >>>> The monotonic raw clock is not affected by NTP/PTP correction. The > >>>> calculation of the monotonic raw clock can be done in the > >>>> post-processing, which can reduce the kernel overhead. > >>>> > >>>> Add hw_time in the struct perf_event_attr to tell the kernel dump the > >>>> raw HW time to user space. The perf tool will calculate the HW time > >>>> in post-processing. > >>>> Currently, only supports the monotonic raw conversion. > >>>> Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate > >>>> HW time can only be provided in a sample by HW. For other type of > >>>> records, the user requested clock should be returned as usual. Nothing > >>>> is changed. > >>>> > >>>> Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the > >>>> conversion information. The cap_user_time_mono_raw also indicates > >>>> whether the monotonic raw conversion information is available. > >>>> If yes, the clock monotonic raw can be calculated as > >>>> mono_raw = base + ((cyc - last) * mult + nsec) >> shift > >>> > >>> Again, I appreciate you reworking and resending this series out, I > >>> know it took some effort. > >>> > >>> But oof, I'd really like to make sure we're not exporting timekeeping > >>> internals to userland. > >>> > >>> I think Thomas' suggestion of doing the timestamp conversion in > >>> post-processing was more about interpolating collected system times > >>> with the counter (tsc) values captured. > >>> > >> > >> Thomas, could you please clarify your suggestion regarding "the relevant > >> conversion information" provided by the kernel? > >> https://lore.kernel.org/lkml/87ilgsgl5f.ffs@tglx/ > >> > >> Is it only the interpolation information or the entire conversion > >> information (Mult, shift etc.)? > >> > >> If it's only the interpolation information, the user space will be lack > >> of information to handle all the cases. If I understand John's comments > >> correctly, it could also bring some interpolation error which can only > >> be addressed by the mult/shift conversion. > > > > > Thanks for the details John. > > > "Only" is maybe too strong a word. I think having the driver use > > kernel timekeeping accessors to CLOCK_MONONOTONIC_RAW time with > > counter values will minimize the error. > > > > The key motivation of using the TSC in the PEBS record is to get an > accurate timestamp of each record. We definitely want the conversion has > minimized error. Yep. > > But again, it's not yet established that any interpolation error using > > existing interfaces is great enough to be problematic here. > > > > The interpoloation is pretty easy to do: > > > > do { > > start= readtsc(); > > clock_gett(CLOCK_MONOTONIC_RAW, &ts); > > end = readtsc(); > > delta = end-start; > > } while (delta > THRESHOLD) // make sure the reads were not preempted > > mid = start + (delta +(delta/2))/2; //round-closest > > > > How to choose the THRESHOLD? It seems the THRESHOLD value also impacts > the accuracy. Maybe by running a number of of these reads and collecting the detlas, then setting THRESHOLD to a standard deviation of the results? (I'm sure there's more sound methods, but I'd have to do some digging to find them) Alternatively you could always take 10 samples and then only do the mapping with the smallest delta value. > > and be able to get you a fairly close matching of TSC to > > CLOCK_MONOTONIC_RAW value. > > > > Once you have that mapping you can take a few samples and establish > > the linear function. > > > > But that will have some error, so quantifying that error helps > > establish why being able to get an atomic mapping of TSC -> > > CLOCK_MONOTONIC_RAW would help. > > > > So I really don't think we need to expose the kernel internal values > > to userland, but I'm willing to guess the atomic mapping (which the > > driver will have access to, not userland) may be helpful for the fine > > granularity you want in the trace. > > > > If I understand correctly, the idea is to let the user space tool run > the above interpoloation algorithm several times to 'guess' the atomic > mapping. Using the mapping information to covert the TSC from the PEBS > record. Is my understanding correct? So I think that's what Thomas was suggesting. The next step would probably be to provide a way for the driver to provide atomic TSC->CLOCK_MONOTONIC_RAW samples, so userland can calculate the function itself. So then the problem becomes if X1 and Y1 are exactly mapped, and X2 and Y2 are exactly mapped, then given X3, find Y3. And if that doesn't work, then we would have to see about having the driver do all the conversions. > If so, to be honest, I doubt we can get the accuracy we want. Sure. I just want to make sure its quantified that the pure userland interpolation approach won't work before we go adding in extra in-kernel logic (We'd obviously rather do the logic that can be done in userland in userland) thanks -john
On 2023-02-14 2:37 p.m., John Stultz wrote: > On Tue, Feb 14, 2023 at 9:46 AM Liang, Kan <kan.liang@linux.intel.com> wrote: >> On 2023-02-14 5:43 a.m., Peter Zijlstra wrote: >>> On Mon, Feb 13, 2023 at 02:22:39PM -0800, John Stultz wrote: >>>> The interpoloation is pretty easy to do: >>>> >>>> do { >>>> start= readtsc(); >>>> clock_gett(CLOCK_MONOTONIC_RAW, &ts); >>>> end = readtsc(); >>>> delta = end-start; >>>> } while (delta > THRESHOLD) // make sure the reads were not preempted >>>> mid = start + (delta +(delta/2))/2; //round-closest >>>> >>>> and be able to get you a fairly close matching of TSC to >>>> CLOCK_MONOTONIC_RAW value. >>>> >>>> Once you have that mapping you can take a few samples and establish >>>> the linear function. >>> >>> Right, this is how we do the TSC calibration in the first place, and if >>> NTP can achieve high correctness over a network, then surely we can do >>> better locally. >>> >>> That is, this scheme should work for all CLOCKs, not only MONOTONIC_RAW. >> >> If I understand correctly, the TSC calibration is done in the kernel. >> The kernel keeps updating the mul/shift. We dump the mul/shift into the >> perf mmap page for the user tools. > > Where is that done in the perf mmap? I wasn't aware. The updating of the mul/shift for sched_clock should be done in the set_cyc2ns_scale() in tsc.c The perf user space tool mmap a page to retrieve the enabling time/running time from the kernel. On X86 and Arm, the conversion information from HW time (TSC) to sched_clock/perf_time is also stored in the page. Please see the arch_perf_update_userpage(). In the perf mmap, it only retrieve the current mul/shift information and write them into the page for the user space tool. This V2 patch series try to do the same thing for the monotonic raw conversion. So the kernel internal mul/shift information has to be exposed. Thanks, Kan
On Tue, Feb 14, 2023 at 9:00 AM Liang, Kan <kan.liang@linux.intel.com> wrote: > On 2023-02-14 9:51 a.m., Liang, Kan wrote: > > If I understand correctly, the idea is to let the user space tool run > > the above interpoloation algorithm several times to 'guess' the atomic > > mapping. Using the mapping information to covert the TSC from the PEBS > > record. Is my understanding correct? > > > > If so, to be honest, I doubt we can get the accuracy we want. > > > > I implemented a simple test to evaluate the error. Very cool! > I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm > at the start and end of perf cmd. > MONO_RAW TSC > start 89553516545645 223619715214239 > end 89562251233830 223641517000376 > > Here is what I get via mult/shift conversion from this patch. > MONO_RAW TSC > PEBS 89555942691466 223625770878571 > > Then I use the time information from start and end to create a linear > function and 'guess' the MONO_RAW of PEBS from the TSC. I get > 89555942692721. > There is a 1255 ns difference. > I tried several different PEBS records. The error is ~1000ns. > I think it should be an observable error. Interesting. That's a good bit higher than I'd expect as I'd expect a clock_gettime() call to take ~ double digit nanoseconds range on average, so the error should be within that. Can you share your logic? thanks -john
On Tue, Feb 14, 2023 at 12:09 PM Liang, Kan <kan.liang@linux.intel.com> wrote: > On 2023-02-14 2:37 p.m., John Stultz wrote: > > On Tue, Feb 14, 2023 at 9:46 AM Liang, Kan <kan.liang@linux.intel.com> wrote: > >> If I understand correctly, the TSC calibration is done in the kernel. > >> The kernel keeps updating the mul/shift. We dump the mul/shift into the > >> perf mmap page for the user tools. > > > > Where is that done in the perf mmap? I wasn't aware. > > The updating of the mul/shift for sched_clock should be done in the > set_cyc2ns_scale() in tsc.c Thanks for the pointer! > The perf user space tool mmap a page to retrieve the enabling > time/running time from the kernel. On X86 and Arm, the conversion > information from HW time (TSC) to sched_clock/perf_time is also stored > in the page. Please see the arch_perf_update_userpage(). In the perf > mmap, it only retrieve the current mul/shift information and write them > into the page for the user space tool. > > This V2 patch series try to do the same thing for the monotonic raw > conversion. So the kernel internal mul/shift information has to be exposed. Ugh. Well, I think perf may have made a bad API choice here, so I'm still going to push back on exposting timekeeping internals to userland. But I do suspect that with ways to provide paired TSC/CLOCK_MONOTONIC values, you should be able to get the same functionality in userland as if the underlying data was shared. thanks -john
On 2023-02-14 3:11 p.m., John Stultz wrote: > On Tue, Feb 14, 2023 at 9:00 AM Liang, Kan <kan.liang@linux.intel.com> wrote: >> On 2023-02-14 9:51 a.m., Liang, Kan wrote: >>> If I understand correctly, the idea is to let the user space tool run >>> the above interpoloation algorithm several times to 'guess' the atomic >>> mapping. Using the mapping information to covert the TSC from the PEBS >>> record. Is my understanding correct? >>> >>> If so, to be honest, I doubt we can get the accuracy we want. >>> >> >> I implemented a simple test to evaluate the error. > > Very cool! > >> I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm >> at the start and end of perf cmd. >> MONO_RAW TSC >> start 89553516545645 223619715214239 >> end 89562251233830 223641517000376 >> >> Here is what I get via mult/shift conversion from this patch. >> MONO_RAW TSC >> PEBS 89555942691466 223625770878571 >> >> Then I use the time information from start and end to create a linear >> function and 'guess' the MONO_RAW of PEBS from the TSC. I get >> 89555942692721. >> There is a 1255 ns difference. >> I tried several different PEBS records. The error is ~1000ns. >> I think it should be an observable error. > > Interesting. That's a good bit higher than I'd expect as I'd expect a > clock_gettime() call to take ~ double digit nanoseconds range on > average, so the error should be within that. > > Can you share your logic? > I run the algorithm right before and after the perf command as below. (The source code of time is attached.) $./time $perf record -e cycles:upp --clockid monotonic_raw $some_workaround $./time The time will dump both MONO_RAW and TSC. That's where "start" and "end" from. The perf command print out both TSC and converted MONO_RAW (using the mul/shift from this patch series). That's where "PEBS" value from. Than I use the below formula to calculate the guessed MONO_RAW of PEBS TSC. Guessed_MONO_RAW = (PEBS_TSC - start_TSC) / (end_TSC - start_TSC) * (end_MONO_RAW - start_MONO_RAW) + start_MONO_RAW. The guessed_MONO_RAW is 89555942692721. The PEBS_MONO_RAW is 89555942691466. The difference is 1255. Is the calculation correct? Thanks, Kan #include <sys/time.h> #include <time.h> #include <stdio.h> #include <errno.h> static inline unsigned long rdtsc () { unsigned long var; unsigned int hi, lo; asm volatile ("rdtsc" : "=a" (lo), "=d" (hi)); var = ((unsigned long long int) hi << 32) | lo; return var; } typedef unsigned long long u64; int main() { struct timespec ts; u64 start, end, delta, mid; do { start= rdtsc(); clock_gettime(CLOCK_MONOTONIC_RAW, &ts); end = rdtsc(); delta = end-start; } while (delta > 20000); // make sure the reads were not preempted mid = start + (delta +(delta/2))/2; //round-closest printf("%llu %llu %llu\n", start, end, delta); printf("MONO_RAW: %llu TSC: %llu\n", (u64)ts.tv_sec * 1000000000 + ts.tv_nsec, mid); }
On Tue, Feb 14, 2023 at 12:38 PM Liang, Kan <kan.liang@linux.intel.com> wrote: > On 2023-02-14 3:11 p.m., John Stultz wrote: > > On Tue, Feb 14, 2023 at 9:00 AM Liang, Kan <kan.liang@linux.intel.com> wrote: > >> On 2023-02-14 9:51 a.m., Liang, Kan wrote: > >>> If I understand correctly, the idea is to let the user space tool run > >>> the above interpoloation algorithm several times to 'guess' the atomic > >>> mapping. Using the mapping information to covert the TSC from the PEBS > >>> record. Is my understanding correct? > >>> > >>> If so, to be honest, I doubt we can get the accuracy we want. > >>> > >> > >> I implemented a simple test to evaluate the error. > > > > Very cool! > > > >> I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm > >> at the start and end of perf cmd. > >> MONO_RAW TSC > >> start 89553516545645 223619715214239 > >> end 89562251233830 223641517000376 > >> > >> Here is what I get via mult/shift conversion from this patch. > >> MONO_RAW TSC > >> PEBS 89555942691466 223625770878571 > >> > >> Then I use the time information from start and end to create a linear > >> function and 'guess' the MONO_RAW of PEBS from the TSC. I get > >> 89555942692721. > >> There is a 1255 ns difference. > >> I tried several different PEBS records. The error is ~1000ns. > >> I think it should be an observable error. > > > > Interesting. That's a good bit higher than I'd expect as I'd expect a > > clock_gettime() call to take ~ double digit nanoseconds range on > > average, so the error should be within that. > > > > Can you share your logic? > > > > I run the algorithm right before and after the perf command as below. > (The source code of time is attached.) > > $./time > $perf record -e cycles:upp --clockid monotonic_raw $some_workaround > $./time > > The time will dump both MONO_RAW and TSC. That's where "start" and "end" > from. > The perf command print out both TSC and converted MONO_RAW (using the > mul/shift from this patch series). That's where "PEBS" value from. > > Than I use the below formula to calculate the guessed MONO_RAW of PEBS TSC. > Guessed_MONO_RAW = (PEBS_TSC - start_TSC) / (end_TSC - start_TSC) * > (end_MONO_RAW - start_MONO_RAW) + start_MONO_RAW. > > The guessed_MONO_RAW is 89555942692721. > The PEBS_MONO_RAW is 89555942691466. > The difference is 1255. > > Is the calculation correct? Thanks for sharing it. The equation you have there looks ok at a high level for the values you captured (there's small tweaks like doing the mult before the div to make sure you don't hit integer precision issues, but I didn't see that with your results). I've got a todo to try to see how the calculation changes if we do provide atomic TSC/RAW stamps, here but I got a little busy with other work and haven't gotten to it. So my apologies, but I'll try to get back to this soon. thanks -john
Hi John, On 2023-02-17 6:11 p.m., John Stultz wrote: > On Tue, Feb 14, 2023 at 12:38 PM Liang, Kan <kan.liang@linux.intel.com> wrote: >> On 2023-02-14 3:11 p.m., John Stultz wrote: >>> On Tue, Feb 14, 2023 at 9:00 AM Liang, Kan <kan.liang@linux.intel.com> wrote: >>>> On 2023-02-14 9:51 a.m., Liang, Kan wrote: >>>>> If I understand correctly, the idea is to let the user space tool run >>>>> the above interpoloation algorithm several times to 'guess' the atomic >>>>> mapping. Using the mapping information to covert the TSC from the PEBS >>>>> record. Is my understanding correct? >>>>> >>>>> If so, to be honest, I doubt we can get the accuracy we want. >>>>> >>>> >>>> I implemented a simple test to evaluate the error. >>> >>> Very cool! >>> >>>> I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm >>>> at the start and end of perf cmd. >>>> MONO_RAW TSC >>>> start 89553516545645 223619715214239 >>>> end 89562251233830 223641517000376 >>>> >>>> Here is what I get via mult/shift conversion from this patch. >>>> MONO_RAW TSC >>>> PEBS 89555942691466 223625770878571 >>>> >>>> Then I use the time information from start and end to create a linear >>>> function and 'guess' the MONO_RAW of PEBS from the TSC. I get >>>> 89555942692721. >>>> There is a 1255 ns difference. >>>> I tried several different PEBS records. The error is ~1000ns. >>>> I think it should be an observable error. >>> >>> Interesting. That's a good bit higher than I'd expect as I'd expect a >>> clock_gettime() call to take ~ double digit nanoseconds range on >>> average, so the error should be within that. >>> >>> Can you share your logic? >>> >> >> I run the algorithm right before and after the perf command as below. >> (The source code of time is attached.) >> >> $./time >> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround >> $./time >> >> The time will dump both MONO_RAW and TSC. That's where "start" and "end" >> from. >> The perf command print out both TSC and converted MONO_RAW (using the >> mul/shift from this patch series). That's where "PEBS" value from. >> >> Than I use the below formula to calculate the guessed MONO_RAW of PEBS TSC. >> Guessed_MONO_RAW = (PEBS_TSC - start_TSC) / (end_TSC - start_TSC) * >> (end_MONO_RAW - start_MONO_RAW) + start_MONO_RAW. >> >> The guessed_MONO_RAW is 89555942692721. >> The PEBS_MONO_RAW is 89555942691466. >> The difference is 1255. >> >> Is the calculation correct? > > Thanks for sharing it. The equation you have there looks ok at a high > level for the values you captured (there's small tweaks like doing the > mult before the div to make sure you don't hit integer precision > issues, but I didn't see that with your results). > > I've got a todo to try to see how the calculation changes if we do > provide atomic TSC/RAW stamps, here but I got a little busy with other > work and haven't gotten to it. > So my apologies, but I'll try to get back to this soon. > Have you got a chance to try the idea? I just want to check whether the userspace interpolation approach works. Should I prepare V3 and go back to the kernel solution? Thanks, Kan
On Wed, Mar 8, 2023 at 10:44 AM Liang, Kan <kan.liang@linux.intel.com> wrote: > On 2023-02-17 6:11 p.m., John Stultz wrote: > > On Tue, Feb 14, 2023 at 12:38 PM Liang, Kan <kan.liang@linux.intel.com> wrote: > >> On 2023-02-14 3:11 p.m., John Stultz wrote: > >>> On Tue, Feb 14, 2023 at 9:00 AM Liang, Kan <kan.liang@linux.intel.com> wrote: > >>>> On 2023-02-14 9:51 a.m., Liang, Kan wrote: > >>>>> If I understand correctly, the idea is to let the user space tool run > >>>>> the above interpoloation algorithm several times to 'guess' the atomic > >>>>> mapping. Using the mapping information to covert the TSC from the PEBS > >>>>> record. Is my understanding correct? > >>>>> > >>>>> If so, to be honest, I doubt we can get the accuracy we want. > >>>>> > >>>> > >>>> I implemented a simple test to evaluate the error. > >>> > >>> Very cool! > >>> > >>>> I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm > >>>> at the start and end of perf cmd. > >>>> MONO_RAW TSC > >>>> start 89553516545645 223619715214239 > >>>> end 89562251233830 223641517000376 > >>>> > >>>> Here is what I get via mult/shift conversion from this patch. > >>>> MONO_RAW TSC > >>>> PEBS 89555942691466 223625770878571 > >>>> > >>>> Then I use the time information from start and end to create a linear > >>>> function and 'guess' the MONO_RAW of PEBS from the TSC. I get > >>>> 89555942692721. > >>>> There is a 1255 ns difference. > >>>> I tried several different PEBS records. The error is ~1000ns. > >>>> I think it should be an observable error. > >>> > >>> Interesting. That's a good bit higher than I'd expect as I'd expect a > >>> clock_gettime() call to take ~ double digit nanoseconds range on > >>> average, so the error should be within that. > >>> > >>> Can you share your logic? > >>> > >> > >> I run the algorithm right before and after the perf command as below. > >> (The source code of time is attached.) > >> > >> $./time > >> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround > >> $./time > >> > >> The time will dump both MONO_RAW and TSC. That's where "start" and "end" > >> from. > >> The perf command print out both TSC and converted MONO_RAW (using the > >> mul/shift from this patch series). That's where "PEBS" value from. > >> > >> Than I use the below formula to calculate the guessed MONO_RAW of PEBS TSC. > >> Guessed_MONO_RAW = (PEBS_TSC - start_TSC) / (end_TSC - start_TSC) * > >> (end_MONO_RAW - start_MONO_RAW) + start_MONO_RAW. > >> > >> The guessed_MONO_RAW is 89555942692721. > >> The PEBS_MONO_RAW is 89555942691466. > >> The difference is 1255. > >> > >> Is the calculation correct? > > > > Thanks for sharing it. The equation you have there looks ok at a high > > level for the values you captured (there's small tweaks like doing the > > mult before the div to make sure you don't hit integer precision > > issues, but I didn't see that with your results). > > > > I've got a todo to try to see how the calculation changes if we do > > provide atomic TSC/RAW stamps, here but I got a little busy with other > > work and haven't gotten to it. > > So my apologies, but I'll try to get back to this soon. > > > > Have you got a chance to try the idea? > > I just want to check whether the userspace interpolation approach works. > Should I prepare V3 and go back to the kernel solution? Oh, my apologies. I had some other work come up and this fell off my plate. So I spent a little bit of time today adding some trace_printks to the timekeeping code so I could record the actual TSC and timestamps being calculated from CLOCK_MONOTONIC_RAW. I did catch one error in the test code, which unfortunately I'm to blame for: mid = start + (delta +(delta/2))/2; //round-closest That should be mid = start + (delta +(2/2))/2 //round-closest or more simply mid = start + (delta +1)/2; //round-closest Generalized rounding should be: (value + (DIV/2))/DIV), but I'm guessing with two as the divisor, my brain mixed it up and typed "delta". My apologies! With that fix, I'm seeing closer to ~500ns of error in the interpolation, just using the userland sampling. Now, I've also disabled vsyscalls for this (otherwise I wouldn't be able to trace_printk), so the error likely would be higher than with vsyscalls. Now, part of the error is that: start= rdtsc(); clock_gettime(CLOCK_MONOTONIC_RAW, &ts); end = rdtsc(); Ends up looking like start= rdtsc(); clock_gettime() { now = rdtsc(); delta = now - last; ns = (delta * mult) >> shift [~midpoint~] ts->nsec = base_ns + ns; ts->sec = base_sec; normalize_ts(ts) } end = rdtsc(); And so by taking the mid-point we're always a little skewed from where the tsc was actually read. Looking at the data for my case the tsc read seems to be ~12% in, so you could instead try: delta = end - start; p12 = start + ((delta * 12) + (100/2))/100; With that adjustment, I'm seeing error around ~40ns. Mind giving that a try? Now, if you had two snapshots of MONOTONIC_RAW + the TSC value used to calculate it(maybe the driver access this via a special internal timekeeping interface), in my testing interpolating will give you sub-ns error. So I think this is workable without exposing quite so much to userland. thanks -john
On 2023-03-08 8:17 p.m., John Stultz wrote: > On Wed, Mar 8, 2023 at 10:44 AM Liang, Kan <kan.liang@linux.intel.com> wrote: >> On 2023-02-17 6:11 p.m., John Stultz wrote: >>> On Tue, Feb 14, 2023 at 12:38 PM Liang, Kan <kan.liang@linux.intel.com> wrote: >>>> On 2023-02-14 3:11 p.m., John Stultz wrote: >>>>> On Tue, Feb 14, 2023 at 9:00 AM Liang, Kan <kan.liang@linux.intel.com> wrote: >>>>>> On 2023-02-14 9:51 a.m., Liang, Kan wrote: >>>>>>> If I understand correctly, the idea is to let the user space tool run >>>>>>> the above interpoloation algorithm several times to 'guess' the atomic >>>>>>> mapping. Using the mapping information to covert the TSC from the PEBS >>>>>>> record. Is my understanding correct? >>>>>>> >>>>>>> If so, to be honest, I doubt we can get the accuracy we want. >>>>>>> >>>>>> >>>>>> I implemented a simple test to evaluate the error. >>>>> >>>>> Very cool! >>>>> >>>>>> I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm >>>>>> at the start and end of perf cmd. >>>>>> MONO_RAW TSC >>>>>> start 89553516545645 223619715214239 >>>>>> end 89562251233830 223641517000376 >>>>>> >>>>>> Here is what I get via mult/shift conversion from this patch. >>>>>> MONO_RAW TSC >>>>>> PEBS 89555942691466 223625770878571 >>>>>> >>>>>> Then I use the time information from start and end to create a linear >>>>>> function and 'guess' the MONO_RAW of PEBS from the TSC. I get >>>>>> 89555942692721. >>>>>> There is a 1255 ns difference. >>>>>> I tried several different PEBS records. The error is ~1000ns. >>>>>> I think it should be an observable error. >>>>> >>>>> Interesting. That's a good bit higher than I'd expect as I'd expect a >>>>> clock_gettime() call to take ~ double digit nanoseconds range on >>>>> average, so the error should be within that. >>>>> >>>>> Can you share your logic? >>>>> >>>> >>>> I run the algorithm right before and after the perf command as below. >>>> (The source code of time is attached.) >>>> >>>> $./time >>>> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround >>>> $./time >>>> >>>> The time will dump both MONO_RAW and TSC. That's where "start" and "end" >>>> from. >>>> The perf command print out both TSC and converted MONO_RAW (using the >>>> mul/shift from this patch series). That's where "PEBS" value from. >>>> >>>> Than I use the below formula to calculate the guessed MONO_RAW of PEBS TSC. >>>> Guessed_MONO_RAW = (PEBS_TSC - start_TSC) / (end_TSC - start_TSC) * >>>> (end_MONO_RAW - start_MONO_RAW) + start_MONO_RAW. >>>> >>>> The guessed_MONO_RAW is 89555942692721. >>>> The PEBS_MONO_RAW is 89555942691466. >>>> The difference is 1255. >>>> >>>> Is the calculation correct? >>> >>> Thanks for sharing it. The equation you have there looks ok at a high >>> level for the values you captured (there's small tweaks like doing the >>> mult before the div to make sure you don't hit integer precision >>> issues, but I didn't see that with your results). >>> >>> I've got a todo to try to see how the calculation changes if we do >>> provide atomic TSC/RAW stamps, here but I got a little busy with other >>> work and haven't gotten to it. >>> So my apologies, but I'll try to get back to this soon. >>> >> >> Have you got a chance to try the idea? >> >> I just want to check whether the userspace interpolation approach works. >> Should I prepare V3 and go back to the kernel solution? > > Oh, my apologies. I had some other work come up and this fell off my plate. > > So I spent a little bit of time today adding some trace_printks to the > timekeeping code so I could record the actual TSC and timestamps being > calculated from CLOCK_MONOTONIC_RAW. > > I did catch one error in the test code, which unfortunately I'm to blame for: > mid = start + (delta +(delta/2))/2; //round-closest > > That should be > mid = start + (delta +(2/2))/2 //round-closest > or more simply > mid = start + (delta +1)/2; //round-closest > > Generalized rounding should be: (value + (DIV/2))/DIV), but I'm > guessing with two as the divisor, my brain mixed it up and typed > "delta". My apologies! > > With that fix, I'm seeing closer to ~500ns of error in the > interpolation, just using the userland sampling. Now, I've also > disabled vsyscalls for this (otherwise I wouldn't be able to > trace_printk), so the error likely would be higher than with > vsyscalls. > > Now, part of the error is that: > start= rdtsc(); > clock_gettime(CLOCK_MONOTONIC_RAW, &ts); > end = rdtsc(); > > Ends up looking like > start= rdtsc(); > clock_gettime() { > now = rdtsc(); > delta = now - last; > ns = (delta * mult) >> shift > [~midpoint~] > ts->nsec = base_ns + ns; > ts->sec = base_sec; > normalize_ts(ts) > } > end = rdtsc(); > > And so by taking the mid-point we're always a little skewed from where > the tsc was actually read. Looking at the data for my case the tsc > read seems to be ~12% in, so you could instead try: > > delta = end - start; > p12 = start + ((delta * 12) + (100/2))/100; > > With that adjustment, I'm seeing error around ~40ns. > > Mind giving that a try? I tried both the new mid and p12. The error becomes even larger. With new mid (start + (delta +1)/2), the error is now ~3800ns With p12 adjustment, the error is ~6700ns. Here is how I run the test. $./time $perf record -e cycles:upp --clockid monotonic_raw $some_workaround $./time Here are some raw data. For the first ./time, start: 961886196018 end: 961886215603 MONO_RAW: 341485848531 For the second ./time, start: 986870117783 end: 986870136152 MONO_RAW: 351495432044 Here is the time generated from one PEBS record. TSC: 968210217271 PEBS_MONO_RAW (calculated via kernel conversion information): 344019503072 Using new mid (start + (delta +1)/2), the guessed PEBS_MONO_RAW is 344019506897. The error is 3825ns. Using p12 adjustment, the guessed PEBS_MONO_RAW is 344019509831. The error is 6759ns Thanks, Kan > > Now, if you had two snapshots of MONOTONIC_RAW + the TSC value used to > calculate it(maybe the driver access this via a special internal > timekeeping interface), in my testing interpolating will give you > sub-ns error. So I think this is workable without exposing quite so > much to userland. > > thanks > -john
On Thu, Mar 9, 2023 at 8:56 AM Liang, Kan <kan.liang@linux.intel.com> wrote: > On 2023-03-08 8:17 p.m., John Stultz wrote: > > So I spent a little bit of time today adding some trace_printks to the > > timekeeping code so I could record the actual TSC and timestamps being > > calculated from CLOCK_MONOTONIC_RAW. > > > > I did catch one error in the test code, which unfortunately I'm to blame for: > > mid = start + (delta +(delta/2))/2; //round-closest > > > > That should be > > mid = start + (delta +(2/2))/2 //round-closest > > or more simply > > mid = start + (delta +1)/2; //round-closest > > > > Generalized rounding should be: (value + (DIV/2))/DIV), but I'm > > guessing with two as the divisor, my brain mixed it up and typed > > "delta". My apologies! > > > > With that fix, I'm seeing closer to ~500ns of error in the > > interpolation, just using the userland sampling. Now, I've also > > disabled vsyscalls for this (otherwise I wouldn't be able to > > trace_printk), so the error likely would be higher than with > > vsyscalls. > > > > Now, part of the error is that: > > start= rdtsc(); > > clock_gettime(CLOCK_MONOTONIC_RAW, &ts); > > end = rdtsc(); > > > > Ends up looking like > > start= rdtsc(); > > clock_gettime() { > > now = rdtsc(); > > delta = now - last; > > ns = (delta * mult) >> shift > > [~midpoint~] > > ts->nsec = base_ns + ns; > > ts->sec = base_sec; > > normalize_ts(ts) > > } > > end = rdtsc(); > > > > And so by taking the mid-point we're always a little skewed from where > > the tsc was actually read. Looking at the data for my case the tsc > > read seems to be ~12% in, so you could instead try: > > > > delta = end - start; > > p12 = start + ((delta * 12) + (100/2))/100; > > > > With that adjustment, I'm seeing error around ~40ns. > > > > Mind giving that a try? > > I tried both the new mid and p12. The error becomes even larger. > > With new mid (start + (delta +1)/2), the error is now ~3800ns > With p12 adjustment, the error is ~6700ns. > > > Here is how I run the test. > $./time > $perf record -e cycles:upp --clockid monotonic_raw $some_workaround > $./time > > Here are some raw data. > > For the first ./time, > start: 961886196018 > end: 961886215603 > MONO_RAW: 341485848531 > > For the second ./time, > start: 986870117783 > end: 986870136152 > MONO_RAW: 351495432044 > > Here is the time generated from one PEBS record. > TSC: 968210217271 > PEBS_MONO_RAW (calculated via kernel conversion information): 344019503072 > > Using new mid (start + (delta +1)/2), the guessed PEBS_MONO_RAW is > 344019506897. The error is 3825ns. > Using p12 adjustment, the guessed PEBS_MONO_RAW is 344019509831. > The error is 6759ns Huh. I dunno. That seems wild that the error increased. Just in case something is going astray with the PEBS_MONO_RAW logic, can you apply the hack patch I was using to display the MONOTONIC_RAW values the kernel calculates? https://github.com/johnstultz-work/linux-dev/commit/8d7896b078965b059ea5e8cc21841580557f6df6 It uses trace_printk, so you'll have to cat /sys/kernel/tracing/trace to get the output. thanks -john
ersion. So the kernel internal mul/shift information has to be exposed. > Ugh. Well, I think perf may have made a bad API choice here, so I'm > still going to push back on exposting timekeeping internals to > userland. It's not about the perf ABI. The perf mmap mult/offset if for PT, which always has raw TSCs. Without it the PT decoder couldn't supply wall clock time. -Andi
On 2023-03-11 12:55 a.m., John Stultz wrote: > On Thu, Mar 9, 2023 at 8:56 AM Liang, Kan <kan.liang@linux.intel.com> wrote: >> On 2023-03-08 8:17 p.m., John Stultz wrote: >>> So I spent a little bit of time today adding some trace_printks to the >>> timekeeping code so I could record the actual TSC and timestamps being >>> calculated from CLOCK_MONOTONIC_RAW. >>> >>> I did catch one error in the test code, which unfortunately I'm to blame for: >>> mid = start + (delta +(delta/2))/2; //round-closest >>> >>> That should be >>> mid = start + (delta +(2/2))/2 //round-closest >>> or more simply >>> mid = start + (delta +1)/2; //round-closest >>> >>> Generalized rounding should be: (value + (DIV/2))/DIV), but I'm >>> guessing with two as the divisor, my brain mixed it up and typed >>> "delta". My apologies! >>> >>> With that fix, I'm seeing closer to ~500ns of error in the >>> interpolation, just using the userland sampling. Now, I've also >>> disabled vsyscalls for this (otherwise I wouldn't be able to >>> trace_printk), so the error likely would be higher than with >>> vsyscalls. >>> >>> Now, part of the error is that: >>> start= rdtsc(); >>> clock_gettime(CLOCK_MONOTONIC_RAW, &ts); >>> end = rdtsc(); >>> >>> Ends up looking like >>> start= rdtsc(); >>> clock_gettime() { >>> now = rdtsc(); >>> delta = now - last; >>> ns = (delta * mult) >> shift >>> [~midpoint~] >>> ts->nsec = base_ns + ns; >>> ts->sec = base_sec; >>> normalize_ts(ts) >>> } >>> end = rdtsc(); >>> >>> And so by taking the mid-point we're always a little skewed from where >>> the tsc was actually read. Looking at the data for my case the tsc >>> read seems to be ~12% in, so you could instead try: >>> >>> delta = end - start; >>> p12 = start + ((delta * 12) + (100/2))/100; >>> >>> With that adjustment, I'm seeing error around ~40ns. >>> >>> Mind giving that a try? >> >> I tried both the new mid and p12. The error becomes even larger. >> >> With new mid (start + (delta +1)/2), the error is now ~3800ns >> With p12 adjustment, the error is ~6700ns. >> >> >> Here is how I run the test. >> $./time >> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround >> $./time >> >> Here are some raw data. >> >> For the first ./time, >> start: 961886196018 >> end: 961886215603 >> MONO_RAW: 341485848531 >> >> For the second ./time, >> start: 986870117783 >> end: 986870136152 >> MONO_RAW: 351495432044 >> >> Here is the time generated from one PEBS record. >> TSC: 968210217271 >> PEBS_MONO_RAW (calculated via kernel conversion information): 344019503072 >> >> Using new mid (start + (delta +1)/2), the guessed PEBS_MONO_RAW is >> 344019506897. The error is 3825ns. >> Using p12 adjustment, the guessed PEBS_MONO_RAW is 344019509831. >> The error is 6759ns > > Huh. I dunno. That seems wild that the error increased. > > Just in case something is going astray with the PEBS_MONO_RAW logic, > can you apply the hack patch I was using to display the MONOTONIC_RAW > values the kernel calculates? > https://github.com/johnstultz-work/linux-dev/commit/8d7896b078965b059ea5e8cc21841580557f6df6 > > It uses trace_printk, so you'll have to cat /sys/kernel/tracing/trace > to get the output. > $ ./time_3 start: 7358368893806 end: 7358368902944 delta: 9138 MONO_RAW: 2899739790738 MID: 7358368898375 P12: 7358368894903 $ sudo cat /sys/kernel/tracing/trace | grep time_3 time_3-1443 [002] ..... 2899.858936: ktime_get_raw_ts64: JDB: timekeeping_get_delta cycle_now: 7358368897679 time_3-1443 [002] ..... 2899.858937: ktime_get_raw_ts64: JDB: ktime_get_raw_ts64: 2899739790738 The error between MID and cycle_now is -696ns The error between P12 and cycle_now is 2776ns The time_3.c is attached. Thanks, Kan #include <sys/time.h> #include <time.h> #include <stdio.h> #include <errno.h> static inline unsigned long rdtsc () { unsigned long var; unsigned int hi, lo; asm volatile ("rdtsc" : "=a" (lo), "=d" (hi)); var = ((unsigned long long int) hi << 32) | lo; return var; } typedef unsigned long long u64; int main() { struct timespec ts; u64 start, end, delta, mid, p12; do { start= rdtsc(); clock_gettime(CLOCK_MONOTONIC_RAW, &ts); end = rdtsc(); delta = end-start; } while (delta > 20000); // make sure the reads were not preempted printf("start: %llu end: %llu delta: %llu\n", start, end, delta); printf("MONO_RAW: %llu\n", (u64)ts.tv_sec * 1000000000 + ts.tv_nsec); mid = start + (delta + 1)/2; //round-closest printf("MID: %llu\n", mid); p12 = start + ((delta * 12) + (100/2))/100; printf("P12: %llu\n", p12); }
On Mon, Mar 13, 2023 at 2:19 PM Liang, Kan <kan.liang@linux.intel.com> wrote: > > > > On 2023-03-11 12:55 a.m., John Stultz wrote: > > On Thu, Mar 9, 2023 at 8:56 AM Liang, Kan <kan.liang@linux.intel.com> wrote: > >> On 2023-03-08 8:17 p.m., John Stultz wrote: > >>> So I spent a little bit of time today adding some trace_printks to the > >>> timekeeping code so I could record the actual TSC and timestamps being > >>> calculated from CLOCK_MONOTONIC_RAW. > >>> > >>> I did catch one error in the test code, which unfortunately I'm to blame for: > >>> mid = start + (delta +(delta/2))/2; //round-closest > >>> > >>> That should be > >>> mid = start + (delta +(2/2))/2 //round-closest > >>> or more simply > >>> mid = start + (delta +1)/2; //round-closest > >>> > >>> Generalized rounding should be: (value + (DIV/2))/DIV), but I'm > >>> guessing with two as the divisor, my brain mixed it up and typed > >>> "delta". My apologies! > >>> > >>> With that fix, I'm seeing closer to ~500ns of error in the > >>> interpolation, just using the userland sampling. Now, I've also > >>> disabled vsyscalls for this (otherwise I wouldn't be able to > >>> trace_printk), so the error likely would be higher than with > >>> vsyscalls. > >>> > >>> Now, part of the error is that: > >>> start= rdtsc(); > >>> clock_gettime(CLOCK_MONOTONIC_RAW, &ts); > >>> end = rdtsc(); > >>> > >>> Ends up looking like > >>> start= rdtsc(); > >>> clock_gettime() { > >>> now = rdtsc(); > >>> delta = now - last; > >>> ns = (delta * mult) >> shift > >>> [~midpoint~] > >>> ts->nsec = base_ns + ns; > >>> ts->sec = base_sec; > >>> normalize_ts(ts) > >>> } > >>> end = rdtsc(); > >>> > >>> And so by taking the mid-point we're always a little skewed from where > >>> the tsc was actually read. Looking at the data for my case the tsc > >>> read seems to be ~12% in, so you could instead try: > >>> > >>> delta = end - start; > >>> p12 = start + ((delta * 12) + (100/2))/100; > >>> > >>> With that adjustment, I'm seeing error around ~40ns. > >>> > >>> Mind giving that a try? > >> > >> I tried both the new mid and p12. The error becomes even larger. > >> > >> With new mid (start + (delta +1)/2), the error is now ~3800ns > >> With p12 adjustment, the error is ~6700ns. > >> > >> > >> Here is how I run the test. > >> $./time > >> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround > >> $./time > >> > >> Here are some raw data. > >> > >> For the first ./time, > >> start: 961886196018 > >> end: 961886215603 > >> MONO_RAW: 341485848531 > >> > >> For the second ./time, > >> start: 986870117783 > >> end: 986870136152 > >> MONO_RAW: 351495432044 > >> > >> Here is the time generated from one PEBS record. > >> TSC: 968210217271 > >> PEBS_MONO_RAW (calculated via kernel conversion information): 344019503072 > >> > >> Using new mid (start + (delta +1)/2), the guessed PEBS_MONO_RAW is > >> 344019506897. The error is 3825ns. > >> Using p12 adjustment, the guessed PEBS_MONO_RAW is 344019509831. > >> The error is 6759ns > > > > Huh. I dunno. That seems wild that the error increased. > > > > Just in case something is going astray with the PEBS_MONO_RAW logic, > > can you apply the hack patch I was using to display the MONOTONIC_RAW > > values the kernel calculates? > > https://github.com/johnstultz-work/linux-dev/commit/8d7896b078965b059ea5e8cc21841580557f6df6 > > > > It uses trace_printk, so you'll have to cat /sys/kernel/tracing/trace > > to get the output. > > > > > $ ./time_3 > start: 7358368893806 end: 7358368902944 delta: 9138 > MONO_RAW: 2899739790738 > MID: 7358368898375 > P12: 7358368894903 > $ sudo cat /sys/kernel/tracing/trace | grep time_3 > time_3-1443 [002] ..... 2899.858936: ktime_get_raw_ts64: > JDB: timekeeping_get_delta cycle_now: 7358368897679 > time_3-1443 [002] ..... 2899.858937: ktime_get_raw_ts64: > JDB: ktime_get_raw_ts64: 2899739790738 > > The error between MID and cycle_now is -696ns > The error between P12 and cycle_now is 2776ns Hey Kan, So I'm terribly sorry, I'm a bit underwater right now and haven't had time to look deeper at this. The MID case you have above looks closer to what I was seeing but I can't explain why the 12% case is worse. Since I feel it's not really fair to object to your patch but not have the time to work through an alternative with you, I'm going to withdraw my objection (though others may persist!). I'd still really prefer if we avoided exposing internal timekeeping state directly to userland, and it would be good to see some further exploration in other directions, but there is the existing perf mmap precedence (even if I dislike it). Sorry I can't be of more help to find a better approach here. :( thanks -john
Hi John, On 2023-03-18 2:02 a.m., John Stultz wrote: > On Mon, Mar 13, 2023 at 2:19 PM Liang, Kan <kan.liang@linux.intel.com> wrote: >> >> >> >> On 2023-03-11 12:55 a.m., John Stultz wrote: >>> On Thu, Mar 9, 2023 at 8:56 AM Liang, Kan <kan.liang@linux.intel.com> wrote: >>>> On 2023-03-08 8:17 p.m., John Stultz wrote: >>>>> So I spent a little bit of time today adding some trace_printks to the >>>>> timekeeping code so I could record the actual TSC and timestamps being >>>>> calculated from CLOCK_MONOTONIC_RAW. >>>>> >>>>> I did catch one error in the test code, which unfortunately I'm to blame for: >>>>> mid = start + (delta +(delta/2))/2; //round-closest >>>>> >>>>> That should be >>>>> mid = start + (delta +(2/2))/2 //round-closest >>>>> or more simply >>>>> mid = start + (delta +1)/2; //round-closest >>>>> >>>>> Generalized rounding should be: (value + (DIV/2))/DIV), but I'm >>>>> guessing with two as the divisor, my brain mixed it up and typed >>>>> "delta". My apologies! >>>>> >>>>> With that fix, I'm seeing closer to ~500ns of error in the >>>>> interpolation, just using the userland sampling. Now, I've also >>>>> disabled vsyscalls for this (otherwise I wouldn't be able to >>>>> trace_printk), so the error likely would be higher than with >>>>> vsyscalls. >>>>> >>>>> Now, part of the error is that: >>>>> start= rdtsc(); >>>>> clock_gettime(CLOCK_MONOTONIC_RAW, &ts); >>>>> end = rdtsc(); >>>>> >>>>> Ends up looking like >>>>> start= rdtsc(); >>>>> clock_gettime() { >>>>> now = rdtsc(); >>>>> delta = now - last; >>>>> ns = (delta * mult) >> shift >>>>> [~midpoint~] >>>>> ts->nsec = base_ns + ns; >>>>> ts->sec = base_sec; >>>>> normalize_ts(ts) >>>>> } >>>>> end = rdtsc(); >>>>> >>>>> And so by taking the mid-point we're always a little skewed from where >>>>> the tsc was actually read. Looking at the data for my case the tsc >>>>> read seems to be ~12% in, so you could instead try: >>>>> >>>>> delta = end - start; >>>>> p12 = start + ((delta * 12) + (100/2))/100; >>>>> >>>>> With that adjustment, I'm seeing error around ~40ns. >>>>> >>>>> Mind giving that a try? >>>> >>>> I tried both the new mid and p12. The error becomes even larger. >>>> >>>> With new mid (start + (delta +1)/2), the error is now ~3800ns >>>> With p12 adjustment, the error is ~6700ns. >>>> >>>> >>>> Here is how I run the test. >>>> $./time >>>> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround >>>> $./time >>>> >>>> Here are some raw data. >>>> >>>> For the first ./time, >>>> start: 961886196018 >>>> end: 961886215603 >>>> MONO_RAW: 341485848531 >>>> >>>> For the second ./time, >>>> start: 986870117783 >>>> end: 986870136152 >>>> MONO_RAW: 351495432044 >>>> >>>> Here is the time generated from one PEBS record. >>>> TSC: 968210217271 >>>> PEBS_MONO_RAW (calculated via kernel conversion information): 344019503072 >>>> >>>> Using new mid (start + (delta +1)/2), the guessed PEBS_MONO_RAW is >>>> 344019506897. The error is 3825ns. >>>> Using p12 adjustment, the guessed PEBS_MONO_RAW is 344019509831. >>>> The error is 6759ns >>> >>> Huh. I dunno. That seems wild that the error increased. >>> >>> Just in case something is going astray with the PEBS_MONO_RAW logic, >>> can you apply the hack patch I was using to display the MONOTONIC_RAW >>> values the kernel calculates? >>> https://github.com/johnstultz-work/linux-dev/commit/8d7896b078965b059ea5e8cc21841580557f6df6 >>> >>> It uses trace_printk, so you'll have to cat /sys/kernel/tracing/trace >>> to get the output. >>> >> >> >> $ ./time_3 >> start: 7358368893806 end: 7358368902944 delta: 9138 >> MONO_RAW: 2899739790738 >> MID: 7358368898375 >> P12: 7358368894903 >> $ sudo cat /sys/kernel/tracing/trace | grep time_3 >> time_3-1443 [002] ..... 2899.858936: ktime_get_raw_ts64: >> JDB: timekeeping_get_delta cycle_now: 7358368897679 >> time_3-1443 [002] ..... 2899.858937: ktime_get_raw_ts64: >> JDB: ktime_get_raw_ts64: 2899739790738 >> >> The error between MID and cycle_now is -696ns >> The error between P12 and cycle_now is 2776ns > > Hey Kan, > So I'm terribly sorry, I'm a bit underwater right now and haven't > had time to look deeper at this. The MID case you have above looks > closer to what I was seeing but I can't explain why the 12% case is > worse. > > Since I feel it's not really fair to object to your patch but not have > the time to work through an alternative with you, I'm going to > withdraw my objection (though others may persist!). > I'd still really prefer if we avoided exposing internal timekeeping > state directly to userland, and it would be good to see some further > exploration in other directions, but there is the existing perf mmap > precedence (even if I dislike it). Sorry I can't be of more help to > find a better approach here. :( > Thank you all the same. I think we learnt that there should be more work for the pure user space solution. It is not a solution for the monotonic raw conversion for now. I have no idea how to do the post-processing conversion without the internal conversion information. So, for now, there seems only two candidate solutions. - Pure kernel solution (Similar to V1). - Expose the internal conversion information to the user space and does post-processing conversion. (V2) I will ping Thomas in the other thread and see if he has any suggestions. Thanks, Kan
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index ccb7f5dad59b..9d56fe027f6c 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -455,7 +455,8 @@ struct perf_event_attr { inherit_thread : 1, /* children only inherit if cloned with CLONE_THREAD */ remove_on_exec : 1, /* event is removed from task on exec */ sigtrap : 1, /* send synchronous SIGTRAP on event */ - __reserved_1 : 26; + hw_time : 1, /* generate raw HW time for samples */ + __reserved_1 : 25; union { __u32 wakeup_events; /* wakeup every n events */ @@ -615,7 +616,8 @@ struct perf_event_mmap_page { cap_user_time : 1, /* The time_{shift,mult,offset} fields are used */ cap_user_time_zero : 1, /* The time_zero field is used */ cap_user_time_short : 1, /* the time_{cycle,mask} fields are used */ - cap_____res : 58; + cap_user_time_mono_raw : 1, /* The time_mono_* fields are used */ + cap_____res : 57; }; }; @@ -692,11 +694,24 @@ struct perf_event_mmap_page { __u64 time_cycles; __u64 time_mask; + /* + * If cap_user_time_mono_raw, the monotonic raw clock can be calculated + * from the hardware clock (e.g. TSC) 'cyc'. + * + * mono_raw = base + ((cyc - last) * mult + nsec) >> shift + * + */ + __u64 time_mono_last; + __u32 time_mono_mult; + __u32 time_mono_shift; + __u64 time_mono_nsec; + __u64 time_mono_base; + /* * Hole for extension of the self monitor capabilities */ - __u8 __reserved[116*8]; /* align to 1k. */ + __u8 __reserved[112*8]; /* align to 1k. */ /* * Control data for the mmap() data buffer. diff --git a/kernel/events/core.c b/kernel/events/core.c index 380476a934e8..f062cce2dafc 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -12135,6 +12135,13 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr, if (attr->sigtrap && !attr->remove_on_exec) return -EINVAL; + if (attr->use_clockid) { + /* + * Only support post-processing for the monotonic raw clock + */ + if (attr->hw_time && (attr->clockid != CLOCK_MONOTONIC_RAW)) + return -EINVAL; + } out: return ret;