x86/vdso: Use non-serializing instruction rdtsc

Message ID tencent_4DC4468312A1CB2CA34B0215FAD797D11F07@qq.com
State New
Headers
Series x86/vdso: Use non-serializing instruction rdtsc |

Commit Message

Rong Tao May 16, 2023, 6:52 a.m. UTC
  From: Rong Tao <rongtao@cestc.cn>

Replacing rdtscp or 'lfence;rdtsc' with the non-serializable instruction
rdtsc can achieve a 40% performance improvement with only a small loss of
precision.

The RDTSCP instruction is not a serializing instruction, but it does wait
until all previous instructions have executed and all previous loads are
globally visible. The RDTSC instruction is not a serializing instruction.
It does not necessarily wait until all previous instructions have been
executed before reading the counter.

Record the time-consuming of vdso clock_gettime(), pseudo code:

    count = 1000 * 1000 * 100;
    while (count--)
        clock_gettime(CLOCK_REALTIME, &ts);

Time-consuming comparison:

     Time Consume(ns) | rdtsc_ordered() |  rdtsc()  | Promote
    ------------------+-----------------+-----------+---------
    Physical Machine  |  1269147289     | 759067324 |   40%
     Guest OS (KVM)   |  1756615963     | 995823886 |   43%

Signed-off-by: Rong Tao <rongtao@cestc.cn>
---
 arch/x86/include/asm/vdso/gettimeofday.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
  

Comments

Dave Hansen May 16, 2023, 2:12 p.m. UTC | #1
On 5/15/23 23:52, Rong Tao wrote:
> Replacing rdtscp or 'lfence;rdtsc' with the non-serializable instruction
> rdtsc can achieve a 40% performance improvement with only a small loss of
> precision.

I think the minimum that can be done in a changelog like this is to
figure out _why_ a RDTSCP was in use.  There are a ton of things that
can make the kernel go faster, but not all of them are a good idea.

I assume that the folks that wrote this had good reason for not using
plain RSTSC.  What were those reasons?
  
Thomas Gleixner May 16, 2023, 2:20 p.m. UTC | #2
Rong!

On Tue, May 16 2023 at 14:52, Rong Tao wrote:
> Replacing rdtscp or 'lfence;rdtsc' with the non-serializable instruction
> rdtsc can achieve a 40% performance improvement with only a small loss of
> precision.

That rdtsc_ordered() is not there to achieve precision. It's there to
guarantee correctness. The correctness requirement is that reading clock
MONOTONIC is strictly monotonic, i.e. there is no way that you can
observe time going backwards. Neither locally nor accross CPUs.

As you explained:

> The RDTSC instruction is not a serializing instruction.  It does not
> necessarily wait until all previous instructions have been executed
> before reading the counter.

Q: What guarantees that this does not speculate deep enough to actually
   make time go backwards?

A: Nothing

Conclusion: The fence stays, unless you can prove the contrary under all
circumstances and microarchitecture generations.

Thanks,

        tglx
  
H. Peter Anvin May 16, 2023, 5:57 p.m. UTC | #3
On May 16, 2023 7:12:34 AM PDT, Dave Hansen <dave.hansen@intel.com> wrote:
>On 5/15/23 23:52, Rong Tao wrote:
>> Replacing rdtscp or 'lfence;rdtsc' with the non-serializable instruction
>> rdtsc can achieve a 40% performance improvement with only a small loss of
>> precision.
>
>I think the minimum that can be done in a changelog like this is to
>figure out _why_ a RDTSCP was in use.  There are a ton of things that
>can make the kernel go faster, but not all of them are a good idea.
>
>I assume that the folks that wrote this had good reason for not using
>plain RSTSC.  What were those reasons?

I believe the motivation is that it is atomic with reading the CPU number.
  
Thomas Gleixner May 16, 2023, 8:39 p.m. UTC | #4
On Tue, May 16 2023 at 10:57, H. Peter Anvin wrote:
> On May 16, 2023 7:12:34 AM PDT, Dave Hansen <dave.hansen@intel.com> wrote:
>>On 5/15/23 23:52, Rong Tao wrote:
>>> Replacing rdtscp or 'lfence;rdtsc' with the non-serializable instruction
>>> rdtsc can achieve a 40% performance improvement with only a small loss of
>>> precision.
>>
>>I think the minimum that can be done in a changelog like this is to
>>figure out _why_ a RDTSCP was in use.  There are a ton of things that
>>can make the kernel go faster, but not all of them are a good idea.
>>
>>I assume that the folks that wrote this had good reason for not using
>>plain RSTSC.  What were those reasons?
>
> I believe the motivation is that it is atomic with reading the CPU number.

Believe belongs in the realm of religion and does not help much to
explain technical issues. :)

rdtsc_ordered() has actually useful comments and also see:
  https://lore.kernel.org/lkml/87ttwc73za.ffs@tglx

The Intel SDM and the AMD APM are both blury about RDTSC speculation and
we've observed (quite some time ago) situations where the RDTSC value
was clearly from the past solely due to speculation. So we had to bite
the bullet to add the fencing. Preferrably RDTSCP or if not available
LFENCE; RDTSC. IIRC the original variant was even CPUID; RDTSC, which is
daft.

The time readout does (simplified):

    do {
           // Wait for the sequence count to become even
           while ((seq = READ_ONCE(vd->seq)) & 1);

           tsc = rdtsc_ordered();
           now = convert(vd, tsc);
    } while (seq != READ_ONCE(vd->seq));

It's obviously more complex than that, but you get the idea.

Now replace RDTSCP with RDTSC and explain what guarantees that
the TSC read isn't speculated ahead of the sequence check.

If it's architecturally guaranteed that this can't happen, I'm more than
happy to use plain RDTSC.

But as I've observed that myself in the past, I'm pretty sure that it is
not guaranteed, at least not on older microarchitectures. If newer ones
make that guarantee then they should have exposed that as a feature bit
in CPUID and clearly documented it in the SDM.

As long as that does not happen, I'm sticking to the correctness first
principle.

Thanks,

        tglx
  
Andy Lutomirski May 16, 2023, 9:53 p.m. UTC | #5
On Mon, May 15, 2023, at 11:52 PM, Rong Tao wrote:
> From: Rong Tao <rongtao@cestc.cn>
>
> Replacing rdtscp or 'lfence;rdtsc' with the non-serializable instruction
> rdtsc can achieve a 40% performance improvement with only a small loss of
> precision.
>
> The RDTSCP instruction is not a serializing instruction, but it does wait
> until all previous instructions have executed and all previous loads are
> globally visible. The RDTSC instruction is not a serializing instruction.
> It does not necessarily wait until all previous instructions have been
> executed before reading the counter.
>
> Record the time-consuming of vdso clock_gettime(), pseudo code:
>
>     count = 1000 * 1000 * 100;
>     while (count--)
>         clock_gettime(CLOCK_REALTIME, &ts);
>
> Time-consuming comparison:
>
>      Time Consume(ns) | rdtsc_ordered() |  rdtsc()  | Promote
>     ------------------+-----------------+-----------+---------
>     Physical Machine  |  1269147289     | 759067324 |   40%
>      Guest OS (KVM)   |  1756615963     | 995823886 |   43%
>
> Signed-off-by: Rong Tao <rongtao@cestc.cn>

Out of curiosity, what happens if you apply that patch and run this thing:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/misc-tests.git/tree/evil-clock-test.cc

Build it with g++ -O2 and run:

./evil-clock-test -c monotonic

--Andy
  
Rong Tao May 17, 2023, 12:41 a.m. UTC | #6
Thank you all very much for your responses, I tested the test code
evil-clock-test[0] provided by Andy, this patch does cause time read errors
and load errors.

    $ ./evil-clock-test.out -c monotonic
    CPU vendor   : GenuineIntel
    CPU model    : Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
    CPU stepping : 0
    TSC flags    : tsc rdtscp constant_tsc tsc_known_freq tsc_deadline_timer tsc_adjust
    Will test the "CLOCK_MONOTONIC" clock.
    Now test failed  : worst error 255 with 81902816 samples
    Load3 test failed: worst error 384 with 3284297 samples
    Load test passed : margin 32 with 18848374 samples
    Store test failed as expected: worst error 704 with 18213325 samples

Thanks again :)

[0] https://git.kernel.org/pub/scm/linux/kernel/git/luto/misc-tests.git/tree/evil-clock-test.cc
  

Patch

diff --git a/arch/x86/include/asm/vdso/gettimeofday.h b/arch/x86/include/asm/vdso/gettimeofday.h
index 4cf6794f9d68..342d29106208 100644
--- a/arch/x86/include/asm/vdso/gettimeofday.h
+++ b/arch/x86/include/asm/vdso/gettimeofday.h
@@ -228,7 +228,7 @@  static u64 vread_pvclock(void)
 		if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT)))
 			return U64_MAX;
 
-		ret = __pvclock_read_cycles(pvti, rdtsc_ordered());
+		ret = __pvclock_read_cycles(pvti, rdtsc());
 	} while (pvclock_read_retry(pvti, version));
 
 	return ret;
@@ -246,7 +246,7 @@  static inline u64 __arch_get_hw_counter(s32 clock_mode,
 					const struct vdso_data *vd)
 {
 	if (likely(clock_mode == VDSO_CLOCKMODE_TSC))
-		return (u64)rdtsc_ordered();
+		return (u64)rdtsc();
 	/*
 	 * For any memory-mapped vclock type, we need to make sure that gcc
 	 * doesn't cleverly hoist a load before the mode check.  Otherwise we