[v2,4/4] x86/percpu: Use C for percpu read/write accessors

Message ID 20231004192404.31733-1-ubizjak@gmail.com
State New
Headers
Series None |

Commit Message

Uros Bizjak Oct. 4, 2023, 7:23 p.m. UTC
  The percpu code mostly uses inline assembly. Using segment qualifiers
allows to use C code instead, which enables the compiler to perform
various optimizations (e.g. propagation of memory arguments). Convert
percpu read and write accessors to C code, so the memory argument can
be propagated to the instruction that uses this argument.

Some examples of propagations:

a) into sign/zero extensions:

the code improves from:

 65 8a 05 00 00 00 00    mov    %gs:0x0(%rip),%al
 0f b6 c0                movzbl %al,%eax

to:

 65 0f b6 05 00 00 00    movzbl %gs:0x0(%rip),%eax
 00

and in a similar way for:

    movzbl %gs:0x0(%rip),%edx
    movzwl %gs:0x0(%rip),%esi
    movzbl %gs:0x78(%rbx),%eax

    movslq %gs:0x0(%rip),%rdx
    movslq %gs:(%rdi),%rbx

b) into compares:

the code improves from:

 65 8b 05 00 00 00 00    mov    %gs:0x0(%rip),%eax
 a9 00 00 0f 00          test   $0xf0000,%eax

to:

 65 f7 05 00 00 00 00    testl  $0xf0000,%gs:0x0(%rip)
 00 00 0f 00

and in a similar way for:

    testl  $0xf0000,%gs:0x0(%rip)
    testb  $0x1,%gs:0x0(%rip)
    testl  $0xff00,%gs:0x0(%rip)

    cmpb   $0x0,%gs:0x0(%rip)
    cmp    %gs:0x0(%rip),%r14d
    cmpw   $0x8,%gs:0x0(%rip)
    cmpb   $0x0,%gs:(%rax)

c) into other insns:

the code improves from:

   1a355:	83 fa ff             	cmp    $0xffffffff,%edx
   1a358:	75 07                	jne    1a361 <...>
   1a35a:	65 8b 15 00 00 00 00 	mov    %gs:0x0(%rip),%edx
   1a361:

to:

   1a35a:	83 fa ff             	cmp    $0xffffffff,%edx
   1a35d:	65 0f 44 15 00 00 00 	cmove  %gs:0x0(%rip),%edx
   1a364:	00

The above propagations result in the following code size
improvements for current mainline kernel (with the default config),
compiled with

gcc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1)

   text    data     bss     dec     hex filename
25508862        4386540  808388 30703790        1d480ae vmlinux-vanilla.o
25500922        4386532  808388 30695842        1d461a2 vmlinux-new.o

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Co-developed-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
---
v2: Rewrite code examples in the commit message.
---
 arch/x86/include/asm/percpu.h | 65 +++++++++++++++++++++++++++++------
 1 file changed, 54 insertions(+), 11 deletions(-)
  

Comments

Linus Torvalds Oct. 4, 2023, 7:42 p.m. UTC | #1
Unrelated reaction..

On Wed, 4 Oct 2023 at 12:24, Uros Bizjak <ubizjak@gmail.com> wrote:
>
> the code improves from:
>
>  65 8b 05 00 00 00 00    mov    %gs:0x0(%rip),%eax
>  a9 00 00 0f 00          test   $0xf0000,%eax
>
> to:
>
>  65 f7 05 00 00 00 00    testl  $0xf0000,%gs:0x0(%rip)
>  00 00 0f 00

Funky.

Why does gcc generate that full-width load from memory, and not demote
it to a byte test?

IOW, it should not be

  65 f7 05 00 00 00 00 testl  $0xf0000,%gs:0x0(%rip)
  00 00 0f 00

after optimizing it, it should be three bytes shorter at

  65 f6 05 00 00 00 00 testb  $0xf,%gs:0x0(%rip)
  0f

instead (this is "objdump", so it doesn't show that the relocation
entry has changed by +2 to compensate).

Now, doing the access narrowing is a bad idea for stores (because it
can cause subsequent loads to have conflicts in the store buffer), but
for loads it should always be a win to narrow the access.

I wonder why gcc doesn't do it. This is not related to __seg_gs - I
tried it with regular memory accesses too, and gcc kept those as
32-bit accesses too.

And no, the assembler can't optimize that operation either, since I
think changing the testl to a testb would change the 'P' bit in the
resulting eflags, so this is a "the compiler could pick a better
instruction choice" thing.

I'm probably missing some reason why gcc wouldn't do this. But clang
does seem to do this obvious optimization.

              Linus
  
Uros Bizjak Oct. 4, 2023, 8:07 p.m. UTC | #2
On Wed, Oct 4, 2023 at 9:42 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Unrelated reaction..
>
> On Wed, 4 Oct 2023 at 12:24, Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > the code improves from:
> >
> >  65 8b 05 00 00 00 00    mov    %gs:0x0(%rip),%eax
> >  a9 00 00 0f 00          test   $0xf0000,%eax
> >
> > to:
> >
> >  65 f7 05 00 00 00 00    testl  $0xf0000,%gs:0x0(%rip)
> >  00 00 0f 00
>
> Funky.
>
> Why does gcc generate that full-width load from memory, and not demote
> it to a byte test?

It does when LSB is accessed at the same address. For example:

int m;
_Bool foo (void) { return m & 0x0f; }

compiles to:

  0:   f6 05 00 00 00 00 0f    testb  $0xf,0x0(%rip)        # 7 <foo+0x7>

>
> IOW, it should not be
>
>   65 f7 05 00 00 00 00 testl  $0xf0000,%gs:0x0(%rip)
>   00 00 0f 00
>
> after optimizing it, it should be three bytes shorter at
>
>   65 f6 05 00 00 00 00 testb  $0xf,%gs:0x0(%rip)
>   0f
>
> instead (this is "objdump", so it doesn't show that the relocation
> entry has changed by +2 to compensate).
>
> Now, doing the access narrowing is a bad idea for stores (because it
> can cause subsequent loads to have conflicts in the store buffer), but
> for loads it should always be a win to narrow the access.
>
> I wonder why gcc doesn't do it. This is not related to __seg_gs - I
> tried it with regular memory accesses too, and gcc kept those as
> 32-bit accesses too.
>
> And no, the assembler can't optimize that operation either, since I
> think changing the testl to a testb would change the 'P' bit in the
> resulting eflags, so this is a "the compiler could pick a better
> instruction choice" thing.
>
> I'm probably missing some reason why gcc wouldn't do this. But clang
> does seem to do this obvious optimization.

You get a store forwarding stall when you write a bigger operand to
memory and then read part of it, if the smaller part doesn't start at
the same
address.

Uros.
  
Linus Torvalds Oct. 4, 2023, 8:12 p.m. UTC | #3
On Wed, 4 Oct 2023 at 13:08, Uros Bizjak <ubizjak@gmail.com> wrote:
>
> You get a store forwarding stall when you write a bigger operand to
> memory and then read part of it, if the smaller part doesn't start at
> the same address.

I don't think that has been true for over a decade now.

Afaik, any half-way modern Intel and AMD cores will forward any fully
contained load.

The whole "same address" was a P4 thing, iirc.

          Linus
  
Linus Torvalds Oct. 4, 2023, 8:19 p.m. UTC | #4
On Wed, 4 Oct 2023 at 13:12, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 4 Oct 2023 at 13:08, Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > You get a store forwarding stall when you write a bigger operand to
> > memory and then read part of it, if the smaller part doesn't start at
> > the same address.
>
> I don't think that has been true for over a decade now.
>
> Afaik, any half-way modern Intel and AMD cores will forward any fully
> contained load.

https://www.agner.org/optimize/microarchitecture.pdf

See for example pg 136 (Sandy Bridge / Ivy Bridge):

 "Store forwarding works in the following cases:
  ..
  • When a write of 64 bits or less is followed by a read of a smaller
size which is fully contained in the write address range, regardless
of alignment"

and for AMD Zen cores:

  "Store forwarding of a write to a subsequent read works very well in
all cases, including reads from a part of the written data"

So forget the whole "same address" rule. It's simply not true or
relevant any more.

                  Linus
  
Uros Bizjak Oct. 4, 2023, 8:22 p.m. UTC | #5
On Wed, Oct 4, 2023 at 10:20 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 4 Oct 2023 at 13:12, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Wed, 4 Oct 2023 at 13:08, Uros Bizjak <ubizjak@gmail.com> wrote:
> > >
> > > You get a store forwarding stall when you write a bigger operand to
> > > memory and then read part of it, if the smaller part doesn't start at
> > > the same address.
> >
> > I don't think that has been true for over a decade now.
> >
> > Afaik, any half-way modern Intel and AMD cores will forward any fully
> > contained load.
>
> https://www.agner.org/optimize/microarchitecture.pdf
>
> See for example pg 136 (Sandy Bridge / Ivy Bridge):
>
>  "Store forwarding works in the following cases:
>   ..
>   • When a write of 64 bits or less is followed by a read of a smaller
> size which is fully contained in the write address range, regardless
> of alignment"
>
> and for AMD Zen cores:
>
>   "Store forwarding of a write to a subsequent read works very well in
> all cases, including reads from a part of the written data"
>
> So forget the whole "same address" rule. It's simply not true or
> relevant any more.

No problem then, we will implement the optimization in the compiler.

Thanks,
Uros.
  
Ingo Molnar Oct. 5, 2023, 7:06 a.m. UTC | #6
* Uros Bizjak <ubizjak@gmail.com> wrote:

> The percpu code mostly uses inline assembly. Using segment qualifiers
> allows to use C code instead, which enables the compiler to perform
> various optimizations (e.g. propagation of memory arguments). Convert
> percpu read and write accessors to C code, so the memory argument can
> be propagated to the instruction that uses this argument.

>    text    data     bss     dec     hex filename
> 25508862        4386540  808388 30703790        1d480ae vmlinux-vanilla.o
> 25500922        4386532  808388 30695842        1d461a2 vmlinux-new.o

Ok, this all looks like a pretty nice optimization.

As discussed previously, I've created a new tip:x86/percpu topic branch
for this, based on tip:x86/asm that carries the other percpu patches.
This branch will be merged in v6.8, best-case scenario.

Also note that I lowered the version cutoff from GCC 13.1 to 12.1, for
the simple selfish reason to include my own daily systems in test coverage.

Is there any known bug fixed in the GCC 12.1 ... 13.1 version range that
could make this approach problematic?

Thanks,

	Ingo
  
Uros Bizjak Oct. 5, 2023, 7:40 a.m. UTC | #7
On Thu, Oct 5, 2023 at 9:06 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Uros Bizjak <ubizjak@gmail.com> wrote:
>
> > The percpu code mostly uses inline assembly. Using segment qualifiers
> > allows to use C code instead, which enables the compiler to perform
> > various optimizations (e.g. propagation of memory arguments). Convert
> > percpu read and write accessors to C code, so the memory argument can
> > be propagated to the instruction that uses this argument.
>
> >    text    data     bss     dec     hex filename
> > 25508862        4386540  808388 30703790        1d480ae vmlinux-vanilla.o
> > 25500922        4386532  808388 30695842        1d461a2 vmlinux-new.o
>
> Ok, this all looks like a pretty nice optimization.

This is just the beginning, named AS enables several other
optimizations, as can be seen in Nadav's original patch series. If we
want to make the kernel PIE, then we have to get rid of the absolute
address in percpu_stable_op (AKA this_cpu_read_stable). We can use
__raw_cpu_read, but PATCH 7/7 [1] optimizes even further.

> As discussed previously, I've created a new tip:x86/percpu topic branch
> for this, based on tip:x86/asm that carries the other percpu patches.
> This branch will be merged in v6.8, best-case scenario.
>
> Also note that I lowered the version cutoff from GCC 13.1 to 12.1, for
> the simple selfish reason to include my own daily systems in test coverage.
>
> Is there any known bug fixed in the GCC 12.1 ... 13.1 version range that
> could make this approach problematic?

Not that I know of. I have done all of the work with GCC 12.3.1 (the
default Fedora 37 compiler) and additionally tested with GCC 13.2.1
(Fedora 38). I have made the patched kernel the default kernel on my
main workstation, and haven't encountered any problems since I
installed it a week ago.

If there are any problems encountered with the compiler, we (the GCC
compiler authors) can and will fix them promptly. I'd push for all
supported GCC versions, but maybe not just yet ;)

[1] https://lore.kernel.org/lkml/20190823224424.15296-8-namit@vmware.com/

Uros.
  

Patch

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index da451202a1b9..60ea7755c0fe 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -400,13 +400,66 @@  do {									\
 #define this_cpu_read_stable_8(pcp)	percpu_stable_op(8, "mov", pcp)
 #define this_cpu_read_stable(pcp)	__pcpu_size_call_return(this_cpu_read_stable_, pcp)
 
+#ifdef CONFIG_USE_X86_SEG_SUPPORT
+
+#define __raw_cpu_read(qual, pcp)					\
+({									\
+	*(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp));		\
+})
+
+#define __raw_cpu_write(qual, pcp, val)					\
+do {									\
+	*(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)) = (val);	\
+} while (0)
+
+#define raw_cpu_read_1(pcp)		__raw_cpu_read(, pcp)
+#define raw_cpu_read_2(pcp)		__raw_cpu_read(, pcp)
+#define raw_cpu_read_4(pcp)		__raw_cpu_read(, pcp)
+#define raw_cpu_write_1(pcp, val)	__raw_cpu_write(, pcp, val)
+#define raw_cpu_write_2(pcp, val)	__raw_cpu_write(, pcp, val)
+#define raw_cpu_write_4(pcp, val)	__raw_cpu_write(, pcp, val)
+
+#define this_cpu_read_1(pcp)		__raw_cpu_read(volatile, pcp)
+#define this_cpu_read_2(pcp)		__raw_cpu_read(volatile, pcp)
+#define this_cpu_read_4(pcp)		__raw_cpu_read(volatile, pcp)
+#define this_cpu_write_1(pcp, val)	__raw_cpu_write(volatile, pcp, val)
+#define this_cpu_write_2(pcp, val)	__raw_cpu_write(volatile, pcp, val)
+#define this_cpu_write_4(pcp, val)	__raw_cpu_write(volatile, pcp, val)
+
+#ifdef CONFIG_X86_64
+#define raw_cpu_read_8(pcp)		__raw_cpu_read(, pcp)
+#define raw_cpu_write_8(pcp, val)	__raw_cpu_write(, pcp, val)
+
+#define this_cpu_read_8(pcp)		__raw_cpu_read(volatile, pcp)
+#define this_cpu_write_8(pcp, val)	__raw_cpu_write(volatile, pcp, val)
+#endif
+
+#else /* CONFIG_USE_X86_SEG_SUPPORT */
+
 #define raw_cpu_read_1(pcp)		percpu_from_op(1, , "mov", pcp)
 #define raw_cpu_read_2(pcp)		percpu_from_op(2, , "mov", pcp)
 #define raw_cpu_read_4(pcp)		percpu_from_op(4, , "mov", pcp)
-
 #define raw_cpu_write_1(pcp, val)	percpu_to_op(1, , "mov", (pcp), val)
 #define raw_cpu_write_2(pcp, val)	percpu_to_op(2, , "mov", (pcp), val)
 #define raw_cpu_write_4(pcp, val)	percpu_to_op(4, , "mov", (pcp), val)
+
+#define this_cpu_read_1(pcp)		percpu_from_op(1, volatile, "mov", pcp)
+#define this_cpu_read_2(pcp)		percpu_from_op(2, volatile, "mov", pcp)
+#define this_cpu_read_4(pcp)		percpu_from_op(4, volatile, "mov", pcp)
+#define this_cpu_write_1(pcp, val)	percpu_to_op(1, volatile, "mov", (pcp), val)
+#define this_cpu_write_2(pcp, val)	percpu_to_op(2, volatile, "mov", (pcp), val)
+#define this_cpu_write_4(pcp, val)	percpu_to_op(4, volatile, "mov", (pcp), val)
+
+#ifdef CONFIG_X86_64
+#define raw_cpu_read_8(pcp)		percpu_from_op(8, , "mov", pcp)
+#define raw_cpu_write_8(pcp, val)	percpu_to_op(8, , "mov", (pcp), val)
+
+#define this_cpu_read_8(pcp)		percpu_from_op(8, volatile, "mov", pcp)
+#define this_cpu_write_8(pcp, val)	percpu_to_op(8, volatile, "mov", (pcp), val)
+#endif
+
+#endif /* CONFIG_USE_X86_SEG_SUPPORT */
+
 #define raw_cpu_add_1(pcp, val)		percpu_add_op(1, , (pcp), val)
 #define raw_cpu_add_2(pcp, val)		percpu_add_op(2, , (pcp), val)
 #define raw_cpu_add_4(pcp, val)		percpu_add_op(4, , (pcp), val)
@@ -432,12 +485,6 @@  do {									\
 #define raw_cpu_xchg_2(pcp, val)	raw_percpu_xchg_op(pcp, val)
 #define raw_cpu_xchg_4(pcp, val)	raw_percpu_xchg_op(pcp, val)
 
-#define this_cpu_read_1(pcp)		percpu_from_op(1, volatile, "mov", pcp)
-#define this_cpu_read_2(pcp)		percpu_from_op(2, volatile, "mov", pcp)
-#define this_cpu_read_4(pcp)		percpu_from_op(4, volatile, "mov", pcp)
-#define this_cpu_write_1(pcp, val)	percpu_to_op(1, volatile, "mov", (pcp), val)
-#define this_cpu_write_2(pcp, val)	percpu_to_op(2, volatile, "mov", (pcp), val)
-#define this_cpu_write_4(pcp, val)	percpu_to_op(4, volatile, "mov", (pcp), val)
 #define this_cpu_add_1(pcp, val)	percpu_add_op(1, volatile, (pcp), val)
 #define this_cpu_add_2(pcp, val)	percpu_add_op(2, volatile, (pcp), val)
 #define this_cpu_add_4(pcp, val)	percpu_add_op(4, volatile, (pcp), val)
@@ -476,8 +523,6 @@  do {									\
  * 32 bit must fall back to generic operations.
  */
 #ifdef CONFIG_X86_64
-#define raw_cpu_read_8(pcp)			percpu_from_op(8, , "mov", pcp)
-#define raw_cpu_write_8(pcp, val)		percpu_to_op(8, , "mov", (pcp), val)
 #define raw_cpu_add_8(pcp, val)			percpu_add_op(8, , (pcp), val)
 #define raw_cpu_and_8(pcp, val)			percpu_to_op(8, , "and", (pcp), val)
 #define raw_cpu_or_8(pcp, val)			percpu_to_op(8, , "or", (pcp), val)
@@ -486,8 +531,6 @@  do {									\
 #define raw_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, , pcp, oval, nval)
 #define raw_cpu_try_cmpxchg_8(pcp, ovalp, nval)	percpu_try_cmpxchg_op(8, , pcp, ovalp, nval)
 
-#define this_cpu_read_8(pcp)			percpu_from_op(8, volatile, "mov", pcp)
-#define this_cpu_write_8(pcp, val)		percpu_to_op(8, volatile, "mov", (pcp), val)
 #define this_cpu_add_8(pcp, val)		percpu_add_op(8, volatile, (pcp), val)
 #define this_cpu_and_8(pcp, val)		percpu_to_op(8, volatile, "and", (pcp), val)
 #define this_cpu_or_8(pcp, val)			percpu_to_op(8, volatile, "or", (pcp), val)