mbox series

[GIT,PULL] x86/misc for 6.5

Message ID	20230627110038.GCZJrBVqu/4BfdyBeN@fat_crate.local
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Date: Tue, 27 Jun 2023 13:00:38 +0200 From: Borislav Petkov <bp@alien8.de> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: x86-ml <x86@kernel.org>, lkml <linux-kernel@vger.kernel.org> Subject: [GIT PULL] x86/misc for 6.5 Message-ID: <20230627110038.GCZJrBVqu/4BfdyBeN@fat_crate.local> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Precedence: bulk
Series	[GIT,PULL] x86/misc for 6.5 \| [GIT,PULL] x86/misc for 6.5

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_misc_for_v6.5

Message

Borislav Petkov June 27, 2023, 11 a.m. UTC

  Hi Linus,

please pull the misc pile of updates for 6.5.

Thx.

---

The following changes since commit ac9a78681b921877518763ba0e89202254349d1b:

  Linux 6.4-rc1 (2023-05-07 13:34:35 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_misc_for_v6.5

for you to fetch changes up to 5516c89d58283413134f8d26960c6303d5d5bd89:

  x86/lib: Make get/put_user() exception handling a visible symbol (2023-06-02 10:51:46 +0200)

----------------------------------------------------------------
- Remove the local symbols prefix of the get/put_user() exception
  handling symbols so that tools do not get confused by the presence of
  code belonging to the wrong symbol/not belonging to any symbol

- Improve csum_partial()'s performance

- Some improvements to the kcpuid tool

----------------------------------------------------------------
Borislav Petkov (AMD) (1):
      tools/x86/kcpuid: Dump the correct CPUID function in error

Nadav Amit (1):
      x86/lib: Make get/put_user() exception handling a visible symbol

Nathan Chancellor (1):
      x86/csum: Fix clang -Wuninitialized in csum_partial()

Noah Goldstein (1):
      x86/csum: Improve performance of `csum_partial`

Rong Tao (1):
      tools/x86/kcpuid: Add .gitignore

 arch/x86/lib/csum-partial_64.c   | 101 ++++++++----
 arch/x86/lib/getuser.S           |  32 ++--
 arch/x86/lib/putuser.S           |  24 +--
 lib/Kconfig.debug                |  17 ++
 lib/Makefile                     |   1 +
 lib/checksum_kunit.c             | 334 +++++++++++++++++++++++++++++++++++++++
 tools/arch/x86/kcpuid/.gitignore |   1 +
 tools/arch/x86/kcpuid/kcpuid.c   |   7 +-
 8 files changed, 453 insertions(+), 64 deletions(-)
 create mode 100644 lib/checksum_kunit.c
 create mode 100644 tools/arch/x86/kcpuid/.gitignore

Comments

Linus Torvalds June 27, 2023, 8:11 p.m. UTC | #1

On Tue, 27 Jun 2023 at 04:00, Borislav Petkov <bp@alien8.de> wrote:
>
> - Improve csum_partial()'s performance

Honestly, looking at that patch, my reaction is "why did it get
unrolled in 64-byte chunks, if 40 bytes is the magic value"?

Particularly when there is then that "do a carry op each 32 bytes to
make 32-byte chunks independent and increase ILP". So even the 64-byte
case isn't *actuall* doing a 64-byte unrolling, it's really doing two
32-byte unrollings in parallel.

So you have three "magic" values, and the only one that really matters
is likely the 40-byte one.

Yes, yes, 64 bytes is the usual cacheline size, and is "traditional"
for unrolling. But there's nothing really magical about it here.

End result: wouldn't it have been nice to just do 40-byte chunks, and
make the 64-byte "two overlapping 32-byte chunks" be two of the
40-byte chunks.

Something like the (ENTIRELY UNTESTED!) attached patch?

Again: this is *not* tested. I took a quick look at the generated
assembly, and it looked roughly like what I expected it to look like,
but it may be complete garbage.

I added a couple of "likely()" things just because it made the
generated asm look more natural (ie it followed the order of the
source code there), they are otherwise questionable annotations.

Finally: did I already mention that this is completely untested?

             Linus

Linus Torvalds June 27, 2023, 8:26 p.m. UTC | #2

On Tue, 27 Jun 2023 at 13:11, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Finally: did I already mention that this is completely untested?

Oh, this part is buggy:

+               asm("addq %1,%0\n\t"
+                   "adcq $0,%0"
+                   :"=r" (temp64): "r" (temp64_2));

and it needs to show that 'temp64' is an input too.

Dummy me.

The trivial fix is just to make the "=r" be a "+r".

In fact, I should have used "+r" inside update_csum_40b(), but at
least there I did add the proper input constraint, so that one isn't
actively buggy.

And again: I noticed this by looking at the patch one more time. No
actual *testing* has happened. It might still be buggy garbage even
with that "+r". It's just a bit *less* buggy garbage.

I will now go back to my cave and continue pulling stuff, I just had
to do something else for a while. Some people relax with a nice drink
by the pool, I relax by playing around with inline asm.

                Linus

Borislav Petkov June 27, 2023, 8:38 p.m. UTC | #3

On Tue, Jun 27, 2023 at 01:26:18PM -0700, Linus Torvalds wrote:
> I will now go back to my cave and continue pulling stuff, I just had
> to do something else for a while. Some people relax with a nice drink
> by the pool, I relax by playing around with inline asm.

And there's a third kind who relax by the pool with a nice drink,
*while* playing around with inline asm. ;-P

Btw, I'll send you a new version of this pull request with this patch
dropped to let folks experiment with it more.

Thx.

Linus Torvalds June 27, 2023, 8:49 p.m. UTC | #4

On Tue, 27 Jun 2023 at 13:38, Borislav Petkov <bp@alien8.de> wrote:
>
> And there's a third kind who relax by the pool with a nice drink,
> *while* playing around with inline asm. ;-P

That explains a lot.

> Btw, I'll send you a new version of this pull request with this patch
> dropped to let folks experiment with it more.

Oh, I already merged it. I don't hate the change, I just looked at it
and went "I would have done that differently" and started playing
around with it.

There's nothing hugely *wrong* with the code I merged, but I do think
that it did too much inside the inline asm (ie looping inside the asm,
but also initializing values that could have - and should have - just
been given as inputs to the asm).

And the whole "why have two different versions for 40-byte and 64-byte
areas, when you _could_ just do it with one 40-byte one that you then
also just unroll".

So I _think_ my version is nicer and shorter - assuming it works and
there are no other bugs than the one I already noticed - but I don't
think it's a huge deal.

Anyway, before I throw my patch away, I'll just post it with the
trivial fixes to use "+r", and with the "volatile" removed (I add
"volatile" to asms by habit, but this one really isn't volatile).

I just checked that both gcc and clang seem to be happy with it, but
that's the only testing this patch has gotten: it compiles for me.

Do with it what you will.

                 Linus

pr-tracker-bot@kernel.org June 27, 2023, 8:52 p.m. UTC | #5

The pull request you sent on Tue, 27 Jun 2023 13:00:38 +0200:

> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_misc_for_v6.5

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/4baa098a147d76a9ad1a6867fa14286db52085b6

Thank you!

Borislav Petkov June 27, 2023, 9:11 p.m. UTC | #6

On Tue, Jun 27, 2023 at 01:49:12PM -0700, Linus Torvalds wrote:
> That explains a lot.

LOL!

That activity has one rule: don't send the code on the same day as the
pool visit. :-)

Arjan van de Ven June 27, 2023, 9:44 p.m. UTC | #7

On 6/27/2023 1:11 PM, Linus Torvalds wrote:
> On Tue, 27 Jun 2023 at 04:00, Borislav Petkov <bp@alien8.de> wrote:
>>
>> - Improve csum_partial()'s performance
> 
> Honestly, looking at that patch, my reaction is "why did it get
> unrolled in 64-byte chunks, if 40 bytes is the magic value"?
> 
> Particularly when there is then that "do a carry op each 32 bytes to
> make 32-byte chunks independent and increase ILP". So even the 64-byte
> case isn't *actuall* doing a 64-byte unrolling, it's really doing two
> 32-byte unrollings in parallel.
> 
> So you have three "magic" values, and the only one that really matters
> is likely the 40-byte one.
> 
> Yes, yes, 64 bytes is the usual cacheline size, and is "traditional"
> for unrolling. But there's nothing really magical about it here.
> 
> End result: wouldn't it have been nice to just do 40-byte chunks, and
> make the 64-byte "two overlapping 32-byte chunks" be two of the
> 40-byte chunks.
> 
> Something like the (ENTIRELY UNTESTED!) attached patch?
> 
> Again: this is *not* tested. I took a quick look at the generated
> assembly, and it looked roughly like what I expected it to look like,
> but it may be complete garbage.
> 
> I added a couple of "likely()" things just because it made the
> generated asm look more natural (ie it followed the order of the
> source code there), they are otherwise questionable annotations.
> 
> Finally: did I already mention that this is completely untested?

fwiw long flights and pools have a relation; I made a userspace testbench
for this some time ago: https://github.com/fenrus75/csum_partial
in case one would actually WANT to test ;)

Linus Torvalds June 27, 2023, 9:44 p.m. UTC | #8

On Tue, 27 Jun 2023 at 13:49, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Anyway, before I throw my patch away, I'll just post it with the
> trivial fixes to use "+r", and with the "volatile" removed (I add
> "volatile" to asms by habit, but this one really isn't volatile).

Oh, never mind. I was about to throw it away, and then I realized that
the code *after* the loop relied on the range having been reduced down
to below 64 bytes, and checked for 32/16/8/4 byte ranges.

And my change to make it loop over 80 bytes had made that no longer be true.

But now I'm committed, and decided to fix that too, and just
re-organize the code to get all the cases right.

And now I'm going to actually boot-test the end result too.  Because
life is too short to spend all my time _just_ with merging.

               Linus

Linus Torvalds June 27, 2023, 9:58 p.m. UTC | #9

On Tue, 27 Jun 2023 at 14:44, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But now I'm committed, and decided to fix that too, and just
> re-organize the code to get all the cases right.
>
> And now I'm going to actually boot-test the end result too.  Because
> life is too short to spend all my time _just_ with merging.

Well, it boots. And I clearly have networking. But who knows how much
that is actually using the csum_partial() function? Not me. I'm just
along for the ride.

Anyway, that last version handles the 40-byte special case
differently, in that it might have done some arbitrary number of
80-byte chunks first. But it shouldn't really make a difference - it
does check for >= 80- bytes first, but we're talking two extra
instructions.

And that way the end case is always less than 64 bytes, and so the
tests for 32/16/8 work fine.

And now it's committed to my test tree, so I'm not throwing it away,
but I also won't be working on it any more. If somebody wants to time
it using Arjan's little thing, more power to them.

               Linus

Linus Torvalds June 27, 2023, 10:25 p.m. UTC | #10

On Tue, 27 Jun 2023 at 14:44, Arjan van de Ven <arjan@linux.intel.com> wrote:
>
> fwiw long flights and pools have a relation; I made a userspace testbench
> for this some time ago: https://github.com/fenrus75/csum_partial
> in case one would actually WANT to test ;)

Hmm.

I don't know what the rules are - and some of the functions you test
seem actively buggy (ie not handling alignment right etc).

But on my machine I get:

  02:   8.6 /  10.4 cycles (e29e455e) Upcoming linux kernel version
  04:   8.6 /  10.4 cycles (e29e455e) Specialized to size 40
  06:   7.7 /   9.5 cycles (e29e455e) New version
  22:   8.7 /   9.6 cycles (e29e455e) Odd-alignment handling removed
 ...

which would seem to mean that my code ("New version") is doing well.

It does do worse on the "odd alignment" case:

  03: 15.5 /  17.8 cycles (00006580) Upcoming linux kernel version
  05: 15.5 /  17.8 cycles (00006580) Specialized to size 40
  07: 16.6 /  19.5 cycles (0000bc29) New version
  23:   8.8 /   8.6 cycles (1de29e47) Odd-alignment handling removed
 ...

I just hacked the code into the benchmark without looking too closely
at what is going on, so no guarantees.

         Linus

Linus Torvalds June 27, 2023, 10:43 p.m. UTC | #11

On Tue, 27 Jun 2023 at 15:25, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I don't know what the rules are - and some of the functions you test
> seem actively buggy (ie not handling alignment right etc).

Oh. And you *only* test the 40-byte case. That seems a bit bogus.

If I change the packet size from 40 to 1500, I get

  02: 185.1 / 186.4 cycles (8b414316) Upcoming linux kernel version
  04: 184.9 / 186.5 cycles (8b414316) Specialized to size 40
  06: 107.3 / 117.2 cycles (8b414316) New version
  22: 185.6 / 186.5 cycles (8b414316) Odd-alignment handling removed

which seems unexpectedly bad for the other versions.

But those other functions have that 64-byte unrolling, rather than the
"two 40-byte loops", so maybe it is real, and my version is actually
just that good.

Or maybe it's a sign that my version is some seriously buggy crap, and
it just looks good on the benchmark because it does the wrong thing.

Whatever. Back to the merge window again.

              Linus

Arjan van de Ven June 27, 2023, 10:50 p.m. UTC | #12

On 6/27/2023 3:43 PM, Linus Torvalds wrote:
> On Tue, 27 Jun 2023 at 15:25, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> I don't know what the rules are - and some of the functions you test
>> seem actively buggy (ie not handling alignment right etc).
> 
> Oh. And you *only* test the 40-byte case. That seems a bit bogus.
> 
> If I change the packet size from 40 to 1500, I get
> 
>    02: 185.1 / 186.4 cycles (8b414316) Upcoming linux kernel version
>    04: 184.9 / 186.5 cycles (8b414316) Specialized to size 40
>    06: 107.3 / 117.2 cycles (8b414316) New version
>    22: 185.6 / 186.5 cycles (8b414316) Odd-alignment handling removed
> 
> which seems unexpectedly bad for the other versions.
> 

I'm not surprised though; running 2 parallel streams (where one stream has a fixed zero as input,
so can run OOO any time) .. can really have a performance change like this

Linus Torvalds June 27, 2023, 11:02 p.m. UTC | #13

On Tue, 27 Jun 2023 at 15:51, Arjan van de Ven <arjan@linux.intel.com> wrote:
>
> I'm not surprised though; running 2 parallel streams (where one stream has a fixed zero as input,
> so can run OOO any time) .. can really have a performance change like this

How much do people care?

One of the advantages of just having that single "update_csum_40b()"
function is that it's trivial to then manually unroll.

With a 4-way unrolling, I get

  02: 184.0 / 184.5 cycles (8b414316) Upcoming linux kernel version
  04: 184.0 / 184.2 cycles (8b414316) Specialized to size 40
  06: 89.4 / 102.5 cycles (512daed6) New version
  22: 184.6 / 184.4 cycles (8b414316) Odd-alignment handling removed

but doesn't most network hardware do the csum on its own anyway? How
critical is csum_partial(), really?

(The above is obviously your test thing modified for 1500 byte
packets, still. With 40-byte packets, the 4-way unrolling obvious
doesn't help, although it doesn't noticeably hurt either - it's just
one more compare and branch)

                  Linus

Arjan van de Ven June 27, 2023, 11:05 p.m. UTC | #14

On 6/27/2023 4:02 PM, Linus Torvalds wrote:
> On Tue, 27 Jun 2023 at 15:51, Arjan van de Ven <arjan@linux.intel.com> wrote:
>>
>> I'm not surprised though; running 2 parallel streams (where one stream has a fixed zero as input,
>> so can run OOO any time) .. can really have a performance change like this
> 
> How much do people care?
> 
> One of the advantages of just having that single "update_csum_40b()"
> function is that it's trivial to then manually unroll.
> 
> With a 4-way unrolling, I get
> 
>    02: 184.0 / 184.5 cycles (8b414316) Upcoming linux kernel version
>    04: 184.0 / 184.2 cycles (8b414316) Specialized to size 40
>    06: 89.4 / 102.5 cycles (512daed6) New version
>    22: 184.6 / 184.4 cycles (8b414316) Odd-alignment handling removed
> 
> but doesn't most network hardware do the csum on its own anyway? How
> critical is csum_partial(), really?

the hardware does most cases..
in
https://lore.kernel.org/netdev/20211111181025.2139131-1-eric.dumazet@gmail.com/
Eric kind of implies it's for IPv6 headers in practice

Herbert Xu June 29, 2023, 5:29 a.m. UTC | #15

Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> but doesn't most network hardware do the csum on its own anyway? How
> critical is csum_partial(), really?

That hardware will checksum the bulk of the packet, but we still
need to checksum the header bits (e.g., 40 bytes or less) when we
add and remove headers in software.

The one exception is Realtek drivers which seems to come with
checksum offload disabled by default.

Cheers,