[RFC,0/2] Switch ARM to generic find_bit() API

Message ID 20221020032024.1804535-1-yury.norov@gmail.com
Headers
Series Switch ARM to generic find_bit() API |

Message

Yury Norov Oct. 20, 2022, 3:20 a.m. UTC
  Hi Russell, all,

I'd like to respin a patch that switches ARM to generic find_bit()
functions.

Generic code works on par with arch or better, according to my
testing [1], and with recent improvements merged in v6.1, it should
be even faster.

ARM already uses many generic find_bit() functions - those that it
doesn't implement. So we are talking about migrating a subset of the
API; most of find_bit() family has only generic implementation on ARM.

The only concern about this migration is that ARM code supports
byte-aligned bitmap addresses, while generic code is optimized for
word-aligned bitmaps.

In my practice, I've never seen unaligned bitmaps. But to check that on
ARM, I added a run-time check for bitmap alignment. I gave it run on
several architectures and found nothing.

Can you please check that on your hardware and compare performance of
generic vs arch code for you? If everything is OK, I suggest switching
ARM to generic find_bit() completely.

Thanks,
Yury

[1] https://lore.kernel.org/all/YuWk3titnOiQACzC@yury-laptop/

Yury Norov (2):
  bitmap: add sanity check function for find_bit()
  arm: drop arch implementation for find_bit() functions

 arch/arm/include/asm/bitops.h |  63 -----------
 arch/arm/kernel/armksyms.c    |  11 --
 arch/arm/lib/Makefile         |   2 +-
 arch/arm/lib/findbit.S        | 193 ----------------------------------
 include/linux/find.h          |  35 ++++++
 lib/Kconfig.debug             |   7 ++
 6 files changed, 43 insertions(+), 268 deletions(-)
 delete mode 100644 arch/arm/lib/findbit.S
  

Comments

Russell King (Oracle) Oct. 20, 2022, 4:51 p.m. UTC | #1
On Wed, Oct 19, 2022 at 08:20:22PM -0700, Yury Norov wrote:
> Hi Russell, all,
> 
> I'd like to respin a patch that switches ARM to generic find_bit()
> functions.
> 
> Generic code works on par with arch or better, according to my
> testing [1], and with recent improvements merged in v6.1, it should
> be even faster.
> 
> ARM already uses many generic find_bit() functions - those that it
> doesn't implement. So we are talking about migrating a subset of the
> API; most of find_bit() family has only generic implementation on ARM.
> 
> The only concern about this migration is that ARM code supports
> byte-aligned bitmap addresses, while generic code is optimized for
> word-aligned bitmaps.
> 
> In my practice, I've never seen unaligned bitmaps. But to check that on
> ARM, I added a run-time check for bitmap alignment. I gave it run on
> several architectures and found nothing.
> 
> Can you please check that on your hardware and compare performance of
> generic vs arch code for you? If everything is OK, I suggest switching
> ARM to generic find_bit() completely.
> 
> Thanks,
> Yury
> 
> [1] https://lore.kernel.org/all/YuWk3titnOiQACzC@yury-laptop/

I _really_ don't want to play around with this stuff right now... 6.0
appears to have a regression on arm32 early on during boot:

[    1.410115] EXT4-fs error (device sda1): htree_dirblock_to_tree:1093: inode #256: block 8797: comm systemd: bad entry in directory: rec_len % 4 != 0 - offset=0, inode=33188, rec_len=35097, size=4096 fake=0

Booting 5.19 with the same filesystem works without issue and without
even a fsck, but booting 6.0 always results in some problem that
prevents it booting.

Debugging this is not easy, because there also seems to be something
up with the bloody serial console - sometimes I get nothing, other
times I get nothing more than:

[    2.929502] EXT4-fs error (de

and then the output stops. Is the console no longer synchronous? If it
isn't, that's a huge mistake which can be seen right here with the
partial message output... so I also need to work out how to make the
console output synchronous again.
  
Yury Norov Oct. 20, 2022, 8:02 p.m. UTC | #2
On Thu, Oct 20, 2022 at 05:51:34PM +0100, Russell King (Oracle) wrote:
> On Wed, Oct 19, 2022 at 08:20:22PM -0700, Yury Norov wrote:
> > Hi Russell, all,
> > 
> > I'd like to respin a patch that switches ARM to generic find_bit()
> > functions.
> > 
> > Generic code works on par with arch or better, according to my
> > testing [1], and with recent improvements merged in v6.1, it should
> > be even faster.
> > 
> > ARM already uses many generic find_bit() functions - those that it
> > doesn't implement. So we are talking about migrating a subset of the
> > API; most of find_bit() family has only generic implementation on ARM.
> > 
> > The only concern about this migration is that ARM code supports
> > byte-aligned bitmap addresses, while generic code is optimized for
> > word-aligned bitmaps.
> > 
> > In my practice, I've never seen unaligned bitmaps. But to check that on
> > ARM, I added a run-time check for bitmap alignment. I gave it run on
> > several architectures and found nothing.
> > 
> > Can you please check that on your hardware and compare performance of
> > generic vs arch code for you? If everything is OK, I suggest switching
> > ARM to generic find_bit() completely.
> > 
> > Thanks,
> > Yury
> > 
> > [1] https://lore.kernel.org/all/YuWk3titnOiQACzC@yury-laptop/
> 
> I _really_ don't want to play around with this stuff right now... 6.0
> appears to have a regression on arm32 early on during boot:
> 
> [    1.410115] EXT4-fs error (device sda1): htree_dirblock_to_tree:1093: inode #256: block 8797: comm systemd: bad entry in directory: rec_len % 4 != 0 - offset=0, inode=33188, rec_len=35097, size=4096 fake=0
> 
> Booting 5.19 with the same filesystem works without issue and without
> even a fsck, but booting 6.0 always results in some problem that
> prevents it booting.
> 
> Debugging this is not easy, because there also seems to be something
> up with the bloody serial console - sometimes I get nothing, other
> times I get nothing more than:
> 
> [    2.929502] EXT4-fs error (de
> 
> and then the output stops. Is the console no longer synchronous? If it
> isn't, that's a huge mistake which can be seen right here with the
> partial message output... so I also need to work out how to make the
> console output synchronous again.

Got it.

I you think that EXT4 problems are due to unaligned bitmaps, you can take
1st patch from this series to check.

Thanks,
Yury
  
Russell King (Oracle) Oct. 20, 2022, 10:55 p.m. UTC | #3
On Thu, Oct 20, 2022 at 01:02:07PM -0700, Yury Norov wrote:
> On Thu, Oct 20, 2022 at 05:51:34PM +0100, Russell King (Oracle) wrote:
> > On Wed, Oct 19, 2022 at 08:20:22PM -0700, Yury Norov wrote:
> > > Hi Russell, all,
> > > 
> > > I'd like to respin a patch that switches ARM to generic find_bit()
> > > functions.
> > > 
> > > Generic code works on par with arch or better, according to my
> > > testing [1], and with recent improvements merged in v6.1, it should
> > > be even faster.
> > > 
> > > ARM already uses many generic find_bit() functions - those that it
> > > doesn't implement. So we are talking about migrating a subset of the
> > > API; most of find_bit() family has only generic implementation on ARM.
> > > 
> > > The only concern about this migration is that ARM code supports
> > > byte-aligned bitmap addresses, while generic code is optimized for
> > > word-aligned bitmaps.
> > > 
> > > In my practice, I've never seen unaligned bitmaps. But to check that on
> > > ARM, I added a run-time check for bitmap alignment. I gave it run on
> > > several architectures and found nothing.
> > > 
> > > Can you please check that on your hardware and compare performance of
> > > generic vs arch code for you? If everything is OK, I suggest switching
> > > ARM to generic find_bit() completely.
> > > 
> > > Thanks,
> > > Yury
> > > 
> > > [1] https://lore.kernel.org/all/YuWk3titnOiQACzC@yury-laptop/
> > 
> > I _really_ don't want to play around with this stuff right now... 6.0
> > appears to have a regression on arm32 early on during boot:
> > 
> > [    1.410115] EXT4-fs error (device sda1): htree_dirblock_to_tree:1093: inode #256: block 8797: comm systemd: bad entry in directory: rec_len % 4 != 0 - offset=0, inode=33188, rec_len=35097, size=4096 fake=0
> > 
> > Booting 5.19 with the same filesystem works without issue and without
> > even a fsck, but booting 6.0 always results in some problem that
> > prevents it booting.
> > 
> > Debugging this is not easy, because there also seems to be something
> > up with the bloody serial console - sometimes I get nothing, other
> > times I get nothing more than:
> > 
> > [    2.929502] EXT4-fs error (de
> > 
> > and then the output stops. Is the console no longer synchronous? If it
> > isn't, that's a huge mistake which can be seen right here with the
> > partial message output... so I also need to work out how to make the
> > console output synchronous again.
> 
> Got it.
> 
> I you think that EXT4 problems are due to unaligned bitmaps, you can take
> 1st patch from this series to check.

Got to the bottom of it, it wasn't the bit array functions, it was DMA
API issues.

Okay, I've now tested the generic ops vs my updated optimised ops,
and my version still comes out faster (based on three runs). The
random-filled show less difference, but the sparse bitmaps show a
much better win for my optimised code over the generic code where
they exist:

arm: [  694.614773] find_next_bit:                   40078 ns,    656 iterations
gen: [   88.611973] find_next_bit:                   69943 ns,    655 iterations
arm: [  694.625436] find_next_zero_bit:            3939309 ns, 327025 iterations
gen: [   88.624227] find_next_zero_bit:            5529553 ns, 327026 iterations
arm: [  694.646236] find_first_bit:                7301363 ns,    656 iterations
gen: [   88.645313] find_first_bit:                7589120 ns,    655 iterations

These figures appear to be pretty consistent across the three
runs.

For completness, here's the random-filled results:

arm: [  694.109190] find_next_bit:                 2242618 ns, 163949 iterations
gen: [   88.167340] find_next_bit:                 2632859 ns, 163743 iterations
arm: [  694.117968] find_next_zero_bit:            2049129 ns, 163732 iterations
gen: [   88.176912] find_next_zero_bit:            2844221 ns, 163938 iterations
arm: [  694.151421] find_first_bit:               17778911 ns,  16448 iterations
gen: [   88.211167] find_first_bit:               18596622 ns,  16401 iterations

So, I don't see much reason to switch to the generic ops for
these, not when we have such a significant speedup on the
find_next_* functions for sparse-filled results..