[0/1] Rosebush, a new hash table

Message ID 20240222203726.1101861-1-willy@infradead.org
Headers
Series Rosebush, a new hash table |

Message

Matthew Wilcox Feb. 22, 2024, 8:37 p.m. UTC
  Rosebush is a resizing, scalable, cache-aware, RCU optimised hash table.
I've written a load of documentation about how it works, mostly in
Documentation/core-api/rosebush.rst but some is dotted through the
rosebush.c file too.

You can see this code as a further exploration of the "Linked lists are
evil" design space.  For the workloads which a hashtable is suited to,
it has lower overhead than either the maple tree or the rhashtable.
It cannot support ranges (so is not a replacement for the maple tree),
but it does have per-bucket locks so is more scalable for write-heavy
workloads.  I suspect one could reimplement the rhashtable API on top
of the rosebush, but I am not interested in doing that work myself.

The per-object overhead is 12 bytes, as opposed to 16 bytes for the
rhashtable and 32 bytes for the maple tree.  The constant overhead is also
small, being just 16 bytes for the struct rosebush.  The exact amount
of memory consumed for a given number of objects is going to depend on
the distribution of hashes; here are some estimated consumptions for
power-of-ten entries distributed evenly over a 32-bit hash space in the
various data structures:

number	xarray	maple	rhash	rosebush
1	3472	272	280	256
10	32272	784	424	256
100	262kB	3600	1864	2080
1000	[1]	34576	17224	16432
10k	[1]	343k	168392	131344
100k	[1]	3.4M	1731272	2101264

As you can see, rosebush and rhashtable are close the whole way.
Rosebush moves in larger chunks because it doubles each time; there's
no actual need to double the bucket size, but that works well with
the slab allocator's existing slabs.  As noted in the documentation,
we could create our own slabs and get closer to the 12 bytes per object
minimum consumption. [2]

Where I expect rosebush to shine is on dependent cache misses.
I've assumed an average chain length of 10 for rhashtable in the above
memory calculations.  That means on average a lookup would take five cache
misses that can't be speculated.  Rosebush does a linear walk of 4-byte
hashes looking for matches, so the CPU can usefully speculate the entire
array of hash values (indeed, we tell it exactly how many we're going to
look at) and then take a single cache miss fetching the correct pointer.
Add that to the cache miss to fetch the bucket and that's just two cache
misses rather than five.

I have not yet converted any code to use the rosebush.  The API is
designed for use by the dcache, and I expect it will evolve as it actually
gets used.  I think there's probably some more refactoring to be done.
I am not aware of any bugs, but the test suite is pitifully small.
The current algorithm grows the buckets more aggressively than the table;
that's probably exactly the wrong thing to do for good performance.

This version is full of debugging printks.  You should probably take
them out if you're going to try to benchmark it.  The regex '^print'
should find them all.  Contributions welcome; I really want to get back
to working on folios, but this felt like an urgent problem to be fixed.

[1] I stopped trying to estimate the memory costs for xarray; I couldn't
be bothered to as it's not a serious competitor for this use case.

[2] We have ideas for improving the maple tree memory consumption for
this kind of workload; a new node type for pivots that fit in 4 bytes and
sparse nodes to avoid putting a NULL entry after each occupied entry.
The maple tree really is optimised for densely packed ranges at the
moment.

Matthew Wilcox (Oracle) (1):
  rosebush: Add new data structure

 Documentation/core-api/index.rst    |   1 +
 Documentation/core-api/rosebush.rst | 135 ++++++
 MAINTAINERS                         |   8 +
 include/linux/rosebush.h            |  41 ++
 lib/Kconfig.debug                   |   3 +
 lib/Makefile                        |   3 +-
 lib/rosebush.c                      | 707 ++++++++++++++++++++++++++++
 lib/test_rosebush.c                 | 135 ++++++
 8 files changed, 1032 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/core-api/rosebush.rst
 create mode 100644 include/linux/rosebush.h
 create mode 100644 lib/rosebush.c
 create mode 100644 lib/test_rosebush.c
  

Comments

Peng Zhang Feb. 23, 2024, 11:37 a.m. UTC | #1
在 2024/2/23 04:37, Matthew Wilcox (Oracle) 写道:
> Rosebush is a resizing, scalable, cache-aware, RCU optimised hash table.
> I've written a load of documentation about how it works, mostly in
> Documentation/core-api/rosebush.rst but some is dotted through the
> rosebush.c file too.
> 
> You can see this code as a further exploration of the "Linked lists are
> evil" design space.  For the workloads which a hashtable is suited to,
> it has lower overhead than either the maple tree or the rhashtable.
> It cannot support ranges (so is not a replacement for the maple tree),
> but it does have per-bucket locks so is more scalable for write-heavy
> workloads.  I suspect one could reimplement the rhashtable API on top
> of the rosebush, but I am not interested in doing that work myself.
> 
> The per-object overhead is 12 bytes, as opposed to 16 bytes for the
> rhashtable and 32 bytes for the maple tree.  The constant overhead is also
> small, being just 16 bytes for the struct rosebush.  The exact amount
> of memory consumed for a given number of objects is going to depend on
> the distribution of hashes; here are some estimated consumptions for
> power-of-ten entries distributed evenly over a 32-bit hash space in the
> various data structures:
> 
> number	xarray	maple	rhash	rosebush
> 1	3472	272	280	256
> 10	32272	784	424	256
> 100	262kB	3600	1864	2080
> 1000	[1]	34576	17224	16432
> 10k	[1]	343k	168392	131344
> 100k	[1]	3.4M	1731272	2101264
> 
> As you can see, rosebush and rhashtable are close the whole way.
> Rosebush moves in larger chunks because it doubles each time; there's
> no actual need to double the bucket size, but that works well with
> the slab allocator's existing slabs.  As noted in the documentation,
> we could create our own slabs and get closer to the 12 bytes per object
> minimum consumption. [2]
> 
> Where I expect rosebush to shine is on dependent cache misses.
> I've assumed an average chain length of 10 for rhashtable in the above
> memory calculations.  That means on average a lookup would take five cache
> misses that can't be speculated.  Rosebush does a linear walk of 4-byte
> hashes looking for matches, so the CPU can usefully speculate the entire
> array of hash values (indeed, we tell it exactly how many we're going to
> look at) and then take a single cache miss fetching the correct pointer.
> Add that to the cache miss to fetch the bucket and that's just two cache
> misses rather than five.
> 
> I have not yet converted any code to use the rosebush.  The API is
> designed for use by the dcache, and I expect it will evolve as it actually
Hello,

It seems that the advantage of this hash table is that it does not need to
traverse the linked list and has fewer cache misses. I want to know how
much performance improvement is expected if it is applied to dcache?

Thanks,
Peng
> gets used.  I think there's probably some more refactoring to be done.
> I am not aware of any bugs, but the test suite is pitifully small.
> The current algorithm grows the buckets more aggressively than the table;
> that's probably exactly the wrong thing to do for good performance.
> 
> This version is full of debugging printks.  You should probably take
> them out if you're going to try to benchmark it.  The regex '^print'
> should find them all.  Contributions welcome; I really want to get back
> to working on folios, but this felt like an urgent problem to be fixed.
> 
> [1] I stopped trying to estimate the memory costs for xarray; I couldn't
> be bothered to as it's not a serious competitor for this use case.
> 
> [2] We have ideas for improving the maple tree memory consumption for
> this kind of workload; a new node type for pivots that fit in 4 bytes and
> sparse nodes to avoid putting a NULL entry after each occupied entry.
> The maple tree really is optimised for densely packed ranges at the
> moment.
> 
> Matthew Wilcox (Oracle) (1):
>    rosebush: Add new data structure
> 
>   Documentation/core-api/index.rst    |   1 +
>   Documentation/core-api/rosebush.rst | 135 ++++++
>   MAINTAINERS                         |   8 +
>   include/linux/rosebush.h            |  41 ++
>   lib/Kconfig.debug                   |   3 +
>   lib/Makefile                        |   3 +-
>   lib/rosebush.c                      | 707 ++++++++++++++++++++++++++++
>   lib/test_rosebush.c                 | 135 ++++++
>   8 files changed, 1032 insertions(+), 1 deletion(-)
>   create mode 100644 Documentation/core-api/rosebush.rst
>   create mode 100644 include/linux/rosebush.h
>   create mode 100644 lib/rosebush.c
>   create mode 100644 lib/test_rosebush.c
>
  
Jason A. Donenfeld Feb. 23, 2024, 1:55 p.m. UTC | #2
Hi Matthew,

On Thu, Feb 22, 2024 at 08:37:23PM +0000, Matthew Wilcox (Oracle) wrote:
> Rosebush is a resizing, scalable, cache-aware, RCU optimised hash table.
> I've written a load of documentation about how it works, mostly in
> Documentation/core-api/rosebush.rst but some is dotted through the
> rosebush.c file too.

If you're interested, WireGuard has some pretty primitive hashtables,
for which maybe Rosebush would be an interesting replacement:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireguard/peerlookup.c
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireguard/peerlookup.h#n17
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireguard/ratelimiter.c#n167

In peerlookup.c, note the "At the moment, we limit" comment for an idea
of some of the hairy issues involved in replacing these. But I wouldn't
be entirely opposed to it, if you see some interesting potential for
Rosebush here. That's a very tentative interest -- maybe it won't work
out in the end -- but nonetheless, seeing this piqued my curiosity. If
you're looking to see how this behaves in a place beyond dcache, this
might be something to play with.

Jason
  
Kent Overstreet Feb. 23, 2024, 6:40 p.m. UTC | #3
On Thu, Feb 22, 2024 at 08:37:23PM +0000, Matthew Wilcox (Oracle) wrote:
> Rosebush is a resizing, scalable, cache-aware, RCU optimised hash table.
> I've written a load of documentation about how it works, mostly in
> Documentation/core-api/rosebush.rst but some is dotted through the
> rosebush.c file too.
> 
> You can see this code as a further exploration of the "Linked lists are
> evil" design space.  For the workloads which a hashtable is suited to,
> it has lower overhead than either the maple tree or the rhashtable.
> It cannot support ranges (so is not a replacement for the maple tree),
> but it does have per-bucket locks so is more scalable for write-heavy
> workloads.  I suspect one could reimplement the rhashtable API on top
> of the rosebush, but I am not interested in doing that work myself.
> 
> The per-object overhead is 12 bytes, as opposed to 16 bytes for the
> rhashtable and 32 bytes for the maple tree.  The constant overhead is also
> small, being just 16 bytes for the struct rosebush.  The exact amount
> of memory consumed for a given number of objects is going to depend on
> the distribution of hashes; here are some estimated consumptions for
> power-of-ten entries distributed evenly over a 32-bit hash space in the
> various data structures:
> 
> number	xarray	maple	rhash	rosebush
> 1	3472	272	280	256
> 10	32272	784	424	256
> 100	262kB	3600	1864	2080
> 1000	[1]	34576	17224	16432
> 10k	[1]	343k	168392	131344
> 100k	[1]	3.4M	1731272	2101264

So I think the interesting numbers to see (besides performance numbers)
are going to be the fill factor numbers under real world use.

It's an interesting technique, I've played around with it a bit
(actually using it in bcachefs for the nocow locks hash table), but no
idea if it makes sense as a general purpose thing yet...

you also mentioned that a motivation was API mismatch between rhashtable
and dcache - could you elaborate on that?
  
Kent Overstreet Feb. 25, 2024, 3:18 a.m. UTC | #4
On Sat, Feb 24, 2024 at 10:10:27PM +0000, David Laight wrote:
> From: Herbert Xu
> > Sent: 24 February 2024 00:21
> > 
> > On Thu, Feb 22, 2024 at 08:37:23PM +0000, Matthew Wilcox (Oracle) wrote:
> > >
> > > Where I expect rosebush to shine is on dependent cache misses.
> > > I've assumed an average chain length of 10 for rhashtable in the above
> > > memory calculations.  That means on average a lookup would take five cache
> > > misses that can't be speculated.  Rosebush does a linear walk of 4-byte
> > 
> > Normally an rhashtable gets resized when it reaches 75% capacity
> > so the average chain length should always be one.
> 
> The average length of non-empty hash chains is more interesting.
> You don't usually search for items in empty chains.
> The only way you'll get all the chains of length one is if you've
> carefully picked the data so that it hashed that way.
> 
> I remember playing around with the elf symbol table for a browser
> and all its shared libraries.
> While the hash function is pretty trivial, it really didn't matter
> whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash
> chains were always long.

that's a pretty bad hash, even golden ratio hash would be better, but
still bad; you really should be using at least jhash.

you really want a good avalanche effect, because in real world usage
your entropy is often only in a relatively few bits.

when I implemented cuckoo (which is more obviously sensitive to a weak
hash function), I had to go with siphash, even jhash wasn't giving me
great reslts. and looking at the code it's not hard to see why, it's all
adds, and the rotates are byte aligned... you want mixed adds and xors
and the rotates to be more prime-ish.

right idea, just old...

what would be ideal is something more like siphash, but with fewer
rounds, so same number of instructions as jhash. xxhash might fit the
bill, I haven't looked at the code yet...
  
Kent Overstreet Feb. 25, 2024, 3:20 a.m. UTC | #5
On Sun, Feb 25, 2024 at 08:50:36AM +0800, Herbert Xu wrote:
> On Sat, Feb 24, 2024 at 10:10:27PM +0000, David Laight wrote:
> >
> > > Normally an rhashtable gets resized when it reaches 75% capacity
> > > so the average chain length should always be one.
> > 
> > The average length of non-empty hash chains is more interesting.
> > You don't usually search for items in empty chains.
> > The only way you'll get all the chains of length one is if you've
> > carefully picked the data so that it hashed that way.
> 
> Sure.  But given the 75% capacity, you'd need a really bad hash
> function to get an *average* (not worst-case) chain length of
> 10.
> 
> > I remember playing around with the elf symbol table for a browser
> > and all its shared libraries.
> > While the hash function is pretty trivial, it really didn't matter
> > whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash
> > chains were always long.
> 
> Even in the unlikely event of bad luck and everything bunches up
> together, we change theh hash function (through hash_rnd) every
> time we resize so you would expect things to even out after the
> resize event.
> 
> A rehash is also automatically triggered if the worst-case chain
> length exceeds 16.

16!? that's crap, use a decent hash function and 3-5 should be your
worst upper bound.
  
Kent Overstreet Feb. 25, 2024, 5:51 a.m. UTC | #6
On Sun, Feb 25, 2024 at 05:01:19AM +0000, Matthew Wilcox wrote:
> On Sat, Feb 24, 2024 at 10:18:31PM -0500, Kent Overstreet wrote:
> > On Sat, Feb 24, 2024 at 10:10:27PM +0000, David Laight wrote:
> > > I remember playing around with the elf symbol table for a browser
> > > and all its shared libraries.
> > > While the hash function is pretty trivial, it really didn't matter
> > > whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash
> > > chains were always long.
> > 
> > that's a pretty bad hash, even golden ratio hash would be better, but
> > still bad; you really should be using at least jhash.
> 
> There's a "fun" effect; essentially the "biased observer" effect which
> leads students to erroneously conclude that the majority of classes are
> oversubscribed.  As somebody observed in this thread, for some usecases
> you only look up hashes which actually exist.
> 
> Task a trivial example where you have four entries unevenly distributed
> between two buckets, three in one bucket and one in the other.  Now 3/4
> of your lookups hit in one bucket and 1/4 in the other bucket.
> Obviously it's not as pronounced if you have 1000 buckets with 1000
> entries randomly distributed between the buckets.  But that distribution
> is not nearly as even as you might expect:
> 
> $ ./distrib
> 0: 362
> 1: 371
> 2: 193
> 3: 57
> 4: 13
> 5: 4
> 
> That's using lrand48() to decide which bucket to use, so not even a
> "quality of hash" problem, just a "your mathematical intuition may not
> be right here".

well, golden ratio hash - hash_32(i, 32)
0: 368
1: 264
2: 368
3: 0

but your distribution actually is accurate in general, golden ratio hash
is relly nice for sequential integers. the actual problem with your test
is that you're testing 100% occupancy - no one does that.

75% occupancy, siphash:
0: 933
1: 60
2: 6
3: 1
4: 0

that looks about right to me.
  
Kent Overstreet Feb. 25, 2024, 6:14 a.m. UTC | #7
On Sun, Feb 25, 2024 at 01:53:35PM +0800, Herbert Xu wrote:
> On Sun, Feb 25, 2024 at 12:51:06AM -0500, Kent Overstreet wrote:
> >
> > but your distribution actually is accurate in general, golden ratio hash
> > is relly nice for sequential integers. the actual problem with your test
> > is that you're testing 100% occupancy - no one does that.
> > 
> > 75% occupancy, siphash:
> > 0: 933
> > 1: 60
> > 2: 6
> > 3: 1
> > 4: 0
> > 
> > that looks about right to me.
> 
> The point is that the worst-case length grows with the size of
> the table so it won't always be 3.  You need to take into account
> the largest table size that you will support.

ok, but - one million entries, siphash, 75% fill factor

0: 472053
1: 354786
2: 132663
3: 33267
4: 6218
5: 884
6: 110
7: 17
8: 2
9: 0

100 million:

0: 51342703
1: 34224025
2: 11413241
3: 2534946
4: 421816
5: 56271
6: 6346
7: 593
8: 56
9: 3
10: 0

it's a log curve - chain length of 16 means you picked a bad hash
function.
  
Kent Overstreet Feb. 25, 2024, 9:48 p.m. UTC | #8
On Sun, Feb 25, 2024 at 02:47:45PM +0000, David Laight wrote:
> From: Kent Overstreet
> > Sent: 25 February 2024 03:19
> ..
> > when I implemented cuckoo (which is more obviously sensitive to a weak
> > hash function), I had to go with siphash, even jhash wasn't giving me
> > great reslts. and looking at the code it's not hard to see why, it's all
> > adds, and the rotates are byte aligned... you want mixed adds and xors
> > and the rotates to be more prime-ish.
> > 
> > right idea, just old...
> > 
> > what would be ideal is something more like siphash, but with fewer
> > rounds, so same number of instructions as jhash. xxhash might fit the
> > bill, I haven't looked at the code yet...
> 
> There is likely to be a point where scanning a list of values
> for the right hash value is faster than executing a hash function
> that is good enough to separate them to separate buckets.

Executing a bunch of alu ops that parallelize nicely is _fast_
(it's all xors/adds/rotates, chacha20 is a good example of how
the modern stuff works).

Also, for 2-way cuckoo there's an xor trick so you don't need to compute
a second hash.

But the key thing about cuckoo is that the loads on lookup _aren't
dependent_ - they can run in parallel. Every cache miss that goes all
the way to DRAM is stupidly expensive, remember - hundreds of
instructions, had you been able to keep the pipeline fed.

> You don't want to scan a linked list because they have horrid
> cache footprints.
> The locking is equally horrid - especially for remove.
> Arrays of pointers ar ethe way forward :-)

Well, maybe; I'm waiting to see fill factor numbers and benchmarks. Fill
factor was my concern when I was playing around with the concept; I used
it in a place where the hash table didn't need to be that big, and the
point was to avoid having to separately allocate the entries (and it
avoids the hassle of tombstone entries with linear/quadratic probing).

The fact that it requires a second dependent load, because buckets are
separately allocated, also looks like a big negative to me. I still
think a good lockless cuckoo hashing implementation ought to beat it.