Message ID | 20240222203726.1101861-1-willy@infradead.org |
---|---|
Headers |
Return-Path: <linux-kernel+bounces-77324-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:a81b:b0:108:e6aa:91d0 with SMTP id bq27csp366514dyb; Thu, 22 Feb 2024 19:46:46 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCWLSb9zqFClJPnqoJeC6zab4vAyGG8agSgZVEHnRz3jZUpCfscoBfQm4wXjhHXYaoqjQHfFRsuSb1LPklhep+iYzM04ew== X-Google-Smtp-Source: AGHT+IHYlbthleVJlbCqesGtBofyD7nYtDjvFu2UyXibp1rju1ivIF7cDFzrt2ZgDf+BRZm3MtYH X-Received: by 2002:a17:906:33c7:b0:a3f:b0e6:9384 with SMTP id w7-20020a17090633c700b00a3fb0e69384mr412849eja.42.1708660006283; Thu, 22 Feb 2024 19:46:46 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1708660006; cv=pass; d=google.com; s=arc-20160816; b=A59mT3LoAoMNfUmRgEZW5sqPfjj4URsoThZOTEU6pxAiqZilgzNUBCO6lI1eKUXojR xwCw8yZCvzVUD+pod63Rc6bSllQ/hFccrJRisM0+6Ik8kUjDgWvjUwjloCrTIUAy+EgO 4WmUDqzmM7MkPBO6bCttxshnXq9r6xRPIcOgQ5ND+aIAHr8qMXAaQftKJ1VEee2zIgFr 7qsQgQSDelyvgJ58Y+h4G4SWZ8yfE3s8tRshys0JS7Y38DrixzqqKfDBLFWuJI0B/oUg zniozSVyZ72nyIRveTcxXZq845trfOZluabwD8bRuDaTFee0P9prwh4xBQaBhZxr47CL Z9Tw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from:dkim-signature; bh=btPG773oXTDOg4O/9cAMX0cN9VDh0j3lO20yvkkL30w=; fh=q4EmuhuRF3Up5nQJXG2HMNyCOYRlrylZu+6KtNvbm7I=; b=FdgU4btWUL8br1w89n8UocETuG9ouEtXwLC35Quin/3iexrlB/m2K7wGhARmH9IKWg hEdBGqDMUAc2hDz9Fw+no8X1GBCdUqOBes592++pxFuqGJBrxUXefUILXIN7/HDCR27v 2DOBvYQutEPnoSA3PF1lCJkAuuALIeXr2dpFUo8rGIMLJ3kZIAQMEEtUrQabDfHjutFW eMJUkpLI3zlctQlcx7NjnyOZWzmiVAEB3EIDF3LkbdC15iqB0SEzIObmTPNEKPzHx7RE 7uwE2lGji4uG/DcVf4jIXNOS1R0s06ENndxMQRR9bbl+jELdGHStx0+b35dSjhdabPIk dmGw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=dEHGEePX; arc=pass (i=1 dkim=pass dkdomain=infradead.org); spf=pass (google.com: domain of linux-kernel+bounces-77324-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-77324-ouuuleilei=gmail.com@vger.kernel.org" Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [2604:1380:4601:e00::3]) by mx.google.com with ESMTPS id lh14-20020a170906f8ce00b00a3f408089f3si1967913ejb.198.2024.02.22.19.46.45 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 22 Feb 2024 19:46:46 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-77324-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=dEHGEePX; arc=pass (i=1 dkim=pass dkdomain=infradead.org); spf=pass (google.com: domain of linux-kernel+bounces-77324-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-77324-ouuuleilei=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 1E63C1F27698 for <ouuuleilei@gmail.com>; Thu, 22 Feb 2024 20:38:13 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 49B6973F1B; Thu, 22 Feb 2024 20:37:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="dEHGEePX" Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1AFB514B838; Thu, 22 Feb 2024 20:37:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708634264; cv=none; b=cghE213EreylWuzTNJxdZvYhkqqmeRMefqmB59EjyCiHW1spNM8XMDES88hseAyDHg40pPG3gHQVuMITycqLBajHL+F0qP4XAi9Cw5wXaY0B0j//KW7qg5PY0thiOlpXuiTs0n8wEjjhysiI3JNfF+ao6QwtXC9fvHXM78/O5Ss= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708634264; c=relaxed/simple; bh=etmoFvC0jzrVfBtm2Oi+WKPEjzsrOdltXvFIbHB+jOU=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=JtuicI4C+Lyi9ms5uRWAvjIcSDpX8tUdK6HXuM5KdaaCT2QEP+3wF/WeKpwWZiSv30qDcMXQlVl1RE6AqEA/IxFSmmCKSW68BbBlnJfG0uNSVO+2WYJfg17bZ0smNMNwvY6odvxq79k+LnAoGVPKKW1+KZO5KEPllpZb1RSl+ps= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=dEHGEePX; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Transfer-Encoding:MIME-Version: Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID: Content-Description:In-Reply-To:References; bh=btPG773oXTDOg4O/9cAMX0cN9VDh0j3lO20yvkkL30w=; b=dEHGEePX9yBXi8zsNzEeIC5QmR ckp11koQLTVAEDXvJZqV6U9qFUwpfmcT7lVr5y1jZGeC2IaLihqNc3lB9PNNPC7tkO438rN9PjQPr jCHLyC2nUc8Zrb5N1zT1VEcEarXcewmQ1P8danAgHG6vzK3uf1O9D9uBuiPaIaNHtPBuD5b1ReZqj rGyJqjFBl4S0ilMQFOdRWFmfLYdbTGfCz3cgh+9Fiok4n92F5NFMA7+V/2ofcWKAZpJnjB6Bc9Rk4 s9hDgEBWdZyT7k0bMNvfi6lkqYTshyzJbRlIPLcNWD8s9Or/LN8HdWLD3fSLNdmiULCKXHytfgHL5 /aswRtIA==; Received: from willy by casper.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1rdFp0-00000004ceu-03QG; Thu, 22 Feb 2024 20:37:34 +0000 From: "Matthew Wilcox (Oracle)" <willy@infradead.org> To: linux-kernel@vger.kernel.org Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>, Thomas Graf <tgraf@suug.ch>, Herbert Xu <herbert@gondor.apana.org.au>, netdev@vger.kernel.org, linux-fsdevel@vger.kernel.org, maple-tree@lists.infradead.org, rcu@vger.kernel.org Subject: [PATCH 0/1] Rosebush, a new hash table Date: Thu, 22 Feb 2024 20:37:23 +0000 Message-ID: <20240222203726.1101861-1-willy@infradead.org> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1791659874689422589 X-GMAIL-MSGID: 1791659874689422589 |
Series |
Rosebush, a new hash table
|
|
Message
Matthew Wilcox
Feb. 22, 2024, 8:37 p.m. UTC
Rosebush is a resizing, scalable, cache-aware, RCU optimised hash table. I've written a load of documentation about how it works, mostly in Documentation/core-api/rosebush.rst but some is dotted through the rosebush.c file too. You can see this code as a further exploration of the "Linked lists are evil" design space. For the workloads which a hashtable is suited to, it has lower overhead than either the maple tree or the rhashtable. It cannot support ranges (so is not a replacement for the maple tree), but it does have per-bucket locks so is more scalable for write-heavy workloads. I suspect one could reimplement the rhashtable API on top of the rosebush, but I am not interested in doing that work myself. The per-object overhead is 12 bytes, as opposed to 16 bytes for the rhashtable and 32 bytes for the maple tree. The constant overhead is also small, being just 16 bytes for the struct rosebush. The exact amount of memory consumed for a given number of objects is going to depend on the distribution of hashes; here are some estimated consumptions for power-of-ten entries distributed evenly over a 32-bit hash space in the various data structures: number xarray maple rhash rosebush 1 3472 272 280 256 10 32272 784 424 256 100 262kB 3600 1864 2080 1000 [1] 34576 17224 16432 10k [1] 343k 168392 131344 100k [1] 3.4M 1731272 2101264 As you can see, rosebush and rhashtable are close the whole way. Rosebush moves in larger chunks because it doubles each time; there's no actual need to double the bucket size, but that works well with the slab allocator's existing slabs. As noted in the documentation, we could create our own slabs and get closer to the 12 bytes per object minimum consumption. [2] Where I expect rosebush to shine is on dependent cache misses. I've assumed an average chain length of 10 for rhashtable in the above memory calculations. That means on average a lookup would take five cache misses that can't be speculated. Rosebush does a linear walk of 4-byte hashes looking for matches, so the CPU can usefully speculate the entire array of hash values (indeed, we tell it exactly how many we're going to look at) and then take a single cache miss fetching the correct pointer. Add that to the cache miss to fetch the bucket and that's just two cache misses rather than five. I have not yet converted any code to use the rosebush. The API is designed for use by the dcache, and I expect it will evolve as it actually gets used. I think there's probably some more refactoring to be done. I am not aware of any bugs, but the test suite is pitifully small. The current algorithm grows the buckets more aggressively than the table; that's probably exactly the wrong thing to do for good performance. This version is full of debugging printks. You should probably take them out if you're going to try to benchmark it. The regex '^print' should find them all. Contributions welcome; I really want to get back to working on folios, but this felt like an urgent problem to be fixed. [1] I stopped trying to estimate the memory costs for xarray; I couldn't be bothered to as it's not a serious competitor for this use case. [2] We have ideas for improving the maple tree memory consumption for this kind of workload; a new node type for pivots that fit in 4 bytes and sparse nodes to avoid putting a NULL entry after each occupied entry. The maple tree really is optimised for densely packed ranges at the moment. Matthew Wilcox (Oracle) (1): rosebush: Add new data structure Documentation/core-api/index.rst | 1 + Documentation/core-api/rosebush.rst | 135 ++++++ MAINTAINERS | 8 + include/linux/rosebush.h | 41 ++ lib/Kconfig.debug | 3 + lib/Makefile | 3 +- lib/rosebush.c | 707 ++++++++++++++++++++++++++++ lib/test_rosebush.c | 135 ++++++ 8 files changed, 1032 insertions(+), 1 deletion(-) create mode 100644 Documentation/core-api/rosebush.rst create mode 100644 include/linux/rosebush.h create mode 100644 lib/rosebush.c create mode 100644 lib/test_rosebush.c
Comments
在 2024/2/23 04:37, Matthew Wilcox (Oracle) 写道: > Rosebush is a resizing, scalable, cache-aware, RCU optimised hash table. > I've written a load of documentation about how it works, mostly in > Documentation/core-api/rosebush.rst but some is dotted through the > rosebush.c file too. > > You can see this code as a further exploration of the "Linked lists are > evil" design space. For the workloads which a hashtable is suited to, > it has lower overhead than either the maple tree or the rhashtable. > It cannot support ranges (so is not a replacement for the maple tree), > but it does have per-bucket locks so is more scalable for write-heavy > workloads. I suspect one could reimplement the rhashtable API on top > of the rosebush, but I am not interested in doing that work myself. > > The per-object overhead is 12 bytes, as opposed to 16 bytes for the > rhashtable and 32 bytes for the maple tree. The constant overhead is also > small, being just 16 bytes for the struct rosebush. The exact amount > of memory consumed for a given number of objects is going to depend on > the distribution of hashes; here are some estimated consumptions for > power-of-ten entries distributed evenly over a 32-bit hash space in the > various data structures: > > number xarray maple rhash rosebush > 1 3472 272 280 256 > 10 32272 784 424 256 > 100 262kB 3600 1864 2080 > 1000 [1] 34576 17224 16432 > 10k [1] 343k 168392 131344 > 100k [1] 3.4M 1731272 2101264 > > As you can see, rosebush and rhashtable are close the whole way. > Rosebush moves in larger chunks because it doubles each time; there's > no actual need to double the bucket size, but that works well with > the slab allocator's existing slabs. As noted in the documentation, > we could create our own slabs and get closer to the 12 bytes per object > minimum consumption. [2] > > Where I expect rosebush to shine is on dependent cache misses. > I've assumed an average chain length of 10 for rhashtable in the above > memory calculations. That means on average a lookup would take five cache > misses that can't be speculated. Rosebush does a linear walk of 4-byte > hashes looking for matches, so the CPU can usefully speculate the entire > array of hash values (indeed, we tell it exactly how many we're going to > look at) and then take a single cache miss fetching the correct pointer. > Add that to the cache miss to fetch the bucket and that's just two cache > misses rather than five. > > I have not yet converted any code to use the rosebush. The API is > designed for use by the dcache, and I expect it will evolve as it actually Hello, It seems that the advantage of this hash table is that it does not need to traverse the linked list and has fewer cache misses. I want to know how much performance improvement is expected if it is applied to dcache? Thanks, Peng > gets used. I think there's probably some more refactoring to be done. > I am not aware of any bugs, but the test suite is pitifully small. > The current algorithm grows the buckets more aggressively than the table; > that's probably exactly the wrong thing to do for good performance. > > This version is full of debugging printks. You should probably take > them out if you're going to try to benchmark it. The regex '^print' > should find them all. Contributions welcome; I really want to get back > to working on folios, but this felt like an urgent problem to be fixed. > > [1] I stopped trying to estimate the memory costs for xarray; I couldn't > be bothered to as it's not a serious competitor for this use case. > > [2] We have ideas for improving the maple tree memory consumption for > this kind of workload; a new node type for pivots that fit in 4 bytes and > sparse nodes to avoid putting a NULL entry after each occupied entry. > The maple tree really is optimised for densely packed ranges at the > moment. > > Matthew Wilcox (Oracle) (1): > rosebush: Add new data structure > > Documentation/core-api/index.rst | 1 + > Documentation/core-api/rosebush.rst | 135 ++++++ > MAINTAINERS | 8 + > include/linux/rosebush.h | 41 ++ > lib/Kconfig.debug | 3 + > lib/Makefile | 3 +- > lib/rosebush.c | 707 ++++++++++++++++++++++++++++ > lib/test_rosebush.c | 135 ++++++ > 8 files changed, 1032 insertions(+), 1 deletion(-) > create mode 100644 Documentation/core-api/rosebush.rst > create mode 100644 include/linux/rosebush.h > create mode 100644 lib/rosebush.c > create mode 100644 lib/test_rosebush.c >
Hi Matthew, On Thu, Feb 22, 2024 at 08:37:23PM +0000, Matthew Wilcox (Oracle) wrote: > Rosebush is a resizing, scalable, cache-aware, RCU optimised hash table. > I've written a load of documentation about how it works, mostly in > Documentation/core-api/rosebush.rst but some is dotted through the > rosebush.c file too. If you're interested, WireGuard has some pretty primitive hashtables, for which maybe Rosebush would be an interesting replacement: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireguard/peerlookup.c https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireguard/peerlookup.h#n17 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireguard/ratelimiter.c#n167 In peerlookup.c, note the "At the moment, we limit" comment for an idea of some of the hairy issues involved in replacing these. But I wouldn't be entirely opposed to it, if you see some interesting potential for Rosebush here. That's a very tentative interest -- maybe it won't work out in the end -- but nonetheless, seeing this piqued my curiosity. If you're looking to see how this behaves in a place beyond dcache, this might be something to play with. Jason
On Thu, Feb 22, 2024 at 08:37:23PM +0000, Matthew Wilcox (Oracle) wrote: > Rosebush is a resizing, scalable, cache-aware, RCU optimised hash table. > I've written a load of documentation about how it works, mostly in > Documentation/core-api/rosebush.rst but some is dotted through the > rosebush.c file too. > > You can see this code as a further exploration of the "Linked lists are > evil" design space. For the workloads which a hashtable is suited to, > it has lower overhead than either the maple tree or the rhashtable. > It cannot support ranges (so is not a replacement for the maple tree), > but it does have per-bucket locks so is more scalable for write-heavy > workloads. I suspect one could reimplement the rhashtable API on top > of the rosebush, but I am not interested in doing that work myself. > > The per-object overhead is 12 bytes, as opposed to 16 bytes for the > rhashtable and 32 bytes for the maple tree. The constant overhead is also > small, being just 16 bytes for the struct rosebush. The exact amount > of memory consumed for a given number of objects is going to depend on > the distribution of hashes; here are some estimated consumptions for > power-of-ten entries distributed evenly over a 32-bit hash space in the > various data structures: > > number xarray maple rhash rosebush > 1 3472 272 280 256 > 10 32272 784 424 256 > 100 262kB 3600 1864 2080 > 1000 [1] 34576 17224 16432 > 10k [1] 343k 168392 131344 > 100k [1] 3.4M 1731272 2101264 So I think the interesting numbers to see (besides performance numbers) are going to be the fill factor numbers under real world use. It's an interesting technique, I've played around with it a bit (actually using it in bcachefs for the nocow locks hash table), but no idea if it makes sense as a general purpose thing yet... you also mentioned that a motivation was API mismatch between rhashtable and dcache - could you elaborate on that?
On Sat, Feb 24, 2024 at 10:10:27PM +0000, David Laight wrote: > From: Herbert Xu > > Sent: 24 February 2024 00:21 > > > > On Thu, Feb 22, 2024 at 08:37:23PM +0000, Matthew Wilcox (Oracle) wrote: > > > > > > Where I expect rosebush to shine is on dependent cache misses. > > > I've assumed an average chain length of 10 for rhashtable in the above > > > memory calculations. That means on average a lookup would take five cache > > > misses that can't be speculated. Rosebush does a linear walk of 4-byte > > > > Normally an rhashtable gets resized when it reaches 75% capacity > > so the average chain length should always be one. > > The average length of non-empty hash chains is more interesting. > You don't usually search for items in empty chains. > The only way you'll get all the chains of length one is if you've > carefully picked the data so that it hashed that way. > > I remember playing around with the elf symbol table for a browser > and all its shared libraries. > While the hash function is pretty trivial, it really didn't matter > whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash > chains were always long. that's a pretty bad hash, even golden ratio hash would be better, but still bad; you really should be using at least jhash. you really want a good avalanche effect, because in real world usage your entropy is often only in a relatively few bits. when I implemented cuckoo (which is more obviously sensitive to a weak hash function), I had to go with siphash, even jhash wasn't giving me great reslts. and looking at the code it's not hard to see why, it's all adds, and the rotates are byte aligned... you want mixed adds and xors and the rotates to be more prime-ish. right idea, just old... what would be ideal is something more like siphash, but with fewer rounds, so same number of instructions as jhash. xxhash might fit the bill, I haven't looked at the code yet...
On Sun, Feb 25, 2024 at 08:50:36AM +0800, Herbert Xu wrote: > On Sat, Feb 24, 2024 at 10:10:27PM +0000, David Laight wrote: > > > > > Normally an rhashtable gets resized when it reaches 75% capacity > > > so the average chain length should always be one. > > > > The average length of non-empty hash chains is more interesting. > > You don't usually search for items in empty chains. > > The only way you'll get all the chains of length one is if you've > > carefully picked the data so that it hashed that way. > > Sure. But given the 75% capacity, you'd need a really bad hash > function to get an *average* (not worst-case) chain length of > 10. > > > I remember playing around with the elf symbol table for a browser > > and all its shared libraries. > > While the hash function is pretty trivial, it really didn't matter > > whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash > > chains were always long. > > Even in the unlikely event of bad luck and everything bunches up > together, we change theh hash function (through hash_rnd) every > time we resize so you would expect things to even out after the > resize event. > > A rehash is also automatically triggered if the worst-case chain > length exceeds 16. 16!? that's crap, use a decent hash function and 3-5 should be your worst upper bound.
On Sun, Feb 25, 2024 at 05:01:19AM +0000, Matthew Wilcox wrote: > On Sat, Feb 24, 2024 at 10:18:31PM -0500, Kent Overstreet wrote: > > On Sat, Feb 24, 2024 at 10:10:27PM +0000, David Laight wrote: > > > I remember playing around with the elf symbol table for a browser > > > and all its shared libraries. > > > While the hash function is pretty trivial, it really didn't matter > > > whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash > > > chains were always long. > > > > that's a pretty bad hash, even golden ratio hash would be better, but > > still bad; you really should be using at least jhash. > > There's a "fun" effect; essentially the "biased observer" effect which > leads students to erroneously conclude that the majority of classes are > oversubscribed. As somebody observed in this thread, for some usecases > you only look up hashes which actually exist. > > Task a trivial example where you have four entries unevenly distributed > between two buckets, three in one bucket and one in the other. Now 3/4 > of your lookups hit in one bucket and 1/4 in the other bucket. > Obviously it's not as pronounced if you have 1000 buckets with 1000 > entries randomly distributed between the buckets. But that distribution > is not nearly as even as you might expect: > > $ ./distrib > 0: 362 > 1: 371 > 2: 193 > 3: 57 > 4: 13 > 5: 4 > > That's using lrand48() to decide which bucket to use, so not even a > "quality of hash" problem, just a "your mathematical intuition may not > be right here". well, golden ratio hash - hash_32(i, 32) 0: 368 1: 264 2: 368 3: 0 but your distribution actually is accurate in general, golden ratio hash is relly nice for sequential integers. the actual problem with your test is that you're testing 100% occupancy - no one does that. 75% occupancy, siphash: 0: 933 1: 60 2: 6 3: 1 4: 0 that looks about right to me.
On Sun, Feb 25, 2024 at 01:53:35PM +0800, Herbert Xu wrote: > On Sun, Feb 25, 2024 at 12:51:06AM -0500, Kent Overstreet wrote: > > > > but your distribution actually is accurate in general, golden ratio hash > > is relly nice for sequential integers. the actual problem with your test > > is that you're testing 100% occupancy - no one does that. > > > > 75% occupancy, siphash: > > 0: 933 > > 1: 60 > > 2: 6 > > 3: 1 > > 4: 0 > > > > that looks about right to me. > > The point is that the worst-case length grows with the size of > the table so it won't always be 3. You need to take into account > the largest table size that you will support. ok, but - one million entries, siphash, 75% fill factor 0: 472053 1: 354786 2: 132663 3: 33267 4: 6218 5: 884 6: 110 7: 17 8: 2 9: 0 100 million: 0: 51342703 1: 34224025 2: 11413241 3: 2534946 4: 421816 5: 56271 6: 6346 7: 593 8: 56 9: 3 10: 0 it's a log curve - chain length of 16 means you picked a bad hash function.
On Sun, Feb 25, 2024 at 02:47:45PM +0000, David Laight wrote: > From: Kent Overstreet > > Sent: 25 February 2024 03:19 > .. > > when I implemented cuckoo (which is more obviously sensitive to a weak > > hash function), I had to go with siphash, even jhash wasn't giving me > > great reslts. and looking at the code it's not hard to see why, it's all > > adds, and the rotates are byte aligned... you want mixed adds and xors > > and the rotates to be more prime-ish. > > > > right idea, just old... > > > > what would be ideal is something more like siphash, but with fewer > > rounds, so same number of instructions as jhash. xxhash might fit the > > bill, I haven't looked at the code yet... > > There is likely to be a point where scanning a list of values > for the right hash value is faster than executing a hash function > that is good enough to separate them to separate buckets. Executing a bunch of alu ops that parallelize nicely is _fast_ (it's all xors/adds/rotates, chacha20 is a good example of how the modern stuff works). Also, for 2-way cuckoo there's an xor trick so you don't need to compute a second hash. But the key thing about cuckoo is that the loads on lookup _aren't dependent_ - they can run in parallel. Every cache miss that goes all the way to DRAM is stupidly expensive, remember - hundreds of instructions, had you been able to keep the pipeline fed. > You don't want to scan a linked list because they have horrid > cache footprints. > The locking is equally horrid - especially for remove. > Arrays of pointers ar ethe way forward :-) Well, maybe; I'm waiting to see fill factor numbers and benchmarks. Fill factor was my concern when I was playing around with the concept; I used it in a place where the hash table didn't need to be that big, and the point was to avoid having to separately allocate the entries (and it avoids the hassle of tombstone entries with linear/quadratic probing). The fact that it requires a second dependent load, because buckets are separately allocated, also looks like a big negative to me. I still think a good lockless cuckoo hashing implementation ought to beat it.