Message ID | 20221105231055.25953-1-krisman@suse.de |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp1223906wru; Sat, 5 Nov 2022 16:31:27 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7APqcMQVjC89GUIJF7lTCQ5u8DIBJBDVHhYCZNRb0k0XAl6QWkU5/S0ZxkOh19SaPv12oe X-Received: by 2002:a17:907:2cea:b0:7a3:4ebe:5eb with SMTP id hz10-20020a1709072cea00b007a34ebe05ebmr40399225ejc.228.1667691087359; Sat, 05 Nov 2022 16:31:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1667691087; cv=none; d=google.com; s=arc-20160816; b=U1KEUu6tGwk6uXfVWqELxSrhOnN7HKtQM7UQKt3/a1+00R52Ahb6gaW9NQ2rguogm4 b3Li+rvVotpPXZR/JJwyaCLH5NK/OcW0ogGpkVA15YREsIsScr+cvfEAMb91kkZgLfmH k0T0WCSFKml2tnLYocND95rT6El4lCkDOAAZKDw62dzo6MP5Ooq+qrdpEiXyc7Kb+z0K PP9BOEw7SBB1jjno5IdxutZMPq40mNVkjakcDuEzbVKAs5xsCFmsTLFxPJI975lUKUPR 9yYsaR2orGOlcHD/JMvwqVe6+funT9VxDnHBLjq6EzeSavQ8FkXgokwFjKnf8q1wXW5z CpGg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature:dkim-signature; bh=tDUWZN79x9yoUZIZ4Slq0fQ9VuYoYPDPcHlKsV/Y7Vw=; b=sJ5TdOiukjFObq6rvwwm3SPcqCjvMD4UujbZH4ubD0jJ/BoYVS4v4t7iCiSJ8mU/F7 uAfimZ2BGT84VDKTlDISD1mpazT16ZqKKxBv4eqeEt2I+silwqSDNTtLdz1CwJQQAiTS TbA1pTgR/X76D/LYvKe2B6YBleArPWR570UTp+CZ6CqRMcP35YgXOhuOQgYWMhvEnI4H eSa2LfxtW0kTt65J00D/lftGfbe6jRDVVc8zqtyUfkKuyviw8mhbhgZg4m0EjJxES+dN iHxHgCLcPn5GFJDEtflfpjYXkV5o+e9QLw7amjvvMd8Cc9vwEMg4I0WjPo6oeBU+BkHw yaeg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b=JvgCdou8; dkim=neutral (no key) header.i=@suse.de header.s=susede2_ed25519; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.de Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id nc30-20020a1709071c1e00b0078d1e610b3esi3451733ejc.26.2022.11.05.16.31.03; Sat, 05 Nov 2022 16:31:27 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b=JvgCdou8; dkim=neutral (no key) header.i=@suse.de header.s=susede2_ed25519; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230057AbiKEXLI (ORCPT <rfc822;hjfbswb@gmail.com> + 99 others); Sat, 5 Nov 2022 19:11:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56148 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229479AbiKEXLH (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sat, 5 Nov 2022 19:11:07 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D0FE1DFA1; Sat, 5 Nov 2022 16:11:02 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 8E3141F37E; Sat, 5 Nov 2022 23:11:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1667689861; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=tDUWZN79x9yoUZIZ4Slq0fQ9VuYoYPDPcHlKsV/Y7Vw=; b=JvgCdou8o4dfS97npKZg5pRbK5fXqlCulxuCFr0+NOgh1posF9Azz8eVn+2/ZXGZ1bnCaC ZcANDFb5ZWPZ+cH5f3d8EaYdUT6jEM0mmdRZdHoy5WUxf8rcoDxOPxtPzfTse0+ssQO/n0 SPz0uh4IY8J+ynnzL7HadbXiEbu3kLA= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1667689861; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=tDUWZN79x9yoUZIZ4Slq0fQ9VuYoYPDPcHlKsV/Y7Vw=; b=4r/fLg4dmd7TNYLkI7tshDLLcmJ6apt/wIR6m/SOrYAb4Q7CLgGzQci/wfsn6I00gLvGkw 29ITplogIWibvGAg== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 5900913AA6; Sat, 5 Nov 2022 23:11:01 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id kVepEIXtZmMTBAAAMHmgww (envelope-from <krisman@suse.de>); Sat, 05 Nov 2022 23:11:01 +0000 From: Gabriel Krisman Bertazi <krisman@suse.de> To: axboe@kernel.dk Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, Gabriel Krisman Bertazi <krisman@suse.de>, Hugh Dickins <hughd@google.com>, Keith Busch <kbusch@kernel.org>, Liu Song <liusong@linux.alibaba.com>, Jan Kara <jack@suse.cz> Subject: [PATCH] sbitmap: Use single per-bitmap counting to wake up queued tags Date: Sat, 5 Nov 2022 19:10:55 -0400 Message-Id: <20221105231055.25953-1-krisman@suse.de> X-Mailer: git-send-email 2.35.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1748700849438241722?= X-GMAIL-MSGID: =?utf-8?q?1748700849438241722?= |
Series |
sbitmap: Use single per-bitmap counting to wake up queued tags
|
|
Commit Message
Gabriel Krisman Bertazi
Nov. 5, 2022, 11:10 p.m. UTC
sbitmap suffers from code complexity, as demonstrated by recent fixes,
and eventual lost wake ups on nested I/O completion. The later happens,
from what I understand, due to the non-atomic nature of the updates to
wait_cnt, which needs to be subtracted and eventually reset when equal
to zero. This two step process can eventually miss an update when a
nested completion happens to interrupt the CPU in between the wait_cnt
updates. This is very hard to fix, as shown by the recent changes to
this code.
The code complexity arises mostly from the corner cases to avoid missed
wakes in this scenario. In addition, the handling of wake_batch
recalculation plus the synchronization with sbq_queue_wake_up is
non-trivial.
This patchset implements the idea originally proposed by Jan [1], which
removes the need for the two-step updates of wait_cnt. This is done by
tracking the number of completions and wakeups in always increasing,
per-bitmap counters. Instead of having to reset the wait_cnt when it
reaches zero, we simply keep counting, and attempt to wake up N threads
in a single wait queue whenever there is enough space for a batch.
Waking up less than batch_wake shouldn't be a problem, because we
haven't changed the conditions for wake up, and the existing batch
calculation guarantees at least enough remaining completions to wake up
a batch for each queue at any time.
Performance-wise, one should expect very similar performance to the
original algorithm for the case where there is no queueing. In both the
old algorithm and this implementation, the first thing is to check
ws_active, which bails out if there is no queueing to be managed. In the
new code, we took care to avoid accounting completions and wakeups when
there is no queueing, to not pay the cost of atomic operations
unnecessarily, since it doesn't skew the numbers.
For more interesting cases, where there is queueing, we need to take
into account the cross-communication of the atomic operations. I've
been benchmarking by running parallel fio jobs against a single hctx
nullb in different hardware queue depth scenarios, and verifying both
IOPS and queueing.
Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
varying only the hardware queue length per test.
queue size 2 4 8 16 32 64
6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K)
patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K)
The following is a similar experiment, ran against a nullb with a single
bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40
parallel fio jobs operating on the same device
queue size 2 4 8 16 32 64
6.1-rc2 1081.0K (2.3K) 957.2K (1.5K) 1699.1K (5.7K) 6178.2K (124.6K) 12227.9K (37.7K) 13286.6K (92.9K)
patched 1081.8K (2.8K) 1316.5K (5.4K) 2364.4K (1.8K) 6151.4K (20.0K) 11893.6K (17.5K) 12385.6K (18.4K)
It has also survived blktests and a 12h-stress run against nullb. I also
ran the code against nvme and a scsi SSD, and I didn't observe
performance regression in those. If there are other tests you think I
should run, please let me know and I will follow up with results.
[1] https://lore.kernel.org/all/aef9de29-e9f5-259a-f8be-12d1b734e72@google.com/
Cc: Hugh Dickins <hughd@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Liu Song <liusong@linux.alibaba.com>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
---
include/linux/sbitmap.h | 16 ++++--
lib/sbitmap.c | 122 +++++++++-------------------------------
2 files changed, 37 insertions(+), 101 deletions(-)
Comments
> For more interesting cases, where there is queueing, we need to take > into account the cross-communication of the atomic operations. I've > been benchmarking by running parallel fio jobs against a single hctx > nullb in different hardware queue depth scenarios, and verifying both > IOPS and queueing. > > Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel > jobs. fio was issuing fixed-size randwrites with qd=64 against nullb, > varying only the hardware queue length per test. > > queue size 2 4 8 16 32 64 > 6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K) > patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K) > So if I understand correctly QD 2,4,8 shows clear performance benefit from this patch whereas QD 16, 32, 64 shows drop in performance it that correct ? If my observation is correct then applications with high QD will observe drop in the performance ? Also, please share a table with block size/IOPS/BW/CPU (system/user) /LAT/SLAT with % increase/decrease and document the raw numbers at the end of the cover-letter for completeness along with fio job to others can repeat the experiment... > The following is a similar experiment, ran against a nullb with a single > bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40 > parallel fio jobs operating on the same device > > queue size 2 4 8 16 32 64 > 6.1-rc2 1081.0K (2.3K) 957.2K (1.5K) 1699.1K (5.7K) 6178.2K (124.6K) 12227.9K (37.7K) 13286.6K (92.9K) > patched 1081.8K (2.8K) 1316.5K (5.4K) 2364.4K (1.8K) 6151.4K (20.0K) 11893.6K (17.5K) 12385.6K (18.4K) > same here ... > It has also survived blktests and a 12h-stress run against nullb. I also > ran the code against nvme and a scsi SSD, and I didn't observe > performance regression in those. If there are other tests you think I > should run, please let me know and I will follow up with results. > -ck
Chaitanya Kulkarni <chaitanyak@nvidia.com> writes: >> For more interesting cases, where there is queueing, we need to take >> into account the cross-communication of the atomic operations. I've >> been benchmarking by running parallel fio jobs against a single hctx >> nullb in different hardware queue depth scenarios, and verifying both >> IOPS and queueing. >> >> Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel >> jobs. fio was issuing fixed-size randwrites with qd=64 against nullb, >> varying only the hardware queue length per test. >> >> queue size 2 4 8 16 32 64 >> 6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K) >> patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K) > >> Hi Chaitanya, Thanks for the feedback. > So if I understand correctly > QD 2,4,8 shows clear performance benefit from this patch whereas > QD 16, 32, 64 shows drop in performance it that correct ? > > If my observation is correct then applications with high QD will > observe drop in the performance ? To be honest, I'm not sure. Given the overlap of the standard variation (in parenthesis) with the mean, I'm not sure the observed drop is statistically significant. In my prior analysis, I thought it wasn't. I don't see where a significant difference would come from, to be honest, because the higher the QD, the more likely it is to go through the not-contended path, where sbq->ws_active == 0. This hot path is identical to the existing implementation. > Also, please share a table with block size/IOPS/BW/CPU (system/user) > /LAT/SLAT with % increase/decrease and document the raw numbers at the > end of the cover-letter for completeness along with fio job to others > can repeat the experiment... This was issued against the nullb and the IO size is fixed, matching the device's block size (512b), which is why I am not tracking BW, only IOPS. I'm not sure the BW is still relevant in this scenario. I'll definitely follow up with CPU time and latencies, and share the fio job. I'll also take another look on the significance of the measured values for high QD. Thank you,
On 11/8/22 19:03, Gabriel Krisman Bertazi wrote: > Chaitanya Kulkarni <chaitanyak@nvidia.com> writes: > >>> For more interesting cases, where there is queueing, we need to take >>> into account the cross-communication of the atomic operations. I've >>> been benchmarking by running parallel fio jobs against a single hctx >>> nullb in different hardware queue depth scenarios, and verifying both >>> IOPS and queueing. >>> >>> Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel >>> jobs. fio was issuing fixed-size randwrites with qd=64 against nullb, >>> varying only the hardware queue length per test. >>> >>> queue size 2 4 8 16 32 64 >>> 6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K) >>> patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K) >> >>> > > Hi Chaitanya, > > Thanks for the feedback. > >> So if I understand correctly >> QD 2,4,8 shows clear performance benefit from this patch whereas >> QD 16, 32, 64 shows drop in performance it that correct ? >> >> If my observation is correct then applications with high QD will >> observe drop in the performance ? > > To be honest, I'm not sure. Given the overlap of the standard variation > (in parenthesis) with the mean, I'm not sure the observed drop is > statistically significant. In my prior analysis, I thought it wasn't. > > I don't see where a significant difference would come from, to be honest, > because the higher the QD, the more likely it is to go through the > not-contended path, where sbq->ws_active == 0. This hot path is > identical to the existing implementation. > The numbers are taken on the null_blk, with the drop I could see here may end up being different on the real H/W ? and I cannot comment on that since we don't have that data ... Did you repeat the experiment with the real H/W like NVMe SSD ? >> Also, please share a table with block size/IOPS/BW/CPU (system/user) >> /LAT/SLAT with % increase/decrease and document the raw numbers at the >> end of the cover-letter for completeness along with fio job to others >> can repeat the experiment... > > This was issued against the nullb and the IO size is fixed, matching the > device's block size (512b), which is why I am not tracking BW, only > IOPS. I'm not sure the BW is still relevant in this scenario. > > I'll definitely follow up with CPU time and latencies, and share the > fio job. I'll also take another look on the significance of the > measured values for high QD. > Yes, please if CPU usage way higher then we need to know that above numbers are at the cost of the higher CPU, in that case IOPs per core B/W per core matrix can be very useful ? -ck
On 11/5/22 5:10 PM, Gabriel Krisman Bertazi wrote: > sbitmap suffers from code complexity, as demonstrated by recent fixes, > and eventual lost wake ups on nested I/O completion. The later happens, > from what I understand, due to the non-atomic nature of the updates to > wait_cnt, which needs to be subtracted and eventually reset when equal > to zero. This two step process can eventually miss an update when a > nested completion happens to interrupt the CPU in between the wait_cnt > updates. This is very hard to fix, as shown by the recent changes to > this code. > > The code complexity arises mostly from the corner cases to avoid missed > wakes in this scenario. In addition, the handling of wake_batch > recalculation plus the synchronization with sbq_queue_wake_up is > non-trivial. > > This patchset implements the idea originally proposed by Jan [1], which > removes the need for the two-step updates of wait_cnt. This is done by > tracking the number of completions and wakeups in always increasing, > per-bitmap counters. Instead of having to reset the wait_cnt when it > reaches zero, we simply keep counting, and attempt to wake up N threads > in a single wait queue whenever there is enough space for a batch. > Waking up less than batch_wake shouldn't be a problem, because we > haven't changed the conditions for wake up, and the existing batch > calculation guarantees at least enough remaining completions to wake up > a batch for each queue at any time. > > Performance-wise, one should expect very similar performance to the > original algorithm for the case where there is no queueing. In both the > old algorithm and this implementation, the first thing is to check > ws_active, which bails out if there is no queueing to be managed. In the > new code, we took care to avoid accounting completions and wakeups when > there is no queueing, to not pay the cost of atomic operations > unnecessarily, since it doesn't skew the numbers. > > For more interesting cases, where there is queueing, we need to take > into account the cross-communication of the atomic operations. I've > been benchmarking by running parallel fio jobs against a single hctx > nullb in different hardware queue depth scenarios, and verifying both > IOPS and queueing. > > Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel > jobs. fio was issuing fixed-size randwrites with qd=64 against nullb, > varying only the hardware queue length per test. > > queue size 2 4 8 16 32 64 > 6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K) > patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K) > > The following is a similar experiment, ran against a nullb with a single > bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40 > parallel fio jobs operating on the same device > > queue size 2 4 8 16 32 64 > 6.1-rc2 1081.0K (2.3K) 957.2K (1.5K) 1699.1K (5.7K) 6178.2K (124.6K) 12227.9K (37.7K) 13286.6K (92.9K) > patched 1081.8K (2.8K) 1316.5K (5.4K) 2364.4K (1.8K) 6151.4K (20.0K) 11893.6K (17.5K) 12385.6K (18.4K) What's the queue depth of these devices? That's the interesting question here, as it'll tell us if any of these are actually hitting the slower path where you made changes. I suspect you are for the second set of numbers, but not for the first one? Anything that isn't hitting the wait path for tags isn't a very useful test, as I would not expect any changes there.
Jens Axboe <axboe@kernel.dk> writes: > On 11/5/22 5:10 PM, Gabriel Krisman Bertazi wrote: >> Performance-wise, one should expect very similar performance to the >> original algorithm for the case where there is no queueing. In both the >> old algorithm and this implementation, the first thing is to check >> ws_active, which bails out if there is no queueing to be managed. In the >> new code, we took care to avoid accounting completions and wakeups when >> there is no queueing, to not pay the cost of atomic operations >> unnecessarily, since it doesn't skew the numbers. >> >> For more interesting cases, where there is queueing, we need to take >> into account the cross-communication of the atomic operations. I've >> been benchmarking by running parallel fio jobs against a single hctx >> nullb in different hardware queue depth scenarios, and verifying both >> IOPS and queueing. >> >> Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel >> jobs. fio was issuing fixed-size randwrites with qd=64 against nullb, >> varying only the hardware queue length per test. >> >> queue size 2 4 8 16 32 64 >> 6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K) >> patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K) >> >> The following is a similar experiment, ran against a nullb with a single >> bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40 >> parallel fio jobs operating on the same device >> >> queue size 2 4 8 16 32 64 >> 6.1-rc2 1081.0K (2.3K) 957.2K (1.5K) 1699.1K (5.7K) 6178.2K (124.6K) 12227.9K (37.7K) 13286.6K (92.9K) >> patched 1081.8K (2.8K) 1316.5K (5.4K) 2364.4K (1.8K) 6151.4K (20.0K) 11893.6K (17.5K) 12385.6K (18.4K) > > What's the queue depth of these devices? That's the interesting question > here, as it'll tell us if any of these are actually hitting the slower > path where you made changes. > Hi Jens, The hardware queue depth is a parameter being varied in this experiment. Each column of the tables has a different queue depth. Its value is the first line (queue size) of both tables. For instance, looking at the first table, for a device with hardware queue depth=2, 6.1-rc2 gave 1681K IOPS and the patched version gave 1721.8K IOPS. As mentioned, I monitored the size of the sbitmap wqs during the benchmark execution to confirm it was indeed hitting the slow path and queueing. Indeed, I observed less queueing on higher QDs (16,32) and even less for QD=64. For QD<=8, there was extensive queueing present throughout the execution. I should provide the queue size over time alongside the latency numbers. I have to rerun the benchmarks already to collect the information Chaitanya requested. > I suspect you are for the second set of numbers, but not for the first > one? No. both tables show some level of queueing. The shared bitmap in table 2 surely has way more intensive queueing, though. > Anything that isn't hitting the wait path for tags isn't a very useful > test, as I would not expect any changes there. Even when there is less to no queueing (QD=64 in this data), we still enter sbitmap_queue_wake_up and bail out on the first line !wait_active. This is why I think it is important to include QD=64 here. it is less interesting data, as I mentioned, but it shows no regressions of the faspath. Thanks,
On 11/9/22 3:48 PM, Gabriel Krisman Bertazi wrote: > Jens Axboe <axboe@kernel.dk> writes: > >> On 11/5/22 5:10 PM, Gabriel Krisman Bertazi wrote: >>> Performance-wise, one should expect very similar performance to the >>> original algorithm for the case where there is no queueing. In both the >>> old algorithm and this implementation, the first thing is to check >>> ws_active, which bails out if there is no queueing to be managed. In the >>> new code, we took care to avoid accounting completions and wakeups when >>> there is no queueing, to not pay the cost of atomic operations >>> unnecessarily, since it doesn't skew the numbers. >>> >>> For more interesting cases, where there is queueing, we need to take >>> into account the cross-communication of the atomic operations. I've >>> been benchmarking by running parallel fio jobs against a single hctx >>> nullb in different hardware queue depth scenarios, and verifying both >>> IOPS and queueing. >>> >>> Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel >>> jobs. fio was issuing fixed-size randwrites with qd=64 against nullb, >>> varying only the hardware queue length per test. >>> >>> queue size 2 4 8 16 32 64 >>> 6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K) >>> patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K) >>> >>> The following is a similar experiment, ran against a nullb with a single >>> bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40 >>> parallel fio jobs operating on the same device >>> >>> queue size 2 4 8 16 32 64 >>> 6.1-rc2 1081.0K (2.3K) 957.2K (1.5K) 1699.1K (5.7K) 6178.2K (124.6K) 12227.9K (37.7K) 13286.6K (92.9K) >>> patched 1081.8K (2.8K) 1316.5K (5.4K) 2364.4K (1.8K) 6151.4K (20.0K) 11893.6K (17.5K) 12385.6K (18.4K) >> >> What's the queue depth of these devices? That's the interesting question >> here, as it'll tell us if any of these are actually hitting the slower >> path where you made changes. >> > > Hi Jens, > > The hardware queue depth is a parameter being varied in this experiment. > Each column of the tables has a different queue depth. Its value is the > first line (queue size) of both tables. For instance, looking at the > first table, for a device with hardware queue depth=2, 6.1-rc2 gave > 1681K IOPS and the patched version gave 1721.8K IOPS. > > As mentioned, I monitored the size of the sbitmap wqs during the > benchmark execution to confirm it was indeed hitting the slow path and > queueing. Indeed, I observed less queueing on higher QDs (16,32) and > even less for QD=64. For QD<=8, there was extensive queueing present > throughout the execution. Gotcha, this makes a lot of sense. I just misunderstood the queue size numbers to be queue depth on the IO generation side. Any idea what is reducing performance at the higher end of queue size spectrum? Not that I think it's _really_ that interesting, just curious. > I should provide the queue size over time alongside the latency numbers. > I have to rerun the benchmarks already to collect the information > Chaitanya requested. Thanks. >> I suspect you are for the second set of numbers, but not for the first >> one? > > No. both tables show some level of queueing. The shared bitmap in > table 2 surely has way more intensive queueing, though. Yep, got it now. >> Anything that isn't hitting the wait path for tags isn't a very useful >> test, as I would not expect any changes there. > > Even when there is less to no queueing (QD=64 in this data), we still > enter sbitmap_queue_wake_up and bail out on the first line > !wait_active. This is why I think it is important to include QD=64 > here. it is less interesting data, as I mentioned, but it shows no > regressions of the faspath. Yes, the checking for whether we need to hit the slow path should be fast. I will try and run some numbers here too for the pure fast path, no sleeping at all. For most fast storage this is where we'll be for the majority of the time, queue depth on NVMe is generally pretty deep. FWIW, I did run my usual quick peak testing and didn't see any regressions there. I'll do a quick per-core benchmark tomorrow, that's more sensitive to cycle issues. But I like the change in general, and I think your numbers look sane. Would be nice to turn sbitmap queueing into something that's actually sane.
Hi, 在 2022/11/06 7:10, Gabriel Krisman Bertazi 写道: > sbitmap suffers from code complexity, as demonstrated by recent fixes, > and eventual lost wake ups on nested I/O completion. The later happens, > from what I understand, due to the non-atomic nature of the updates to > wait_cnt, which needs to be subtracted and eventually reset when equal > to zero. This two step process can eventually miss an update when a > nested completion happens to interrupt the CPU in between the wait_cnt > updates. This is very hard to fix, as shown by the recent changes to > this code. > > The code complexity arises mostly from the corner cases to avoid missed > wakes in this scenario. In addition, the handling of wake_batch > recalculation plus the synchronization with sbq_queue_wake_up is > non-trivial. > > This patchset implements the idea originally proposed by Jan [1], which > removes the need for the two-step updates of wait_cnt. This is done by > tracking the number of completions and wakeups in always increasing, > per-bitmap counters. Instead of having to reset the wait_cnt when it > reaches zero, we simply keep counting, and attempt to wake up N threads > in a single wait queue whenever there is enough space for a batch. > Waking up less than batch_wake shouldn't be a problem, because we > haven't changed the conditions for wake up, and the existing batch > calculation guarantees at least enough remaining completions to wake up > a batch for each queue at any time. > > Performance-wise, one should expect very similar performance to the > original algorithm for the case where there is no queueing. In both the > old algorithm and this implementation, the first thing is to check > ws_active, which bails out if there is no queueing to be managed. In the > new code, we took care to avoid accounting completions and wakeups when > there is no queueing, to not pay the cost of atomic operations > unnecessarily, since it doesn't skew the numbers. > > For more interesting cases, where there is queueing, we need to take > into account the cross-communication of the atomic operations. I've > been benchmarking by running parallel fio jobs against a single hctx > nullb in different hardware queue depth scenarios, and verifying both > IOPS and queueing. > > Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel > jobs. fio was issuing fixed-size randwrites with qd=64 against nullb, > varying only the hardware queue length per test. > > queue size 2 4 8 16 32 64 > 6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K) > patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K) > > The following is a similar experiment, ran against a nullb with a single > bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40 > parallel fio jobs operating on the same device > > queue size 2 4 8 16 32 64 > 6.1-rc2 1081.0K (2.3K) 957.2K (1.5K) 1699.1K (5.7K) 6178.2K (124.6K) 12227.9K (37.7K) 13286.6K (92.9K) > patched 1081.8K (2.8K) 1316.5K (5.4K) 2364.4K (1.8K) 6151.4K (20.0K) 11893.6K (17.5K) 12385.6K (18.4K) > > It has also survived blktests and a 12h-stress run against nullb. I also > ran the code against nvme and a scsi SSD, and I didn't observe > performance regression in those. If there are other tests you think I > should run, please let me know and I will follow up with results. > > [1] https://lore.kernel.org/all/aef9de29-e9f5-259a-f8be-12d1b734e72@google.com/ > > Cc: Hugh Dickins <hughd@google.com> > Cc: Keith Busch <kbusch@kernel.org> > Cc: Liu Song <liusong@linux.alibaba.com> > Suggested-by: Jan Kara <jack@suse.cz> > Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> > --- > include/linux/sbitmap.h | 16 ++++-- > lib/sbitmap.c | 122 +++++++++------------------------------- > 2 files changed, 37 insertions(+), 101 deletions(-) > > diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h > index 4d2d5205ab58..d662cf136021 100644 > --- a/include/linux/sbitmap.h > +++ b/include/linux/sbitmap.h > @@ -86,11 +86,6 @@ struct sbitmap { > * struct sbq_wait_state - Wait queue in a &struct sbitmap_queue. > */ > struct sbq_wait_state { > - /** > - * @wait_cnt: Number of frees remaining before we wake up. > - */ > - atomic_t wait_cnt; > - > /** > * @wait: Wait queue. > */ > @@ -138,6 +133,17 @@ struct sbitmap_queue { > * sbitmap_queue_get_shallow() > */ > unsigned int min_shallow_depth; > + > + /** > + * @completion_cnt: Number of bits cleared passed to the > + * wakeup function. > + */ > + atomic_t completion_cnt; > + > + /** > + * @wakeup_cnt: Number of thread wake ups issued. > + */ > + atomic_t wakeup_cnt; > }; > > /** > diff --git a/lib/sbitmap.c b/lib/sbitmap.c > index 7280ae8ca88c..eca462cba398 100644 > --- a/lib/sbitmap.c > +++ b/lib/sbitmap.c > @@ -434,6 +434,8 @@ int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth, > sbq->wake_batch = sbq_calc_wake_batch(sbq, depth); > atomic_set(&sbq->wake_index, 0); > atomic_set(&sbq->ws_active, 0); > + atomic_set(&sbq->completion_cnt, 0); > + atomic_set(&sbq->wakeup_cnt, 0); > > sbq->ws = kzalloc_node(SBQ_WAIT_QUEUES * sizeof(*sbq->ws), flags, node); > if (!sbq->ws) { > @@ -441,40 +443,21 @@ int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth, > return -ENOMEM; > } > > - for (i = 0; i < SBQ_WAIT_QUEUES; i++) { > + for (i = 0; i < SBQ_WAIT_QUEUES; i++) > init_waitqueue_head(&sbq->ws[i].wait); > - atomic_set(&sbq->ws[i].wait_cnt, sbq->wake_batch); > - } > > return 0; > } > EXPORT_SYMBOL_GPL(sbitmap_queue_init_node); > > -static inline void __sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq, > - unsigned int wake_batch) > -{ > - int i; > - > - if (sbq->wake_batch != wake_batch) { > - WRITE_ONCE(sbq->wake_batch, wake_batch); > - /* > - * Pairs with the memory barrier in sbitmap_queue_wake_up() > - * to ensure that the batch size is updated before the wait > - * counts. > - */ > - smp_mb(); > - for (i = 0; i < SBQ_WAIT_QUEUES; i++) > - atomic_set(&sbq->ws[i].wait_cnt, 1); > - } > -} > - > static void sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq, > unsigned int depth) > { > unsigned int wake_batch; > > wake_batch = sbq_calc_wake_batch(sbq, depth); > - __sbitmap_queue_update_wake_batch(sbq, wake_batch); > + if (sbq->wake_batch != wake_batch) > + WRITE_ONCE(sbq->wake_batch, wake_batch); > } > > void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq, > @@ -488,7 +471,8 @@ void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq, > > wake_batch = clamp_val(depth / SBQ_WAIT_QUEUES, > min_batch, SBQ_WAKE_BATCH); > - __sbitmap_queue_update_wake_batch(sbq, wake_batch); > + > + WRITE_ONCE(sbq->wake_batch, wake_batch); > } > EXPORT_SYMBOL_GPL(sbitmap_queue_recalculate_wake_batch); > > @@ -587,7 +571,7 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq) > for (i = 0; i < SBQ_WAIT_QUEUES; i++) { > struct sbq_wait_state *ws = &sbq->ws[wake_index]; > > - if (waitqueue_active(&ws->wait) && atomic_read(&ws->wait_cnt)) { > + if (waitqueue_active(&ws->wait)) { > if (wake_index != atomic_read(&sbq->wake_index)) > atomic_set(&sbq->wake_index, wake_index); > return ws; > @@ -599,83 +583,31 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq) > return NULL; > } > > -static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr) > +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr) > { > - struct sbq_wait_state *ws; > - unsigned int wake_batch; > - int wait_cnt, cur, sub; > - bool ret; > + unsigned int wake_batch = READ_ONCE(sbq->wake_batch); > + struct sbq_wait_state *ws = NULL; > + unsigned int wakeups; > > - if (*nr <= 0) > - return false; > + if (!atomic_read(&sbq->ws_active)) > + return; > > - ws = sbq_wake_ptr(sbq); > - if (!ws) > - return false; > + atomic_add(nr, &sbq->completion_cnt); > + wakeups = atomic_read(&sbq->wakeup_cnt); > > - cur = atomic_read(&ws->wait_cnt); > do { > - /* > - * For concurrent callers of this, callers should call this > - * function again to wakeup a new batch on a different 'ws'. > - */ > - if (cur == 0) > - return true; > - sub = min(*nr, cur); > - wait_cnt = cur - sub; > - } while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt)); > - > - /* > - * If we decremented queue without waiters, retry to avoid lost > - * wakeups. > - */ > - if (wait_cnt > 0) > - return !waitqueue_active(&ws->wait); > + if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch) > + return; Should it be considered that completion_cnt overflow and becomes negtive? Thanks, Kuai > > - *nr -= sub; > - > - /* > - * When wait_cnt == 0, we have to be particularly careful as we are > - * responsible to reset wait_cnt regardless whether we've actually > - * woken up anybody. But in case we didn't wakeup anybody, we still > - * need to retry. > - */ > - ret = !waitqueue_active(&ws->wait); > - wake_batch = READ_ONCE(sbq->wake_batch); > + if (!ws) { > + ws = sbq_wake_ptr(sbq); > + if (!ws) > + return; > + } > + } while (!atomic_try_cmpxchg(&sbq->wakeup_cnt, > + &wakeups, wakeups + wake_batch)); > > - /* > - * Wake up first in case that concurrent callers decrease wait_cnt > - * while waitqueue is empty. > - */ > wake_up_nr(&ws->wait, wake_batch); > - > - /* > - * Pairs with the memory barrier in sbitmap_queue_resize() to > - * ensure that we see the batch size update before the wait > - * count is reset. > - * > - * Also pairs with the implicit barrier between decrementing wait_cnt > - * and checking for waitqueue_active() to make sure waitqueue_active() > - * sees result of the wakeup if atomic_dec_return() has seen the result > - * of atomic_set(). > - */ > - smp_mb__before_atomic(); > - > - /* > - * Increase wake_index before updating wait_cnt, otherwise concurrent > - * callers can see valid wait_cnt in old waitqueue, which can cause > - * invalid wakeup on the old waitqueue. > - */ > - sbq_index_atomic_inc(&sbq->wake_index); > - atomic_set(&ws->wait_cnt, wake_batch); > - > - return ret || *nr; > -} > - > -void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr) > -{ > - while (__sbq_wake_up(sbq, &nr)) > - ; > } > EXPORT_SYMBOL_GPL(sbitmap_queue_wake_up); > > @@ -792,9 +724,7 @@ void sbitmap_queue_show(struct sbitmap_queue *sbq, struct seq_file *m) > seq_puts(m, "ws={\n"); > for (i = 0; i < SBQ_WAIT_QUEUES; i++) { > struct sbq_wait_state *ws = &sbq->ws[i]; > - > - seq_printf(m, "\t{.wait_cnt=%d, .wait=%s},\n", > - atomic_read(&ws->wait_cnt), > + seq_printf(m, "\t{.wait=%s},\n", > waitqueue_active(&ws->wait) ? "active" : "inactive"); > } > seq_puts(m, "}\n"); >
Hi! On Thu 10-11-22 17:42:49, Yu Kuai wrote: > 在 2022/11/06 7:10, Gabriel Krisman Bertazi 写道: > > +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr) > > { > > - struct sbq_wait_state *ws; > > - unsigned int wake_batch; > > - int wait_cnt, cur, sub; > > - bool ret; > > + unsigned int wake_batch = READ_ONCE(sbq->wake_batch); > > + struct sbq_wait_state *ws = NULL; > > + unsigned int wakeups; > > - if (*nr <= 0) > > - return false; > > + if (!atomic_read(&sbq->ws_active)) > > + return; > > - ws = sbq_wake_ptr(sbq); > > - if (!ws) > > - return false; > > + atomic_add(nr, &sbq->completion_cnt); > > + wakeups = atomic_read(&sbq->wakeup_cnt); > > - cur = atomic_read(&ws->wait_cnt); > > do { > > - /* > > - * For concurrent callers of this, callers should call this > > - * function again to wakeup a new batch on a different 'ws'. > > - */ > > - if (cur == 0) > > - return true; > > - sub = min(*nr, cur); > > - wait_cnt = cur - sub; > > - } while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt)); > > - > > - /* > > - * If we decremented queue without waiters, retry to avoid lost > > - * wakeups. > > - */ > > - if (wait_cnt > 0) > > - return !waitqueue_active(&ws->wait); > > + if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch) > > + return; > > Should it be considered that completion_cnt overflow and becomes > negtive? Yes, the counters can (and will) certainly overflow but since we only care about (completion_cnt - wakeups), we should be fine - this number is always sane (and relatively small) and in the kernel we do compile with signed overflows being well defined. Honza
Hi, 在 2022/11/10 19:16, Jan Kara 写道: > Hi! > > On Thu 10-11-22 17:42:49, Yu Kuai wrote: >> 在 2022/11/06 7:10, Gabriel Krisman Bertazi 写道: >>> +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr) >>> { >>> - struct sbq_wait_state *ws; >>> - unsigned int wake_batch; >>> - int wait_cnt, cur, sub; >>> - bool ret; >>> + unsigned int wake_batch = READ_ONCE(sbq->wake_batch); >>> + struct sbq_wait_state *ws = NULL; >>> + unsigned int wakeups; >>> - if (*nr <= 0) >>> - return false; >>> + if (!atomic_read(&sbq->ws_active)) >>> + return; >>> - ws = sbq_wake_ptr(sbq); >>> - if (!ws) >>> - return false; >>> + atomic_add(nr, &sbq->completion_cnt); >>> + wakeups = atomic_read(&sbq->wakeup_cnt); >>> - cur = atomic_read(&ws->wait_cnt); >>> do { >>> - /* >>> - * For concurrent callers of this, callers should call this >>> - * function again to wakeup a new batch on a different 'ws'. >>> - */ >>> - if (cur == 0) >>> - return true; >>> - sub = min(*nr, cur); >>> - wait_cnt = cur - sub; >>> - } while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt)); >>> - >>> - /* >>> - * If we decremented queue without waiters, retry to avoid lost >>> - * wakeups. >>> - */ >>> - if (wait_cnt > 0) >>> - return !waitqueue_active(&ws->wait); >>> + if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch) >>> + return; >> >> Should it be considered that completion_cnt overflow and becomes >> negtive? > > Yes, the counters can (and will) certainly overflow but since we only care > about (completion_cnt - wakeups), we should be fine - this number is always > sane (and relatively small) and in the kernel we do compile with signed > overflows being well defined. I'm worried about this: for example, the extreme scenaro that there is only one tag, currently there are only one infight rq and one thread is waiting for tag. When the infight rq complete, if 'completion_cnt' overflow to negative, then 'atomic_read(&sbq->completion_cnt) - wakeups < wake_batch' will be passed unexpected, then will the thread never be woken up if there are no new io issued ? Thanks, Kuai > > Honza >
Hi! On Thu 10-11-22 21:18:19, Yu Kuai wrote: > 在 2022/11/10 19:16, Jan Kara 写道: > > Hi! > > > > On Thu 10-11-22 17:42:49, Yu Kuai wrote: > > > 在 2022/11/06 7:10, Gabriel Krisman Bertazi 写道: > > > > +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr) > > > > { > > > > - struct sbq_wait_state *ws; > > > > - unsigned int wake_batch; > > > > - int wait_cnt, cur, sub; > > > > - bool ret; > > > > + unsigned int wake_batch = READ_ONCE(sbq->wake_batch); > > > > + struct sbq_wait_state *ws = NULL; > > > > + unsigned int wakeups; > > > > - if (*nr <= 0) > > > > - return false; > > > > + if (!atomic_read(&sbq->ws_active)) > > > > + return; > > > > - ws = sbq_wake_ptr(sbq); > > > > - if (!ws) > > > > - return false; > > > > + atomic_add(nr, &sbq->completion_cnt); > > > > + wakeups = atomic_read(&sbq->wakeup_cnt); > > > > - cur = atomic_read(&ws->wait_cnt); > > > > do { > > > > - /* > > > > - * For concurrent callers of this, callers should call this > > > > - * function again to wakeup a new batch on a different 'ws'. > > > > - */ > > > > - if (cur == 0) > > > > - return true; > > > > - sub = min(*nr, cur); > > > > - wait_cnt = cur - sub; > > > > - } while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt)); > > > > - > > > > - /* > > > > - * If we decremented queue without waiters, retry to avoid lost > > > > - * wakeups. > > > > - */ > > > > - if (wait_cnt > 0) > > > > - return !waitqueue_active(&ws->wait); > > > > + if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch) > > > > + return; > > > > > > Should it be considered that completion_cnt overflow and becomes > > > negtive? > > > > Yes, the counters can (and will) certainly overflow but since we only care > > about (completion_cnt - wakeups), we should be fine - this number is always > > sane (and relatively small) and in the kernel we do compile with signed > > overflows being well defined. > > I'm worried about this: for example, the extreme scenaro that there > is only one tag, currently there are only one infight rq and one thread > is waiting for tag. When the infight rq complete, if 'completion_cnt' > overflow to negative, then 'atomic_read(&sbq->completion_cnt) - wakeups > < wake_batch' will be passed unexpected, then will the thread never be > woken up if there are no new io issued ? Well but my point is that 'wakeups' is staying close to completion_cnt. So if completion_cnt wraps to INT_MIN, then 'wakeups' is close to INT_MAX and so completion_cnt - wakeups is going to wrap back and still result in a small number. That is simply how wrapping arithmetics works... Honza
Hi 在 2022/11/10 23:35, Jan Kara 写道: > Hi! > > On Thu 10-11-22 21:18:19, Yu Kuai wrote: >> 在 2022/11/10 19:16, Jan Kara 写道: >>> Hi! >>> >>> On Thu 10-11-22 17:42:49, Yu Kuai wrote: >>>> 在 2022/11/06 7:10, Gabriel Krisman Bertazi 写道: >>>>> +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr) >>>>> { >>>>> - struct sbq_wait_state *ws; >>>>> - unsigned int wake_batch; >>>>> - int wait_cnt, cur, sub; >>>>> - bool ret; >>>>> + unsigned int wake_batch = READ_ONCE(sbq->wake_batch); >>>>> + struct sbq_wait_state *ws = NULL; >>>>> + unsigned int wakeups; >>>>> - if (*nr <= 0) >>>>> - return false; >>>>> + if (!atomic_read(&sbq->ws_active)) >>>>> + return; >>>>> - ws = sbq_wake_ptr(sbq); >>>>> - if (!ws) >>>>> - return false; >>>>> + atomic_add(nr, &sbq->completion_cnt); >>>>> + wakeups = atomic_read(&sbq->wakeup_cnt); >>>>> - cur = atomic_read(&ws->wait_cnt); >>>>> do { >>>>> - /* >>>>> - * For concurrent callers of this, callers should call this >>>>> - * function again to wakeup a new batch on a different 'ws'. >>>>> - */ >>>>> - if (cur == 0) >>>>> - return true; >>>>> - sub = min(*nr, cur); >>>>> - wait_cnt = cur - sub; >>>>> - } while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt)); >>>>> - >>>>> - /* >>>>> - * If we decremented queue without waiters, retry to avoid lost >>>>> - * wakeups. >>>>> - */ >>>>> - if (wait_cnt > 0) >>>>> - return !waitqueue_active(&ws->wait); >>>>> + if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch) >>>>> + return; >>>> >>>> Should it be considered that completion_cnt overflow and becomes >>>> negtive? >>> >>> Yes, the counters can (and will) certainly overflow but since we only care >>> about (completion_cnt - wakeups), we should be fine - this number is always >>> sane (and relatively small) and in the kernel we do compile with signed >>> overflows being well defined. >> >> I'm worried about this: for example, the extreme scenaro that there >> is only one tag, currently there are only one infight rq and one thread >> is waiting for tag. When the infight rq complete, if 'completion_cnt' >> overflow to negative, then 'atomic_read(&sbq->completion_cnt) - wakeups >> < wake_batch' will be passed unexpected, then will the thread never be >> woken up if there are no new io issued ? > > Well but my point is that 'wakeups' is staying close to completion_cnt. So > if completion_cnt wraps to INT_MIN, then 'wakeups' is close to INT_MAX and > so completion_cnt - wakeups is going to wrap back and still result in a > small number. That is simply how wrapping arithmetics works... Yes, you're right, I'm being foolish here. 😆 Thanks, Kuai > > Honza >
On Sat, 5 Nov 2022 19:10:55 -0400, Gabriel Krisman Bertazi wrote: > sbitmap suffers from code complexity, as demonstrated by recent fixes, > and eventual lost wake ups on nested I/O completion. The later happens, > from what I understand, due to the non-atomic nature of the updates to > wait_cnt, which needs to be subtracted and eventually reset when equal > to zero. This two step process can eventually miss an update when a > nested completion happens to interrupt the CPU in between the wait_cnt > updates. This is very hard to fix, as shown by the recent changes to > this code. > > [...] Applied, thanks! [1/1] sbitmap: Use single per-bitmap counting to wake up queued tags commit: 4f8126bb2308066b877859e4b5923ffb54143630 Best regards,
Gabriel, when looking through this patch, I've noticed we can loose wakeups after your latest simplifications. See below for details: On Sat 05-11-22 19:10:55, Gabriel Krisman Bertazi wrote: > @@ -587,7 +571,7 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq) > for (i = 0; i < SBQ_WAIT_QUEUES; i++) { > struct sbq_wait_state *ws = &sbq->ws[wake_index]; > > - if (waitqueue_active(&ws->wait) && atomic_read(&ws->wait_cnt)) { > + if (waitqueue_active(&ws->wait)) { > if (wake_index != atomic_read(&sbq->wake_index)) > atomic_set(&sbq->wake_index, wake_index); > return ws; Neither sbq_wake_ptr() nor sbitmap_queue_wake_up() now increment the wake_index after performing the wakeup. Thus we would effectively keep waking tasks from a single waitqueue until it becomes empty and only then go for the next waitqueue. This creates unnecessary unfairness in task wakeups and even possible starvation issues. So I think we need to advance wake_index somewhere. Perhaps here before returning waitqueue. > @@ -599,83 +583,31 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq) > return NULL; > } > > -static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr) > +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr) > { > + unsigned int wake_batch = READ_ONCE(sbq->wake_batch); > + struct sbq_wait_state *ws = NULL; > + unsigned int wakeups; > > + if (!atomic_read(&sbq->ws_active)) > + return; > > + atomic_add(nr, &sbq->completion_cnt); > + wakeups = atomic_read(&sbq->wakeup_cnt); > > do { > + if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch) > + return; > > + if (!ws) { > + ws = sbq_wake_ptr(sbq); > + if (!ws) > + return; > + } > + } while (!atomic_try_cmpxchg(&sbq->wakeup_cnt, > + &wakeups, wakeups + wake_batch)); > > wake_up_nr(&ws->wait, wake_batch); Now this may be also problematic - when we were checking the number of woken waiters in the older version of the patch (for others: internal version of the patch) this was fine but now it may happen that the 'ws' we have selected has no waiters anymore. And in that case we need to find another waitqueue because otherwise we'd be loosing too many wakeups and we could deadlock. So I think this rather needs to be something like: do { if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch) return; } while (!atomic_try_cmpxchg(&sbq->wakeup_cnt, &wakeups, wakeups + wake_batch)); do { ws = sbq_wake_ptr(sbq); if (!ws) return; } while (!wake_up_nr(&ws->wait, wake_batch)); with our original version of wake_up_nr() returning number of woken waiters. What do you think? Honza
Jan Kara <jack@suse.cz> writes: > Now this may be also problematic - when we were checking the number of woken > waiters in the older version of the patch (for others: internal version of > the patch) this was fine but now it may happen that the 'ws' we have > selected has no waiters anymore. And in that case we need to find another > waitqueue because otherwise we'd be loosing too many wakeups and we could > deadlock. So I think this rather needs to be something like: > > do { > if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch) > return; > } while (!atomic_try_cmpxchg(&sbq->wakeup_cnt, > &wakeups, wakeups + wake_batch)); > > do { > ws = sbq_wake_ptr(sbq); > if (!ws) > return; Jan, Does this really solve it? There is no guarantee there will be another waiter in the queues when we check here. So, once again we could not wake up anyone and return it this if leg. If that is the case, don't we end up overshooting wakeups and end up again with less completions than required to wake up an incoming io?
On Mon 14-11-22 22:52:32, Gabriel Krisman Bertazi wrote: > Jan Kara <jack@suse.cz> writes: > > > Now this may be also problematic - when we were checking the number of woken > > waiters in the older version of the patch (for others: internal version of > > the patch) this was fine but now it may happen that the 'ws' we have > > selected has no waiters anymore. And in that case we need to find another > > waitqueue because otherwise we'd be loosing too many wakeups and we could > > deadlock. So I think this rather needs to be something like: > > > > do { > > if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch) > > return; > > } while (!atomic_try_cmpxchg(&sbq->wakeup_cnt, > > &wakeups, wakeups + wake_batch)); > > > > do { > > ws = sbq_wake_ptr(sbq); > > if (!ws) > > return; > > Does this really solve it? There is no guarantee there will be another > waiter in the queues when we check here. So, once again we could not > wake up anyone and return it this if leg. If that is the case, don't we > end up overshooting wakeups and end up again with less completions than > required to wake up an incoming io? Well, if we don't find any waiter in any of the wait queues, it sure does not matter we just discard wakeups? And sure, these checks are racy as the waiters can be constantly added but the invariant is: if some waiter is added after the atomic_try_cmpxchg(), then all tags are used so as many completions as there are tags are coming. So that is enough to wake the new waiter (due to batch size). So all what matters is: If there's any waiter in the waitqueue by the time atomic_try_cmpxchg() is called, we'll consider it in the wakeup loop. And that is fulfilled by the code AFAICT. Honza
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h index 4d2d5205ab58..d662cf136021 100644 --- a/include/linux/sbitmap.h +++ b/include/linux/sbitmap.h @@ -86,11 +86,6 @@ struct sbitmap { * struct sbq_wait_state - Wait queue in a &struct sbitmap_queue. */ struct sbq_wait_state { - /** - * @wait_cnt: Number of frees remaining before we wake up. - */ - atomic_t wait_cnt; - /** * @wait: Wait queue. */ @@ -138,6 +133,17 @@ struct sbitmap_queue { * sbitmap_queue_get_shallow() */ unsigned int min_shallow_depth; + + /** + * @completion_cnt: Number of bits cleared passed to the + * wakeup function. + */ + atomic_t completion_cnt; + + /** + * @wakeup_cnt: Number of thread wake ups issued. + */ + atomic_t wakeup_cnt; }; /** diff --git a/lib/sbitmap.c b/lib/sbitmap.c index 7280ae8ca88c..eca462cba398 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -434,6 +434,8 @@ int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth, sbq->wake_batch = sbq_calc_wake_batch(sbq, depth); atomic_set(&sbq->wake_index, 0); atomic_set(&sbq->ws_active, 0); + atomic_set(&sbq->completion_cnt, 0); + atomic_set(&sbq->wakeup_cnt, 0); sbq->ws = kzalloc_node(SBQ_WAIT_QUEUES * sizeof(*sbq->ws), flags, node); if (!sbq->ws) { @@ -441,40 +443,21 @@ int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth, return -ENOMEM; } - for (i = 0; i < SBQ_WAIT_QUEUES; i++) { + for (i = 0; i < SBQ_WAIT_QUEUES; i++) init_waitqueue_head(&sbq->ws[i].wait); - atomic_set(&sbq->ws[i].wait_cnt, sbq->wake_batch); - } return 0; } EXPORT_SYMBOL_GPL(sbitmap_queue_init_node); -static inline void __sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq, - unsigned int wake_batch) -{ - int i; - - if (sbq->wake_batch != wake_batch) { - WRITE_ONCE(sbq->wake_batch, wake_batch); - /* - * Pairs with the memory barrier in sbitmap_queue_wake_up() - * to ensure that the batch size is updated before the wait - * counts. - */ - smp_mb(); - for (i = 0; i < SBQ_WAIT_QUEUES; i++) - atomic_set(&sbq->ws[i].wait_cnt, 1); - } -} - static void sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq, unsigned int depth) { unsigned int wake_batch; wake_batch = sbq_calc_wake_batch(sbq, depth); - __sbitmap_queue_update_wake_batch(sbq, wake_batch); + if (sbq->wake_batch != wake_batch) + WRITE_ONCE(sbq->wake_batch, wake_batch); } void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq, @@ -488,7 +471,8 @@ void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq, wake_batch = clamp_val(depth / SBQ_WAIT_QUEUES, min_batch, SBQ_WAKE_BATCH); - __sbitmap_queue_update_wake_batch(sbq, wake_batch); + + WRITE_ONCE(sbq->wake_batch, wake_batch); } EXPORT_SYMBOL_GPL(sbitmap_queue_recalculate_wake_batch); @@ -587,7 +571,7 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq) for (i = 0; i < SBQ_WAIT_QUEUES; i++) { struct sbq_wait_state *ws = &sbq->ws[wake_index]; - if (waitqueue_active(&ws->wait) && atomic_read(&ws->wait_cnt)) { + if (waitqueue_active(&ws->wait)) { if (wake_index != atomic_read(&sbq->wake_index)) atomic_set(&sbq->wake_index, wake_index); return ws; @@ -599,83 +583,31 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq) return NULL; } -static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr) +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr) { - struct sbq_wait_state *ws; - unsigned int wake_batch; - int wait_cnt, cur, sub; - bool ret; + unsigned int wake_batch = READ_ONCE(sbq->wake_batch); + struct sbq_wait_state *ws = NULL; + unsigned int wakeups; - if (*nr <= 0) - return false; + if (!atomic_read(&sbq->ws_active)) + return; - ws = sbq_wake_ptr(sbq); - if (!ws) - return false; + atomic_add(nr, &sbq->completion_cnt); + wakeups = atomic_read(&sbq->wakeup_cnt); - cur = atomic_read(&ws->wait_cnt); do { - /* - * For concurrent callers of this, callers should call this - * function again to wakeup a new batch on a different 'ws'. - */ - if (cur == 0) - return true; - sub = min(*nr, cur); - wait_cnt = cur - sub; - } while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt)); - - /* - * If we decremented queue without waiters, retry to avoid lost - * wakeups. - */ - if (wait_cnt > 0) - return !waitqueue_active(&ws->wait); + if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch) + return; - *nr -= sub; - - /* - * When wait_cnt == 0, we have to be particularly careful as we are - * responsible to reset wait_cnt regardless whether we've actually - * woken up anybody. But in case we didn't wakeup anybody, we still - * need to retry. - */ - ret = !waitqueue_active(&ws->wait); - wake_batch = READ_ONCE(sbq->wake_batch); + if (!ws) { + ws = sbq_wake_ptr(sbq); + if (!ws) + return; + } + } while (!atomic_try_cmpxchg(&sbq->wakeup_cnt, + &wakeups, wakeups + wake_batch)); - /* - * Wake up first in case that concurrent callers decrease wait_cnt - * while waitqueue is empty. - */ wake_up_nr(&ws->wait, wake_batch); - - /* - * Pairs with the memory barrier in sbitmap_queue_resize() to - * ensure that we see the batch size update before the wait - * count is reset. - * - * Also pairs with the implicit barrier between decrementing wait_cnt - * and checking for waitqueue_active() to make sure waitqueue_active() - * sees result of the wakeup if atomic_dec_return() has seen the result - * of atomic_set(). - */ - smp_mb__before_atomic(); - - /* - * Increase wake_index before updating wait_cnt, otherwise concurrent - * callers can see valid wait_cnt in old waitqueue, which can cause - * invalid wakeup on the old waitqueue. - */ - sbq_index_atomic_inc(&sbq->wake_index); - atomic_set(&ws->wait_cnt, wake_batch); - - return ret || *nr; -} - -void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr) -{ - while (__sbq_wake_up(sbq, &nr)) - ; } EXPORT_SYMBOL_GPL(sbitmap_queue_wake_up); @@ -792,9 +724,7 @@ void sbitmap_queue_show(struct sbitmap_queue *sbq, struct seq_file *m) seq_puts(m, "ws={\n"); for (i = 0; i < SBQ_WAIT_QUEUES; i++) { struct sbq_wait_state *ws = &sbq->ws[i]; - - seq_printf(m, "\t{.wait_cnt=%d, .wait=%s},\n", - atomic_read(&ws->wait_cnt), + seq_printf(m, "\t{.wait=%s},\n", waitqueue_active(&ws->wait) ? "active" : "inactive"); } seq_puts(m, "}\n");