sbitmap: Use single per-bitmap counting to wake up queued tags

Message ID 20221105231055.25953-1-krisman@suse.de
State New
Headers
Series sbitmap: Use single per-bitmap counting to wake up queued tags |

Commit Message

Gabriel Krisman Bertazi Nov. 5, 2022, 11:10 p.m. UTC
  sbitmap suffers from code complexity, as demonstrated by recent fixes,
and eventual lost wake ups on nested I/O completion.  The later happens,
from what I understand, due to the non-atomic nature of the updates to
wait_cnt, which needs to be subtracted and eventually reset when equal
to zero.  This two step process can eventually miss an update when a
nested completion happens to interrupt the CPU in between the wait_cnt
updates.  This is very hard to fix, as shown by the recent changes to
this code.

The code complexity arises mostly from the corner cases to avoid missed
wakes in this scenario.  In addition, the handling of wake_batch
recalculation plus the synchronization with sbq_queue_wake_up is
non-trivial.

This patchset implements the idea originally proposed by Jan [1], which
removes the need for the two-step updates of wait_cnt.  This is done by
tracking the number of completions and wakeups in always increasing,
per-bitmap counters.  Instead of having to reset the wait_cnt when it
reaches zero, we simply keep counting, and attempt to wake up N threads
in a single wait queue whenever there is enough space for a batch.
Waking up less than batch_wake shouldn't be a problem, because we
haven't changed the conditions for wake up, and the existing batch
calculation guarantees at least enough remaining completions to wake up
a batch for each queue at any time.

Performance-wise, one should expect very similar performance to the
original algorithm for the case where there is no queueing.  In both the
old algorithm and this implementation, the first thing is to check
ws_active, which bails out if there is no queueing to be managed. In the
new code, we took care to avoid accounting completions and wakeups when
there is no queueing, to not pay the cost of atomic operations
unnecessarily, since it doesn't skew the numbers.

For more interesting cases, where there is queueing, we need to take
into account the cross-communication of the atomic operations.  I've
been benchmarking by running parallel fio jobs against a single hctx
nullb in different hardware queue depth scenarios, and verifying both
IOPS and queueing.

Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
varying only the hardware queue length per test.

queue size 2                 4                 8                 16                 32                 64
6.1-rc2    1681.1K (1.6K)    2633.0K (12.7K)   6940.8K (16.3K)   8172.3K (617.5K)   8391.7K (367.1K)   8606.1K (351.2K)
patched    1721.8K (15.1K)   3016.7K (3.8K)    7543.0K (89.4K)   8132.5K (303.4K)   8324.2K (230.6K)   8401.8K (284.7K)

The following is a similar experiment, ran against a nullb with a single
bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40
parallel fio jobs operating on the same device

queue size 2 	             4                 8              	16             	    32		       64
6.1-rc2	   1081.0K (2.3K)    957.2K (1.5K)     1699.1K (5.7K) 	6178.2K (124.6K)    12227.9K (37.7K)   13286.6K (92.9K)
patched	   1081.8K (2.8K)    1316.5K (5.4K)    2364.4K (1.8K) 	6151.4K  (20.0K)    11893.6K (17.5K)   12385.6K (18.4K)

It has also survived blktests and a 12h-stress run against nullb. I also
ran the code against nvme and a scsi SSD, and I didn't observe
performance regression in those. If there are other tests you think I
should run, please let me know and I will follow up with results.

[1] https://lore.kernel.org/all/aef9de29-e9f5-259a-f8be-12d1b734e72@google.com/

Cc: Hugh Dickins <hughd@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Liu Song <liusong@linux.alibaba.com>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
---
 include/linux/sbitmap.h |  16 ++++--
 lib/sbitmap.c           | 122 +++++++++-------------------------------
 2 files changed, 37 insertions(+), 101 deletions(-)
  

Comments

Chaitanya Kulkarni Nov. 8, 2022, 11:28 p.m. UTC | #1
> For more interesting cases, where there is queueing, we need to take
> into account the cross-communication of the atomic operations.  I've
> been benchmarking by running parallel fio jobs against a single hctx
> nullb in different hardware queue depth scenarios, and verifying both
> IOPS and queueing.
> 
> Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
> jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
> varying only the hardware queue length per test.
> 
> queue size 2                 4                 8                 16                 32                 64
> 6.1-rc2    1681.1K (1.6K)    2633.0K (12.7K)   6940.8K (16.3K)   8172.3K (617.5K)   8391.7K (367.1K)   8606.1K (351.2K)
> patched    1721.8K (15.1K)   3016.7K (3.8K)    7543.0K (89.4K)   8132.5K (303.4K)   8324.2K (230.6K)   8401.8K (284.7K)

> 

So if I understand correctly
QD 2,4,8 shows clear performance benefit from this patch whereas
QD 16, 32, 64 shows drop in performance it that correct ?

If my observation is correct then applications with high QD will
observe drop in the performance ?

Also, please share a table with block size/IOPS/BW/CPU (system/user)
/LAT/SLAT with % increase/decrease and document the raw numbers at the
end of the cover-letter for completeness along with fio job to others
can repeat the experiment...

> The following is a similar experiment, ran against a nullb with a single
> bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40
> parallel fio jobs operating on the same device
> 
> queue size 2 	             4                 8              	16             	    32		       64
> 6.1-rc2	   1081.0K (2.3K)    957.2K (1.5K)     1699.1K (5.7K) 	6178.2K (124.6K)    12227.9K (37.7K)   13286.6K (92.9K)
> patched	   1081.8K (2.8K)    1316.5K (5.4K)    2364.4K (1.8K) 	6151.4K  (20.0K)    11893.6K (17.5K)   12385.6K (18.4K)
> 

same here ...

> It has also survived blktests and a 12h-stress run against nullb. I also
> ran the code against nvme and a scsi SSD, and I didn't observe
> performance regression in those. If there are other tests you think I
> should run, please let me know and I will follow up with results.
> 
-ck
  
Gabriel Krisman Bertazi Nov. 9, 2022, 3:03 a.m. UTC | #2
Chaitanya Kulkarni <chaitanyak@nvidia.com> writes:

>> For more interesting cases, where there is queueing, we need to take
>> into account the cross-communication of the atomic operations.  I've
>> been benchmarking by running parallel fio jobs against a single hctx
>> nullb in different hardware queue depth scenarios, and verifying both
>> IOPS and queueing.
>> 
>> Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
>> jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
>> varying only the hardware queue length per test.
>> 
>> queue size 2                 4                 8                 16                 32                 64
>> 6.1-rc2    1681.1K (1.6K)    2633.0K (12.7K)   6940.8K (16.3K)   8172.3K (617.5K)   8391.7K (367.1K)   8606.1K (351.2K)
>> patched    1721.8K (15.1K)   3016.7K (3.8K)    7543.0K (89.4K)   8132.5K (303.4K)   8324.2K (230.6K)   8401.8K (284.7K)
>
>> 

Hi Chaitanya,

Thanks for the feedback.

> So if I understand correctly
> QD 2,4,8 shows clear performance benefit from this patch whereas
> QD 16, 32, 64 shows drop in performance it that correct ?
>
> If my observation is correct then applications with high QD will
> observe drop in the performance ?

To be honest, I'm not sure.  Given the overlap of the standard variation
(in parenthesis) with the mean, I'm not sure the observed drop is
statistically significant. In my prior analysis, I thought it wasn't.

I don't see where a significant difference would come from, to be honest,
because the higher the QD, the more likely it is  to go through the
not-contended path, where sbq->ws_active == 0.  This hot path is
identical to the existing implementation.

> Also, please share a table with block size/IOPS/BW/CPU (system/user)
> /LAT/SLAT with % increase/decrease and document the raw numbers at the
> end of the cover-letter for completeness along with fio job to others
> can repeat the experiment...

This was issued against the nullb and the IO size is fixed, matching the
device's block size (512b), which is why I am not tracking BW, only
IOPS.  I'm not sure the BW is still relevant in this scenario.

I'll definitely follow up with CPU time and latencies, and share the
fio job.  I'll also take another look on the significance of the
measured values for high QD.

Thank you,
  
Chaitanya Kulkarni Nov. 9, 2022, 3:35 a.m. UTC | #3
On 11/8/22 19:03, Gabriel Krisman Bertazi wrote:
> Chaitanya Kulkarni <chaitanyak@nvidia.com> writes:
> 
>>> For more interesting cases, where there is queueing, we need to take
>>> into account the cross-communication of the atomic operations.  I've
>>> been benchmarking by running parallel fio jobs against a single hctx
>>> nullb in different hardware queue depth scenarios, and verifying both
>>> IOPS and queueing.
>>>
>>> Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
>>> jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
>>> varying only the hardware queue length per test.
>>>
>>> queue size 2                 4                 8                 16                 32                 64
>>> 6.1-rc2    1681.1K (1.6K)    2633.0K (12.7K)   6940.8K (16.3K)   8172.3K (617.5K)   8391.7K (367.1K)   8606.1K (351.2K)
>>> patched    1721.8K (15.1K)   3016.7K (3.8K)    7543.0K (89.4K)   8132.5K (303.4K)   8324.2K (230.6K)   8401.8K (284.7K)
>>
>>>
> 
> Hi Chaitanya,
> 
> Thanks for the feedback.
> 
>> So if I understand correctly
>> QD 2,4,8 shows clear performance benefit from this patch whereas
>> QD 16, 32, 64 shows drop in performance it that correct ?
>>
>> If my observation is correct then applications with high QD will
>> observe drop in the performance ?
> 
> To be honest, I'm not sure.  Given the overlap of the standard variation
> (in parenthesis) with the mean, I'm not sure the observed drop is
> statistically significant. In my prior analysis, I thought it wasn't.
> 
> I don't see where a significant difference would come from, to be honest,
> because the higher the QD, the more likely it is  to go through the
> not-contended path, where sbq->ws_active == 0.  This hot path is
> identical to the existing implementation.
> 

The numbers are  taken on the null_blk, with the drop I could see here
may end up being different on the real H/W ? and I cannot
comment on that since we don't have that data ...

Did you repeat the experiment with the real H/W like NVMe SSD ?

>> Also, please share a table with block size/IOPS/BW/CPU (system/user)
>> /LAT/SLAT with % increase/decrease and document the raw numbers at the
>> end of the cover-letter for completeness along with fio job to others
>> can repeat the experiment...
> 
> This was issued against the nullb and the IO size is fixed, matching the
> device's block size (512b), which is why I am not tracking BW, only
> IOPS.  I'm not sure the BW is still relevant in this scenario.
> 
> I'll definitely follow up with CPU time and latencies, and share the
> fio job.  I'll also take another look on the significance of the
> measured values for high QD.
> 

Yes, please if CPU usage way higher then we need to know that above
numbers are at the cost of the higher CPU, in that case IOPs per core
B/W per core matrix can be very useful ?

-ck
  
Jens Axboe Nov. 9, 2022, 10:06 p.m. UTC | #4
On 11/5/22 5:10 PM, Gabriel Krisman Bertazi wrote:
> sbitmap suffers from code complexity, as demonstrated by recent fixes,
> and eventual lost wake ups on nested I/O completion.  The later happens,
> from what I understand, due to the non-atomic nature of the updates to
> wait_cnt, which needs to be subtracted and eventually reset when equal
> to zero.  This two step process can eventually miss an update when a
> nested completion happens to interrupt the CPU in between the wait_cnt
> updates.  This is very hard to fix, as shown by the recent changes to
> this code.
> 
> The code complexity arises mostly from the corner cases to avoid missed
> wakes in this scenario.  In addition, the handling of wake_batch
> recalculation plus the synchronization with sbq_queue_wake_up is
> non-trivial.
> 
> This patchset implements the idea originally proposed by Jan [1], which
> removes the need for the two-step updates of wait_cnt.  This is done by
> tracking the number of completions and wakeups in always increasing,
> per-bitmap counters.  Instead of having to reset the wait_cnt when it
> reaches zero, we simply keep counting, and attempt to wake up N threads
> in a single wait queue whenever there is enough space for a batch.
> Waking up less than batch_wake shouldn't be a problem, because we
> haven't changed the conditions for wake up, and the existing batch
> calculation guarantees at least enough remaining completions to wake up
> a batch for each queue at any time.
> 
> Performance-wise, one should expect very similar performance to the
> original algorithm for the case where there is no queueing.  In both the
> old algorithm and this implementation, the first thing is to check
> ws_active, which bails out if there is no queueing to be managed. In the
> new code, we took care to avoid accounting completions and wakeups when
> there is no queueing, to not pay the cost of atomic operations
> unnecessarily, since it doesn't skew the numbers.
> 
> For more interesting cases, where there is queueing, we need to take
> into account the cross-communication of the atomic operations.  I've
> been benchmarking by running parallel fio jobs against a single hctx
> nullb in different hardware queue depth scenarios, and verifying both
> IOPS and queueing.
> 
> Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
> jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
> varying only the hardware queue length per test.
> 
> queue size 2                 4                 8                 16                 32                 64
> 6.1-rc2    1681.1K (1.6K)    2633.0K (12.7K)   6940.8K (16.3K)   8172.3K (617.5K)   8391.7K (367.1K)   8606.1K (351.2K)
> patched    1721.8K (15.1K)   3016.7K (3.8K)    7543.0K (89.4K)   8132.5K (303.4K)   8324.2K (230.6K)   8401.8K (284.7K)
> 
> The following is a similar experiment, ran against a nullb with a single
> bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40
> parallel fio jobs operating on the same device
> 
> queue size 2 	             4                 8              	16             	    32		       64
> 6.1-rc2	   1081.0K (2.3K)    957.2K (1.5K)     1699.1K (5.7K) 	6178.2K (124.6K)    12227.9K (37.7K)   13286.6K (92.9K)
> patched	   1081.8K (2.8K)    1316.5K (5.4K)    2364.4K (1.8K) 	6151.4K  (20.0K)    11893.6K (17.5K)   12385.6K (18.4K)

What's the queue depth of these devices? That's the interesting question
here, as it'll tell us if any of these are actually hitting the slower
path where you made changes. I suspect you are for the second set of
numbers, but not for the first one?

Anything that isn't hitting the wait path for tags isn't a very useful
test, as I would not expect any changes there.
  
Gabriel Krisman Bertazi Nov. 9, 2022, 10:48 p.m. UTC | #5
Jens Axboe <axboe@kernel.dk> writes:

> On 11/5/22 5:10 PM, Gabriel Krisman Bertazi wrote:
>> Performance-wise, one should expect very similar performance to the
>> original algorithm for the case where there is no queueing.  In both the
>> old algorithm and this implementation, the first thing is to check
>> ws_active, which bails out if there is no queueing to be managed. In the
>> new code, we took care to avoid accounting completions and wakeups when
>> there is no queueing, to not pay the cost of atomic operations
>> unnecessarily, since it doesn't skew the numbers.
>> 
>> For more interesting cases, where there is queueing, we need to take
>> into account the cross-communication of the atomic operations.  I've
>> been benchmarking by running parallel fio jobs against a single hctx
>> nullb in different hardware queue depth scenarios, and verifying both
>> IOPS and queueing.
>> 
>> Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
>> jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
>> varying only the hardware queue length per test.
>> 
>> queue size 2                 4                 8                 16                 32                 64
>> 6.1-rc2    1681.1K (1.6K)    2633.0K (12.7K)   6940.8K (16.3K)   8172.3K (617.5K)   8391.7K (367.1K)   8606.1K (351.2K)
>> patched    1721.8K (15.1K)   3016.7K (3.8K)    7543.0K (89.4K)   8132.5K (303.4K)   8324.2K (230.6K)   8401.8K (284.7K)
>> 
>> The following is a similar experiment, ran against a nullb with a single
>> bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40
>> parallel fio jobs operating on the same device
>> 
>> queue size 2 	             4                 8              	16             	    32		       64
>> 6.1-rc2	   1081.0K (2.3K)    957.2K (1.5K)     1699.1K (5.7K) 	6178.2K (124.6K)    12227.9K (37.7K)   13286.6K (92.9K)
>> patched	   1081.8K (2.8K)    1316.5K (5.4K)    2364.4K (1.8K) 	6151.4K  (20.0K)    11893.6K (17.5K)   12385.6K (18.4K)
>
> What's the queue depth of these devices? That's the interesting question
> here, as it'll tell us if any of these are actually hitting the slower
> path where you made changes. 
>

Hi Jens,

The hardware queue depth is a parameter being varied in this experiment.
Each column of the tables has a different queue depth.  Its value is the
first line (queue size) of both tables.  For instance, looking at the
first table, for a device with hardware queue depth=2, 6.1-rc2 gave
1681K IOPS and the patched version gave 1721.8K IOPS.

As mentioned, I monitored the size of the sbitmap wqs during the
benchmark execution to confirm it was indeed hitting the slow path and
queueing.  Indeed, I observed less queueing on higher QDs (16,32) and
even less for QD=64.  For QD<=8, there was extensive queueing present
throughout the execution.

I should provide the queue size over time alongside the latency numbers.
I have to rerun the benchmarks already to collect the information
Chaitanya requested.

> I suspect you are for the second set of numbers, but not for the first
> one?

No. both tables show some level of queueing. The shared bitmap in
table 2 surely has way more intensive queueing, though.

> Anything that isn't hitting the wait path for tags isn't a very useful
> test, as I would not expect any changes there.

Even when there is less to no queueing (QD=64 in this data), we still
enter sbitmap_queue_wake_up and bail out on the first line
!wait_active. This is why I think it is important to include QD=64
here. it is less interesting data, as I mentioned, but it shows no
regressions of the faspath.

Thanks,
  
Jens Axboe Nov. 10, 2022, 3:25 a.m. UTC | #6
On 11/9/22 3:48 PM, Gabriel Krisman Bertazi wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> On 11/5/22 5:10 PM, Gabriel Krisman Bertazi wrote:
>>> Performance-wise, one should expect very similar performance to the
>>> original algorithm for the case where there is no queueing.  In both the
>>> old algorithm and this implementation, the first thing is to check
>>> ws_active, which bails out if there is no queueing to be managed. In the
>>> new code, we took care to avoid accounting completions and wakeups when
>>> there is no queueing, to not pay the cost of atomic operations
>>> unnecessarily, since it doesn't skew the numbers.
>>>
>>> For more interesting cases, where there is queueing, we need to take
>>> into account the cross-communication of the atomic operations.  I've
>>> been benchmarking by running parallel fio jobs against a single hctx
>>> nullb in different hardware queue depth scenarios, and verifying both
>>> IOPS and queueing.
>>>
>>> Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
>>> jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
>>> varying only the hardware queue length per test.
>>>
>>> queue size 2                 4                 8                 16                 32                 64
>>> 6.1-rc2    1681.1K (1.6K)    2633.0K (12.7K)   6940.8K (16.3K)   8172.3K (617.5K)   8391.7K (367.1K)   8606.1K (351.2K)
>>> patched    1721.8K (15.1K)   3016.7K (3.8K)    7543.0K (89.4K)   8132.5K (303.4K)   8324.2K (230.6K)   8401.8K (284.7K)
>>>
>>> The following is a similar experiment, ran against a nullb with a single
>>> bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40
>>> parallel fio jobs operating on the same device
>>>
>>> queue size 2 	             4                 8              	16             	    32		       64
>>> 6.1-rc2	   1081.0K (2.3K)    957.2K (1.5K)     1699.1K (5.7K) 	6178.2K (124.6K)    12227.9K (37.7K)   13286.6K (92.9K)
>>> patched	   1081.8K (2.8K)    1316.5K (5.4K)    2364.4K (1.8K) 	6151.4K  (20.0K)    11893.6K (17.5K)   12385.6K (18.4K)
>>
>> What's the queue depth of these devices? That's the interesting question
>> here, as it'll tell us if any of these are actually hitting the slower
>> path where you made changes. 
>>
> 
> Hi Jens,
> 
> The hardware queue depth is a parameter being varied in this experiment.
> Each column of the tables has a different queue depth.  Its value is the
> first line (queue size) of both tables.  For instance, looking at the
> first table, for a device with hardware queue depth=2, 6.1-rc2 gave
> 1681K IOPS and the patched version gave 1721.8K IOPS.
> 
> As mentioned, I monitored the size of the sbitmap wqs during the
> benchmark execution to confirm it was indeed hitting the slow path and
> queueing.  Indeed, I observed less queueing on higher QDs (16,32) and
> even less for QD=64.  For QD<=8, there was extensive queueing present
> throughout the execution.

Gotcha, this makes a lot of sense. I just misunderstood the queue size
numbers to be queue depth on the IO generation side.

Any idea what is reducing performance at the higher end of queue size
spectrum? Not that I think it's _really_ that interesting, just curious.

> I should provide the queue size over time alongside the latency numbers.
> I have to rerun the benchmarks already to collect the information
> Chaitanya requested.

Thanks.

>> I suspect you are for the second set of numbers, but not for the first
>> one?
> 
> No. both tables show some level of queueing. The shared bitmap in
> table 2 surely has way more intensive queueing, though.

Yep, got it now.

>> Anything that isn't hitting the wait path for tags isn't a very useful
>> test, as I would not expect any changes there.
> 
> Even when there is less to no queueing (QD=64 in this data), we still
> enter sbitmap_queue_wake_up and bail out on the first line
> !wait_active. This is why I think it is important to include QD=64
> here. it is less interesting data, as I mentioned, but it shows no
> regressions of the faspath.

Yes, the checking for whether we need to hit the slow path should be
fast. I will try and run some numbers here too for the pure fast path,
no sleeping at all. For most fast storage this is where we'll be for the
majority of the time, queue depth on NVMe is generally pretty deep.
FWIW, I did run my usual quick peak testing and didn't see any
regressions there. I'll do a quick per-core benchmark tomorrow, that's
more sensitive to cycle issues.

But I like the change in general, and I think your numbers look sane.
Would be nice to turn sbitmap queueing into something that's actually
sane.
  
Yu Kuai Nov. 10, 2022, 9:42 a.m. UTC | #7
Hi,

在 2022/11/06 7:10, Gabriel Krisman Bertazi 写道:
> sbitmap suffers from code complexity, as demonstrated by recent fixes,
> and eventual lost wake ups on nested I/O completion.  The later happens,
> from what I understand, due to the non-atomic nature of the updates to
> wait_cnt, which needs to be subtracted and eventually reset when equal
> to zero.  This two step process can eventually miss an update when a
> nested completion happens to interrupt the CPU in between the wait_cnt
> updates.  This is very hard to fix, as shown by the recent changes to
> this code.
> 
> The code complexity arises mostly from the corner cases to avoid missed
> wakes in this scenario.  In addition, the handling of wake_batch
> recalculation plus the synchronization with sbq_queue_wake_up is
> non-trivial.
> 
> This patchset implements the idea originally proposed by Jan [1], which
> removes the need for the two-step updates of wait_cnt.  This is done by
> tracking the number of completions and wakeups in always increasing,
> per-bitmap counters.  Instead of having to reset the wait_cnt when it
> reaches zero, we simply keep counting, and attempt to wake up N threads
> in a single wait queue whenever there is enough space for a batch.
> Waking up less than batch_wake shouldn't be a problem, because we
> haven't changed the conditions for wake up, and the existing batch
> calculation guarantees at least enough remaining completions to wake up
> a batch for each queue at any time.
> 
> Performance-wise, one should expect very similar performance to the
> original algorithm for the case where there is no queueing.  In both the
> old algorithm and this implementation, the first thing is to check
> ws_active, which bails out if there is no queueing to be managed. In the
> new code, we took care to avoid accounting completions and wakeups when
> there is no queueing, to not pay the cost of atomic operations
> unnecessarily, since it doesn't skew the numbers.
> 
> For more interesting cases, where there is queueing, we need to take
> into account the cross-communication of the atomic operations.  I've
> been benchmarking by running parallel fio jobs against a single hctx
> nullb in different hardware queue depth scenarios, and verifying both
> IOPS and queueing.
> 
> Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
> jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
> varying only the hardware queue length per test.
> 
> queue size 2                 4                 8                 16                 32                 64
> 6.1-rc2    1681.1K (1.6K)    2633.0K (12.7K)   6940.8K (16.3K)   8172.3K (617.5K)   8391.7K (367.1K)   8606.1K (351.2K)
> patched    1721.8K (15.1K)   3016.7K (3.8K)    7543.0K (89.4K)   8132.5K (303.4K)   8324.2K (230.6K)   8401.8K (284.7K)
> 
> The following is a similar experiment, ran against a nullb with a single
> bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40
> parallel fio jobs operating on the same device
> 
> queue size 2 	             4                 8              	16             	    32		       64
> 6.1-rc2	   1081.0K (2.3K)    957.2K (1.5K)     1699.1K (5.7K) 	6178.2K (124.6K)    12227.9K (37.7K)   13286.6K (92.9K)
> patched	   1081.8K (2.8K)    1316.5K (5.4K)    2364.4K (1.8K) 	6151.4K  (20.0K)    11893.6K (17.5K)   12385.6K (18.4K)
> 
> It has also survived blktests and a 12h-stress run against nullb. I also
> ran the code against nvme and a scsi SSD, and I didn't observe
> performance regression in those. If there are other tests you think I
> should run, please let me know and I will follow up with results.
> 
> [1] https://lore.kernel.org/all/aef9de29-e9f5-259a-f8be-12d1b734e72@google.com/
> 
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Keith Busch <kbusch@kernel.org>
> Cc: Liu Song <liusong@linux.alibaba.com>
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
> ---
>   include/linux/sbitmap.h |  16 ++++--
>   lib/sbitmap.c           | 122 +++++++++-------------------------------
>   2 files changed, 37 insertions(+), 101 deletions(-)
> 
> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> index 4d2d5205ab58..d662cf136021 100644
> --- a/include/linux/sbitmap.h
> +++ b/include/linux/sbitmap.h
> @@ -86,11 +86,6 @@ struct sbitmap {
>    * struct sbq_wait_state - Wait queue in a &struct sbitmap_queue.
>    */
>   struct sbq_wait_state {
> -	/**
> -	 * @wait_cnt: Number of frees remaining before we wake up.
> -	 */
> -	atomic_t wait_cnt;
> -
>   	/**
>   	 * @wait: Wait queue.
>   	 */
> @@ -138,6 +133,17 @@ struct sbitmap_queue {
>   	 * sbitmap_queue_get_shallow()
>   	 */
>   	unsigned int min_shallow_depth;
> +
> +	/**
> +	 * @completion_cnt: Number of bits cleared passed to the
> +	 * wakeup function.
> +	 */
> +	atomic_t completion_cnt;
> +
> +	/**
> +	 * @wakeup_cnt: Number of thread wake ups issued.
> +	 */
> +	atomic_t wakeup_cnt;
>   };
>   
>   /**
> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> index 7280ae8ca88c..eca462cba398 100644
> --- a/lib/sbitmap.c
> +++ b/lib/sbitmap.c
> @@ -434,6 +434,8 @@ int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth,
>   	sbq->wake_batch = sbq_calc_wake_batch(sbq, depth);
>   	atomic_set(&sbq->wake_index, 0);
>   	atomic_set(&sbq->ws_active, 0);
> +	atomic_set(&sbq->completion_cnt, 0);
> +	atomic_set(&sbq->wakeup_cnt, 0);
>   
>   	sbq->ws = kzalloc_node(SBQ_WAIT_QUEUES * sizeof(*sbq->ws), flags, node);
>   	if (!sbq->ws) {
> @@ -441,40 +443,21 @@ int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth,
>   		return -ENOMEM;
>   	}
>   
> -	for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
> +	for (i = 0; i < SBQ_WAIT_QUEUES; i++)
>   		init_waitqueue_head(&sbq->ws[i].wait);
> -		atomic_set(&sbq->ws[i].wait_cnt, sbq->wake_batch);
> -	}
>   
>   	return 0;
>   }
>   EXPORT_SYMBOL_GPL(sbitmap_queue_init_node);
>   
> -static inline void __sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq,
> -					    unsigned int wake_batch)
> -{
> -	int i;
> -
> -	if (sbq->wake_batch != wake_batch) {
> -		WRITE_ONCE(sbq->wake_batch, wake_batch);
> -		/*
> -		 * Pairs with the memory barrier in sbitmap_queue_wake_up()
> -		 * to ensure that the batch size is updated before the wait
> -		 * counts.
> -		 */
> -		smp_mb();
> -		for (i = 0; i < SBQ_WAIT_QUEUES; i++)
> -			atomic_set(&sbq->ws[i].wait_cnt, 1);
> -	}
> -}
> -
>   static void sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq,
>   					    unsigned int depth)
>   {
>   	unsigned int wake_batch;
>   
>   	wake_batch = sbq_calc_wake_batch(sbq, depth);
> -	__sbitmap_queue_update_wake_batch(sbq, wake_batch);
> +	if (sbq->wake_batch != wake_batch)
> +		WRITE_ONCE(sbq->wake_batch, wake_batch);
>   }
>   
>   void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq,
> @@ -488,7 +471,8 @@ void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq,
>   
>   	wake_batch = clamp_val(depth / SBQ_WAIT_QUEUES,
>   			min_batch, SBQ_WAKE_BATCH);
> -	__sbitmap_queue_update_wake_batch(sbq, wake_batch);
> +
> +	WRITE_ONCE(sbq->wake_batch, wake_batch);
>   }
>   EXPORT_SYMBOL_GPL(sbitmap_queue_recalculate_wake_batch);
>   
> @@ -587,7 +571,7 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
>   	for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
>   		struct sbq_wait_state *ws = &sbq->ws[wake_index];
>   
> -		if (waitqueue_active(&ws->wait) && atomic_read(&ws->wait_cnt)) {
> +		if (waitqueue_active(&ws->wait)) {
>   			if (wake_index != atomic_read(&sbq->wake_index))
>   				atomic_set(&sbq->wake_index, wake_index);
>   			return ws;
> @@ -599,83 +583,31 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
>   	return NULL;
>   }
>   
> -static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr)
> +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr)
>   {
> -	struct sbq_wait_state *ws;
> -	unsigned int wake_batch;
> -	int wait_cnt, cur, sub;
> -	bool ret;
> +	unsigned int wake_batch = READ_ONCE(sbq->wake_batch);
> +	struct sbq_wait_state *ws = NULL;
> +	unsigned int wakeups;
>   
> -	if (*nr <= 0)
> -		return false;
> +	if (!atomic_read(&sbq->ws_active))
> +		return;
>   
> -	ws = sbq_wake_ptr(sbq);
> -	if (!ws)
> -		return false;
> +	atomic_add(nr, &sbq->completion_cnt);
> +	wakeups = atomic_read(&sbq->wakeup_cnt);
>   
> -	cur = atomic_read(&ws->wait_cnt);
>   	do {
> -		/*
> -		 * For concurrent callers of this, callers should call this
> -		 * function again to wakeup a new batch on a different 'ws'.
> -		 */
> -		if (cur == 0)
> -			return true;
> -		sub = min(*nr, cur);
> -		wait_cnt = cur - sub;
> -	} while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt));
> -
> -	/*
> -	 * If we decremented queue without waiters, retry to avoid lost
> -	 * wakeups.
> -	 */
> -	if (wait_cnt > 0)
> -		return !waitqueue_active(&ws->wait);
> +		if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch)
> +			return;

Should it be considered that completion_cnt overflow and becomes
negtive?

Thanks,
Kuai
>   
> -	*nr -= sub;
> -
> -	/*
> -	 * When wait_cnt == 0, we have to be particularly careful as we are
> -	 * responsible to reset wait_cnt regardless whether we've actually
> -	 * woken up anybody. But in case we didn't wakeup anybody, we still
> -	 * need to retry.
> -	 */
> -	ret = !waitqueue_active(&ws->wait);
> -	wake_batch = READ_ONCE(sbq->wake_batch);
> +		if (!ws) {
> +			ws = sbq_wake_ptr(sbq);
> +			if (!ws)
> +				return;
> +		}
> +	} while (!atomic_try_cmpxchg(&sbq->wakeup_cnt,
> +				     &wakeups, wakeups + wake_batch));
>   
> -	/*
> -	 * Wake up first in case that concurrent callers decrease wait_cnt
> -	 * while waitqueue is empty.
> -	 */
>   	wake_up_nr(&ws->wait, wake_batch);
> -
> -	/*
> -	 * Pairs with the memory barrier in sbitmap_queue_resize() to
> -	 * ensure that we see the batch size update before the wait
> -	 * count is reset.
> -	 *
> -	 * Also pairs with the implicit barrier between decrementing wait_cnt
> -	 * and checking for waitqueue_active() to make sure waitqueue_active()
> -	 * sees result of the wakeup if atomic_dec_return() has seen the result
> -	 * of atomic_set().
> -	 */
> -	smp_mb__before_atomic();
> -
> -	/*
> -	 * Increase wake_index before updating wait_cnt, otherwise concurrent
> -	 * callers can see valid wait_cnt in old waitqueue, which can cause
> -	 * invalid wakeup on the old waitqueue.
> -	 */
> -	sbq_index_atomic_inc(&sbq->wake_index);
> -	atomic_set(&ws->wait_cnt, wake_batch);
> -
> -	return ret || *nr;
> -}
> -
> -void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr)
> -{
> -	while (__sbq_wake_up(sbq, &nr))
> -		;
>   }
>   EXPORT_SYMBOL_GPL(sbitmap_queue_wake_up);
>   
> @@ -792,9 +724,7 @@ void sbitmap_queue_show(struct sbitmap_queue *sbq, struct seq_file *m)
>   	seq_puts(m, "ws={\n");
>   	for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
>   		struct sbq_wait_state *ws = &sbq->ws[i];
> -
> -		seq_printf(m, "\t{.wait_cnt=%d, .wait=%s},\n",
> -			   atomic_read(&ws->wait_cnt),
> +		seq_printf(m, "\t{.wait=%s},\n",
>   			   waitqueue_active(&ws->wait) ? "active" : "inactive");
>   	}
>   	seq_puts(m, "}\n");
>
  
Jan Kara Nov. 10, 2022, 11:16 a.m. UTC | #8
Hi!

On Thu 10-11-22 17:42:49, Yu Kuai wrote:
> 在 2022/11/06 7:10, Gabriel Krisman Bertazi 写道:
> > +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr)
> >   {
> > -	struct sbq_wait_state *ws;
> > -	unsigned int wake_batch;
> > -	int wait_cnt, cur, sub;
> > -	bool ret;
> > +	unsigned int wake_batch = READ_ONCE(sbq->wake_batch);
> > +	struct sbq_wait_state *ws = NULL;
> > +	unsigned int wakeups;
> > -	if (*nr <= 0)
> > -		return false;
> > +	if (!atomic_read(&sbq->ws_active))
> > +		return;
> > -	ws = sbq_wake_ptr(sbq);
> > -	if (!ws)
> > -		return false;
> > +	atomic_add(nr, &sbq->completion_cnt);
> > +	wakeups = atomic_read(&sbq->wakeup_cnt);
> > -	cur = atomic_read(&ws->wait_cnt);
> >   	do {
> > -		/*
> > -		 * For concurrent callers of this, callers should call this
> > -		 * function again to wakeup a new batch on a different 'ws'.
> > -		 */
> > -		if (cur == 0)
> > -			return true;
> > -		sub = min(*nr, cur);
> > -		wait_cnt = cur - sub;
> > -	} while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt));
> > -
> > -	/*
> > -	 * If we decremented queue without waiters, retry to avoid lost
> > -	 * wakeups.
> > -	 */
> > -	if (wait_cnt > 0)
> > -		return !waitqueue_active(&ws->wait);
> > +		if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch)
> > +			return;
> 
> Should it be considered that completion_cnt overflow and becomes
> negtive?

Yes, the counters can (and will) certainly overflow but since we only care
about (completion_cnt - wakeups), we should be fine - this number is always
sane (and relatively small) and in the kernel we do compile with signed
overflows being well defined.

								Honza
  
Yu Kuai Nov. 10, 2022, 1:18 p.m. UTC | #9
Hi,

在 2022/11/10 19:16, Jan Kara 写道:
> Hi!
> 
> On Thu 10-11-22 17:42:49, Yu Kuai wrote:
>> 在 2022/11/06 7:10, Gabriel Krisman Bertazi 写道:
>>> +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr)
>>>    {
>>> -	struct sbq_wait_state *ws;
>>> -	unsigned int wake_batch;
>>> -	int wait_cnt, cur, sub;
>>> -	bool ret;
>>> +	unsigned int wake_batch = READ_ONCE(sbq->wake_batch);
>>> +	struct sbq_wait_state *ws = NULL;
>>> +	unsigned int wakeups;
>>> -	if (*nr <= 0)
>>> -		return false;
>>> +	if (!atomic_read(&sbq->ws_active))
>>> +		return;
>>> -	ws = sbq_wake_ptr(sbq);
>>> -	if (!ws)
>>> -		return false;
>>> +	atomic_add(nr, &sbq->completion_cnt);
>>> +	wakeups = atomic_read(&sbq->wakeup_cnt);
>>> -	cur = atomic_read(&ws->wait_cnt);
>>>    	do {
>>> -		/*
>>> -		 * For concurrent callers of this, callers should call this
>>> -		 * function again to wakeup a new batch on a different 'ws'.
>>> -		 */
>>> -		if (cur == 0)
>>> -			return true;
>>> -		sub = min(*nr, cur);
>>> -		wait_cnt = cur - sub;
>>> -	} while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt));
>>> -
>>> -	/*
>>> -	 * If we decremented queue without waiters, retry to avoid lost
>>> -	 * wakeups.
>>> -	 */
>>> -	if (wait_cnt > 0)
>>> -		return !waitqueue_active(&ws->wait);
>>> +		if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch)
>>> +			return;
>>
>> Should it be considered that completion_cnt overflow and becomes
>> negtive?
> 
> Yes, the counters can (and will) certainly overflow but since we only care
> about (completion_cnt - wakeups), we should be fine - this number is always
> sane (and relatively small) and in the kernel we do compile with signed
> overflows being well defined.

I'm worried about this: for example, the extreme scenaro that there
is only one tag, currently there are only one infight rq and one thread
is waiting for tag. When the infight rq complete, if 'completion_cnt'
overflow to negative, then 'atomic_read(&sbq->completion_cnt) - wakeups
< wake_batch' will be passed unexpected, then will the thread never be
woken up if there are no new io issued ?

Thanks,
Kuai
> 
> 								Honza
>
  
Jan Kara Nov. 10, 2022, 3:35 p.m. UTC | #10
Hi!

On Thu 10-11-22 21:18:19, Yu Kuai wrote:
> 在 2022/11/10 19:16, Jan Kara 写道:
> > Hi!
> > 
> > On Thu 10-11-22 17:42:49, Yu Kuai wrote:
> > > 在 2022/11/06 7:10, Gabriel Krisman Bertazi 写道:
> > > > +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr)
> > > >    {
> > > > -	struct sbq_wait_state *ws;
> > > > -	unsigned int wake_batch;
> > > > -	int wait_cnt, cur, sub;
> > > > -	bool ret;
> > > > +	unsigned int wake_batch = READ_ONCE(sbq->wake_batch);
> > > > +	struct sbq_wait_state *ws = NULL;
> > > > +	unsigned int wakeups;
> > > > -	if (*nr <= 0)
> > > > -		return false;
> > > > +	if (!atomic_read(&sbq->ws_active))
> > > > +		return;
> > > > -	ws = sbq_wake_ptr(sbq);
> > > > -	if (!ws)
> > > > -		return false;
> > > > +	atomic_add(nr, &sbq->completion_cnt);
> > > > +	wakeups = atomic_read(&sbq->wakeup_cnt);
> > > > -	cur = atomic_read(&ws->wait_cnt);
> > > >    	do {
> > > > -		/*
> > > > -		 * For concurrent callers of this, callers should call this
> > > > -		 * function again to wakeup a new batch on a different 'ws'.
> > > > -		 */
> > > > -		if (cur == 0)
> > > > -			return true;
> > > > -		sub = min(*nr, cur);
> > > > -		wait_cnt = cur - sub;
> > > > -	} while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt));
> > > > -
> > > > -	/*
> > > > -	 * If we decremented queue without waiters, retry to avoid lost
> > > > -	 * wakeups.
> > > > -	 */
> > > > -	if (wait_cnt > 0)
> > > > -		return !waitqueue_active(&ws->wait);
> > > > +		if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch)
> > > > +			return;
> > > 
> > > Should it be considered that completion_cnt overflow and becomes
> > > negtive?
> > 
> > Yes, the counters can (and will) certainly overflow but since we only care
> > about (completion_cnt - wakeups), we should be fine - this number is always
> > sane (and relatively small) and in the kernel we do compile with signed
> > overflows being well defined.
> 
> I'm worried about this: for example, the extreme scenaro that there
> is only one tag, currently there are only one infight rq and one thread
> is waiting for tag. When the infight rq complete, if 'completion_cnt'
> overflow to negative, then 'atomic_read(&sbq->completion_cnt) - wakeups
> < wake_batch' will be passed unexpected, then will the thread never be
> woken up if there are no new io issued ?

Well but my point is that 'wakeups' is staying close to completion_cnt. So
if completion_cnt wraps to INT_MIN, then 'wakeups' is close to INT_MAX and
so completion_cnt - wakeups is going to wrap back and still result in a
small number. That is simply how wrapping arithmetics works...

								Honza
  
Yu Kuai Nov. 11, 2022, 12:59 a.m. UTC | #11
Hi

在 2022/11/10 23:35, Jan Kara 写道:
> Hi!
> 
> On Thu 10-11-22 21:18:19, Yu Kuai wrote:
>> 在 2022/11/10 19:16, Jan Kara 写道:
>>> Hi!
>>>
>>> On Thu 10-11-22 17:42:49, Yu Kuai wrote:
>>>> 在 2022/11/06 7:10, Gabriel Krisman Bertazi 写道:
>>>>> +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr)
>>>>>     {
>>>>> -	struct sbq_wait_state *ws;
>>>>> -	unsigned int wake_batch;
>>>>> -	int wait_cnt, cur, sub;
>>>>> -	bool ret;
>>>>> +	unsigned int wake_batch = READ_ONCE(sbq->wake_batch);
>>>>> +	struct sbq_wait_state *ws = NULL;
>>>>> +	unsigned int wakeups;
>>>>> -	if (*nr <= 0)
>>>>> -		return false;
>>>>> +	if (!atomic_read(&sbq->ws_active))
>>>>> +		return;
>>>>> -	ws = sbq_wake_ptr(sbq);
>>>>> -	if (!ws)
>>>>> -		return false;
>>>>> +	atomic_add(nr, &sbq->completion_cnt);
>>>>> +	wakeups = atomic_read(&sbq->wakeup_cnt);
>>>>> -	cur = atomic_read(&ws->wait_cnt);
>>>>>     	do {
>>>>> -		/*
>>>>> -		 * For concurrent callers of this, callers should call this
>>>>> -		 * function again to wakeup a new batch on a different 'ws'.
>>>>> -		 */
>>>>> -		if (cur == 0)
>>>>> -			return true;
>>>>> -		sub = min(*nr, cur);
>>>>> -		wait_cnt = cur - sub;
>>>>> -	} while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt));
>>>>> -
>>>>> -	/*
>>>>> -	 * If we decremented queue without waiters, retry to avoid lost
>>>>> -	 * wakeups.
>>>>> -	 */
>>>>> -	if (wait_cnt > 0)
>>>>> -		return !waitqueue_active(&ws->wait);
>>>>> +		if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch)
>>>>> +			return;
>>>>
>>>> Should it be considered that completion_cnt overflow and becomes
>>>> negtive?
>>>
>>> Yes, the counters can (and will) certainly overflow but since we only care
>>> about (completion_cnt - wakeups), we should be fine - this number is always
>>> sane (and relatively small) and in the kernel we do compile with signed
>>> overflows being well defined.
>>
>> I'm worried about this: for example, the extreme scenaro that there
>> is only one tag, currently there are only one infight rq and one thread
>> is waiting for tag. When the infight rq complete, if 'completion_cnt'
>> overflow to negative, then 'atomic_read(&sbq->completion_cnt) - wakeups
>> < wake_batch' will be passed unexpected, then will the thread never be
>> woken up if there are no new io issued ?
> 
> Well but my point is that 'wakeups' is staying close to completion_cnt. So
> if completion_cnt wraps to INT_MIN, then 'wakeups' is close to INT_MAX and
> so completion_cnt - wakeups is going to wrap back and still result in a
> small number. That is simply how wrapping arithmetics works...

Yes, you're right, I'm being foolish here. 😆

Thanks,
Kuai
> 
> 								Honza
>
  
Jens Axboe Nov. 11, 2022, 3:38 p.m. UTC | #12
On Sat, 5 Nov 2022 19:10:55 -0400, Gabriel Krisman Bertazi wrote:
> sbitmap suffers from code complexity, as demonstrated by recent fixes,
> and eventual lost wake ups on nested I/O completion.  The later happens,
> from what I understand, due to the non-atomic nature of the updates to
> wait_cnt, which needs to be subtracted and eventually reset when equal
> to zero.  This two step process can eventually miss an update when a
> nested completion happens to interrupt the CPU in between the wait_cnt
> updates.  This is very hard to fix, as shown by the recent changes to
> this code.
> 
> [...]

Applied, thanks!

[1/1] sbitmap: Use single per-bitmap counting to wake up queued tags
      commit: 4f8126bb2308066b877859e4b5923ffb54143630

Best regards,
  
Jan Kara Nov. 14, 2022, 1:23 p.m. UTC | #13
Gabriel, when looking through this patch, I've noticed we can loose wakeups
after your latest simplifications. See below for details:

On Sat 05-11-22 19:10:55, Gabriel Krisman Bertazi wrote:
> @@ -587,7 +571,7 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
>  	for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
>  		struct sbq_wait_state *ws = &sbq->ws[wake_index];
>  
> -		if (waitqueue_active(&ws->wait) && atomic_read(&ws->wait_cnt)) {
> +		if (waitqueue_active(&ws->wait)) {
>  			if (wake_index != atomic_read(&sbq->wake_index))
>  				atomic_set(&sbq->wake_index, wake_index);
>  			return ws;

Neither sbq_wake_ptr() nor sbitmap_queue_wake_up() now increment the
wake_index after performing the wakeup. Thus we would effectively keep
waking tasks from a single waitqueue until it becomes empty and only then
go for the next waitqueue. This creates unnecessary unfairness in task
wakeups and even possible starvation issues. So I think we need to advance
wake_index somewhere. Perhaps here before returning waitqueue.

> @@ -599,83 +583,31 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
>  	return NULL;
>  }
>  
> -static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr)
> +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr)
>  {
> +	unsigned int wake_batch = READ_ONCE(sbq->wake_batch);
> +	struct sbq_wait_state *ws = NULL;
> +	unsigned int wakeups;
>  
> +	if (!atomic_read(&sbq->ws_active))
> +		return;
>  
> +	atomic_add(nr, &sbq->completion_cnt);
> +	wakeups = atomic_read(&sbq->wakeup_cnt);
>  
>  	do {
> +		if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch)
> +			return;
>  
> +		if (!ws) {
> +			ws = sbq_wake_ptr(sbq);
> +			if (!ws)
> +				return;
> +		}
> +	} while (!atomic_try_cmpxchg(&sbq->wakeup_cnt,
> +				     &wakeups, wakeups + wake_batch));
>  
>  	wake_up_nr(&ws->wait, wake_batch);

Now this may be also problematic - when we were checking the number of woken
waiters in the older version of the patch (for others: internal version of
the patch) this was fine but now it may happen that the 'ws' we have
selected has no waiters anymore. And in that case we need to find another
waitqueue because otherwise we'd be loosing too many wakeups and we could
deadlock. So I think this rather needs to be something like:

	do {
		if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch)
			return;
	} while (!atomic_try_cmpxchg(&sbq->wakeup_cnt,
				     &wakeups, wakeups + wake_batch));

	do {
		ws = sbq_wake_ptr(sbq);
		if (!ws)
			return;
	} while (!wake_up_nr(&ws->wait, wake_batch));

with our original version of wake_up_nr() returning number of woken
waiters. What do you think?

								Honza
  
Gabriel Krisman Bertazi Nov. 15, 2022, 3:52 a.m. UTC | #14
Jan Kara <jack@suse.cz> writes:

> Now this may be also problematic - when we were checking the number of woken
> waiters in the older version of the patch (for others: internal version of
> the patch) this was fine but now it may happen that the 'ws' we have
> selected has no waiters anymore. And in that case we need to find another
> waitqueue because otherwise we'd be loosing too many wakeups and we could
> deadlock. So I think this rather needs to be something like:
>
> 	do {
> 		if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch)
> 			return;
> 	} while (!atomic_try_cmpxchg(&sbq->wakeup_cnt,
> 				     &wakeups, wakeups + wake_batch));
>
> 	do {
> 		ws = sbq_wake_ptr(sbq);
> 		if (!ws)
> 			return;

Jan,

Does this really solve it? There is no guarantee there will be another
waiter in the queues when we check here.  So, once again we could not
wake up anyone and return it this if leg.  If that is the case, don't we
end up overshooting wakeups and end up again with less completions than
required to wake up an incoming io?
  
Jan Kara Nov. 15, 2022, 10:24 a.m. UTC | #15
On Mon 14-11-22 22:52:32, Gabriel Krisman Bertazi wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > Now this may be also problematic - when we were checking the number of woken
> > waiters in the older version of the patch (for others: internal version of
> > the patch) this was fine but now it may happen that the 'ws' we have
> > selected has no waiters anymore. And in that case we need to find another
> > waitqueue because otherwise we'd be loosing too many wakeups and we could
> > deadlock. So I think this rather needs to be something like:
> >
> > 	do {
> > 		if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch)
> > 			return;
> > 	} while (!atomic_try_cmpxchg(&sbq->wakeup_cnt,
> > 				     &wakeups, wakeups + wake_batch));
> >
> > 	do {
> > 		ws = sbq_wake_ptr(sbq);
> > 		if (!ws)
> > 			return;
> 
> Does this really solve it? There is no guarantee there will be another
> waiter in the queues when we check here.  So, once again we could not
> wake up anyone and return it this if leg.  If that is the case, don't we
> end up overshooting wakeups and end up again with less completions than
> required to wake up an incoming io?

Well, if we don't find any waiter in any of the wait queues, it sure does
not matter we just discard wakeups? And sure, these checks are racy as the
waiters can be constantly added but the invariant is: if some waiter is
added after the atomic_try_cmpxchg(), then all tags are used so as many
completions as there are tags are coming. So that is enough to wake the
new waiter (due to batch size). So all what matters is: If there's any
waiter in the waitqueue by the time atomic_try_cmpxchg() is called, we'll
consider it in the wakeup loop. And that is fulfilled by the code AFAICT.

								Honza
  

Patch

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 4d2d5205ab58..d662cf136021 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -86,11 +86,6 @@  struct sbitmap {
  * struct sbq_wait_state - Wait queue in a &struct sbitmap_queue.
  */
 struct sbq_wait_state {
-	/**
-	 * @wait_cnt: Number of frees remaining before we wake up.
-	 */
-	atomic_t wait_cnt;
-
 	/**
 	 * @wait: Wait queue.
 	 */
@@ -138,6 +133,17 @@  struct sbitmap_queue {
 	 * sbitmap_queue_get_shallow()
 	 */
 	unsigned int min_shallow_depth;
+
+	/**
+	 * @completion_cnt: Number of bits cleared passed to the
+	 * wakeup function.
+	 */
+	atomic_t completion_cnt;
+
+	/**
+	 * @wakeup_cnt: Number of thread wake ups issued.
+	 */
+	atomic_t wakeup_cnt;
 };
 
 /**
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 7280ae8ca88c..eca462cba398 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -434,6 +434,8 @@  int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth,
 	sbq->wake_batch = sbq_calc_wake_batch(sbq, depth);
 	atomic_set(&sbq->wake_index, 0);
 	atomic_set(&sbq->ws_active, 0);
+	atomic_set(&sbq->completion_cnt, 0);
+	atomic_set(&sbq->wakeup_cnt, 0);
 
 	sbq->ws = kzalloc_node(SBQ_WAIT_QUEUES * sizeof(*sbq->ws), flags, node);
 	if (!sbq->ws) {
@@ -441,40 +443,21 @@  int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth,
 		return -ENOMEM;
 	}
 
-	for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
+	for (i = 0; i < SBQ_WAIT_QUEUES; i++)
 		init_waitqueue_head(&sbq->ws[i].wait);
-		atomic_set(&sbq->ws[i].wait_cnt, sbq->wake_batch);
-	}
 
 	return 0;
 }
 EXPORT_SYMBOL_GPL(sbitmap_queue_init_node);
 
-static inline void __sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq,
-					    unsigned int wake_batch)
-{
-	int i;
-
-	if (sbq->wake_batch != wake_batch) {
-		WRITE_ONCE(sbq->wake_batch, wake_batch);
-		/*
-		 * Pairs with the memory barrier in sbitmap_queue_wake_up()
-		 * to ensure that the batch size is updated before the wait
-		 * counts.
-		 */
-		smp_mb();
-		for (i = 0; i < SBQ_WAIT_QUEUES; i++)
-			atomic_set(&sbq->ws[i].wait_cnt, 1);
-	}
-}
-
 static void sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq,
 					    unsigned int depth)
 {
 	unsigned int wake_batch;
 
 	wake_batch = sbq_calc_wake_batch(sbq, depth);
-	__sbitmap_queue_update_wake_batch(sbq, wake_batch);
+	if (sbq->wake_batch != wake_batch)
+		WRITE_ONCE(sbq->wake_batch, wake_batch);
 }
 
 void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq,
@@ -488,7 +471,8 @@  void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq,
 
 	wake_batch = clamp_val(depth / SBQ_WAIT_QUEUES,
 			min_batch, SBQ_WAKE_BATCH);
-	__sbitmap_queue_update_wake_batch(sbq, wake_batch);
+
+	WRITE_ONCE(sbq->wake_batch, wake_batch);
 }
 EXPORT_SYMBOL_GPL(sbitmap_queue_recalculate_wake_batch);
 
@@ -587,7 +571,7 @@  static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
 	for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
 		struct sbq_wait_state *ws = &sbq->ws[wake_index];
 
-		if (waitqueue_active(&ws->wait) && atomic_read(&ws->wait_cnt)) {
+		if (waitqueue_active(&ws->wait)) {
 			if (wake_index != atomic_read(&sbq->wake_index))
 				atomic_set(&sbq->wake_index, wake_index);
 			return ws;
@@ -599,83 +583,31 @@  static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
 	return NULL;
 }
 
-static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr)
+void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr)
 {
-	struct sbq_wait_state *ws;
-	unsigned int wake_batch;
-	int wait_cnt, cur, sub;
-	bool ret;
+	unsigned int wake_batch = READ_ONCE(sbq->wake_batch);
+	struct sbq_wait_state *ws = NULL;
+	unsigned int wakeups;
 
-	if (*nr <= 0)
-		return false;
+	if (!atomic_read(&sbq->ws_active))
+		return;
 
-	ws = sbq_wake_ptr(sbq);
-	if (!ws)
-		return false;
+	atomic_add(nr, &sbq->completion_cnt);
+	wakeups = atomic_read(&sbq->wakeup_cnt);
 
-	cur = atomic_read(&ws->wait_cnt);
 	do {
-		/*
-		 * For concurrent callers of this, callers should call this
-		 * function again to wakeup a new batch on a different 'ws'.
-		 */
-		if (cur == 0)
-			return true;
-		sub = min(*nr, cur);
-		wait_cnt = cur - sub;
-	} while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt));
-
-	/*
-	 * If we decremented queue without waiters, retry to avoid lost
-	 * wakeups.
-	 */
-	if (wait_cnt > 0)
-		return !waitqueue_active(&ws->wait);
+		if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch)
+			return;
 
-	*nr -= sub;
-
-	/*
-	 * When wait_cnt == 0, we have to be particularly careful as we are
-	 * responsible to reset wait_cnt regardless whether we've actually
-	 * woken up anybody. But in case we didn't wakeup anybody, we still
-	 * need to retry.
-	 */
-	ret = !waitqueue_active(&ws->wait);
-	wake_batch = READ_ONCE(sbq->wake_batch);
+		if (!ws) {
+			ws = sbq_wake_ptr(sbq);
+			if (!ws)
+				return;
+		}
+	} while (!atomic_try_cmpxchg(&sbq->wakeup_cnt,
+				     &wakeups, wakeups + wake_batch));
 
-	/*
-	 * Wake up first in case that concurrent callers decrease wait_cnt
-	 * while waitqueue is empty.
-	 */
 	wake_up_nr(&ws->wait, wake_batch);
-
-	/*
-	 * Pairs with the memory barrier in sbitmap_queue_resize() to
-	 * ensure that we see the batch size update before the wait
-	 * count is reset.
-	 *
-	 * Also pairs with the implicit barrier between decrementing wait_cnt
-	 * and checking for waitqueue_active() to make sure waitqueue_active()
-	 * sees result of the wakeup if atomic_dec_return() has seen the result
-	 * of atomic_set().
-	 */
-	smp_mb__before_atomic();
-
-	/*
-	 * Increase wake_index before updating wait_cnt, otherwise concurrent
-	 * callers can see valid wait_cnt in old waitqueue, which can cause
-	 * invalid wakeup on the old waitqueue.
-	 */
-	sbq_index_atomic_inc(&sbq->wake_index);
-	atomic_set(&ws->wait_cnt, wake_batch);
-
-	return ret || *nr;
-}
-
-void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr)
-{
-	while (__sbq_wake_up(sbq, &nr))
-		;
 }
 EXPORT_SYMBOL_GPL(sbitmap_queue_wake_up);
 
@@ -792,9 +724,7 @@  void sbitmap_queue_show(struct sbitmap_queue *sbq, struct seq_file *m)
 	seq_puts(m, "ws={\n");
 	for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
 		struct sbq_wait_state *ws = &sbq->ws[i];
-
-		seq_printf(m, "\t{.wait_cnt=%d, .wait=%s},\n",
-			   atomic_read(&ws->wait_cnt),
+		seq_printf(m, "\t{.wait=%s},\n",
 			   waitqueue_active(&ws->wait) ? "active" : "inactive");
 	}
 	seq_puts(m, "}\n");