[RFC] scsi: core: remove unsed 'restarts' from scsi_device

Message ID 20221118113052.1324140-1-yukuai1@huaweicloud.com
State New
Headers
Series [RFC] scsi: core: remove unsed 'restarts' from scsi_device |

Commit Message

Yu Kuai Nov. 18, 2022, 11:30 a.m. UTC
  From: Yu Kuai <yukuai3@huawei.com>

During code review, I found that 'restarts' is not useful anymore after
the following commits:

1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget"
is a reason to kick")
2) commit d3b38596875d ("blk-mq: run queue no matter whether the request
is the last request")
3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE
and completion")

Now that if get budget ever failed, block layer will make sure to
trigger new run queue for the hctx. Hence there is no need to run queue
from scsi layer in this case.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/scsi/scsi_lib.c    | 35 -----------------------------------
 include/scsi/scsi_device.h |  1 -
 2 files changed, 36 deletions(-)
  

Comments

Bart Van Assche Nov. 19, 2022, 1:11 a.m. UTC | #1
On 11/18/22 03:30, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>

Regarding the subject: unsed -> unused?
  
Yu Kuai Nov. 19, 2022, 5:57 a.m. UTC | #2
Hi,

在 2022/11/19 9:11, Bart Van Assche 写道:
> On 11/18/22 03:30, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
> 
> Regarding the subject: unsed -> unused?

Yes, sorry for the spelling mistake.
> 
> .
>
  
Yu Kuai Nov. 26, 2022, 8:54 a.m. UTC | #3
Hi,

在 2022/11/18 19:30, Yu Kuai 写道:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> During code review, I found that 'restarts' is not useful anymore after
> the following commits:
> 
> 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget"
> is a reason to kick")
> 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request
> is the last request")
> 3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE
> and completion")
> 
> Now that if get budget ever failed, block layer will make sure to
> trigger new run queue for the hctx. Hence there is no need to run queue
> from scsi layer in this case.
> 

Does anyone has suggestions about this patch?

More info why I tried to remove this:

while testing megaraid with 4 nvme with none elevator, the default
queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth,
bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth,
bw is decreased to about 0.8Gib/s, and with this patch applied,
bw can stay 4Gib/s in the later case.

Thanks,
Kuai
  
Ming Lei Nov. 27, 2022, 9:45 a.m. UTC | #4
On Sat, Nov 26, 2022 at 04:54:46PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2022/11/18 19:30, Yu Kuai 写道:
> > From: Yu Kuai <yukuai3@huawei.com>
> > 
> > During code review, I found that 'restarts' is not useful anymore after
> > the following commits:
> > 
> > 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget"
> > is a reason to kick")
> > 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request
> > is the last request")
> > 3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE
> > and completion")
> > 
> > Now that if get budget ever failed, block layer will make sure to
> > trigger new run queue for the hctx. Hence there is no need to run queue
> > from scsi layer in this case.
> > 
> 
> Does anyone has suggestions about this patch?
> 
> More info why I tried to remove this:
> 
> while testing megaraid with 4 nvme with none elevator, the default
> queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth,
> bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth,
> bw is decreased to about 0.8Gib/s, and with this patch applied,
> bw can stay 4Gib/s in the later case.

I will look at this patch next week.

Can you investigate a bit the reason why perf boost is from this patch?

Thanks,
Ming
  
Jason Yan Nov. 28, 2022, 2:26 a.m. UTC | #5
On 2022/11/26 16:54, Yu Kuai wrote:
> Hi,
> 
> 在 2022/11/18 19:30, Yu Kuai 写道:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> During code review, I found that 'restarts' is not useful anymore after
>> the following commits:
>>
>> 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget"
>> is a reason to kick")
>> 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request
>> is the last request")
>> 3) commit 673235f91531 ("scsi: core: Fix race between handling 
>> STS_RESOURCE
>> and completion")
>>
>> Now that if get budget ever failed, block layer will make sure to
>> trigger new run queue for the hctx. Hence there is no need to run queue
>> from scsi layer in this case.
>>
> 
> Does anyone has suggestions about this patch?
> 
> More info why I tried to remove this:
> 
> while testing megaraid with 4 nvme with none elevator, the default
> queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth,
> bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth,
> bw is decreased to about 0.8Gib/s, and with this patch applied,
> bw can stay 4Gib/s in the later case.
> 

Hi Yu Kuai,

This information should be included in the commit message.

Thanks,
Jason
  
Yu Kuai Nov. 28, 2022, 2:49 a.m. UTC | #6
Hi, Ming

在 2022/11/27 17:45, Ming Lei 写道:
> On Sat, Nov 26, 2022 at 04:54:46PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2022/11/18 19:30, Yu Kuai 写道:
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> During code review, I found that 'restarts' is not useful anymore after
>>> the following commits:
>>>
>>> 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget"
>>> is a reason to kick")
>>> 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request
>>> is the last request")
>>> 3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE
>>> and completion")
>>>
>>> Now that if get budget ever failed, block layer will make sure to
>>> trigger new run queue for the hctx. Hence there is no need to run queue
>>> from scsi layer in this case.
>>>
>>
>> Does anyone has suggestions about this patch?
>>
>> More info why I tried to remove this:
>>
>> while testing megaraid with 4 nvme with none elevator, the default
>> queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth,
>> bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth,
>> bw is decreased to about 0.8Gib/s, and with this patch applied,
>> bw can stay 4Gib/s in the later case.
> 
> I will look at this patch next week.
> 
> Can you investigate a bit the reason why perf boost is from this patch?

test cmd:

without this patch, perf will show that on cpu time is wasted on
mod_delayed_work_on:

-    3.44%  swapper          [kernel.vmlinux]            [k] 
mod_delayed_work_on
    - mod_delayed_work_on
       - 3.44% kblockd_mod_delayed_work_on
            __blk_mq_delay_run_hw_queue
                  scsi_run_queue_async
                  scsi_end_request
                  scsi_io_completion
                  scsi_finish_command
                + scsi_complete
+    1.05%  ksoftirqd/30     [kernel.vmlinux]            [k] 
mod_delayed_work_on
+    0.54%  fio              [kernel.vmlinux]            [k] 
mod_delayed_work_on
      0.36%  ksoftirqd/22     [kernel.vmlinux]            [k] 
mod_delayed_work_on
      0.35%  ksoftirqd/1      [kernel.vmlinux]            [k] 
mod_delayed_work_on

I also attched the flamegraph.

Thanks,
Kuai
> 
o> Thanks,
> Ming
> 
> 
> .
>
  
Ming Lei Nov. 28, 2022, 3:27 a.m. UTC | #7
On Sat, Nov 26, 2022 at 04:54:46PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2022/11/18 19:30, Yu Kuai 写道:
> > From: Yu Kuai <yukuai3@huawei.com>
> > 
> > During code review, I found that 'restarts' is not useful anymore after
> > the following commits:
> > 
> > 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget"
> > is a reason to kick")
> > 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request
> > is the last request")
> > 3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE
> > and completion")
> > 
> > Now that if get budget ever failed, block layer will make sure to
> > trigger new run queue for the hctx. Hence there is no need to run queue
> > from scsi layer in this case.
> > 

But scsi_run_queue_async() needs to run all hw queue because budget is
actually LUN/request queue wide.

> 
> Does anyone has suggestions about this patch?
> 
> More info why I tried to remove this:
> 
> while testing megaraid with 4 nvme with none elevator, the default
> queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth,
> bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth,
> bw is decreased to about 0.8Gib/s, and with this patch applied,
> bw can stay 4Gib/s in the later case.

What is .can_queue and nr_hw_queues in your setting?



thanks,
Ming
  
Yu Kuai Nov. 28, 2022, 3:35 a.m. UTC | #8
在 2022/11/28 11:27, Ming Lei 写道:
> On Sat, Nov 26, 2022 at 04:54:46PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2022/11/18 19:30, Yu Kuai 写道:
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> During code review, I found that 'restarts' is not useful anymore after
>>> the following commits:
>>>
>>> 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget"
>>> is a reason to kick")
>>> 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request
>>> is the last request")
>>> 3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE
>>> and completion")
>>>
>>> Now that if get budget ever failed, block layer will make sure to
>>> trigger new run queue for the hctx. Hence there is no need to run queue
>>> from scsi layer in this case.
>>>
> 
> But scsi_run_queue_async() needs to run all hw queue because budget is
> actually LUN/request queue wide.

Why the hw queue need to run if get budget never failed in this hw
queue?

> 
>>
>> Does anyone has suggestions about this patch?
>>
>> More info why I tried to remove this:
>>
>> while testing megaraid with 4 nvme with none elevator, the default
>> queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth,
>> bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth,
>> bw is decreased to about 0.8Gib/s, and with this patch applied,
>> bw can stay 4Gib/s in the later case.
> 
> What is .can_queue and nr_hw_queues in your setting?
test cmd:
fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 
-rwmixread=70 -refill_buffers -filename=/dev/sdg -numjobs=128 -size=1TB 
-runtime=60s -bs=4k -iodepth=2 -rw=randwrite

test environment:
arm64 Kunpeng-920, 128 cpu
megaraid with 4 NVMEs, 128 hctx and queue_depth is 128

> 
> 
> 
> thanks,
> Ming
> 
> .
>
  
Yu Kuai Nov. 28, 2022, 3:38 a.m. UTC | #9
在 2022/11/28 11:35, Yu Kuai 写道:

> 
> Why the hw queue need to run if get budget never failed in this hw
> queue?

And by the way, after Jan's patch "blk-mq: Improve performance of non-mq
IO schedulers with multiple HW queues", scsi_run_queue_async() can only
garantee to run hw queue for the current cpu, not all the hw queues.
  
Ming Lei Nov. 28, 2022, 4:12 a.m. UTC | #10
On Mon, Nov 28, 2022 at 11:35:18AM +0800, Yu Kuai wrote:
> 
> 
> 在 2022/11/28 11:27, Ming Lei 写道:
> > On Sat, Nov 26, 2022 at 04:54:46PM +0800, Yu Kuai wrote:
> > > Hi,
> > > 
> > > 在 2022/11/18 19:30, Yu Kuai 写道:
> > > > From: Yu Kuai <yukuai3@huawei.com>
> > > > 
> > > > During code review, I found that 'restarts' is not useful anymore after
> > > > the following commits:
> > > > 
> > > > 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget"
> > > > is a reason to kick")
> > > > 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request
> > > > is the last request")
> > > > 3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE
> > > > and completion")
> > > > 
> > > > Now that if get budget ever failed, block layer will make sure to
> > > > trigger new run queue for the hctx. Hence there is no need to run queue
> > > > from scsi layer in this case.
> > > > 
> > 
> > But scsi_run_queue_async() needs to run all hw queue because budget is
> > actually LUN/request queue wide.
> 
> Why the hw queue need to run if get budget never failed in this hw
> queue?

Because all hw queues share the queue wide budget, and once budget
is available, all hw queues are re-run, and the hw queue won't be
scheduled actually if there is nothing to run, see
blk_mq_run_hw_queue().

> 
> > 
> > > 
> > > Does anyone has suggestions about this patch?
> > > 
> > > More info why I tried to remove this:
> > > 
> > > while testing megaraid with 4 nvme with none elevator, the default
> > > queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth,
> > > bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth,
> > > bw is decreased to about 0.8Gib/s, and with this patch applied,
> > > bw can stay 4Gib/s in the later case.
> > 
> > What is .can_queue and nr_hw_queues in your setting?
> test cmd:
> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022
> -rwmixread=70 -refill_buffers -filename=/dev/sdg -numjobs=128 -size=1TB
> -runtime=60s -bs=4k -iodepth=2 -rw=randwrite
> 
> test environment:
> arm64 Kunpeng-920, 128 cpu
> megaraid with 4 NVMEs, 128 hctx and queue_depth is 128

From your setting, megaraid should sets ->host_tagset, that said there
is only 128 tags for all 4 NVMEs(128 hw queue shares the all 128 tags
too).

That looks one really bad setting.

BTW, why do you drive nvme via megaraid instead nvme driver?

> And by the way, after Jan's patch "blk-mq: Improve performance of non-mq
> IO schedulers with multiple HW queues", scsi_run_queue_async() can only
> garantee to run hw queue for the current cpu, not all the hw queues.

That isn't true, each hctx is still run in case of none & kyber scheduler.

thanks,
Ming
  
Yu Kuai Nov. 28, 2022, 6:08 a.m. UTC | #11
Hi,

在 2022/11/28 12:12, Ming Lei 写道:
> On Mon, Nov 28, 2022 at 11:35:18AM +0800, Yu Kuai wrote:
>>
>>
>> 在 2022/11/28 11:27, Ming Lei 写道:
>>> On Sat, Nov 26, 2022 at 04:54:46PM +0800, Yu Kuai wrote:
>>>> Hi,
>>>>
>>>> 在 2022/11/18 19:30, Yu Kuai 写道:
>>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>>
>>>>> During code review, I found that 'restarts' is not useful anymore after
>>>>> the following commits:
>>>>>
>>>>> 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget"
>>>>> is a reason to kick")
>>>>> 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request
>>>>> is the last request")
>>>>> 3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE
>>>>> and completion")
>>>>>
>>>>> Now that if get budget ever failed, block layer will make sure to
>>>>> trigger new run queue for the hctx. Hence there is no need to run queue
>>>>> from scsi layer in this case.
>>>>>
>>>
>>> But scsi_run_queue_async() needs to run all hw queue because budget is
>>> actually LUN/request queue wide.
>>
>> Why the hw queue need to run if get budget never failed in this hw
>> queue?
> 
> Because all hw queues share the queue wide budget, and once budget
> is available, all hw queues are re-run, and the hw queue won't be
> scheduled actually if there is nothing to run, see
> blk_mq_run_hw_queue().

I still don't get it why all hw queues should be re-run, is this just
for performance or fixing a bug? And I'm not sure this behavior is good
for performance.

> 
>>
>>>
>>>>
>>>> Does anyone has suggestions about this patch?
>>>>
>>>> More info why I tried to remove this:
>>>>
>>>> while testing megaraid with 4 nvme with none elevator, the default
>>>> queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth,
>>>> bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth,
>>>> bw is decreased to about 0.8Gib/s, and with this patch applied,
>>>> bw can stay 4Gib/s in the later case.
>>>
>>> What is .can_queue and nr_hw_queues in your setting?
>> test cmd:
>> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022
>> -rwmixread=70 -refill_buffers -filename=/dev/sdg -numjobs=128 -size=1TB
>> -runtime=60s -bs=4k -iodepth=2 -rw=randwrite
>>
>> test environment:
>> arm64 Kunpeng-920, 128 cpu
>> megaraid with 4 NVMEs, 128 hctx and queue_depth is 128
> 
>>From your setting, megaraid should sets ->host_tagset, that said there
> is only 128 tags for all 4 NVMEs(128 hw queue shares the all 128 tags
> too).
> 
> That looks one really bad setting.

It's right that is bad, but the point here is to triggered get budget
failed frequently. If I set queue_depth to 2048, and I use 128 numjobs
with 32 iodpeth, performance is still bad.
> 
> BTW, why do you drive nvme via megaraid instead nvme driver?
> 
>> And by the way, after Jan's patch "blk-mq: Improve performance of non-mq
>> IO schedulers with multiple HW queues", scsi_run_queue_async() can only
>> garantee to run hw queue for the current cpu, not all the hw queues.
> 
> That isn't true, each hctx is still run in case of none & kyber scheduler.

Yes, but current default hctx shared elevator is deadline.
> 
> thanks,
> Ming
> 
> .
>
  
Yu Kuai Nov. 29, 2022, 2:15 a.m. UTC | #12
Hi,

> Hi,
> 
> 在 2022/11/28 12:12, Ming Lei 写道:

>>
>> BTW, why do you drive nvme via megaraid instead nvme driver?
>>
>>> And by the way, after Jan's patch "blk-mq: Improve performance of non-mq
>>> IO schedulers with multiple HW queues", scsi_run_queue_async() can only
>>> garantee to run hw queue for the current cpu, not all the hw queues.
>>
>> That isn't true, each hctx is still run in case of none & kyber 
>> scheduler.

And I really suspect that why Jan's patch can improve performance is
because it avoid many run queues from scsi_run_queue_async().

> 
> Yes, but current default hctx shared elevator is deadline.
>>
>> thanks,
>> Ming
>>
>> .
>>
> 
> .
>
  

Patch

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 56f641ba1261..f6325a0f80fb 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -507,24 +507,6 @@  static void scsi_run_queue_async(struct scsi_device *sdev)
 	if (scsi_target(sdev)->single_lun ||
 	    !list_empty(&sdev->host->starved_list)) {
 		kblockd_schedule_work(&sdev->requeue_work);
-	} else {
-		/*
-		 * smp_mb() present in sbitmap_queue_clear() or implied in
-		 * .end_io is for ordering writing .device_busy in
-		 * scsi_device_unbusy() and reading sdev->restarts.
-		 */
-		int old = atomic_read(&sdev->restarts);
-
-		/*
-		 * ->restarts has to be kept as non-zero if new budget
-		 *  contention occurs.
-		 *
-		 *  No need to run queue when either another re-run
-		 *  queue wins in updating ->restarts or a new budget
-		 *  contention occurs.
-		 */
-		if (old && atomic_cmpxchg(&sdev->restarts, old, 0) == old)
-			blk_mq_run_hw_queues(sdev->request_queue, true);
 	}
 }
 
@@ -1666,23 +1648,6 @@  static int scsi_mq_get_budget(struct request_queue *q)
 	if (token >= 0)
 		return token;
 
-	atomic_inc(&sdev->restarts);
-
-	/*
-	 * Orders atomic_inc(&sdev->restarts) and atomic_read(&sdev->device_busy).
-	 * .restarts must be incremented before .device_busy is read because the
-	 * code in scsi_run_queue_async() depends on the order of these operations.
-	 */
-	smp_mb__after_atomic();
-
-	/*
-	 * If all in-flight requests originated from this LUN are completed
-	 * before reading .device_busy, sdev->device_busy will be observed as
-	 * zero, then blk_mq_delay_run_hw_queues() will dispatch this request
-	 * soon. Otherwise, completion of one of these requests will observe
-	 * the .restarts flag, and the request queue will be run for handling
-	 * this request, see scsi_end_request().
-	 */
 	if (unlikely(scsi_device_busy(sdev) == 0 &&
 				!scsi_device_blocked(sdev)))
 		blk_mq_delay_run_hw_queues(sdev->request_queue, SCSI_QUEUE_DELAY);
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 24bdbf7999ab..66345de80897 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -115,7 +115,6 @@  struct scsi_device {
 	struct sbitmap budget_map;
 	atomic_t device_blocked;	/* Device returned QUEUE_FULL. */
 
-	atomic_t restarts;
 	spinlock_t list_lock;
 	struct list_head starved_entry;
 	unsigned short queue_depth;	/* How deep of a queue we want */