Message ID | cover.1678474375.git.asml.silence@gmail.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp1055806wrd; Fri, 10 Mar 2023 11:21:50 -0800 (PST) X-Google-Smtp-Source: AK7set93AO7VU/FeegbN8P97LoNAt8WrP3nf9hTsiaWwp7Juo3XM1dLNpsNfLfincAJKuBnlidT/ X-Received: by 2002:a17:90a:1a02:b0:234:e5c2:b92c with SMTP id 2-20020a17090a1a0200b00234e5c2b92cmr7747527pjk.15.1678476110256; Fri, 10 Mar 2023 11:21:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1678476110; cv=none; d=google.com; s=arc-20160816; b=Mi5hoWJt2Y8jPa7Jpt88nLO0Xkb4+PmmCt15tT282Dg11e52M9FxEveMv4jSpD48Xu vBXFJRLjFry4mAv4wx5tpHHojB7ij5h2YOri/Qf5shCkY+9bidhBtQsifoBCI+VMwqYx 8G4cD8E+7tb6X1/PMoTDyhW7Axyn6RXp3V3DWOllA9WgR5aM18c6jWozMHKWuY+oEn4t vE8Yrx09U926k0UGphG+9NnCf3C6dG36Aujgz5HQ9qrLFJWL1cdVKP0o7XwoUZEMwEcE u1acR439gyZbhxzzziAi79DsaNMB8PJ7VSMbBCqClMJPUHo16e5AIJGWXHSabqNpvGYz zSDA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=qnZxvF0PZgGlXBT9bGeH/QAOxzjxQU6JzB9LYz2gJVk=; b=Sd0wPnKtBu7fxVVc/Vf6R3Ht3nckp3Q0dY8/zdszz+l/0HkQSkO6BGpCU6ei+U1h5Z 87GaiQkgsOLqYAZRgRoDRdMrnXMwt1/1P8F0faVLIlIWyV/LTVRFA8KYzwgn4Q4EnVhl bL6YBnN7MCDshcFHQMRz4NNUK+X00t85i+Kppz0hCoijkmcCce6e6v72D2VvyrjMUjVd wTOF7exybB9n73Ip+HIIPSFaTbxX5LiC68NWq3KKcU8pjE7NXUdlqa2nPGsrzxLZwFZ9 AQ12JzTkEvutE9hXlXmdhbBh8cjFO6vlwyQDRpHdKgRxBfCuqo5unYT4og4jLaFOOFvw wUNQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=PUttA2bj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a24-20020a631a18000000b004e1bf0dbe7esi455571pga.477.2023.03.10.11.21.32; Fri, 10 Mar 2023 11:21:50 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=PUttA2bj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230347AbjCJTFj (ORCPT <rfc822;carlos.wei.hk@gmail.com> + 99 others); Fri, 10 Mar 2023 14:05:39 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35026 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229937AbjCJTFi (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 10 Mar 2023 14:05:38 -0500 Received: from mail-wm1-x32a.google.com (mail-wm1-x32a.google.com [IPv6:2a00:1450:4864:20::32a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E12B812FD19; Fri, 10 Mar 2023 11:05:36 -0800 (PST) Received: by mail-wm1-x32a.google.com with SMTP id j19-20020a05600c191300b003eb3e1eb0caso6823893wmq.1; Fri, 10 Mar 2023 11:05:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1678475135; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=qnZxvF0PZgGlXBT9bGeH/QAOxzjxQU6JzB9LYz2gJVk=; b=PUttA2bjLPcDH65tgzhUIuvEYDybgDx+kyslxZWTTtiV/hrTXNXsGk4n6wTOSI3o9H 5qr0iItzMnwkXOezBXD0z51uIUHFtAWnDkVbP/yuMN7DPd5KtMfDNmYHZsgP7/cCnBw6 DhqDxfFM6YpCZKkwhAV5rHl5botPIEF+bO7xtQgrLcUMQhnqRUSPGwF2WOGSjLdjt3mi qdXqL8GgG/5j7tsEKoJsD/EM77vEXO2VV5zpqj1JA3pl70MeEAiDrf5pivMJHQMsPmOQ e/y7vkKl+2+uCOg0XQRYRweb3Fi9tfGhBiu6VDYUUYsdQMGR5JH90AtA8Qrc56zWXFDn k7Aw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678475135; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=qnZxvF0PZgGlXBT9bGeH/QAOxzjxQU6JzB9LYz2gJVk=; b=u9SeDEIP+68+MwL5s1cUj/og/fOcABo73wNhIf7iXF2evOUPGzNCejgHbyE2wkK/rr aOyKX0AEh3go5Wwl2uw/+yELgb6h+OkYBRP3FS8Lly7hc6ISocsHQzwhMSPD2ADWrqPn APPnPUCIx5jj82HIr4a0NixbF/BS3P5C6sykdIchlJEf1UIxhuN5+SQ0n+5jAB26JsRT A2kBmu8gxVdLnN0hY3qYky5chbh0rqqGqWE3bXmXrdcjrLuYpipRcVxthCQOpYjq7lKq f+IoPTbvEcN9VcPbTLQFzfxa2526hK54c17sfQxMny7tS4g8eiXzVOEFMdaf43+5g+mb r78w== X-Gm-Message-State: AO0yUKUIBUxwzIvT1CBMK+ytj0nGG54TcoU+ncA/Oga2oIiBbC0reB0N 3LACgx3SlN6aFaQ+4th6SSosaOPK8SA= X-Received: by 2002:a05:600c:3b11:b0:3eb:38b0:e757 with SMTP id m17-20020a05600c3b1100b003eb38b0e757mr2812376wms.10.1678475135057; Fri, 10 Mar 2023 11:05:35 -0800 (PST) Received: from 127.0.0.1localhost (188.30.129.33.threembb.co.uk. [188.30.129.33]) by smtp.gmail.com with ESMTPSA id z19-20020a1c4c13000000b003e20cf0408esm647882wmf.40.2023.03.10.11.05.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 10 Mar 2023 11:05:34 -0800 (PST) From: Pavel Begunkov <asml.silence@gmail.com> To: io-uring@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk>, asml.silence@gmail.com, linux-kernel@vger.kernel.org Subject: [RFC 0/2] optimise local-tw task resheduling Date: Fri, 10 Mar 2023 19:04:14 +0000 Message-Id: <cover.1678474375.git.asml.silence@gmail.com> X-Mailer: git-send-email 2.39.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1760009765656563767?= X-GMAIL-MSGID: =?utf-8?q?1760009765656563767?= |
Series |
optimise local-tw task resheduling
|
|
Message
Pavel Begunkov
March 10, 2023, 7:04 p.m. UTC
io_uring extensively uses task_work, but when a task is waiting for multiple CQEs it causes lots of rescheduling. This series is an attempt to optimise it and be a base for future improvements. For some zc network tests eventually waiting for a portion of buffers I've got 10x descrease in the number of context switches, which reduced the CPU consumption more than twice (17% -> 8%). It also helps storage cases, while running fio/t/io_uring against a low performant drive it got 2x descrease of the number of context switches for QD8 and ~4 times for QD32. Not for inclusion yet, I want to add an optimisation for when waiting for 1 CQE. Pavel Begunkov (2): io_uring: add tw add flags io_uring: reduce sheduling due to tw include/linux/io_uring_types.h | 2 +- io_uring/io_uring.c | 48 ++++++++++++++++++++-------------- io_uring/io_uring.h | 10 +++++-- io_uring/notif.h | 2 +- io_uring/rw.c | 2 +- 5 files changed, 40 insertions(+), 24 deletions(-)
Comments
On 3/10/23 12:04?PM, Pavel Begunkov wrote: > io_uring extensively uses task_work, but when a task is waiting > for multiple CQEs it causes lots of rescheduling. This series > is an attempt to optimise it and be a base for future improvements. > > For some zc network tests eventually waiting for a portion of > buffers I've got 10x descrease in the number of context switches, > which reduced the CPU consumption more than twice (17% -> 8%). > It also helps storage cases, while running fio/t/io_uring against > a low performant drive it got 2x descrease of the number of context > switches for QD8 and ~4 times for QD32. > > Not for inclusion yet, I want to add an optimisation for when > waiting for 1 CQE. Ran this on the usual peak benchmark, using IRQ. IOPS is around ~70M for that, and I see context rates of around 8.1-8.3M/sec with the current kernel. Applied the two patches, but didn't see much of a change? Performance is about the same, and cx rate ditto. Confused... As you probably know, this test waits for 32 ios at the time. Didn't take a closer look just yet, but I grok the concept. One immediate thing I'd want to change is the FACILE part of it. Let's call it something a bit more straightforward, perhaps LIGHT? Or LIGHTWEIGHT? I can see this mostly being used for filling a CQE, so it could also be named something like that. But could also be used for light work in the same vein, so might not be a good idea to base the naming on that.
On 3/11/23 17:24, Jens Axboe wrote: > On 3/10/23 12:04?PM, Pavel Begunkov wrote: >> io_uring extensively uses task_work, but when a task is waiting >> for multiple CQEs it causes lots of rescheduling. This series >> is an attempt to optimise it and be a base for future improvements. >> >> For some zc network tests eventually waiting for a portion of >> buffers I've got 10x descrease in the number of context switches, >> which reduced the CPU consumption more than twice (17% -> 8%). >> It also helps storage cases, while running fio/t/io_uring against >> a low performant drive it got 2x descrease of the number of context >> switches for QD8 and ~4 times for QD32. >> >> Not for inclusion yet, I want to add an optimisation for when >> waiting for 1 CQE. > > Ran this on the usual peak benchmark, using IRQ. IOPS is around ~70M for > that, and I see context rates of around 8.1-8.3M/sec with the current > kernel. > > Applied the two patches, but didn't see much of a change? Performance is > about the same, and cx rate ditto. Confused... As you probably know, > this test waits for 32 ios at the time. If I'd to guess it already has perfect batching, for which case the patch does nothing. Maybe it's due to SSD coalescing + small ro I/O + consistency and small latencies of Optanes, or might be on the scheduling and the kernel side to be slow to react. I was looking at trace_io_uring_local_work_run() while testing, It's always should be @loop=QD (i.e. 32) for the patch, but the guess is it's also 32 with that setup but without patches. > Didn't take a closer look just yet, but I grok the concept. One > immediate thing I'd want to change is the FACILE part of it. Let's call > it something a bit more straightforward, perhaps LIGHT? Or LIGHTWEIGHT? I don't really care, will change, but let me also ask why? They're more or less synonyms, though facile is much less popular. Is that your reasoning? > I can see this mostly being used for filling a CQE, so it could also be > named something like that. But could also be used for light work in the > same vein, so might not be a good idea to base the naming on that.
On 3/11/23 20:45, Pavel Begunkov wrote: > On 3/11/23 17:24, Jens Axboe wrote: >> On 3/10/23 12:04?PM, Pavel Begunkov wrote: >>> io_uring extensively uses task_work, but when a task is waiting >>> for multiple CQEs it causes lots of rescheduling. This series >>> is an attempt to optimise it and be a base for future improvements. >>> >>> For some zc network tests eventually waiting for a portion of >>> buffers I've got 10x descrease in the number of context switches, >>> which reduced the CPU consumption more than twice (17% -> 8%). >>> It also helps storage cases, while running fio/t/io_uring against >>> a low performant drive it got 2x descrease of the number of context >>> switches for QD8 and ~4 times for QD32. >>> >>> Not for inclusion yet, I want to add an optimisation for when >>> waiting for 1 CQE. >> >> Ran this on the usual peak benchmark, using IRQ. IOPS is around ~70M for >> that, and I see context rates of around 8.1-8.3M/sec with the current >> kernel. >> >> Applied the two patches, but didn't see much of a change? Performance is >> about the same, and cx rate ditto. Confused... As you probably know, >> this test waits for 32 ios at the time. > > If I'd to guess it already has perfect batching, for which case > the patch does nothing. Maybe it's due to SSD coalescing + > small ro I/O + consistency and small latencies of Optanes, > or might be on the scheduling and the kernel side to be slow > to react. And if that's that, I have to note that it's quite a sterile case, the last time I asked the usual batching we're currently getting for networking cases is 1-2.
On 3/11/23 1:45?PM, Pavel Begunkov wrote: > On 3/11/23 17:24, Jens Axboe wrote: >> On 3/10/23 12:04?PM, Pavel Begunkov wrote: >>> io_uring extensively uses task_work, but when a task is waiting >>> for multiple CQEs it causes lots of rescheduling. This series >>> is an attempt to optimise it and be a base for future improvements. >>> >>> For some zc network tests eventually waiting for a portion of >>> buffers I've got 10x descrease in the number of context switches, >>> which reduced the CPU consumption more than twice (17% -> 8%). >>> It also helps storage cases, while running fio/t/io_uring against >>> a low performant drive it got 2x descrease of the number of context >>> switches for QD8 and ~4 times for QD32. >>> >>> Not for inclusion yet, I want to add an optimisation for when >>> waiting for 1 CQE. >> >> Ran this on the usual peak benchmark, using IRQ. IOPS is around ~70M for >> that, and I see context rates of around 8.1-8.3M/sec with the current >> kernel. >> >> Applied the two patches, but didn't see much of a change? Performance is >> about the same, and cx rate ditto. Confused... As you probably know, >> this test waits for 32 ios at the time. > > If I'd to guess it already has perfect batching, for which case > the patch does nothing. Maybe it's due to SSD coalescing + > small ro I/O + consistency and small latencies of Optanes, > or might be on the scheduling and the kernel side to be slow > to react. > > I was looking at trace_io_uring_local_work_run() while testing, > It's always should be @loop=QD (i.e. 32) for the patch, but > the guess is it's also 32 with that setup but without patches. It very well could be that it's just loaded enough that we get perfect batching anyway. I'd need to reuse some of your tracing to know for sure. >> Didn't take a closer look just yet, but I grok the concept. One >> immediate thing I'd want to change is the FACILE part of it. Let's call >> it something a bit more straightforward, perhaps LIGHT? Or LIGHTWEIGHT? > > I don't really care, will change, but let me also ask why? > They're more or less synonyms, though facile is much less > popular. Is that your reasoning? Yep, it's not very common and the name should be self-explanatory immediately for most people.
On 3/11/23 1:53?PM, Pavel Begunkov wrote: > On 3/11/23 20:45, Pavel Begunkov wrote: >> On 3/11/23 17:24, Jens Axboe wrote: >>> On 3/10/23 12:04?PM, Pavel Begunkov wrote: >>>> io_uring extensively uses task_work, but when a task is waiting >>>> for multiple CQEs it causes lots of rescheduling. This series >>>> is an attempt to optimise it and be a base for future improvements. >>>> >>>> For some zc network tests eventually waiting for a portion of >>>> buffers I've got 10x descrease in the number of context switches, >>>> which reduced the CPU consumption more than twice (17% -> 8%). >>>> It also helps storage cases, while running fio/t/io_uring against >>>> a low performant drive it got 2x descrease of the number of context >>>> switches for QD8 and ~4 times for QD32. >>>> >>>> Not for inclusion yet, I want to add an optimisation for when >>>> waiting for 1 CQE. >>> >>> Ran this on the usual peak benchmark, using IRQ. IOPS is around ~70M for >>> that, and I see context rates of around 8.1-8.3M/sec with the current >>> kernel. >>> >>> Applied the two patches, but didn't see much of a change? Performance is >>> about the same, and cx rate ditto. Confused... As you probably know, >>> this test waits for 32 ios at the time. >> >> If I'd to guess it already has perfect batching, for which case >> the patch does nothing. Maybe it's due to SSD coalescing + >> small ro I/O + consistency and small latencies of Optanes, >> or might be on the scheduling and the kernel side to be slow >> to react. > > And if that's that, I have to note that it's quite a sterile > case, the last time I asked the usual batching we're currently > getting for networking cases is 1-2. I can definitely see this being very useful for the more non-deterministic cases where "completions" come in more sporadically. But for the networking case, if this is eg receives, you'd trigger the wakeup anyway to do the actual receive? And then the cqe posting doesn't trigger another wakeup.
On 3/12/23 15:30, Jens Axboe wrote: > On 3/11/23 1:45?PM, Pavel Begunkov wrote: >> On 3/11/23 17:24, Jens Axboe wrote: >>> On 3/10/23 12:04?PM, Pavel Begunkov wrote: >>>> io_uring extensively uses task_work, but when a task is waiting >>>> for multiple CQEs it causes lots of rescheduling. This series >>>> is an attempt to optimise it and be a base for future improvements. >>>> >>>> For some zc network tests eventually waiting for a portion of >>>> buffers I've got 10x descrease in the number of context switches, >>>> which reduced the CPU consumption more than twice (17% -> 8%). >>>> It also helps storage cases, while running fio/t/io_uring against >>>> a low performant drive it got 2x descrease of the number of context >>>> switches for QD8 and ~4 times for QD32. >>>> >>>> Not for inclusion yet, I want to add an optimisation for when >>>> waiting for 1 CQE. >>> >>> Ran this on the usual peak benchmark, using IRQ. IOPS is around ~70M for >>> that, and I see context rates of around 8.1-8.3M/sec with the current >>> kernel. >>> >>> Applied the two patches, but didn't see much of a change? Performance is >>> about the same, and cx rate ditto. Confused... As you probably know, >>> this test waits for 32 ios at the time. >> >> If I'd to guess it already has perfect batching, for which case >> the patch does nothing. Maybe it's due to SSD coalescing + >> small ro I/O + consistency and small latencies of Optanes, >> or might be on the scheduling and the kernel side to be slow >> to react. >> >> I was looking at trace_io_uring_local_work_run() while testing, >> It's always should be @loop=QD (i.e. 32) for the patch, but >> the guess is it's also 32 with that setup but without patches. > > It very well could be that it's just loaded enough that we get perfect > batching anyway. I'd need to reuse some of your tracing to know for > sure. I used existing trace points. If you see a pattern trace_io_uring_local_work_run() trace_io_uring_cqring_wait(@count=32) trace_io_uring_local_work_run() trace_io_uring_cqring_wait(@count=32) ... that would mean a perfect batching. Even more so if @loops=1 >>> Didn't take a closer look just yet, but I grok the concept. One >>> immediate thing I'd want to change is the FACILE part of it. Let's call >>> it something a bit more straightforward, perhaps LIGHT? Or LIGHTWEIGHT? >> >> I don't really care, will change, but let me also ask why? >> They're more or less synonyms, though facile is much less >> popular. Is that your reasoning? > > Yep, it's not very common and the name should be self-explanatory > immediately for most people. That's exactly the problem. Someone will think that it's like normal tw but "better" and blindly apply it. Same happened before with priority tw lists.
On 3/12/23 15:31, Jens Axboe wrote: > On 3/11/23 1:53?PM, Pavel Begunkov wrote: >> On 3/11/23 20:45, Pavel Begunkov wrote: >>> On 3/11/23 17:24, Jens Axboe wrote: >>>> On 3/10/23 12:04?PM, Pavel Begunkov wrote: >>>>> io_uring extensively uses task_work, but when a task is waiting >>>>> for multiple CQEs it causes lots of rescheduling. This series >>>>> is an attempt to optimise it and be a base for future improvements. >>>>> >>>>> For some zc network tests eventually waiting for a portion of >>>>> buffers I've got 10x descrease in the number of context switches, >>>>> which reduced the CPU consumption more than twice (17% -> 8%). >>>>> It also helps storage cases, while running fio/t/io_uring against >>>>> a low performant drive it got 2x descrease of the number of context >>>>> switches for QD8 and ~4 times for QD32. >>>>> >>>>> Not for inclusion yet, I want to add an optimisation for when >>>>> waiting for 1 CQE. >>>> >>>> Ran this on the usual peak benchmark, using IRQ. IOPS is around ~70M for >>>> that, and I see context rates of around 8.1-8.3M/sec with the current >>>> kernel. >>>> >>>> Applied the two patches, but didn't see much of a change? Performance is >>>> about the same, and cx rate ditto. Confused... As you probably know, >>>> this test waits for 32 ios at the time. >>> >>> If I'd to guess it already has perfect batching, for which case >>> the patch does nothing. Maybe it's due to SSD coalescing + >>> small ro I/O + consistency and small latencies of Optanes, >>> or might be on the scheduling and the kernel side to be slow >>> to react. >> >> And if that's that, I have to note that it's quite a sterile >> case, the last time I asked the usual batching we're currently >> getting for networking cases is 1-2. > > I can definitely see this being very useful for the more > non-deterministic cases where "completions" come in more sporadically. > But for the networking case, if this is eg receives, you'd trigger the > wakeup anyway to do the actual receive? And then the cqe posting doesn't > trigger another wakeup. True, In my case zc send notifications were the culprit. It's not in the series, it might be better to not wake eagerly recv poll tw, it'll give time to accumulate more data. I'm a bit afraid of exhausting recv queues this way, so I don't think it's applicable by default.
On 3/12/23 9:45?PM, Pavel Begunkov wrote: >>>> Didn't take a closer look just yet, but I grok the concept. One >>>> immediate thing I'd want to change is the FACILE part of it. Let's call >>>> it something a bit more straightforward, perhaps LIGHT? Or LIGHTWEIGHT? >>> >>> I don't really care, will change, but let me also ask why? >>> They're more or less synonyms, though facile is much less >>> popular. Is that your reasoning? >> >> Yep, it's not very common and the name should be self-explanatory >> immediately for most people. > > That's exactly the problem. Someone will think that it's > like normal tw but "better" and blindly apply it. Same happened > before with priority tw lists. But the way to fix that is not through obscure naming, it's through better and more frequent review. Naming is hard, but naming should be basically self-explanatory in terms of why it differs from not setting that flag. LIGHTWEIGHT and friends isn't great either, maybe it should just be explicit in that this task_work just posts a CQE and hence it's pointless to wake the task to run it unless it'll then meet the criteria of having that task exit its wait loop as it now has enough CQEs available. IO_UF_TWQ_CQE_POST or something like that. Then if it at some point gets modified to also encompass different types of task_work that should not cause wakes, then it can change again. Just tossing suggestions out there...
On 3/13/23 14:16, Jens Axboe wrote: > On 3/12/23 9:45?PM, Pavel Begunkov wrote: >>>>> Didn't take a closer look just yet, but I grok the concept. One >>>>> immediate thing I'd want to change is the FACILE part of it. Let's call >>>>> it something a bit more straightforward, perhaps LIGHT? Or LIGHTWEIGHT? >>>> >>>> I don't really care, will change, but let me also ask why? >>>> They're more or less synonyms, though facile is much less >>>> popular. Is that your reasoning? >> >>> Yep, it's not very common and the name should be self-explanatory >>> immediately for most people. >> >> That's exactly the problem. Someone will think that it's >> like normal tw but "better" and blindly apply it. Same happened >> before with priority tw lists. > > But the way to fix that is not through obscure naming, it's through > better and more frequent review. Naming is hard, but naming should be > basically self-explanatory in terms of why it differs from not setting > that flag. LIGHTWEIGHT and friends isn't great either, maybe it should > just be explicit in that this task_work just posts a CQE and hence it's > pointless to wake the task to run it unless it'll then meet the criteria > of having that task exit its wait loop as it now has enough CQEs > available. IO_UF_TWQ_CQE_POST or something like that. Then if it at some There are 2 expectations (will add a comment) 1) it's posts no more that 1 CQE, 0 is fine 2) it's not urgent, including that it doesn't lock out scarce [system wide] resources. DMA mappings come to mind as an example. IIRC is a problem even now with nvme passthrough and DEFER_TASKRUN > point gets modified to also encompass different types of task_work that > should not cause wakes, then it can change again. Just tossing > suggestions out there... I honestly don't see how LIGHTWEIGHT is better. I think a proper name would be _LAZY_WAKE or maybe _DEFERRED_WAKE. It doesn't tell much about why you would want it, but at least sets expectations what it does. Only needs a comment that multishot is not supported.
On 3/13/23 11:50?AM, Pavel Begunkov wrote: > On 3/13/23 14:16, Jens Axboe wrote: >> On 3/12/23 9:45?PM, Pavel Begunkov wrote: >>>>>> Didn't take a closer look just yet, but I grok the concept. One >>>>>> immediate thing I'd want to change is the FACILE part of it. Let's call >>>>>> it something a bit more straightforward, perhaps LIGHT? Or LIGHTWEIGHT? >>>>> >>>>> I don't really care, will change, but let me also ask why? >>>>> They're more or less synonyms, though facile is much less >>>>> popular. Is that your reasoning? >>> >>>> Yep, it's not very common and the name should be self-explanatory >>>> immediately for most people. >>> >>> That's exactly the problem. Someone will think that it's >>> like normal tw but "better" and blindly apply it. Same happened >>> before with priority tw lists. >> >> But the way to fix that is not through obscure naming, it's through >> better and more frequent review. Naming is hard, but naming should be >> basically self-explanatory in terms of why it differs from not setting >> that flag. LIGHTWEIGHT and friends isn't great either, maybe it should >> just be explicit in that this task_work just posts a CQE and hence it's >> pointless to wake the task to run it unless it'll then meet the criteria >> of having that task exit its wait loop as it now has enough CQEs >> available. IO_UF_TWQ_CQE_POST or something like that. Then if it at some > > There are 2 expectations (will add a comment) > 1) it's posts no more that 1 CQE, 0 is fine > > 2) it's not urgent, including that it doesn't lock out scarce > [system wide] resources. DMA mappings come to mind as an example. > > IIRC is a problem even now with nvme passthrough and DEFER_TASKRUN DMA mappings aren't really scarce, only on weird/crappy setups with a very limited IOMMU space where and IOMMU is being used. So not a huge deal I think. >> point gets modified to also encompass different types of task_work that >> should not cause wakes, then it can change again. Just tossing >> suggestions out there... > > I honestly don't see how LIGHTWEIGHT is better. I think a proper > name would be _LAZY_WAKE or maybe _DEFERRED_WAKE. It doesn't tell > much about why you would want it, but at least sets expectations > what it does. Only needs a comment that multishot is not supported. Agree, and this is what I said too, LIGHTWEIGHT isn't a great word either. DEFERRED_WAKE seems like a good candidate, and it'd be great to also include a code comment there on what it does. That'll help future contributors.
Hi Pavel On Fri, Mar 10, 2023 at 07:04:14PM +0000, Pavel Begunkov wrote: > io_uring extensively uses task_work, but when a task is waiting > for multiple CQEs it causes lots of rescheduling. This series > is an attempt to optimise it and be a base for future improvements. > > For some zc network tests eventually waiting for a portion of > buffers I've got 10x descrease in the number of context switches, > which reduced the CPU consumption more than twice (17% -> 8%). > It also helps storage cases, while running fio/t/io_uring against > a low performant drive it got 2x descrease of the number of context > switches for QD8 and ~4 times for QD32. ublk uses io_uring_cmd_complete_in_task()(io_req_task_work_add()) heavily. So I tried this patchset, looks not see obvious change on both IOPS and context switches when running 't/io_uring /dev/ublkb0', and it is one null ublk target(ublk add -t null -z -u 1 -q 2), IOPS is ~2.8M. But ublk applies batch schedule similar with io_uring before calling io_uring_cmd_complete_in_task(). thanks, Ming
On 3/15/23 02:35, Ming Lei wrote: > Hi Pavel > > On Fri, Mar 10, 2023 at 07:04:14PM +0000, Pavel Begunkov wrote: >> io_uring extensively uses task_work, but when a task is waiting >> for multiple CQEs it causes lots of rescheduling. This series >> is an attempt to optimise it and be a base for future improvements. >> >> For some zc network tests eventually waiting for a portion of >> buffers I've got 10x descrease in the number of context switches, >> which reduced the CPU consumption more than twice (17% -> 8%). >> It also helps storage cases, while running fio/t/io_uring against >> a low performant drive it got 2x descrease of the number of context >> switches for QD8 and ~4 times for QD32. > > ublk uses io_uring_cmd_complete_in_task()(io_req_task_work_add()) > heavily. So I tried this patchset, looks not see obvious change > on both IOPS and context switches when running 't/io_uring /dev/ublkb0', > and it is one null ublk target(ublk add -t null -z -u 1 -q 2), IOPS > is ~2.8M. Hi Ming, It's enabled for rw requests and send-zc notifications, but io_uring_cmd_complete_in_task() is not covered. I'll be enabling it for more cases, including pass through. > But ublk applies batch schedule similar with io_uring before calling > io_uring_cmd_complete_in_task(). The feature doesn't tolerate tw that produce multiple CQEs, so it can't be applied to this batching and the task would stuck waiting. btw, from a quick look it appeared that ublk batching is there to keep requests together but not to improve batching. And if so, I think we can get rid of it, rely on io_uring batching and let ublk to gather its requests from tw list, which sounds cleaner. I'll elaborate on that later
On Wed, Mar 15, 2023 at 04:53:09PM +0000, Pavel Begunkov wrote: > On 3/15/23 02:35, Ming Lei wrote: > > Hi Pavel > > > > On Fri, Mar 10, 2023 at 07:04:14PM +0000, Pavel Begunkov wrote: > > > io_uring extensively uses task_work, but when a task is waiting > > > for multiple CQEs it causes lots of rescheduling. This series > > > is an attempt to optimise it and be a base for future improvements. > > > > > > For some zc network tests eventually waiting for a portion of > > > buffers I've got 10x descrease in the number of context switches, > > > which reduced the CPU consumption more than twice (17% -> 8%). > > > It also helps storage cases, while running fio/t/io_uring against > > > a low performant drive it got 2x descrease of the number of context > > > switches for QD8 and ~4 times for QD32. > > > > ublk uses io_uring_cmd_complete_in_task()(io_req_task_work_add()) > > heavily. So I tried this patchset, looks not see obvious change > > on both IOPS and context switches when running 't/io_uring /dev/ublkb0', > > and it is one null ublk target(ublk add -t null -z -u 1 -q 2), IOPS > > is ~2.8M. > > Hi Ming, > > It's enabled for rw requests and send-zc notifications, but > io_uring_cmd_complete_in_task() is not covered. I'll be enabling > it for more cases, including pass through. > > > But ublk applies batch schedule similar with io_uring before calling > > io_uring_cmd_complete_in_task(). > > The feature doesn't tolerate tw that produce multiple CQEs, so > it can't be applied to this batching and the task would stuck > waiting. > > btw, from a quick look it appeared that ublk batching is there > to keep requests together but not to improve batching. And if so, > I think we can get rid of it, rely on io_uring batching and > let ublk to gather its requests from tw list, which sounds > cleaner. I'll elaborate on that later Yeah, the ublk batching can be removed since __io_req_task_work_add already does it, and it is kept just for micro optimization of calling less io_uring_cmd_complete_in_task(), but I think we can get bigger improvement with your tw optimization. Thanks, Ming
On 3/11/23 17:24, Jens Axboe wrote: > On 3/10/23 12:04?PM, Pavel Begunkov wrote: >> io_uring extensively uses task_work, but when a task is waiting >> for multiple CQEs it causes lots of rescheduling. This series >> is an attempt to optimise it and be a base for future improvements. >> >> For some zc network tests eventually waiting for a portion of >> buffers I've got 10x descrease in the number of context switches, >> which reduced the CPU consumption more than twice (17% -> 8%). >> It also helps storage cases, while running fio/t/io_uring against >> a low performant drive it got 2x descrease of the number of context >> switches for QD8 and ~4 times for QD32. >> >> Not for inclusion yet, I want to add an optimisation for when >> waiting for 1 CQE. > > Ran this on the usual peak benchmark, using IRQ. IOPS is around ~70M for > that, and I see context rates of around 8.1-8.3M/sec with the current > kernel. Tried it out. No difference with bs=512, qd=4 is completed before it gets to schedule() in io_cqring_wait(). With QD32, it's local tw run __io_run_local_work() spins 2 loops, and QD=8 somewhat in the middle with rare extra sched. For bs=4096 QD=8 I see a lot of: io_cqring_wait() @min_events=8 schedule() __io_run_local_work() nr=4 schedule() __io_run_local_work() nr=4 And if we benchmark without and with the patch there is a nice CPU util reduction. CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 0 1.18 0.00 19.24 0.00 0.00 0.00 0.00 0.00 0.00 79.57 0 1.63 0.00 29.38 0.00 0.00 0.00 0.00 0.00 0.00 68.98