Message ID | 20240228192355.290114-2-axboe@kernel.dk |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-85592-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:a81b:b0:108:e6aa:91d0 with SMTP id bq27csp3561292dyb; Wed, 28 Feb 2024 11:24:31 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCU/sv2rKGg2UgoBf/27xx4GM+9DXdLhYFz9RW2nJG4yuawe+BCAwRyFZ6zyLYhtLlPFd4/jKCPl5qWcH7WKCtDhpp8ryQ== X-Google-Smtp-Source: AGHT+IEE7u4HYpbhSg9bcXDXeAfKtZZpzyO6VqLhTcUVhdiWrx3a+Y/qkZjFyO8Ixz7jHZkq3UAA X-Received: by 2002:a05:620a:c47:b0:787:ba3b:9f7a with SMTP id u7-20020a05620a0c4700b00787ba3b9f7amr5372144qki.24.1709148271663; Wed, 28 Feb 2024 11:24:31 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1709148271; cv=pass; d=google.com; s=arc-20160816; b=MMSDQfIBTVmm+jERkab990/RJyGl+uLwQb2LdReTK/zyp4FKhDUt7BWGEZLK9s5NJ8 vvZYYJx7b4WBVE2k3BGTq12wnWTGfAStKx49L4VcthdutjrnlO8GKjfuWLyARZ21iP61 AI+VmIGd32cc3wLbYl0arrB5SwRofDGLMbK5rN036ep21heljhfj15aBBRc/MNjWMQ6X yariCcHbmA1ylBPY8EOtHPKETT7ZRyigFfNHAyOYQiVwCvVyiLK8gH1pOkVU75kY2y1u hUI19qGOvEcqou4mXZ6FC1xz9iI/Q7c5BZEF/QMUYvF7AkfV7t+nf5ZqmlBOKWYQ1lvb c83A== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:in-reply-to:message-id :date:subject:cc:to:from:dkim-signature; bh=k8UxLx5bov8q9DvrLxnPNYwpFvC9klIZbCUnSVx2g6U=; fh=QLeERbPuyyXG4VUTEwvqfDOsG6ebgMl25C7foFloUoY=; b=uqe+tsb+FpgE9Q+mnAXMowtfPV/xwI8NZtYT82Jxqj2O42kn7D6D/PNkPSLgbtHsyX HLL9IEBge8uL+XjX6Jt4qCW3LHSMetiSvCjE0sn0s+0iquRL/FFAPwbUha3bJ9ftn9J4 mb/Cgjm2qT3ZBmTLuVvjbYviUsis1z0T/0Ng6UJJsBAMheXi90i8Ieni1fGJCC/jnnVj mJrZQ2qDLXy1vWWjqQQg1yiHL074R2lIMXd4J081fxehevTLZlPGyY1KN3HZXsylf+Lt qp7unlYWerObKVj7UwOxVVvFPYOkdXwVoEqmQbbGuN6z5byawvsYF442HWioJ+Q7tP63 c9oA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b=CT2XLcWK; arc=pass (i=1 spf=pass spfdomain=kernel.dk dkim=pass dkdomain=kernel-dk.20230601.gappssmtp.com); spf=pass (google.com: domain of linux-kernel+bounces-85592-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-85592-ouuuleilei=gmail.com@vger.kernel.org" Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id g11-20020a05620a40cb00b00787ae8d0d82si196566qko.698.2024.02.28.11.24.31 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 28 Feb 2024 11:24:31 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-85592-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b=CT2XLcWK; arc=pass (i=1 spf=pass spfdomain=kernel.dk dkim=pass dkdomain=kernel-dk.20230601.gappssmtp.com); spf=pass (google.com: domain of linux-kernel+bounces-85592-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-85592-ouuuleilei=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 64F671C224EA for <ouuuleilei@gmail.com>; Wed, 28 Feb 2024 19:24:31 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id B4A2015E5A0; Wed, 28 Feb 2024 19:24:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20230601.gappssmtp.com header.i=@kernel-dk.20230601.gappssmtp.com header.b="CT2XLcWK" Received: from mail-pg1-f169.google.com (mail-pg1-f169.google.com [209.85.215.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 160E671EB3 for <linux-kernel@vger.kernel.org>; Wed, 28 Feb 2024 19:24:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.169 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709148245; cv=none; b=YEfldZoUo6XkezK9+rq9UwWzxMYMrcAufnIxz3TiJVLPZtMPhHzm82rQMdKsfOBkgh03oQTuqrWw+FnF3QWwBe6O96ZcSd63OHXAm4K/kcWfAPgPzWgF9IObIDpnUih/OsTOCE8A2xOS7DMvezzwp4m4OWljbmrSJxOxhgbY8nU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709148245; c=relaxed/simple; bh=b6UIF1ZLdEXM8cy9LIjNnogO7ZLMmBZKzULoHDdfKqU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=qbaS/U2FXWvVAfG1lSP3DvvwsKXV3x/gak+viwq3ZPWrlKMdCaIaA0lZSlgznf2U5USOD0LsvLpBm6ZytNNmUkyh4zaa/dIvK4UIgI6AGQtTDtX4ag6VwVm5ryU1pCFmqe2fxnXuKYdJF0fIkGRtjMbMvYICAUuJ3gj3C2REdlY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.dk; spf=pass smtp.mailfrom=kernel.dk; dkim=pass (2048-bit key) header.d=kernel-dk.20230601.gappssmtp.com header.i=@kernel-dk.20230601.gappssmtp.com header.b=CT2XLcWK; arc=none smtp.client-ip=209.85.215.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.dk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=kernel.dk Received: by mail-pg1-f169.google.com with SMTP id 41be03b00d2f7-5cfb8126375so13222a12.1 for <linux-kernel@vger.kernel.org>; Wed, 28 Feb 2024 11:24:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20230601.gappssmtp.com; s=20230601; t=1709148242; x=1709753042; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=k8UxLx5bov8q9DvrLxnPNYwpFvC9klIZbCUnSVx2g6U=; b=CT2XLcWKJqxebLpG6w3ZCLvMPVNWOW2KmWM+AzHpUufgPoNoM5HggS+2u7sin53Iz+ PospHEy0P7wZrB0ugRT6HIqqtQf65ZIG75g6rdBZosj/efmWwGwaMnLzNcAtGAxA9j4q vddNFBLGKjXf2si3va6aTswJUcgEqZu/Hcb2uAlzoU9ZVVblFMljxfEiVqOoW6tXEAHk FFOQWzvTTUCyGg72Ys3EGLfOAvhr7u3y8eoc6MncyfjHAjgYailoj3Qp/INhYZlCGnTq ng9d4DG43Y2e8Gt4mCYMl3cQOHruZfG86KNW8UzpsPAmLPVKjg6z2mVX8TFW7gJ7NDK8 u7Ug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709148242; x=1709753042; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=k8UxLx5bov8q9DvrLxnPNYwpFvC9klIZbCUnSVx2g6U=; b=DTVN5GcIwUfU29kv8+Nm3h4qyvLCdT+qL4TtBUOUbRgVSzgFkRtDzT9PfJ9CAGuNcw bi8IQzoR0Z6fIk55AHNYkxywG9WubsQ7y27D6BOaDQrKtSKUEQedkzCvNs7khcfkx11S X8QbewLdWHPVkyv9p5PKI7vPXY+7NjqALIMVkWLzrOWvxbXzo6tZOW990x/EEntp1SoD x6sj/r3pZMWrIIXagtJ2r1Fa4uq54kaOkCDrw1hrABTmHGrKoPfm/0Qg9nnLJd8kTZWv 0QHWdKYB46AginmvVPAvNDbH8fh4OLjaCC7V3jVJtFhLAAtZifZXdGM23509Ip/igILv 2H8g== X-Gm-Message-State: AOJu0YzuQEaTnfLRcn3uSDuo4WF7H9A5oWIWoEufUpmpB4A/xdYf1SYF owjpG6LYRC1Km3InpAJM8OxoRgTMPdp3J4tbk9y2EeDB8LeTYuI2/KL7OYttmkkeX9nWSRijKhX 8 X-Received: by 2002:a17:90a:7565:b0:299:3748:4ada with SMTP id q92-20020a17090a756500b0029937484adamr131097pjk.1.1709148241854; Wed, 28 Feb 2024 11:24:01 -0800 (PST) Received: from localhost.localdomain ([198.8.77.194]) by smtp.gmail.com with ESMTPSA id b10-20020a17090a550a00b00298f2ad430csm4230pji.0.2024.02.28.11.23.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 28 Feb 2024 11:24:00 -0800 (PST) From: Jens Axboe <axboe@kernel.dk> To: linux-kernel@vger.kernel.org Cc: peterz@infradead.org, mingo@redhat.com, Jens Axboe <axboe@kernel.dk> Subject: [PATCH 1/2] sched/core: switch struct rq->nr_iowait to a normal int Date: Wed, 28 Feb 2024 12:16:56 -0700 Message-ID: <20240228192355.290114-2-axboe@kernel.dk> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240228192355.290114-1-axboe@kernel.dk> References: <20240228192355.290114-1-axboe@kernel.dk> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1792171858375267274 X-GMAIL-MSGID: 1792171858375267274 |
Series |
Split iowait into two states
|
|
Commit Message
Jens Axboe
Feb. 28, 2024, 7:16 p.m. UTC
In 3 of the 4 spots where we modify rq->nr_iowait we already hold the
rq lock, and hence don't need atomics to modify the current per-rq
iowait count. In the 4th case, where we are scheduling in on a different
CPU than the task was previously on, we do not hold the previous rq lock,
and hence still need to use an atomic to increment the iowait count.
Rename the existing nr_iowait to nr_iowait_remote, and use that for the
4th case. The other three cases can simply inc/dec in a non-atomic
fashion under the held rq lock.
The per-rq iowait now becomes the difference between the two, the local
count minus the remote count.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
kernel/sched/core.c | 15 ++++++++++-----
kernel/sched/cputime.c | 3 +--
kernel/sched/sched.h | 8 +++++++-
3 files changed, 18 insertions(+), 8 deletions(-)
Comments
On Wed, Feb 28 2024 at 12:16, Jens Axboe wrote: > In 3 of the 4 spots where we modify rq->nr_iowait we already hold the We modify something and hold locks? It's documented that changelogs should not impersonate code. It simply does not make any sense. > rq lock, and hence don't need atomics to modify the current per-rq > iowait count. In the 4th case, where we are scheduling in on a different > CPU than the task was previously on, we do not hold the previous rq lock, > and hence still need to use an atomic to increment the iowait count. > > Rename the existing nr_iowait to nr_iowait_remote, and use that for the > 4th case. The other three cases can simply inc/dec in a non-atomic > fashion under the held rq lock. > > The per-rq iowait now becomes the difference between the two, the local > count minus the remote count. > > Signed-off-by: Jens Axboe <axboe@kernel.dk> Other than that: Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
On 2/29/24 9:53 AM, Thomas Gleixner wrote: > On Wed, Feb 28 2024 at 12:16, Jens Axboe wrote: >> In 3 of the 4 spots where we modify rq->nr_iowait we already hold the > > We modify something and hold locks? It's documented that changelogs > should not impersonate code. It simply does not make any sense. Agree it doesn't read that well... It's meant to say that we already hold the rq lock in 3 of the 4 spots, hence using atomic_inc/dec is pointless for those cases. > Other than that: > > Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Thanks for the review!
On Thu, Feb 29 2024 at 10:19, Jens Axboe wrote: > On 2/29/24 9:53 AM, Thomas Gleixner wrote: >> On Wed, Feb 28 2024 at 12:16, Jens Axboe wrote: >>> In 3 of the 4 spots where we modify rq->nr_iowait we already hold the >> >> We modify something and hold locks? It's documented that changelogs >> should not impersonate code. It simply does not make any sense. > > Agree it doesn't read that well... It's meant to say that we already > hold the rq lock in 3 of the 4 spots, hence using atomic_inc/dec is > pointless for those cases. That and the 'we'. Write it neutral. The accounting of rq::nr_iowait is using an atomic_t but 3 out of 4 places hold runqueue lock already. .... So but I just noticed that there is actually an issue with this: > unsigned int nr_iowait_cpu(int cpu) > { > - return atomic_read(&cpu_rq(cpu)->nr_iowait); > + struct rq *rq = cpu_rq(cpu); > + > + return rq->nr_iowait - atomic_read(&rq->nr_iowait_remote); The access to rq->nr_iowait is not protected by the runqueue lock and therefore a data race when @cpu is not the current CPU. This needs to be properly annotated and explained why it does not matter. So s/Reviewed-by/Un-Reviewed-by/ Though thinking about it some more. Is this split a real benefit over always using the atomic? Do you have numbers to show? Thanks, tglx
On 2/29/24 10:42 AM, Thomas Gleixner wrote: > On Thu, Feb 29 2024 at 10:19, Jens Axboe wrote: >> On 2/29/24 9:53 AM, Thomas Gleixner wrote: >>> On Wed, Feb 28 2024 at 12:16, Jens Axboe wrote: >>>> In 3 of the 4 spots where we modify rq->nr_iowait we already hold the >>> >>> We modify something and hold locks? It's documented that changelogs >>> should not impersonate code. It simply does not make any sense. >> >> Agree it doesn't read that well... It's meant to say that we already >> hold the rq lock in 3 of the 4 spots, hence using atomic_inc/dec is >> pointless for those cases. > > That and the 'we'. Write it neutral. > > The accounting of rq::nr_iowait is using an atomic_t but 3 out of 4 > places hold runqueue lock already. .... Will do > So but I just noticed that there is actually an issue with this: > >> unsigned int nr_iowait_cpu(int cpu) >> { >> - return atomic_read(&cpu_rq(cpu)->nr_iowait); >> + struct rq *rq = cpu_rq(cpu); >> + >> + return rq->nr_iowait - atomic_read(&rq->nr_iowait_remote); > > The access to rq->nr_iowait is not protected by the runqueue lock and > therefore a data race when @cpu is not the current CPU. > > This needs to be properly annotated and explained why it does not > matter. But that was always racy before as well, if someone else is inc/dec'ing ->nr_iowait while it's being read, you could get either the before or after value. This doesn't really change that. I could've sworn I mentioned that in the commit message, but I did not. > So s/Reviewed-by/Un-Reviewed-by/ > > Though thinking about it some more. Is this split a real benefit over > always using the atomic? Do you have numbers to show? It was more on Peter's complaint that now we're trading a single atomic for two, hence I got to thinking about nr_iowait in general. I don't have numbers showing it matters, as mentioned in another email the most costly part about this seems to be fetching task->in_iowait and not the actual atomic.
On Thu, Feb 29 2024 at 10:49, Jens Axboe wrote: > On 2/29/24 10:42 AM, Thomas Gleixner wrote: >> So but I just noticed that there is actually an issue with this: >> >>> unsigned int nr_iowait_cpu(int cpu) >>> { >>> - return atomic_read(&cpu_rq(cpu)->nr_iowait); >>> + struct rq *rq = cpu_rq(cpu); >>> + >>> + return rq->nr_iowait - atomic_read(&rq->nr_iowait_remote); >> >> The access to rq->nr_iowait is not protected by the runqueue lock and >> therefore a data race when @cpu is not the current CPU. >> >> This needs to be properly annotated and explained why it does not >> matter. > > But that was always racy before as well, if someone else is inc/dec'ing > ->nr_iowait while it's being read, you could get either the before or > after value. This doesn't really change that. I could've sworn I > mentioned that in the commit message, but I did not. There are actually two issues here: 1) atomic_read() vs. atomic_inc/dec() guarantees that the read value is consistent in itself. Non-atomic inc/dec is not guaranteeing that the concurrent read is a consistent value as the compiler is free to do store/load tearing. Unlikely but not guaranteed to never happen. KCSAN will complain about it sooner than later and then someone has to go and do the analysis and the annotation. I rather let you do the reasoning now than chasing you down later :) 2) What's worse is that the result can be completely bogus: i.e. CPU0 CPU1 CPU2 a = rq(CPU1)->nr_iowait; // 0 rq->nr_iowait++; rq(CPU1)->nr_iowait_remote++; b = rq(CPU1)->nr_iowait_remote; // 1 r = a - b; // -1 return (unsigned int) r; // UINT_MAX The consumers of this interface might be upset. :) While with a single atomic_t it's guaranteed that the result is always greater or equal zero. >> So s/Reviewed-by/Un-Reviewed-by/ >> >> Though thinking about it some more. Is this split a real benefit over >> always using the atomic? Do you have numbers to show? > > It was more on Peter's complaint that now we're trading a single atomic > for two, hence I got to thinking about nr_iowait in general. I don't > have numbers showing it matters, as mentioned in another email the most > costly part about this seems to be fetching task->in_iowait and not the > actual atomic. On the write side (except for the remote case) the cache line is already dirty on the current CPU and I doubt that the atomic will be noticable. If there is concurrent remote access to the runqueue then the cache line is bouncing no matter what. On the read side there is always an atomic operation required, so it's not really different. I assume Peter's complaint was about the extra nr_iowait_acct part. I think that's solvable without the extra atomic_t member and with a single atomic_add()/sub(). atomic_t is 32bit wide, so what about splitting the thing and adding/subtracting both in one go? While sketching this I noticed that prepare/finish can be written w/o any conditionals. int io_schedule_prepare(void) { int flags = current->in_iowait + current->in_iowait_acct << 16; current->in_iowait = 1; current->in_iowait_acct = 1; blk_flush_plug(current->plug, true); return flags; } void io_schedule_finish(int old_wait_flags) { current->in_iowait = flags & 0x01; current->in_iowait_acct = flags >> 16; } Now __schedule(): if (prev->in_iowait) { int x = 1 + current->in_iowait_acct << 16; atomic_add(x, rq->nr_iowait); delayacct_blkio_start(); } and ttwu_do_activate(): if (p->in_iowait) { int x = 1 + current->in_iowait_acct << 16; delayacct_blkio_end(p); atomic_sub(x, task_rq(p)->nr_iowait); } and try_to_wake_up(): delayacct_blkio_end(p); int x = 1 + current->in_iowait_acct << 16; atomic_add(x, task_rq(p)->nr_iowait); nr_iowait_acct_cpu() becomes: return atomic_read(&cpu_rq(cpu)->nr_iowait) >> 16; and nr_iowait_cpu(): return atomic_read(&cpu_rq(cpu)->nr_iowait) & ((1 << 16) - 1); Obviously written with proper inline wrappers and defines, but you get the idea. Hmm? Thanks, tglx
On 2/29/24 12:52 PM, Thomas Gleixner wrote: > On Thu, Feb 29 2024 at 10:49, Jens Axboe wrote: >> On 2/29/24 10:42 AM, Thomas Gleixner wrote: >>> So but I just noticed that there is actually an issue with this: >>> >>>> unsigned int nr_iowait_cpu(int cpu) >>>> { >>>> - return atomic_read(&cpu_rq(cpu)->nr_iowait); >>>> + struct rq *rq = cpu_rq(cpu); >>>> + >>>> + return rq->nr_iowait - atomic_read(&rq->nr_iowait_remote); >>> >>> The access to rq->nr_iowait is not protected by the runqueue lock and >>> therefore a data race when @cpu is not the current CPU. >>> >>> This needs to be properly annotated and explained why it does not >>> matter. >> >> But that was always racy before as well, if someone else is inc/dec'ing >> ->nr_iowait while it's being read, you could get either the before or >> after value. This doesn't really change that. I could've sworn I >> mentioned that in the commit message, but I did not. > > There are actually two issues here: > > 1) atomic_read() vs. atomic_inc/dec() guarantees that the read value > is consistent in itself. > > Non-atomic inc/dec is not guaranteeing that the concurrent read is a > consistent value as the compiler is free to do store/load > tearing. Unlikely but not guaranteed to never happen. > > KCSAN will complain about it sooner than later and then someone has > to go and do the analysis and the annotation. I rather let you do > the reasoning now than chasing you down later :) Fair enough. > 2) What's worse is that the result can be completely bogus: > > i.e. > > CPU0 CPU1 CPU2 > a = rq(CPU1)->nr_iowait; // 0 > rq->nr_iowait++; > rq(CPU1)->nr_iowait_remote++; > b = rq(CPU1)->nr_iowait_remote; // 1 > > r = a - b; // -1 > return (unsigned int) r; // UINT_MAX > > The consumers of this interface might be upset. :) > > While with a single atomic_t it's guaranteed that the result is > always greater or equal zero. Yeah OK, this is a real problem... >>> So s/Reviewed-by/Un-Reviewed-by/ >>> >>> Though thinking about it some more. Is this split a real benefit over >>> always using the atomic? Do you have numbers to show? >> >> It was more on Peter's complaint that now we're trading a single atomic >> for two, hence I got to thinking about nr_iowait in general. I don't >> have numbers showing it matters, as mentioned in another email the most >> costly part about this seems to be fetching task->in_iowait and not the >> actual atomic. > > On the write side (except for the remote case) the cache line is already > dirty on the current CPU and I doubt that the atomic will be > noticable. If there is concurrent remote access to the runqueue then the > cache line is bouncing no matter what. That was my exact thinking too, same cacheline and back-to-back atomics don't really matter vs a single atomic on it. > On the read side there is always an atomic operation required, so it's > not really different. > > I assume Peter's complaint was about the extra nr_iowait_acct part. I > think that's solvable without the extra atomic_t member and with a > single atomic_add()/sub(). atomic_t is 32bit wide, so what about > splitting the thing and adding/subtracting both in one go? > > While sketching this I noticed that prepare/finish can be written w/o > any conditionals. > > int io_schedule_prepare(void) > { > int flags = current->in_iowait + current->in_iowait_acct << 16; > > current->in_iowait = 1; > current->in_iowait_acct = 1; > blk_flush_plug(current->plug, true); > return flags; > } > > void io_schedule_finish(int old_wait_flags) > { > current->in_iowait = flags & 0x01; > current->in_iowait_acct = flags >> 16; > } > > Now __schedule(): > > if (prev->in_iowait) { > int x = 1 + current->in_iowait_acct << 16; > > atomic_add(x, rq->nr_iowait); > delayacct_blkio_start(); > } > > and ttwu_do_activate(): > > if (p->in_iowait) { > int x = 1 + current->in_iowait_acct << 16; > > delayacct_blkio_end(p); > atomic_sub(x, task_rq(p)->nr_iowait); > } > > > and try_to_wake_up(): > > delayacct_blkio_end(p); > > int x = 1 + current->in_iowait_acct << 16; > > atomic_add(x, task_rq(p)->nr_iowait); > > nr_iowait_acct_cpu() becomes: > > return atomic_read(&cpu_rq(cpu)->nr_iowait) >> 16; > > and nr_iowait_cpu(): > > return atomic_read(&cpu_rq(cpu)->nr_iowait) & ((1 << 16) - 1); > > Obviously written with proper inline wrappers and defines, but you get > the idea. I'll play with this a bit, but do we want to switch to an atomic_long_t for this? 2^16 in iowait seems extreme, but it definitely seems possible to overflow it.
On Thu, Feb 29 2024 at 15:30, Jens Axboe wrote: > On 2/29/24 12:52 PM, Thomas Gleixner wrote: >> return atomic_read(&cpu_rq(cpu)->nr_iowait) & ((1 << 16) - 1); >> >> Obviously written with proper inline wrappers and defines, but you get >> the idea. > > I'll play with this a bit, but do we want to switch to an atomic_long_t > for this? 2^16 in iowait seems extreme, but it definitely seems possible > to overflow it. Indeed. 32bit has PID_MAX_LIMIT == 0x8000 which obviously fits into 16 bits, while 64bit lifts that limit and relies on memory exhaustion to limit the number of concurrent threads on the machine, but that obviously can exceed 16bits. Whether more than 2^16 sit in iowait concurrently on a single CPU that's a different question and probably more academic. :) Though as this will touch all nr_iowait places anyway changing it to atomic_long_t in a preparatory patch first makes a lot of sense. Thanks, tglx
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 9116bcc90346..48d15529a777 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3789,7 +3789,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, #endif if (p->in_iowait) { delayacct_blkio_end(p); - atomic_dec(&task_rq(p)->nr_iowait); + task_rq(p)->nr_iowait--; } activate_task(rq, p, en_flags); @@ -4354,8 +4354,10 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU); if (task_cpu(p) != cpu) { if (p->in_iowait) { + struct rq *__rq = task_rq(p); + delayacct_blkio_end(p); - atomic_dec(&task_rq(p)->nr_iowait); + atomic_inc(&__rq->nr_iowait_remote); } wake_flags |= WF_MIGRATED; @@ -5463,7 +5465,9 @@ unsigned long long nr_context_switches(void) unsigned int nr_iowait_cpu(int cpu) { - return atomic_read(&cpu_rq(cpu)->nr_iowait); + struct rq *rq = cpu_rq(cpu); + + return rq->nr_iowait - atomic_read(&rq->nr_iowait_remote); } /* @@ -6681,7 +6685,7 @@ static void __sched notrace __schedule(unsigned int sched_mode) deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); if (prev->in_iowait) { - atomic_inc(&rq->nr_iowait); + rq->nr_iowait++; delayacct_blkio_start(); } } @@ -10029,7 +10033,8 @@ void __init sched_init(void) #endif #endif /* CONFIG_SMP */ hrtick_rq_init(rq); - atomic_set(&rq->nr_iowait, 0); + rq->nr_iowait = 0; + atomic_set(&rq->nr_iowait_remote, 0); #ifdef CONFIG_SCHED_CORE rq->core = rq; diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index af7952f12e6c..0ed81c2d3c3b 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -222,9 +222,8 @@ void account_steal_time(u64 cputime) void account_idle_time(u64 cputime) { u64 *cpustat = kcpustat_this_cpu->cpustat; - struct rq *rq = this_rq(); - if (atomic_read(&rq->nr_iowait) > 0) + if (nr_iowait_cpu(smp_processor_id()) > 0) cpustat[CPUTIME_IOWAIT] += cputime; else cpustat[CPUTIME_IDLE] += cputime; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 001fe047bd5d..91fa5b4d45ed 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1049,7 +1049,13 @@ struct rq { u64 clock_idle_copy; #endif - atomic_t nr_iowait; + /* + * Total per-cpu iowait is the difference of the two below. One is + * modified under the rq lock (nr_iowait), and if we don't have the rq + * lock, then nr_iowait_remote is used. + */ + unsigned int nr_iowait; + atomic_t nr_iowait_remote; #ifdef CONFIG_SCHED_DEBUG u64 last_seen_need_resched_ns;