Message ID | 20230517152654.7193-2-mathieu.desnoyers@efficios.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:6358:3046:b0:115:7a1d:dabb with SMTP id p6csp1103602rwl; Wed, 17 May 2023 08:31:28 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6XQqEhI8UEzckzt0ufMThiegs/OI24dqVJNCidWDXuVujh83CY1FpIdQlumAfrjuQtizlL X-Received: by 2002:a05:6a00:c83:b0:646:b3f6:945c with SMTP id a3-20020a056a000c8300b00646b3f6945cmr194619pfv.4.1684337487905; Wed, 17 May 2023 08:31:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684337487; cv=none; d=google.com; s=arc-20160816; b=wkXe2140sMdkYQ6TgJaFdt626H0DzYnP11C5CCOAXiHFGdkNKVO5OP0S04iMfG6nLS ymgeU5LXFzIw+UGhokdpoYh9OrH+8PlJMynprUG3UyzmZ7Uw8Yaqd1R4GFw5GBsr3KMH gtwRLuLXztd2EcsSC0khn7hVovXHa60/u1pOeEfwRMiKWsdZ/uDrrtv7eTweEtIIQEyg wNGQrVi3blErYBGUfTapmmWB4aRVkJ3UtcXWra6N2T55VHK/h8arLHyz0b/iKuAmyq7r UB+c1sVHlesfRewvmKKvbYdVlfVSG9ZB3HmwTBIwZPymE8wGgiVnHpepuPnx+xDtSs6/ f97Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=I+n62NOzeqoDJaIS0fVUVqcuLGX1IUfTknQBNq/CXhc=; b=kaKakaJRDaSg+1e65qCfvKfSyEdCxos05IDblQr7K2WRSbLkEnQnP480wKLou8WRCx FJGbiyrlyM9ZWA/U0QOYiP8elUqpQc1496NdtUiLaIVOcYzHH3oLcMWbOS/b3E9+5dkK oCGnn3QMC2bYrdAY+PUCuB5OZL99pX3PTErQSJmkbnAjcDoP612zfUKGfvJvu30jDUok zNbaB3Mf5HsPzObeIs5u0Wzkza+wmmcp/YGf19aC7hS224SE9gYJXD4UilsQo1qlXdYd a+/zEqGxSwul6yVXeScu3FoTo3Cf6cxnQWrDFUcCb9xXXvzGGErisJzsTw4qu5Mp2u0Q fNJg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=EmcH9M9u; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f67-20020a623846000000b0064105588e53si12070556pfa.359.2023.05.17.08.31.15; Wed, 17 May 2023 08:31:27 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=EmcH9M9u; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231860AbjEQP1z (ORCPT <rfc822;abdi.embedded@gmail.com> + 99 others); Wed, 17 May 2023 11:27:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56202 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232361AbjEQP1V (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 17 May 2023 11:27:21 -0400 Received: from smtpout.efficios.com (smtpout.efficios.com [167.114.26.122]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3F41C9ECB; Wed, 17 May 2023 08:27:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1684337224; bh=1RursNp9qkMTEe/+yMwWnmeNTzScNBsuw1A9Btx9VgI=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=EmcH9M9uGLTxz1YI6yx1ArarwBBRUSw2TPdgE8iVHAMD9UT7tr/3iiFq244PhEw8o +Gr9a+zVV/aRi+thfHp+rrk4OlidHlM/r6zMlInIZUBrTJkh2K2qsJKpPBI10cGrRk P7Z6U6DAmDnJdtq4tRBfv3dzZfMswU1QVRshqBB8E7eUCXAyUhgl54p8BGWpKcscaS 6kN2/UQguakV2xf384R2W3zsMUOP7vaXC9hSn41w7FJSNzFUYdm/EqCiY77zMfNheZ INdQlrRmyq1mF60w+pTO0boVcG1e8vbRqIZmnMc8jJrlAxfrM670yaz0JfbKGa+75/ 1+C6Qp62LE5IQ== Received: from localhost.localdomain (192-222-143-198.qc.cable.ebox.net [192.222.143.198]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4QLxmR53HXz131r; Wed, 17 May 2023 11:27:03 -0400 (EDT) From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> To: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org, Thomas Gleixner <tglx@linutronix.de>, "Paul E . McKenney" <paulmck@kernel.org>, Boqun Feng <boqun.feng@gmail.com>, "H . Peter Anvin" <hpa@zytor.com>, Paul Turner <pjt@google.com>, linux-api@vger.kernel.org, Christian Brauner <brauner@kernel.org>, Florian Weimer <fw@deneb.enyo.de>, David.Laight@ACULAB.COM, carlos@redhat.com, Peter Oskolkov <posk@posk.io>, Alexander Mikhalitsyn <alexander@mihalicyn.com>, Chris Kennelly <ckennelly@google.com>, Ingo Molnar <mingo@redhat.com>, Darren Hart <dvhart@infradead.org>, Davidlohr Bueso <dave@stgolabs.net>, =?utf-8?q?Andr=C3=A9_Almeida?= <andrealmeid@igalia.com>, libc-alpha@sourceware.org, Steven Rostedt <rostedt@goodmis.org>, Jonathan Corbet <corbet@lwn.net>, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Florian Weimer <fweimer@redhat.com> Subject: [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq Date: Wed, 17 May 2023 11:26:51 -0400 Message-Id: <20230517152654.7193-2-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230517152654.7193-1-mathieu.desnoyers@efficios.com> References: <20230517152654.7193-1-mathieu.desnoyers@efficios.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1766155866138840401?= X-GMAIL-MSGID: =?utf-8?q?1766155866138840401?= |
Series | Extend rseq with sched_state field | |
Commit Message
Mathieu Desnoyers
May 17, 2023, 3:26 p.m. UTC
Expose the "on-cpu" state for each thread through struct rseq to allow
adaptative mutexes to decide more accurately between busy-waiting and
calling sys_futex() to release the CPU, based on the on-cpu state of the
mutex owner.
It is only provided as an optimization hint, because there is no
guarantee that the page containing this field is in the page cache, and
therefore the scheduler may very well fail to clear the on-cpu state on
preemption. This is expected to be rare though, and is resolved as soon
as the task returns to user-space.
The goal is to improve use-cases where the duration of the critical
sections for a given lock follows a multi-modal distribution, preventing
statistical guesses from doing a good job at choosing between busy-wait
and futex wait behavior.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Carlos O'Donell <carlos@redhat.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: libc-alpha@sourceware.org
---
include/linux/sched.h | 12 ++++++++++++
include/uapi/linux/rseq.h | 17 +++++++++++++++++
kernel/rseq.c | 14 ++++++++++++++
3 files changed, 43 insertions(+)
Comments
On Wed, 17 May 2023, Mathieu Desnoyers wrote: >Expose the "on-cpu" state for each thread through struct rseq to allow >adaptative mutexes to decide more accurately between busy-waiting and >calling sys_futex() to release the CPU, based on the on-cpu state of the >mutex owner. Oh yeah moving the spin stuff out of the kernel is much nicer. >It is only provided as an optimization hint, because there is no >guarantee that the page containing this field is in the page cache, and >therefore the scheduler may very well fail to clear the on-cpu state on >preemption. This is expected to be rare though, and is resolved as soon >as the task returns to user-space. > >The goal is to improve use-cases where the duration of the critical >sections for a given lock follows a multi-modal distribution, preventing >statistical guesses from doing a good job at choosing between busy-wait >and futex wait behavior. > >Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >Cc: Peter Zijlstra (Intel) <peterz@infradead.org> >Cc: Jonathan Corbet <corbet@lwn.net> >Cc: Steven Rostedt (Google) <rostedt@goodmis.org> >Cc: Carlos O'Donell <carlos@redhat.com> >Cc: Florian Weimer <fweimer@redhat.com> >Cc: libc-alpha@sourceware.org >--- > include/linux/sched.h | 12 ++++++++++++ > include/uapi/linux/rseq.h | 17 +++++++++++++++++ > kernel/rseq.c | 14 ++++++++++++++ > 3 files changed, 43 insertions(+) Ie: previous efforts kernel/futex.c | 675 ++++++++++++++++++++++++++++++++++++++------ kernel/futex.c | 572 ++++++++++++++++++++++++++++++++++++------------- Thanks, Davidlohr
On Wed, May 17, 2023 at 11:26:51AM -0400, Mathieu Desnoyers wrote: > Expose the "on-cpu" state for each thread through struct rseq to allow > adaptative mutexes to decide more accurately between busy-waiting and > calling sys_futex() to release the CPU, based on the on-cpu state of the > mutex owner. > > It is only provided as an optimization hint, because there is no > guarantee that the page containing this field is in the page cache, and > therefore the scheduler may very well fail to clear the on-cpu state on > preemption. This is expected to be rare though, and is resolved as soon > as the task returns to user-space. > > The goal is to improve use-cases where the duration of the critical > sections for a given lock follows a multi-modal distribution, preventing > statistical guesses from doing a good job at choosing between busy-wait > and futex wait behavior. > > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Cc: Peter Zijlstra (Intel) <peterz@infradead.org> > Cc: Jonathan Corbet <corbet@lwn.net> > Cc: Steven Rostedt (Google) <rostedt@goodmis.org> > Cc: Carlos O'Donell <carlos@redhat.com> > Cc: Florian Weimer <fweimer@redhat.com> > Cc: libc-alpha@sourceware.org > --- > include/linux/sched.h | 12 ++++++++++++ > include/uapi/linux/rseq.h | 17 +++++++++++++++++ > kernel/rseq.c | 14 ++++++++++++++ > 3 files changed, 43 insertions(+) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index eed5d65b8d1f..c7e9248134c1 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, > rseq_handle_notify_resume(ksig, regs); > } > > +void __rseq_set_sched_state(struct task_struct *t, unsigned int state); > + > +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) > +{ > + if (t->rseq) > + __rseq_set_sched_state(t, state); > +} > + > /* rseq_preempt() requires preemption to be disabled. */ > static inline void rseq_preempt(struct task_struct *t) > { > __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); > rseq_set_notify_resume(t); > + rseq_set_sched_state(t, 0); > } > > /* rseq_migrate() requires preemption to be disabled. */ > @@ -2405,6 +2414,9 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, > struct pt_regs *regs) > { > } > +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) > +{ > +} > static inline void rseq_preempt(struct task_struct *t) > { > } > diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h > index c233aae5eac9..c6d8537e23ca 100644 > --- a/include/uapi/linux/rseq.h > +++ b/include/uapi/linux/rseq.h > @@ -37,6 +37,13 @@ enum rseq_cs_flags { > (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), > }; > > +enum rseq_sched_state { > + /* > + * Task is currently running on a CPU if bit is set. > + */ > + RSEQ_SCHED_STATE_ON_CPU = (1U << 0), > +}; > + > /* > * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always > * contained within a single cache-line. It is usually declared as > @@ -148,6 +155,16 @@ struct rseq { > */ > __u32 mm_cid; > > + /* > + * Restartable sequences sched_state field. Updated by the kernel. Read > + * by user-space with single-copy atomicity semantics. This fields can > + * be read by any userspace thread. Aligned on 32-bit. Contains a Maybe this is a premature optimization, but since most of the time the bit would be read by another thread, does it make sense putting the "sched_state" into a different cache line to avoid false sharing? Also We could have a "sched_state_local" and "sched_state_remote" for different usages (local reads vs remote reads). Regards, Boqun > + * bitmask of enum rseq_sched_state. This field is provided as a hint > + * by the scheduler, and requires that the page holding struct rseq is > + * faulted-in for the state update to be performed by the scheduler. > + */ > + __u32 sched_state; > + > /* > * Flexible array member at end of structure, after last feature field. > */ > diff --git a/kernel/rseq.c b/kernel/rseq.c > index 9de6e35fe679..b2eb3bbaa9ef 100644 > --- a/kernel/rseq.c > +++ b/kernel/rseq.c > @@ -91,6 +91,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t) > u32 cpu_id = raw_smp_processor_id(); > u32 node_id = cpu_to_node(cpu_id); > u32 mm_cid = task_mm_cid(t); > + u32 sched_state = RSEQ_SCHED_STATE_ON_CPU; > > WARN_ON_ONCE((int) mm_cid < 0); > if (!user_write_access_begin(rseq, t->rseq_len)) > @@ -99,6 +100,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t) > unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end); > unsafe_put_user(node_id, &rseq->node_id, efault_end); > unsafe_put_user(mm_cid, &rseq->mm_cid, efault_end); > + unsafe_put_user(sched_state, &rseq->sched_state, efault_end); > /* > * Additional feature fields added after ORIG_RSEQ_SIZE > * need to be conditionally updated only if > @@ -339,6 +341,18 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) > force_sigsegv(sig); > } > > +/* > + * Attempt to update rseq scheduler state. > + */ > +void __rseq_set_sched_state(struct task_struct *t, unsigned int state) > +{ > + if (unlikely(t->flags & PF_EXITING)) > + return; > + pagefault_disable(); > + (void) put_user(state, &t->rseq->sched_state); > + pagefault_enable(); > +} > + > #ifdef CONFIG_DEBUG_RSEQ > > /* > -- > 2.25.1 >
On 2023-05-18 17:49, Boqun Feng wrote: > On Wed, May 17, 2023 at 11:26:51AM -0400, Mathieu Desnoyers wrote: [...] >> diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h >> index c233aae5eac9..c6d8537e23ca 100644 >> --- a/include/uapi/linux/rseq.h >> +++ b/include/uapi/linux/rseq.h >> @@ -37,6 +37,13 @@ enum rseq_cs_flags { >> (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), >> }; >> >> +enum rseq_sched_state { >> + /* >> + * Task is currently running on a CPU if bit is set. >> + */ >> + RSEQ_SCHED_STATE_ON_CPU = (1U << 0), >> +}; [...] >> >> + /* >> + * Restartable sequences sched_state field. Updated by the kernel. Read >> + * by user-space with single-copy atomicity semantics. This fields can >> + * be read by any userspace thread. Aligned on 32-bit. Contains a > > Maybe this is a premature optimization, but since most of the time the > bit would be read by another thread, does it make sense putting the > "sched_state" into a different cache line to avoid false sharing? I'm puzzled by your optimization proposal, so I'll say it outright: I'm probably missing something. I agree that false-sharing would be an issue if various threads would contend for updating any field within this cache line. But the only thread responsible for updating this cache line's fields is the current thread, either from userspace (stores to rseq_abi->rseq_cs) or from the kernel (usually on return to userspace, except for this new ON_CPU bit clear on context switch). The other threads busy-waiting on the content of this sched_state field will only load from it, never store. And they will only busy-wait on it as long as the current task runs. When that task gets preempted, other threads will notice the flag change and use sys_futex instead. So the very worse I can think of in terms of pattern causing cache coherency traffic due to false-sharing is if the lock owner happens to have lots of rseq critical sections as well, causing it to repeatedly store to the rseq_abi->rseq_cs field, which is in the same cache line. But even then I'm wondering if this really matters, because each of those stores to rseq_cs would only slow down loads from other threads which will need to retry busy-wait anyway if the on-cpu flag is still set. So, what am I missing ? Is this heavy use of rseq critical sections while the lock is held the scenario you are concerned about ? Note that the heavy cache-line bouncing in my test-case happens on the lock structure (cmpxchg expecting NULL, setting the current thread rseq_get_abi() pointer on success). There are probably better ways to implement that part, it is currently just a simple prototype showcasing the approach. Thanks, Mathieu
On Fri, May 19, 2023 at 10:15:11AM -0400, Mathieu Desnoyers wrote: > On 2023-05-18 17:49, Boqun Feng wrote: > > On Wed, May 17, 2023 at 11:26:51AM -0400, Mathieu Desnoyers wrote: > [...] > > > diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h > > > index c233aae5eac9..c6d8537e23ca 100644 > > > --- a/include/uapi/linux/rseq.h > > > +++ b/include/uapi/linux/rseq.h > > > @@ -37,6 +37,13 @@ enum rseq_cs_flags { > > > (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), > > > }; > > > +enum rseq_sched_state { > > > + /* > > > + * Task is currently running on a CPU if bit is set. > > > + */ > > > + RSEQ_SCHED_STATE_ON_CPU = (1U << 0), > > > +}; > [...] > > > + /* > > > + * Restartable sequences sched_state field. Updated by the kernel. Read > > > + * by user-space with single-copy atomicity semantics. This fields can > > > + * be read by any userspace thread. Aligned on 32-bit. Contains a > > > > Maybe this is a premature optimization, but since most of the time the > > bit would be read by another thread, does it make sense putting the > > "sched_state" into a different cache line to avoid false sharing? > > I'm puzzled by your optimization proposal, so I'll say it outright: I'm > probably missing something. > Maybe it's me who is missing something ;-) > I agree that false-sharing would be an issue if various threads would > contend for updating any field within this cache line. > > But the only thread responsible for updating this cache line's fields is the > current thread, either from userspace (stores to rseq_abi->rseq_cs) or from > the kernel (usually on return to userspace, except for this new ON_CPU bit > clear on context switch). > > The other threads busy-waiting on the content of this sched_state field will > only load from it, never store. And they will only busy-wait on it as long But their loads can change the cache line state from "Exclusive" to "Shared" (using MESI terms), right? And that could delay the stores of the current thread. > as the current task runs. When that task gets preempted, other threads will > notice the flag change and use sys_futex instead. > > So the very worse I can think of in terms of pattern causing cache coherency > traffic due to false-sharing is if the lock owner happens to have lots of > rseq critical sections as well, causing it to repeatedly store to the > rseq_abi->rseq_cs field, which is in the same cache line. > > But even then I'm wondering if this really matters, because each of those > stores to rseq_cs would only slow down loads from other threads which will > need to retry busy-wait anyway if the on-cpu flag is still set. > > So, what am I missing ? Is this heavy use of rseq critical sections while > the lock is held the scenario you are concerned about ? > The case in my mind is the opposite direction: the loads from other threads delay the stores to rseq_cs on the current thread, which I assume are usually a fast path. For example: CPU 1 CPU 2 lock(foo); // holding a lock rseq_start(): <CPU 1 own the cache line exclusively> lock(foo): <fail to get foo> <check whether the lock owner is on CPU> <cache line becames shared> ->rseq_cs = .. // Need to invalidate the cache line on other CPU But as you mentioned, there is only one updater here (the current thread), so maybe it doesn't matter... but since it's a userspace ABI, so I cannot help thinking "what if there is another bit that has a different usage pattern introduced in the future", so.. > Note that the heavy cache-line bouncing in my test-case happens on the lock > structure (cmpxchg expecting NULL, setting the current thread rseq_get_abi() > pointer on success). There are probably better ways to implement that part, > it is currently just a simple prototype showcasing the approach. > Yeah.. that's a little strange, I guess you can just read the lock owner's rseq_abi, for example: rseq_lock_slowpath() { struct rseq_abi *other_rseq = lock->owner; if (RSEQ_ACCESS_ONCE(other_rseq->sched_state)) { ... } } ? Regards, Boqun > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha <libc-alpha@sourceware.org> wrote: > > Expose the "on-cpu" state for each thread through struct rseq to allow > adaptative mutexes to decide more accurately between busy-waiting and > calling sys_futex() to release the CPU, based on the on-cpu state of the > mutex owner. > > It is only provided as an optimization hint, because there is no > guarantee that the page containing this field is in the page cache, and > therefore the scheduler may very well fail to clear the on-cpu state on > preemption. This is expected to be rare though, and is resolved as soon > as the task returns to user-space. > > The goal is to improve use-cases where the duration of the critical > sections for a given lock follows a multi-modal distribution, preventing > statistical guesses from doing a good job at choosing between busy-wait > and futex wait behavior. > > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Cc: Peter Zijlstra (Intel) <peterz@infradead.org> > Cc: Jonathan Corbet <corbet@lwn.net> > Cc: Steven Rostedt (Google) <rostedt@goodmis.org> > Cc: Carlos O'Donell <carlos@redhat.com> > Cc: Florian Weimer <fweimer@redhat.com> > Cc: libc-alpha@sourceware.org > --- > include/linux/sched.h | 12 ++++++++++++ > include/uapi/linux/rseq.h | 17 +++++++++++++++++ > kernel/rseq.c | 14 ++++++++++++++ > 3 files changed, 43 insertions(+) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index eed5d65b8d1f..c7e9248134c1 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, > rseq_handle_notify_resume(ksig, regs); > } > > +void __rseq_set_sched_state(struct task_struct *t, unsigned int state); > + > +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) > +{ > + if (t->rseq) > + __rseq_set_sched_state(t, state); > +} > + > /* rseq_preempt() requires preemption to be disabled. */ > static inline void rseq_preempt(struct task_struct *t) > { > __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); > rseq_set_notify_resume(t); > + rseq_set_sched_state(t, 0); Should rseq_migrate also be made to update the cpu_id of the new core? I imagine the usage of this will be something along the lines of: if(!on_cpu(mutex->owner_rseq_struct) && cpu(mutex->owner_rseq_struct) == this_threads_cpu) // goto futex So I would think updating on migrate would be useful as well. > } > > /* rseq_migrate() requires preemption to be disabled. */ > @@ -2405,6 +2414,9 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, > struct pt_regs *regs) > { > } > +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) > +{ > +} > static inline void rseq_preempt(struct task_struct *t) > { > } > diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h > index c233aae5eac9..c6d8537e23ca 100644 > --- a/include/uapi/linux/rseq.h > +++ b/include/uapi/linux/rseq.h > @@ -37,6 +37,13 @@ enum rseq_cs_flags { > (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), > }; > > +enum rseq_sched_state { > + /* > + * Task is currently running on a CPU if bit is set. > + */ > + RSEQ_SCHED_STATE_ON_CPU = (1U << 0), > +}; > + > /* > * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always > * contained within a single cache-line. It is usually declared as > @@ -148,6 +155,16 @@ struct rseq { > */ > __u32 mm_cid; > > + /* > + * Restartable sequences sched_state field. Updated by the kernel. Read > + * by user-space with single-copy atomicity semantics. This fields can > + * be read by any userspace thread. Aligned on 32-bit. Contains a > + * bitmask of enum rseq_sched_state. This field is provided as a hint > + * by the scheduler, and requires that the page holding struct rseq is > + * faulted-in for the state update to be performed by the scheduler. > + */ > + __u32 sched_state; > + > /* > * Flexible array member at end of structure, after last feature field. > */ > diff --git a/kernel/rseq.c b/kernel/rseq.c > index 9de6e35fe679..b2eb3bbaa9ef 100644 > --- a/kernel/rseq.c > +++ b/kernel/rseq.c > @@ -91,6 +91,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t) > u32 cpu_id = raw_smp_processor_id(); > u32 node_id = cpu_to_node(cpu_id); > u32 mm_cid = task_mm_cid(t); > + u32 sched_state = RSEQ_SCHED_STATE_ON_CPU; > > WARN_ON_ONCE((int) mm_cid < 0); > if (!user_write_access_begin(rseq, t->rseq_len)) > @@ -99,6 +100,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t) > unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end); > unsafe_put_user(node_id, &rseq->node_id, efault_end); > unsafe_put_user(mm_cid, &rseq->mm_cid, efault_end); > + unsafe_put_user(sched_state, &rseq->sched_state, efault_end); > /* > * Additional feature fields added after ORIG_RSEQ_SIZE > * need to be conditionally updated only if > @@ -339,6 +341,18 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) > force_sigsegv(sig); > } > > +/* > + * Attempt to update rseq scheduler state. > + */ > +void __rseq_set_sched_state(struct task_struct *t, unsigned int state) > +{ > + if (unlikely(t->flags & PF_EXITING)) > + return; > + pagefault_disable(); > + (void) put_user(state, &t->rseq->sched_state); > + pagefault_enable(); > +} > + > #ifdef CONFIG_DEBUG_RSEQ > > /* > -- > 2.25.1 >
On 2023-05-19 16:51, Noah Goldstein wrote: > On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha > <libc-alpha@sourceware.org> wrote: >> >> Expose the "on-cpu" state for each thread through struct rseq to allow >> adaptative mutexes to decide more accurately between busy-waiting and >> calling sys_futex() to release the CPU, based on the on-cpu state of the >> mutex owner. >> >> It is only provided as an optimization hint, because there is no >> guarantee that the page containing this field is in the page cache, and >> therefore the scheduler may very well fail to clear the on-cpu state on >> preemption. This is expected to be rare though, and is resolved as soon >> as the task returns to user-space. >> >> The goal is to improve use-cases where the duration of the critical >> sections for a given lock follows a multi-modal distribution, preventing >> statistical guesses from doing a good job at choosing between busy-wait >> and futex wait behavior. >> >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> >> Cc: Jonathan Corbet <corbet@lwn.net> >> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> >> Cc: Carlos O'Donell <carlos@redhat.com> >> Cc: Florian Weimer <fweimer@redhat.com> >> Cc: libc-alpha@sourceware.org >> --- >> include/linux/sched.h | 12 ++++++++++++ >> include/uapi/linux/rseq.h | 17 +++++++++++++++++ >> kernel/rseq.c | 14 ++++++++++++++ >> 3 files changed, 43 insertions(+) >> >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> index eed5d65b8d1f..c7e9248134c1 100644 >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, >> rseq_handle_notify_resume(ksig, regs); >> } >> >> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state); >> + >> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) >> +{ >> + if (t->rseq) >> + __rseq_set_sched_state(t, state); >> +} >> + >> /* rseq_preempt() requires preemption to be disabled. */ >> static inline void rseq_preempt(struct task_struct *t) >> { >> __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); >> rseq_set_notify_resume(t); >> + rseq_set_sched_state(t, 0); > > Should rseq_migrate also be made to update the cpu_id of the new core? > I imagine the usage of this will be something along the lines of: > > if(!on_cpu(mutex->owner_rseq_struct) && > cpu(mutex->owner_rseq_struct) == this_threads_cpu) > // goto futex > > So I would think updating on migrate would be useful as well. I don't think we want to act differently based on the cpu on which the owner is queued. If the mutex owner is not on-cpu, and queued on the same cpu as the current thread, we indeed want to call sys_futex WAIT. If the mutex owner is not on-cpu, but queued on a different cpu than the current thread, we *still* want to call sys_futex WAIT, because busy-waiting for a thread which is queued but not currently running is wasteful. Or am I missing something ? Thanks, Mathieu
On 2023-05-19 13:18, Boqun Feng wrote: [...] > > The case in my mind is the opposite direction: the loads from other > threads delay the stores to rseq_cs on the current thread, which I > assume are usually a fast path. For example: Yes, OK, you are correct. And I just validated on my end that busy-waiting repeatedly loading from a cache line does slow down the concurrent stores to other variables on that cache line significantly (at least on my Intel(R) Core(TM) i7-8650U). Small reproducer provided at the end of this email. Results: compudj@thinkos:~/test$ time ./test-cacheline -d thread id : 140242706274048, pid 16940 thread id : 140242697881344, pid 16940 real 0m4.145s user 0m8.289s sys 0m0.000s compudj@thinkos:~/test$ time ./test-cacheline -s thread id : 139741482387200, pid 16950 thread id : 139741473994496, pid 16950 real 0m4.573s user 0m9.147s sys 0m0.000s > > CPU 1 CPU 2 > > lock(foo); // holding a lock > rseq_start(): > <CPU 1 own the cache line exclusively> > lock(foo): > <fail to get foo> > <check whether the lock owner is on CPU> > <cache line becames shared> > ->rseq_cs = .. // Need to invalidate the cache line on other CPU > > But as you mentioned, there is only one updater here (the current > thread), so maybe it doesn't matter... but since it's a userspace ABI, > so I cannot help thinking "what if there is another bit that has a > different usage pattern introduced in the future", so.. Yes, however we have to be careful about how we introduce this considering that the rseq feature extensions are "append only" to the structure feature size exported by the kernel to userspace through getauxval(3). So if we decide that we create a big hole right in the middle of the rseq_abi for cacheline alignment, that's a possibility, but we'd really be wasting an entire cacheline for a single bit. Another possibility would be to add a level of indirection: we could have a field in struct rseq which is either a pointer or offset from the thread_pointer() to the on-cpu bit, which would sit in a different cache line. It would be up to glibc to allocate space for it, possibly at the end of the rseq_abi field. > >> Note that the heavy cache-line bouncing in my test-case happens on the lock >> structure (cmpxchg expecting NULL, setting the current thread rseq_get_abi() >> pointer on success). There are probably better ways to implement that part, >> it is currently just a simple prototype showcasing the approach. >> > > Yeah.. that's a little strange, I guess you can just read the lock > owner's rseq_abi, for example: > > rseq_lock_slowpath() { > struct rseq_abi *other_rseq = lock->owner; > > if (RSEQ_ACCESS_ONCE(other_rseq->sched_state)) { > ... > } > } Yes, I don't think the load of the owner pointer needs to be part of the cmpxchg per se. It could be done from a load on the slow-path. This way we would not require that the owner id and the lock state be the same content, and this would allow much more freedom for the fast-path semantic. Thanks, Mathieu > > ? > > Regards, > Boqun > >> Thanks, >> >> Mathieu >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >> Reproducer: /* * cacheline testing (exclusive vs shared store speed) * * build with gcc -O2 -pthread -o test-cacheline test-cacheline.c * * Mathieu Desnoyers <mathieu.desnoyers@efficios.com> * License: MIT */ #include <stdio.h> #include <pthread.h> #include <stdlib.h> #include <sys/types.h> #include <unistd.h> #include <rseq/rseq.h> #define NR_THREADS 2 struct test { int a; int b; } __attribute__((aligned(256))); enum testcase { TEST_SAME_CACHELINE, TEST_OTHER_CACHELINE, }; static enum testcase testcase; static int test_stop, test_go; static struct test test, test2; static void *testthread(void *arg) { long nr = (long)arg; printf("thread id : %lu, pid %lu\n", pthread_self(), getpid()); __atomic_add_fetch(&test_go, 1, __ATOMIC_RELAXED); while (RSEQ_READ_ONCE(test_go) < NR_THREADS) rseq_barrier(); if (nr == 0) { switch (testcase) { case TEST_SAME_CACHELINE: while (!RSEQ_READ_ONCE(test_stop)) (void) RSEQ_READ_ONCE(test.a); break; case TEST_OTHER_CACHELINE: while (!RSEQ_READ_ONCE(test_stop)) (void) RSEQ_READ_ONCE(test2.a); break; } } else if (nr == 1) { unsigned long long i; for (i = 0; i < 16000000000UL; i++) RSEQ_WRITE_ONCE(test.b, i); RSEQ_WRITE_ONCE(test_stop, 1); } return ((void*)0); } static void show_usage(char **argv) { fprintf(stderr, "Usage: %s <OPTIONS>\n", argv[0]); fprintf(stderr, "OPTIONS:\n"); fprintf(stderr, " [-s] Same cacheline\n"); fprintf(stderr, " [-d] Different cacheline\n"); } static int parse_args(int argc, char **argv) { if (argc != 2 || argv[1][0] != '-') { show_usage(argv); return -1; } switch (argv[1][1]) { case 's': testcase = TEST_SAME_CACHELINE; break; case 'd': testcase = TEST_OTHER_CACHELINE; break; default: show_usage(argv); return -1; } return 0; } int main(int argc, char **argv) { pthread_t testid[NR_THREADS]; void *tret; int i, err; if (parse_args(argc, argv)) exit(1); for (i = 0; i < NR_THREADS; i++) { err = pthread_create(&testid[i], NULL, testthread, (void *)(long)i); if (err != 0) exit(1); } for (i = 0; i < NR_THREADS; i++) { err = pthread_join(testid[i], &tret); if (err != 0) exit(1); } return 0; }
On Tue, May 23, 2023 at 7:49 AM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2023-05-19 16:51, Noah Goldstein wrote: > > On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha > > <libc-alpha@sourceware.org> wrote: > >> > >> Expose the "on-cpu" state for each thread through struct rseq to allow > >> adaptative mutexes to decide more accurately between busy-waiting and > >> calling sys_futex() to release the CPU, based on the on-cpu state of the > >> mutex owner. > >> > >> It is only provided as an optimization hint, because there is no > >> guarantee that the page containing this field is in the page cache, and > >> therefore the scheduler may very well fail to clear the on-cpu state on > >> preemption. This is expected to be rare though, and is resolved as soon > >> as the task returns to user-space. > >> > >> The goal is to improve use-cases where the duration of the critical > >> sections for a given lock follows a multi-modal distribution, preventing > >> statistical guesses from doing a good job at choosing between busy-wait > >> and futex wait behavior. > >> > >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > >> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> > >> Cc: Jonathan Corbet <corbet@lwn.net> > >> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> > >> Cc: Carlos O'Donell <carlos@redhat.com> > >> Cc: Florian Weimer <fweimer@redhat.com> > >> Cc: libc-alpha@sourceware.org > >> --- > >> include/linux/sched.h | 12 ++++++++++++ > >> include/uapi/linux/rseq.h | 17 +++++++++++++++++ > >> kernel/rseq.c | 14 ++++++++++++++ > >> 3 files changed, 43 insertions(+) > >> > >> diff --git a/include/linux/sched.h b/include/linux/sched.h > >> index eed5d65b8d1f..c7e9248134c1 100644 > >> --- a/include/linux/sched.h > >> +++ b/include/linux/sched.h > >> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, > >> rseq_handle_notify_resume(ksig, regs); > >> } > >> > >> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state); > >> + > >> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) > >> +{ > >> + if (t->rseq) > >> + __rseq_set_sched_state(t, state); > >> +} > >> + > >> /* rseq_preempt() requires preemption to be disabled. */ > >> static inline void rseq_preempt(struct task_struct *t) > >> { > >> __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); > >> rseq_set_notify_resume(t); > >> + rseq_set_sched_state(t, 0); > > > > Should rseq_migrate also be made to update the cpu_id of the new core? > > I imagine the usage of this will be something along the lines of: > > > > if(!on_cpu(mutex->owner_rseq_struct) && > > cpu(mutex->owner_rseq_struct) == this_threads_cpu) > > // goto futex > > > > So I would think updating on migrate would be useful as well. > > I don't think we want to act differently based on the cpu on which the > owner is queued. > > If the mutex owner is not on-cpu, and queued on the same cpu as the > current thread, we indeed want to call sys_futex WAIT. > > If the mutex owner is not on-cpu, but queued on a different cpu than the > current thread, we *still* want to call sys_futex WAIT, because > busy-waiting for a thread which is queued but not currently running is > wasteful. > I think this is less clear. In some cases sure but not always. Going to the futex has more latency that userland waits, and if the system is not busy (other than the one process) most likely less latency that yield. Also going to the futex requires a syscall on unlock. For example if the critical section is expected to be very small, it would be easy to imagine the lock be better implemented with: while(is_locked) if (owner->on_cpu || owner->cpu != my_cpu) exponential backoff else yield Its not that "just go to futex" doesn't ever make sense, but I don't think its fair to say that *always* the case. Looking at the kernel code, it doesn't seem to be a particularly high cost to keep the CPU field updated during migration so seems like a why not kind of question. > Or am I missing something ? > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On 2023-05-23 12:32, Noah Goldstein wrote: > On Tue, May 23, 2023 at 7:49 AM Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> On 2023-05-19 16:51, Noah Goldstein wrote: >>> On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha >>> <libc-alpha@sourceware.org> wrote: >>>> >>>> Expose the "on-cpu" state for each thread through struct rseq to allow >>>> adaptative mutexes to decide more accurately between busy-waiting and >>>> calling sys_futex() to release the CPU, based on the on-cpu state of the >>>> mutex owner. >>>> >>>> It is only provided as an optimization hint, because there is no >>>> guarantee that the page containing this field is in the page cache, and >>>> therefore the scheduler may very well fail to clear the on-cpu state on >>>> preemption. This is expected to be rare though, and is resolved as soon >>>> as the task returns to user-space. >>>> >>>> The goal is to improve use-cases where the duration of the critical >>>> sections for a given lock follows a multi-modal distribution, preventing >>>> statistical guesses from doing a good job at choosing between busy-wait >>>> and futex wait behavior. >>>> >>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >>>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> >>>> Cc: Jonathan Corbet <corbet@lwn.net> >>>> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> >>>> Cc: Carlos O'Donell <carlos@redhat.com> >>>> Cc: Florian Weimer <fweimer@redhat.com> >>>> Cc: libc-alpha@sourceware.org >>>> --- >>>> include/linux/sched.h | 12 ++++++++++++ >>>> include/uapi/linux/rseq.h | 17 +++++++++++++++++ >>>> kernel/rseq.c | 14 ++++++++++++++ >>>> 3 files changed, 43 insertions(+) >>>> >>>> diff --git a/include/linux/sched.h b/include/linux/sched.h >>>> index eed5d65b8d1f..c7e9248134c1 100644 >>>> --- a/include/linux/sched.h >>>> +++ b/include/linux/sched.h >>>> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, >>>> rseq_handle_notify_resume(ksig, regs); >>>> } >>>> >>>> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state); >>>> + >>>> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) >>>> +{ >>>> + if (t->rseq) >>>> + __rseq_set_sched_state(t, state); >>>> +} >>>> + >>>> /* rseq_preempt() requires preemption to be disabled. */ >>>> static inline void rseq_preempt(struct task_struct *t) >>>> { >>>> __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); >>>> rseq_set_notify_resume(t); >>>> + rseq_set_sched_state(t, 0); >>> >>> Should rseq_migrate also be made to update the cpu_id of the new core? >>> I imagine the usage of this will be something along the lines of: >>> >>> if(!on_cpu(mutex->owner_rseq_struct) && >>> cpu(mutex->owner_rseq_struct) == this_threads_cpu) >>> // goto futex >>> >>> So I would think updating on migrate would be useful as well. >> >> I don't think we want to act differently based on the cpu on which the >> owner is queued. >> >> If the mutex owner is not on-cpu, and queued on the same cpu as the >> current thread, we indeed want to call sys_futex WAIT. >> >> If the mutex owner is not on-cpu, but queued on a different cpu than the >> current thread, we *still* want to call sys_futex WAIT, because >> busy-waiting for a thread which is queued but not currently running is >> wasteful. >> > I think this is less clear. In some cases sure but not always. Going > to the futex > has more latency that userland waits, and if the system is not busy (other than > the one process) most likely less latency that yield. Also going to the futex > requires a syscall on unlock. > > For example if the critical section is expected to be very small, it > would be easy > to imagine the lock be better implemented with: > while(is_locked) > if (owner->on_cpu || owner->cpu != my_cpu) > exponential backoff > else > yield > > Its not that "just go to futex" doesn't ever make sense, but I don't > think its fair > to say that *always* the case. > > Looking at the kernel code, it doesn't seem to be a particularly high cost to > keep the CPU field updated during migration so seems like a why not > kind of question. We already have the owner rseq_abi cpu_id field populated on every return-to-userspace. I wonder if it's really relevant that migration populates an updated value in this field immediately ? It's another case where this would be provided as a hint updated only if the struct rseq is in the page cache, because AFAIU the scheduler migration path cannot take a page fault. Also, if a thread bounces around many runqueues before being scheduled again, we would be adding those useless stores to the rseq_abi structure at each migration between runqueues. Given this would add some complexity to the scheduler migration code, I would want to see metrics/benchmarks showing that it indeed improves real-world use-cases before adding this to the rseq ABI. It's not only a question of added lines of code as of today, but also a question of added userspace ABI guarantees which can prevent future scheduler optimizations. I'm *very* careful about keeping those to a strict minimum, which I hope Peter Zijlstra appreciates. Thanks, Mathieu >> Or am I missing something ? >> >> Thanks, >> >> Mathieu >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >>
On Tue, May 23, 2023 at 12:30 PM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2023-05-23 12:32, Noah Goldstein wrote: > > On Tue, May 23, 2023 at 7:49 AM Mathieu Desnoyers > > <mathieu.desnoyers@efficios.com> wrote: > >> > >> On 2023-05-19 16:51, Noah Goldstein wrote: > >>> On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha > >>> <libc-alpha@sourceware.org> wrote: > >>>> > >>>> Expose the "on-cpu" state for each thread through struct rseq to allow > >>>> adaptative mutexes to decide more accurately between busy-waiting and > >>>> calling sys_futex() to release the CPU, based on the on-cpu state of the > >>>> mutex owner. > >>>> > >>>> It is only provided as an optimization hint, because there is no > >>>> guarantee that the page containing this field is in the page cache, and > >>>> therefore the scheduler may very well fail to clear the on-cpu state on > >>>> preemption. This is expected to be rare though, and is resolved as soon > >>>> as the task returns to user-space. > >>>> > >>>> The goal is to improve use-cases where the duration of the critical > >>>> sections for a given lock follows a multi-modal distribution, preventing > >>>> statistical guesses from doing a good job at choosing between busy-wait > >>>> and futex wait behavior. > >>>> > >>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > >>>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> > >>>> Cc: Jonathan Corbet <corbet@lwn.net> > >>>> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> > >>>> Cc: Carlos O'Donell <carlos@redhat.com> > >>>> Cc: Florian Weimer <fweimer@redhat.com> > >>>> Cc: libc-alpha@sourceware.org > >>>> --- > >>>> include/linux/sched.h | 12 ++++++++++++ > >>>> include/uapi/linux/rseq.h | 17 +++++++++++++++++ > >>>> kernel/rseq.c | 14 ++++++++++++++ > >>>> 3 files changed, 43 insertions(+) > >>>> > >>>> diff --git a/include/linux/sched.h b/include/linux/sched.h > >>>> index eed5d65b8d1f..c7e9248134c1 100644 > >>>> --- a/include/linux/sched.h > >>>> +++ b/include/linux/sched.h > >>>> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, > >>>> rseq_handle_notify_resume(ksig, regs); > >>>> } > >>>> > >>>> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state); > >>>> + > >>>> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) > >>>> +{ > >>>> + if (t->rseq) > >>>> + __rseq_set_sched_state(t, state); > >>>> +} > >>>> + > >>>> /* rseq_preempt() requires preemption to be disabled. */ > >>>> static inline void rseq_preempt(struct task_struct *t) > >>>> { > >>>> __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); > >>>> rseq_set_notify_resume(t); > >>>> + rseq_set_sched_state(t, 0); > >>> > >>> Should rseq_migrate also be made to update the cpu_id of the new core? > >>> I imagine the usage of this will be something along the lines of: > >>> > >>> if(!on_cpu(mutex->owner_rseq_struct) && > >>> cpu(mutex->owner_rseq_struct) == this_threads_cpu) > >>> // goto futex > >>> > >>> So I would think updating on migrate would be useful as well. > >> > >> I don't think we want to act differently based on the cpu on which the > >> owner is queued. > >> > >> If the mutex owner is not on-cpu, and queued on the same cpu as the > >> current thread, we indeed want to call sys_futex WAIT. > >> > >> If the mutex owner is not on-cpu, but queued on a different cpu than the > >> current thread, we *still* want to call sys_futex WAIT, because > >> busy-waiting for a thread which is queued but not currently running is > >> wasteful. > >> > > I think this is less clear. In some cases sure but not always. Going > > to the futex > > has more latency that userland waits, and if the system is not busy (other than > > the one process) most likely less latency that yield. Also going to the futex > > requires a syscall on unlock. > > > > For example if the critical section is expected to be very small, it > > would be easy > > to imagine the lock be better implemented with: > > while(is_locked) > > if (owner->on_cpu || owner->cpu != my_cpu) > > exponential backoff > > else > > yield > > > > Its not that "just go to futex" doesn't ever make sense, but I don't > > think its fair > > to say that *always* the case. > > > > Looking at the kernel code, it doesn't seem to be a particularly high cost to > > keep the CPU field updated during migration so seems like a why not > > kind of question. > > We already have the owner rseq_abi cpu_id field populated on every > return-to-userspace. I wonder if it's really relevant that migration > populates an updated value in this field immediately ? It's another case > where this would be provided as a hint updated only if the struct rseq > is in the page cache, because AFAIU the scheduler migration path cannot > take a page fault. > Ah, thats a good point. And probably as probability the page is in the cache goes down a fair bit as the task is idle / bounced around for longer. > Also, if a thread bounces around many runqueues before being scheduled > again, we would be adding those useless stores to the rseq_abi structure > at each migration between runqueues. > > Given this would add some complexity to the scheduler migration code, I > would want to see metrics/benchmarks showing that it indeed improves > real-world use-cases before adding this to the rseq ABI. > > It's not only a question of added lines of code as of today, but also a > question of added userspace ABI guarantees which can prevent future > scheduler optimizations. I'm *very* careful about keeping those to a > strict minimum, which I hope Peter Zijlstra appreciates. Well, this entire thing is moreso a hint than a guarantee. Even on_cpu is only updated if the page happens to be in the pagecache so I don't see how you could ever be *having* to do anything. But fair enough, thought I'd throw the idea out there, but enough valid concerns seem to make it not such a good idea. > > Thanks, > > Mathieu > > > >> Or am I missing something ? > >> > >> Thanks, > >> > >> Mathieu > >> > >> -- > >> Mathieu Desnoyers > >> EfficiOS Inc. > >> https://www.efficios.com > >> > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
diff --git a/include/linux/sched.h b/include/linux/sched.h index eed5d65b8d1f..c7e9248134c1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, rseq_handle_notify_resume(ksig, regs); } +void __rseq_set_sched_state(struct task_struct *t, unsigned int state); + +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) +{ + if (t->rseq) + __rseq_set_sched_state(t, state); +} + /* rseq_preempt() requires preemption to be disabled. */ static inline void rseq_preempt(struct task_struct *t) { __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); rseq_set_notify_resume(t); + rseq_set_sched_state(t, 0); } /* rseq_migrate() requires preemption to be disabled. */ @@ -2405,6 +2414,9 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { } +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) +{ +} static inline void rseq_preempt(struct task_struct *t) { } diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h index c233aae5eac9..c6d8537e23ca 100644 --- a/include/uapi/linux/rseq.h +++ b/include/uapi/linux/rseq.h @@ -37,6 +37,13 @@ enum rseq_cs_flags { (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), }; +enum rseq_sched_state { + /* + * Task is currently running on a CPU if bit is set. + */ + RSEQ_SCHED_STATE_ON_CPU = (1U << 0), +}; + /* * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always * contained within a single cache-line. It is usually declared as @@ -148,6 +155,16 @@ struct rseq { */ __u32 mm_cid; + /* + * Restartable sequences sched_state field. Updated by the kernel. Read + * by user-space with single-copy atomicity semantics. This fields can + * be read by any userspace thread. Aligned on 32-bit. Contains a + * bitmask of enum rseq_sched_state. This field is provided as a hint + * by the scheduler, and requires that the page holding struct rseq is + * faulted-in for the state update to be performed by the scheduler. + */ + __u32 sched_state; + /* * Flexible array member at end of structure, after last feature field. */ diff --git a/kernel/rseq.c b/kernel/rseq.c index 9de6e35fe679..b2eb3bbaa9ef 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -91,6 +91,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t) u32 cpu_id = raw_smp_processor_id(); u32 node_id = cpu_to_node(cpu_id); u32 mm_cid = task_mm_cid(t); + u32 sched_state = RSEQ_SCHED_STATE_ON_CPU; WARN_ON_ONCE((int) mm_cid < 0); if (!user_write_access_begin(rseq, t->rseq_len)) @@ -99,6 +100,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t) unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end); unsafe_put_user(node_id, &rseq->node_id, efault_end); unsafe_put_user(mm_cid, &rseq->mm_cid, efault_end); + unsafe_put_user(sched_state, &rseq->sched_state, efault_end); /* * Additional feature fields added after ORIG_RSEQ_SIZE * need to be conditionally updated only if @@ -339,6 +341,18 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) force_sigsegv(sig); } +/* + * Attempt to update rseq scheduler state. + */ +void __rseq_set_sched_state(struct task_struct *t, unsigned int state) +{ + if (unlikely(t->flags & PF_EXITING)) + return; + pagefault_disable(); + (void) put_user(state, &t->rseq->sched_state); + pagefault_enable(); +} + #ifdef CONFIG_DEBUG_RSEQ /*