[RFC,1/4] rseq: Add sched_state field to struct rseq

Message ID 20230517152654.7193-2-mathieu.desnoyers@efficios.com
State New
Headers
Series Extend rseq with sched_state field |

Commit Message

Mathieu Desnoyers May 17, 2023, 3:26 p.m. UTC
  Expose the "on-cpu" state for each thread through struct rseq to allow
adaptative mutexes to decide more accurately between busy-waiting and
calling sys_futex() to release the CPU, based on the on-cpu state of the
mutex owner.

It is only provided as an optimization hint, because there is no
guarantee that the page containing this field is in the page cache, and
therefore the scheduler may very well fail to clear the on-cpu state on
preemption. This is expected to be rare though, and is resolved as soon
as the task returns to user-space.

The goal is to improve use-cases where the duration of the critical
sections for a given lock follows a multi-modal distribution, preventing
statistical guesses from doing a good job at choosing between busy-wait
and futex wait behavior.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Carlos O'Donell <carlos@redhat.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: libc-alpha@sourceware.org
---
 include/linux/sched.h     | 12 ++++++++++++
 include/uapi/linux/rseq.h | 17 +++++++++++++++++
 kernel/rseq.c             | 14 ++++++++++++++
 3 files changed, 43 insertions(+)
  

Comments

Davidlohr Bueso May 17, 2023, 4:03 p.m. UTC | #1
On Wed, 17 May 2023, Mathieu Desnoyers wrote:

>Expose the "on-cpu" state for each thread through struct rseq to allow
>adaptative mutexes to decide more accurately between busy-waiting and
>calling sys_futex() to release the CPU, based on the on-cpu state of the
>mutex owner.

Oh yeah moving the spin stuff out of the kernel is much nicer.

>It is only provided as an optimization hint, because there is no
>guarantee that the page containing this field is in the page cache, and
>therefore the scheduler may very well fail to clear the on-cpu state on
>preemption. This is expected to be rare though, and is resolved as soon
>as the task returns to user-space.
>
>The goal is to improve use-cases where the duration of the critical
>sections for a given lock follows a multi-modal distribution, preventing
>statistical guesses from doing a good job at choosing between busy-wait
>and futex wait behavior.
>
>Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>Cc: Jonathan Corbet <corbet@lwn.net>
>Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
>Cc: Carlos O'Donell <carlos@redhat.com>
>Cc: Florian Weimer <fweimer@redhat.com>
>Cc: libc-alpha@sourceware.org
>---
> include/linux/sched.h     | 12 ++++++++++++
> include/uapi/linux/rseq.h | 17 +++++++++++++++++
> kernel/rseq.c             | 14 ++++++++++++++
> 3 files changed, 43 insertions(+)

Ie: previous efforts

  kernel/futex.c             |  675 ++++++++++++++++++++++++++++++++++++++------
  kernel/futex.c             |  572 ++++++++++++++++++++++++++++++++++++-------------

Thanks,
Davidlohr
  
Boqun Feng May 18, 2023, 9:49 p.m. UTC | #2
On Wed, May 17, 2023 at 11:26:51AM -0400, Mathieu Desnoyers wrote:
> Expose the "on-cpu" state for each thread through struct rseq to allow
> adaptative mutexes to decide more accurately between busy-waiting and
> calling sys_futex() to release the CPU, based on the on-cpu state of the
> mutex owner.
> 
> It is only provided as an optimization hint, because there is no
> guarantee that the page containing this field is in the page cache, and
> therefore the scheduler may very well fail to clear the on-cpu state on
> preemption. This is expected to be rare though, and is resolved as soon
> as the task returns to user-space.
> 
> The goal is to improve use-cases where the duration of the critical
> sections for a given lock follows a multi-modal distribution, preventing
> statistical guesses from doing a good job at choosing between busy-wait
> and futex wait behavior.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
> Cc: Carlos O'Donell <carlos@redhat.com>
> Cc: Florian Weimer <fweimer@redhat.com>
> Cc: libc-alpha@sourceware.org
> ---
>  include/linux/sched.h     | 12 ++++++++++++
>  include/uapi/linux/rseq.h | 17 +++++++++++++++++
>  kernel/rseq.c             | 14 ++++++++++++++
>  3 files changed, 43 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index eed5d65b8d1f..c7e9248134c1 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig,
>  	rseq_handle_notify_resume(ksig, regs);
>  }
>  
> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state);
> +
> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
> +{
> +	if (t->rseq)
> +		__rseq_set_sched_state(t, state);
> +}
> +
>  /* rseq_preempt() requires preemption to be disabled. */
>  static inline void rseq_preempt(struct task_struct *t)
>  {
>  	__set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
>  	rseq_set_notify_resume(t);
> +	rseq_set_sched_state(t, 0);
>  }
>  
>  /* rseq_migrate() requires preemption to be disabled. */
> @@ -2405,6 +2414,9 @@ static inline void rseq_signal_deliver(struct ksignal *ksig,
>  				       struct pt_regs *regs)
>  {
>  }
> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
> +{
> +}
>  static inline void rseq_preempt(struct task_struct *t)
>  {
>  }
> diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
> index c233aae5eac9..c6d8537e23ca 100644
> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -37,6 +37,13 @@ enum rseq_cs_flags {
>  		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
>  };
>  
> +enum rseq_sched_state {
> +	/*
> +	 * Task is currently running on a CPU if bit is set.
> +	 */
> +	RSEQ_SCHED_STATE_ON_CPU		= (1U << 0),
> +};
> +
>  /*
>   * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always
>   * contained within a single cache-line. It is usually declared as
> @@ -148,6 +155,16 @@ struct rseq {
>  	 */
>  	__u32 mm_cid;
>  
> +	/*
> +	 * Restartable sequences sched_state field. Updated by the kernel. Read
> +	 * by user-space with single-copy atomicity semantics. This fields can
> +	 * be read by any userspace thread. Aligned on 32-bit. Contains a

Maybe this is a premature optimization, but since most of the time the
bit would be read by another thread, does it make sense putting the
"sched_state" into a different cache line to avoid false sharing?

Also We could have a "sched_state_local" and "sched_state_remote" for
different usages (local reads vs remote reads).

Regards,
Boqun

> +	 * bitmask of enum rseq_sched_state. This field is provided as a hint
> +	 * by the scheduler, and requires that the page holding struct rseq is
> +	 * faulted-in for the state update to be performed by the scheduler.
> +	 */
> +	__u32 sched_state;
> +
>  	/*
>  	 * Flexible array member at end of structure, after last feature field.
>  	 */
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 9de6e35fe679..b2eb3bbaa9ef 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -91,6 +91,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
>  	u32 cpu_id = raw_smp_processor_id();
>  	u32 node_id = cpu_to_node(cpu_id);
>  	u32 mm_cid = task_mm_cid(t);
> +	u32 sched_state = RSEQ_SCHED_STATE_ON_CPU;
>  
>  	WARN_ON_ONCE((int) mm_cid < 0);
>  	if (!user_write_access_begin(rseq, t->rseq_len))
> @@ -99,6 +100,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
>  	unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end);
>  	unsafe_put_user(node_id, &rseq->node_id, efault_end);
>  	unsafe_put_user(mm_cid, &rseq->mm_cid, efault_end);
> +	unsafe_put_user(sched_state, &rseq->sched_state, efault_end);
>  	/*
>  	 * Additional feature fields added after ORIG_RSEQ_SIZE
>  	 * need to be conditionally updated only if
> @@ -339,6 +341,18 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
>  	force_sigsegv(sig);
>  }
>  
> +/*
> + * Attempt to update rseq scheduler state.
> + */
> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state)
> +{
> +	if (unlikely(t->flags & PF_EXITING))
> +		return;
> +	pagefault_disable();
> +	(void) put_user(state, &t->rseq->sched_state);
> +	pagefault_enable();
> +}
> +
>  #ifdef CONFIG_DEBUG_RSEQ
>  
>  /*
> -- 
> 2.25.1
>
  
Mathieu Desnoyers May 19, 2023, 2:15 p.m. UTC | #3
On 2023-05-18 17:49, Boqun Feng wrote:
> On Wed, May 17, 2023 at 11:26:51AM -0400, Mathieu Desnoyers wrote:
[...]
>> diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
>> index c233aae5eac9..c6d8537e23ca 100644
>> --- a/include/uapi/linux/rseq.h
>> +++ b/include/uapi/linux/rseq.h
>> @@ -37,6 +37,13 @@ enum rseq_cs_flags {
>>   		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
>>   };
>>   
>> +enum rseq_sched_state {
>> +	/*
>> +	 * Task is currently running on a CPU if bit is set.
>> +	 */
>> +	RSEQ_SCHED_STATE_ON_CPU		= (1U << 0),
>> +};
[...]
>>   
>> +	/*
>> +	 * Restartable sequences sched_state field. Updated by the kernel. Read
>> +	 * by user-space with single-copy atomicity semantics. This fields can
>> +	 * be read by any userspace thread. Aligned on 32-bit. Contains a
> 
> Maybe this is a premature optimization, but since most of the time the
> bit would be read by another thread, does it make sense putting the
> "sched_state" into a different cache line to avoid false sharing?

I'm puzzled by your optimization proposal, so I'll say it outright: I'm 
probably missing something.

I agree that false-sharing would be an issue if various threads would 
contend for updating any field within this cache line.

But the only thread responsible for updating this cache line's fields is 
the current thread, either from userspace (stores to rseq_abi->rseq_cs) 
or from the kernel (usually on return to userspace, except for this new 
ON_CPU bit clear on context switch).

The other threads busy-waiting on the content of this sched_state field 
will only load from it, never store. And they will only busy-wait on it 
as long as the current task runs. When that task gets preempted, other 
threads will notice the flag change and use sys_futex instead.

So the very worse I can think of in terms of pattern causing cache 
coherency traffic due to false-sharing is if the lock owner happens to 
have lots of rseq critical sections as well, causing it to repeatedly 
store to the rseq_abi->rseq_cs field, which is in the same cache line.

But even then I'm wondering if this really matters, because each of 
those stores to rseq_cs would only slow down loads from other threads 
which will need to retry busy-wait anyway if the on-cpu flag is still set.

So, what am I missing ? Is this heavy use of rseq critical sections 
while the lock is held the scenario you are concerned about ?

Note that the heavy cache-line bouncing in my test-case happens on the 
lock structure (cmpxchg expecting NULL, setting the current thread 
rseq_get_abi() pointer on success). There are probably better ways to 
implement that part, it is currently just a simple prototype showcasing 
the approach.

Thanks,

Mathieu
  
Boqun Feng May 19, 2023, 5:18 p.m. UTC | #4
On Fri, May 19, 2023 at 10:15:11AM -0400, Mathieu Desnoyers wrote:
> On 2023-05-18 17:49, Boqun Feng wrote:
> > On Wed, May 17, 2023 at 11:26:51AM -0400, Mathieu Desnoyers wrote:
> [...]
> > > diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
> > > index c233aae5eac9..c6d8537e23ca 100644
> > > --- a/include/uapi/linux/rseq.h
> > > +++ b/include/uapi/linux/rseq.h
> > > @@ -37,6 +37,13 @@ enum rseq_cs_flags {
> > >   		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
> > >   };
> > > +enum rseq_sched_state {
> > > +	/*
> > > +	 * Task is currently running on a CPU if bit is set.
> > > +	 */
> > > +	RSEQ_SCHED_STATE_ON_CPU		= (1U << 0),
> > > +};
> [...]
> > > +	/*
> > > +	 * Restartable sequences sched_state field. Updated by the kernel. Read
> > > +	 * by user-space with single-copy atomicity semantics. This fields can
> > > +	 * be read by any userspace thread. Aligned on 32-bit. Contains a
> > 
> > Maybe this is a premature optimization, but since most of the time the
> > bit would be read by another thread, does it make sense putting the
> > "sched_state" into a different cache line to avoid false sharing?
> 
> I'm puzzled by your optimization proposal, so I'll say it outright: I'm
> probably missing something.
> 

Maybe it's me who is missing something ;-)

> I agree that false-sharing would be an issue if various threads would
> contend for updating any field within this cache line.
> 
> But the only thread responsible for updating this cache line's fields is the
> current thread, either from userspace (stores to rseq_abi->rseq_cs) or from
> the kernel (usually on return to userspace, except for this new ON_CPU bit
> clear on context switch).
> 
> The other threads busy-waiting on the content of this sched_state field will
> only load from it, never store. And they will only busy-wait on it as long

But their loads can change the cache line state from "Exclusive" to
"Shared" (using MESI terms), right? And that could delay the stores of
the current thread.

> as the current task runs. When that task gets preempted, other threads will
> notice the flag change and use sys_futex instead.
> 
> So the very worse I can think of in terms of pattern causing cache coherency
> traffic due to false-sharing is if the lock owner happens to have lots of
> rseq critical sections as well, causing it to repeatedly store to the
> rseq_abi->rseq_cs field, which is in the same cache line.
> 
> But even then I'm wondering if this really matters, because each of those
> stores to rseq_cs would only slow down loads from other threads which will
> need to retry busy-wait anyway if the on-cpu flag is still set.
> 
> So, what am I missing ? Is this heavy use of rseq critical sections while
> the lock is held the scenario you are concerned about ?
> 

The case in my mind is the opposite direction: the loads from other
threads delay the stores to rseq_cs on the current thread, which I
assume are usually a fast path. For example:

	CPU 1				CPU 2

	lock(foo); // holding a lock
	rseq_start():
	  <CPU 1 own the cache line exclusively>
	  				lock(foo):
					  <fail to get foo>
					  <check whether the lock owner is on CPU>
					  <cache line becames shared>
	  ->rseq_cs = .. // Need to invalidate the cache line on other CPU

But as you mentioned, there is only one updater here (the current
thread), so maybe it doesn't matter... but since it's a userspace ABI,
so I cannot help thinking "what if there is another bit that has a
different usage pattern introduced in the future", so..

> Note that the heavy cache-line bouncing in my test-case happens on the lock
> structure (cmpxchg expecting NULL, setting the current thread rseq_get_abi()
> pointer on success). There are probably better ways to implement that part,
> it is currently just a simple prototype showcasing the approach.
> 

Yeah.. that's a little strange, I guess you can just read the lock
owner's rseq_abi, for example:

	rseq_lock_slowpath() {
		struct rseq_abi *other_rseq = lock->owner;

		if (RSEQ_ACCESS_ONCE(other_rseq->sched_state)) {
			...
		}
	}

?

Regards,
Boqun

> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
  
Noah Goldstein May 19, 2023, 8:51 p.m. UTC | #5
On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Expose the "on-cpu" state for each thread through struct rseq to allow
> adaptative mutexes to decide more accurately between busy-waiting and
> calling sys_futex() to release the CPU, based on the on-cpu state of the
> mutex owner.
>
> It is only provided as an optimization hint, because there is no
> guarantee that the page containing this field is in the page cache, and
> therefore the scheduler may very well fail to clear the on-cpu state on
> preemption. This is expected to be rare though, and is resolved as soon
> as the task returns to user-space.
>
> The goal is to improve use-cases where the duration of the critical
> sections for a given lock follows a multi-modal distribution, preventing
> statistical guesses from doing a good job at choosing between busy-wait
> and futex wait behavior.
>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
> Cc: Carlos O'Donell <carlos@redhat.com>
> Cc: Florian Weimer <fweimer@redhat.com>
> Cc: libc-alpha@sourceware.org
> ---
>  include/linux/sched.h     | 12 ++++++++++++
>  include/uapi/linux/rseq.h | 17 +++++++++++++++++
>  kernel/rseq.c             | 14 ++++++++++++++
>  3 files changed, 43 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index eed5d65b8d1f..c7e9248134c1 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig,
>         rseq_handle_notify_resume(ksig, regs);
>  }
>
> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state);
> +
> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
> +{
> +       if (t->rseq)
> +               __rseq_set_sched_state(t, state);
> +}
> +
>  /* rseq_preempt() requires preemption to be disabled. */
>  static inline void rseq_preempt(struct task_struct *t)
>  {
>         __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
>         rseq_set_notify_resume(t);
> +       rseq_set_sched_state(t, 0);

Should rseq_migrate also be made to update the cpu_id of the new core?
I imagine the usage of this will be something along the lines of:

if(!on_cpu(mutex->owner_rseq_struct) &&
   cpu(mutex->owner_rseq_struct) == this_threads_cpu)
   // goto futex

So I would think updating on migrate would be useful as well.


>  }
>
>  /* rseq_migrate() requires preemption to be disabled. */
> @@ -2405,6 +2414,9 @@ static inline void rseq_signal_deliver(struct ksignal *ksig,
>                                        struct pt_regs *regs)
>  {
>  }
> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
> +{
> +}
>  static inline void rseq_preempt(struct task_struct *t)
>  {
>  }
> diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
> index c233aae5eac9..c6d8537e23ca 100644
> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -37,6 +37,13 @@ enum rseq_cs_flags {
>                 (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
>  };
>
> +enum rseq_sched_state {
> +       /*
> +        * Task is currently running on a CPU if bit is set.
> +        */
> +       RSEQ_SCHED_STATE_ON_CPU         = (1U << 0),
> +};
> +
>  /*
>   * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always
>   * contained within a single cache-line. It is usually declared as
> @@ -148,6 +155,16 @@ struct rseq {
>          */
>         __u32 mm_cid;
>
> +       /*
> +        * Restartable sequences sched_state field. Updated by the kernel. Read
> +        * by user-space with single-copy atomicity semantics. This fields can
> +        * be read by any userspace thread. Aligned on 32-bit. Contains a
> +        * bitmask of enum rseq_sched_state. This field is provided as a hint
> +        * by the scheduler, and requires that the page holding struct rseq is
> +        * faulted-in for the state update to be performed by the scheduler.
> +        */
> +       __u32 sched_state;
> +
>         /*
>          * Flexible array member at end of structure, after last feature field.
>          */
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 9de6e35fe679..b2eb3bbaa9ef 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -91,6 +91,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
>         u32 cpu_id = raw_smp_processor_id();
>         u32 node_id = cpu_to_node(cpu_id);
>         u32 mm_cid = task_mm_cid(t);
> +       u32 sched_state = RSEQ_SCHED_STATE_ON_CPU;
>
>         WARN_ON_ONCE((int) mm_cid < 0);
>         if (!user_write_access_begin(rseq, t->rseq_len))
> @@ -99,6 +100,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
>         unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end);
>         unsafe_put_user(node_id, &rseq->node_id, efault_end);
>         unsafe_put_user(mm_cid, &rseq->mm_cid, efault_end);
> +       unsafe_put_user(sched_state, &rseq->sched_state, efault_end);
>         /*
>          * Additional feature fields added after ORIG_RSEQ_SIZE
>          * need to be conditionally updated only if
> @@ -339,6 +341,18 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
>         force_sigsegv(sig);
>  }
>
> +/*
> + * Attempt to update rseq scheduler state.
> + */
> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state)
> +{
> +       if (unlikely(t->flags & PF_EXITING))
> +               return;
> +       pagefault_disable();
> +       (void) put_user(state, &t->rseq->sched_state);
> +       pagefault_enable();
> +}
> +
>  #ifdef CONFIG_DEBUG_RSEQ
>
>  /*
> --
> 2.25.1
>
  
Mathieu Desnoyers May 23, 2023, 12:49 p.m. UTC | #6
On 2023-05-19 16:51, Noah Goldstein wrote:
> On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> Expose the "on-cpu" state for each thread through struct rseq to allow
>> adaptative mutexes to decide more accurately between busy-waiting and
>> calling sys_futex() to release the CPU, based on the on-cpu state of the
>> mutex owner.
>>
>> It is only provided as an optimization hint, because there is no
>> guarantee that the page containing this field is in the page cache, and
>> therefore the scheduler may very well fail to clear the on-cpu state on
>> preemption. This is expected to be rare though, and is resolved as soon
>> as the task returns to user-space.
>>
>> The goal is to improve use-cases where the duration of the critical
>> sections for a given lock follows a multi-modal distribution, preventing
>> statistical guesses from doing a good job at choosing between busy-wait
>> and futex wait behavior.
>>
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
>> Cc: Carlos O'Donell <carlos@redhat.com>
>> Cc: Florian Weimer <fweimer@redhat.com>
>> Cc: libc-alpha@sourceware.org
>> ---
>>   include/linux/sched.h     | 12 ++++++++++++
>>   include/uapi/linux/rseq.h | 17 +++++++++++++++++
>>   kernel/rseq.c             | 14 ++++++++++++++
>>   3 files changed, 43 insertions(+)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index eed5d65b8d1f..c7e9248134c1 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig,
>>          rseq_handle_notify_resume(ksig, regs);
>>   }
>>
>> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state);
>> +
>> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
>> +{
>> +       if (t->rseq)
>> +               __rseq_set_sched_state(t, state);
>> +}
>> +
>>   /* rseq_preempt() requires preemption to be disabled. */
>>   static inline void rseq_preempt(struct task_struct *t)
>>   {
>>          __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
>>          rseq_set_notify_resume(t);
>> +       rseq_set_sched_state(t, 0);
> 
> Should rseq_migrate also be made to update the cpu_id of the new core?
> I imagine the usage of this will be something along the lines of:
> 
> if(!on_cpu(mutex->owner_rseq_struct) &&
>     cpu(mutex->owner_rseq_struct) == this_threads_cpu)
>     // goto futex
> 
> So I would think updating on migrate would be useful as well.

I don't think we want to act differently based on the cpu on which the 
owner is queued.

If the mutex owner is not on-cpu, and queued on the same cpu as the 
current thread, we indeed want to call sys_futex WAIT.

If the mutex owner is not on-cpu, but queued on a different cpu than the 
current thread, we *still* want to call sys_futex WAIT, because 
busy-waiting for a thread which is queued but not currently running is 
wasteful.

Or am I missing something ?

Thanks,

Mathieu
  
Mathieu Desnoyers May 23, 2023, 2:10 p.m. UTC | #7
On 2023-05-19 13:18, Boqun Feng wrote:
[...]
> 
> The case in my mind is the opposite direction: the loads from other
> threads delay the stores to rseq_cs on the current thread, which I
> assume are usually a fast path. For example:

Yes, OK, you are correct. And I just validated on my end that busy-waiting
repeatedly loading from a cache line does slow down the concurrent stores
to other variables on that cache line significantly (at least on my
Intel(R) Core(TM) i7-8650U). Small reproducer provided at the end of
this email. Results:

compudj@thinkos:~/test$ time ./test-cacheline -d
thread id : 140242706274048, pid 16940
thread id : 140242697881344, pid 16940

real	0m4.145s
user	0m8.289s
sys	0m0.000s

compudj@thinkos:~/test$ time ./test-cacheline -s
thread id : 139741482387200, pid 16950
thread id : 139741473994496, pid 16950

real	0m4.573s
user	0m9.147s
sys	0m0.000s


> 
> 	CPU 1				CPU 2
> 
> 	lock(foo); // holding a lock
> 	rseq_start():
> 	  <CPU 1 own the cache line exclusively>
> 	  				lock(foo):
> 					  <fail to get foo>
> 					  <check whether the lock owner is on CPU>
> 					  <cache line becames shared>
> 	  ->rseq_cs = .. // Need to invalidate the cache line on other CPU
> 
> But as you mentioned, there is only one updater here (the current
> thread), so maybe it doesn't matter... but since it's a userspace ABI,
> so I cannot help thinking "what if there is another bit that has a
> different usage pattern introduced in the future", so..

Yes, however we have to be careful about how we introduce this considering
that the rseq feature extensions are "append only" to the structure feature
size exported by the kernel to userspace through getauxval(3).

So if we decide that we create a big hole right in the middle of the rseq_abi
for cacheline alignment, that's a possibility, but we'd really be wasting an
entire cacheline for a single bit.

Another possibility would be to add a level of indirection: we could have a field
in struct rseq which is either a pointer or offset from the thread_pointer() to
the on-cpu bit, which would sit in a different cache line. It would be up to
glibc to allocate space for it, possibly at the end of the rseq_abi field.

> 
>> Note that the heavy cache-line bouncing in my test-case happens on the lock
>> structure (cmpxchg expecting NULL, setting the current thread rseq_get_abi()
>> pointer on success). There are probably better ways to implement that part,
>> it is currently just a simple prototype showcasing the approach.
>>
> 
> Yeah.. that's a little strange, I guess you can just read the lock
> owner's rseq_abi, for example:
> 
> 	rseq_lock_slowpath() {
> 		struct rseq_abi *other_rseq = lock->owner;
> 
> 		if (RSEQ_ACCESS_ONCE(other_rseq->sched_state)) {
> 			...
> 		}
> 	}

Yes, I don't think the load of the owner pointer needs to be part of the
cmpxchg per se. It could be done from a load on the slow-path.

This way we would not require that the owner id and the lock state be the
same content, and this would allow much more freedom for the fast-path
semantic.

Thanks,

Mathieu

> 
> ?
> 
> Regards,
> Boqun
> 
>> Thanks,
>>
>> Mathieu
>>
>> -- 
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>

Reproducer:

/*
  * cacheline testing (exclusive vs shared store speed)
  *
  * build with gcc -O2 -pthread -o test-cacheline test-cacheline.c
  *
  * Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
  * License: MIT
  */

#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <rseq/rseq.h>

#define NR_THREADS 2

struct test {
	int a;
	int b;
} __attribute__((aligned(256)));

enum testcase {
	TEST_SAME_CACHELINE,
	TEST_OTHER_CACHELINE,
};

static enum testcase testcase;
static int test_stop, test_go;
static struct test test, test2;

static
void *testthread(void *arg)
{
	long nr = (long)arg;

         printf("thread id : %lu, pid %lu\n", pthread_self(), getpid());

	__atomic_add_fetch(&test_go, 1, __ATOMIC_RELAXED);
	while (RSEQ_READ_ONCE(test_go) < NR_THREADS)
		rseq_barrier();
	if (nr == 0) {
		switch (testcase) {
		case TEST_SAME_CACHELINE:
			while (!RSEQ_READ_ONCE(test_stop))
				(void) RSEQ_READ_ONCE(test.a);
			break;
		case TEST_OTHER_CACHELINE:
			while (!RSEQ_READ_ONCE(test_stop))
				(void) RSEQ_READ_ONCE(test2.a);
			break;
		}
	} else if (nr == 1) {
		unsigned long long i;

		for (i = 0; i < 16000000000UL; i++)
			RSEQ_WRITE_ONCE(test.b, i);
		RSEQ_WRITE_ONCE(test_stop, 1);
	}
         return ((void*)0);
}

static
void show_usage(char **argv)
{
	fprintf(stderr, "Usage: %s <OPTIONS>\n", argv[0]);
	fprintf(stderr, "OPTIONS:\n");
	fprintf(stderr, "	[-s] Same cacheline\n");
	fprintf(stderr, "	[-d] Different cacheline\n");
}

static
int parse_args(int argc, char **argv)
{
	if (argc != 2 || argv[1][0] != '-') {
		show_usage(argv);
		return -1;
	}
	switch (argv[1][1]) {
	case 's':
		testcase = TEST_SAME_CACHELINE;
		break;
	case 'd':
		testcase = TEST_OTHER_CACHELINE;
		break;
	default:
		show_usage(argv);
		return -1;
	}
	return 0;
}

int main(int argc, char **argv)
{
         pthread_t testid[NR_THREADS];
         void *tret;
         int i, err;

	if (parse_args(argc, argv))
		exit(1);

         for (i = 0; i < NR_THREADS; i++) {
                 err = pthread_create(&testid[i], NULL, testthread,
                         (void *)(long)i);
                 if (err != 0)
                         exit(1);
         }

         for (i = 0; i < NR_THREADS; i++) {
                 err = pthread_join(testid[i], &tret);
                 if (err != 0)
                         exit(1);
         }

         return 0;
}
  
Noah Goldstein May 23, 2023, 4:32 p.m. UTC | #8
On Tue, May 23, 2023 at 7:49 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2023-05-19 16:51, Noah Goldstein wrote:
> > On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha
> > <libc-alpha@sourceware.org> wrote:
> >>
> >> Expose the "on-cpu" state for each thread through struct rseq to allow
> >> adaptative mutexes to decide more accurately between busy-waiting and
> >> calling sys_futex() to release the CPU, based on the on-cpu state of the
> >> mutex owner.
> >>
> >> It is only provided as an optimization hint, because there is no
> >> guarantee that the page containing this field is in the page cache, and
> >> therefore the scheduler may very well fail to clear the on-cpu state on
> >> preemption. This is expected to be rare though, and is resolved as soon
> >> as the task returns to user-space.
> >>
> >> The goal is to improve use-cases where the duration of the critical
> >> sections for a given lock follows a multi-modal distribution, preventing
> >> statistical guesses from doing a good job at choosing between busy-wait
> >> and futex wait behavior.
> >>
> >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> >> Cc: Jonathan Corbet <corbet@lwn.net>
> >> Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
> >> Cc: Carlos O'Donell <carlos@redhat.com>
> >> Cc: Florian Weimer <fweimer@redhat.com>
> >> Cc: libc-alpha@sourceware.org
> >> ---
> >>   include/linux/sched.h     | 12 ++++++++++++
> >>   include/uapi/linux/rseq.h | 17 +++++++++++++++++
> >>   kernel/rseq.c             | 14 ++++++++++++++
> >>   3 files changed, 43 insertions(+)
> >>
> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >> index eed5d65b8d1f..c7e9248134c1 100644
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> >> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig,
> >>          rseq_handle_notify_resume(ksig, regs);
> >>   }
> >>
> >> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state);
> >> +
> >> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
> >> +{
> >> +       if (t->rseq)
> >> +               __rseq_set_sched_state(t, state);
> >> +}
> >> +
> >>   /* rseq_preempt() requires preemption to be disabled. */
> >>   static inline void rseq_preempt(struct task_struct *t)
> >>   {
> >>          __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
> >>          rseq_set_notify_resume(t);
> >> +       rseq_set_sched_state(t, 0);
> >
> > Should rseq_migrate also be made to update the cpu_id of the new core?
> > I imagine the usage of this will be something along the lines of:
> >
> > if(!on_cpu(mutex->owner_rseq_struct) &&
> >     cpu(mutex->owner_rseq_struct) == this_threads_cpu)
> >     // goto futex
> >
> > So I would think updating on migrate would be useful as well.
>
> I don't think we want to act differently based on the cpu on which the
> owner is queued.
>
> If the mutex owner is not on-cpu, and queued on the same cpu as the
> current thread, we indeed want to call sys_futex WAIT.
>
> If the mutex owner is not on-cpu, but queued on a different cpu than the
> current thread, we *still* want to call sys_futex WAIT, because
> busy-waiting for a thread which is queued but not currently running is
> wasteful.
>
I think this is less clear. In some cases sure but not always. Going
to the futex
has more latency that userland waits, and if the system is not busy (other than
the one process) most likely less latency that yield. Also going to the futex
requires a syscall on unlock.

For example if the critical section is expected to be very small, it
would be easy
to imagine the lock be better implemented with:
while(is_locked)
  if (owner->on_cpu || owner->cpu != my_cpu)
    exponential backoff
  else
    yield

Its not that "just go to futex" doesn't ever make sense, but I don't
think its fair
to say that *always* the case.

Looking at the kernel code, it doesn't seem to be a particularly high cost to
keep the CPU field updated during migration so seems like a why not
kind of question.
> Or am I missing something ?
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
  
Mathieu Desnoyers May 23, 2023, 5:30 p.m. UTC | #9
On 2023-05-23 12:32, Noah Goldstein wrote:
> On Tue, May 23, 2023 at 7:49 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> On 2023-05-19 16:51, Noah Goldstein wrote:
>>> On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha
>>> <libc-alpha@sourceware.org> wrote:
>>>>
>>>> Expose the "on-cpu" state for each thread through struct rseq to allow
>>>> adaptative mutexes to decide more accurately between busy-waiting and
>>>> calling sys_futex() to release the CPU, based on the on-cpu state of the
>>>> mutex owner.
>>>>
>>>> It is only provided as an optimization hint, because there is no
>>>> guarantee that the page containing this field is in the page cache, and
>>>> therefore the scheduler may very well fail to clear the on-cpu state on
>>>> preemption. This is expected to be rare though, and is resolved as soon
>>>> as the task returns to user-space.
>>>>
>>>> The goal is to improve use-cases where the duration of the critical
>>>> sections for a given lock follows a multi-modal distribution, preventing
>>>> statistical guesses from doing a good job at choosing between busy-wait
>>>> and futex wait behavior.
>>>>
>>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>>>> Cc: Jonathan Corbet <corbet@lwn.net>
>>>> Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
>>>> Cc: Carlos O'Donell <carlos@redhat.com>
>>>> Cc: Florian Weimer <fweimer@redhat.com>
>>>> Cc: libc-alpha@sourceware.org
>>>> ---
>>>>    include/linux/sched.h     | 12 ++++++++++++
>>>>    include/uapi/linux/rseq.h | 17 +++++++++++++++++
>>>>    kernel/rseq.c             | 14 ++++++++++++++
>>>>    3 files changed, 43 insertions(+)
>>>>
>>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>>> index eed5d65b8d1f..c7e9248134c1 100644
>>>> --- a/include/linux/sched.h
>>>> +++ b/include/linux/sched.h
>>>> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig,
>>>>           rseq_handle_notify_resume(ksig, regs);
>>>>    }
>>>>
>>>> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state);
>>>> +
>>>> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
>>>> +{
>>>> +       if (t->rseq)
>>>> +               __rseq_set_sched_state(t, state);
>>>> +}
>>>> +
>>>>    /* rseq_preempt() requires preemption to be disabled. */
>>>>    static inline void rseq_preempt(struct task_struct *t)
>>>>    {
>>>>           __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
>>>>           rseq_set_notify_resume(t);
>>>> +       rseq_set_sched_state(t, 0);
>>>
>>> Should rseq_migrate also be made to update the cpu_id of the new core?
>>> I imagine the usage of this will be something along the lines of:
>>>
>>> if(!on_cpu(mutex->owner_rseq_struct) &&
>>>      cpu(mutex->owner_rseq_struct) == this_threads_cpu)
>>>      // goto futex
>>>
>>> So I would think updating on migrate would be useful as well.
>>
>> I don't think we want to act differently based on the cpu on which the
>> owner is queued.
>>
>> If the mutex owner is not on-cpu, and queued on the same cpu as the
>> current thread, we indeed want to call sys_futex WAIT.
>>
>> If the mutex owner is not on-cpu, but queued on a different cpu than the
>> current thread, we *still* want to call sys_futex WAIT, because
>> busy-waiting for a thread which is queued but not currently running is
>> wasteful.
>>
> I think this is less clear. In some cases sure but not always. Going
> to the futex
> has more latency that userland waits, and if the system is not busy (other than
> the one process) most likely less latency that yield. Also going to the futex
> requires a syscall on unlock.
> 
> For example if the critical section is expected to be very small, it
> would be easy
> to imagine the lock be better implemented with:
> while(is_locked)
>    if (owner->on_cpu || owner->cpu != my_cpu)
>      exponential backoff
>    else
>      yield
> 
> Its not that "just go to futex" doesn't ever make sense, but I don't
> think its fair
> to say that *always* the case.
> 
> Looking at the kernel code, it doesn't seem to be a particularly high cost to
> keep the CPU field updated during migration so seems like a why not
> kind of question.

We already have the owner rseq_abi cpu_id field populated on every 
return-to-userspace. I wonder if it's really relevant that migration 
populates an updated value in this field immediately ? It's another case 
where this would be provided as a hint updated only if the struct rseq 
is in the page cache, because AFAIU the scheduler migration path cannot 
take a page fault.

Also, if a thread bounces around many runqueues before being scheduled 
again, we would be adding those useless stores to the rseq_abi structure 
at each migration between runqueues.

Given this would add some complexity to the scheduler migration code, I 
would want to see metrics/benchmarks showing that it indeed improves 
real-world use-cases before adding this to the rseq ABI.

It's not only a question of added lines of code as of today, but also a 
question of added userspace ABI guarantees which can prevent future 
scheduler optimizations. I'm *very* careful about keeping those to a 
strict minimum, which I hope Peter Zijlstra appreciates.

Thanks,

Mathieu


>> Or am I missing something ?
>>
>> Thanks,
>>
>> Mathieu
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>
  
Noah Goldstein May 23, 2023, 8:10 p.m. UTC | #10
On Tue, May 23, 2023 at 12:30 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2023-05-23 12:32, Noah Goldstein wrote:
> > On Tue, May 23, 2023 at 7:49 AM Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> >>
> >> On 2023-05-19 16:51, Noah Goldstein wrote:
> >>> On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha
> >>> <libc-alpha@sourceware.org> wrote:
> >>>>
> >>>> Expose the "on-cpu" state for each thread through struct rseq to allow
> >>>> adaptative mutexes to decide more accurately between busy-waiting and
> >>>> calling sys_futex() to release the CPU, based on the on-cpu state of the
> >>>> mutex owner.
> >>>>
> >>>> It is only provided as an optimization hint, because there is no
> >>>> guarantee that the page containing this field is in the page cache, and
> >>>> therefore the scheduler may very well fail to clear the on-cpu state on
> >>>> preemption. This is expected to be rare though, and is resolved as soon
> >>>> as the task returns to user-space.
> >>>>
> >>>> The goal is to improve use-cases where the duration of the critical
> >>>> sections for a given lock follows a multi-modal distribution, preventing
> >>>> statistical guesses from doing a good job at choosing between busy-wait
> >>>> and futex wait behavior.
> >>>>
> >>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >>>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> >>>> Cc: Jonathan Corbet <corbet@lwn.net>
> >>>> Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
> >>>> Cc: Carlos O'Donell <carlos@redhat.com>
> >>>> Cc: Florian Weimer <fweimer@redhat.com>
> >>>> Cc: libc-alpha@sourceware.org
> >>>> ---
> >>>>    include/linux/sched.h     | 12 ++++++++++++
> >>>>    include/uapi/linux/rseq.h | 17 +++++++++++++++++
> >>>>    kernel/rseq.c             | 14 ++++++++++++++
> >>>>    3 files changed, 43 insertions(+)
> >>>>
> >>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >>>> index eed5d65b8d1f..c7e9248134c1 100644
> >>>> --- a/include/linux/sched.h
> >>>> +++ b/include/linux/sched.h
> >>>> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig,
> >>>>           rseq_handle_notify_resume(ksig, regs);
> >>>>    }
> >>>>
> >>>> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state);
> >>>> +
> >>>> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
> >>>> +{
> >>>> +       if (t->rseq)
> >>>> +               __rseq_set_sched_state(t, state);
> >>>> +}
> >>>> +
> >>>>    /* rseq_preempt() requires preemption to be disabled. */
> >>>>    static inline void rseq_preempt(struct task_struct *t)
> >>>>    {
> >>>>           __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
> >>>>           rseq_set_notify_resume(t);
> >>>> +       rseq_set_sched_state(t, 0);
> >>>
> >>> Should rseq_migrate also be made to update the cpu_id of the new core?
> >>> I imagine the usage of this will be something along the lines of:
> >>>
> >>> if(!on_cpu(mutex->owner_rseq_struct) &&
> >>>      cpu(mutex->owner_rseq_struct) == this_threads_cpu)
> >>>      // goto futex
> >>>
> >>> So I would think updating on migrate would be useful as well.
> >>
> >> I don't think we want to act differently based on the cpu on which the
> >> owner is queued.
> >>
> >> If the mutex owner is not on-cpu, and queued on the same cpu as the
> >> current thread, we indeed want to call sys_futex WAIT.
> >>
> >> If the mutex owner is not on-cpu, but queued on a different cpu than the
> >> current thread, we *still* want to call sys_futex WAIT, because
> >> busy-waiting for a thread which is queued but not currently running is
> >> wasteful.
> >>
> > I think this is less clear. In some cases sure but not always. Going
> > to the futex
> > has more latency that userland waits, and if the system is not busy (other than
> > the one process) most likely less latency that yield. Also going to the futex
> > requires a syscall on unlock.
> >
> > For example if the critical section is expected to be very small, it
> > would be easy
> > to imagine the lock be better implemented with:
> > while(is_locked)
> >    if (owner->on_cpu || owner->cpu != my_cpu)
> >      exponential backoff
> >    else
> >      yield
> >
> > Its not that "just go to futex" doesn't ever make sense, but I don't
> > think its fair
> > to say that *always* the case.
> >
> > Looking at the kernel code, it doesn't seem to be a particularly high cost to
> > keep the CPU field updated during migration so seems like a why not
> > kind of question.
>
> We already have the owner rseq_abi cpu_id field populated on every
> return-to-userspace. I wonder if it's really relevant that migration
> populates an updated value in this field immediately ? It's another case
> where this would be provided as a hint updated only if the struct rseq
> is in the page cache, because AFAIU the scheduler migration path cannot
> take a page fault.
>

Ah, thats a good point. And probably as probability the page is in the cache
goes down a fair bit as the task is idle / bounced around for longer.

> Also, if a thread bounces around many runqueues before being scheduled
> again, we would be adding those useless stores to the rseq_abi structure
> at each migration between runqueues.
>
> Given this would add some complexity to the scheduler migration code, I
> would want to see metrics/benchmarks showing that it indeed improves
> real-world use-cases before adding this to the rseq ABI.
>
> It's not only a question of added lines of code as of today, but also a
> question of added userspace ABI guarantees which can prevent future
> scheduler optimizations. I'm *very* careful about keeping those to a
> strict minimum, which I hope Peter Zijlstra appreciates.

Well, this entire thing is moreso a hint than a guarantee. Even on_cpu
is only updated if the page happens to be in the pagecache so I don't
see how you could ever be *having* to do anything.

But fair enough, thought I'd throw the idea out there, but enough valid
concerns seem to make it not such a good idea.
>
> Thanks,
>
> Mathieu
>
>
> >> Or am I missing something ?
> >>
> >> Thanks,
> >>
> >> Mathieu
> >>
> >> --
> >> Mathieu Desnoyers
> >> EfficiOS Inc.
> >> https://www.efficios.com
> >>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
  

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index eed5d65b8d1f..c7e9248134c1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2351,11 +2351,20 @@  static inline void rseq_signal_deliver(struct ksignal *ksig,
 	rseq_handle_notify_resume(ksig, regs);
 }
 
+void __rseq_set_sched_state(struct task_struct *t, unsigned int state);
+
+static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
+{
+	if (t->rseq)
+		__rseq_set_sched_state(t, state);
+}
+
 /* rseq_preempt() requires preemption to be disabled. */
 static inline void rseq_preempt(struct task_struct *t)
 {
 	__set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
 	rseq_set_notify_resume(t);
+	rseq_set_sched_state(t, 0);
 }
 
 /* rseq_migrate() requires preemption to be disabled. */
@@ -2405,6 +2414,9 @@  static inline void rseq_signal_deliver(struct ksignal *ksig,
 				       struct pt_regs *regs)
 {
 }
+static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
+{
+}
 static inline void rseq_preempt(struct task_struct *t)
 {
 }
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index c233aae5eac9..c6d8537e23ca 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -37,6 +37,13 @@  enum rseq_cs_flags {
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
 };
 
+enum rseq_sched_state {
+	/*
+	 * Task is currently running on a CPU if bit is set.
+	 */
+	RSEQ_SCHED_STATE_ON_CPU		= (1U << 0),
+};
+
 /*
  * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always
  * contained within a single cache-line. It is usually declared as
@@ -148,6 +155,16 @@  struct rseq {
 	 */
 	__u32 mm_cid;
 
+	/*
+	 * Restartable sequences sched_state field. Updated by the kernel. Read
+	 * by user-space with single-copy atomicity semantics. This fields can
+	 * be read by any userspace thread. Aligned on 32-bit. Contains a
+	 * bitmask of enum rseq_sched_state. This field is provided as a hint
+	 * by the scheduler, and requires that the page holding struct rseq is
+	 * faulted-in for the state update to be performed by the scheduler.
+	 */
+	__u32 sched_state;
+
 	/*
 	 * Flexible array member at end of structure, after last feature field.
 	 */
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 9de6e35fe679..b2eb3bbaa9ef 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -91,6 +91,7 @@  static int rseq_update_cpu_node_id(struct task_struct *t)
 	u32 cpu_id = raw_smp_processor_id();
 	u32 node_id = cpu_to_node(cpu_id);
 	u32 mm_cid = task_mm_cid(t);
+	u32 sched_state = RSEQ_SCHED_STATE_ON_CPU;
 
 	WARN_ON_ONCE((int) mm_cid < 0);
 	if (!user_write_access_begin(rseq, t->rseq_len))
@@ -99,6 +100,7 @@  static int rseq_update_cpu_node_id(struct task_struct *t)
 	unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end);
 	unsafe_put_user(node_id, &rseq->node_id, efault_end);
 	unsafe_put_user(mm_cid, &rseq->mm_cid, efault_end);
+	unsafe_put_user(sched_state, &rseq->sched_state, efault_end);
 	/*
 	 * Additional feature fields added after ORIG_RSEQ_SIZE
 	 * need to be conditionally updated only if
@@ -339,6 +341,18 @@  void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
 	force_sigsegv(sig);
 }
 
+/*
+ * Attempt to update rseq scheduler state.
+ */
+void __rseq_set_sched_state(struct task_struct *t, unsigned int state)
+{
+	if (unlikely(t->flags & PF_EXITING))
+		return;
+	pagefault_disable();
+	(void) put_user(state, &t->rseq->sched_state);
+	pagefault_enable();
+}
+
 #ifdef CONFIG_DEBUG_RSEQ
 
 /*