[3/6] sched: add a few helpers to wake up tasks on the current cpu
Commit Message
Add complete_on_current_cpu, wake_up_poll_on_current_cpu helpers to wake
up tasks on the current CPU.
These two helpers are useful when the task needs to make a synchronous context
switch to another task. In this context, synchronous means it wakes up the
target task and falls asleep right after that.
One example of such workloads is seccomp user notifies. This mechanism allows
the supervisor process handles system calls on behalf of a target process.
While the supervisor is handling an intercepted system call, the target process
will be blocked in the kernel, waiting for a response to come back.
On-CPU context switches are much faster than regular ones.
Signed-off-by: Andrei Vagin <avagin@google.com>
---
include/linux/completion.h | 1 +
include/linux/swait.h | 2 +-
include/linux/wait.h | 3 +++
kernel/sched/completion.c | 26 ++++++++++++++++++--------
kernel/sched/core.c | 2 +-
kernel/sched/swait.c | 8 ++++----
kernel/sched/wait.c | 5 +++++
7 files changed, 33 insertions(+), 14 deletions(-)
Comments
On Wed, Apr 26, 2023 at 7:43 AM Bernd Schubert <bschubert@ddn.com> wrote:
>
> > Add complete_on_current_cpu, wake_up_poll_on_current_cpu helpers to wake
> > up tasks on the current CPU.
>
> > These two helpers are useful when the task needs to make a synchronous context
> > switch to another task. In this context, synchronous means it wakes up the
> > target task and falls asleep right after that.
>
> > One example of such workloads is seccomp user notifies. This mechanism allows
> > the supervisor process handles system calls on behalf of a target process.
> > While the supervisor is handling an intercepted system call, the target process
> > will be blocked in the kernel, waiting for a response to come back.
>
> > On-CPU context switches are much faster than regular ones.
>
> > Signed-off-by: Andrei Vagin <avagin@google.com>
>
> Avoiding cpu switches is very desirable for fuse, I'm working on fuse over uring
> with per core queues. In my current branch and running a single threaded bonnie++
> I get about 9000 creates/s when I bind the process to a core, about 7000 creates/s
> when I set SCHED_IDLE for the ring threads and back to 9000 with SCHED_IDLE and
> disabling cpu migration in fs/fuse/dev.c request_wait_answer() before going into
> the waitq and enabling it back after waking up.
>
> I had reported this a few weeks back
> https://lore.kernel.org/lkml/d0ed1dbd-1b7e-bf98-65c0-7f61dd1a3228@ddn.com/
> and had been pointed to your and Prateeks patch series. I'm now going
> through these series. Interesting part is that a few weeks I didn't need
> SCHED_IDLE, just disabling/enabling migration before/after waking up was
> enough.
>
> [...]
>
> > EXPORT_SYMBOL(swake_up_one);
> > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> > index 133b74730738..47803a0b8d5d 100644
> > --- a/kernel/sched/wait.c
> > +++ b/kernel/sched/wait.c
> > @@ -161,6 +161,11 @@ int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
> > }
> > EXPORT_SYMBOL(__wake_up);
>
> > +void __wake_up_on_current_cpu(struct wait_queue_head *wq_head, unsigned int mode, void *key)
> > +{
> > + __wake_up_common_lock(wq_head, mode, 1, WF_CURRENT_CPU, key);
> > +}
>
> I'm about to test this instead of migrate_disable/migrate_enable, but the symbol needs
> to be exported - any objection to do that right from the beginning in your patch?
I think EXPORT_SYMBOL_GPL should not trigger any objections and it
covers your case, doesn't it?
Thanks,
Andrei
On 4/26/23 20:52, Andrei Vagin wrote:
> On Wed, Apr 26, 2023 at 7:43 AM Bernd Schubert <bschubert@ddn.com> wrote:
>>
>>> Add complete_on_current_cpu, wake_up_poll_on_current_cpu helpers to wake
>>> up tasks on the current CPU.
>>
>>> These two helpers are useful when the task needs to make a synchronous context
>>> switch to another task. In this context, synchronous means it wakes up the
>>> target task and falls asleep right after that.
>>
>>> One example of such workloads is seccomp user notifies. This mechanism allows
>>> the supervisor process handles system calls on behalf of a target process.
>>> While the supervisor is handling an intercepted system call, the target process
>>> will be blocked in the kernel, waiting for a response to come back.
>>
>>> On-CPU context switches are much faster than regular ones.
>>
>>> Signed-off-by: Andrei Vagin <avagin@google.com>
>>
>> Avoiding cpu switches is very desirable for fuse, I'm working on fuse over uring
>> with per core queues. In my current branch and running a single threaded bonnie++
>> I get about 9000 creates/s when I bind the process to a core, about 7000 creates/s
>> when I set SCHED_IDLE for the ring threads and back to 9000 with SCHED_IDLE and
>> disabling cpu migration in fs/fuse/dev.c request_wait_answer() before going into
>> the waitq and enabling it back after waking up.
>>
>> I had reported this a few weeks back
>> https://lore.kernel.org/lkml/d0ed1dbd-1b7e-bf98-65c0-7f61dd1a3228@ddn.com/
>> and had been pointed to your and Prateeks patch series. I'm now going
>> through these series. Interesting part is that a few weeks I didn't need
>> SCHED_IDLE, just disabling/enabling migration before/after waking up was
>> enough.
>>
>> [...]
>>
>>> EXPORT_SYMBOL(swake_up_one);
>>> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
>>> index 133b74730738..47803a0b8d5d 100644
>>> --- a/kernel/sched/wait.c
>>> +++ b/kernel/sched/wait.c
>>> @@ -161,6 +161,11 @@ int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
>>> }
>>> EXPORT_SYMBOL(__wake_up);
>>
>>> +void __wake_up_on_current_cpu(struct wait_queue_head *wq_head, unsigned int mode, void *key)
>>> +{
>>> + __wake_up_common_lock(wq_head, mode, 1, WF_CURRENT_CPU, key);
>>> +}
>>
>> I'm about to test this instead of migrate_disable/migrate_enable, but the symbol needs
>> to be exported - any objection to do that right from the beginning in your patch?
>
> I think EXPORT_SYMBOL_GPL should not trigger any objections and it
> covers your case, doesn't it?
Ah yes, sure, _GPL is fine. I have applied 2/6 and 3/6 in my branch and then have
wait.h
#define wake_up_interruptible_sync(x) __wake_up_sync((x), TASK_INTERRUPTIBLE)
and and using that in fuse_request_end() - works fine and no migration on wake up.
Though, I still need SCHED_IDLE for the uring thread to avoid a later migration,
will open a separate thread for that.
Thanks,
Bernd
On 4/26/23 21:35, Bernd Schubert wrote:
> On 4/26/23 20:52, Andrei Vagin wrote:
>> On Wed, Apr 26, 2023 at 7:43 AM Bernd Schubert <bschubert@ddn.com> wrote:
>>>
>>>> Add complete_on_current_cpu, wake_up_poll_on_current_cpu helpers to
>>>> wake
>>>> up tasks on the current CPU.
>>>
>>>> These two helpers are useful when the task needs to make a
>>>> synchronous context
>>>> switch to another task. In this context, synchronous means it wakes
>>>> up the
>>>> target task and falls asleep right after that.
>>>
>>>> One example of such workloads is seccomp user notifies. This
>>>> mechanism allows
>>>> the supervisor process handles system calls on behalf of a target
>>>> process.
>>>> While the supervisor is handling an intercepted system call, the
>>>> target process
>>>> will be blocked in the kernel, waiting for a response to come back.
>>>
>>>> On-CPU context switches are much faster than regular ones.
>>>
>>>> Signed-off-by: Andrei Vagin <avagin@google.com>
>>>
>>> Avoiding cpu switches is very desirable for fuse, I'm working on fuse
>>> over uring
>>> with per core queues. In my current branch and running a single
>>> threaded bonnie++
>>> I get about 9000 creates/s when I bind the process to a core, about
>>> 7000 creates/s
>>> when I set SCHED_IDLE for the ring threads and back to 9000 with
>>> SCHED_IDLE and
>>> disabling cpu migration in fs/fuse/dev.c request_wait_answer() before
>>> going into
>>> the waitq and enabling it back after waking up.
>>>
>>> I had reported this a few weeks back
>>> https://lore.kernel.org/lkml/d0ed1dbd-1b7e-bf98-65c0-7f61dd1a3228@ddn.com/
>>> and had been pointed to your and Prateeks patch series. I'm now going
>>> through these series. Interesting part is that a few weeks I didn't need
>>> SCHED_IDLE, just disabling/enabling migration before/after waking up was
>>> enough.
>>>
>>> [...]
>>>
>>>> EXPORT_SYMBOL(swake_up_one);
>>>> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
>>>> index 133b74730738..47803a0b8d5d 100644
>>>> --- a/kernel/sched/wait.c
>>>> +++ b/kernel/sched/wait.c
>>>> @@ -161,6 +161,11 @@ int __wake_up(struct wait_queue_head *wq_head,
>>>> unsigned int mode,
>>>> }
>>>> EXPORT_SYMBOL(__wake_up);
>>>
>>>> +void __wake_up_on_current_cpu(struct wait_queue_head *wq_head,
>>>> unsigned int mode, void *key)
>>>> +{
>>>> + __wake_up_common_lock(wq_head, mode, 1, WF_CURRENT_CPU, key);
>>>> +}
>>>
>>> I'm about to test this instead of migrate_disable/migrate_enable, but
>>> the symbol needs
>>> to be exported - any objection to do that right from the beginning in
>>> your patch?
>>
>> I think EXPORT_SYMBOL_GPL should not trigger any objections and it
>> covers your case, doesn't it?
>
> Ah yes, sure, _GPL is fine. I have applied 2/6 and 3/6 in my branch and
> then have
>
> wait.h
> #define wake_up_interruptible_sync(x) __wake_up_sync((x),
> TASK_INTERRUPTIBLE)
Sorry, no, that was the part that is actually not working, I'm actually
using __wake_up_on_current_cpu(&req->waitq, TASK_NORMAL, NULL) in
fuse_request_end().
>
> and and using that in fuse_request_end() - works fine and no migration
> on wake up.
> Though, I still need SCHED_IDLE for the uring thread to avoid a later
> migration,
> will open a separate thread for that.
>
>
> Thanks,
> Bernd
>
>
@@ -116,6 +116,7 @@ extern bool try_wait_for_completion(struct completion *x);
extern bool completion_done(struct completion *x);
extern void complete(struct completion *);
+extern void complete_on_current_cpu(struct completion *x);
extern void complete_all(struct completion *);
#endif
@@ -146,7 +146,7 @@ static inline bool swq_has_sleeper(struct swait_queue_head *wq)
extern void swake_up_one(struct swait_queue_head *q);
extern void swake_up_all(struct swait_queue_head *q);
-extern void swake_up_locked(struct swait_queue_head *q);
+extern void swake_up_locked(struct swait_queue_head *q, int wake_flags);
extern void prepare_to_swait_exclusive(struct swait_queue_head *q, struct swait_queue *wait, int state);
extern long prepare_to_swait_event(struct swait_queue_head *q, struct swait_queue *wait, int state);
@@ -210,6 +210,7 @@ __remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq
}
int __wake_up(struct wait_queue_head *wq_head, unsigned int mode, int nr, void *key);
+void __wake_up_on_current_cpu(struct wait_queue_head *wq_head, unsigned int mode, void *key);
void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key);
void __wake_up_locked_key_bookmark(struct wait_queue_head *wq_head,
unsigned int mode, void *key, wait_queue_entry_t *bookmark);
@@ -237,6 +238,8 @@ void __wake_up_pollfree(struct wait_queue_head *wq_head);
#define key_to_poll(m) ((__force __poll_t)(uintptr_t)(void *)(m))
#define wake_up_poll(x, m) \
__wake_up(x, TASK_NORMAL, 1, poll_to_key(m))
+#define wake_up_poll_on_current_cpu(x, m) \
+ __wake_up_on_current_cpu(x, TASK_NORMAL, poll_to_key(m))
#define wake_up_locked_poll(x, m) \
__wake_up_locked_key((x), TASK_NORMAL, poll_to_key(m))
#define wake_up_interruptible_poll(x, m) \
@@ -13,6 +13,23 @@
* Waiting for completion is a typically sync point, but not an exclusion point.
*/
+static void complete_with_flags(struct completion *x, int wake_flags)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&x->wait.lock, flags);
+
+ if (x->done != UINT_MAX)
+ x->done++;
+ swake_up_locked(&x->wait, wake_flags);
+ raw_spin_unlock_irqrestore(&x->wait.lock, flags);
+}
+
+void complete_on_current_cpu(struct completion *x)
+{
+ return complete_with_flags(x, WF_CURRENT_CPU);
+}
+
/**
* complete: - signals a single thread waiting on this completion
* @x: holds the state of this particular completion
@@ -27,14 +44,7 @@
*/
void complete(struct completion *x)
{
- unsigned long flags;
-
- raw_spin_lock_irqsave(&x->wait.lock, flags);
-
- if (x->done != UINT_MAX)
- x->done++;
- swake_up_locked(&x->wait);
- raw_spin_unlock_irqrestore(&x->wait.lock, flags);
+ complete_with_flags(x, 0);
}
EXPORT_SYMBOL(complete);
@@ -6941,7 +6941,7 @@ asmlinkage __visible void __sched preempt_schedule_irq(void)
int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
void *key)
{
- WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~WF_SYNC);
+ WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~(WF_SYNC|WF_CURRENT_CPU));
return try_to_wake_up(curr->private, mode, wake_flags);
}
EXPORT_SYMBOL(default_wake_function);
@@ -18,7 +18,7 @@ EXPORT_SYMBOL(__init_swait_queue_head);
* If for some reason it would return 0, that means the previously waiting
* task is already running, so it will observe condition true (or has already).
*/
-void swake_up_locked(struct swait_queue_head *q)
+void swake_up_locked(struct swait_queue_head *q, int wake_flags)
{
struct swait_queue *curr;
@@ -26,7 +26,7 @@ void swake_up_locked(struct swait_queue_head *q)
return;
curr = list_first_entry(&q->task_list, typeof(*curr), task_list);
- wake_up_process(curr->task);
+ try_to_wake_up(curr->task, TASK_NORMAL, wake_flags);
list_del_init(&curr->task_list);
}
EXPORT_SYMBOL(swake_up_locked);
@@ -41,7 +41,7 @@ EXPORT_SYMBOL(swake_up_locked);
void swake_up_all_locked(struct swait_queue_head *q)
{
while (!list_empty(&q->task_list))
- swake_up_locked(q);
+ swake_up_locked(q, 0);
}
void swake_up_one(struct swait_queue_head *q)
@@ -49,7 +49,7 @@ void swake_up_one(struct swait_queue_head *q)
unsigned long flags;
raw_spin_lock_irqsave(&q->lock, flags);
- swake_up_locked(q);
+ swake_up_locked(q, 0);
raw_spin_unlock_irqrestore(&q->lock, flags);
}
EXPORT_SYMBOL(swake_up_one);
@@ -161,6 +161,11 @@ int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
}
EXPORT_SYMBOL(__wake_up);
+void __wake_up_on_current_cpu(struct wait_queue_head *wq_head, unsigned int mode, void *key)
+{
+ __wake_up_common_lock(wq_head, mode, 1, WF_CURRENT_CPU, key);
+}
+
/*
* Same as __wake_up but called with the spinlock in wait_queue_head_t held.
*/