[0/6,v5,RESEND] seccomp: add the synchronous mode for seccomp_unotify

Message ID 20230308073201.3102738-1-avagin@google.com
Headers
Series seccomp: add the synchronous mode for seccomp_unotify |

Message

Andrei Vagin March 8, 2023, 7:31 a.m. UTC
  seccomp_unotify allows more privileged processes do actions on behalf
of less privileged processes.

In many cases, the workflow is fully synchronous. It means a target
process triggers a system call and passes controls to a supervisor
process that handles the system call and returns controls back to the
target process. In this context, "synchronous" means that only one
process is running and another one is waiting.

The new WF_CURRENT_CPU flag advises the scheduler to move the wakee to
the current CPU. For such synchronous workflows, it makes context
switches a few times faster.

Right now, each interaction takes 12µs. With this patch, it takes about
3µs.

v2: clean up the first patch and add the test.
v3: update commit messages and a few fixes suggested by Kees Cook.
v4: update the third patch to avoid code duplications (suggested by
    Peter Zijlstra)
    Add the benchmark to the perf bench set.
v5: Update the author email. No code changes.

Kees is ready to take this patch set, but wants to get Acks from the
sched folks.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tycho Andersen <tycho@tycho.pizza>
Cc: Will Drewry <wad@chromium.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>

Andrei Vagin (4):
  seccomp: don't use semaphore and wait_queue together
  sched: add a few helpers to wake up tasks on the current cpu
  seccomp: add the synchronous mode for seccomp_unotify
  selftest/seccomp: add a new test for the sync mode of
    seccomp_user_notify

Peter Oskolkov (1):
  sched: add WF_CURRENT_CPU and externise ttwu

 include/linux/completion.h                    |   1 +
 include/linux/swait.h                         |   2 +-
 include/linux/wait.h                          |   3 +
 include/uapi/linux/seccomp.h                  |   4 +
 kernel/sched/completion.c                     |  26 ++-
 kernel/sched/core.c                           |   5 +-
 kernel/sched/fair.c                           |   4 +
 kernel/sched/sched.h                          |  13 +-
 kernel/sched/swait.c                          |   8 +-
 kernel/sched/wait.c                           |   5 +
 kernel/seccomp.c                              |  72 +++++++-
 tools/arch/x86/include/uapi/asm/unistd_32.h   |   3 +
 tools/arch/x86/include/uapi/asm/unistd_64.h   |   3 +
 tools/perf/bench/Build                        |   1 +
 tools/perf/bench/bench.h                      |   1 +
 tools/perf/bench/sched-seccomp-notify.c       | 167 ++++++++++++++++++
 tools/perf/builtin-bench.c                    |   1 +
 tools/testing/selftests/seccomp/seccomp_bpf.c |  55 ++++++
 18 files changed, 346 insertions(+), 28 deletions(-)
 create mode 100644 tools/perf/bench/sched-seccomp-notify.c
  

Comments

Andrei Vagin March 21, 2023, 6:19 p.m. UTC | #1
On Tue, Mar 7, 2023 at 11:32 PM Andrei Vagin <avagin@google.com> wrote:
>
> seccomp_unotify allows more privileged processes do actions on behalf
> of less privileged processes.
>
> In many cases, the workflow is fully synchronous. It means a target
> process triggers a system call and passes controls to a supervisor
> process that handles the system call and returns controls back to the
> target process. In this context, "synchronous" means that only one
> process is running and another one is waiting.
>
> The new WF_CURRENT_CPU flag advises the scheduler to move the wakee to
> the current CPU. For such synchronous workflows, it makes context
> switches a few times faster.
>
> Right now, each interaction takes 12µs. With this patch, it takes about
> 3µs.
>
> v2: clean up the first patch and add the test.
> v3: update commit messages and a few fixes suggested by Kees Cook.
> v4: update the third patch to avoid code duplications (suggested by
>     Peter Zijlstra)
>     Add the benchmark to the perf bench set.
> v5: Update the author email. No code changes.
>
> Kees is ready to take this patch set, but wants to get Acks from the
> sched folks.

Peter, could you review the second and third patches of this series?

Thanks,
Andrei
  
Peter Zijlstra March 27, 2023, 10:27 a.m. UTC | #2
On Tue, Mar 07, 2023 at 11:31:55PM -0800, Andrei Vagin wrote:

> Kees is ready to take this patch set, but wants to get Acks from the
> sched folks.
> 

> Andrei Vagin (4):
>   seccomp: don't use semaphore and wait_queue together
>   sched: add a few helpers to wake up tasks on the current cpu
>   seccomp: add the synchronous mode for seccomp_unotify
>   selftest/seccomp: add a new test for the sync mode of
>     seccomp_user_notify
> 
> Peter Oskolkov (1):
>   sched: add WF_CURRENT_CPU and externise ttwu

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
  
Andrei Vagin April 3, 2023, 6:32 p.m. UTC | #3
On Mon, Mar 27, 2023 at 3:27 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Mar 07, 2023 at 11:31:55PM -0800, Andrei Vagin wrote:
>
> > Kees is ready to take this patch set, but wants to get Acks from the
> > sched folks.
> >
>
> > Andrei Vagin (4):
> >   seccomp: don't use semaphore and wait_queue together
> >   sched: add a few helpers to wake up tasks on the current cpu
> >   seccomp: add the synchronous mode for seccomp_unotify
> >   selftest/seccomp: add a new test for the sync mode of
> >     seccomp_user_notify
> >
> > Peter Oskolkov (1):
> >   sched: add WF_CURRENT_CPU and externise ttwu
>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Kees,

Could you look at this patch set? You wrote to one of the previous
versions that you are ready to take it if sched maintainers approve it.
Here is no major changes from that moment. The sched-related part has
been cleaned up according with Peter's comments, and I moved the perf
test to the perf tool.

Thanks,
Andrei
  
Kees Cook April 6, 2023, 3:19 a.m. UTC | #4
On April 3, 2023 11:32:00 AM PDT, Andrei Vagin <avagin@gmail.com> wrote:
>On Mon, Mar 27, 2023 at 3:27 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> On Tue, Mar 07, 2023 at 11:31:55PM -0800, Andrei Vagin wrote:
>>
>> > Kees is ready to take this patch set, but wants to get Acks from the
>> > sched folks.
>> >
>>
>> > Andrei Vagin (4):
>> >   seccomp: don't use semaphore and wait_queue together
>> >   sched: add a few helpers to wake up tasks on the current cpu
>> >   seccomp: add the synchronous mode for seccomp_unotify
>> >   selftest/seccomp: add a new test for the sync mode of
>> >     seccomp_user_notify
>> >
>> > Peter Oskolkov (1):
>> >   sched: add WF_CURRENT_CPU and externise ttwu
>>
>> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>
>Kees,
>
>Could you look at this patch set? You wrote to one of the previous
>versions that you are ready to take it if sched maintainers approve it.
>Here is no major changes from that moment. The sched-related part has
>been cleaned up according with Peter's comments, and I moved the perf
>test to the perf tool.

Hi!

Yes, thanks for keeping this going! I'm catching up after some vacation, but this is on my TODO list. :)

-Kees
  
Bernd Schubert April 26, 2023, 2:43 p.m. UTC | #5
> Add complete_on_current_cpu, wake_up_poll_on_current_cpu helpers to wake
> up tasks on the current CPU.

> These two helpers are useful when the task needs to make a synchronous context
> switch to another task. In this context, synchronous means it wakes up the
> target task and falls asleep right after that.

> One example of such workloads is seccomp user notifies. This mechanism allows
> the  supervisor process handles system calls on behalf of a target process.
> While the supervisor is handling an intercepted system call, the target process
> will be blocked in the kernel, waiting for a response to come back.

> On-CPU context switches are much faster than regular ones.

> Signed-off-by: Andrei Vagin <avagin@google.com>

Avoiding cpu switches is very desirable for fuse, I'm working on fuse over uring
with per core queues. In my current branch and running a single threaded bonnie++
I get about 9000 creates/s when I bind the process to a core, about 7000 creates/s
when I set SCHED_IDLE for the ring threads and back to 9000 with SCHED_IDLE and
disabling cpu migration in fs/fuse/dev.c request_wait_answer() before going into
the waitq and enabling it back after waking up. 

I had reported this a few weeks back 
https://lore.kernel.org/lkml/d0ed1dbd-1b7e-bf98-65c0-7f61dd1a3228@ddn.com/
and had been pointed to your and Prateeks patch series. I'm now going
through these series. Interesting part is that a few weeks I didn't need
SCHED_IDLE, just disabling/enabling migration before/after waking up was
enough.

[...]

> EXPORT_SYMBOL(swake_up_one);
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index 133b74730738..47803a0b8d5d 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -161,6 +161,11 @@ int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
>  }
>  EXPORT_SYMBOL(__wake_up);
 
> +void __wake_up_on_current_cpu(struct wait_queue_head *wq_head, unsigned int mode, void *key)
> +{
> +	__wake_up_common_lock(wq_head, mode, 1, WF_CURRENT_CPU, key);
> +}

I'm about to test this instead of migrate_disable/migrate_enable, but the symbol needs
to be exported - any objection to do that right from the beginning in your patch? 


Thanks,
Bernd
  
Andrei Vagin June 28, 2023, 6:44 p.m. UTC | #6
On Wed, Apr 5, 2023 at 8:19 PM Kees Cook <kees@kernel.org> wrote:
>
> On April 3, 2023 11:32:00 AM PDT, Andrei Vagin <avagin@gmail.com> wrote:
> >On Mon, Mar 27, 2023 at 3:27 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >>
> >> On Tue, Mar 07, 2023 at 11:31:55PM -0800, Andrei Vagin wrote:
> >>
> >> > Kees is ready to take this patch set, but wants to get Acks from the
> >> > sched folks.
> >> >
> >>
> >> > Andrei Vagin (4):
> >> >   seccomp: don't use semaphore and wait_queue together
> >> >   sched: add a few helpers to wake up tasks on the current cpu
> >> >   seccomp: add the synchronous mode for seccomp_unotify
> >> >   selftest/seccomp: add a new test for the sync mode of
> >> >     seccomp_user_notify
> >> >
> >> > Peter Oskolkov (1):
> >> >   sched: add WF_CURRENT_CPU and externise ttwu
> >>
> >> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >
> >Kees,
> >
> >Could you look at this patch set? You wrote to one of the previous
> >versions that you are ready to take it if sched maintainers approve it.
> >Here is no major changes from that moment. The sched-related part has
> >been cleaned up according with Peter's comments, and I moved the perf
> >test to the perf tool.
>
> Hi!
>
> Yes, thanks for keeping this going! I'm catching up after some vacation, but this is on my TODO list. :)

Hi Kees. Do you have any updates on this series?

>
> -Kees
>
>
> --
> Kees Cook
  
Kees Cook June 28, 2023, 11:32 p.m. UTC | #7
On Wed, Jun 28, 2023 at 11:44:02AM -0700, Andrei Vagin wrote:
> On Wed, Apr 5, 2023 at 8:19 PM Kees Cook <kees@kernel.org> wrote:
> >
> > On April 3, 2023 11:32:00 AM PDT, Andrei Vagin <avagin@gmail.com> wrote:
> > >On Mon, Mar 27, 2023 at 3:27 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >>
> > >> On Tue, Mar 07, 2023 at 11:31:55PM -0800, Andrei Vagin wrote:
> > >>
> > >> > Kees is ready to take this patch set, but wants to get Acks from the
> > >> > sched folks.
> > >> >
> > >>
> > >> > Andrei Vagin (4):
> > >> >   seccomp: don't use semaphore and wait_queue together
> > >> >   sched: add a few helpers to wake up tasks on the current cpu
> > >> >   seccomp: add the synchronous mode for seccomp_unotify
> > >> >   selftest/seccomp: add a new test for the sync mode of
> > >> >     seccomp_user_notify
> > >> >
> > >> > Peter Oskolkov (1):
> > >> >   sched: add WF_CURRENT_CPU and externise ttwu
> > >>
> > >> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > >
> > >Kees,
> > >
> > >Could you look at this patch set? You wrote to one of the previous
> > >versions that you are ready to take it if sched maintainers approve it.
> > >Here is no major changes from that moment. The sched-related part has
> > >been cleaned up according with Peter's comments, and I moved the perf
> > >test to the perf tool.
> >
> > Hi!
> >
> > Yes, thanks for keeping this going! I'm catching up after some vacation, but this is on my TODO list. :)
> 
> Hi Kees. Do you have any updates on this series?

Apologies for the delay!

I've added this to the seccomp tree -- it should show up in -next soon.

-Kees