Message ID | 20230110213010.2683185-3-avagin@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4e01:0:0:0:0:0 with SMTP id p1csp2978628wrt; Tue, 10 Jan 2023 13:33:17 -0800 (PST) X-Google-Smtp-Source: AMrXdXuADmfGb8+9htRVDN2pG1xZvh4476WstCX3QQPnoe7x4JhX+d5ID8XmnAXhI/g8uaO4lUZg X-Received: by 2002:a05:6402:550a:b0:47f:bc9b:46ec with SMTP id fi10-20020a056402550a00b0047fbc9b46ecmr52670441edb.7.1673386396974; Tue, 10 Jan 2023 13:33:16 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1673386396; cv=none; d=google.com; s=arc-20160816; b=EODjadFIO/ry/bdSnhXUEMsq92ZAAuU8B4f8blbY3qYp/NVPAshVZItWyhYYJ49Z6/ zmM/6O1ZuRt5/bVcA+7rTC0qNO4cpGCWc33YuKXe3RHqocmWyu2EqXWWQgvcRukmtBPw GfBUVRGWQxGANq+NBlfPgvcf47NvFBq8dNMQ94vdxeJgi1Ng05YPFXZv4h0zuAmvGSjP 42jCXpMuHoRQYRESBkjGCn7VO/ivCG3KvtQadVjqoKtFpU8hOOk/BILfj4YRJajw0+BO YrH6DW/Nvv9HRsmJsll+1hFt2S256Da+awBd+BHxe+JvBher4x2BkltmB7Qec8rEcE87 Beog== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=er80A9Er8wBpxR2BIE0p2D2/xkiobfzzZD4tbBwSUuc=; b=KfPpDTuz6nTzReGd92918Q+0JBUULh12rAyMeNg+jSOtoxdrLyBZRKlHI/wr2wgxi8 eg5k/qUwCy45jTuJtpI9KRfZAGI2CoW9q09o8nCk0om612+Ymt697a2dl10iDGhJkykj a4ce+WLgkQp6x0IwL5unEQqvYbYlyikejvj7t06aJb4NS3oIhSG7VX2UtTliCIppnlN4 0FDbXkJA9vdtOb0iDDBzFm1aLvgyLQGCOUnkw1iJIeJSN1TVJ4KHkptEDRASocbSRLfy UeTUizI+669oNATja2m031k93bdxe/BY93MzVYasMsk5x+dLjlro03KEqsos24IGUGSC YJ2A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=k7kio1FL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id o15-20020aa7d3cf000000b0048461eff750si10709846edr.563.2023.01.10.13.32.52; Tue, 10 Jan 2023 13:33:16 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=k7kio1FL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232483AbjAJVbv (ORCPT <rfc822;syz17693488234@gmail.com> + 99 others); Tue, 10 Jan 2023 16:31:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40562 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234862AbjAJVbC (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 10 Jan 2023 16:31:02 -0500 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C36C36534B for <linux-kernel@vger.kernel.org>; Tue, 10 Jan 2023 13:30:21 -0800 (PST) Received: by mail-pj1-x104a.google.com with SMTP id r5-20020a17090a1bc500b00227067dde1eso3614448pjr.0 for <linux-kernel@vger.kernel.org>; Tue, 10 Jan 2023 13:30:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=er80A9Er8wBpxR2BIE0p2D2/xkiobfzzZD4tbBwSUuc=; b=k7kio1FL9Eg0JJeN23HIDqJzpDd/aXo24nWXA8QtE26rzS8NIcltXZhsRIlZMJ/v4y FFL90KgnLTpkw1lLYymV6ADXdp5aTBQFGoW/wwuZWYtoaI0yAtVlEWUoLZyRcnsjdNBi JT3BNX2YSuH9BnOBoE/oeMVzlB/ShvfW9zWuGCfcHHZJGFMQ+IHSYZwpZbmHCUMzb27R ej317WRq8YkKUX6P8M+rM3r5REU7oFpoNuUkoGCut433ssPdvnJ8ILCQi9Ks2QR+GZWj yYdrXk1tL6llthn89LX846SNzgDrI4Z2ToHsY9Vr3IhA/1ymq2oE6tNHQhaHGzgxsXQ2 M2Xg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=er80A9Er8wBpxR2BIE0p2D2/xkiobfzzZD4tbBwSUuc=; b=TTaZS+K84JXkm1xMvI2M+1JC4EBVlVuoXE6L1xjBU0GaSoRiBi1G+xGXO9AvLIQGgZ I8qABCMIf1PkAP0u0QFm0WNqdPmDeEzIzE9ApS9RJ82capkb+7Kf+UUOPmOuJweD6HKD 4rTaOwTMI8qNd+3QOT8WWJrx6g/c+ec2H4gnn73A7dI3g/ocU3AHLkfORNHtBa/lH7ka FwDKwZvjqBFnIUDGeykI46S+ug9jdw5LxdHoFRmM5AmFzsaXMMUAvpN72ziv669CSMYK 7bpquvM6YpfNv6Ug8OUcx3qJhg9PyygK6ddyQlR5cotd84t6Pnvwso7XxjAH3a/7C4Ay br5g== X-Gm-Message-State: AFqh2koKtqts7scJmxzwNMfoXWTYmm7qizf8lDDNWFcSr91JQcoLJi8S HXxTHEDLDBfwlajjXF0U3BeH6IMVh6Y= X-Received: from avagin.kir.corp.google.com ([2620:0:1008:11:6203:13b5:2d85:b75c]) (user=avagin job=sendgmr) by 2002:a62:3683:0:b0:578:47f4:e0ec with SMTP id d125-20020a623683000000b0057847f4e0ecmr5429470pfa.60.1673386221162; Tue, 10 Jan 2023 13:30:21 -0800 (PST) Date: Tue, 10 Jan 2023 13:30:07 -0800 In-Reply-To: <20230110213010.2683185-1-avagin@google.com> Mime-Version: 1.0 References: <20230110213010.2683185-1-avagin@google.com> X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog Message-ID: <20230110213010.2683185-3-avagin@google.com> Subject: [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu From: Andrei Vagin <avagin@google.com> To: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: linux-kernel@vger.kernel.org, Kees Cook <keescook@chromium.org>, Christian Brauner <brauner@kernel.org>, Andrei Vagin <avagin@gmail.com>, Andy Lutomirski <luto@amacapital.net>, Juri Lelli <juri.lelli@redhat.com>, Peter Oskolkov <posk@google.com>, Tycho Andersen <tycho@tycho.pizza>, Will Drewry <wad@chromium.org> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1754672814616720801?= X-GMAIL-MSGID: =?utf-8?q?1754672814616720801?= |
Series |
seccomp: add the synchronous mode for seccomp_unotify
|
|
Commit Message
Andrei Vagin
Jan. 10, 2023, 9:30 p.m. UTC
From: Peter Oskolkov <posk@google.com> Add WF_CURRENT_CPU wake flag that advices the scheduler to move the wakee to the current CPU. This is useful for fast on-CPU context switching use cases. In addition, make ttwu external rather than static so that the flag could be passed to it from outside of sched/core.c. Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Andrei Vagin <avagin@gmail.com> --- kernel/sched/core.c | 3 +-- kernel/sched/fair.c | 4 ++++ kernel/sched/sched.h | 13 ++++++++----- 3 files changed, 13 insertions(+), 7 deletions(-)
Comments
On 2023-01-10 at 13:30:07 -0800, Andrei Vagin wrote: > From: Peter Oskolkov <posk@google.com> > > Add WF_CURRENT_CPU wake flag that advices the scheduler to > move the wakee to the current CPU. This is useful for fast on-CPU > context switching use cases. > > In addition, make ttwu external rather than static so that > the flag could be passed to it from outside of sched/core.c. > > Signed-off-by: Peter Oskolkov <posk@google.com> > Signed-off-by: Andrei Vagin <avagin@gmail.com> > @@ -7380,6 +7380,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) > if (wake_flags & WF_TTWU) { > record_wakee(p); > > + if ((wake_flags & WF_CURRENT_CPU) && > + cpumask_test_cpu(cpu, p->cpus_ptr)) > + return cpu; I agree that cross-CPU wake up brings pain to fast context switching use cases, especially on high core count system. We suffered from this issue as well, so previously we presented this issue as well. The difference is that we used some dynamic "WF_CURRENT_CPU" mechanism[1] to deal with it. That is, if the waker/wakee are both short duration tasks, let the waker wakes up the wakee on current CPU. So not only seccomp but also other components/workloads could benefit from this without having to set the WF_CURRENT_CPU flag. Link [1]: https://lore.kernel.org/lkml/cover.1671158588.git.yu.c.chen@intel.com/ thanks, Chenyu
On Wed, Jan 11, 2023 at 11:36 PM Chen Yu <yu.c.chen@intel.com> wrote: > > On 2023-01-10 at 13:30:07 -0800, Andrei Vagin wrote: > > From: Peter Oskolkov <posk@google.com> > > > > Add WF_CURRENT_CPU wake flag that advices the scheduler to > > move the wakee to the current CPU. This is useful for fast on-CPU > > context switching use cases. > > > > In addition, make ttwu external rather than static so that > > the flag could be passed to it from outside of sched/core.c. > > > > Signed-off-by: Peter Oskolkov <posk@google.com> > > Signed-off-by: Andrei Vagin <avagin@gmail.com> > > @@ -7380,6 +7380,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) > > if (wake_flags & WF_TTWU) { > > record_wakee(p); > > > > + if ((wake_flags & WF_CURRENT_CPU) && > > + cpumask_test_cpu(cpu, p->cpus_ptr)) > > + return cpu; > I agree that cross-CPU wake up brings pain to fast context switching > use cases, especially on high core count system. We suffered from this > issue as well, so previously we presented this issue as well. The difference > is that we used some dynamic "WF_CURRENT_CPU" mechanism[1] to deal with it. > That is, if the waker/wakee are both short duration tasks, let the waker wakes up > the wakee on current CPU. So not only seccomp but also other components/workloads > could benefit from this without having to set the WF_CURRENT_CPU flag. > > Link [1]: > https://lore.kernel.org/lkml/cover.1671158588.git.yu.c.chen@intel.com/ Thanks for the link. I like the idea, but this change has no impact on the seccom notify case. I used the benchmark from the fifth patch. It is a ping-pong benchmark in which one process triggers system calls, and another process handles them. It measures the number of system calls that can be processed within a specified time slice. $ cd tools/testing/selftests/seccomp/ $ make $ ./seccomp_bpf 2>&1| grep user_notification_sync # RUN global.user_notification_sync ... # seccomp_bpf.c:4281:user_notification_sync:basic: 8489 nsec/syscall # seccomp_bpf.c:4281:user_notification_sync:sync: 3078 nsec/syscall # OK global.user_notification_sync ok 51 global.user_notification_sync The results are the same with and without your change. I expected that your change improves the basic case so that it reaches the sync one. I did some experiments and found that we can achieve the desirable outcome if we move the "short-task" checks prior to considering waking up on prev_cpu. For example, with this patch: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2f89e44e237d..af20b58e3972 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6384,6 +6384,11 @@ static int wake_wide(struct task_struct *p) static int wake_affine_idle(int this_cpu, int prev_cpu, int sync) { + /* The only running task is a short duration one. */ + if (cpu_rq(this_cpu)->nr_running == 1 && + is_short_task(cpu_curr(this_cpu))) + return this_cpu; + /* * If this_cpu is idle, it implies the wakeup is from interrupt * context. Only allow the move if cache is shared. Otherwise an @@ -6405,11 +6410,6 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync) if (available_idle_cpu(prev_cpu)) return prev_cpu; - /* The only running task is a short duration one. */ - if (cpu_rq(this_cpu)->nr_running == 1 && - is_short_task(cpu_curr(this_cpu))) - return this_cpu; - return nr_cpumask_bits; } @@ -6897,6 +6897,10 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) asym_fits_cpu(task_util, util_min, util_max, target)) return target; + if (!has_idle_core && cpu_rq(target)->nr_running == 1 && + is_short_task(cpu_curr(target)) && is_short_task(p)) + return target; + /* * If the previous CPU is cache affine and idle, don't be stupid: */ the basic test case shows almost the same results as the sync one: $ ./seccomp_bpf 2>&1| grep user_notification_sync # RUN global.user_notification_sync ... # seccomp_bpf.c:4281:user_notification_sync:basic: 3082 nsec/syscall # seccomp_bpf.c:4281:user_notification_sync:sync: 2690 nsec/syscall # OK global.user_notification_sync ok 51 global.user_notification_sync If you want to do any experiments, you can find my tree here: [1] https://github.com/avagin/linux-task-diag/tree/wip/seccom-notify-sync-and-shed-short-task Thanks, Andrei
On 2023-01-13 at 13:39:46 -0800, Andrei Vagin wrote: > On Wed, Jan 11, 2023 at 11:36 PM Chen Yu <yu.c.chen@intel.com> wrote: > > > > On 2023-01-10 at 13:30:07 -0800, Andrei Vagin wrote: > > > From: Peter Oskolkov <posk@google.com> > > > > > > Add WF_CURRENT_CPU wake flag that advices the scheduler to > > > move the wakee to the current CPU. This is useful for fast on-CPU > > > context switching use cases. > > > > > > In addition, make ttwu external rather than static so that > > > the flag could be passed to it from outside of sched/core.c. > > > > > > Signed-off-by: Peter Oskolkov <posk@google.com> > > > Signed-off-by: Andrei Vagin <avagin@gmail.com> > > > @@ -7380,6 +7380,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) > > > if (wake_flags & WF_TTWU) { > > > record_wakee(p); > > > > > > + if ((wake_flags & WF_CURRENT_CPU) && > > > + cpumask_test_cpu(cpu, p->cpus_ptr)) > > > + return cpu; > > I agree that cross-CPU wake up brings pain to fast context switching > > use cases, especially on high core count system. We suffered from this > > issue as well, so previously we presented this issue as well. The difference > > is that we used some dynamic "WF_CURRENT_CPU" mechanism[1] to deal with it. > > That is, if the waker/wakee are both short duration tasks, let the waker wakes up > > the wakee on current CPU. So not only seccomp but also other components/workloads > > could benefit from this without having to set the WF_CURRENT_CPU flag. > > > > Link [1]: > > https://lore.kernel.org/lkml/cover.1671158588.git.yu.c.chen@intel.com/ > > Thanks for the link. I like the idea, but this change has no impact on the > seccom notify case. I used the benchmark from the fifth patch. It is > a ping-pong > benchmark in which one process triggers system calls, and another process > handles them. It measures the number of system calls that can be processed > within a specified time slice. > Thanks for this information. > $ cd tools/testing/selftests/seccomp/ > $ make > $ ./seccomp_bpf 2>&1| grep user_notification_sync > # RUN global.user_notification_sync ... > # seccomp_bpf.c:4281:user_notification_sync:basic: 8489 nsec/syscall > # seccomp_bpf.c:4281:user_notification_sync:sync: 3078 nsec/syscall > # OK global.user_notification_sync > ok 51 global.user_notification_sync > > The results are the same with and without your change. I expected that > your change improves > the basic case so that it reaches the sync one. > The reason why the patch did not bring benefit might be that, that patch aims to wake task on current CPU only there is no idle cores. The seccomp notify would launch 2 processes which makes it hard to saturate the system if I understand correctly? I built a kernel based on your repo, and launched above test, it seems that the average load is quit low on my system. Is this expected? Besides, is 100 ms long enough to test(USER_NOTIF_BENCH_TIMEOUT)? > I did some experiments and found that we can achieve the desirable > outcome if we move the "short-task" checks prior to considering waking > up on prev_cpu. > > For example, with this patch: > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 2f89e44e237d..af20b58e3972 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6384,6 +6384,11 @@ static int wake_wide(struct task_struct *p) > static int > wake_affine_idle(int this_cpu, int prev_cpu, int sync) > { > + /* The only running task is a short duration one. */ > + if (cpu_rq(this_cpu)->nr_running == 1 && > + is_short_task(cpu_curr(this_cpu))) > + return this_cpu; > + In v1 we put above code before checking the prev_cpu, but is was reported to bring regression on some systems[2]. The result suggested that, we should try to pick idle CPU first, no matter it is current CPU, or previous CPU, if it failed, we can lower the bar and pick the current CPU. [2] https://lore.kernel.org/lkml/6c626e8d-4133-00ba-a765-bafe08034517@amd.com/ > /* > * If this_cpu is idle, it implies the wakeup is from interrupt > * context. Only allow the move if cache is shared. Otherwise an > @@ -6405,11 +6410,6 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync) > if (available_idle_cpu(prev_cpu)) > return prev_cpu; > > - /* The only running task is a short duration one. */ > - if (cpu_rq(this_cpu)->nr_running == 1 && > - is_short_task(cpu_curr(this_cpu))) > - return this_cpu; > - > return nr_cpumask_bits; > } > > @@ -6897,6 +6897,10 @@ static int select_idle_sibling(struct > task_struct *p, int prev, int target) > asym_fits_cpu(task_util, util_min, util_max, target)) > return target; > > + if (!has_idle_core && cpu_rq(target)->nr_running == 1 && > + is_short_task(cpu_curr(target)) && is_short_task(p)) > + return target; > + > /* > * If the previous CPU is cache affine and idle, don't be stupid: > */ > > > the basic test case shows almost the same results as the sync one: > > $ ./seccomp_bpf 2>&1| grep user_notification_sync > # RUN global.user_notification_sync ... > # seccomp_bpf.c:4281:user_notification_sync:basic: 3082 nsec/syscall > # seccomp_bpf.c:4281:user_notification_sync:sync: 2690 nsec/syscall > # OK global.user_notification_sync > ok 51 global.user_notification_sync > > If you want to do any experiments, you can find my tree here: > [1] https://github.com/avagin/linux-task-diag/tree/wip/seccom-notify-sync-and-shed-short-task > I'm happy to further test on this. One thing I'm curious about is, where does the benefit of waking up seccom task on current CPU come from? Is it due to hot L1 cache? thanks, Chenyu > Thanks, > Andrei
On Sat, Jan 14, 2023 at 8:00 AM Chen Yu <yu.c.chen@intel.com> wrote: > > On 2023-01-13 at 13:39:46 -0800, Andrei Vagin wrote: > > On Wed, Jan 11, 2023 at 11:36 PM Chen Yu <yu.c.chen@intel.com> wrote: > > > > > > On 2023-01-10 at 13:30:07 -0800, Andrei Vagin wrote: > > > > From: Peter Oskolkov <posk@google.com> > > > > > > > > Add WF_CURRENT_CPU wake flag that advices the scheduler to > > > > move the wakee to the current CPU. This is useful for fast on-CPU > > > > context switching use cases. > > > > > > > > In addition, make ttwu external rather than static so that > > > > the flag could be passed to it from outside of sched/core.c. > > > > > > > > Signed-off-by: Peter Oskolkov <posk@google.com> > > > > Signed-off-by: Andrei Vagin <avagin@gmail.com> > > > > @@ -7380,6 +7380,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) > > > > if (wake_flags & WF_TTWU) { > > > > record_wakee(p); > > > > > > > > + if ((wake_flags & WF_CURRENT_CPU) && > > > > + cpumask_test_cpu(cpu, p->cpus_ptr)) > > > > + return cpu; > > > I agree that cross-CPU wake up brings pain to fast context switching > > > use cases, especially on high core count system. We suffered from this > > > issue as well, so previously we presented this issue as well. The difference > > > is that we used some dynamic "WF_CURRENT_CPU" mechanism[1] to deal with it. > > > That is, if the waker/wakee are both short duration tasks, let the waker wakes up > > > the wakee on current CPU. So not only seccomp but also other components/workloads > > > could benefit from this without having to set the WF_CURRENT_CPU flag. > > > > > > Link [1]: > > > https://lore.kernel.org/lkml/cover.1671158588.git.yu.c.chen@intel.com/ > > > > Thanks for the link. I like the idea, but this change has no impact on the > > seccom notify case. I used the benchmark from the fifth patch. It is > > a ping-pong > > benchmark in which one process triggers system calls, and another process > > handles them. It measures the number of system calls that can be processed > > within a specified time slice. > > > Thanks for this information. > > $ cd tools/testing/selftests/seccomp/ > > $ make > > $ ./seccomp_bpf 2>&1| grep user_notification_sync > > # RUN global.user_notification_sync ... > > # seccomp_bpf.c:4281:user_notification_sync:basic: 8489 nsec/syscall > > # seccomp_bpf.c:4281:user_notification_sync:sync: 3078 nsec/syscall > > # OK global.user_notification_sync > > ok 51 global.user_notification_sync > > > > The results are the same with and without your change. I expected that > > your change improves > > the basic case so that it reaches the sync one. > > > The reason why the patch did not bring benefit might be that, that patch > aims to wake task on current CPU only there is no idle cores. The seccomp > notify would launch 2 processes which makes it hard to saturate the system > if I understand correctly? Yes, you understand it right. Our workloads do not always scale on all CPU-s, and it is critical to work with maximum performance even when there are some idle cores. I feel I need to say a few words about our use-case. I am working on gVisor, it is a sandbox solution. In a few words, we intercept guest system calls and handle them in our user-mode kernel. All these mean that we are limited by a user application that is running inside a sandbox and how well it scales on multiple cpu-s. The current benchmark emulates a uni-thread process running in a sandbox. > I built a kernel based on your repo, and launched > above test, it seems that the average load is quit low on my system. Is this > expected? Besides, is 100 ms long enough to test(USER_NOTIF_BENCH_TIMEOUT)? The accuracy within the 20% range is good enough for this test. But if we want to have a real benchmark, we need to implement it in a separate binary or we can add it to the perf benchmark suite. > > I did some experiments and found that we can achieve the desirable > > outcome if we move the "short-task" checks prior to considering waking > > up on prev_cpu. > > > > For example, with this patch: > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 2f89e44e237d..af20b58e3972 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6384,6 +6384,11 @@ static int wake_wide(struct task_struct *p) > > static int > > wake_affine_idle(int this_cpu, int prev_cpu, int sync) > > { > > + /* The only running task is a short duration one. */ > > + if (cpu_rq(this_cpu)->nr_running == 1 && > > + is_short_task(cpu_curr(this_cpu))) > > + return this_cpu; > > + > In v1 we put above code before checking the prev_cpu, but is was reported > to bring regression on some systems[2]. The result suggested that, we should > try to pick idle CPU first, no matter it is current CPU, or previous CPU, > if it failed, we can lower the bar and pick the current CPU. Maybe the criteria of a short task should be lower for idle cpu-s. It should be close to the cost of waking up an idle cpu. > > [2] https://lore.kernel.org/lkml/6c626e8d-4133-00ba-a765-bafe08034517@amd.com/ > > /* > > * If this_cpu is idle, it implies the wakeup is from interrupt > > * context. Only allow the move if cache is shared. Otherwise an > > @@ -6405,11 +6410,6 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync) > > if (available_idle_cpu(prev_cpu)) > > return prev_cpu; > > > > - /* The only running task is a short duration one. */ > > - if (cpu_rq(this_cpu)->nr_running == 1 && > > - is_short_task(cpu_curr(this_cpu))) > > - return this_cpu; > > - > > return nr_cpumask_bits; > > } > > > > @@ -6897,6 +6897,10 @@ static int select_idle_sibling(struct > > task_struct *p, int prev, int target) > > asym_fits_cpu(task_util, util_min, util_max, target)) > > return target; > > > > + if (!has_idle_core && cpu_rq(target)->nr_running == 1 && > > + is_short_task(cpu_curr(target)) && is_short_task(p)) > > + return target; > > + > > /* > > * If the previous CPU is cache affine and idle, don't be stupid: > > */ > > > > > > the basic test case shows almost the same results as the sync one: > > > > $ ./seccomp_bpf 2>&1| grep user_notification_sync > > # RUN global.user_notification_sync ... > > # seccomp_bpf.c:4281:user_notification_sync:basic: 3082 nsec/syscall > > # seccomp_bpf.c:4281:user_notification_sync:sync: 2690 nsec/syscall > > # OK global.user_notification_sync > > ok 51 global.user_notification_sync > > > > If you want to do any experiments, you can find my tree here: > > [1] https://github.com/avagin/linux-task-diag/tree/wip/seccom-notify-sync-and-shed-short-task > > > I'm happy to further test on this. One thing I'm curious about is, where does the > benefit of waking up seccom task on current CPU come from? Is it due to hot L1 cache? I don't think that it is due to cpu caches. It should be due to the overhead of queuing a task to a non-current queue, sending ipt, waking from the idle loop. Thanks, Andrei
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 25b582b6ee5f..6478e819eb99 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4063,8 +4063,7 @@ bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success) * Return: %true if @p->state changes (an actual wakeup was done), * %false otherwise. */ -static int -try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) +int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) { unsigned long flags; int cpu, success = 0; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c36aa54ae071..d6f76bead3c5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7380,6 +7380,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) if (wake_flags & WF_TTWU) { record_wakee(p); + if ((wake_flags & WF_CURRENT_CPU) && + cpumask_test_cpu(cpu, p->cpus_ptr)) + return cpu; + if (sched_energy_enabled()) { new_cpu = find_energy_efficient_cpu(p, prev_cpu); if (new_cpu >= 0) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 771f8ddb7053..34b4c54b2a2a 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2088,12 +2088,13 @@ static inline int task_on_rq_migrating(struct task_struct *p) } /* Wake flags. The first three directly map to some SD flag value */ -#define WF_EXEC 0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */ -#define WF_FORK 0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */ -#define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */ +#define WF_EXEC 0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */ +#define WF_FORK 0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */ +#define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */ -#define WF_SYNC 0x10 /* Waker goes to sleep after wakeup */ -#define WF_MIGRATED 0x20 /* Internal use, task got migrated */ +#define WF_SYNC 0x10 /* Waker goes to sleep after wakeup */ +#define WF_MIGRATED 0x20 /* Internal use, task got migrated */ +#define WF_CURRENT_CPU 0x40 /* Prefer to move the wakee to the current CPU. */ #ifdef CONFIG_SMP static_assert(WF_EXEC == SD_BALANCE_EXEC); @@ -3245,6 +3246,8 @@ static inline bool is_per_cpu_kthread(struct task_struct *p) extern void swake_up_all_locked(struct swait_queue_head *q); extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait); +extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags); + #ifdef CONFIG_PREEMPT_DYNAMIC extern int preempt_dynamic_mode; extern int sched_dynamic_mode(const char *str);