Message ID | 20230308100414.37114-1-jiahao.os@bytedance.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp251315wrd; Wed, 8 Mar 2023 02:31:17 -0800 (PST) X-Google-Smtp-Source: AK7set+wO/JErfmohRchonRS9DrhhqhHv6izBAw2iSbt4N4YlIDMqJ/1tJtNSBOOzCFNJrBEP2bb X-Received: by 2002:a17:902:db05:b0:19c:da7f:a238 with SMTP id m5-20020a170902db0500b0019cda7fa238mr22670347plx.31.1678271476994; Wed, 08 Mar 2023 02:31:16 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1678271476; cv=none; d=google.com; s=arc-20160816; b=OfMPt7GyuNEG0Yxx3irLhtHA7m6rmxZS+jQPFbxB1nK7sGHPNJYOlNh9Bhehwako5B mXm6BFC3XxNtuPHdIePIWoeFUZaXgSkQmp9Y7Ua0vKm73pEsaFwRzH3yh5SUY55vCLzl vjQAGiQROT95k2AxYTo1Y4oL0PKm5Jtg4+hUozkjyg4zKQ//96bjWogUSkk4w5t1BMh/ 3btxMOF1ZIt83MMv6lEQvfHGKfUP7zDVlGEwrkQcEeoXXu744Al0MTUXnPHEwSO0rm8f 2RBIaoeir8Z23jloAOltsDWc1KdhVtsDKuQNi54JjE6iFdnSDp9qmyxbdG397lEKJUyY gqXA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=A9REEP/8ROAureHI/NaNojTlrwNXg/sHpmQLban0nLw=; b=IyNSa1ZjJtn/917XrH/wqV/PdzogT7dvRZdU3Ny53jW+A0vkP180ixDW6U1ZGvNPsa jCtaSUOwjNP5TdQskkhmCVtbei/+ysSmhxXpK1PVovMEU8Ptl4RqOBacxiB9nxwEpseD AStPm7bwuE6RDSVfDe1DEVsdPQAsyZQYOJBaNvHuSsmpb8iGqCytSyjN6SvLchY2zWAF VZfhjF4INj8DHaowxubhacJ76YzCnVF3nfx+sN6ajauxLtScviso/MVyfxxo/tw95nrj umhhthhni4J0rt4zGRbBo3ExIDrzn5c1yzXRVXrYAcSV1ebZXd/5YVjPkX1wAJcyYCy3 YnvA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=WtEwAN7f; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id kk16-20020a170903071000b0019ca3bea326si6510894plb.269.2023.03.08.02.31.01; Wed, 08 Mar 2023 02:31:16 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=WtEwAN7f; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230312AbjCHKFO (ORCPT <rfc822;toshivichauhan@gmail.com> + 99 others); Wed, 8 Mar 2023 05:05:14 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59392 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230389AbjCHKFD (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 8 Mar 2023 05:05:03 -0500 Received: from mail-pj1-x1035.google.com (mail-pj1-x1035.google.com [IPv6:2607:f8b0:4864:20::1035]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CDC1B7B491 for <linux-kernel@vger.kernel.org>; Wed, 8 Mar 2023 02:04:35 -0800 (PST) Received: by mail-pj1-x1035.google.com with SMTP id 6-20020a17090a190600b00237c5b6ecd7so1724538pjg.4 for <linux-kernel@vger.kernel.org>; Wed, 08 Mar 2023 02:04:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1678269875; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=A9REEP/8ROAureHI/NaNojTlrwNXg/sHpmQLban0nLw=; b=WtEwAN7f4WNf0+O0RFrUJtgdnKhz/yXbAl6pZL5Wt/FqUavJhtBGmKdouF9wkb6qSZ Tv/OS0t5v4YgN85io+R6Wg7xnjFlVz/jraPxVHmKY7ifTBfV0e0uxNidMxberE6eL1N3 QXaGLYbdLCfKdNKYeGaDsgUUYEvx9FBbbOUvqP8edKr2akEBShSaPVQzNyrW+jPpBGu8 Vh2pNQ8csPMEWYGYThJP6zvxg/U1K86sCpPLUEELmc/GFmfVfwzwpBUU5w77ZGGMPhFQ RkDnsWvII4XbHvEJNL0Tb2SvgFq5sGVEUIi6+4cysiKElUa0PqNdvvuUdpm6iNTR6GBu 1E0g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678269875; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=A9REEP/8ROAureHI/NaNojTlrwNXg/sHpmQLban0nLw=; b=NdzZv/mMh1hEmINg5w0ec5g5WTcl/al6V3HIJHNpNFlUd5Y5gdoi13i4pxw1papiX6 Q+zWxqCZmeqOBlFK4OIouunw8G2aOSuEvoN/QV+66iC4GyjP7Fzb/PnvkWqVni0tLjpe c0q61Pv96TjgW9riRBlLmm2E9bGgku8iLFRHK21/gwqgr9U3Z1w1Ns146S6crHiIwSQl GKbuobhNn+jmxpTGGVNAFHJDWOEFQrEXs9Q5iJSfZCf07y9ewoM99iY94yIYrFvT9AHg MsJtMML8EuUBfy2wXHVDZzPefHEUoEhijsUtykcrykjjHJg3FSX7dZ313DvZD9Y9Emzz w/QQ== X-Gm-Message-State: AO0yUKVx7qeAWmX+sf50D048OuJUgWJ5tqskHqrjfALSOVn2bPgUpFur nXHqjpagtJy9QkqCfMVvCtGcJw== X-Received: by 2002:a05:6a20:4fa4:b0:cb:af96:9687 with SMTP id gh36-20020a056a204fa400b000cbaf969687mr13462057pzb.15.1678269875253; Wed, 08 Mar 2023 02:04:35 -0800 (PST) Received: from C02G87K0MD6R.bytedance.net ([139.177.225.244]) by smtp.gmail.com with ESMTPSA id bm2-20020a056a00320200b005813f365afcsm9171874pfb.189.2023.03.08.02.04.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 08 Mar 2023 02:04:34 -0800 (PST) From: Hao Jia <jiahao.os@bytedance.com> To: mingo@redhat.com, peterz@infradead.org, mingo@kernel.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com, mgorman@techsingularity.net Cc: linux-kernel@vger.kernel.org, Hao Jia <jiahao.os@bytedance.com> Subject: [PATCH] sched/core: Minor optimize pick_next_task() when core-sched enable Date: Wed, 8 Mar 2023 18:04:13 +0800 Message-Id: <20230308100414.37114-1-jiahao.os@bytedance.com> X-Mailer: git-send-email 2.37.0 (Apple Git-136) MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1759795192036943404?= X-GMAIL-MSGID: =?utf-8?q?1759795192036943404?= |
Series |
sched/core: Minor optimize pick_next_task() when core-sched enable
|
|
Commit Message
Hao Jia
March 8, 2023, 10:04 a.m. UTC
If core-sched is enabled, sometimes we will traverse each CPU on the core
to find the highest priority task 'max' on the entire core, and then try
and find a runnable task that matches @max.
But in the following case, we choose the runnable task is not the best.
core max: task2 (cookie 0)
rq0 rq1
task0(cookie non-zero) task2(cookie 0)
task1(cookie 0)
task3(cookie 0)
...
pick-task: idle pick-task: task2
CPU0 and CPU1 are two CPUs on the same core, task0 and task2 are the
highest priority tasks on rq0 and rq1 respectively, task2 is @max
on the entire core.
In the case that 'max' has a zero cookie, instead of continuing to
search for a runnable task on rq0 that matches @max's cookie, we
choose idle for rq0 directly.
At this time, it is obviously better to choose task1 to run for rq0,
which will increase the CPU utilization.
Therefore, we queue tasks with zero cookies in core_tree, and record
the number of non-zero cookie tasks of each rq to detect the status
of the sched-core.
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
---
kernel/sched/core.c | 29 +++++++++++++++--------------
kernel/sched/core_sched.c | 9 ++++-----
kernel/sched/sched.h | 1 +
3 files changed, 20 insertions(+), 19 deletions(-)
Comments
kindly ping... On 2023/3/8 Hao Jia wrote: > If core-sched is enabled, sometimes we will traverse each CPU on the core > to find the highest priority task 'max' on the entire core, and then try > and find a runnable task that matches @max. > But in the following case, we choose the runnable task is not the best. > > core max: task2 (cookie 0) > > rq0 rq1 > task0(cookie non-zero) task2(cookie 0) > task1(cookie 0) > task3(cookie 0) > ... > > pick-task: idle pick-task: task2 > > CPU0 and CPU1 are two CPUs on the same core, task0 and task2 are the > highest priority tasks on rq0 and rq1 respectively, task2 is @max > on the entire core. > > In the case that 'max' has a zero cookie, instead of continuing to > search for a runnable task on rq0 that matches @max's cookie, we > choose idle for rq0 directly. > At this time, it is obviously better to choose task1 to run for rq0, > which will increase the CPU utilization. > Therefore, we queue tasks with zero cookies in core_tree, and record > the number of non-zero cookie tasks of each rq to detect the status > of the sched-core. > > Signed-off-by: Hao Jia <jiahao.os@bytedance.com> > --- > kernel/sched/core.c | 29 +++++++++++++++-------------- > kernel/sched/core_sched.c | 9 ++++----- > kernel/sched/sched.h | 1 + > 3 files changed, 20 insertions(+), 19 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index af017e038b48..765cd14c52e1 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -236,8 +236,8 @@ void sched_core_enqueue(struct rq *rq, struct task_struct *p) > { > rq->core->core_task_seq++; > > - if (!p->core_cookie) > - return; > + if (p->core_cookie) > + rq->cookied_count++; > > rb_add(&p->core_node, &rq->core_tree, rb_sched_core_less); > } > @@ -246,11 +246,16 @@ void sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags) > { > rq->core->core_task_seq++; > > + if (p->core_cookie) > + rq->cookied_count--; > + > if (sched_core_enqueued(p)) { > rb_erase(&p->core_node, &rq->core_tree); > RB_CLEAR_NODE(&p->core_node); > } > > + if (!sched_core_enabled(rq)) > + return; > /* > * Migrating the last task off the cpu, with the cpu in forced idle > * state. Reschedule to create an accounting edge for forced idle, > @@ -370,7 +375,7 @@ static void sched_core_assert_empty(void) > int cpu; > > for_each_possible_cpu(cpu) > - WARN_ON_ONCE(!RB_EMPTY_ROOT(&cpu_rq(cpu)->core_tree)); > + WARN_ON_ONCE(cpu_rq(cpu)->cookied_count); > } > > static void __sched_core_enable(void) > @@ -2061,14 +2066,12 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) > uclamp_rq_inc(rq, p); > p->sched_class->enqueue_task(rq, p, flags); > > - if (sched_core_enabled(rq)) > - sched_core_enqueue(rq, p); > + sched_core_enqueue(rq, p); > } > > static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) > { > - if (sched_core_enabled(rq)) > - sched_core_dequeue(rq, p, flags); > + sched_core_dequeue(rq, p, flags); > > if (!(flags & DEQUEUE_NOCLOCK)) > update_rq_clock(rq); > @@ -6126,13 +6129,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > rq_i = cpu_rq(i); > p = rq_i->core_pick; > > - if (!cookie_equals(p, cookie)) { > - p = NULL; > - if (cookie) > - p = sched_core_find(rq_i, cookie); > - if (!p) > - p = idle_sched_class.pick_task(rq_i); > - } > + if (!cookie_equals(p, cookie)) > + p = sched_core_find(rq_i, cookie); > > rq_i->core_pick = p; > > @@ -6333,6 +6331,7 @@ static void sched_core_cpu_starting(unsigned int cpu) > sched_core_lock(cpu, &flags); > > WARN_ON_ONCE(rq->core != rq); > + WARN_ON_ONCE(rq->cookied_count); > > /* if we're the first, we'll be our own leader */ > if (cpumask_weight(smt_mask) == 1) > @@ -6425,6 +6424,7 @@ static inline void sched_core_cpu_dying(unsigned int cpu) > { > struct rq *rq = cpu_rq(cpu); > > + WARN_ON_ONCE(rq->cookied_count); > if (rq->core != rq) > rq->core = rq; > } > @@ -9917,6 +9917,7 @@ void __init sched_init(void) > rq->core = rq; > rq->core_pick = NULL; > rq->core_enabled = 0; > + rq->cookied_count = 0; > rq->core_tree = RB_ROOT; > rq->core_forceidle_count = 0; > rq->core_forceidle_occupation = 0; > diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c > index a57fd8f27498..70f424abcc2b 100644 > --- a/kernel/sched/core_sched.c > +++ b/kernel/sched/core_sched.c > @@ -56,6 +56,7 @@ static unsigned long sched_core_update_cookie(struct task_struct *p, > unsigned long old_cookie; > struct rq_flags rf; > struct rq *rq; > + bool enqueued; > > rq = task_rq_lock(p, &rf); > > @@ -67,16 +68,14 @@ static unsigned long sched_core_update_cookie(struct task_struct *p, > */ > SCHED_WARN_ON((p->core_cookie || cookie) && !sched_core_enabled(rq)); > > - if (sched_core_enqueued(p)) > + enqueued = task_on_rq_queued(p); > + if (enqueued) > sched_core_dequeue(rq, p, DEQUEUE_SAVE); > > old_cookie = p->core_cookie; > p->core_cookie = cookie; > > - /* > - * Consider the cases: !prev_cookie and !cookie. > - */ > - if (cookie && task_on_rq_queued(p)) > + if (enqueued) > sched_core_enqueue(rq, p); > > /* > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 3e8df6d31c1e..f5a0ee7fccae 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1148,6 +1148,7 @@ struct rq { > unsigned int core_task_seq; > unsigned int core_pick_seq; > unsigned long core_cookie; > + unsigned int cookied_count; > unsigned int core_forceidle_count; > unsigned int core_forceidle_seq; > unsigned int core_forceidle_occupation;
On Wed, Mar 08, 2023 at 06:04:13PM +0800, Hao Jia wrote: > core max: task2 (cookie 0) > > rq0 rq1 > task0(cookie non-zero) task2(cookie 0) > task1(cookie 0) > task3(cookie 0) > ... > > pick-task: idle pick-task: task2 > > CPU0 and CPU1 are two CPUs on the same core, task0 and task2 are the > highest priority tasks on rq0 and rq1 respectively, task2 is @max > on the entire core. I'm assuming this all starts by rq0 doing a pick and getting task0. Because any other selection would go down the whole !need_sync route. > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index af017e038b48..765cd14c52e1 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -236,8 +236,8 @@ void sched_core_enqueue(struct rq *rq, struct task_struct *p) > { > rq->core->core_task_seq++; > > - if (!p->core_cookie) > - return; > + if (p->core_cookie) > + rq->cookied_count++; > > rb_add(&p->core_node, &rq->core_tree, rb_sched_core_less); > } > @@ -2061,14 +2066,12 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) > uclamp_rq_inc(rq, p); > p->sched_class->enqueue_task(rq, p, flags); > > - if (sched_core_enabled(rq)) > - sched_core_enqueue(rq, p); > + sched_core_enqueue(rq, p); > } > > static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) > { > - if (sched_core_enabled(rq)) > - sched_core_dequeue(rq, p, flags); > + sched_core_dequeue(rq, p, flags); > > if (!(flags & DEQUEUE_NOCLOCK)) > update_rq_clock(rq); Yeah, this is an absolute no-no, it makes the overhead of the second rb tree unconditional.
On Mon, Mar 20, 2023 at 4:55 AM Hao Jia <jiahao.os@bytedance.com> wrote: > > kindly ping... > > On 2023/3/8 Hao Jia wrote: > > If core-sched is enabled, sometimes we will traverse each CPU on the core > > to find the highest priority task 'max' on the entire core, and then try > > and find a runnable task that matches @max. > > But in the following case, we choose the runnable task is not the best. > > > > core max: task2 (cookie 0) > > > > rq0 rq1 > > task0(cookie non-zero) task2(cookie 0) > > task1(cookie 0) > > task3(cookie 0) > > ... > > > > pick-task: idle pick-task: task2 > > > > CPU0 and CPU1 are two CPUs on the same core, task0 and task2 are the > > highest priority tasks on rq0 and rq1 respectively, task2 is @max > > on the entire core. > > > > In the case that 'max' has a zero cookie, instead of continuing to > > search for a runnable task on rq0 that matches @max's cookie, we > > choose idle for rq0 directly. > > At this time, it is obviously better to choose task1 to run for rq0, > > which will increase the CPU utilization. > > Therefore, we queue tasks with zero cookies in core_tree, and record > > the number of non-zero cookie tasks of each rq to detect the status > > of the sched-core. I do remember this as a known issue (more of a known but unimplemented optimization) which happens when you have a high priority non-cookie task which is in front of several low priority ones on the same thread/rq. Adding +Vineeth Pillai to see if he remembers the issue. I can try to take a look at it this week to make sense of your patch. The code in upstream has changed quite a bit since we did this work on the older kernels, so allow some time to page it all in. Meanwhile, could you please provide some more details of your usecase/workload and how this patch improves it? Also out of curiosity, is bytedance using core scheduling for security or something else? Thanks, - Joel > > > > Signed-off-by: Hao Jia <jiahao.os@bytedance.com> > > --- > > kernel/sched/core.c | 29 +++++++++++++++-------------- > > kernel/sched/core_sched.c | 9 ++++----- > > kernel/sched/sched.h | 1 + > > 3 files changed, 20 insertions(+), 19 deletions(-) > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index af017e038b48..765cd14c52e1 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -236,8 +236,8 @@ void sched_core_enqueue(struct rq *rq, struct task_struct *p) > > { > > rq->core->core_task_seq++; > > > > - if (!p->core_cookie) > > - return; > > + if (p->core_cookie) > > + rq->cookied_count++; > > > > rb_add(&p->core_node, &rq->core_tree, rb_sched_core_less); > > } > > @@ -246,11 +246,16 @@ void sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags) > > { > > rq->core->core_task_seq++; > > > > + if (p->core_cookie) > > + rq->cookied_count--; > > + > > if (sched_core_enqueued(p)) { > > rb_erase(&p->core_node, &rq->core_tree); > > RB_CLEAR_NODE(&p->core_node); > > } > > > > + if (!sched_core_enabled(rq)) > > + return; > > /* > > * Migrating the last task off the cpu, with the cpu in forced idle > > * state. Reschedule to create an accounting edge for forced idle, > > @@ -370,7 +375,7 @@ static void sched_core_assert_empty(void) > > int cpu; > > > > for_each_possible_cpu(cpu) > > - WARN_ON_ONCE(!RB_EMPTY_ROOT(&cpu_rq(cpu)->core_tree)); > > + WARN_ON_ONCE(cpu_rq(cpu)->cookied_count); > > } > > > > static void __sched_core_enable(void) > > @@ -2061,14 +2066,12 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) > > uclamp_rq_inc(rq, p); > > p->sched_class->enqueue_task(rq, p, flags); > > > > - if (sched_core_enabled(rq)) > > - sched_core_enqueue(rq, p); > > + sched_core_enqueue(rq, p); > > } > > > > static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) > > { > > - if (sched_core_enabled(rq)) > > - sched_core_dequeue(rq, p, flags); > > + sched_core_dequeue(rq, p, flags); > > > > if (!(flags & DEQUEUE_NOCLOCK)) > > update_rq_clock(rq); > > @@ -6126,13 +6129,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > > rq_i = cpu_rq(i); > > p = rq_i->core_pick; > > > > - if (!cookie_equals(p, cookie)) { > > - p = NULL; > > - if (cookie) > > - p = sched_core_find(rq_i, cookie); > > - if (!p) > > - p = idle_sched_class.pick_task(rq_i); > > - } > > + if (!cookie_equals(p, cookie)) > > + p = sched_core_find(rq_i, cookie); > > > > rq_i->core_pick = p; > > > > @@ -6333,6 +6331,7 @@ static void sched_core_cpu_starting(unsigned int cpu) > > sched_core_lock(cpu, &flags); > > > > WARN_ON_ONCE(rq->core != rq); > > + WARN_ON_ONCE(rq->cookied_count); > > > > /* if we're the first, we'll be our own leader */ > > if (cpumask_weight(smt_mask) == 1) > > @@ -6425,6 +6424,7 @@ static inline void sched_core_cpu_dying(unsigned int cpu) > > { > > struct rq *rq = cpu_rq(cpu); > > > > + WARN_ON_ONCE(rq->cookied_count); > > if (rq->core != rq) > > rq->core = rq; > > } > > @@ -9917,6 +9917,7 @@ void __init sched_init(void) > > rq->core = rq; > > rq->core_pick = NULL; > > rq->core_enabled = 0; > > + rq->cookied_count = 0; > > rq->core_tree = RB_ROOT; > > rq->core_forceidle_count = 0; > > rq->core_forceidle_occupation = 0; > > diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c > > index a57fd8f27498..70f424abcc2b 100644 > > --- a/kernel/sched/core_sched.c > > +++ b/kernel/sched/core_sched.c > > @@ -56,6 +56,7 @@ static unsigned long sched_core_update_cookie(struct task_struct *p, > > unsigned long old_cookie; > > struct rq_flags rf; > > struct rq *rq; > > + bool enqueued; > > > > rq = task_rq_lock(p, &rf); > > > > @@ -67,16 +68,14 @@ static unsigned long sched_core_update_cookie(struct task_struct *p, > > */ > > SCHED_WARN_ON((p->core_cookie || cookie) && !sched_core_enabled(rq)); > > > > - if (sched_core_enqueued(p)) > > + enqueued = task_on_rq_queued(p); > > + if (enqueued) > > sched_core_dequeue(rq, p, DEQUEUE_SAVE); > > > > old_cookie = p->core_cookie; > > p->core_cookie = cookie; > > > > - /* > > - * Consider the cases: !prev_cookie and !cookie. > > - */ > > - if (cookie && task_on_rq_queued(p)) > > + if (enqueued) > > sched_core_enqueue(rq, p); > > > > /* > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > index 3e8df6d31c1e..f5a0ee7fccae 100644 > > --- a/kernel/sched/sched.h > > +++ b/kernel/sched/sched.h > > @@ -1148,6 +1148,7 @@ struct rq { > > unsigned int core_task_seq; > > unsigned int core_pick_seq; > > unsigned long core_cookie; > > + unsigned int cookied_count; > > unsigned int core_forceidle_count; > > unsigned int core_forceidle_seq; > > unsigned int core_forceidle_occupation;
On 2023/3/22 Joel Fernandes wrote: > On Mon, Mar 20, 2023 at 4:55 AM Hao Jia <jiahao.os@bytedance.com> wrote: >> >> kindly ping... >> >> On 2023/3/8 Hao Jia wrote: >>> If core-sched is enabled, sometimes we will traverse each CPU on the core >>> to find the highest priority task 'max' on the entire core, and then try >>> and find a runnable task that matches @max. >>> But in the following case, we choose the runnable task is not the best. >>> >>> core max: task2 (cookie 0) >>> >>> rq0 rq1 >>> task0(cookie non-zero) task2(cookie 0) >>> task1(cookie 0) >>> task3(cookie 0) >>> ... >>> >>> pick-task: idle pick-task: task2 >>> >>> CPU0 and CPU1 are two CPUs on the same core, task0 and task2 are the >>> highest priority tasks on rq0 and rq1 respectively, task2 is @max >>> on the entire core. >>> >>> In the case that 'max' has a zero cookie, instead of continuing to >>> search for a runnable task on rq0 that matches @max's cookie, we >>> choose idle for rq0 directly. >>> At this time, it is obviously better to choose task1 to run for rq0, >>> which will increase the CPU utilization. >>> Therefore, we queue tasks with zero cookies in core_tree, and record >>> the number of non-zero cookie tasks of each rq to detect the status >>> of the sched-core. > > I do remember this as a known issue (more of a known but unimplemented > optimization) which happens when you have a high priority non-cookie > task which is in front of several low priority ones on the same > thread/rq. Adding +Vineeth Pillai to see if he remembers the issue. Thank you for sharing the information, I will check the previous emails. > > I can try to take a look at it this week to make sense of your patch. > The code in upstream has changed quite a bit since we did this work on > the older kernels, so allow some time to page it all in. My patch tries to make non-cookie tasks also queue the core_tree if CONFIG_SCHED_CORE is defined. This allows us to quaickly find the highest priority non-cookie task on this rq through sched_core_find(). But as Peter said it makes the overhead of the second rb tree unconditional. Finding the highest priority non-cookie task is complex and time-consuming, especially in CONFIG_FAIR_GROUP_SCHED. I am now trying to let the fore_idle CPU look for a highest priority non-cookie task through a mechanism similar to queue_balance_callback. I'm not sure if this is possible, do you have a better suggestion? > > Meanwhile, could you please provide some more details of your > usecase/workload and how this patch improves it? Also out of > curiosity, is bytedance using core scheduling for security or > something else? > At ByteDance, we try to use core-sched to deploy high-priority services (online) and low-priority services (offline) on the same physical machine. core-sched can ensure that offline(cookie non-zero) will not affect onlie's L1 and L2 cache when onlie is running. In our scenario, core-sched is very useful for online services. Thanks, Hao > Thanks, > > - Joel > > >>> >>> Signed-off-by: Hao Jia <jiahao.os@bytedance.com> >>> --- >>> kernel/sched/core.c | 29 +++++++++++++++-------------- >>> kernel/sched/core_sched.c | 9 ++++----- >>> kernel/sched/sched.h | 1 + >>> 3 files changed, 20 insertions(+), 19 deletions(-) >>> >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >>> index af017e038b48..765cd14c52e1 100644 >>> --- a/kernel/sched/core.c >>> +++ b/kernel/sched/core.c >>> @@ -236,8 +236,8 @@ void sched_core_enqueue(struct rq *rq, struct task_struct *p) >>> { >>> rq->core->core_task_seq++; >>> >>> - if (!p->core_cookie) >>> - return; >>> + if (p->core_cookie) >>> + rq->cookied_count++; >>> >>> rb_add(&p->core_node, &rq->core_tree, rb_sched_core_less); >>> } >>> @@ -246,11 +246,16 @@ void sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags) >>> { >>> rq->core->core_task_seq++; >>> >>> + if (p->core_cookie) >>> + rq->cookied_count--; >>> + >>> if (sched_core_enqueued(p)) { >>> rb_erase(&p->core_node, &rq->core_tree); >>> RB_CLEAR_NODE(&p->core_node); >>> } >>> >>> + if (!sched_core_enabled(rq)) >>> + return; >>> /* >>> * Migrating the last task off the cpu, with the cpu in forced idle >>> * state. Reschedule to create an accounting edge for forced idle, >>> @@ -370,7 +375,7 @@ static void sched_core_assert_empty(void) >>> int cpu; >>> >>> for_each_possible_cpu(cpu) >>> - WARN_ON_ONCE(!RB_EMPTY_ROOT(&cpu_rq(cpu)->core_tree)); >>> + WARN_ON_ONCE(cpu_rq(cpu)->cookied_count); >>> } >>> >>> static void __sched_core_enable(void)
Merging two threads. On Tue, Mar 21, 2023 at 5:40 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > CPU0 and CPU1 are two CPUs on the same core, task0 and task2 are the > > highest priority tasks on rq0 and rq1 respectively, task2 is @max > > on the entire core. > I'm assuming this all starts by rq0 doing a pick and getting task0. > Because any other selection would go down the whole !need_sync route. > I think this could also happen when rq1 starts the pick due to task2 wakeup while task0 was running in rq0. In this case, core->core_cookie would be set and we take the need_sync path I guess. > > In the case that 'max' has a zero cookie, instead of continuing to > > search for a runnable task on rq0 that matches @max's cookie, we > > choose idle for rq0 directly. > > At this time, it is obviously better to choose task1 to run for rq0, > > which will increase the CPU utilization. > > Therefore, we queue tasks with zero cookies in core_tree, and record > > the number of non-zero cookie tasks of each rq to detect the status > > of the sched-core. > > I do remember this as a known issue (more of a known but unimplemented > optimization) which happens when you have a high priority non-cookie > task which is in front of several low priority ones on the same > thread/rq. Adding +Vineeth Pillai to see if he remembers the issue. > Yes, I remember this as one of the 2 issues we noticed, but could not get to fixing it. Here we have non-cookied tasks considered special as a side effect of implementation(non-cookied tasks not in core rbtree) and hence we force idle if max is non-cookied and the highest prio task on the sibling is cookied. The other issue was - we don't update core rbtree when vruntime changes and this can cause starvation of cookied task if there are more than one task with the same cookie on an rq. > > static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) > > { > > - if (sched_core_enabled(rq)) > > - sched_core_dequeue(rq, p, flags); > > + sched_core_dequeue(rq, p, flags); > > > > if (!(flags & DEQUEUE_NOCLOCK)) > > update_rq_clock(rq); > Yeah, this is an absolute no-no, it makes the overhead of the second rb > tree unconditional. I agree. Could we keep it conditional by enqueuing 0-cookied tasks only when coresched is enabled, just like what we do for cookied tasks? This is still an overhead where we have two trees storing all the runnable tasks but in different order. We would also need to populate core rbtree from cfs rbtree on coresched enable and empty the tree on coresched disable. Thanks, Vineeth
On 2023/3/23 Vineeth Pillai wrote: > Merging two threads. > > On Tue, Mar 21, 2023 at 5:40 PM Joel Fernandes <joel@joelfernandes.org> wrote: >>> >>> CPU0 and CPU1 are two CPUs on the same core, task0 and task2 are the >>> highest priority tasks on rq0 and rq1 respectively, task2 is @max >>> on the entire core. > >> I'm assuming this all starts by rq0 doing a pick and getting task0. >> Because any other selection would go down the whole !need_sync route. >> > I think this could also happen when rq1 starts the pick due to task2 wakeup > while task0 was running in rq0. In this case, core->core_cookie would be set > and we take the need_sync path I guess. > >>> In the case that 'max' has a zero cookie, instead of continuing to >>> search for a runnable task on rq0 that matches @max's cookie, we >>> choose idle for rq0 directly. >>> At this time, it is obviously better to choose task1 to run for rq0, >>> which will increase the CPU utilization. >>> Therefore, we queue tasks with zero cookies in core_tree, and record >>> the number of non-zero cookie tasks of each rq to detect the status >>> of the sched-core. >> >> I do remember this as a known issue (more of a known but unimplemented >> optimization) which happens when you have a high priority non-cookie >> task which is in front of several low priority ones on the same >> thread/rq. Adding +Vineeth Pillai to see if he remembers the issue. >> > Yes, I remember this as one of the 2 issues we noticed, but could not get to > fixing it. Here we have non-cookied tasks considered special as a side effect > of implementation(non-cookied tasks not in core rbtree) and hence we force idle > if max is non-cookied and the highest prio task on the sibling is cookied. > > The other issue was - we don't update core rbtree when vruntime changes and > this can cause starvation of cookied task if there are more than one task with > the same cookie on an rq. > If I understand correctly, when a cookied task is enqueued, the difference delta1 between its vruntime and min_vruntime is very large. Another task with the same cookie is very actively dequeuing and enqueuing, and the difference delta2 between its vruntime and min_vruntime is always smaller than delta1? I'm not sure if this is the case? >>> static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) >>> { >>> - if (sched_core_enabled(rq)) >>> - sched_core_dequeue(rq, p, flags); >>> + sched_core_dequeue(rq, p, flags); >>> >>> if (!(flags & DEQUEUE_NOCLOCK)) >>> update_rq_clock(rq); > >> Yeah, this is an absolute no-no, it makes the overhead of the second rb >> tree unconditional. > > I agree. Could we keep it conditional by enqueuing 0-cookied tasks only when > coresched is enabled, just like what we do for cookied tasks? This is still an > overhead where we have two trees storing all the runnable tasks but in > different order. We would also need to populate core rbtree from cfs rbtree > on coresched enable and empty the tree on coresched disable. > I'm not sure if the other way is reasonable, I'm trying to provide a function for each scheduling class to find a highest priority non-cookie task. For example fair_sched_class, we can use rq->cfs_tasks to traverse the search. But this search may take a long time, maybe we need to limit the number of searches. Thanks, Hao
On Thu, Mar 23, 2023 at 3:03 AM Hao Jia <jiahao.os@bytedance.com> wrote: > > The other issue was - we don't update core rbtree when vruntime changes and > > this can cause starvation of cookied task if there are more than one task with > > the same cookie on an rq. > > > > If I understand correctly, when a cookied task is enqueued, the > difference delta1 between its vruntime and min_vruntime is very large. > > Another task with the same cookie is very actively dequeuing and > enqueuing, and the difference delta2 between its vruntime and > min_vruntime is always smaller than delta1? > I'm not sure if this is the case? This case I was mentioning is about tasks that are continuously running and hence always in the runqueue. sched_core_enqueue/dequeue is not called and hence their position in the core rbtree is static while cfs rbtree positions change as vruntime progresses. BTW, this is a separate issue than the one you are targeting with this fix. I just thought of mentioning it here as well.. > >> Yeah, this is an absolute no-no, it makes the overhead of the second rb > >> tree unconditional. > > > > I agree. Could we keep it conditional by enqueuing 0-cookied tasks only when > > coresched is enabled, just like what we do for cookied tasks? This is still an > > overhead where we have two trees storing all the runnable tasks but in > > different order. We would also need to populate core rbtree from cfs rbtree > > on coresched enable and empty the tree on coresched disable. > > > > I'm not sure if the other way is reasonable, I'm trying to provide a > function for each scheduling class to find a highest priority non-cookie > task. > > For example fair_sched_class, we can use rq->cfs_tasks to traverse the > search. But this search may take a long time, maybe we need to limit the > number of searches. Yes, it can be time consuming based on the number of cgroups and tasks that are runnable. You could probably take some performance numbers to see how worse it is. We could also have some optimization like marking a runqueue having non-cookied tasks and then do the search only if it is marked. I haven't thought much about it, but search could be optimized hopefully. Thanks, Vineeth
On 2023/3/24 Vineeth Pillai wrote: > On Thu, Mar 23, 2023 at 3:03 AM Hao Jia <jiahao.os@bytedance.com> wrote: > >>> The other issue was - we don't update core rbtree when vruntime changes and >>> this can cause starvation of cookied task if there are more than one task with >>> the same cookie on an rq. >>> >> >> If I understand correctly, when a cookied task is enqueued, the >> difference delta1 between its vruntime and min_vruntime is very large. >> >> Another task with the same cookie is very actively dequeuing and >> enqueuing, and the difference delta2 between its vruntime and >> min_vruntime is always smaller than delta1? >> I'm not sure if this is the case? > > This case I was mentioning is about tasks that are continuously running > and hence always in the runqueue. sched_core_enqueue/dequeue is > not called and hence their position in the core rbtree is static while cfs > rbtree positions change as vruntime progresses. > Thanks for the detailed explanation. > BTW, this is a separate issue than the one you are targeting with this > fix. I just thought of mentioning it here as well.. > >>>> Yeah, this is an absolute no-no, it makes the overhead of the second rb >>>> tree unconditional. >>> >>> I agree. Could we keep it conditional by enqueuing 0-cookied tasks only when >>> coresched is enabled, just like what we do for cookied tasks? This is still an >>> overhead where we have two trees storing all the runnable tasks but in >>> different order. We would also need to populate core rbtree from cfs rbtree >>> on coresched enable and empty the tree on coresched disable. >>> >> >> I'm not sure if the other way is reasonable, I'm trying to provide a >> function for each scheduling class to find a highest priority non-cookie >> task. >> >> For example fair_sched_class, we can use rq->cfs_tasks to traverse the >> search. But this search may take a long time, maybe we need to limit the >> number of searches. > > Yes, it can be time consuming based on the number of cgroups and tasks > that are runnable. You could probably take some performance numbers to > see how worse it is. I agree, this can be very bad if there are a lot of tasks on rq. But using cfs rbtree to find the highest priority non-cookie task will become very complicated when CONFIG_FAIR_GROUP_SCHED is enabled. Thanks, Hao > > We could also have some optimization like marking a runqueue having > non-cookied tasks and then do the search only if it is marked. I haven't > thought much about it, but search could be optimized hopefully. >
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index af017e038b48..765cd14c52e1 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -236,8 +236,8 @@ void sched_core_enqueue(struct rq *rq, struct task_struct *p) { rq->core->core_task_seq++; - if (!p->core_cookie) - return; + if (p->core_cookie) + rq->cookied_count++; rb_add(&p->core_node, &rq->core_tree, rb_sched_core_less); } @@ -246,11 +246,16 @@ void sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags) { rq->core->core_task_seq++; + if (p->core_cookie) + rq->cookied_count--; + if (sched_core_enqueued(p)) { rb_erase(&p->core_node, &rq->core_tree); RB_CLEAR_NODE(&p->core_node); } + if (!sched_core_enabled(rq)) + return; /* * Migrating the last task off the cpu, with the cpu in forced idle * state. Reschedule to create an accounting edge for forced idle, @@ -370,7 +375,7 @@ static void sched_core_assert_empty(void) int cpu; for_each_possible_cpu(cpu) - WARN_ON_ONCE(!RB_EMPTY_ROOT(&cpu_rq(cpu)->core_tree)); + WARN_ON_ONCE(cpu_rq(cpu)->cookied_count); } static void __sched_core_enable(void) @@ -2061,14 +2066,12 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) uclamp_rq_inc(rq, p); p->sched_class->enqueue_task(rq, p, flags); - if (sched_core_enabled(rq)) - sched_core_enqueue(rq, p); + sched_core_enqueue(rq, p); } static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) { - if (sched_core_enabled(rq)) - sched_core_dequeue(rq, p, flags); + sched_core_dequeue(rq, p, flags); if (!(flags & DEQUEUE_NOCLOCK)) update_rq_clock(rq); @@ -6126,13 +6129,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) rq_i = cpu_rq(i); p = rq_i->core_pick; - if (!cookie_equals(p, cookie)) { - p = NULL; - if (cookie) - p = sched_core_find(rq_i, cookie); - if (!p) - p = idle_sched_class.pick_task(rq_i); - } + if (!cookie_equals(p, cookie)) + p = sched_core_find(rq_i, cookie); rq_i->core_pick = p; @@ -6333,6 +6331,7 @@ static void sched_core_cpu_starting(unsigned int cpu) sched_core_lock(cpu, &flags); WARN_ON_ONCE(rq->core != rq); + WARN_ON_ONCE(rq->cookied_count); /* if we're the first, we'll be our own leader */ if (cpumask_weight(smt_mask) == 1) @@ -6425,6 +6424,7 @@ static inline void sched_core_cpu_dying(unsigned int cpu) { struct rq *rq = cpu_rq(cpu); + WARN_ON_ONCE(rq->cookied_count); if (rq->core != rq) rq->core = rq; } @@ -9917,6 +9917,7 @@ void __init sched_init(void) rq->core = rq; rq->core_pick = NULL; rq->core_enabled = 0; + rq->cookied_count = 0; rq->core_tree = RB_ROOT; rq->core_forceidle_count = 0; rq->core_forceidle_occupation = 0; diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c index a57fd8f27498..70f424abcc2b 100644 --- a/kernel/sched/core_sched.c +++ b/kernel/sched/core_sched.c @@ -56,6 +56,7 @@ static unsigned long sched_core_update_cookie(struct task_struct *p, unsigned long old_cookie; struct rq_flags rf; struct rq *rq; + bool enqueued; rq = task_rq_lock(p, &rf); @@ -67,16 +68,14 @@ static unsigned long sched_core_update_cookie(struct task_struct *p, */ SCHED_WARN_ON((p->core_cookie || cookie) && !sched_core_enabled(rq)); - if (sched_core_enqueued(p)) + enqueued = task_on_rq_queued(p); + if (enqueued) sched_core_dequeue(rq, p, DEQUEUE_SAVE); old_cookie = p->core_cookie; p->core_cookie = cookie; - /* - * Consider the cases: !prev_cookie and !cookie. - */ - if (cookie && task_on_rq_queued(p)) + if (enqueued) sched_core_enqueue(rq, p); /* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 3e8df6d31c1e..f5a0ee7fccae 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1148,6 +1148,7 @@ struct rq { unsigned int core_task_seq; unsigned int core_pick_seq; unsigned long core_cookie; + unsigned int cookied_count; unsigned int core_forceidle_count; unsigned int core_forceidle_seq; unsigned int core_forceidle_occupation;