Message ID | 20230313065759.39698-1-yangyicong@huawei.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp1041174wrd; Mon, 13 Mar 2023 00:15:17 -0700 (PDT) X-Google-Smtp-Source: AK7set+dz6o3wJAXoEk0yITlCeonX3tBVlrweL6roG5ug/igWX2z3K/v8aWgEqvS8X9JiNIXq7Nl X-Received: by 2002:a17:903:1103:b0:19e:b2f3:e8c4 with SMTP id n3-20020a170903110300b0019eb2f3e8c4mr30775090plh.25.1678691717238; Mon, 13 Mar 2023 00:15:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1678691717; cv=none; d=google.com; s=arc-20160816; b=jNR7gMQTtKmAR1eIIaWRQOQp4Qc8z4JXRbFVnfgKsbqT1db+0TS+/soX4JLmtph+/D fTEMzvHMLRnvoLXoCqXWHnmZfsYvTDBKUKXBGdiglxgax4zkLq6ky9A+Jfva4YZJX+H8 5s+nr5TDS7fGAJrLGQIHZGIzJGE6RFDzTr3Jgx7mLksD9jRVZtYhufx9rNmCO0RDTrbX aLzxhBXMHldoh/ZvB1bYI83goZZByQgzy+7cTZy8chJTiCEfE3iuA8E50Pqf3+3MDfMR zv1t1zH0+oEN7qTABKFUjNXVLvRaTSdtbGrclH/Xry4B+UO4VjZrVMTp1XzDpaUdqXE3 FB6A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=2VNsZHv7kNkkEmwmWvazKSFK9LOTF6z47uGB8F1Mdp4=; b=aawdShcdfk2GezDM4Ckoujc2NE+pUw+1kDq7LVj3tQA/gd3X/EC0sJZPP8P+ItwDvw tjc0pJ3b2ZfhEosdNggGCRR41Sfrduvw8b8ZcbtZlByccWf1OX4GZRKLYvtq3GH9vSM+ 9JZwohDxFg2cTfewoVsd85FPCWEVpWi9PVGZ0oN1arwnqxbOsxK7q3rP6bAl/kmeqo5N VhadR0o3wG0+hSxo8YZhYI6MxwcwJz8E+X+YXGSzoLsOV+92yhCeYwsGQFxTrH0vAq5+ 7nHxo6pax9BRsBlK8OTp2UYnJqMhSfsgezdLzrnm5rGEHszE9lSWr/B8rtmqjaNEX28p /6lQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id z7-20020a170902ccc700b0019e9f6cd049si6327831ple.416.2023.03.13.00.15.02; Mon, 13 Mar 2023 00:15:17 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229514AbjCMG6h (ORCPT <rfc822;realc9580@gmail.com> + 99 others); Mon, 13 Mar 2023 02:58:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35126 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229528AbjCMG6f (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 13 Mar 2023 02:58:35 -0400 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D13D627D50 for <linux-kernel@vger.kernel.org>; Sun, 12 Mar 2023 23:58:28 -0700 (PDT) Received: from canpemm500009.china.huawei.com (unknown [172.30.72.54]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4PZnTv0GLCzSkd4; Mon, 13 Mar 2023 14:55:15 +0800 (CST) Received: from localhost.localdomain (10.50.163.32) by canpemm500009.china.huawei.com (7.192.105.203) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.21; Mon, 13 Mar 2023 14:58:26 +0800 From: Yicong Yang <yangyicong@huawei.com> To: <mingo@redhat.com>, <peterz@infradead.org>, <juri.lelli@redhat.com>, <vincent.guittot@linaro.org>, <linux-kernel@vger.kernel.org> CC: <dietmar.eggemann@arm.com>, <rostedt@goodmis.org>, <bsegall@google.com>, <mgorman@suse.de>, <bristot@redhat.com>, <vschneid@redhat.com>, <linuxarm@huawei.com>, <prime.zeng@huawei.com>, <wangjie125@huawei.com>, <yangyicong@hisilicon.com> Subject: [PATCH] sched/fair: Don't balance migration disabled tasks Date: Mon, 13 Mar 2023 14:57:59 +0800 Message-ID: <20230313065759.39698-1-yangyicong@huawei.com> X-Mailer: git-send-email 2.31.0 MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.50.163.32] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To canpemm500009.china.huawei.com (7.192.105.203) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1760235846370614360?= X-GMAIL-MSGID: =?utf-8?q?1760235846370614360?= |
Series |
sched/fair: Don't balance migration disabled tasks
|
|
Commit Message
Yicong Yang
March 13, 2023, 6:57 a.m. UTC
From: Yicong Yang <yangyicong@hisilicon.com> On load balance we didn't check whether the candidate task is migration disabled or not, this may hit the WARN_ON in set_task_cpu() since the migration disabled tasks are expected to run on their current CPU. We've run into this case several times on our server: ------------[ cut here ]------------ WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240 Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip> CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1 Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021 pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : set_task_cpu+0x188/0x240 lr : load_balance+0x5d0/0xc60 sp : ffff80000803bc70 x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040 x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001 x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78 x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000 x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000 x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530 x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001 Call trace: set_task_cpu+0x188/0x240 load_balance+0x5d0/0xc60 rebalance_domains+0x26c/0x380 _nohz_idle_balance.isra.0+0x1e0/0x370 run_rebalance_domains+0x6c/0x80 __do_softirq+0x128/0x3d8 ____do_softirq+0x18/0x24 call_on_irq_stack+0x2c/0x38 do_softirq_own_stack+0x24/0x3c __irq_exit_rcu+0xcc/0xf4 irq_exit_rcu+0x18/0x24 el1_interrupt+0x4c/0xe4 el1h_64_irq_handler+0x18/0x2c el1h_64_irq+0x74/0x78 arch_cpu_idle+0x18/0x4c default_idle_call+0x58/0x194 do_idle+0x244/0x2b0 cpu_startup_entry+0x30/0x3c secondary_start_kernel+0x14c/0x190 __secondary_switched+0xb0/0xb4 ---[ end trace 0000000000000000 ]--- Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> --- kernel/sched/fair.c | 4 ++++ 1 file changed, 4 insertions(+)
Comments
On 2023-03-13 at 14:57:59 +0800, Yicong Yang wrote: > From: Yicong Yang <yangyicong@hisilicon.com> > > On load balance we didn't check whether the candidate task is migration > disabled or not, this may hit the WARN_ON in set_task_cpu() since the > migration disabled tasks are expected to run on their current CPU. > We've run into this case several times on our server: > > ------------[ cut here ]------------ > WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240 > Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip> > CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1 > Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021 > pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > pc : set_task_cpu+0x188/0x240 > lr : load_balance+0x5d0/0xc60 > sp : ffff80000803bc70 > x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040 > x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001 > x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78 > x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000 > x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000 > x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000 > x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530 > x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e > x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a > x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001 > Call trace: > set_task_cpu+0x188/0x240 > load_balance+0x5d0/0xc60 > rebalance_domains+0x26c/0x380 > _nohz_idle_balance.isra.0+0x1e0/0x370 > run_rebalance_domains+0x6c/0x80 > __do_softirq+0x128/0x3d8 > ____do_softirq+0x18/0x24 > call_on_irq_stack+0x2c/0x38 > do_softirq_own_stack+0x24/0x3c > __irq_exit_rcu+0xcc/0xf4 > irq_exit_rcu+0x18/0x24 > el1_interrupt+0x4c/0xe4 > el1h_64_irq_handler+0x18/0x2c > el1h_64_irq+0x74/0x78 > arch_cpu_idle+0x18/0x4c > default_idle_call+0x58/0x194 > do_idle+0x244/0x2b0 > cpu_startup_entry+0x30/0x3c > secondary_start_kernel+0x14c/0x190 > __secondary_switched+0xb0/0xb4 > ---[ end trace 0000000000000000 ]--- > > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > --- > kernel/sched/fair.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 7a1b1f855b96..8fe767362d22 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -8433,6 +8433,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) > if (kthread_is_per_cpu(p)) > return 0; > > + /* Migration disabled tasks need to be kept on their running CPU. */ > + if (is_migration_disabled(p)) > + return 0; > + > if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) { > int cpu; > > -- > 2.24.0 > Looks reasonable to me. Would it be possible to also update the comments at the beginning of can_migrate_task() starts with: "We do not migrate tasks that are:" Reviewed-by: Chen Yu <yu.c.chen@intel.com> thanks, Chenyu
On 2023/3/14 11:08, Chen Yu wrote: > On 2023-03-13 at 14:57:59 +0800, Yicong Yang wrote: >> From: Yicong Yang <yangyicong@hisilicon.com> >> >> On load balance we didn't check whether the candidate task is migration >> disabled or not, this may hit the WARN_ON in set_task_cpu() since the >> migration disabled tasks are expected to run on their current CPU. >> We've run into this case several times on our server: >> >> ------------[ cut here ]------------ >> WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240 >> Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip> >> CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1 >> Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021 >> pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) >> pc : set_task_cpu+0x188/0x240 >> lr : load_balance+0x5d0/0xc60 >> sp : ffff80000803bc70 >> x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040 >> x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001 >> x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78 >> x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000 >> x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000 >> x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000 >> x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530 >> x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e >> x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a >> x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001 >> Call trace: >> set_task_cpu+0x188/0x240 >> load_balance+0x5d0/0xc60 >> rebalance_domains+0x26c/0x380 >> _nohz_idle_balance.isra.0+0x1e0/0x370 >> run_rebalance_domains+0x6c/0x80 >> __do_softirq+0x128/0x3d8 >> ____do_softirq+0x18/0x24 >> call_on_irq_stack+0x2c/0x38 >> do_softirq_own_stack+0x24/0x3c >> __irq_exit_rcu+0xcc/0xf4 >> irq_exit_rcu+0x18/0x24 >> el1_interrupt+0x4c/0xe4 >> el1h_64_irq_handler+0x18/0x2c >> el1h_64_irq+0x74/0x78 >> arch_cpu_idle+0x18/0x4c >> default_idle_call+0x58/0x194 >> do_idle+0x244/0x2b0 >> cpu_startup_entry+0x30/0x3c >> secondary_start_kernel+0x14c/0x190 >> __secondary_switched+0xb0/0xb4 >> ---[ end trace 0000000000000000 ]--- >> >> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> >> --- >> kernel/sched/fair.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 7a1b1f855b96..8fe767362d22 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -8433,6 +8433,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) >> if (kthread_is_per_cpu(p)) >> return 0; >> >> + /* Migration disabled tasks need to be kept on their running CPU. */ >> + if (is_migration_disabled(p)) >> + return 0; >> + >> if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) { >> int cpu; >> >> -- >> 2.24.0 >> > Looks reasonable to me. Would it be possible to also update the comments at the beginning of > can_migrate_task() starts with: "We do not migrate tasks that are:" > Thanks for the suggestion! It seems only uncommented conditions are summarized in that graph, otherwise they're mentioned close to there branch like kthread_is_per_cpu(p) case. I can add it in v2 if you think it'll be useful. > Reviewed-by: Chen Yu <yu.c.chen@intel.com> Thanks, Yicong > > thanks, > Chenyu > > . >
On 13/03/23 14:57, Yicong Yang wrote: > kernel/sched/fair.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 7a1b1f855b96..8fe767362d22 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -8433,6 +8433,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) > if (kthread_is_per_cpu(p)) > return 0; > > + /* Migration disabled tasks need to be kept on their running CPU. */ > + if (is_migration_disabled(p)) > + return 0; > + > if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) { > int cpu; That cpumask check should cover migration_disabled tasks, unless they haven't gone through migrate_disable_switch() yet (p->migration_disabled == 1, but the cpus_ptr hasn't been touched yet). Now, if that's the case, the task has to be src_rq's current (since it hasn't switched out), which means can_migrate_task() should exit via: if (task_on_cpu(env->src_rq, p)) { schedstat_inc(p->stats.nr_failed_migrations_running); return 0; } and thus not try to detach_task(). With that in mind, I don't get how your splat can happen, nor how the change change can help (a remote task p could execute migrate_disable() concurrently with can_migrate_task(p)). I'm a bit confused here, detach_tasks() happens entirely with src_rq rq_lock held, so there shouldn't be any surprises. Can you share any extra context? E.g. exact HEAD of your tree, maybe the migrate_disable task in question if you have that info. > > -- > 2.24.0
On 2023-03-15 at 17:55:13 +0800, Yicong Yang wrote: > On 2023/3/14 11:08, Chen Yu wrote: > > On 2023-03-13 at 14:57:59 +0800, Yicong Yang wrote: > >> From: Yicong Yang <yangyicong@hisilicon.com> > >> > >> On load balance we didn't check whether the candidate task is migration > >> disabled or not, this may hit the WARN_ON in set_task_cpu() since the > >> migration disabled tasks are expected to run on their current CPU. > >> We've run into this case several times on our server: > >> > >> ------------[ cut here ]------------ > >> WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240 > >> Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip> > >> CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1 > >> Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021 > >> pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > >> pc : set_task_cpu+0x188/0x240 > >> lr : load_balance+0x5d0/0xc60 > >> sp : ffff80000803bc70 > >> x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040 > >> x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001 > >> x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78 > >> x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000 > >> x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000 > >> x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000 > >> x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530 > >> x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e > >> x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a > >> x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001 > >> Call trace: > >> set_task_cpu+0x188/0x240 > >> load_balance+0x5d0/0xc60 > >> rebalance_domains+0x26c/0x380 > >> _nohz_idle_balance.isra.0+0x1e0/0x370 > >> run_rebalance_domains+0x6c/0x80 > >> __do_softirq+0x128/0x3d8 > >> ____do_softirq+0x18/0x24 > >> call_on_irq_stack+0x2c/0x38 > >> do_softirq_own_stack+0x24/0x3c > >> __irq_exit_rcu+0xcc/0xf4 > >> irq_exit_rcu+0x18/0x24 > >> el1_interrupt+0x4c/0xe4 > >> el1h_64_irq_handler+0x18/0x2c > >> el1h_64_irq+0x74/0x78 > >> arch_cpu_idle+0x18/0x4c > >> default_idle_call+0x58/0x194 > >> do_idle+0x244/0x2b0 > >> cpu_startup_entry+0x30/0x3c > >> secondary_start_kernel+0x14c/0x190 > >> __secondary_switched+0xb0/0xb4 > >> ---[ end trace 0000000000000000 ]--- > >> > >> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > >> --- > >> kernel/sched/fair.c | 4 ++++ > >> 1 file changed, 4 insertions(+) > >> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> index 7a1b1f855b96..8fe767362d22 100644 > >> --- a/kernel/sched/fair.c > >> +++ b/kernel/sched/fair.c > >> @@ -8433,6 +8433,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) > >> if (kthread_is_per_cpu(p)) > >> return 0; > >> > >> + /* Migration disabled tasks need to be kept on their running CPU. */ > >> + if (is_migration_disabled(p)) > >> + return 0; > >> + > >> if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) { > >> int cpu; > >> > >> -- > >> 2.24.0 > >> > > Looks reasonable to me. Would it be possible to also update the comments at the beginning of > > can_migrate_task() starts with: "We do not migrate tasks that are:" > > > > Thanks for the suggestion! It seems only uncommented conditions are summarized in that graph, > otherwise they're mentioned close to there branch like kthread_is_per_cpu(p) case. I can add > it in v2 if you think it'll be useful. > It seems that I overlooked migrate_disable(). It can only set current task rather than arbitrary task. As Valentin described in his reply, I'm also thinking of what type of race condition can trigger this. Are you refering to something like this: cpu1 cpu2 load_balance rq_lock(cpu2); detach_task(cpu2, p) can_migrate_task(p) returns true migrate_disable(current=p) set_task_cpu(p, cpu1); WARN(p can not migrate) But can_migrate_task(p) should return 0 because p is always the current one as long as the rq_lock been taken by cpu1. Thanks, Chenyu > > Reviewed-by: Chen Yu <yu.c.chen@intel.com> > > Thanks, > Yicong > > > > > thanks, > > Chenyu > > > > . > >
Hi Valentin, On 2023/3/15 23:34, Valentin Schneider wrote: > On 13/03/23 14:57, Yicong Yang wrote: >> kernel/sched/fair.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 7a1b1f855b96..8fe767362d22 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -8433,6 +8433,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) >> if (kthread_is_per_cpu(p)) >> return 0; >> >> + /* Migration disabled tasks need to be kept on their running CPU. */ >> + if (is_migration_disabled(p)) >> + return 0; >> + >> if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) { >> int cpu; > > That cpumask check should cover migration_disabled tasks, unless they > haven't gone through migrate_disable_switch() yet > (p->migration_disabled == 1, but the cpus_ptr hasn't been touched yet). > > Now, if that's the case, the task has to be src_rq's current (since it > hasn't switched out), which means can_migrate_task() should exit via: > > if (task_on_cpu(env->src_rq, p)) { > schedstat_inc(p->stats.nr_failed_migrations_running); > return 0; > } > > and thus not try to detach_task(). With that in mind, I don't get how your > splat can happen, nor how the change change can help (a remote task p could > execute migrate_disable() concurrently with can_migrate_task(p)). > I see, for migrate disabled tasks, if !p->on_cpu the migration can be avoid by the cpus_ptr check and if p->on_cpu migration can be avoid by the task_on_cpu() check. So this patch won't help. > I'm a bit confused here, detach_tasks() happens entirely with src_rq > rq_lock held, so there shouldn't be any surprises. > Since it's a arm64 machine, could the WARN_ON_ONCE() test be false positive? I mean the update of p->migration_disabled is not observed by the balance CPU and trigger this warning incorrectly. > Can you share any extra context? E.g. exact HEAD of your tree, maybe the > migrate_disable task in question if you have that info. > We met this on our internal version based on 6.1-rc4, the scheduler is not patched. We also saw this in previous version like 6.0. This patch is applied since 6.2 so we haven't seen this recently, but as you point out this patch doesn't solve the problem. The questioned tasks are some systemd services(udevd, etc) and I assume the migration disable is caused by seccomp: seccomp() ... bpf_prog_run_pin_on_cpu() migrate_disable() ... migrate_enable() Thanks, Yicong
Hi Chenyu, On 2023/3/16 14:43, Chen Yu wrote: > On 2023-03-15 at 17:55:13 +0800, Yicong Yang wrote: >> On 2023/3/14 11:08, Chen Yu wrote: >>> On 2023-03-13 at 14:57:59 +0800, Yicong Yang wrote: >>>> From: Yicong Yang <yangyicong@hisilicon.com> >>>> >>>> On load balance we didn't check whether the candidate task is migration >>>> disabled or not, this may hit the WARN_ON in set_task_cpu() since the >>>> migration disabled tasks are expected to run on their current CPU. >>>> We've run into this case several times on our server: >>>> >>>> ------------[ cut here ]------------ >>>> WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240 >>>> Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip> >>>> CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1 >>>> Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021 >>>> pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>>> pc : set_task_cpu+0x188/0x240 >>>> lr : load_balance+0x5d0/0xc60 >>>> sp : ffff80000803bc70 >>>> x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040 >>>> x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001 >>>> x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78 >>>> x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000 >>>> x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000 >>>> x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000 >>>> x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530 >>>> x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e >>>> x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a >>>> x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001 >>>> Call trace: >>>> set_task_cpu+0x188/0x240 >>>> load_balance+0x5d0/0xc60 >>>> rebalance_domains+0x26c/0x380 >>>> _nohz_idle_balance.isra.0+0x1e0/0x370 >>>> run_rebalance_domains+0x6c/0x80 >>>> __do_softirq+0x128/0x3d8 >>>> ____do_softirq+0x18/0x24 >>>> call_on_irq_stack+0x2c/0x38 >>>> do_softirq_own_stack+0x24/0x3c >>>> __irq_exit_rcu+0xcc/0xf4 >>>> irq_exit_rcu+0x18/0x24 >>>> el1_interrupt+0x4c/0xe4 >>>> el1h_64_irq_handler+0x18/0x2c >>>> el1h_64_irq+0x74/0x78 >>>> arch_cpu_idle+0x18/0x4c >>>> default_idle_call+0x58/0x194 >>>> do_idle+0x244/0x2b0 >>>> cpu_startup_entry+0x30/0x3c >>>> secondary_start_kernel+0x14c/0x190 >>>> __secondary_switched+0xb0/0xb4 >>>> ---[ end trace 0000000000000000 ]--- >>>> >>>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> >>>> --- >>>> kernel/sched/fair.c | 4 ++++ >>>> 1 file changed, 4 insertions(+) >>>> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>> index 7a1b1f855b96..8fe767362d22 100644 >>>> --- a/kernel/sched/fair.c >>>> +++ b/kernel/sched/fair.c >>>> @@ -8433,6 +8433,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) >>>> if (kthread_is_per_cpu(p)) >>>> return 0; >>>> >>>> + /* Migration disabled tasks need to be kept on their running CPU. */ >>>> + if (is_migration_disabled(p)) >>>> + return 0; >>>> + >>>> if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) { >>>> int cpu; >>>> >>>> -- >>>> 2.24.0 >>>> >>> Looks reasonable to me. Would it be possible to also update the comments at the beginning of >>> can_migrate_task() starts with: "We do not migrate tasks that are:" >>> >> >> Thanks for the suggestion! It seems only uncommented conditions are summarized in that graph, >> otherwise they're mentioned close to there branch like kthread_is_per_cpu(p) case. I can add >> it in v2 if you think it'll be useful. >> > It seems that I overlooked migrate_disable(). It can only set current task rather than arbitrary task. > As Valentin described in his reply, I'm also thinking of what type of race condition can trigger this. > Are you refering to something like this: > cpu1 cpu2 > load_balance > rq_lock(cpu2); > detach_task(cpu2, p) > can_migrate_task(p) returns true > migrate_disable(current=p) > set_task_cpu(p, cpu1); > WARN(p can not migrate) > But can_migrate_task(p) should return 0 because p is always the current one as > long as the rq_lock been taken by cpu1. > Yes it's right the current checks should avoid the issue. As I replied to Valentin there maybe other reasons and needs to further check. Thanks, Yicong
On 16/03/23 17:13, Yicong Yang wrote: > Hi Valentin, > > On 2023/3/15 23:34, Valentin Schneider wrote: >> That cpumask check should cover migration_disabled tasks, unless they >> haven't gone through migrate_disable_switch() yet >> (p->migration_disabled == 1, but the cpus_ptr hasn't been touched yet). >> >> Now, if that's the case, the task has to be src_rq's current (since it >> hasn't switched out), which means can_migrate_task() should exit via: >> >> if (task_on_cpu(env->src_rq, p)) { >> schedstat_inc(p->stats.nr_failed_migrations_running); >> return 0; >> } >> >> and thus not try to detach_task(). With that in mind, I don't get how your >> splat can happen, nor how the change change can help (a remote task p could >> execute migrate_disable() concurrently with can_migrate_task(p)). >> > > I see, for migrate disabled tasks, if !p->on_cpu the migration can be avoid by > the cpus_ptr check and if p->on_cpu migration can be avoid by the task_on_cpu() > check. So this patch won't help. > Right. >> I'm a bit confused here, detach_tasks() happens entirely with src_rq >> rq_lock held, so there shouldn't be any surprises. >> > > Since it's a arm64 machine, could the WARN_ON_ONCE() test be false positive? > I mean the update of p->migration_disabled is not observed by the balance > CPU and trigger this warning incorrectly. > Since the balance CPU holds the src_rq's rq_lock for the entire duration of detach_tasks(), the actual value of p->migration_disabled shouldn't matter. Either p has been scheduled out while being migration_disabled, in which case its ->cpus_ptr has been updated, or p is still running, in which case can_migrate_task() should return false (p->on_cpu). But from your report, we're somehow not seeing one of these. >> Can you share any extra context? E.g. exact HEAD of your tree, maybe the >> migrate_disable task in question if you have that info. >> > > We met this on our internal version based on 6.1-rc4, the scheduler is not > patched. We also saw this in previous version like 6.0. This patch is applied > since 6.2 so we haven't seen this recently, but as you point out this patch doesn't > solve the problem. Could you try to reproduce this on an unpatched upstream kernel?
Hi Valentin, On 2023/3/16 20:12, Valentin Schneider wrote: > On 16/03/23 17:13, Yicong Yang wrote: >> Hi Valentin, >> >> On 2023/3/15 23:34, Valentin Schneider wrote: >>> That cpumask check should cover migration_disabled tasks, unless they >>> haven't gone through migrate_disable_switch() yet >>> (p->migration_disabled == 1, but the cpus_ptr hasn't been touched yet). >>> >>> Now, if that's the case, the task has to be src_rq's current (since it >>> hasn't switched out), which means can_migrate_task() should exit via: >>> >>> if (task_on_cpu(env->src_rq, p)) { >>> schedstat_inc(p->stats.nr_failed_migrations_running); >>> return 0; >>> } >>> >>> and thus not try to detach_task(). With that in mind, I don't get how your >>> splat can happen, nor how the change change can help (a remote task p could >>> execute migrate_disable() concurrently with can_migrate_task(p)). >>> >> >> I see, for migrate disabled tasks, if !p->on_cpu the migration can be avoid by >> the cpus_ptr check and if p->on_cpu migration can be avoid by the task_on_cpu() >> check. So this patch won't help. >> > > Right. > >>> I'm a bit confused here, detach_tasks() happens entirely with src_rq >>> rq_lock held, so there shouldn't be any surprises. >>> >> >> Since it's a arm64 machine, could the WARN_ON_ONCE() test be false positive? >> I mean the update of p->migration_disabled is not observed by the balance >> CPU and trigger this warning incorrectly. >> > > Since the balance CPU holds the src_rq's rq_lock for the entire duration of > detach_tasks(), the actual value of p->migration_disabled shouldn't matter. > > Either p has been scheduled out while being migration_disabled, in which > case its ->cpus_ptr has been updated, or p is still running, in which case > can_migrate_task() should return false (p->on_cpu). But from your report, > we're somehow not seeing one of these. > >>> Can you share any extra context? E.g. exact HEAD of your tree, maybe the >>> migrate_disable task in question if you have that info. >>> >> >> We met this on our internal version based on 6.1-rc4, the scheduler is not >> patched. We also saw this in previous version like 6.0. This patch is applied >> since 6.2 so we haven't seen this recently, but as you point out this patch doesn't >> solve the problem. > > Could you try to reproduce this on an unpatched upstream kernel? > Sorry for the late reply. Yes it can be reproduced on the upstream kernel (tested below on 6.4-rc1). Since it happens occasionally with the normal setup, I wrote a test kthread with migration enable/disable periodically: static int workload_func(void *data) { cpumask_var_t cpumask; int i; if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL)) return -ENOMEM; for (i = 0; i < 8; i++) cpumask_set_cpu(i, cpumask); set_cpus_allowed_ptr(current, cpumask); free_cpumask_var(cpumask); while (!kthread_should_stop()) { migrate_disable(); mdelay(1000); cond_resched(); migrate_enable(); mdelay(1000); } return -1; } Launching this and bind another workload to the same CPU it's currently running like `taskset -c $workload_cpu stress-ng -c 1` will trigger the issue. In fact, the problem is not because of the migration disable mechanism which works well, but because the balancing policy after found all the tasks on the source CPU are pinned. With below debug print added: @@ -8527,6 +8527,20 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) if (kthread_is_per_cpu(p)) return 0; + if (is_migration_disabled(p)) { + if (!p->on_cpu && cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) + pr_err("dst_cpu %d on_cpu %d cpus_ptr %*pbl cpus_mask %*pbl", + env->dst_cpu, p->on_cpu, cpumask_pr_args(p->cpus_ptr), + cpumask_pr_args(&p->cpus_mask)); + } + if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) { int cpu; I got below output: [ 686.135619] dst_cpu 1 on_cpu 0 cpus_ptr 1 cpus_mask 0-7 [ 686.148809] ------------[ cut here ]------------ [ 686.169505] WARNING: CPU: 64 PID: 0 at kernel/sched/core.c:3210 set_task_cpu+0x190/0x250 [ 686.186537] Modules linked in: kthread_workload(O) bluetooth rfkill xt_CHECKSUM iptable_mangle xt_conntrack ipt_REJECT nf_reject_ipv4 ip6table_filter ip6_tables iptable_filter ib_isert iscsi_target_mod ib_ipoib ib_umad rpcrdma ib_iser libiscsi scsi_transport_iscsi crct10dif_ce hns_roce_hw_v2 arm_spe_pmu sbsa_gwdt sm4_generic sm4 xts ecb hisi_hpre hisi_uncore_l3c_pmu hisi_uncore_hha_pmu hisi_uncore_ddrc_pmu hisi_trng_v2 rng_core hisi_uncore_pmu spi_dw_mmio hisi_zip hisi_sec2 hisi_qm uacce hclge hns3 hnae3 hisi_sas_v3_hw hisi_sas_main libsas [last unloaded: kthread_workload(O)] [ 686.293937] CPU: 64 PID: 0 Comm: swapper/64 Tainted: G O 6.4.0-rc1-migration-race-debug+ #24 [ 686.314616] Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B211.01 11/10/2021 [ 686.333285] pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 686.347930] pc : set_task_cpu+0x190/0x250 [...] It shows that we're going to balance the task to its current CPU (CPU 1) rather than the balancer CPU (CPU 64). It's because we're going to find a new dst_cpu if the task on the src_cpu is pinned, and the new_dst_cpu happens to be the task's current CPU. So the right way to solve this maybe avoid selecting the src_cpu as the new_dst_cpu and below patch works to solve this issue. @@ -8550,7 +8564,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) /* Prevent to re-select dst_cpu via env's CPUs: */ for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) { - if (cpumask_test_cpu(cpu, p->cpus_ptr)) { + if (cpumask_test_cpu(cpu, p->cpus_ptr) && cpu != env->src_cpu) { env->flags |= LBF_DST_PINNED; env->new_dst_cpu = cpu; break; Thanks, Yicong
On 16/05/23 19:10, Yicong Yang wrote: > Hi Valentin, > Sorry for the late reply. Yes it can be reproduced on the upstream kernel (tested below on > 6.4-rc1). Since it happens occasionally with the normal setup, I wrote a test kthread > with migration enable/disable periodically: > > static int workload_func(void *data) > { > cpumask_var_t cpumask; > int i; > > if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL)) > return -ENOMEM; > > for (i = 0; i < 8; i++) > cpumask_set_cpu(i, cpumask); > > set_cpus_allowed_ptr(current, cpumask); > free_cpumask_var(cpumask); > > while (!kthread_should_stop()) { > migrate_disable(); > mdelay(1000); > cond_resched(); > migrate_enable(); > mdelay(1000); > } > > return -1; > } > > Launching this and bind another workload to the same CPU it's currently running like > `taskset -c $workload_cpu stress-ng -c 1` will trigger the issue. In fact, the problem > is not because of the migration disable mechanism which works well, but because the > balancing policy after found all the tasks on the source CPU are pinned. With below > debug print added: > > @@ -8527,6 +8527,20 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) > if (kthread_is_per_cpu(p)) > return 0; > > + if (is_migration_disabled(p)) { > + if (!p->on_cpu && cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) > + pr_err("dst_cpu %d on_cpu %d cpus_ptr %*pbl cpus_mask %*pbl", > + env->dst_cpu, p->on_cpu, cpumask_pr_args(p->cpus_ptr), > + cpumask_pr_args(&p->cpus_mask)); > + } > + > if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) { > int cpu; > > I got below output: > > [ 686.135619] dst_cpu 1 on_cpu 0 cpus_ptr 1 cpus_mask 0-7 > [ 686.148809] ------------[ cut here ]------------ > [ 686.169505] WARNING: CPU: 64 PID: 0 at kernel/sched/core.c:3210 set_task_cpu+0x190/0x250 > [ 686.186537] Modules linked in: kthread_workload(O) bluetooth rfkill xt_CHECKSUM iptable_mangle xt_conntrack ipt_REJECT nf_reject_ipv4 ip6table_filter ip6_tables iptable_filter ib_isert iscsi_target_mod ib_ipoib ib_umad rpcrdma ib_iser libiscsi scsi_transport_iscsi crct10dif_ce hns_roce_hw_v2 arm_spe_pmu sbsa_gwdt sm4_generic sm4 xts ecb hisi_hpre hisi_uncore_l3c_pmu hisi_uncore_hha_pmu hisi_uncore_ddrc_pmu hisi_trng_v2 rng_core hisi_uncore_pmu spi_dw_mmio hisi_zip hisi_sec2 hisi_qm uacce hclge hns3 hnae3 hisi_sas_v3_hw hisi_sas_main libsas [last unloaded: kthread_workload(O)] > [ 686.293937] CPU: 64 PID: 0 Comm: swapper/64 Tainted: G O 6.4.0-rc1-migration-race-debug+ #24 > [ 686.314616] Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B211.01 11/10/2021 > [ 686.333285] pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 686.347930] pc : set_task_cpu+0x190/0x250 > [...] > > It shows that we're going to balance the task to its current CPU (CPU 1) rather than > the balancer CPU (CPU 64). It's because we're going to find a new dst_cpu if the task > on the src_cpu is pinned, and the new_dst_cpu happens to be the task's current CPU. > Nicely found! Thanks for having spent time on this. I haven't been able to retrigger the issue using your reproducer, but I think you have indeed found the root cause of it. > So the right way to solve this maybe avoid selecting the src_cpu as the new_dst_cpu and > below patch works to solve this issue. > > @@ -8550,7 +8564,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) > > /* Prevent to re-select dst_cpu via env's CPUs: */ > for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) { > - if (cpumask_test_cpu(cpu, p->cpus_ptr)) { > + if (cpumask_test_cpu(cpu, p->cpus_ptr) && cpu != env->src_cpu) { > env->flags |= LBF_DST_PINNED; > env->new_dst_cpu = cpu; > break; > Other than having some better inplace cpumask helpers, I don't think we can make this look better. Could you send this change as a proper patch, please? > Thanks, > Yicong
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7a1b1f855b96..8fe767362d22 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8433,6 +8433,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) if (kthread_is_per_cpu(p)) return 0; + /* Migration disabled tasks need to be kept on their running CPU. */ + if (is_migration_disabled(p)) + return 0; + if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) { int cpu;