Message ID | 20230206130449.41360-1-wander@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp2227470wrn; Mon, 6 Feb 2023 05:14:28 -0800 (PST) X-Google-Smtp-Source: AK7set9mp42ZPr7KPNChWR/N6ZQPG/DcRK6Culv7FhDxC6ZFg0OsyZsLIiQ+Y10Jpi8unSvFFTcC X-Received: by 2002:aa7:8745:0:b0:593:cdc7:5dd3 with SMTP id g5-20020aa78745000000b00593cdc75dd3mr17359771pfo.4.1675689267795; Mon, 06 Feb 2023 05:14:27 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1675689267; cv=none; d=google.com; s=arc-20160816; b=ioHFqADmsD9g+NlUoS5ZwHVPjvPbOm/QqbBCPsnx8fVWXIQv470vK4k1iucuG4ZZ8b JPIgRDrXWp+JU6EOgUeMtz3yr+n8G4TnIg0WHKfdSPkdeWWV+/rXY+7HP4W1bhD9Qhak hzhnB4kxqnp2/GdBRZ91IsVfNOCqNraEhQA/1Bs0lKiNGEWpBHNK5IfFiFGxQC3avcAI xINVoLxZmPkDr9q5lbucmNNekGjDzAamHur3qGskDnMB+mjkJWZrTaY/9IfbmdRR/jUb ph2DDI1Y3UnyfSRa59ykWusWYv/nSB1aeqI5vMcBJwEth9y3X3XqUjmyc6DK8sAhjWP4 3zxg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=09AkApwLvB9yC9Uk4K9ttZ2n5UCGkgVxYQHo2Kbt4uc=; b=E1266EJELiS/XUy+Amxj3FlKleCcTLFAIhEESPz9sNqRRTbauMgm9L5P5KJhT+kpNa yfWdVmoJ1Q+XI47Qj60yjRvvciG6MbnpMZD0c7Ikf28XPf+hrSXwjJdiUFQ7SZf2PbO6 fI6cClwuFup0fdu4amJ5KU1M62TmxfneIa3EXtVXqq/NCktDpKYMr6mQitmDRuj0ROQ7 s1M6eDH3iuMmguOjuVOcwHhMtu3fM85KTKszqENui6zfDJWoj2Fo1fEjJnNLyJ3nTXDG pyBfigqFKEsnpa44mDJNIIta8sQCfPMyHN1GCfOU6Y/WBKynOIHadgHS3cwq4htnJRhU 42gQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=LXDna3El; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k12-20020aa792cc000000b0059402bfd500si11468565pfa.114.2023.02.06.05.14.13; Mon, 06 Feb 2023 05:14:27 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=LXDna3El; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230250AbjBFNGS (ORCPT <rfc822;kmanaouilinux@gmail.com> + 99 others); Mon, 6 Feb 2023 08:06:18 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45502 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230256AbjBFNGR (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 6 Feb 2023 08:06:17 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0AE1E5244 for <linux-kernel@vger.kernel.org>; Mon, 6 Feb 2023 05:05:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1675688728; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=09AkApwLvB9yC9Uk4K9ttZ2n5UCGkgVxYQHo2Kbt4uc=; b=LXDna3Elz0+Uo1K+4Y+VYbTdzMUq3VaM15ly/pCTaCtIbmH5t85AaNM1evDZ6elKbVxnTd lXs9yazdUy8WnTB0xALXssCK2sM4WyK2DG2mlVcP+RU8XZyCYgUyFF5mLZ2q7PCDebACQA goCqdICr0KF8nMtaiRO/vr18wvkc9y4= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-3-DIuJCX7QP9-bAP2jERT6fQ-1; Mon, 06 Feb 2023 08:05:09 -0500 X-MC-Unique: DIuJCX7QP9-bAP2jERT6fQ-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 0117C85A5A3; Mon, 6 Feb 2023 13:05:09 +0000 (UTC) Received: from fedora.redhat.com (unknown [10.22.32.197]) by smtp.corp.redhat.com (Postfix) with ESMTP id 98F592026D38; Mon, 6 Feb 2023 13:05:05 +0000 (UTC) From: Wander Lairson Costa <wander@redhat.com> To: Andrew Morton <akpm@linux-foundation.org>, Thomas Gleixner <tglx@linutronix.de>, "Eric W. Biederman" <ebiederm@xmission.com>, Andy Lutomirski <luto@kernel.org>, Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Peter Zijlstra <peterz@infradead.org>, "Liam R. Howlett" <Liam.Howlett@Oracle.com>, Fenghua Yu <fenghua.yu@intel.com>, Wander Lairson Costa <wander@redhat.com>, Andrei Vagin <avagin@gmail.com>, linux-kernel@vger.kernel.org (open list) Cc: Hu Chunyu <chuhu@redhat.com>, Oleg Nesterov <oleg@redhat.com>, Valentin Schneider <vschneid@redhat.com>, Paul McKenney <paulmck@kernel.org> Subject: [PATCH v4] kernel/fork: beware of __put_task_struct calling context Date: Mon, 6 Feb 2023 10:04:47 -0300 Message-Id: <20230206130449.41360-1-wander@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1757087549973728447?= X-GMAIL-MSGID: =?utf-8?q?1757087549973728447?= |
Series |
[v4] kernel/fork: beware of __put_task_struct calling context
|
|
Commit Message
Wander Lairson Costa
Feb. 6, 2023, 1:04 p.m. UTC
Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping
locks. Therefore, it can't be called from an non-preemptible context.
One practical example is splat inside inactive_task_timer(), which is
called in a interrupt context:
CPU: 1 PID: 2848 Comm: life Kdump: loaded Tainted: G W ---------
Hardware name: HP ProLiant DL388p Gen8, BIOS P70 07/15/2012
Call Trace:
dump_stack_lvl+0x57/0x7d
mark_lock_irq.cold+0x33/0xba
? stack_trace_save+0x4b/0x70
? save_trace+0x55/0x150
mark_lock+0x1e7/0x400
mark_usage+0x11d/0x140
__lock_acquire+0x30d/0x930
lock_acquire.part.0+0x9c/0x210
? refill_obj_stock+0x3d/0x3a0
? rcu_read_lock_sched_held+0x3f/0x70
? trace_lock_acquire+0x38/0x140
? lock_acquire+0x30/0x80
? refill_obj_stock+0x3d/0x3a0
rt_spin_lock+0x27/0xe0
? refill_obj_stock+0x3d/0x3a0
refill_obj_stock+0x3d/0x3a0
? inactive_task_timer+0x1ad/0x340
kmem_cache_free+0x357/0x560
inactive_task_timer+0x1ad/0x340
? switched_from_dl+0x2d0/0x2d0
__run_hrtimer+0x8a/0x1a0
__hrtimer_run_queues+0x91/0x130
hrtimer_interrupt+0x10f/0x220
__sysvec_apic_timer_interrupt+0x7b/0xd0
sysvec_apic_timer_interrupt+0x4f/0xd0
? asm_sysvec_apic_timer_interrupt+0xa/0x20
asm_sysvec_apic_timer_interrupt+0x12/0x20
RIP: 0033:0x7fff196bf6f5
Instead of calling __put_task_struct() directly, we defer it using
call_rcu(). A more natural approach would use a workqueue, but since
in PREEMPT_RT, we can't allocate dynamic memory from atomic context,
the code would become more complex because we would need to put the
work_struct instance in the task_struct and initialize it when we
allocate a new task_struct.
Changelog
=========
v1:
* Initial implementation fixing the splat.
v2:
* Isolate the logic in its own function.
* Fix two more cases caught in review.
v3:
* Change __put_task_struct() to handle the issue internally.
v4:
* Explain why call_rcu() is safe to call from interrupt context.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reported-by: Hu Chunyu <chuhu@redhat.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Valentin Schneider <vschneid@redhat.com>
Cc: Paul McKenney <paulmck@kernel.org>
---
kernel/fork.c | 25 ++++++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)
Comments
On 2023-02-06 10:04:47 [-0300], Wander Lairson Costa wrote: > Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping > locks. Therefore, it can't be called from an non-preemptible context. > > One practical example is splat inside inactive_task_timer(), which is > called in a interrupt context: Do you have more? The inactive_task_timer() is marked as HRTIMER_MODE_REL_HARD which means in runs in hardirq-context. The author of commit 850377a875a48 ("sched/deadline: Ensure inactive_timer runs in hardirq context") should have been aware of that. We have on workaround of that put_task() in sched-switch. I wasn't aware of this shortcoming. So either we have more problems or potential problems or this is the only finding so far. > diff --git a/kernel/fork.c b/kernel/fork.c > index 9f7fe3541897..532dd2ceb6a3 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -857,6 +857,29 @@ void __put_task_struct(struct task_struct *tsk) … > +void __put_task_struct(struct task_struct *tsk) > +{ > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (!preemptible() || !in_task())) Is it safe to use the rcu member in any case? If so why not use it unconditionally? > + /* > + * under PREEMPT_RT, we can't call put_task_struct > + * in atomic context because it will indirectly > + * acquire sleeping locks. > + * > + * call_rcu() will schedule delayed_put_task_struct_rcu() > + * to be called in process context. > + */ > + call_rcu(&tsk->rcu, delayed_put_task_struct_rcu); > + else > + ___put_task_struct(tsk); > +} > EXPORT_SYMBOL_GPL(__put_task_struct); > > void __init __weak arch_task_cache_init(void) { } Sebastian
On 02/06, Sebastian Andrzej Siewior wrote: > > On 2023-02-06 10:04:47 [-0300], Wander Lairson Costa wrote: > > > @@ -857,6 +857,29 @@ void __put_task_struct(struct task_struct *tsk) > … > > +void __put_task_struct(struct task_struct *tsk) > > +{ > > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (!preemptible() || !in_task())) > > Is it safe to use the rcu member in any case? I thinks it is safe but deserves a comment. I guess Wander misunderstood me when I asked him to do this... __put_task_struct() is called when refcount_dec_and_test(&t->usage) succeeds. This means that it can't "conflict" with put_task_struct_rcu_user() which abuses ->rcu the same way; rcu_users has a reference so task->usage can't be zero after rcu_users 1 -> 0 transition. > If so why not use it > unconditionally? performance ? And... I still don't like the name of delayed_put_task_struct_rcu() to me ___put_task_struct_rcu() looks a bit less confusing, note that we already have delayed_put_task_struct(). But this is minor. Oleg.
On 2023-02-06 16:27:12 [+0100], Oleg Nesterov wrote: > On 02/06, Sebastian Andrzej Siewior wrote: > > > > On 2023-02-06 10:04:47 [-0300], Wander Lairson Costa wrote: > > > > > @@ -857,6 +857,29 @@ void __put_task_struct(struct task_struct *tsk) > > … > > > +void __put_task_struct(struct task_struct *tsk) > > > +{ > > > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (!preemptible() || !in_task())) > > > > Is it safe to use the rcu member in any case? > > I thinks it is safe but deserves a comment. I guess Wander misunderstood > me when I asked him to do this... > > __put_task_struct() is called when refcount_dec_and_test(&t->usage) succeeds. > > This means that it can't "conflict" with put_task_struct_rcu_user() which > abuses ->rcu the same way; rcu_users has a reference so task->usage can't > be zero after rcu_users 1 -> 0 transition. Sounds good. > > If so why not use it > > unconditionally? > > performance ? All the free() part is moved from the caller into rcu. > > And... I still don't like the name of delayed_put_task_struct_rcu() to me > ___put_task_struct_rcu() looks a bit less confusing, note that we already > have delayed_put_task_struct(). But this is minor. So if we do it unconditionally then we could get rid of put_task_struct_rcu_user(). Otherwise we could use put_task_struct_rcu_user() in that timer callback because it will lead to lockdep warnings once printk is fixed. > Oleg. Sebastian
On 02/06, Sebastian Andrzej Siewior wrote: > > On 2023-02-06 16:27:12 [+0100], Oleg Nesterov wrote: > > > > If so why not use it > > > unconditionally? > > > > performance ? > > All the free() part is moved from the caller into rcu. sorry, I don't understand, > > > > > And... I still don't like the name of delayed_put_task_struct_rcu() to me > > ___put_task_struct_rcu() looks a bit less confusing, note that we already > > have delayed_put_task_struct(). But this is minor. > > So if we do it unconditionally then we could get rid of > put_task_struct_rcu_user(). Yes. But the whole purpose of rcu_users is that we want to avoid the unconditional rcu grace period before free_task() ? Just in case... please note that delayed_put_task_struct() delays refcount_sub(t->usage), not free_task(). Why do we need this? Consider rcu_read_lock(); task = find-task-in-rcu-protected-list; // Safe, task->usage can't be zero get_task_struct(task); rcu_read_unlock(); > Otherwise we could use put_task_struct_rcu_user() in that timer > callback because it will lead to lockdep warnings once printk is fixed. IIUC there are more in-atomic callers of put_task_struct(). But perhaps I misunderstood you... Oleg.
On Mon, Feb 6, 2023 at 11:57 AM Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > On 2023-02-06 10:04:47 [-0300], Wander Lairson Costa wrote: > > Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping > > locks. Therefore, it can't be called from an non-preemptible context. > > > > One practical example is splat inside inactive_task_timer(), which is > > called in a interrupt context: > > Do you have more? > The inactive_task_timer() is marked as HRTIMER_MODE_REL_HARD which means > in runs in hardirq-context. The author of commit > 850377a875a48 ("sched/deadline: Ensure inactive_timer runs in hardirq context") > > should have been aware of that. > We have on workaround of that put_task() in sched-switch. I wasn't aware > of this shortcoming. So either we have more problems or potential > problems or this is the only finding so far. > Valentin spotted two other potential issues, I fixed them in v2[1]. Also there is a discussion there that led to this implementation. > > diff --git a/kernel/fork.c b/kernel/fork.c > > index 9f7fe3541897..532dd2ceb6a3 100644 > > --- a/kernel/fork.c > > +++ b/kernel/fork.c > > @@ -857,6 +857,29 @@ void __put_task_struct(struct task_struct *tsk) > … > > +void __put_task_struct(struct task_struct *tsk) > > +{ > > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (!preemptible() || !in_task())) > > Is it safe to use the rcu member in any case? If so why not use it > unconditionally? > I am unsure what would be the consequences of moving every call to RCU, so I thought it would be better to play safe and do it only when necessary. > > + /* > > + * under PREEMPT_RT, we can't call put_task_struct > > + * in atomic context because it will indirectly > > + * acquire sleeping locks. > > + * > > + * call_rcu() will schedule delayed_put_task_struct_rcu() > > + * to be called in process context. > > + */ > > + call_rcu(&tsk->rcu, delayed_put_task_struct_rcu); > > + else > > + ___put_task_struct(tsk); > > +} > > EXPORT_SYMBOL_GPL(__put_task_struct); > > > > void __init __weak arch_task_cache_init(void) { } > > Sebastian > [1] https://lore.kernel.org/all/20230120150246.20797-1-wander@redhat.com/
On Mon, Feb 6, 2023 at 12:27 PM Oleg Nesterov <oleg@redhat.com> wrote: > > On 02/06, Sebastian Andrzej Siewior wrote: > > > > On 2023-02-06 10:04:47 [-0300], Wander Lairson Costa wrote: > > > > > @@ -857,6 +857,29 @@ void __put_task_struct(struct task_struct *tsk) > > … > > > +void __put_task_struct(struct task_struct *tsk) > > > +{ > > > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (!preemptible() || !in_task())) > > > > Is it safe to use the rcu member in any case? > > I thinks it is safe but deserves a comment. I guess Wander misunderstood > me when I asked him to do this... > Oops, sorry. Next version, I will include this description. > __put_task_struct() is called when refcount_dec_and_test(&t->usage) succeeds. > > This means that it can't "conflict" with put_task_struct_rcu_user() which > abuses ->rcu the same way; rcu_users has a reference so task->usage can't > be zero after rcu_users 1 -> 0 transition. > > > If so why not use it > > unconditionally? > > performance ? > > > And... I still don't like the name of delayed_put_task_struct_rcu() to me > ___put_task_struct_rcu() looks a bit less confusing, note that we already > have delayed_put_task_struct(). But this is minor. > Ok, I will change it. > Oleg. >
On Mon, Feb 6, 2023 at 1:04 PM, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > On 2023-02-06 16:27:12 [+0100], Oleg Nesterov wrote: > > On 02/06, Sebastian Andrzej Siewior wrote: > > > > > > On 2023-02-06 10:04:47 [-0300], Wander Lairson Costa wrote: > > > > > > > @@ -857,6 +857,29 @@ void __put_task_struct(struct task_struct *tsk) > > > … > > > > +void __put_task_struct(struct task_struct *tsk) > > > > +{ > > > > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (!preemptible() || !in_task())) > > > > > > Is it safe to use the rcu member in any case? > > > > I thinks it is safe but deserves a comment. I guess Wander misunderstood > > me when I asked him to do this... > > > > __put_task_struct() is called when refcount_dec_and_test(&t->usage) succeeds. > > > > This means that it can't "conflict" with put_task_struct_rcu_user() which > > abuses ->rcu the same way; rcu_users has a reference so task->usage can't > > be zero after rcu_users 1 -> 0 transition. > > Sounds good. > > > > If so why not use it > > > unconditionally? > > > > performance ? > > All the free() part is moved from the caller into rcu. > > > > > And... I still don't like the name of delayed_put_task_struct_rcu() to me > > ___put_task_struct_rcu() looks a bit less confusing, note that we already > > have delayed_put_task_struct(). But this is minor. > > So if we do it unconditionally then we could get rid of > put_task_struct_rcu_user(). > Otherwise we could use put_task_struct_rcu_user() in that timer > callback because it will lead to lockdep warnings once printk is fixed. put_task_struct_rcu_user() calls delayed_put_task_struct(), which does more than just call __put_task_struct(). I tried this approach at the beginning, but I got another splat (unfortunately, I don't remember where).
On Mon, 6 Feb 2023 10:04:47 -0300 Wander Lairson Costa <wander@redhat.com> wrote: > Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping > locks. Therefore, it can't be called from an non-preemptible context. Well that's regrettable. Especially if non-preempt kernels don't do this. Why does PREEMPT_RT do this and can it be fixed? If it cannot be fixed then we should have a might_sleep() in __put_task_struct() for all kernel configurations, along with an apologetic comment explaining why.
On Mon, Feb 6, 2023 at 10:09 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Mon, 6 Feb 2023 10:04:47 -0300 Wander Lairson Costa <wander@redhat.com> wrote: > > > Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping > > locks. Therefore, it can't be cala from a non-preemptible context. > > Well that's regrettable. Especially if non-preempt kernels don't do > this. > > Why does PREEMPT_RT do this and can it be fixed? > Under PREEMPT_RT, spin_lock becomes a wrapper around rtmutex, which is a sleeping lock. This is necessary because of the deterministic scheduling requirements of the RT kernel. Most of the places that run in atomic context in the stock kernel, become process context in the RT kernel, so the change spin_lock -> rtmutex is safe. However, there are always exceptions. In this particular case, __put_task_struct calls kcache_mem_free, which refill_obj_stock. refill_obj_stock acquires a local_lock, that is implemented using a spin_lock. > If it cannot be fixed then we should have a might_sleep() in > __put_task_struct() for all kernel configurations, along with an > apologetic comment explaining why. > This patch is supposed to be the fix.
On 2023-02-06 17:27:58 [+0100], Oleg Nesterov wrote: > On 02/06, Sebastian Andrzej Siewior wrote: > > > > On 2023-02-06 16:27:12 [+0100], Oleg Nesterov wrote: > > > > > > If so why not use it > > > > unconditionally? > > > > > > performance ? > > > > All the free() part is moved from the caller into rcu. > > sorry, I don't understand, That callback does mostly free() and it is batched with other free() invocations. This also is moved away from the caller which _might_ benefit. > > > And... I still don't like the name of delayed_put_task_struct_rcu() to me > > > ___put_task_struct_rcu() looks a bit less confusing, note that we already > > > have delayed_put_task_struct(). But this is minor. > > > > So if we do it unconditionally then we could get rid of > > put_task_struct_rcu_user(). > > Yes. But the whole purpose of rcu_users is that we want to avoid the unconditional > rcu grace period before free_task() ? Oh, this is usage vs rcu_users. Okay, mixed that up. > Just in case... please note that delayed_put_task_struct() delays > refcount_sub(t->usage), not free_task(). Just noticed ;) > Why do we need this? Consider > > rcu_read_lock(); > > task = find-task-in-rcu-protected-list; > > // Safe, task->usage can't be zero > get_task_struct(task); > > rcu_read_unlock(); > > > > Otherwise we could use put_task_struct_rcu_user() in that timer > > callback because it will lead to lockdep warnings once printk is fixed. > > IIUC there are more in-atomic callers of put_task_struct(). But perhaps > I misunderstood you... That is true. So you are saying that we don't what to use RCU for put_task_struct() unconditionally? > Oleg. Sebastian
On 2023-02-06 17:09:27 [-0800], Andrew Morton wrote: > On Mon, 6 Feb 2023 10:04:47 -0300 Wander Lairson Costa <wander@redhat.com> wrote: > > > Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping > > locks. Therefore, it can't be called from an non-preemptible context. > > Well that's regrettable. Especially if non-preempt kernels don't do > this. Non-preemptible context on PREEMPT_RT. Interrupts handler and timers don't count as non-preemptible because interrupt handler are threaded and hrtimers are invoked in softirq context (which is preemptible on PREEMPT_RT). This here is different because the hrtimer in question was marked as HRTIMER_MODE_REL_HARD. In this case it is invoked in hardirq context as requested with all the problems that follow. > Why does PREEMPT_RT do this and can it be fixed? PREEMPT_RT tries to move as much as it can out of hardirq context into preemptible context. A spinlock_t is preemptible on PREEMPT_RT while it is not in other preemption models. The scheduler needs to use raw_spinlock_t in order to be able to schedule a task from hardirq-context without a deadlock. For memory allocation only sleeping locks (spinlock_t) is used since there are no memory allocation/ deallocation on PREEMPT_RT in hardirq context. These two need to be separated. > If it cannot be fixed then we should have a might_sleep() in > __put_task_struct() for all kernel configurations, along with an > apologetic comment explaining why. __put_task_struct() should not be invoked in atomic context on PREEMPT_RT. It is fine however in a regular timer hrtimer. Adding might_sleep() will trigger a lot of false positives on a preemptible kernel and RT. A might_lock() on a spinlock_t should do the trick from LOCKDEP perspective if CONFIG_PROVE_RAW_LOCK_NESTING is enabled. In this case it should be visible due to rq-lock or due to hrtimer. Sebastian
diff --git a/kernel/fork.c b/kernel/fork.c index 9f7fe3541897..532dd2ceb6a3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -840,7 +840,7 @@ static inline void put_signal_struct(struct signal_struct *sig) free_signal_struct(sig); } -void __put_task_struct(struct task_struct *tsk) +static void ___put_task_struct(struct task_struct *tsk) { WARN_ON(!tsk->exit_state); WARN_ON(refcount_read(&tsk->usage)); @@ -857,6 +857,29 @@ void __put_task_struct(struct task_struct *tsk) sched_core_free(tsk); free_task(tsk); } + +static void delayed_put_task_struct_rcu(struct rcu_head *rhp) +{ + struct task_struct *task = container_of(rhp, struct task_struct, rcu); + + ___put_task_struct(task); +} + +void __put_task_struct(struct task_struct *tsk) +{ + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (!preemptible() || !in_task())) + /* + * under PREEMPT_RT, we can't call put_task_struct + * in atomic context because it will indirectly + * acquire sleeping locks. + * + * call_rcu() will schedule delayed_put_task_struct_rcu() + * to be called in process context. + */ + call_rcu(&tsk->rcu, delayed_put_task_struct_rcu); + else + ___put_task_struct(tsk); +} EXPORT_SYMBOL_GPL(__put_task_struct); void __init __weak arch_task_cache_init(void) { }