Message ID | 20231213101745.4526-1-quic_aiquny@quicinc.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:3b04:b0:fb:cd0c:d3e with SMTP id c4csp7672210dys; Wed, 13 Dec 2023 02:18:42 -0800 (PST) X-Google-Smtp-Source: AGHT+IH0RQrHDSvlMvRVEpcygm4xfRcBOKK2ZqQ3K/73tk8NRYRA8sZNgUwXRzyZAgaJnKxijTAu X-Received: by 2002:a05:6870:d208:b0:1fb:75a:77a2 with SMTP id g8-20020a056870d20800b001fb075a77a2mr8492167oac.83.1702462721919; Wed, 13 Dec 2023 02:18:41 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702462721; cv=none; d=google.com; s=arc-20160816; b=g1iXgOw8uZqzFPLZz6JrQSNbxl430uMc0GENMPpFYtoaGs83fTiHhUVVsFAwN17b5e Mco05QQhWAEClzsX+iSIzT12GA4in/BPeyTZpQfG0c5tPIXNzDQ27xEuqFWdDAY2BgW6 /yI4oceaBMmEWvQVEjiXjOO3MWrlhGDY97a5U+CcM2hYdSmOvYv9yYu+tlie8fT4W373 78WJxw1/lPKQqDSG1o8zyswKCoi7UE+CfrvKHYU7mop4PngUlu4OnSk0WFDLmgw7tjlO 2sh1UNKU22ai7AF69xdYov0h0earQdOoX75Fs5W4EHq1fveTUVT8MMMCt3LU+UZhILxd ewEQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:message-id:date:subject:cc:to:from :dkim-signature; bh=Ib1d+QhymcafLXYKFPOeKHZFLKskwCb8h78crXgt5AU=; fh=s+bz4Yl6zljPuWa+Hdf73e7KZ57eQyygPvEBkvUjH/8=; b=0wH1jFFw9ZmFXv6pcXwgL98QupaaMd/OqpU13ifap9Jc3HfI8bqlnWZFLjbQo+E8NZ F7FDvJYU2r/gwScN9zBFDmKYyQdUBZ8v+T7jTU4g/vHZ2AG1rCjFBOnTsUP4zAPec/FT yPiRZUA+ZXwkPPussn08qGNs5L6QkfVt2/LvyZVaW2BKYGGC5lWr98FHesyC2tMcFUIL NRwz2jclW2c6r90lWaMMdFSeeS7fvggJjBhCGXut9+4tU2QQQyZ5t+5bVpLuhzozY9U4 VBQiKdrq0O3DyALtYasvDx2hpjfupdFodn89TDg1gaVs5HlhRwmrU633sXzg+7cs+wgQ a7Uw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@quicinc.com header.s=qcppdkim1 header.b=Z2rQ9g2d; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=quicinc.com Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id 189-20020a6300c6000000b005c615702198si9174672pga.703.2023.12.13.02.18.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 13 Dec 2023 02:18:41 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; dkim=pass header.i=@quicinc.com header.s=qcppdkim1 header.b=Z2rQ9g2d; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=quicinc.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 7A41B80BB505; Wed, 13 Dec 2023 02:18:35 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235202AbjLMKS0 (ORCPT <rfc822;dexuan.linux@gmail.com> + 99 others); Wed, 13 Dec 2023 05:18:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43910 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231482AbjLMKS0 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 13 Dec 2023 05:18:26 -0500 Received: from mx0b-0031df01.pphosted.com (mx0b-0031df01.pphosted.com [205.220.180.131]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 818A583; Wed, 13 Dec 2023 02:18:31 -0800 (PST) Received: from pps.filterd (m0279868.ppops.net [127.0.0.1]) by mx0a-0031df01.pphosted.com (8.17.1.24/8.17.1.24) with ESMTP id 3BD8sOEe017896; Wed, 13 Dec 2023 10:18:12 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quicinc.com; h= from:to:cc:subject:date:message-id:mime-version:content-type; s= qcppdkim1; bh=Ib1d+QhymcafLXYKFPOeKHZFLKskwCb8h78crXgt5AU=; b=Z2 rQ9g2d4OeM2UhWkiUhE/AvcZWgQZgMXaS/tMKQocB4HZX9bhcr2ibx7CiUCfSoOE +B6Hr5Y8cOgEPmoouDEZrm07PJbaFzw1SVOJxKLgC0SOcA5tRNu2ic/bYRwC8RiM c9svgvX4vbvhFgz1nh5LUSqCYfGBYFlj4az8PSNSmFRJkEFuffkPzueweM3JZItD 8Y14yiWAwIUIvKAaT0aAE7au4SF/reVVSRuybqAlFhit9aldeDZTEsbY7KTcg1im 25q6Yre72WTsV7LSyY2QI9Uxrl3aQpu8jPCgw954b+xUeEEZPmtohhTixo18nTRC dlQUayxvcOummvV6UCNA== Received: from nasanppmta03.qualcomm.com (i-global254.qualcomm.com [199.106.103.254]) by mx0a-0031df01.pphosted.com (PPS) with ESMTPS id 3uy9gd073u-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 13 Dec 2023 10:18:12 +0000 (GMT) Received: from nasanex01a.na.qualcomm.com (nasanex01a.na.qualcomm.com [10.52.223.231]) by NASANPPMTA03.qualcomm.com (8.17.1.5/8.17.1.5) with ESMTPS id 3BDAIBZV000467 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 13 Dec 2023 10:18:11 GMT Received: from aiquny2-gv.qualcomm.com (10.80.80.8) by nasanex01a.na.qualcomm.com (10.52.223.231) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Wed, 13 Dec 2023 02:18:03 -0800 From: Maria Yu <quic_aiquny@quicinc.com> To: <ebiederm@xmission.com> CC: Maria Yu <quic_aiquny@quicinc.com>, <kernel@quicinc.com>, <quic_pkondeti@quicinc.com>, <keescook@chromium.or>, <viro@zeniv.linux.org.uk>, <brauner@kernel.org>, <oleg@redhat.com>, <dhowells@redhat.com>, <jarkko@kernel.org>, <paul@paul-moore.com>, <jmorris@namei.org>, <serge@hallyn.com>, <linux-mm@kvack.org>, <linux-fsdevel@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <keyrings@vger.kernel.org>, <linux-security-module@vger.kernel.org>, <linux-arm-msm@vger.kernel.org> Subject: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock Date: Wed, 13 Dec 2023 18:17:45 +0800 Message-ID: <20231213101745.4526-1-quic_aiquny@quicinc.com> X-Mailer: git-send-email 2.17.1 MIME-Version: 1.0 Content-Type: text/plain X-Originating-IP: [10.80.80.8] X-ClientProxiedBy: nasanex01b.na.qualcomm.com (10.46.141.250) To nasanex01a.na.qualcomm.com (10.52.223.231) X-QCInternal: smtphost X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=5800 signatures=585085 X-Proofpoint-ORIG-GUID: TZTasOTHbuTw77b4b3w9ljTMkTHIoEK8 X-Proofpoint-GUID: TZTasOTHbuTw77b4b3w9ljTMkTHIoEK8 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.997,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-12-09_02,2023-12-07_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 impostorscore=0 bulkscore=0 spamscore=0 priorityscore=1501 mlxscore=0 suspectscore=0 clxscore=1011 phishscore=0 adultscore=0 mlxlogscore=332 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2311290000 definitions=main-2312130074 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Wed, 13 Dec 2023 02:18:35 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1785161551192324737 X-GMAIL-MSGID: 1785161551192324737 |
Series |
kernel: Introduce a write lock/unlock wrapper for tasklist_lock
|
|
Commit Message
Aiqun Yu (Maria)
Dec. 13, 2023, 10:17 a.m. UTC
As a rwlock for tasklist_lock, there are multiple scenarios to acquire
read lock which write lock needed to be waiting for.
In freeze_process/thaw_processes it can take about 200+ms for holding read
lock of tasklist_lock by walking and freezing/thawing tasks in commercial
devices. And write_lock_irq will have preempt disabled and local irq
disabled to spin until the tasklist_lock can be acquired. This leading to
a bad responsive performance of current system.
Take an example:
1. cpu0 is holding read lock of tasklist_lock to thaw_processes.
2. cpu1 is waiting write lock of tasklist_lock to exec a new thread with
preempt_disabled and local irq disabled.
3. cpu2 is waiting write lock of tasklist_lock to do_exit with
preempt_disabled and local irq disabled.
4. cpu3 is waiting write lock of tasklist_lock to do_exit with
preempt_disabled and local irq disabled.
So introduce a write lock/unlock wrapper for tasklist_lock specificly.
The current taskslist_lock writers all have write_lock_irq to hold
tasklist_lock, and write_unlock_irq to release tasklist_lock, that means
the writers are not suitable or workable to wait on tasklist_lock in irq
disabled scenarios. So the write lock/unlock wrapper here only follow the
current design of directly use local_irq_disable and local_irq_enable,
and not take already irq disabled writer callers into account.
Use write_trylock in the loop and enabled irq for cpu to repsond if lock
cannot be taken.
Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
---
fs/exec.c | 10 +++++-----
include/linux/sched/task.h | 29 +++++++++++++++++++++++++++++
kernel/exit.c | 16 ++++++++--------
kernel/fork.c | 6 +++---
kernel/ptrace.c | 12 ++++++------
kernel/sys.c | 8 ++++----
security/keys/keyctl.c | 4 ++--
7 files changed, 57 insertions(+), 28 deletions(-)
base-commit: 88035e5694a86a7167d490bb95e9df97a9bb162b
Comments
On Wed, Dec 13, 2023 at 06:17:45PM +0800, Maria Yu wrote: > +static inline void write_lock_tasklist_lock(void) > +{ > + while (1) { > + local_irq_disable(); > + if (write_trylock(&tasklist_lock)) > + break; > + local_irq_enable(); > + cpu_relax(); This is a bad implementation though. You don't set the _QW_WAITING flag so readers don't know that there's a pending writer. Also, I've seen cpu_relax() pessimise CPU behaviour; putting it into a low-power mode that takes a while to wake up from. I think the right way to fix this is to pass a boolean flag to queued_write_lock_slowpath() to let it know whether it can re-enable interrupts while checking whether _QW_WAITING is set.
Matthew Wilcox <willy@infradead.org> writes: > On Wed, Dec 13, 2023 at 06:17:45PM +0800, Maria Yu wrote: >> +static inline void write_lock_tasklist_lock(void) >> +{ >> + while (1) { >> + local_irq_disable(); >> + if (write_trylock(&tasklist_lock)) >> + break; >> + local_irq_enable(); >> + cpu_relax(); > > This is a bad implementation though. You don't set the _QW_WAITING flag > so readers don't know that there's a pending writer. Also, I've seen > cpu_relax() pessimise CPU behaviour; putting it into a low-power mode > that takes a while to wake up from. > > I think the right way to fix this is to pass a boolean flag to > queued_write_lock_slowpath() to let it know whether it can re-enable > interrupts while checking whether _QW_WAITING is set. Yes. It seems to make sense to distinguish between write_lock_irq and write_lock_irqsave and fix this for all of write_lock_irq. Either that or someone can put in the work to start making the tasklist_lock go away. Eric
On 12/14/2023 2:27 AM, Eric W. Biederman wrote: > Matthew Wilcox <willy@infradead.org> writes: > >> On Wed, Dec 13, 2023 at 06:17:45PM +0800, Maria Yu wrote: >>> +static inline void write_lock_tasklist_lock(void) >>> +{ >>> + while (1) { >>> + local_irq_disable(); >>> + if (write_trylock(&tasklist_lock)) >>> + break; >>> + local_irq_enable(); >>> + cpu_relax(); >> >> This is a bad implementation though. You don't set the _QW_WAITING flag Any better ideas and suggestions are welcomed. :) >> so readers don't know that there's a pending writer. Also, I've see >> cpu_relax() pessimise CPU behaviour; putting it into a low-power mode >> that takes a while to wake up from. >> >> I think the right way to fix this is to pass a boolean flag to >> queued_write_lock_slowpath() to let it know whether it can re-enable >> interrupts while checking whether _QW_WAITING is set. > > Yes. It seems to make sense to distinguish between write_lock_irq and > write_lock_irqsave and fix this for all of write_lock_irq. > Let me think about this. It seems a possible because there is a special behavior from reader side when in interrupt it will directly get the lock regardless of the pending writer. > Either that or someone can put in the work to start making the > tasklist_lock go away. > > Eric >
On Wed, 13 Dec 2023 12:27:05 -0600 Eric W. Biederman <ebiederm@xmission.com> > Matthew Wilcox <willy@infradead.org> writes: > > On Wed, Dec 13, 2023 at 06:17:45PM +0800, Maria Yu wrote: > >> +static inline void write_lock_tasklist_lock(void) > >> +{ > >> + while (1) { > >> + local_irq_disable(); > >> + if (write_trylock(&tasklist_lock)) > >> + break; > >> + local_irq_enable(); > >> + cpu_relax(); > > > > This is a bad implementation though. You don't set the _QW_WAITING flag > > so readers don't know that there's a pending writer. Also, I've seen > > cpu_relax() pessimise CPU behaviour; putting it into a low-power mode > > that takes a while to wake up from. > > > > I think the right way to fix this is to pass a boolean flag to > > queued_write_lock_slowpath() to let it know whether it can re-enable > > interrupts while checking whether _QW_WAITING is set. lock(&lock->wait_lock) enable irq int lock(&lock->wait_lock) You are adding chance for recursive locking. > > Yes. It seems to make sense to distinguish between write_lock_irq and > write_lock_irqsave and fix this for all of write_lock_irq. > > Either that or someone can put in the work to start making the > tasklist_lock go away. > > Eric
On Tue, Dec 26, 2023 at 06:46:52PM +0800, Hillf Danton wrote: > On Wed, 13 Dec 2023 12:27:05 -0600 Eric W. Biederman <ebiederm@xmission.com> > > Matthew Wilcox <willy@infradead.org> writes: > > > On Wed, Dec 13, 2023 at 06:17:45PM +0800, Maria Yu wrote: > > >> +static inline void write_lock_tasklist_lock(void) > > >> +{ > > >> + while (1) { > > >> + local_irq_disable(); > > >> + if (write_trylock(&tasklist_lock)) > > >> + break; > > >> + local_irq_enable(); > > >> + cpu_relax(); > > > > > > This is a bad implementation though. You don't set the _QW_WAITING flag > > > so readers don't know that there's a pending writer. Also, I've seen > > > cpu_relax() pessimise CPU behaviour; putting it into a low-power mode > > > that takes a while to wake up from. > > > > > > I think the right way to fix this is to pass a boolean flag to > > > queued_write_lock_slowpath() to let it know whether it can re-enable > > > interrupts while checking whether _QW_WAITING is set. > > lock(&lock->wait_lock) > enable irq > int > lock(&lock->wait_lock) > > You are adding chance for recursive locking. Did you bother to read queued_read_lock_slowpath() before writing this email? if (unlikely(in_interrupt())) { /* * Readers in interrupt context will get the lock immediately * if the writer is just waiting (not holding the lock yet), * so spin with ACQUIRE semantics until the lock is available * without waiting in the queue. */ atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED)); return;
On 12/26/2023 6:46 PM, Hillf Danton wrote: > On Wed, 13 Dec 2023 12:27:05 -0600 Eric W. Biederman <ebiederm@xmission.com> >> Matthew Wilcox <willy@infradead.org> writes: >>> On Wed, Dec 13, 2023 at 06:17:45PM +0800, Maria Yu wrote: >>>> +static inline void write_lock_tasklist_lock(void) >>>> +{ >>>> + while (1) { >>>> + local_irq_disable(); >>>> + if (write_trylock(&tasklist_lock)) >>>> + break; >>>> + local_irq_enable(); >>>> + cpu_relax(); >>> >>> This is a bad implementation though. You don't set the _QW_WAITING flag >>> so readers don't know that there's a pending writer. Also, I've seen >>> cpu_relax() pessimise CPU behaviour; putting it into a low-power mode >>> that takes a while to wake up from. >>> >>> I think the right way to fix this is to pass a boolean flag to >>> queued_write_lock_slowpath() to let it know whether it can re-enable >>> interrupts while checking whether _QW_WAITING is set. > > lock(&lock->wait_lock) > enable irq > int > lock(&lock->wait_lock) > > You are adding chance for recursive locking. Thx for the comments for discuss of the deadlock possibility. While I think deadlock can be differentiate with below 2 scenarios: 1. queued_write_lock_slowpath being triggered in interrupt context. tasklist_lock don't have write_lock_irq(save) in interrupt context. while for common rw lock, maybe write_lock_irq(save) usage in interrupt context is a possible. so may introduce a state when lock->wait_lock is released and left the _QW_WAITING flag. Welcome others to suggest on designs and comments. 2.queued_read_lock_slowpath can be triggered in interrupt context. And it already have the handle to avoid possible deadlock. In the queued_read_lock_slowpath, there is check whether current context is in interrupt or not, and get the lock directly of only write lock waiting. Pls reference[1]: /* * Readers come here when they cannot get the lock without waiting */ if (unlikely(in_interrupt())) { /* * Readers in interrupt context will get the lock immediately * if the writer is just waiting (not holding the lock yet), * so spin with ACQUIRE semantics until the lock is available * without waiting in the queue. */ atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED)); return; } [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/locking/qrwlock.c >> >> Yes. It seems to make sense to distinguish between write_lock_irq and >> write_lock_irqsave and fix this for all of write_lock_irq. >> >> Either that or someone can put in the work to start making the >> tasklist_lock go away. >> >> Eric
On Wed, Dec 27, 2023 at 09:41:29AM +0800, Aiqun Yu (Maria) wrote: > On 12/26/2023 6:46 PM, Hillf Danton wrote: > > On Wed, 13 Dec 2023 12:27:05 -0600 Eric W. Biederman <ebiederm@xmission.com> > > > Matthew Wilcox <willy@infradead.org> writes: > > > > On Wed, Dec 13, 2023 at 06:17:45PM +0800, Maria Yu wrote: > > > > > +static inline void write_lock_tasklist_lock(void) > > > > > +{ > > > > > + while (1) { > > > > > + local_irq_disable(); > > > > > + if (write_trylock(&tasklist_lock)) > > > > > + break; > > > > > + local_irq_enable(); > > > > > + cpu_relax(); > > > > > > > > This is a bad implementation though. You don't set the _QW_WAITING flag > > > > so readers don't know that there's a pending writer. Also, I've seen > > > > cpu_relax() pessimise CPU behaviour; putting it into a low-power mode > > > > that takes a while to wake up from. > > > > > > > > I think the right way to fix this is to pass a boolean flag to > > > > queued_write_lock_slowpath() to let it know whether it can re-enable > > > > interrupts while checking whether _QW_WAITING is set. > > > > lock(&lock->wait_lock) > > enable irq > > int > > lock(&lock->wait_lock) > > > > You are adding chance for recursive locking. > > Thx for the comments for discuss of the deadlock possibility. While I think > deadlock can be differentiate with below 2 scenarios: > 1. queued_write_lock_slowpath being triggered in interrupt context. > tasklist_lock don't have write_lock_irq(save) in interrupt context. > while for common rw lock, maybe write_lock_irq(save) usage in interrupt > context is a possible. > so may introduce a state when lock->wait_lock is released and left the > _QW_WAITING flag. > Welcome others to suggest on designs and comments. Hm? I am confused. You're talking about the scenario where: - CPU B holds the lock for read - CPU A attempts to get the lock for write in user context, fails, sets the _QW_WAITING flag - CPU A re-enables interrupts - CPU A executes an interrupt handler which calls queued_write_lock() - If CPU B has dropped the read lock in the meantime, atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED) succeeds - CPU A calls queued_write_unlock() which stores 0 to the lock and we _lose_ the _QW_WAITING flag for the userspace waiter. How do we end up with CPU A leaving the _QW_WAITING flag set?
On Tue, 26 Dec 2023 20:49:38 +0000 Matthew Wilcox <willy@infradead.org> > On Tue, Dec 26, 2023 at 06:46:52PM +0800, Hillf Danton wrote: > > On Wed, 13 Dec 2023 12:27:05 -0600 Eric W. Biederman <ebiederm@xmission.com> > > > Matthew Wilcox <willy@infradead.org> writes: > > > > I think the right way to fix this is to pass a boolean flag to > > > > queued_write_lock_slowpath() to let it know whether it can re-enable > > > > interrupts while checking whether _QW_WAITING is set. > > > > lock(&lock->wait_lock) > > enable irq > > int > > lock(&lock->wait_lock) > > > > You are adding chance for recursive locking. > > Did you bother to read queued_read_lock_slowpath() before writing this email? Nope but it matters nothing in this case. > > if (unlikely(in_interrupt())) { > /* > * Readers in interrupt context will get the lock immediately > * if the writer is just waiting (not holding the lock yet), > * so spin with ACQUIRE semantics until the lock is available > * without waiting in the queue. > */ > atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED)); > return; This is the lock acquirer for read in irq context, and it rolls out the red carpet for write acquirer in irq, right Willy? Feel free to ignore the following leg works. /* Set the waiting flag to notify readers that a writer is pending */ atomic_or(_QW_WAITING, &lock->cnts); enable irq; /* When no more readers or writers, set the locked flag */ do { cnts = atomic_cond_read_relaxed(&lock->cnts, VAL == _QW_WAITING); } while (!atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED)); int atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED)); deadlock disable irq; Though the case below is safe, it looks not pretty but clumsy. /* Set the waiting flag to notify readers that a writer is pending */ atomic_or(_QW_WAITING, &lock->cnts); /* When no more readers or writers, set the locked flag */ do { enable irq; cnts = atomic_cond_read_relaxed(&lock->cnts, VAL == _QW_WAITING); disable irq; } while (!atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED));
On Wed, Dec 27, 2023 at 07:07:27PM +0800, Hillf Danton wrote: > Feel free to ignore the following leg works. > > /* Set the waiting flag to notify readers that a writer is pending */ > atomic_or(_QW_WAITING, &lock->cnts); > > enable irq; > > /* When no more readers or writers, set the locked flag */ > do { > cnts = atomic_cond_read_relaxed(&lock->cnts, VAL == _QW_WAITING); > } while (!atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED)); > > int > atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED)); > deadlock > disable irq; That would be a buggy implementation, and would not be what I was thinking. > Though the case below is safe, it looks not pretty but clumsy. > > /* Set the waiting flag to notify readers that a writer is pending */ > atomic_or(_QW_WAITING, &lock->cnts); > > /* When no more readers or writers, set the locked flag */ > do { > enable irq; > > cnts = atomic_cond_read_relaxed(&lock->cnts, VAL == _QW_WAITING); > > disable irq; > > } while (!atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED)); Why do you think it looks clumsy? It's more or less what I was thinking. -void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) +void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock, bool irq) ... do { + if (irq) + local_irq_enable(); cnts = atomic_cond_read_relaxed(&lock->cnts, VAL == _QW_WAITING); + if (irq) + local_irq_disable(); } while (!atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED));
On Wed, Dec 13, 2023 at 12:27:05PM -0600, Eric W. Biederman wrote: > Matthew Wilcox <willy@infradead.org> writes: > > I think the right way to fix this is to pass a boolean flag to > > queued_write_lock_slowpath() to let it know whether it can re-enable > > interrupts while checking whether _QW_WAITING is set. > > Yes. It seems to make sense to distinguish between write_lock_irq and > write_lock_irqsave and fix this for all of write_lock_irq. I wasn't planning on doing anything here, but Hillf kind of pushed me into it. I think it needs to be something like this. Compile tested only. If it ends up getting used, Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h index 75b8f4601b28..1152e080c719 100644 --- a/include/asm-generic/qrwlock.h +++ b/include/asm-generic/qrwlock.h @@ -33,8 +33,8 @@ /* * External function declarations */ -extern void queued_read_lock_slowpath(struct qrwlock *lock); -extern void queued_write_lock_slowpath(struct qrwlock *lock); +void queued_read_lock_slowpath(struct qrwlock *lock); +void queued_write_lock_slowpath(struct qrwlock *lock, bool irq); /** * queued_read_trylock - try to acquire read lock of a queued rwlock @@ -98,7 +98,21 @@ static inline void queued_write_lock(struct qrwlock *lock) if (likely(atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED))) return; - queued_write_lock_slowpath(lock); + queued_write_lock_slowpath(lock, false); +} + +/** + * queued_write_lock_irq - acquire write lock of a queued rwlock + * @lock : Pointer to queued rwlock structure + */ +static inline void queued_write_lock_irq(struct qrwlock *lock) +{ + int cnts = 0; + /* Optimize for the unfair lock case where the fair flag is 0. */ + if (likely(atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED))) + return; + + queued_write_lock_slowpath(lock, true); } /** @@ -138,6 +152,7 @@ static inline int queued_rwlock_is_contended(struct qrwlock *lock) */ #define arch_read_lock(l) queued_read_lock(l) #define arch_write_lock(l) queued_write_lock(l) +#define arch_write_lock_irq(l) queued_write_lock_irq(l) #define arch_read_trylock(l) queued_read_trylock(l) #define arch_write_trylock(l) queued_write_trylock(l) #define arch_read_unlock(l) queued_read_unlock(l) diff --git a/include/linux/rwlock.h b/include/linux/rwlock.h index c0ef596f340b..897010b6ba0a 100644 --- a/include/linux/rwlock.h +++ b/include/linux/rwlock.h @@ -33,6 +33,7 @@ do { \ extern int do_raw_read_trylock(rwlock_t *lock); extern void do_raw_read_unlock(rwlock_t *lock) __releases(lock); extern void do_raw_write_lock(rwlock_t *lock) __acquires(lock); + extern void do_raw_write_lock_irq(rwlock_t *lock) __acquires(lock); extern int do_raw_write_trylock(rwlock_t *lock); extern void do_raw_write_unlock(rwlock_t *lock) __releases(lock); #else @@ -40,6 +41,7 @@ do { \ # define do_raw_read_trylock(rwlock) arch_read_trylock(&(rwlock)->raw_lock) # define do_raw_read_unlock(rwlock) do {arch_read_unlock(&(rwlock)->raw_lock); __release(lock); } while (0) # define do_raw_write_lock(rwlock) do {__acquire(lock); arch_write_lock(&(rwlock)->raw_lock); } while (0) +# define do_raw_write_lock_irq(rwlock) do {__acquire(lock); arch_write_lock_irq(&(rwlock)->raw_lock); } while (0) # define do_raw_write_trylock(rwlock) arch_write_trylock(&(rwlock)->raw_lock) # define do_raw_write_unlock(rwlock) do {arch_write_unlock(&(rwlock)->raw_lock); __release(lock); } while (0) #endif diff --git a/include/linux/rwlock_api_smp.h b/include/linux/rwlock_api_smp.h index dceb0a59b692..6257976dfb72 100644 --- a/include/linux/rwlock_api_smp.h +++ b/include/linux/rwlock_api_smp.h @@ -193,7 +193,7 @@ static inline void __raw_write_lock_irq(rwlock_t *lock) local_irq_disable(); preempt_disable(); rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_); - LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock); + LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock_irq); } static inline void __raw_write_lock_bh(rwlock_t *lock) diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c index d2ef312a8611..6c644a71b01d 100644 --- a/kernel/locking/qrwlock.c +++ b/kernel/locking/qrwlock.c @@ -61,9 +61,10 @@ EXPORT_SYMBOL(queued_read_lock_slowpath); /** * queued_write_lock_slowpath - acquire write lock of a queued rwlock - * @lock : Pointer to queued rwlock structure + * @lock: Pointer to queued rwlock structure + * @irq: True if we can enable interrupts while spinning */ -void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) +void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock, bool irq) { int cnts; @@ -82,7 +83,11 @@ void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) /* When no more readers or writers, set the locked flag */ do { + if (irq) + local_irq_enable(); cnts = atomic_cond_read_relaxed(&lock->cnts, VAL == _QW_WAITING); + if (irq) + local_irq_disable(); } while (!atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED)); unlock: arch_spin_unlock(&lock->wait_lock); diff --git a/kernel/locking/spinlock_debug.c b/kernel/locking/spinlock_debug.c index 87b03d2e41db..bf94551d7435 100644 --- a/kernel/locking/spinlock_debug.c +++ b/kernel/locking/spinlock_debug.c @@ -212,6 +212,13 @@ void do_raw_write_lock(rwlock_t *lock) debug_write_lock_after(lock); } +void do_raw_write_lock_irq(rwlock_t *lock) +{ + debug_write_lock_before(lock); + arch_write_lock_irq(&lock->raw_lock); + debug_write_lock_after(lock); +} + int do_raw_write_trylock(rwlock_t *lock) { int ret = arch_write_trylock(&lock->raw_lock);
Hi Matthew, kernel test robot noticed the following build errors: [auto build test ERROR on tip/locking/core] [also build test ERROR on arnd-asm-generic/master brauner-vfs/vfs.all vfs-idmapping/for-next linus/master v6.7-rc7 next-20231222] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Matthew-Wilcox/Re-PATCH-kernel-Introduce-a-write-lock-unlock-wrapper-for-tasklist_lock/20231229-062352 base: tip/locking/core patch link: https://lore.kernel.org/r/ZY30k7OCtxrdR9oP%40casper.infradead.org patch subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock config: i386-randconfig-011-20231229 (https://download.01.org/0day-ci/archive/20231229/202312291936.G87eGfCo-lkp@intel.com/config) compiler: gcc-12 (Debian 12.2.0-14) 12.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231229/202312291936.G87eGfCo-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202312291936.G87eGfCo-lkp@intel.com/ All errors (new ones prefixed by >>): kernel/locking/spinlock_debug.c: In function 'do_raw_write_lock_irq': >> kernel/locking/spinlock_debug.c:217:9: error: implicit declaration of function 'arch_write_lock_irq'; did you mean '_raw_write_lock_irq'? [-Werror=implicit-function-declaration] 217 | arch_write_lock_irq(&lock->raw_lock); | ^~~~~~~~~~~~~~~~~~~ | _raw_write_lock_irq cc1: some warnings being treated as errors vim +217 kernel/locking/spinlock_debug.c 213 214 void do_raw_write_lock_irq(rwlock_t *lock) 215 { 216 debug_write_lock_before(lock); > 217 arch_write_lock_irq(&lock->raw_lock); 218 debug_write_lock_after(lock); 219 } 220
On 12/29/2023 6:20 AM, Matthew Wilcox wrote: > On Wed, Dec 13, 2023 at 12:27:05PM -0600, Eric W. Biederman wrote: >> Matthew Wilcox <willy@infradead.org> writes: >>> I think the right way to fix this is to pass a boolean flag to >>> queued_write_lock_slowpath() to let it know whether it can re-enable >>> interrupts while checking whether _QW_WAITING is set. >> >> Yes. It seems to make sense to distinguish between write_lock_irq and >> write_lock_irqsave and fix this for all of write_lock_irq. > > I wasn't planning on doing anything here, but Hillf kind of pushed me into > it. I think it needs to be something like this. Compile tested only. > If it ends up getting used, Happy new year! Thx Metthew for chiming into this. I think more thoughts will gain more perfect designs. > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h > index 75b8f4601b28..1152e080c719 100644 > --- a/include/asm-generic/qrwlock.h > +++ b/include/asm-generic/qrwlock.h > @@ -33,8 +33,8 @@ > /* > * External function declarations > */ > -extern void queued_read_lock_slowpath(struct qrwlock *lock); > -extern void queued_write_lock_slowpath(struct qrwlock *lock); > +void queued_read_lock_slowpath(struct qrwlock *lock); > +void queued_write_lock_slowpath(struct qrwlock *lock, bool irq); > > /** > * queued_read_trylock - try to acquire read lock of a queued rwlock > @@ -98,7 +98,21 @@ static inline void queued_write_lock(struct qrwlock *lock) > if (likely(atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED))) > return; > > - queued_write_lock_slowpath(lock); > + queued_write_lock_slowpath(lock, false); > +} > + > +/** > + * queued_write_lock_irq - acquire write lock of a queued rwlock > + * @lock : Pointer to queued rwlock structure > + */ > +static inline void queued_write_lock_irq(struct qrwlock *lock) > +{ > + int cnts = 0; > + /* Optimize for the unfair lock case where the fair flag is 0. */ > + if (likely(atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED))) > + return; > + > + queued_write_lock_slowpath(lock, true); > } > > /** > @@ -138,6 +152,7 @@ static inline int queued_rwlock_is_contended(struct qrwlock *lock) > */ > #define arch_read_lock(l) queued_read_lock(l) > #define arch_write_lock(l) queued_write_lock(l) > +#define arch_write_lock_irq(l) queued_write_lock_irq(l) > #define arch_read_trylock(l) queued_read_trylock(l) > #define arch_write_trylock(l) queued_write_trylock(l) > #define arch_read_unlock(l) queued_read_unlock(l) > diff --git a/include/linux/rwlock.h b/include/linux/rwlock.h > index c0ef596f340b..897010b6ba0a 100644 > --- a/include/linux/rwlock.h > +++ b/include/linux/rwlock.h > @@ -33,6 +33,7 @@ do { \ > extern int do_raw_read_trylock(rwlock_t *lock); > extern void do_raw_read_unlock(rwlock_t *lock) __releases(lock); > extern void do_raw_write_lock(rwlock_t *lock) __acquires(lock); > + extern void do_raw_write_lock_irq(rwlock_t *lock) __acquires(lock); > extern int do_raw_write_trylock(rwlock_t *lock); > extern void do_raw_write_unlock(rwlock_t *lock) __releases(lock); > #else > @@ -40,6 +41,7 @@ do { \ > # define do_raw_read_trylock(rwlock) arch_read_trylock(&(rwlock)->raw_lock) > # define do_raw_read_unlock(rwlock) do {arch_read_unlock(&(rwlock)->raw_lock); __release(lock); } while (0) > # define do_raw_write_lock(rwlock) do {__acquire(lock); arch_write_lock(&(rwlock)->raw_lock); } while (0) > +# define do_raw_write_lock_irq(rwlock) do {__acquire(lock); arch_write_lock_irq(&(rwlock)->raw_lock); } while (0) > # define do_raw_write_trylock(rwlock) arch_write_trylock(&(rwlock)->raw_lock) > # define do_raw_write_unlock(rwlock) do {arch_write_unlock(&(rwlock)->raw_lock); __release(lock); } while (0) > #endif > diff --git a/include/linux/rwlock_api_smp.h b/include/linux/rwlock_api_smp.h > index dceb0a59b692..6257976dfb72 100644 > --- a/include/linux/rwlock_api_smp.h > +++ b/include/linux/rwlock_api_smp.h > @@ -193,7 +193,7 @@ static inline void __raw_write_lock_irq(rwlock_t *lock) > local_irq_disable(); > preempt_disable(); > rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_); > - LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock); > + LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock_irq); > } > > static inline void __raw_write_lock_bh(rwlock_t *lock) > diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c > index d2ef312a8611..6c644a71b01d 100644 > --- a/kernel/locking/qrwlock.c > +++ b/kernel/locking/qrwlock.c > @@ -61,9 +61,10 @@ EXPORT_SYMBOL(queued_read_lock_slowpath); > > /** > * queued_write_lock_slowpath - acquire write lock of a queued rwlock > - * @lock : Pointer to queued rwlock structure > + * @lock: Pointer to queued rwlock structure > + * @irq: True if we can enable interrupts while spinning > */ > -void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) > +void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock, bool irq) > { > int cnts; > > @@ -82,7 +83,11 @@ void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) > Also a new state showed up after the current design: 1. locked flag with _QW_WAITING, while irq enabled. 2. And this state will be only in interrupt context. 3. lock->wait_lock is hold by the write waiter. So per my understanding, a different behavior also needed to be done in queued_write_lock_slowpath: when (unlikely(in_interrupt())) , get the lock directly. So needed to be done in release path. This is to address Hillf's concern on possibility of deadlock. Add Hillf here to merge thread. I am going to have a tested patch V2 accordingly. Feel free to let me know your thoughts prior on that. > /* When no more readers or writers, set the locked flag */ > do { > + if (irq) > + local_irq_enable(); I think write_lock_irqsave also needs to be take account. So loal_irq_save(flags) should be take into account here. > cnts = atomic_cond_read_relaxed(&lock->cnts, VAL == _QW_WAITING); > + if (irq) > + local_irq_disable(); ditto. > } while (!atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED)); > unlock: > arch_spin_unlock(&lock->wait_lock); > diff --git a/kernel/locking/spinlock_debug.c b/kernel/locking/spinlock_debug.c > index 87b03d2e41db..bf94551d7435 100644 > --- a/kernel/locking/spinlock_debug.c > +++ b/kernel/locking/spinlock_debug.c > @@ -212,6 +212,13 @@ void do_raw_write_lock(rwlock_t *lock) > debug_write_lock_after(lock); > } > > +void do_raw_write_lock_irq(rwlock_t *lock) > +{ > + debug_write_lock_before(lock); > + arch_write_lock_irq(&lock->raw_lock); > + debug_write_lock_after(lock); > +} > + > int do_raw_write_trylock(rwlock_t *lock) > { > int ret = arch_write_trylock(&lock->raw_lock);
On Tue, Jan 02, 2024 at 10:19:47AM +0800, Aiqun Yu (Maria) wrote: > On 12/29/2023 6:20 AM, Matthew Wilcox wrote: > > On Wed, Dec 13, 2023 at 12:27:05PM -0600, Eric W. Biederman wrote: > > > Matthew Wilcox <willy@infradead.org> writes: > > > > I think the right way to fix this is to pass a boolean flag to > > > > queued_write_lock_slowpath() to let it know whether it can re-enable > > > > interrupts while checking whether _QW_WAITING is set. > > > > > > Yes. It seems to make sense to distinguish between write_lock_irq and > > > write_lock_irqsave and fix this for all of write_lock_irq. > > > > I wasn't planning on doing anything here, but Hillf kind of pushed me into > > it. I think it needs to be something like this. Compile tested only. > > If it ends up getting used, > Happy new year! Thank you! I know your new year is a few weeks away still ;-) > > -void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) > > +void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock, bool irq) > > { > > int cnts; > > @@ -82,7 +83,11 @@ void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) > Also a new state showed up after the current design: > 1. locked flag with _QW_WAITING, while irq enabled. > 2. And this state will be only in interrupt context. > 3. lock->wait_lock is hold by the write waiter. > So per my understanding, a different behavior also needed to be done in > queued_write_lock_slowpath: > when (unlikely(in_interrupt())) , get the lock directly. I don't think so. Remember that write_lock_irq() can only be called in process context, and when interrupts are enabled. > So needed to be done in release path. This is to address Hillf's concern on > possibility of deadlock. Hillf's concern is invalid. > > /* When no more readers or writers, set the locked flag */ > > do { > > + if (irq) > > + local_irq_enable(); > I think write_lock_irqsave also needs to be take account. So > loal_irq_save(flags) should be take into account here. If we did want to support the same kind of spinning with interrupts enabled for write_lock_irqsave(), we'd want to pass the flags in and do local_irq_restore(), but I don't know how we'd support write_lock_irq() if we did that -- can we rely on passing in 0 for flags meaning "reenable" on all architectures? And ~0 meaning "don't reenable" on all architectures? That all seems complicated, so I didn't do that.
On 1/2/2024 5:14 PM, Matthew Wilcox wrote: > On Tue, Jan 02, 2024 at 10:19:47AM +0800, Aiqun Yu (Maria) wrote: >> On 12/29/2023 6:20 AM, Matthew Wilcox wrote: >>> On Wed, Dec 13, 2023 at 12:27:05PM -0600, Eric W. Biederman wrote: >>>> Matthew Wilcox <willy@infradead.org> writes: >>>>> I think the right way to fix this is to pass a boolean flag to >>>>> queued_write_lock_slowpath() to let it know whether it can re-enable >>>>> interrupts while checking whether _QW_WAITING is set. >>>> >>>> Yes. It seems to make sense to distinguish between write_lock_irq and >>>> write_lock_irqsave and fix this for all of write_lock_irq. >>> >>> I wasn't planning on doing anything here, but Hillf kind of pushed me into >>> it. I think it needs to be something like this. Compile tested only. >>> If it ends up getting used, >> Happy new year! > > Thank you! I know your new year is a few weeks away still ;-) Yeah, Chinese new year will come about 5 weeks later. :) > >>> -void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) >>> +void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock, bool irq) >>> { >>> int cnts; >>> @@ -82,7 +83,11 @@ void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) >> Also a new state showed up after the current design: >> 1. locked flag with _QW_WAITING, while irq enabled. >> 2. And this state will be only in interrupt context. >> 3. lock->wait_lock is hold by the write waiter. >> So per my understanding, a different behavior also needed to be done in >> queued_write_lock_slowpath: >> when (unlikely(in_interrupt())) , get the lock directly. > > I don't think so. Remember that write_lock_irq() can only be called in > process context, and when interrupts are enabled. In current kernel drivers, I can see same lock called with write_lock_irq and write_lock_irqsave in different drivers. And this is the scenario I am talking about: 1. cpu0 have task run and called write_lock_irq.(Not in interrupt context) 2. cpu0 hold the lock->wait_lock and re-enabled the interrupt. * this is the new state with _QW_WAITING set, lock->wait_lock locked, interrupt enabled. * 3. cpu0 in-interrupt context and want to do write_lock_irqsave. 4. cpu0 tried to acquire lock->wait_lock again. I was thinking to support both write_lock_irq and write_lock_irqsave with interrupt enabled together in queued_write_lock_slowpath. That's why I am suggesting in write_lock_irqsave when (in_interrupt()), instead spin for the lock->wait_lock, spin to get the lock->cnts directly. > >> So needed to be done in release path. This is to address Hillf's concern on >> possibility of deadlock. > > Hillf's concern is invalid. > >>> /* When no more readers or writers, set the locked flag */ >>> do { >>> + if (irq) >>> + local_irq_enable(); >> I think write_lock_irqsave also needs to be take account. So >> loal_irq_save(flags) should be take into account here. > > If we did want to support the same kind of spinning with interrupts > enabled for write_lock_irqsave(), we'd want to pass the flags in > and do local_irq_restore(), but I don't know how we'd support > write_lock_irq() if we did that -- can we rely on passing in 0 for flags > meaning "reenable" on all architectures? And ~0 meaning "don't > reenable" on all architectures? What about for all write_lock_irq, pass the real flags from local_irq_save(flags) into the queued_write_lock_slowpath? Arch specific valid flags won't be !0 limited then. > > That all seems complicated, so I didn't do that. This is complicated. Also need test verify to ensure. More careful design more better. Fixed previous wrong email address. ^-^! >
Hello, kernel test robot noticed "WARNING:inconsistent_lock_state" on: commit: 30ebdbe58c5be457f329cb81487df2a9eae886b4 ("Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock") url: https://github.com/intel-lab-lkp/linux/commits/Matthew-Wilcox/Re-PATCH-kernel-Introduce-a-write-lock-unlock-wrapper-for-tasklist_lock/20231229-062352 base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git a51749ab34d9e5dec548fe38ede7e01e8bb26454 patch link: https://lore.kernel.org/all/ZY30k7OCtxrdR9oP@casper.infradead.org/ patch subject: Re: [PATCH] kernel: Introduce a write lock/unlock wrapper for tasklist_lock in testcase: boot compiler: gcc-12 test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G (please refer to attached dmesg/kmsg for entire log/backtrace) +-----------------------------------------------------+------------+------------+ | | a51749ab34 | 30ebdbe58c | +-----------------------------------------------------+------------+------------+ | WARNING:inconsistent_lock_state | 0 | 10 | | inconsistent{IN-HARDIRQ-R}->{HARDIRQ-ON-W}usage | 0 | 10 | +-----------------------------------------------------+------------+------------+ If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202401031032.b7d5324-oliver.sang@intel.com [ 75.968288][ T141] WARNING: inconsistent lock state [ 75.968550][ T141] 6.7.0-rc1-00006-g30ebdbe58c5b #1 Tainted: G W N [ 75.968946][ T141] -------------------------------- [ 75.969208][ T141] inconsistent {IN-HARDIRQ-R} -> {HARDIRQ-ON-W} usage. [ 75.969556][ T141] systemd-udevd/141 [HC0[0]:SC0[0]:HE0:SE1] takes: [ 75.969889][ T141] ffff888113a9d958 (&ep->lock){+-.-}-{2:2}, at: ep_start_scan (include/linux/list.h:373 (discriminator 31) include/linux/list.h:571 (discriminator 31) fs/eventpoll.c:628 (discriminator 31)) [ 75.970329][ T141] {IN-HARDIRQ-R} state was registered at: [ 75.970620][ T141] __lock_acquire (kernel/locking/lockdep.c:5090) [ 75.970873][ T141] lock_acquire (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5755) [ 75.971113][ T141] _raw_read_lock_irqsave (include/linux/rwlock_api_smp.h:161 kernel/locking/spinlock.c:236) [ 75.971387][ T141] ep_poll_callback (include/net/busy_poll.h:37 fs/eventpoll.c:434 fs/eventpoll.c:1178) [ 75.971638][ T141] __wake_up_common (kernel/sched/wait.c:90) [ 75.971894][ T141] __wake_up (include/linux/spinlock.h:406 kernel/sched/wait.c:108 kernel/sched/wait.c:127) [ 75.972110][ T141] irq_work_single (kernel/irq_work.c:222) [ 75.972363][ T141] irq_work_run_list (kernel/irq_work.c:251 (discriminator 3)) [ 75.972619][ T141] update_process_times (kernel/time/timer.c:2074) [ 75.972895][ T141] tick_nohz_highres_handler (kernel/time/tick-sched.c:257 kernel/time/tick-sched.c:1516) [ 75.973188][ T141] __hrtimer_run_queues (kernel/time/hrtimer.c:1688 kernel/time/hrtimer.c:1752) [ 75.973460][ T141] hrtimer_interrupt (kernel/time/hrtimer.c:1817) [ 75.973719][ T141] __sysvec_apic_timer_interrupt (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 arch/x86/include/asm/trace/irq_vectors.h:41 arch/x86/kernel/apic/apic.c:1083) [ 75.974031][ T141] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1076 (discriminator 14)) [ 75.974324][ T141] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:645) [ 75.974636][ T141] kasan_check_range (mm/kasan/generic.c:186) [ 75.974888][ T141] trace_preempt_off (arch/x86/include/asm/bitops.h:227 arch/x86/include/asm/bitops.h:239 include/asm-generic/bitops/instrumented-non-atomic.h:142 include/linux/cpumask.h:504 include/linux/cpumask.h:1104 include/trace/events/preemptirq.h:51 kernel/trace/trace_preemptirq.c:109) [ 75.975144][ T141] _raw_spin_lock (include/linux/spinlock_api_smp.h:133 kernel/locking/spinlock.c:154) [ 75.975383][ T141] __change_page_attr_set_clr (arch/x86/mm/pat/set_memory.c:1765) [ 75.975683][ T141] change_page_attr_set_clr (arch/x86/mm/pat/set_memory.c:1849) [ 75.975971][ T141] set_memory_ro (arch/x86/mm/pat/set_memory.c:2077) [ 75.976206][ T141] module_enable_ro (kernel/module/strict_rwx.c:19 kernel/module/strict_rwx.c:47) [ 75.976460][ T141] do_init_module (kernel/module/main.c:2572) [ 75.976715][ T141] load_module (kernel/module/main.c:2981) [ 75.976959][ T141] init_module_from_file (kernel/module/main.c:3148) [ 75.977233][ T141] idempotent_init_module (kernel/module/main.c:3165) [ 75.977514][ T141] __ia32_sys_finit_module (include/linux/file.h:45 kernel/module/main.c:3187 kernel/module/main.c:3169 kernel/module/main.c:3169) [ 75.977796][ T141] __do_fast_syscall_32 (arch/x86/entry/common.c:164 arch/x86/entry/common.c:230) [ 75.978060][ T141] do_fast_syscall_32 (arch/x86/entry/common.c:255) [ 75.978315][ T141] entry_SYSENTER_compat_after_hwframe (arch/x86/entry/entry_64_compat.S:121) [ 75.978644][ T141] irq event stamp: 226426 [ 75.978866][ T141] hardirqs last enabled at (226425): syscall_enter_from_user_mode_prepare (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:77 kernel/entry/common.c:122) [ 75.979436][ T141] hardirqs last disabled at (226426): _raw_write_lock_irq (include/linux/rwlock_api_smp.h:193 kernel/locking/spinlock.c:326) [ 75.979932][ T141] softirqs last enabled at (225118): __do_softirq (arch/x86/include/asm/preempt.h:27 kernel/softirq.c:400 kernel/softirq.c:582) [ 75.980407][ T141] softirqs last disabled at (225113): irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632 kernel/softirq.c:644) [ 75.980892][ T141] [ 75.980892][ T141] other info that might help us debug this: [ 75.981299][ T141] Possible unsafe locking scenario: [ 75.981299][ T141] [ 75.981676][ T141] CPU0 [ 75.981848][ T141] ---- [ 75.982019][ T141] lock(&ep->lock); [ 75.982224][ T141] <Interrupt> [ 75.982405][ T141] lock(&ep->lock); [ 75.982617][ T141] [ 75.982617][ T141] *** DEADLOCK *** [ 75.982617][ T141] [ 75.983028][ T141] 2 locks held by systemd-udevd/141: [ 75.983299][ T141] #0: ffff888113a9d868 (&ep->mtx){+.+.}-{3:3}, at: ep_send_events (fs/eventpoll.c:1695) [ 75.983758][ T141] #1: ffff888113a9d958 (&ep->lock){+-.-}-{2:2}, at: ep_start_scan (include/linux/list.h:373 (discriminator 31) include/linux/list.h:571 (discriminator 31) fs/eventpoll.c:628 (discriminator 31)) [ 75.984215][ T141] [ 75.984215][ T141] stack backtrace: [ 75.984517][ T141] CPU: 1 PID: 141 Comm: systemd-udevd Tainted: G W N 6.7.0-rc1-00006-g30ebdbe58c5b #1 f53d658e8913bcc30100423f807a4e1a31eca25f [ 75.985251][ T141] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 75.985777][ T141] Call Trace: [ 75.985950][ T141] <TASK> [ 75.986105][ T141] dump_stack_lvl (lib/dump_stack.c:107) [ 75.986344][ T141] mark_lock_irq (kernel/locking/lockdep.c:4216) [ 75.986591][ T141] ? print_usage_bug (kernel/locking/lockdep.c:4206) [ 75.986847][ T141] ? stack_trace_snprint (kernel/stacktrace.c:114) [ 75.987115][ T141] ? save_trace (kernel/locking/lockdep.c:586) [ 75.987350][ T141] mark_lock (kernel/locking/lockdep.c:4677) [ 75.987576][ T141] ? mark_lock_irq (kernel/locking/lockdep.c:4638) [ 75.987836][ T141] mark_held_locks (kernel/locking/lockdep.c:4273) [ 75.988077][ T141] lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4291 kernel/locking/lockdep.c:4358) [ 75.988377][ T141] trace_hardirqs_on (kernel/trace/trace_preemptirq.c:62) [ 75.988631][ T141] queued_write_lock_slowpath (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:77 kernel/locking/qrwlock.c:87) [ 75.988926][ T141] ? queued_read_lock_slowpath (kernel/locking/qrwlock.c:68) [ 75.989226][ T141] ? lock_acquire (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5755) [ 75.989473][ T141] ? lock_sync (kernel/locking/lockdep.c:5721) [ 75.989706][ T141] do_raw_write_lock_irq (include/asm-generic/qrwlock.h:115 kernel/locking/spinlock_debug.c:217) [ 75.989980][ T141] ? do_raw_write_lock (kernel/locking/spinlock_debug.c:215) [ 75.990245][ T141] ? _raw_write_lock_irq (include/linux/rwlock_api_smp.h:195 kernel/locking/spinlock.c:326) [ 75.990512][ T141] ep_start_scan (include/linux/list.h:373 (discriminator 31) include/linux/list.h:571 (discriminator 31) fs/eventpoll.c:628 (discriminator 31)) [ 75.990749][ T141] ep_send_events (fs/eventpoll.c:1701) [ 75.990995][ T141] ? _copy_from_iter_nocache (lib/iov_iter.c:181) [ 75.991296][ T141] ? __ia32_sys_epoll_create (fs/eventpoll.c:1678) [ 75.991579][ T141] ? mark_lock (arch/x86/include/asm/bitops.h:227 (discriminator 3) arch/x86/include/asm/bitops.h:239 (discriminator 3) include/asm-generic/bitops/instrumented-non-atomic.h:142 (discriminator 3) kernel/locking/lockdep.c:228 (discriminator 3) kernel/locking/lockdep.c:4655 (discriminator 3)) [ 75.991813][ T141] ep_poll (fs/eventpoll.c:1865) [ 75.992030][ T141] ? ep_send_events (fs/eventpoll.c:1827) [ 75.992290][ T141] do_epoll_wait (fs/eventpoll.c:2318) [ 75.992532][ T141] __ia32_sys_epoll_wait (fs/eventpoll.c:2325) [ 75.992810][ T141] ? clockevents_program_event (kernel/time/clockevents.c:336 (discriminator 3)) [ 75.993112][ T141] ? do_epoll_wait (fs/eventpoll.c:2325) [ 75.993366][ T141] __do_fast_syscall_32 (arch/x86/entry/common.c:164 arch/x86/entry/common.c:230) [ 75.993627][ T141] do_fast_syscall_32 (arch/x86/entry/common.c:255) [ 75.993879][ T141] entry_SYSENTER_compat_after_hwframe (arch/x86/entry/entry_64_compat.S:121) [ 75.994204][ T141] RIP: 0023:0xf7f55579 [ 75.994417][ T141] Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00 All code ======== 0: b8 01 10 06 03 mov $0x3061001,%eax 5: 74 b4 je 0xffffffffffffffbb 7: 01 10 add %edx,(%rax) 9: 07 (bad) a: 03 74 b0 01 add 0x1(%rax,%rsi,4),%esi e: 10 08 adc %cl,(%rax) 10: 03 74 d8 01 add 0x1(%rax,%rbx,8),%esi ... 20: 00 51 52 add %dl,0x52(%rcx) 23: 55 push %rbp 24:* 89 e5 mov %esp,%ebp <-- trapping instruction 26: 0f 34 sysenter 28: cd 80 int $0x80 2a: 5d pop %rbp 2b: 5a pop %rdx 2c: 59 pop %rcx 2d: c3 ret 2e: 90 nop 2f: 90 nop 30: 90 nop 31: 90 nop 32: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi 39: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi Code starting with the faulting instruction =========================================== 0: 5d pop %rbp 1: 5a pop %rdx 2: 59 pop %rcx 3: c3 ret 4: 90 nop 5: 90 nop 6: 90 nop 7: 90 nop 8: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi f: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20240103/202401031032.b7d5324-oliver.sang@intel.com
On Wed, Jan 03, 2024 at 10:58:33AM +0800, Aiqun Yu (Maria) wrote: > On 1/2/2024 5:14 PM, Matthew Wilcox wrote: > > > > -void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) > > > > +void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock, bool irq) > > > > { > > > > int cnts; > > > > @@ -82,7 +83,11 @@ void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) > > > Also a new state showed up after the current design: > > > 1. locked flag with _QW_WAITING, while irq enabled. > > > 2. And this state will be only in interrupt context. > > > 3. lock->wait_lock is hold by the write waiter. > > > So per my understanding, a different behavior also needed to be done in > > > queued_write_lock_slowpath: > > > when (unlikely(in_interrupt())) , get the lock directly. > > > > I don't think so. Remember that write_lock_irq() can only be called in > > process context, and when interrupts are enabled. > In current kernel drivers, I can see same lock called with write_lock_irq > and write_lock_irqsave in different drivers. > > And this is the scenario I am talking about: > 1. cpu0 have task run and called write_lock_irq.(Not in interrupt context) > 2. cpu0 hold the lock->wait_lock and re-enabled the interrupt. Oh, I missed that it was holding the wait_lock. Yes, we also need to release the wait_lock before spinning with interrupts disabled. > I was thinking to support both write_lock_irq and write_lock_irqsave with > interrupt enabled together in queued_write_lock_slowpath. > > That's why I am suggesting in write_lock_irqsave when (in_interrupt()), > instead spin for the lock->wait_lock, spin to get the lock->cnts directly. Mmm, but the interrupt could come in on a different CPU and that would lead to it stealing the wait_lock from the CPU which is merely waiting for the readers to go away.
On 1/4/2024 2:18 AM, Matthew Wilcox wrote: > On Wed, Jan 03, 2024 at 10:58:33AM +0800, Aiqun Yu (Maria) wrote: >> On 1/2/2024 5:14 PM, Matthew Wilcox wrote: >>>>> -void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) >>>>> +void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock, bool irq) >>>>> { >>>>> int cnts; >>>>> @@ -82,7 +83,11 @@ void __lockfunc queued_write_lock_slowpath(struct qrwlock *lock) >>>> Also a new state showed up after the current design: >>>> 1. locked flag with _QW_WAITING, while irq enabled. >>>> 2. And this state will be only in interrupt context. >>>> 3. lock->wait_lock is hold by the write waiter. >>>> So per my understanding, a different behavior also needed to be done in >>>> queued_write_lock_slowpath: >>>> when (unlikely(in_interrupt())) , get the lock directly. >>> >>> I don't think so. Remember that write_lock_irq() can only be called in >>> process context, and when interrupts are enabled. >> In current kernel drivers, I can see same lock called with write_lock_irq >> and write_lock_irqsave in different drivers. >> >> And this is the scenario I am talking about: >> 1. cpu0 have task run and called write_lock_irq.(Not in interrupt context) >> 2. cpu0 hold the lock->wait_lock and re-enabled the interrupt. > > Oh, I missed that it was holding the wait_lock. Yes, we also need to > release the wait_lock before spinning with interrupts disabled. > >> I was thinking to support both write_lock_irq and write_lock_irqsave with >> interrupt enabled together in queued_write_lock_slowpath. >> >> That's why I am suggesting in write_lock_irqsave when (in_interrupt()), >> instead spin for the lock->wait_lock, spin to get the lock->cnts directly. > > Mmm, but the interrupt could come in on a different CPU and that would > lead to it stealing the wait_lock from the CPU which is merely waiting > for the readers to go away. That's right. The fairness(or queue mechanism) wouldn't be ensured (only in interrupt context) if we have the special design when (in_interrupt()) spin to get the lock->cnts directly. When in interrupt context, the later write_lock_irqsave may get the lock earlier than the write_lock_irq() which is not in interrupt context. This is a side effect of the design, while similar unfairness design in read lock as well. I think it is reasonable to have in_interrupt() waiters get lock earlier from the whole system's performance of view. >
diff --git a/fs/exec.c b/fs/exec.c index 4aa19b24f281..030eef6852eb 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1086,7 +1086,7 @@ static int de_thread(struct task_struct *tsk) for (;;) { cgroup_threadgroup_change_begin(tsk); - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); /* * Do this under tasklist_lock to ensure that * exit_notify() can't miss ->group_exec_task @@ -1095,7 +1095,7 @@ static int de_thread(struct task_struct *tsk) if (likely(leader->exit_state)) break; __set_current_state(TASK_KILLABLE); - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); cgroup_threadgroup_change_end(tsk); schedule(); if (__fatal_signal_pending(tsk)) @@ -1150,7 +1150,7 @@ static int de_thread(struct task_struct *tsk) */ if (unlikely(leader->ptrace)) __wake_up_parent(leader, leader->parent); - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); cgroup_threadgroup_change_end(tsk); release_task(leader); @@ -1198,13 +1198,13 @@ static int unshare_sighand(struct task_struct *me) refcount_set(&newsighand->count, 1); - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); spin_lock(&oldsighand->siglock); memcpy(newsighand->action, oldsighand->action, sizeof(newsighand->action)); rcu_assign_pointer(me->sighand, newsighand); spin_unlock(&oldsighand->siglock); - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); __cleanup_sighand(oldsighand); } diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index a23af225c898..6f69d9a3c868 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -50,6 +50,35 @@ struct kernel_clone_args { * a separate lock). */ extern rwlock_t tasklist_lock; + +/* + * Tasklist_lock is a special lock, it takes a good amount of time of + * taskslist_lock readers to finish, and the pure write_irq_lock api + * will do local_irq_disable at the very first, and put the current cpu + * waiting for the lock while is non-responsive for interrupts. + * + * The current taskslist_lock writers all have write_lock_irq to hold + * tasklist_lock, and write_unlock_irq to release tasklist_lock, that + * means the writers are not suitable or workable to wait on + * tasklist_lock in irq disabled scenarios. So the write lock/unlock + * wrapper here only follow the current design of directly use + * local_irq_disable and local_irq_enable. + */ +static inline void write_lock_tasklist_lock(void) +{ + while (1) { + local_irq_disable(); + if (write_trylock(&tasklist_lock)) + break; + local_irq_enable(); + cpu_relax(); + } +} +static inline void write_unlock_tasklist_lock(void) +{ + write_unlock_irq(&tasklist_lock); +} + extern spinlock_t mmlist_lock; extern union thread_union init_thread_union; diff --git a/kernel/exit.c b/kernel/exit.c index ee9f43bed49a..18b00f477079 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -251,7 +251,7 @@ void release_task(struct task_struct *p) cgroup_release(p); - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); ptrace_release_task(p); thread_pid = get_pid(p->thread_pid); __exit_signal(p); @@ -275,7 +275,7 @@ void release_task(struct task_struct *p) leader->exit_state = EXIT_DEAD; } - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); seccomp_filter_release(p); proc_flush_pid(thread_pid); put_pid(thread_pid); @@ -598,7 +598,7 @@ static struct task_struct *find_child_reaper(struct task_struct *father, return reaper; } - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); list_for_each_entry_safe(p, n, dead, ptrace_entry) { list_del_init(&p->ptrace_entry); @@ -606,7 +606,7 @@ static struct task_struct *find_child_reaper(struct task_struct *father, } zap_pid_ns_processes(pid_ns); - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); return father; } @@ -730,7 +730,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead) struct task_struct *p, *n; LIST_HEAD(dead); - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); forget_original_parent(tsk, &dead); if (group_dead) @@ -758,7 +758,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead) /* mt-exec, de_thread() is waiting for group leader */ if (unlikely(tsk->signal->notify_count < 0)) wake_up_process(tsk->signal->group_exec_task); - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); list_for_each_entry_safe(p, n, &dead, ptrace_entry) { list_del_init(&p->ptrace_entry); @@ -1172,7 +1172,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p) wo->wo_stat = status; if (state == EXIT_TRACE) { - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); /* We dropped tasklist, ptracer could die and untrace */ ptrace_unlink(p); @@ -1181,7 +1181,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p) if (do_notify_parent(p, p->exit_signal)) state = EXIT_DEAD; p->exit_state = state; - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); } if (state == EXIT_DEAD) release_task(p); diff --git a/kernel/fork.c b/kernel/fork.c index 10917c3e1f03..06c4b4ab9102 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2623,7 +2623,7 @@ __latent_entropy struct task_struct *copy_process( * Make it visible to the rest of the system, but dont wake it up yet. * Need tasklist lock for parent etc handling! */ - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); /* CLONE_PARENT re-uses the old parent */ if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) { @@ -2714,7 +2714,7 @@ __latent_entropy struct task_struct *copy_process( hlist_del_init(&delayed.node); spin_unlock(¤t->sighand->siglock); syscall_tracepoint_update(p); - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); if (pidfile) fd_install(pidfd, pidfile); @@ -2735,7 +2735,7 @@ __latent_entropy struct task_struct *copy_process( bad_fork_cancel_cgroup: sched_core_free(p); spin_unlock(¤t->sighand->siglock); - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); cgroup_cancel_fork(p, args); bad_fork_put_pidfd: if (clone_flags & CLONE_PIDFD) { diff --git a/kernel/ptrace.c b/kernel/ptrace.c index d8b5e13a2229..a8d7e2d06f3e 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -435,7 +435,7 @@ static int ptrace_attach(struct task_struct *task, long request, if (retval) goto unlock_creds; - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); retval = -EPERM; if (unlikely(task->exit_state)) goto unlock_tasklist; @@ -479,7 +479,7 @@ static int ptrace_attach(struct task_struct *task, long request, retval = 0; unlock_tasklist: - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); unlock_creds: mutex_unlock(&task->signal->cred_guard_mutex); out: @@ -508,7 +508,7 @@ static int ptrace_traceme(void) { int ret = -EPERM; - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); /* Are we already being traced? */ if (!current->ptrace) { ret = security_ptrace_traceme(current->parent); @@ -522,7 +522,7 @@ static int ptrace_traceme(void) ptrace_link(current, current->real_parent); } } - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); return ret; } @@ -588,7 +588,7 @@ static int ptrace_detach(struct task_struct *child, unsigned int data) /* Architecture-specific hardware disable .. */ ptrace_disable(child); - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); /* * We rely on ptrace_freeze_traced(). It can't be killed and * untraced by another thread, it can't be a zombie. @@ -600,7 +600,7 @@ static int ptrace_detach(struct task_struct *child, unsigned int data) */ child->exit_code = data; __ptrace_detach(current, child); - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); proc_ptrace_connector(child, PTRACE_DETACH); diff --git a/kernel/sys.c b/kernel/sys.c index e219fcfa112d..0b1647d3ed32 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1088,7 +1088,7 @@ SYSCALL_DEFINE2(setpgid, pid_t, pid, pid_t, pgid) /* From this point forward we keep holding onto the tasklist lock * so that our parent does not change from under us. -DaveM */ - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); err = -ESRCH; p = find_task_by_vpid(pid); @@ -1136,7 +1136,7 @@ SYSCALL_DEFINE2(setpgid, pid_t, pid, pid_t, pgid) err = 0; out: /* All paths lead to here, thus we are safe. -DaveM */ - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); rcu_read_unlock(); return err; } @@ -1229,7 +1229,7 @@ int ksys_setsid(void) pid_t session = pid_vnr(sid); int err = -EPERM; - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); /* Fail if I am already a session leader */ if (group_leader->signal->leader) goto out; @@ -1247,7 +1247,7 @@ int ksys_setsid(void) err = session; out: - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); if (err > 0) { proc_sid_connector(group_leader); sched_autogroup_create_attach(group_leader); diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c index 19be69fa4d05..dd8aed20486a 100644 --- a/security/keys/keyctl.c +++ b/security/keys/keyctl.c @@ -1652,7 +1652,7 @@ long keyctl_session_to_parent(void) me = current; rcu_read_lock(); - write_lock_irq(&tasklist_lock); + write_lock_tasklist_lock(); ret = -EPERM; oldwork = NULL; @@ -1702,7 +1702,7 @@ long keyctl_session_to_parent(void) if (!ret) newwork = NULL; unlock: - write_unlock_irq(&tasklist_lock); + write_unlock_tasklist_lock(); rcu_read_unlock(); if (oldwork) put_cred(container_of(oldwork, struct cred, rcu));