[next,2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.
Message ID | fbb1f9ed42b2460c93eeac43a92157c8@AcuMS.aculab.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-13259-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:6f82:b0:100:9c79:88ff with SMTP id tb2csp2820199dyb; Fri, 29 Dec 2023 14:12:16 -0800 (PST) X-Google-Smtp-Source: AGHT+IEGWIqDFutHyNoZch2dP6RDdPbv6JjmEbymbqAGCNwpkP2Qh8WtTOZo1HCqe2OCMo1XAxIn X-Received: by 2002:a50:d59b:0:b0:553:2dc2:4b9b with SMTP id v27-20020a50d59b000000b005532dc24b9bmr7955943edi.14.1703887936594; Fri, 29 Dec 2023 14:12:16 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1703887936; cv=none; d=google.com; s=arc-20160816; b=qMTT43OOfbdm+NUmTYUpMEmR41hZX6iFSYMBihk0uvGnLPHEy00CVZMVbQGAheJgo3 3sJZeZksuSP0wcVs5RzuN+uufUbzZj3idmzFqzwRW2xyi0EWc0gqcu4+WuV2TvwW3UmX ii6BUzc3R6QjOCEqzyunvFh6+mLyPMIGAfPElX3gdl4hSwbwGV5W9wxWYJZLyP5+Eq9p GYbpW3fXmwptsRtSg0JQt/I4dTt+Tm1LCBCAysyPVEJn8OE9sJcOneLvcsL3zebYE69G UZoiWZrJYVy6zIC1A01YZ0uuMAdGx2XsCfV9iye/0KgOtwmdW434+2dSMRe48zG42E8g yQlg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:content-language:mime-version :list-unsubscribe:list-subscribe:list-id:precedence:accept-language :in-reply-to:references:message-id:date:thread-index:thread-topic :subject:cc:to:from; bh=Zxoj+Whcj6wwFWOInbOmrLUCPKc+AX+1M+bKnwz8TIE=; fh=9mjHhe+fO0FRmfh1SiQmAZ16aLlCaXA7AOG78MgCddY=; b=p+JTrZ4F3nv98aSJ6BOHGxZbE8Eh/MkFPPOtHwa354ERCSH14dvc9krQnYabHgBfRa +7zu2p1ugG6kRaPIpCJF6HAAlTDWfUjaPcWI1Skis8bc8q+N0aZt9qsK5Om7f8GhW0+7 YWTvWOkh9RNhBTBRnpKXSY0FxJcFOGaZgPi31YpNpJ2DX6gd7vD72fz5l2XrV9vUFCU3 QxyL6i2K7f6tyz0IQOKzK16dlMvwHOmT3cPGmXiS9M+4wmm11c7v6NEzrGFtGsPULqPK puJXzfszHYMJ+lzUeu4XAKMvIwrBKwHCauyNt+4BBfIXzRCtpKApDM/o0uh0iJZWwk7t vo0w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel+bounces-13259-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-13259-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=aculab.com Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id m13-20020a056402510d00b00553b47279b0si7981978edd.220.2023.12.29.14.12.16 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Dec 2023 14:12:16 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-13259-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel+bounces-13259-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-13259-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=aculab.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 37CF11F22E5D for <ouuuleilei@gmail.com>; Fri, 29 Dec 2023 22:12:16 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 03E8C14AB8; Fri, 29 Dec 2023 22:12:01 +0000 (UTC) X-Original-To: linux-kernel@vger.kernel.org Received: from eu-smtp-delivery-151.mimecast.com (eu-smtp-delivery-151.mimecast.com [185.58.86.151]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8CE6E14A97 for <linux-kernel@vger.kernel.org>; Fri, 29 Dec 2023 22:11:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=ACULAB.COM Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=aculab.com Received: from AcuMS.aculab.com (156.67.243.121 [156.67.243.121]) by relay.mimecast.com with ESMTP with both STARTTLS and AUTH (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id uk-mta-227-QXrAE9vJN0uhbMdRrMv0sA-1; Fri, 29 Dec 2023 22:11:54 +0000 X-MC-Unique: QXrAE9vJN0uhbMdRrMv0sA-1 Received: from AcuMS.Aculab.com (10.202.163.6) by AcuMS.aculab.com (10.202.163.6) with Microsoft SMTP Server (TLS) id 15.0.1497.48; Fri, 29 Dec 2023 22:11:40 +0000 Received: from AcuMS.Aculab.com ([::1]) by AcuMS.aculab.com ([::1]) with mapi id 15.00.1497.048; Fri, 29 Dec 2023 22:11:40 +0000 From: David Laight <David.Laight@ACULAB.COM> To: "'linux-kernel@vger.kernel.org'" <linux-kernel@vger.kernel.org>, "'peterz@infradead.org'" <peterz@infradead.org>, "'longman@redhat.com'" <longman@redhat.com> CC: "'mingo@redhat.com'" <mingo@redhat.com>, "'will@kernel.org'" <will@kernel.org>, "'boqun.feng@gmail.com'" <boqun.feng@gmail.com>, "'Linus Torvalds'" <torvalds@linux-foundation.org>, "'xinhui.pan@linux.vnet.ibm.com'" <xinhui.pan@linux.vnet.ibm.com>, "'virtualization@lists.linux-foundation.org'" <virtualization@lists.linux-foundation.org>, 'Zeng Heng' <zengheng4@huawei.com> Subject: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path. Thread-Topic: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path. Thread-Index: Ado6o/zBb4T2uRNuSTy7E9ZX7SDa6w== Date: Fri, 29 Dec 2023 22:11:40 +0000 Message-ID: <fbb1f9ed42b2460c93eeac43a92157c8@AcuMS.aculab.com> References: <73a4b31c9c874081baabad9e5f2e5204@AcuMS.aculab.com> In-Reply-To: <73a4b31c9c874081baabad9e5f2e5204@AcuMS.aculab.com> Accept-Language: en-GB, en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-exchange-transport-fromentityheader: Hosted Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: aculab.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1786651511658470658 X-GMAIL-MSGID: 1786655997269097234 |
Series |
locking/osq_lock: Optimisations to osq_lock code
|
|
Commit Message
David Laight
Dec. 29, 2023, 10:11 p.m. UTC
osq_lock() starts by setting node->next to NULL and node->locked to 0.
Careful analysis shows that node->next is always NULL on entry.
node->locked is set non-zero by another cpu to force a wakeup.
This can only happen after the 'prev->next = node' assignment,
so locked can be set to zero just before that (along with the assignment
to node->prev).
Only initialise node->cpu once, after that use its value instead
of smp_processor_id() - which is probably a real function call.
Should reduce cache-line bouncing a little.
Signed-off-by: David Laight <david.laight@aculab.com>
---
Re-send without the 'RE:' on the subject line.
kernel/locking/osq_lock.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
Comments
On 12/29/23 17:11, David Laight wrote: > osq_lock() starts by setting node->next to NULL and node->locked to 0. > Careful analysis shows that node->next is always NULL on entry. > > node->locked is set non-zero by another cpu to force a wakeup. > This can only happen after the 'prev->next = node' assignment, > so locked can be set to zero just before that (along with the assignment > to node->prev). > > Only initialise node->cpu once, after that use its value instead > of smp_processor_id() - which is probably a real function call. > > Should reduce cache-line bouncing a little. > > Signed-off-by: David Laight <david.laight@aculab.com> > --- > > Re-send without the 'RE:' on the subject line. > kernel/locking/osq_lock.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c > index d414eef4bec6..55f5db896c02 100644 > --- a/kernel/locking/osq_lock.c > +++ b/kernel/locking/osq_lock.c > @@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock, > struct optimistic_spin_node *prev) > { > struct optimistic_spin_node *next = NULL; > - int curr = encode_cpu(smp_processor_id()); > + int curr = node->cpu; > int old; > > /* > @@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock) > { > struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); > struct optimistic_spin_node *prev, *next; > - int curr = encode_cpu(smp_processor_id()); > int old; > > - node->locked = 0; > - node->next = NULL; I have some concern about not clearing node->next at the beginning. I know that next is supposed to be cleared at the end. I do have some worry that there may exist a race condition that leaves next not cleared before it is used again. So you either have to prove that such a race does not exist, or initializing it to NULL at the beginning like it is today. Cheers, Longman > - node->cpu = curr; > + if (unlikely(node->cpu == OSQ_UNLOCKED_VAL)) > + node->cpu = encode_cpu(smp_processor_id()); > > /* > * We need both ACQUIRE (pairs with corresponding RELEASE in > @@ -111,12 +109,13 @@ bool osq_lock(struct optimistic_spin_queue *lock) > * the node fields we just initialised) semantics when updating > * the lock tail. > */ > - old = atomic_xchg(&lock->tail, curr); > + old = atomic_xchg(&lock->tail, node->cpu); > if (old == OSQ_UNLOCKED_VAL) > return true; > > prev = decode_cpu(old); > node->prev = prev; > + node->locked = 0; > > /* > * osq_lock() unqueue > @@ -214,7 +213,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) > void osq_unlock(struct optimistic_spin_queue *lock) > { > struct optimistic_spin_node *node, *next; > - int curr = encode_cpu(smp_processor_id()); > + int curr = raw_cpu_read(osq_node.cpu); > > /* > * Fast path for the uncontended case.
From: Waiman Long > Sent: 30 December 2023 03:20 > > On 12/29/23 17:11, David Laight wrote: > > osq_lock() starts by setting node->next to NULL and node->locked to 0. > > Careful analysis shows that node->next is always NULL on entry. > > > > node->locked is set non-zero by another cpu to force a wakeup. > > This can only happen after the 'prev->next = node' assignment, > > so locked can be set to zero just before that (along with the assignment > > to node->prev). > > > > Only initialise node->cpu once, after that use its value instead > > of smp_processor_id() - which is probably a real function call. > > > > Should reduce cache-line bouncing a little. > > > > Signed-off-by: David Laight <david.laight@aculab.com> > > --- > > > > Re-send without the 'RE:' on the subject line. > > kernel/locking/osq_lock.c | 13 ++++++------- > > 1 file changed, 6 insertions(+), 7 deletions(-) > > > > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c > > index d414eef4bec6..55f5db896c02 100644 > > --- a/kernel/locking/osq_lock.c > > +++ b/kernel/locking/osq_lock.c > > @@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock, > > struct optimistic_spin_node *prev) > > { > > struct optimistic_spin_node *next = NULL; > > - int curr = encode_cpu(smp_processor_id()); > > + int curr = node->cpu; > > int old; > > > > /* > > @@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock) > > { > > struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); > > struct optimistic_spin_node *prev, *next; > > - int curr = encode_cpu(smp_processor_id()); > > int old; > > > > - node->locked = 0; > > - node->next = NULL; > > I have some concern about not clearing node->next at the beginning. I > know that next is supposed to be cleared at the end. I do have some > worry that there may exist a race condition that leaves next not cleared > before it is used again. So you either have to prove that such a race > does not exist, or initializing it to NULL at the beginning like it is > today. I've stared at the code some more :-) There are two places where node->next is written non-NULL, both in osq_lock(). The first is at the top of the slow-path where prev->next = node (this should be overwriting the NULL set (or not) on entry). The second is at the bottom of osq_lock() prev->next = next (Step C) where the NULL written in Step A is written with the correct 'next' node. After either of those the 'node' cpu must later either take the unqueue exit from osq_lock() or call osq_unlock(). Both require it wait for node->next be non-NULL and NULL it. If code calls osq_lock() twice all bets are off! Even if (somehow) node->next was non-NULL on entry it will be set by an osq_lock() call from another cpu. The only problem would be if osq_unlock() were called before the queueing cpu set prev->next = node. That in itself is unlikely - but would happen if node->next were always set. I don't completely understand the 'acquire'/'release' semantics (they didn't exist when I started doing SMP kernel code in the late 1980s). But it looks odd that osq_unlock()'s fast path uses _release but the very similar code in osq_wait_next() uses _acquire. Indeed, apart from some (assumed) optimisations, I think osq_unlock() could just be: next = osq_wait_next(lock, this_cpu_ptr(&osq_node), 0); if (next) next->locked = 1; I don't think the order of the tests for lock->tail and node->next matter in osq_wait_next(). If they were swapped the 'Second most likely case' code from osq_unlock() could be removed. (The 'uncontended case' doesn't need to load the address of 'node'.) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sat, Dec 30, 2023 at 03:49:52PM +0000, David Laight wrote: [...] > I don't completely understand the 'acquire'/'release' semantics (they didn't > exist when I started doing SMP kernel code in the late 1980s). > But it looks odd that osq_unlock()'s fast path uses _release but the very > similar code in osq_wait_next() uses _acquire. > The _release in osq_unlock() is needed since unlocks are needed to be RELEASE so that lock+unlock can be a critical section (i.e. no memory accesses can escape). When osq_wait_next() is used in non unlock cases, the RELEASE is not required. As for the case where osq_wait_next() is used in osq_unlock(), there is a xchg() preceding it, which provides a full barrier, so things are fine. /me wonders whether we can relax the _acquire in osq_wait_next() into a _relaxed. > Indeed, apart from some (assumed) optimisations, I think osq_unlock() > could just be: > next = osq_wait_next(lock, this_cpu_ptr(&osq_node), 0); > if (next) > next->locked = 1; > If so we need to provide some sort of RELEASE semantics for the osq_unlock() in all the cases. Regards, Boqun > I don't think the order of the tests for lock->tail and node->next > matter in osq_wait_next(). > If they were swapped the 'Second most likely case' code from osq_unlock() > could be removed. > (The 'uncontended case' doesn't need to load the address of 'node'.) > > David > > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales)
From: Boqun Feng > Sent: 02 January 2024 18:54 > > On Sat, Dec 30, 2023 at 03:49:52PM +0000, David Laight wrote: > [...] > > But it looks odd that osq_unlock()'s fast path uses _release but the very > > similar code in osq_wait_next() uses _acquire. > > > > The _release in osq_unlock() is needed since unlocks are needed to be > RELEASE so that lock+unlock can be a critical section (i.e. no memory > accesses can escape). When osq_wait_next() is used in non unlock cases, > the RELEASE is not required. As for the case where osq_wait_next() is > used in osq_unlock(), there is a xchg() preceding it, which provides a > full barrier, so things are fine. I know there have been issues with ACQUIRE/RELEASE/FULL xchg in this code, but are FULL xchg always needed on node->next? > /me wonders whether we can relax the _acquire in osq_wait_next() into > a _relaxed. I wouldn't have worried about relaxed v release. > > Indeed, apart from some (assumed) optimisations, I think osq_unlock() > > could just be: > > next = osq_wait_next(lock, this_cpu_ptr(&osq_node), 0); > > if (next) > > next->locked = 1; > > > > If so we need to provide some sort of RELEASE semantics for the > osq_unlock() in all the cases. I wonder how often the unqueue code happens, and especially for the last cpu in the list? I'd only expect need_resched() to return true after spinning for a while - in which case perhaps it is more likely that there are a lot of cpu in the queue and the cpu being removed won't be last. So osq_wait_next() exits on xchg(&node->next, NULL) != NULL which is full barrier. On a slightly different note I've also wondered if 'osq_node' actually needs to be cache line aligned? You definitely don't want it spanning 2 cache line, but I'd expect that per-cpu data is mostly accessed by its own cpu? So you really aren't going to get false sharing with some other per-cpu data since the cpu is busy in this code. So __aligned(16) would do? David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index d414eef4bec6..55f5db896c02 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock, struct optimistic_spin_node *prev) { struct optimistic_spin_node *next = NULL; - int curr = encode_cpu(smp_processor_id()); + int curr = node->cpu; int old; /* @@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock) { struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); struct optimistic_spin_node *prev, *next; - int curr = encode_cpu(smp_processor_id()); int old; - node->locked = 0; - node->next = NULL; - node->cpu = curr; + if (unlikely(node->cpu == OSQ_UNLOCKED_VAL)) + node->cpu = encode_cpu(smp_processor_id()); /* * We need both ACQUIRE (pairs with corresponding RELEASE in @@ -111,12 +109,13 @@ bool osq_lock(struct optimistic_spin_queue *lock) * the node fields we just initialised) semantics when updating * the lock tail. */ - old = atomic_xchg(&lock->tail, curr); + old = atomic_xchg(&lock->tail, node->cpu); if (old == OSQ_UNLOCKED_VAL) return true; prev = decode_cpu(old); node->prev = prev; + node->locked = 0; /* * osq_lock() unqueue @@ -214,7 +213,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) void osq_unlock(struct optimistic_spin_queue *lock) { struct optimistic_spin_node *node, *next; - int curr = encode_cpu(smp_processor_id()); + int curr = raw_cpu_read(osq_node.cpu); /* * Fast path for the uncontended case.