Message ID | bddb6b00434d4492abca4725c10f8d5a@AcuMS.aculab.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-13232-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:6f82:b0:100:9c79:88ff with SMTP id tb2csp2793481dyb; Fri, 29 Dec 2023 12:57:52 -0800 (PST) X-Google-Smtp-Source: AGHT+IEgRuyBA8EMrNFi875v6GCPHSDVHhQFcZ3GQTwTVgCopJan95hFrrrmIJTaS6nEaydv0FUQ X-Received: by 2002:a05:6358:2786:b0:174:f648:f9c0 with SMTP id l6-20020a056358278600b00174f648f9c0mr7445801rwb.27.1703883471931; Fri, 29 Dec 2023 12:57:51 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1703883471; cv=none; d=google.com; s=arc-20160816; b=jUttCQZFYeC1mA4TlG9RDi0Cp8zVTjNHEZz4R+7rPqoUeyvta2sdrZAI2XOQ1z2zR8 TjDK5hAa0pBtQtrKkUBCYiNPrjMTDi1oFTdP0sB6eH/EVSdSqlR5Knmo76ZirfGrZaZV 3HLDTwNO2S9A7XMiyf9S1Vw0S6evMaajm+jDcRPwor/3d3hLeYotfgeidfvx3zU4MrsD GSfFCxiAdTI0IbvFanMNUtqS6Sl4WVSrcIKiZpePm/vtSVhd0YRYGdeKLvLemdT7oFJT VMN3F0syUdXUR59wj6JAz8f7jD+70HXkOBs+2AWVs9iTS4z66GigSedXy9kTtF4cYS7J 7A+A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:content-language:mime-version :list-unsubscribe:list-subscribe:list-id:precedence:accept-language :in-reply-to:references:message-id:date:thread-index:thread-topic :subject:cc:to:from; bh=DWsG1Ubi+miEftDxDAFqhr19zEgnKNFv58uQtnm0+JY=; fh=9mjHhe+fO0FRmfh1SiQmAZ16aLlCaXA7AOG78MgCddY=; b=v0nFOu+/wGXUeUkPxoY1uO0tlbFXRJkDet8ykcdD2jCWhL12qAW3MKqAAFxTuIQD5O qjXjb2eTuqrSqCMQtcpL58Q3VoolKGoG4RG363wlZHQOdcpzsEzPAXyEFgR1M+/eZnXy L+0SANGOf/6bEazGBzROcUErm4xg9hts1AjidcknvkluWyNJPrIqYwYW1PlC1NV6gCXZ uc5dnY1VYuJe02rHRZIn8izjqEfaop33ec7HcN4RWTMfaCIcxn7U7l0lUlmnnvGWKvHh Q+8K6oyXCtSAwPLQJEmW5AVHXb2szt/Tu1ZM3F10KlTisXDMrtw2LwAZVIZnqq+uP3Vt 6uuw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel+bounces-13232-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-13232-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=aculab.com Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [2604:1380:45e3:2400::1]) by mx.google.com with ESMTPS id x26-20020a62fb1a000000b006d9b327f2d8si4642193pfm.74.2023.12.29.12.57.51 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Dec 2023 12:57:51 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-13232-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel+bounces-13232-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-13232-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=aculab.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id AD4BD284275 for <ouuuleilei@gmail.com>; Fri, 29 Dec 2023 20:57:51 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 4CCF814AA9; Fri, 29 Dec 2023 20:57:39 +0000 (UTC) X-Original-To: linux-kernel@vger.kernel.org Received: from eu-smtp-delivery-151.mimecast.com (eu-smtp-delivery-151.mimecast.com [185.58.86.151]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C13B914A83 for <linux-kernel@vger.kernel.org>; Fri, 29 Dec 2023 20:57:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=ACULAB.COM Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=aculab.com Received: from AcuMS.aculab.com (156.67.243.121 [156.67.243.121]) by relay.mimecast.com with ESMTP with both STARTTLS and AUTH (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id uk-mta-311-yCoQCupGOBSlWEcLQ3ukbA-1; Fri, 29 Dec 2023 20:57:31 +0000 X-MC-Unique: yCoQCupGOBSlWEcLQ3ukbA-1 Received: from AcuMS.Aculab.com (10.202.163.4) by AcuMS.aculab.com (10.202.163.4) with Microsoft SMTP Server (TLS) id 15.0.1497.48; Fri, 29 Dec 2023 20:57:14 +0000 Received: from AcuMS.Aculab.com ([::1]) by AcuMS.aculab.com ([::1]) with mapi id 15.00.1497.048; Fri, 29 Dec 2023 20:57:14 +0000 From: David Laight <David.Laight@ACULAB.COM> To: "'linux-kernel@vger.kernel.org'" <linux-kernel@vger.kernel.org>, "'peterz@infradead.org'" <peterz@infradead.org>, "'longman@redhat.com'" <longman@redhat.com> CC: "'mingo@redhat.com'" <mingo@redhat.com>, "'will@kernel.org'" <will@kernel.org>, "'boqun.feng@gmail.com'" <boqun.feng@gmail.com>, "'Linus Torvalds'" <torvalds@linux-foundation.org>, "'xinhui.pan@linux.vnet.ibm.com'" <xinhui.pan@linux.vnet.ibm.com>, "'virtualization@lists.linux-foundation.org'" <virtualization@lists.linux-foundation.org>, 'Zeng Heng' <zengheng4@huawei.com> Subject: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses. Thread-Topic: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses. Thread-Index: Ado6mZUJWFdx4PkETd+mn/PWVjPd0A== Date: Fri, 29 Dec 2023 20:57:13 +0000 Message-ID: <bddb6b00434d4492abca4725c10f8d5a@AcuMS.aculab.com> References: <73a4b31c9c874081baabad9e5f2e5204@AcuMS.aculab.com> In-Reply-To: <73a4b31c9c874081baabad9e5f2e5204@AcuMS.aculab.com> Accept-Language: en-GB, en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-exchange-transport-fromentityheader: Hosted Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: aculab.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1786651315777103321 X-GMAIL-MSGID: 1786651315777103321 |
Series |
locking/osq_lock: Optimisations to osq_lock code
|
|
Commit Message
David Laight
Dec. 29, 2023, 8:57 p.m. UTC
this_cpu_ptr() is rather more expensive than raw_cpu_read() since
the latter can use an 'offset from register' (%gs for x86-84).
Add a 'self' field to 'struct optimistic_spin_node' that can be
read with raw_cpu_read(), initialise on first call.
Signed-off-by: David Laight <david.laight@aculab.com>
---
kernel/locking/osq_lock.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
Comments
On 12/29/23 15:57, David Laight wrote: > this_cpu_ptr() is rather more expensive than raw_cpu_read() since > the latter can use an 'offset from register' (%gs for x86-84). > > Add a 'self' field to 'struct optimistic_spin_node' that can be > read with raw_cpu_read(), initialise on first call. > > Signed-off-by: David Laight <david.laight@aculab.com> > --- > kernel/locking/osq_lock.c | 14 +++++++++----- > 1 file changed, 9 insertions(+), 5 deletions(-) > > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c > index 9bb3a077ba92..b60b0add0161 100644 > --- a/kernel/locking/osq_lock.c > +++ b/kernel/locking/osq_lock.c > @@ -13,7 +13,7 @@ > */ > > struct optimistic_spin_node { > - struct optimistic_spin_node *next, *prev; > + struct optimistic_spin_node *self, *next, *prev; > int locked; /* 1 if lock acquired */ > int cpu; /* encoded CPU # + 1 value */ > }; > @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock, > > bool osq_lock(struct optimistic_spin_queue *lock) > { > - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); > + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self); My gcc 11 compiler produces the following x86-64 code: 92 struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); 0x0000000000000029 <+25>: mov %rcx,%rdx 0x000000000000002c <+28>: add %gs:0x0(%rip),%rdx # 0x34 <osq_lock+36> Which looks pretty optimized for me. Maybe older compiler may generate more complex code. However, I do have some doubt as to the benefit of this patch at the expense of making the code a bit more complex. Cheers, Longman
* Waiman Long <longman@redhat.com> wrote: > On 12/29/23 15:57, David Laight wrote: > > this_cpu_ptr() is rather more expensive than raw_cpu_read() since > > the latter can use an 'offset from register' (%gs for x86-84). > > > > Add a 'self' field to 'struct optimistic_spin_node' that can be > > read with raw_cpu_read(), initialise on first call. > > > > Signed-off-by: David Laight <david.laight@aculab.com> > > --- > > kernel/locking/osq_lock.c | 14 +++++++++----- > > 1 file changed, 9 insertions(+), 5 deletions(-) > > > > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c > > index 9bb3a077ba92..b60b0add0161 100644 > > --- a/kernel/locking/osq_lock.c > > +++ b/kernel/locking/osq_lock.c > > @@ -13,7 +13,7 @@ > > */ > > struct optimistic_spin_node { > > - struct optimistic_spin_node *next, *prev; > > + struct optimistic_spin_node *self, *next, *prev; > > int locked; /* 1 if lock acquired */ > > int cpu; /* encoded CPU # + 1 value */ > > }; > > @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock, > > bool osq_lock(struct optimistic_spin_queue *lock) > > { > > - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); > > + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self); > > My gcc 11 compiler produces the following x86-64 code: > > 92 struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); > 0x0000000000000029 <+25>: mov %rcx,%rdx > 0x000000000000002c <+28>: add %gs:0x0(%rip),%rdx # 0x34 > <osq_lock+36> > > Which looks pretty optimized for me. Maybe older compiler may generate more > complex code. However, I do have some doubt as to the benefit of this patch > at the expense of making the code a bit more complex. GCC-11 is plenty of a look-back window in terms of compiler efficiency: latest enterprise distros use GCC-11 or newer, while recent desktop distros use GCC-13. Anything older won't matter, because no major distribution is going to use new kernels with old compilers. Thanks, Ingo
From: Ingo Molnar > Sent: 30 December 2023 11:09 > > > * Waiman Long <longman@redhat.com> wrote: > > > On 12/29/23 15:57, David Laight wrote: > > > this_cpu_ptr() is rather more expensive than raw_cpu_read() since > > > the latter can use an 'offset from register' (%gs for x86-84). > > > > > > Add a 'self' field to 'struct optimistic_spin_node' that can be > > > read with raw_cpu_read(), initialise on first call. > > > > > > Signed-off-by: David Laight <david.laight@aculab.com> > > > --- > > > kernel/locking/osq_lock.c | 14 +++++++++----- > > > 1 file changed, 9 insertions(+), 5 deletions(-) > > > > > > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c > > > index 9bb3a077ba92..b60b0add0161 100644 > > > --- a/kernel/locking/osq_lock.c > > > +++ b/kernel/locking/osq_lock.c > > > @@ -13,7 +13,7 @@ > > > */ > > > struct optimistic_spin_node { > > > - struct optimistic_spin_node *next, *prev; > > > + struct optimistic_spin_node *self, *next, *prev; > > > int locked; /* 1 if lock acquired */ > > > int cpu; /* encoded CPU # + 1 value */ > > > }; > > > @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock, > > > bool osq_lock(struct optimistic_spin_queue *lock) > > > { > > > - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); > > > + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self); > > > > My gcc 11 compiler produces the following x86-64 code: > > > > 92 struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); > > 0x0000000000000029 <+25>: mov %rcx,%rdx > > 0x000000000000002c <+28>: add %gs:0x0(%rip),%rdx # 0x34 > > <osq_lock+36> > > > > Which looks pretty optimized for me. Maybe older compiler may generate more > > complex code. However, I do have some doubt as to the benefit of this patch > > at the expense of making the code a bit more complex. My changed code is one instruction shorter! 18: 65 48 8b 15 00 00 00 mov %gs:0x0(%rip),%rdx # 20 <osq_lock+0x20> 1f: 00 1c: R_X86_64_PC32 .data..percpu..shared_aligned-0x4 However is might have one less cache line miss. > GCC-11 is plenty of a look-back window in terms of compiler efficiency: > latest enterprise distros use GCC-11 or newer, while recent desktop > distros use GCC-13. Anything older won't matter, because no major > distribution is going to use new kernels with old compilers. There must be a difference in the header files as well. Possibly forced by the older compiler I'm using (7.5 from Ubuntu 18.04). But maybe based on some config option. I'm seeing this_cpu_ptr(&xxx) converted to per_cpu_ptr(&xxx, smp_processor_id()) which necessitates an array lookup (indexed by cpu number). Whereas I think you are seeing it implemented as raw_cpu_read(per_cpu_data_base) + offset_to(xxx) So the old code generates (after the prologue): 10: 49 89 fd mov %rdi,%r13 13: 49 c7 c4 00 00 00 00 mov $0x0,%r12 16: R_X86_64_32S .data..percpu..shared_aligned 1a: e8 00 00 00 00 callq 1f <osq_lock+0x1f> 1b: R_X86_64_PC32 debug_smp_processor_id-0x4 1f: 89 c0 mov %eax,%eax 21: 48 8b 1c c5 00 00 00 mov 0x0(,%rax,8),%rbx 28: 00 25: R_X86_64_32S __per_cpu_offset 29: e8 00 00 00 00 callq 2e <osq_lock+0x2e> 2a: R_X86_64_PC32 debug_smp_processor_id-0x4 2e: 4c 01 e3 add %r12,%rbx 31: 83 c0 01 add $0x1,%eax 34: c7 43 10 00 00 00 00 movl $0x0,0x10(%rbx) 3b: 48 c7 03 00 00 00 00 movq $0x0,(%rbx) 42: 89 43 14 mov %eax,0x14(%rbx) 45: 41 87 45 00 xchg %eax,0x0(%r13) I was also surprised that smp_processor_id() is a real function rather than an offset from %gs. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
* David Laight <David.Laight@ACULAB.COM> wrote: > bool osq_lock(struct optimistic_spin_queue *lock) > { > - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); > + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self); > struct optimistic_spin_node *prev, *next; > int old; > > - if (unlikely(node->cpu == OSQ_UNLOCKED_VAL)) > - node->cpu = encode_cpu(smp_processor_id()); > + if (unlikely(!node)) { > + int cpu = encode_cpu(smp_processor_id()); > + node = decode_cpu(cpu); > + node->self = node; > + node->cpu = cpu; This whole initialization sequence is suboptimal and needs to be cleaned up first: the node->cpu field is constant once initialized, so it should be initialized from appropriate init methods, not runtime in osq_lock(), right? Eliminating that initialization branch is a useful micro-optimization in itself for the hot path. Thanks, Ingo
On Fri, 29 Dec 2023 at 12:57, David Laight <David.Laight@aculab.com> wrote: > > this_cpu_ptr() is rather more expensive than raw_cpu_read() since > the latter can use an 'offset from register' (%gs for x86-84). > > Add a 'self' field to 'struct optimistic_spin_node' that can be > read with raw_cpu_read(), initialise on first call. No, this is horrible. The problem isn't the "this_cpu_ptr()", it's the rest of the code. > bool osq_lock(struct optimistic_spin_queue *lock) > { > - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); > + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self); No. Both of these are crap. > struct optimistic_spin_node *prev, *next; > int old; > > - if (unlikely(node->cpu == OSQ_UNLOCKED_VAL)) > - node->cpu = encode_cpu(smp_processor_id()); > + if (unlikely(!node)) { > + int cpu = encode_cpu(smp_processor_id()); > + node = decode_cpu(cpu); > + node->self = node; > + node->cpu = cpu; > + } The proper fix here is to not do that silly node = this_cpu_ptr(&osq_node); .. node->next = NULL; dance at all, but to simply do this_cpu_write(osq_node.next, NULL); in the first place. That makes the whole thing just a single store off the segment descriptor. Yes, you'll eventually end up doing that node = this_cpu_ptr(&osq_node); thing because it then wants to use that raw pointer to do WRITE_ONCE(prev->next, node); but that's a separate issue and still does not make it worth it to create a pointless self-pointer. Btw, if you *really* want to solve that separate issue, then make the optimistic_spin_node struct not contain the pointers at all, but the CPU numbers, and then turn those numbers into the pointers the exact same way it does for the "lock->tail" thing, ie doing that whole prev = decode_cpu(old); dance. That *may* then result in avoiding turning them into pointers at all in some cases. Also, I think that you might want to look into making OSQ_UNLOCKED_VAL be -1 instead, and add something like #define IS_OSQ_UNLOCKED(x) ((int)(x)<0) and that would then avoid the +1 / -1 games in encoding/decoding the CPU numbers. It causes silly code generated like this: subl $1, %eax #, cpu_nr ... cltq addq __per_cpu_offset(,%rax,8), %rcx which seems honestly stupid. The cltq is there for sign-extension, which is because all these things are "int", and the "subl" will zero-extend to 64-bit, not sign-extend. At that point, I think gcc might be able to just generate addq __per_cpu_offset-8(,%rax,8), %rcx but honestly, I think it would be nicer to just have decode_cpu() do unsigned int cpu_nr = encoded_cpu_val; return per_cpu_ptr(&osq_node, cpu_nr); and not have the -1/+1 at all. Hmm? UNTESTED patch to just do the "this_cpu_write()" parts attached. Again, note how we do end up doing that this_cpu_ptr conversion later anyway, but at least it's off the critical path. Linus
On Sat, 30 Dec 2023 at 12:41, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > UNTESTED patch to just do the "this_cpu_write()" parts attached. > Again, note how we do end up doing that this_cpu_ptr conversion later > anyway, but at least it's off the critical path. Also note that while 'this_cpu_ptr()' doesn't exactly generate lovely code, it really is still better than caching a value in memory. At least the memory location that 'this_cpu_ptr()' accesses is slightly more likely to be hot (and is right next to the cpu number, iirc). That said, I think we should fix this_cpu_ptr() to not ever generate that disgusting cltq just because the cpu pointer has the wrong signedness. I don't quite know how to do it, but this: -#define per_cpu_offset(x) (__per_cpu_offset[x]) +#define per_cpu_offset(x) (__per_cpu_offset[(unsigned)(x)]) at least helps a *bit*. It gets rid of the cltq, at least, but if somebody actually passes in an 'unsigned long' cpuid, it would cause an unnecessary truncation. And gcc still generates subl $1, %eax #, cpu_nr addq __per_cpu_offset(,%rax,8), %rcx instead of just doing addq __per_cpu_offset-8(,%rax,8), %rcx because it still needs to clear the upper 32 bits and doesn't know that the 'xchg()' already did that. Oh well. I guess even without the -1/+1 games by the OSQ code, we would still end up with a "movl" just to do that upper bits clearing that the compiler doesn't know is unnecessary. I don't think we have any reasonable way to tell the compiler that the register output of our xchg() inline asm has the upper 32 bits clear. Linus
From: Ingo Molnar > Sent: 30 December 2023 20:38 > > * David Laight <David.Laight@ACULAB.COM> wrote: > > > bool osq_lock(struct optimistic_spin_queue *lock) > > { > > - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); > > + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self); > > struct optimistic_spin_node *prev, *next; > > int old; > > > > - if (unlikely(node->cpu == OSQ_UNLOCKED_VAL)) > > - node->cpu = encode_cpu(smp_processor_id()); > > + if (unlikely(!node)) { > > + int cpu = encode_cpu(smp_processor_id()); > > + node = decode_cpu(cpu); > > + node->self = node; > > + node->cpu = cpu; > > This whole initialization sequence is suboptimal and needs to be > cleaned up first: the node->cpu field is constant once initialized, so > it should be initialized from appropriate init methods, not runtime in > osq_lock(), right? I thought that as well, but there would need to be a list of 'init' functions for the per-cpu data. I didn't spot one. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 12/30/23 06:35, David Laight wrote: > From: Ingo Molnar >> Sent: 30 December 2023 11:09 >> >> >> * Waiman Long <longman@redhat.com> wrote: >> >>> On 12/29/23 15:57, David Laight wrote: >>>> this_cpu_ptr() is rather more expensive than raw_cpu_read() since >>>> the latter can use an 'offset from register' (%gs for x86-84). >>>> >>>> Add a 'self' field to 'struct optimistic_spin_node' that can be >>>> read with raw_cpu_read(), initialise on first call. >>>> >>>> Signed-off-by: David Laight <david.laight@aculab.com> >>>> --- >>>> kernel/locking/osq_lock.c | 14 +++++++++----- >>>> 1 file changed, 9 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c >>>> index 9bb3a077ba92..b60b0add0161 100644 >>>> --- a/kernel/locking/osq_lock.c >>>> +++ b/kernel/locking/osq_lock.c >>>> @@ -13,7 +13,7 @@ >>>> */ >>>> struct optimistic_spin_node { >>>> - struct optimistic_spin_node *next, *prev; >>>> + struct optimistic_spin_node *self, *next, *prev; >>>> int locked; /* 1 if lock acquired */ >>>> int cpu; /* encoded CPU # + 1 value */ >>>> }; >>>> @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock, >>>> bool osq_lock(struct optimistic_spin_queue *lock) >>>> { >>>> - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); >>>> + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self); >>> My gcc 11 compiler produces the following x86-64 code: >>> >>> 92 struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); >>> 0x0000000000000029 <+25>: mov %rcx,%rdx >>> 0x000000000000002c <+28>: add %gs:0x0(%rip),%rdx # 0x34 >>> <osq_lock+36> >>> >>> Which looks pretty optimized for me. Maybe older compiler may generate more >>> complex code. However, I do have some doubt as to the benefit of this patch >>> at the expense of making the code a bit more complex. > My changed code is one instruction shorter! > 18: 65 48 8b 15 00 00 00 mov %gs:0x0(%rip),%rdx # 20 <osq_lock+0x20> > 1f: 00 > 1c: R_X86_64_PC32 .data..percpu..shared_aligned-0x4 > However is might have one less cache line miss. > >> GCC-11 is plenty of a look-back window in terms of compiler efficiency: >> latest enterprise distros use GCC-11 or newer, while recent desktop >> distros use GCC-13. Anything older won't matter, because no major >> distribution is going to use new kernels with old compilers. > There must be a difference in the header files as well. > Possibly forced by the older compiler I'm using (7.5 from Ubuntu 18.04). > But maybe based on some config option. > > I'm seeing this_cpu_ptr(&xxx) converted to per_cpu_ptr(&xxx, smp_processor_id()) > which necessitates an array lookup (indexed by cpu number). > Whereas I think you are seeing it implemented as > raw_cpu_read(per_cpu_data_base) + offset_to(xxx) > > So the old code generates (after the prologue): > 10: 49 89 fd mov %rdi,%r13 > 13: 49 c7 c4 00 00 00 00 mov $0x0,%r12 > 16: R_X86_64_32S .data..percpu..shared_aligned > 1a: e8 00 00 00 00 callq 1f <osq_lock+0x1f> > 1b: R_X86_64_PC32 debug_smp_processor_id-0x4 > 1f: 89 c0 mov %eax,%eax > 21: 48 8b 1c c5 00 00 00 mov 0x0(,%rax,8),%rbx > 28: 00 > 25: R_X86_64_32S __per_cpu_offset > 29: e8 00 00 00 00 callq 2e <osq_lock+0x2e> > 2a: R_X86_64_PC32 debug_smp_processor_id-0x4 > 2e: 4c 01 e3 add %r12,%rbx > 31: 83 c0 01 add $0x1,%eax > 34: c7 43 10 00 00 00 00 movl $0x0,0x10(%rbx) > 3b: 48 c7 03 00 00 00 00 movq $0x0,(%rbx) > 42: 89 43 14 mov %eax,0x14(%rbx) > 45: 41 87 45 00 xchg %eax,0x0(%r13) > > I was also surprised that smp_processor_id() is a real function rather > than an offset from %gs. I have looked up definition of this_cpu_ptr() and gotten the following results: this_cpu_ptr() => raw_cpu_ptr() => arch_raw_cpu_ptr() /* * Compared to the generic __my_cpu_offset version, the following * saves one instruction and avoids clobbering a temp register. */ #define arch_raw_cpu_ptr(ptr) \ ({ \ unsigned long tcp_ptr__; \ asm ("add " __percpu_arg(1) ", %0" \ : "=r" (tcp_ptr__) \ : "m" (this_cpu_off), "0" (ptr)); \ (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \ }) The presence of debug_smp_processor_id in your compiled code is likely due to the setting of CONFIG_DEBUG_PREEMPT in your kernel config. #ifdef CONFIG_DEBUG_PREEMPT extern unsigned int debug_smp_processor_id(void); # define smp_processor_id() debug_smp_processor_id() #else # define smp_processor_id() __smp_processor_id() #endif I don't have that config entry in my kernel config and so I only get 2 instructions for this_cpu_ptr(). We are not going to optimize the code specifically for CONFIG_DEBUG_PREEMPT and so this patch should be dropped. Cheers, Longman
From: Waiman Long > Sent: 31 December 2023 03:04 .... > The presence of debug_smp_processor_id in your compiled code is likely > due to the setting of CONFIG_DEBUG_PREEMPT in your kernel config. > > #ifdef CONFIG_DEBUG_PREEMPT > extern unsigned int debug_smp_processor_id(void); > # define smp_processor_id() debug_smp_processor_id() > #else > # define smp_processor_id() __smp_processor_id() > #endif > > I don't have that config entry in my kernel config and so I only get 2 > instructions for this_cpu_ptr(). We are not going to optimize the code > specifically for CONFIG_DEBUG_PREEMPT and so this patch should be dropped. Yes, I discovered that late last night. I've no idea why it is set. It might even be inherited from the ubuntu 18.04 (I think) .config I started with (mostly removing extra modules to reduce compile time and the size of uramdisk). I'll look at the patches again. Saving node->prev->cpu as node->prev_cpu will definitely save the cache-line bounce. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
From: Linus Torvalds > Sent: 30 December 2023 20:41 > > On Fri, 29 Dec 2023 at 12:57, David Laight <David.Laight@aculab.com> wrote: > > > > this_cpu_ptr() is rather more expensive than raw_cpu_read() since > > the latter can use an 'offset from register' (%gs for x86-84). > > > > Add a 'self' field to 'struct optimistic_spin_node' that can be > > read with raw_cpu_read(), initialise on first call. > > No, this is horrible. > > The problem isn't the "this_cpu_ptr()", it's the rest of the code. > > > bool osq_lock(struct optimistic_spin_queue *lock) > > { > > - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); > > + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self); > > No. Both of these are crap. > > > struct optimistic_spin_node *prev, *next; > > int old; > > > > - if (unlikely(node->cpu == OSQ_UNLOCKED_VAL)) > > - node->cpu = encode_cpu(smp_processor_id()); > > + if (unlikely(!node)) { > > + int cpu = encode_cpu(smp_processor_id()); > > + node = decode_cpu(cpu); > > + node->self = node; > > + node->cpu = cpu; > > + } > > The proper fix here is to not do that silly > > node = this_cpu_ptr(&osq_node); > .. > node->next = NULL; > > dance at all, but to simply do > > this_cpu_write(osq_node.next, NULL); > > in the first place. That makes the whole thing just a single store off > the segment descriptor. Is the equivalent true (ie offset from fixed register) for all SMP archs? Or do some have to do something rather more complicated? > Yes, you'll eventually end up doing that > > node = this_cpu_ptr(&osq_node); For some reason I've had CONFIG_DEBUG_PREEMPT set. I don't remember setting it, and can't imagine why I might have. Best guess is it was inherited from the ubuntu .config I started with. In any case it changes smp_processor_id() into a real function in order to check that preemption is disabled. I'd guess something like BUG_ON(!raw_cpu_read(preempt_disable_count)) would be faster and more obvious! > thing because it then wants to use that raw pointer to do > > WRITE_ONCE(prev->next, node); > > but that's a separate issue and still does not make it worth it to > create a pointless self-pointer. I could claim that loading it is one instruction shorter and that if the cache line containing 'node' is needed and 'pcpu_hot' is (unexpectedly) not cached it saves a cache line load. But I probably won't! > > Btw, if you *really* want to solve that separate issue, then make the > optimistic_spin_node struct not contain the pointers at all, but the > CPU numbers, and then turn those numbers into the pointers the exact > same way it does for the "lock->tail" thing, ie doing that whole > > prev = decode_cpu(old); > > dance. That *may* then result in avoiding turning them into pointers > at all in some cases. Don't think so. Pretty much all the uses need to dereference the next/prev pointers. > Also, I think that you might want to look into making OSQ_UNLOCKED_VAL > be -1 instead, and add something like > > #define IS_OSQ_UNLOCKED(x) ((int)(x)<0) I did think about that (but not for these patches). But it is a lot more dangerous because it absolutely requires the structure be correctly initialised, not just be all zero. That might show up some very strange bugs. > and that would then avoid the +1 / -1 games in encoding/decoding the > CPU numbers. It causes silly code generated like this: > > subl $1, %eax #, cpu_nr > ... > cltq > addq __per_cpu_offset(,%rax,8), %rcx > > which seems honestly stupid. The cltq is there for sign-extension, > which is because all these things are "int", and the "subl" will > zero-extend to 64-bit, not sign-extend. Changing all the variables to 'unsigned int' will remove the sign extension and, after the decrement, gcc will know the high bits are zero so shouldn't need to zero extend. > At that point, I think gcc might be able to just generate > > addq __per_cpu_offset-8(,%rax,8), %rcx That would need the decrement to be 64bit. A quick test failed to make that work. Probably (as you mentioned in the next email) because gcc doesn't know that the high bits from atomic_xchg() are zero. > but honestly, I think it would be nicer to just have decode_cpu() do > > unsigned int cpu_nr = encoded_cpu_val; > > return per_cpu_ptr(&osq_node, cpu_nr); > > and not have the -1/+1 at all. Unless you can persuade gcc that the high bits from atomic_xchg() are zero that will require a zero extend. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
From: Linus Torvalds > Sent: 30 December 2023 20:59 > > On Sat, 30 Dec 2023 at 12:41, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > UNTESTED patch to just do the "this_cpu_write()" parts attached. > > Again, note how we do end up doing that this_cpu_ptr conversion later > > anyway, but at least it's off the critical path. > > Also note that while 'this_cpu_ptr()' doesn't exactly generate lovely > code, it really is still better than caching a value in memory. > > At least the memory location that 'this_cpu_ptr()' accesses is > slightly more likely to be hot (and is right next to the cpu number, > iirc). I was only going to access the 'self' field in code that required the 'node' cache line be present. > > That said, I think we should fix this_cpu_ptr() to not ever generate > that disgusting cltq just because the cpu pointer has the wrong > signedness. I don't quite know how to do it, but this: > > -#define per_cpu_offset(x) (__per_cpu_offset[x]) > +#define per_cpu_offset(x) (__per_cpu_offset[(unsigned)(x)]) > > at least helps a *bit*. It gets rid of the cltq, at least, but if > somebody actually passes in an 'unsigned long' cpuid, it would cause > an unnecessary truncation. Doing the conversion using arithmetic might help, so: __per_cpu_offset[(x) + 0u] > And gcc still generates > > subl $1, %eax #, cpu_nr > addq __per_cpu_offset(,%rax,8), %rcx > > instead of just doing > > addq __per_cpu_offset-8(,%rax,8), %rcx > > because it still needs to clear the upper 32 bits and doesn't know > that the 'xchg()' already did that. Not only that, you need to do the 'subl' after converting to 64 bits. Otherwise the wrong location is read were cpu_nr to be zero. I've tried that - but it still failed. > Oh well. I guess even without the -1/+1 games by the OSQ code, we > would still end up with a "movl" just to do that upper bits clearing > that the compiler doesn't know is unnecessary. > > I don't think we have any reasonable way to tell the compiler that the > register output of our xchg() inline asm has the upper 32 bits clear. It could be done for a 32bit unsigned xchg() - just make the return type unsigned 64bit. But that won't work for the signed exchange - and 'atomic_t' is signed. OTOH I'd guess this code could use 'unsigned int' instead of atomic_t? David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 9bb3a077ba92..b60b0add0161 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -13,7 +13,7 @@ */ struct optimistic_spin_node { - struct optimistic_spin_node *next, *prev; + struct optimistic_spin_node *self, *next, *prev; int locked; /* 1 if lock acquired */ int cpu; /* encoded CPU # + 1 value */ }; @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock, bool osq_lock(struct optimistic_spin_queue *lock) { - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self); struct optimistic_spin_node *prev, *next; int old; - if (unlikely(node->cpu == OSQ_UNLOCKED_VAL)) - node->cpu = encode_cpu(smp_processor_id()); + if (unlikely(!node)) { + int cpu = encode_cpu(smp_processor_id()); + node = decode_cpu(cpu); + node->self = node; + node->cpu = cpu; + } /* * We need both ACQUIRE (pairs with corresponding RELEASE in @@ -222,7 +226,7 @@ void osq_unlock(struct optimistic_spin_queue *lock) /* * Second most likely case. */ - node = this_cpu_ptr(&osq_node); + node = raw_cpu_read(osq_node.self); next = xchg(&node->next, NULL); if (next) { WRITE_ONCE(next->locked, 1);