[0/3] net, refcount: Address dst_entry reference count scalability issues

Message ID	20230228132118.978145284@linutronix.de
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Message-ID: <20230228132118.978145284@linutronix.de> From: Thomas Gleixner <tglx@linutronix.de> To: LKML <linux-kernel@vger.kernel.org> Cc: Linus Torvalds <torvalds@linuxfoundation.org>, x86@kernel.org, Wangyang Guo <wangyang.guo@intel.com>, Arjan van De Ven <arjan@linux.intel.com>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, netdev@vger.kernel.org, Will Deacon <will@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Boqun Feng <boqun.feng@gmail.com>, Mark Rutland <mark.rutland@arm.com>, Marc Zyngier <maz@kernel.org> Subject: [patch 0/3] net, refcount: Address dst_entry reference count scalability issues Date: Tue, 28 Feb 2023 15:33:25 +0100 (CET) Precedence: bulk
Series	net, refcount: Address dst_entry reference count scalability issues \| [0/3] net, refcount: Address dst_entry reference count scalability issues [1/3] net: dst: Prevent false sharing vs. dst_entry::__refcnt [2/3] atomics: Provide rcuref - scalable reference counting [3/3] net: dst: Switch to rcuref_t reference counting

Message ID

20230228132118.978145284@linutronix.de

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Message-ID: <20230228132118.978145284@linutronix.de>
From: Thomas Gleixner <tglx@linutronix.de>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>, x86@kernel.org,
        Wangyang Guo <wangyang.guo@intel.com>,
        Arjan van De Ven <arjan@linux.intel.com>,
        "David S. Miller" <davem@davemloft.net>,
        Eric Dumazet <edumazet@google.com>,
        Jakub Kicinski <kuba@kernel.org>,
        Paolo Abeni <pabeni@redhat.com>, netdev@vger.kernel.org,
        Will Deacon <will@kernel.org>,
        Peter Zijlstra <peterz@infradead.org>,
        Boqun Feng <boqun.feng@gmail.com>,
        Mark Rutland <mark.rutland@arm.com>,
        Marc Zyngier <maz@kernel.org>
Subject: [patch 0/3] net, refcount: Address dst_entry reference count
 scalability issues
Date: Tue, 28 Feb 2023 15:33:25 +0100 (CET)
Precedence: bulk

Series

net, refcount: Address dst_entry reference count scalability issues |

Message

Thomas Gleixner Feb. 28, 2023, 2:33 p.m. UTC

  Hi!

Wangyang and Arjan reported a bottleneck in the networking code related to
struct dst_entry::__refcnt. Performance tanks massively when concurrency on
a dst_entry increases.

This happens when there are a large amount of connections to or from the
same IP address. The memtier benchmark when run on the same host as
memcached amplifies this massively. But even over real network connections
this issue can be observed at an obviously smaller scale (due to the
network bandwith limitations in my setup, i.e. 1Gb).

There are two factors which make this reference count a scalability issue:

   1) False sharing

      dst_entry:__refcnt is located at offset 64 of dst_entry, which puts
      it into a seperate cacheline vs. the read mostly members located at
      the beginning of the struct.

      That prevents false sharing vs. the struct members in the first 64
      bytes of the structure, but there is also

      	    dst_entry::lwtstate

      which is located after the reference count and in the same cache
      line. This member is read after a reference count has been acquired.

      The other problem is struct rtable, which embeds a struct dst_entry
      at offset 0. struct dst_entry has a size of 112 bytes, which means
      that the struct members of rtable which follow the dst member share
      the same cache line as dst_entry::__refcnt. Especially

      	  rtable::rt_genid

      is also read by the contexts which have a reference count acquired
      already.

      When dst_entry:__refcnt is incremented or decremented via an atomic
      operation these read accesses stall and contribute to the performance
      problem.

   2) atomic_inc_not_zero()

      A reference on dst_entry:__refcnt is acquired via
      atomic_inc_not_zero() and released via atomic_dec_return().

      atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
      which exposes O(N^2) behaviour under contention with N concurrent
      operations.

      Lightweight instrumentation exposed an average of 8!! retry loops per
      atomic_inc_not_zero() invocation in a userspace inc()/dec() loop
      running concurrently on 112 CPUs.

      There is nothing which can be done to make atomic_inc_not_zero() more
      scalable.

The following series addresses these issues:

    1) Reorder and pad struct dst_entry to prevent the false sharing.

    2) Implement and use a reference count implementation which avoids the
       atomic_inc_not_zero() problem.

       It is slightly less performant in the case of the final 1 -> 0
       transition, but the deconstruction of these objects is a low
       frequency event. get()/put() pairs are in the hotpath and that's
       what this implementation optimizes for.

       The algorithm of this reference count is only suitable for RCU
       managed objects. Therefore it cannot replace the refcount_t
       algorithm, which is also based on atomic_inc_not_zero(), due to a
       subtle race condition related to the 1 -> 0 transition and the final
       verdict to mark the reference count dead. See details in patch 2/3.

       It might be just my lack of imagination which declares this to be
       impossible and I'd be happy to be proven wrong.

       As a bonus the new rcuref implementation provides underflow/overflow
       detection and mitigation while being performance wise on par with
       open coded atomic_inc_not_zero() / atomic_dec_return() pairs even in
       the non-contended case.

The combination of these two changes results in performance gains in micro
benchmarks and also localhost and networked memtier benchmarks talking to
memcached. It's hard to quantify the benchmark results as they depend
heavily on the micro-architecture and the number of concurrent operations.

The overall gain of both changes for localhost memtier ranges from 1.2X to
3.2X and from +2% to %5% range for networked operations on a 1Gb connection.

A micro benchmark which enforces maximized concurrency shows a gain between
1.2X and 4.7X!!!

Obviously this is focussed on a particular problem and therefore needs to
be discussed in detail. It also requires wider testing outside of the cases
which this is focussed on.

Though the false sharing issue is obvious and should be addressed
independent of the more focussed reference count changes.

The series is also available from git:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git

I want to say thanks to Wangyang who analyzed the issue and provided
the initial fix for the false sharing problem. Further thanks go to
Arjan Peter, Marc, Will and Borislav for valuable input and providing
test results on machines which I do not have access to.

Thoughts?

Thanks,

	tglx
---
 include/linux/rcuref.h          |   89 +++++++++++
 include/linux/types.h           |    6 
 include/net/dst.h               |   21 ++
 include/net/sock.h              |    2 
 lib/Makefile                    |    2 
 lib/rcuref.c                    |  311 ++++++++++++++++++++++++++++++++++++++++
 net/bridge/br_nf_core.c         |    2 
 net/core/dst.c                  |   26 ---
 net/core/rtnetlink.c            |    2 
 net/ipv6/route.c                |    6 
 net/netfilter/ipvs/ip_vs_xmit.c |    4 
 11 files changed, 436 insertions(+), 35 deletions(-)

Comments

Eric Dumazet Feb. 28, 2023, 3:07 p.m. UTC | #1

On Tue, Feb 28, 2023 at 3:33 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Hi!
>
> Wangyang and Arjan reported a bottleneck in the networking code related to
> struct dst_entry::__refcnt. Performance tanks massively when concurrency on
> a dst_entry increases.

We have per-cpu or per-tcp-socket dst though.

Input path is RCU and does not touch dst refcnt.

In real workloads (200Gbit NIC and above), we do not observe
contention on a dst refcnt.

So it would be nice knowing in which case you noticed some issues,
maybe there is something wrong in some layer.

>
> This happens when there are a large amount of connections to or from the
> same IP address. The memtier benchmark when run on the same host as
> memcached amplifies this massively. But even over real network connections
> this issue can be observed at an obviously smaller scale (due to the
> network bandwith limitations in my setup, i.e. 1Gb).
>
> There are two factors which make this reference count a scalability issue:
>
>    1) False sharing
>
>       dst_entry:__refcnt is located at offset 64 of dst_entry, which puts
>       it into a seperate cacheline vs. the read mostly members located at
>       the beginning of the struct.
>
>       That prevents false sharing vs. the struct members in the first 64
>       bytes of the structure, but there is also
>
>             dst_entry::lwtstate
>
>       which is located after the reference count and in the same cache
>       line. This member is read after a reference count has been acquired.
>
>       The other problem is struct rtable, which embeds a struct dst_entry
>       at offset 0. struct dst_entry has a size of 112 bytes, which means
>       that the struct members of rtable which follow the dst member share
>       the same cache line as dst_entry::__refcnt. Especially
>
>           rtable::rt_genid
>
>       is also read by the contexts which have a reference count acquired
>       already.
>
>       When dst_entry:__refcnt is incremented or decremented via an atomic
>       operation these read accesses stall and contribute to the performance
>       problem.

In our kernel profiles, we never saw dst->refcnt changes being a
serious problem.

>
>    2) atomic_inc_not_zero()
>
>       A reference on dst_entry:__refcnt is acquired via
>       atomic_inc_not_zero() and released via atomic_dec_return().
>
>       atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
>       which exposes O(N^2) behaviour under contention with N concurrent
>       operations.
>
>       Lightweight instrumentation exposed an average of 8!! retry loops per
>       atomic_inc_not_zero() invocation in a userspace inc()/dec() loop
>       running concurrently on 112 CPUs.

User space benchmark <> kernel space.
And we tend not using 112 cpus for kernel stack processing.

Again, concurrent dst->refcnt changes are quite unlikely.

>
>       There is nothing which can be done to make atomic_inc_not_zero() more
>       scalable.
>
> The following series addresses these issues:
>
>     1) Reorder and pad struct dst_entry to prevent the false sharing.
>
>     2) Implement and use a reference count implementation which avoids the
>        atomic_inc_not_zero() problem.
>
>        It is slightly less performant in the case of the final 1 -> 0
>        transition, but the deconstruction of these objects is a low
>        frequency event. get()/put() pairs are in the hotpath and that's
>        what this implementation optimizes for.
>
>        The algorithm of this reference count is only suitable for RCU
>        managed objects. Therefore it cannot replace the refcount_t
>        algorithm, which is also based on atomic_inc_not_zero(), due to a
>        subtle race condition related to the 1 -> 0 transition and the final
>        verdict to mark the reference count dead. See details in patch 2/3.
>
>        It might be just my lack of imagination which declares this to be
>        impossible and I'd be happy to be proven wrong.
>
>        As a bonus the new rcuref implementation provides underflow/overflow
>        detection and mitigation while being performance wise on par with
>        open coded atomic_inc_not_zero() / atomic_dec_return() pairs even in
>        the non-contended case.
>
> The combination of these two changes results in performance gains in micro
> benchmarks and also localhost and networked memtier benchmarks talking to
> memcached. It's hard to quantify the benchmark results as they depend
> heavily on the micro-architecture and the number of concurrent operations.
>
> The overall gain of both changes for localhost memtier ranges from 1.2X to
> 3.2X and from +2% to %5% range for networked operations on a 1Gb connection.
>
> A micro benchmark which enforces maximized concurrency shows a gain between
> 1.2X and 4.7X!!!

Can you elaborate on what networking benchmark you have used,
and what actual gains you got ?

In which path access to dst->lwtstate proved to be a problem ?

>
> Obviously this is focussed on a particular problem and therefore needs to
> be discussed in detail. It also requires wider testing outside of the cases
> which this is focussed on.
>
> Though the false sharing issue is obvious and should be addressed
> independent of the more focussed reference count changes.
>
> The series is also available from git:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git
>
> I want to say thanks to Wangyang who analyzed the issue and provided
> the initial fix for the false sharing problem. Further thanks go to
> Arjan Peter, Marc, Will and Borislav for valuable input and providing
> test results on machines which I do not have access to.
>
> Thoughts?

Initial feeling is that we need more details on the real workload.

Is it some degenerated case with one connected UDP socket used by
multiple threads ?

To me, this looks like someone wanted to push a new piece of infra
(include/linux/rcuref.h)
and decided that dst->refcnt would be a perfect place.
Not the other way (noticing there is an issue, enquire networking
folks about it, before designing a solution)

Thanks



>
> Thanks,
>
>         tglx
> ---
>  include/linux/rcuref.h          |   89 +++++++++++
>  include/linux/types.h           |    6
>  include/net/dst.h               |   21 ++
>  include/net/sock.h              |    2
>  lib/Makefile                    |    2
>  lib/rcuref.c                    |  311 ++++++++++++++++++++++++++++++++++++++++
>  net/bridge/br_nf_core.c         |    2
>  net/core/dst.c                  |   26 ---
>  net/core/rtnetlink.c            |    2
>  net/ipv6/route.c                |    6
>  net/netfilter/ipvs/ip_vs_xmit.c |    4
>  11 files changed, 436 insertions(+), 35 deletions(-)

Thomas Gleixner Feb. 28, 2023, 4:38 p.m. UTC | #2

Eric!

On Tue, Feb 28 2023 at 16:07, Eric Dumazet wrote:
> On Tue, Feb 28, 2023 at 3:33 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> Hi!
>>
>> Wangyang and Arjan reported a bottleneck in the networking code related to
>> struct dst_entry::__refcnt. Performance tanks massively when concurrency on
>> a dst_entry increases.
>
> We have per-cpu or per-tcp-socket dst though.
>
> Input path is RCU and does not touch dst refcnt.
>
> In real workloads (200Gbit NIC and above), we do not observe
> contention on a dst refcnt.
>
> So it would be nice knowing in which case you noticed some issues,
> maybe there is something wrong in some layer.

Two lines further down I explained which benchmark was used, no?

>> This happens when there are a large amount of connections to or from the
>> same IP address. The memtier benchmark when run on the same host as
>> memcached amplifies this massively. But even over real network connections
>> this issue can be observed at an obviously smaller scale (due to the
>> network bandwith limitations in my setup, i.e. 1Gb).
>>       atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
>>       which exposes O(N^2) behaviour under contention with N concurrent
>>       operations.
>>
>>       Lightweight instrumentation exposed an average of 8!! retry loops per
>>       atomic_inc_not_zero() invocation in a userspace inc()/dec() loop
>>       running concurrently on 112 CPUs.
>
> User space benchmark <> kernel space.

I know that. The point was to illustrate the non-scalability.

> And we tend not using 112 cpus for kernel stack processing.
>
> Again, concurrent dst->refcnt changes are quite unlikely.

So unlikely that they stand out in that particular benchmark.

>> The overall gain of both changes for localhost memtier ranges from 1.2X to
>> 3.2X and from +2% to %5% range for networked operations on a 1Gb connection.
>>
>> A micro benchmark which enforces maximized concurrency shows a gain between
>> 1.2X and 4.7X!!!
>
> Can you elaborate on what networking benchmark you have used,
> and what actual gains you got ?

I'm happy to repeat here that it was memtier/memcached as I explained
more than once in the cover letter.

> In which path access to dst->lwtstate proved to be a problem ?

ip_finish_output2()
   if (lwtunnel_xmit_redirect(dst->lwtstate)) <- This read

> To me, this looks like someone wanted to push a new piece of infra
> (include/linux/rcuref.h)
> and decided that dst->refcnt would be a perfect place.
>
> Not the other way (noticing there is an issue, enquire networking
> folks about it, before designing a solution)

We looked at this because the reference count operations stood out in
perf top and we analyzed it down to the false sharing _and_ the
non-scalability of atomic_inc_not_zero().

That's what made me to think about a different implementation and yes,
the case at hand, which happens to be in the network code, allowed me to
verify that it actually works and scales better.

I'm terrible sorry, that I missed to first ask network people for
permission to think. Won't happen again.

Thanks,

        tglx

Eric Dumazet Feb. 28, 2023, 4:59 p.m. UTC | #3

On Tue, Feb 28, 2023 at 5:38 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Eric!
>
> On Tue, Feb 28 2023 at 16:07, Eric Dumazet wrote:
> > On Tue, Feb 28, 2023 at 3:33 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >>
> >> Hi!
> >>
> >> Wangyang and Arjan reported a bottleneck in the networking code related to
> >> struct dst_entry::__refcnt. Performance tanks massively when concurrency on
> >> a dst_entry increases.
> >
> > We have per-cpu or per-tcp-socket dst though.
> >
> > Input path is RCU and does not touch dst refcnt.
> >
> > In real workloads (200Gbit NIC and above), we do not observe
> > contention on a dst refcnt.
> >
> > So it would be nice knowing in which case you noticed some issues,
> > maybe there is something wrong in some layer.
>
> Two lines further down I explained which benchmark was used, no?
>
> >> This happens when there are a large amount of connections to or from the
> >> same IP address. The memtier benchmark when run on the same host as
> >> memcached amplifies this massively. But even over real network connections
> >> this issue can be observed at an obviously smaller scale (due to the
> >> network bandwith limitations in my setup, i.e. 1Gb).
> >>       atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
> >>       which exposes O(N^2) behaviour under contention with N concurrent
> >>       operations.
> >>
> >>       Lightweight instrumentation exposed an average of 8!! retry loops per
> >>       atomic_inc_not_zero() invocation in a userspace inc()/dec() loop
> >>       running concurrently on 112 CPUs.
> >
> > User space benchmark <> kernel space.
>
> I know that. The point was to illustrate the non-scalability.
>
> > And we tend not using 112 cpus for kernel stack processing.
> >
> > Again, concurrent dst->refcnt changes are quite unlikely.
>
> So unlikely that they stand out in that particular benchmark.
>
> >> The overall gain of both changes for localhost memtier ranges from 1.2X to
> >> 3.2X and from +2% to %5% range for networked operations on a 1Gb connection.
> >>
> >> A micro benchmark which enforces maximized concurrency shows a gain between
> >> 1.2X and 4.7X!!!
> >
> > Can you elaborate on what networking benchmark you have used,
> > and what actual gains you got ?
>
> I'm happy to repeat here that it was memtier/memcached as I explained
> more than once in the cover letter.
>
> > In which path access to dst->lwtstate proved to be a problem ?
>
> ip_finish_output2()
>    if (lwtunnel_xmit_redirect(dst->lwtstate)) <- This read

This change alone should be easy to measure, please do this ?

Oftentimes, moving a field looks sane, but the cache line access is
simply done later.
For example when refcnt is changed :)

Making dsts one cache line bigger has a performance impact.

>
> > To me, this looks like someone wanted to push a new piece of infra
> > (include/linux/rcuref.h)
> > and decided that dst->refcnt would be a perfect place.
> >
> > Not the other way (noticing there is an issue, enquire networking
> > folks about it, before designing a solution)
>
> We looked at this because the reference count operations stood out in
> perf top and we analyzed it down to the false sharing _and_ the
> non-scalability of atomic_inc_not_zero().
>

Please share your recipe and perf results.

We must have been very lucky to not see this at Google.

tcp_rr workloads show dst_mtu() costs (60k GCU in Google fleet) ,
which outperform the dst refcnt you are mentioning here.

Thomas Gleixner March 1, 2023, 1 a.m. UTC | #4

Eric!

On Tue, Feb 28 2023 at 17:59, Eric Dumazet wrote:
> On Tue, Feb 28, 2023 at 5:38 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> >>       Lightweight instrumentation exposed an average of 8!! retry loops per
>> >>       atomic_inc_not_zero() invocation in a userspace inc()/dec() loop
>> >>       running concurrently on 112 CPUs.
>> >
>> > User space benchmark <> kernel space.
>>
>> I know that. The point was to illustrate the non-scalability.
>> >
>> > And we tend not using 112 cpus for kernel stack processing.

Just to be more clear. Scalability of atomic_inc_not_zero() tanks
already with a small number of concurrent operations. As explained in
the cover letter and in the changelogs. The worst case:

  It exposes O(N^2) behaviour under contention with N concurrent
  operations.

The actual impact depends on the micro-architecture. But O(N^2) is bad
by definition, right? It tanks way before 112 CPUs significantly.

>> > In which path access to dst->lwtstate proved to be a problem ?
>>
>> ip_finish_output2()
>>    if (lwtunnel_xmit_redirect(dst->lwtstate)) <- This read
>
> This change alone should be easy to measure, please do this ?

Moving lwtstate out also moves rtable::rt_genid out into the next
cacheline. So it's not that easy to measure the impact of moving
lwtstate alone.

We measured the impact of changing struct dst_entry seperately as you
can see from the changelog of the actual patch [1/3]:

  "The resulting improvement depends on the micro-architecture and the number
   of CPUs. It ranges from +20% to +120% with a localhost memtier/memcached
   benchmark."

It was very intentional _not_ to provide a single benchmark output to
boast about the improvement simply because the outcome differs
massively depending on the micro-architecture.

Here are 3 different contemporary machines to compare the impact of the
padding/reorder of dst_entry and the refcount.

memcached/memtier 1:100
                                A                       B                       C

atomic_t baseline		10702			 6156			 11993
atomic_t + padding		12505   = 1.16x		13994  = 2.27x		 24997  = 2.08x 
rcuref_t + padding		22888   = 2.13x		22080  = 3.58x		 25048  = 2.08x

atomic_t baseline		10702			 6156			 11993
rcuref_t			11843   = 1.10x		 9590  = 1.55x		 15434	= 1.28x  

atomic_t + padding baseline	12505			13994			 24997
rcuref_t + padding		22888   = 1.83x		22080  = 1.57x		 25048  = 1.01x

See?

I surely could have picked the most prominent of each and made a bug
fuzz about it, but if you read the changelogs caerefully, I provided
range numbers which make this clear.

Here are perf top numbers from machine A with a cutoff at 1.00%:

  Baseline:

     14.24% [kernel] [k] __dev_queue_xmit
      7.12% [kernel] [k] ipv4_dst_check
      5.13% [kernel] [k] ip_finish_output2
      4.95% [kernel] [k] dst_release
      4.27% [kernel] [k] ip_rcv_finish_core.constprop.0

     35.71% SUBTOTAL

      3.86% [kernel] [k] ip_skb_dst_mtu
      3.85% [kernel] [k] ipv4_mtu
      3.58% [kernel] [k] raw_local_deliver
      2.44% [kernel] [k] tcp_v4_rcv
      1.45% [kernel] [k] tcp_ack
      1.32% [kernel] [k] skb_release_data
      1.23% [kernel] [k] _raw_spin_lock_irqsave

     53.44% TOTAL

  Padding:    

     24.21% [kernel] [k] __dev_queue_xmit
      7.75% [kernel] [k] dst_release

     31.96% SUBTOTAL

      1.96% [kernel] [k] tcp_v4_rcv
      1.88% [kernel] [k] tcp_current_mss
      1.87% [kernel] [k] __sk_dst_check
      1.84% [kernel] [k] tcp_ack_update_rtt
      1.83% [kernel] [k] _raw_spin_lock_irqsave
      1.83% [kernel] [k] skb_release_data
      1.83% [kernel] [k] ip_rcv_finish_core.constprop.0
      1.70% [kernel] [k] __tcp_ack_snd_check
      1.67% [kernel] [k] tcp_v4_do_rcv
      1.39% [kernel] [k] tcp_ack
      1.20% [kernel] [k] __fget_light.part.0
      1.00% [kernel] [k] _raw_spin_lock_bh

     51.96% TOTAL

  Padding + rcuref:

      9.23% [kernel] [k] __dev_queue_xmit
      8.81% [kernel] [k] rcuref_put
      2.22% [kernel] [k] tcp_ack
      1.86% [kernel] [k] ip_rcv_finish_core.constprop.0

     18.04% SUBTOTAL

      1.86% [kernel] [k] tcp_v4_rcv
      1.72% [kernel] [k] tcp_current_mss
      1.69% [kernel] [k] skb_release_data
      1.58% [kernel] [k] tcp_ack_update_rtt
      1.56% [kernel] [k] __sk_dst_check
      1.50% [kernel] [k] _raw_spin_lock_irqsave
      1.40% [kernel] [k] __tcp_ack_snd_check
      1.40% [kernel] [k] tcp_v4_do_rcv
      1.39% [kernel] [k] __fget_light.part.0
      1.23% [kernel] [k] tcp_sendmsg_locked
      1.19% [kernel] [k] _raw_spin_lock_bh
      1.17% [kernel] [k] tcp_recvmsg_locked
      1.15% [kernel] [k] ipv4_mtu
      1.14% [kernel] [k] __inet_lookup_established
      1.10% [kernel] [k] ip_skb_dst_mtu
      1.02% [kernel] [k] tcp_rcv_established
      1.02% [kernel] [k] tcp_poll

     45.24% TOTAL

All the outstanding numbers above each SUBTOTAL are related to the
refcnt issues.

As you can see from the above numbers of the three example machines, the
relevant perf output would be totally different, but still correlated to
both the false sharing and the refcount performance.

> Oftentimes, moving a field looks sane, but the cache line access is
> simply done later.
> For example when refcnt is changed :)

Sorry, I can't decode what you are trying to tell me here.

> Making dsts one cache line bigger has a performance impact.

Sure. I'm not claiming that this is completely free. Whether it really
matters is a different story and that's why we are debating this, right?

>> We looked at this because the reference count operations stood out in
>> perf top and we analyzed it down to the false sharing _and_ the
>> non-scalability of atomic_inc_not_zero().
>>
>
> Please share your recipe and perf results.

Sorry for being not explicit enough about this, but I was under the
impression that explicitely mentioning memcached and memtier would be
enough of a hint for people famiiar with this matter.

Run memcached with -t $N and memtier_benchmark with -t $M and
--ratio=1:100 on the same machine. localhost connections obviously
amplify the problem,

Start with the defaults for $N and $M and increase them. Depending on
your machine this will tank at some point. But even in reasonably small
$N, $M scenarios the refcount operations and the resulting false sharing
fallout becomes visible in perf top. At some point it becomes the
dominating issue while the machine still has capacity...

> We must have been very lucky to not see this at Google.

There _is_ a world outside of Google? :)

Seriously. The point is that even if you @google cannot obverse this as
a major issue and it just gives your usecase a minimal 0.X gain, it
still is contributing to the overall performance, no?

I have surely better things to do than pushing the next newfangled
infrastrucure just because.

Thanks,

        tglx

Jakub Kicinski March 1, 2023, 3:17 a.m. UTC | #5

FWIW looks good to me, especially the refcount part.
We do see 10% of jitter in microbenchmarks due to random cache
effects, so forgive the questioning. But again, the refcount seems 
like an obvious win to my noob eyes.

While I have you it would be remiss of me not to mention my ksoftirq
change which makes a large difference in production workloads:
https://lore.kernel.org/all/20221222221244.1290833-3-kuba@kernel.org/
Is Peter's "rework" of softirq going in for 6.3?

On Wed, 01 Mar 2023 02:00:20 +0100 Thomas Gleixner wrote:
> >> We looked at this because the reference count operations stood out in
> >> perf top and we analyzed it down to the false sharing _and_ the
> >> non-scalability of atomic_inc_not_zero().
> >
> > Please share your recipe and perf results.  
> 
> Sorry for being not explicit enough about this, but I was under the
> impression that explicitely mentioning memcached and memtier would be
> enough of a hint for people famiiar with this matter.

I think the disconnect may be that we are indeed familiar with 
the workloads, but these exact workloads don't hit the issue
in production (I don't work at Google but a similarly large corp).
My initial reaction was also to see if I can find the issue in prod.
Not to question but in hope that I can indeed find a repro, and make
this series an easy sell.

> Run memcached with -t $N and memtier_benchmark with -t $M and
> --ratio=1:100 on the same machine. localhost connections obviously
> amplify the problem,
> 
> Start with the defaults for $N and $M and increase them. Depending on
> your machine this will tank at some point. But even in reasonably small
> $N, $M scenarios the refcount operations and the resulting false sharing
> fallout becomes visible in perf top. At some point it becomes the
> dominating issue while the machine still has capacity...
> 
> > We must have been very lucky to not see this at Google.  
> 
> There _is_ a world outside of Google? :)
> 
> Seriously. The point is that even if you @google cannot obverse this as
> a major issue and it just gives your usecase a minimal 0.X gain, it
> still is contributing to the overall performance, no?

Thomas Gleixner March 1, 2023, 10:40 a.m. UTC | #6

On Tue, Feb 28 2023 at 19:17, Jakub Kicinski wrote:
> FWIW looks good to me, especially the refcount part.
> We do see 10% of jitter in microbenchmarks due to random cache
> effects, so forgive the questioning.

Yes, those things are hard to measure. 

> But again, the refcount seems like an obvious win to my noob eyes.
>
> While I have you it would be remiss of me not to mention my ksoftirq
> change which makes a large difference in production workloads:
> https://lore.kernel.org/all/20221222221244.1290833-3-kuba@kernel.org/

Let me find this again.

> Is Peter's "rework" of softirq going in for 6.3?

Not that I'm aware of. That had a few loose ends and got lost in space
AFAICT.

Thanks,

        tglx