[v2,00/11] fold per-CPU vmstats remotely

Message ID	20230209150150.380060673@redhat.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Message-ID: <20230209150150.380060673@redhat.com> User-Agent: quilt/0.67 Date: Thu, 09 Feb 2023 12:01:50 -0300 From: Marcelo Tosatti <mtosatti@redhat.com> To: Christoph Lameter <cl@linux.com> Cc: Aaron Tomlin <atomlin@atomlin.com>, Frederic Weisbecker <frederic@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 00/11] fold per-CPU vmstats remotely Precedence: bulk
Series	fold per-CPU vmstats remotely \| [v2,00/11] fold per-CPU vmstats remotely [v2,01/11] mm/vmstat: remove remote node draining [v2,02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function [v2,03/11] this_cpu_cmpxchg: loongarch: switch this_cpu_cmpxchg to locked, add _local function [v2,04/11] this_cpu_cmpxchg: S390: switch this_cpu_cmpxchg to locked, add _local function [v2,05/11] this_cpu_cmpxchg: x86: switch this_cpu_cmpxchg to locked, add _local function [v2,06/11] this_cpu_cmpxchg: asm-generic: switch this_cpu_cmpxchg to locked, add _local function [v2,07/11] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local [v2,08/11] mm/vmstat: switch counter modification to cmpxchg [v2,09/11] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold [v2,10/11] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely [v2,11/11] mm/vmstat: refresh stats remotely instead of via work item

Message ID

20230209150150.380060673@redhat.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Message-ID: <20230209150150.380060673@redhat.com>
User-Agent: quilt/0.67
Date: Thu, 09 Feb 2023 12:01:50 -0300
From: Marcelo Tosatti <mtosatti@redhat.com>
To: Christoph Lameter <cl@linux.com>
Cc: Aaron Tomlin <atomlin@atomlin.com>,
        Frederic Weisbecker <frederic@kernel.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: [PATCH v2 00/11] fold per-CPU vmstats remotely
Precedence: bulk

Series

fold per-CPU vmstats remotely |

Message

Marcelo Tosatti Feb. 9, 2023, 3:01 p.m. UTC

  This patch series addresses the following two problems:

    1. A customer provided some evidence which indicates that
       the idle tick was stopped; albeit, CPU-specific vmstat
       counters still remained populated.

       Thus one can only assume quiet_vmstat() was not
       invoked on return to the idle loop. If I understand
       correctly, I suspect this divergence might erroneously
       prevent a reclaim attempt by kswapd. If the number of
       zone specific free pages are below their per-cpu drift
       value then zone_page_state_snapshot() is used to
       compute a more accurate view of the aforementioned
       statistic.  Thus any task blocked on the NUMA node
       specific pfmemalloc_wait queue will be unable to make
       significant progress via direct reclaim unless it is
       killed after being woken up by kswapd
       (see throttle_direct_reclaim())

    2. With a SCHED_FIFO task that busy loops on a given CPU,
       and kworker for that CPU at SCHED_OTHER priority,
       queuing work to sync per-vmstats will either cause that
       work to never execute, or stalld (i.e. stall daemon)
       boosts kworker priority which causes a latency
       violation

By having vmstat_shepherd flush the per-CPU counters to the
global counters from remote CPUs.

This is done using cmpxchg to manipulate the counters,
both CPU locally (via the account functions),
and remotely (via cpu_vm_stats_fold).

Thanks to Aaron Tomlin for diagnosing issue 1 and writing
the initial patch series.

v2:
- actually use LOCK CMPXCHG on counter mod/inc/dec functions
  (Christoph Lameter)
- use try_cmpxchg for cmpxchg loops
  (Uros Bizjak / Matthew Wilcox)


 arch/arm64/include/asm/percpu.h     |   16 ++-
 arch/loongarch/include/asm/percpu.h |   23 ++++
 arch/s390/include/asm/percpu.h      |    5 +
 arch/x86/include/asm/percpu.h       |   39 ++++----
 include/asm-generic/percpu.h        |   17 +++
 include/linux/mmzone.h              |    3 
 kernel/fork.c                       |    2 
 kernel/scs.c                        |    2 
 mm/vmstat.c                         |  424 ++++++++++++++++++++++++++++++++++++++++++++++++------------------------------------------
 9 files changed, 308 insertions(+), 223 deletions(-)

Comments

Marcelo Tosatti Feb. 23, 2023, 2:54 p.m. UTC | #1

On Thu, Feb 09, 2023 at 12:01:50PM -0300, Marcelo Tosatti wrote:
> This patch series addresses the following two problems:
> 
>     1. A customer provided some evidence which indicates that
>        the idle tick was stopped; albeit, CPU-specific vmstat
>        counters still remained populated.
> 
>        Thus one can only assume quiet_vmstat() was not
>        invoked on return to the idle loop. If I understand
>        correctly, I suspect this divergence might erroneously
>        prevent a reclaim attempt by kswapd. If the number of
>        zone specific free pages are below their per-cpu drift
>        value then zone_page_state_snapshot() is used to
>        compute a more accurate view of the aforementioned
>        statistic.  Thus any task blocked on the NUMA node
>        specific pfmemalloc_wait queue will be unable to make
>        significant progress via direct reclaim unless it is
>        killed after being woken up by kswapd
>        (see throttle_direct_reclaim())
> 
>     2. With a SCHED_FIFO task that busy loops on a given CPU,
>        and kworker for that CPU at SCHED_OTHER priority,
>        queuing work to sync per-vmstats will either cause that
>        work to never execute, or stalld (i.e. stall daemon)
>        boosts kworker priority which causes a latency
>        violation
> 
> By having vmstat_shepherd flush the per-CPU counters to the
> global counters from remote CPUs.
> 
> This is done using cmpxchg to manipulate the counters,
> both CPU locally (via the account functions),
> and remotely (via cpu_vm_stats_fold).
> 
> Thanks to Aaron Tomlin for diagnosing issue 1 and writing
> the initial patch series.
> 
> v2:
> - actually use LOCK CMPXCHG on counter mod/inc/dec functions
>   (Christoph Lameter)
> - use try_cmpxchg for cmpxchg loops
>   (Uros Bizjak / Matthew Wilcox)
> 
> 
>  arch/arm64/include/asm/percpu.h     |   16 ++-
>  arch/loongarch/include/asm/percpu.h |   23 ++++
>  arch/s390/include/asm/percpu.h      |    5 +
>  arch/x86/include/asm/percpu.h       |   39 ++++----
>  include/asm-generic/percpu.h        |   17 +++
>  include/linux/mmzone.h              |    3 
>  kernel/fork.c                       |    2 
>  kernel/scs.c                        |    2 
>  mm/vmstat.c                         |  424 ++++++++++++++++++++++++++++++++++++++++++++++++------------------------------------------
>  9 files changed, 308 insertions(+), 223 deletions(-)

Friendly ping, any other concern with this series?

If not, ACKed-by or Reviewed-by's would be welcome.

Marcelo Tosatti Feb. 27, 2023, 7:41 p.m. UTC | #2

On Fri, Feb 24, 2023 at 10:34:10AM +0800, Hillf Danton wrote:
> On Thu, Feb 09, 2023 at 12:01:50PM -0300, Marcelo Tosatti wrote:
> > This patch series addresses the following two problems:
> > 
> >     1. A customer provided some evidence which indicates that
> >        the idle tick was stopped; albeit, CPU-specific vmstat
> >        counters still remained populated.
> > 
> >        Thus one can only assume quiet_vmstat() was not
> >        invoked on return to the idle loop. If I understand
> >        correctly, I suspect this divergence might erroneously
> >        prevent a reclaim attempt by kswapd. If the number of
> >        zone specific free pages are below their per-cpu drift
> >        value then zone_page_state_snapshot() is used to
> >        compute a more accurate view of the aforementioned
> >        statistic.  Thus any task blocked on the NUMA node
> >        specific pfmemalloc_wait queue will be unable to make
> >        significant progress via direct reclaim unless it is
> >        killed after being woken up by kswapd
> >        (see throttle_direct_reclaim())
> > 
> >     2. With a SCHED_FIFO task that busy loops on a given CPU,
> >        and kworker for that CPU at SCHED_OTHER priority,
> >        queuing work to sync per-vmstats will either cause that
> >        work to never execute, or stalld (i.e. stall daemon)
> >        boosts kworker priority which causes a latency
> >        violation
> > 
> > By having vmstat_shepherd flush the per-CPU counters to the
> > global counters from remote CPUs.
> > 
> > This is done using cmpxchg to manipulate the counters,
> > both CPU locally (via the account functions),
> > and remotely (via cpu_vm_stats_fold).
> 
> Frankly another case of bandaid[1] ?
> 
> [1] https://lore.kernel.org/lkml/20230223150624.GA29739@lst.de/

Only if you disable per-CPU vmstat counters for isolated CPUs
(then maintenance of the data structures in isolated CPUs is
not necessary).

Which would be terrible for performance, however.