[v2,10/11] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely

Message ID 20230209153204.901518530@redhat.com
State New
Headers
Series fold per-CPU vmstats remotely |

Commit Message

Marcelo Tosatti Feb. 9, 2023, 3:02 p.m. UTC
  Now that the counters are modified via cmpxchg both CPU locally
(via the account functions), and remotely (via cpu_vm_stats_fold),
its possible to switch vmstat_shepherd to perform the per-CPU 
vmstats folding remotely.

This fixes the following two problems:

 1. A customer provided some evidence which indicates that
    the idle tick was stopped; albeit, CPU-specific vmstat
    counters still remained populated.

    Thus one can only assume quiet_vmstat() was not
    invoked on return to the idle loop. If I understand
    correctly, I suspect this divergence might erroneously
    prevent a reclaim attempt by kswapd. If the number of
    zone specific free pages are below their per-cpu drift
    value then zone_page_state_snapshot() is used to
    compute a more accurate view of the aforementioned
    statistic.  Thus any task blocked on the NUMA node
    specific pfmemalloc_wait queue will be unable to make
    significant progress via direct reclaim unless it is
    killed after being woken up by kswapd
    (see throttle_direct_reclaim())

 2. With a SCHED_FIFO task that busy loops on a given CPU,
    and kworker for that CPU at SCHED_OTHER priority,
    queuing work to sync per-vmstats will either cause that
    work to never execute, or stalld (i.e. stall daemon)
    boosts kworker priority which causes a latency
    violation

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
  

Comments

Peter Xu March 2, 2023, 9:01 p.m. UTC | #1
On Thu, Feb 09, 2023 at 12:02:00PM -0300, Marcelo Tosatti wrote:
> +#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
> +/* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
> +static void vmstat_shepherd(struct work_struct *w)
> +{
> +	int cpu;
> +
> +	cpus_read_lock();
> +	for_each_online_cpu(cpu) {
> +		cpu_vm_stats_fold(cpu);

Nitpick: IIUC this line is the only change with CONFIG_HAVE_CMPXCHG_LOCAL
to replace the queuing.  Would it be cleaner to move the ifdef into
vmstat_shepherd, then, and keep the common logic?

> +		cond_resched();
> +	}
> +	cpus_read_unlock();
> +
> +	schedule_delayed_work(&shepherd,
> +		round_jiffies_relative(sysctl_stat_interval));
> +}
> +#else
>  static void vmstat_shepherd(struct work_struct *w)
>  {
>  	int cpu;
> @@ -2026,6 +2043,7 @@ static void vmstat_shepherd(struct work_
>  	schedule_delayed_work(&shepherd,
>  		round_jiffies_relative(sysctl_stat_interval));
>  }
> +#endif
>  
>  static void __init start_shepherd_timer(void)
>  {
> 
> 
>
  
Marcelo Tosatti March 2, 2023, 9:16 p.m. UTC | #2
On Thu, Mar 02, 2023 at 04:01:07PM -0500, Peter Xu wrote:
> On Thu, Feb 09, 2023 at 12:02:00PM -0300, Marcelo Tosatti wrote:
> > +#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
> > +/* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
> > +static void vmstat_shepherd(struct work_struct *w)
> > +{
> > +	int cpu;
> > +
> > +	cpus_read_lock();
> > +	for_each_online_cpu(cpu) {
> > +		cpu_vm_stats_fold(cpu);
> 
> Nitpick: IIUC this line is the only change with CONFIG_HAVE_CMPXCHG_LOCAL
> to replace the queuing.  Would it be cleaner to move the ifdef into
> vmstat_shepherd, then, and keep the common logic?

https://lore.kernel.org/lkml/20221223144150.GA79369@lothringen/

Could have

#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
int cpu_flush_vm_stats(int cpu)
{
	return cpu_vm_stats_fold(cpu);
}
#else
int cpu_flush_vm_stats(int cpu)
{
	struct delayed_work *dw = &per_cpu(vmstat_work, cpu);

	if (!delayed_work_pending(dw) && need_update(cpu))
		queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
}
#endif

static void vmstat_shepherd(struct work_struct *w)
{
       int cpu;

       cpus_read_lock();
       for_each_online_cpu(cpu) {
	       cpu_flush_vm_stats(cpu);
               cond_resched();
       }
       cpus_read_unlock();

       schedule_delayed_work(&shepherd,
               round_jiffies_relative(sysctl_stat_interval));
}

This looks really awkward to me. But then, we don't want
schedule_delayed_work if !CONFIG_HAVE_CMPXCHG_LOCAL.
The common part would be the cpus_read_lock and for_each_online_cpu
loop.

So it seems the current separation is quite readable
(unless you have a suggestion).

> > +		cond_resched();
> > +	}
> > +	cpus_read_unlock();
> > +
> > +	schedule_delayed_work(&shepherd,
> > +		round_jiffies_relative(sysctl_stat_interval));
> > +}
> > +#else
> >  static void vmstat_shepherd(struct work_struct *w)
> >  {
> >  	int cpu;
> > @@ -2026,6 +2043,7 @@ static void vmstat_shepherd(struct work_
> >  	schedule_delayed_work(&shepherd,
> >  		round_jiffies_relative(sysctl_stat_interval));
> >  }
> > +#endif
> >  
> >  static void __init start_shepherd_timer(void)
> >  {
> > 
> > 
> > 
> 
> -- 
> Peter Xu
> 
>
  
Peter Xu March 2, 2023, 9:30 p.m. UTC | #3
On Thu, Mar 02, 2023 at 06:16:42PM -0300, Marcelo Tosatti wrote:
> On Thu, Mar 02, 2023 at 04:01:07PM -0500, Peter Xu wrote:
> > On Thu, Feb 09, 2023 at 12:02:00PM -0300, Marcelo Tosatti wrote:
> > > +#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
> > > +/* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
> > > +static void vmstat_shepherd(struct work_struct *w)
> > > +{
> > > +	int cpu;
> > > +
> > > +	cpus_read_lock();
> > > +	for_each_online_cpu(cpu) {
> > > +		cpu_vm_stats_fold(cpu);
> > 
> > Nitpick: IIUC this line is the only change with CONFIG_HAVE_CMPXCHG_LOCAL
> > to replace the queuing.  Would it be cleaner to move the ifdef into
> > vmstat_shepherd, then, and keep the common logic?
> 
> https://lore.kernel.org/lkml/20221223144150.GA79369@lothringen/

:-)

[...]

> So it seems the current separation is quite readable
> (unless you have a suggestion).

No, feel free to ignore any of my nitpicks when you don't think proper. :)
Keeping it is fine to me.
  

Patch

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -2007,6 +2007,23 @@  static void vmstat_shepherd(struct work_
 
 static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd);
 
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+/* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
+static void vmstat_shepherd(struct work_struct *w)
+{
+	int cpu;
+
+	cpus_read_lock();
+	for_each_online_cpu(cpu) {
+		cpu_vm_stats_fold(cpu);
+		cond_resched();
+	}
+	cpus_read_unlock();
+
+	schedule_delayed_work(&shepherd,
+		round_jiffies_relative(sysctl_stat_interval));
+}
+#else
 static void vmstat_shepherd(struct work_struct *w)
 {
 	int cpu;
@@ -2026,6 +2043,7 @@  static void vmstat_shepherd(struct work_
 	schedule_delayed_work(&shepherd,
 		round_jiffies_relative(sysctl_stat_interval));
 }
+#endif
 
 static void __init start_shepherd_timer(void)
 {