[v13,2/6] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies

Message ID 20230105125248.813825852@redhat.com
State New
Headers
Series Ensure quiet_vmstat() is called when returning to userpace and when idle tick is stopped |

Commit Message

Marcelo Tosatti Jan. 5, 2023, 12:52 p.m. UTC
  From: Aaron Tomlin <atomlin@atomlin.com>

This patch will now use the previously introduced CPU-specific variable
namely vmstat_dirty to indicate if a vmstat differential/or imbalance is
present for a given CPU. So, at the appropriate time, vmstat processing can
be initiated. The hope is that this particular approach is "cheaper" when
compared to need_update(). The idea is based on Marcelo's patch [1].

[1]: https://lore.kernel.org/lkml/20220204173554.763888172@fedora.localdomain/

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
 mm/vmstat.c |   48 ++++++++++++++----------------------------------
 1 file changed, 14 insertions(+), 34 deletions(-)
  

Comments

Christoph Lameter Jan. 10, 2023, 12:06 p.m. UTC | #1
On Thu, 5 Jan 2023, Marcelo Tosatti wrote:

> --- linux-2.6.orig/mm/vmstat.c
> +++ linux-2.6/mm/vmstat.c
> @@ -381,6 +381,7 @@ void __mod_zone_page_state(struct zone *
>  		x = 0;
>  	}
>  	__this_cpu_write(*p, x);
> +	vmstat_mark_dirty();

__vmstat_mark_dirty()? See earlier patch

>
>  	preempt_enable_nested();
>  }
> @@ -417,6 +418,7 @@ void __mod_node_page_state(struct pglist
>  		x = 0;
>  	}
>  	__this_cpu_write(*p, x);
> +	vmstat_mark_dirty();

Ditto.

>
>  	preempt_enable_nested();
>  }
> @@ -577,6 +579,9 @@ static inline void mod_zone_state(struct
>  	s8 __percpu *p = pcp->vm_stat_diff + item;
>  	long o, n, t, z;
>
> +	/* cmpxchg and vmstat_mark_dirty should happen on the same CPU */
> +	preempt_disable();

If you are disabling preemption then why do we still need cmpxchg?
Same again below.
  
Frederic Weisbecker Jan. 10, 2023, 12:18 p.m. UTC | #2
On Tue, Jan 10, 2023 at 01:06:37PM +0100, Christoph Lameter wrote:
> On Thu, 5 Jan 2023, Marcelo Tosatti wrote:
> > @@ -577,6 +579,9 @@ static inline void mod_zone_state(struct
> >  	s8 __percpu *p = pcp->vm_stat_diff + item;
> >  	long o, n, t, z;
> >
> > +	/* cmpxchg and vmstat_mark_dirty should happen on the same CPU */
> > +	preempt_disable();
> 
> If you are disabling preemption then why do we still need cmpxchg?
> Same again below.

Note I'm absolutely clueless with vmstat. But I was wondering about it as well
while reviewing Marcelo's series, so git blame pointed me to:

7c83912062c801738d7d19acaf8f7fec25ea663c ("vmstat: User per cpu atomics to avoid
interrupt disable / enable")

And this seem to mention that this can race with IRQs as well, hence the local
cmpxchg operation.
  
Christoph Lameter Jan. 10, 2023, 1:39 p.m. UTC | #3
On Tue, 10 Jan 2023, Frederic Weisbecker wrote:

> Note I'm absolutely clueless with vmstat. But I was wondering about it as well
> while reviewing Marcelo's series, so git blame pointed me to:
>
> 7c83912062c801738d7d19acaf8f7fec25ea663c ("vmstat: User per cpu atomics to avoid
> interrupt disable / enable")
>
> And this seem to mention that this can race with IRQs as well, hence the local
> cmpxchg operation.

The race with irq could be an issue but I thought we avoided that and were
content with disabling preemption.

But this issue illustrates the central problem of the patchset: It makes
the lightweight counters not so lightweight anymore. The basic primitives
add a  lot of weight. And the pre cpu atomic updates operations require
the modification of multiple values. The operation cannot be "atomic" in
that sense anymore and we need some other form of synchronization that can
span multiple instructions.
  
Marcelo Tosatti Jan. 10, 2023, 8:09 p.m. UTC | #4
On Tue, Jan 10, 2023 at 02:39:08PM +0100, Christoph Lameter wrote:
> On Tue, 10 Jan 2023, Frederic Weisbecker wrote:
> 
> > Note I'm absolutely clueless with vmstat. But I was wondering about it as well
> > while reviewing Marcelo's series, so git blame pointed me to:
> >
> > 7c83912062c801738d7d19acaf8f7fec25ea663c ("vmstat: User per cpu atomics to avoid
> > interrupt disable / enable")
> >
> > And this seem to mention that this can race with IRQs as well, hence the local
> > cmpxchg operation.
> 
> The race with irq could be an issue but I thought we avoided that and were
> content with disabling preemption.
> 
> But this issue illustrates the central problem of the patchset: It makes
> the lightweight counters not so lightweight anymore. 

https://lkml.iu.edu/hypermail/linux/kernel/0903.2/00569.html

With added

static void do_test_preempt(void)
{
        unsigned long flags;
        unsigned int i;
        cycles_t time1, time2, time;
        u32 rem;

        local_irq_save(flags);
        preempt_disable();
        time1 = get_cycles();
        for (i = 0; i < NR_LOOPS; i++) {
                preempt_disable();
                preempt_enable();
        }
        time2 = get_cycles();
        local_irq_restore(flags);
        preempt_enable();
        time = time2 - time1;

        printk(KERN_ALERT "test results: time for disabling/enabling preemption\n");
        printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS);
        printk(KERN_ALERT "total time: %llu\n", time);
        time = div_u64_rem(time, NR_LOOPS, &rem);
        printk(KERN_ALERT "-> enabling/disabling preemption takes %llu cycles\n",
time);
        printk(KERN_ALERT "test end\n");
}


model name	: 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz

[  423.676079] test init
[  423.676249] test results: time for baseline
[  423.676405] number of loops: 200000
[  423.676676] total time: 104274
[  423.676910] -> baseline takes 0 cycles
[  423.677051] test end
[  423.678150] test results: time for locked cmpxchg
[  423.678353] number of loops: 200000
[  423.678498] total time: 2473839
[  423.678630] -> locked cmpxchg takes 12 cycles
[  423.678810] test end
[  423.679204] test results: time for non locked cmpxchg
[  423.679394] number of loops: 200000
[  423.679527] total time: 740298
[  423.679644] -> non locked cmpxchg takes 3 cycles
[  423.679817] test end
[  423.680755] test results: time for locked add return
[  423.680951] number of loops: 200000
[  423.681089] total time: 2118185
[  423.681229] -> locked add return takes 10 cycles
[  423.681411] test end
[  423.681846] test results: time for enabling interrupts (STI)
[  423.682063] number of loops: 200000
[  423.682209] total time: 861591
[  423.682335] -> enabling interrupts (STI) takes 4 cycles
[  423.682532] test end
[  423.683606] test results: time for disabling interrupts (CLI)
[  423.683852] number of loops: 200000
[  423.684006] total time: 2440756
[  423.684141] -> disabling interrupts (CLI) takes 12 cycles
[  423.684588] test end
[  423.686626] test results: time for disabling/enabling interrupts (STI/CLI)
[  423.686879] number of loops: 200000
[  423.687015] total time: 4802297
[  423.687139] -> enabling/disabling interrupts (STI/CLI) takes 24 cycles
[  423.687389] test end
[  423.688025] test results: time for disabling/enabling preemption
[  423.688258] number of loops: 200000
[  423.688396] total time: 1341001
[  423.688526] -> enabling/disabling preemption takes 6 cycles
[  423.689276] test end

> The basic primitives add a  lot of weight. 

Can't see any alternative given the necessity to avoid interruption
by the work to sync per-CPU vmstats to global vmstats.

> And the pre cpu atomic updates operations require the modification
> of multiple values. The operation 
> cannot be "atomic" in that sense anymore and we need some other form of
> synchronization that can
> span multiple instructions.

    So use this_cpu_cmpxchg() to avoid the overhead. Since we can no longer
    count on preremption being disabled we still have some minor issues.
    The fetching of the counter thresholds is racy.
    A threshold from another cpu may be applied if we happen to be
    rescheduled on another cpu.  However, the following vmstat operation
    will then bring the counter again under the threshold limit.

Those small issues are gone, OTOH.
  
Christoph Lameter Jan. 11, 2023, 8:42 a.m. UTC | #5
On Tue, 10 Jan 2023, Marcelo Tosatti wrote:

> > The basic primitives add a  lot of weight.
>
> Can't see any alternative given the necessity to avoid interruption
> by the work to sync per-CPU vmstats to global vmstats.

this_cpu operations are designed to operate on a *single* value (a counter) and can
be run on an arbitrary cpu, There is no preemption or interrupt
disable required since the counters of all cpus will be added up at the
end.

You want *two* values (the counter and the dirty flag) to be modified
together and want to use the counters/flag to identify the cpu where
these events occurred. this_cpu_xxx operations are not suitable for that
purpose. You would need a way to ensure that both operations occur on the
same cpu.

> > > And the pre cpu atomic updates operations require the modification
> > of multiple values. The operation
> > cannot be "atomic" in that sense anymore and we need some other form of
> > synchronization that can
> > span multiple instructions.
>
>     So use this_cpu_cmpxchg() to avoid the overhead. Since we can no longer
>     count on preremption being disabled we still have some minor issues.
>     The fetching of the counter thresholds is racy.
>     A threshold from another cpu may be applied if we happen to be
>     rescheduled on another cpu.  However, the following vmstat operation
>     will then bring the counter again under the threshold limit.
>
> Those small issues are gone, OTOH.

Well you could use this_cpu_cmpxchg128 to update a 64 bit counter and a
flag at the same time. Otherwise you will have to switch off preemption or
interrupts when incrementing the counters and updating the dirty flag.

Thus you do not really need the this_cpu operations anymore. It would
best to use a preempt_disable section and uuse C operators -- ++ for the
counter and do regular assignment for the flag.
  
Marcelo Tosatti Jan. 11, 2023, 5:07 p.m. UTC | #6
On Wed, Jan 11, 2023 at 09:42:18AM +0100, Christoph Lameter wrote:
> On Tue, 10 Jan 2023, Marcelo Tosatti wrote:
> 
> > > The basic primitives add a  lot of weight.
> >
> > Can't see any alternative given the necessity to avoid interruption
> > by the work to sync per-CPU vmstats to global vmstats.
> 
> this_cpu operations are designed to operate on a *single* value (a counter) and can
> be run on an arbitrary cpu, There is no preemption or interrupt
> disable required since the counters of all cpus will be added up at the
> end.
> 
> You want *two* values (the counter and the dirty flag) to be modified
> together and want to use the counters/flag to identify the cpu where
> these events occurred. this_cpu_xxx operations are not suitable for that
> purpose. You would need a way to ensure that both operations occur on the
> same cpu.

Which is either preempt_disable (CONFIG_HAVE_CMPXCHG_LOCAL case), or 
local_irq_disable (!CONFIG_HAVE_CMPXCHG_LOCAL case).

> > > > And the pre cpu atomic updates operations require the modification
> > > of multiple values. The operation
> > > cannot be "atomic" in that sense anymore and we need some other form of
> > > synchronization that can
> > > span multiple instructions.
> >
> >     So use this_cpu_cmpxchg() to avoid the overhead. Since we can no longer
> >     count on preremption being disabled we still have some minor issues.
> >     The fetching of the counter thresholds is racy.
> >     A threshold from another cpu may be applied if we happen to be
> >     rescheduled on another cpu.  However, the following vmstat operation
> >     will then bring the counter again under the threshold limit.
> >
> > Those small issues are gone, OTOH.
> 
> Well you could use this_cpu_cmpxchg128 to update a 64 bit counter and a
> flag at the same time. 

But then you transform the "per-CPU vmstat is dirty" bit (bool) into a 
number of flags that must be scanned (when returning to userspace).

Which increases the overhead of a fast path (return to userspace).

> Otherwise you will have to switch off preemption or
> interrupts when incrementing the counters and updating the dirty flag.
> 
> Thus you do not really need the this_cpu operations anymore. It would
> best to use a preempt_disable section and uuse C operators -- ++ for the
> counter and do regular assignment for the flag.

OK, can replace this_cpu operations with this_cpu_ptr + standard C operators 
(and in fact can do that for interrupt disabled functions as well, that
is CONFIG_HAVE_CMPXCHG_LOCAL not defined).

Is that it?
  
Christoph Lameter Jan. 16, 2023, 9:51 a.m. UTC | #7
On Wed, 11 Jan 2023, Marcelo Tosatti wrote:

> OK, can replace this_cpu operations with this_cpu_ptr + standard C operators
> (and in fact can do that for interrupt disabled functions as well, that
> is CONFIG_HAVE_CMPXCHG_LOCAL not defined).
>
> Is that it?

No that was hyperthetical.

I do not know how to get out of this dilemma. We surely want to keep fast
vmstat operations working.

The fundamental issue that causes the vmstat discrepancies is likely that
the fast this_cpu ops can increment the counter on any random cpu and that
this is the reason you get vmstat discrepancies.

Give up the assumption that an increment of a this_cpu counter on a
specific cpu means that something occurred on that specific cpu. Maybe
that will get you on a path to resolve the issues you are seeing.
  
Marcelo Tosatti Jan. 16, 2023, 4:11 p.m. UTC | #8
On Mon, Jan 16, 2023 at 10:51:40AM +0100, Christoph Lameter wrote:
> On Wed, 11 Jan 2023, Marcelo Tosatti wrote:
> 
> > OK, can replace this_cpu operations with this_cpu_ptr + standard C operators
> > (and in fact can do that for interrupt disabled functions as well, that
> > is CONFIG_HAVE_CMPXCHG_LOCAL not defined).
> >
> > Is that it?
> 
> No that was hyperthetical.
> 
> I do not know how to get out of this dilemma. We surely want to keep fast
> vmstat operations working.

Honestly, to me, there is no dilemma:

* There is a requirement from applications to be uninterrupted
by operating system activities. Examples include radio access
network software, software defined PLCs for industrial automation (1).

* There exists vm-statistics counters (which count
the number of pages on different states, for example, number of 
free pages, locked pages, pages under writeback, pagetable pages,
file pages, etc). 
To reduce number of accesses to the global counters, each CPU maintains
its own delta relative to the global VM counters
(which can be cached in the local processor cache, therefore fast).

The per-CPU deltas are synchronized to global counters:
	1) If the per-CPU counters exceed a given threshold.
	2) Periodically, with a low frequency compared to 
	   CPU events (every number of seconds).
	3) Upon an event that requires accurate counters.

* The periodic synchronization interrupts a given CPU, in case
it contains a counter delta relative to the global counters.

To avoid this interruption, due to [1], the proposed patchset
synchronizes any pending per-CPU deltas to global counters,
for nohz_full= CPUs, when returning to userspace
(which is a very fast path).

Since return to userspace is a very fast path, synchronizing
per-CPU counter deltas by reading their contents is undesired.

Therefore a single bit is introduced to compact the following
information: does this CPU contain any delta relative to the
global counters that should be written back?

This bit is set when a per-CPU delta is increased.
This bit is cleared when the per-CPU deltas are written back to 
the global counters.

Since for the following two operations:

	modify per-CPU delta (for current CPU) of counter X by Y
	set bit (for current CPU) indicating the per-CPU delta exists

"current CPU" can change, it is necessary to disable CPU preemption
when executing the pair of operations.

vmstat operations still perform their main goal which is to 
maintain accesses local to the CPU when incrementing the counters
(for most of counter modifications).
The preempt_disable/enable pair is also a per-CPU variable.

Now you are objecting to this patchset because:

It increases the number of cycles to execute the function to modify 
the counters by 6. Can you mention any benchmark where this 
increase is significant?

By searching for mod_zone_page_state/mode_node_page_state one can see
the following: the codepaths that call them are touching multiple pages
and other data structures, so the preempt_enable/preempt_disable
pair should be a very small contribution (in terms of percentage) to any
meaningful benchmark.


> The fundamental issue that causes the vmstat discrepancies is likely that
> the fast this_cpu ops can increment the counter on any random cpu and that
> this is the reason you get vmstat discrepancies.

Yes. 

> Give up the assumption that an increment of a this_cpu counter on a
> specific cpu means that something occurred on that specific cpu. Maybe
> that will get you on a path to resolve the issues you are seeing.

But it can't. To be able to condense the information "does a delta exist
on this CPU" from a number of cacheline reads to a single cacheline
read, one bit can be used. And the write to that bit and to the counters
is not atomic.

Alternatives:
	1) Disable periodic synchronization for nohz_full CPUs.
	2) Processor instructions which can modify more than
	   one address in memory.
	3) Synchronize the per-CPU stats remotely (which
	   increases per-CPU and per-node accesses).
  
Christoph Lameter Jan. 17, 2023, 12:52 p.m. UTC | #9
On Mon, 16 Jan 2023, Marcelo Tosatti wrote:

> Honestly, to me, there is no dilemma:
>
> * There is a requirement from applications to be uninterrupted
> by operating system activities. Examples include radio access
> network software, software defined PLCs for industrial automation (1).
>
> * There exists vm-statistics counters (which count
> the number of pages on different states, for example, number of
> free pages, locked pages, pages under writeback, pagetable pages,
> file pages, etc).
> To reduce number of accesses to the global counters, each CPU maintains
> its own delta relative to the global VM counters
> (which can be cached in the local processor cache, therefore fast).

The counters only count accurately as a global sum. A counter may be
specific to a zone and at which time it counts uses of that zone of from
all processors.

> Now you are objecting to this patchset because:
>
> It increases the number of cycles to execute the function to modify
> the counters by 6. Can you mention any benchmark where this
> increase is significant?

I am objecting because of a fundamental misunderstanding of how these
counters work and because the patchset is incorrect in the way it handles
these counters. Come up with a correct approach and then we can consider
regressions and/or improvements in performance.

> Alternatives:
> 	1) Disable periodic synchronization for nohz_full CPUs.
> 	2) Processor instructions which can modify more than
> 	   one address in memory.
> 	3) Synchronize the per-CPU stats remotely (which
> 	   increases per-CPU and per-node accesses).

Or remove the assumptions that may exist in current code that a delta on a
specific cpu counter means that something occurred on that cpu?

If there is a delta then that *does not* mean that there is something to
do on that processor. The delta could be folded by another processor into
the global counter if that processor is idle or not entering the Kernel
and stays that way throughout the operation.

So I guess that would be #3. The function cpu_vm_stats_fold() already does
this for offline cpus. Can something similar be made to work for idle cpus
or those continually running in user space?
  

Patch

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -381,6 +381,7 @@  void __mod_zone_page_state(struct zone *
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+	vmstat_mark_dirty();
 
 	preempt_enable_nested();
 }
@@ -417,6 +418,7 @@  void __mod_node_page_state(struct pglist
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+	vmstat_mark_dirty();
 
 	preempt_enable_nested();
 }
@@ -577,6 +579,9 @@  static inline void mod_zone_state(struct
 	s8 __percpu *p = pcp->vm_stat_diff + item;
 	long o, n, t, z;
 
+	/* cmpxchg and vmstat_mark_dirty should happen on the same CPU */
+	preempt_disable();
+
 	do {
 		z = 0;  /* overflow to zone counters */
 
@@ -606,6 +611,8 @@  static inline void mod_zone_state(struct
 
 	if (z)
 		zone_page_state_add(z, zone, item);
+	vmstat_mark_dirty();
+	preempt_enable();
 }
 
 void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
@@ -645,6 +652,8 @@  static inline void mod_node_state(struct
 		delta >>= PAGE_SHIFT;
 	}
 
+	/* cmpxchg and vmstat_mark_dirty should happen on the same CPU */
+	preempt_disable();
 	do {
 		z = 0;  /* overflow to node counters */
 
@@ -674,6 +683,8 @@  static inline void mod_node_state(struct
 
 	if (z)
 		node_page_state_add(z, pgdat, item);
+	vmstat_mark_dirty();
+	preempt_enable();
 }
 
 void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
@@ -828,6 +839,14 @@  static int refresh_cpu_vm_stats(bool do_
 	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
 	int changes = 0;
 
+	/*
+	 * Clear vmstat_dirty before clearing the percpu vmstats.
+	 * If interrupts are enabled, it is possible that an interrupt
+	 * or another task modifies a percpu vmstat, which will
+	 * set vmstat_dirty to true.
+	 */
+	vmstat_clear_dirty();
+
 	for_each_populated_zone(zone) {
 		struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
 #ifdef CONFIG_NUMA
@@ -1957,35 +1976,6 @@  static void vmstat_update(struct work_st
 }
 
 /*
- * Check if the diffs for a certain cpu indicate that
- * an update is needed.
- */
-static bool need_update(int cpu)
-{
-	pg_data_t *last_pgdat = NULL;
-	struct zone *zone;
-
-	for_each_populated_zone(zone) {
-		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
-		struct per_cpu_nodestat *n;
-
-		/*
-		 * The fast way of checking if there are any vmstat diffs.
-		 */
-		if (memchr_inv(pzstats->vm_stat_diff, 0, sizeof(pzstats->vm_stat_diff)))
-			return true;
-
-		if (last_pgdat == zone->zone_pgdat)
-			continue;
-		last_pgdat = zone->zone_pgdat;
-		n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu);
-		if (memchr_inv(n->vm_node_stat_diff, 0, sizeof(n->vm_node_stat_diff)))
-			return true;
-	}
-	return false;
-}
-
-/*
  * Switch off vmstat processing and then fold all the remaining differentials
  * until the diffs stay at zero. The function is used by NOHZ and can only be
  * invoked when tick processing is not active.
@@ -1995,10 +1985,7 @@  void quiet_vmstat(void)
 	if (system_state != SYSTEM_RUNNING)
 		return;
 
-	if (!delayed_work_pending(this_cpu_ptr(&vmstat_work)))
-		return;
-
-	if (!need_update(smp_processor_id()))
+	if (!is_vmstat_dirty())
 		return;
 
 	/*
@@ -2029,7 +2016,7 @@  static void vmstat_shepherd(struct work_
 	for_each_online_cpu(cpu) {
 		struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
 
-		if (!delayed_work_pending(dw) && need_update(cpu))
+		if (!delayed_work_pending(dw) && per_cpu(vmstat_dirty, cpu))
 			queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
 
 		cond_resched();