mm: convert mm's rss stats into percpu_counter

Message ID 20221024052841.3291983-1-shakeelb@google.com
State New
Headers
Series mm: convert mm's rss stats into percpu_counter |

Commit Message

Shakeel Butt Oct. 24, 2022, 5:28 a.m. UTC
  Currently mm_struct maintains rss_stats which are updated on page fault
and the unmapping codepaths. For page fault codepath the updates are
cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64.
The reason for caching is performance for multithreaded applications
otherwise the rss_stats updates may become hotspot for such
applications.

However this optimization comes with the cost of error margin in the rss
stats. The rss_stats for applications with large number of threads can
be very skewed. At worst the error margin is (nr_threads * 64) and we
have a lot of applications with 100s of threads, so the error margin can
be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.

Recently we started seeing the unbounded errors for rss_stats for
specific applications which use TCP rx0cp. It seems like
vm_insert_pages() codepath does not sync rss_stats at all.

This patch converts the rss_stats into percpu_counter to convert the
error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2).
However this conversion enable us to get the accurate stats for
situations where accuracy is more important than the cpu cost. Though
this patch does not make such tradeoffs.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 include/linux/mm.h             | 26 ++++--------
 include/linux/mm_types.h       |  7 +---
 include/linux/mm_types_task.h  | 13 ------
 include/linux/percpu_counter.h |  1 -
 include/linux/sched.h          |  3 --
 include/trace/events/kmem.h    |  8 ++--
 kernel/fork.c                  | 16 +++++++-
 mm/memory.c                    | 73 +++++-----------------------------
 8 files changed, 40 insertions(+), 107 deletions(-)
  

Comments

Andrew Morton Oct. 24, 2022, 10:30 p.m. UTC | #1
On Mon, 24 Oct 2022 05:28:41 +0000 Shakeel Butt <shakeelb@google.com> wrote:

> Currently mm_struct maintains rss_stats which are updated on page fault
> and the unmapping codepaths. For page fault codepath the updates are
> cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64.
> The reason for caching is performance for multithreaded applications
> otherwise the rss_stats updates may become hotspot for such
> applications.
> 
> However this optimization comes with the cost of error margin in the rss
> stats. The rss_stats for applications with large number of threads can
> be very skewed. At worst the error margin is (nr_threads * 64) and we
> have a lot of applications with 100s of threads, so the error margin can
> be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.
> 
> Recently we started seeing the unbounded errors for rss_stats for
> specific applications which use TCP rx0cp. It seems like
> vm_insert_pages() codepath does not sync rss_stats at all.
> 
> This patch converts the rss_stats into percpu_counter to convert the
> error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2).

Confused.  The max error should be O(nr_cpus)?

> However this conversion enable us to get the accurate stats for
> situations where accuracy is more important than the cpu cost. Though
> this patch does not make such tradeoffs.

Curiousity.  Can you expand on the final sentence here?

>  8 files changed, 40 insertions(+), 107 deletions(-)

There's that, too.
  
Shakeel Butt Oct. 24, 2022, 11:14 p.m. UTC | #2
On Mon, Oct 24, 2022 at 03:30:22PM -0700, Andrew Morton wrote:
> On Mon, 24 Oct 2022 05:28:41 +0000 Shakeel Butt <shakeelb@google.com> wrote:
> 
> > Currently mm_struct maintains rss_stats which are updated on page fault
> > and the unmapping codepaths. For page fault codepath the updates are
> > cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64.
> > The reason for caching is performance for multithreaded applications
> > otherwise the rss_stats updates may become hotspot for such
> > applications.
> > 
> > However this optimization comes with the cost of error margin in the rss
> > stats. The rss_stats for applications with large number of threads can
> > be very skewed. At worst the error margin is (nr_threads * 64) and we
> > have a lot of applications with 100s of threads, so the error margin can
> > be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.
> > 
> > Recently we started seeing the unbounded errors for rss_stats for
> > specific applications which use TCP rx0cp. It seems like
> > vm_insert_pages() codepath does not sync rss_stats at all.
> > 
> > This patch converts the rss_stats into percpu_counter to convert the
> > error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2).
> 
> Confused.  The max error should be O(nr_cpus)?
> 

So, percpu_counter code sets the percpu batch in the following way:

static int compute_batch_value(unsigned int cpu)
{
        int nr = num_online_cpus();

        percpu_counter_batch = max(32, nr*2);
        return 0;
}

This means each cpu can cache (nr_cpus*2) updates. Practically the
number of cpus do not change and are usually much less than the number
of threads of large applications, so error margin is lower.

> > However this conversion enable us to get the accurate stats for
> > situations where accuracy is more important than the cpu cost. Though
> > this patch does not make such tradeoffs.
> 
> Curiousity.  Can you expand on the final sentence here?
> 

Basically we can just use percpu_counter_add_local() for the updates and
percpu_counter_sum() (or percpu_counter_sync() + percpu_counter_read)
for the readers. At the moment the readers are either procfs interface,
oom_killer and memory reclaim which I think are not performance critical
and should be ok with slow read. However I think we can make that change
in a separate patch.

thanks,
Shakeel
  
Marek Szyprowski Nov. 2, 2022, 9:09 p.m. UTC | #3
Hi

On 24.10.2022 07:28, Shakeel Butt wrote:
> Currently mm_struct maintains rss_stats which are updated on page fault
> and the unmapping codepaths. For page fault codepath the updates are
> cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64.
> The reason for caching is performance for multithreaded applications
> otherwise the rss_stats updates may become hotspot for such
> applications.
>
> However this optimization comes with the cost of error margin in the rss
> stats. The rss_stats for applications with large number of threads can
> be very skewed. At worst the error margin is (nr_threads * 64) and we
> have a lot of applications with 100s of threads, so the error margin can
> be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.
>
> Recently we started seeing the unbounded errors for rss_stats for
> specific applications which use TCP rx0cp. It seems like
> vm_insert_pages() codepath does not sync rss_stats at all.
>
> This patch converts the rss_stats into percpu_counter to convert the
> error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2).
> However this conversion enable us to get the accurate stats for
> situations where accuracy is more important than the cpu cost. Though
> this patch does not make such tradeoffs.
>
> Signed-off-by: Shakeel Butt <shakeelb@google.com>

This patch landed recently in linux-next as commit d59f19a7a068 ("mm: 
convert mm's rss stats into percpu_counter"). Unfortunately it causes a 
regression on my test systems. I've noticed that it triggers a 'BUG: Bad 
rss-counter state' warning from time to time for random processes. This 
is somehow related to CPU hot-plug and/or system suspend/resume. The 
easiest way to reproduce this issue (although not always) on my test 
systems (ARM or ARM64 based) is to run the following commands:

root@target:~# for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 
 >$i/online;
BUG: Bad rss-counter state mm:f04c7160 type:MM_FILEPAGES val:1
BUG: Bad rss-counter state mm:50f1f502 type:MM_FILEPAGES val:2
BUG: Bad rss-counter state mm:50f1f502 type:MM_ANONPAGES val:15
BUG: Bad rss-counter state mm:63660fd0 type:MM_FILEPAGES val:2
BUG: Bad rss-counter state mm:63660fd0 type:MM_ANONPAGES val:15

Let me know if I can help debugging this somehow or testing a fix.

> ---
>   include/linux/mm.h             | 26 ++++--------
>   include/linux/mm_types.h       |  7 +---
>   include/linux/mm_types_task.h  | 13 ------
>   include/linux/percpu_counter.h |  1 -
>   include/linux/sched.h          |  3 --
>   include/trace/events/kmem.h    |  8 ++--
>   kernel/fork.c                  | 16 +++++++-
>   mm/memory.c                    | 73 +++++-----------------------------
>   8 files changed, 40 insertions(+), 107 deletions(-)
>
> ...

Best regards
  
Shakeel Butt Nov. 3, 2022, 5:14 p.m. UTC | #4
On Wed, Nov 02, 2022 at 10:09:57PM +0100, Marek Szyprowski wrote:
> Hi
> 
> On 24.10.2022 07:28, Shakeel Butt wrote:
> > Currently mm_struct maintains rss_stats which are updated on page fault
> > and the unmapping codepaths. For page fault codepath the updates are
> > cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64.
> > The reason for caching is performance for multithreaded applications
> > otherwise the rss_stats updates may become hotspot for such
> > applications.
> >
> > However this optimization comes with the cost of error margin in the rss
> > stats. The rss_stats for applications with large number of threads can
> > be very skewed. At worst the error margin is (nr_threads * 64) and we
> > have a lot of applications with 100s of threads, so the error margin can
> > be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.
> >
> > Recently we started seeing the unbounded errors for rss_stats for
> > specific applications which use TCP rx0cp. It seems like
> > vm_insert_pages() codepath does not sync rss_stats at all.
> >
> > This patch converts the rss_stats into percpu_counter to convert the
> > error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2).
> > However this conversion enable us to get the accurate stats for
> > situations where accuracy is more important than the cpu cost. Though
> > this patch does not make such tradeoffs.
> >
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> 
> This patch landed recently in linux-next as commit d59f19a7a068 ("mm: 
> convert mm's rss stats into percpu_counter"). Unfortunately it causes a 
> regression on my test systems. I've noticed that it triggers a 'BUG: Bad 
> rss-counter state' warning from time to time for random processes. This 
> is somehow related to CPU hot-plug and/or system suspend/resume. The 
> easiest way to reproduce this issue (although not always) on my test 
> systems (ARM or ARM64 based) is to run the following commands:
> 
> root@target:~# for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 
>  >$i/online;
> BUG: Bad rss-counter state mm:f04c7160 type:MM_FILEPAGES val:1
> BUG: Bad rss-counter state mm:50f1f502 type:MM_FILEPAGES val:2
> BUG: Bad rss-counter state mm:50f1f502 type:MM_ANONPAGES val:15
> BUG: Bad rss-counter state mm:63660fd0 type:MM_FILEPAGES val:2
> BUG: Bad rss-counter state mm:63660fd0 type:MM_ANONPAGES val:15
> 
> Let me know if I can help debugging this somehow or testing a fix.
> 

Hi Marek,

Thanks for the report. It seems like there is a race between
for_each_online_cpu() in __percpu_counter_sum() and
percpu_counter_cpu_dead()/cpu-offlining. Normally this race is fine for
percpu_counter users but for check_mm() is not happy with this race. Can
you please try the following patch:


From: Shakeel Butt <shakeelb@google.com>
Date: Thu, 3 Nov 2022 06:05:13 +0000
Subject: [PATCH] mm: percpu_counter: use race free percpu_counter sum
 interface

percpu_counter_sum can race with cpu offlining. Add a new interface
which does not race with it and use that for check_mm().
---
 include/linux/percpu_counter.h | 11 +++++++++++
 kernel/fork.c                  |  2 +-
 lib/percpu_counter.c           | 24 ++++++++++++++++++------
 3 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index bde6c4c1f405..3070c1043acf 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -45,6 +45,7 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount,
 			      s32 batch);
 s64 __percpu_counter_sum(struct percpu_counter *fbc);
+s64 __percpu_counter_sum_all(struct percpu_counter *fbc);
 int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
 void percpu_counter_sync(struct percpu_counter *fbc);
 
@@ -85,6 +86,11 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
 	return __percpu_counter_sum(fbc);
 }
 
+static inline s64 percpu_counter_sum_all(struct percpu_counter *fbc)
+{
+	return __percpu_counter_sum_all(fbc);
+}
+
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
 	return fbc->count;
@@ -193,6 +199,11 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
 	return percpu_counter_read(fbc);
 }
 
+static inline s64 percpu_counter_sum_all(struct percpu_counter *fbc)
+{
+	return percpu_counter_read(fbc);
+}
+
 static inline bool percpu_counter_initialized(struct percpu_counter *fbc)
 {
 	return true;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9c32f593ef11..7d6f510cf397 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -756,7 +756,7 @@ static void check_mm(struct mm_struct *mm)
 			 "Please make sure 'struct resident_page_types[]' is updated as well");
 
 	for (i = 0; i < NR_MM_COUNTERS; i++) {
-		long x = percpu_counter_sum(&mm->rss_stat[i]);
+		long x = percpu_counter_sum_all(&mm->rss_stat[i]);
 
 		if (unlikely(x))
 			pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n",
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index ed610b75dc32..f26a1a5df399 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -117,11 +117,8 @@ void percpu_counter_sync(struct percpu_counter *fbc)
 }
 EXPORT_SYMBOL(percpu_counter_sync);
 
-/*
- * Add up all the per-cpu counts, return the result.  This is a more accurate
- * but much slower version of percpu_counter_read_positive()
- */
-s64 __percpu_counter_sum(struct percpu_counter *fbc)
+static s64 __percpu_counter_sum_mask(struct percpu_counter *fbc,
+			      const struct cpumask *cpu_mask)
 {
 	s64 ret;
 	int cpu;
@@ -129,15 +126,30 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
 
 	raw_spin_lock_irqsave(&fbc->lock, flags);
 	ret = fbc->count;
-	for_each_online_cpu(cpu) {
+	for_each_cpu(cpu, cpu_mask) {
 		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
 		ret += *pcount;
 	}
 	raw_spin_unlock_irqrestore(&fbc->lock, flags);
 	return ret;
 }
+
+/*
+ * Add up all the per-cpu counts, return the result.  This is a more accurate
+ * but much slower version of percpu_counter_read_positive()
+ */
+s64 __percpu_counter_sum(struct percpu_counter *fbc)
+{
+	return __percpu_counter_sum_mask(fbc, cpu_online_mask);
+}
 EXPORT_SYMBOL(__percpu_counter_sum);
 
+s64 __percpu_counter_sum_all(struct percpu_counter *fbc)
+{
+	return __percpu_counter_sum_mask(fbc, cpu_possible_mask);
+}
+EXPORT_SYMBOL(__percpu_counter_sum_all);
+
 int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
 			  struct lock_class_key *key)
 {
  
Marek Szyprowski Nov. 3, 2022, 11:02 p.m. UTC | #5
Hi,

On 03.11.2022 18:14, Shakeel Butt wrote:
> On Wed, Nov 02, 2022 at 10:09:57PM +0100, Marek Szyprowski wrote:
>> On 24.10.2022 07:28, Shakeel Butt wrote:
>>> Currently mm_struct maintains rss_stats which are updated on page fault
>>> and the unmapping codepaths. For page fault codepath the updates are
>>> cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64.
>>> The reason for caching is performance for multithreaded applications
>>> otherwise the rss_stats updates may become hotspot for such
>>> applications.
>>>
>>> However this optimization comes with the cost of error margin in the rss
>>> stats. The rss_stats for applications with large number of threads can
>>> be very skewed. At worst the error margin is (nr_threads * 64) and we
>>> have a lot of applications with 100s of threads, so the error margin can
>>> be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.
>>>
>>> Recently we started seeing the unbounded errors for rss_stats for
>>> specific applications which use TCP rx0cp. It seems like
>>> vm_insert_pages() codepath does not sync rss_stats at all.
>>>
>>> This patch converts the rss_stats into percpu_counter to convert the
>>> error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2).
>>> However this conversion enable us to get the accurate stats for
>>> situations where accuracy is more important than the cpu cost. Though
>>> this patch does not make such tradeoffs.
>>>
>>> Signed-off-by: Shakeel Butt <shakeelb@google.com>
>> This patch landed recently in linux-next as commit d59f19a7a068 ("mm:
>> convert mm's rss stats into percpu_counter"). Unfortunately it causes a
>> regression on my test systems. I've noticed that it triggers a 'BUG: Bad
>> rss-counter state' warning from time to time for random processes. This
>> is somehow related to CPU hot-plug and/or system suspend/resume. The
>> easiest way to reproduce this issue (although not always) on my test
>> systems (ARM or ARM64 based) is to run the following commands:
>>
>> root@target:~# for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0
>>   >$i/online;
>> BUG: Bad rss-counter state mm:f04c7160 type:MM_FILEPAGES val:1
>> BUG: Bad rss-counter state mm:50f1f502 type:MM_FILEPAGES val:2
>> BUG: Bad rss-counter state mm:50f1f502 type:MM_ANONPAGES val:15
>> BUG: Bad rss-counter state mm:63660fd0 type:MM_FILEPAGES val:2
>> BUG: Bad rss-counter state mm:63660fd0 type:MM_ANONPAGES val:15
>>
>> Let me know if I can help debugging this somehow or testing a fix.
>>
> Hi Marek,
>
> Thanks for the report. It seems like there is a race between
> for_each_online_cpu() in __percpu_counter_sum() and
> percpu_counter_cpu_dead()/cpu-offlining. Normally this race is fine for
> percpu_counter users but for check_mm() is not happy with this race. Can
> you please try the following patch:
>
>
> From: Shakeel Butt <shakeelb@google.com>
> Date: Thu, 3 Nov 2022 06:05:13 +0000
> Subject: [PATCH] mm: percpu_counter: use race free percpu_counter sum
>   interface
>
> percpu_counter_sum can race with cpu offlining. Add a new interface
> which does not race with it and use that for check_mm().
> ---
>   include/linux/percpu_counter.h | 11 +++++++++++
>   kernel/fork.c                  |  2 +-
>   lib/percpu_counter.c           | 24 ++++++++++++++++++------
>   3 files changed, 30 insertions(+), 7 deletions(-)


Yes, this seems to fix the issue I've reported. Feel free to add:

Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>

Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>


> diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
> index bde6c4c1f405..3070c1043acf 100644
> --- a/include/linux/percpu_counter.h
> +++ b/include/linux/percpu_counter.h
> @@ -45,6 +45,7 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
>   void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount,
>   			      s32 batch);
>   s64 __percpu_counter_sum(struct percpu_counter *fbc);
> +s64 __percpu_counter_sum_all(struct percpu_counter *fbc);
>   int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
>   void percpu_counter_sync(struct percpu_counter *fbc);
>   
> @@ -85,6 +86,11 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
>   	return __percpu_counter_sum(fbc);
>   }
>   
> +static inline s64 percpu_counter_sum_all(struct percpu_counter *fbc)
> +{
> +	return __percpu_counter_sum_all(fbc);
> +}
> +
>   static inline s64 percpu_counter_read(struct percpu_counter *fbc)
>   {
>   	return fbc->count;
> @@ -193,6 +199,11 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
>   	return percpu_counter_read(fbc);
>   }
>   
> +static inline s64 percpu_counter_sum_all(struct percpu_counter *fbc)
> +{
> +	return percpu_counter_read(fbc);
> +}
> +
>   static inline bool percpu_counter_initialized(struct percpu_counter *fbc)
>   {
>   	return true;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9c32f593ef11..7d6f510cf397 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -756,7 +756,7 @@ static void check_mm(struct mm_struct *mm)
>   			 "Please make sure 'struct resident_page_types[]' is updated as well");
>   
>   	for (i = 0; i < NR_MM_COUNTERS; i++) {
> -		long x = percpu_counter_sum(&mm->rss_stat[i]);
> +		long x = percpu_counter_sum_all(&mm->rss_stat[i]);
>   
>   		if (unlikely(x))
>   			pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n",
> diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
> index ed610b75dc32..f26a1a5df399 100644
> --- a/lib/percpu_counter.c
> +++ b/lib/percpu_counter.c
> @@ -117,11 +117,8 @@ void percpu_counter_sync(struct percpu_counter *fbc)
>   }
>   EXPORT_SYMBOL(percpu_counter_sync);
>   
> -/*
> - * Add up all the per-cpu counts, return the result.  This is a more accurate
> - * but much slower version of percpu_counter_read_positive()
> - */
> -s64 __percpu_counter_sum(struct percpu_counter *fbc)
> +static s64 __percpu_counter_sum_mask(struct percpu_counter *fbc,
> +			      const struct cpumask *cpu_mask)
>   {
>   	s64 ret;
>   	int cpu;
> @@ -129,15 +126,30 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
>   
>   	raw_spin_lock_irqsave(&fbc->lock, flags);
>   	ret = fbc->count;
> -	for_each_online_cpu(cpu) {
> +	for_each_cpu(cpu, cpu_mask) {
>   		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
>   		ret += *pcount;
>   	}
>   	raw_spin_unlock_irqrestore(&fbc->lock, flags);
>   	return ret;
>   }
> +
> +/*
> + * Add up all the per-cpu counts, return the result.  This is a more accurate
> + * but much slower version of percpu_counter_read_positive()
> + */
> +s64 __percpu_counter_sum(struct percpu_counter *fbc)
> +{
> +	return __percpu_counter_sum_mask(fbc, cpu_online_mask);
> +}
>   EXPORT_SYMBOL(__percpu_counter_sum);
>   
> +s64 __percpu_counter_sum_all(struct percpu_counter *fbc)
> +{
> +	return __percpu_counter_sum_mask(fbc, cpu_possible_mask);
> +}
> +EXPORT_SYMBOL(__percpu_counter_sum_all);
> +
>   int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
>   			  struct lock_class_key *key)
>   {

Best regards
  
Shakeel Butt Nov. 4, 2022, 12:18 a.m. UTC | #6
On Thu, Nov 3, 2022 at 4:02 PM Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
>
> Hi,
>
> On 03.11.2022 18:14, Shakeel Butt wrote:
> > On Wed, Nov 02, 2022 at 10:09:57PM +0100, Marek Szyprowski wrote:
> >> On 24.10.2022 07:28, Shakeel Butt wrote:
> >>> Currently mm_struct maintains rss_stats which are updated on page fault
> >>> and the unmapping codepaths. For page fault codepath the updates are
> >>> cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64.
> >>> The reason for caching is performance for multithreaded applications
> >>> otherwise the rss_stats updates may become hotspot for such
> >>> applications.
> >>>
> >>> However this optimization comes with the cost of error margin in the rss
> >>> stats. The rss_stats for applications with large number of threads can
> >>> be very skewed. At worst the error margin is (nr_threads * 64) and we
> >>> have a lot of applications with 100s of threads, so the error margin can
> >>> be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.
> >>>
> >>> Recently we started seeing the unbounded errors for rss_stats for
> >>> specific applications which use TCP rx0cp. It seems like
> >>> vm_insert_pages() codepath does not sync rss_stats at all.
> >>>
> >>> This patch converts the rss_stats into percpu_counter to convert the
> >>> error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2).
> >>> However this conversion enable us to get the accurate stats for
> >>> situations where accuracy is more important than the cpu cost. Though
> >>> this patch does not make such tradeoffs.
> >>>
> >>> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> >> This patch landed recently in linux-next as commit d59f19a7a068 ("mm:
> >> convert mm's rss stats into percpu_counter"). Unfortunately it causes a
> >> regression on my test systems. I've noticed that it triggers a 'BUG: Bad
> >> rss-counter state' warning from time to time for random processes. This
> >> is somehow related to CPU hot-plug and/or system suspend/resume. The
> >> easiest way to reproduce this issue (although not always) on my test
> >> systems (ARM or ARM64 based) is to run the following commands:
> >>
> >> root@target:~# for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0
> >>   >$i/online;
> >> BUG: Bad rss-counter state mm:f04c7160 type:MM_FILEPAGES val:1
> >> BUG: Bad rss-counter state mm:50f1f502 type:MM_FILEPAGES val:2
> >> BUG: Bad rss-counter state mm:50f1f502 type:MM_ANONPAGES val:15
> >> BUG: Bad rss-counter state mm:63660fd0 type:MM_FILEPAGES val:2
> >> BUG: Bad rss-counter state mm:63660fd0 type:MM_ANONPAGES val:15
> >>
> >> Let me know if I can help debugging this somehow or testing a fix.
> >>
> > Hi Marek,
> >
> > Thanks for the report. It seems like there is a race between
> > for_each_online_cpu() in __percpu_counter_sum() and
> > percpu_counter_cpu_dead()/cpu-offlining. Normally this race is fine for
> > percpu_counter users but for check_mm() is not happy with this race. Can
> > you please try the following patch:
> >
> >
> > From: Shakeel Butt <shakeelb@google.com>
> > Date: Thu, 3 Nov 2022 06:05:13 +0000
> > Subject: [PATCH] mm: percpu_counter: use race free percpu_counter sum
> >   interface
> >
> > percpu_counter_sum can race with cpu offlining. Add a new interface
> > which does not race with it and use that for check_mm().
> > ---
> >   include/linux/percpu_counter.h | 11 +++++++++++
> >   kernel/fork.c                  |  2 +-
> >   lib/percpu_counter.c           | 24 ++++++++++++++++++------
> >   3 files changed, 30 insertions(+), 7 deletions(-)
>
>
> Yes, this seems to fix the issue I've reported. Feel free to add:
>
> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
>
> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
>
>

Thanks a lot Marek. I will send out a formal patch later with your
reported-by and tested-by tags.
  
Andrew Morton Nov. 4, 2022, 11:05 p.m. UTC | #7
On Thu, 3 Nov 2022 17:14:07 +0000 Shakeel Butt <shakeelb@google.com> wrote:

> 
> ...
>
> Thanks for the report. It seems like there is a race between
> for_each_online_cpu() in __percpu_counter_sum() and
> percpu_counter_cpu_dead()/cpu-offlining. Normally this race is fine for
> percpu_counter users but for check_mm() is not happy with this race. Can
> you please try the following patch:

percpu-counters supposedly avoid such races via the hotplup notifier. 
So can you please fully describe the race and let's see if it can be
fixed at the percpu_counter level?

> 
> From: Shakeel Butt <shakeelb@google.com>
> Date: Thu, 3 Nov 2022 06:05:13 +0000
> Subject: [PATCH] mm: percpu_counter: use race free percpu_counter sum
>  interface
> 
> percpu_counter_sum can race with cpu offlining. Add a new interface
> which does not race with it and use that for check_mm().

I'll grab this version for now, as others might be seeing this issue.


> ---
>  include/linux/percpu_counter.h | 11 +++++++++++
>  kernel/fork.c                  |  2 +-
>  lib/percpu_counter.c           | 24 ++++++++++++++++++------
>  3 files changed, 30 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
> index bde6c4c1f405..3070c1043acf 100644
> --- a/include/linux/percpu_counter.h
> +++ b/include/linux/percpu_counter.h
> @@ -45,6 +45,7 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
>  void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount,
>  			      s32 batch);
>  s64 __percpu_counter_sum(struct percpu_counter *fbc);
> +s64 __percpu_counter_sum_all(struct percpu_counter *fbc);
>  int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
>  void percpu_counter_sync(struct percpu_counter *fbc);
>  
> @@ -85,6 +86,11 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
>  	return __percpu_counter_sum(fbc);
>  }
>  
> +static inline s64 percpu_counter_sum_all(struct percpu_counter *fbc)
> +{
> +	return __percpu_counter_sum_all(fbc);
> +}

We haven't been good about documenting these interfaces.  Can we please
start now? ;)

>
> ...
>
> +
> +/*
> + * Add up all the per-cpu counts, return the result.  This is a more accurate
> + * but much slower version of percpu_counter_read_positive()
> + */
> +s64 __percpu_counter_sum(struct percpu_counter *fbc)
> +{
> +	return __percpu_counter_sum_mask(fbc, cpu_online_mask);
> +}
>  EXPORT_SYMBOL(__percpu_counter_sum);
>  
> +s64 __percpu_counter_sum_all(struct percpu_counter *fbc)
> +{
> +	return __percpu_counter_sum_mask(fbc, cpu_possible_mask);
> +}
> +EXPORT_SYMBOL(__percpu_counter_sum_all);

Probably here is a good place to document it.

Is there any point in having the
percpu_counter_sum_all()->__percpu_counter_sum_all() inlined wrapper? 
Why not name this percpu_counter_sum_all() directly?

>  int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
>  			  struct lock_class_key *key)
>  {
> -- 
> 2.38.1.431.g37b22c650d-goog
  
Shakeel Butt Nov. 4, 2022, 11:15 p.m. UTC | #8
On Fri, Nov 4, 2022 at 4:05 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 3 Nov 2022 17:14:07 +0000 Shakeel Butt <shakeelb@google.com> wrote:
>
> >
> > ...
> >
> > Thanks for the report. It seems like there is a race between
> > for_each_online_cpu() in __percpu_counter_sum() and
> > percpu_counter_cpu_dead()/cpu-offlining. Normally this race is fine for
> > percpu_counter users but for check_mm() is not happy with this race. Can
> > you please try the following patch:
>
> percpu-counters supposedly avoid such races via the hotplup notifier.
> So can you please fully describe the race and let's see if it can be
> fixed at the percpu_counter level?
>

Yes, I am writing a more detailed commit message explaining the race
and why it is not really an issue for current users.

> >
> > From: Shakeel Butt <shakeelb@google.com>
> > Date: Thu, 3 Nov 2022 06:05:13 +0000
> > Subject: [PATCH] mm: percpu_counter: use race free percpu_counter sum
> >  interface
> >
> > percpu_counter_sum can race with cpu offlining. Add a new interface
> > which does not race with it and use that for check_mm().
>
> I'll grab this version for now, as others might be seeing this issue.
>

Thanks.

>
> > ---
> >  include/linux/percpu_counter.h | 11 +++++++++++
> >  kernel/fork.c                  |  2 +-
> >  lib/percpu_counter.c           | 24 ++++++++++++++++++------
> >  3 files changed, 30 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
> > index bde6c4c1f405..3070c1043acf 100644
> > --- a/include/linux/percpu_counter.h
> > +++ b/include/linux/percpu_counter.h
> > @@ -45,6 +45,7 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
> >  void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount,
> >                             s32 batch);
> >  s64 __percpu_counter_sum(struct percpu_counter *fbc);
> > +s64 __percpu_counter_sum_all(struct percpu_counter *fbc);
> >  int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
> >  void percpu_counter_sync(struct percpu_counter *fbc);
> >
> > @@ -85,6 +86,11 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
> >       return __percpu_counter_sum(fbc);
> >  }
> >
> > +static inline s64 percpu_counter_sum_all(struct percpu_counter *fbc)
> > +{
> > +     return __percpu_counter_sum_all(fbc);
> > +}
>
> We haven't been good about documenting these interfaces.  Can we please
> start now? ;)
>

Yup will do.

> >
> > ...
> >
> > +
> > +/*
> > + * Add up all the per-cpu counts, return the result.  This is a more accurate
> > + * but much slower version of percpu_counter_read_positive()
> > + */
> > +s64 __percpu_counter_sum(struct percpu_counter *fbc)
> > +{
> > +     return __percpu_counter_sum_mask(fbc, cpu_online_mask);
> > +}
> >  EXPORT_SYMBOL(__percpu_counter_sum);
> >
> > +s64 __percpu_counter_sum_all(struct percpu_counter *fbc)
> > +{
> > +     return __percpu_counter_sum_mask(fbc, cpu_possible_mask);
> > +}
> > +EXPORT_SYMBOL(__percpu_counter_sum_all);
>
> Probably here is a good place to document it.
>
> Is there any point in having the
> percpu_counter_sum_all()->__percpu_counter_sum_all() inlined wrapper?
> Why not name this percpu_counter_sum_all() directly?
>

Ack.

thanks,
Shakeel
  
Jan Kara June 8, 2023, 11:14 a.m. UTC | #9
On Mon 24-10-22 05:28:41, Shakeel Butt wrote:
> Currently mm_struct maintains rss_stats which are updated on page fault
> and the unmapping codepaths. For page fault codepath the updates are
> cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64.
> The reason for caching is performance for multithreaded applications
> otherwise the rss_stats updates may become hotspot for such
> applications.
> 
> However this optimization comes with the cost of error margin in the rss
> stats. The rss_stats for applications with large number of threads can
> be very skewed. At worst the error margin is (nr_threads * 64) and we
> have a lot of applications with 100s of threads, so the error margin can
> be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.
> 
> Recently we started seeing the unbounded errors for rss_stats for
> specific applications which use TCP rx0cp. It seems like
> vm_insert_pages() codepath does not sync rss_stats at all.
> 
> This patch converts the rss_stats into percpu_counter to convert the
> error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2).
> However this conversion enable us to get the accurate stats for
> situations where accuracy is more important than the cpu cost. Though
> this patch does not make such tradeoffs.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>

Somewhat late to the game but our performance testing grid has noticed this
commit causes a performance regression on shell-heavy workloads. For
example running 'make test' in git sources on our test machine with 192
CPUs takes about 4% longer, system time is increased by about 9%:

                       before (9cd6ffa6025)  after (f1a7941243c1)
Amean     User         471.12 *   0.30%*     481.77 *  -1.96%*
Amean     System       244.47 *   0.90%*     269.13 *  -9.09%*
Amean     Elapsed      709.22 *   0.45%*     742.27 *  -4.19%*
Amean     CPU          100.00 (   0.20%)     101.00 *  -0.80%*

Essentially this workload spawns in sequence a lot of short-lived tasks and
the task startup + teardown cost is what this patch increases. To
demonstrate this more clearly, I've written trivial (and somewhat stupid)
benchmark shell_bench.sh:

for (( i = 0; i < 20000; i++ )); do
	/bin/true
done

And when run like:

numactl -C 1 ./shell_bench.sh

(I've forced physical CPU binding to avoid task migrating over the machine
and cpu frequency scaling interfering which makes the numbers much more
noisy) I get the following elapsed times:

         9cd6ffa6025    f1a7941243c1
Avg      6.807429       7.631571
Stddev   0.021797       0.016483

So some 12% regression in elapsed time. Just to be sure I've verified that
per-cpu allocator patch [1] does not improve these numbers in any
significant way.

Where do we go from here? I think in principle the problem could be fixed
by being clever and when the task has only a single thread, we don't bother
with allocating pcpu counter (and summing it at the end) and just account
directly in mm_struct. When the second thread is spawned, we bite the
bullet, allocate pcpu counter and start with more scalable accounting.
These shortlived tasks in shell workloads or similar don't spawn any
threads so this should fix the regression. But this is obviously easier
said than done...

								Honza

[1] https://lore.kernel.org/all/20230606125404.95256-1-yu.ma@intel.com/

> ---
>  include/linux/mm.h             | 26 ++++--------
>  include/linux/mm_types.h       |  7 +---
>  include/linux/mm_types_task.h  | 13 ------
>  include/linux/percpu_counter.h |  1 -
>  include/linux/sched.h          |  3 --
>  include/trace/events/kmem.h    |  8 ++--
>  kernel/fork.c                  | 16 +++++++-
>  mm/memory.c                    | 73 +++++-----------------------------
>  8 files changed, 40 insertions(+), 107 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9dec25c7d631..a8a9c3a20534 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2000,40 +2000,30 @@ static inline bool get_user_page_fast_only(unsigned long addr,
>   */
>  static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
>  {
> -	long val = atomic_long_read(&mm->rss_stat.count[member]);
> -
> -#ifdef SPLIT_RSS_COUNTING
> -	/*
> -	 * counter is updated in asynchronous manner and may go to minus.
> -	 * But it's never be expected number for users.
> -	 */
> -	if (val < 0)
> -		val = 0;
> -#endif
> -	return (unsigned long)val;
> +	return percpu_counter_read_positive(&mm->rss_stat[member]);
>  }
>  
> -void mm_trace_rss_stat(struct mm_struct *mm, int member, long count);
> +void mm_trace_rss_stat(struct mm_struct *mm, int member);
>  
>  static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
>  {
> -	long count = atomic_long_add_return(value, &mm->rss_stat.count[member]);
> +	percpu_counter_add(&mm->rss_stat[member], value);
>  
> -	mm_trace_rss_stat(mm, member, count);
> +	mm_trace_rss_stat(mm, member);
>  }
>  
>  static inline void inc_mm_counter(struct mm_struct *mm, int member)
>  {
> -	long count = atomic_long_inc_return(&mm->rss_stat.count[member]);
> +	percpu_counter_inc(&mm->rss_stat[member]);
>  
> -	mm_trace_rss_stat(mm, member, count);
> +	mm_trace_rss_stat(mm, member);
>  }
>  
>  static inline void dec_mm_counter(struct mm_struct *mm, int member)
>  {
> -	long count = atomic_long_dec_return(&mm->rss_stat.count[member]);
> +	percpu_counter_dec(&mm->rss_stat[member]);
>  
> -	mm_trace_rss_stat(mm, member, count);
> +	mm_trace_rss_stat(mm, member);
>  }
>  
>  /* Optimized variant when page is already known not to be PageAnon */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index a82f06ab18a1..834022721bc6 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -18,6 +18,7 @@
>  #include <linux/page-flags-layout.h>
>  #include <linux/workqueue.h>
>  #include <linux/seqlock.h>
> +#include <linux/percpu_counter.h>
>  
>  #include <asm/mmu.h>
>  
> @@ -626,11 +627,7 @@ struct mm_struct {
>  
>  		unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
>  
> -		/*
> -		 * Special counters, in some configurations protected by the
> -		 * page_table_lock, in other configurations by being atomic.
> -		 */
> -		struct mm_rss_stat rss_stat;
> +		struct percpu_counter rss_stat[NR_MM_COUNTERS];
>  
>  		struct linux_binfmt *binfmt;
>  
> diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
> index 0bb4b6da9993..5414b5c6a103 100644
> --- a/include/linux/mm_types_task.h
> +++ b/include/linux/mm_types_task.h
> @@ -36,19 +36,6 @@ enum {
>  	NR_MM_COUNTERS
>  };
>  
> -#if USE_SPLIT_PTE_PTLOCKS && defined(CONFIG_MMU)
> -#define SPLIT_RSS_COUNTING
> -/* per-thread cached information, */
> -struct task_rss_stat {
> -	int events;	/* for synchronization threshold */
> -	int count[NR_MM_COUNTERS];
> -};
> -#endif /* USE_SPLIT_PTE_PTLOCKS */
> -
> -struct mm_rss_stat {
> -	atomic_long_t count[NR_MM_COUNTERS];
> -};
> -
>  struct page_frag {
>  	struct page *page;
>  #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
> diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
> index 8ed5fba6d156..bde6c4c1f405 100644
> --- a/include/linux/percpu_counter.h
> +++ b/include/linux/percpu_counter.h
> @@ -13,7 +13,6 @@
>  #include <linux/threads.h>
>  #include <linux/percpu.h>
>  #include <linux/types.h>
> -#include <linux/gfp.h>
>  
>  /* percpu_counter batch for local add or sub */
>  #define PERCPU_COUNTER_LOCAL_BATCH	INT_MAX
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ffb6eb55cd13..079d299fa465 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -870,9 +870,6 @@ struct task_struct {
>  	struct mm_struct		*mm;
>  	struct mm_struct		*active_mm;
>  
> -#ifdef SPLIT_RSS_COUNTING
> -	struct task_rss_stat		rss_stat;
> -#endif
>  	int				exit_state;
>  	int				exit_code;
>  	int				exit_signal;
> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
> index 243073cfc29d..58688768ef0f 100644
> --- a/include/trace/events/kmem.h
> +++ b/include/trace/events/kmem.h
> @@ -346,10 +346,9 @@ TRACE_MM_PAGES
>  TRACE_EVENT(rss_stat,
>  
>  	TP_PROTO(struct mm_struct *mm,
> -		int member,
> -		long count),
> +		int member),
>  
> -	TP_ARGS(mm, member, count),
> +	TP_ARGS(mm, member),
>  
>  	TP_STRUCT__entry(
>  		__field(unsigned int, mm_id)
> @@ -362,7 +361,8 @@ TRACE_EVENT(rss_stat,
>  		__entry->mm_id = mm_ptr_to_hash(mm);
>  		__entry->curr = !!(current->mm == mm);
>  		__entry->member = member;
> -		__entry->size = (count << PAGE_SHIFT);
> +		__entry->size = (percpu_counter_sum_positive(&mm->rss_stat[member])
> +							    << PAGE_SHIFT);
>  	),
>  
>  	TP_printk("mm_id=%u curr=%d type=%s size=%ldB",
> diff --git a/kernel/fork.c b/kernel/fork.c
> index cfb09ca1b1bc..f56ad06240e1 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -756,7 +756,7 @@ static void check_mm(struct mm_struct *mm)
>  			 "Please make sure 'struct resident_page_types[]' is updated as well");
>  
>  	for (i = 0; i < NR_MM_COUNTERS; i++) {
> -		long x = atomic_long_read(&mm->rss_stat.count[i]);
> +		long x = percpu_counter_sum(&mm->rss_stat[i]);
>  
>  		if (unlikely(x))
>  			pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n",
> @@ -782,6 +782,8 @@ static void check_mm(struct mm_struct *mm)
>   */
>  void __mmdrop(struct mm_struct *mm)
>  {
> +	int i;
> +
>  	BUG_ON(mm == &init_mm);
>  	WARN_ON_ONCE(mm == current->mm);
>  	WARN_ON_ONCE(mm == current->active_mm);
> @@ -791,6 +793,9 @@ void __mmdrop(struct mm_struct *mm)
>  	check_mm(mm);
>  	put_user_ns(mm->user_ns);
>  	mm_pasid_drop(mm);
> +
> +	for (i = 0; i < NR_MM_COUNTERS; i++)
> +		percpu_counter_destroy(&mm->rss_stat[i]);
>  	free_mm(mm);
>  }
>  EXPORT_SYMBOL_GPL(__mmdrop);
> @@ -1110,6 +1115,8 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
>  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  	struct user_namespace *user_ns)
>  {
> +	int i;
> +
>  	mt_init_flags(&mm->mm_mt, MM_MT_FLAGS);
>  	mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock);
>  	atomic_set(&mm->mm_users, 1);
> @@ -1151,10 +1158,17 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  	if (init_new_context(p, mm))
>  		goto fail_nocontext;
>  
> +	for (i = 0; i < NR_MM_COUNTERS; i++)
> +		if (percpu_counter_init(&mm->rss_stat[i], 0, GFP_KERNEL_ACCOUNT))
> +			goto fail_pcpu;
> +
>  	mm->user_ns = get_user_ns(user_ns);
>  	lru_gen_init_mm(mm);
>  	return mm;
>  
> +fail_pcpu:
> +	while (i > 0)
> +		percpu_counter_destroy(&mm->rss_stat[--i]);
>  fail_nocontext:
>  	mm_free_pgd(mm);
>  fail_nopgd:
> diff --git a/mm/memory.c b/mm/memory.c
> index 8e72f703ed99..fea8d737e8c0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -162,58 +162,11 @@ static int __init init_zero_pfn(void)
>  }
>  early_initcall(init_zero_pfn);
>  
> -void mm_trace_rss_stat(struct mm_struct *mm, int member, long count)
> +void mm_trace_rss_stat(struct mm_struct *mm, int member)
>  {
> -	trace_rss_stat(mm, member, count);
> +	trace_rss_stat(mm, member);
>  }
>  
> -#if defined(SPLIT_RSS_COUNTING)
> -
> -void sync_mm_rss(struct mm_struct *mm)
> -{
> -	int i;
> -
> -	for (i = 0; i < NR_MM_COUNTERS; i++) {
> -		if (current->rss_stat.count[i]) {
> -			add_mm_counter(mm, i, current->rss_stat.count[i]);
> -			current->rss_stat.count[i] = 0;
> -		}
> -	}
> -	current->rss_stat.events = 0;
> -}
> -
> -static void add_mm_counter_fast(struct mm_struct *mm, int member, int val)
> -{
> -	struct task_struct *task = current;
> -
> -	if (likely(task->mm == mm))
> -		task->rss_stat.count[member] += val;
> -	else
> -		add_mm_counter(mm, member, val);
> -}
> -#define inc_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, 1)
> -#define dec_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, -1)
> -
> -/* sync counter once per 64 page faults */
> -#define TASK_RSS_EVENTS_THRESH	(64)
> -static void check_sync_rss_stat(struct task_struct *task)
> -{
> -	if (unlikely(task != current))
> -		return;
> -	if (unlikely(task->rss_stat.events++ > TASK_RSS_EVENTS_THRESH))
> -		sync_mm_rss(task->mm);
> -}
> -#else /* SPLIT_RSS_COUNTING */
> -
> -#define inc_mm_counter_fast(mm, member) inc_mm_counter(mm, member)
> -#define dec_mm_counter_fast(mm, member) dec_mm_counter(mm, member)
> -
> -static void check_sync_rss_stat(struct task_struct *task)
> -{
> -}
> -
> -#endif /* SPLIT_RSS_COUNTING */
> -
>  /*
>   * Note: this doesn't free the actual pages themselves. That
>   * has been handled earlier when unmapping all the memory regions.
> @@ -1860,7 +1813,7 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
>  		return -EBUSY;
>  	/* Ok, finally just insert the thing.. */
>  	get_page(page);
> -	inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
> +	inc_mm_counter(vma->vm_mm, mm_counter_file(page));
>  	page_add_file_rmap(page, vma, false);
>  	set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot));
>  	return 0;
> @@ -3156,12 +3109,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>  	if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
>  		if (old_page) {
>  			if (!PageAnon(old_page)) {
> -				dec_mm_counter_fast(mm,
> -						mm_counter_file(old_page));
> -				inc_mm_counter_fast(mm, MM_ANONPAGES);
> +				dec_mm_counter(mm, mm_counter_file(old_page));
> +				inc_mm_counter(mm, MM_ANONPAGES);
>  			}
>  		} else {
> -			inc_mm_counter_fast(mm, MM_ANONPAGES);
> +			inc_mm_counter(mm, MM_ANONPAGES);
>  		}
>  		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
>  		entry = mk_pte(new_page, vma->vm_page_prot);
> @@ -3968,8 +3920,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	if (should_try_to_free_swap(folio, vma, vmf->flags))
>  		folio_free_swap(folio);
>  
> -	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> -	dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
> +	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> +	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>  	pte = mk_pte(page, vma->vm_page_prot);
>  
>  	/*
> @@ -4148,7 +4100,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  		return handle_userfault(vmf, VM_UFFD_MISSING);
>  	}
>  
> -	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> +	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>  	page_add_new_anon_rmap(page, vma, vmf->address);
>  	lru_cache_add_inactive_or_unevictable(page, vma);
>  setpte:
> @@ -4338,11 +4290,11 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>  		entry = pte_mkuffd_wp(pte_wrprotect(entry));
>  	/* copy-on-write page */
>  	if (write && !(vma->vm_flags & VM_SHARED)) {
> -		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> +		inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>  		page_add_new_anon_rmap(page, vma, addr);
>  		lru_cache_add_inactive_or_unevictable(page, vma);
>  	} else {
> -		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
> +		inc_mm_counter(vma->vm_mm, mm_counter_file(page));
>  		page_add_file_rmap(page, vma, false);
>  	}
>  	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
> @@ -5194,9 +5146,6 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  	count_vm_event(PGFAULT);
>  	count_memcg_event_mm(vma->vm_mm, PGFAULT);
>  
> -	/* do counter updates before entering really critical section. */
> -	check_sync_rss_stat(current);
> -
>  	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>  					    flags & FAULT_FLAG_INSTRUCTION,
>  					    flags & FAULT_FLAG_REMOTE))
> -- 
> 2.38.0.135.g90850a2211-goog
>
  
Yu Zhao June 8, 2023, 4:33 p.m. UTC | #10
On Thu, Jun 8, 2023 at 5:14 AM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 24-10-22 05:28:41, Shakeel Butt wrote:
> > Currently mm_struct maintains rss_stats which are updated on page fault
> > and the unmapping codepaths. For page fault codepath the updates are
> > cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64.
> > The reason for caching is performance for multithreaded applications
> > otherwise the rss_stats updates may become hotspot for such
> > applications.
> >
> > However this optimization comes with the cost of error margin in the rss
> > stats. The rss_stats for applications with large number of threads can
> > be very skewed. At worst the error margin is (nr_threads * 64) and we
> > have a lot of applications with 100s of threads, so the error margin can
> > be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.
> >
> > Recently we started seeing the unbounded errors for rss_stats for
> > specific applications which use TCP rx0cp. It seems like
> > vm_insert_pages() codepath does not sync rss_stats at all.
> >
> > This patch converts the rss_stats into percpu_counter to convert the
> > error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2).
> > However this conversion enable us to get the accurate stats for
> > situations where accuracy is more important than the cpu cost. Though
> > this patch does not make such tradeoffs.
> >
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
>
> Somewhat late to the game but our performance testing grid has noticed this
> commit causes a performance regression on shell-heavy workloads. For
> example running 'make test' in git sources on our test machine with 192
> CPUs takes about 4% longer, system time is increased by about 9%:
>
>                        before (9cd6ffa6025)  after (f1a7941243c1)
> Amean     User         471.12 *   0.30%*     481.77 *  -1.96%*
> Amean     System       244.47 *   0.90%*     269.13 *  -9.09%*
> Amean     Elapsed      709.22 *   0.45%*     742.27 *  -4.19%*
> Amean     CPU          100.00 (   0.20%)     101.00 *  -0.80%*
>
> Essentially this workload spawns in sequence a lot of short-lived tasks and
> the task startup + teardown cost is what this patch increases. To
> demonstrate this more clearly, I've written trivial (and somewhat stupid)
> benchmark shell_bench.sh:
>
> for (( i = 0; i < 20000; i++ )); do
>         /bin/true
> done
>
> And when run like:
>
> numactl -C 1 ./shell_bench.sh
>
> (I've forced physical CPU binding to avoid task migrating over the machine
> and cpu frequency scaling interfering which makes the numbers much more
> noisy) I get the following elapsed times:
>
>          9cd6ffa6025    f1a7941243c1
> Avg      6.807429       7.631571
> Stddev   0.021797       0.016483
>
> So some 12% regression in elapsed time. Just to be sure I've verified that
> per-cpu allocator patch [1] does not improve these numbers in any
> significant way.
>
> Where do we go from here? I think in principle the problem could be fixed
> by being clever and when the task has only a single thread, we don't bother
> with allocating pcpu counter (and summing it at the end) and just account
> directly in mm_struct. When the second thread is spawned, we bite the
> bullet, allocate pcpu counter and start with more scalable accounting.
> These shortlived tasks in shell workloads or similar don't spawn any
> threads so this should fix the regression. But this is obviously easier
> said than done...
>
>                                                                 Honza
>
> [1] https://lore.kernel.org/all/20230606125404.95256-1-yu.ma@intel.com/

Another regression reported earlier:
https://lore.kernel.org/linux-mm/202301301057.e55dad5b-oliver.sang@intel.com/
  
Shakeel Butt June 8, 2023, 5:37 p.m. UTC | #11
On Thu, Jun 08, 2023 at 01:14:08PM +0200, Jan Kara wrote:
[...]
> 
> Somewhat late to the game but our performance testing grid has noticed this
> commit causes a performance regression on shell-heavy workloads. For
> example running 'make test' in git sources on our test machine with 192
> CPUs takes about 4% longer, system time is increased by about 9%:
> 
>                        before (9cd6ffa6025)  after (f1a7941243c1)
> Amean     User         471.12 *   0.30%*     481.77 *  -1.96%*
> Amean     System       244.47 *   0.90%*     269.13 *  -9.09%*
> Amean     Elapsed      709.22 *   0.45%*     742.27 *  -4.19%*
> Amean     CPU          100.00 (   0.20%)     101.00 *  -0.80%*
> 
> Essentially this workload spawns in sequence a lot of short-lived tasks and
> the task startup + teardown cost is what this patch increases. To
> demonstrate this more clearly, I've written trivial (and somewhat stupid)
> benchmark shell_bench.sh:
> 
> for (( i = 0; i < 20000; i++ )); do
> 	/bin/true
> done
> 
> And when run like:
> 
> numactl -C 1 ./shell_bench.sh
> 
> (I've forced physical CPU binding to avoid task migrating over the machine
> and cpu frequency scaling interfering which makes the numbers much more
> noisy) I get the following elapsed times:
> 
>          9cd6ffa6025    f1a7941243c1
> Avg      6.807429       7.631571
> Stddev   0.021797       0.016483
> 
> So some 12% regression in elapsed time. Just to be sure I've verified that
> per-cpu allocator patch [1] does not improve these numbers in any
> significant way.
> 
> Where do we go from here? I think in principle the problem could be fixed
> by being clever and when the task has only a single thread, we don't bother
> with allocating pcpu counter (and summing it at the end) and just account
> directly in mm_struct. When the second thread is spawned, we bite the
> bullet, allocate pcpu counter and start with more scalable accounting.
> These shortlived tasks in shell workloads or similar don't spawn any
> threads so this should fix the regression. But this is obviously easier
> said than done...
> 

Thanks Jan for the report. I wanted to improve the percpu allocation to
eliminate this regression as it was reported by intel test bot as well.
However your suggestion seems seems targetted and reasonable as well. At
the moment I am travelling, so not sure when I will get to this. Do you
want to take a stab at it or you want me to do it? Also how urgent and
sensitive this regression is for you?

thanks,
Shakeel
  
Jan Kara June 8, 2023, 6:07 p.m. UTC | #12
On Thu 08-06-23 17:37:00, Shakeel Butt wrote:
> On Thu, Jun 08, 2023 at 01:14:08PM +0200, Jan Kara wrote:
> [...]
> > 
> > Somewhat late to the game but our performance testing grid has noticed this
> > commit causes a performance regression on shell-heavy workloads. For
> > example running 'make test' in git sources on our test machine with 192
> > CPUs takes about 4% longer, system time is increased by about 9%:
> > 
> >                        before (9cd6ffa6025)  after (f1a7941243c1)
> > Amean     User         471.12 *   0.30%*     481.77 *  -1.96%*
> > Amean     System       244.47 *   0.90%*     269.13 *  -9.09%*
> > Amean     Elapsed      709.22 *   0.45%*     742.27 *  -4.19%*
> > Amean     CPU          100.00 (   0.20%)     101.00 *  -0.80%*
> > 
> > Essentially this workload spawns in sequence a lot of short-lived tasks and
> > the task startup + teardown cost is what this patch increases. To
> > demonstrate this more clearly, I've written trivial (and somewhat stupid)
> > benchmark shell_bench.sh:
> > 
> > for (( i = 0; i < 20000; i++ )); do
> > 	/bin/true
> > done
> > 
> > And when run like:
> > 
> > numactl -C 1 ./shell_bench.sh
> > 
> > (I've forced physical CPU binding to avoid task migrating over the machine
> > and cpu frequency scaling interfering which makes the numbers much more
> > noisy) I get the following elapsed times:
> > 
> >          9cd6ffa6025    f1a7941243c1
> > Avg      6.807429       7.631571
> > Stddev   0.021797       0.016483
> > 
> > So some 12% regression in elapsed time. Just to be sure I've verified that
> > per-cpu allocator patch [1] does not improve these numbers in any
> > significant way.
> > 
> > Where do we go from here? I think in principle the problem could be fixed
> > by being clever and when the task has only a single thread, we don't bother
> > with allocating pcpu counter (and summing it at the end) and just account
> > directly in mm_struct. When the second thread is spawned, we bite the
> > bullet, allocate pcpu counter and start with more scalable accounting.
> > These shortlived tasks in shell workloads or similar don't spawn any
> > threads so this should fix the regression. But this is obviously easier
> > said than done...
> > 
> 
> Thanks Jan for the report. I wanted to improve the percpu allocation to
> eliminate this regression as it was reported by intel test bot as well.
> However your suggestion seems seems targetted and reasonable as well. At
> the moment I am travelling, so not sure when I will get to this. Do you
> want to take a stab at it or you want me to do it? Also how urgent and
> sensitive this regression is for you?

It is not really urgent but eventually we'd like to get this fixed (like
within couple of months). I have currently other stuff in progress so if
you could get to it, it would be nice, otherwise I should be able to look
into this in a week or two.

								Honza
  
Dennis Zhou June 8, 2023, 7:10 p.m. UTC | #13
Hi Shakeel and Jan,

On Thu, Jun 08, 2023 at 05:37:00PM +0000, Shakeel Butt wrote:
> On Thu, Jun 08, 2023 at 01:14:08PM +0200, Jan Kara wrote:
> [...]
> > 
> > Somewhat late to the game but our performance testing grid has noticed this
> > commit causes a performance regression on shell-heavy workloads. For
> > example running 'make test' in git sources on our test machine with 192
> > CPUs takes about 4% longer, system time is increased by about 9%:
> > 
> >                        before (9cd6ffa6025)  after (f1a7941243c1)
> > Amean     User         471.12 *   0.30%*     481.77 *  -1.96%*
> > Amean     System       244.47 *   0.90%*     269.13 *  -9.09%*
> > Amean     Elapsed      709.22 *   0.45%*     742.27 *  -4.19%*
> > Amean     CPU          100.00 (   0.20%)     101.00 *  -0.80%*
> > 
> > Essentially this workload spawns in sequence a lot of short-lived tasks and
> > the task startup + teardown cost is what this patch increases. To
> > demonstrate this more clearly, I've written trivial (and somewhat stupid)
> > benchmark shell_bench.sh:
> > 
> > for (( i = 0; i < 20000; i++ )); do
> > 	/bin/true
> > done
> > 
> > And when run like:
> > 
> > numactl -C 1 ./shell_bench.sh
> > 
> > (I've forced physical CPU binding to avoid task migrating over the machine
> > and cpu frequency scaling interfering which makes the numbers much more
> > noisy) I get the following elapsed times:
> > 
> >          9cd6ffa6025    f1a7941243c1
> > Avg      6.807429       7.631571
> > Stddev   0.021797       0.016483
> > 
> > So some 12% regression in elapsed time. Just to be sure I've verified that
> > per-cpu allocator patch [1] does not improve these numbers in any
> > significant way.
> > 
> > Where do we go from here? I think in principle the problem could be fixed
> > by being clever and when the task has only a single thread, we don't bother
> > with allocating pcpu counter (and summing it at the end) and just account
> > directly in mm_struct. When the second thread is spawned, we bite the
> > bullet, allocate pcpu counter and start with more scalable accounting.
> > These shortlived tasks in shell workloads or similar don't spawn any
> > threads so this should fix the regression. But this is obviously easier
> > said than done...
> > 
> 
> Thanks Jan for the report. I wanted to improve the percpu allocation to
> eliminate this regression as it was reported by intel test bot as well.
> However your suggestion seems seems targetted and reasonable as well. At
> the moment I am travelling, so not sure when I will get to this. Do you
> want to take a stab at it or you want me to do it? Also how urgent and
> sensitive this regression is for you?
> 
> thanks,
> Shakeel
> 
> 

I _think_ I could probably spin you a percpu_alloc_bulk() series in a
couple days for percpu_counters. Let me try and find some time, unless
you had something different in mind.

Thanks,
Dennis
  
Shakeel Butt June 8, 2023, 7:36 p.m. UTC | #14
On Fri, Jun 9, 2023 at 12:10 AM Dennis Zhou <dennisszhou@gmail.com> wrote:
>
> Hi Shakeel and Jan,
>
> On Thu, Jun 08, 2023 at 05:37:00PM +0000, Shakeel Butt wrote:
> > On Thu, Jun 08, 2023 at 01:14:08PM +0200, Jan Kara wrote:
> > [...]
> > >
> > > Somewhat late to the game but our performance testing grid has noticed this
> > > commit causes a performance regression on shell-heavy workloads. For
> > > example running 'make test' in git sources on our test machine with 192
> > > CPUs takes about 4% longer, system time is increased by about 9%:
> > >
> > >                        before (9cd6ffa6025)  after (f1a7941243c1)
> > > Amean     User         471.12 *   0.30%*     481.77 *  -1.96%*
> > > Amean     System       244.47 *   0.90%*     269.13 *  -9.09%*
> > > Amean     Elapsed      709.22 *   0.45%*     742.27 *  -4.19%*
> > > Amean     CPU          100.00 (   0.20%)     101.00 *  -0.80%*
> > >
> > > Essentially this workload spawns in sequence a lot of short-lived tasks and
> > > the task startup + teardown cost is what this patch increases. To
> > > demonstrate this more clearly, I've written trivial (and somewhat stupid)
> > > benchmark shell_bench.sh:
> > >
> > > for (( i = 0; i < 20000; i++ )); do
> > >     /bin/true
> > > done
> > >
> > > And when run like:
> > >
> > > numactl -C 1 ./shell_bench.sh
> > >
> > > (I've forced physical CPU binding to avoid task migrating over the machine
> > > and cpu frequency scaling interfering which makes the numbers much more
> > > noisy) I get the following elapsed times:
> > >
> > >          9cd6ffa6025    f1a7941243c1
> > > Avg      6.807429       7.631571
> > > Stddev   0.021797       0.016483
> > >
> > > So some 12% regression in elapsed time. Just to be sure I've verified that
> > > per-cpu allocator patch [1] does not improve these numbers in any
> > > significant way.
> > >
> > > Where do we go from here? I think in principle the problem could be fixed
> > > by being clever and when the task has only a single thread, we don't bother
> > > with allocating pcpu counter (and summing it at the end) and just account
> > > directly in mm_struct. When the second thread is spawned, we bite the
> > > bullet, allocate pcpu counter and start with more scalable accounting.
> > > These shortlived tasks in shell workloads or similar don't spawn any
> > > threads so this should fix the regression. But this is obviously easier
> > > said than done...
> > >
> >
> > Thanks Jan for the report. I wanted to improve the percpu allocation to
> > eliminate this regression as it was reported by intel test bot as well.
> > However your suggestion seems seems targetted and reasonable as well. At
> > the moment I am travelling, so not sure when I will get to this. Do you
> > want to take a stab at it or you want me to do it? Also how urgent and
> > sensitive this regression is for you?
> >
> > thanks,
> > Shakeel
> >
> >
>
> I _think_ I could probably spin you a percpu_alloc_bulk() series in a
> couple days for percpu_counters. Let me try and find some time, unless
> you had something different in mind.
>

That would be awesome and thanks a lot.
  
Linux regression tracking (Thorsten Leemhuis) June 14, 2023, 8:37 a.m. UTC | #15
[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 08.06.23 13:14, Jan Kara wrote:
> On Mon 24-10-22 05:28:41, Shakeel Butt wrote:
>> Currently mm_struct maintains rss_stats which are updated on page fault
>> and the unmapping codepaths. For page fault codepath the updates are
>> cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64.
>> The reason for caching is performance for multithreaded applications
>> otherwise the rss_stats updates may become hotspot for such
>> applications.
>>
>> However this optimization comes with the cost of error margin in the rss
>> stats. The rss_stats for applications with large number of threads can
>> be very skewed. At worst the error margin is (nr_threads * 64) and we
>> have a lot of applications with 100s of threads, so the error margin can
>> be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.
>>
>> Recently we started seeing the unbounded errors for rss_stats for
>> specific applications which use TCP rx0cp. It seems like
>> vm_insert_pages() codepath does not sync rss_stats at all.
>>
>> This patch converts the rss_stats into percpu_counter to convert the
>> error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2).
>> However this conversion enable us to get the accurate stats for
>> situations where accuracy is more important than the cpu cost. Though
>> this patch does not make such tradeoffs.
>>
>> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> 
> Somewhat late to the game but our performance testing grid has noticed this
> commit causes a performance regression on shell-heavy workloads. For
> example running 'make test' in git sources on our test machine with 192
> CPUs takes about 4% longer, system time is increased by about 9%:

Thanks for the report.

I noticed this is nothing urgent. Nevertheless to be sure the issue
doesn't fall through the cracks unnoticed, I'm adding it to regzbot, the
Linux kernel regression tracking bot:

#regzbot ^introduced f1a7941243c
#regzbot title mm: performance regression on shell-heavy workloads
#regzbot backburner: not urgent according to reporter
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.
  

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9dec25c7d631..a8a9c3a20534 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2000,40 +2000,30 @@  static inline bool get_user_page_fast_only(unsigned long addr,
  */
 static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
 {
-	long val = atomic_long_read(&mm->rss_stat.count[member]);
-
-#ifdef SPLIT_RSS_COUNTING
-	/*
-	 * counter is updated in asynchronous manner and may go to minus.
-	 * But it's never be expected number for users.
-	 */
-	if (val < 0)
-		val = 0;
-#endif
-	return (unsigned long)val;
+	return percpu_counter_read_positive(&mm->rss_stat[member]);
 }
 
-void mm_trace_rss_stat(struct mm_struct *mm, int member, long count);
+void mm_trace_rss_stat(struct mm_struct *mm, int member);
 
 static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
 {
-	long count = atomic_long_add_return(value, &mm->rss_stat.count[member]);
+	percpu_counter_add(&mm->rss_stat[member], value);
 
-	mm_trace_rss_stat(mm, member, count);
+	mm_trace_rss_stat(mm, member);
 }
 
 static inline void inc_mm_counter(struct mm_struct *mm, int member)
 {
-	long count = atomic_long_inc_return(&mm->rss_stat.count[member]);
+	percpu_counter_inc(&mm->rss_stat[member]);
 
-	mm_trace_rss_stat(mm, member, count);
+	mm_trace_rss_stat(mm, member);
 }
 
 static inline void dec_mm_counter(struct mm_struct *mm, int member)
 {
-	long count = atomic_long_dec_return(&mm->rss_stat.count[member]);
+	percpu_counter_dec(&mm->rss_stat[member]);
 
-	mm_trace_rss_stat(mm, member, count);
+	mm_trace_rss_stat(mm, member);
 }
 
 /* Optimized variant when page is already known not to be PageAnon */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a82f06ab18a1..834022721bc6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -18,6 +18,7 @@ 
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
+#include <linux/percpu_counter.h>
 
 #include <asm/mmu.h>
 
@@ -626,11 +627,7 @@  struct mm_struct {
 
 		unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
-		/*
-		 * Special counters, in some configurations protected by the
-		 * page_table_lock, in other configurations by being atomic.
-		 */
-		struct mm_rss_stat rss_stat;
+		struct percpu_counter rss_stat[NR_MM_COUNTERS];
 
 		struct linux_binfmt *binfmt;
 
diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
index 0bb4b6da9993..5414b5c6a103 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -36,19 +36,6 @@  enum {
 	NR_MM_COUNTERS
 };
 
-#if USE_SPLIT_PTE_PTLOCKS && defined(CONFIG_MMU)
-#define SPLIT_RSS_COUNTING
-/* per-thread cached information, */
-struct task_rss_stat {
-	int events;	/* for synchronization threshold */
-	int count[NR_MM_COUNTERS];
-};
-#endif /* USE_SPLIT_PTE_PTLOCKS */
-
-struct mm_rss_stat {
-	atomic_long_t count[NR_MM_COUNTERS];
-};
-
 struct page_frag {
 	struct page *page;
 #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 8ed5fba6d156..bde6c4c1f405 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -13,7 +13,6 @@ 
 #include <linux/threads.h>
 #include <linux/percpu.h>
 #include <linux/types.h>
-#include <linux/gfp.h>
 
 /* percpu_counter batch for local add or sub */
 #define PERCPU_COUNTER_LOCAL_BATCH	INT_MAX
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ffb6eb55cd13..079d299fa465 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -870,9 +870,6 @@  struct task_struct {
 	struct mm_struct		*mm;
 	struct mm_struct		*active_mm;
 
-#ifdef SPLIT_RSS_COUNTING
-	struct task_rss_stat		rss_stat;
-#endif
 	int				exit_state;
 	int				exit_code;
 	int				exit_signal;
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 243073cfc29d..58688768ef0f 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -346,10 +346,9 @@  TRACE_MM_PAGES
 TRACE_EVENT(rss_stat,
 
 	TP_PROTO(struct mm_struct *mm,
-		int member,
-		long count),
+		int member),
 
-	TP_ARGS(mm, member, count),
+	TP_ARGS(mm, member),
 
 	TP_STRUCT__entry(
 		__field(unsigned int, mm_id)
@@ -362,7 +361,8 @@  TRACE_EVENT(rss_stat,
 		__entry->mm_id = mm_ptr_to_hash(mm);
 		__entry->curr = !!(current->mm == mm);
 		__entry->member = member;
-		__entry->size = (count << PAGE_SHIFT);
+		__entry->size = (percpu_counter_sum_positive(&mm->rss_stat[member])
+							    << PAGE_SHIFT);
 	),
 
 	TP_printk("mm_id=%u curr=%d type=%s size=%ldB",
diff --git a/kernel/fork.c b/kernel/fork.c
index cfb09ca1b1bc..f56ad06240e1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -756,7 +756,7 @@  static void check_mm(struct mm_struct *mm)
 			 "Please make sure 'struct resident_page_types[]' is updated as well");
 
 	for (i = 0; i < NR_MM_COUNTERS; i++) {
-		long x = atomic_long_read(&mm->rss_stat.count[i]);
+		long x = percpu_counter_sum(&mm->rss_stat[i]);
 
 		if (unlikely(x))
 			pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n",
@@ -782,6 +782,8 @@  static void check_mm(struct mm_struct *mm)
  */
 void __mmdrop(struct mm_struct *mm)
 {
+	int i;
+
 	BUG_ON(mm == &init_mm);
 	WARN_ON_ONCE(mm == current->mm);
 	WARN_ON_ONCE(mm == current->active_mm);
@@ -791,6 +793,9 @@  void __mmdrop(struct mm_struct *mm)
 	check_mm(mm);
 	put_user_ns(mm->user_ns);
 	mm_pasid_drop(mm);
+
+	for (i = 0; i < NR_MM_COUNTERS; i++)
+		percpu_counter_destroy(&mm->rss_stat[i]);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -1110,6 +1115,8 @@  static void mm_init_uprobes_state(struct mm_struct *mm)
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	struct user_namespace *user_ns)
 {
+	int i;
+
 	mt_init_flags(&mm->mm_mt, MM_MT_FLAGS);
 	mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock);
 	atomic_set(&mm->mm_users, 1);
@@ -1151,10 +1158,17 @@  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (init_new_context(p, mm))
 		goto fail_nocontext;
 
+	for (i = 0; i < NR_MM_COUNTERS; i++)
+		if (percpu_counter_init(&mm->rss_stat[i], 0, GFP_KERNEL_ACCOUNT))
+			goto fail_pcpu;
+
 	mm->user_ns = get_user_ns(user_ns);
 	lru_gen_init_mm(mm);
 	return mm;
 
+fail_pcpu:
+	while (i > 0)
+		percpu_counter_destroy(&mm->rss_stat[--i]);
 fail_nocontext:
 	mm_free_pgd(mm);
 fail_nopgd:
diff --git a/mm/memory.c b/mm/memory.c
index 8e72f703ed99..fea8d737e8c0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -162,58 +162,11 @@  static int __init init_zero_pfn(void)
 }
 early_initcall(init_zero_pfn);
 
-void mm_trace_rss_stat(struct mm_struct *mm, int member, long count)
+void mm_trace_rss_stat(struct mm_struct *mm, int member)
 {
-	trace_rss_stat(mm, member, count);
+	trace_rss_stat(mm, member);
 }
 
-#if defined(SPLIT_RSS_COUNTING)
-
-void sync_mm_rss(struct mm_struct *mm)
-{
-	int i;
-
-	for (i = 0; i < NR_MM_COUNTERS; i++) {
-		if (current->rss_stat.count[i]) {
-			add_mm_counter(mm, i, current->rss_stat.count[i]);
-			current->rss_stat.count[i] = 0;
-		}
-	}
-	current->rss_stat.events = 0;
-}
-
-static void add_mm_counter_fast(struct mm_struct *mm, int member, int val)
-{
-	struct task_struct *task = current;
-
-	if (likely(task->mm == mm))
-		task->rss_stat.count[member] += val;
-	else
-		add_mm_counter(mm, member, val);
-}
-#define inc_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, 1)
-#define dec_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, -1)
-
-/* sync counter once per 64 page faults */
-#define TASK_RSS_EVENTS_THRESH	(64)
-static void check_sync_rss_stat(struct task_struct *task)
-{
-	if (unlikely(task != current))
-		return;
-	if (unlikely(task->rss_stat.events++ > TASK_RSS_EVENTS_THRESH))
-		sync_mm_rss(task->mm);
-}
-#else /* SPLIT_RSS_COUNTING */
-
-#define inc_mm_counter_fast(mm, member) inc_mm_counter(mm, member)
-#define dec_mm_counter_fast(mm, member) dec_mm_counter(mm, member)
-
-static void check_sync_rss_stat(struct task_struct *task)
-{
-}
-
-#endif /* SPLIT_RSS_COUNTING */
-
 /*
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
@@ -1860,7 +1813,7 @@  static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
 		return -EBUSY;
 	/* Ok, finally just insert the thing.. */
 	get_page(page);
-	inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
+	inc_mm_counter(vma->vm_mm, mm_counter_file(page));
 	page_add_file_rmap(page, vma, false);
 	set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot));
 	return 0;
@@ -3156,12 +3109,11 @@  static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
 		if (old_page) {
 			if (!PageAnon(old_page)) {
-				dec_mm_counter_fast(mm,
-						mm_counter_file(old_page));
-				inc_mm_counter_fast(mm, MM_ANONPAGES);
+				dec_mm_counter(mm, mm_counter_file(old_page));
+				inc_mm_counter(mm, MM_ANONPAGES);
 			}
 		} else {
-			inc_mm_counter_fast(mm, MM_ANONPAGES);
+			inc_mm_counter(mm, MM_ANONPAGES);
 		}
 		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
@@ -3968,8 +3920,8 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (should_try_to_free_swap(folio, vma, vmf->flags))
 		folio_free_swap(folio);
 
-	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-	dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
+	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
+	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
 	pte = mk_pte(page, vma->vm_page_prot);
 
 	/*
@@ -4148,7 +4100,7 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
-	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
+	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, vmf->address);
 	lru_cache_add_inactive_or_unevictable(page, vma);
 setpte:
@@ -4338,11 +4290,11 @@  void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
 		entry = pte_mkuffd_wp(pte_wrprotect(entry));
 	/* copy-on-write page */
 	if (write && !(vma->vm_flags & VM_SHARED)) {
-		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
+		inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
 		page_add_new_anon_rmap(page, vma, addr);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
-		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
+		inc_mm_counter(vma->vm_mm, mm_counter_file(page));
 		page_add_file_rmap(page, vma, false);
 	}
 	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
@@ -5194,9 +5146,6 @@  vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	count_vm_event(PGFAULT);
 	count_memcg_event_mm(vma->vm_mm, PGFAULT);
 
-	/* do counter updates before entering really critical section. */
-	check_sync_rss_stat(current);
-
 	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
 					    flags & FAULT_FLAG_INSTRUCTION,
 					    flags & FAULT_FLAG_REMOTE))