[3/3] mm: add vmstat statistics for madvise_[cold|pageout]

Message ID 20230117231632.2734737-3-minchan@kernel.org
State New
Headers
Series [1/3] mm: return the number of pages successfully paged out |

Commit Message

Minchan Kim Jan. 17, 2023, 11:16 p.m. UTC
  madvise LRU manipulation APIs need to scan address ranges to find
present pages at page table and provides advice hints for them.

Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted]
shows the proactive reclaim efficiency so this patch addes those
two statistics in vmstat.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/vm_event_item.h |  2 ++
 mm/madvise.c                  | 19 +++++++++++++++----
 mm/vmstat.c                   |  2 ++
 3 files changed, 19 insertions(+), 4 deletions(-)
  

Comments

Michal Hocko Jan. 18, 2023, 9:11 a.m. UTC | #1
On Tue 17-01-23 15:16:32, Minchan Kim wrote:
> madvise LRU manipulation APIs need to scan address ranges to find
> present pages at page table and provides advice hints for them.
> 
> Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted]
> shows the proactive reclaim efficiency so this patch addes those
> two statistics in vmstat.

Please describe the usecase for those new counters.
  
Minchan Kim Jan. 18, 2023, 5:15 p.m. UTC | #2
On Wed, Jan 18, 2023 at 10:11:46AM +0100, Michal Hocko wrote:
> On Tue 17-01-23 15:16:32, Minchan Kim wrote:
> > madvise LRU manipulation APIs need to scan address ranges to find
> > present pages at page table and provides advice hints for them.
> > 
> > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted]
> > shows the proactive reclaim efficiency so this patch addes those
> > two statistics in vmstat.
> 
> Please describe the usecase for those new counters.

I wanted to know the proactive reclaim efficieny using MADV_COLD/MDDV_PAGEOUT.
Userspace has several policy which when/which vmas need to be hinted by the call
and they are evolving. I needed to know how effectively their policy works since
the vma ranges are huge(i.e., nr_hinted/nr_scanned).
  
Michal Hocko Jan. 18, 2023, 5:27 p.m. UTC | #3
On Wed 18-01-23 09:15:34, Minchan Kim wrote:
> On Wed, Jan 18, 2023 at 10:11:46AM +0100, Michal Hocko wrote:
> > On Tue 17-01-23 15:16:32, Minchan Kim wrote:
> > > madvise LRU manipulation APIs need to scan address ranges to find
> > > present pages at page table and provides advice hints for them.
> > > 
> > > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted]
> > > shows the proactive reclaim efficiency so this patch addes those
> > > two statistics in vmstat.
> > 
> > Please describe the usecase for those new counters.
> 
> I wanted to know the proactive reclaim efficieny using MADV_COLD/MDDV_PAGEOUT.
> Userspace has several policy which when/which vmas need to be hinted by the call
> and they are evolving. I needed to know how effectively their policy works since
> the vma ranges are huge(i.e., nr_hinted/nr_scanned).

I can see how that can be an interesting information but is there
anything actionable about that beyond debugging purposes? In other words
isn't this something that could be done by tracing instead?

Also how are you going to identify specific madvise calls when they can
interleave arbitrarily?
  
Minchan Kim Jan. 18, 2023, 5:55 p.m. UTC | #4
On Wed, Jan 18, 2023 at 06:27:02PM +0100, Michal Hocko wrote:
> On Wed 18-01-23 09:15:34, Minchan Kim wrote:
> > On Wed, Jan 18, 2023 at 10:11:46AM +0100, Michal Hocko wrote:
> > > On Tue 17-01-23 15:16:32, Minchan Kim wrote:
> > > > madvise LRU manipulation APIs need to scan address ranges to find
> > > > present pages at page table and provides advice hints for them.
> > > > 
> > > > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted]
> > > > shows the proactive reclaim efficiency so this patch addes those
> > > > two statistics in vmstat.
> > > 
> > > Please describe the usecase for those new counters.
> > 
> > I wanted to know the proactive reclaim efficieny using MADV_COLD/MDDV_PAGEOUT.
> > Userspace has several policy which when/which vmas need to be hinted by the call
> > and they are evolving. I needed to know how effectively their policy works since
> > the vma ranges are huge(i.e., nr_hinted/nr_scanned).
> 
> I can see how that can be an interesting information but is there
> anything actionable about that beyond debugging purposes? In other words
> isn't this something that could be done by tracing instead?

That's the statictis for telemetry. With those stat, we are collecting
various vmstat fields(i.e., pgsteal/pgscan) from real field devices
and thought those two stats would be good fit along with other reclaim
statistics in vmstat since we can know how much proactive madvise policy
could make system healthier(e.g., less kswapd scan, less allocstall
and so on).

> 
> Also how are you going to identify specific madvise calls when they can
> interleave arbitrarily?

I guess you are talking about how we could separate MADV_PAGEOUT and
MADV_COLD from vmstat. That's valid question. I thought for the start,
adds just umbrella stat like this and if we want to break down, we need
to introudce sysfs likewise slab.
  
Michal Hocko Jan. 18, 2023, 9:13 p.m. UTC | #5
On Wed 18-01-23 09:55:38, Minchan Kim wrote:
> On Wed, Jan 18, 2023 at 06:27:02PM +0100, Michal Hocko wrote:
> > On Wed 18-01-23 09:15:34, Minchan Kim wrote:
> > > On Wed, Jan 18, 2023 at 10:11:46AM +0100, Michal Hocko wrote:
> > > > On Tue 17-01-23 15:16:32, Minchan Kim wrote:
> > > > > madvise LRU manipulation APIs need to scan address ranges to find
> > > > > present pages at page table and provides advice hints for them.
> > > > > 
> > > > > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted]
> > > > > shows the proactive reclaim efficiency so this patch addes those
> > > > > two statistics in vmstat.
> > > > 
> > > > Please describe the usecase for those new counters.
> > > 
> > > I wanted to know the proactive reclaim efficieny using MADV_COLD/MDDV_PAGEOUT.
> > > Userspace has several policy which when/which vmas need to be hinted by the call
> > > and they are evolving. I needed to know how effectively their policy works since
> > > the vma ranges are huge(i.e., nr_hinted/nr_scanned).
> > 
> > I can see how that can be an interesting information but is there
> > anything actionable about that beyond debugging purposes? In other words
> > isn't this something that could be done by tracing instead?
> 
> That's the statictis for telemetry. With those stat, we are collecting
> various vmstat fields(i.e., pgsteal/pgscan) from real field devices
> and thought those two stats would be good fit along with other reclaim
> statistics in vmstat since we can know how much proactive madvise policy
> could make system healthier(e.g., less kswapd scan, less allocstall
> and so on).
> 
> > 
> > Also how are you going to identify specific madvise calls when they can
> > interleave arbitrarily?
> 
> I guess you are talking about how we could separate MADV_PAGEOUT and
> MADV_COLD from vmstat. That's valid question. I thought for the start,
> adds just umbrella stat like this and if we want to break down, we need
> to introudce sysfs likewise slab. 

No, not really. MADV_COLD is about aging. There is no actual reclaim
going on so pgscan/steal metrics do not make any sense. I am asking
about potential different concurrent MADV_PAGEOUT happening. From what
you've said earlier (how effectively policy works) I have understood you
want to find out how a specific MADV_PAGEOUT effective is. But there
maybe different callers of this applied to all sorts of different memory
mappings and therefore the efficiency might be really different. As
there is no clear way to tell one from the other I am really questioning
whether this global stat is actually useful.
  
Minchan Kim Jan. 18, 2023, 9:47 p.m. UTC | #6
On Wed, Jan 18, 2023 at 10:13:38PM +0100, Michal Hocko wrote:
> On Wed 18-01-23 09:55:38, Minchan Kim wrote:
> > On Wed, Jan 18, 2023 at 06:27:02PM +0100, Michal Hocko wrote:
> > > On Wed 18-01-23 09:15:34, Minchan Kim wrote:
> > > > On Wed, Jan 18, 2023 at 10:11:46AM +0100, Michal Hocko wrote:
> > > > > On Tue 17-01-23 15:16:32, Minchan Kim wrote:
> > > > > > madvise LRU manipulation APIs need to scan address ranges to find
> > > > > > present pages at page table and provides advice hints for them.
> > > > > > 
> > > > > > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted]
> > > > > > shows the proactive reclaim efficiency so this patch addes those
> > > > > > two statistics in vmstat.
> > > > > 
> > > > > Please describe the usecase for those new counters.
> > > > 
> > > > I wanted to know the proactive reclaim efficieny using MADV_COLD/MDDV_PAGEOUT.
> > > > Userspace has several policy which when/which vmas need to be hinted by the call
> > > > and they are evolving. I needed to know how effectively their policy works since
> > > > the vma ranges are huge(i.e., nr_hinted/nr_scanned).
> > > 
> > > I can see how that can be an interesting information but is there
> > > anything actionable about that beyond debugging purposes? In other words
> > > isn't this something that could be done by tracing instead?
> > 
> > That's the statictis for telemetry. With those stat, we are collecting
> > various vmstat fields(i.e., pgsteal/pgscan) from real field devices
> > and thought those two stats would be good fit along with other reclaim
> > statistics in vmstat since we can know how much proactive madvise policy
> > could make system healthier(e.g., less kswapd scan, less allocstall
> > and so on).
> > 
> > > 
> > > Also how are you going to identify specific madvise calls when they can
> > > interleave arbitrarily?
> > 
> > I guess you are talking about how we could separate MADV_PAGEOUT and
> > MADV_COLD from vmstat. That's valid question. I thought for the start,
> > adds just umbrella stat like this and if we want to break down, we need
> > to introudce sysfs likewise slab. 
> 
> No, not really. MADV_COLD is about aging. There is no actual reclaim
> going on so pgscan/steal metrics do not make any sense. I am asking
> about potential different concurrent MADV_PAGEOUT happening. From what
> you've said earlier (how effectively policy works) I have understood you
> want to find out how a specific MADV_PAGEOUT effective is. But there

No, it 's not a specific MADV_PAGEOUT but system global policy.
Android has used the ActivityManagerService to control proactive
memory compaction from apps since it could control life of apps.
You can think it as userspace kswapd.

> maybe different callers of this applied to all sorts of different memory
> mappings and therefore the efficiency might be really different. As
> there is no clear way to tell one from the other I am really questioning
> whether this global stat is actually useful.

The purpose is global stat.
  

Patch

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 3518dba1e02f..8b9fb2e151eb 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -49,6 +49,8 @@  enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGSCAN_FILE,
 		PGSTEAL_ANON,
 		PGSTEAL_FILE,
+		MADVISE_PGSCANNED,
+		MADVISE_PGHINTED,
 #ifdef CONFIG_NUMA
 		PGSCAN_ZONE_RECLAIM_FAILED,
 #endif
diff --git a/mm/madvise.c b/mm/madvise.c
index a4a03054ab6b..0e58545ff6e9 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -334,6 +334,8 @@  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	spinlock_t *ptl;
 	struct page *page = NULL;
 	LIST_HEAD(page_list);
+	unsigned int nr_scanned = 0;
+	unsigned int nr_hinted = 0;
 
 	if (fatal_signal_pending(current))
 		return -EINTR;
@@ -343,6 +345,7 @@  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		pmd_t orig_pmd;
 		unsigned long next = pmd_addr_end(addr, end);
 
+		nr_scanned += HPAGE_PMD_NR
 		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
 		ptl = pmd_trans_huge_lock(pmd, vma);
 		if (!ptl)
@@ -396,11 +399,15 @@  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 					list_add(&page->lru, &page_list);
 			}
 		} else
-			deactivate_page(page);
+			if (deactivate_page(page))
+				nr_hinted += HPAGE_PMD_NR;
 huge_unlock:
 		spin_unlock(ptl);
 		if (pageout)
-			paging_out(&page_list);
+			nr_hinted += paging_out(&page_list);
+
+		count_vm_events(MADVISE_PGSCANNED, nr_scanned);
+		count_vm_events(MADVISE_PGHINTED, nr_hinted);
 		return 0;
 	}
 
@@ -414,6 +421,7 @@  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	arch_enter_lazy_mmu_mode();
 	for (; addr < end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
+		nr_scanned++;
 
 		if (pte_none(ptent))
 			continue;
@@ -485,14 +493,17 @@  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 					list_add(&page->lru, &page_list);
 			}
 		} else
-			deactivate_page(page);
+			if (deactivate_page(page))
+				nr_hinted++;
 	}
 
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(orig_pte, ptl);
 	if (pageout)
-		paging_out(&page_list);
+		nr_hinted += paging_out(&page_list);
 	cond_resched();
+	count_vm_events(MADVISE_PGSCANNED, nr_scanned);
+	count_vm_events(MADVISE_PGHINTED, nr_hinted);
 
 	return 0;
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b2371d745e00..0139feade854 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1280,6 +1280,8 @@  const char * const vmstat_text[] = {
 	"pgscan_file",
 	"pgsteal_anon",
 	"pgsteal_file",
+	"madvise_pgscanned",
+	"madvise_pghinted",
 
 #ifdef CONFIG_NUMA
 	"zone_reclaim_failed",