[4/5] mm: workingset: move the stats flush into workingset_test_recent()

Message ID 20230921081057.3440885-5-yosryahmed@google.com
State New
Headers
Series mm: memcg: subtree stats flushing and thresholds |

Commit Message

Yosry Ahmed Sept. 21, 2023, 8:10 a.m. UTC
  The workingset code flushes the stats in workingset_refault() to get
accurate stats of the eviction memcg. In preparation for more scoped
flushed and passing the eviction memcg to the flush call, move the call
to workingset_test_recent() where we have a pointer to the eviction
memcg.

The flush call is sleepable, and cannot be made in an rcu read section.
Hence, minimize the rcu read section by also moving it into
workingset_test_recent(). Furthermore, instead of holding the rcu read
lock throughout workingset_test_recent(), only hold it briefly to get a
ref on the eviction memcg. This allows us to make the flush call after
we get the eviction memcg.

As for workingset_refault(), nothing else there appears to be protected
by rcu. The memcg of the faulted folio (which is not necessarily the
same as the eviction memcg) is protected by the folio lock, which is
held from all callsites. Add a VM_BUG_ON() to make sure this doesn't
change from under us.

No functional change intended.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/workingset.c | 31 ++++++++++++++++++++-----------
 1 file changed, 20 insertions(+), 11 deletions(-)
  

Comments

Yosry Ahmed Sept. 21, 2023, 9 p.m. UTC | #1
On Thu, Sep 21, 2023 at 4:06 AM kernel test robot <lkp@intel.com> wrote:
>
> Hi Yosry,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on akpm-mm/mm-everything]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20230921-161246
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link:    https://lore.kernel.org/r/20230921081057.3440885-5-yosryahmed%40google.com
> patch subject: [PATCH 4/5] mm: workingset: move the stats flush into workingset_test_recent()
> config: powerpc-allnoconfig (https://download.01.org/0day-ci/archive/20230921/202309211829.Efuqg8NE-lkp@intel.com/config)
> compiler: powerpc-linux-gcc (GCC) 13.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230921/202309211829.Efuqg8NE-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202309211829.Efuqg8NE-lkp@intel.com/
>
> All errors (new ones prefixed by >>):
>
>    mm/workingset.c: In function 'workingset_test_recent':
> >> mm/workingset.c:461:32: error: invalid use of undefined type 'struct mem_cgroup'
>      461 |         css_get(&eviction_memcg->css);
>          |                                ^~
>

Ah yes, I cannot dereference the memcg pointer here. I think we want
mem_cgroup_get_from_id (a getter version of mem_cgroup_from_id) or
mem_cgroup_get (similar to mem_cgroup_put) to have stubs when
!CONFIG_MEMCG. I will do this change in the next version, but I'll
wait for feedback on this version first.

>
> vim +461 mm/workingset.c
>
>    405
>    406  /**
>    407   * workingset_test_recent - tests if the shadow entry is for a folio that was
>    408   * recently evicted. Also fills in @workingset with the value unpacked from
>    409   * shadow.
>    410   * @shadow: the shadow entry to be tested.
>    411   * @file: whether the corresponding folio is from the file lru.
>    412   * @workingset: where the workingset value unpacked from shadow should
>    413   * be stored.
>    414   *
>    415   * Return: true if the shadow is for a recently evicted folio; false otherwise.
>    416   */
>    417  bool workingset_test_recent(void *shadow, bool file, bool *workingset)
>    418  {
>    419          struct mem_cgroup *eviction_memcg;
>    420          struct lruvec *eviction_lruvec;
>    421          unsigned long refault_distance;
>    422          unsigned long workingset_size;
>    423          unsigned long refault;
>    424          int memcgid;
>    425          struct pglist_data *pgdat;
>    426          unsigned long eviction;
>    427
>    428          rcu_read_lock();
>    429
>    430          if (lru_gen_enabled()) {
>    431                  bool recent = lru_gen_test_recent(shadow, file,
>    432                                                    &eviction_lruvec, &eviction,
>    433                                                    workingset);
>    434                  rcu_read_unlock();
>    435                  return recent;
>    436          }
>    437
>    438          unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
>    439          eviction <<= bucket_order;
>    440
>    441          /*
>    442           * Look up the memcg associated with the stored ID. It might
>    443           * have been deleted since the folio's eviction.
>    444           *
>    445           * Note that in rare events the ID could have been recycled
>    446           * for a new cgroup that refaults a shared folio. This is
>    447           * impossible to tell from the available data. However, this
>    448           * should be a rare and limited disturbance, and activations
>    449           * are always speculative anyway. Ultimately, it's the aging
>    450           * algorithm's job to shake out the minimum access frequency
>    451           * for the active cache.
>    452           *
>    453           * XXX: On !CONFIG_MEMCG, this will always return NULL; it
>    454           * would be better if the root_mem_cgroup existed in all
>    455           * configurations instead.
>    456           */
>    457          eviction_memcg = mem_cgroup_from_id(memcgid);
>    458          if (!mem_cgroup_disabled() && !eviction_memcg)
>    459                  return false;
>    460
>  > 461          css_get(&eviction_memcg->css);
>    462          rcu_read_unlock();
>    463
>    464          /* Flush stats (and potentially sleep) outside the RCU read section */
>    465          mem_cgroup_flush_stats_ratelimited();
>    466
>    467          eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
>    468          refault = atomic_long_read(&eviction_lruvec->nonresident_age);
>    469
>    470          /*
>    471           * Calculate the refault distance
>    472           *
>    473           * The unsigned subtraction here gives an accurate distance
>    474           * across nonresident_age overflows in most cases. There is a
>    475           * special case: usually, shadow entries have a short lifetime
>    476           * and are either refaulted or reclaimed along with the inode
>    477           * before they get too old.  But it is not impossible for the
>    478           * nonresident_age to lap a shadow entry in the field, which
>    479           * can then result in a false small refault distance, leading
>    480           * to a false activation should this old entry actually
>    481           * refault again.  However, earlier kernels used to deactivate
>    482           * unconditionally with *every* reclaim invocation for the
>    483           * longest time, so the occasional inappropriate activation
>    484           * leading to pressure on the active list is not a problem.
>    485           */
>    486          refault_distance = (refault - eviction) & EVICTION_MASK;
>    487
>    488          /*
>    489           * Compare the distance to the existing workingset size. We
>    490           * don't activate pages that couldn't stay resident even if
>    491           * all the memory was available to the workingset. Whether
>    492           * workingset competition needs to consider anon or not depends
>    493           * on having free swap space.
>    494           */
>    495          workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
>    496          if (!file) {
>    497                  workingset_size += lruvec_page_state(eviction_lruvec,
>    498                                                       NR_INACTIVE_FILE);
>    499          }
>    500          if (mem_cgroup_get_nr_swap_pages(eviction_memcg) > 0) {
>    501                  workingset_size += lruvec_page_state(eviction_lruvec,
>    502                                                       NR_ACTIVE_ANON);
>    503                  if (file) {
>    504                          workingset_size += lruvec_page_state(eviction_lruvec,
>    505                                                       NR_INACTIVE_ANON);
>    506                  }
>    507          }
>    508
>    509          mem_cgroup_put(eviction_memcg);
>    510          return refault_distance <= workingset_size;
>    511  }
>    512
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
  

Patch

diff --git a/mm/workingset.c b/mm/workingset.c
index b192e44a0e7c..79b338996088 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -425,8 +425,15 @@  bool workingset_test_recent(void *shadow, bool file, bool *workingset)
 	struct pglist_data *pgdat;
 	unsigned long eviction;
 
-	if (lru_gen_enabled())
-		return lru_gen_test_recent(shadow, file, &eviction_lruvec, &eviction, workingset);
+	rcu_read_lock();
+
+	if (lru_gen_enabled()) {
+		bool recent = lru_gen_test_recent(shadow, file,
+						  &eviction_lruvec, &eviction,
+						  workingset);
+		rcu_read_unlock();
+		return recent;
+	}
 
 	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
 	eviction <<= bucket_order;
@@ -451,6 +458,12 @@  bool workingset_test_recent(void *shadow, bool file, bool *workingset)
 	if (!mem_cgroup_disabled() && !eviction_memcg)
 		return false;
 
+	css_get(&eviction_memcg->css);
+	rcu_read_unlock();
+
+	/* Flush stats (and potentially sleep) outside the RCU read section */
+	mem_cgroup_flush_stats_ratelimited();
+
 	eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
 	refault = atomic_long_read(&eviction_lruvec->nonresident_age);
 
@@ -493,6 +506,7 @@  bool workingset_test_recent(void *shadow, bool file, bool *workingset)
 		}
 	}
 
+	mem_cgroup_put(eviction_memcg);
 	return refault_distance <= workingset_size;
 }
 
@@ -519,19 +533,16 @@  void workingset_refault(struct folio *folio, void *shadow)
 		return;
 	}
 
-	/* Flush stats (and potentially sleep) before holding RCU read lock */
-	mem_cgroup_flush_stats_ratelimited();
-
-	rcu_read_lock();
-
 	/*
 	 * The activation decision for this folio is made at the level
 	 * where the eviction occurred, as that is where the LRU order
 	 * during folio reclaim is being determined.
 	 *
 	 * However, the cgroup that will own the folio is the one that
-	 * is actually experiencing the refault event.
+	 * is actually experiencing the refault event. Make sure the folio is
+	 * locked to guarantee folio_memcg() stability throughout.
 	 */
+	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	nr = folio_nr_pages(folio);
 	memcg = folio_memcg(folio);
 	pgdat = folio_pgdat(folio);
@@ -540,7 +551,7 @@  void workingset_refault(struct folio *folio, void *shadow)
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
 
 	if (!workingset_test_recent(shadow, file, &workingset))
-		goto out;
+		return;
 
 	folio_set_active(folio);
 	workingset_age_nonresident(lruvec, nr);
@@ -556,8 +567,6 @@  void workingset_refault(struct folio *folio, void *shadow)
 		lru_note_cost_refault(folio);
 		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
 	}
-out:
-	rcu_read_unlock();
 }
 
 /**