[23/24] swap: fix multiple swap leak when after cgroup migrate

Message ID 20231119194740.94101-24-ryncsn@gmail.com
State New
Headers
Series Swapin path refactor for optimization and bugfix |

Commit Message

Kairui Song Nov. 19, 2023, 7:47 p.m. UTC
  From: Kairui Song <kasong@tencent.com>

When a process which previously swapped some memory was moved to
another cgroup, and the cgroup it previous in is dead, then swapped in
pages will be leaked into rootcg. Previous commits fixed the bug for
no readahead path, this commit fix the same issue for readahead path.

This can be easily reproduced by:
- Setup a SSD or HDD swap.
- Create memory cgroup A, B and C.
- Spawn process P1 in cgroup A and make it swap out some pages.
- Move process P1 to memory cgroup B.
- Destroy cgroup A.
- Do a swapoff in cgroup C
- Swapped in pages is accounted into cgroup C.

This patch will fix it make the swapped in pages accounted in cgroup B.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h       |  2 +-
 mm/swap_state.c | 19 ++++++++++---------
 mm/zswap.c      |  2 +-
 3 files changed, 12 insertions(+), 11 deletions(-)
  

Comments

Huang, Ying Nov. 20, 2023, 7:35 a.m. UTC | #1
Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> When a process which previously swapped some memory was moved to
> another cgroup, and the cgroup it previous in is dead, then swapped in
> pages will be leaked into rootcg. Previous commits fixed the bug for
> no readahead path, this commit fix the same issue for readahead path.
>
> This can be easily reproduced by:
> - Setup a SSD or HDD swap.
> - Create memory cgroup A, B and C.
> - Spawn process P1 in cgroup A and make it swap out some pages.
> - Move process P1 to memory cgroup B.
> - Destroy cgroup A.
> - Do a swapoff in cgroup C
> - Swapped in pages is accounted into cgroup C.
>
> This patch will fix it make the swapped in pages accounted in cgroup B.

Accroding to "Memory Ownership" section of
Documentation/admin-guide/cgroup-v2.rst,

"
A memory area is charged to the cgroup which instantiated it and stays
charged to the cgroup until the area is released.  Migrating a process
to a different cgroup doesn't move the memory usages that it
instantiated while in the previous cgroup to the new cgroup.
"

Because we don't move the charge when we move a task from one cgroup to
another.  It's controversial which cgroup should be charged to.
According to the above document, it's acceptable to charge to the cgroup
C (cgroup where swapoff happens).

--
Best Regards,
Huang, Ying
  
Kairui Song Nov. 20, 2023, 11:17 a.m. UTC | #2
Huang, Ying <ying.huang@intel.com> 于2023年11月20日周一 15:37写道:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > When a process which previously swapped some memory was moved to
> > another cgroup, and the cgroup it previous in is dead, then swapped in
> > pages will be leaked into rootcg. Previous commits fixed the bug for
> > no readahead path, this commit fix the same issue for readahead path.
> >
> > This can be easily reproduced by:
> > - Setup a SSD or HDD swap.
> > - Create memory cgroup A, B and C.
> > - Spawn process P1 in cgroup A and make it swap out some pages.
> > - Move process P1 to memory cgroup B.
> > - Destroy cgroup A.
> > - Do a swapoff in cgroup C
> > - Swapped in pages is accounted into cgroup C.
> >
> > This patch will fix it make the swapped in pages accounted in cgroup B.
>
> Accroding to "Memory Ownership" section of
> Documentation/admin-guide/cgroup-v2.rst,
>
> "
> A memory area is charged to the cgroup which instantiated it and stays
> charged to the cgroup until the area is released.  Migrating a process
> to a different cgroup doesn't move the memory usages that it
> instantiated while in the previous cgroup to the new cgroup.
> "
>
> Because we don't move the charge when we move a task from one cgroup to
> another.  It's controversial which cgroup should be charged to.
> According to the above document, it's acceptable to charge to the cgroup
> C (cgroup where swapoff happens).

Hi Ying, thank you very much for the info!

It is controversial indeed, just the original behavior is kind of
counter-intuitive.

Image if there are cgroup P1, and its child cgroup C1 C2. If a process
swapped out some memory in C1 then moved to C2, and C1 is dead.
On swapoff the charge will be moved out of P1...

And swapoff often happen on some unlimited cgroup or some cgroup for
management agent.

If P1 have a memory limit, it can breech the limit easily, we will see
a process that never leave P1 having a much higher RSS that P1/C1/C2's
limit.
And if there is a limit for the management agent cgroup, the agent
will be OOM instead of OOM in P1.

Simply moving a process between the child cgroup of the same parent
cgroup won't cause such issue, thing get weird when swapoff is
involved.

Or maybe we should try to be compatible, and introduce a sysctl or
cmdline for this?
  
Chris Li Nov. 22, 2023, 5:34 a.m. UTC | #3
Hi Kairui,

On Mon, Nov 20, 2023 at 3:17 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> Huang, Ying <ying.huang@intel.com> 于2023年11月20日周一 15:37写道:
> >
> > Kairui Song <ryncsn@gmail.com> writes:
> >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > When a process which previously swapped some memory was moved to
> > > another cgroup, and the cgroup it previous in is dead, then swapped in
> > > pages will be leaked into rootcg. Previous commits fixed the bug for
> > > no readahead path, this commit fix the same issue for readahead path.
> > >
> > > This can be easily reproduced by:
> > > - Setup a SSD or HDD swap.
> > > - Create memory cgroup A, B and C.
> > > - Spawn process P1 in cgroup A and make it swap out some pages.
> > > - Move process P1 to memory cgroup B.
> > > - Destroy cgroup A.
> > > - Do a swapoff in cgroup C
> > > - Swapped in pages is accounted into cgroup C.
> > >
> > > This patch will fix it make the swapped in pages accounted in cgroup B.
> >
> > Accroding to "Memory Ownership" section of
> > Documentation/admin-guide/cgroup-v2.rst,
> >
> > "

> > A memory area is charged to the cgroup which instantiated it and stays
> > charged to the cgroup until the area is released.  Migrating a process
> > to a different cgroup doesn't move the memory usages that it
> > instantiated while in the previous cgroup to the new cgroup.
> > "
> >
> > Because we don't move the charge when we move a task from one cgroup to
> > another.  It's controversial which cgroup should be charged to.
> > According to the above document, it's acceptable to charge to the cgroup
> > C (cgroup where swapoff happens).
>
> Hi Ying, thank you very much for the info!
>
> It is controversial indeed, just the original behavior is kind of
> counter-intuitive.
>
> Image if there are cgroup P1, and its child cgroup C1 C2. If a process
> swapped out some memory in C1 then moved to C2, and C1 is dead.
> On swapoff the charge will be moved out of P1...
>
> And swapoff often happen on some unlimited cgroup or some cgroup for
> management agent.
>
> If P1 have a memory limit, it can breech the limit easily, we will see
> a process that never leave P1 having a much higher RSS that P1/C1/C2's
> limit.
> And if there is a limit for the management agent cgroup, the agent
> will be OOM instead of OOM in P1.

I think I will reply to another similar email.

If you want OOM in P1, you can have an admin program. fork and execute
a new process, add the new process into P1, then swap off from that
new process.

>
> Simply moving a process between the child cgroup of the same parent
> cgroup won't cause such issue, thing get weird when swapoff is
> involved.
>
> Or maybe we should try to be compatible, and introduce a sysctl or
> cmdline for this?

If the above suggestion works, then you don't need to change swap off?

If you still want to change the charging model. I like to see the
bigger picture, what rules it follows and how it works in other
situations.

Chris
  

Patch

diff --git a/mm/swap.h b/mm/swap.h
index 795a25df87da..4374bf11ca41 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -55,7 +55,7 @@  struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 				   struct swap_iocb **plug);
 struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 				     struct mempolicy *mpol, pgoff_t ilx,
-				     bool *new_page_allocated);
+				     struct mm_struct *mm, bool *new_page_allocated);
 struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
 			      struct vm_fault *vmf, enum swap_cache_result *result);
 struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index b377e55cb850..362a6f674b36 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -416,7 +416,7 @@  struct folio *filemap_get_incore_folio(struct address_space *mapping,
 
 struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 				     struct mempolicy *mpol, pgoff_t ilx,
-				     bool *new_page_allocated)
+				     struct mm_struct *mm, bool *new_page_allocated)
 {
 	struct swap_info_struct *si;
 	struct folio *folio;
@@ -462,7 +462,7 @@  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 						mpol, ilx, numa_node_id());
 		if (!folio)
                         goto fail_put_swap;
-		if (mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry))
+		if (mem_cgroup_swapin_charge_folio(folio, mm, gfp_mask, entry))
 			goto fail_put_folio;
 
 		/*
@@ -540,7 +540,7 @@  struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 
 	mpol = get_vma_policy(vma, addr, 0, &ilx);
 	page = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
-					&page_allocated);
+				       vma->vm_mm, &page_allocated);
 	mpol_cond_put(mpol);
 
 	if (page_allocated)
@@ -628,7 +628,8 @@  static unsigned long swapin_nr_pages(unsigned long offset)
  * are fairly likely to have been swapped out from the same node.
  */
 static struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
-					   struct mempolicy *mpol, pgoff_t ilx)
+					   struct mempolicy *mpol, pgoff_t ilx,
+					   struct mm_struct *mm)
 {
 	struct page *page;
 	unsigned long entry_offset = swp_offset(entry);
@@ -657,7 +658,7 @@  static struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		/* Ok, do the async read-ahead now */
 		page = __read_swap_cache_async(
 				swp_entry(swp_type(entry), offset),
-				gfp_mask, mpol, ilx, &page_allocated);
+				gfp_mask, mpol, ilx, mm, &page_allocated);
 		if (!page)
 			continue;
 		if (page_allocated) {
@@ -675,7 +676,7 @@  static struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 skip:
 	/* The page was likely read above, so no need for plugging here */
 	page = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
-					&page_allocated);
+				       mm, &page_allocated);
 	if (unlikely(page_allocated))
 		swap_readpage(page, false, NULL);
 	return page;
@@ -830,7 +831,7 @@  static struct page *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 		pte_unmap(pte);
 		pte = NULL;
 		page = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
-						&page_allocated);
+					       vmf->vma->vm_mm, &page_allocated);
 		if (!page)
 			continue;
 		if (page_allocated) {
@@ -850,7 +851,7 @@  static struct page *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 skip:
 	/* The page was likely read above, so no need for plugging here */
 	page = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx,
-					&page_allocated);
+				       vmf->vma->vm_mm, &page_allocated);
 	if (unlikely(page_allocated))
 		swap_readpage(page, false, NULL);
 	return page;
@@ -980,7 +981,7 @@  struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
 			workingset_refault(page_folio(page), shadow);
 		cache_result = SWAP_CACHE_BYPASS;
 	} else {
-		page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
+		page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx, mm);
 		cache_result = SWAP_CACHE_MISS;
 	}
 done:
diff --git a/mm/zswap.c b/mm/zswap.c
index 030cc137138f..e2712ff169b1 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1081,7 +1081,7 @@  static int zswap_writeback_entry(struct zswap_entry *entry,
 	/* try to allocate swap cache page */
 	mpol = get_task_policy(current);
 	page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
-				NO_INTERLEAVE_INDEX, &page_was_allocated);
+				NO_INTERLEAVE_INDEX, NULL, &page_was_allocated);
 	if (!page) {
 		ret = -ENOMEM;
 		goto fail;