mm:vmscan: shrink skip folios in the exiting task

Message ID 20240124124308.461-1-justinjiang@vivo.com
State New
Headers
Series mm:vmscan: shrink skip folios in the exiting task |

Commit Message

Zhiguo Jiang Jan. 24, 2024, 12:43 p.m. UTC
  If the shrinking folio is belong to the exiting task, this folio should
be freed in the task exit flow rather than being reclaimed in the shrink
flow, because the former takes less time.

If the folio which is belong to the exiting task is reclaimed in the
shrink flow, such as the anon folio, the anon folio needs to be first
written to the swap partition by swap-in in shrink flow, and then the
corresponding swap folio needs to be released in the task exiting flow.
As is well known, releasing a swap folio will task more time than 
releasing directly an anon folio.

In the scenarios of the low memory system and mutil backed-applications,
the time-consuming problem caused by shrinking the exiting task's folios
will be more severe.

Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
---
 include/linux/mm.h |  1 +
 mm/rmap.c          | 12 ++++++++++++
 mm/vmscan.c        |  4 ++++
 3 files changed, 17 insertions(+)
 mode change 100644 => 100755 include/linux/mm.h
 mode change 100644 => 100755 mm/rmap.c
  

Comments

Matthew Wilcox Jan. 24, 2024, 1:20 p.m. UTC | #1
On Wed, Jan 24, 2024 at 08:43:07PM +0800, Zhiguo Jiang wrote:
> If the shrinking folio is belong to the exiting task, this folio should
> be freed in the task exit flow rather than being reclaimed in the shrink
> flow, because the former takes less time.
> 
> If the folio which is belong to the exiting task is reclaimed in the
> shrink flow, such as the anon folio, the anon folio needs to be first
> written to the swap partition by swap-in in shrink flow, and then the
> corresponding swap folio needs to be released in the task exiting flow.
> As is well known, releasing a swap folio will task more time than 
> releasing directly an anon folio.
> 
> In the scenarios of the low memory system and mutil backed-applications,
> the time-consuming problem caused by shrinking the exiting task's folios
> will be more severe.

What testing have you done of this patch?  How often does it happen?
Are there particular workloads that benefit from this?  (I'm not sure
what "mutil backed-applications" are?)

And I do mean specifically of this patch, because to my eyes it
shouldn't even compile.
  
Zhiguo Jiang Jan. 25, 2024, 1:34 a.m. UTC | #2
在 2024/1/24 21:20, Matthew Wilcox 写道:
> On Wed, Jan 24, 2024 at 08:43:07PM +0800, Zhiguo Jiang wrote:
>> If the shrinking folio is belong to the exiting task, this folio should
>> be freed in the task exit flow rather than being reclaimed in the shrink
>> flow, because the former takes less time.
>>
>> If the folio which is belong to the exiting task is reclaimed in the
>> shrink flow, such as the anon folio, the anon folio needs to be first
>> written to the swap partition by swap-in in shrink flow, and then the
>> corresponding swap folio needs to be released in the task exiting flow.
>> As is well known, releasing a swap folio will task more time than
>> releasing directly an anon folio.
>>
>> In the scenarios of the low memory system and mutil backed-applications,
>> the time-consuming problem caused by shrinking the exiting task's folios
>> will be more severe.
> What testing have you done of this patch?  How often does it happen?
> Are there particular workloads that benefit from this?  (I'm not sure
> what "mutil backed-applications" are?)
1 Yes, this patch has been tested.

2 When the exiting tasks and shrink_inactive_list occur at the same time,
    the folios which shrink_inactive_list reclaims may be the exiting 
tasks's folios
    in lruvecs. And when system is low memory, it more likely to occur, 
because
    more backend applidatuions will be killed.
    The shrink_inactive_list reclaims the exiting tasks's folios in 
lruvecs and
    transforms the exiting tasks's anon folios into swap memory, which leads
    to the increasing load of the current exiting tasks.

3 This patch can alleviate the load of the tasks exiting process. This patch
    can make that the exiting tasks release its anon folios faster 
instead of
    releasing its swap memory from its anon folios swap-in in 
shrink_inactive_list.

4 "mutil backed-applications" means that there are a large number of
     the backend applications in the system.

Thanks
>
> And I do mean specifically of this patch, because to my eyes it
> shouldn't even compile.
Has been tested.
  
Matthew Wilcox Jan. 28, 2024, 6:35 a.m. UTC | #3
On Thu, Jan 25, 2024 at 09:34:53AM +0800, zhiguojiang wrote:
> > > In the scenarios of the low memory system and mutil backed-applications,
> > > the time-consuming problem caused by shrinking the exiting task's folios
> > > will be more severe.
> > What testing have you done of this patch?  How often does it happen?
> > Are there particular workloads that benefit from this?  (I'm not sure
> > what "mutil backed-applications" are?)
> 1 Yes, this patch has been tested.
> 
> 2 When the exiting tasks and shrink_inactive_list occur at the same time,
>    the folios which shrink_inactive_list reclaims may be the exiting tasks's
> folios
>    in lruvecs. And when system is low memory, it more likely to occur,
> because
>    more backend applidatuions will be killed.
>    The shrink_inactive_list reclaims the exiting tasks's folios in lruvecs
> and
>    transforms the exiting tasks's anon folios into swap memory, which leads
>    to the increasing load of the current exiting tasks.

Ah, we're talking about an OOM scenario.  OK, that makes some sense.
You should have mentioned that in the patch description.

> 3 This patch can alleviate the load of the tasks exiting process. This patch
>    can make that the exiting tasks release its anon folios faster instead of
>    releasing its swap memory from its anon folios swap-in in
> shrink_inactive_list.
> 
> 4 "mutil backed-applications" means that there are a large number of
>     the backend applications in the system.
> 
> Thanks
> > 
> > And I do mean specifically of this patch, because to my eyes it
> > shouldn't even compile.
> Has been tested.

That's odd.  I thought GCC used to complain about

	long x = 0x100000000;

but it does the right thing.  Except on 32-bit where it'll say
"warning: integer constant out of range".

In any case, the number you chose is not going to work on 32-bit systems
or on ARM or x86.  It conflicts with protection keys on x86 and MTE on
ARM.

We can do it without any new magic numbers.  I'm not sure this is a
great approach, but this should work:

+++ b/mm/rmap.c
@@ -840,6 +840,12 @@ static bool folio_referenced_one(struct folio *folio,
        int referenced = 0;
        unsigned long start = address, ptes = 0;

+       /* Skip this folio if it's mapped by an exiting task */
+       if (test_bit(MMF_OOM_SKIP, &vma->vm_mm->flags)) {
+               pra->referenced = -1;
+               return false;
+       }
+
        while (page_vma_mapped_walk(&pvmw)) {
                address = pvmw.address;
  
Zhiguo Jiang Jan. 29, 2024, 2:21 a.m. UTC | #4
在 2024/1/28 14:35, Matthew Wilcox 写道:
> On Thu, Jan 25, 2024 at 09:34:53AM +0800, zhiguojiang wrote:
>>>> In the scenarios of the low memory system and mutil backed-applications,
>>>> the time-consuming problem caused by shrinking the exiting task's folios
>>>> will be more severe.
>>> What testing have you done of this patch?  How often does it happen?
>>> Are there particular workloads that benefit from this?  (I'm not sure
>>> what "mutil backed-applications" are?)
>> 1 Yes, this patch has been tested.
>>
>> 2 When the exiting tasks and shrink_inactive_list occur at the same time,
>>     the folios which shrink_inactive_list reclaims may be the exiting tasks's
>> folios
>>     in lruvecs. And when system is low memory, it more likely to occur,
>> because
>>     more backend applidatuions will be killed.
>>     The shrink_inactive_list reclaims the exiting tasks's folios in lruvecs
>> and
>>     transforms the exiting tasks's anon folios into swap memory, which leads
>>     to the increasing load of the current exiting tasks.
> Ah, we're talking about an OOM scenario.  OK, that makes some sense.
> You should have mentioned that in the patch description.
Hi,

1 Ok, I will update a more comprehensive description in next version.

2 I think this issue can occur not only in OOM scenario, but also in
   normal task exit scenario. So:

1) if (unlikely(!atomic_read(&vma->vm_mm->mm_users))) represents
   the scenario where the task exits normally.

2) if(test_bit(MMF_OOM_SKIP, &vma->vm_mm->flags)) represents the
    OOM Reaper scenario.

3 MMF_OOM_SKIP can only represent OOM scenario and cannot represent
   normal task exit scenario, as when MMF_OOM_SKIP is set in normal
   task exit scenario, the memory folios of the task have already been
   released. And the shrink_inactive_list should recognize these lru folios
   in exiting task before this exiting task releases its memory folios.

     tlb_gather_mmu_fullmm(&tlb, mm);
     /* update_hiwater_rss(mm) here? but nobody should be looking */
     /* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
     unmap_vmas(&tlb, &mas, vma, 0, ULONG_MAX, ULONG_MAX, false);
     mmap_read_unlock(mm);

     /*
      * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
      * because the memory has been already freed.
      */
     set_bit(MMF_OOM_SKIP, &mm->flags);
>> 3 This patch can alleviate the load of the tasks exiting process. This patch
>>     can make that the exiting tasks release its anon folios faster instead of
>>     releasing its swap memory from its anon folios swap-in in
>> shrink_inactive_list.
>>
>> 4 "mutil backed-applications" means that there are a large number of
>>      the backend applications in the system.
>>
>> Thanks
>>> And I do mean specifically of this patch, because to my eyes it
>>> shouldn't even compile.
>> Has been tested.
> That's odd.  I thought GCC used to complain about
>
> 	long x = 0x100000000;
>
> but it does the right thing.  Except on 32-bit where it'll say
> "warning: integer constant out of range".
>
> In any case, the number you chose is not going to work on 32-bit systems
> or on ARM or x86.  It conflicts with protection keys on x86 and MTE on
> ARM.
You're right, I overlooked the situation with the 32-bit system.
> We can do it without any new magic numbers.  I'm not sure this is a
> great approach, but this should work:
>
> +++ b/mm/rmap.c
> @@ -840,6 +840,12 @@ static bool folio_referenced_one(struct folio *folio,
>          int referenced = 0;
>          unsigned long start = address, ptes = 0;
>
> +       /* Skip this folio if it's mapped by an exiting task */
> +       if (test_bit(MMF_OOM_SKIP, &vma->vm_mm->flags)) {
> +               pra->referenced = -1;
> +               return false;
> +       }
> +
>          while (page_vma_mapped_walk(&pvmw)) {
>                  address = pvmw.address;
>
I agree with your point of view. I think this is currently the best 
solution,
   but I think it also needs to be added with:
if (unlikely(!atomic_read(&vma->vm_mm->mm_users)))

Please help review it again.

Best Regards.
  

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 42652bc8ceca..67e84b7d1928
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -312,6 +312,7 @@  extern unsigned int kobjsize(const void *objp);
 #define VM_HUGEPAGE	0x20000000	/* MADV_HUGEPAGE marked this vma */
 #define VM_NOHUGEPAGE	0x40000000	/* MADV_NOHUGEPAGE marked this vma */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
+#define VM_EXITING	0x100000000  /* The task which this vma belongs to is exiting */
 
 #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
 #define VM_HIGH_ARCH_BIT_0	32	/* bit only usable on 64-bit architectures */
diff --git a/mm/rmap.c b/mm/rmap.c
index 1cf2bffa48ed..deb419fd095b
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -943,6 +943,18 @@  static bool invalid_folio_referenced_vma(struct vm_area_struct *vma, void *arg)
 	if (memcg && !mm_match_cgroup(vma->vm_mm, memcg))
 		return true;
 
+	/*
+	 * Skip counting references of the folios in the exiting task, and
+	 * these folios will be freeed in the task exit flow rather than being
+	 * reclaimed in shrink flow, which is shorter time consumption.
+	 */
+	if (likely(vma->vm_mm) &&
+		unlikely(!atomic_read(&vma->vm_mm->mm_users) ||
+		test_bit(MMF_OOM_SKIP, &vma->vm_mm->flags))) {
+		pra->vm_flags |= VM_EXITING;
+		return true;
+	}
+
 	return false;
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b3ed3c9c3e3d..40ea4128044c 100755
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -887,6 +887,10 @@  static enum folio_references folio_check_references(struct folio *folio,
 		return FOLIOREF_KEEP;
 	}
 
+	/* The folio belongs to the exiting task: rotate */
+	if (vm_flags & VM_EXITING)
+		return FOLIOREF_KEEP;
+
 	/* Reclaim if clean, defer dirty folios to writeback */
 	if (referenced_folio && folio_is_file_lru(folio))
 		return FOLIOREF_RECLAIM_CLEAN;