[1/2] mm: mmap: no need to call khugepaged_enter_vma() for stack

Message ID 20231221065943.2803551-1-shy828301@gmail.com
State New
Headers
Series [1/2] mm: mmap: no need to call khugepaged_enter_vma() for stack |

Commit Message

Yang Shi Dec. 21, 2023, 6:59 a.m. UTC
  From: Yang Shi <yang@os.amperecomputing.com>

We avoid allocating THP for temporary stack, even tnough
khugepaged_enter_vma() is called for stack VMAs, it actualy returns
false.  So no need to call it in the first place at all.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 mm/mmap.c | 2 --
 1 file changed, 2 deletions(-)
  

Comments

Yin Fengwei Jan. 10, 2024, 1:35 a.m. UTC | #1
On 2023/12/21 14:59, Yang Shi wrote:
> From: Yang Shi <yang@os.amperecomputing.com>
> 
> We avoid allocating THP for temporary stack, even tnough
> khugepaged_enter_vma() is called for stack VMAs, it actualy returns
> false.  So no need to call it in the first place at all.
> 
> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>

> ---
>   mm/mmap.c | 2 --
>   1 file changed, 2 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index b78e83d351d2..2ff79b1d1564 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2046,7 +2046,6 @@ static int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>   		}
>   	}
>   	anon_vma_unlock_write(vma->anon_vma);
> -	khugepaged_enter_vma(vma, vma->vm_flags);
>   	mas_destroy(&mas);
>   	validate_mm(mm);
>   	return error;
> @@ -2140,7 +2139,6 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
>   		}
>   	}
>   	anon_vma_unlock_write(vma->anon_vma);
> -	khugepaged_enter_vma(vma, vma->vm_flags);
>   	mas_destroy(&mas);
>   	validate_mm(mm);
>   	return error;
  
Yin Fengwei Jan. 10, 2024, 1:36 a.m. UTC | #2
On 2023/12/21 14:59, Yang Shi wrote:
> From: Yang Shi <yang@os.amperecomputing.com>
> 
> The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
> boundaries") incured regression for stress-ng pthread benchmark [1].
> It is because THP get allocated to pthread's stack area much more possible
> than before.  Pthread's stack area is allocated by mmap without VM_GROWSDOWN
> or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not.
> 
> The MAP_STACK flag is used to mark the stack area, but it is a no-op on
> Linux.  Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating
> THP for such stack area.
> 
> With this change the stack area looks like:
> 
> fffd18e10000-fffd19610000 rw-p 00000000 00:00 0
> Size:               8192 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Rss:                  12 kB
> Pss:                  12 kB
> Pss_Dirty:            12 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:        12 kB
> Referenced:           12 kB
> Anonymous:            12 kB
> KSM:                   0 kB
> LazyFree:              0 kB
> AnonHugePages:         0 kB
> ShmemPmdMapped:        0 kB
> FilePmdMapped:         0 kB
> Shared_Hugetlb:        0 kB
> Private_Hugetlb:       0 kB
> Swap:                  0 kB
> SwapPss:               0 kB
> Locked:                0 kB
> THPeligible:           0
> VmFlags: rd wr mr mw me ac nh
> 
> The "nh" flag is set.
> 
> [1] https://lore.kernel.org/linux-mm/202312192310.56367035-oliver.sang@intel.com/
> 
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Tested-by: Oliver Sang <oliver.sang@intel.com>
> Cc: Yin Fengwei <fengwei.yin@intel.com>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Huang, Ying <ying.huang@intel.com>
> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>

Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>

> ---
>   include/linux/mman.h | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index 40d94411d492..dc7048824be8 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags)
>   	return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
>   	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
>   	       _calc_vm_trans(flags, MAP_SYNC,	     VM_SYNC      ) |
> +	       _calc_vm_trans(flags, MAP_STACK,	     VM_NOHUGEPAGE) |
>   	       arch_calc_vm_flag_bits(flags);
>   }
>
  
Huang, Ying Jan. 15, 2024, 5:50 a.m. UTC | #3
Yang Shi <shy828301@gmail.com> writes:

> From: Yang Shi <yang@os.amperecomputing.com>
>
> We avoid allocating THP for temporary stack, even tnough
                                                    ~~~~~~
                                                    though?

--
Best Regards,
Huang, Ying

> khugepaged_enter_vma() is called for stack VMAs, it actualy returns
> false.  So no need to call it in the first place at all.
>
> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
> ---
>  mm/mmap.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index b78e83d351d2..2ff79b1d1564 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2046,7 +2046,6 @@ static int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  		}
>  	}
>  	anon_vma_unlock_write(vma->anon_vma);
> -	khugepaged_enter_vma(vma, vma->vm_flags);
>  	mas_destroy(&mas);
>  	validate_mm(mm);
>  	return error;
> @@ -2140,7 +2139,6 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
>  		}
>  	}
>  	anon_vma_unlock_write(vma->anon_vma);
> -	khugepaged_enter_vma(vma, vma->vm_flags);
>  	mas_destroy(&mas);
>  	validate_mm(mm);
>  	return error;
  
Yang Shi Jan. 16, 2024, 9:39 p.m. UTC | #4
On Sun, Jan 14, 2024 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yang Shi <shy828301@gmail.com> writes:
>
> > From: Yang Shi <yang@os.amperecomputing.com>
> >
> > We avoid allocating THP for temporary stack, even tnough
>                                                     ~~~~~~
>                                                     though?

Yeah, it is a typo. Thanks for noticing this.

>
> --
> Best Regards,
> Huang, Ying
>
> > khugepaged_enter_vma() is called for stack VMAs, it actualy returns
> > false.  So no need to call it in the first place at all.
> >
> > Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
> > ---
> >  mm/mmap.c | 2 --
> >  1 file changed, 2 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index b78e83d351d2..2ff79b1d1564 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2046,7 +2046,6 @@ static int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> >               }
> >       }
> >       anon_vma_unlock_write(vma->anon_vma);
> > -     khugepaged_enter_vma(vma, vma->vm_flags);
> >       mas_destroy(&mas);
> >       validate_mm(mm);
> >       return error;
> > @@ -2140,7 +2139,6 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> >               }
> >       }
> >       anon_vma_unlock_write(vma->anon_vma);
> > -     khugepaged_enter_vma(vma, vma->vm_flags);
> >       mas_destroy(&mas);
> >       validate_mm(mm);
> >       return error;
  
Florian Weimer Jan. 31, 2024, 7:53 a.m. UTC | #5
* Yang Shi:

> From: Yang Shi <yang@os.amperecomputing.com>
>
> The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
> boundaries") incured regression for stress-ng pthread benchmark [1].
> It is because THP get allocated to pthread's stack area much more possible
> than before.  Pthread's stack area is allocated by mmap without VM_GROWSDOWN
> or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not.
>
> The MAP_STACK flag is used to mark the stack area, but it is a no-op on
> Linux.  Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating
> THP for such stack area.

Doesn't this introduce a regression in the other direction, where
workloads expect to use a hugepage TLB entry for the stack?

It's seems an odd approach to fixing the stress-ng regression.  Isn't it
very much coding to the benchmark?

Thanks,
Florian
  
Florian Weimer Feb. 1, 2024, 3:34 p.m. UTC | #6
* Yang Shi:

> On Tue, Jan 30, 2024 at 11:53 PM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * Yang Shi:
>>
>> > From: Yang Shi <yang@os.amperecomputing.com>
>> >
>> > The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
>> > boundaries") incured regression for stress-ng pthread benchmark [1].
>> > It is because THP get allocated to pthread's stack area much more possible
>> > than before.  Pthread's stack area is allocated by mmap without VM_GROWSDOWN
>> > or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not.
>> >
>> > The MAP_STACK flag is used to mark the stack area, but it is a no-op on
>> > Linux.  Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating
>> > THP for such stack area.
>>
>> Doesn't this introduce a regression in the other direction, where
>> workloads expect to use a hugepage TLB entry for the stack?
>
> Maybe, it is theoretically possible. But AFAICT, the real life
> workloads performance usually gets hurt if THP is used for stack.
> Willy has an example:
>
> https://lore.kernel.org/linux-mm/ZYPDwCcAjX+r+g6s@casper.infradead.org/#t
>
> And avoiding THP on stack is not new, VM_GROWSDOWN | VM_GROWSUP areas
> have been applied before, this patch just extends this to MAP_STACK.

If it's *always* beneficial then we should help it along in glibc as
well.  We've started to offer a tunable in response to this observation
(also paper over in OpenJDK):

  Make thread stacks not use huge pages
  <https://bugs.openjdk.org/browse/JDK-8303215>

But this is specifically about RSS usage, and not directly about
reducing TLB misses etc.

Thanks,
Florian
  

Patch

diff --git a/mm/mmap.c b/mm/mmap.c
index b78e83d351d2..2ff79b1d1564 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2046,7 +2046,6 @@  static int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		}
 	}
 	anon_vma_unlock_write(vma->anon_vma);
-	khugepaged_enter_vma(vma, vma->vm_flags);
 	mas_destroy(&mas);
 	validate_mm(mm);
 	return error;
@@ -2140,7 +2139,6 @@  int expand_downwards(struct vm_area_struct *vma, unsigned long address)
 		}
 	}
 	anon_vma_unlock_write(vma->anon_vma);
-	khugepaged_enter_vma(vma, vma->vm_flags);
 	mas_destroy(&mas);
 	validate_mm(mm);
 	return error;