diff mbox series

mm, oom: Add lru_add_drain() in __oom_reap_task_mm()

Message ID	20240109091511.8299-1-jianfeng.w.wang@oracle.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel+bounces-20638-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; From: Jianfeng Wang <jianfeng.w.wang@oracle.com> To: akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Jianfeng Wang <jianfeng.w.wang@oracle.com> Subject: [PATCH] mm, oom: Add lru_add_drain() in __oom_reap_task_mm() Date: Tue, 9 Jan 2024 01:15:11 -0800 Message-ID: <20240109091511.8299-1-jianfeng.w.wang@oracle.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	mm, oom: Add lru_add_drain() in __oom_reap_task_mm() \| mm, oom: Add lru_add_drain() in __oom_reap_task_mm()

Commit Message

Jianfeng Wang Jan. 9, 2024, 9:15 a.m. UTC

  The oom_reaper tries to reclaim additional memory owned by the oom
victim. In __oom_reap_task_mm(), it uses mmu_gather for batched page
free. After oom_reaper was added, mmu_gather feature introduced
CONFIG_MMU_GATHER_NO_GATHER (in 'commit 952a31c9e6fa ("asm-generic/tlb:
Introduce CONFIG_HAVE_MMU_GATHER_NO_GATHER=y")', an option to skip batched
page free. If set, tlb_batch_pages_flush(), which is responsible for
calling lru_add_drain(), is skipped during tlb_finish_mmu(). Without it,
pages could still be held by per-cpu fbatches rather than be freed.

This fix adds lru_add_drain() prior to mmu_gather. This makes the code
consistent with other cases where mmu_gather is used for freeing pages.

Signed-off-by: Jianfeng Wang <jianfeng.w.wang@oracle.com>
---
 mm/oom_kill.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Michal Hocko Jan. 10, 2024, 8:46 a.m. UTC | #1

On Tue 09-01-24 01:15:11, Jianfeng Wang wrote:
> The oom_reaper tries to reclaim additional memory owned by the oom
> victim. In __oom_reap_task_mm(), it uses mmu_gather for batched page
> free. After oom_reaper was added, mmu_gather feature introduced
> CONFIG_MMU_GATHER_NO_GATHER (in 'commit 952a31c9e6fa ("asm-generic/tlb:
> Introduce CONFIG_HAVE_MMU_GATHER_NO_GATHER=y")', an option to skip batched
> page free. If set, tlb_batch_pages_flush(), which is responsible for
> calling lru_add_drain(), is skipped during tlb_finish_mmu(). Without it,
> pages could still be held by per-cpu fbatches rather than be freed.
> 
> This fix adds lru_add_drain() prior to mmu_gather. This makes the code
> consistent with other cases where mmu_gather is used for freeing pages.

Does this fix any actual problem or is this pure code consistency thing?
I am asking because it doesn't make much sense to me TBH, LRU cache
draining is usually important when we want to ensure that cached pages
are put to LRU to be dealt with because otherwise the MM code wouldn't
be able to deal with them. OOM reaper doesn't necessarily run on the
same CPU as the oom victim so draining on a local CPU doesn't
necessarily do anything for the victim's pages.

While this patch is not harmful I really do not see much point in adding
the local draining here. Could you clarify please?

> Signed-off-by: Jianfeng Wang <jianfeng.w.wang@oracle.com>
> ---
>  mm/oom_kill.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9e6071fde34a..e2fcf4f062ea 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -537,6 +537,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  			struct mmu_notifier_range range;
>  			struct mmu_gather tlb;
>  
> +			lru_add_drain();
>  			mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0,
>  						mm, vma->vm_start,
>  						vma->vm_end);
> -- 
> 2.42.1
>

Jianfeng Wang Jan. 10, 2024, 7:02 p.m. UTC | #2

On 1/10/24 12:46 AM, Michal Hocko wrote:
> On Tue 09-01-24 01:15:11, Jianfeng Wang wrote:
>> The oom_reaper tries to reclaim additional memory owned by the oom
>> victim. In __oom_reap_task_mm(), it uses mmu_gather for batched page
>> free. After oom_reaper was added, mmu_gather feature introduced
>> CONFIG_MMU_GATHER_NO_GATHER (in 'commit 952a31c9e6fa ("asm-generic/tlb:
>> Introduce CONFIG_HAVE_MMU_GATHER_NO_GATHER=y")', an option to skip batched
>> page free. If set, tlb_batch_pages_flush(), which is responsible for
>> calling lru_add_drain(), is skipped during tlb_finish_mmu(). Without it,
>> pages could still be held by per-cpu fbatches rather than be freed.
>>
>> This fix adds lru_add_drain() prior to mmu_gather. This makes the code
>> consistent with other cases where mmu_gather is used for freeing pages.
> 
> Does this fix any actual problem or is this pure code consistency thing?
> I am asking because it doesn't make much sense to me TBH, LRU cache
> draining is usually important when we want to ensure that cached pages
> are put to LRU to be dealt with because otherwise the MM code wouldn't
> be able to deal with them. OOM reaper doesn't necessarily run on the
> same CPU as the oom victim so draining on a local CPU doesn't
> necessarily do anything for the victim's pages.
> 
> While this patch is not harmful I really do not see much point in adding
> the local draining here. Could you clarify please?
> 
It targets the case described in the patch's commit message: oom_killer
thinks that it 'reclaims' pages while pages are still held by per-cpu
fbatches with a ref count.

I admit that pages may sit on a different core(s). Given that
doing remote calls to all CPUs with lru_add_drain_all() is expensive,
this line of code can be helpful if it happens to give back a few pages
to the system right away without the overhead, especially when oom is
involved. Plus, it also makes the code consistent with other places
using mmu_gather feature to free pages in batch.

--JW

>> Signed-off-by: Jianfeng Wang <jianfeng.w.wang@oracle.com>
>> ---
>>  mm/oom_kill.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 9e6071fde34a..e2fcf4f062ea 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -537,6 +537,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>>  			struct mmu_notifier_range range;
>>  			struct mmu_gather tlb;
>>  
>> +			lru_add_drain();
>>  			mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0,
>>  						mm, vma->vm_start,
>>  						vma->vm_end);
>> -- 
>> 2.42.1
>>
>

Michal Hocko Jan. 11, 2024, 8:46 a.m. UTC | #3

On Wed 10-01-24 11:02:03, Jianfeng Wang wrote:
> On 1/10/24 12:46 AM, Michal Hocko wrote:
> > On Tue 09-01-24 01:15:11, Jianfeng Wang wrote:
> >> The oom_reaper tries to reclaim additional memory owned by the oom
> >> victim. In __oom_reap_task_mm(), it uses mmu_gather for batched page
> >> free. After oom_reaper was added, mmu_gather feature introduced
> >> CONFIG_MMU_GATHER_NO_GATHER (in 'commit 952a31c9e6fa ("asm-generic/tlb:
> >> Introduce CONFIG_HAVE_MMU_GATHER_NO_GATHER=y")', an option to skip batched
> >> page free. If set, tlb_batch_pages_flush(), which is responsible for
> >> calling lru_add_drain(), is skipped during tlb_finish_mmu(). Without it,
> >> pages could still be held by per-cpu fbatches rather than be freed.
> >>
> >> This fix adds lru_add_drain() prior to mmu_gather. This makes the code
> >> consistent with other cases where mmu_gather is used for freeing pages.
> > 
> > Does this fix any actual problem or is this pure code consistency thing?
> > I am asking because it doesn't make much sense to me TBH, LRU cache
> > draining is usually important when we want to ensure that cached pages
> > are put to LRU to be dealt with because otherwise the MM code wouldn't
> > be able to deal with them. OOM reaper doesn't necessarily run on the
> > same CPU as the oom victim so draining on a local CPU doesn't
> > necessarily do anything for the victim's pages.
> > 
> > While this patch is not harmful I really do not see much point in adding
> > the local draining here. Could you clarify please?
> > 
> It targets the case described in the patch's commit message: oom_killer
> thinks that it 'reclaims' pages while pages are still held by per-cpu
> fbatches with a ref count.
> 
> I admit that pages may sit on a different core(s). Given that
> doing remote calls to all CPUs with lru_add_drain_all() is expensive,
> this line of code can be helpful if it happens to give back a few pages
> to the system right away without the overhead, especially when oom is
> involved. Plus, it also makes the code consistent with other places
> using mmu_gather feature to free pages in batch.

I would argue that consistency the biggest problem of this patch. It
tries to follow a pattern that is just not really correct. First it
operates on a random CPU from the oom victim perspective and second it
doesn't really block any unmapping operation and that is the main
purpose of the reaper. Sure it frees a lot of unmapped memory but if
there are couple of pages that cannot be freed imeediately because they
are sitting on a per-cpu LRU caches then this is not a deal breaker. As
you have noted those pages might be sitting on any per-cpu cache.

So I do not really see that as a good justification. People will follow
that pattern even more and spread lru_add_drain to other random places.

Unless you can show any actual runtime effect of this patch then I think
it shouldn't be merged.

Jianfeng Wang Jan. 11, 2024, 6:54 p.m. UTC | #4

On 1/11/24 12:46 AM, Michal Hocko wrote:
> On Wed 10-01-24 11:02:03, Jianfeng Wang wrote:
>> On 1/10/24 12:46 AM, Michal Hocko wrote:
>>> On Tue 09-01-24 01:15:11, Jianfeng Wang wrote:
>>>> The oom_reaper tries to reclaim additional memory owned by the oom
>>>> victim. In __oom_reap_task_mm(), it uses mmu_gather for batched page
>>>> free. After oom_reaper was added, mmu_gather feature introduced
>>>> CONFIG_MMU_GATHER_NO_GATHER (in 'commit 952a31c9e6fa ("asm-generic/tlb:
>>>> Introduce CONFIG_HAVE_MMU_GATHER_NO_GATHER=y")', an option to skip batched
>>>> page free. If set, tlb_batch_pages_flush(), which is responsible for
>>>> calling lru_add_drain(), is skipped during tlb_finish_mmu(). Without it,
>>>> pages could still be held by per-cpu fbatches rather than be freed.
>>>>
>>>> This fix adds lru_add_drain() prior to mmu_gather. This makes the code
>>>> consistent with other cases where mmu_gather is used for freeing pages.
>>>
>>> Does this fix any actual problem or is this pure code consistency thing?
>>> I am asking because it doesn't make much sense to me TBH, LRU cache
>>> draining is usually important when we want to ensure that cached pages
>>> are put to LRU to be dealt with because otherwise the MM code wouldn't
>>> be able to deal with them. OOM reaper doesn't necessarily run on the
>>> same CPU as the oom victim so draining on a local CPU doesn't
>>> necessarily do anything for the victim's pages.
>>>
>>> While this patch is not harmful I really do not see much point in adding
>>> the local draining here. Could you clarify please?
>>>
>> It targets the case described in the patch's commit message: oom_killer
>> thinks that it 'reclaims' pages while pages are still held by per-cpu
>> fbatches with a ref count.
>>
>> I admit that pages may sit on a different core(s). Given that
>> doing remote calls to all CPUs with lru_add_drain_all() is expensive,
>> this line of code can be helpful if it happens to give back a few pages
>> to the system right away without the overhead, especially when oom is
>> involved. Plus, it also makes the code consistent with other places
>> using mmu_gather feature to free pages in batch.
> 
> I would argue that consistency the biggest problem of this patch. It
> tries to follow a pattern that is just not really correct. First it
> operates on a random CPU from the oom victim perspective and second it
> doesn't really block any unmapping operation and that is the main
> purpose of the reaper. Sure it frees a lot of unmapped memory but if
> there are couple of pages that cannot be freed imeediately because they
> are sitting on a per-cpu LRU caches then this is not a deal breaker. As
> you have noted those pages might be sitting on any per-cpu cache.
> 
> So I do not really see that as a good justification. People will follow
> that pattern even more and spread lru_add_drain to other random places.
> 
> Unless you can show any actual runtime effect of this patch then I think
> it shouldn't be merged.
> 

Thanks for raising your concern.
I'd call it a trade-off rather than "not really correct". Look at
unmap_region() / free_pages_and_swap_cache() written by Linus. These are in
favor of this pattern, which indicates that the trade-off (i.e. draining
local CPU or draining all CPUs or no draining at all) had been made in the
same way in the past. I don't have a specific runtime effect to provide,
except that it will free 10s kB pages immediately during OOM.

Andrew Morton Jan. 11, 2024, 9:54 p.m. UTC | #5

On Thu, 11 Jan 2024 10:54:45 -0800 Jianfeng Wang <jianfeng.w.wang@oracle.com> wrote:

> 
> > Unless you can show any actual runtime effect of this patch then I think
> > it shouldn't be merged.
> > 
> 
> Thanks for raising your concern.
> I'd call it a trade-off rather than "not really correct". Look at
> unmap_region() / free_pages_and_swap_cache() written by Linus. These are in
> favor of this pattern, which indicates that the trade-off (i.e. draining
> local CPU or draining all CPUs or no draining at all) had been made in the
> same way in the past. I don't have a specific runtime effect to provide,
> except that it will free 10s kB pages immediately during OOM.

I don't think it's necessary to run lru_add_drain() for each vma.  Once
we've done it it once, it can be skipped for additional vmas.

That's pretty minor because the second and successive calls will be
cheap.  But it becomes much more significant if we switch to
lru_add_drain_all(), which sounds like what we should be doing here. 
Is it possible?

Jianfeng Wang Jan. 12, 2024, 12:08 a.m. UTC | #6

On 1/11/24 1:54 PM, Andrew Morton wrote:
> On Thu, 11 Jan 2024 10:54:45 -0800 Jianfeng Wang <jianfeng.w.wang@oracle.com> wrote:
> 
>>
>>> Unless you can show any actual runtime effect of this patch then I think
>>> it shouldn't be merged.
>>>
>>
>> Thanks for raising your concern.
>> I'd call it a trade-off rather than "not really correct". Look at
>> unmap_region() / free_pages_and_swap_cache() written by Linus. These are in
>> favor of this pattern, which indicates that the trade-off (i.e. draining
>> local CPU or draining all CPUs or no draining at all) had been made in the
>> same way in the past. I don't have a specific runtime effect to provide,
>> except that it will free 10s kB pages immediately during OOM.
> 
> I don't think it's necessary to run lru_add_drain() for each vma.  Once
> we've done it it once, it can be skipped for additional vmas.
> 
Agreed.

> That's pretty minor because the second and successive calls will be
> cheap.  But it becomes much more significant if we switch to
> lru_add_drain_all(), which sounds like what we should be doing here. 
> Is it possible?
>
What do you both think of adding lru_add_drain_all() prior to the for loop?

Michal Hocko Jan. 12, 2024, 8:49 a.m. UTC | #7

On Thu 11-01-24 16:08:57, Jianfeng Wang wrote:
> 
> 
> On 1/11/24 1:54 PM, Andrew Morton wrote:
> > On Thu, 11 Jan 2024 10:54:45 -0800 Jianfeng Wang <jianfeng.w.wang@oracle.com> wrote:
> > 
> >>
> >>> Unless you can show any actual runtime effect of this patch then I think
> >>> it shouldn't be merged.
> >>>
> >>
> >> Thanks for raising your concern.
> >> I'd call it a trade-off rather than "not really correct". Look at
> >> unmap_region() / free_pages_and_swap_cache() written by Linus. These are in
> >> favor of this pattern, which indicates that the trade-off (i.e. draining
> >> local CPU or draining all CPUs or no draining at all) had been made in the
> >> same way in the past. I don't have a specific runtime effect to provide,
> >> except that it will free 10s kB pages immediately during OOM.

You are missing an important point. Those two calls are quite different.
oom_reaper unmaps memory after all the reclaim attempts have failed.
That includes draining all sorts of caches on the way. Including
draining LRU pcp cache (look for lru_add_drain_all in the reclaim path). 

> > I don't think it's necessary to run lru_add_drain() for each vma.  Once
> > we've done it it once, it can be skipped for additional vmas.
> > 
> Agreed.
> 
> > That's pretty minor because the second and successive calls will be
> > cheap.  But it becomes much more significant if we switch to
> > lru_add_drain_all(), which sounds like what we should be doing here. 
> > Is it possible?
> >
> What do you both think of adding lru_add_drain_all() prior to the for loop?

lru_add_drain_all relies on WQs. And we absolutely do not want to get
oom_reaper stuck just because all the WQ is jammed. So no, this is
actually actively harmful!

All that being said I stand by my previous statement that this patch is
not doing anything measurably useful. Prove me wrong otherwise I am
against merging "just for consistency patch". Really, we should go and
re-evaluate existing local lru draining callers. I wouldn't be surprised
if we removed some of them.

Minchan Kim Jan. 12, 2024, 9:43 p.m. UTC | #8

On Fri, Jan 12, 2024 at 09:49:08AM +0100, Michal Hocko wrote:
> On Thu 11-01-24 16:08:57, Jianfeng Wang wrote:
> > 
> > 
> > On 1/11/24 1:54 PM, Andrew Morton wrote:
> > > On Thu, 11 Jan 2024 10:54:45 -0800 Jianfeng Wang <jianfeng.w.wang@oracle.com> wrote:
> > > 
> > >>
> > >>> Unless you can show any actual runtime effect of this patch then I think
> > >>> it shouldn't be merged.
> > >>>
> > >>
> > >> Thanks for raising your concern.
> > >> I'd call it a trade-off rather than "not really correct". Look at
> > >> unmap_region() / free_pages_and_swap_cache() written by Linus. These are in
> > >> favor of this pattern, which indicates that the trade-off (i.e. draining
> > >> local CPU or draining all CPUs or no draining at all) had been made in the
> > >> same way in the past. I don't have a specific runtime effect to provide,
> > >> except that it will free 10s kB pages immediately during OOM.
> 
> You are missing an important point. Those two calls are quite different.
> oom_reaper unmaps memory after all the reclaim attempts have failed.
> That includes draining all sorts of caches on the way. Including
> draining LRU pcp cache (look for lru_add_drain_all in the reclaim path). 
>  
> > > I don't think it's necessary to run lru_add_drain() for each vma.  Once
> > > we've done it it once, it can be skipped for additional vmas.
> > > 
> > Agreed.
> > 
> > > That's pretty minor because the second and successive calls will be
> > > cheap.  But it becomes much more significant if we switch to
> > > lru_add_drain_all(), which sounds like what we should be doing here. 
> > > Is it possible?
> > >
> > What do you both think of adding lru_add_drain_all() prior to the for loop?
> 
> lru_add_drain_all relies on WQs. And we absolutely do not want to get
> oom_reaper stuck just because all the WQ is jammed. So no, this is
> actually actively harmful!

I completely agree. The oom_reap_task_mm function is also used for process_mrelease,
which is a critical path for releasing memory in Android and is typically used
under system pressure(not only for memory pressure but also CPU pressured at the
same time). The lru_add_drain_all function can take a long time to finish because
Android is susceptible to priority inversion among processes.

The better idea may enable remote draining with lru_add_drain_all, analogous to
the recent PCP modifications.

diff mbox series

Patch

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e6071fde34a..e2fcf4f062ea 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -537,6 +537,7 @@  static bool __oom_reap_task_mm(struct mm_struct *mm)
 			struct mmu_notifier_range range;
 			struct mmu_gather tlb;
 
+			lru_add_drain();
 			mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0,
 						mm, vma->vm_start,
 						vma->vm_end);