[v2,1/2] mm:vmscan: the dirty folio in folio_list skip unmap

Message ID 20231019131446.317-2-justinjiang@vivo.com
State New
Headers
Series mm: the dirty folio unmap redundantly |

Commit Message

Zhiguo Jiang Oct. 19, 2023, 1:14 p.m. UTC
  In the shrink_folio_list() the sources of the file dirty folio include
two ways below:
1. The dirty folio is from the incoming parameter folio_list,
   which is the inactive file lru.
2. The dirty folio is from the PTE dirty bit transferred by
   the try_to_unmap().

For the first source of the dirty folio, if the dirty folio does not
support pageout, the dirty folio can skip unmap in advance to reduce
recyling time.

Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
---

Changelog:
v1->v2:
1. Keep the original judgment flow.
2. Add the interface of folio_check_pageout().
3. The dirty folio which does not support pageout in inactive file lru
   skip unmap in advance.

 mm/vmscan.c | 103 +++++++++++++++++++++++++++++++++-------------------
 1 file changed, 66 insertions(+), 37 deletions(-)
  

Comments

David Hildenbrand Oct. 19, 2023, 2:15 p.m. UTC | #1
On 19.10.23 15:14, Zhiguo Jiang wrote:
> In the shrink_folio_list() the sources of the file dirty folio include
> two ways below:
> 1. The dirty folio is from the incoming parameter folio_list,
>     which is the inactive file lru.
> 2. The dirty folio is from the PTE dirty bit transferred by
>     the try_to_unmap().
> 
> For the first source of the dirty folio, if the dirty folio does not
> support pageout, the dirty folio can skip unmap in advance to reduce
> recyling time.
> 
> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
> ---
> 
> Changelog:
> v1->v2:
> 1. Keep the original judgment flow.
> 2. Add the interface of folio_check_pageout().
> 3. The dirty folio which does not support pageout in inactive file lru
>     skip unmap in advance.
> 
>   mm/vmscan.c | 103 +++++++++++++++++++++++++++++++++-------------------
>   1 file changed, 66 insertions(+), 37 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a68d01fcc307..e067269275a5 100755
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -925,6 +925,44 @@ static void folio_check_dirty_writeback(struct folio *folio,
>   		mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
>   }
>   
> +/* Check if a dirty folio can support pageout in the recyling process*/
> +static bool folio_check_pageout(struct folio *folio,
> +						struct pglist_data *pgdat)
> +{
> +	int ret = true;
> +
> +	/*
> +	 * Anonymous folios are not handled by flushers and must be written
> +	 * from reclaim context. Do not stall reclaim based on them.
> +	 * MADV_FREE anonymous folios are put into inactive file list too.
> +	 * They could be mistakenly treated as file lru. So further anon
> +	 * test is needed.
> +	 */
> +	if (!folio_is_file_lru(folio) ||
> +		(folio_test_anon(folio) && !folio_test_swapbacked(folio)))
> +		goto out;
> +
> +	if (folio_test_dirty(folio) &&
> +		(!current_is_kswapd() ||
> +		 !folio_test_reclaim(folio) ||
> +		 !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
> +		/*
> +		 * Immediately reclaim when written back.
> +		 * Similar in principle to folio_deactivate()
> +		 * except we already have the folio isolated
> +		 * and know it's dirty
> +		 */
> +		node_stat_mod_folio(folio, NR_VMSCAN_IMMEDIATE,
> +			folio_nr_pages(folio));
> +		folio_set_reclaim(folio);
> +
> +		ret = false;
> +	}
> +
> +out:
> +	return ret;
> +}
> +
>   static struct folio *alloc_demote_folio(struct folio *src,
>   		unsigned long private)
>   {
> @@ -1078,6 +1116,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>   		if (dirty && !writeback)
>   			stat->nr_unqueued_dirty += nr_pages;
>   
> +		/* If the dirty folio dose not support pageout,
> +		 * the dirty folio can skip this recycling.
> +		 */
> +		if (!folio_check_pageout(folio, pgdat))
> +			goto activate_locked;
> +
>   		/*
>   		 * Treat this folio as congested if folios are cycling
>   		 * through the LRU so quickly that the folios marked
> @@ -1261,43 +1305,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>   			enum ttu_flags flags = TTU_BATCH_FLUSH;
>   			bool was_swapbacked = folio_test_swapbacked(folio);
>   
> -			if (folio_test_dirty(folio)) {
> -				/*
> -				 * Only kswapd can writeback filesystem folios
> -				 * to avoid risk of stack overflow. But avoid
> -				 * injecting inefficient single-folio I/O into
> -				 * flusher writeback as much as possible: only
> -				 * write folios when we've encountered many
> -				 * dirty folios, and when we've already scanned
> -				 * the rest of the LRU for clean folios and see
> -				 * the same dirty folios again (with the reclaim
> -				 * flag set).
> -				 */
> -				if (folio_is_file_lru(folio) &&
> -					(!current_is_kswapd() ||
> -					 !folio_test_reclaim(folio) ||
> -					 !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
> -					/*
> -					 * Immediately reclaim when written back.
> -					 * Similar in principle to folio_deactivate()
> -					 * except we already have the folio isolated
> -					 * and know it's dirty
> -					 */
> -					node_stat_mod_folio(folio, NR_VMSCAN_IMMEDIATE,
> -							nr_pages);
> -					folio_set_reclaim(folio);
> -
> -					goto activate_locked;
> -				}
> -
> -				if (references == FOLIOREF_RECLAIM_CLEAN)
> -					goto keep_locked;
> -				if (!may_enter_fs(folio, sc->gfp_mask))
> -					goto keep_locked;
> -				if (!sc->may_writepage)
> -					goto keep_locked;
> -			}
> -
>   			if (folio_test_pmd_mappable(folio))
>   				flags |= TTU_SPLIT_HUGE_PMD;
>   
> @@ -1323,6 +1330,28 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>   
>   		mapping = folio_mapping(folio);
>   		if (folio_test_dirty(folio)) {
> +			/*
> +			 * Only kswapd can writeback filesystem folios
> +			 * to avoid risk of stack overflow. But avoid
> +			 * injecting inefficient single-folio I/O into
> +			 * flusher writeback as much as possible: only
> +			 * write folios when we've encountered many
> +			 * dirty folios, and when we've already scanned
> +			 * the rest of the LRU for clean folios and see
> +			 * the same dirty folios again (with the reclaim
> +			 * flag set).
> +			 */
> +			if (folio_is_file_lru(folio) &&
> +				!folio_check_pageout(folio, pgdat))
> +				goto activate_locked;
> +
> +			if (references == FOLIOREF_RECLAIM_CLEAN)
> +				goto keep_locked;
> +			if (!may_enter_fs(folio, sc->gfp_mask))
> +				goto keep_locked;
> +			if (!sc->may_writepage)
> +				goto keep_locked;
> +
>   			/*
>   			 * Folio is dirty. Flush the TLB if a writable entry
>   			 * potentially exists to avoid CPU writes after I/O

I'm confused. Did you apply this on top of v1 by accident?
  
Zhiguo Jiang Oct. 20, 2023, 3:59 a.m. UTC | #2
在 2023/10/19 22:15, David Hildenbrand 写道:
> [你通常不会收到来自 david@redhat.com 的电子邮件。请访问 
> https://aka.ms/LearnAboutSenderIdentification,以了解这一点为什么很重要]
>
> On 19.10.23 15:14, Zhiguo Jiang wrote:
>> In the shrink_folio_list() the sources of the file dirty folio include
>> two ways below:
>> 1. The dirty folio is from the incoming parameter folio_list,
>>     which is the inactive file lru.
>> 2. The dirty folio is from the PTE dirty bit transferred by
>>     the try_to_unmap().
>>
>> For the first source of the dirty folio, if the dirty folio does not
>> support pageout, the dirty folio can skip unmap in advance to reduce
>> recyling time.
>>
>> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
>> ---
>>
>> Changelog:
>> v1->v2:
>> 1. Keep the original judgment flow.
>> 2. Add the interface of folio_check_pageout().
>> 3. The dirty folio which does not support pageout in inactive file lru
>>     skip unmap in advance.
>>
>>   mm/vmscan.c | 103 +++++++++++++++++++++++++++++++++-------------------
>>   1 file changed, 66 insertions(+), 37 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index a68d01fcc307..e067269275a5 100755
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -925,6 +925,44 @@ static void folio_check_dirty_writeback(struct 
>> folio *folio,
>>               mapping->a_ops->is_dirty_writeback(folio, dirty, 
>> writeback);
>>   }
>>
>> +/* Check if a dirty folio can support pageout in the recyling process*/
>> +static bool folio_check_pageout(struct folio *folio,
>> +                                             struct pglist_data *pgdat)
>> +{
>> +     int ret = true;
>> +
>> +     /*
>> +      * Anonymous folios are not handled by flushers and must be 
>> written
>> +      * from reclaim context. Do not stall reclaim based on them.
>> +      * MADV_FREE anonymous folios are put into inactive file list too.
>> +      * They could be mistakenly treated as file lru. So further anon
>> +      * test is needed.
>> +      */
>> +     if (!folio_is_file_lru(folio) ||
>> +             (folio_test_anon(folio) && !folio_test_swapbacked(folio)))
>> +             goto out;
>> +
>> +     if (folio_test_dirty(folio) &&
>> +             (!current_is_kswapd() ||
>> +              !folio_test_reclaim(folio) ||
>> +              !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
>> +             /*
>> +              * Immediately reclaim when written back.
>> +              * Similar in principle to folio_deactivate()
>> +              * except we already have the folio isolated
>> +              * and know it's dirty
>> +              */
>> +             node_stat_mod_folio(folio, NR_VMSCAN_IMMEDIATE,
>> +                     folio_nr_pages(folio));
>> +             folio_set_reclaim(folio);
>> +
>> +             ret = false;
>> +     }
>> +
>> +out:
>> +     return ret;
>> +}
>> +
>>   static struct folio *alloc_demote_folio(struct folio *src,
>>               unsigned long private)
>>   {
>> @@ -1078,6 +1116,12 @@ static unsigned int shrink_folio_list(struct 
>> list_head *folio_list,
>>               if (dirty && !writeback)
>>                       stat->nr_unqueued_dirty += nr_pages;
>>
>> +             /* If the dirty folio dose not support pageout,
>> +              * the dirty folio can skip this recycling.
>> +              */
>> +             if (!folio_check_pageout(folio, pgdat))
>> +                     goto activate_locked;
>> +
>>               /*
>>                * Treat this folio as congested if folios are cycling
>>                * through the LRU so quickly that the folios marked
>> @@ -1261,43 +1305,6 @@ static unsigned int shrink_folio_list(struct 
>> list_head *folio_list,
>>                       enum ttu_flags flags = TTU_BATCH_FLUSH;
>>                       bool was_swapbacked = 
>> folio_test_swapbacked(folio);
>>
>> -                     if (folio_test_dirty(folio)) {
>> -                             /*
>> -                              * Only kswapd can writeback filesystem 
>> folios
>> -                              * to avoid risk of stack overflow. But 
>> avoid
>> -                              * injecting inefficient single-folio 
>> I/O into
>> -                              * flusher writeback as much as 
>> possible: only
>> -                              * write folios when we've encountered 
>> many
>> -                              * dirty folios, and when we've already 
>> scanned
>> -                              * the rest of the LRU for clean folios 
>> and see
>> -                              * the same dirty folios again (with 
>> the reclaim
>> -                              * flag set).
>> -                              */
>> -                             if (folio_is_file_lru(folio) &&
>> -                                     (!current_is_kswapd() ||
>> - !folio_test_reclaim(folio) ||
>> -                                      !test_bit(PGDAT_DIRTY, 
>> &pgdat->flags))) {
>> -                                     /*
>> -                                      * Immediately reclaim when 
>> written back.
>> -                                      * Similar in principle to 
>> folio_deactivate()
>> -                                      * except we already have the 
>> folio isolated
>> -                                      * and know it's dirty
>> -                                      */
>> -                                     node_stat_mod_folio(folio, 
>> NR_VMSCAN_IMMEDIATE,
>> -                                                     nr_pages);
>> -                                     folio_set_reclaim(folio);
>> -
>> -                                     goto activate_locked;
>> -                             }
>> -
>> -                             if (references == FOLIOREF_RECLAIM_CLEAN)
>> -                                     goto keep_locked;
>> -                             if (!may_enter_fs(folio, sc->gfp_mask))
>> -                                     goto keep_locked;
>> -                             if (!sc->may_writepage)
>> -                                     goto keep_locked;
>> -                     }
>> -
>>                       if (folio_test_pmd_mappable(folio))
>>                               flags |= TTU_SPLIT_HUGE_PMD;
>>
>> @@ -1323,6 +1330,28 @@ static unsigned int shrink_folio_list(struct 
>> list_head *folio_list,
>>
>>               mapping = folio_mapping(folio);
>>               if (folio_test_dirty(folio)) {
>> +                     /*
>> +                      * Only kswapd can writeback filesystem folios
>> +                      * to avoid risk of stack overflow. But avoid
>> +                      * injecting inefficient single-folio I/O into
>> +                      * flusher writeback as much as possible: only
>> +                      * write folios when we've encountered many
>> +                      * dirty folios, and when we've already scanned
>> +                      * the rest of the LRU for clean folios and see
>> +                      * the same dirty folios again (with the reclaim
>> +                      * flag set).
>> +                      */
>> +                     if (folio_is_file_lru(folio) &&
>> +                             !folio_check_pageout(folio, pgdat))
>> +                             goto activate_locked;
>> +
>> +                     if (references == FOLIOREF_RECLAIM_CLEAN)
>> +                             goto keep_locked;
>> +                     if (!may_enter_fs(folio, sc->gfp_mask))
>> +                             goto keep_locked;
>> +                     if (!sc->may_writepage)
>> +                             goto keep_locked;
>> +
>>                       /*
>>                        * Folio is dirty. Flush the TLB if a writable 
>> entry
>>                        * potentially exists to avoid CPU writes after 
>> I/O
>
> I'm confused. Did you apply this on top of v1 by accident?
Hi,
According to my modified mm_vmscan_lru_shrink_inactive test tracelog, in 
the 32 scanned inactive file pages, 20 were dirty, and the 20 dirty 
pages were not reclamed, but they took 20us to perform try_to_unmap.

I think unreclaimed dirty folio in inactive file lru can skip to perform 
try_to_unmap. Please help to continue review. Thanks.

kswapd0-99      (     99) [005] .....   687.793724: 
mm_vmscan_lru_shrink_inactive: [Justin] nid 0 scan=32 isolate=32 
reclamed=12 nr_dirty=20 nr_unqueued_dirty=20 nr_writeback=0 
nr_congested=0 nr_immediate=0 nr_activate[0]=0 nr_activate[1]=20 
nr_ref_keep=0 nr_unmap_fail=0 priority=2 
file=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC total=39 exe=0 reference_cost=5 
reference_exe=0 unmap_cost=21 unmap_exe=0 dirty_unmap_cost=20 
dirty_unmap_exe=0 pageout_cost=0 pageout_exe=0
>
> -- 
> Cheers,
>
> David / dhildenb
>
  
Zhiguo Jiang Oct. 20, 2023, 4:09 a.m. UTC | #3
在 2023/10/20 11:59, zhiguojiang 写道:
>
>
> 在 2023/10/19 22:15, David Hildenbrand 写道:
>> [你通常不会收到来自 david@redhat.com 的电子邮件。请访问 
>> https://aka.ms/LearnAboutSenderIdentification,以了解这一点为什么很重要]
>>
>> On 19.10.23 15:14, Zhiguo Jiang wrote:
>>> In the shrink_folio_list() the sources of the file dirty folio include
>>> two ways below:
>>> 1. The dirty folio is from the incoming parameter folio_list,
>>>     which is the inactive file lru.
>>> 2. The dirty folio is from the PTE dirty bit transferred by
>>>     the try_to_unmap().
>>>
>>> For the first source of the dirty folio, if the dirty folio does not
>>> support pageout, the dirty folio can skip unmap in advance to reduce
>>> recyling time.
>>>
>>> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
>>> ---
>>>
>>> Changelog:
>>> v1->v2:
>>> 1. Keep the original judgment flow.
>>> 2. Add the interface of folio_check_pageout().
>>> 3. The dirty folio which does not support pageout in inactive file lru
>>>     skip unmap in advance.
>>>
>>>   mm/vmscan.c | 103 
>>> +++++++++++++++++++++++++++++++++-------------------
>>>   1 file changed, 66 insertions(+), 37 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index a68d01fcc307..e067269275a5 100755
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -925,6 +925,44 @@ static void folio_check_dirty_writeback(struct 
>>> folio *folio,
>>>               mapping->a_ops->is_dirty_writeback(folio, dirty, 
>>> writeback);
>>>   }
>>>
>>> +/* Check if a dirty folio can support pageout in the recyling 
>>> process*/
>>> +static bool folio_check_pageout(struct folio *folio,
>>> +                                             struct pglist_data 
>>> *pgdat)
>>> +{
>>> +     int ret = true;
>>> +
>>> +     /*
>>> +      * Anonymous folios are not handled by flushers and must be 
>>> written
>>> +      * from reclaim context. Do not stall reclaim based on them.
>>> +      * MADV_FREE anonymous folios are put into inactive file list 
>>> too.
>>> +      * They could be mistakenly treated as file lru. So further anon
>>> +      * test is needed.
>>> +      */
>>> +     if (!folio_is_file_lru(folio) ||
>>> +             (folio_test_anon(folio) && 
>>> !folio_test_swapbacked(folio)))
>>> +             goto out;
>>> +
>>> +     if (folio_test_dirty(folio) &&
>>> +             (!current_is_kswapd() ||
>>> +              !folio_test_reclaim(folio) ||
>>> +              !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
>>> +             /*
>>> +              * Immediately reclaim when written back.
>>> +              * Similar in principle to folio_deactivate()
>>> +              * except we already have the folio isolated
>>> +              * and know it's dirty
>>> +              */
>>> +             node_stat_mod_folio(folio, NR_VMSCAN_IMMEDIATE,
>>> +                     folio_nr_pages(folio));
>>> +             folio_set_reclaim(folio);
>>> +
>>> +             ret = false;
>>> +     }
>>> +
>>> +out:
>>> +     return ret;
>>> +}
>>> +
>>>   static struct folio *alloc_demote_folio(struct folio *src,
>>>               unsigned long private)
>>>   {
>>> @@ -1078,6 +1116,12 @@ static unsigned int shrink_folio_list(struct 
>>> list_head *folio_list,
>>>               if (dirty && !writeback)
>>>                       stat->nr_unqueued_dirty += nr_pages;
>>>
>>> +             /* If the dirty folio dose not support pageout,
>>> +              * the dirty folio can skip this recycling.
>>> +              */
>>> +             if (!folio_check_pageout(folio, pgdat))
>>> +                     goto activate_locked;
>>> +
>>>               /*
>>>                * Treat this folio as congested if folios are cycling
>>>                * through the LRU so quickly that the folios marked
>>> @@ -1261,43 +1305,6 @@ static unsigned int shrink_folio_list(struct 
>>> list_head *folio_list,
>>>                       enum ttu_flags flags = TTU_BATCH_FLUSH;
>>>                       bool was_swapbacked = 
>>> folio_test_swapbacked(folio);
>>>
>>> -                     if (folio_test_dirty(folio)) {
>>> -                             /*
>>> -                              * Only kswapd can writeback 
>>> filesystem folios
>>> -                              * to avoid risk of stack overflow. 
>>> But avoid
>>> -                              * injecting inefficient single-folio 
>>> I/O into
>>> -                              * flusher writeback as much as 
>>> possible: only
>>> -                              * write folios when we've encountered 
>>> many
>>> -                              * dirty folios, and when we've 
>>> already scanned
>>> -                              * the rest of the LRU for clean 
>>> folios and see
>>> -                              * the same dirty folios again (with 
>>> the reclaim
>>> -                              * flag set).
>>> -                              */
>>> -                             if (folio_is_file_lru(folio) &&
>>> -                                     (!current_is_kswapd() ||
>>> - !folio_test_reclaim(folio) ||
>>> -                                      !test_bit(PGDAT_DIRTY, 
>>> &pgdat->flags))) {
>>> -                                     /*
>>> -                                      * Immediately reclaim when 
>>> written back.
>>> -                                      * Similar in principle to 
>>> folio_deactivate()
>>> -                                      * except we already have the 
>>> folio isolated
>>> -                                      * and know it's dirty
>>> -                                      */
>>> - node_stat_mod_folio(folio, NR_VMSCAN_IMMEDIATE,
>>> - nr_pages);
>>> - folio_set_reclaim(folio);
>>> -
>>> -                                     goto activate_locked;
>>> -                             }
>>> -
>>> -                             if (references == FOLIOREF_RECLAIM_CLEAN)
>>> -                                     goto keep_locked;
>>> -                             if (!may_enter_fs(folio, sc->gfp_mask))
>>> -                                     goto keep_locked;
>>> -                             if (!sc->may_writepage)
>>> -                                     goto keep_locked;
>>> -                     }
>>> -
>>>                       if (folio_test_pmd_mappable(folio))
>>>                               flags |= TTU_SPLIT_HUGE_PMD;
>>>
>>> @@ -1323,6 +1330,28 @@ static unsigned int shrink_folio_list(struct 
>>> list_head *folio_list,
>>>
>>>               mapping = folio_mapping(folio);
>>>               if (folio_test_dirty(folio)) {
>>> +                     /*
>>> +                      * Only kswapd can writeback filesystem folios
>>> +                      * to avoid risk of stack overflow. But avoid
>>> +                      * injecting inefficient single-folio I/O into
>>> +                      * flusher writeback as much as possible: only
>>> +                      * write folios when we've encountered many
>>> +                      * dirty folios, and when we've already scanned
>>> +                      * the rest of the LRU for clean folios and see
>>> +                      * the same dirty folios again (with the reclaim
>>> +                      * flag set).
>>> +                      */
>>> +                     if (folio_is_file_lru(folio) &&
>>> +                             !folio_check_pageout(folio, pgdat))
>>> +                             goto activate_locked;
>>> +
>>> +                     if (references == FOLIOREF_RECLAIM_CLEAN)
>>> +                             goto keep_locked;
>>> +                     if (!may_enter_fs(folio, sc->gfp_mask))
>>> +                             goto keep_locked;
>>> +                     if (!sc->may_writepage)
>>> +                             goto keep_locked;
>>> +
>>>                       /*
>>>                        * Folio is dirty. Flush the TLB if a writable 
>>> entry
>>>                        * potentially exists to avoid CPU writes 
>>> after I/O
>>
>> I'm confused. Did you apply this on top of v1 by accident?
> Hi,
> According to my modified mm_vmscan_lru_shrink_inactive test tracelog, 
> in the 32 scanned inactive file pages, 20 were dirty, and the 20 dirty 
> pages were not reclamed, but they took 20us to perform try_to_unmap.
>
> I think unreclaimed dirty folio in inactive file lru can skip to 
> perform try_to_unmap. Please help to continue review. Thanks.
>
> kswapd0-99      (     99) [005] .....   687.793724: 
> mm_vmscan_lru_shrink_inactive: [Justin] nid 0 scan=32 isolate=32 
> reclamed=12 nr_dirty=20 nr_unqueued_dirty=20 nr_writeback=0 
> nr_congested=0 nr_immediate=0 nr_activate[0]=0 nr_activate[1]=20 
> nr_ref_keep=0 nr_unmap_fail=0 priority=2 
> file=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC total=39 exe=0 reference_cost=5 
> reference_exe=0 unmap_cost=21 unmap_exe=0 dirty_unmap_cost=20 
> dirty_unmap_exe=0 pageout_cost=0 pageout_exe=0
>
To supplement, I think the unreclaimed dirty folio of the inactive file 
lru in shrink_folio_list() can exit the recyling flow in advance and 
avoid to execute some time-consuming interfaces, such as 
folio_check_references() and try_to_unmap().
>> -- 
>> Cheers,
>>
>> David / dhildenb
>>
>
  
Matthew Wilcox Oct. 20, 2023, 4:15 a.m. UTC | #4
On Fri, Oct 20, 2023 at 11:59:33AM +0800, zhiguojiang wrote:
> > > @@ -1261,43 +1305,6 @@ static unsigned int shrink_folio_list(struct
> > > list_head *folio_list,
> > >                       enum ttu_flags flags = TTU_BATCH_FLUSH;
> > >                       bool was_swapbacked =
> > > folio_test_swapbacked(folio);
> > > 
> > > -                     if (folio_test_dirty(folio)) {
> > > -                             /*
> > > -                              * Only kswapd can writeback
> > > filesystem folios
> > > -                              * to avoid risk of stack overflow.
> > > But avoid
> > > -                              * injecting inefficient single-folio
> > > I/O into
> > > -                              * flusher writeback as much as
> > > possible: only
> > > -                              * write folios when we've encountered
> > > many
> > > -                              * dirty folios, and when we've
> > > already scanned
> > > -                              * the rest of the LRU for clean
> > > folios and see
> > > -                              * the same dirty folios again (with
> > > the reclaim
> > > -                              * flag set).
> > > -                              */
> > > -                             if (folio_is_file_lru(folio) &&
> > > -                                     (!current_is_kswapd() ||
> > > - !folio_test_reclaim(folio) ||
> > > -                                      !test_bit(PGDAT_DIRTY,
> > > &pgdat->flags))) {
> > > -                                     /*
> > > -                                      * Immediately reclaim when
> > > written back.
> > > -                                      * Similar in principle to
> > > folio_deactivate()
> > > -                                      * except we already have the
> > > folio isolated
> > > -                                      * and know it's dirty
> > > -                                      */
> > > -                                     node_stat_mod_folio(folio,
> > > NR_VMSCAN_IMMEDIATE,
> > > -                                                     nr_pages);
> > > -                                     folio_set_reclaim(folio);
> > > -
> > > -                                     goto activate_locked;
> > > -                             }
> > > -
> > > -                             if (references == FOLIOREF_RECLAIM_CLEAN)
> > > -                                     goto keep_locked;
> > > -                             if (!may_enter_fs(folio, sc->gfp_mask))
> > > -                                     goto keep_locked;
> > > -                             if (!sc->may_writepage)
> > > -                                     goto keep_locked;
> > > -                     }
> > > -
> > >                       if (folio_test_pmd_mappable(folio))
> > >                               flags |= TTU_SPLIT_HUGE_PMD;
> > > 
> > 
> > I'm confused. Did you apply this on top of v1 by accident?
> Hi,
> According to my modified mm_vmscan_lru_shrink_inactive test tracelog, in the

You're missing David's point.  You've generated this patch against ...
something ... that isn't upstream.  Probably against v1 of your
patch.  Please check your git tree.

> 32 scanned inactive file pages, 20 were dirty, and the 20 dirty pages were
> not reclamed, but they took 20us to perform try_to_unmap.
> 
> I think unreclaimed dirty folio in inactive file lru can skip to perform
> try_to_unmap. Please help to continue review. Thanks.
> 
> kswapd0-99      (     99) [005] .....   687.793724:
> mm_vmscan_lru_shrink_inactive: [Justin] nid 0 scan=32 isolate=32 reclamed=12
> nr_dirty=20 nr_unqueued_dirty=20 nr_writeback=0 nr_congested=0
> nr_immediate=0 nr_activate[0]=0 nr_activate[1]=20 nr_ref_keep=0
> nr_unmap_fail=0 priority=2 file=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC total=39
> exe=0 reference_cost=5 reference_exe=0 unmap_cost=21 unmap_exe=0
> dirty_unmap_cost=20 dirty_unmap_exe=0 pageout_cost=0 pageout_exe=0

Are you seeing measurable changes for any workloads?  It certainly seems
like you should, but it would help if you chose a test from mmtests and
showed how performance changed on your system.
  
Zhiguo Jiang Oct. 20, 2023, 4:36 a.m. UTC | #5
在 2023/10/20 12:15, Matthew Wilcox 写道:
> On Fri, Oct 20, 2023 at 11:59:33AM +0800, zhiguojiang wrote:
>>>> @@ -1261,43 +1305,6 @@ static unsigned int shrink_folio_list(struct
>>>> list_head *folio_list,
>>>>                        enum ttu_flags flags = TTU_BATCH_FLUSH;
>>>>                        bool was_swapbacked =
>>>> folio_test_swapbacked(folio);
>>>>
>>>> -                     if (folio_test_dirty(folio)) {
>>>> -                             /*
>>>> -                              * Only kswapd can writeback
>>>> filesystem folios
>>>> -                              * to avoid risk of stack overflow.
>>>> But avoid
>>>> -                              * injecting inefficient single-folio
>>>> I/O into
>>>> -                              * flusher writeback as much as
>>>> possible: only
>>>> -                              * write folios when we've encountered
>>>> many
>>>> -                              * dirty folios, and when we've
>>>> already scanned
>>>> -                              * the rest of the LRU for clean
>>>> folios and see
>>>> -                              * the same dirty folios again (with
>>>> the reclaim
>>>> -                              * flag set).
>>>> -                              */
>>>> -                             if (folio_is_file_lru(folio) &&
>>>> -                                     (!current_is_kswapd() ||
>>>> - !folio_test_reclaim(folio) ||
>>>> -                                      !test_bit(PGDAT_DIRTY,
>>>> &pgdat->flags))) {
>>>> -                                     /*
>>>> -                                      * Immediately reclaim when
>>>> written back.
>>>> -                                      * Similar in principle to
>>>> folio_deactivate()
>>>> -                                      * except we already have the
>>>> folio isolated
>>>> -                                      * and know it's dirty
>>>> -                                      */
>>>> -                                     node_stat_mod_folio(folio,
>>>> NR_VMSCAN_IMMEDIATE,
>>>> -                                                     nr_pages);
>>>> -                                     folio_set_reclaim(folio);
>>>> -
>>>> -                                     goto activate_locked;
>>>> -                             }
>>>> -
>>>> -                             if (references == FOLIOREF_RECLAIM_CLEAN)
>>>> -                                     goto keep_locked;
>>>> -                             if (!may_enter_fs(folio, sc->gfp_mask))
>>>> -                                     goto keep_locked;
>>>> -                             if (!sc->may_writepage)
>>>> -                                     goto keep_locked;
>>>> -                     }
>>>> -
>>>>                        if (folio_test_pmd_mappable(folio))
>>>>                                flags |= TTU_SPLIT_HUGE_PMD;
>>>>
>>> I'm confused. Did you apply this on top of v1 by accident?
>> Hi,
>> According to my modified mm_vmscan_lru_shrink_inactive test tracelog, in the
> You're missing David's point.  You've generated this patch against ...
> something ... that isn't upstream.  Probably against v1 of your
> patch.  Please check your git tree.
Yes, [PATCH v2 1/2] is against my patch v1 index 2cc0cb41fb32..cf555cdfcefc.
>
>> 32 scanned inactive file pages, 20 were dirty, and the 20 dirty pages were
>> not reclamed, but they took 20us to perform try_to_unmap.
>>
>> I think unreclaimed dirty folio in inactive file lru can skip to perform
>> try_to_unmap. Please help to continue review. Thanks.
>>
>> kswapd0-99      (     99) [005] .....   687.793724:
>> mm_vmscan_lru_shrink_inactive: [Justin] nid 0 scan=32 isolate=32 reclamed=12
>> nr_dirty=20 nr_unqueued_dirty=20 nr_writeback=0 nr_congested=0
>> nr_immediate=0 nr_activate[0]=0 nr_activate[1]=20 nr_ref_keep=0
>> nr_unmap_fail=0 priority=2 file=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC total=39
>> exe=0 reference_cost=5 reference_exe=0 unmap_cost=21 unmap_exe=0
>> dirty_unmap_cost=20 dirty_unmap_exe=0 pageout_cost=0 pageout_exe=0
> Are you seeing measurable changes for any workloads?  It certainly seems
> like you should, but it would help if you chose a test from mmtests and
> showed how performance changed on your system.
  
Zhiguo Jiang Oct. 23, 2023, 8:07 a.m. UTC | #6
在 2023/10/20 12:15, Matthew Wilcox 写道:
> On Fri, Oct 20, 2023 at 11:59:33AM +0800, zhiguojiang wrote:
>>>> @@ -1261,43 +1305,6 @@ static unsigned int shrink_folio_list(struct
>>>> list_head *folio_list,
>>>>                        enum ttu_flags flags = TTU_BATCH_FLUSH;
>>>>                        bool was_swapbacked =
>>>> folio_test_swapbacked(folio);
>>>>
>>>> -                     if (folio_test_dirty(folio)) {
>>>> -                             /*
>>>> -                              * Only kswapd can writeback
>>>> filesystem folios
>>>> -                              * to avoid risk of stack overflow.
>>>> But avoid
>>>> -                              * injecting inefficient single-folio
>>>> I/O into
>>>> -                              * flusher writeback as much as
>>>> possible: only
>>>> -                              * write folios when we've encountered
>>>> many
>>>> -                              * dirty folios, and when we've
>>>> already scanned
>>>> -                              * the rest of the LRU for clean
>>>> folios and see
>>>> -                              * the same dirty folios again (with
>>>> the reclaim
>>>> -                              * flag set).
>>>> -                              */
>>>> -                             if (folio_is_file_lru(folio) &&
>>>> -                                     (!current_is_kswapd() ||
>>>> - !folio_test_reclaim(folio) ||
>>>> -                                      !test_bit(PGDAT_DIRTY,
>>>> &pgdat->flags))) {
>>>> -                                     /*
>>>> -                                      * Immediately reclaim when
>>>> written back.
>>>> -                                      * Similar in principle to
>>>> folio_deactivate()
>>>> -                                      * except we already have the
>>>> folio isolated
>>>> -                                      * and know it's dirty
>>>> -                                      */
>>>> -                                     node_stat_mod_folio(folio,
>>>> NR_VMSCAN_IMMEDIATE,
>>>> -                                                     nr_pages);
>>>> -                                     folio_set_reclaim(folio);
>>>> -
>>>> -                                     goto activate_locked;
>>>> -                             }
>>>> -
>>>> -                             if (references == FOLIOREF_RECLAIM_CLEAN)
>>>> -                                     goto keep_locked;
>>>> -                             if (!may_enter_fs(folio, sc->gfp_mask))
>>>> -                                     goto keep_locked;
>>>> -                             if (!sc->may_writepage)
>>>> -                                     goto keep_locked;
>>>> -                     }
>>>> -
>>>>                        if (folio_test_pmd_mappable(folio))
>>>>                                flags |= TTU_SPLIT_HUGE_PMD;
>>>>
>>> I'm confused. Did you apply this on top of v1 by accident?
>> Hi,
>> According to my modified mm_vmscan_lru_shrink_inactive test tracelog, in the
> You're missing David's point.  You've generated this patch against ...
> something ... that isn't upstream.  Probably against v1 of your
> patch.  Please check your git tree.
>
>> 32 scanned inactive file pages, 20 were dirty, and the 20 dirty pages were
>> not reclamed, but they took 20us to perform try_to_unmap.
>>
>> I think unreclaimed dirty folio in inactive file lru can skip to perform
>> try_to_unmap. Please help to continue review. Thanks.
>>
>> kswapd0-99      (     99) [005] .....   687.793724:
>> mm_vmscan_lru_shrink_inactive: [Justin] nid 0 scan=32 isolate=32 reclamed=12
>> nr_dirty=20 nr_unqueued_dirty=20 nr_writeback=0 nr_congested=0
>> nr_immediate=0 nr_activate[0]=0 nr_activate[1]=20 nr_ref_keep=0
>> nr_unmap_fail=0 priority=2 file=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC total=39
>> exe=0 reference_cost=5 reference_exe=0 unmap_cost=21 unmap_exe=0
>> dirty_unmap_cost=20 dirty_unmap_exe=0 pageout_cost=0 pageout_exe=0
> Are you seeing measurable changes for any workloads?  It certainly seems
> like you should, but it would help if you chose a test from mmtests and
> showed how performance changed on your system.
In one mmtest, the max times for a invalid recyling of a folio_list 
dirty folio that does not support pageout and has been activated in 
shrink_folio_list() are: cost=51us, exe=2365us.

Calculate according to this formula: dirty_cost / total_cost * 100%, the 
recyling efficiency of dirty folios can be improved 53.13%、82.95%.

So this patch can optimize shrink efficiency and reduce the workload of 
kswapd to a certain extent.

kswapd0-96      (     96) [005] .....   387.218548: 
mm_vmscan_lru_shrink_inactive: [Justin] nid 0 nr_scanned 32 nr_taken 32 
nr_reclaimed 31 nr_dirty  1 nr_unqueued_dirty  1 nr_writeback 0 
nr_activate[1]  1 nr_ref_keep  0 f RECLAIM_WB_FILE|RECLAIM_WB_ASYNC 
total_cost 96 total_exe 2365 dirty_cost 51 total_exe 2365

kswapd0-96      (     96) [006] .....   412.822532: 
mm_vmscan_lru_shrink_inactive: [Justin] nid 0 nr_scanned 32 nr_taken 32 
nr_reclaimed  0 nr_dirty 32 nr_unqueued_dirty 32 nr_writeback 0 
nr_activate[1] 19 nr_ref_keep 13 f RECLAIM_WB_FILE|RECLAIM_WB_ASYNC 
total_cost 88 total_exe 605  dirty_cost 73 total_exe 605
  
Matthew Wilcox Oct. 23, 2023, 12:21 p.m. UTC | #7
On Mon, Oct 23, 2023 at 04:07:28PM +0800, zhiguojiang wrote:
> > Are you seeing measurable changes for any workloads?  It certainly seems
> > like you should, but it would help if you chose a test from mmtests and
> > showed how performance changed on your system.
> In one mmtest, the max times for a invalid recyling of a folio_list dirty
> folio that does not support pageout and has been activated in
> shrink_folio_list() are: cost=51us, exe=2365us.
> 
> Calculate according to this formula: dirty_cost / total_cost * 100%, the
> recyling efficiency of dirty folios can be improved 53.13%、82.95%.
> 
> So this patch can optimize shrink efficiency and reduce the workload of
> kswapd to a certain extent.
> 
> kswapd0-96      (     96) [005] .....   387.218548:
> mm_vmscan_lru_shrink_inactive: [Justin] nid 0 nr_scanned 32 nr_taken 32
> nr_reclaimed 31 nr_dirty  1 nr_unqueued_dirty  1 nr_writeback 0
> nr_activate[1]  1 nr_ref_keep  0 f RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
> total_cost 96 total_exe 2365 dirty_cost 51 total_exe 2365
> 
> kswapd0-96      (     96) [006] .....   412.822532:
> mm_vmscan_lru_shrink_inactive: [Justin] nid 0 nr_scanned 32 nr_taken 32
> nr_reclaimed  0 nr_dirty 32 nr_unqueued_dirty 32 nr_writeback 0
> nr_activate[1] 19 nr_ref_keep 13 f RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
> total_cost 88 total_exe 605  dirty_cost 73 total_exe 605

I appreciate that you can put probes in and determine the cost, but do
you see improvements for a real workload?  Like doing a kernel compile
-- does it speed up at all?
  
Zhiguo Jiang Oct. 23, 2023, 12:44 p.m. UTC | #8
在 2023/10/23 20:21, Matthew Wilcox 写道:
> On Mon, Oct 23, 2023 at 04:07:28PM +0800, zhiguojiang wrote:
>>> Are you seeing measurable changes for any workloads?  It certainly seems
>>> like you should, but it would help if you chose a test from mmtests and
>>> showed how performance changed on your system.
>> In one mmtest, the max times for a invalid recyling of a folio_list dirty
>> folio that does not support pageout and has been activated in
>> shrink_folio_list() are: cost=51us, exe=2365us.
>>
>> Calculate according to this formula: dirty_cost / total_cost * 100%, the
>> recyling efficiency of dirty folios can be improved 53.13%、82.95%.
>>
>> So this patch can optimize shrink efficiency and reduce the workload of
>> kswapd to a certain extent.
>>
>> kswapd0-96      (     96) [005] .....   387.218548:
>> mm_vmscan_lru_shrink_inactive: [Justin] nid 0 nr_scanned 32 nr_taken 32
>> nr_reclaimed 31 nr_dirty  1 nr_unqueued_dirty  1 nr_writeback 0
>> nr_activate[1]  1 nr_ref_keep  0 f RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>> total_cost 96 total_exe 2365 dirty_cost 51 total_exe 2365
>>
>> kswapd0-96      (     96) [006] .....   412.822532:
>> mm_vmscan_lru_shrink_inactive: [Justin] nid 0 nr_scanned 32 nr_taken 32
>> nr_reclaimed  0 nr_dirty 32 nr_unqueued_dirty 32 nr_writeback 0
>> nr_activate[1] 19 nr_ref_keep 13 f RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>> total_cost 88 total_exe 605  dirty_cost 73 total_exe 605
> I appreciate that you can put probes in and determine the cost, but do
> you see improvements for a real workload?  Like doing a kernel compile
> -- does it speed up at all?
Can you help share a method for testing thread workload, like kswapd?

Thanks.
  
Matthew Wilcox Oct. 23, 2023, 1:01 p.m. UTC | #9
On Mon, Oct 23, 2023 at 08:44:55PM +0800, zhiguojiang wrote:
> 在 2023/10/23 20:21, Matthew Wilcox 写道:
> > On Mon, Oct 23, 2023 at 04:07:28PM +0800, zhiguojiang wrote:
> > > > Are you seeing measurable changes for any workloads?  It certainly seems
> > > > like you should, but it would help if you chose a test from mmtests and
> > > > showed how performance changed on your system.
> > > In one mmtest, the max times for a invalid recyling of a folio_list dirty
> > > folio that does not support pageout and has been activated in
> > > shrink_folio_list() are: cost=51us, exe=2365us.
> > > 
> > > Calculate according to this formula: dirty_cost / total_cost * 100%, the
> > > recyling efficiency of dirty folios can be improved 53.13%、82.95%.
> > > 
> > > So this patch can optimize shrink efficiency and reduce the workload of
> > > kswapd to a certain extent.
> > > 
> > > kswapd0-96      (     96) [005] .....   387.218548:
> > > mm_vmscan_lru_shrink_inactive: [Justin] nid 0 nr_scanned 32 nr_taken 32
> > > nr_reclaimed 31 nr_dirty  1 nr_unqueued_dirty  1 nr_writeback 0
> > > nr_activate[1]  1 nr_ref_keep  0 f RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
> > > total_cost 96 total_exe 2365 dirty_cost 51 total_exe 2365
> > > 
> > > kswapd0-96      (     96) [006] .....   412.822532:
> > > mm_vmscan_lru_shrink_inactive: [Justin] nid 0 nr_scanned 32 nr_taken 32
> > > nr_reclaimed  0 nr_dirty 32 nr_unqueued_dirty 32 nr_writeback 0
> > > nr_activate[1] 19 nr_ref_keep 13 f RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
> > > total_cost 88 total_exe 605  dirty_cost 73 total_exe 605
> > I appreciate that you can put probes in and determine the cost, but do
> > you see improvements for a real workload?  Like doing a kernel compile
> > -- does it speed up at all?
> Can you help share a method for testing thread workload, like kswapd?

Something dirt simple like 'time make -j8'.
  
Zhiguo Jiang Oct. 24, 2023, 2:04 a.m. UTC | #10
在 2023/10/23 21:01, Matthew Wilcox 写道:
> On Mon, Oct 23, 2023 at 08:44:55PM +0800, zhiguojiang wrote:
>> 在 2023/10/23 20:21, Matthew Wilcox 写道:
>>> On Mon, Oct 23, 2023 at 04:07:28PM +0800, zhiguojiang wrote:
>>>>> Are you seeing measurable changes for any workloads?  It certainly seems
>>>>> like you should, but it would help if you chose a test from mmtests and
>>>>> showed how performance changed on your system.
>>>> In one mmtest, the max times for a invalid recyling of a folio_list dirty
>>>> folio that does not support pageout and has been activated in
>>>> shrink_folio_list() are: cost=51us, exe=2365us.
>>>>
>>>> Calculate according to this formula: dirty_cost / total_cost * 100%, the
>>>> recyling efficiency of dirty folios can be improved 53.13%、82.95%.
>>>>
>>>> So this patch can optimize shrink efficiency and reduce the workload of
>>>> kswapd to a certain extent.
>>>>
>>>> kswapd0-96      (     96) [005] .....   387.218548:
>>>> mm_vmscan_lru_shrink_inactive: [Justin] nid 0 nr_scanned 32 nr_taken 32
>>>> nr_reclaimed 31 nr_dirty  1 nr_unqueued_dirty  1 nr_writeback 0
>>>> nr_activate[1]  1 nr_ref_keep  0 f RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>>>> total_cost 96 total_exe 2365 dirty_cost 51 total_exe 2365
>>>>
>>>> kswapd0-96      (     96) [006] .....   412.822532:
>>>> mm_vmscan_lru_shrink_inactive: [Justin] nid 0 nr_scanned 32 nr_taken 32
>>>> nr_reclaimed  0 nr_dirty 32 nr_unqueued_dirty 32 nr_writeback 0
>>>> nr_activate[1] 19 nr_ref_keep 13 f RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>>>> total_cost 88 total_exe 605  dirty_cost 73 total_exe 605
>>> I appreciate that you can put probes in and determine the cost, but do
>>> you see improvements for a real workload?  Like doing a kernel compile
>>> -- does it speed up at all?
>> Can you help share a method for testing thread workload, like kswapd?
> Something dirt simple like 'time make -j8'.
Two compilations were conducted separately, and compared to the 
unmodified compilation,
the compilation time for adding modified patches had a certain 
reduction, as follows:

Compilation command:
make distclean -j8
make ARCH=x86_64 x86_64_defconfig
time make -j8

1.Unmodified Compilation time:
real    2m40.276s
user    16m2.956s
sys     2m14.738s

real    2m40.136s
user    16m2.617s
sys     2m14.722s

2.[Patch v2 1/2] Modified Compilation time:
real    2m40.067s
user    16m3.164s
sys     2m14.211s

real    2m40.123s
user    16m2.439s
sys     2m14.508s

3 [Patch v2 1/2] + [Patch v2 2/2] Modified Compilation time:
real    2m40.367s
user    16m3.738s
sys     2m13.662s

real    2m40.014s
user    16m3.108s
sys     2m14.096s
  
Zhiguo Jiang Oct. 24, 2023, 2:08 a.m. UTC | #11
在 2023/10/23 21:01, Matthew Wilcox 写道:
> On Mon, Oct 23, 2023 at 08:44:55PM +0800, zhiguojiang wrote:
>> 在 2023/10/23 20:21, Matthew Wilcox 写道:
>>> On Mon, Oct 23, 2023 at 04:07:28PM +0800, zhiguojiang wrote:
>>>>> Are you seeing measurable changes for any workloads?  It certainly seems
>>>>> like you should, but it would help if you chose a test from mmtests and
>>>>> showed how performance changed on your system.
>>>> In one mmtest, the max times for a invalid recyling of a folio_list dirty
>>>> folio that does not support pageout and has been activated in
>>>> shrink_folio_list() are: cost=51us, exe=2365us.
>>>>
>>>> Calculate according to this formula: dirty_cost / total_cost * 100%, the
>>>> recyling efficiency of dirty folios can be improved 53.13%、82.95%.
>>>>
>>>> So this patch can optimize shrink efficiency and reduce the workload of
>>>> kswapd to a certain extent.
>>>>
>>>> kswapd0-96      (     96) [005] .....   387.218548:
>>>> mm_vmscan_lru_shrink_inactive: [Justin] nid 0 nr_scanned 32 nr_taken 32
>>>> nr_reclaimed 31 nr_dirty  1 nr_unqueued_dirty  1 nr_writeback 0
>>>> nr_activate[1]  1 nr_ref_keep  0 f RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>>>> total_cost 96 total_exe 2365 dirty_cost 51 total_exe 2365
>>>>
>>>> kswapd0-96      (     96) [006] .....   412.822532:
>>>> mm_vmscan_lru_shrink_inactive: [Justin] nid 0 nr_scanned 32 nr_taken 32
>>>> nr_reclaimed  0 nr_dirty 32 nr_unqueued_dirty 32 nr_writeback 0
>>>> nr_activate[1] 19 nr_ref_keep 13 f RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>>>> total_cost 88 total_exe 605  dirty_cost 73 total_exe 605
>>> I appreciate that you can put probes in and determine the cost, but do
>>> you see improvements for a real workload?  Like doing a kernel compile
>>> -- does it speed up at all?
>> Can you help share a method for testing thread workload, like kswapd?
> Something dirt simple like 'time make -j8'.
Two compilations were conducted separately, and compared to the 
unmodified compilation,
the compilation time for adding modified patches had a certain 
reduction, as follows:

Compilation command:
make distclean -j8
make ARCH=x86_64 x86_64_defconfig
time make -j8

1.Unmodified Compilation time:
real    2m40.276s
user    16m2.956s
sys     2m14.738s

real    2m40.136s
user    16m2.617s
sys     2m14.722s

2.[Patch v2 1/2] Modified Compilation time:
real    2m40.067s
user    16m3.164s
sys     2m14.211s

real    2m40.123s
user    16m2.439s
sys     2m14.508s

3.[Patch v2 1/2] + [Patch v2 2/2] Modified Compilation time:
real    2m40.367s
user    16m3.738s
sys     2m13.662s

real    2m40.014s
user    16m3.108s
sys     2m14.096s

Thanks
  
David Hildenbrand Oct. 24, 2023, 7:07 a.m. UTC | #12
On 24.10.23 04:04, zhiguojiang wrote:
> 
> 
> 在 2023/10/23 21:01, Matthew Wilcox 写道:
>> On Mon, Oct 23, 2023 at 08:44:55PM +0800, zhiguojiang wrote:
>>> 在 2023/10/23 20:21, Matthew Wilcox 写道:
>>>> On Mon, Oct 23, 2023 at 04:07:28PM +0800, zhiguojiang wrote:
>>>>>> Are you seeing measurable changes for any workloads?  It certainly seems
>>>>>> like you should, but it would help if you chose a test from mmtests and
>>>>>> showed how performance changed on your system.
>>>>> In one mmtest, the max times for a invalid recyling of a folio_list dirty
>>>>> folio that does not support pageout and has been activated in
>>>>> shrink_folio_list() are: cost=51us, exe=2365us.
>>>>>
>>>>> Calculate according to this formula: dirty_cost / total_cost * 100%, the
>>>>> recyling efficiency of dirty folios can be improved 53.13%、82.95%.
>>>>>
>>>>> So this patch can optimize shrink efficiency and reduce the workload of
>>>>> kswapd to a certain extent.
>>>>>
>>>>> kswapd0-96      (     96) [005] .....   387.218548:
>>>>> mm_vmscan_lru_shrink_inactive: [Justin] nid 0 nr_scanned 32 nr_taken 32
>>>>> nr_reclaimed 31 nr_dirty  1 nr_unqueued_dirty  1 nr_writeback 0
>>>>> nr_activate[1]  1 nr_ref_keep  0 f RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>>>>> total_cost 96 total_exe 2365 dirty_cost 51 total_exe 2365
>>>>>
>>>>> kswapd0-96      (     96) [006] .....   412.822532:
>>>>> mm_vmscan_lru_shrink_inactive: [Justin] nid 0 nr_scanned 32 nr_taken 32
>>>>> nr_reclaimed  0 nr_dirty 32 nr_unqueued_dirty 32 nr_writeback 0
>>>>> nr_activate[1] 19 nr_ref_keep 13 f RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>>>>> total_cost 88 total_exe 605  dirty_cost 73 total_exe 605
>>>> I appreciate that you can put probes in and determine the cost, but do
>>>> you see improvements for a real workload?  Like doing a kernel compile
>>>> -- does it speed up at all?
>>> Can you help share a method for testing thread workload, like kswapd?
>> Something dirt simple like 'time make -j8'.
> Two compilations were conducted separately, and compared to the
> unmodified compilation,
> the compilation time for adding modified patches had a certain
> reduction, as follows:
> 
> Compilation command:
> make distclean -j8
> make ARCH=x86_64 x86_64_defconfig
> time make -j8
> 
> 1.Unmodified Compilation time:
> real    2m40.276s
> user    16m2.956s
> sys     2m14.738s
> 
> real    2m40.136s
> user    16m2.617s
> sys     2m14.722s
> 
> 2.[Patch v2 1/2] Modified Compilation time:
> real    2m40.067s
> user    16m3.164s
> sys     2m14.211s
> 
> real    2m40.123s
> user    16m2.439s
> sys     2m14.508s
> 
> 3 [Patch v2 1/2] + [Patch v2 2/2] Modified Compilation time:
> real    2m40.367s
> user    16m3.738s
> sys     2m13.662s
> 
> real    2m40.014s
> user    16m3.108s
> sys     2m14.096s
> 

To get expressive numbers two iterations are usually not sufficient. How 
much memory does you system have? Does vmscan even ever get active?
  
Zhiguo Jiang Oct. 24, 2023, 7:21 a.m. UTC | #13
在 2023/10/24 15:07, David Hildenbrand 写道:
> On 24.10.23 04:04, zhiguojiang wrote:
>>
>>
>> 在 2023/10/23 21:01, Matthew Wilcox 写道:
>>> On Mon, Oct 23, 2023 at 08:44:55PM +0800, zhiguojiang wrote:
>>>> 在 2023/10/23 20:21, Matthew Wilcox 写道:
>>>>> On Mon, Oct 23, 2023 at 04:07:28PM +0800, zhiguojiang wrote:
>>>>>>> Are you seeing measurable changes for any workloads?  It 
>>>>>>> certainly seems
>>>>>>> like you should, but it would help if you chose a test from 
>>>>>>> mmtests and
>>>>>>> showed how performance changed on your system.
>>>>>> In one mmtest, the max times for a invalid recyling of a 
>>>>>> folio_list dirty
>>>>>> folio that does not support pageout and has been activated in
>>>>>> shrink_folio_list() are: cost=51us, exe=2365us.
>>>>>>
>>>>>> Calculate according to this formula: dirty_cost / total_cost * 
>>>>>> 100%, the
>>>>>> recyling efficiency of dirty folios can be improved 53.13%、82.95%.
>>>>>>
>>>>>> So this patch can optimize shrink efficiency and reduce the 
>>>>>> workload of
>>>>>> kswapd to a certain extent.
>>>>>>
>>>>>> kswapd0-96      (     96) [005] .....   387.218548:
>>>>>> mm_vmscan_lru_shrink_inactive: [Justin] nid 0 nr_scanned 32 
>>>>>> nr_taken 32
>>>>>> nr_reclaimed 31 nr_dirty  1 nr_unqueued_dirty  1 nr_writeback 0
>>>>>> nr_activate[1]  1 nr_ref_keep  0 f RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>>>>>> total_cost 96 total_exe 2365 dirty_cost 51 total_exe 2365
>>>>>>
>>>>>> kswapd0-96      (     96) [006] .....   412.822532:
>>>>>> mm_vmscan_lru_shrink_inactive: [Justin] nid 0 nr_scanned 32 
>>>>>> nr_taken 32
>>>>>> nr_reclaimed  0 nr_dirty 32 nr_unqueued_dirty 32 nr_writeback 0
>>>>>> nr_activate[1] 19 nr_ref_keep 13 f RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>>>>>> total_cost 88 total_exe 605  dirty_cost 73 total_exe 605
>>>>> I appreciate that you can put probes in and determine the cost, 
>>>>> but do
>>>>> you see improvements for a real workload?  Like doing a kernel 
>>>>> compile
>>>>> -- does it speed up at all?
>>>> Can you help share a method for testing thread workload, like kswapd?
>>> Something dirt simple like 'time make -j8'.
>> Two compilations were conducted separately, and compared to the
>> unmodified compilation,
>> the compilation time for adding modified patches had a certain
>> reduction, as follows:
>>
>> Compilation command:
>> make distclean -j8
>> make ARCH=x86_64 x86_64_defconfig
>> time make -j8
>>
>> 1.Unmodified Compilation time:
>> real    2m40.276s
>> user    16m2.956s
>> sys     2m14.738s
>>
>> real    2m40.136s
>> user    16m2.617s
>> sys     2m14.722s
>>
>> 2.[Patch v2 1/2] Modified Compilation time:
>> real    2m40.067s
>> user    16m3.164s
>> sys     2m14.211s
>>
>> real    2m40.123s
>> user    16m2.439s
>> sys     2m14.508s
>>
>> 3 [Patch v2 1/2] + [Patch v2 2/2] Modified Compilation time:
>> real    2m40.367s
>> user    16m3.738s
>> sys     2m13.662s
>>
>> real    2m40.014s
>> user    16m3.108s
>> sys     2m14.096s
>>
>
> To get expressive numbers two iterations are usually not sufficient. 
> How much memory does you system have? Does vmscan even ever get active?
Test system memory:  MemTotal:    8161608 kB.  When multiple Apps were 
opened, vmscan can get active. I can capture a lot of tracelog data 
through testing, I only posted two sets of tracelog data.
  
Zhiguo Jiang Oct. 25, 2023, 3:37 p.m. UTC | #14
在 2023/10/24 15:21, zhiguojiang 写道:
>
>
> 在 2023/10/24 15:07, David Hildenbrand 写道:
>> On 24.10.23 04:04, zhiguojiang wrote:
>>>
>>>
>>> 在 2023/10/23 21:01, Matthew Wilcox 写道:
>>>> On Mon, Oct 23, 2023 at 08:44:55PM +0800, zhiguojiang wrote:
>>>>> 在 2023/10/23 20:21, Matthew Wilcox 写道:
>>>>>> On Mon, Oct 23, 2023 at 04:07:28PM +0800, zhiguojiang wrote:
>>>>>>>> Are you seeing measurable changes for any workloads?  It 
>>>>>>>> certainly seems
>>>>>>>> like you should, but it would help if you chose a test from 
>>>>>>>> mmtests and
>>>>>>>> showed how performance changed on your system.
>>>>>>> In one mmtest, the max times for a invalid recyling of a 
>>>>>>> folio_list dirty
>>>>>>> folio that does not support pageout and has been activated in
>>>>>>> shrink_folio_list() are: cost=51us, exe=2365us.
>>>>>>>
>>>>>>> Calculate according to this formula: dirty_cost / total_cost * 
>>>>>>> 100%, the
>>>>>>> recyling efficiency of dirty folios can be improved 53.13%、82.95%.
>>>>>>>
>>>>>>> So this patch can optimize shrink efficiency and reduce the 
>>>>>>> workload of
>>>>>>> kswapd to a certain extent.
>>>>>>>
>>>>>>> kswapd0-96      (     96) [005] .....   387.218548:
>>>>>>> mm_vmscan_lru_shrink_inactive: [Justin] nid 0 nr_scanned 32 
>>>>>>> nr_taken 32
>>>>>>> nr_reclaimed 31 nr_dirty  1 nr_unqueued_dirty  1 nr_writeback 0
>>>>>>> nr_activate[1]  1 nr_ref_keep  0 f RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>>>>>>> total_cost 96 total_exe 2365 dirty_cost 51 total_exe 2365
>>>>>>>
>>>>>>> kswapd0-96      (     96) [006] .....   412.822532:
>>>>>>> mm_vmscan_lru_shrink_inactive: [Justin] nid 0 nr_scanned 32 
>>>>>>> nr_taken 32
>>>>>>> nr_reclaimed  0 nr_dirty 32 nr_unqueued_dirty 32 nr_writeback 0
>>>>>>> nr_activate[1] 19 nr_ref_keep 13 f RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>>>>>>> total_cost 88 total_exe 605  dirty_cost 73 total_exe 605
>>>>>> I appreciate that you can put probes in and determine the cost, 
>>>>>> but do
>>>>>> you see improvements for a real workload?  Like doing a kernel 
>>>>>> compile
>>>>>> -- does it speed up at all?
>>>>> Can you help share a method for testing thread workload, like kswapd?
>>>> Something dirt simple like 'time make -j8'.
>>> Two compilations were conducted separately, and compared to the
>>> unmodified compilation,
>>> the compilation time for adding modified patches had a certain
>>> reduction, as follows:
>>>
>>> Compilation command:
>>> make distclean -j8
>>> make ARCH=x86_64 x86_64_defconfig
>>> time make -j8
>>>
>>> 1.Unmodified Compilation time:
>>> real    2m40.276s
>>> user    16m2.956s
>>> sys     2m14.738s
>>>
>>> real    2m40.136s
>>> user    16m2.617s
>>> sys     2m14.722s
>>>
>>> 2.[Patch v2 1/2] Modified Compilation time:
>>> real    2m40.067s
>>> user    16m3.164s
>>> sys     2m14.211s
>>>
>>> real    2m40.123s
>>> user    16m2.439s
>>> sys     2m14.508s
>>>
>>> 3 [Patch v2 1/2] + [Patch v2 2/2] Modified Compilation time:
>>> real    2m40.367s
>>> user    16m3.738s
>>> sys     2m13.662s
>>>
>>> real    2m40.014s
>>> user    16m3.108s
>>> sys     2m14.096s
>>>
>>
>> To get expressive numbers two iterations are usually not sufficient. 
>> How much memory does you system have? Does vmscan even ever get active?
> Test system memory:  MemTotal:    8161608 kB.  When multiple Apps were 
> opened, vmscan can get active. I can capture a lot of tracelog data 
> through testing, I only posted two sets of tracelog data.
Hi, please help to continue reviewing this path and draw a conclusion on 
whether it can be merged. Thanks.
>
>
  

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a68d01fcc307..e067269275a5 100755
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -925,6 +925,44 @@  static void folio_check_dirty_writeback(struct folio *folio,
 		mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
 }
 
+/* Check if a dirty folio can support pageout in the recyling process*/
+static bool folio_check_pageout(struct folio *folio,
+						struct pglist_data *pgdat)
+{
+	int ret = true;
+
+	/*
+	 * Anonymous folios are not handled by flushers and must be written
+	 * from reclaim context. Do not stall reclaim based on them.
+	 * MADV_FREE anonymous folios are put into inactive file list too.
+	 * They could be mistakenly treated as file lru. So further anon
+	 * test is needed.
+	 */
+	if (!folio_is_file_lru(folio) ||
+		(folio_test_anon(folio) && !folio_test_swapbacked(folio)))
+		goto out;
+
+	if (folio_test_dirty(folio) &&
+		(!current_is_kswapd() ||
+		 !folio_test_reclaim(folio) ||
+		 !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
+		/*
+		 * Immediately reclaim when written back.
+		 * Similar in principle to folio_deactivate()
+		 * except we already have the folio isolated
+		 * and know it's dirty
+		 */
+		node_stat_mod_folio(folio, NR_VMSCAN_IMMEDIATE,
+			folio_nr_pages(folio));
+		folio_set_reclaim(folio);
+
+		ret = false;
+	}
+
+out:
+	return ret;
+}
+
 static struct folio *alloc_demote_folio(struct folio *src,
 		unsigned long private)
 {
@@ -1078,6 +1116,12 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 		if (dirty && !writeback)
 			stat->nr_unqueued_dirty += nr_pages;
 
+		/* If the dirty folio dose not support pageout,
+		 * the dirty folio can skip this recycling.
+		 */
+		if (!folio_check_pageout(folio, pgdat))
+			goto activate_locked;
+
 		/*
 		 * Treat this folio as congested if folios are cycling
 		 * through the LRU so quickly that the folios marked
@@ -1261,43 +1305,6 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 			enum ttu_flags flags = TTU_BATCH_FLUSH;
 			bool was_swapbacked = folio_test_swapbacked(folio);
 
-			if (folio_test_dirty(folio)) {
-				/*
-				 * Only kswapd can writeback filesystem folios
-				 * to avoid risk of stack overflow. But avoid
-				 * injecting inefficient single-folio I/O into
-				 * flusher writeback as much as possible: only
-				 * write folios when we've encountered many
-				 * dirty folios, and when we've already scanned
-				 * the rest of the LRU for clean folios and see
-				 * the same dirty folios again (with the reclaim
-				 * flag set).
-				 */
-				if (folio_is_file_lru(folio) &&
-					(!current_is_kswapd() ||
-					 !folio_test_reclaim(folio) ||
-					 !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
-					/*
-					 * Immediately reclaim when written back.
-					 * Similar in principle to folio_deactivate()
-					 * except we already have the folio isolated
-					 * and know it's dirty
-					 */
-					node_stat_mod_folio(folio, NR_VMSCAN_IMMEDIATE,
-							nr_pages);
-					folio_set_reclaim(folio);
-
-					goto activate_locked;
-				}
-
-				if (references == FOLIOREF_RECLAIM_CLEAN)
-					goto keep_locked;
-				if (!may_enter_fs(folio, sc->gfp_mask))
-					goto keep_locked;
-				if (!sc->may_writepage)
-					goto keep_locked;
-			}
-
 			if (folio_test_pmd_mappable(folio))
 				flags |= TTU_SPLIT_HUGE_PMD;
 
@@ -1323,6 +1330,28 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 
 		mapping = folio_mapping(folio);
 		if (folio_test_dirty(folio)) {
+			/*
+			 * Only kswapd can writeback filesystem folios
+			 * to avoid risk of stack overflow. But avoid
+			 * injecting inefficient single-folio I/O into
+			 * flusher writeback as much as possible: only
+			 * write folios when we've encountered many
+			 * dirty folios, and when we've already scanned
+			 * the rest of the LRU for clean folios and see
+			 * the same dirty folios again (with the reclaim
+			 * flag set).
+			 */
+			if (folio_is_file_lru(folio) &&
+				!folio_check_pageout(folio, pgdat))
+				goto activate_locked;
+
+			if (references == FOLIOREF_RECLAIM_CLEAN)
+				goto keep_locked;
+			if (!may_enter_fs(folio, sc->gfp_mask))
+				goto keep_locked;
+			if (!sc->may_writepage)
+				goto keep_locked;
+
 			/*
 			 * Folio is dirty. Flush the TLB if a writable entry
 			 * potentially exists to avoid CPU writes after I/O