diff mbox series

[v2] mm: migrate: Fix THP's mapcount on isolation

Message ID	20221124095523.31061-1-gshan@redhat.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Gavin Shan <gshan@redhat.com> To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, akpm@linux-foundation.org, william.kucharski@oracle.com, ziy@nvidia.com, kirill.shutemov@linux.intel.com, david@redhat.com, zhenyzha@redhat.com, apopple@nvidia.com, hughd@google.com, willy@infradead.org, shan.gavin@gmail.com Subject: [PATCH v2] mm: migrate: Fix THP's mapcount on isolation Date: Thu, 24 Nov 2022 17:55:23 +0800 Message-Id: <20221124095523.31061-1-gshan@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[v2] mm: migrate: Fix THP's mapcount on isolation \| [v2] mm: migrate: Fix THP's mapcount on isolation

Commit Message

Gavin Shan Nov. 24, 2022, 9:55 a.m. UTC

  The issue is reported when removing memory through virtio_mem device.
The transparent huge page, experienced copy-on-write fault, is wrongly
regarded as pinned. The transparent huge page is escaped from being
isolated in isolate_migratepages_block(). The transparent huge page
can't be migrated and the corresponding memory block can't be put
into offline state.

Fix it by replacing page_mapcount() with total_mapcount(). With this,
the transparent huge page can be isolated and migrated, and the memory
block can be put into offline state. Besides, The page's refcount is
increased a bit earlier to avoid the page is released when the check
is executed.

Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations")
Cc: stable@vger.kernel.org   # v5.7+
Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
v2: Corrected fix tag and increase page's refcount before the check
---
 mm/compaction.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

Comments

David Hildenbrand Nov. 24, 2022, 10:09 a.m. UTC | #1

On 24.11.22 10:55, Gavin Shan wrote:
> The issue is reported when removing memory through virtio_mem device.
> The transparent huge page, experienced copy-on-write fault, is wrongly
> regarded as pinned. The transparent huge page is escaped from being
> isolated in isolate_migratepages_block(). The transparent huge page
> can't be migrated and the corresponding memory block can't be put
> into offline state.
> 
> Fix it by replacing page_mapcount() with total_mapcount(). With this,
> the transparent huge page can be isolated and migrated, and the memory
> block can be put into offline state. Besides, The page's refcount is
> increased a bit earlier to avoid the page is released when the check
> is executed.

Did you look into handling pages that are in the swapcache case as well?

See is_refcount_suitable() in mm/khugepaged.c.

Should be easy to reproduce, let me know if you need inspiration.

> 
> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations")
> Cc: stable@vger.kernel.org   # v5.7+
> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
> v2: Corrected fix tag and increase page's refcount before the check
> ---
>   mm/compaction.c | 22 +++++++++++-----------
>   1 file changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index c51f7f545afe..1f6da31dd9a5 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -984,29 +984,29 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   			goto isolate_fail;
>   		}
>   
> +		/*
> +		 * Be careful not to clear PageLRU until after we're
> +		 * sure the page is not being freed elsewhere -- the
> +		 * page release code relies on it.
> +		 */
> +		if (unlikely(!get_page_unless_zero(page)))
> +			goto isolate_fail;
> +
>   		/*
>   		 * Migration will fail if an anonymous page is pinned in memory,
>   		 * so avoid taking lru_lock and isolating it unnecessarily in an
>   		 * admittedly racy check.
>   		 */
>   		mapping = page_mapping(page);
> -		if (!mapping && page_count(page) > page_mapcount(page))
> -			goto isolate_fail;
> +		if (!mapping && (page_count(page) - 1) > total_mapcount(page))
> +			goto isolate_fail_put;
>   
>   		/*
>   		 * Only allow to migrate anonymous pages in GFP_NOFS context
>   		 * because those do not depend on fs locks.
>   		 */
>   		if (!(cc->gfp_mask & __GFP_FS) && mapping)
> -			goto isolate_fail;
> -
> -		/*
> -		 * Be careful not to clear PageLRU until after we're
> -		 * sure the page is not being freed elsewhere -- the
> -		 * page release code relies on it.
> -		 */
> -		if (unlikely(!get_page_unless_zero(page)))
> -			goto isolate_fail;
> +			goto isolate_fail_put;
>   
>   		/* Only take pages on LRU: a check now makes later tests safe */
>   		if (!PageLRU(page))

Gavin Shan Nov. 24, 2022, 10:21 a.m. UTC | #2

On 11/24/22 6:09 PM, David Hildenbrand wrote:
> On 24.11.22 10:55, Gavin Shan wrote:
>> The issue is reported when removing memory through virtio_mem device.
>> The transparent huge page, experienced copy-on-write fault, is wrongly
>> regarded as pinned. The transparent huge page is escaped from being
>> isolated in isolate_migratepages_block(). The transparent huge page
>> can't be migrated and the corresponding memory block can't be put
>> into offline state.
>>
>> Fix it by replacing page_mapcount() with total_mapcount(). With this,
>> the transparent huge page can be isolated and migrated, and the memory
>> block can be put into offline state. Besides, The page's refcount is
>> increased a bit earlier to avoid the page is released when the check
>> is executed.
> 
> Did you look into handling pages that are in the swapcache case as well?
> 
> See is_refcount_suitable() in mm/khugepaged.c.
> 
> Should be easy to reproduce, let me know if you need inspiration.
> 

Nope, I didn't look into the case. Please elaborate the details so that
I can reproduce it firstly.

>>
>> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations")
>> Cc: stable@vger.kernel.org   # v5.7+
>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>> v2: Corrected fix tag and increase page's refcount before the check
>> ---
>>   mm/compaction.c | 22 +++++++++++-----------
>>   1 file changed, 11 insertions(+), 11 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index c51f7f545afe..1f6da31dd9a5 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -984,29 +984,29 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>>               goto isolate_fail;
>>           }
>> +        /*
>> +         * Be careful not to clear PageLRU until after we're
>> +         * sure the page is not being freed elsewhere -- the
>> +         * page release code relies on it.
>> +         */
>> +        if (unlikely(!get_page_unless_zero(page)))
>> +            goto isolate_fail;
>> +
>>           /*
>>            * Migration will fail if an anonymous page is pinned in memory,
>>            * so avoid taking lru_lock and isolating it unnecessarily in an
>>            * admittedly racy check.
>>            */
>>           mapping = page_mapping(page);
>> -        if (!mapping && page_count(page) > page_mapcount(page))
>> -            goto isolate_fail;
>> +        if (!mapping && (page_count(page) - 1) > total_mapcount(page))
>> +            goto isolate_fail_put;
>>           /*
>>            * Only allow to migrate anonymous pages in GFP_NOFS context
>>            * because those do not depend on fs locks.
>>            */
>>           if (!(cc->gfp_mask & __GFP_FS) && mapping)
>> -            goto isolate_fail;
>> -
>> -        /*
>> -         * Be careful not to clear PageLRU until after we're
>> -         * sure the page is not being freed elsewhere -- the
>> -         * page release code relies on it.
>> -         */
>> -        if (unlikely(!get_page_unless_zero(page)))
>> -            goto isolate_fail;
>> +            goto isolate_fail_put;
>>           /* Only take pages on LRU: a check now makes later tests safe */
>>           if (!PageLRU(page))
> 

Thanks,
Gavin

David Hildenbrand Nov. 24, 2022, 10:43 a.m. UTC | #3

On 24.11.22 11:21, Gavin Shan wrote:
> On 11/24/22 6:09 PM, David Hildenbrand wrote:
>> On 24.11.22 10:55, Gavin Shan wrote:
>>> The issue is reported when removing memory through virtio_mem device.
>>> The transparent huge page, experienced copy-on-write fault, is wrongly
>>> regarded as pinned. The transparent huge page is escaped from being
>>> isolated in isolate_migratepages_block(). The transparent huge page
>>> can't be migrated and the corresponding memory block can't be put
>>> into offline state.
>>>
>>> Fix it by replacing page_mapcount() with total_mapcount(). With this,
>>> the transparent huge page can be isolated and migrated, and the memory
>>> block can be put into offline state. Besides, The page's refcount is
>>> increased a bit earlier to avoid the page is released when the check
>>> is executed.
>>
>> Did you look into handling pages that are in the swapcache case as well?
>>
>> See is_refcount_suitable() in mm/khugepaged.c.
>>
>> Should be easy to reproduce, let me know if you need inspiration.
>>
> 
> Nope, I didn't look into the case. Please elaborate the details so that
> I can reproduce it firstly.


A simple reproducer would be (on a system with ordinary swap (not zram))

1) mmap a region (MAP_ANON|MAP_PRIVATE) that can hold a THP

2) Enable THP for that region (MADV_HUGEPAGE)

3) Populate a THP (e.g., write access)

4) PTE-map the THP, for example, using MADV_FREE on the last subpage

5) Trigger swapout of the THP, for example, using MADV_PAGEOUT

6) Read-access to some subpages to fault them in from the swapcache


Now you'd have a THP, which

1) Is partially PTE-mapped into the page table
2) Is in the swapcache (each subpage should have one reference from the 
swapache)


Now we could test, if alloc_contig_range() will still succeed (e.g., 
using virtio-mem).

Zi Yan Nov. 24, 2022, 12:38 p.m. UTC | #4

On 24 Nov 2022, at 5:43, David Hildenbrand wrote:

> On 24.11.22 11:21, Gavin Shan wrote:
>> On 11/24/22 6:09 PM, David Hildenbrand wrote:
>>> On 24.11.22 10:55, Gavin Shan wrote:
>>>> The issue is reported when removing memory through virtio_mem device.
>>>> The transparent huge page, experienced copy-on-write fault, is wrongly
>>>> regarded as pinned. The transparent huge page is escaped from being
>>>> isolated in isolate_migratepages_block(). The transparent huge page
>>>> can't be migrated and the corresponding memory block can't be put
>>>> into offline state.
>>>>
>>>> Fix it by replacing page_mapcount() with total_mapcount(). With this,
>>>> the transparent huge page can be isolated and migrated, and the memory
>>>> block can be put into offline state. Besides, The page's refcount is
>>>> increased a bit earlier to avoid the page is released when the check
>>>> is executed.
>>>
>>> Did you look into handling pages that are in the swapcache case as well?
>>>
>>> See is_refcount_suitable() in mm/khugepaged.c.
>>>
>>> Should be easy to reproduce, let me know if you need inspiration.
>>>
>>
>> Nope, I didn't look into the case. Please elaborate the details so that
>> I can reproduce it firstly.
>
>
> A simple reproducer would be (on a system with ordinary swap (not zram))
>
> 1) mmap a region (MAP_ANON|MAP_PRIVATE) that can hold a THP
>
> 2) Enable THP for that region (MADV_HUGEPAGE)
>
> 3) Populate a THP (e.g., write access)
>
> 4) PTE-map the THP, for example, using MADV_FREE on the last subpage
>
> 5) Trigger swapout of the THP, for example, using MADV_PAGEOUT

Added the original THP swapout code author, Ying.

At this step, the THP will be split, right?

https://elixir.bootlin.com/linux/latest/source/mm/vmscan.c#L1786

Even if a THP has PMD mapping, IIRC, it is split in the add_to_swap()
then swapped out. But I cannot find that split code now.


>
> 6) Read-access to some subpages to fault them in from the swapcache
>
>
> Now you'd have a THP, which
>
> 1) Is partially PTE-mapped into the page table
> 2) Is in the swapcache (each subpage should have one reference from the swapache)
>
>
> Now we could test, if alloc_contig_range() will still succeed (e.g., using virtio-mem).
>
> -- 
> Thanks,
>
> David / dhildenb

--
Best Regards,
Yan Zi

Gavin Shan Nov. 24, 2022, 12:55 p.m. UTC | #5

On 11/24/22 6:43 PM, David Hildenbrand wrote:
> On 24.11.22 11:21, Gavin Shan wrote:
>> On 11/24/22 6:09 PM, David Hildenbrand wrote:
>>> On 24.11.22 10:55, Gavin Shan wrote:
>>>> The issue is reported when removing memory through virtio_mem device.
>>>> The transparent huge page, experienced copy-on-write fault, is wrongly
>>>> regarded as pinned. The transparent huge page is escaped from being
>>>> isolated in isolate_migratepages_block(). The transparent huge page
>>>> can't be migrated and the corresponding memory block can't be put
>>>> into offline state.
>>>>
>>>> Fix it by replacing page_mapcount() with total_mapcount(). With this,
>>>> the transparent huge page can be isolated and migrated, and the memory
>>>> block can be put into offline state. Besides, The page's refcount is
>>>> increased a bit earlier to avoid the page is released when the check
>>>> is executed.
>>>
>>> Did you look into handling pages that are in the swapcache case as well?
>>>
>>> See is_refcount_suitable() in mm/khugepaged.c.
>>>
>>> Should be easy to reproduce, let me know if you need inspiration.
>>>
>>
>> Nope, I didn't look into the case. Please elaborate the details so that
>> I can reproduce it firstly.
> 
> 
> A simple reproducer would be (on a system with ordinary swap (not zram))
> 
> 1) mmap a region (MAP_ANON|MAP_PRIVATE) that can hold a THP
> 
> 2) Enable THP for that region (MADV_HUGEPAGE)
> 
> 3) Populate a THP (e.g., write access)
> 
> 4) PTE-map the THP, for example, using MADV_FREE on the last subpage
> 
> 5) Trigger swapout of the THP, for example, using MADV_PAGEOUT
> 
> 6) Read-access to some subpages to fault them in from the swapcache
> 
> 
> Now you'd have a THP, which
> 
> 1) Is partially PTE-mapped into the page table
> 2) Is in the swapcache (each subpage should have one reference from the swapache)
> 
> 
> Now we could test, if alloc_contig_range() will still succeed (e.g., using virtio-mem).
> 

Thanks for the details. Step (4) and (5) can be actually combined. To swap part of
the THP (e.g. one sub-page) will force the THP to be split.

I followed your steps in the attached program, there is no issue to do memory hot-remove
through virtio-mem with or without this patch.

    # numactl -p 1 testsuite mm swap -k
    Any key to split THP
    Any key to swap sub-pages
    Any key to read the swapped sub-pages
        Page[000]: 0xffffffffffffffff
        Page[001]: 0xffffffffffffffff
          :
        Page[255]: 0xffffffffffffffff
    Any key to exit                                // hold here and the program doesn't exit

    (qemu) qom-set vm1 requested-size 0
    [  356.005396] virtio_mem virtio1: plugged size: 0x40000000
    [  356.005996] virtio_mem virtio1: requested size: 0x0
    [  356.350299] Fallback order for Node 0: 0 1
    [  356.350810] Fallback order for Node 1: 1 0
    [  356.351260] Built 2 zonelists, mobility grouping on.  Total pages: 491343
    [  356.351998] Policy zone: DMA

Thanks,
Gavin

David Hildenbrand Nov. 24, 2022, 1:20 p.m. UTC | #6

On 24.11.22 13:38, Zi Yan wrote:
> 
> On 24 Nov 2022, at 5:43, David Hildenbrand wrote:
> 
>> On 24.11.22 11:21, Gavin Shan wrote:
>>> On 11/24/22 6:09 PM, David Hildenbrand wrote:
>>>> On 24.11.22 10:55, Gavin Shan wrote:
>>>>> The issue is reported when removing memory through virtio_mem device.
>>>>> The transparent huge page, experienced copy-on-write fault, is wrongly
>>>>> regarded as pinned. The transparent huge page is escaped from being
>>>>> isolated in isolate_migratepages_block(). The transparent huge page
>>>>> can't be migrated and the corresponding memory block can't be put
>>>>> into offline state.
>>>>>
>>>>> Fix it by replacing page_mapcount() with total_mapcount(). With this,
>>>>> the transparent huge page can be isolated and migrated, and the memory
>>>>> block can be put into offline state. Besides, The page's refcount is
>>>>> increased a bit earlier to avoid the page is released when the check
>>>>> is executed.
>>>>
>>>> Did you look into handling pages that are in the swapcache case as well?
>>>>
>>>> See is_refcount_suitable() in mm/khugepaged.c.
>>>>
>>>> Should be easy to reproduce, let me know if you need inspiration.
>>>>
>>>
>>> Nope, I didn't look into the case. Please elaborate the details so that
>>> I can reproduce it firstly.
>>
>>
>> A simple reproducer would be (on a system with ordinary swap (not zram))
>>
>> 1) mmap a region (MAP_ANON|MAP_PRIVATE) that can hold a THP
>>
>> 2) Enable THP for that region (MADV_HUGEPAGE)
>>
>> 3) Populate a THP (e.g., write access)
>>
>> 4) PTE-map the THP, for example, using MADV_FREE on the last subpage
>>
>> 5) Trigger swapout of the THP, for example, using MADV_PAGEOUT
> 
> Added the original THP swapout code author, Ying.
> 
> At this step, the THP will be split, right?
> 
> https://elixir.bootlin.com/linux/latest/source/mm/vmscan.c#L1786
> 
> Even if a THP has PMD mapping, IIRC, it is split in the add_to_swap()
> then swapped out. But I cannot find that split code now.

I recall there was some sequence to achieve it. Maybe it was
swapping out the PMD first and not triggering a PTE-mapping first.

mm/vmscan.c:shrink_folio_list()

if (folio_test_large(folio)) {
	/* cannot split folio, skip it */
	if (!can_split_folio(folio, NULL))
		goto activate_locked;
	/*
	 * Split folios without a PMD map right
	 * away. Chances are some or all of the
	 * tail pages can be freed without IO.
	 */
	if (!folio_entire_mapcount(folio) &&
	    split_folio_to_list(folio, folio_list))
		goto activate_locked;
	}
}

So the sequence might have to be

1) mmap a region (MAP_ANON|MAP_PRIVATE) that can hold a THP

2) Enable THP for that region (MADV_HUGEPAGE)

3) Populate a THP (e.g., write access)

4) Trigger swapout of the THP, for example, using MADV_PAGEOUT

5) Access some subpage

As we don't have PMD swap entries, we will PTE-map the
THP during try_to_unmap() IIRC.



Independent of that, the check we have here also doesn't consider
ordinary order-0 pages that might be in the swapache.

David Hildenbrand Nov. 24, 2022, 1:22 p.m. UTC | #7

On 24.11.22 13:55, Gavin Shan wrote:
> On 11/24/22 6:43 PM, David Hildenbrand wrote:
>> On 24.11.22 11:21, Gavin Shan wrote:
>>> On 11/24/22 6:09 PM, David Hildenbrand wrote:
>>>> On 24.11.22 10:55, Gavin Shan wrote:
>>>>> The issue is reported when removing memory through virtio_mem device.
>>>>> The transparent huge page, experienced copy-on-write fault, is wrongly
>>>>> regarded as pinned. The transparent huge page is escaped from being
>>>>> isolated in isolate_migratepages_block(). The transparent huge page
>>>>> can't be migrated and the corresponding memory block can't be put
>>>>> into offline state.
>>>>>
>>>>> Fix it by replacing page_mapcount() with total_mapcount(). With this,
>>>>> the transparent huge page can be isolated and migrated, and the memory
>>>>> block can be put into offline state. Besides, The page's refcount is
>>>>> increased a bit earlier to avoid the page is released when the check
>>>>> is executed.
>>>>
>>>> Did you look into handling pages that are in the swapcache case as well?
>>>>
>>>> See is_refcount_suitable() in mm/khugepaged.c.
>>>>
>>>> Should be easy to reproduce, let me know if you need inspiration.
>>>>
>>>
>>> Nope, I didn't look into the case. Please elaborate the details so that
>>> I can reproduce it firstly.
>>
>>
>> A simple reproducer would be (on a system with ordinary swap (not zram))
>>
>> 1) mmap a region (MAP_ANON|MAP_PRIVATE) that can hold a THP
>>
>> 2) Enable THP for that region (MADV_HUGEPAGE)
>>
>> 3) Populate a THP (e.g., write access)
>>
>> 4) PTE-map the THP, for example, using MADV_FREE on the last subpage
>>
>> 5) Trigger swapout of the THP, for example, using MADV_PAGEOUT
>>
>> 6) Read-access to some subpages to fault them in from the swapcache
>>
>>
>> Now you'd have a THP, which
>>
>> 1) Is partially PTE-mapped into the page table
>> 2) Is in the swapcache (each subpage should have one reference from the swapache)
>>
>>
>> Now we could test, if alloc_contig_range() will still succeed (e.g., using virtio-mem).
>>
> 
> Thanks for the details. Step (4) and (5) can be actually combined. To swap part of
> the THP (e.g. one sub-page) will force the THP to be split.
> 
> I followed your steps in the attached program, there is no issue to do memory hot-remove
> through virtio-mem with or without this patch.

Interesting. But I don't really see how we could pass this check with a 
page that's in the swapcache, maybe I'm missing something else.

I'll try to see if I can reproduce it.

David Hildenbrand Nov. 24, 2022, 2:09 p.m. UTC | #8

On 24.11.22 14:22, David Hildenbrand wrote:
> On 24.11.22 13:55, Gavin Shan wrote:
>> On 11/24/22 6:43 PM, David Hildenbrand wrote:
>>> On 24.11.22 11:21, Gavin Shan wrote:
>>>> On 11/24/22 6:09 PM, David Hildenbrand wrote:
>>>>> On 24.11.22 10:55, Gavin Shan wrote:
>>>>>> The issue is reported when removing memory through virtio_mem device.
>>>>>> The transparent huge page, experienced copy-on-write fault, is wrongly
>>>>>> regarded as pinned. The transparent huge page is escaped from being
>>>>>> isolated in isolate_migratepages_block(). The transparent huge page
>>>>>> can't be migrated and the corresponding memory block can't be put
>>>>>> into offline state.
>>>>>>
>>>>>> Fix it by replacing page_mapcount() with total_mapcount(). With this,
>>>>>> the transparent huge page can be isolated and migrated, and the memory
>>>>>> block can be put into offline state. Besides, The page's refcount is
>>>>>> increased a bit earlier to avoid the page is released when the check
>>>>>> is executed.
>>>>>
>>>>> Did you look into handling pages that are in the swapcache case as well?
>>>>>
>>>>> See is_refcount_suitable() in mm/khugepaged.c.
>>>>>
>>>>> Should be easy to reproduce, let me know if you need inspiration.
>>>>>
>>>>
>>>> Nope, I didn't look into the case. Please elaborate the details so that
>>>> I can reproduce it firstly.
>>>
>>>
>>> A simple reproducer would be (on a system with ordinary swap (not zram))
>>>
>>> 1) mmap a region (MAP_ANON|MAP_PRIVATE) that can hold a THP
>>>
>>> 2) Enable THP for that region (MADV_HUGEPAGE)
>>>
>>> 3) Populate a THP (e.g., write access)
>>>
>>> 4) PTE-map the THP, for example, using MADV_FREE on the last subpage
>>>
>>> 5) Trigger swapout of the THP, for example, using MADV_PAGEOUT
>>>
>>> 6) Read-access to some subpages to fault them in from the swapcache
>>>
>>>
>>> Now you'd have a THP, which
>>>
>>> 1) Is partially PTE-mapped into the page table
>>> 2) Is in the swapcache (each subpage should have one reference from the swapache)
>>>
>>>
>>> Now we could test, if alloc_contig_range() will still succeed (e.g., using virtio-mem).
>>>
>>
>> Thanks for the details. Step (4) and (5) can be actually combined. To swap part of
>> the THP (e.g. one sub-page) will force the THP to be split.
>>
>> I followed your steps in the attached program, there is no issue to do memory hot-remove
>> through virtio-mem with or without this patch.
> 
> Interesting. But I don't really see how we could pass this check with a
> page that's in the swapcache, maybe I'm missing something else.
> 
> I'll try to see if I can reproduce it.
> 

After some unsuccessful attempts and many head-scratches, I realized 
that it's quite simple why we don't have to worry about swapcache pages 
here:

page_mapping() is != NULL for pages in the swapcache: folio_mapping() 
makes this rather obvious:

if (unlikely(folio_test_swapcache(folio))
	return swap_address_space(folio_swap_entry(folio));


I think the get_page_unless_zero() might also be a fix for the 
page_mapping() call, smells like something could blow up on concurrent 
page freeing. (what about concurrent removal from the swapcache? nobody 
knows :) )


Thanks Gavin!

Acked-by: David Hildenbrand <david@redhat.com>

Zhenyu Zhang Nov. 25, 2022, 7:40 a.m. UTC | #9

With the patch applied, I'm unable to hit memory hot-remove failure in
the environment where the issue was initially found.

Tested-by: Zhenyu Zhang <zhenyzha@redhat.com>

On Thu, Nov 24, 2022 at 10:09 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 24.11.22 14:22, David Hildenbrand wrote:
> > On 24.11.22 13:55, Gavin Shan wrote:
> >> On 11/24/22 6:43 PM, David Hildenbrand wrote:
> >>> On 24.11.22 11:21, Gavin Shan wrote:
> >>>> On 11/24/22 6:09 PM, David Hildenbrand wrote:
> >>>>> On 24.11.22 10:55, Gavin Shan wrote:
> >>>>>> The issue is reported when removing memory through virtio_mem device.
> >>>>>> The transparent huge page, experienced copy-on-write fault, is wrongly
> >>>>>> regarded as pinned. The transparent huge page is escaped from being
> >>>>>> isolated in isolate_migratepages_block(). The transparent huge page
> >>>>>> can't be migrated and the corresponding memory block can't be put
> >>>>>> into offline state.
> >>>>>>
> >>>>>> Fix it by replacing page_mapcount() with total_mapcount(). With this,
> >>>>>> the transparent huge page can be isolated and migrated, and the memory
> >>>>>> block can be put into offline state. Besides, The page's refcount is
> >>>>>> increased a bit earlier to avoid the page is released when the check
> >>>>>> is executed.
> >>>>>
> >>>>> Did you look into handling pages that are in the swapcache case as well?
> >>>>>
> >>>>> See is_refcount_suitable() in mm/khugepaged.c.
> >>>>>
> >>>>> Should be easy to reproduce, let me know if you need inspiration.
> >>>>>
> >>>>
> >>>> Nope, I didn't look into the case. Please elaborate the details so that
> >>>> I can reproduce it firstly.
> >>>
> >>>
> >>> A simple reproducer would be (on a system with ordinary swap (not zram))
> >>>
> >>> 1) mmap a region (MAP_ANON|MAP_PRIVATE) that can hold a THP
> >>>
> >>> 2) Enable THP for that region (MADV_HUGEPAGE)
> >>>
> >>> 3) Populate a THP (e.g., write access)
> >>>
> >>> 4) PTE-map the THP, for example, using MADV_FREE on the last subpage
> >>>
> >>> 5) Trigger swapout of the THP, for example, using MADV_PAGEOUT
> >>>
> >>> 6) Read-access to some subpages to fault them in from the swapcache
> >>>
> >>>
> >>> Now you'd have a THP, which
> >>>
> >>> 1) Is partially PTE-mapped into the page table
> >>> 2) Is in the swapcache (each subpage should have one reference from the swapache)
> >>>
> >>>
> >>> Now we could test, if alloc_contig_range() will still succeed (e.g., using virtio-mem).
> >>>
> >>
> >> Thanks for the details. Step (4) and (5) can be actually combined. To swap part of
> >> the THP (e.g. one sub-page) will force the THP to be split.
> >>
> >> I followed your steps in the attached program, there is no issue to do memory hot-remove
> >> through virtio-mem with or without this patch.
> >
> > Interesting. But I don't really see how we could pass this check with a
> > page that's in the swapcache, maybe I'm missing something else.
> >
> > I'll try to see if I can reproduce it.
> >
>
> After some unsuccessful attempts and many head-scratches, I realized
> that it's quite simple why we don't have to worry about swapcache pages
> here:
>
> page_mapping() is != NULL for pages in the swapcache: folio_mapping()
> makes this rather obvious:
>
> if (unlikely(folio_test_swapcache(folio))
>         return swap_address_space(folio_swap_entry(folio));
>
>
> I think the get_page_unless_zero() might also be a fix for the
> page_mapping() call, smells like something could blow up on concurrent
> page freeing. (what about concurrent removal from the swapcache? nobody
> knows :) )
>
>
> Thanks Gavin!
>
> Acked-by: David Hildenbrand <david@redhat.com>
>
>
> --
> Thanks,
>
> David / dhildenb
>

diff mbox series

Patch

diff --git a/mm/compaction.c b/mm/compaction.c
index c51f7f545afe..1f6da31dd9a5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -984,29 +984,29 @@  isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			goto isolate_fail;
 		}
 
+		/*
+		 * Be careful not to clear PageLRU until after we're
+		 * sure the page is not being freed elsewhere -- the
+		 * page release code relies on it.
+		 */
+		if (unlikely(!get_page_unless_zero(page)))
+			goto isolate_fail;
+
 		/*
 		 * Migration will fail if an anonymous page is pinned in memory,
 		 * so avoid taking lru_lock and isolating it unnecessarily in an
 		 * admittedly racy check.
 		 */
 		mapping = page_mapping(page);
-		if (!mapping && page_count(page) > page_mapcount(page))
-			goto isolate_fail;
+		if (!mapping && (page_count(page) - 1) > total_mapcount(page))
+			goto isolate_fail_put;
 
 		/*
 		 * Only allow to migrate anonymous pages in GFP_NOFS context
 		 * because those do not depend on fs locks.
 		 */
 		if (!(cc->gfp_mask & __GFP_FS) && mapping)
-			goto isolate_fail;
-
-		/*
-		 * Be careful not to clear PageLRU until after we're
-		 * sure the page is not being freed elsewhere -- the
-		 * page release code relies on it.
-		 */
-		if (unlikely(!get_page_unless_zero(page)))
-			goto isolate_fail;
+			goto isolate_fail_put;
 
 		/* Only take pages on LRU: a check now makes later tests safe */
 		if (!PageLRU(page))