mm: migrate: Fix THP's mapcount on isolation

Message ID 20221123005752.161003-1-gshan@redhat.com
State New
Headers
Series mm: migrate: Fix THP's mapcount on isolation |

Commit Message

Gavin Shan Nov. 23, 2022, 12:57 a.m. UTC
  The issue is reported when removing memory through virtio_mem device.
The transparent huge page, experienced copy-on-write fault, is wrongly
regarded as pinned. The transparent huge page is escaped from being
isolated in isolate_migratepages_block(). The transparent huge page
can't be migrated and the corresponding memory block can't be put
into offline state.

Fix it by replacing page_mapcount() with total_mapcount(). With this,
the transparent huge page can be isolated and migrated, and the memory
block can be put into offline state.

Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP")
Cc: stable@vger.kernel.org   # v5.8+
Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 mm/compaction.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Alistair Popple Nov. 23, 2022, 4:26 a.m. UTC | #1
Gavin Shan <gshan@redhat.com> writes:

> The issue is reported when removing memory through virtio_mem device.
> The transparent huge page, experienced copy-on-write fault, is wrongly
> regarded as pinned. The transparent huge page is escaped from being
> isolated in isolate_migratepages_block(). The transparent huge page
> can't be migrated and the corresponding memory block can't be put
> into offline state.
>
> Fix it by replacing page_mapcount() with total_mapcount(). With this,
> the transparent huge page can be isolated and migrated, and the memory
> block can be put into offline state.
>
> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP")
> Cc: stable@vger.kernel.org   # v5.8+
> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  mm/compaction.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index c51f7f545afe..c408b5e04c1d 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -990,7 +990,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>  		 * admittedly racy check.
>  		 */
>  		mapping = page_mapping(page);
> -		if (!mapping && page_count(page) > page_mapcount(page))
> +		if (!mapping && page_count(page) > total_mapcount(page))

We have several versions of these checks for pinned pages open-coded
around the place. See for example migrate_vma_check_page() and
folio_expected_refs(). It looks like you could use a variant of
migrate_vma_check_page() which would also check non-anon pins, although
I don't know the compaction code well enough to know if that's useful.

Either way it would be nice if we had a common helper for these kind of
checks. Guess that would be harder to backport, and the change itself
looks ok. But why isn't the fixes tag 119d6d59dcc0 ("mm, compaction:
avoid isolating pinned pages")?

>  			goto isolate_fail;
>  
>  		/*
  
Gavin Shan Nov. 23, 2022, 5:06 a.m. UTC | #2
Hi Alistair,

On 11/23/22 12:26 PM, Alistair Popple wrote:
> Gavin Shan <gshan@redhat.com> writes:
> 
>> The issue is reported when removing memory through virtio_mem device.
>> The transparent huge page, experienced copy-on-write fault, is wrongly
>> regarded as pinned. The transparent huge page is escaped from being
>> isolated in isolate_migratepages_block(). The transparent huge page
>> can't be migrated and the corresponding memory block can't be put
>> into offline state.
>>
>> Fix it by replacing page_mapcount() with total_mapcount(). With this,
>> the transparent huge page can be isolated and migrated, and the memory
>> block can be put into offline state.
>>
>> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP")
>> Cc: stable@vger.kernel.org   # v5.8+
>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   mm/compaction.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index c51f7f545afe..c408b5e04c1d 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -990,7 +990,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>>   		 * admittedly racy check.
>>   		 */
>>   		mapping = page_mapping(page);
>> -		if (!mapping && page_count(page) > page_mapcount(page))
>> +		if (!mapping && page_count(page) > total_mapcount(page))
> 
> We have several versions of these checks for pinned pages open-coded
> around the place. See for example migrate_vma_check_page() and
> folio_expected_refs(). It looks like you could use a variant of
> migrate_vma_check_page() which would also check non-anon pins, although
> I don't know the compaction code well enough to know if that's useful.
> 
> Either way it would be nice if we had a common helper for these kind of
> checks. Guess that would be harder to backport, and the change itself
> looks ok. But why isn't the fixes tag 119d6d59dcc0 ("mm, compaction:
> avoid isolating pinned pages")?
> 

Nice to see your comments. folio_expected_refs() only returns the
mapcount for file-mapped pages. migrate_vma_check_page() doesn't
cover THP and there is a 'FIXME' there. A unified helper is beneficial
to maintainance. I think the issue can be fixed in place and have a
followup patch to introduce the unified helper, to make the backporting
happy.

Right, I was thinking of 119d6d59dcc0, which was merged in early days
back to v3.15. I doubt we even had THP support that time. 3917c80280c9
changed the behavior of THP COW handling, to split the THP. Without
3917c80280c9, no splitting is expected and the original condition is
correct to check for anon pins.

>>   			goto isolate_fail;
>>   
>>   		/*
> 

Thanks,
Gavin
  
Hugh Dickins Nov. 23, 2022, 5:14 a.m. UTC | #3
On Wed, 23 Nov 2022, Gavin Shan wrote:

> The issue is reported when removing memory through virtio_mem device.
> The transparent huge page, experienced copy-on-write fault, is wrongly
> regarded as pinned. The transparent huge page is escaped from being
> isolated in isolate_migratepages_block(). The transparent huge page
> can't be migrated and the corresponding memory block can't be put
> into offline state.
> 
> Fix it by replacing page_mapcount() with total_mapcount(). With this,
> the transparent huge page can be isolated and migrated, and the memory
> block can be put into offline state.
> 
> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP")
> Cc: stable@vger.kernel.org   # v5.8+
> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>

Interesting, good catch, looked right to me: except for the Fixes line
and mention of v5.8.  That CoW change may have added a case which easily
demonstrates the problem, but it would have been the wrong test on a THP
for long before then - but only in v5.7 were compound pages allowed
through at all to reach that test, so I think it should be

Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations")
Cc: stable@vger.kernel.org   # v5.7+

Oh, no, stop: this is not so easy, even in the latest tree.

Because at the time of that "admittedly racy check", we have no hold
at all on the page in question: and if it's PageLRU or PageCompound
at one instant, it may be different the next instant.  Which leaves it
vulnerable to whatever BUG_ON()s there may be in the total_mapcount()
path - needs research.  *Perhaps* there are no more BUG_ON()s in the
total_mapcount() path than in the existing page_mapcount() path.

I suspect that for this to be safe (before your patch and more so after),
it will be necessary to shift the "admittedly racy check" down after the
get_page_unless_zero() (and check the sequence of operations when a
compound page is initialized).

The races I'm talking about are much much rarer than the condition you
are trying to avoid, so it's frustrating; but such races are real,
and increasing stable's exposure to them is not so good.

Sorry, I'm going to run away now: just raising these concerns
without working on the solution.

Hugh

> ---
>  mm/compaction.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index c51f7f545afe..c408b5e04c1d 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -990,7 +990,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>  		 * admittedly racy check.
>  		 */
>  		mapping = page_mapping(page);
> -		if (!mapping && page_count(page) > page_mapcount(page))
> +		if (!mapping && page_count(page) > total_mapcount(page))
>  			goto isolate_fail;
>  
>  		/*
> -- 
> 2.23.0
  
David Hildenbrand Nov. 23, 2022, 8:56 a.m. UTC | #4
On 23.11.22 06:14, Hugh Dickins wrote:
> On Wed, 23 Nov 2022, Gavin Shan wrote:
> 
>> The issue is reported when removing memory through virtio_mem device.
>> The transparent huge page, experienced copy-on-write fault, is wrongly
>> regarded as pinned. The transparent huge page is escaped from being
>> isolated in isolate_migratepages_block(). The transparent huge page
>> can't be migrated and the corresponding memory block can't be put
>> into offline state.
>>
>> Fix it by replacing page_mapcount() with total_mapcount(). With this,
>> the transparent huge page can be isolated and migrated, and the memory
>> block can be put into offline state.
>>
>> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP")
>> Cc: stable@vger.kernel.org   # v5.8+
>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
> 
> Interesting, good catch, looked right to me: except for the Fixes line
> and mention of v5.8.  That CoW change may have added a case which easily
> demonstrates the problem, but it would have been the wrong test on a THP
> for long before then - but only in v5.7 were compound pages allowed
> through at all to reach that test, so I think it should be
> 
> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations")
> Cc: stable@vger.kernel.org   # v5.7+
> 
> Oh, no, stop: this is not so easy, even in the latest tree.
> 
> Because at the time of that "admittedly racy check", we have no hold
> at all on the page in question: and if it's PageLRU or PageCompound
> at one instant, it may be different the next instant.  Which leaves it
> vulnerable to whatever BUG_ON()s there may be in the total_mapcount()
> path - needs research.  *Perhaps* there are no more BUG_ON()s in the
> total_mapcount() path than in the existing page_mapcount() path.
> 
> I suspect that for this to be safe (before your patch and more so after),
> it will be necessary to shift the "admittedly racy check" down after the
> get_page_unless_zero() (and check the sequence of operations when a
> compound page is initialized).

Grabbing a reference first sounds like the right approach to me.

> 
> The races I'm talking about are much much rarer than the condition you
> are trying to avoid, so it's frustrating; but such races are real,
> and increasing stable's exposure to them is not so good.

Such checks are always racy and the code has to be able to deal with 
false negatives/postives (we're not even holding the page lock); as you 
state, we just don't want to trigger undefined behavior/BUG.


I'm also curious how that migration code handles a THP that's in the 
swapcache. It better should handle such pages correctly, for example, by 
removing them from the swapcache first, otherwise that could block 
migration.


For example, in mm/ksm.c:write_protect_page() we have

"page_mapcount(page) + 1 + swapped != page_count(page)"

page_mapcount() and "swapped==0/1" makes sense to me, because KSM only 
cares about order-0 pages, so no need for THP games.


But we do have an even better helper in place already: 
mm/huge_memory.c:can_split_folio()

Which cares about

a) Swapcache for THP: each subpage could be in the swapcache
b) Requires the caller to hold one reference to be safe

But I am a bit confused about the "extra_pins" for !anon. Where do the 
folio_nr_pages() references come from?

So *maybe* it makes sense to factor out can_split_folio() and call it 
something like: "folio_maybe_additionally_referenced"  [to clearly 
distinguish it from "folio_maybe_dma_pinned" that cares about actual 
page pinning (read/write page content)].

Such a function could return false positives/negatives due to races and 
the caller would have to hold one reference and be able to deal with the 
semantics.
  
Matthew Wilcox Nov. 23, 2022, 4:07 p.m. UTC | #5
On Wed, Nov 23, 2022 at 09:56:38AM +0100, David Hildenbrand wrote:
> But we do have an even better helper in place already:
> mm/huge_memory.c:can_split_folio()
> 
> Which cares about
> 
> a) Swapcache for THP: each subpage could be in the swapcache
> b) Requires the caller to hold one reference to be safe
> 
> But I am a bit confused about the "extra_pins" for !anon. Where do the
> folio_nr_pages() references come from?

When we add a folio to the page cache, we increment its refcount by
folio_nr_pages() instead of by 1.  I suspect this is no longer needed
(if it was ever needed) and it could be changed.  See
__filemap_add_folio():

        long nr = 1;
        if (!huge) {
                nr = folio_nr_pages(folio);
        folio_ref_add(folio, nr);

> So *maybe* it makes sense to factor out can_split_folio() and call it
> something like: "folio_maybe_additionally_referenced"  [to clearly
> distinguish it from "folio_maybe_dma_pinned" that cares about actual page
> pinning (read/write page content)].
> 
> Such a function could return false positives/negatives due to races and the
> caller would have to hold one reference and be able to deal with the
> semantics.

I don't like the 'pextra_pins' parameter to a generic function ...
  
Gavin Shan Nov. 24, 2022, 12:14 a.m. UTC | #6
On 11/23/22 4:56 PM, David Hildenbrand wrote:
> On 23.11.22 06:14, Hugh Dickins wrote:
>> On Wed, 23 Nov 2022, Gavin Shan wrote:
>>
>>> The issue is reported when removing memory through virtio_mem device.
>>> The transparent huge page, experienced copy-on-write fault, is wrongly
>>> regarded as pinned. The transparent huge page is escaped from being
>>> isolated in isolate_migratepages_block(). The transparent huge page
>>> can't be migrated and the corresponding memory block can't be put
>>> into offline state.
>>>
>>> Fix it by replacing page_mapcount() with total_mapcount(). With this,
>>> the transparent huge page can be isolated and migrated, and the memory
>>> block can be put into offline state.
>>>
>>> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP")
>>> Cc: stable@vger.kernel.org   # v5.8+
>>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>
>> Interesting, good catch, looked right to me: except for the Fixes line
>> and mention of v5.8.  That CoW change may have added a case which easily
>> demonstrates the problem, but it would have been the wrong test on a THP
>> for long before then - but only in v5.7 were compound pages allowed
>> through at all to reach that test, so I think it should be
>>
>> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations")
>> Cc: stable@vger.kernel.org   # v5.7+
>>

Right, commit 1da2f328fa64 looks more accurate in this particular
case, I will fix it up in next revision.

>> Oh, no, stop: this is not so easy, even in the latest tree.
>>
>> Because at the time of that "admittedly racy check", we have no hold
>> at all on the page in question: and if it's PageLRU or PageCompound
>> at one instant, it may be different the next instant.  Which leaves it
>> vulnerable to whatever BUG_ON()s there may be in the total_mapcount()
>> path - needs research.  *Perhaps* there are no more BUG_ON()s in the
>> total_mapcount() path than in the existing page_mapcount() path.
>>
>> I suspect that for this to be safe (before your patch and more so after),
>> it will be necessary to shift the "admittedly racy check" down after the
>> get_page_unless_zero() (and check the sequence of operations when a
>> compound page is initialized).
> 
> Grabbing a reference first sounds like the right approach to me.
> 

Yeah, it sounds reasonable to me to grab a page->__refcount in the
first place. Looking at isolate_migratepages_block(), the page's refcount
is increased by get_page_unless_zero(), but it's too late. To increase
the page's refcount at the first place in the function will be conflicting
with hugetlb page and non-LRU page. I mean there will be a series to refactor
the code so that the page's refcount can be grabbed in the first place.

So I plan to post a followup series to refactor the code and grab
the page's refcount in the first place. In this way, the fix can be
merged as soon as possible. David and Hugh, please let me know if
it's reasonable plan? :)

   static int isolate_migratepages_block()
   {
       for (; low_pfn < end_pfn; low_pfn++) {
           :
           page = pfn_to_page(low_pfn);
           if (unlikely(!get_page_unless_zero(page)))                           // grab page's refcount in the first place
               goto isolate_fail_put;
           :
           if (PageHuge(page) && cc->alloc_contig) {
               ret = isolate_or_dissolve_huge_page(page, &cc->migratepages);   // refcount is increased by this function
               :
           }
           :
           if (!PageLRU(page)) {
               if (unlikely(__PageMovable(page)) &&
                   !PageIsolated(page)) {
                   if (!isolate_movable_page(page, mode)))                     // refcunt is increased here too
                       goto isolate_success;
               }
           }
           :
           mapping = page_mapping(page);
           if (!mapping && page_count(page) > total_mapcount(page))
               goto isolate_fail;
           :
           if (unlikely(!get_page_unless_zero(page)))                         // too late to grab the refcount?
               goto isolate_fail;
           :
       }
   }


[...]

Thanks,
Gavin
  
Alistair Popple Nov. 24, 2022, 1:06 a.m. UTC | #7
David Hildenbrand <david@redhat.com> writes:

> On 23.11.22 06:14, Hugh Dickins wrote:
>> On Wed, 23 Nov 2022, Gavin Shan wrote:
>> 
>>> The issue is reported when removing memory through virtio_mem device.
>>> The transparent huge page, experienced copy-on-write fault, is wrongly
>>> regarded as pinned. The transparent huge page is escaped from being
>>> isolated in isolate_migratepages_block(). The transparent huge page
>>> can't be migrated and the corresponding memory block can't be put
>>> into offline state.
>>>
>>> Fix it by replacing page_mapcount() with total_mapcount(). With this,
>>> the transparent huge page can be isolated and migrated, and the memory
>>> block can be put into offline state.
>>>
>>> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP")
>>> Cc: stable@vger.kernel.org   # v5.8+
>>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> Interesting, good catch, looked right to me: except for the Fixes
>> line
>> and mention of v5.8.  That CoW change may have added a case which easily
>> demonstrates the problem, but it would have been the wrong test on a THP
>> for long before then - but only in v5.7 were compound pages allowed
>> through at all to reach that test, so I think it should be
>> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for
>> CMA allocations")
>> Cc: stable@vger.kernel.org   # v5.7+
>> Oh, no, stop: this is not so easy, even in the latest tree.
>> Because at the time of that "admittedly racy check", we have no hold
>> at all on the page in question: and if it's PageLRU or PageCompound
>> at one instant, it may be different the next instant.  Which leaves it
>> vulnerable to whatever BUG_ON()s there may be in the total_mapcount()
>> path - needs research.  *Perhaps* there are no more BUG_ON()s in the
>> total_mapcount() path than in the existing page_mapcount() path.
>> I suspect that for this to be safe (before your patch and more so
>> after),
>> it will be necessary to shift the "admittedly racy check" down after the
>> get_page_unless_zero() (and check the sequence of operations when a
>> compound page is initialized).
>
> Grabbing a reference first sounds like the right approach to me.

I think you're right. Without a page reference I don't think it is even
safe to look at struct page, at least not without synchronisation
against memory hot unplug which could remove the struct page. From a
quick glance I didn't see anything here that obviously did that though.

>> The races I'm talking about are much much rarer than the condition
>> you
>> are trying to avoid, so it's frustrating; but such races are real,
>> and increasing stable's exposure to them is not so good.
>
> Such checks are always racy and the code has to be able to deal with
> false negatives/postives (we're not even holding the page lock); as
> you state, we just don't want to trigger undefined behavior/BUG.
>
>
> I'm also curious how that migration code handles a THP that's in the
> swapcache. It better should handle such pages correctly, for example,
> by removing them from the swapcache first, otherwise that could block
> migration.
>
>
> For example, in mm/ksm.c:write_protect_page() we have
>
> "page_mapcount(page) + 1 + swapped != page_count(page)"
>
> page_mapcount() and "swapped==0/1" makes sense to me, because KSM only
> cares about order-0 pages, so no need for THP games.
>
>
> But we do have an even better helper in place already:
> mm/huge_memory.c:can_split_folio()
>
> Which cares about
>
> a) Swapcache for THP: each subpage could be in the swapcache
> b) Requires the caller to hold one reference to be safe
>
> But I am a bit confused about the "extra_pins" for !anon. Where do the
> folio_nr_pages() references come from?
>
> So *maybe* it makes sense to factor out can_split_folio() and call it
> something like: "folio_maybe_additionally_referenced"  [to clearly
> distinguish it from "folio_maybe_dma_pinned" that cares about actual
> page pinning (read/write page content)].
>
> Such a function could return false positives/negatives due to races
> and the caller would have to hold one reference and be able to deal
> with the semantics.
  
Matthew Wilcox Nov. 24, 2022, 3:33 a.m. UTC | #8
On Thu, Nov 24, 2022 at 12:06:56PM +1100, Alistair Popple wrote:
> 
> David Hildenbrand <david@redhat.com> writes:
> 
> > On 23.11.22 06:14, Hugh Dickins wrote:
> >> On Wed, 23 Nov 2022, Gavin Shan wrote:
> >> 
> >>> The issue is reported when removing memory through virtio_mem device.
> >>> The transparent huge page, experienced copy-on-write fault, is wrongly
> >>> regarded as pinned. The transparent huge page is escaped from being
> >>> isolated in isolate_migratepages_block(). The transparent huge page
> >>> can't be migrated and the corresponding memory block can't be put
> >>> into offline state.
> >>>
> >>> Fix it by replacing page_mapcount() with total_mapcount(). With this,
> >>> the transparent huge page can be isolated and migrated, and the memory
> >>> block can be put into offline state.
> >>>
> >>> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP")
> >>> Cc: stable@vger.kernel.org   # v5.8+
> >>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
> >>> Suggested-by: David Hildenbrand <david@redhat.com>
> >>> Signed-off-by: Gavin Shan <gshan@redhat.com>
> >> Interesting, good catch, looked right to me: except for the Fixes
> >> line
> >> and mention of v5.8.  That CoW change may have added a case which easily
> >> demonstrates the problem, but it would have been the wrong test on a THP
> >> for long before then - but only in v5.7 were compound pages allowed
> >> through at all to reach that test, so I think it should be
> >> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for
> >> CMA allocations")
> >> Cc: stable@vger.kernel.org   # v5.7+
> >> Oh, no, stop: this is not so easy, even in the latest tree.
> >> Because at the time of that "admittedly racy check", we have no hold
> >> at all on the page in question: and if it's PageLRU or PageCompound
> >> at one instant, it may be different the next instant.  Which leaves it
> >> vulnerable to whatever BUG_ON()s there may be in the total_mapcount()
> >> path - needs research.  *Perhaps* there are no more BUG_ON()s in the
> >> total_mapcount() path than in the existing page_mapcount() path.
> >> I suspect that for this to be safe (before your patch and more so
> >> after),
> >> it will be necessary to shift the "admittedly racy check" down after the
> >> get_page_unless_zero() (and check the sequence of operations when a
> >> compound page is initialized).
> >
> > Grabbing a reference first sounds like the right approach to me.
> 
> I think you're right. Without a page reference I don't think it is even
> safe to look at struct page, at least not without synchronisation
> against memory hot unplug which could remove the struct page. From a
> quick glance I didn't see anything here that obviously did that though.

Memory hotplug is the offending party here.  It has to make sure that
everything else is definitely quiescent before removing the struct pages.
Otherwise you can't even try_get a refcount.
  
David Hildenbrand Nov. 24, 2022, 8:46 a.m. UTC | #9
On 24.11.22 01:14, Gavin Shan wrote:
> On 11/23/22 4:56 PM, David Hildenbrand wrote:
>> On 23.11.22 06:14, Hugh Dickins wrote:
>>> On Wed, 23 Nov 2022, Gavin Shan wrote:
>>>
>>>> The issue is reported when removing memory through virtio_mem device.
>>>> The transparent huge page, experienced copy-on-write fault, is wrongly
>>>> regarded as pinned. The transparent huge page is escaped from being
>>>> isolated in isolate_migratepages_block(). The transparent huge page
>>>> can't be migrated and the corresponding memory block can't be put
>>>> into offline state.
>>>>
>>>> Fix it by replacing page_mapcount() with total_mapcount(). With this,
>>>> the transparent huge page can be isolated and migrated, and the memory
>>>> block can be put into offline state.
>>>>
>>>> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP")
>>>> Cc: stable@vger.kernel.org   # v5.8+
>>>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
>>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>
>>> Interesting, good catch, looked right to me: except for the Fixes line
>>> and mention of v5.8.  That CoW change may have added a case which easily
>>> demonstrates the problem, but it would have been the wrong test on a THP
>>> for long before then - but only in v5.7 were compound pages allowed
>>> through at all to reach that test, so I think it should be
>>>
>>> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations")
>>> Cc: stable@vger.kernel.org   # v5.7+
>>>
> 
> Right, commit 1da2f328fa64 looks more accurate in this particular
> case, I will fix it up in next revision.
> 
>>> Oh, no, stop: this is not so easy, even in the latest tree.
>>>
>>> Because at the time of that "admittedly racy check", we have no hold
>>> at all on the page in question: and if it's PageLRU or PageCompound
>>> at one instant, it may be different the next instant.  Which leaves it
>>> vulnerable to whatever BUG_ON()s there may be in the total_mapcount()
>>> path - needs research.  *Perhaps* there are no more BUG_ON()s in the
>>> total_mapcount() path than in the existing page_mapcount() path.
>>>
>>> I suspect that for this to be safe (before your patch and more so after),
>>> it will be necessary to shift the "admittedly racy check" down after the
>>> get_page_unless_zero() (and check the sequence of operations when a
>>> compound page is initialized).
>>
>> Grabbing a reference first sounds like the right approach to me.
>>
> 
> Yeah, it sounds reasonable to me to grab a page->__refcount in the
> first place. Looking at isolate_migratepages_block(), the page's refcount
> is increased by get_page_unless_zero(), but it's too late. To increase
> the page's refcount at the first place in the function will be conflicting
> with hugetlb page and non-LRU page. I mean there will be a series to refactor
> the code so that the page's refcount can be grabbed in the first place.
> 
> So I plan to post a followup series to refactor the code and grab
> the page's refcount in the first place. In this way, the fix can be
> merged as soon as possible. David and Hugh, please let me know if
> it's reasonable plan? :)


Can't you just temporarily grab the refcount and drop it again? I mean, 
it's all racy either way and the code has to be able to cope with such 
races.
  
David Hildenbrand Nov. 24, 2022, 8:49 a.m. UTC | #10
On 24.11.22 04:33, Matthew Wilcox wrote:
> On Thu, Nov 24, 2022 at 12:06:56PM +1100, Alistair Popple wrote:
>>
>> David Hildenbrand <david@redhat.com> writes:
>>
>>> On 23.11.22 06:14, Hugh Dickins wrote:
>>>> On Wed, 23 Nov 2022, Gavin Shan wrote:
>>>>
>>>>> The issue is reported when removing memory through virtio_mem device.
>>>>> The transparent huge page, experienced copy-on-write fault, is wrongly
>>>>> regarded as pinned. The transparent huge page is escaped from being
>>>>> isolated in isolate_migratepages_block(). The transparent huge page
>>>>> can't be migrated and the corresponding memory block can't be put
>>>>> into offline state.
>>>>>
>>>>> Fix it by replacing page_mapcount() with total_mapcount(). With this,
>>>>> the transparent huge page can be isolated and migrated, and the memory
>>>>> block can be put into offline state.
>>>>>
>>>>> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP")
>>>>> Cc: stable@vger.kernel.org   # v5.8+
>>>>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
>>>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>> Interesting, good catch, looked right to me: except for the Fixes
>>>> line
>>>> and mention of v5.8.  That CoW change may have added a case which easily
>>>> demonstrates the problem, but it would have been the wrong test on a THP
>>>> for long before then - but only in v5.7 were compound pages allowed
>>>> through at all to reach that test, so I think it should be
>>>> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for
>>>> CMA allocations")
>>>> Cc: stable@vger.kernel.org   # v5.7+
>>>> Oh, no, stop: this is not so easy, even in the latest tree.
>>>> Because at the time of that "admittedly racy check", we have no hold
>>>> at all on the page in question: and if it's PageLRU or PageCompound
>>>> at one instant, it may be different the next instant.  Which leaves it
>>>> vulnerable to whatever BUG_ON()s there may be in the total_mapcount()
>>>> path - needs research.  *Perhaps* there are no more BUG_ON()s in the
>>>> total_mapcount() path than in the existing page_mapcount() path.
>>>> I suspect that for this to be safe (before your patch and more so
>>>> after),
>>>> it will be necessary to shift the "admittedly racy check" down after the
>>>> get_page_unless_zero() (and check the sequence of operations when a
>>>> compound page is initialized).
>>>
>>> Grabbing a reference first sounds like the right approach to me.
>>
>> I think you're right. Without a page reference I don't think it is even
>> safe to look at struct page, at least not without synchronisation
>> against memory hot unplug which could remove the struct page. From a
>> quick glance I didn't see anything here that obviously did that though.
> 
> Memory hotplug is the offending party here.  It has to make sure that
> everything else is definitely quiescent before removing the struct pages.
> Otherwise you can't even try_get a refcount.

At least alloc_contig_range() and memory offlining are mutually 
exclusive due to MIGRATE_ISOLTAE. I recall that ordinary memory 
compaction similarly deals with isolated pageblocks (or some other 
mechanism I forgot) to not race with memory offlining. Wouldn't worry 
about that for now.
  
David Hildenbrand Nov. 24, 2022, 8:50 a.m. UTC | #11
On 23.11.22 17:07, Matthew Wilcox wrote:
> On Wed, Nov 23, 2022 at 09:56:38AM +0100, David Hildenbrand wrote:
>> But we do have an even better helper in place already:
>> mm/huge_memory.c:can_split_folio()
>>
>> Which cares about
>>
>> a) Swapcache for THP: each subpage could be in the swapcache
>> b) Requires the caller to hold one reference to be safe
>>
>> But I am a bit confused about the "extra_pins" for !anon. Where do the
>> folio_nr_pages() references come from?
> 
> When we add a folio to the page cache, we increment its refcount by
> folio_nr_pages() instead of by 1.  I suspect this is no longer needed
> (if it was ever needed) and it could be changed.  See
> __filemap_add_folio():
> 
>          long nr = 1;
>          if (!huge) {
>                  nr = folio_nr_pages(folio);
>          folio_ref_add(folio, nr);
> 
>> So *maybe* it makes sense to factor out can_split_folio() and call it
>> something like: "folio_maybe_additionally_referenced"  [to clearly
>> distinguish it from "folio_maybe_dma_pinned" that cares about actual page
>> pinning (read/write page content)].
>>
>> Such a function could return false positives/negatives due to races and the
>> caller would have to hold one reference and be able to deal with the
>> semantics.
> 
> I don't like the 'pextra_pins' parameter to a generic function ...

Right, that part should remain khugepaged specific. The assumption would 
be, that the caller of the generic function holds exactly one additional 
reference.
  
Gavin Shan Nov. 24, 2022, 9:44 a.m. UTC | #12
On 11/24/22 4:46 PM, David Hildenbrand wrote:
> On 24.11.22 01:14, Gavin Shan wrote:
>> On 11/23/22 4:56 PM, David Hildenbrand wrote:
>>> On 23.11.22 06:14, Hugh Dickins wrote:
>>>> On Wed, 23 Nov 2022, Gavin Shan wrote:
>>>>
>>>>> The issue is reported when removing memory through virtio_mem device.
>>>>> The transparent huge page, experienced copy-on-write fault, is wrongly
>>>>> regarded as pinned. The transparent huge page is escaped from being
>>>>> isolated in isolate_migratepages_block(). The transparent huge page
>>>>> can't be migrated and the corresponding memory block can't be put
>>>>> into offline state.
>>>>>
>>>>> Fix it by replacing page_mapcount() with total_mapcount(). With this,
>>>>> the transparent huge page can be isolated and migrated, and the memory
>>>>> block can be put into offline state.
>>>>>
>>>>> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP")
>>>>> Cc: stable@vger.kernel.org   # v5.8+
>>>>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
>>>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>>
>>>> Interesting, good catch, looked right to me: except for the Fixes line
>>>> and mention of v5.8.  That CoW change may have added a case which easily
>>>> demonstrates the problem, but it would have been the wrong test on a THP
>>>> for long before then - but only in v5.7 were compound pages allowed
>>>> through at all to reach that test, so I think it should be
>>>>
>>>> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations")
>>>> Cc: stable@vger.kernel.org   # v5.7+
>>>>
>>
>> Right, commit 1da2f328fa64 looks more accurate in this particular
>> case, I will fix it up in next revision.
>>
>>>> Oh, no, stop: this is not so easy, even in the latest tree.
>>>>
>>>> Because at the time of that "admittedly racy check", we have no hold
>>>> at all on the page in question: and if it's PageLRU or PageCompound
>>>> at one instant, it may be different the next instant.  Which leaves it
>>>> vulnerable to whatever BUG_ON()s there may be in the total_mapcount()
>>>> path - needs research.  *Perhaps* there are no more BUG_ON()s in the
>>>> total_mapcount() path than in the existing page_mapcount() path.
>>>>
>>>> I suspect that for this to be safe (before your patch and more so after),
>>>> it will be necessary to shift the "admittedly racy check" down after the
>>>> get_page_unless_zero() (and check the sequence of operations when a
>>>> compound page is initialized).
>>>
>>> Grabbing a reference first sounds like the right approach to me.
>>>
>>
>> Yeah, it sounds reasonable to me to grab a page->__refcount in the
>> first place. Looking at isolate_migratepages_block(), the page's refcount
>> is increased by get_page_unless_zero(), but it's too late. To increase
>> the page's refcount at the first place in the function will be conflicting
>> with hugetlb page and non-LRU page. I mean there will be a series to refactor
>> the code so that the page's refcount can be grabbed in the first place.
>>
>> So I plan to post a followup series to refactor the code and grab
>> the page's refcount in the first place. In this way, the fix can be
>> merged as soon as possible. David and Hugh, please let me know if
>> it's reasonable plan? :)
> 
> 
> Can't you just temporarily grab the refcount and drop it again? I mean, it's all racy either way and the code has to be able to cope with such races.
> 

Well, we can do this by moving the hunk of code, which increases page's
refcount, ahead of the check.


   if (unlikely(!get_page_unless_zero(page)))
       goto isolate_fail;

   if (!mapping && (page_count(page) - 1) > total_mapcount(page))
       goto isolate_fail_put;

Thanks,
Gavin
  
Alistair Popple Nov. 25, 2022, 12:58 a.m. UTC | #13
David Hildenbrand <david@redhat.com> writes:

> On 24.11.22 04:33, Matthew Wilcox wrote:
>> On Thu, Nov 24, 2022 at 12:06:56PM +1100, Alistair Popple wrote:
>>>
>>> David Hildenbrand <david@redhat.com> writes:
>>>
>>>> On 23.11.22 06:14, Hugh Dickins wrote:
>>>>> On Wed, 23 Nov 2022, Gavin Shan wrote:
>>>>>
>>>>>> The issue is reported when removing memory through virtio_mem device.
>>>>>> The transparent huge page, experienced copy-on-write fault, is wrongly
>>>>>> regarded as pinned. The transparent huge page is escaped from being
>>>>>> isolated in isolate_migratepages_block(). The transparent huge page
>>>>>> can't be migrated and the corresponding memory block can't be put
>>>>>> into offline state.
>>>>>>
>>>>>> Fix it by replacing page_mapcount() with total_mapcount(). With this,
>>>>>> the transparent huge page can be isolated and migrated, and the memory
>>>>>> block can be put into offline state.
>>>>>>
>>>>>> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP")
>>>>>> Cc: stable@vger.kernel.org   # v5.8+
>>>>>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
>>>>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>>> Interesting, good catch, looked right to me: except for the Fixes
>>>>> line
>>>>> and mention of v5.8.  That CoW change may have added a case which easily
>>>>> demonstrates the problem, but it would have been the wrong test on a THP
>>>>> for long before then - but only in v5.7 were compound pages allowed
>>>>> through at all to reach that test, so I think it should be
>>>>> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for
>>>>> CMA allocations")
>>>>> Cc: stable@vger.kernel.org   # v5.7+
>>>>> Oh, no, stop: this is not so easy, even in the latest tree.
>>>>> Because at the time of that "admittedly racy check", we have no hold
>>>>> at all on the page in question: and if it's PageLRU or PageCompound
>>>>> at one instant, it may be different the next instant.  Which leaves it
>>>>> vulnerable to whatever BUG_ON()s there may be in the total_mapcount()
>>>>> path - needs research.  *Perhaps* there are no more BUG_ON()s in the
>>>>> total_mapcount() path than in the existing page_mapcount() path.
>>>>> I suspect that for this to be safe (before your patch and more so
>>>>> after),
>>>>> it will be necessary to shift the "admittedly racy check" down after the
>>>>> get_page_unless_zero() (and check the sequence of operations when a
>>>>> compound page is initialized).
>>>>
>>>> Grabbing a reference first sounds like the right approach to me.
>>>
>>> I think you're right. Without a page reference I don't think it is even
>>> safe to look at struct page, at least not without synchronisation
>>> against memory hot unplug which could remove the struct page. From a
>>> quick glance I didn't see anything here that obviously did that though.
>> Memory hotplug is the offending party here.  It has to make sure
>> that
>> everything else is definitely quiescent before removing the struct pages.
>> Otherwise you can't even try_get a refcount.

Sure, I might be missing something but how can memory hotplug possibly
synchronise against some process (eg. try_to_compact_pages) doing
try_get(pfn_to_page(random_pfn)) without that process either informing
memory hotplug that it needs pages to remain valid long enough to get a
reference or detecting that memory hotplug is in the process of
offlining pages?

> At least alloc_contig_range() and memory offlining are mutually
> exclusive due to MIGRATE_ISOLTAE. I recall that ordinary memory
> compaction similarly deals with isolated pageblocks (or some other
> mechanism I forgot) to not race with memory offlining. Wouldn't worry
> about that for now.

Sorry, agree - to be clear this discussion isn't relevant for this patch
but reviewing it got me thinking about how this works. I still don't see
how alloc_contig_range() is totally safe against memory offlining
though. From what I can see we have the following call path to set
MIGRATE_ISOLATE:

alloc_contig_range(pfn) -> start_isolate_page_range(pfn) ->
isolate_single_pageblock(pfn) -> page_zone(pfn_to_page(pfn))

There's nothing in that call stack that prevent offlining of the page,
hence the struct page may be freed by this point. Am I missing something
else here?

try_to_compact_pages() has a similar issue which is what I noticed
reviewing this patch:

try_to_compact_pages() -> compact_zone_order() -> compact_zone() ->
isolate_migratepages() -> isolate_migratepages_block() ->
PageHuge(pfn_to_page(pfn))

Again I couldn't see anything in that path that would hold the page
stable long enough to safely perform the pfn_to_page() and get a
reference. And after a bit of fiddling I was able to trigger the below
oops running the compaction_test selftest whilst offlining memory so I
don't think it is safe:

Entering kdb (current=0xffff8882452fb6c0, pid 5646) on processor 1 Oops: (null)
due to oops @ 0xffffffff81af6486
CPU: 1 PID: 5646 Comm: compaction_test Not tainted 6.0.0-01424-gf3ec7e734795 #110
Hardware name: Gigabyte Technology Co., Ltd. B150M-D3H/B150M-D3H-CF, BIOS F24 04/11/2018
RIP: 0010:PageHuge+0xa6/0x180
Code: 00 0f 85 d0 00 00 00 48 8b 43 08 a8 01 0f 85 97 00 00 00 66 90 48 b8 00 00 00 00 00 fc ff df 48 8d 7b 50 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 02 7e 7f 31 c0 80 7b 50 02 0f 94 c0 5b 41 5c
RSP: 0000:ffffc9001252efa8 EFLAGS: 00010207
RAX: dffffc0000000000 RBX: fffffffffffffffe RCX: ffffffff81af63f9
RDX: 0000000000000009 RSI: 0000000000000008 RDI: 000000000000004e
RBP: ffffc9001252efb8 R08: 0000000000000000 R09: ffffea000f690007
R10: fffff94001ed2000 R11: 0000000000000001 R12: ffffea000f690008
R13: 0000000000000ab3 R14: ffffea000f690000 R15: 00000000003da400
FS:  00007f83e08b7740(0000) GS:ffff88823bc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb6e1cbb3e0 CR3: 0000000344044003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 isolate_migratepages_block+0x43c/0x3dc0
 ? reclaim_throttle+0x7a0/0x7a0
 ? __reset_isolation_suitable+0x350/0x350
 compact_zone+0xb73/0x31f0
 ? compaction_suitable+0x1e0/0x1e0
 compact_zone_order+0x127/0x240
 ? compact_zone+0x31f0/0x31f0
 ? __this_cpu_preempt_check+0x13/0x20
 ? cpuusage_write+0x380/0x380
 ? __kasan_check_read+0x11/0x20
 try_to_compact_pages+0x23b/0x770
 ? psi_task_change+0x201/0x300
 __alloc_pages_direct_compact+0x15d/0x650
 ? get_page_from_freelist+0x3fa0/0x3fa0
 ? psi_task_change+0x201/0x300
 ? _raw_spin_unlock+0x19/0x40
 __alloc_pages_slowpath.constprop.0+0x9e1/0x2260
 ? warn_alloc+0x1a0/0x1a0
 ? __zone_watermark_ok+0x430/0x430
 ? prepare_alloc_pages+0x40b/0x620
 __alloc_pages+0x42f/0x540
 ? __alloc_pages_slowpath.constprop.0+0x2260/0x2260
 alloc_buddy_huge_page.isra.0+0x7c/0x130
 alloc_fresh_huge_page+0x1f1/0x4e0
 alloc_pool_huge_page+0xab/0x2d0
 __nr_hugepages_store_common+0x37a/0xed0
 ? return_unused_surplus_pages+0x330/0x330
 ? __kasan_check_write+0x14/0x20
 ? _raw_spin_lock_irqsave+0x99/0x100
 ? proc_doulongvec_minmax+0x54/0x80
 hugetlb_sysctl_handler_common+0x247/0x320
 ? nr_hugepages_store+0xf0/0xf0
 ? alloc_huge_page+0xbf0/0xbf0
 hugetlb_sysctl_handler+0x20/0x30
 proc_sys_call_handler+0x451/0x650
 ? unregister_sysctl_table+0x1c0/0x1c0
 ? apparmor_file_permission+0x124/0x280
 ? security_file_permission+0x72/0x90
 proc_sys_write+0x13/0x20
 vfs_write+0x7ca/0xd80
 ? kernel_write+0x4d0/0x4d0
 ? do_sys_openat2+0x114/0x450
 ? __kasan_check_write+0x14/0x20
 ? down_write+0xb4/0x130
 ksys_write+0x116/0x220
 ? __kasan_check_write+0x14/0x20
 ? __ia32_sys_read+0xb0/0xb0
 ? debug_smp_processor_id+0x17/0x20
 ? fpregs_assert_state_consistent+0x52/0xc0
 __x64_sys_write+0x73/0xb0
 ? syscall_exit_to_user_mode+0x26/0x50
 do_syscall_64+0x38/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f83e09b2190
Code: 40 00 48 8b 15 71 9c 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 80 3d 51 24 0e 00 00 74 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89
RSP: 002b:00007ffe4c08e478 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007ffe4c08e648 RCX: 00007f83e09b2190
RDX: 0000000000000006 RSI: 000055d7575611f8 RDI: 0000000000000003
RBP: 00007ffe4c08e4c0 R08: 00007f83e0a8cc60 R09: 0000000000000000
R10: 00007f83e08d40b8 R11: 0000000000000202 R12: 0000000000000000
R13: 00007ffe4c08e658 R14: 000055d757562df0 R15: 00007f83e0adc020
 </TASK>
  
David Hildenbrand Nov. 25, 2022, 8:54 a.m. UTC | #14
>>>> I think you're right. Without a page reference I don't think it is even
>>>> safe to look at struct page, at least not without synchronisation
>>>> against memory hot unplug which could remove the struct page. From a
>>>> quick glance I didn't see anything here that obviously did that though.
>>> Memory hotplug is the offending party here.  It has to make sure
>>> that
>>> everything else is definitely quiescent before removing the struct pages.
>>> Otherwise you can't even try_get a refcount.
> 
> Sure, I might be missing something but how can memory hotplug possibly
> synchronise against some process (eg. try_to_compact_pages) doing
> try_get(pfn_to_page(random_pfn)) without that process either informing
> memory hotplug that it needs pages to remain valid long enough to get a
> reference or detecting that memory hotplug is in the process of
> offlining pages?

It currently doesn't, and it's been mostly a known theoretical problem 
so far. We've been ignoring these kind of problems because they are not 
easy to sort out and so far never popped up in practice.

First, the correct approach is using pfn_to_online_page() instead of 
pfn_to_page() when in doubt whether the page might already be offline. 
While we're using pfn_to_online_page() already in quite some PFN 
walkers, changes are good that we're still missing some.

Second, essentially anybody (PFN walker) doing a pfn_to_online_page() is 
prone to races with memory offlining. I've discussed this in the past 
with Michal and Dan and one idea was to use RCU to synchronize PFN 
walkers and pfn_to_online_page(), essentially synchronizing clearing of 
the SECTION_IS_ONLINE flag.

Nobody worked on that so far because we've never had actual BUG reports. 
These kind of races are rare, but they are theoretically possible.

> 
>> At least alloc_contig_range() and memory offlining are mutually
>> exclusive due to MIGRATE_ISOLTAE. I recall that ordinary memory
>> compaction similarly deals with isolated pageblocks (or some other
>> mechanism I forgot) to not race with memory offlining. Wouldn't worry
>> about that for now.
> 
> Sorry, agree - to be clear this discussion isn't relevant for this patch
> but reviewing it got me thinking about how this works. I still don't see
> how alloc_contig_range() is totally safe against memory offlining
> though. From what I can see we have the following call path to set
> MIGRATE_ISOLATE:
> 
> alloc_contig_range(pfn) -> start_isolate_page_range(pfn) ->
> isolate_single_pageblock(pfn) -> page_zone(pfn_to_page(pfn))
> 
> There's nothing in that call stack that prevent offlining of the page,
> hence the struct page may be freed by this point. Am I missing something
> else here?

Good point. I even had at some point a patch that converted some 
pfn_to_online_page() calls in there, but discarded it.

IIRC, two alloc_contig_range() users are safe because:

1) virtio-mem synchonizes against memory offlining using the memory 
notifier. While memory offlining is in progress, it won't call 
alloc_contig_range().

2) CMA regions will always fail to offline because they have MIGRATE_CMA 
set.

What remains is alloc_contig_pages(), for example, used for memtrace on 
ppc, kfence, and allocation of gigantic pages. That might indeed be 
racy. At least from kfence_init_late() it's unlikely (impossible?) that 
we'll have concurrent memory offlining.

> 
> try_to_compact_pages() has a similar issue which is what I noticed
> reviewing this patch:

Yes, I can spot pfn_to_online_page() in __reset_isolation_pfn(), but of 
course, are likely missing proper synchronization and checks later. I 
wonder if we could use the memory notifier to temporarily pause any 
running compaction and to restart it once memory offlining succeeded.

> 
> try_to_compact_pages() -> compact_zone_order() -> compact_zone() ->
> isolate_migratepages() -> isolate_migratepages_block() ->
> PageHuge(pfn_to_page(pfn))
> 
> Again I couldn't see anything in that path that would hold the page
> stable long enough to safely perform the pfn_to_page() and get a
> reference. And after a bit of fiddling I was able to trigger the below
> oops running the compaction_test selftest whilst offlining memory so I
> don't think it is safe:

Thanks for finding a reproducer. This is exactly the kind of BUG that we 
speculated about in the past but that we haven't seen in practice yet.


Having that said, I'd be very happy if someone would have time to work 
on proper synchronization and I'd be happy to help 
brainstorming/reviewing :)
  
Alistair Popple Dec. 1, 2022, 10:35 p.m. UTC | #15
David Hildenbrand <david@redhat.com> writes:

>>>>> I think you're right. Without a page reference I don't think it is even
>>>>> safe to look at struct page, at least not without synchronisation
>>>>> against memory hot unplug which could remove the struct page. From a
>>>>> quick glance I didn't see anything here that obviously did that though.
>>>> Memory hotplug is the offending party here.  It has to make sure
>>>> that
>>>> everything else is definitely quiescent before removing the struct pages.
>>>> Otherwise you can't even try_get a refcount.
>> Sure, I might be missing something but how can memory hotplug
>> possibly
>> synchronise against some process (eg. try_to_compact_pages) doing
>> try_get(pfn_to_page(random_pfn)) without that process either informing
>> memory hotplug that it needs pages to remain valid long enough to get a
>> reference or detecting that memory hotplug is in the process of
>> offlining pages?
>
> It currently doesn't, and it's been mostly a known theoretical problem
> so far. We've been ignoring these kind of problems because they are
> not easy to sort out and so far never popped up in practice.
>
> First, the correct approach is using pfn_to_online_page() instead of
> pfn_to_page() when in doubt whether the page might already be
> offline. While we're using pfn_to_online_page() already in quite some
> PFN walkers, changes are good that we're still missing some.
>
> Second, essentially anybody (PFN walker) doing a pfn_to_online_page()
> is prone to races with memory offlining. I've discussed this in the
> past with Michal and Dan and one idea was to use RCU to synchronize
> PFN walkers and pfn_to_online_page(), essentially synchronizing
> clearing of the SECTION_IS_ONLINE flag.
>
> Nobody worked on that so far because we've never had actual BUG
> reports. These kind of races are rare, but they are theoretically
> possible.

Thanks for providing some background context here. I have been digging
into these details a bit because GPU drivers using device-private memory
tend to excersise memory hot unplug paths more frequently than what I
suspect is normally the case. I haven't actually seen any real BUG
reports caused by this though, and it's likely the places where they're
likely would involve device-private pages at the moment anyway. So agree
such races are rare.

[...]

>> try_to_compact_pages() -> compact_zone_order() -> compact_zone() ->
>> isolate_migratepages() -> isolate_migratepages_block() ->
>> PageHuge(pfn_to_page(pfn))
>> Again I couldn't see anything in that path that would hold the page
>> stable long enough to safely perform the pfn_to_page() and get a
>> reference. And after a bit of fiddling I was able to trigger the below
>> oops running the compaction_test selftest whilst offlining memory so I
>> don't think it is safe:
>
> Thanks for finding a reproducer. This is exactly the kind of BUG that
> we speculated about in the past but that we haven't seen in practice
> yet.

No worries. The reproducer consisted of running the compaction_test
selftest in a loop on a single NUMA node whilst cycling every memory
block in that node online/offline in parallel (ie. not a particularly
realistic real world workload). Eventually that led to a couple of
different kernel OOPS/BUG backtraces.

> Having that said, I'd be very happy if someone would have time to work
> on proper synchronization and I'd be happy to help
> brainstorming/reviewing :)

I would like to see the synchronisation cleaned up here and am happy to
work on it. I'm unlikely to have much bandwidth to do so before the new
year but any brainstorming/review would certainly be appreciated when I
do find some time though!
  

Patch

diff --git a/mm/compaction.c b/mm/compaction.c
index c51f7f545afe..c408b5e04c1d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -990,7 +990,7 @@  isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 * admittedly racy check.
 		 */
 		mapping = page_mapping(page);
-		if (!mapping && page_count(page) > page_mapcount(page))
+		if (!mapping && page_count(page) > total_mapcount(page))
 			goto isolate_fail;
 
 		/*