Message ID | 20221124095523.31061-1-gshan@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp3298013wrr; Thu, 24 Nov 2022 02:03:33 -0800 (PST) X-Google-Smtp-Source: AA0mqf5iw8bVxorSAhcZWJo6qEDvXj5FLchYyE3hilSlbV6T6hJ5NsFhC16Kjpl7b88HygBML0hj X-Received: by 2002:a17:90a:f0c2:b0:218:ac70:fdd5 with SMTP id fa2-20020a17090af0c200b00218ac70fdd5mr19640198pjb.123.1669284213408; Thu, 24 Nov 2022 02:03:33 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669284213; cv=none; d=google.com; s=arc-20160816; b=hZwmYrbo3i2V5TicruCSCi1IBQfcCjV1oXrOXp9/jvLyvp4nEtt1ALaMq0k8qiGP2A s7mIEqxJ4d+Cdgh9M2RQ3SeqFHiQcPr5FobGD/to2lhMox7v3CNEEfpCiYyh7YxZd1qT Iu9tp3evZYwRLJCRiCraARgvKWTu/Ceuk09FyyOQJMyKHuGgWceS8CParUWAjOo2jZxK OC6hkrlggAUdFivF3IFzRSHOyJ72a25iSnRSjdM95rER/BnGokOFCYtL/MiXP2GOWbze ClvO/2GQmn43ew/RI48e90KaGZBPWnrinRnli9rdjFo9GI8wprxfDj2l5vemfoHhRg9E BRsQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=60QDRRHbYW5CIGKF/Qr6kydVK9e2+lWs3YQKxRzQMXc=; b=Ar2cjihtbSY0X3GiNaqRMa60jcbbHfJgOdYUPgM8vNg2qwgBeHKwk5YXf1HZcH2B6/ cBiTfjJpC7ul6sqzXkSJBXrPLJQ8jIJnHO1DyccVDbU0pZ5n68iKA37wdJpUF539Fg/6 ITTx5M9StUVbFcG/DS7qoS1X05woA5UVBS4My1AtdhlavyL8jv8UKeUEWC7c5/4TJ8YC axzjivuOGs6jJHokX2orPzISr1WLVGRQMxtjsk6Uq+Zen0teHtlXCPZbFA3dbSiEqH4R FMqKyTnZhKUd3X1Aru56rIAzSs02oqTdQfsIiFTSUrNe12OfIn4lMyj7tBq+ZT3RRrq7 9fTA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=S9wj6D3X; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id o12-20020a1709026b0c00b001893e9d712dsi507018plk.557.2022.11.24.02.03.20; Thu, 24 Nov 2022 02:03:33 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=S9wj6D3X; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229671AbiKXJ4s (ORCPT <rfc822;fengqi706@gmail.com> + 99 others); Thu, 24 Nov 2022 04:56:48 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41154 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229555AbiKXJ4q (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 24 Nov 2022 04:56:46 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C677912EBE5 for <linux-kernel@vger.kernel.org>; Thu, 24 Nov 2022 01:55:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1669283749; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=60QDRRHbYW5CIGKF/Qr6kydVK9e2+lWs3YQKxRzQMXc=; b=S9wj6D3XPLSoazx33TkWn8VTdWPEXHFZmi2kDwuPOfedqqSlXSjthL2+Kymy6GqM34jYEK NlStzKQ0GCU39MtBzvV11Jj5STTZtjsZvvN92JDWhIQYgtPPQcqmxOLhuVF8NgfByYLJkM qsu3by9FtmtYbPyMcNUGaJWTwxmiU+Q= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-499-By9IKOSyMZuvRh8Ile7czA-1; Thu, 24 Nov 2022 04:55:44 -0500 X-MC-Unique: By9IKOSyMZuvRh8Ile7czA-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id C8C2B29AB3E5; Thu, 24 Nov 2022 09:55:43 +0000 (UTC) Received: from gshan.redhat.com (vpn2-54-95.bne.redhat.com [10.64.54.95]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 6303140C6DC7; Thu, 24 Nov 2022 09:55:39 +0000 (UTC) From: Gavin Shan <gshan@redhat.com> To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, akpm@linux-foundation.org, william.kucharski@oracle.com, ziy@nvidia.com, kirill.shutemov@linux.intel.com, david@redhat.com, zhenyzha@redhat.com, apopple@nvidia.com, hughd@google.com, willy@infradead.org, shan.gavin@gmail.com Subject: [PATCH v2] mm: migrate: Fix THP's mapcount on isolation Date: Thu, 24 Nov 2022 17:55:23 +0800 Message-Id: <20221124095523.31061-1-gshan@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.2 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1750246827300294072?= X-GMAIL-MSGID: =?utf-8?q?1750371363224019534?= |
Series |
[v2] mm: migrate: Fix THP's mapcount on isolation
|
|
Commit Message
Gavin Shan
Nov. 24, 2022, 9:55 a.m. UTC
The issue is reported when removing memory through virtio_mem device.
The transparent huge page, experienced copy-on-write fault, is wrongly
regarded as pinned. The transparent huge page is escaped from being
isolated in isolate_migratepages_block(). The transparent huge page
can't be migrated and the corresponding memory block can't be put
into offline state.
Fix it by replacing page_mapcount() with total_mapcount(). With this,
the transparent huge page can be isolated and migrated, and the memory
block can be put into offline state. Besides, The page's refcount is
increased a bit earlier to avoid the page is released when the check
is executed.
Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations")
Cc: stable@vger.kernel.org # v5.7+
Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
v2: Corrected fix tag and increase page's refcount before the check
---
mm/compaction.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)
Comments
On 24.11.22 10:55, Gavin Shan wrote: > The issue is reported when removing memory through virtio_mem device. > The transparent huge page, experienced copy-on-write fault, is wrongly > regarded as pinned. The transparent huge page is escaped from being > isolated in isolate_migratepages_block(). The transparent huge page > can't be migrated and the corresponding memory block can't be put > into offline state. > > Fix it by replacing page_mapcount() with total_mapcount(). With this, > the transparent huge page can be isolated and migrated, and the memory > block can be put into offline state. Besides, The page's refcount is > increased a bit earlier to avoid the page is released when the check > is executed. Did you look into handling pages that are in the swapcache case as well? See is_refcount_suitable() in mm/khugepaged.c. Should be easy to reproduce, let me know if you need inspiration. > > Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations") > Cc: stable@vger.kernel.org # v5.7+ > Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> > Suggested-by: David Hildenbrand <david@redhat.com> > Signed-off-by: Gavin Shan <gshan@redhat.com> > --- > v2: Corrected fix tag and increase page's refcount before the check > --- > mm/compaction.c | 22 +++++++++++----------- > 1 file changed, 11 insertions(+), 11 deletions(-) > > diff --git a/mm/compaction.c b/mm/compaction.c > index c51f7f545afe..1f6da31dd9a5 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -984,29 +984,29 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, > goto isolate_fail; > } > > + /* > + * Be careful not to clear PageLRU until after we're > + * sure the page is not being freed elsewhere -- the > + * page release code relies on it. > + */ > + if (unlikely(!get_page_unless_zero(page))) > + goto isolate_fail; > + > /* > * Migration will fail if an anonymous page is pinned in memory, > * so avoid taking lru_lock and isolating it unnecessarily in an > * admittedly racy check. > */ > mapping = page_mapping(page); > - if (!mapping && page_count(page) > page_mapcount(page)) > - goto isolate_fail; > + if (!mapping && (page_count(page) - 1) > total_mapcount(page)) > + goto isolate_fail_put; > > /* > * Only allow to migrate anonymous pages in GFP_NOFS context > * because those do not depend on fs locks. > */ > if (!(cc->gfp_mask & __GFP_FS) && mapping) > - goto isolate_fail; > - > - /* > - * Be careful not to clear PageLRU until after we're > - * sure the page is not being freed elsewhere -- the > - * page release code relies on it. > - */ > - if (unlikely(!get_page_unless_zero(page))) > - goto isolate_fail; > + goto isolate_fail_put; > > /* Only take pages on LRU: a check now makes later tests safe */ > if (!PageLRU(page))
On 11/24/22 6:09 PM, David Hildenbrand wrote: > On 24.11.22 10:55, Gavin Shan wrote: >> The issue is reported when removing memory through virtio_mem device. >> The transparent huge page, experienced copy-on-write fault, is wrongly >> regarded as pinned. The transparent huge page is escaped from being >> isolated in isolate_migratepages_block(). The transparent huge page >> can't be migrated and the corresponding memory block can't be put >> into offline state. >> >> Fix it by replacing page_mapcount() with total_mapcount(). With this, >> the transparent huge page can be isolated and migrated, and the memory >> block can be put into offline state. Besides, The page's refcount is >> increased a bit earlier to avoid the page is released when the check >> is executed. > > Did you look into handling pages that are in the swapcache case as well? > > See is_refcount_suitable() in mm/khugepaged.c. > > Should be easy to reproduce, let me know if you need inspiration. > Nope, I didn't look into the case. Please elaborate the details so that I can reproduce it firstly. >> >> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations") >> Cc: stable@vger.kernel.org # v5.7+ >> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> >> Suggested-by: David Hildenbrand <david@redhat.com> >> Signed-off-by: Gavin Shan <gshan@redhat.com> >> --- >> v2: Corrected fix tag and increase page's refcount before the check >> --- >> mm/compaction.c | 22 +++++++++++----------- >> 1 file changed, 11 insertions(+), 11 deletions(-) >> >> diff --git a/mm/compaction.c b/mm/compaction.c >> index c51f7f545afe..1f6da31dd9a5 100644 >> --- a/mm/compaction.c >> +++ b/mm/compaction.c >> @@ -984,29 +984,29 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, >> goto isolate_fail; >> } >> + /* >> + * Be careful not to clear PageLRU until after we're >> + * sure the page is not being freed elsewhere -- the >> + * page release code relies on it. >> + */ >> + if (unlikely(!get_page_unless_zero(page))) >> + goto isolate_fail; >> + >> /* >> * Migration will fail if an anonymous page is pinned in memory, >> * so avoid taking lru_lock and isolating it unnecessarily in an >> * admittedly racy check. >> */ >> mapping = page_mapping(page); >> - if (!mapping && page_count(page) > page_mapcount(page)) >> - goto isolate_fail; >> + if (!mapping && (page_count(page) - 1) > total_mapcount(page)) >> + goto isolate_fail_put; >> /* >> * Only allow to migrate anonymous pages in GFP_NOFS context >> * because those do not depend on fs locks. >> */ >> if (!(cc->gfp_mask & __GFP_FS) && mapping) >> - goto isolate_fail; >> - >> - /* >> - * Be careful not to clear PageLRU until after we're >> - * sure the page is not being freed elsewhere -- the >> - * page release code relies on it. >> - */ >> - if (unlikely(!get_page_unless_zero(page))) >> - goto isolate_fail; >> + goto isolate_fail_put; >> /* Only take pages on LRU: a check now makes later tests safe */ >> if (!PageLRU(page)) > Thanks, Gavin
On 24.11.22 11:21, Gavin Shan wrote: > On 11/24/22 6:09 PM, David Hildenbrand wrote: >> On 24.11.22 10:55, Gavin Shan wrote: >>> The issue is reported when removing memory through virtio_mem device. >>> The transparent huge page, experienced copy-on-write fault, is wrongly >>> regarded as pinned. The transparent huge page is escaped from being >>> isolated in isolate_migratepages_block(). The transparent huge page >>> can't be migrated and the corresponding memory block can't be put >>> into offline state. >>> >>> Fix it by replacing page_mapcount() with total_mapcount(). With this, >>> the transparent huge page can be isolated and migrated, and the memory >>> block can be put into offline state. Besides, The page's refcount is >>> increased a bit earlier to avoid the page is released when the check >>> is executed. >> >> Did you look into handling pages that are in the swapcache case as well? >> >> See is_refcount_suitable() in mm/khugepaged.c. >> >> Should be easy to reproduce, let me know if you need inspiration. >> > > Nope, I didn't look into the case. Please elaborate the details so that > I can reproduce it firstly. A simple reproducer would be (on a system with ordinary swap (not zram)) 1) mmap a region (MAP_ANON|MAP_PRIVATE) that can hold a THP 2) Enable THP for that region (MADV_HUGEPAGE) 3) Populate a THP (e.g., write access) 4) PTE-map the THP, for example, using MADV_FREE on the last subpage 5) Trigger swapout of the THP, for example, using MADV_PAGEOUT 6) Read-access to some subpages to fault them in from the swapcache Now you'd have a THP, which 1) Is partially PTE-mapped into the page table 2) Is in the swapcache (each subpage should have one reference from the swapache) Now we could test, if alloc_contig_range() will still succeed (e.g., using virtio-mem).
On 24 Nov 2022, at 5:43, David Hildenbrand wrote: > On 24.11.22 11:21, Gavin Shan wrote: >> On 11/24/22 6:09 PM, David Hildenbrand wrote: >>> On 24.11.22 10:55, Gavin Shan wrote: >>>> The issue is reported when removing memory through virtio_mem device. >>>> The transparent huge page, experienced copy-on-write fault, is wrongly >>>> regarded as pinned. The transparent huge page is escaped from being >>>> isolated in isolate_migratepages_block(). The transparent huge page >>>> can't be migrated and the corresponding memory block can't be put >>>> into offline state. >>>> >>>> Fix it by replacing page_mapcount() with total_mapcount(). With this, >>>> the transparent huge page can be isolated and migrated, and the memory >>>> block can be put into offline state. Besides, The page's refcount is >>>> increased a bit earlier to avoid the page is released when the check >>>> is executed. >>> >>> Did you look into handling pages that are in the swapcache case as well? >>> >>> See is_refcount_suitable() in mm/khugepaged.c. >>> >>> Should be easy to reproduce, let me know if you need inspiration. >>> >> >> Nope, I didn't look into the case. Please elaborate the details so that >> I can reproduce it firstly. > > > A simple reproducer would be (on a system with ordinary swap (not zram)) > > 1) mmap a region (MAP_ANON|MAP_PRIVATE) that can hold a THP > > 2) Enable THP for that region (MADV_HUGEPAGE) > > 3) Populate a THP (e.g., write access) > > 4) PTE-map the THP, for example, using MADV_FREE on the last subpage > > 5) Trigger swapout of the THP, for example, using MADV_PAGEOUT Added the original THP swapout code author, Ying. At this step, the THP will be split, right? https://elixir.bootlin.com/linux/latest/source/mm/vmscan.c#L1786 Even if a THP has PMD mapping, IIRC, it is split in the add_to_swap() then swapped out. But I cannot find that split code now. > > 6) Read-access to some subpages to fault them in from the swapcache > > > Now you'd have a THP, which > > 1) Is partially PTE-mapped into the page table > 2) Is in the swapcache (each subpage should have one reference from the swapache) > > > Now we could test, if alloc_contig_range() will still succeed (e.g., using virtio-mem). > > -- > Thanks, > > David / dhildenb -- Best Regards, Yan Zi
On 11/24/22 6:43 PM, David Hildenbrand wrote: > On 24.11.22 11:21, Gavin Shan wrote: >> On 11/24/22 6:09 PM, David Hildenbrand wrote: >>> On 24.11.22 10:55, Gavin Shan wrote: >>>> The issue is reported when removing memory through virtio_mem device. >>>> The transparent huge page, experienced copy-on-write fault, is wrongly >>>> regarded as pinned. The transparent huge page is escaped from being >>>> isolated in isolate_migratepages_block(). The transparent huge page >>>> can't be migrated and the corresponding memory block can't be put >>>> into offline state. >>>> >>>> Fix it by replacing page_mapcount() with total_mapcount(). With this, >>>> the transparent huge page can be isolated and migrated, and the memory >>>> block can be put into offline state. Besides, The page's refcount is >>>> increased a bit earlier to avoid the page is released when the check >>>> is executed. >>> >>> Did you look into handling pages that are in the swapcache case as well? >>> >>> See is_refcount_suitable() in mm/khugepaged.c. >>> >>> Should be easy to reproduce, let me know if you need inspiration. >>> >> >> Nope, I didn't look into the case. Please elaborate the details so that >> I can reproduce it firstly. > > > A simple reproducer would be (on a system with ordinary swap (not zram)) > > 1) mmap a region (MAP_ANON|MAP_PRIVATE) that can hold a THP > > 2) Enable THP for that region (MADV_HUGEPAGE) > > 3) Populate a THP (e.g., write access) > > 4) PTE-map the THP, for example, using MADV_FREE on the last subpage > > 5) Trigger swapout of the THP, for example, using MADV_PAGEOUT > > 6) Read-access to some subpages to fault them in from the swapcache > > > Now you'd have a THP, which > > 1) Is partially PTE-mapped into the page table > 2) Is in the swapcache (each subpage should have one reference from the swapache) > > > Now we could test, if alloc_contig_range() will still succeed (e.g., using virtio-mem). > Thanks for the details. Step (4) and (5) can be actually combined. To swap part of the THP (e.g. one sub-page) will force the THP to be split. I followed your steps in the attached program, there is no issue to do memory hot-remove through virtio-mem with or without this patch. # numactl -p 1 testsuite mm swap -k Any key to split THP Any key to swap sub-pages Any key to read the swapped sub-pages Page[000]: 0xffffffffffffffff Page[001]: 0xffffffffffffffff : Page[255]: 0xffffffffffffffff Any key to exit // hold here and the program doesn't exit (qemu) qom-set vm1 requested-size 0 [ 356.005396] virtio_mem virtio1: plugged size: 0x40000000 [ 356.005996] virtio_mem virtio1: requested size: 0x0 [ 356.350299] Fallback order for Node 0: 0 1 [ 356.350810] Fallback order for Node 1: 1 0 [ 356.351260] Built 2 zonelists, mobility grouping on. Total pages: 491343 [ 356.351998] Policy zone: DMA Thanks, Gavin
On 24.11.22 13:38, Zi Yan wrote: > > On 24 Nov 2022, at 5:43, David Hildenbrand wrote: > >> On 24.11.22 11:21, Gavin Shan wrote: >>> On 11/24/22 6:09 PM, David Hildenbrand wrote: >>>> On 24.11.22 10:55, Gavin Shan wrote: >>>>> The issue is reported when removing memory through virtio_mem device. >>>>> The transparent huge page, experienced copy-on-write fault, is wrongly >>>>> regarded as pinned. The transparent huge page is escaped from being >>>>> isolated in isolate_migratepages_block(). The transparent huge page >>>>> can't be migrated and the corresponding memory block can't be put >>>>> into offline state. >>>>> >>>>> Fix it by replacing page_mapcount() with total_mapcount(). With this, >>>>> the transparent huge page can be isolated and migrated, and the memory >>>>> block can be put into offline state. Besides, The page's refcount is >>>>> increased a bit earlier to avoid the page is released when the check >>>>> is executed. >>>> >>>> Did you look into handling pages that are in the swapcache case as well? >>>> >>>> See is_refcount_suitable() in mm/khugepaged.c. >>>> >>>> Should be easy to reproduce, let me know if you need inspiration. >>>> >>> >>> Nope, I didn't look into the case. Please elaborate the details so that >>> I can reproduce it firstly. >> >> >> A simple reproducer would be (on a system with ordinary swap (not zram)) >> >> 1) mmap a region (MAP_ANON|MAP_PRIVATE) that can hold a THP >> >> 2) Enable THP for that region (MADV_HUGEPAGE) >> >> 3) Populate a THP (e.g., write access) >> >> 4) PTE-map the THP, for example, using MADV_FREE on the last subpage >> >> 5) Trigger swapout of the THP, for example, using MADV_PAGEOUT > > Added the original THP swapout code author, Ying. > > At this step, the THP will be split, right? > > https://elixir.bootlin.com/linux/latest/source/mm/vmscan.c#L1786 > > Even if a THP has PMD mapping, IIRC, it is split in the add_to_swap() > then swapped out. But I cannot find that split code now. I recall there was some sequence to achieve it. Maybe it was swapping out the PMD first and not triggering a PTE-mapping first. mm/vmscan.c:shrink_folio_list() if (folio_test_large(folio)) { /* cannot split folio, skip it */ if (!can_split_folio(folio, NULL)) goto activate_locked; /* * Split folios without a PMD map right * away. Chances are some or all of the * tail pages can be freed without IO. */ if (!folio_entire_mapcount(folio) && split_folio_to_list(folio, folio_list)) goto activate_locked; } } So the sequence might have to be 1) mmap a region (MAP_ANON|MAP_PRIVATE) that can hold a THP 2) Enable THP for that region (MADV_HUGEPAGE) 3) Populate a THP (e.g., write access) 4) Trigger swapout of the THP, for example, using MADV_PAGEOUT 5) Access some subpage As we don't have PMD swap entries, we will PTE-map the THP during try_to_unmap() IIRC. Independent of that, the check we have here also doesn't consider ordinary order-0 pages that might be in the swapache.
On 24.11.22 13:55, Gavin Shan wrote: > On 11/24/22 6:43 PM, David Hildenbrand wrote: >> On 24.11.22 11:21, Gavin Shan wrote: >>> On 11/24/22 6:09 PM, David Hildenbrand wrote: >>>> On 24.11.22 10:55, Gavin Shan wrote: >>>>> The issue is reported when removing memory through virtio_mem device. >>>>> The transparent huge page, experienced copy-on-write fault, is wrongly >>>>> regarded as pinned. The transparent huge page is escaped from being >>>>> isolated in isolate_migratepages_block(). The transparent huge page >>>>> can't be migrated and the corresponding memory block can't be put >>>>> into offline state. >>>>> >>>>> Fix it by replacing page_mapcount() with total_mapcount(). With this, >>>>> the transparent huge page can be isolated and migrated, and the memory >>>>> block can be put into offline state. Besides, The page's refcount is >>>>> increased a bit earlier to avoid the page is released when the check >>>>> is executed. >>>> >>>> Did you look into handling pages that are in the swapcache case as well? >>>> >>>> See is_refcount_suitable() in mm/khugepaged.c. >>>> >>>> Should be easy to reproduce, let me know if you need inspiration. >>>> >>> >>> Nope, I didn't look into the case. Please elaborate the details so that >>> I can reproduce it firstly. >> >> >> A simple reproducer would be (on a system with ordinary swap (not zram)) >> >> 1) mmap a region (MAP_ANON|MAP_PRIVATE) that can hold a THP >> >> 2) Enable THP for that region (MADV_HUGEPAGE) >> >> 3) Populate a THP (e.g., write access) >> >> 4) PTE-map the THP, for example, using MADV_FREE on the last subpage >> >> 5) Trigger swapout of the THP, for example, using MADV_PAGEOUT >> >> 6) Read-access to some subpages to fault them in from the swapcache >> >> >> Now you'd have a THP, which >> >> 1) Is partially PTE-mapped into the page table >> 2) Is in the swapcache (each subpage should have one reference from the swapache) >> >> >> Now we could test, if alloc_contig_range() will still succeed (e.g., using virtio-mem). >> > > Thanks for the details. Step (4) and (5) can be actually combined. To swap part of > the THP (e.g. one sub-page) will force the THP to be split. > > I followed your steps in the attached program, there is no issue to do memory hot-remove > through virtio-mem with or without this patch. Interesting. But I don't really see how we could pass this check with a page that's in the swapcache, maybe I'm missing something else. I'll try to see if I can reproduce it.
On 24.11.22 14:22, David Hildenbrand wrote: > On 24.11.22 13:55, Gavin Shan wrote: >> On 11/24/22 6:43 PM, David Hildenbrand wrote: >>> On 24.11.22 11:21, Gavin Shan wrote: >>>> On 11/24/22 6:09 PM, David Hildenbrand wrote: >>>>> On 24.11.22 10:55, Gavin Shan wrote: >>>>>> The issue is reported when removing memory through virtio_mem device. >>>>>> The transparent huge page, experienced copy-on-write fault, is wrongly >>>>>> regarded as pinned. The transparent huge page is escaped from being >>>>>> isolated in isolate_migratepages_block(). The transparent huge page >>>>>> can't be migrated and the corresponding memory block can't be put >>>>>> into offline state. >>>>>> >>>>>> Fix it by replacing page_mapcount() with total_mapcount(). With this, >>>>>> the transparent huge page can be isolated and migrated, and the memory >>>>>> block can be put into offline state. Besides, The page's refcount is >>>>>> increased a bit earlier to avoid the page is released when the check >>>>>> is executed. >>>>> >>>>> Did you look into handling pages that are in the swapcache case as well? >>>>> >>>>> See is_refcount_suitable() in mm/khugepaged.c. >>>>> >>>>> Should be easy to reproduce, let me know if you need inspiration. >>>>> >>>> >>>> Nope, I didn't look into the case. Please elaborate the details so that >>>> I can reproduce it firstly. >>> >>> >>> A simple reproducer would be (on a system with ordinary swap (not zram)) >>> >>> 1) mmap a region (MAP_ANON|MAP_PRIVATE) that can hold a THP >>> >>> 2) Enable THP for that region (MADV_HUGEPAGE) >>> >>> 3) Populate a THP (e.g., write access) >>> >>> 4) PTE-map the THP, for example, using MADV_FREE on the last subpage >>> >>> 5) Trigger swapout of the THP, for example, using MADV_PAGEOUT >>> >>> 6) Read-access to some subpages to fault them in from the swapcache >>> >>> >>> Now you'd have a THP, which >>> >>> 1) Is partially PTE-mapped into the page table >>> 2) Is in the swapcache (each subpage should have one reference from the swapache) >>> >>> >>> Now we could test, if alloc_contig_range() will still succeed (e.g., using virtio-mem). >>> >> >> Thanks for the details. Step (4) and (5) can be actually combined. To swap part of >> the THP (e.g. one sub-page) will force the THP to be split. >> >> I followed your steps in the attached program, there is no issue to do memory hot-remove >> through virtio-mem with or without this patch. > > Interesting. But I don't really see how we could pass this check with a > page that's in the swapcache, maybe I'm missing something else. > > I'll try to see if I can reproduce it. > After some unsuccessful attempts and many head-scratches, I realized that it's quite simple why we don't have to worry about swapcache pages here: page_mapping() is != NULL for pages in the swapcache: folio_mapping() makes this rather obvious: if (unlikely(folio_test_swapcache(folio)) return swap_address_space(folio_swap_entry(folio)); I think the get_page_unless_zero() might also be a fix for the page_mapping() call, smells like something could blow up on concurrent page freeing. (what about concurrent removal from the swapcache? nobody knows :) ) Thanks Gavin! Acked-by: David Hildenbrand <david@redhat.com>
With the patch applied, I'm unable to hit memory hot-remove failure in the environment where the issue was initially found. Tested-by: Zhenyu Zhang <zhenyzha@redhat.com> On Thu, Nov 24, 2022 at 10:09 PM David Hildenbrand <david@redhat.com> wrote: > > On 24.11.22 14:22, David Hildenbrand wrote: > > On 24.11.22 13:55, Gavin Shan wrote: > >> On 11/24/22 6:43 PM, David Hildenbrand wrote: > >>> On 24.11.22 11:21, Gavin Shan wrote: > >>>> On 11/24/22 6:09 PM, David Hildenbrand wrote: > >>>>> On 24.11.22 10:55, Gavin Shan wrote: > >>>>>> The issue is reported when removing memory through virtio_mem device. > >>>>>> The transparent huge page, experienced copy-on-write fault, is wrongly > >>>>>> regarded as pinned. The transparent huge page is escaped from being > >>>>>> isolated in isolate_migratepages_block(). The transparent huge page > >>>>>> can't be migrated and the corresponding memory block can't be put > >>>>>> into offline state. > >>>>>> > >>>>>> Fix it by replacing page_mapcount() with total_mapcount(). With this, > >>>>>> the transparent huge page can be isolated and migrated, and the memory > >>>>>> block can be put into offline state. Besides, The page's refcount is > >>>>>> increased a bit earlier to avoid the page is released when the check > >>>>>> is executed. > >>>>> > >>>>> Did you look into handling pages that are in the swapcache case as well? > >>>>> > >>>>> See is_refcount_suitable() in mm/khugepaged.c. > >>>>> > >>>>> Should be easy to reproduce, let me know if you need inspiration. > >>>>> > >>>> > >>>> Nope, I didn't look into the case. Please elaborate the details so that > >>>> I can reproduce it firstly. > >>> > >>> > >>> A simple reproducer would be (on a system with ordinary swap (not zram)) > >>> > >>> 1) mmap a region (MAP_ANON|MAP_PRIVATE) that can hold a THP > >>> > >>> 2) Enable THP for that region (MADV_HUGEPAGE) > >>> > >>> 3) Populate a THP (e.g., write access) > >>> > >>> 4) PTE-map the THP, for example, using MADV_FREE on the last subpage > >>> > >>> 5) Trigger swapout of the THP, for example, using MADV_PAGEOUT > >>> > >>> 6) Read-access to some subpages to fault them in from the swapcache > >>> > >>> > >>> Now you'd have a THP, which > >>> > >>> 1) Is partially PTE-mapped into the page table > >>> 2) Is in the swapcache (each subpage should have one reference from the swapache) > >>> > >>> > >>> Now we could test, if alloc_contig_range() will still succeed (e.g., using virtio-mem). > >>> > >> > >> Thanks for the details. Step (4) and (5) can be actually combined. To swap part of > >> the THP (e.g. one sub-page) will force the THP to be split. > >> > >> I followed your steps in the attached program, there is no issue to do memory hot-remove > >> through virtio-mem with or without this patch. > > > > Interesting. But I don't really see how we could pass this check with a > > page that's in the swapcache, maybe I'm missing something else. > > > > I'll try to see if I can reproduce it. > > > > After some unsuccessful attempts and many head-scratches, I realized > that it's quite simple why we don't have to worry about swapcache pages > here: > > page_mapping() is != NULL for pages in the swapcache: folio_mapping() > makes this rather obvious: > > if (unlikely(folio_test_swapcache(folio)) > return swap_address_space(folio_swap_entry(folio)); > > > I think the get_page_unless_zero() might also be a fix for the > page_mapping() call, smells like something could blow up on concurrent > page freeing. (what about concurrent removal from the swapcache? nobody > knows :) ) > > > Thanks Gavin! > > Acked-by: David Hildenbrand <david@redhat.com> > > > -- > Thanks, > > David / dhildenb >
diff --git a/mm/compaction.c b/mm/compaction.c index c51f7f545afe..1f6da31dd9a5 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -984,29 +984,29 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, goto isolate_fail; } + /* + * Be careful not to clear PageLRU until after we're + * sure the page is not being freed elsewhere -- the + * page release code relies on it. + */ + if (unlikely(!get_page_unless_zero(page))) + goto isolate_fail; + /* * Migration will fail if an anonymous page is pinned in memory, * so avoid taking lru_lock and isolating it unnecessarily in an * admittedly racy check. */ mapping = page_mapping(page); - if (!mapping && page_count(page) > page_mapcount(page)) - goto isolate_fail; + if (!mapping && (page_count(page) - 1) > total_mapcount(page)) + goto isolate_fail_put; /* * Only allow to migrate anonymous pages in GFP_NOFS context * because those do not depend on fs locks. */ if (!(cc->gfp_mask & __GFP_FS) && mapping) - goto isolate_fail; - - /* - * Be careful not to clear PageLRU until after we're - * sure the page is not being freed elsewhere -- the - * page release code relies on it. - */ - if (unlikely(!get_page_unless_zero(page))) - goto isolate_fail; + goto isolate_fail_put; /* Only take pages on LRU: a check now makes later tests safe */ if (!PageLRU(page))