Message ID | 20221123005752.161003-1-gshan@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp2523887wrr; Tue, 22 Nov 2022 17:04:06 -0800 (PST) X-Google-Smtp-Source: AA0mqf6RSZa9GRnF9cthnTDLCvUEMm+dc2X6Jgoq1sse6uteiihuUspi4IpCmMkWK+5mM+S1pahm X-Received: by 2002:a63:1054:0:b0:42b:9219:d14e with SMTP id 20-20020a631054000000b0042b9219d14emr6306070pgq.176.1669165446302; Tue, 22 Nov 2022 17:04:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669165446; cv=none; d=google.com; s=arc-20160816; b=waGucUyfHzS8kTjOr3dX8OAhg4EzRgJP9Um/i6pbmT1Mwd+WSHjfqJQcUZ6YIc7SAr 1GYmXTh59hYkjoFsMxk8OsL1GxOAUD/sfnBrKqGFqx2uZ0x31GmiRZAuw/aepLlKMxGo TmMg7seyb46AlWlFhUq8F7j88gAr3GFc7PoICqDfkYPreWFLWoyqDMPlDVwES0Y98Hr+ gvyF1TYMLDP1vNwmPkMXYaw8H8bIHLC6ft6L7/JStW2HBwQo0hJxOOzYTviuTcNsA12M QEx9E7aO8/J0WeRBM6cMlPz+zp0xBuy276OESIWj4eun9Wx4guSxFXtF8/r0hrzIqJtE dw5g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=FRXQzInu4Oq+Sfj0rM7dyXyjMHp78R6zBNAHLpf5AvU=; b=UphAQ/pyd02NyrHMQKAcAshbtB1LGRDFVaZzt6lIW4JHqRcJv/fTZKF1MIoRaA2Hsf qnL2ITTP+MLW+NmE/o1u7jr53gjQHQi3yYfxaPgdOsImFZhc3D2GxKBYl4dNoDsHJQjc xxPp2gspxB/pcYKQpl2l71CF7ZTrTuBbA4h+4tZKcBNU+FE6PaDUcpKNgo3W/42pgyjf BRh3xiEMyBdp1xb+UbTPPup5xzGAzixuf+ZDk1p1Ub5NYCQXoCBdFST3TCDjer3N3lSt A6nRB/dm0zFobV5yRr68QtwqFzUcQNOS/3luXnKJkkAA0eXCtJ9FT3cF+VLPNa+MuoaY Bn7g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=jUpgEAT4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id w4-20020a656944000000b004772cf9ae7esi1822072pgq.284.2022.11.22.17.03.53; Tue, 22 Nov 2022 17:04:06 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=jUpgEAT4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235357AbiKWA71 (ORCPT <rfc822;cjcooper78@gmail.com> + 99 others); Tue, 22 Nov 2022 19:59:27 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36528 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235150AbiKWA7O (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 22 Nov 2022 19:59:14 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 46C80AFE65 for <linux-kernel@vger.kernel.org>; Tue, 22 Nov 2022 16:58:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1669165099; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=FRXQzInu4Oq+Sfj0rM7dyXyjMHp78R6zBNAHLpf5AvU=; b=jUpgEAT4QRjW17MR7Z4fdTjnekXLW/CnSYdAYEk6ZMw7QpDPl2kxVLhGPiCxUedCnSEnmP zDrPseFUHgOmk5gIkR8YPJOII5W2cXoOVRRNdhb8qHmhd7/jD2HqHP2UKkdmbGam86pWOE 0mdxw2x1sz0hs0JuB5uWYM+UpZ/MncQ= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-184-XCEd7GhINXuqIGb4MCnXLQ-1; Tue, 22 Nov 2022 19:58:15 -0500 X-MC-Unique: XCEd7GhINXuqIGb4MCnXLQ-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 6256C811E67; Wed, 23 Nov 2022 00:58:15 +0000 (UTC) Received: from gshan.redhat.com (vpn2-54-62.bne.redhat.com [10.64.54.62]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 02F044A9254; Wed, 23 Nov 2022 00:58:11 +0000 (UTC) From: Gavin Shan <gshan@redhat.com> To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, akpm@linux-foundation.org, william.kucharski@oracle.com, ziy@nvidia.com, kirill.shutemov@linux.intel.com, david@redhat.com, zhenyzha@redhat.com, shan.gavin@gmail.com Subject: [PATCH] mm: migrate: Fix THP's mapcount on isolation Date: Wed, 23 Nov 2022 08:57:52 +0800 Message-Id: <20221123005752.161003-1-gshan@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.9 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1750246827300294072?= X-GMAIL-MSGID: =?utf-8?q?1750246827300294072?= |
Series |
mm: migrate: Fix THP's mapcount on isolation
|
|
Commit Message
Gavin Shan
Nov. 23, 2022, 12:57 a.m. UTC
The issue is reported when removing memory through virtio_mem device.
The transparent huge page, experienced copy-on-write fault, is wrongly
regarded as pinned. The transparent huge page is escaped from being
isolated in isolate_migratepages_block(). The transparent huge page
can't be migrated and the corresponding memory block can't be put
into offline state.
Fix it by replacing page_mapcount() with total_mapcount(). With this,
the transparent huge page can be isolated and migrated, and the memory
block can be put into offline state.
Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP")
Cc: stable@vger.kernel.org # v5.8+
Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
mm/compaction.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Comments
Gavin Shan <gshan@redhat.com> writes: > The issue is reported when removing memory through virtio_mem device. > The transparent huge page, experienced copy-on-write fault, is wrongly > regarded as pinned. The transparent huge page is escaped from being > isolated in isolate_migratepages_block(). The transparent huge page > can't be migrated and the corresponding memory block can't be put > into offline state. > > Fix it by replacing page_mapcount() with total_mapcount(). With this, > the transparent huge page can be isolated and migrated, and the memory > block can be put into offline state. > > Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP") > Cc: stable@vger.kernel.org # v5.8+ > Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> > Suggested-by: David Hildenbrand <david@redhat.com> > Signed-off-by: Gavin Shan <gshan@redhat.com> > --- > mm/compaction.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/compaction.c b/mm/compaction.c > index c51f7f545afe..c408b5e04c1d 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -990,7 +990,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, > * admittedly racy check. > */ > mapping = page_mapping(page); > - if (!mapping && page_count(page) > page_mapcount(page)) > + if (!mapping && page_count(page) > total_mapcount(page)) We have several versions of these checks for pinned pages open-coded around the place. See for example migrate_vma_check_page() and folio_expected_refs(). It looks like you could use a variant of migrate_vma_check_page() which would also check non-anon pins, although I don't know the compaction code well enough to know if that's useful. Either way it would be nice if we had a common helper for these kind of checks. Guess that would be harder to backport, and the change itself looks ok. But why isn't the fixes tag 119d6d59dcc0 ("mm, compaction: avoid isolating pinned pages")? > goto isolate_fail; > > /*
Hi Alistair, On 11/23/22 12:26 PM, Alistair Popple wrote: > Gavin Shan <gshan@redhat.com> writes: > >> The issue is reported when removing memory through virtio_mem device. >> The transparent huge page, experienced copy-on-write fault, is wrongly >> regarded as pinned. The transparent huge page is escaped from being >> isolated in isolate_migratepages_block(). The transparent huge page >> can't be migrated and the corresponding memory block can't be put >> into offline state. >> >> Fix it by replacing page_mapcount() with total_mapcount(). With this, >> the transparent huge page can be isolated and migrated, and the memory >> block can be put into offline state. >> >> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP") >> Cc: stable@vger.kernel.org # v5.8+ >> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> >> Suggested-by: David Hildenbrand <david@redhat.com> >> Signed-off-by: Gavin Shan <gshan@redhat.com> >> --- >> mm/compaction.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/compaction.c b/mm/compaction.c >> index c51f7f545afe..c408b5e04c1d 100644 >> --- a/mm/compaction.c >> +++ b/mm/compaction.c >> @@ -990,7 +990,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, >> * admittedly racy check. >> */ >> mapping = page_mapping(page); >> - if (!mapping && page_count(page) > page_mapcount(page)) >> + if (!mapping && page_count(page) > total_mapcount(page)) > > We have several versions of these checks for pinned pages open-coded > around the place. See for example migrate_vma_check_page() and > folio_expected_refs(). It looks like you could use a variant of > migrate_vma_check_page() which would also check non-anon pins, although > I don't know the compaction code well enough to know if that's useful. > > Either way it would be nice if we had a common helper for these kind of > checks. Guess that would be harder to backport, and the change itself > looks ok. But why isn't the fixes tag 119d6d59dcc0 ("mm, compaction: > avoid isolating pinned pages")? > Nice to see your comments. folio_expected_refs() only returns the mapcount for file-mapped pages. migrate_vma_check_page() doesn't cover THP and there is a 'FIXME' there. A unified helper is beneficial to maintainance. I think the issue can be fixed in place and have a followup patch to introduce the unified helper, to make the backporting happy. Right, I was thinking of 119d6d59dcc0, which was merged in early days back to v3.15. I doubt we even had THP support that time. 3917c80280c9 changed the behavior of THP COW handling, to split the THP. Without 3917c80280c9, no splitting is expected and the original condition is correct to check for anon pins. >> goto isolate_fail; >> >> /* > Thanks, Gavin
On Wed, 23 Nov 2022, Gavin Shan wrote: > The issue is reported when removing memory through virtio_mem device. > The transparent huge page, experienced copy-on-write fault, is wrongly > regarded as pinned. The transparent huge page is escaped from being > isolated in isolate_migratepages_block(). The transparent huge page > can't be migrated and the corresponding memory block can't be put > into offline state. > > Fix it by replacing page_mapcount() with total_mapcount(). With this, > the transparent huge page can be isolated and migrated, and the memory > block can be put into offline state. > > Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP") > Cc: stable@vger.kernel.org # v5.8+ > Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> > Suggested-by: David Hildenbrand <david@redhat.com> > Signed-off-by: Gavin Shan <gshan@redhat.com> Interesting, good catch, looked right to me: except for the Fixes line and mention of v5.8. That CoW change may have added a case which easily demonstrates the problem, but it would have been the wrong test on a THP for long before then - but only in v5.7 were compound pages allowed through at all to reach that test, so I think it should be Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations") Cc: stable@vger.kernel.org # v5.7+ Oh, no, stop: this is not so easy, even in the latest tree. Because at the time of that "admittedly racy check", we have no hold at all on the page in question: and if it's PageLRU or PageCompound at one instant, it may be different the next instant. Which leaves it vulnerable to whatever BUG_ON()s there may be in the total_mapcount() path - needs research. *Perhaps* there are no more BUG_ON()s in the total_mapcount() path than in the existing page_mapcount() path. I suspect that for this to be safe (before your patch and more so after), it will be necessary to shift the "admittedly racy check" down after the get_page_unless_zero() (and check the sequence of operations when a compound page is initialized). The races I'm talking about are much much rarer than the condition you are trying to avoid, so it's frustrating; but such races are real, and increasing stable's exposure to them is not so good. Sorry, I'm going to run away now: just raising these concerns without working on the solution. Hugh > --- > mm/compaction.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/compaction.c b/mm/compaction.c > index c51f7f545afe..c408b5e04c1d 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -990,7 +990,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, > * admittedly racy check. > */ > mapping = page_mapping(page); > - if (!mapping && page_count(page) > page_mapcount(page)) > + if (!mapping && page_count(page) > total_mapcount(page)) > goto isolate_fail; > > /* > -- > 2.23.0
On 23.11.22 06:14, Hugh Dickins wrote: > On Wed, 23 Nov 2022, Gavin Shan wrote: > >> The issue is reported when removing memory through virtio_mem device. >> The transparent huge page, experienced copy-on-write fault, is wrongly >> regarded as pinned. The transparent huge page is escaped from being >> isolated in isolate_migratepages_block(). The transparent huge page >> can't be migrated and the corresponding memory block can't be put >> into offline state. >> >> Fix it by replacing page_mapcount() with total_mapcount(). With this, >> the transparent huge page can be isolated and migrated, and the memory >> block can be put into offline state. >> >> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP") >> Cc: stable@vger.kernel.org # v5.8+ >> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> >> Suggested-by: David Hildenbrand <david@redhat.com> >> Signed-off-by: Gavin Shan <gshan@redhat.com> > > Interesting, good catch, looked right to me: except for the Fixes line > and mention of v5.8. That CoW change may have added a case which easily > demonstrates the problem, but it would have been the wrong test on a THP > for long before then - but only in v5.7 were compound pages allowed > through at all to reach that test, so I think it should be > > Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations") > Cc: stable@vger.kernel.org # v5.7+ > > Oh, no, stop: this is not so easy, even in the latest tree. > > Because at the time of that "admittedly racy check", we have no hold > at all on the page in question: and if it's PageLRU or PageCompound > at one instant, it may be different the next instant. Which leaves it > vulnerable to whatever BUG_ON()s there may be in the total_mapcount() > path - needs research. *Perhaps* there are no more BUG_ON()s in the > total_mapcount() path than in the existing page_mapcount() path. > > I suspect that for this to be safe (before your patch and more so after), > it will be necessary to shift the "admittedly racy check" down after the > get_page_unless_zero() (and check the sequence of operations when a > compound page is initialized). Grabbing a reference first sounds like the right approach to me. > > The races I'm talking about are much much rarer than the condition you > are trying to avoid, so it's frustrating; but such races are real, > and increasing stable's exposure to them is not so good. Such checks are always racy and the code has to be able to deal with false negatives/postives (we're not even holding the page lock); as you state, we just don't want to trigger undefined behavior/BUG. I'm also curious how that migration code handles a THP that's in the swapcache. It better should handle such pages correctly, for example, by removing them from the swapcache first, otherwise that could block migration. For example, in mm/ksm.c:write_protect_page() we have "page_mapcount(page) + 1 + swapped != page_count(page)" page_mapcount() and "swapped==0/1" makes sense to me, because KSM only cares about order-0 pages, so no need for THP games. But we do have an even better helper in place already: mm/huge_memory.c:can_split_folio() Which cares about a) Swapcache for THP: each subpage could be in the swapcache b) Requires the caller to hold one reference to be safe But I am a bit confused about the "extra_pins" for !anon. Where do the folio_nr_pages() references come from? So *maybe* it makes sense to factor out can_split_folio() and call it something like: "folio_maybe_additionally_referenced" [to clearly distinguish it from "folio_maybe_dma_pinned" that cares about actual page pinning (read/write page content)]. Such a function could return false positives/negatives due to races and the caller would have to hold one reference and be able to deal with the semantics.
On Wed, Nov 23, 2022 at 09:56:38AM +0100, David Hildenbrand wrote: > But we do have an even better helper in place already: > mm/huge_memory.c:can_split_folio() > > Which cares about > > a) Swapcache for THP: each subpage could be in the swapcache > b) Requires the caller to hold one reference to be safe > > But I am a bit confused about the "extra_pins" for !anon. Where do the > folio_nr_pages() references come from? When we add a folio to the page cache, we increment its refcount by folio_nr_pages() instead of by 1. I suspect this is no longer needed (if it was ever needed) and it could be changed. See __filemap_add_folio(): long nr = 1; if (!huge) { nr = folio_nr_pages(folio); folio_ref_add(folio, nr); > So *maybe* it makes sense to factor out can_split_folio() and call it > something like: "folio_maybe_additionally_referenced" [to clearly > distinguish it from "folio_maybe_dma_pinned" that cares about actual page > pinning (read/write page content)]. > > Such a function could return false positives/negatives due to races and the > caller would have to hold one reference and be able to deal with the > semantics. I don't like the 'pextra_pins' parameter to a generic function ...
On 11/23/22 4:56 PM, David Hildenbrand wrote: > On 23.11.22 06:14, Hugh Dickins wrote: >> On Wed, 23 Nov 2022, Gavin Shan wrote: >> >>> The issue is reported when removing memory through virtio_mem device. >>> The transparent huge page, experienced copy-on-write fault, is wrongly >>> regarded as pinned. The transparent huge page is escaped from being >>> isolated in isolate_migratepages_block(). The transparent huge page >>> can't be migrated and the corresponding memory block can't be put >>> into offline state. >>> >>> Fix it by replacing page_mapcount() with total_mapcount(). With this, >>> the transparent huge page can be isolated and migrated, and the memory >>> block can be put into offline state. >>> >>> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP") >>> Cc: stable@vger.kernel.org # v5.8+ >>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> >>> Suggested-by: David Hildenbrand <david@redhat.com> >>> Signed-off-by: Gavin Shan <gshan@redhat.com> >> >> Interesting, good catch, looked right to me: except for the Fixes line >> and mention of v5.8. That CoW change may have added a case which easily >> demonstrates the problem, but it would have been the wrong test on a THP >> for long before then - but only in v5.7 were compound pages allowed >> through at all to reach that test, so I think it should be >> >> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations") >> Cc: stable@vger.kernel.org # v5.7+ >> Right, commit 1da2f328fa64 looks more accurate in this particular case, I will fix it up in next revision. >> Oh, no, stop: this is not so easy, even in the latest tree. >> >> Because at the time of that "admittedly racy check", we have no hold >> at all on the page in question: and if it's PageLRU or PageCompound >> at one instant, it may be different the next instant. Which leaves it >> vulnerable to whatever BUG_ON()s there may be in the total_mapcount() >> path - needs research. *Perhaps* there are no more BUG_ON()s in the >> total_mapcount() path than in the existing page_mapcount() path. >> >> I suspect that for this to be safe (before your patch and more so after), >> it will be necessary to shift the "admittedly racy check" down after the >> get_page_unless_zero() (and check the sequence of operations when a >> compound page is initialized). > > Grabbing a reference first sounds like the right approach to me. > Yeah, it sounds reasonable to me to grab a page->__refcount in the first place. Looking at isolate_migratepages_block(), the page's refcount is increased by get_page_unless_zero(), but it's too late. To increase the page's refcount at the first place in the function will be conflicting with hugetlb page and non-LRU page. I mean there will be a series to refactor the code so that the page's refcount can be grabbed in the first place. So I plan to post a followup series to refactor the code and grab the page's refcount in the first place. In this way, the fix can be merged as soon as possible. David and Hugh, please let me know if it's reasonable plan? :) static int isolate_migratepages_block() { for (; low_pfn < end_pfn; low_pfn++) { : page = pfn_to_page(low_pfn); if (unlikely(!get_page_unless_zero(page))) // grab page's refcount in the first place goto isolate_fail_put; : if (PageHuge(page) && cc->alloc_contig) { ret = isolate_or_dissolve_huge_page(page, &cc->migratepages); // refcount is increased by this function : } : if (!PageLRU(page)) { if (unlikely(__PageMovable(page)) && !PageIsolated(page)) { if (!isolate_movable_page(page, mode))) // refcunt is increased here too goto isolate_success; } } : mapping = page_mapping(page); if (!mapping && page_count(page) > total_mapcount(page)) goto isolate_fail; : if (unlikely(!get_page_unless_zero(page))) // too late to grab the refcount? goto isolate_fail; : } } [...] Thanks, Gavin
David Hildenbrand <david@redhat.com> writes: > On 23.11.22 06:14, Hugh Dickins wrote: >> On Wed, 23 Nov 2022, Gavin Shan wrote: >> >>> The issue is reported when removing memory through virtio_mem device. >>> The transparent huge page, experienced copy-on-write fault, is wrongly >>> regarded as pinned. The transparent huge page is escaped from being >>> isolated in isolate_migratepages_block(). The transparent huge page >>> can't be migrated and the corresponding memory block can't be put >>> into offline state. >>> >>> Fix it by replacing page_mapcount() with total_mapcount(). With this, >>> the transparent huge page can be isolated and migrated, and the memory >>> block can be put into offline state. >>> >>> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP") >>> Cc: stable@vger.kernel.org # v5.8+ >>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> >>> Suggested-by: David Hildenbrand <david@redhat.com> >>> Signed-off-by: Gavin Shan <gshan@redhat.com> >> Interesting, good catch, looked right to me: except for the Fixes >> line >> and mention of v5.8. That CoW change may have added a case which easily >> demonstrates the problem, but it would have been the wrong test on a THP >> for long before then - but only in v5.7 were compound pages allowed >> through at all to reach that test, so I think it should be >> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for >> CMA allocations") >> Cc: stable@vger.kernel.org # v5.7+ >> Oh, no, stop: this is not so easy, even in the latest tree. >> Because at the time of that "admittedly racy check", we have no hold >> at all on the page in question: and if it's PageLRU or PageCompound >> at one instant, it may be different the next instant. Which leaves it >> vulnerable to whatever BUG_ON()s there may be in the total_mapcount() >> path - needs research. *Perhaps* there are no more BUG_ON()s in the >> total_mapcount() path than in the existing page_mapcount() path. >> I suspect that for this to be safe (before your patch and more so >> after), >> it will be necessary to shift the "admittedly racy check" down after the >> get_page_unless_zero() (and check the sequence of operations when a >> compound page is initialized). > > Grabbing a reference first sounds like the right approach to me. I think you're right. Without a page reference I don't think it is even safe to look at struct page, at least not without synchronisation against memory hot unplug which could remove the struct page. From a quick glance I didn't see anything here that obviously did that though. >> The races I'm talking about are much much rarer than the condition >> you >> are trying to avoid, so it's frustrating; but such races are real, >> and increasing stable's exposure to them is not so good. > > Such checks are always racy and the code has to be able to deal with > false negatives/postives (we're not even holding the page lock); as > you state, we just don't want to trigger undefined behavior/BUG. > > > I'm also curious how that migration code handles a THP that's in the > swapcache. It better should handle such pages correctly, for example, > by removing them from the swapcache first, otherwise that could block > migration. > > > For example, in mm/ksm.c:write_protect_page() we have > > "page_mapcount(page) + 1 + swapped != page_count(page)" > > page_mapcount() and "swapped==0/1" makes sense to me, because KSM only > cares about order-0 pages, so no need for THP games. > > > But we do have an even better helper in place already: > mm/huge_memory.c:can_split_folio() > > Which cares about > > a) Swapcache for THP: each subpage could be in the swapcache > b) Requires the caller to hold one reference to be safe > > But I am a bit confused about the "extra_pins" for !anon. Where do the > folio_nr_pages() references come from? > > So *maybe* it makes sense to factor out can_split_folio() and call it > something like: "folio_maybe_additionally_referenced" [to clearly > distinguish it from "folio_maybe_dma_pinned" that cares about actual > page pinning (read/write page content)]. > > Such a function could return false positives/negatives due to races > and the caller would have to hold one reference and be able to deal > with the semantics.
On Thu, Nov 24, 2022 at 12:06:56PM +1100, Alistair Popple wrote: > > David Hildenbrand <david@redhat.com> writes: > > > On 23.11.22 06:14, Hugh Dickins wrote: > >> On Wed, 23 Nov 2022, Gavin Shan wrote: > >> > >>> The issue is reported when removing memory through virtio_mem device. > >>> The transparent huge page, experienced copy-on-write fault, is wrongly > >>> regarded as pinned. The transparent huge page is escaped from being > >>> isolated in isolate_migratepages_block(). The transparent huge page > >>> can't be migrated and the corresponding memory block can't be put > >>> into offline state. > >>> > >>> Fix it by replacing page_mapcount() with total_mapcount(). With this, > >>> the transparent huge page can be isolated and migrated, and the memory > >>> block can be put into offline state. > >>> > >>> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP") > >>> Cc: stable@vger.kernel.org # v5.8+ > >>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> > >>> Suggested-by: David Hildenbrand <david@redhat.com> > >>> Signed-off-by: Gavin Shan <gshan@redhat.com> > >> Interesting, good catch, looked right to me: except for the Fixes > >> line > >> and mention of v5.8. That CoW change may have added a case which easily > >> demonstrates the problem, but it would have been the wrong test on a THP > >> for long before then - but only in v5.7 were compound pages allowed > >> through at all to reach that test, so I think it should be > >> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for > >> CMA allocations") > >> Cc: stable@vger.kernel.org # v5.7+ > >> Oh, no, stop: this is not so easy, even in the latest tree. > >> Because at the time of that "admittedly racy check", we have no hold > >> at all on the page in question: and if it's PageLRU or PageCompound > >> at one instant, it may be different the next instant. Which leaves it > >> vulnerable to whatever BUG_ON()s there may be in the total_mapcount() > >> path - needs research. *Perhaps* there are no more BUG_ON()s in the > >> total_mapcount() path than in the existing page_mapcount() path. > >> I suspect that for this to be safe (before your patch and more so > >> after), > >> it will be necessary to shift the "admittedly racy check" down after the > >> get_page_unless_zero() (and check the sequence of operations when a > >> compound page is initialized). > > > > Grabbing a reference first sounds like the right approach to me. > > I think you're right. Without a page reference I don't think it is even > safe to look at struct page, at least not without synchronisation > against memory hot unplug which could remove the struct page. From a > quick glance I didn't see anything here that obviously did that though. Memory hotplug is the offending party here. It has to make sure that everything else is definitely quiescent before removing the struct pages. Otherwise you can't even try_get a refcount.
On 24.11.22 01:14, Gavin Shan wrote: > On 11/23/22 4:56 PM, David Hildenbrand wrote: >> On 23.11.22 06:14, Hugh Dickins wrote: >>> On Wed, 23 Nov 2022, Gavin Shan wrote: >>> >>>> The issue is reported when removing memory through virtio_mem device. >>>> The transparent huge page, experienced copy-on-write fault, is wrongly >>>> regarded as pinned. The transparent huge page is escaped from being >>>> isolated in isolate_migratepages_block(). The transparent huge page >>>> can't be migrated and the corresponding memory block can't be put >>>> into offline state. >>>> >>>> Fix it by replacing page_mapcount() with total_mapcount(). With this, >>>> the transparent huge page can be isolated and migrated, and the memory >>>> block can be put into offline state. >>>> >>>> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP") >>>> Cc: stable@vger.kernel.org # v5.8+ >>>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> >>>> Suggested-by: David Hildenbrand <david@redhat.com> >>>> Signed-off-by: Gavin Shan <gshan@redhat.com> >>> >>> Interesting, good catch, looked right to me: except for the Fixes line >>> and mention of v5.8. That CoW change may have added a case which easily >>> demonstrates the problem, but it would have been the wrong test on a THP >>> for long before then - but only in v5.7 were compound pages allowed >>> through at all to reach that test, so I think it should be >>> >>> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations") >>> Cc: stable@vger.kernel.org # v5.7+ >>> > > Right, commit 1da2f328fa64 looks more accurate in this particular > case, I will fix it up in next revision. > >>> Oh, no, stop: this is not so easy, even in the latest tree. >>> >>> Because at the time of that "admittedly racy check", we have no hold >>> at all on the page in question: and if it's PageLRU or PageCompound >>> at one instant, it may be different the next instant. Which leaves it >>> vulnerable to whatever BUG_ON()s there may be in the total_mapcount() >>> path - needs research. *Perhaps* there are no more BUG_ON()s in the >>> total_mapcount() path than in the existing page_mapcount() path. >>> >>> I suspect that for this to be safe (before your patch and more so after), >>> it will be necessary to shift the "admittedly racy check" down after the >>> get_page_unless_zero() (and check the sequence of operations when a >>> compound page is initialized). >> >> Grabbing a reference first sounds like the right approach to me. >> > > Yeah, it sounds reasonable to me to grab a page->__refcount in the > first place. Looking at isolate_migratepages_block(), the page's refcount > is increased by get_page_unless_zero(), but it's too late. To increase > the page's refcount at the first place in the function will be conflicting > with hugetlb page and non-LRU page. I mean there will be a series to refactor > the code so that the page's refcount can be grabbed in the first place. > > So I plan to post a followup series to refactor the code and grab > the page's refcount in the first place. In this way, the fix can be > merged as soon as possible. David and Hugh, please let me know if > it's reasonable plan? :) Can't you just temporarily grab the refcount and drop it again? I mean, it's all racy either way and the code has to be able to cope with such races.
On 24.11.22 04:33, Matthew Wilcox wrote: > On Thu, Nov 24, 2022 at 12:06:56PM +1100, Alistair Popple wrote: >> >> David Hildenbrand <david@redhat.com> writes: >> >>> On 23.11.22 06:14, Hugh Dickins wrote: >>>> On Wed, 23 Nov 2022, Gavin Shan wrote: >>>> >>>>> The issue is reported when removing memory through virtio_mem device. >>>>> The transparent huge page, experienced copy-on-write fault, is wrongly >>>>> regarded as pinned. The transparent huge page is escaped from being >>>>> isolated in isolate_migratepages_block(). The transparent huge page >>>>> can't be migrated and the corresponding memory block can't be put >>>>> into offline state. >>>>> >>>>> Fix it by replacing page_mapcount() with total_mapcount(). With this, >>>>> the transparent huge page can be isolated and migrated, and the memory >>>>> block can be put into offline state. >>>>> >>>>> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP") >>>>> Cc: stable@vger.kernel.org # v5.8+ >>>>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> >>>>> Suggested-by: David Hildenbrand <david@redhat.com> >>>>> Signed-off-by: Gavin Shan <gshan@redhat.com> >>>> Interesting, good catch, looked right to me: except for the Fixes >>>> line >>>> and mention of v5.8. That CoW change may have added a case which easily >>>> demonstrates the problem, but it would have been the wrong test on a THP >>>> for long before then - but only in v5.7 were compound pages allowed >>>> through at all to reach that test, so I think it should be >>>> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for >>>> CMA allocations") >>>> Cc: stable@vger.kernel.org # v5.7+ >>>> Oh, no, stop: this is not so easy, even in the latest tree. >>>> Because at the time of that "admittedly racy check", we have no hold >>>> at all on the page in question: and if it's PageLRU or PageCompound >>>> at one instant, it may be different the next instant. Which leaves it >>>> vulnerable to whatever BUG_ON()s there may be in the total_mapcount() >>>> path - needs research. *Perhaps* there are no more BUG_ON()s in the >>>> total_mapcount() path than in the existing page_mapcount() path. >>>> I suspect that for this to be safe (before your patch and more so >>>> after), >>>> it will be necessary to shift the "admittedly racy check" down after the >>>> get_page_unless_zero() (and check the sequence of operations when a >>>> compound page is initialized). >>> >>> Grabbing a reference first sounds like the right approach to me. >> >> I think you're right. Without a page reference I don't think it is even >> safe to look at struct page, at least not without synchronisation >> against memory hot unplug which could remove the struct page. From a >> quick glance I didn't see anything here that obviously did that though. > > Memory hotplug is the offending party here. It has to make sure that > everything else is definitely quiescent before removing the struct pages. > Otherwise you can't even try_get a refcount. At least alloc_contig_range() and memory offlining are mutually exclusive due to MIGRATE_ISOLTAE. I recall that ordinary memory compaction similarly deals with isolated pageblocks (or some other mechanism I forgot) to not race with memory offlining. Wouldn't worry about that for now.
On 23.11.22 17:07, Matthew Wilcox wrote: > On Wed, Nov 23, 2022 at 09:56:38AM +0100, David Hildenbrand wrote: >> But we do have an even better helper in place already: >> mm/huge_memory.c:can_split_folio() >> >> Which cares about >> >> a) Swapcache for THP: each subpage could be in the swapcache >> b) Requires the caller to hold one reference to be safe >> >> But I am a bit confused about the "extra_pins" for !anon. Where do the >> folio_nr_pages() references come from? > > When we add a folio to the page cache, we increment its refcount by > folio_nr_pages() instead of by 1. I suspect this is no longer needed > (if it was ever needed) and it could be changed. See > __filemap_add_folio(): > > long nr = 1; > if (!huge) { > nr = folio_nr_pages(folio); > folio_ref_add(folio, nr); > >> So *maybe* it makes sense to factor out can_split_folio() and call it >> something like: "folio_maybe_additionally_referenced" [to clearly >> distinguish it from "folio_maybe_dma_pinned" that cares about actual page >> pinning (read/write page content)]. >> >> Such a function could return false positives/negatives due to races and the >> caller would have to hold one reference and be able to deal with the >> semantics. > > I don't like the 'pextra_pins' parameter to a generic function ... Right, that part should remain khugepaged specific. The assumption would be, that the caller of the generic function holds exactly one additional reference.
On 11/24/22 4:46 PM, David Hildenbrand wrote: > On 24.11.22 01:14, Gavin Shan wrote: >> On 11/23/22 4:56 PM, David Hildenbrand wrote: >>> On 23.11.22 06:14, Hugh Dickins wrote: >>>> On Wed, 23 Nov 2022, Gavin Shan wrote: >>>> >>>>> The issue is reported when removing memory through virtio_mem device. >>>>> The transparent huge page, experienced copy-on-write fault, is wrongly >>>>> regarded as pinned. The transparent huge page is escaped from being >>>>> isolated in isolate_migratepages_block(). The transparent huge page >>>>> can't be migrated and the corresponding memory block can't be put >>>>> into offline state. >>>>> >>>>> Fix it by replacing page_mapcount() with total_mapcount(). With this, >>>>> the transparent huge page can be isolated and migrated, and the memory >>>>> block can be put into offline state. >>>>> >>>>> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP") >>>>> Cc: stable@vger.kernel.org # v5.8+ >>>>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> >>>>> Suggested-by: David Hildenbrand <david@redhat.com> >>>>> Signed-off-by: Gavin Shan <gshan@redhat.com> >>>> >>>> Interesting, good catch, looked right to me: except for the Fixes line >>>> and mention of v5.8. That CoW change may have added a case which easily >>>> demonstrates the problem, but it would have been the wrong test on a THP >>>> for long before then - but only in v5.7 were compound pages allowed >>>> through at all to reach that test, so I think it should be >>>> >>>> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations") >>>> Cc: stable@vger.kernel.org # v5.7+ >>>> >> >> Right, commit 1da2f328fa64 looks more accurate in this particular >> case, I will fix it up in next revision. >> >>>> Oh, no, stop: this is not so easy, even in the latest tree. >>>> >>>> Because at the time of that "admittedly racy check", we have no hold >>>> at all on the page in question: and if it's PageLRU or PageCompound >>>> at one instant, it may be different the next instant. Which leaves it >>>> vulnerable to whatever BUG_ON()s there may be in the total_mapcount() >>>> path - needs research. *Perhaps* there are no more BUG_ON()s in the >>>> total_mapcount() path than in the existing page_mapcount() path. >>>> >>>> I suspect that for this to be safe (before your patch and more so after), >>>> it will be necessary to shift the "admittedly racy check" down after the >>>> get_page_unless_zero() (and check the sequence of operations when a >>>> compound page is initialized). >>> >>> Grabbing a reference first sounds like the right approach to me. >>> >> >> Yeah, it sounds reasonable to me to grab a page->__refcount in the >> first place. Looking at isolate_migratepages_block(), the page's refcount >> is increased by get_page_unless_zero(), but it's too late. To increase >> the page's refcount at the first place in the function will be conflicting >> with hugetlb page and non-LRU page. I mean there will be a series to refactor >> the code so that the page's refcount can be grabbed in the first place. >> >> So I plan to post a followup series to refactor the code and grab >> the page's refcount in the first place. In this way, the fix can be >> merged as soon as possible. David and Hugh, please let me know if >> it's reasonable plan? :) > > > Can't you just temporarily grab the refcount and drop it again? I mean, it's all racy either way and the code has to be able to cope with such races. > Well, we can do this by moving the hunk of code, which increases page's refcount, ahead of the check. if (unlikely(!get_page_unless_zero(page))) goto isolate_fail; if (!mapping && (page_count(page) - 1) > total_mapcount(page)) goto isolate_fail_put; Thanks, Gavin
David Hildenbrand <david@redhat.com> writes: > On 24.11.22 04:33, Matthew Wilcox wrote: >> On Thu, Nov 24, 2022 at 12:06:56PM +1100, Alistair Popple wrote: >>> >>> David Hildenbrand <david@redhat.com> writes: >>> >>>> On 23.11.22 06:14, Hugh Dickins wrote: >>>>> On Wed, 23 Nov 2022, Gavin Shan wrote: >>>>> >>>>>> The issue is reported when removing memory through virtio_mem device. >>>>>> The transparent huge page, experienced copy-on-write fault, is wrongly >>>>>> regarded as pinned. The transparent huge page is escaped from being >>>>>> isolated in isolate_migratepages_block(). The transparent huge page >>>>>> can't be migrated and the corresponding memory block can't be put >>>>>> into offline state. >>>>>> >>>>>> Fix it by replacing page_mapcount() with total_mapcount(). With this, >>>>>> the transparent huge page can be isolated and migrated, and the memory >>>>>> block can be put into offline state. >>>>>> >>>>>> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP") >>>>>> Cc: stable@vger.kernel.org # v5.8+ >>>>>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> >>>>>> Suggested-by: David Hildenbrand <david@redhat.com> >>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com> >>>>> Interesting, good catch, looked right to me: except for the Fixes >>>>> line >>>>> and mention of v5.8. That CoW change may have added a case which easily >>>>> demonstrates the problem, but it would have been the wrong test on a THP >>>>> for long before then - but only in v5.7 were compound pages allowed >>>>> through at all to reach that test, so I think it should be >>>>> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for >>>>> CMA allocations") >>>>> Cc: stable@vger.kernel.org # v5.7+ >>>>> Oh, no, stop: this is not so easy, even in the latest tree. >>>>> Because at the time of that "admittedly racy check", we have no hold >>>>> at all on the page in question: and if it's PageLRU or PageCompound >>>>> at one instant, it may be different the next instant. Which leaves it >>>>> vulnerable to whatever BUG_ON()s there may be in the total_mapcount() >>>>> path - needs research. *Perhaps* there are no more BUG_ON()s in the >>>>> total_mapcount() path than in the existing page_mapcount() path. >>>>> I suspect that for this to be safe (before your patch and more so >>>>> after), >>>>> it will be necessary to shift the "admittedly racy check" down after the >>>>> get_page_unless_zero() (and check the sequence of operations when a >>>>> compound page is initialized). >>>> >>>> Grabbing a reference first sounds like the right approach to me. >>> >>> I think you're right. Without a page reference I don't think it is even >>> safe to look at struct page, at least not without synchronisation >>> against memory hot unplug which could remove the struct page. From a >>> quick glance I didn't see anything here that obviously did that though. >> Memory hotplug is the offending party here. It has to make sure >> that >> everything else is definitely quiescent before removing the struct pages. >> Otherwise you can't even try_get a refcount. Sure, I might be missing something but how can memory hotplug possibly synchronise against some process (eg. try_to_compact_pages) doing try_get(pfn_to_page(random_pfn)) without that process either informing memory hotplug that it needs pages to remain valid long enough to get a reference or detecting that memory hotplug is in the process of offlining pages? > At least alloc_contig_range() and memory offlining are mutually > exclusive due to MIGRATE_ISOLTAE. I recall that ordinary memory > compaction similarly deals with isolated pageblocks (or some other > mechanism I forgot) to not race with memory offlining. Wouldn't worry > about that for now. Sorry, agree - to be clear this discussion isn't relevant for this patch but reviewing it got me thinking about how this works. I still don't see how alloc_contig_range() is totally safe against memory offlining though. From what I can see we have the following call path to set MIGRATE_ISOLATE: alloc_contig_range(pfn) -> start_isolate_page_range(pfn) -> isolate_single_pageblock(pfn) -> page_zone(pfn_to_page(pfn)) There's nothing in that call stack that prevent offlining of the page, hence the struct page may be freed by this point. Am I missing something else here? try_to_compact_pages() has a similar issue which is what I noticed reviewing this patch: try_to_compact_pages() -> compact_zone_order() -> compact_zone() -> isolate_migratepages() -> isolate_migratepages_block() -> PageHuge(pfn_to_page(pfn)) Again I couldn't see anything in that path that would hold the page stable long enough to safely perform the pfn_to_page() and get a reference. And after a bit of fiddling I was able to trigger the below oops running the compaction_test selftest whilst offlining memory so I don't think it is safe: Entering kdb (current=0xffff8882452fb6c0, pid 5646) on processor 1 Oops: (null) due to oops @ 0xffffffff81af6486 CPU: 1 PID: 5646 Comm: compaction_test Not tainted 6.0.0-01424-gf3ec7e734795 #110 Hardware name: Gigabyte Technology Co., Ltd. B150M-D3H/B150M-D3H-CF, BIOS F24 04/11/2018 RIP: 0010:PageHuge+0xa6/0x180 Code: 00 0f 85 d0 00 00 00 48 8b 43 08 a8 01 0f 85 97 00 00 00 66 90 48 b8 00 00 00 00 00 fc ff df 48 8d 7b 50 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 02 7e 7f 31 c0 80 7b 50 02 0f 94 c0 5b 41 5c RSP: 0000:ffffc9001252efa8 EFLAGS: 00010207 RAX: dffffc0000000000 RBX: fffffffffffffffe RCX: ffffffff81af63f9 RDX: 0000000000000009 RSI: 0000000000000008 RDI: 000000000000004e RBP: ffffc9001252efb8 R08: 0000000000000000 R09: ffffea000f690007 R10: fffff94001ed2000 R11: 0000000000000001 R12: ffffea000f690008 R13: 0000000000000ab3 R14: ffffea000f690000 R15: 00000000003da400 FS: 00007f83e08b7740(0000) GS:ffff88823bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb6e1cbb3e0 CR3: 0000000344044003 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> isolate_migratepages_block+0x43c/0x3dc0 ? reclaim_throttle+0x7a0/0x7a0 ? __reset_isolation_suitable+0x350/0x350 compact_zone+0xb73/0x31f0 ? compaction_suitable+0x1e0/0x1e0 compact_zone_order+0x127/0x240 ? compact_zone+0x31f0/0x31f0 ? __this_cpu_preempt_check+0x13/0x20 ? cpuusage_write+0x380/0x380 ? __kasan_check_read+0x11/0x20 try_to_compact_pages+0x23b/0x770 ? psi_task_change+0x201/0x300 __alloc_pages_direct_compact+0x15d/0x650 ? get_page_from_freelist+0x3fa0/0x3fa0 ? psi_task_change+0x201/0x300 ? _raw_spin_unlock+0x19/0x40 __alloc_pages_slowpath.constprop.0+0x9e1/0x2260 ? warn_alloc+0x1a0/0x1a0 ? __zone_watermark_ok+0x430/0x430 ? prepare_alloc_pages+0x40b/0x620 __alloc_pages+0x42f/0x540 ? __alloc_pages_slowpath.constprop.0+0x2260/0x2260 alloc_buddy_huge_page.isra.0+0x7c/0x130 alloc_fresh_huge_page+0x1f1/0x4e0 alloc_pool_huge_page+0xab/0x2d0 __nr_hugepages_store_common+0x37a/0xed0 ? return_unused_surplus_pages+0x330/0x330 ? __kasan_check_write+0x14/0x20 ? _raw_spin_lock_irqsave+0x99/0x100 ? proc_doulongvec_minmax+0x54/0x80 hugetlb_sysctl_handler_common+0x247/0x320 ? nr_hugepages_store+0xf0/0xf0 ? alloc_huge_page+0xbf0/0xbf0 hugetlb_sysctl_handler+0x20/0x30 proc_sys_call_handler+0x451/0x650 ? unregister_sysctl_table+0x1c0/0x1c0 ? apparmor_file_permission+0x124/0x280 ? security_file_permission+0x72/0x90 proc_sys_write+0x13/0x20 vfs_write+0x7ca/0xd80 ? kernel_write+0x4d0/0x4d0 ? do_sys_openat2+0x114/0x450 ? __kasan_check_write+0x14/0x20 ? down_write+0xb4/0x130 ksys_write+0x116/0x220 ? __kasan_check_write+0x14/0x20 ? __ia32_sys_read+0xb0/0xb0 ? debug_smp_processor_id+0x17/0x20 ? fpregs_assert_state_consistent+0x52/0xc0 __x64_sys_write+0x73/0xb0 ? syscall_exit_to_user_mode+0x26/0x50 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f83e09b2190 Code: 40 00 48 8b 15 71 9c 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 80 3d 51 24 0e 00 00 74 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89 RSP: 002b:00007ffe4c08e478 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007ffe4c08e648 RCX: 00007f83e09b2190 RDX: 0000000000000006 RSI: 000055d7575611f8 RDI: 0000000000000003 RBP: 00007ffe4c08e4c0 R08: 00007f83e0a8cc60 R09: 0000000000000000 R10: 00007f83e08d40b8 R11: 0000000000000202 R12: 0000000000000000 R13: 00007ffe4c08e658 R14: 000055d757562df0 R15: 00007f83e0adc020 </TASK>
>>>> I think you're right. Without a page reference I don't think it is even >>>> safe to look at struct page, at least not without synchronisation >>>> against memory hot unplug which could remove the struct page. From a >>>> quick glance I didn't see anything here that obviously did that though. >>> Memory hotplug is the offending party here. It has to make sure >>> that >>> everything else is definitely quiescent before removing the struct pages. >>> Otherwise you can't even try_get a refcount. > > Sure, I might be missing something but how can memory hotplug possibly > synchronise against some process (eg. try_to_compact_pages) doing > try_get(pfn_to_page(random_pfn)) without that process either informing > memory hotplug that it needs pages to remain valid long enough to get a > reference or detecting that memory hotplug is in the process of > offlining pages? It currently doesn't, and it's been mostly a known theoretical problem so far. We've been ignoring these kind of problems because they are not easy to sort out and so far never popped up in practice. First, the correct approach is using pfn_to_online_page() instead of pfn_to_page() when in doubt whether the page might already be offline. While we're using pfn_to_online_page() already in quite some PFN walkers, changes are good that we're still missing some. Second, essentially anybody (PFN walker) doing a pfn_to_online_page() is prone to races with memory offlining. I've discussed this in the past with Michal and Dan and one idea was to use RCU to synchronize PFN walkers and pfn_to_online_page(), essentially synchronizing clearing of the SECTION_IS_ONLINE flag. Nobody worked on that so far because we've never had actual BUG reports. These kind of races are rare, but they are theoretically possible. > >> At least alloc_contig_range() and memory offlining are mutually >> exclusive due to MIGRATE_ISOLTAE. I recall that ordinary memory >> compaction similarly deals with isolated pageblocks (or some other >> mechanism I forgot) to not race with memory offlining. Wouldn't worry >> about that for now. > > Sorry, agree - to be clear this discussion isn't relevant for this patch > but reviewing it got me thinking about how this works. I still don't see > how alloc_contig_range() is totally safe against memory offlining > though. From what I can see we have the following call path to set > MIGRATE_ISOLATE: > > alloc_contig_range(pfn) -> start_isolate_page_range(pfn) -> > isolate_single_pageblock(pfn) -> page_zone(pfn_to_page(pfn)) > > There's nothing in that call stack that prevent offlining of the page, > hence the struct page may be freed by this point. Am I missing something > else here? Good point. I even had at some point a patch that converted some pfn_to_online_page() calls in there, but discarded it. IIRC, two alloc_contig_range() users are safe because: 1) virtio-mem synchonizes against memory offlining using the memory notifier. While memory offlining is in progress, it won't call alloc_contig_range(). 2) CMA regions will always fail to offline because they have MIGRATE_CMA set. What remains is alloc_contig_pages(), for example, used for memtrace on ppc, kfence, and allocation of gigantic pages. That might indeed be racy. At least from kfence_init_late() it's unlikely (impossible?) that we'll have concurrent memory offlining. > > try_to_compact_pages() has a similar issue which is what I noticed > reviewing this patch: Yes, I can spot pfn_to_online_page() in __reset_isolation_pfn(), but of course, are likely missing proper synchronization and checks later. I wonder if we could use the memory notifier to temporarily pause any running compaction and to restart it once memory offlining succeeded. > > try_to_compact_pages() -> compact_zone_order() -> compact_zone() -> > isolate_migratepages() -> isolate_migratepages_block() -> > PageHuge(pfn_to_page(pfn)) > > Again I couldn't see anything in that path that would hold the page > stable long enough to safely perform the pfn_to_page() and get a > reference. And after a bit of fiddling I was able to trigger the below > oops running the compaction_test selftest whilst offlining memory so I > don't think it is safe: Thanks for finding a reproducer. This is exactly the kind of BUG that we speculated about in the past but that we haven't seen in practice yet. Having that said, I'd be very happy if someone would have time to work on proper synchronization and I'd be happy to help brainstorming/reviewing :)
David Hildenbrand <david@redhat.com> writes: >>>>> I think you're right. Without a page reference I don't think it is even >>>>> safe to look at struct page, at least not without synchronisation >>>>> against memory hot unplug which could remove the struct page. From a >>>>> quick glance I didn't see anything here that obviously did that though. >>>> Memory hotplug is the offending party here. It has to make sure >>>> that >>>> everything else is definitely quiescent before removing the struct pages. >>>> Otherwise you can't even try_get a refcount. >> Sure, I might be missing something but how can memory hotplug >> possibly >> synchronise against some process (eg. try_to_compact_pages) doing >> try_get(pfn_to_page(random_pfn)) without that process either informing >> memory hotplug that it needs pages to remain valid long enough to get a >> reference or detecting that memory hotplug is in the process of >> offlining pages? > > It currently doesn't, and it's been mostly a known theoretical problem > so far. We've been ignoring these kind of problems because they are > not easy to sort out and so far never popped up in practice. > > First, the correct approach is using pfn_to_online_page() instead of > pfn_to_page() when in doubt whether the page might already be > offline. While we're using pfn_to_online_page() already in quite some > PFN walkers, changes are good that we're still missing some. > > Second, essentially anybody (PFN walker) doing a pfn_to_online_page() > is prone to races with memory offlining. I've discussed this in the > past with Michal and Dan and one idea was to use RCU to synchronize > PFN walkers and pfn_to_online_page(), essentially synchronizing > clearing of the SECTION_IS_ONLINE flag. > > Nobody worked on that so far because we've never had actual BUG > reports. These kind of races are rare, but they are theoretically > possible. Thanks for providing some background context here. I have been digging into these details a bit because GPU drivers using device-private memory tend to excersise memory hot unplug paths more frequently than what I suspect is normally the case. I haven't actually seen any real BUG reports caused by this though, and it's likely the places where they're likely would involve device-private pages at the moment anyway. So agree such races are rare. [...] >> try_to_compact_pages() -> compact_zone_order() -> compact_zone() -> >> isolate_migratepages() -> isolate_migratepages_block() -> >> PageHuge(pfn_to_page(pfn)) >> Again I couldn't see anything in that path that would hold the page >> stable long enough to safely perform the pfn_to_page() and get a >> reference. And after a bit of fiddling I was able to trigger the below >> oops running the compaction_test selftest whilst offlining memory so I >> don't think it is safe: > > Thanks for finding a reproducer. This is exactly the kind of BUG that > we speculated about in the past but that we haven't seen in practice > yet. No worries. The reproducer consisted of running the compaction_test selftest in a loop on a single NUMA node whilst cycling every memory block in that node online/offline in parallel (ie. not a particularly realistic real world workload). Eventually that led to a couple of different kernel OOPS/BUG backtraces. > Having that said, I'd be very happy if someone would have time to work > on proper synchronization and I'd be happy to help > brainstorming/reviewing :) I would like to see the synchronisation cleaned up here and am happy to work on it. I'm unlikely to have much bandwidth to do so before the new year but any brainstorming/review would certainly be appreciated when I do find some time though!
diff --git a/mm/compaction.c b/mm/compaction.c index c51f7f545afe..c408b5e04c1d 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -990,7 +990,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, * admittedly racy check. */ mapping = page_mapping(page); - if (!mapping && page_count(page) > page_mapcount(page)) + if (!mapping && page_count(page) > total_mapcount(page)) goto isolate_fail; /*