Message ID | 20230307052036.1520708-4-stevensd@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp2262716wrd; Mon, 6 Mar 2023 21:58:12 -0800 (PST) X-Google-Smtp-Source: AK7set/t4gZhdhkDi3+DgLDyFZ4eRf19Fqc+qCYnm7XttvsTyvdFGCKG9R65q/fwXl02aLnvMKOn X-Received: by 2002:a17:907:7ea5:b0:8a5:3d1e:6302 with SMTP id qb37-20020a1709077ea500b008a53d1e6302mr15594190ejc.56.1678168692342; Mon, 06 Mar 2023 21:58:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1678168692; cv=none; d=google.com; s=arc-20160816; b=DnBTTlTVw0DQDBHFJgnL86fi43ohQqP3LGM2v1ijbhHH4LlDeJvz7/GQJHgdEabPoz nVZ+MHtM8DgGjxrroaip6saiVlL9neWja4ve6levB/BGodGjXYXD0AiHztI/ntF/ceNq lYN5BUBwLg4JlHvc6fySUF/Ny7nabzb3zGNZWq9plDfWAwkgrB6yw4X0AwBCXTKyLe2+ ECy937tXx38aCytpkpajnpJcyFoNo6QrEApvJC7wpY8BGr46vw6iVPfB7JSsODLVre5W XQMvOKZaXnad0+QdkD9/nNQ1VTPKdBrd9hynoKFlImWXCFwHBfg+I1Flsbocbx+P/w61 X+MQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=eFAeWPY2kcSCdT3O6CTUVlduCnpUJhZej+FbmoWYNkg=; b=iyLXn6u+s1C2GdT2tRnjDIR3dFwVB4NNdKz419GB6VMH9nx+26gTHKAicf/mrXLOJA yL008kt4Bm3FFcvRpp/3FfnU4+rj6Dk0vmFuxclZh+Qkj9lX/DsvhEIHyCDsCydWmNOb +kq7nfEzl+SuaudUKLZQMEBI6Ffe1o9RwEsB1feWf4L/14kxteseFebmQBUxU87zNPq1 M05FQWdUjYLx58Q7Dvu29ky5gs1cp7/JdII+wryY/SNgkvYmxz8fsIi8Z0nqXeitEffQ ir/bi6lHimOBdkXwuvLEVMQfcRxYE4Sj4KWWtXq3KsK46Fy6bHhCjpYiwTkZwQbA6XSx P+hw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=Q9sGR9Jp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id u13-20020a170906408d00b008d0dbf15be3si9837936ejj.676.2023.03.06.21.57.49; Mon, 06 Mar 2023 21:58:12 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=Q9sGR9Jp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230135AbjCGFV2 (ORCPT <rfc822;toshivichauhan@gmail.com> + 99 others); Tue, 7 Mar 2023 00:21:28 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33252 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230336AbjCGFVN (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 7 Mar 2023 00:21:13 -0500 Received: from mail-pf1-x42a.google.com (mail-pf1-x42a.google.com [IPv6:2607:f8b0:4864:20::42a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5212662318 for <linux-kernel@vger.kernel.org>; Mon, 6 Mar 2023 21:21:05 -0800 (PST) Received: by mail-pf1-x42a.google.com with SMTP id fa28so7298030pfb.12 for <linux-kernel@vger.kernel.org>; Mon, 06 Mar 2023 21:21:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1678166464; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=eFAeWPY2kcSCdT3O6CTUVlduCnpUJhZej+FbmoWYNkg=; b=Q9sGR9JplEaeNqsSC0pRNCxNi/C7wmGhoCkA56L/loqSr4HfTSKbd24kgSyhErzSqH 0I6wjvacPrWo2mFsQrRLwwJXlg3UaeAhxOZ0HtAJPSslGOMFmzE06GPMU9FLNZYobvT/ 3o1S3aGwC8Ks5mnW1V4r0yFDwjDaYhzpIxfho= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678166464; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=eFAeWPY2kcSCdT3O6CTUVlduCnpUJhZej+FbmoWYNkg=; b=fQpuaLLZp5uFy/G1KFCXyF+9k/Libhtx7fTkFudRveCKob2qEpITpCrUgEjjL1RH4P s71xOT2B5I+XIBTSwn8OZFGOsDX4j+fx4i17gMDR5d/s7gO8yirPBe9t1wqY9mWBAmYY T2niQjAD50ev+nftsBNiMOUXj1yczVVosEFhwUi6U5W1LkT3vOcbfF3TbFSbGYMMMEQW V5konlkTnq7WwPb0bK9U/JP3Ovdq2neYXyTgKuf0PsleNL7DYdnT0OiR3dj7679eZKLJ aw9UH2lImX+2GcdkeAlOk0P/xZTdctuGrZH043PZSPyh7oIZIf1k2pRrv5INq+0M1t5Y jMhw== X-Gm-Message-State: AO0yUKUTDfnFNvXqod1vxPycCsanxVDzXz7KkuvQux8PWkLdj9WOwHpa 7M0WACEfzStgpe2WBJnte9OosQ== X-Received: by 2002:a62:6542:0:b0:5a8:ad9d:83f with SMTP id z63-20020a626542000000b005a8ad9d083fmr10412674pfb.24.1678166464453; Mon, 06 Mar 2023 21:21:04 -0800 (PST) Received: from localhost ([2401:fa00:8f:203:1f73:9034:ce28:4421]) by smtp.gmail.com with UTF8SMTPSA id c26-20020aa78c1a000000b005a8f1187112sm7117378pfd.58.2023.03.06.21.21.01 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 06 Mar 2023 21:21:04 -0800 (PST) From: David Stevens <stevensd@chromium.org> X-Google-Original-From: David Stevens <stevensd@google.com> To: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com>, Matthew Wilcox <willy@infradead.org>, "Kirill A . Shutemov" <kirill@shutemov.name>, Yang Shi <shy828301@gmail.com>, David Hildenbrand <david@redhat.com>, Hugh Dickins <hughd@google.com>, Jiaqi Yan <jiaqiyan@google.com>, linux-kernel@vger.kernel.org, David Stevens <stevensd@chromium.org> Subject: [PATCH v5 3/3] mm/khugepaged: maintain page cache uptodate flag Date: Tue, 7 Mar 2023 14:20:36 +0900 Message-Id: <20230307052036.1520708-4-stevensd@google.com> X-Mailer: git-send-email 2.40.0.rc0.216.gc4246ad0f0-goog In-Reply-To: <20230307052036.1520708-1-stevensd@google.com> References: <20230307052036.1520708-1-stevensd@google.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1759687414815222670?= X-GMAIL-MSGID: =?utf-8?q?1759687414815222670?= |
Series |
mm/khugepaged: fix khugepaged+shmem races
|
|
Commit Message
David Stevens
March 7, 2023, 5:20 a.m. UTC
From: David Stevens <stevensd@chromium.org> Make sure that collapse_file doesn't interfere with checking the uptodate flag in the page cache by only inserting hpage into the page cache after it has been updated and marked uptodate. This is achieved by simply not replacing present pages with hpage when iterating over the target range. The present pages are already locked, so replacing the with the locked hpage before the collapse is finalized is unnecessary. This fixes a race where folio_seek_hole_data would mistake hpage for an fallocated but unwritten page. This race is visible to userspace via data temporarily disappearing from SEEK_DATA/SEEK_HOLE. Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: David Stevens <stevensd@chromium.org> Acked-by: Peter Xu <peterx@redhat.com> --- mm/khugepaged.c | 50 ++++++++++++------------------------------------- 1 file changed, 12 insertions(+), 38 deletions(-)
Comments
On Tue, 7 Mar 2023, David Stevens wrote: > From: David Stevens <stevensd@chromium.org> > > Make sure that collapse_file doesn't interfere with checking the > uptodate flag in the page cache by only inserting hpage into the page > cache after it has been updated and marked uptodate. This is achieved by > simply not replacing present pages with hpage when iterating over the > target range. The present pages are already locked, so replacing the > with the locked hpage before the collapse is finalized is unnecessary. > > This fixes a race where folio_seek_hole_data would mistake hpage for > an fallocated but unwritten page. This race is visible to userspace via > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") > Signed-off-by: David Stevens <stevensd@chromium.org> > Acked-by: Peter Xu <peterx@redhat.com> NAK to this patch, I'm afraid: it deadlocks. What I know it to deadlock against, does not make the most persuasive argument: cgroup v1 deprecated memcg moving, where mc_handle_file_pte() uses filemap_get_incore_folio() while holding page table lock, and spins around doing "goto repeat" in filemap_get_entry() while folio_try_get_rcu() keeps failing because collapse_file()'s old page has been left in the xarray with its refcount frozen to 0. Meanwhile, collapse_file() is spinning to get that page table lock, to unmap pte of a later page. mincore()'s use of filemap_get_incore_folio() would be liable to hit the same deadlock. If we think for longer, we may find more examples. But even when not actually deadlocking, it's wasting lots of CPU on concurrent lookups (e.g. faults) spinning in filemap_get_entry(). I don't suppose it's entirely accurate, but think of keeping a folio refcount frozen to 0 as like holding a spinlock (and this lock sadly out of sight from lockdep). The pre-existing code works because the old page with refcount frozen to 0 is immediately replaced in the xarray by an entry for the new hpage, so the old page is no longer discoverable: and the new hpage is locked, not with a spinlock but the usual folio/page lock, on which concurrent lookups will sleep. Your discovery of the SEEK_DATA/SEEK_HOLE issue is important - thank you - but I believe collapse_file() should be left as is, and the fix made instead in mapping_seek_hole_data() or folio_seek_hole_data(): I believe that should not jump to assume that a !uptodate folio is a hole (as was reasonable to assume for shmem, before collapsing to huge got added), but should lock the folio if !uptodate, and check again after getting the lock - if still !uptodate, it's a shmem hole, not a transient race with collapse_file(). I was (pleased but) a little surprised when Matthew found in 5.12 that shmem_seek_hole_data() could be generalized to filemap_seek_hole_data(): he will have a surer grasp of what's safe or unsafe to assume of !uptodate in non-shmem folios. On an earlier audit, for different reasons, I did also run across lib/buildid.c build_id_parse() using find_get_page() without checking PageUptodate() - looks as if it might do the wrong thing if it races with khugepaged collapsing text to huge, and should probably have a similar fix. Hugh > --- > mm/khugepaged.c | 50 ++++++++++++------------------------------------- > 1 file changed, 12 insertions(+), 38 deletions(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 51ae399f2035..bdde0a02811b 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1930,12 +1930,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > } > } while (1); > > - /* > - * At this point the hpage is locked and not up-to-date. > - * It's safe to insert it into the page cache, because nobody would > - * be able to map it or use it in another way until we unlock it. > - */ > - > xas_set(&xas, start); > for (index = start; index < end; index++) { > page = xas_next(&xas); > @@ -2104,13 +2098,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > } > > /* > - * Add the page to the list to be able to undo the collapse if > - * something go wrong. > + * Accumulate the pages that are being collapsed. > */ > list_add_tail(&page->lru, &pagelist); > - > - /* Finally, replace with the new page. */ > - xas_store(&xas, hpage); > continue; > out_unlock: > unlock_page(page); > @@ -2149,8 +2139,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > goto rollback; > > /* > - * Replacing old pages with new one has succeeded, now we > - * attempt to copy the contents. > + * The old pages are locked, so they won't change anymore. > */ > index = start; > list_for_each_entry(page, &pagelist, lru) { > @@ -2230,11 +2219,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > /* nr_none is always 0 for non-shmem. */ > __mod_lruvec_page_state(hpage, NR_SHMEM, nr_none); > } > - /* Join all the small entries into a single multi-index entry. */ > - xas_set_order(&xas, start, HPAGE_PMD_ORDER); > - xas_store(&xas, hpage); > - xas_unlock_irq(&xas); > > + /* > + * Mark hpage as uptodate before inserting it into the page cache so > + * that it isn't mistaken for an fallocated but unwritten page. > + */ > folio = page_folio(hpage); > folio_mark_uptodate(folio); > folio_ref_add(folio, HPAGE_PMD_NR - 1); > @@ -2243,6 +2232,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > folio_mark_dirty(folio); > folio_add_lru(folio); > > + /* Join all the small entries into a single multi-index entry. */ > + xas_set_order(&xas, start, HPAGE_PMD_ORDER); > + xas_store(&xas, hpage); > + xas_unlock_irq(&xas); > + > /* > * Remove pte page tables, so we can re-fault the page as huge. > */ > @@ -2267,36 +2261,18 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > > rollback: > /* Something went wrong: roll back page cache changes */ > - xas_lock_irq(&xas); > if (nr_none) { > mapping->nrpages -= nr_none; > shmem_uncharge(mapping->host, nr_none); > } > > - xas_set(&xas, start); > - end = index; > - for (index = start; index < end; index++) { > - xas_next(&xas); > - page = list_first_entry_or_null(&pagelist, > - struct page, lru); > - if (!page || xas.xa_index < page->index) { > - nr_none--; > - continue; > - } > - > - VM_BUG_ON_PAGE(page->index != xas.xa_index, page); > - > + list_for_each_entry_safe(page, tmp, &pagelist, lru) { > /* Unfreeze the page. */ > list_del(&page->lru); > page_ref_unfreeze(page, 2); > - xas_store(&xas, page); > - xas_pause(&xas); > - xas_unlock_irq(&xas); > unlock_page(page); > putback_lru_page(page); > - xas_lock_irq(&xas); > } > - VM_BUG_ON(nr_none); > /* > * Undo the updates of filemap_nr_thps_inc for non-SHMEM file only. > * This undo is not needed unless failure is due to SCAN_COPY_MC. > @@ -2304,8 +2280,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > if (!is_shmem && result == SCAN_COPY_MC) > filemap_nr_thps_dec(mapping); > > - xas_unlock_irq(&xas); > - > hpage->mapping = NULL; > > unlock_page(hpage); > -- > 2.40.0.rc0.216.gc4246ad0f0-goog
On Thu, Mar 23, 2023 at 12:07:46PM -0700, Hugh Dickins wrote: > On an earlier audit, for different reasons, I did also run across > lib/buildid.c build_id_parse() using find_get_page() without checking > PageUptodate() - looks as if it might do the wrong thing if it races > with khugepaged collapsing text to huge, and should probably have a > similar fix. That shouldn't be using find_get_page(). It should probably use read_cache_folio() which will actually read in the data if it's not present in the page cache, and return an ERR_PTR if the data couldn't be read.
On Thu, Mar 23, 2023 at 2:56 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Mar 23, 2023 at 12:07:46PM -0700, Hugh Dickins wrote: > > On an earlier audit, for different reasons, I did also run across > > lib/buildid.c build_id_parse() using find_get_page() without checking > > PageUptodate() - looks as if it might do the wrong thing if it races > > with khugepaged collapsing text to huge, and should probably have a > > similar fix. > > That shouldn't be using find_get_page(). It should probably use > read_cache_folio() which will actually read in the data if it's not > present in the page cache, and return an ERR_PTR if the data couldn't > be read. build_id_parse() can be called from NMI, so I don't think we can let read_cache_folio() read-in the data. Thanks, Song
On Thu, 23 Mar 2023, Song Liu wrote: > On Thu, Mar 23, 2023 at 2:56 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Thu, Mar 23, 2023 at 12:07:46PM -0700, Hugh Dickins wrote: > > > On an earlier audit, for different reasons, I did also run across > > > lib/buildid.c build_id_parse() using find_get_page() without checking > > > PageUptodate() - looks as if it might do the wrong thing if it races > > > with khugepaged collapsing text to huge, and should probably have a > > > similar fix. > > > > That shouldn't be using find_get_page(). It should probably use > > read_cache_folio() which will actually read in the data if it's not > > present in the page cache, and return an ERR_PTR if the data couldn't > > be read. > > build_id_parse() can be called from NMI, so I don't think we can let > read_cache_folio() read-in the data. Interesting. This being the same Layering_Violation_ID which is asking for a home in everyone's struct file? (Okay, I'm being disagreeable, no need to answer!) I think even the current find_get_page() is unsafe from NMI: imagine that NMI interrupting a sequence (maybe THP collapse or splitting, maybe page migration, maybe others) when the page refcount has been frozen to 0: you'll just have to reboot the machine? I guess the RCU-safety of find_get_page() implies that its XArray basics are safe in NMI; but you need a low-level variant (xas_find()?) which does none of the "goto retry"s, and fails immediately if anything is wrong - including !PageUptodate. Hugh
On Thu, Mar 23, 2023 at 06:56:34PM -0700, Hugh Dickins wrote: > On Thu, 23 Mar 2023, Song Liu wrote: > > On Thu, Mar 23, 2023 at 2:56 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Thu, Mar 23, 2023 at 12:07:46PM -0700, Hugh Dickins wrote: > > > > On an earlier audit, for different reasons, I did also run across > > > > lib/buildid.c build_id_parse() using find_get_page() without checking > > > > PageUptodate() - looks as if it might do the wrong thing if it races > > > > with khugepaged collapsing text to huge, and should probably have a > > > > similar fix. > > > > > > That shouldn't be using find_get_page(). It should probably use > > > read_cache_folio() which will actually read in the data if it's not > > > present in the page cache, and return an ERR_PTR if the data couldn't > > > be read. > > > > build_id_parse() can be called from NMI, so I don't think we can let > > read_cache_folio() read-in the data. > > Interesting. > > This being the same Layering_Violation_ID which is asking for a home in > everyone's struct file? (Okay, I'm being disagreeable, no need to answer!) Yes, that's the one. Part of the BPF "splatter stuff all across the kernel that we don't really understand" (see, I can be just as disagreeable!) > I think even the current find_get_page() is unsafe from NMI: imagine that > NMI interrupting a sequence (maybe THP collapse or splitting, maybe page > migration, maybe others) when the page refcount has been frozen to 0: > you'll just have to reboot the machine? Correct; if it's been frozen by this CPU, we'll spin forever. I think the other conditions where we retry are temporary and caused by something _another_ CPU is doing. For example, if _this_ CPU is in the middle of modifying the tree when an NMI happens, we won't hit the xas_retry() case. That can only happen if we've observed something another CPU did, and then a second write happened from that same other CPU. The easiest example of that would be that we hit this kind of race: CPU 0 CPU 1 load root of tree, points to node store entry in root of tree wmb(); store RETRY entry in node load entry from node, observe RETRY entry The retry will happen and we'll observe the new state of the tree with the entry we're looking for in the root. If CPU 1 takes an NMI and interrupts itself, it will always see a consistent tree. > I guess the RCU-safety of find_get_page() implies that its XArray basics > are safe in NMI; but you need a low-level variant (xas_find()?) which > does none of the "goto retry"s, and fails immediately if anything is > wrong - including !PageUptodate. The Uptodate flag check needs to be done by the caller; the find_get_page() family return !uptodate pages. But find_get_page() does not advertise itself as NMI-safe. And I think it's wrong to try to make it NMI-safe. Most of the kernel is not NMI-safe. I think it's incumbent on the BPF people to get the information they need ahead of taking the NMI. NMI handlers are not supposed to be doing a huge amount of work! I don't really understand why it needs to do work in NMI context; surely it can note the location of the fault and queue work to be done later (eg on irq-enable, task-switch or return-to-user)
> On Mar 23, 2023, at 8:30 PM, Matthew Wilcox <willy@infradead.org> wrote: [...] > > The Uptodate flag check needs to be done by the caller; the > find_get_page() family return !uptodate pages. > > But find_get_page() does not advertise itself as NMI-safe. And I > think it's wrong to try to make it NMI-safe. Most of the kernel is > not NMI-safe. I think it's incumbent on the BPF people to get the > information they need ahead of taking the NMI. NMI handlers are not > supposed to be doing a huge amount of work! I don't really understand > why it needs to do work in NMI context; surely it can note the location of > the fault and queue work to be done later (eg on irq-enable, task-switch > or return-to-user) The use case here is a profiler (similar to perf-record). Parsing the build id in side the NMI makes the profiler a lot simpler. Otherwise, we will need some post processing for each sample. OTOH, it is totally fine if build_id_parse() fails some time, say < 5%. The profiler output is still useful in such cases. I guess the next step is to replace find_get_page() with a NMI-safe version? Thanks, Song
On Fri, Mar 24, 2023 at 06:03:37AM +0000, Song Liu wrote: > > > > On Mar 23, 2023, at 8:30 PM, Matthew Wilcox <willy@infradead.org> wrote: > > [...] > > > > > The Uptodate flag check needs to be done by the caller; the > > find_get_page() family return !uptodate pages. > > > > But find_get_page() does not advertise itself as NMI-safe. And I > > think it's wrong to try to make it NMI-safe. Most of the kernel is > > not NMI-safe. I think it's incumbent on the BPF people to get the > > information they need ahead of taking the NMI. NMI handlers are not > > supposed to be doing a huge amount of work! I don't really understand > > why it needs to do work in NMI context; surely it can note the location of > > the fault and queue work to be done later (eg on irq-enable, task-switch > > or return-to-user) > > The use case here is a profiler (similar to perf-record). Parsing the > build id in side the NMI makes the profiler a lot simpler. Otherwise, > we will need some post processing for each sample. Simpler for you, maybe. But this is an NMI! It's not supposed to be doing printf-formatting or whatever, much less poking around in the file cache. Like perf, it should record a sample and then convert that later. Maybe it can defer to a tasklet, but i think scheduling work is a better option. > OTOH, it is totally fine if build_id_parse() fails some time, say < 5%. > The profiler output is still useful in such cases. > > I guess the next step is to replace find_get_page() with a NMI-safe > version? No, absolutely not. Stop doing so much work in an NMI.
On Fri, Mar 24, 2023 at 4:08 AM Hugh Dickins <hughd@google.com> wrote: > > On Tue, 7 Mar 2023, David Stevens wrote: > > > From: David Stevens <stevensd@chromium.org> > > > > Make sure that collapse_file doesn't interfere with checking the > > uptodate flag in the page cache by only inserting hpage into the page > > cache after it has been updated and marked uptodate. This is achieved by > > simply not replacing present pages with hpage when iterating over the > > target range. The present pages are already locked, so replacing the > > with the locked hpage before the collapse is finalized is unnecessary. > > > > This fixes a race where folio_seek_hole_data would mistake hpage for > > an fallocated but unwritten page. This race is visible to userspace via > > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. > > > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") > > Signed-off-by: David Stevens <stevensd@chromium.org> > > Acked-by: Peter Xu <peterx@redhat.com> > > NAK to this patch, I'm afraid: it deadlocks. > > What I know it to deadlock against, does not make the most persuasive > argument: cgroup v1 deprecated memcg moving, where mc_handle_file_pte() > uses filemap_get_incore_folio() while holding page table lock, and spins > around doing "goto repeat" in filemap_get_entry() while folio_try_get_rcu() > keeps failing because collapse_file()'s old page has been left in the > xarray with its refcount frozen to 0. Meanwhile, collapse_file() is > spinning to get that page table lock, to unmap pte of a later page. > > mincore()'s use of filemap_get_incore_folio() would be liable to hit > the same deadlock. If we think for longer, we may find more examples. > But even when not actually deadlocking, it's wasting lots of CPU on > concurrent lookups (e.g. faults) spinning in filemap_get_entry(). Ignoring my changes for now, these callers of filemap_get_incore_folio seem broken to some degree with respect to khugepaged. Mincore can show mlocked pages spuriously disappearing - this is pretty easy to reproduce with concurrent calls to MADV_COLLAPSE and mincore. As for the memcg code, I'm not sure how precise it is expected to be, but it seems likely that khugepaged can make task migration accounting less reliable (although I don't really understand the code). > I don't suppose it's entirely accurate, but think of keeping a folio > refcount frozen to 0 as like holding a spinlock (and this lock sadly out > of sight from lockdep). The pre-existing code works because the old page > with refcount frozen to 0 is immediately replaced in the xarray by an > entry for the new hpage, so the old page is no longer discoverable: > and the new hpage is locked, not with a spinlock but the usual > folio/page lock, on which concurrent lookups will sleep. Is it actually necessary to freeze the original pages? At least at a surface level, it seems that the arguments in 87c460a0bded ("mm/khugepaged: collapse_shmem() without freezing new_page") would apply to the original pages as well. And if it is actually necessary to freeze the original pages, why is it not necessary to freeze the hugepage for the rollback case? Rolling back hugepage->original pages seems more-or-less symmetric to collapsing original pages->hugepage. > Your discovery of the SEEK_DATA/SEEK_HOLE issue is important - thank > you - but I believe collapse_file() should be left as is, and the fix > made instead in mapping_seek_hole_data() or folio_seek_hole_data(): > I believe that should not jump to assume that a !uptodate folio is a > hole (as was reasonable to assume for shmem, before collapsing to huge > got added), but should lock the folio if !uptodate, and check again > after getting the lock - if still !uptodate, it's a shmem hole, not > a transient race with collapse_file(). This sounds like it would work for lseek. I guess it could maybe be made to sort of work for mincore if we abort the walk, lock the page, restart the walk, and then re-validate the locked page. Although that's not exactly efficient. -David > I was (pleased but) a little surprised when Matthew found in 5.12 that > shmem_seek_hole_data() could be generalized to filemap_seek_hole_data(): > he will have a surer grasp of what's safe or unsafe to assume of > !uptodate in non-shmem folios. > > On an earlier audit, for different reasons, I did also run across > lib/buildid.c build_id_parse() using find_get_page() without checking > PageUptodate() - looks as if it might do the wrong thing if it races > with khugepaged collapsing text to huge, and should probably have a > similar fix. > > Hugh
> On Mar 24, 2023, at 6:31 AM, Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Mar 24, 2023 at 06:03:37AM +0000, Song Liu wrote: >> >> >>> On Mar 23, 2023, at 8:30 PM, Matthew Wilcox <willy@infradead.org> wrote: >> >> [...] >> >>> >>> The Uptodate flag check needs to be done by the caller; the >>> find_get_page() family return !uptodate pages. >>> >>> But find_get_page() does not advertise itself as NMI-safe. And I >>> think it's wrong to try to make it NMI-safe. Most of the kernel is >>> not NMI-safe. I think it's incumbent on the BPF people to get the >>> information they need ahead of taking the NMI. NMI handlers are not >>> supposed to be doing a huge amount of work! I don't really understand >>> why it needs to do work in NMI context; surely it can note the location of >>> the fault and queue work to be done later (eg on irq-enable, task-switch >>> or return-to-user) >> >> The use case here is a profiler (similar to perf-record). Parsing the >> build id in side the NMI makes the profiler a lot simpler. Otherwise, >> we will need some post processing for each sample. > > Simpler for you, maybe. But this is an NMI! It's not supposed to > be doing printf-formatting or whatever, much less poking around > in the file cache. Like perf, it should record a sample and then > convert that later. Maybe it can defer to a tasklet, but i think > scheduling work is a better option. > >> OTOH, it is totally fine if build_id_parse() fails some time, say < 5%. >> The profiler output is still useful in such cases. >> >> I guess the next step is to replace find_get_page() with a NMI-safe >> version? > > No, absolutely not. Stop doing so much work in an NMI. While I understand the concern, it is not something we can easily remove, as there are users rely on this feature. How about we discuss this at upcoming LSFMMBPF? Thanks, Song
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 51ae399f2035..bdde0a02811b 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1930,12 +1930,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, } } while (1); - /* - * At this point the hpage is locked and not up-to-date. - * It's safe to insert it into the page cache, because nobody would - * be able to map it or use it in another way until we unlock it. - */ - xas_set(&xas, start); for (index = start; index < end; index++) { page = xas_next(&xas); @@ -2104,13 +2098,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, } /* - * Add the page to the list to be able to undo the collapse if - * something go wrong. + * Accumulate the pages that are being collapsed. */ list_add_tail(&page->lru, &pagelist); - - /* Finally, replace with the new page. */ - xas_store(&xas, hpage); continue; out_unlock: unlock_page(page); @@ -2149,8 +2139,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, goto rollback; /* - * Replacing old pages with new one has succeeded, now we - * attempt to copy the contents. + * The old pages are locked, so they won't change anymore. */ index = start; list_for_each_entry(page, &pagelist, lru) { @@ -2230,11 +2219,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, /* nr_none is always 0 for non-shmem. */ __mod_lruvec_page_state(hpage, NR_SHMEM, nr_none); } - /* Join all the small entries into a single multi-index entry. */ - xas_set_order(&xas, start, HPAGE_PMD_ORDER); - xas_store(&xas, hpage); - xas_unlock_irq(&xas); + /* + * Mark hpage as uptodate before inserting it into the page cache so + * that it isn't mistaken for an fallocated but unwritten page. + */ folio = page_folio(hpage); folio_mark_uptodate(folio); folio_ref_add(folio, HPAGE_PMD_NR - 1); @@ -2243,6 +2232,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, folio_mark_dirty(folio); folio_add_lru(folio); + /* Join all the small entries into a single multi-index entry. */ + xas_set_order(&xas, start, HPAGE_PMD_ORDER); + xas_store(&xas, hpage); + xas_unlock_irq(&xas); + /* * Remove pte page tables, so we can re-fault the page as huge. */ @@ -2267,36 +2261,18 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, rollback: /* Something went wrong: roll back page cache changes */ - xas_lock_irq(&xas); if (nr_none) { mapping->nrpages -= nr_none; shmem_uncharge(mapping->host, nr_none); } - xas_set(&xas, start); - end = index; - for (index = start; index < end; index++) { - xas_next(&xas); - page = list_first_entry_or_null(&pagelist, - struct page, lru); - if (!page || xas.xa_index < page->index) { - nr_none--; - continue; - } - - VM_BUG_ON_PAGE(page->index != xas.xa_index, page); - + list_for_each_entry_safe(page, tmp, &pagelist, lru) { /* Unfreeze the page. */ list_del(&page->lru); page_ref_unfreeze(page, 2); - xas_store(&xas, page); - xas_pause(&xas); - xas_unlock_irq(&xas); unlock_page(page); putback_lru_page(page); - xas_lock_irq(&xas); } - VM_BUG_ON(nr_none); /* * Undo the updates of filemap_nr_thps_inc for non-SHMEM file only. * This undo is not needed unless failure is due to SCAN_COPY_MC. @@ -2304,8 +2280,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, if (!is_shmem && result == SCAN_COPY_MC) filemap_nr_thps_dec(mapping); - xas_unlock_irq(&xas); - hpage->mapping = NULL; unlock_page(hpage);