Message ID | 2e9996fa-d238-e7c-1194-834a2bd1f60@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp1320449vqr; Sun, 28 May 2023 23:52:41 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7l5ojvGFWW4dG8MlevyJUcvDWZtIjqL8FJKQZs/j4r9i/VMeeeRQtz3tyDXGPUQQu2q2U7 X-Received: by 2002:a05:6a00:10c1:b0:634:7ba3:d140 with SMTP id d1-20020a056a0010c100b006347ba3d140mr13215789pfu.15.1685343161113; Sun, 28 May 2023 23:52:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685343161; cv=none; d=google.com; s=arc-20160816; b=TzvM1qwvalbehmM/+mtmiwWl+cQ6UrjaTfEuWusBtJcTyoG7/g94n8wcmpFP9402+O ivBVpoWMNjJuUdavGmiU4Ugawb0ioCpBnsIoJqj9OCGyifqPKlfs7qh9iWMhLv29NdAi 0aetOtFIqffcj5DMNgTe0P0wg2VwLag3FsubNYGMGHEskkSjEDfiuL8YUt6JO6EeWHD7 p/y+7tTgIl3E1DDkfrqBhM7CSlVVpeyP6BTUm8rnlFuswn+C3DA5lFLNopCUmGlghYfd pg1hXJwzybTOfwyLT/lkVwVMeiU9Vgeq+/1p0ZiJ6od3RAT8uaFNTDJ4OHeJF375Lu18 lvOA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:references:message-id:in-reply-to :subject:cc:to:from:date:dkim-signature; bh=JHIvVFDDeb0UmLqYECNwvBDzrNYvPV2mAJt9DuZsGqI=; b=rpBQ0LhiN5YGXUZwmmPRbsvpPNJbiXY0tMMHkRnl/yggd3Turrn9RNeyY8NVMUkTnu 8ugm5/WgOfqsKy03nFrjBDAardnxdnoFJw7qNpz8nu0P4nBYQefqGYwQ7v40O8+l6zqC C1aJn4clTvMZdAZwhk7/ajXTS5puK+C9tQ8g4mBcVhV4OEFU+kyAdgfIrDIyvseRWc81 HTE6aPZawlXWpTg/dzi616f7mnEjgaGf4Ifu7/i6FoOrkLHNZ6wqsWGLhrW0NrqaOz16 pBkjKpTBtnmQXC5KMBtdq52WjnmAAqFSNONWBxG99m9DMvngzurM7NOVubfmd3ZNHHHY uWWA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=vy4bOZeB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k5-20020a6568c5000000b005348f8c06edsi3505880pgt.520.2023.05.28.23.52.27; Sun, 28 May 2023 23:52:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=vy4bOZeB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231637AbjE2G0E (ORCPT <rfc822;callmefire3@gmail.com> + 99 others); Mon, 29 May 2023 02:26:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53056 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231588AbjE2G0B (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 29 May 2023 02:26:01 -0400 Received: from mail-yb1-xb29.google.com (mail-yb1-xb29.google.com [IPv6:2607:f8b0:4864:20::b29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B48B819D for <linux-kernel@vger.kernel.org>; Sun, 28 May 2023 23:25:26 -0700 (PDT) Received: by mail-yb1-xb29.google.com with SMTP id 3f1490d57ef6-ba71cd7ce7fso4329577276.1 for <linux-kernel@vger.kernel.org>; Sun, 28 May 2023 23:25:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1685341520; x=1687933520; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=JHIvVFDDeb0UmLqYECNwvBDzrNYvPV2mAJt9DuZsGqI=; b=vy4bOZeB/A66J4j9HnNgzhCNnvZdPQ/qYVsp6cHRESlm/+NNE5/wIBuX4BQPZZXi+U yfOv5KIuNFZIGxmT2J41Omwq2WuOLtYXkDCRVQFlDnKL7Sb2EkLjAGyVqgxk6YwVvXsq 4Pf+i37X+4QiOjtG5wU2Llmv4Y51UqbqrD37XCP8fC+Il7S+6dvF57+wSKT3pOZKo9Wq XS9k4OAS6EFviHg6GY7NWmUx1fP6lsswTyubgGRjqDpTcAziJzjScdUEu1t3l5BKVD8B YHHYTZOtafVUcsfZZNCTBvgZT4szHnXqQAoJ2bUIgqmZt8ZRwlANX5NxU9BJDyCisuqK 1k1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685341520; x=1687933520; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=JHIvVFDDeb0UmLqYECNwvBDzrNYvPV2mAJt9DuZsGqI=; b=X1seUOMyj7jfzRHrkYe93ecLtOc2235jqhNC8mhp4BeU1FG72wKNZ8DdyiBX1bBG1z 1C3CL3mXmnA8cXWMCKG/Tj3wk0bDFkkeEOXkaeJs8/P51kjPVDeJj6eXyFb5RgbKYT9r 8Y5WNNALph/R/31dtaXzE/qHI/OTTfCqZ4Tl7qImHGYWCZMT8rinjqokkZ7oStTQVmUl c/K22pK7B1eaeoWcY1IC++Ig4+hDlImUpSk8xzplWuibPHnyCD6omEEWu3Idr0a5GjB6 d7tsVC860izk8oPlrHB42KIvMWwjwLo1SA1BloL/eybwpL1WyxO8OfR6SBHnaACg0g9F JPGg== X-Gm-Message-State: AC+VfDzWhFo1jr8nCwbSQ5cZQdNmY5yvne+IHfB9U0JSBVpKMugjHag6 3wGOEmkJQpctXbKp5F2gg3jdSA== X-Received: by 2002:a25:ae87:0:b0:b9e:7082:971e with SMTP id b7-20020a25ae87000000b00b9e7082971emr9144162ybj.45.1685341519867; Sun, 28 May 2023 23:25:19 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id x18-20020a258592000000b00ba88763e5b5sm2667181ybk.2.2023.05.28.23.25.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 28 May 2023 23:25:19 -0700 (PDT) Date: Sun, 28 May 2023 23:25:15 -0700 (PDT) From: Hugh Dickins <hughd@google.com> X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton <akpm@linux-foundation.org> cc: Mike Kravetz <mike.kravetz@oracle.com>, Mike Rapoport <rppt@kernel.org>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Matthew Wilcox <willy@infradead.org>, David Hildenbrand <david@redhat.com>, Suren Baghdasaryan <surenb@google.com>, Qi Zheng <zhengqi.arch@bytedance.com>, Yang Shi <shy828301@gmail.com>, Mel Gorman <mgorman@techsingularity.net>, Peter Xu <peterx@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Will Deacon <will@kernel.org>, Yu Zhao <yuzhao@google.com>, Alistair Popple <apopple@nvidia.com>, Ralph Campbell <rcampbell@nvidia.com>, Ira Weiny <ira.weiny@intel.com>, Steven Price <steven.price@arm.com>, SeongJae Park <sj@kernel.org>, Naoya Horiguchi <naoya.horiguchi@nec.com>, Christophe Leroy <christophe.leroy@csgroup.eu>, Zack Rusin <zackr@vmware.com>, Jason Gunthorpe <jgg@ziepe.ca>, Axel Rasmussen <axelrasmussen@google.com>, Anshuman Khandual <anshuman.khandual@arm.com>, Pasha Tatashin <pasha.tatashin@soleen.com>, Miaohe Lin <linmiaohe@huawei.com>, Minchan Kim <minchan@kernel.org>, Christoph Hellwig <hch@infradead.org>, Song Liu <song@kernel.org>, Thomas Hellstrom <thomas.hellstrom@linux.intel.com>, Russell King <linux@armlinux.org.uk>, "David S. Miller" <davem@davemloft.net>, Michael Ellerman <mpe@ellerman.id.au>, "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>, Heiko Carstens <hca@linux.ibm.com>, Christian Borntraeger <borntraeger@linux.ibm.com>, Claudio Imbrenda <imbrenda@linux.ibm.com>, Alexander Gordeev <agordeev@linux.ibm.com>, Jann Horn <jannh@google.com>, linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock In-Reply-To: <35e983f5-7ed3-b310-d949-9ae8b130cdab@google.com> Message-ID: <2e9996fa-d238-e7c-1194-834a2bd1f60@google.com> References: <35e983f5-7ed3-b310-d949-9ae8b130cdab@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1767210390473398463?= X-GMAIL-MSGID: =?utf-8?q?1767210390473398463?= |
Series |
mm: free retracted page table by RCU
|
|
Commit Message
Hugh Dickins
May 29, 2023, 6:25 a.m. UTC
Simplify shmem and file THP collapse's retract_page_tables(), and relax
its locking: to improve its success rate and to lessen impact on others.
Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
target_mm, leave that part of the work to madvise_collapse() calling
collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s
result code to arrange for that. That spares retract_page_tables() four
arguments; and since it will be successful in retracting all of the page
tables expected of it, no need to track and return a result code itself.
It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
THPs. retract_page_tables() just needs to use those same spinlocks to
exclude it briefly, while transitioning pmd from page table to none: so
restore its use of pmd_lock() inside of which pte lock is nested.
Users of pte_offset_map_lock() etc all now allow for them to fail:
so retract_page_tables() now has no use for mmap_write_trylock() or
vma_try_start_write(). In common with rmap and page_vma_mapped_walk(),
it does not even need the mmap_read_lock().
But those users do expect the page table to remain a good page table,
until they unlock and rcu_read_unlock(): so the page table cannot be
freed immediately, but rather by the recently added pte_free_defer().
retract_page_tables() can be enhanced to replace_page_tables(), which
inserts the final huge pmd without mmap lock: going through an invalid
state instead of pmd_none() followed by fault. But that does raise some
questions, and requires a more complicated pte_free_defer() for powerpc
(when its arch_needs_pgtable_deposit() for shmem and file THPs). Leave
that enhancement to a later release.
Signed-off-by: Hugh Dickins <hughd@google.com>
---
mm/khugepaged.c | 169 +++++++++++++++++-------------------------------
1 file changed, 60 insertions(+), 109 deletions(-)
Comments
On Sun, May 28, 2023 at 11:25:15PM -0700, Hugh Dickins wrote: > Simplify shmem and file THP collapse's retract_page_tables(), and relax > its locking: to improve its success rate and to lessen impact on others. > > Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of > target_mm, leave that part of the work to madvise_collapse() calling > collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s > result code to arrange for that. That spares retract_page_tables() four > arguments; and since it will be successful in retracting all of the page > tables expected of it, no need to track and return a result code itself. > > It needs i_mmap_lock_read(mapping) for traversing the vma interval tree, > but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk() > allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for > THPs. retract_page_tables() just needs to use those same spinlocks to > exclude it briefly, while transitioning pmd from page table to none: so > restore its use of pmd_lock() inside of which pte lock is nested. > > Users of pte_offset_map_lock() etc all now allow for them to fail: > so retract_page_tables() now has no use for mmap_write_trylock() or > vma_try_start_write(). In common with rmap and page_vma_mapped_walk(), > it does not even need the mmap_read_lock(). > > But those users do expect the page table to remain a good page table, > until they unlock and rcu_read_unlock(): so the page table cannot be > freed immediately, but rather by the recently added pte_free_defer(). > > retract_page_tables() can be enhanced to replace_page_tables(), which > inserts the final huge pmd without mmap lock: going through an invalid > state instead of pmd_none() followed by fault. But that does raise some > questions, and requires a more complicated pte_free_defer() for powerpc > (when its arch_needs_pgtable_deposit() for shmem and file THPs). Leave > that enhancement to a later release. > > Signed-off-by: Hugh Dickins <hughd@google.com> > --- > mm/khugepaged.c | 169 +++++++++++++++++------------------------------- > 1 file changed, 60 insertions(+), 109 deletions(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 1083f0e38a07..4fd408154692 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1617,9 +1617,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > break; > case SCAN_PMD_NONE: > /* > - * In MADV_COLLAPSE path, possible race with khugepaged where > - * all pte entries have been removed and pmd cleared. If so, > - * skip all the pte checks and just update the pmd mapping. > + * All pte entries have been removed and pmd cleared. > + * Skip all the pte checks and just update the pmd mapping. > */ > goto maybe_install_pmd; > default: > @@ -1748,123 +1747,73 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl > mmap_write_unlock(mm); > } > > -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > - struct mm_struct *target_mm, > - unsigned long target_addr, struct page *hpage, > - struct collapse_control *cc) > +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) > { > struct vm_area_struct *vma; > - int target_result = SCAN_FAIL; > > - i_mmap_lock_write(mapping); > + i_mmap_lock_read(mapping); > vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { > - int result = SCAN_FAIL; > - struct mm_struct *mm = NULL; > - unsigned long addr = 0; > - pmd_t *pmd; > - bool is_target = false; > + struct mm_struct *mm; > + unsigned long addr; > + pmd_t *pmd, pgt_pmd; > + spinlock_t *pml; > + spinlock_t *ptl; > > /* > * Check vma->anon_vma to exclude MAP_PRIVATE mappings that > - * got written to. These VMAs are likely not worth investing > - * mmap_write_lock(mm) as PMD-mapping is likely to be split > - * later. > + * got written to. These VMAs are likely not worth removing > + * page tables from, as PMD-mapping is likely to be split later. > * > - * Note that vma->anon_vma check is racy: it can be set up after > - * the check but before we took mmap_lock by the fault path. > - * But page lock would prevent establishing any new ptes of the > - * page, so we are safe. > - * > - * An alternative would be drop the check, but check that page > - * table is clear before calling pmdp_collapse_flush() under > - * ptl. It has higher chance to recover THP for the VMA, but > - * has higher cost too. It would also probably require locking > - * the anon_vma. > + * Note that vma->anon_vma check is racy: it can be set after > + * the check, but page locks (with XA_RETRY_ENTRYs in holes) > + * prevented establishing new ptes of the page. So we are safe > + * to remove page table below, without even checking it's empty. > */ > - if (READ_ONCE(vma->anon_vma)) { > - result = SCAN_PAGE_ANON; > - goto next; > - } > + if (READ_ONCE(vma->anon_vma)) > + continue; Not directly related to current patch, but I just realized there seems to have similar issue as what ab0c3f1251b4 wanted to fix. IIUC any shmem vma that used to have uprobe/bp installed will have anon_vma set here, then does it mean that any vma used to get debugged will never be able to merge into a thp (with either madvise or khugepaged)? I think it'll only make a difference when the page cache is not huge yet when bp was uninstalled, but then it becomes a thp candidate somehow. Even if so, I think the anon_vma should still be there. Did I miss something, or maybe that's not even a problem? > + > addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); > if (addr & ~HPAGE_PMD_MASK || > - vma->vm_end < addr + HPAGE_PMD_SIZE) { > - result = SCAN_VMA_CHECK; > - goto next; > - } > - mm = vma->vm_mm; > - is_target = mm == target_mm && addr == target_addr; > - result = find_pmd_or_thp_or_none(mm, addr, &pmd); > - if (result != SCAN_SUCCEED) > - goto next; > - /* > - * We need exclusive mmap_lock to retract page table. > - * > - * We use trylock due to lock inversion: we need to acquire > - * mmap_lock while holding page lock. Fault path does it in > - * reverse order. Trylock is a way to avoid deadlock. > - * > - * Also, it's not MADV_COLLAPSE's job to collapse other > - * mappings - let khugepaged take care of them later. > - */ > - result = SCAN_PTE_MAPPED_HUGEPAGE; > - if ((cc->is_khugepaged || is_target) && > - mmap_write_trylock(mm)) { > - /* trylock for the same lock inversion as above */ > - if (!vma_try_start_write(vma)) > - goto unlock_next; > - > - /* > - * Re-check whether we have an ->anon_vma, because > - * collapse_and_free_pmd() requires that either no > - * ->anon_vma exists or the anon_vma is locked. > - * We already checked ->anon_vma above, but that check > - * is racy because ->anon_vma can be populated under the > - * mmap lock in read mode. > - */ > - if (vma->anon_vma) { > - result = SCAN_PAGE_ANON; > - goto unlock_next; > - } > - /* > - * When a vma is registered with uffd-wp, we can't > - * recycle the pmd pgtable because there can be pte > - * markers installed. Skip it only, so the rest mm/vma > - * can still have the same file mapped hugely, however > - * it'll always mapped in small page size for uffd-wp > - * registered ranges. > - */ > - if (hpage_collapse_test_exit(mm)) { > - result = SCAN_ANY_PROCESS; > - goto unlock_next; > - } > - if (userfaultfd_wp(vma)) { > - result = SCAN_PTE_UFFD_WP; > - goto unlock_next; > - } > - collapse_and_free_pmd(mm, vma, addr, pmd); > - if (!cc->is_khugepaged && is_target) > - result = set_huge_pmd(vma, addr, pmd, hpage); > - else > - result = SCAN_SUCCEED; > - > -unlock_next: > - mmap_write_unlock(mm); > - goto next; > - } > - /* > - * Calling context will handle target mm/addr. Otherwise, let > - * khugepaged try again later. > - */ > - if (!is_target) { > - khugepaged_add_pte_mapped_thp(mm, addr); > + vma->vm_end < addr + HPAGE_PMD_SIZE) > continue; > - } > -next: > - if (is_target) > - target_result = result; > + > + mm = vma->vm_mm; > + if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED) > + continue; > + > + if (hpage_collapse_test_exit(mm)) > + continue; > + /* > + * When a vma is registered with uffd-wp, we cannot recycle > + * the page table because there may be pte markers installed. > + * Other vmas can still have the same file mapped hugely, but > + * skip this one: it will always be mapped in small page size > + * for uffd-wp registered ranges. > + * > + * What if VM_UFFD_WP is set a moment after this check? No > + * problem, huge page lock is still held, stopping new mappings > + * of page which might then get replaced by pte markers: only > + * existing markers need to be protected here. (We could check > + * after getting ptl below, but this comment distracting there!) > + */ > + if (userfaultfd_wp(vma)) > + continue; IIUC here with the new code we only hold (1) hpage lock, and (2) i_mmap_lock read. Then could it possible that right after checking this and found !UFFD_WP, but then someone quickly (1) register uffd-wp on this vma, then UFFDIO_WRITEPROTECT to install some pte markers, before below pgtable locks taken? The thing is installation of pte markers may not need either of the locks iiuc.. Would taking mmap read lock help in this case? Thanks, > + > + /* Huge page lock is still held, so page table must be empty */ > + pml = pmd_lock(mm, pmd); > + ptl = pte_lockptr(mm, pmd); > + if (ptl != pml) > + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); > + pgt_pmd = pmdp_collapse_flush(vma, addr, pmd); > + if (ptl != pml) > + spin_unlock(ptl); > + spin_unlock(pml); > + > + mm_dec_nr_ptes(mm); > + page_table_check_pte_clear_range(mm, addr, pgt_pmd); > + pte_free_defer(mm, pmd_pgtable(pgt_pmd)); > } > - i_mmap_unlock_write(mapping); > - return target_result; > + i_mmap_unlock_read(mapping); > } > > /** > @@ -2261,9 +2210,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > > /* > * Remove pte page tables, so we can re-fault the page as huge. > + * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp(). > */ > - result = retract_page_tables(mapping, start, mm, addr, hpage, > - cc); > + retract_page_tables(mapping, start); > + if (cc && !cc->is_khugepaged) > + result = SCAN_PTE_MAPPED_HUGEPAGE; > unlock_page(hpage); > > /* > -- > 2.35.3 >
Thanks for looking, Peter: I was well aware of you dropping several hints that you wanted to see what's intended before passing judgment on earlier series, and I preferred to get on with showing this series, than go into detail in responses to you there - thanks for your patience :) On Mon, 29 May 2023, Peter Xu wrote: > On Sun, May 28, 2023 at 11:25:15PM -0700, Hugh Dickins wrote: ... > > @@ -1748,123 +1747,73 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl > > mmap_write_unlock(mm); > > } > > > > -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > > - struct mm_struct *target_mm, > > - unsigned long target_addr, struct page *hpage, > > - struct collapse_control *cc) > > +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) > > { > > struct vm_area_struct *vma; > > - int target_result = SCAN_FAIL; > > > > - i_mmap_lock_write(mapping); > > + i_mmap_lock_read(mapping); > > vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { > > - int result = SCAN_FAIL; > > - struct mm_struct *mm = NULL; > > - unsigned long addr = 0; > > - pmd_t *pmd; > > - bool is_target = false; > > + struct mm_struct *mm; > > + unsigned long addr; > > + pmd_t *pmd, pgt_pmd; > > + spinlock_t *pml; > > + spinlock_t *ptl; > > > > /* > > * Check vma->anon_vma to exclude MAP_PRIVATE mappings that > > - * got written to. These VMAs are likely not worth investing > > - * mmap_write_lock(mm) as PMD-mapping is likely to be split > > - * later. > > + * got written to. These VMAs are likely not worth removing > > + * page tables from, as PMD-mapping is likely to be split later. > > * > > - * Note that vma->anon_vma check is racy: it can be set up after > > - * the check but before we took mmap_lock by the fault path. > > - * But page lock would prevent establishing any new ptes of the > > - * page, so we are safe. > > - * > > - * An alternative would be drop the check, but check that page > > - * table is clear before calling pmdp_collapse_flush() under > > - * ptl. It has higher chance to recover THP for the VMA, but > > - * has higher cost too. It would also probably require locking > > - * the anon_vma. > > + * Note that vma->anon_vma check is racy: it can be set after > > + * the check, but page locks (with XA_RETRY_ENTRYs in holes) > > + * prevented establishing new ptes of the page. So we are safe > > + * to remove page table below, without even checking it's empty. > > */ > > - if (READ_ONCE(vma->anon_vma)) { > > - result = SCAN_PAGE_ANON; > > - goto next; > > - } > > + if (READ_ONCE(vma->anon_vma)) > > + continue; > > Not directly related to current patch, but I just realized there seems to > have similar issue as what ab0c3f1251b4 wanted to fix. > > IIUC any shmem vma that used to have uprobe/bp installed will have anon_vma > set here, then does it mean that any vma used to get debugged will never be > able to merge into a thp (with either madvise or khugepaged)? > > I think it'll only make a difference when the page cache is not huge yet > when bp was uninstalled, but then it becomes a thp candidate somehow. Even > if so, I think the anon_vma should still be there. > > Did I miss something, or maybe that's not even a problem? Finding vma->anon_vma set would discourage retract_page_tables() from doing its business with that previously uprobed area; but it does not stop collapse_pte_mapped_thp() (which uprobes unregister calls directly) from dealing with it, and MADV_COLLAPSE works on anon_vma'ed areas too. It's just a heuristic in retract_page_tables(), when it chooses to skip the anon_vma'ed areas as often not worth bothering with. As to vma merges: I haven't actually checked since the maple tree and other rewrites of vma merging, but previously one vma with anon_vma set could be merged with adjacent vma before or after without anon_vma set - the anon_vma comparison is not just equality of anon_vma, but allows NULL too - so the anon_vma will still be there, but extends to cover the wider extent. Right, I find is_mergeable_anon_vma() still following that rule. (And once vmas are merged, so that the whole of the huge page falls within a single vma, khugepaged can consider it, and do collapse_pte_mapped_thp() on it - before or after 11/12 I think.) As to whether it would even be a problem: generally no, the vma is supposed just to be an internal representation, and so long as the code resists proliferating them unnecessarily, occasional failures to merge should not matter. The one place that forever sticks in my mind as mattering (perhaps there are others I'm unaware of, but I'd call them bugs) is mremap(): which is sufficiently awkward and bug-prone already, that nobody ever had the courage to make it independent of vma boundaries; but ideally, it's mremap() that we should fix. But I may have written three answers, yet still missed your point. ... > > + > > + mm = vma->vm_mm; > > + if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED) > > + continue; > > + > > + if (hpage_collapse_test_exit(mm)) > > + continue; > > + /* > > + * When a vma is registered with uffd-wp, we cannot recycle > > + * the page table because there may be pte markers installed. > > + * Other vmas can still have the same file mapped hugely, but > > + * skip this one: it will always be mapped in small page size > > + * for uffd-wp registered ranges. > > + * > > + * What if VM_UFFD_WP is set a moment after this check? No > > + * problem, huge page lock is still held, stopping new mappings > > + * of page which might then get replaced by pte markers: only > > + * existing markers need to be protected here. (We could check > > + * after getting ptl below, but this comment distracting there!) > > + */ > > + if (userfaultfd_wp(vma)) > > + continue; > > IIUC here with the new code we only hold (1) hpage lock, and (2) > i_mmap_lock read. Then could it possible that right after checking this > and found !UFFD_WP, but then someone quickly (1) register uffd-wp on this > vma, then UFFDIO_WRITEPROTECT to install some pte markers, before below > pgtable locks taken? > > The thing is installation of pte markers may not need either of the locks > iiuc.. > > Would taking mmap read lock help in this case? Isn't my comment above it a good enough answer? If I misunderstand the uffd-wp pte marker ("If"? certainly I don't understand it well enough, but I may or may not be too wrong about it here), and actually it can spring up in places where the page has not even been mapped yet, then I'd *much* rather just move that check down into the pte_locked area, than involve mmap read lock (which, though easier to acquire than its write lock, would I think take us back to square 1 in terms of needing trylock); but I did prefer not to have a big uffd-wp comment distracting from the code flow there. I expect now, that if I follow up UFFDIO_WRITEPROTECT, I shall indeed find it inserting pte markers where the page has not even been mapped yet. A "Yes" from you will save me looking, but probably I shall have to move that check down (oh well, the comment will be smaller there). Thanks, Hugh
On Mon, May 29, 2023 at 8:25 AM Hugh Dickins <hughd@google.com> wrote: > -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > - struct mm_struct *target_mm, > - unsigned long target_addr, struct page *hpage, > - struct collapse_control *cc) > +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) > { > struct vm_area_struct *vma; > - int target_result = SCAN_FAIL; > > - i_mmap_lock_write(mapping); > + i_mmap_lock_read(mapping); > vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { > - int result = SCAN_FAIL; > - struct mm_struct *mm = NULL; > - unsigned long addr = 0; > - pmd_t *pmd; > - bool is_target = false; > + struct mm_struct *mm; > + unsigned long addr; > + pmd_t *pmd, pgt_pmd; > + spinlock_t *pml; > + spinlock_t *ptl; > > /* > * Check vma->anon_vma to exclude MAP_PRIVATE mappings that > - * got written to. These VMAs are likely not worth investing > - * mmap_write_lock(mm) as PMD-mapping is likely to be split > - * later. > + * got written to. These VMAs are likely not worth removing > + * page tables from, as PMD-mapping is likely to be split later. > * > - * Note that vma->anon_vma check is racy: it can be set up after > - * the check but before we took mmap_lock by the fault path. > - * But page lock would prevent establishing any new ptes of the > - * page, so we are safe. > - * > - * An alternative would be drop the check, but check that page > - * table is clear before calling pmdp_collapse_flush() under > - * ptl. It has higher chance to recover THP for the VMA, but > - * has higher cost too. It would also probably require locking > - * the anon_vma. > + * Note that vma->anon_vma check is racy: it can be set after > + * the check, but page locks (with XA_RETRY_ENTRYs in holes) > + * prevented establishing new ptes of the page. So we are safe > + * to remove page table below, without even checking it's empty. This "we are safe to remove page table below, without even checking it's empty" assumes that the only way to create new anonymous PTEs is to use existing file PTEs, right? What about private shmem VMAs that are registered with userfaultfd as VM_UFFD_MISSING? I think for those, the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without looking at the mapping and its pages (except for checking that the insertion point is before end-of-file), protected only by mmap_lock (shared) and pte_offset_map_lock(). > */ > - if (READ_ONCE(vma->anon_vma)) { > - result = SCAN_PAGE_ANON; > - goto next; > - } > + if (READ_ONCE(vma->anon_vma)) > + continue; > + > addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); > if (addr & ~HPAGE_PMD_MASK || > - vma->vm_end < addr + HPAGE_PMD_SIZE) { > - result = SCAN_VMA_CHECK; > - goto next; > - } > - mm = vma->vm_mm; > - is_target = mm == target_mm && addr == target_addr; > - result = find_pmd_or_thp_or_none(mm, addr, &pmd); > - if (result != SCAN_SUCCEED) > - goto next; > - /* > - * We need exclusive mmap_lock to retract page table. > - * > - * We use trylock due to lock inversion: we need to acquire > - * mmap_lock while holding page lock. Fault path does it in > - * reverse order. Trylock is a way to avoid deadlock. > - * > - * Also, it's not MADV_COLLAPSE's job to collapse other > - * mappings - let khugepaged take care of them later. > - */ > - result = SCAN_PTE_MAPPED_HUGEPAGE; > - if ((cc->is_khugepaged || is_target) && > - mmap_write_trylock(mm)) { > - /* trylock for the same lock inversion as above */ > - if (!vma_try_start_write(vma)) > - goto unlock_next; > - > - /* > - * Re-check whether we have an ->anon_vma, because > - * collapse_and_free_pmd() requires that either no > - * ->anon_vma exists or the anon_vma is locked. > - * We already checked ->anon_vma above, but that check > - * is racy because ->anon_vma can be populated under the > - * mmap lock in read mode. > - */ > - if (vma->anon_vma) { > - result = SCAN_PAGE_ANON; > - goto unlock_next; > - } > - /* > - * When a vma is registered with uffd-wp, we can't > - * recycle the pmd pgtable because there can be pte > - * markers installed. Skip it only, so the rest mm/vma > - * can still have the same file mapped hugely, however > - * it'll always mapped in small page size for uffd-wp > - * registered ranges. > - */ > - if (hpage_collapse_test_exit(mm)) { > - result = SCAN_ANY_PROCESS; > - goto unlock_next; > - } > - if (userfaultfd_wp(vma)) { > - result = SCAN_PTE_UFFD_WP; > - goto unlock_next; > - } > - collapse_and_free_pmd(mm, vma, addr, pmd); The old code called collapse_and_free_pmd(), which involves MMU notifier invocation... > - if (!cc->is_khugepaged && is_target) > - result = set_huge_pmd(vma, addr, pmd, hpage); > - else > - result = SCAN_SUCCEED; > - > -unlock_next: > - mmap_write_unlock(mm); > - goto next; > - } > - /* > - * Calling context will handle target mm/addr. Otherwise, let > - * khugepaged try again later. > - */ > - if (!is_target) { > - khugepaged_add_pte_mapped_thp(mm, addr); > + vma->vm_end < addr + HPAGE_PMD_SIZE) > continue; > - } > -next: > - if (is_target) > - target_result = result; > + > + mm = vma->vm_mm; > + if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED) > + continue; > + > + if (hpage_collapse_test_exit(mm)) > + continue; > + /* > + * When a vma is registered with uffd-wp, we cannot recycle > + * the page table because there may be pte markers installed. > + * Other vmas can still have the same file mapped hugely, but > + * skip this one: it will always be mapped in small page size > + * for uffd-wp registered ranges. > + * > + * What if VM_UFFD_WP is set a moment after this check? No > + * problem, huge page lock is still held, stopping new mappings > + * of page which might then get replaced by pte markers: only > + * existing markers need to be protected here. (We could check > + * after getting ptl below, but this comment distracting there!) > + */ > + if (userfaultfd_wp(vma)) > + continue; > + > + /* Huge page lock is still held, so page table must be empty */ > + pml = pmd_lock(mm, pmd); > + ptl = pte_lockptr(mm, pmd); > + if (ptl != pml) > + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); > + pgt_pmd = pmdp_collapse_flush(vma, addr, pmd); ... while the new code only does pmdp_collapse_flush(), which clears the pmd entry and does a TLB flush, but AFAICS doesn't use MMU notifiers. My understanding is that that's problematic - maybe (?) it is sort of okay with regards to classic MMU notifier users like KVM, but it's probably wrong for IOMMUv2 users, where an IOMMU directly consumes the normal page tables? (FWIW, last I looked, there also seemed to be some other issues with MMU notifier usage wrt IOMMUv2, see the thread <https://lore.kernel.org/linux-mm/Yzbaf9HW1%2FreKqR8@nvidia.com/>.) > + if (ptl != pml) > + spin_unlock(ptl); > + spin_unlock(pml); > + > + mm_dec_nr_ptes(mm); > + page_table_check_pte_clear_range(mm, addr, pgt_pmd); > + pte_free_defer(mm, pmd_pgtable(pgt_pmd)); > } > - i_mmap_unlock_write(mapping); > - return target_result; > + i_mmap_unlock_read(mapping); > } > > /** > @@ -2261,9 +2210,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > > /* > * Remove pte page tables, so we can re-fault the page as huge. > + * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp(). > */ > - result = retract_page_tables(mapping, start, mm, addr, hpage, > - cc); > + retract_page_tables(mapping, start); > + if (cc && !cc->is_khugepaged) > + result = SCAN_PTE_MAPPED_HUGEPAGE; > unlock_page(hpage); > > /* > -- > 2.35.3 >
On Tue, May 30, 2023 at 05:38:25PM -0700, Hugh Dickins wrote: > Thanks for looking, Peter: I was well aware of you dropping several hints > that you wanted to see what's intended before passing judgment on earlier > series, and I preferred to get on with showing this series, than go into > detail in responses to you there - thanks for your patience :) Not a problem at all here! > > On Mon, 29 May 2023, Peter Xu wrote: > > On Sun, May 28, 2023 at 11:25:15PM -0700, Hugh Dickins wrote: > ... > > > @@ -1748,123 +1747,73 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl > > > mmap_write_unlock(mm); > > > } > > > > > > -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > > > - struct mm_struct *target_mm, > > > - unsigned long target_addr, struct page *hpage, > > > - struct collapse_control *cc) > > > +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) > > > { > > > struct vm_area_struct *vma; > > > - int target_result = SCAN_FAIL; > > > > > > - i_mmap_lock_write(mapping); > > > + i_mmap_lock_read(mapping); > > > vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { > > > - int result = SCAN_FAIL; > > > - struct mm_struct *mm = NULL; > > > - unsigned long addr = 0; > > > - pmd_t *pmd; > > > - bool is_target = false; > > > + struct mm_struct *mm; > > > + unsigned long addr; > > > + pmd_t *pmd, pgt_pmd; > > > + spinlock_t *pml; > > > + spinlock_t *ptl; > > > > > > /* > > > * Check vma->anon_vma to exclude MAP_PRIVATE mappings that > > > - * got written to. These VMAs are likely not worth investing > > > - * mmap_write_lock(mm) as PMD-mapping is likely to be split > > > - * later. > > > + * got written to. These VMAs are likely not worth removing > > > + * page tables from, as PMD-mapping is likely to be split later. > > > * > > > - * Note that vma->anon_vma check is racy: it can be set up after > > > - * the check but before we took mmap_lock by the fault path. > > > - * But page lock would prevent establishing any new ptes of the > > > - * page, so we are safe. > > > - * > > > - * An alternative would be drop the check, but check that page > > > - * table is clear before calling pmdp_collapse_flush() under > > > - * ptl. It has higher chance to recover THP for the VMA, but > > > - * has higher cost too. It would also probably require locking > > > - * the anon_vma. > > > + * Note that vma->anon_vma check is racy: it can be set after > > > + * the check, but page locks (with XA_RETRY_ENTRYs in holes) > > > + * prevented establishing new ptes of the page. So we are safe > > > + * to remove page table below, without even checking it's empty. > > > */ > > > - if (READ_ONCE(vma->anon_vma)) { > > > - result = SCAN_PAGE_ANON; > > > - goto next; > > > - } > > > + if (READ_ONCE(vma->anon_vma)) > > > + continue; > > > > Not directly related to current patch, but I just realized there seems to > > have similar issue as what ab0c3f1251b4 wanted to fix. > > > > IIUC any shmem vma that used to have uprobe/bp installed will have anon_vma > > set here, then does it mean that any vma used to get debugged will never be > > able to merge into a thp (with either madvise or khugepaged)? > > > > I think it'll only make a difference when the page cache is not huge yet > > when bp was uninstalled, but then it becomes a thp candidate somehow. Even > > if so, I think the anon_vma should still be there. > > > > Did I miss something, or maybe that's not even a problem? > > Finding vma->anon_vma set would discourage retract_page_tables() from > doing its business with that previously uprobed area; but it does not stop > collapse_pte_mapped_thp() (which uprobes unregister calls directly) from > dealing with it, This one is true to me. > and MADV_COLLAPSE works on anon_vma'ed areas too. It's just a heuristic > in retract_page_tables(), when it chooses to skip the anon_vma'ed areas > as often not worth bothering with. This is the one I'm unsure about. What I read (at least with current code base) is that both the khugepaged and madvise paths will rely on SCAN_PTE_MAPPED_HUGEPAGE returned (or to be returned) first, then only if so we will have the attempt to collapse: - For khugepaged, we'll add the candidate into pte_mapped_thp[] array only after we set "result=SCAN_PTE_MAPPED_HUGEPAGE": if (!is_target) { khugepaged_add_pte_mapped_thp(mm, addr); continue; } - For madvise, we fully rely on hpage_collapse_scan_file() retval to be SCAN_PTE_MAPPED_HUGEPAGE to trigger the collapse_pte_mapped_thp(). While the anon_vma check in retract_page_tables() is fairly early for each vma run, assuming the simplest case of 1 vma mapping IIUC it'll just never try to collapse such a vma? > > As to vma merges: I haven't actually checked since the maple tree and other > rewrites of vma merging, but previously one vma with anon_vma set could be > merged with adjacent vma before or after without anon_vma set - the > anon_vma comparison is not just equality of anon_vma, but allows NULL too - > so the anon_vma will still be there, but extends to cover the wider extent. > Right, I find is_mergeable_anon_vma() still following that rule. Yes. > > (And once vmas are merged, so that the whole of the huge page falls within > a single vma, khugepaged can consider it, and do collapse_pte_mapped_thp() > on it - before or after 11/12 I think.) > > As to whether it would even be a problem: generally no, the vma is supposed > just to be an internal representation, and so long as the code resists > proliferating them unnecessarily, occasional failures to merge should not > matter. The one place that forever sticks in my mind as mattering (perhaps > there are others I'm unaware of, but I'd call them bugs) is mremap(): which > is sufficiently awkward and bug-prone already, that nobody ever had the > courage to make it independent of vma boundaries; but ideally, it's > mremap() that we should fix. > > But I may have written three answers, yet still missed your point. Thanks for writing three answers. :) For me maybe what I worried is even simpler, please refer to above if that explains. Again, I don't think that's a problem specific to this series as it was there before or after, and I still keep thinking I could have just missed something. I'll also need to rethink too (by reading more carefully on the follow up ones) since I think this series changed some facts above, maybe it'll be different when whole set applied, which I'll do. > > ... > > > + > > > + mm = vma->vm_mm; > > > + if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED) > > > + continue; > > > + > > > + if (hpage_collapse_test_exit(mm)) > > > + continue; > > > + /* > > > + * When a vma is registered with uffd-wp, we cannot recycle > > > + * the page table because there may be pte markers installed. > > > + * Other vmas can still have the same file mapped hugely, but > > > + * skip this one: it will always be mapped in small page size > > > + * for uffd-wp registered ranges. > > > + * > > > + * What if VM_UFFD_WP is set a moment after this check? No > > > + * problem, huge page lock is still held, stopping new mappings > > > + * of page which might then get replaced by pte markers: only > > > + * existing markers need to be protected here. (We could check > > > + * after getting ptl below, but this comment distracting there!) > > > + */ > > > + if (userfaultfd_wp(vma)) > > > + continue; > > > > IIUC here with the new code we only hold (1) hpage lock, and (2) > > i_mmap_lock read. Then could it possible that right after checking this > > and found !UFFD_WP, but then someone quickly (1) register uffd-wp on this > > vma, then UFFDIO_WRITEPROTECT to install some pte markers, before below > > pgtable locks taken? > > > > The thing is installation of pte markers may not need either of the locks > > iiuc.. > > > > Would taking mmap read lock help in this case? > > Isn't my comment above it a good enough answer? If I misunderstand the > uffd-wp pte marker ("If"? certainly I don't understand it well enough, > but I may or may not be too wrong about it here), and actually it can > spring up in places where the page has not even been mapped yet, then > I'd *much* rather just move that check down into the pte_locked area, > than involve mmap read lock (which, though easier to acquire than its > write lock, would I think take us back to square 1 in terms of needing > trylock); but I did prefer not to have a big uffd-wp comment distracting > from the code flow there. > > I expect now, that if I follow up UFFDIO_WRITEPROTECT, I shall indeed > find it inserting pte markers where the page has not even been mapped > yet. A "Yes" from you will save me looking, but probably I shall have > to move that check down (oh well, the comment will be smaller there). I think the answer is yes, as we need to be able to install markers just to avoid knowing the fact on "what's inside the page cache" for file mems (which IIRC originates from your suggestion even if in another format of swap encoding, but the idea should be similar). Moving it into the pgtable locked section looks fine here too for this specific issue. Thanks,
On Wed, May 31, 2023 at 05:34:58PM +0200, Jann Horn wrote: > On Mon, May 29, 2023 at 8:25 AM Hugh Dickins <hughd@google.com> wrote: > > -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > > - struct mm_struct *target_mm, > > - unsigned long target_addr, struct page *hpage, > > - struct collapse_control *cc) > > +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) > > { > > struct vm_area_struct *vma; > > - int target_result = SCAN_FAIL; > > > > - i_mmap_lock_write(mapping); > > + i_mmap_lock_read(mapping); > > vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { > > - int result = SCAN_FAIL; > > - struct mm_struct *mm = NULL; > > - unsigned long addr = 0; > > - pmd_t *pmd; > > - bool is_target = false; > > + struct mm_struct *mm; > > + unsigned long addr; > > + pmd_t *pmd, pgt_pmd; > > + spinlock_t *pml; > > + spinlock_t *ptl; > > > > /* > > * Check vma->anon_vma to exclude MAP_PRIVATE mappings that > > - * got written to. These VMAs are likely not worth investing > > - * mmap_write_lock(mm) as PMD-mapping is likely to be split > > - * later. > > + * got written to. These VMAs are likely not worth removing > > + * page tables from, as PMD-mapping is likely to be split later. > > * > > - * Note that vma->anon_vma check is racy: it can be set up after > > - * the check but before we took mmap_lock by the fault path. > > - * But page lock would prevent establishing any new ptes of the > > - * page, so we are safe. > > - * > > - * An alternative would be drop the check, but check that page > > - * table is clear before calling pmdp_collapse_flush() under > > - * ptl. It has higher chance to recover THP for the VMA, but > > - * has higher cost too. It would also probably require locking > > - * the anon_vma. > > + * Note that vma->anon_vma check is racy: it can be set after > > + * the check, but page locks (with XA_RETRY_ENTRYs in holes) > > + * prevented establishing new ptes of the page. So we are safe > > + * to remove page table below, without even checking it's empty. > > This "we are safe to remove page table below, without even checking > it's empty" assumes that the only way to create new anonymous PTEs is > to use existing file PTEs, right? What about private shmem VMAs that > are registered with userfaultfd as VM_UFFD_MISSING? I think for those, > the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without > looking at the mapping and its pages (except for checking that the > insertion point is before end-of-file), protected only by mmap_lock > (shared) and pte_offset_map_lock(). Hmm, yes. We probably need to keep that though, and 5b51072e97 explained the reason (to still respect file permissions). Maybe the anon_vma check can also be moved into the pgtable lock section, with some comments explaining (but it's getting a bit ugly..)? > > > > */ > > - if (READ_ONCE(vma->anon_vma)) { > > - result = SCAN_PAGE_ANON; > > - goto next; > > - } > > + if (READ_ONCE(vma->anon_vma)) > > + continue; > > + > > addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); > > if (addr & ~HPAGE_PMD_MASK || > > - vma->vm_end < addr + HPAGE_PMD_SIZE) { > > - result = SCAN_VMA_CHECK; > > - goto next; > > - } > > - mm = vma->vm_mm; > > - is_target = mm == target_mm && addr == target_addr; > > - result = find_pmd_or_thp_or_none(mm, addr, &pmd); > > - if (result != SCAN_SUCCEED) > > - goto next; > > - /* > > - * We need exclusive mmap_lock to retract page table. > > - * > > - * We use trylock due to lock inversion: we need to acquire > > - * mmap_lock while holding page lock. Fault path does it in > > - * reverse order. Trylock is a way to avoid deadlock. > > - * > > - * Also, it's not MADV_COLLAPSE's job to collapse other > > - * mappings - let khugepaged take care of them later. > > - */ > > - result = SCAN_PTE_MAPPED_HUGEPAGE; > > - if ((cc->is_khugepaged || is_target) && > > - mmap_write_trylock(mm)) { > > - /* trylock for the same lock inversion as above */ > > - if (!vma_try_start_write(vma)) > > - goto unlock_next; > > - > > - /* > > - * Re-check whether we have an ->anon_vma, because > > - * collapse_and_free_pmd() requires that either no > > - * ->anon_vma exists or the anon_vma is locked. > > - * We already checked ->anon_vma above, but that check > > - * is racy because ->anon_vma can be populated under the > > - * mmap lock in read mode. > > - */ > > - if (vma->anon_vma) { > > - result = SCAN_PAGE_ANON; > > - goto unlock_next; > > - } > > - /* > > - * When a vma is registered with uffd-wp, we can't > > - * recycle the pmd pgtable because there can be pte > > - * markers installed. Skip it only, so the rest mm/vma > > - * can still have the same file mapped hugely, however > > - * it'll always mapped in small page size for uffd-wp > > - * registered ranges. > > - */ > > - if (hpage_collapse_test_exit(mm)) { > > - result = SCAN_ANY_PROCESS; > > - goto unlock_next; > > - } > > - if (userfaultfd_wp(vma)) { > > - result = SCAN_PTE_UFFD_WP; > > - goto unlock_next; > > - } > > - collapse_and_free_pmd(mm, vma, addr, pmd); > > The old code called collapse_and_free_pmd(), which involves MMU > notifier invocation... > > > - if (!cc->is_khugepaged && is_target) > > - result = set_huge_pmd(vma, addr, pmd, hpage); > > - else > > - result = SCAN_SUCCEED; > > - > > -unlock_next: > > - mmap_write_unlock(mm); > > - goto next; > > - } > > - /* > > - * Calling context will handle target mm/addr. Otherwise, let > > - * khugepaged try again later. > > - */ > > - if (!is_target) { > > - khugepaged_add_pte_mapped_thp(mm, addr); > > + vma->vm_end < addr + HPAGE_PMD_SIZE) > > continue; > > - } > > -next: > > - if (is_target) > > - target_result = result; > > + > > + mm = vma->vm_mm; > > + if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED) > > + continue; > > + > > + if (hpage_collapse_test_exit(mm)) > > + continue; > > + /* > > + * When a vma is registered with uffd-wp, we cannot recycle > > + * the page table because there may be pte markers installed. > > + * Other vmas can still have the same file mapped hugely, but > > + * skip this one: it will always be mapped in small page size > > + * for uffd-wp registered ranges. > > + * > > + * What if VM_UFFD_WP is set a moment after this check? No > > + * problem, huge page lock is still held, stopping new mappings > > + * of page which might then get replaced by pte markers: only > > + * existing markers need to be protected here. (We could check > > + * after getting ptl below, but this comment distracting there!) > > + */ > > + if (userfaultfd_wp(vma)) > > + continue; > > + > > + /* Huge page lock is still held, so page table must be empty */ > > + pml = pmd_lock(mm, pmd); > > + ptl = pte_lockptr(mm, pmd); > > + if (ptl != pml) > > + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); > > + pgt_pmd = pmdp_collapse_flush(vma, addr, pmd); > > ... while the new code only does pmdp_collapse_flush(), which clears > the pmd entry and does a TLB flush, but AFAICS doesn't use MMU > notifiers. My understanding is that that's problematic - maybe (?) it > is sort of okay with regards to classic MMU notifier users like KVM, > but it's probably wrong for IOMMUv2 users, where an IOMMU directly > consumes the normal page tables? The iommuv2 wasn't "consuming" the pgtables? IIUC it relies on that to make sure no secondary (and illegal) tlb exists in the iommu tlbs. For this case if the pgtable _must_ be empty when reaching here (we'd better make sure of it..), maybe we're good? Because we should have just invalidated once when unmap all the pages in the thp range, so no existing tlb should generate anyway for either cpu or iommu hardwares. However OTOH, maybe it'll also be safer to just have the mmu notifiers like before (e.g., no idea whether anything can cache invalidate tlb translations from the empty pgtable)? As that doesn't seems to beat the purpose of the patchset as notifiers shouldn't fail. > > (FWIW, last I looked, there also seemed to be some other issues with > MMU notifier usage wrt IOMMUv2, see the thread > <https://lore.kernel.org/linux-mm/Yzbaf9HW1%2FreKqR8@nvidia.com/>.) > > > > + if (ptl != pml) > > + spin_unlock(ptl); > > + spin_unlock(pml); > > + > > + mm_dec_nr_ptes(mm); > > + page_table_check_pte_clear_range(mm, addr, pgt_pmd); > > + pte_free_defer(mm, pmd_pgtable(pgt_pmd)); > > } > > - i_mmap_unlock_write(mapping); > > - return target_result; > > + i_mmap_unlock_read(mapping); > > } > > > > /** > > @@ -2261,9 +2210,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > > > > /* > > * Remove pte page tables, so we can re-fault the page as huge. > > + * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp(). > > */ > > - result = retract_page_tables(mapping, start, mm, addr, hpage, > > - cc); > > + retract_page_tables(mapping, start); > > + if (cc && !cc->is_khugepaged) > > + result = SCAN_PTE_MAPPED_HUGEPAGE; > > unlock_page(hpage); > > > > /* > > -- > > 2.35.3 > > >
On Wed, May 31, 2023 at 10:54 PM Peter Xu <peterx@redhat.com> wrote: > On Wed, May 31, 2023 at 05:34:58PM +0200, Jann Horn wrote: > > On Mon, May 29, 2023 at 8:25 AM Hugh Dickins <hughd@google.com> wrote: > > > -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > > > - struct mm_struct *target_mm, > > > - unsigned long target_addr, struct page *hpage, > > > - struct collapse_control *cc) > > > +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) > > > { > > > struct vm_area_struct *vma; > > > - int target_result = SCAN_FAIL; > > > > > > - i_mmap_lock_write(mapping); > > > + i_mmap_lock_read(mapping); > > > vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { > > > - int result = SCAN_FAIL; > > > - struct mm_struct *mm = NULL; > > > - unsigned long addr = 0; > > > - pmd_t *pmd; > > > - bool is_target = false; > > > + struct mm_struct *mm; > > > + unsigned long addr; > > > + pmd_t *pmd, pgt_pmd; > > > + spinlock_t *pml; > > > + spinlock_t *ptl; > > > > > > /* > > > * Check vma->anon_vma to exclude MAP_PRIVATE mappings that > > > - * got written to. These VMAs are likely not worth investing > > > - * mmap_write_lock(mm) as PMD-mapping is likely to be split > > > - * later. > > > + * got written to. These VMAs are likely not worth removing > > > + * page tables from, as PMD-mapping is likely to be split later. > > > * > > > - * Note that vma->anon_vma check is racy: it can be set up after > > > - * the check but before we took mmap_lock by the fault path. > > > - * But page lock would prevent establishing any new ptes of the > > > - * page, so we are safe. > > > - * > > > - * An alternative would be drop the check, but check that page > > > - * table is clear before calling pmdp_collapse_flush() under > > > - * ptl. It has higher chance to recover THP for the VMA, but > > > - * has higher cost too. It would also probably require locking > > > - * the anon_vma. > > > + * Note that vma->anon_vma check is racy: it can be set after > > > + * the check, but page locks (with XA_RETRY_ENTRYs in holes) > > > + * prevented establishing new ptes of the page. So we are safe > > > + * to remove page table below, without even checking it's empty. > > > > This "we are safe to remove page table below, without even checking > > it's empty" assumes that the only way to create new anonymous PTEs is > > to use existing file PTEs, right? What about private shmem VMAs that > > are registered with userfaultfd as VM_UFFD_MISSING? I think for those, > > the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without > > looking at the mapping and its pages (except for checking that the > > insertion point is before end-of-file), protected only by mmap_lock > > (shared) and pte_offset_map_lock(). > > Hmm, yes. We probably need to keep that though, and 5b51072e97 explained > the reason (to still respect file permissions). > > Maybe the anon_vma check can also be moved into the pgtable lock section, > with some comments explaining (but it's getting a bit ugly..)? Or check that all entries are pte_none() or something like that inside the pgtable-locked section? [...] > > The old code called collapse_and_free_pmd(), which involves MMU > > notifier invocation... [...] > > ... while the new code only does pmdp_collapse_flush(), which clears > > the pmd entry and does a TLB flush, but AFAICS doesn't use MMU > > notifiers. My understanding is that that's problematic - maybe (?) it > > is sort of okay with regards to classic MMU notifier users like KVM, > > but it's probably wrong for IOMMUv2 users, where an IOMMU directly > > consumes the normal page tables? > > The iommuv2 wasn't "consuming" the pgtables? My wording was confusing, I meant that as "the iommuv2 hardware directly uses/walks the page tables". > IIUC it relies on that to > make sure no secondary (and illegal) tlb exists in the iommu tlbs. > > For this case if the pgtable _must_ be empty when reaching here (we'd > better make sure of it..), maybe we're good? Because we should have just > invalidated once when unmap all the pages in the thp range, so no existing > tlb should generate anyway for either cpu or iommu hardwares. My headcanon is that there are approximately three reasons why we normally have to do iommuv2 invalidations and I think one or two of them might still apply here, though admittedly I haven't actually dug up documentation on how this stuff actually works for IOMMUv2, so maybe one of y'all can tell me that my concerns here are unfounded: 1. We have to flush normal TLB entries. This is probably not necessary if the page table contains no entries. 2. We might have to flush "paging-structure caches" / "intermediate table walk caches", if the IOMMU caches the physical addresses of page tables to skip some levels of page table walk. IDK if IOMMUs do that, but normal MMUs definitely do it, so I wouldn't be surprised if the IOMMUs did it too (or reserved the right to do it in a future hardware generation or whatever). 3. We have to *serialize* with page table walks performed by the IOMMU. We're doing an RCU barrier to synchronize against page table walks from the MMU, but without an appropriate mmu_notifier call, we have nothing to ensure that we aren't yanking a page table out from under an IOMMU page table walker while it's in the middle of its walk. Sure, this isn't very likely in practice, the IOMMU page table walker is probably pretty fast, but still we need some kind of explicit synchronization to make this robust, I think. > However OTOH, maybe it'll also be safer to just have the mmu notifiers like > before (e.g., no idea whether anything can cache invalidate tlb > translations from the empty pgtable)? As that doesn't seems to beat the > purpose of the patchset as notifiers shouldn't fail. > > > > > (FWIW, last I looked, there also seemed to be some other issues with > > MMU notifier usage wrt IOMMUv2, see the thread > > <https://lore.kernel.org/linux-mm/Yzbaf9HW1%2FreKqR8@nvidia.com/>.)
On Thu, Jun 01, 2023 at 12:18:43AM +0200, Jann Horn wrote: > 3. We have to *serialize* with page table walks performed by the > IOMMU. We're doing an RCU barrier to synchronize against page table > walks from the MMU, but without an appropriate mmu_notifier call, we > have nothing to ensure that we aren't yanking a page table out from > under an IOMMU page table walker while it's in the middle of its walk. > Sure, this isn't very likely in practice, the IOMMU page table walker > is probably pretty fast, but still we need some kind of explicit > synchronization to make this robust, I think. There is another thread talking about this.. Broadly we are saying that we need to call mmu ops invalidate_range at any time the normal CPU TLB would be invalidated. invalidate_range will not return until the iommu HW is coherent with the current state of the page table. Jason
On Wed, 31 May 2023, Jann Horn wrote: > On Mon, May 29, 2023 at 8:25 AM Hugh Dickins <hughd@google.com> wrote: > > +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) ... > > + * Note that vma->anon_vma check is racy: it can be set after > > + * the check, but page locks (with XA_RETRY_ENTRYs in holes) > > + * prevented establishing new ptes of the page. So we are safe > > + * to remove page table below, without even checking it's empty. > > This "we are safe to remove page table below, without even checking > it's empty" assumes that the only way to create new anonymous PTEs is > to use existing file PTEs, right? What about private shmem VMAs that > are registered with userfaultfd as VM_UFFD_MISSING? I think for those, > the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without > looking at the mapping and its pages (except for checking that the > insertion point is before end-of-file), protected only by mmap_lock > (shared) and pte_offset_map_lock(). Right, from your comments and Peter's, thank you both, I can see that userfaultfd breaks the usual assumptions here: so I'm putting an if (unlikely(vma->anon_vma || userfaultfd_wp(vma))) check in once we've got the ptlock; with a comment above it to point the blame at uffd, though I gave up on describing all the detail. And deleted this earlier "we are safe" paragraph. You did suggest, in another mail, that perhaps there should be a scan checking all pte_none() when we get the ptlock. I wasn't keen on yet another debug scan for bugs and didn't add that, thinking I was going to add a patch on the end to do so in page_table_check_pte_clear_range(). But when I came to write that patch, found that I'd been misled by its name: it's about checking or adjusting some accounting, not really a suitable place to check for pte_none() at all; so just scrapped it. ... > > - collapse_and_free_pmd(mm, vma, addr, pmd); > > The old code called collapse_and_free_pmd(), which involves MMU > notifier invocation... ... > > + pml = pmd_lock(mm, pmd); > > + ptl = pte_lockptr(mm, pmd); > > + if (ptl != pml) > > + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); > > + pgt_pmd = pmdp_collapse_flush(vma, addr, pmd); > > ... while the new code only does pmdp_collapse_flush(), which clears > the pmd entry and does a TLB flush, but AFAICS doesn't use MMU > notifiers. My understanding is that that's problematic - maybe (?) it > is sort of okay with regards to classic MMU notifier users like KVM, > but it's probably wrong for IOMMUv2 users, where an IOMMU directly > consumes the normal page tables? Right, I intentionally left out the MMU notifier invocation, knowing that we have already done an MMU notifier invocation when unmapping any PTEs which were mapped: it was necessary for collapse_and_free_pmd() in the collapse_pte_mapped_thp() case, but there was no notifier in this case for many years, and I was glad to be rid of it. However, I now see that you were adding it intentionally even for this case in your f268f6cf875f; and from later comments in this thread, it looks like there is still uncertainty about whether it is needed here, but safer to assume that it is needed: I'll add it back. > > (FWIW, last I looked, there also seemed to be some other issues with > MMU notifier usage wrt IOMMUv2, see the thread > <https://lore.kernel.org/linux-mm/Yzbaf9HW1%2FreKqR8@nvidia.com/>.)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 1083f0e38a07..4fd408154692 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1617,9 +1617,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, break; case SCAN_PMD_NONE: /* - * In MADV_COLLAPSE path, possible race with khugepaged where - * all pte entries have been removed and pmd cleared. If so, - * skip all the pte checks and just update the pmd mapping. + * All pte entries have been removed and pmd cleared. + * Skip all the pte checks and just update the pmd mapping. */ goto maybe_install_pmd; default: @@ -1748,123 +1747,73 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl mmap_write_unlock(mm); } -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, - struct mm_struct *target_mm, - unsigned long target_addr, struct page *hpage, - struct collapse_control *cc) +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) { struct vm_area_struct *vma; - int target_result = SCAN_FAIL; - i_mmap_lock_write(mapping); + i_mmap_lock_read(mapping); vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { - int result = SCAN_FAIL; - struct mm_struct *mm = NULL; - unsigned long addr = 0; - pmd_t *pmd; - bool is_target = false; + struct mm_struct *mm; + unsigned long addr; + pmd_t *pmd, pgt_pmd; + spinlock_t *pml; + spinlock_t *ptl; /* * Check vma->anon_vma to exclude MAP_PRIVATE mappings that - * got written to. These VMAs are likely not worth investing - * mmap_write_lock(mm) as PMD-mapping is likely to be split - * later. + * got written to. These VMAs are likely not worth removing + * page tables from, as PMD-mapping is likely to be split later. * - * Note that vma->anon_vma check is racy: it can be set up after - * the check but before we took mmap_lock by the fault path. - * But page lock would prevent establishing any new ptes of the - * page, so we are safe. - * - * An alternative would be drop the check, but check that page - * table is clear before calling pmdp_collapse_flush() under - * ptl. It has higher chance to recover THP for the VMA, but - * has higher cost too. It would also probably require locking - * the anon_vma. + * Note that vma->anon_vma check is racy: it can be set after + * the check, but page locks (with XA_RETRY_ENTRYs in holes) + * prevented establishing new ptes of the page. So we are safe + * to remove page table below, without even checking it's empty. */ - if (READ_ONCE(vma->anon_vma)) { - result = SCAN_PAGE_ANON; - goto next; - } + if (READ_ONCE(vma->anon_vma)) + continue; + addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); if (addr & ~HPAGE_PMD_MASK || - vma->vm_end < addr + HPAGE_PMD_SIZE) { - result = SCAN_VMA_CHECK; - goto next; - } - mm = vma->vm_mm; - is_target = mm == target_mm && addr == target_addr; - result = find_pmd_or_thp_or_none(mm, addr, &pmd); - if (result != SCAN_SUCCEED) - goto next; - /* - * We need exclusive mmap_lock to retract page table. - * - * We use trylock due to lock inversion: we need to acquire - * mmap_lock while holding page lock. Fault path does it in - * reverse order. Trylock is a way to avoid deadlock. - * - * Also, it's not MADV_COLLAPSE's job to collapse other - * mappings - let khugepaged take care of them later. - */ - result = SCAN_PTE_MAPPED_HUGEPAGE; - if ((cc->is_khugepaged || is_target) && - mmap_write_trylock(mm)) { - /* trylock for the same lock inversion as above */ - if (!vma_try_start_write(vma)) - goto unlock_next; - - /* - * Re-check whether we have an ->anon_vma, because - * collapse_and_free_pmd() requires that either no - * ->anon_vma exists or the anon_vma is locked. - * We already checked ->anon_vma above, but that check - * is racy because ->anon_vma can be populated under the - * mmap lock in read mode. - */ - if (vma->anon_vma) { - result = SCAN_PAGE_ANON; - goto unlock_next; - } - /* - * When a vma is registered with uffd-wp, we can't - * recycle the pmd pgtable because there can be pte - * markers installed. Skip it only, so the rest mm/vma - * can still have the same file mapped hugely, however - * it'll always mapped in small page size for uffd-wp - * registered ranges. - */ - if (hpage_collapse_test_exit(mm)) { - result = SCAN_ANY_PROCESS; - goto unlock_next; - } - if (userfaultfd_wp(vma)) { - result = SCAN_PTE_UFFD_WP; - goto unlock_next; - } - collapse_and_free_pmd(mm, vma, addr, pmd); - if (!cc->is_khugepaged && is_target) - result = set_huge_pmd(vma, addr, pmd, hpage); - else - result = SCAN_SUCCEED; - -unlock_next: - mmap_write_unlock(mm); - goto next; - } - /* - * Calling context will handle target mm/addr. Otherwise, let - * khugepaged try again later. - */ - if (!is_target) { - khugepaged_add_pte_mapped_thp(mm, addr); + vma->vm_end < addr + HPAGE_PMD_SIZE) continue; - } -next: - if (is_target) - target_result = result; + + mm = vma->vm_mm; + if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED) + continue; + + if (hpage_collapse_test_exit(mm)) + continue; + /* + * When a vma is registered with uffd-wp, we cannot recycle + * the page table because there may be pte markers installed. + * Other vmas can still have the same file mapped hugely, but + * skip this one: it will always be mapped in small page size + * for uffd-wp registered ranges. + * + * What if VM_UFFD_WP is set a moment after this check? No + * problem, huge page lock is still held, stopping new mappings + * of page which might then get replaced by pte markers: only + * existing markers need to be protected here. (We could check + * after getting ptl below, but this comment distracting there!) + */ + if (userfaultfd_wp(vma)) + continue; + + /* Huge page lock is still held, so page table must be empty */ + pml = pmd_lock(mm, pmd); + ptl = pte_lockptr(mm, pmd); + if (ptl != pml) + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); + pgt_pmd = pmdp_collapse_flush(vma, addr, pmd); + if (ptl != pml) + spin_unlock(ptl); + spin_unlock(pml); + + mm_dec_nr_ptes(mm); + page_table_check_pte_clear_range(mm, addr, pgt_pmd); + pte_free_defer(mm, pmd_pgtable(pgt_pmd)); } - i_mmap_unlock_write(mapping); - return target_result; + i_mmap_unlock_read(mapping); } /** @@ -2261,9 +2210,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, /* * Remove pte page tables, so we can re-fault the page as huge. + * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp(). */ - result = retract_page_tables(mapping, start, mm, addr, hpage, - cc); + retract_page_tables(mapping, start); + if (cc && !cc->is_khugepaged) + result = SCAN_PTE_MAPPED_HUGEPAGE; unlock_page(hpage); /*