Message ID | 20230922193639.10158-1-vishal.moola@gmail.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp5825198vqi; Fri, 22 Sep 2023 12:42:15 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFZ4le4A4BFKw2H/Xlw44AGSH8jFNuskdSUmnz4rv1KME9OItvI5iJIkVWf9MbZD/GQfIje X-Received: by 2002:a17:902:788f:b0:1c4:44a0:5c03 with SMTP id q15-20020a170902788f00b001c444a05c03mr404051pll.9.1695411735411; Fri, 22 Sep 2023 12:42:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695411735; cv=none; d=google.com; s=arc-20160816; b=GEeVtjeFo3O9w9oqlZtEZMC4nvL0mnpTbi5ZIUgUoxqlrSh9DotYf3C+072XlCIGmS XAsVsdWI1g1fZg9WHGJCbdsqTUZ3E/60ymCNzowJxavytTKN8Cx5hPRvSE+KiP6flAR/ ISepqpN1aePQaDpjE8oWNmJ+WO4xLDUV+Tn8C0xbjM3Pevx8Mk+CWT8+mTyH5AxNRdHm t3fxpbxL3vgknzow3v3MNTsnu5w0CwIlk+WLdheiADgRBRvTnDkv5r/khkrlF/l/9AnW bFEyXraxhz5WY9lOgwXZ9lo9kFBedthMMTIBILWSVJxuYj6EhxmjXIhb8svVKt0PLjWb 08ng== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=da37cK7f0DLWm75F0VD76b8MmO4wjEBlXspBzXAT+eA=; fh=4qFuugfqw5PAR7+hGyaRe9WHUJGirFG+O59K7JUPnfM=; b=IfMcgBABO6tfhwe6GT2svcSJB8Nt8Pw6tFz/uI/jnykwjC7qOFixNuJ42z0xM3ltq4 04UGcZ/FnE4rRCd4Z3EspRdlake3QMwW1NXOZ0cq/5x5B0/Q5OKSQLhxXMVPgN+vmNbZ pxxUwCV0VJK4Cq226xI4eP0D/cr26o34MT/UfL0ErGUVHkCnHxynYb0mMcFXwZf1Nm0i homZ1uTAci0rHPzRoEEmqxQQLwgJGWODb5TaM9oVYPSe/PCqKjELEn2dnw2ytuxz9JN5 0dhodbj/4Pr5i/MpxRFkp1wcjPNrMERGi8iD7EU/R9cb2MnPVuxtrXGmVberek/lMQcM O09g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=hSLmnZkI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from agentk.vger.email (agentk.vger.email. [2620:137:e000::3:2]) by mx.google.com with ESMTPS id o18-20020a170902d4d200b001bbd70bdffbsi4672718plg.440.2023.09.22.12.42.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Sep 2023 12:42:15 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) client-ip=2620:137:e000::3:2; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=hSLmnZkI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 1C7D48183251; Fri, 22 Sep 2023 12:37:02 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231547AbjIVTg4 (ORCPT <rfc822;pwkd43@gmail.com> + 28 others); Fri, 22 Sep 2023 15:36:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52506 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232330AbjIVTgz (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 22 Sep 2023 15:36:55 -0400 Received: from mail-pf1-x42d.google.com (mail-pf1-x42d.google.com [IPv6:2607:f8b0:4864:20::42d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DD7D9B9 for <linux-kernel@vger.kernel.org>; Fri, 22 Sep 2023 12:36:49 -0700 (PDT) Received: by mail-pf1-x42d.google.com with SMTP id d2e1a72fcca58-68bed2c786eso2461746b3a.0 for <linux-kernel@vger.kernel.org>; Fri, 22 Sep 2023 12:36:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1695411409; x=1696016209; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=da37cK7f0DLWm75F0VD76b8MmO4wjEBlXspBzXAT+eA=; b=hSLmnZkIS82o96fDvUHDbnt9NZWo2JjNhU/M54R8WxbG33pMWOo8rKpa2aKJ31VNFT NCb0RgCQr84fz1jOYdW+lJcyKdfIF5zFpLiruc5zOD9nxVghnOCaGogsycyIjQvkmiDo ZE/KKdUJoDLYh9n0/Qp/n2NjgabJi8iLOqeKX4zzcYGh/ICRjRCCqI3SN2lQnFX427pd Jz8GiZc5H+D+XX7V9XCgmjNQjVK+ETdqE/tL4bOyRokHrsfr6o/GDJ5jVeEVGkUE/8nf mFuBqoElWD9RgnQvd59VomiVCdTAGWS5OAtQ43c2+uDJi0a6H6eldZlOFV+AyoWm4eFa ak0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695411409; x=1696016209; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=da37cK7f0DLWm75F0VD76b8MmO4wjEBlXspBzXAT+eA=; b=NbKHUKMAQQTNbmcKvSFbFdhzX1wpFNUxCaZoKbodSaqauVFwqeLJPOwYG8F5tIzJ5R NU7/jAHO75H+NmpP+4NDW1h3CyL4UCMCUKZzP7b4RN8Eg5rMOZH9VzB2fsmX9Hq9VQ5x WFbsT04u5r9vSljRTNxB0akiMsUAh8nszA2YWcdHXjMBE/Ur1QpZQAoPI03XfAQmCpT0 1rkNY22gW40lQnGzWUts1y0zb5T732wXPMaz9N3m+FUOUcWr5Ka5QNZ1RmZ9YUfBsz/Q Wm9FBFoIio2TUpVzKYcdYIz81z/h4CHf5DpWUMrxLW06J/z+guTY6uDxTZy8lK54TE/6 T1PA== X-Gm-Message-State: AOJu0YwSW1Gk+ddJ89XdmXOKPYHE+cqygqQ6HN8gIsPgzAABWAv3rU8G 7+eg1DYqi25v+ZGBn5pVgFyRpIBlzC8= X-Received: by 2002:a05:6a21:9805:b0:157:877a:5f5e with SMTP id ue5-20020a056a21980500b00157877a5f5emr338411pzb.61.1695411409240; Fri, 22 Sep 2023 12:36:49 -0700 (PDT) Received: from fedora.. (c-73-170-51-167.hsd1.ca.comcast.net. [73.170.51.167]) by smtp.googlemail.com with ESMTPSA id q16-20020a170902dad000b001c0cb2aa2easm3841833plx.121.2023.09.22.12.36.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Sep 2023 12:36:48 -0700 (PDT) From: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Subject: [RFC PATCH 0/2] Remove compound_pagelist from khugepaged Date: Fri, 22 Sep 2023 12:36:37 -0700 Message-Id: <20230922193639.10158-1-vishal.moola@gmail.com> X-Mailer: git-send-email 2.40.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=3.0 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_SBL_CSS, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Fri, 22 Sep 2023 12:37:02 -0700 (PDT) X-Spam-Level: ** X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1777768055979783588 X-GMAIL-MSGID: 1777768055979783588 |
Series |
Remove compound_pagelist from khugepaged
|
|
Message
Vishal Moola
Sept. 22, 2023, 7:36 p.m. UTC
Currently, khugepaged builds and uses a list to properly account for compound pages. Now we can use folios to account for these compound pages as singular large folio units instead. Removing compound_pagelist streamlines the scanning/freeing code in khugepaged, as well as reduces the overall size of the kernel. Vishal Moola (Oracle) (2): mm/khugepaged: Convert __collapse_huge_page_isolate() to use folios mm/khugepaged: Remove compound_pagelist mm/khugepaged.c | 116 ++++++++++++++++-------------------------------- 1 file changed, 38 insertions(+), 78 deletions(-)
Comments
On Fri, Sep 22, 2023 at 12:36:38PM -0700, Vishal Moola (Oracle) wrote: > This is in preparation for the removal of the khugepaged compound_pagelist. > > Replaces 11 calls to compound_head() with 1, and removes 499 bytes of > kernel text. Looks good to me. > @@ -634,32 +633,33 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > * Isolate the page to avoid collapsing an hugepage > * currently in use by the VM. > */ > - if (!isolate_lru_page(page)) { > - unlock_page(page); > + if (!folio_isolate_lru(folio)) { > + folio_unlock(folio); > result = SCAN_DEL_PAGE_LRU; > goto out; > } > - mod_node_page_state(page_pgdat(page), > - NR_ISOLATED_ANON + page_is_file_lru(page), > - compound_nr(page)); > - VM_BUG_ON_PAGE(!PageLocked(page), page); > - VM_BUG_ON_PAGE(PageLRU(page), page); > + node_stat_mod_folio(folio, > + NR_ISOLATED_ANON + folio_is_file_lru(folio), > + folio_nr_pages(folio)); > + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); > + VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); > > - if (PageCompound(page)) > - list_add_tail(&page->lru, compound_pagelist); > + if (folio_test_large(folio)) > + list_add_tail(&folio->lru, compound_pagelist); > next: > /* > * If collapse was initiated by khugepaged, check that there is > * enough young pte to justify collapsing the page > */ > if (cc->is_khugepaged && > - (pte_young(pteval) || page_is_young(page) || > - PageReferenced(page) || mmu_notifier_test_young(vma->vm_mm, > + (pte_young(pteval) || folio_test_young(folio) || > + folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm, > address))) > referenced++; > > if (pte_write(pteval)) > writable = true; > + Spurious change here. Other than that, Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
On Fri, Sep 22, 2023 at 9:33 PM Vishal Moola (Oracle) <vishal.moola@gmail.com> wrote: > > Currently, khugepaged builds a compound_pagelist while scanning, which > is used to properly account for compound pages. We can now account > for a compound page as a singular folio instead, so remove this list. > > Large folios are guaranteed to have consecutive ptes and addresses, so > once the first pte of a large folio is found skip over the rest. The address space may just map a partial folio, for example, in the extreme case the HUGE_PMD size range may have HUGE_PMD_NR folios with mapping one subpage from each folio per PTE. So assuming the PTE mapped folio is mapped consecutively may be wrong. Please refer to collapse_compound_extreme() in tools/testing/selftests/mm/khugepaged.c. > > This helps convert khugepaged to use folios. It removes 3 compound_head > calls in __collapse_huge_page_copy_succeeded(), and removes 980 bytes of > kernel text. > > Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> > --- > mm/khugepaged.c | 76 ++++++++++++------------------------------------- > 1 file changed, 18 insertions(+), 58 deletions(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index f46a7a7c489f..b6c7d55a8231 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -498,10 +498,9 @@ static void release_pte_page(struct page *page) > release_pte_folio(page_folio(page)); > } > > -static void release_pte_pages(pte_t *pte, pte_t *_pte, > - struct list_head *compound_pagelist) > +static void release_pte_folios(pte_t *pte, pte_t *_pte) > { > - struct folio *folio, *tmp; > + struct folio *folio; > > while (--_pte >= pte) { > pte_t pteval = ptep_get(_pte); > @@ -514,12 +513,7 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte, > continue; > folio = pfn_folio(pfn); > if (folio_test_large(folio)) > - continue; > - release_pte_folio(folio); > - } > - > - list_for_each_entry_safe(folio, tmp, compound_pagelist, lru) { > - list_del(&folio->lru); > + _pte -= folio_nr_pages(folio) - 1; > release_pte_folio(folio); > } > } > @@ -538,8 +532,7 @@ static bool is_refcount_suitable(struct page *page) > static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > unsigned long address, > pte_t *pte, > - struct collapse_control *cc, > - struct list_head *compound_pagelist) > + struct collapse_control *cc) > { > struct folio *folio = NULL; > pte_t *_pte; > @@ -588,19 +581,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > } > } > > - if (folio_test_large(folio)) { > - struct folio *f; > - > - /* > - * Check if we have dealt with the compound page > - * already > - */ > - list_for_each_entry(f, compound_pagelist, lru) { > - if (folio == f) > - goto next; > - } > - } > - > /* > * We can do it before isolate_lru_page because the > * page can't be freed from under us. NOTE: PG_lock > @@ -644,9 +624,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); > VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); > > - if (folio_test_large(folio)) > - list_add_tail(&folio->lru, compound_pagelist); > -next: > /* > * If collapse was initiated by khugepaged, check that there is > * enough young pte to justify collapsing the page > @@ -660,6 +637,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > if (pte_write(pteval)) > writable = true; > > + if (folio_test_large(folio)) { > + _pte += folio_nr_pages(folio) - 1; > + address += folio_size(folio) - PAGE_SIZE; > + } > } > > if (unlikely(!writable)) { > @@ -673,7 +654,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > return result; > } > out: > - release_pte_pages(pte, _pte, compound_pagelist); > + release_pte_folios(pte, _pte); > trace_mm_collapse_huge_page_isolate(&folio->page, none_or_zero, > referenced, writable, result); > return result; > @@ -682,11 +663,9 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > static void __collapse_huge_page_copy_succeeded(pte_t *pte, > struct vm_area_struct *vma, > unsigned long address, > - spinlock_t *ptl, > - struct list_head *compound_pagelist) > + spinlock_t *ptl) > { > struct page *src_page; > - struct page *tmp; > pte_t *_pte; > pte_t pteval; > > @@ -706,8 +685,7 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte, > } > } else { > src_page = pte_page(pteval); > - if (!PageCompound(src_page)) > - release_pte_page(src_page); > + release_pte_page(src_page); > /* > * ptl mostly unnecessary, but preempt has to > * be disabled to update the per-cpu stats > @@ -720,23 +698,12 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte, > free_page_and_swap_cache(src_page); > } > } > - > - list_for_each_entry_safe(src_page, tmp, compound_pagelist, lru) { > - list_del(&src_page->lru); > - mod_node_page_state(page_pgdat(src_page), > - NR_ISOLATED_ANON + page_is_file_lru(src_page), > - -compound_nr(src_page)); > - unlock_page(src_page); > - free_swap_cache(src_page); > - putback_lru_page(src_page); > - } > } > > static void __collapse_huge_page_copy_failed(pte_t *pte, > pmd_t *pmd, > pmd_t orig_pmd, > - struct vm_area_struct *vma, > - struct list_head *compound_pagelist) > + struct vm_area_struct *vma) > { > spinlock_t *pmd_ptl; > > @@ -753,7 +720,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte, > * Release both raw and compound pages isolated > * in __collapse_huge_page_isolate. > */ > - release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist); > + release_pte_folios(pte, pte + HPAGE_PMD_NR); > } > > /* > @@ -769,7 +736,6 @@ static void __collapse_huge_page_copy_failed(pte_t *pte, > * @vma: the original raw pages' virtual memory area > * @address: starting address to copy > * @ptl: lock on raw pages' PTEs > - * @compound_pagelist: list that stores compound pages > */ > static int __collapse_huge_page_copy(pte_t *pte, > struct page *page, > @@ -777,8 +743,7 @@ static int __collapse_huge_page_copy(pte_t *pte, > pmd_t orig_pmd, > struct vm_area_struct *vma, > unsigned long address, > - spinlock_t *ptl, > - struct list_head *compound_pagelist) > + spinlock_t *ptl) > { > struct page *src_page; > pte_t *_pte; > @@ -804,11 +769,9 @@ static int __collapse_huge_page_copy(pte_t *pte, > } > > if (likely(result == SCAN_SUCCEED)) > - __collapse_huge_page_copy_succeeded(pte, vma, address, ptl, > - compound_pagelist); > + __collapse_huge_page_copy_succeeded(pte, vma, address, ptl); > else > - __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma, > - compound_pagelist); > + __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma); > > return result; > } > @@ -1081,7 +1044,6 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > int referenced, int unmapped, > struct collapse_control *cc) > { > - LIST_HEAD(compound_pagelist); > pmd_t *pmd, _pmd; > pte_t *pte; > pgtable_t pgtable; > @@ -1168,8 +1130,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > > pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); > if (pte) { > - result = __collapse_huge_page_isolate(vma, address, pte, cc, > - &compound_pagelist); > + result = __collapse_huge_page_isolate(vma, address, pte, cc); > spin_unlock(pte_ptl); > } else { > result = SCAN_PMD_NULL; > @@ -1198,8 +1159,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > anon_vma_unlock_write(vma->anon_vma); > > result = __collapse_huge_page_copy(pte, hpage, pmd, _pmd, > - vma, address, pte_ptl, > - &compound_pagelist); > + vma, address, pte_ptl); > pte_unmap(pte); > if (unlikely(result != SCAN_SUCCEED)) > goto out_up_write; > -- > 2.40.1 >
On Tue, Sep 26, 2023 at 03:07:18PM -0700, Yang Shi wrote: > On Fri, Sep 22, 2023 at 9:33 PM Vishal Moola (Oracle) > <vishal.moola@gmail.com> wrote: > > > > Currently, khugepaged builds a compound_pagelist while scanning, which > > is used to properly account for compound pages. We can now account > > for a compound page as a singular folio instead, so remove this list. > > > > Large folios are guaranteed to have consecutive ptes and addresses, so > > once the first pte of a large folio is found skip over the rest. > > The address space may just map a partial folio, for example, in the > extreme case the HUGE_PMD size range may have HUGE_PMD_NR folios with > mapping one subpage from each folio per PTE. So assuming the PTE > mapped folio is mapped consecutively may be wrong. How? You can do that with two VMAs, but this is limited to scanning within a single VMA. If we've COWed a large folio, we currently do so as a single page folio, and I'm not seeing any demand to change that. If we did COW as a large folio, we'd COW every page in that folio. How do we interleave two large folios in the same VMA? > Please refer to collapse_compound_extreme() in > tools/testing/selftests/mm/khugepaged.c. I agree that running that part of the test-suite would be useful, but could you point to which test specifically would create a problem here?
On Tue, Sep 26, 2023 at 3:07 PM Yang Shi <shy828301@gmail.com> wrote: > > On Fri, Sep 22, 2023 at 9:33 PM Vishal Moola (Oracle) > <vishal.moola@gmail.com> wrote: > > > > Currently, khugepaged builds a compound_pagelist while scanning, which > > is used to properly account for compound pages. We can now account > > for a compound page as a singular folio instead, so remove this list. > > > > Large folios are guaranteed to have consecutive ptes and addresses, so > > once the first pte of a large folio is found skip over the rest. > > The address space may just map a partial folio, for example, in the > extreme case the HUGE_PMD size range may have HUGE_PMD_NR folios with > mapping one subpage from each folio per PTE. So assuming the PTE > mapped folio is mapped consecutively may be wrong. > > Please refer to collapse_compound_extreme() in > tools/testing/selftests/mm/khugepaged.c. > > > > > This helps convert khugepaged to use folios. It removes 3 compound_head > > calls in __collapse_huge_page_copy_succeeded(), and removes 980 bytes of > > kernel text. > > > > Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> > > --- > > mm/khugepaged.c | 76 ++++++++++++------------------------------------- > > 1 file changed, 18 insertions(+), 58 deletions(-) > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index f46a7a7c489f..b6c7d55a8231 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -498,10 +498,9 @@ static void release_pte_page(struct page *page) > > release_pte_folio(page_folio(page)); > > } > > > > -static void release_pte_pages(pte_t *pte, pte_t *_pte, > > - struct list_head *compound_pagelist) > > +static void release_pte_folios(pte_t *pte, pte_t *_pte) > > { > > - struct folio *folio, *tmp; > > + struct folio *folio; > > > > while (--_pte >= pte) { > > pte_t pteval = ptep_get(_pte); > > @@ -514,12 +513,7 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte, > > continue; > > folio = pfn_folio(pfn); > > if (folio_test_large(folio)) > > - continue; > > - release_pte_folio(folio); > > - } > > - > > - list_for_each_entry_safe(folio, tmp, compound_pagelist, lru) { > > - list_del(&folio->lru); > > + _pte -= folio_nr_pages(folio) - 1; > > release_pte_folio(folio); > > } > > } > > @@ -538,8 +532,7 @@ static bool is_refcount_suitable(struct page *page) > > static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > > unsigned long address, > > pte_t *pte, > > - struct collapse_control *cc, > > - struct list_head *compound_pagelist) > > + struct collapse_control *cc) > > { > > struct folio *folio = NULL; > > pte_t *_pte; > > @@ -588,19 +581,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > > } > > } > > > > - if (folio_test_large(folio)) { > > - struct folio *f; > > - > > - /* > > - * Check if we have dealt with the compound page > > - * already > > - */ > > - list_for_each_entry(f, compound_pagelist, lru) { > > - if (folio == f) > > - goto next; > > - } > > - } > > - > > /* > > * We can do it before isolate_lru_page because the > > * page can't be freed from under us. NOTE: PG_lock > > @@ -644,9 +624,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > > VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); > > VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); > > > > - if (folio_test_large(folio)) > > - list_add_tail(&folio->lru, compound_pagelist); > > -next: > > /* > > * If collapse was initiated by khugepaged, check that there is > > * enough young pte to justify collapsing the page > > @@ -660,6 +637,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > > if (pte_write(pteval)) > > writable = true; > > > > + if (folio_test_large(folio)) { > > + _pte += folio_nr_pages(folio) - 1; > > + address += folio_size(folio) - PAGE_SIZE; > > + } > > } > > > > if (unlikely(!writable)) { > > @@ -673,7 +654,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > > return result; > > } > > out: > > - release_pte_pages(pte, _pte, compound_pagelist); > > + release_pte_folios(pte, _pte); > > trace_mm_collapse_huge_page_isolate(&folio->page, none_or_zero, > > referenced, writable, result); > > return result; > > @@ -682,11 +663,9 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > > static void __collapse_huge_page_copy_succeeded(pte_t *pte, > > struct vm_area_struct *vma, > > unsigned long address, > > - spinlock_t *ptl, > > - struct list_head *compound_pagelist) > > + spinlock_t *ptl) > > { > > struct page *src_page; > > - struct page *tmp; > > pte_t *_pte; > > pte_t pteval; > > > > @@ -706,8 +685,7 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte, > > } > > } else { > > src_page = pte_page(pteval); > > - if (!PageCompound(src_page)) > > - release_pte_page(src_page); > > + release_pte_page(src_page); This line is problematic too. It may cause double unlock if I read it correctly. The loop scans the mapped subpages from the same folio, release_pte_page() is called for the same folio multiple times. > > /* > > * ptl mostly unnecessary, but preempt has to > > * be disabled to update the per-cpu stats > > @@ -720,23 +698,12 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte, > > free_page_and_swap_cache(src_page); > > } > > } > > - > > - list_for_each_entry_safe(src_page, tmp, compound_pagelist, lru) { > > - list_del(&src_page->lru); > > - mod_node_page_state(page_pgdat(src_page), > > - NR_ISOLATED_ANON + page_is_file_lru(src_page), > > - -compound_nr(src_page)); > > - unlock_page(src_page); > > - free_swap_cache(src_page); > > - putback_lru_page(src_page); > > - } > > } > > > > static void __collapse_huge_page_copy_failed(pte_t *pte, > > pmd_t *pmd, > > pmd_t orig_pmd, > > - struct vm_area_struct *vma, > > - struct list_head *compound_pagelist) > > + struct vm_area_struct *vma) > > { > > spinlock_t *pmd_ptl; > > > > @@ -753,7 +720,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte, > > * Release both raw and compound pages isolated > > * in __collapse_huge_page_isolate. > > */ > > - release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist); > > + release_pte_folios(pte, pte + HPAGE_PMD_NR); > > } > > > > /* > > @@ -769,7 +736,6 @@ static void __collapse_huge_page_copy_failed(pte_t *pte, > > * @vma: the original raw pages' virtual memory area > > * @address: starting address to copy > > * @ptl: lock on raw pages' PTEs > > - * @compound_pagelist: list that stores compound pages > > */ > > static int __collapse_huge_page_copy(pte_t *pte, > > struct page *page, > > @@ -777,8 +743,7 @@ static int __collapse_huge_page_copy(pte_t *pte, > > pmd_t orig_pmd, > > struct vm_area_struct *vma, > > unsigned long address, > > - spinlock_t *ptl, > > - struct list_head *compound_pagelist) > > + spinlock_t *ptl) > > { > > struct page *src_page; > > pte_t *_pte; > > @@ -804,11 +769,9 @@ static int __collapse_huge_page_copy(pte_t *pte, > > } > > > > if (likely(result == SCAN_SUCCEED)) > > - __collapse_huge_page_copy_succeeded(pte, vma, address, ptl, > > - compound_pagelist); > > + __collapse_huge_page_copy_succeeded(pte, vma, address, ptl); > > else > > - __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma, > > - compound_pagelist); > > + __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma); > > > > return result; > > } > > @@ -1081,7 +1044,6 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > > int referenced, int unmapped, > > struct collapse_control *cc) > > { > > - LIST_HEAD(compound_pagelist); > > pmd_t *pmd, _pmd; > > pte_t *pte; > > pgtable_t pgtable; > > @@ -1168,8 +1130,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > > > > pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); > > if (pte) { > > - result = __collapse_huge_page_isolate(vma, address, pte, cc, > > - &compound_pagelist); > > + result = __collapse_huge_page_isolate(vma, address, pte, cc); > > spin_unlock(pte_ptl); > > } else { > > result = SCAN_PMD_NULL; > > @@ -1198,8 +1159,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > > anon_vma_unlock_write(vma->anon_vma); > > > > result = __collapse_huge_page_copy(pte, hpage, pmd, _pmd, > > - vma, address, pte_ptl, > > - &compound_pagelist); > > + vma, address, pte_ptl); > > pte_unmap(pte); > > if (unlikely(result != SCAN_SUCCEED)) > > goto out_up_write; > > -- > > 2.40.1 > >
On Thu, Sep 28, 2023 at 12:33 PM Yang Shi <shy828301@gmail.com> wrote: > > On Thu, Sep 28, 2023 at 2:05 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Tue, Sep 26, 2023 at 03:07:18PM -0700, Yang Shi wrote: > > > On Fri, Sep 22, 2023 at 9:33 PM Vishal Moola (Oracle) > > > <vishal.moola@gmail.com> wrote: > > > > > > > > Currently, khugepaged builds a compound_pagelist while scanning, which > > > > is used to properly account for compound pages. We can now account > > > > for a compound page as a singular folio instead, so remove this list. > > > > > > > > Large folios are guaranteed to have consecutive ptes and addresses, so > > > > once the first pte of a large folio is found skip over the rest. > > > > > > The address space may just map a partial folio, for example, in the > > > extreme case the HUGE_PMD size range may have HUGE_PMD_NR folios with > > > mapping one subpage from each folio per PTE. So assuming the PTE > > > mapped folio is mapped consecutively may be wrong. > > > > How? You can do that with two VMAs, but this is limited to scanning > > within a single VMA. If we've COWed a large folio, we currently do > > so as a single page folio, and I'm not seeing any demand to change that. > > If we did COW as a large folio, we'd COW every page in that folio. > > How do we interleave two large folios in the same VMA? > > It is not about COW. The magic from mremap() may cause some corner > cases. For example, > > We have a 2M VMA, every 4K of the VMA may be mapped to a subpage from > different folios. Like: > > 0: #0 subpage of folio #0 > 1: #1 subpage of folio #1 > 2: #2 subpage of folio #2 > .... > 511: #511 subpage of folio #511 > > When khugepaged is scanning the VMA, it may just isolate and lock the > folio #0, but skip all other folios since it assumes the VMA is just > mapped by folio #0. > > This may trigger kernel bug when unlocking other folios which are > actually not locked and maybe data corruption since the other folios > may go away under us (unisolated, unlocked and unpinned). Thanks for the review. I did not know this could happen; I'll drop this patch for now until I can think of a better way to iterate through ptes for large folios.