diff mbox series

[19/46] hugetlb: add HGM support for follow_hugetlb_page

Message ID	20230105101844.1893104-20-jthoughton@google.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Date: Thu, 5 Jan 2023 10:18:17 +0000 In-Reply-To: <20230105101844.1893104-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230105101844.1893104-1-jthoughton@google.com> Message-ID: <20230105101844.1893104-20-jthoughton@google.com> Subject: [PATCH 19/46] hugetlb: add HGM support for follow_hugetlb_page From: James Houghton <jthoughton@google.com> To: Mike Kravetz <mike.kravetz@oracle.com>, Muchun Song <songmuchun@bytedance.com>, Peter Xu <peterx@redhat.com> Cc: David Hildenbrand <david@redhat.com>, David Rientjes <rientjes@google.com>, Axel Rasmussen <axelrasmussen@google.com>, Mina Almasry <almasrymina@google.com>, "Zach O'Keefe" <zokeefe@google.com>, Manish Mishra <manish.mishra@nutanix.com>, Naoya Horiguchi <naoya.horiguchi@nec.com>, "Dr . David Alan Gilbert" <dgilbert@redhat.com>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Vlastimil Babka <vbabka@suse.cz>, Baolin Wang <baolin.wang@linux.alibaba.com>, Miaohe Lin <linmiaohe@huawei.com>, Yang Shi <shy828301@gmail.com>, Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton <jthoughton@google.com> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	Based on latest mm-unstable (85b44c25cd1e). \| [00/46] Based on latest mm-unstable (85b44c25cd1e). [01/46] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE [02/46] hugetlb: remove mk_huge_pte; it is unused [03/46] hugetlb: remove redundant pte_mkhuge in migration path [04/46] hugetlb: only adjust address ranges when VMAs want PMD sharing [05/46] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING [06/46] mm: add VM_HUGETLB_HGM VMA flag [07/46] hugetlb: rename __vma_shareable_flags_pmd to __vma_has_hugetlb_vma_lock [08/46] hugetlb: add HugeTLB HGM enablement helpers [09/46] mm: add MADV_SPLIT to enable HugeTLB HGM [10/46] hugetlb: make huge_pte_lockptr take an explicit shift argument [11/46] hugetlb: add hugetlb_pte to track HugeTLB page table entries [12/46] hugetlb: add hugetlb_alloc_pmd and hugetlb_alloc_pte [13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step [14/46] hugetlb: add make_huge_pte_with_shift [15/46] hugetlb: make default arch_make_huge_pte understand small mappings [16/46] hugetlbfs: do a full walk to check if vma maps a page [17/46] hugetlb: make unmapping compatible with high-granularity mappings [18/46] hugetlb: add HGM support for hugetlb_change_protection [19/46] hugetlb: add HGM support for follow_hugetlb_page [20/46] hugetlb: add HGM support for hugetlb_follow_page_mask [21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range [22/46] mm: rmap: provide pte_order in page_vma_mapped_walk [23/46] mm: rmap: make page_vma_mapped_walk callers use pte_order [24/46] rmap: update hugetlb lock comment for HGM [25/46] hugetlb: update page_vma_mapped to do high-granularity walks [26/46] hugetlb: add HGM support for copy_hugetlb_page_range [27/46] hugetlb: add HGM support for move_hugetlb_page_tables [28/46] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page [29/46] rmap: in try_to_{migrate,unmap}_one, check head page for page flags [30/46] hugetlb: add high-granularity migration support [31/46] hugetlb: sort hstates in hugetlb_init_hstates [32/46] hugetlb: add for_each_hgm_shift [33/46] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE [34/46] hugetlb: userfaultfd: when using MADV_SPLIT, round addresses to PAGE_SIZE [35/46] hugetlb: add MADV_COLLAPSE for hugetlb [36/46] hugetlb: remove huge_pte_lock and huge_pte_lockptr [37/46] hugetlb: replace make_huge_pte with make_huge_pte_with_shift [38/46] mm: smaps: add stats for HugeTLB mapping size [39/46] hugetlb: x86: enable high-granularity mapping [40/46] docs: hugetlb: update hugetlb and userfaultfd admin-guides with HGM info [41/46] docs: proc: include information about HugeTLB HGM [42/46] selftests/vm: add HugeTLB HGM to userfaultfd selftest [43/46] selftests/kvm: add HugeTLB HGM to KVM demand paging selftest [44/46] selftests/vm: add anon and shared hugetlb to migration test [45/46] selftests/vm: add hugetlb HGM test to migration selftest [46/46] selftests/vm: add HGM UFFDIO_CONTINUE and hwpoison tests

Commit Message

James Houghton Jan. 5, 2023, 10:18 a.m. UTC

  This enables high-granularity mapping support in GUP.

In case it is confusing, pfn_offset is the offset (in PAGE_SIZE units)
that vaddr points to within the subpage that hpte points to.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 59 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 37 insertions(+), 22 deletions(-)

Comments

Peter Xu Jan. 5, 2023, 10:26 p.m. UTC | #1

On Thu, Jan 05, 2023 at 10:18:17AM +0000, James Houghton wrote:
> @@ -6649,13 +6655,20 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 */
>  		if (absent && (flags & FOLL_DUMP) &&
>  		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
> -			if (pte)
> +			if (ptep)
>  				spin_unlock(ptl);
>  			hugetlb_vma_unlock_read(vma);
>  			remainder = 0;
>  			break;
>  		}
>  
> +		if (!absent && pte_present(pte) &&
> +				!hugetlb_pte_present_leaf(&hpte, pte)) {
> +			/* We raced with someone splitting the PTE, so retry. */
> +			spin_unlock(ptl);

vma unlock missing here.

> +			continue;
> +		}

Peter Xu Jan. 12, 2023, 6:02 p.m. UTC | #2

On Thu, Jan 05, 2023 at 10:18:17AM +0000, James Houghton wrote:
> @@ -6731,22 +6746,22 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * and skip the same_page loop below.
>  		 */
>  		if (!pages && !vmas && !pfn_offset &&
> -		    (vaddr + huge_page_size(h) < vma->vm_end) &&
> -		    (remainder >= pages_per_huge_page(h))) {
> -			vaddr += huge_page_size(h);
> -			remainder -= pages_per_huge_page(h);
> -			i += pages_per_huge_page(h);
> +		    (vaddr + pages_per_hpte < vma->vm_end) &&
> +		    (remainder >= pages_per_hpte)) {
> +			vaddr += pages_per_hpte;

This silently breaks hugetlb GUP.. should be

                        vaddr += hugetlb_pte_size(&hpte);

It caused misterious MISSING events when I'm playing with this tree, and
I'm surprised it rooted here.  So far the most time consuming one. :)

> +			remainder -= pages_per_hpte;
> +			i += pages_per_hpte;
>  			spin_unlock(ptl);
>  			hugetlb_vma_unlock_read(vma);
>  			continue;
>  		}

James Houghton Jan. 12, 2023, 6:06 p.m. UTC | #3

On Thu, Jan 12, 2023 at 1:02 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Jan 05, 2023 at 10:18:17AM +0000, James Houghton wrote:
> > @@ -6731,22 +6746,22 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >                * and skip the same_page loop below.
> >                */
> >               if (!pages && !vmas && !pfn_offset &&
> > -                 (vaddr + huge_page_size(h) < vma->vm_end) &&
> > -                 (remainder >= pages_per_huge_page(h))) {
> > -                     vaddr += huge_page_size(h);
> > -                     remainder -= pages_per_huge_page(h);
> > -                     i += pages_per_huge_page(h);
> > +                 (vaddr + pages_per_hpte < vma->vm_end) &&
> > +                 (remainder >= pages_per_hpte)) {
> > +                     vaddr += pages_per_hpte;
>
> This silently breaks hugetlb GUP.. should be
>
>                         vaddr += hugetlb_pte_size(&hpte);
>
> It caused misterious MISSING events when I'm playing with this tree, and
> I'm surprised it rooted here.  So far the most time consuming one. :)

Thanks Peter!! And the `vaddr + pages_per_hpte < vma->vm_end` should
be `vaddr + hugetlb_pte_size(&hpte) < vma->vm_end` too.

diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 73672d806172..30fea414d9ee 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6532,11 +6532,9 @@  static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma,
 }
 
 static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma,
-					       unsigned int flags, pte_t *pte,
+					       unsigned int flags, pte_t pteval,
 					       bool *unshare)
 {
-	pte_t pteval = huge_ptep_get(pte);
-
 	*unshare = false;
 	if (is_swap_pte(pteval))
 		return true;
@@ -6611,11 +6609,13 @@  long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int err = -EFAULT, refs;
 
 	while (vaddr < vma->vm_end && remainder) {
-		pte_t *pte;
+		pte_t *ptep, pte;
 		spinlock_t *ptl = NULL;
 		bool unshare = false;
 		int absent;
-		struct page *page;
+		unsigned long pages_per_hpte;
+		struct page *page, *subpage;
+		struct hugetlb_pte hpte;
 
 		/*
 		 * If we have a pending SIGKILL, don't keep faulting pages and
@@ -6632,13 +6632,19 @@  long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * each hugepage.  We have to make sure we get the
 		 * first, for the page indexing below to work.
 		 *
-		 * Note that page table lock is not held when pte is null.
+		 * hugetlb_full_walk will mask the address appropriately.
+		 *
+		 * Note that page table lock is not held when ptep is null.
 		 */
-		pte = hugetlb_walk(vma, vaddr & huge_page_mask(h),
-				   huge_page_size(h));
-		if (pte)
-			ptl = huge_pte_lock(h, mm, pte);
-		absent = !pte || huge_pte_none(huge_ptep_get(pte));
+		if (hugetlb_full_walk(&hpte, vma, vaddr)) {
+			ptep = NULL;
+			absent = true;
+		} else {
+			ptl = hugetlb_pte_lock(&hpte);
+			ptep = hpte.ptep;
+			pte = huge_ptep_get(ptep);
+			absent = huge_pte_none(pte);
+		}
 
 		/*
 		 * When coredumping, it suits get_dump_page if we just return
@@ -6649,13 +6655,20 @@  long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		if (absent && (flags & FOLL_DUMP) &&
 		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
-			if (pte)
+			if (ptep)
 				spin_unlock(ptl);
 			hugetlb_vma_unlock_read(vma);
 			remainder = 0;
 			break;
 		}
 
+		if (!absent && pte_present(pte) &&
+				!hugetlb_pte_present_leaf(&hpte, pte)) {
+			/* We raced with someone splitting the PTE, so retry. */
+			spin_unlock(ptl);
+			continue;
+		}
+
 		/*
 		 * We need call hugetlb_fault for both hugepages under migration
 		 * (in which case hugetlb_fault waits for the migration,) and
@@ -6671,7 +6684,7 @@  long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			vm_fault_t ret;
 			unsigned int fault_flags = 0;
 
-			if (pte)
+			if (ptep)
 				spin_unlock(ptl);
 			hugetlb_vma_unlock_read(vma);
 
@@ -6720,8 +6733,10 @@  long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			continue;
 		}
 
-		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
-		page = pte_page(huge_ptep_get(pte));
+		pfn_offset = (vaddr & ~hugetlb_pte_mask(&hpte)) >> PAGE_SHIFT;
+		subpage = pte_page(pte);
+		pages_per_hpte = hugetlb_pte_size(&hpte) / PAGE_SIZE;
+		page = compound_head(subpage);
 
 		VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
 			       !PageAnonExclusive(page), page);
@@ -6731,22 +6746,22 @@  long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * and skip the same_page loop below.
 		 */
 		if (!pages && !vmas && !pfn_offset &&
-		    (vaddr + huge_page_size(h) < vma->vm_end) &&
-		    (remainder >= pages_per_huge_page(h))) {
-			vaddr += huge_page_size(h);
-			remainder -= pages_per_huge_page(h);
-			i += pages_per_huge_page(h);
+		    (vaddr + pages_per_hpte < vma->vm_end) &&
+		    (remainder >= pages_per_hpte)) {
+			vaddr += pages_per_hpte;
+			remainder -= pages_per_hpte;
+			i += pages_per_hpte;
 			spin_unlock(ptl);
 			hugetlb_vma_unlock_read(vma);
 			continue;
 		}
 
 		/* vaddr may not be aligned to PAGE_SIZE */
-		refs = min3(pages_per_huge_page(h) - pfn_offset, remainder,
+		refs = min3(pages_per_hpte - pfn_offset, remainder,
 		    (vma->vm_end - ALIGN_DOWN(vaddr, PAGE_SIZE)) >> PAGE_SHIFT);
 
 		if (pages || vmas)
-			record_subpages_vmas(nth_page(page, pfn_offset),
+			record_subpages_vmas(nth_page(subpage, pfn_offset),
 					     vma, refs,
 					     likely(pages) ? pages + i : NULL,
 					     vmas ? vmas + i : NULL);