[RFC,v2,02/12] mm/hugetlb: Move swap entry handling into vma lock for fault

Message ID 20221118011025.2178986-3-peterx@redhat.com
State New
Headers
Series mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare |

Commit Message

Peter Xu Nov. 18, 2022, 1:10 a.m. UTC
  In hugetlb_fault(), there used to have a special path to handle swap entry
at the entrance using huge_pte_offset().  That's unsafe because
huge_pte_offset() for a pmd sharable range can access freed pgtables if
without either the walker lock or vma lock.

Here the simplest solution for making it safe is just to move the swap
handling to be after the vma lock being held.  We may need to take the
fault mutex on either migration or hwpoison entries now (also the vma lock,
but that's really needed), however neither of them is hot path so it should
be fine.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 24 +++++++-----------------
 1 file changed, 7 insertions(+), 17 deletions(-)
  

Comments

Peter Xu Nov. 18, 2022, 1:35 a.m. UTC | #1
On Thu, Nov 17, 2022 at 08:10:15PM -0500, Peter Xu wrote:
> In hugetlb_fault(), there used to have a special path to handle swap entry
> at the entrance using huge_pte_offset().  That's unsafe because
> huge_pte_offset() for a pmd sharable range can access freed pgtables if
> without either the walker lock or vma lock.
> 
> Here the simplest solution for making it safe is just to move the swap
> handling to be after the vma lock being held.  We may need to take the
> fault mutex on either migration or hwpoison entries now (also the vma lock,
> but that's really needed), however neither of them is hot path so it should
> be fine.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/hugetlb.c | 24 +++++++-----------------
>  1 file changed, 7 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c3aab6d5b7aa..62ff3fc51d4e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5824,22 +5824,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	int need_wait_lock = 0;
>  	unsigned long haddr = address & huge_page_mask(h);
>  
> -	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
> -	if (ptep) {
> -		/*
> -		 * Since we hold no locks, ptep could be stale.  That is
> -		 * OK as we are only making decisions based on content and
> -		 * not actually modifying content here.
> -		 */
> -		entry = huge_ptep_get(ptep);
> -		if (unlikely(is_hugetlb_entry_migration(entry))) {
> -			migration_entry_wait_huge(vma, ptep);
> -			return 0;
> -		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> -			return VM_FAULT_HWPOISON_LARGE |
> -				VM_FAULT_SET_HINDEX(hstate_index(h));
> -	}
> -
>  	/*
>  	 * Serialize hugepage allocation and instantiation, so that we don't
>  	 * get spurious allocation failures if two CPUs race to instantiate
> @@ -5886,8 +5870,14 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	 * fault, and is_hugetlb_entry_(migration|hwpoisoned) check will
>  	 * properly handle it.
>  	 */
> -	if (!pte_present(entry))
> +	if (!pte_present(entry)) {
> +		if (unlikely(is_hugetlb_entry_migration(entry)))
> +			migration_entry_wait_huge(vma, ptep);

Hmm no, need to release the vma lock and fault mutex.. So I remembered why
I had a note that I need to rework migration wait code..

I'll try that on next version, it would be a callback just to release the
proper locks in migration_entry_wait_huge() right after releasing the
pgtable lock, in e.g. migration_entry_wait_on_locked().

> +		else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> +			ret = VM_FAULT_HWPOISON_LARGE |
> +			    VM_FAULT_SET_HINDEX(hstate_index(h));
>  		goto out_mutex;
> +	}
>  
>  	/*
>  	 * If we are going to COW/unshare the mapping later, we examine the
> -- 
> 2.37.3
>
  

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c3aab6d5b7aa..62ff3fc51d4e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5824,22 +5824,6 @@  vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	int need_wait_lock = 0;
 	unsigned long haddr = address & huge_page_mask(h);
 
-	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
-	if (ptep) {
-		/*
-		 * Since we hold no locks, ptep could be stale.  That is
-		 * OK as we are only making decisions based on content and
-		 * not actually modifying content here.
-		 */
-		entry = huge_ptep_get(ptep);
-		if (unlikely(is_hugetlb_entry_migration(entry))) {
-			migration_entry_wait_huge(vma, ptep);
-			return 0;
-		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
-			return VM_FAULT_HWPOISON_LARGE |
-				VM_FAULT_SET_HINDEX(hstate_index(h));
-	}
-
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
@@ -5886,8 +5870,14 @@  vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * fault, and is_hugetlb_entry_(migration|hwpoisoned) check will
 	 * properly handle it.
 	 */
-	if (!pte_present(entry))
+	if (!pte_present(entry)) {
+		if (unlikely(is_hugetlb_entry_migration(entry)))
+			migration_entry_wait_huge(vma, ptep);
+		else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
+			ret = VM_FAULT_HWPOISON_LARGE |
+			    VM_FAULT_SET_HINDEX(hstate_index(h));
 		goto out_mutex;
+	}
 
 	/*
 	 * If we are going to COW/unshare the mapping later, we examine the