[03/10] mm/hugetlb: Document huge_pte_offset usage

Message ID 20221129193526.3588187-4-peterx@redhat.com
State New
Headers
Series [01/10] mm/hugetlb: Let vma_offset_start() to return start |

Commit Message

Peter Xu Nov. 29, 2022, 7:35 p.m. UTC
  huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
hugetlb address.

Normally, it's always safe to walk a generic pgtable as long as we're with
the mmap lock held for either read or write, because that guarantees the
pgtable pages will always be valid during the process.

But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
pgtable freed by pmd unsharing, it means that even with mmap lock held for
current mm, the PMD pgtable page can still go away from under us if pmd
unsharing is possible during the walk.

So we have two ways to make it safe even for a shared mapping:

  (1) If we're with the hugetlb vma lock held for either read/write, it's
      okay because pmd unshare cannot happen at all.

  (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
      okay because even if pmd unshare can happen, the pgtable page cannot
      be freed from under us.

Document it.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)
  

Comments

Mike Kravetz Nov. 30, 2022, 4:55 a.m. UTC | #1
On 11/29/22 14:35, Peter Xu wrote:
> huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
> hugetlb address.
> 
> Normally, it's always safe to walk a generic pgtable as long as we're with
> the mmap lock held for either read or write, because that guarantees the
> pgtable pages will always be valid during the process.
> 
> But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
> pgtable freed by pmd unsharing, it means that even with mmap lock held for
> current mm, the PMD pgtable page can still go away from under us if pmd
> unsharing is possible during the walk.
> 
> So we have two ways to make it safe even for a shared mapping:
> 
>   (1) If we're with the hugetlb vma lock held for either read/write, it's
>       okay because pmd unshare cannot happen at all.
> 
>   (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
>       okay because even if pmd unshare can happen, the pgtable page cannot
>       be freed from under us.
> 
> Document it.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 551834cd5299..81efd9b9baa2 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -192,6 +192,38 @@ extern struct list_head huge_boot_pages;
>  
>  pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long addr, unsigned long sz);
> +/*
> + * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
> + * Returns the pte_t* if found, or NULL if the address is not mapped.
> + *
> + * Since this function will walk all the pgtable pages (including not only
> + * high-level pgtable page, but also PUD entry that can be unshared
> + * concurrently for VM_SHARED), the caller of this function should be
> + * responsible of its thread safety.  One can follow this rule:
> + *
> + *  (1) For private mappings: pmd unsharing is not possible, so it'll
> + *      always be safe if we're with the mmap sem for either read or write.
> + *      This is normally always the case, IOW we don't need to do anything
> + *      special.
> + *
> + *  (2) For shared mappings: pmd unsharing is possible (so the PUD-ranged
> + *      pgtable page can go away from under us!  It can be done by a pmd
> + *      unshare with a follow up munmap() on the other process), then we
> + *      need either:
> + *
> + *     (2.1) hugetlb vma lock read or write held, to make sure pmd unshare
> + *           won't happen upon the range (it also makes sure the pte_t we
> + *           read is the right and stable one), or,
> + *
> + *     (2.2) hugetlb mapping i_mmap_rwsem lock held read or write, to make
> + *           sure even if unshare happened the racy unmap() will wait until
> + *           i_mmap_rwsem is released.

Is that 100% correct?  IIUC, the page tables will be released via the
call to tlb_finish_mmu().  In most cases, the tlb_finish_mmu() call is
performed when holding i_mmap_rwsem.  However, in the final teardown of
a hugetlb vma via __unmap_hugepage_range_final, the tlb_finish_mmu call
is done outside the i_mmap_rwsem lock.  In this case, I think we are
still safe because nobody else should be walking the page table.

I really like the documentation.  However, if i_mmap_rwsem is not 100%
safe I would prefer not to document it here.  I don't think anyone
relies on this do they?
  
David Hildenbrand Nov. 30, 2022, 10:21 a.m. UTC | #2
On 29.11.22 20:35, Peter Xu wrote:
> huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
> hugetlb address.
> 
> Normally, it's always safe to walk a generic pgtable as long as we're with
> the mmap lock held for either read or write, because that guarantees the
> pgtable pages will always be valid during the process.

With the addition, that it's only safe to walk within VMA ranges while 
holding the mmap lock in read mode. It's not safe to walk outside VMA 
ranges.

But the point is that we're walking within a known hugetlbfs VMA, I 
assume, just adding it for completeness :)

> 
> But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
> pgtable freed by pmd unsharing, it means that even with mmap lock held for
> current mm, the PMD pgtable page can still go away from under us if pmd
> unsharing is possible during the walk.
> 
> So we have two ways to make it safe even for a shared mapping:
> 
>    (1) If we're with the hugetlb vma lock held for either read/write, it's
>        okay because pmd unshare cannot happen at all.
> 
>    (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
>        okay because even if pmd unshare can happen, the pgtable page cannot
>        be freed from under us.
> 
> Document it.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

In general, I like that documentation. Let's see if we can figure out 
what to do with the i_mmap_rwsem.
  
David Hildenbrand Nov. 30, 2022, 10:24 a.m. UTC | #3
On 29.11.22 20:35, Peter Xu wrote:
> huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
> hugetlb address.
> 
> Normally, it's always safe to walk a generic pgtable as long as we're with
> the mmap lock held for either read or write, because that guarantees the
> pgtable pages will always be valid during the process.
> 
> But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
> pgtable freed by pmd unsharing, it means that even with mmap lock held for
> current mm, the PMD pgtable page can still go away from under us if pmd
> unsharing is possible during the walk.
> 
> So we have two ways to make it safe even for a shared mapping:
> 
>    (1) If we're with the hugetlb vma lock held for either read/write, it's
>        okay because pmd unshare cannot happen at all.
> 
>    (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
>        okay because even if pmd unshare can happen, the pgtable page cannot
>        be freed from under us.
> 
> Document it.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++
>   1 file changed, 32 insertions(+)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 551834cd5299..81efd9b9baa2 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -192,6 +192,38 @@ extern struct list_head huge_boot_pages;
>   
>   pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>   			unsigned long addr, unsigned long sz);
> +/*
> + * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
> + * Returns the pte_t* if found, or NULL if the address is not mapped.
> + *
> + * Since this function will walk all the pgtable pages (including not only
> + * high-level pgtable page, but also PUD entry that can be unshared
> + * concurrently for VM_SHARED), the caller of this function should be
> + * responsible of its thread safety.  One can follow this rule:
> + *
> + *  (1) For private mappings: pmd unsharing is not possible, so it'll
> + *      always be safe if we're with the mmap sem for either read or write.
> + *      This is normally always the case, IOW we don't need to do anything
> + *      special.

Maybe worth mentioning that hugetlb_vma_lock_read() and friends already 
optimize for private mappings, to not take the VMA lock if not required.

Was happy to spot that optimization in there already :)
  
Peter Xu Nov. 30, 2022, 3:58 p.m. UTC | #4
Hi, Mike,

On Tue, Nov 29, 2022 at 08:55:21PM -0800, Mike Kravetz wrote:
> > + *  (2) For shared mappings: pmd unsharing is possible (so the PUD-ranged
> > + *      pgtable page can go away from under us!  It can be done by a pmd
> > + *      unshare with a follow up munmap() on the other process), then we
> > + *      need either:
> > + *
> > + *     (2.1) hugetlb vma lock read or write held, to make sure pmd unshare
> > + *           won't happen upon the range (it also makes sure the pte_t we
> > + *           read is the right and stable one), or,
> > + *
> > + *     (2.2) hugetlb mapping i_mmap_rwsem lock held read or write, to make
> > + *           sure even if unshare happened the racy unmap() will wait until
> > + *           i_mmap_rwsem is released.
> 
> Is that 100% correct?  IIUC, the page tables will be released via the
> call to tlb_finish_mmu().  In most cases, the tlb_finish_mmu() call is
> performed when holding i_mmap_rwsem.  However, in the final teardown of
> a hugetlb vma via __unmap_hugepage_range_final, the tlb_finish_mmu call
> is done outside the i_mmap_rwsem lock.  In this case, I think we are
> still safe because nobody else should be walking the page table.
> 
> I really like the documentation.  However, if i_mmap_rwsem is not 100%
> safe I would prefer not to document it here.  I don't think anyone
> relies on this do they?

I think i_mmap_rwsem is 100% safe.

It's not in tlb_finish_mmu(), but when freeing the pgtables we need to
unlink current vma from the vma list first:

	free_pgtables
            unlink_file_vma
                i_mmap_lock_write
	tlb_finish_mmu

So it's not the same logic as how the RCU lock worked, but it's actually
better (even though with higher overhead) because vma unlink happens before
free_pgd_range(), so the pgtable locks are not freed yet (unlike RCU).

Thanks,
  
Peter Xu Nov. 30, 2022, 4:09 p.m. UTC | #5
On Wed, Nov 30, 2022 at 11:24:34AM +0100, David Hildenbrand wrote:
> On 29.11.22 20:35, Peter Xu wrote:
> > huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
> > hugetlb address.
> > 
> > Normally, it's always safe to walk a generic pgtable as long as we're with
> > the mmap lock held for either read or write, because that guarantees the
> > pgtable pages will always be valid during the process.
> > 
> > But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
> > pgtable freed by pmd unsharing, it means that even with mmap lock held for
> > current mm, the PMD pgtable page can still go away from under us if pmd
> > unsharing is possible during the walk.
> > 
> > So we have two ways to make it safe even for a shared mapping:
> > 
> >    (1) If we're with the hugetlb vma lock held for either read/write, it's
> >        okay because pmd unshare cannot happen at all.
> > 
> >    (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
> >        okay because even if pmd unshare can happen, the pgtable page cannot
> >        be freed from under us.
> > 
> > Document it.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++
> >   1 file changed, 32 insertions(+)
> > 
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 551834cd5299..81efd9b9baa2 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -192,6 +192,38 @@ extern struct list_head huge_boot_pages;
> >   pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> >   			unsigned long addr, unsigned long sz);
> > +/*
> > + * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
> > + * Returns the pte_t* if found, or NULL if the address is not mapped.
> > + *
> > + * Since this function will walk all the pgtable pages (including not only
> > + * high-level pgtable page, but also PUD entry that can be unshared
> > + * concurrently for VM_SHARED), the caller of this function should be
> > + * responsible of its thread safety.  One can follow this rule:
> > + *
> > + *  (1) For private mappings: pmd unsharing is not possible, so it'll
> > + *      always be safe if we're with the mmap sem for either read or write.
> > + *      This is normally always the case, IOW we don't need to do anything
> > + *      special.
> 
> Maybe worth mentioning that hugetlb_vma_lock_read() and friends already
> optimize for private mappings, to not take the VMA lock if not required.

Yes we can.  I assume this is not super urgent so I'll hold a while to see
whether there's anything else that needs amending for the documents.

Btw, even with hugetlb_vma_lock_read() checking SHARED for a private only
code path it's still better to not take the lock at all, because that still
contains a function jump which will be unnecesary.

Thanks,
  
David Hildenbrand Nov. 30, 2022, 4:11 p.m. UTC | #6
On 30.11.22 17:09, Peter Xu wrote:
> On Wed, Nov 30, 2022 at 11:24:34AM +0100, David Hildenbrand wrote:
>> On 29.11.22 20:35, Peter Xu wrote:
>>> huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
>>> hugetlb address.
>>>
>>> Normally, it's always safe to walk a generic pgtable as long as we're with
>>> the mmap lock held for either read or write, because that guarantees the
>>> pgtable pages will always be valid during the process.
>>>
>>> But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
>>> pgtable freed by pmd unsharing, it means that even with mmap lock held for
>>> current mm, the PMD pgtable page can still go away from under us if pmd
>>> unsharing is possible during the walk.
>>>
>>> So we have two ways to make it safe even for a shared mapping:
>>>
>>>     (1) If we're with the hugetlb vma lock held for either read/write, it's
>>>         okay because pmd unshare cannot happen at all.
>>>
>>>     (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
>>>         okay because even if pmd unshare can happen, the pgtable page cannot
>>>         be freed from under us.
>>>
>>> Document it.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>    include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++
>>>    1 file changed, 32 insertions(+)
>>>
>>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>>> index 551834cd5299..81efd9b9baa2 100644
>>> --- a/include/linux/hugetlb.h
>>> +++ b/include/linux/hugetlb.h
>>> @@ -192,6 +192,38 @@ extern struct list_head huge_boot_pages;
>>>    pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>>>    			unsigned long addr, unsigned long sz);
>>> +/*
>>> + * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
>>> + * Returns the pte_t* if found, or NULL if the address is not mapped.
>>> + *
>>> + * Since this function will walk all the pgtable pages (including not only
>>> + * high-level pgtable page, but also PUD entry that can be unshared
>>> + * concurrently for VM_SHARED), the caller of this function should be
>>> + * responsible of its thread safety.  One can follow this rule:
>>> + *
>>> + *  (1) For private mappings: pmd unsharing is not possible, so it'll
>>> + *      always be safe if we're with the mmap sem for either read or write.
>>> + *      This is normally always the case, IOW we don't need to do anything
>>> + *      special.
>>
>> Maybe worth mentioning that hugetlb_vma_lock_read() and friends already
>> optimize for private mappings, to not take the VMA lock if not required.
> 
> Yes we can.  I assume this is not super urgent so I'll hold a while to see
> whether there's anything else that needs amending for the documents.
> 
> Btw, even with hugetlb_vma_lock_read() checking SHARED for a private only
> code path it's still better to not take the lock at all, because that still
> contains a function jump which will be unnecesary.

IMHO it makes coding a lot more consistent and less error-prone when not 
care about whether to the the lock or not (as an optimization) and just 
having this handled "automatically".

Optimizing a jump out would rather smell like a micro-optimization.
  
Peter Xu Nov. 30, 2022, 4:25 p.m. UTC | #7
On Wed, Nov 30, 2022 at 05:11:36PM +0100, David Hildenbrand wrote:
> On 30.11.22 17:09, Peter Xu wrote:
> > On Wed, Nov 30, 2022 at 11:24:34AM +0100, David Hildenbrand wrote:
> > > On 29.11.22 20:35, Peter Xu wrote:
> > > > huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
> > > > hugetlb address.
> > > > 
> > > > Normally, it's always safe to walk a generic pgtable as long as we're with
> > > > the mmap lock held for either read or write, because that guarantees the
> > > > pgtable pages will always be valid during the process.
> > > > 
> > > > But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
> > > > pgtable freed by pmd unsharing, it means that even with mmap lock held for
> > > > current mm, the PMD pgtable page can still go away from under us if pmd
> > > > unsharing is possible during the walk.
> > > > 
> > > > So we have two ways to make it safe even for a shared mapping:
> > > > 
> > > >     (1) If we're with the hugetlb vma lock held for either read/write, it's
> > > >         okay because pmd unshare cannot happen at all.
> > > > 
> > > >     (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
> > > >         okay because even if pmd unshare can happen, the pgtable page cannot
> > > >         be freed from under us.
> > > > 
> > > > Document it.
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >    include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++
> > > >    1 file changed, 32 insertions(+)
> > > > 
> > > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > > > index 551834cd5299..81efd9b9baa2 100644
> > > > --- a/include/linux/hugetlb.h
> > > > +++ b/include/linux/hugetlb.h
> > > > @@ -192,6 +192,38 @@ extern struct list_head huge_boot_pages;
> > > >    pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> > > >    			unsigned long addr, unsigned long sz);
> > > > +/*
> > > > + * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
> > > > + * Returns the pte_t* if found, or NULL if the address is not mapped.
> > > > + *
> > > > + * Since this function will walk all the pgtable pages (including not only
> > > > + * high-level pgtable page, but also PUD entry that can be unshared
> > > > + * concurrently for VM_SHARED), the caller of this function should be
> > > > + * responsible of its thread safety.  One can follow this rule:
> > > > + *
> > > > + *  (1) For private mappings: pmd unsharing is not possible, so it'll
> > > > + *      always be safe if we're with the mmap sem for either read or write.
> > > > + *      This is normally always the case, IOW we don't need to do anything
> > > > + *      special.
> > > 
> > > Maybe worth mentioning that hugetlb_vma_lock_read() and friends already
> > > optimize for private mappings, to not take the VMA lock if not required.
> > 
> > Yes we can.  I assume this is not super urgent so I'll hold a while to see
> > whether there's anything else that needs amending for the documents.
> > 
> > Btw, even with hugetlb_vma_lock_read() checking SHARED for a private only
> > code path it's still better to not take the lock at all, because that still
> > contains a function jump which will be unnecesary.
> 
> IMHO it makes coding a lot more consistent and less error-prone when not
> care about whether to the the lock or not (as an optimization) and just
> having this handled "automatically".
> 
> Optimizing a jump out would rather smell like a micro-optimization.

Or we can move the lock helpers into the headers, too.
  
David Hildenbrand Nov. 30, 2022, 4:31 p.m. UTC | #8
On 30.11.22 17:25, Peter Xu wrote:
>          On Wed, Nov 30, 2022 at 05:11:36PM +0100, David Hildenbrand wrote:
>> On 30.11.22 17:09, Peter Xu wrote:
>>> On Wed, Nov 30, 2022 at 11:24:34AM +0100, David Hildenbrand wrote:
>>>> On 29.11.22 20:35, Peter Xu wrote:
>>>>> huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
>>>>> hugetlb address.
>>>>>
>>>>> Normally, it's always safe to walk a generic pgtable as long as we're with
>>>>> the mmap lock held for either read or write, because that guarantees the
>>>>> pgtable pages will always be valid during the process.
>>>>>
>>>>> But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
>>>>> pgtable freed by pmd unsharing, it means that even with mmap lock held for
>>>>> current mm, the PMD pgtable page can still go away from under us if pmd
>>>>> unsharing is possible during the walk.
>>>>>
>>>>> So we have two ways to make it safe even for a shared mapping:
>>>>>
>>>>>      (1) If we're with the hugetlb vma lock held for either read/write, it's
>>>>>          okay because pmd unshare cannot happen at all.
>>>>>
>>>>>      (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
>>>>>          okay because even if pmd unshare can happen, the pgtable page cannot
>>>>>          be freed from under us.
>>>>>
>>>>> Document it.
>>>>>
>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>> ---
>>>>>     include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++
>>>>>     1 file changed, 32 insertions(+)
>>>>>
>>>>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>>>>> index 551834cd5299..81efd9b9baa2 100644
>>>>> --- a/include/linux/hugetlb.h
>>>>> +++ b/include/linux/hugetlb.h
>>>>> @@ -192,6 +192,38 @@ extern struct list_head huge_boot_pages;
>>>>>     pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>>>>>     			unsigned long addr, unsigned long sz);
>>>>> +/*
>>>>> + * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
>>>>> + * Returns the pte_t* if found, or NULL if the address is not mapped.
>>>>> + *
>>>>> + * Since this function will walk all the pgtable pages (including not only
>>>>> + * high-level pgtable page, but also PUD entry that can be unshared
>>>>> + * concurrently for VM_SHARED), the caller of this function should be
>>>>> + * responsible of its thread safety.  One can follow this rule:
>>>>> + *
>>>>> + *  (1) For private mappings: pmd unsharing is not possible, so it'll
>>>>> + *      always be safe if we're with the mmap sem for either read or write.
>>>>> + *      This is normally always the case, IOW we don't need to do anything
>>>>> + *      special.
>>>>
>>>> Maybe worth mentioning that hugetlb_vma_lock_read() and friends already
>>>> optimize for private mappings, to not take the VMA lock if not required.
>>>
>>> Yes we can.  I assume this is not super urgent so I'll hold a while to see
>>> whether there's anything else that needs amending for the documents.
>>>
>>> Btw, even with hugetlb_vma_lock_read() checking SHARED for a private only
>>> code path it's still better to not take the lock at all, because that still
>>> contains a function jump which will be unnecesary.
>>
>> IMHO it makes coding a lot more consistent and less error-prone when not
>> care about whether to the the lock or not (as an optimization) and just
>> having this handled "automatically".
>>
>> Optimizing a jump out would rather smell like a micro-optimization.
> 
> Or we can move the lock helpers into the headers, too.

Ah, yes.
  
Mike Kravetz Dec. 5, 2022, 9:47 p.m. UTC | #9
On 11/30/22 10:58, Peter Xu wrote:
> Hi, Mike,
> 
> On Tue, Nov 29, 2022 at 08:55:21PM -0800, Mike Kravetz wrote:
> > > + *  (2) For shared mappings: pmd unsharing is possible (so the PUD-ranged
> > > + *      pgtable page can go away from under us!  It can be done by a pmd
> > > + *      unshare with a follow up munmap() on the other process), then we
> > > + *      need either:
> > > + *
> > > + *     (2.1) hugetlb vma lock read or write held, to make sure pmd unshare
> > > + *           won't happen upon the range (it also makes sure the pte_t we
> > > + *           read is the right and stable one), or,
> > > + *
> > > + *     (2.2) hugetlb mapping i_mmap_rwsem lock held read or write, to make
> > > + *           sure even if unshare happened the racy unmap() will wait until
> > > + *           i_mmap_rwsem is released.
> > 
> > Is that 100% correct?  IIUC, the page tables will be released via the
> > call to tlb_finish_mmu().  In most cases, the tlb_finish_mmu() call is
> > performed when holding i_mmap_rwsem.  However, in the final teardown of
> > a hugetlb vma via __unmap_hugepage_range_final, the tlb_finish_mmu call
> > is done outside the i_mmap_rwsem lock.  In this case, I think we are
> > still safe because nobody else should be walking the page table.
> > 
> > I really like the documentation.  However, if i_mmap_rwsem is not 100%
> > safe I would prefer not to document it here.  I don't think anyone
> > relies on this do they?
> 
> I think i_mmap_rwsem is 100% safe.
> 
> It's not in tlb_finish_mmu(), but when freeing the pgtables we need to
> unlink current vma from the vma list first:
> 
> 	free_pgtables
>             unlink_file_vma
>                 i_mmap_lock_write
> 	tlb_finish_mmu

Thanks!

Sorry, I was thinking about page freeing not page table freeing.

Agree that is 100% safe.
  

Patch

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 551834cd5299..81efd9b9baa2 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -192,6 +192,38 @@  extern struct list_head huge_boot_pages;
 
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long addr, unsigned long sz);
+/*
+ * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
+ * Returns the pte_t* if found, or NULL if the address is not mapped.
+ *
+ * Since this function will walk all the pgtable pages (including not only
+ * high-level pgtable page, but also PUD entry that can be unshared
+ * concurrently for VM_SHARED), the caller of this function should be
+ * responsible of its thread safety.  One can follow this rule:
+ *
+ *  (1) For private mappings: pmd unsharing is not possible, so it'll
+ *      always be safe if we're with the mmap sem for either read or write.
+ *      This is normally always the case, IOW we don't need to do anything
+ *      special.
+ *
+ *  (2) For shared mappings: pmd unsharing is possible (so the PUD-ranged
+ *      pgtable page can go away from under us!  It can be done by a pmd
+ *      unshare with a follow up munmap() on the other process), then we
+ *      need either:
+ *
+ *     (2.1) hugetlb vma lock read or write held, to make sure pmd unshare
+ *           won't happen upon the range (it also makes sure the pte_t we
+ *           read is the right and stable one), or,
+ *
+ *     (2.2) hugetlb mapping i_mmap_rwsem lock held read or write, to make
+ *           sure even if unshare happened the racy unmap() will wait until
+ *           i_mmap_rwsem is released.
+ *
+ * Option (2.1) is the safest, which guarantees pte stability from pmd
+ * sharing pov, until the vma lock released.  Option (2.2) doesn't protect
+ * a concurrent pmd unshare, but it makes sure the pgtable page is safe to
+ * access.
+ */
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz);
 unsigned long hugetlb_mask_last_page(struct hstate *h);