[v2,10/10] mm/hugetlb: Document why page_vma_mapped_walk() is safe to walk

Message ID 20221207203158.651092-1-peterx@redhat.com
State New
Headers
Series [v2,01/10] mm/hugetlb: Let vma_offset_start() to return start |

Commit Message

Peter Xu Dec. 7, 2022, 8:31 p.m. UTC
  Taking vma lock here is not needed for now because all potential hugetlb
walkers here should have i_mmap_rwsem held.  Document the fact.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/page_vma_mapped.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)
  

Comments

John Hubbard Dec. 8, 2022, 12:16 a.m. UTC | #1
On 12/7/22 12:31, Peter Xu wrote:
> Taking vma lock here is not needed for now because all potential hugetlb
> walkers here should have i_mmap_rwsem held.  Document the fact.
> 
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/page_vma_mapped.c | 10 ++++++++--
>   1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index e97b2e23bd28..2e59a0419d22 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>   		/* The only possible mapping was handled on last iteration */
>   		if (pvmw->pte)
>   			return not_found(pvmw);
> -
> -		/* when pud is not present, pte will be NULL */
> +		/*
> +		 * NOTE: we don't need explicit lock here to walk the
> +		 * hugetlb pgtable because either (1) potential callers of
> +		 * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the
> +		 * caller will not walk a hugetlb vma (e.g. ksm or uprobe).
> +		 * When one day this rule breaks, one will get a warning
> +		 * in hugetlb_walk(), and then we'll figure out what to do.
> +		 */

Confused. Is this documentation actually intended to refer to hugetlb_walk()
itself, or just this call site? If the former, then let's move it over
to be right before hugetlb_walk().

>   		pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
>   		if (!pvmw->pte)
>   			return false;

thanks,
  
David Hildenbrand Dec. 8, 2022, 1:16 p.m. UTC | #2
On 07.12.22 21:31, Peter Xu wrote:
> Taking vma lock here is not needed for now because all potential hugetlb
> walkers here should have i_mmap_rwsem held.  Document the fact.
> 
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/page_vma_mapped.c | 10 ++++++++--
>   1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index e97b2e23bd28..2e59a0419d22 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>   		/* The only possible mapping was handled on last iteration */
>   		if (pvmw->pte)
>   			return not_found(pvmw);
> -
> -		/* when pud is not present, pte will be NULL */
> +		/*
> +		 * NOTE: we don't need explicit lock here to walk the
> +		 * hugetlb pgtable because either (1) potential callers of
> +		 * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the
> +		 * caller will not walk a hugetlb vma (e.g. ksm or uprobe).
> +		 * When one day this rule breaks, one will get a warning
> +		 * in hugetlb_walk(), and then we'll figure out what to do.
> +		 */
>   		pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
>   		if (!pvmw->pte)
>   			return false;

Would it make sense to squash that into the previous commit?
  
Peter Xu Dec. 8, 2022, 9:05 p.m. UTC | #3
On Wed, Dec 07, 2022 at 04:16:11PM -0800, John Hubbard wrote:
> On 12/7/22 12:31, Peter Xu wrote:
> > Taking vma lock here is not needed for now because all potential hugetlb
> > walkers here should have i_mmap_rwsem held.  Document the fact.
> > 
> > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   mm/page_vma_mapped.c | 10 ++++++++--
> >   1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index e97b2e23bd28..2e59a0419d22 100644
> > --- a/mm/page_vma_mapped.c
> > +++ b/mm/page_vma_mapped.c
> > @@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> >   		/* The only possible mapping was handled on last iteration */
> >   		if (pvmw->pte)
> >   			return not_found(pvmw);
> > -
> > -		/* when pud is not present, pte will be NULL */
> > +		/*
> > +		 * NOTE: we don't need explicit lock here to walk the
> > +		 * hugetlb pgtable because either (1) potential callers of
> > +		 * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the
> > +		 * caller will not walk a hugetlb vma (e.g. ksm or uprobe).
> > +		 * When one day this rule breaks, one will get a warning
> > +		 * in hugetlb_walk(), and then we'll figure out what to do.
> > +		 */
> 
> Confused. Is this documentation actually intended to refer to hugetlb_walk()
> itself, or just this call site? If the former, then let's move it over
> to be right before hugetlb_walk().

It is for this specific code path not hugetlb_walk().

The "holds i_mmap_rwsem" here is a true statement (not requirement) because
PVMW rmap walkers always have that.  That satisfies with hugetlb_walk()
requirements already even without holding the vma lock.
  
Peter Xu Dec. 8, 2022, 9:05 p.m. UTC | #4
On Thu, Dec 08, 2022 at 02:16:03PM +0100, David Hildenbrand wrote:
> On 07.12.22 21:31, Peter Xu wrote:
> > Taking vma lock here is not needed for now because all potential hugetlb
> > walkers here should have i_mmap_rwsem held.  Document the fact.
> > 
> > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   mm/page_vma_mapped.c | 10 ++++++++--
> >   1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index e97b2e23bd28..2e59a0419d22 100644
> > --- a/mm/page_vma_mapped.c
> > +++ b/mm/page_vma_mapped.c
> > @@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> >   		/* The only possible mapping was handled on last iteration */
> >   		if (pvmw->pte)
> >   			return not_found(pvmw);
> > -
> > -		/* when pud is not present, pte will be NULL */
> > +		/*
> > +		 * NOTE: we don't need explicit lock here to walk the
> > +		 * hugetlb pgtable because either (1) potential callers of
> > +		 * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the
> > +		 * caller will not walk a hugetlb vma (e.g. ksm or uprobe).
> > +		 * When one day this rule breaks, one will get a warning
> > +		 * in hugetlb_walk(), and then we'll figure out what to do.
> > +		 */
> >   		pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
> >   		if (!pvmw->pte)
> >   			return false;
> 
> Would it make sense to squash that into the previous commit?

Sure thing.
  
John Hubbard Dec. 8, 2022, 9:54 p.m. UTC | #5
On 12/8/22 13:05, Peter Xu wrote:
>>> +		/*
>>> +		 * NOTE: we don't need explicit lock here to walk the
>>> +		 * hugetlb pgtable because either (1) potential callers of
>>> +		 * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the
>>> +		 * caller will not walk a hugetlb vma (e.g. ksm or uprobe).
>>> +		 * When one day this rule breaks, one will get a warning
>>> +		 * in hugetlb_walk(), and then we'll figure out what to do.
>>> +		 */
>>
>> Confused. Is this documentation actually intended to refer to hugetlb_walk()
>> itself, or just this call site? If the former, then let's move it over
>> to be right before hugetlb_walk().
> 
> It is for this specific code path not hugetlb_walk().
> 
> The "holds i_mmap_rwsem" here is a true statement (not requirement) because
> PVMW rmap walkers always have that.  That satisfies with hugetlb_walk()
> requirements already even without holding the vma lock.
> 

It's really hard to understand. Do you have a few extra words to explain it?
I can help with actual comment wording perhaps, but I am still a bit in
the dark as to the actual meaning. :)

thanks,
  
Peter Xu Dec. 8, 2022, 10:21 p.m. UTC | #6
On Thu, Dec 08, 2022 at 01:54:27PM -0800, John Hubbard wrote:
> On 12/8/22 13:05, Peter Xu wrote:
> > > > +		/*
> > > > +		 * NOTE: we don't need explicit lock here to walk the
> > > > +		 * hugetlb pgtable because either (1) potential callers of
> > > > +		 * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the
> > > > +		 * caller will not walk a hugetlb vma (e.g. ksm or uprobe).
> > > > +		 * When one day this rule breaks, one will get a warning
> > > > +		 * in hugetlb_walk(), and then we'll figure out what to do.
> > > > +		 */
> > > 
> > > Confused. Is this documentation actually intended to refer to hugetlb_walk()
> > > itself, or just this call site? If the former, then let's move it over
> > > to be right before hugetlb_walk().
> > 
> > It is for this specific code path not hugetlb_walk().
> > 
> > The "holds i_mmap_rwsem" here is a true statement (not requirement) because
> > PVMW rmap walkers always have that.  That satisfies with hugetlb_walk()
> > requirements already even without holding the vma lock.
> > 
> 
> It's really hard to understand. Do you have a few extra words to explain it?
> I can help with actual comment wording perhaps, but I am still a bit in
> the dark as to the actual meaning. :)

Firstly, this patch (to be squashed into previous) is trying to document
page_vma_mapped_walk() on why it's not needed to further take any lock to
call hugetlb_walk().

To call hugetlb_walk() we need either of the locks listed below (in either
read or write mode), according to the rules we setup for it in patch 3:

  (1) hugetlb vma lock
  (2) i_mmap_rwsem lock

page_vma_mapped_walk() is called in below sites across the kernel:

__replace_page[179]            if (!page_vma_mapped_walk(&pvmw))
__damon_pa_mkold[24]           while (page_vma_mapped_walk(&pvmw)) {
__damon_pa_young[97]           while (page_vma_mapped_walk(&pvmw)) {
write_protect_page[1065]       if (!page_vma_mapped_walk(&pvmw))
remove_migration_pte[179]      while (page_vma_mapped_walk(&pvmw)) {
page_idle_clear_pte_refs_one[56] while (page_vma_mapped_walk(&pvmw)) {
page_mapped_in_vma[318]        if (!page_vma_mapped_walk(&pvmw))
folio_referenced_one[813]      while (page_vma_mapped_walk(&pvmw)) {
page_vma_mkclean_one[958]      while (page_vma_mapped_walk(pvmw)) {
try_to_unmap_one[1506]         while (page_vma_mapped_walk(&pvmw)) {
try_to_migrate_one[1881]       while (page_vma_mapped_walk(&pvmw)) {
page_make_device_exclusive_one[2205] while (page_vma_mapped_walk(&pvmw)) {

If we group them, we can see that most of them are during a rmap walk
(i.e., comes from a higher rmap_walk() stack), they are:

__damon_pa_mkold[24]           while (page_vma_mapped_walk(&pvmw)) {
__damon_pa_young[97]           while (page_vma_mapped_walk(&pvmw)) {
remove_migration_pte[179]      while (page_vma_mapped_walk(&pvmw)) {
page_idle_clear_pte_refs_one[56] while (page_vma_mapped_walk(&pvmw)) {
page_mapped_in_vma[318]        if (!page_vma_mapped_walk(&pvmw))
folio_referenced_one[813]      while (page_vma_mapped_walk(&pvmw)) {
page_vma_mkclean_one[958]      while (page_vma_mapped_walk(pvmw)) {
try_to_unmap_one[1506]         while (page_vma_mapped_walk(&pvmw)) {
try_to_migrate_one[1881]       while (page_vma_mapped_walk(&pvmw)) {
page_make_device_exclusive_one[2205] while (page_vma_mapped_walk(&pvmw)) {

Let's call it case (A).

We have another two special cases that are not during a rmap walk, they
are:

write_protect_page[1065]       if (!page_vma_mapped_walk(&pvmw))
__replace_page[179]            if (!page_vma_mapped_walk(&pvmw))

Let's call it case (B).

Case (A) is always safe because it always take the i_mmap_rwsem lock in
read mode.  It's done in rmap_walk_file() where:

	if (!locked) {
		if (i_mmap_trylock_read(mapping))
			goto lookup;

		if (rwc->try_lock) {
			rwc->contended = true;
			return;
		}

		i_mmap_lock_read(mapping);
	}

If locked==true it means the caller already holds the lock, so no need to
take it.  It justifies that all callers from rmap_walk() upon a hugetlb vma
is safe to call hugetlb_walk() already according to the rule of hugetlb_walk().

Case (B) contains two cases either in KSM path or uprobe path, and none of
the paths (afaict) can get a hugetlb vma involved.  IOW, the whole path of 

	if (unlikely(is_vm_hugetlb_page(vma))) {

In page_vma_mapped_walk() just should never trigger.

To summarize above into a shorter paragraph, it'll become the comment.

Hope it explains.  Thanks.
  
John Hubbard Dec. 9, 2022, 12:24 a.m. UTC | #7
On 12/8/22 14:21, Peter Xu wrote:
> 
> Firstly, this patch (to be squashed into previous) is trying to document
> page_vma_mapped_walk() on why it's not needed to further take any lock to
> call hugetlb_walk().
> 
> To call hugetlb_walk() we need either of the locks listed below (in either
> read or write mode), according to the rules we setup for it in patch 3:
> 
>    (1) hugetlb vma lock
>    (2) i_mmap_rwsem lock
> 
> page_vma_mapped_walk() is called in below sites across the kernel:
> 
> __replace_page[179]            if (!page_vma_mapped_walk(&pvmw))
> __damon_pa_mkold[24]           while (page_vma_mapped_walk(&pvmw)) {
> __damon_pa_young[97]           while (page_vma_mapped_walk(&pvmw)) {
> write_protect_page[1065]       if (!page_vma_mapped_walk(&pvmw))
> remove_migration_pte[179]      while (page_vma_mapped_walk(&pvmw)) {
> page_idle_clear_pte_refs_one[56] while (page_vma_mapped_walk(&pvmw)) {
> page_mapped_in_vma[318]        if (!page_vma_mapped_walk(&pvmw))
> folio_referenced_one[813]      while (page_vma_mapped_walk(&pvmw)) {
> page_vma_mkclean_one[958]      while (page_vma_mapped_walk(pvmw)) {
> try_to_unmap_one[1506]         while (page_vma_mapped_walk(&pvmw)) {
> try_to_migrate_one[1881]       while (page_vma_mapped_walk(&pvmw)) {
> page_make_device_exclusive_one[2205] while (page_vma_mapped_walk(&pvmw)) {
> 
> If we group them, we can see that most of them are during a rmap walk
> (i.e., comes from a higher rmap_walk() stack), they are:
> 
> __damon_pa_mkold[24]           while (page_vma_mapped_walk(&pvmw)) {
> __damon_pa_young[97]           while (page_vma_mapped_walk(&pvmw)) {
> remove_migration_pte[179]      while (page_vma_mapped_walk(&pvmw)) {
> page_idle_clear_pte_refs_one[56] while (page_vma_mapped_walk(&pvmw)) {
> page_mapped_in_vma[318]        if (!page_vma_mapped_walk(&pvmw))
> folio_referenced_one[813]      while (page_vma_mapped_walk(&pvmw)) {
> page_vma_mkclean_one[958]      while (page_vma_mapped_walk(pvmw)) {
> try_to_unmap_one[1506]         while (page_vma_mapped_walk(&pvmw)) {
> try_to_migrate_one[1881]       while (page_vma_mapped_walk(&pvmw)) {
> page_make_device_exclusive_one[2205] while (page_vma_mapped_walk(&pvmw)) {
> 
> Let's call it case (A).
> 
> We have another two special cases that are not during a rmap walk, they
> are:
> 
> write_protect_page[1065]       if (!page_vma_mapped_walk(&pvmw))
> __replace_page[179]            if (!page_vma_mapped_walk(&pvmw))
> 
> Let's call it case (B).
> 
> Case (A) is always safe because it always take the i_mmap_rwsem lock in
> read mode.  It's done in rmap_walk_file() where:
> 
> 	if (!locked) {
> 		if (i_mmap_trylock_read(mapping))
> 			goto lookup;
> 
> 		if (rwc->try_lock) {
> 			rwc->contended = true;
> 			return;
> 		}
> 
> 		i_mmap_lock_read(mapping);
> 	}
> 
> If locked==true it means the caller already holds the lock, so no need to
> take it.  It justifies that all callers from rmap_walk() upon a hugetlb vma
> is safe to call hugetlb_walk() already according to the rule of hugetlb_walk().
> 
> Case (B) contains two cases either in KSM path or uprobe path, and none of
> the paths (afaict) can get a hugetlb vma involved.  IOW, the whole path of
> 
> 	if (unlikely(is_vm_hugetlb_page(vma))) {
> 
> In page_vma_mapped_walk() just should never trigger.
> 
> To summarize above into a shorter paragraph, it'll become the comment.
> 
> Hope it explains.  Thanks.
> 

It does! And now for the comment, I'll think you'll find that this suffices:

		/*
		 * All callers that get here will already hold the i_mmap_rwsem.
		 * Therefore, no additional locks need to be taken before
		 * calling hugetlb_walk().
		 */

...which, considering all the data above, is probably the mother of
all summaries. :)  But really, it's all that people need to know here, and
it's readily understandable without wondering what KSM has to do with this,
for example.


thanks,
  
Peter Xu Dec. 9, 2022, 12:43 a.m. UTC | #8
On Thu, Dec 08, 2022 at 04:24:19PM -0800, John Hubbard wrote:
> It does! And now for the comment, I'll think you'll find that this suffices:
> 
> 		/*
> 		 * All callers that get here will already hold the i_mmap_rwsem.
> 		 * Therefore, no additional locks need to be taken before
> 		 * calling hugetlb_walk().
> 		 */
> 
> ...which, considering all the data above, is probably the mother of
> all summaries. :)  But really, it's all that people need to know here, and
> it's readily understandable without wondering what KSM has to do with this,
> for example.

I'm okay with the change. :)

I think what I'll do is I'll move part of the original one into commit
message, and take the new version in the code.

Thanks,
  

Patch

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index e97b2e23bd28..2e59a0419d22 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -168,8 +168,14 @@  bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 		/* The only possible mapping was handled on last iteration */
 		if (pvmw->pte)
 			return not_found(pvmw);
-
-		/* when pud is not present, pte will be NULL */
+		/*
+		 * NOTE: we don't need explicit lock here to walk the
+		 * hugetlb pgtable because either (1) potential callers of
+		 * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the
+		 * caller will not walk a hugetlb vma (e.g. ksm or uprobe).
+		 * When one day this rule breaks, one will get a warning
+		 * in hugetlb_walk(), and then we'll figure out what to do.
+		 */
 		pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
 		if (!pvmw->pte)
 			return false;