[RFC,00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

Message ID	20221030212929.335473-1-peterx@redhat.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Peter Xu <peterx@redhat.com> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Andrew Morton <akpm@linux-foundation.org>, James Houghton <jthoughton@google.com>, Miaohe Lin <linmiaohe@huawei.com>, David Hildenbrand <david@redhat.com>, Muchun Song <songmuchun@bytedance.com>, Andrea Arcangeli <aarcange@redhat.com>, Nadav Amit <nadav.amit@gmail.com>, Mike Kravetz <mike.kravetz@oracle.com>, peterx@redhat.com, Rik van Riel <riel@surriel.com> Subject: [PATCH RFC 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare Date: Sun, 30 Oct 2022 17:29:19 -0400 Message-Id: <20221030212929.335473-1-peterx@redhat.com> Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare \| [RFC,00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare [RFC,01/10] mm/hugetlb: Let vma_offset_start() to return start [RFC,02/10] mm/hugetlb: Comment huge_pte_offset() for its locking requirements [RFC,03/10] mm/hugetlb: Make hugetlb_vma_maps_page() RCU-safe [RFC,04/10] mm/hugetlb: Make userfaultfd_huge_must_wait() RCU-safe [RFC,05/10] mm/hugetlb: Make walk_hugetlb_range() RCU-safe [RFC,06/10] mm/hugetlb: Make page_vma_mapped_walk() RCU-safe [RFC,07/10] mm/hugetlb: Make hugetlb_follow_page_mask() RCU-safe [RFC,08/10] mm/hugetlb: Make follow_hugetlb_page RCU-safe [RFC,09/10] mm/hugetlb: Make hugetlb_fault() RCU-safe [RFC,10/10] mm/hugetlb: Comment at rest huge_pte_offset() places

Message ID

20221030212929.335473-1-peterx@redhat.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: Peter Xu <peterx@redhat.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
        James Houghton <jthoughton@google.com>,
        Miaohe Lin <linmiaohe@huawei.com>,
        David Hildenbrand <david@redhat.com>,
        Muchun Song <songmuchun@bytedance.com>,
        Andrea Arcangeli <aarcange@redhat.com>,
        Nadav Amit <nadav.amit@gmail.com>,
        Mike Kravetz <mike.kravetz@oracle.com>, peterx@redhat.com,
        Rik van Riel <riel@surriel.com>
Subject: [PATCH RFC 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for
 pmd unshare
Date: Sun, 30 Oct 2022 17:29:19 -0400
Message-Id: <20221030212929.335473-1-peterx@redhat.com>
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare |

Message

Peter Xu Oct. 30, 2022, 9:29 p.m. UTC

  This can be seen as a follow-up series to Mike's recent hugetlb vma lock
series for pmd unsharing.  So this series also depends on that one.  But
there're some huge_pte_offset() paths that seem to be still racy on pmd
unsharing (as they don't take vma lock), more below.

Hopefully this series can make it a more complete resolution for pmd
unsharing.

Problem
=======

huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable.  It's used mostly everywhere since that's needed even
before taking the pgtable lock.

huge_pte_offset() is always called with mmap lock held with either read or
write.

For normal memory types that's far enough, since any pgtable removal
requires mmap write lock (e.g. munmap or mm destructions).  However hugetlb
has the pmd unshare feature, it means not only the pgtable page can be gone
from under us when we're doing a walking, but also the pgtable page we're
walking (even after unshared, in this case it can only be the huge PUD page
which contains 512 huge pmd entries, with the vma VM_SHARED mapped).  It's
possible because even though freeing the pgtable page requires mmap write
lock, it doesn't help us from when we're walking on another mm's pgtable,
so it's still on risk even if we're with the current->mm's mmap lock.

The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed.

But it means it'll work only if we take the vma lock for all the places
around huge_pte_offset().  There're already a bunch of them that we did as
per the latest mm-unstable, but also a lot that we didn't for various
reasons.  E.g. it may not be applicable for not-allow-to-sleep contexts
like FOLL_NOWAIT.

I have totally no report showing that I can trigger such a race, but from
code wise I never see anything that stops the race from happening.  This
series is trying to resolve that problem.

Resolution
==========

What this patch proposed is, besides using the vma lock, we can also use
RCU to protect the pgtable page from being freed from under us when
huge_pte_offset() is used.  The idea is kind of similar to RCU fast-gup.
Note that fast-gup is very safe regarding pmd unsharing even before vma
lock, because fast-gup relies on RCU to protect walking any pgtable page,
including another mm's.

To apply the same idea to huge_pte_offset(), it means with proper RCU
protection the pte_t* pointer returned from huge_pte_offset() can also be
always safe to access and de-reference, along with the pgtable lock that
was bound to the pgtable page.

Patch Layout
============

Patch 1 is a trivial cleanup that I noticed when working on this.  Please
shoot if anyone think I should just post it separately, or hopefully I can
still just carry it over.

Patch 2 is the gut of the patchset, describing how we should use the helper
huge_pte_offset() correctly. Only a comment patch but should be the most
important one, as the follow up patches are just trying to follow the rule
it setup here.

The rest patches resolve all the call sites of huge_pte_offset() to make
sure either it's with the vma lock (which is perfectly good enough for
safety in this case; the last patch commented on all those callers to make
sure we won't miss a single case, and why they're safe).  Besides, each of
the patch will add rcu protection to one caller of huge_pte_offset().

Tests
=====

Only lightly tested on hugetlb kselftests including uffd, no more errors
triggered than current mm-unstable (hugetlb-madvise fails before/after
here, with error "Unexpected number of free huge pages line 207"; haven't
really got time to look into it).

Since this is so far only discussed with Mike quickly in the other thread,
marking this as RFC for now as I could have missed something.

Comments welcomed, thanks.

Peter Xu (10):
  mm/hugetlb: Let vma_offset_start() to return start
  mm/hugetlb: Comment huge_pte_offset() for its locking requirements
  mm/hugetlb: Make hugetlb_vma_maps_page() RCU-safe
  mm/hugetlb: Make userfaultfd_huge_must_wait() RCU-safe
  mm/hugetlb: Make walk_hugetlb_range() RCU-safe
  mm/hugetlb: Make page_vma_mapped_walk() RCU-safe
  mm/hugetlb: Make hugetlb_follow_page_mask() RCU-safe
  mm/hugetlb: Make follow_hugetlb_page RCU-safe
  mm/hugetlb: Make hugetlb_fault() RCU-safe
  mm/hugetlb: Comment at rest huge_pte_offset() places

 arch/arm64/mm/hugetlbpage.c | 32 ++++++++++++++++++++++++++
 fs/hugetlbfs/inode.c        | 39 ++++++++++++++++++--------------
 fs/userfaultfd.c            |  4 ++++
 include/linux/rmap.h        |  3 +++
 mm/hugetlb.c                | 45 +++++++++++++++++++++++++++++++++++--
 mm/page_vma_mapped.c        |  7 +++++-
 mm/pagewalk.c               |  5 +++++
 7 files changed, 115 insertions(+), 20 deletions(-)

Comments

Mike Kravetz Nov. 4, 2022, 12:21 a.m. UTC | #1

On 10/30/22 17:29, Peter Xu wrote:
> Resolution
> ==========
> 
> What this patch proposed is, besides using the vma lock, we can also use
> RCU to protect the pgtable page from being freed from under us when
> huge_pte_offset() is used.  The idea is kind of similar to RCU fast-gup.
> Note that fast-gup is very safe regarding pmd unsharing even before vma
> lock, because fast-gup relies on RCU to protect walking any pgtable page,
> including another mm's.
> 
> To apply the same idea to huge_pte_offset(), it means with proper RCU
> protection the pte_t* pointer returned from huge_pte_offset() can also be
> always safe to access and de-reference, along with the pgtable lock that
> was bound to the pgtable page.
> 
> Patch Layout
> ============
> 
> Patch 1 is a trivial cleanup that I noticed when working on this.  Please
> shoot if anyone think I should just post it separately, or hopefully I can
> still just carry it over.
> 
> Patch 2 is the gut of the patchset, describing how we should use the helper
> huge_pte_offset() correctly. Only a comment patch but should be the most
> important one, as the follow up patches are just trying to follow the rule
> it setup here.
> 
> The rest patches resolve all the call sites of huge_pte_offset() to make
> sure either it's with the vma lock (which is perfectly good enough for
> safety in this case; the last patch commented on all those callers to make
> sure we won't miss a single case, and why they're safe).  Besides, each of
> the patch will add rcu protection to one caller of huge_pte_offset().
> 
> Tests
> =====
> 
> Only lightly tested on hugetlb kselftests including uffd, no more errors
> triggered than current mm-unstable (hugetlb-madvise fails before/after
> here, with error "Unexpected number of free huge pages line 207"; haven't
> really got time to look into it).

Do not worry about the madvise test failure, that is caused by a recent
change.

Unless I am missing something, the basic strategy in this series is to
wrap calls to huge_pte_offset and subsequent ptep access with
rcu_read_lock/unlock calls.  I must embarrassingly admit that it has
been a loooong time since I had to look at rcu usage and may not know
what I am talking about.  However, I seem to recall that one needs to
somehow flag the data items being protected from update/freeing.  I
do not see anything like that in the huge_pmd_unshare routine where
pmd page pointer is updated.  Or, is it where the pmd page pointer is
referenced in huge_pte_offset?

Please ignore if you are certain of this rcu usage, otherwise I will
spend some time reeducating myself.

Peter Xu Nov. 4, 2022, 3:02 p.m. UTC | #2

Hi, Mike,

On Thu, Nov 03, 2022 at 05:21:46PM -0700, Mike Kravetz wrote:
> On 10/30/22 17:29, Peter Xu wrote:
> > Resolution
> > ==========
> > 
> > What this patch proposed is, besides using the vma lock, we can also use
> > RCU to protect the pgtable page from being freed from under us when
> > huge_pte_offset() is used.  The idea is kind of similar to RCU fast-gup.
> > Note that fast-gup is very safe regarding pmd unsharing even before vma
> > lock, because fast-gup relies on RCU to protect walking any pgtable page,
> > including another mm's.
> > 
> > To apply the same idea to huge_pte_offset(), it means with proper RCU
> > protection the pte_t* pointer returned from huge_pte_offset() can also be
> > always safe to access and de-reference, along with the pgtable lock that
> > was bound to the pgtable page.
> > 
> > Patch Layout
> > ============
> > 
> > Patch 1 is a trivial cleanup that I noticed when working on this.  Please
> > shoot if anyone think I should just post it separately, or hopefully I can
> > still just carry it over.
> > 
> > Patch 2 is the gut of the patchset, describing how we should use the helper
> > huge_pte_offset() correctly. Only a comment patch but should be the most
> > important one, as the follow up patches are just trying to follow the rule
> > it setup here.
> > 
> > The rest patches resolve all the call sites of huge_pte_offset() to make
> > sure either it's with the vma lock (which is perfectly good enough for
> > safety in this case; the last patch commented on all those callers to make
> > sure we won't miss a single case, and why they're safe).  Besides, each of
> > the patch will add rcu protection to one caller of huge_pte_offset().
> > 
> > Tests
> > =====
> > 
> > Only lightly tested on hugetlb kselftests including uffd, no more errors
> > triggered than current mm-unstable (hugetlb-madvise fails before/after
> > here, with error "Unexpected number of free huge pages line 207"; haven't
> > really got time to look into it).
> 
> Do not worry about the madvise test failure, that is caused by a recent
> change.
> 
> Unless I am missing something, the basic strategy in this series is to
> wrap calls to huge_pte_offset and subsequent ptep access with
> rcu_read_lock/unlock calls.  I must embarrassingly admit that it has
> been a loooong time since I had to look at rcu usage and may not know
> what I am talking about.  However, I seem to recall that one needs to
> somehow flag the data items being protected from update/freeing.  I
> do not see anything like that in the huge_pmd_unshare routine where
> pmd page pointer is updated.  Or, is it where the pmd page pointer is
> referenced in huge_pte_offset?

Right.  The RCU proposed here is trying to protect the pmd pgtable page
that will normally be freed in rcu pattern.  Please refer to
tlb_remove_table_free() (which can be called from tlb_finish_mmu()) where
it's released with RCU API:

	call_rcu(&batch->rcu, tlb_remove_table_rcu);

I mentioned fast-gup just to refererence on the same usage as fast-gup has
the same risk if without RCU or similar protections that is IPI-based, but
I definitely can be even clearer, and I will enrich the cover letter in the
next post.

In short, my understanding is pgtable pages (including the shared PUD page
for hugetlb) needs to be freed with caution because there can be softwares
that are walking the pages with no locks.  In our case, even though
huge_pte_offset() is with the mmap lock, due to the pmd sharing it's not
always having the same mmap lock as when the pgtable needs to be freed, so
it's similar to having no lock here, imo.  Then huge_pte_offset() needs to
be protected just like what we do with fast-gup.

Please also feel free to refer to the comment chunk at the start of
asm-generic/tlb.h for more information on the mmu gather API.

> 
> Please ignore if you are certain of this rcu usage, otherwise I will
> spend some time reeducating myself.

I'm not certain, and I'd like to get any form of comment. :)

Sorry if this RFC version is confusing, but if it can try to at least
explain what the problem we have and if we can agree on the problem first
then that'll already be a step forward to me.  So far that's more important
than how we resolve it, using RCU or vma lock or anything else.

For a non-rfc series, I think I need to be more careful on some details,
e.g., the RCU protection for pgtable page is only used when the arch
supports MMU_GATHER_RCU_TABLE_FREE.  I thought that's always supported at
least for pmd sharing enabled archs, but I'm actually wrong:

arch/arm64/Kconfig:     select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
arch/riscv/Kconfig:     select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
arch/x86/Kconfig:       select ARCH_WANT_HUGE_PMD_SHARE

arch/arm/Kconfig:       select MMU_GATHER_RCU_TABLE_FREE if SMP && ARM_LPAE
arch/arm64/Kconfig:     select MMU_GATHER_RCU_TABLE_FREE
arch/powerpc/Kconfig:   select MMU_GATHER_RCU_TABLE_FREE
arch/s390/Kconfig:      select MMU_GATHER_RCU_TABLE_FREE
arch/sparc/Kconfig:     select MMU_GATHER_RCU_TABLE_FREE if SMP
arch/sparc/include/asm/tlb_64.h:#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
arch/x86/Kconfig:       select MMU_GATHER_RCU_TABLE_FREE        if PARAVIRT

I think it means at least on RISCV RCU_TABLE_FREE is not enabled and we'll
need to rely on the IPIs (e.g. I think we need to replace rcu_read_lock()
with local_irq_disable() on RISCV only for what this patchset wanted to
do).  In the next version, I plan to add a helper, let's name it
huge_pte_walker_lock() for now, and it should be one of the three options:

  - if !ARCH_WANT_HUGE_PMD_SHARE:      it's no-op
  - else if MMU_GATHER_RCU_TABLE_FREE: it should be rcu_read_lock()
  - else:                              it should be local_irq_disable()

With that, I think we'll strictly follow what we have with fast-gup, at the
meantime it should add zero overhead on archs that does not have pmd sharing.

Hope above helps a bit on extending the missing pieces of the cover
letter.  Or again if anything missing I'd be more than glad to know..

Thanks,

Mike Kravetz Nov. 4, 2022, 3:44 p.m. UTC | #3

On 11/04/22 11:02, Peter Xu wrote:
> Hi, Mike,
> 
> On Thu, Nov 03, 2022 at 05:21:46PM -0700, Mike Kravetz wrote:
> > On 10/30/22 17:29, Peter Xu wrote:
> > > Resolution
> > > ==========
> > > 
> > > What this patch proposed is, besides using the vma lock, we can also use
> > > RCU to protect the pgtable page from being freed from under us when
> > > huge_pte_offset() is used.  The idea is kind of similar to RCU fast-gup.
> > > Note that fast-gup is very safe regarding pmd unsharing even before vma
> > > lock, because fast-gup relies on RCU to protect walking any pgtable page,
> > > including another mm's.
> > > 
> > > To apply the same idea to huge_pte_offset(), it means with proper RCU
> > > protection the pte_t* pointer returned from huge_pte_offset() can also be
> > > always safe to access and de-reference, along with the pgtable lock that
> > > was bound to the pgtable page.
> > > 
> > > Patch Layout
> > > ============
> > > 
> > > Patch 1 is a trivial cleanup that I noticed when working on this.  Please
> > > shoot if anyone think I should just post it separately, or hopefully I can
> > > still just carry it over.
> > > 
> > > Patch 2 is the gut of the patchset, describing how we should use the helper
> > > huge_pte_offset() correctly. Only a comment patch but should be the most
> > > important one, as the follow up patches are just trying to follow the rule
> > > it setup here.
> > > 
> > > The rest patches resolve all the call sites of huge_pte_offset() to make
> > > sure either it's with the vma lock (which is perfectly good enough for
> > > safety in this case; the last patch commented on all those callers to make
> > > sure we won't miss a single case, and why they're safe).  Besides, each of
> > > the patch will add rcu protection to one caller of huge_pte_offset().
> > > 
> > > Tests
> > > =====
> > > 
> > > Only lightly tested on hugetlb kselftests including uffd, no more errors
> > > triggered than current mm-unstable (hugetlb-madvise fails before/after
> > > here, with error "Unexpected number of free huge pages line 207"; haven't
> > > really got time to look into it).
> > 
> > Do not worry about the madvise test failure, that is caused by a recent
> > change.
> > 
> > Unless I am missing something, the basic strategy in this series is to
> > wrap calls to huge_pte_offset and subsequent ptep access with
> > rcu_read_lock/unlock calls.  I must embarrassingly admit that it has
> > been a loooong time since I had to look at rcu usage and may not know
> > what I am talking about.  However, I seem to recall that one needs to
> > somehow flag the data items being protected from update/freeing.  I
> > do not see anything like that in the huge_pmd_unshare routine where
> > pmd page pointer is updated.  Or, is it where the pmd page pointer is
> > referenced in huge_pte_offset?
> 
> Right.  The RCU proposed here is trying to protect the pmd pgtable page
> that will normally be freed in rcu pattern.  Please refer to
> tlb_remove_table_free() (which can be called from tlb_finish_mmu()) where
> it's released with RCU API:
> 
> 	call_rcu(&batch->rcu, tlb_remove_table_rcu);
> 

Thanks!  That is the piece of the puzzle I was missing.

> I mentioned fast-gup just to refererence on the same usage as fast-gup has
> the same risk if without RCU or similar protections that is IPI-based, but
> I definitely can be even clearer, and I will enrich the cover letter in the
> next post.
> 
> In short, my understanding is pgtable pages (including the shared PUD page
> for hugetlb) needs to be freed with caution because there can be softwares
> that are walking the pages with no locks.  In our case, even though
> huge_pte_offset() is with the mmap lock, due to the pmd sharing it's not
> always having the same mmap lock as when the pgtable needs to be freed, so
> it's similar to having no lock here, imo.  Then huge_pte_offset() needs to
> be protected just like what we do with fast-gup.
> 
> Please also feel free to refer to the comment chunk at the start of
> asm-generic/tlb.h for more information on the mmu gather API.
> 
> > 
> > Please ignore if you are certain of this rcu usage, otherwise I will
> > spend some time reeducating myself.

Sorry for any misunderstanding.  I am very happy with the RFC and the
work you have done.  I was just missing the piece about rcu
synchronization when the page table was removed.

> I'm not certain, and I'd like to get any form of comment. :)
> 
> Sorry if this RFC version is confusing, but if it can try to at least
> explain what the problem we have and if we can agree on the problem first
> then that'll already be a step forward to me.  So far that's more important
> than how we resolve it, using RCU or vma lock or anything else.
> 
> For a non-rfc series, I think I need to be more careful on some details,
> e.g., the RCU protection for pgtable page is only used when the arch
> supports MMU_GATHER_RCU_TABLE_FREE.  I thought that's always supported at
> least for pmd sharing enabled archs, but I'm actually wrong:
> 
> arch/arm64/Kconfig:     select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> arch/riscv/Kconfig:     select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
> arch/x86/Kconfig:       select ARCH_WANT_HUGE_PMD_SHARE
> 
> arch/arm/Kconfig:       select MMU_GATHER_RCU_TABLE_FREE if SMP && ARM_LPAE
> arch/arm64/Kconfig:     select MMU_GATHER_RCU_TABLE_FREE
> arch/powerpc/Kconfig:   select MMU_GATHER_RCU_TABLE_FREE
> arch/s390/Kconfig:      select MMU_GATHER_RCU_TABLE_FREE
> arch/sparc/Kconfig:     select MMU_GATHER_RCU_TABLE_FREE if SMP
> arch/sparc/include/asm/tlb_64.h:#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> arch/x86/Kconfig:       select MMU_GATHER_RCU_TABLE_FREE        if PARAVIRT
> 
> I think it means at least on RISCV RCU_TABLE_FREE is not enabled and we'll
> need to rely on the IPIs (e.g. I think we need to replace rcu_read_lock()
> with local_irq_disable() on RISCV only for what this patchset wanted to
> do).  In the next version, I plan to add a helper, let's name it
> huge_pte_walker_lock() for now, and it should be one of the three options:
> 
>   - if !ARCH_WANT_HUGE_PMD_SHARE:      it's no-op
>   - else if MMU_GATHER_RCU_TABLE_FREE: it should be rcu_read_lock()
>   - else:                              it should be local_irq_disable()
> 
> With that, I think we'll strictly follow what we have with fast-gup, at the
> meantime it should add zero overhead on archs that does not have pmd sharing.
> 
> Hope above helps a bit on extending the missing pieces of the cover
> letter.  Or again if anything missing I'd be more than glad to know..