arm64: mm: drop tlb flush operation when clearing the access bit

Message ID ae3115778a3fa10ec77152e18beed54fafe0f6e7.1698151516.git.baolin.wang@linux.alibaba.com
State New
Headers
Series arm64: mm: drop tlb flush operation when clearing the access bit |

Commit Message

Baolin Wang Oct. 24, 2023, 12:56 p.m. UTC
  Now ptep_clear_flush_young() is only called by folio_referenced() to
check if the folio was referenced, and now it will call a tlb flush on
ARM64 architecture. However the tlb flush can be expensive on ARM64
servers, especially for the systems with a large CPU numbers.

Similar to the x86 architecture, below comments also apply equally to
ARM64 architecture. So we can drop the tlb flush operation in
ptep_clear_flush_young() on ARM64 architecture to improve the performance.
"
/* Clearing the accessed bit without a TLB flush
 * doesn't cause data corruption. [ It could cause incorrect
 * page aging and the (mistaken) reclaim of hot pages, but the
 * chance of that should be relatively low. ]
 *
 * So as a performance optimization don't flush the TLB when
 * clearing the accessed bit, it will eventually be flushed by
 * a context switch or a VM operation anyway. [ In the rare
 * event of it not getting flushed for a long time the delay
 * shouldn't really matter because there's no real memory
 * pressure for swapout to react to. ]
 */
"
Running the thpscale to show some obvious improvements for compaction
latency with this patch:
                             base                   patched
Amean     fault-both-1      1093.19 (   0.00%)     1084.57 *   0.79%*
Amean     fault-both-3      2566.22 (   0.00%)     2228.45 *  13.16%*
Amean     fault-both-5      3591.22 (   0.00%)     3146.73 *  12.38%*
Amean     fault-both-7      4157.26 (   0.00%)     4113.67 *   1.05%*
Amean     fault-both-12     6184.79 (   0.00%)     5218.70 *  15.62%*
Amean     fault-both-18     9103.70 (   0.00%)     7739.71 *  14.98%*
Amean     fault-both-24    12341.73 (   0.00%)    10684.23 *  13.43%*
Amean     fault-both-30    15519.00 (   0.00%)    13695.14 *  11.75%*
Amean     fault-both-32    16189.15 (   0.00%)    14365.73 *  11.26%*
                       base       patched
Duration User         167.78      161.03
Duration System      1836.66     1673.01
Duration Elapsed     2074.58     2059.75

Barry Song submitted a similar patch [1] before, that replaces the
ptep_clear_flush_young_notify() with ptep_clear_young_notify() in
folio_referenced_one(). However, I'm not sure if removing the tlb flush
operation is applicable to every architecture in kernel, so dropping
the tlb flush for ARM64 seems a sensible change.

Note: I am okay for both approach, if someone can help to ensure that
all architectures do not need the tlb flush when clearing the accessed
bit, then I also think Barry's patch is better (hope Barry can resend
his patch).

[1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)
  

Comments

Kefeng Wang Oct. 24, 2023, 1:48 p.m. UTC | #1
On 2023/10/24 20:56, Baolin Wang wrote:
> Now ptep_clear_flush_young() is only called by folio_referenced() to
> check if the folio was referenced, and now it will call a tlb flush on
> ARM64 architecture. However the tlb flush can be expensive on ARM64
> servers, especially for the systems with a large CPU numbers.
> 
> Similar to the x86 architecture, below comments also apply equally to
> ARM64 architecture. So we can drop the tlb flush operation in
> ptep_clear_flush_young() on ARM64 architecture to improve the performance.
> "
> /* Clearing the accessed bit without a TLB flush
>   * doesn't cause data corruption. [ It could cause incorrect
>   * page aging and the (mistaken) reclaim of hot pages, but the
>   * chance of that should be relatively low. ]
>   *
>   * So as a performance optimization don't flush the TLB when
>   * clearing the accessed bit, it will eventually be flushed by
>   * a context switch or a VM operation anyway. [ In the rare
>   * event of it not getting flushed for a long time the delay
>   * shouldn't really matter because there's no real memory
>   * pressure for swapout to react to. ]
>   */
> "
> Running the thpscale to show some obvious improvements for compaction
> latency with this patch:
>                               base                   patched
> Amean     fault-both-1      1093.19 (   0.00%)     1084.57 *   0.79%*
> Amean     fault-both-3      2566.22 (   0.00%)     2228.45 *  13.16%*
> Amean     fault-both-5      3591.22 (   0.00%)     3146.73 *  12.38%*
> Amean     fault-both-7      4157.26 (   0.00%)     4113.67 *   1.05%*
> Amean     fault-both-12     6184.79 (   0.00%)     5218.70 *  15.62%*
> Amean     fault-both-18     9103.70 (   0.00%)     7739.71 *  14.98%*
> Amean     fault-both-24    12341.73 (   0.00%)    10684.23 *  13.43%*
> Amean     fault-both-30    15519.00 (   0.00%)    13695.14 *  11.75%*
> Amean     fault-both-32    16189.15 (   0.00%)    14365.73 *  11.26%*
>                         base       patched
> Duration User         167.78      161.03
> Duration System      1836.66     1673.01
> Duration Elapsed     2074.58     2059.75
> 
> Barry Song submitted a similar patch [1] before, that replaces the
> ptep_clear_flush_young_notify() with ptep_clear_young_notify() in
> folio_referenced_one(). However, I'm not sure if removing the tlb flush
> operation is applicable to every architecture in kernel, so dropping
> the tlb flush for ARM64 seems a sensible change.

At least x86/s390/riscv/powerpc already do it, also I think we could
change pmdp_clear_flush_young_notify() too, since it is same with
ptep_clear_flush_young_notify(),

> 
> Note: I am okay for both approach, if someone can help to ensure that
> all architectures do not need the tlb flush when clearing the accessed
> bit, then I also think Barry's patch is better (hope Barry can resend
> his patch).
> 
> [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++---------------
>   1 file changed, 16 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 0bd18de9fd97..2979d796ba9d 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>   static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>   					 unsigned long address, pte_t *ptep)
>   {
> -	int young = ptep_test_and_clear_young(vma, address, ptep);
> -
> -	if (young) {
> -		/*
> -		 * We can elide the trailing DSB here since the worst that can
> -		 * happen is that a CPU continues to use the young entry in its
> -		 * TLB and we mistakenly reclaim the associated page. The
> -		 * window for such an event is bounded by the next
> -		 * context-switch, which provides a DSB to complete the TLB
> -		 * invalidation.
> -		 */
> -		flush_tlb_page_nosync(vma, address);
> -	}
> -
> -	return young;
> +	/*
> +	 * This comment is borrowed from x86, but applies equally to ARM64:
> +	 *
> +	 * Clearing the accessed bit without a TLB flush doesn't cause
> +	 * data corruption. [ It could cause incorrect page aging and
> +	 * the (mistaken) reclaim of hot pages, but the chance of that
> +	 * should be relatively low. ]
> +	 *
> +	 * So as a performance optimization don't flush the TLB when
> +	 * clearing the accessed bit, it will eventually be flushed by
> +	 * a context switch or a VM operation anyway. [ In the rare
> +	 * event of it not getting flushed for a long time the delay
> +	 * shouldn't really matter because there's no real memory
> +	 * pressure for swapout to react to. ]
> +	 */
> +	return ptep_test_and_clear_young(vma, address, ptep);
>   }
>   
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
  
Yu Zhao Oct. 24, 2023, 10:32 p.m. UTC | #2
On Tue, Oct 24, 2023 at 6:56 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Now ptep_clear_flush_young() is only called by folio_referenced() to
> check if the folio was referenced, and now it will call a tlb flush on
> ARM64 architecture. However the tlb flush can be expensive on ARM64
> servers, especially for the systems with a large CPU numbers.
>
> Similar to the x86 architecture, below comments also apply equally to
> ARM64 architecture. So we can drop the tlb flush operation in
> ptep_clear_flush_young() on ARM64 architecture to improve the performance.
> "
> /* Clearing the accessed bit without a TLB flush
>  * doesn't cause data corruption. [ It could cause incorrect
>  * page aging and the (mistaken) reclaim of hot pages, but the
>  * chance of that should be relatively low. ]
>  *
>  * So as a performance optimization don't flush the TLB when
>  * clearing the accessed bit, it will eventually be flushed by
>  * a context switch or a VM operation anyway. [ In the rare
>  * event of it not getting flushed for a long time the delay
>  * shouldn't really matter because there's no real memory
>  * pressure for swapout to react to. ]
>  */
> "
> Running the thpscale to show some obvious improvements for compaction
> latency with this patch:
>                              base                   patched
> Amean     fault-both-1      1093.19 (   0.00%)     1084.57 *   0.79%*
> Amean     fault-both-3      2566.22 (   0.00%)     2228.45 *  13.16%*
> Amean     fault-both-5      3591.22 (   0.00%)     3146.73 *  12.38%*
> Amean     fault-both-7      4157.26 (   0.00%)     4113.67 *   1.05%*
> Amean     fault-both-12     6184.79 (   0.00%)     5218.70 *  15.62%*
> Amean     fault-both-18     9103.70 (   0.00%)     7739.71 *  14.98%*
> Amean     fault-both-24    12341.73 (   0.00%)    10684.23 *  13.43%*
> Amean     fault-both-30    15519.00 (   0.00%)    13695.14 *  11.75%*
> Amean     fault-both-32    16189.15 (   0.00%)    14365.73 *  11.26%*
>                        base       patched
> Duration User         167.78      161.03
> Duration System      1836.66     1673.01
> Duration Elapsed     2074.58     2059.75
>
> Barry Song submitted a similar patch [1] before, that replaces the
> ptep_clear_flush_young_notify() with ptep_clear_young_notify() in
> folio_referenced_one(). However, I'm not sure if removing the tlb flush
> operation is applicable to every architecture in kernel, so dropping
> the tlb flush for ARM64 seems a sensible change.
>
> Note: I am okay for both approach, if someone can help to ensure that
> all architectures do not need the tlb flush when clearing the accessed
> bit, then I also think Barry's patch is better (hope Barry can resend
> his patch).
>
> [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

+Minchan Kim

Minchan and I discussed this (again) yesterday -- I'm in favor and he
can voice his different opinion on this.
  
Barry Song Oct. 24, 2023, 11:16 p.m. UTC | #3
On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Now ptep_clear_flush_young() is only called by folio_referenced() to
> check if the folio was referenced, and now it will call a tlb flush on
> ARM64 architecture. However the tlb flush can be expensive on ARM64
> servers, especially for the systems with a large CPU numbers.
>
> Similar to the x86 architecture, below comments also apply equally to
> ARM64 architecture. So we can drop the tlb flush operation in
> ptep_clear_flush_young() on ARM64 architecture to improve the performance.
> "
> /* Clearing the accessed bit without a TLB flush
>  * doesn't cause data corruption. [ It could cause incorrect
>  * page aging and the (mistaken) reclaim of hot pages, but the
>  * chance of that should be relatively low. ]
>  *
>  * So as a performance optimization don't flush the TLB when
>  * clearing the accessed bit, it will eventually be flushed by
>  * a context switch or a VM operation anyway. [ In the rare
>  * event of it not getting flushed for a long time the delay
>  * shouldn't really matter because there's no real memory
>  * pressure for swapout to react to. ]
>  */
> "
> Running the thpscale to show some obvious improvements for compaction
> latency with this patch:
>                              base                   patched
> Amean     fault-both-1      1093.19 (   0.00%)     1084.57 *   0.79%*
> Amean     fault-both-3      2566.22 (   0.00%)     2228.45 *  13.16%*
> Amean     fault-both-5      3591.22 (   0.00%)     3146.73 *  12.38%*
> Amean     fault-both-7      4157.26 (   0.00%)     4113.67 *   1.05%*
> Amean     fault-both-12     6184.79 (   0.00%)     5218.70 *  15.62%*
> Amean     fault-both-18     9103.70 (   0.00%)     7739.71 *  14.98%*
> Amean     fault-both-24    12341.73 (   0.00%)    10684.23 *  13.43%*
> Amean     fault-both-30    15519.00 (   0.00%)    13695.14 *  11.75%*
> Amean     fault-both-32    16189.15 (   0.00%)    14365.73 *  11.26%*
>                        base       patched
> Duration User         167.78      161.03
> Duration System      1836.66     1673.01
> Duration Elapsed     2074.58     2059.75
>
> Barry Song submitted a similar patch [1] before, that replaces the
> ptep_clear_flush_young_notify() with ptep_clear_young_notify() in
> folio_referenced_one(). However, I'm not sure if removing the tlb flush
> operation is applicable to every architecture in kernel, so dropping
> the tlb flush for ARM64 seems a sensible change.
>
> Note: I am okay for both approach, if someone can help to ensure that
> all architectures do not need the tlb flush when clearing the accessed
> bit, then I also think Barry's patch is better (hope Barry can resend
> his patch).
>

Thanks!

ptep_clear_flush_young() with "flush" in its name clearly says it needs a
flush. but it happens in arm64, all other code which needs a flush has
called other variants, for example __flush_tlb_page_nosync():

static inline void arch_tlbbatch_add_pending(struct
arch_tlbflush_unmap_batch *batch,
 struct mm_struct *mm, unsigned long uaddr)
{
 __flush_tlb_page_nosync(mm, uaddr);
}

so it seems folio_referenced is the only left user of this
ptep_clear_flush_young().
The fact makes Baolin's patch look safe now.

but this function still has "flush" in its name, so one day, one person might
call it with the understanding that it will flush tlb but actually it
won't. This is
bad smell in code.

I guess one side effect of not flushing tlb while clearing the access
flag is that
hardware won't see this cleared flag in the tlb, so it might not set this bit in
memory even though the bit has been cleared before in memory(but not in tlb)
while the page is accessed *again*.

next time, someone reads the access flag in memory again by folio_referenced,
he/she will see the page is cold as hardware has lost a chance to set
the bit again
since it finds tlb already has a true access flag.

But anyway, tlb is so small, it will be flushed by context switch and
other running
code soon. so it seems we don't actually require the access flag being instantly
updated. the time gap, in which access flag might lose the new set by hardware,
seems to be too short to really affect the accuracy of page reclamation. but its
cost is large.

(A). Constant flush cost vs. (B). very very occasional reclaimed hot
page,  B might
be a correct choice.

> [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++---------------
>  1 file changed, 16 insertions(+), 15 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 0bd18de9fd97..2979d796ba9d 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>                                          unsigned long address, pte_t *ptep)
>  {
> -       int young = ptep_test_and_clear_young(vma, address, ptep);
> -
> -       if (young) {
> -               /*
> -                * We can elide the trailing DSB here since the worst that can
> -                * happen is that a CPU continues to use the young entry in its
> -                * TLB and we mistakenly reclaim the associated page. The
> -                * window for such an event is bounded by the next
> -                * context-switch, which provides a DSB to complete the TLB
> -                * invalidation.
> -                */
> -               flush_tlb_page_nosync(vma, address);
> -       }
> -
> -       return young;
> +       /*
> +        * This comment is borrowed from x86, but applies equally to ARM64:
> +        *
> +        * Clearing the accessed bit without a TLB flush doesn't cause
> +        * data corruption. [ It could cause incorrect page aging and
> +        * the (mistaken) reclaim of hot pages, but the chance of that
> +        * should be relatively low. ]
> +        *
> +        * So as a performance optimization don't flush the TLB when
> +        * clearing the accessed bit, it will eventually be flushed by
> +        * a context switch or a VM operation anyway. [ In the rare
> +        * event of it not getting flushed for a long time the delay
> +        * shouldn't really matter because there's no real memory
> +        * pressure for swapout to react to. ]
> +        */
> +       return ptep_test_and_clear_young(vma, address, ptep);
>  }
>
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> --
> 2.39.3
>

Thanks
Barry
  
Barry Song Oct. 24, 2023, 11:31 p.m. UTC | #4
On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
> >
> > Now ptep_clear_flush_young() is only called by folio_referenced() to
> > check if the folio was referenced, and now it will call a tlb flush on
> > ARM64 architecture. However the tlb flush can be expensive on ARM64
> > servers, especially for the systems with a large CPU numbers.
> >
> > Similar to the x86 architecture, below comments also apply equally to
> > ARM64 architecture. So we can drop the tlb flush operation in
> > ptep_clear_flush_young() on ARM64 architecture to improve the performance.
> > "
> > /* Clearing the accessed bit without a TLB flush
> >  * doesn't cause data corruption. [ It could cause incorrect
> >  * page aging and the (mistaken) reclaim of hot pages, but the
> >  * chance of that should be relatively low. ]
> >  *
> >  * So as a performance optimization don't flush the TLB when
> >  * clearing the accessed bit, it will eventually be flushed by
> >  * a context switch or a VM operation anyway. [ In the rare
> >  * event of it not getting flushed for a long time the delay
> >  * shouldn't really matter because there's no real memory
> >  * pressure for swapout to react to. ]
> >  */
> > "
> > Running the thpscale to show some obvious improvements for compaction
> > latency with this patch:
> >                              base                   patched
> > Amean     fault-both-1      1093.19 (   0.00%)     1084.57 *   0.79%*
> > Amean     fault-both-3      2566.22 (   0.00%)     2228.45 *  13.16%*
> > Amean     fault-both-5      3591.22 (   0.00%)     3146.73 *  12.38%*
> > Amean     fault-both-7      4157.26 (   0.00%)     4113.67 *   1.05%*
> > Amean     fault-both-12     6184.79 (   0.00%)     5218.70 *  15.62%*
> > Amean     fault-both-18     9103.70 (   0.00%)     7739.71 *  14.98%*
> > Amean     fault-both-24    12341.73 (   0.00%)    10684.23 *  13.43%*
> > Amean     fault-both-30    15519.00 (   0.00%)    13695.14 *  11.75%*
> > Amean     fault-both-32    16189.15 (   0.00%)    14365.73 *  11.26%*
> >                        base       patched
> > Duration User         167.78      161.03
> > Duration System      1836.66     1673.01
> > Duration Elapsed     2074.58     2059.75
> >
> > Barry Song submitted a similar patch [1] before, that replaces the
> > ptep_clear_flush_young_notify() with ptep_clear_young_notify() in
> > folio_referenced_one(). However, I'm not sure if removing the tlb flush
> > operation is applicable to every architecture in kernel, so dropping
> > the tlb flush for ARM64 seems a sensible change.
> >
> > Note: I am okay for both approach, if someone can help to ensure that
> > all architectures do not need the tlb flush when clearing the accessed
> > bit, then I also think Barry's patch is better (hope Barry can resend
> > his patch).
> >
>
> Thanks!
>
> ptep_clear_flush_young() with "flush" in its name clearly says it needs a
> flush. but it happens in arm64, all other code which needs a flush has
> called other variants, for example __flush_tlb_page_nosync():
>
> static inline void arch_tlbbatch_add_pending(struct
> arch_tlbflush_unmap_batch *batch,
>  struct mm_struct *mm, unsigned long uaddr)
> {
>  __flush_tlb_page_nosync(mm, uaddr);
> }
>
> so it seems folio_referenced is the only left user of this
> ptep_clear_flush_young().
> The fact makes Baolin's patch look safe now.
>
> but this function still has "flush" in its name, so one day, one person might
> call it with the understanding that it will flush tlb but actually it
> won't. This is
> bad smell in code.
>
> I guess one side effect of not flushing tlb while clearing the access
> flag is that
> hardware won't see this cleared flag in the tlb, so it might not set this bit in
> memory even though the bit has been cleared before in memory(but not in tlb)
> while the page is accessed *again*.
>
> next time, someone reads the access flag in memory again by folio_referenced,
> he/she will see the page is cold as hardware has lost a chance to set
> the bit again
> since it finds tlb already has a true access flag.
>
> But anyway, tlb is so small, it will be flushed by context switch and
> other running
> code soon. so it seems we don't actually require the access flag being instantly
> updated. the time gap, in which access flag might lose the new set by hardware,
> seems to be too short to really affect the accuracy of page reclamation. but its
> cost is large.
>
> (A). Constant flush cost vs. (B). very very occasional reclaimed hot
> page,  B might
> be a correct choice.

Plus, I doubt B is really going to happen. as after a page is promoted to
the head of lru list or new generation, it needs a long time to slide back
to the inactive list tail or to the candidate generation of mglru. the time
should have been large enough for tlb to be flushed. If the page is really
hot, the hardware will get second, third, fourth etc opportunity to set an
access flag in the long time in which the page is re-moved to the tail
as the page can be accessed multiple times if it is really hot.

>
> > [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
> > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > ---
> >  arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++---------------
> >  1 file changed, 16 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 0bd18de9fd97..2979d796ba9d 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> >  static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> >                                          unsigned long address, pte_t *ptep)
> >  {
> > -       int young = ptep_test_and_clear_young(vma, address, ptep);
> > -
> > -       if (young) {
> > -               /*
> > -                * We can elide the trailing DSB here since the worst that can
> > -                * happen is that a CPU continues to use the young entry in its
> > -                * TLB and we mistakenly reclaim the associated page. The
> > -                * window for such an event is bounded by the next
> > -                * context-switch, which provides a DSB to complete the TLB
> > -                * invalidation.
> > -                */
> > -               flush_tlb_page_nosync(vma, address);
> > -       }
> > -
> > -       return young;
> > +       /*
> > +        * This comment is borrowed from x86, but applies equally to ARM64:
> > +        *
> > +        * Clearing the accessed bit without a TLB flush doesn't cause
> > +        * data corruption. [ It could cause incorrect page aging and
> > +        * the (mistaken) reclaim of hot pages, but the chance of that
> > +        * should be relatively low. ]
> > +        *
> > +        * So as a performance optimization don't flush the TLB when
> > +        * clearing the accessed bit, it will eventually be flushed by
> > +        * a context switch or a VM operation anyway. [ In the rare
> > +        * event of it not getting flushed for a long time the delay
> > +        * shouldn't really matter because there's no real memory
> > +        * pressure for swapout to react to. ]
> > +        */
> > +       return ptep_test_and_clear_young(vma, address, ptep);
> >  }
> >
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > --
> > 2.39.3
> >
>
> Thanks
> Barry
  
Alistair Popple Oct. 25, 2023, 1:07 a.m. UTC | #5
Barry Song <21cnbao@gmail.com> writes:

> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang
>> <baolin.wang@linux.alibaba.com> wrote:
>> >
>> > Now ptep_clear_flush_young() is only called by folio_referenced() to
>> > check if the folio was referenced, and now it will call a tlb flush on
>> > ARM64 architecture. However the tlb flush can be expensive on ARM64
>> > servers, especially for the systems with a large CPU numbers.
>> >
>> > Similar to the x86 architecture, below comments also apply equally to
>> > ARM64 architecture. So we can drop the tlb flush operation in
>> > ptep_clear_flush_young() on ARM64 architecture to improve the performance.
>> > "
>> > /* Clearing the accessed bit without a TLB flush
>> >  * doesn't cause data corruption. [ It could cause incorrect
>> >  * page aging and the (mistaken) reclaim of hot pages, but the
>> >  * chance of that should be relatively low. ]
>> >  *
>> >  * So as a performance optimization don't flush the TLB when
>> >  * clearing the accessed bit, it will eventually be flushed by
>> >  * a context switch or a VM operation anyway. [ In the rare
>> >  * event of it not getting flushed for a long time the delay
>> >  * shouldn't really matter because there's no real memory
>> >  * pressure for swapout to react to. ]
>> >  */
>> > "
>> > Running the thpscale to show some obvious improvements for compaction
>> > latency with this patch:
>> >                              base                   patched
>> > Amean     fault-both-1      1093.19 (   0.00%)     1084.57 *   0.79%*
>> > Amean     fault-both-3      2566.22 (   0.00%)     2228.45 *  13.16%*
>> > Amean     fault-both-5      3591.22 (   0.00%)     3146.73 *  12.38%*
>> > Amean     fault-both-7      4157.26 (   0.00%)     4113.67 *   1.05%*
>> > Amean     fault-both-12     6184.79 (   0.00%)     5218.70 *  15.62%*
>> > Amean     fault-both-18     9103.70 (   0.00%)     7739.71 *  14.98%*
>> > Amean     fault-both-24    12341.73 (   0.00%)    10684.23 *  13.43%*
>> > Amean     fault-both-30    15519.00 (   0.00%)    13695.14 *  11.75%*
>> > Amean     fault-both-32    16189.15 (   0.00%)    14365.73 *  11.26%*
>> >                        base       patched
>> > Duration User         167.78      161.03
>> > Duration System      1836.66     1673.01
>> > Duration Elapsed     2074.58     2059.75
>> >
>> > Barry Song submitted a similar patch [1] before, that replaces the
>> > ptep_clear_flush_young_notify() with ptep_clear_young_notify() in
>> > folio_referenced_one(). However, I'm not sure if removing the tlb flush
>> > operation is applicable to every architecture in kernel, so dropping
>> > the tlb flush for ARM64 seems a sensible change.
>> >
>> > Note: I am okay for both approach, if someone can help to ensure that
>> > all architectures do not need the tlb flush when clearing the accessed
>> > bit, then I also think Barry's patch is better (hope Barry can resend
>> > his patch).
>> >
>>
>> Thanks!
>>
>> ptep_clear_flush_young() with "flush" in its name clearly says it needs a
>> flush. but it happens in arm64, all other code which needs a flush has
>> called other variants, for example __flush_tlb_page_nosync():
>>
>> static inline void arch_tlbbatch_add_pending(struct
>> arch_tlbflush_unmap_batch *batch,
>>  struct mm_struct *mm, unsigned long uaddr)
>> {
>>  __flush_tlb_page_nosync(mm, uaddr);
>> }
>>
>> so it seems folio_referenced is the only left user of this
>> ptep_clear_flush_young().
>> The fact makes Baolin's patch look safe now.
>>
>> but this function still has "flush" in its name, so one day, one person might
>> call it with the understanding that it will flush tlb but actually it
>> won't. This is
>> bad smell in code.
>>
>> I guess one side effect of not flushing tlb while clearing the access
>> flag is that
>> hardware won't see this cleared flag in the tlb, so it might not set this bit in
>> memory even though the bit has been cleared before in memory(but not in tlb)
>> while the page is accessed *again*.
>>
>> next time, someone reads the access flag in memory again by folio_referenced,
>> he/she will see the page is cold as hardware has lost a chance to set
>> the bit again
>> since it finds tlb already has a true access flag.
>>
>> But anyway, tlb is so small, it will be flushed by context switch and
>> other running
>> code soon. so it seems we don't actually require the access flag being instantly
>> updated. the time gap, in which access flag might lose the new set by hardware,
>> seems to be too short to really affect the accuracy of page reclamation. but its
>> cost is large.
>>
>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot
>> page,  B might
>> be a correct choice.
>
> Plus, I doubt B is really going to happen. as after a page is promoted to
> the head of lru list or new generation, it needs a long time to slide back
> to the inactive list tail or to the candidate generation of mglru. the time
> should have been large enough for tlb to be flushed. If the page is really
> hot, the hardware will get second, third, fourth etc opportunity to set an
> access flag in the long time in which the page is re-moved to the tail
> as the page can be accessed multiple times if it is really hot.

This might not be true if you have external hardware sharing the page
tables with software through either HMM or hardware supported ATS
though.

In those cases I think it's much more likely hardware can still be
accessing the page even after a context switch on the CPU say. So those
pages will tend to get reclaimed even though hardware is still actively
using them which would be quite expensive and I guess could lead to
thrashing as each page is reclaimed and then immediately faulted back
in.

Of course TLB flushes are equally (perhaps even more) expensive for this
kind of external HW so reducing them would still be beneficial. I wonder
if there's some way they could be deferred until the page is moved to
the inactive list say?

>>
>> > [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
>> > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> > ---
>> >  arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++---------------
>> >  1 file changed, 16 insertions(+), 15 deletions(-)
>> >
>> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> > index 0bd18de9fd97..2979d796ba9d 100644
>> > --- a/arch/arm64/include/asm/pgtable.h
>> > +++ b/arch/arm64/include/asm/pgtable.h
>> > @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>> >  static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>> >                                          unsigned long address, pte_t *ptep)
>> >  {
>> > -       int young = ptep_test_and_clear_young(vma, address, ptep);
>> > -
>> > -       if (young) {
>> > -               /*
>> > -                * We can elide the trailing DSB here since the worst that can
>> > -                * happen is that a CPU continues to use the young entry in its
>> > -                * TLB and we mistakenly reclaim the associated page. The
>> > -                * window for such an event is bounded by the next
>> > -                * context-switch, which provides a DSB to complete the TLB
>> > -                * invalidation.
>> > -                */
>> > -               flush_tlb_page_nosync(vma, address);
>> > -       }
>> > -
>> > -       return young;
>> > +       /*
>> > +        * This comment is borrowed from x86, but applies equally to ARM64:
>> > +        *
>> > +        * Clearing the accessed bit without a TLB flush doesn't cause
>> > +        * data corruption. [ It could cause incorrect page aging and
>> > +        * the (mistaken) reclaim of hot pages, but the chance of that
>> > +        * should be relatively low. ]
>> > +        *
>> > +        * So as a performance optimization don't flush the TLB when
>> > +        * clearing the accessed bit, it will eventually be flushed by
>> > +        * a context switch or a VM operation anyway. [ In the rare
>> > +        * event of it not getting flushed for a long time the delay
>> > +        * shouldn't really matter because there's no real memory
>> > +        * pressure for swapout to react to. ]
>> > +        */
>> > +       return ptep_test_and_clear_young(vma, address, ptep);
>> >  }
>> >
>> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> > --
>> > 2.39.3
>> >
>>
>> Thanks
>> Barry
  
Yin Fengwei Oct. 25, 2023, 1:39 a.m. UTC | #6
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 0bd18de9fd97..2979d796ba9d 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>  static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>                                          unsigned long address, pte_t *ptep)
>>  {
>> -       int young = ptep_test_and_clear_young(vma, address, ptep);
>> -
>> -       if (young) {
>> -               /*
>> -                * We can elide the trailing DSB here since the worst that can
>> -                * happen is that a CPU continues to use the young entry in its
>> -                * TLB and we mistakenly reclaim the associated page. The
>> -                * window for such an event is bounded by the next
>> -                * context-switch, which provides a DSB to complete the TLB
>> -                * invalidation.
>> -                */
>> -               flush_tlb_page_nosync(vma, address);
>> -       }
>> -
>> -       return young;
>> +       /*
>> +        * This comment is borrowed from x86, but applies equally to ARM64:
>> +        *
>> +        * Clearing the accessed bit without a TLB flush doesn't cause
>> +        * data corruption. [ It could cause incorrect page aging and
>> +        * the (mistaken) reclaim of hot pages, but the chance of that
>> +        * should be relatively low. ]
>> +        *
>> +        * So as a performance optimization don't flush the TLB when
>> +        * clearing the accessed bit, it will eventually be flushed by
>> +        * a context switch or a VM operation anyway. [ In the rare
>> +        * event of it not getting flushed for a long time the delay
>> +        * shouldn't really matter because there's no real memory
>> +        * pressure for swapout to react to. ]
>> +        */
>> +       return ptep_test_and_clear_young(vma, address, ptep);
>>  }
From https://lore.kernel.org/lkml/20181029105515.GD14127@arm.com/:

This is blindly copied from x86 and isn't true for us: we don't invalidate
the TLB on context switch. That means our window for keeping the stale
entries around is potentially much bigger and might not be a great idea.


My understanding is that arm64 doesn't do invalidate the TLB during
context switch. The flush_tlb_page_nosync() here + DSB during context
switch make sure the TLB is invalidated during context switch.
So we can't remove flush_tlb_page_nosync() here? Or something was changed
for arm64 (I have zero knowledge to TLB on arm64. So some obvious thing
may be missed)? Thanks.


Regards
Yin, Fengwei
  
Baolin Wang Oct. 25, 2023, 1:44 a.m. UTC | #7
On 10/24/2023 9:48 PM, Kefeng Wang wrote:
> 
> 
> On 2023/10/24 20:56, Baolin Wang wrote:
>> Now ptep_clear_flush_young() is only called by folio_referenced() to
>> check if the folio was referenced, and now it will call a tlb flush on
>> ARM64 architecture. However the tlb flush can be expensive on ARM64
>> servers, especially for the systems with a large CPU numbers.
>>
>> Similar to the x86 architecture, below comments also apply equally to
>> ARM64 architecture. So we can drop the tlb flush operation in
>> ptep_clear_flush_young() on ARM64 architecture to improve the 
>> performance.
>> "
>> /* Clearing the accessed bit without a TLB flush
>>   * doesn't cause data corruption. [ It could cause incorrect
>>   * page aging and the (mistaken) reclaim of hot pages, but the
>>   * chance of that should be relatively low. ]
>>   *
>>   * So as a performance optimization don't flush the TLB when
>>   * clearing the accessed bit, it will eventually be flushed by
>>   * a context switch or a VM operation anyway. [ In the rare
>>   * event of it not getting flushed for a long time the delay
>>   * shouldn't really matter because there's no real memory
>>   * pressure for swapout to react to. ]
>>   */
>> "
>> Running the thpscale to show some obvious improvements for compaction
>> latency with this patch:
>>                               base                   patched
>> Amean     fault-both-1      1093.19 (   0.00%)     1084.57 *   0.79%*
>> Amean     fault-both-3      2566.22 (   0.00%)     2228.45 *  13.16%*
>> Amean     fault-both-5      3591.22 (   0.00%)     3146.73 *  12.38%*
>> Amean     fault-both-7      4157.26 (   0.00%)     4113.67 *   1.05%*
>> Amean     fault-both-12     6184.79 (   0.00%)     5218.70 *  15.62%*
>> Amean     fault-both-18     9103.70 (   0.00%)     7739.71 *  14.98%*
>> Amean     fault-both-24    12341.73 (   0.00%)    10684.23 *  13.43%*
>> Amean     fault-both-30    15519.00 (   0.00%)    13695.14 *  11.75%*
>> Amean     fault-both-32    16189.15 (   0.00%)    14365.73 *  11.26%*
>>                         base       patched
>> Duration User         167.78      161.03
>> Duration System      1836.66     1673.01
>> Duration Elapsed     2074.58     2059.75
>>
>> Barry Song submitted a similar patch [1] before, that replaces the
>> ptep_clear_flush_young_notify() with ptep_clear_young_notify() in
>> folio_referenced_one(). However, I'm not sure if removing the tlb flush
>> operation is applicable to every architecture in kernel, so dropping
>> the tlb flush for ARM64 seems a sensible change.
> 
> At least x86/s390/riscv/powerpc already do it, also I think we could

Right.

> change pmdp_clear_flush_young_notify() too, since it is same with
> ptep_clear_flush_young_notify(),

Perhaps yes, but I'm still unsure if removing tlb flush for PMD entry is 
applicable to all architectures. Let's see the discussion in this 
thread. Thanks.
  
Barry Song Oct. 25, 2023, 1:44 a.m. UTC | #8
On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@nvidia.com> wrote:
>
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang
> >> <baolin.wang@linux.alibaba.com> wrote:
> >> >
> >> > Now ptep_clear_flush_young() is only called by folio_referenced() to
> >> > check if the folio was referenced, and now it will call a tlb flush on
> >> > ARM64 architecture. However the tlb flush can be expensive on ARM64
> >> > servers, especially for the systems with a large CPU numbers.
> >> >
> >> > Similar to the x86 architecture, below comments also apply equally to
> >> > ARM64 architecture. So we can drop the tlb flush operation in
> >> > ptep_clear_flush_young() on ARM64 architecture to improve the performance.
> >> > "
> >> > /* Clearing the accessed bit without a TLB flush
> >> >  * doesn't cause data corruption. [ It could cause incorrect
> >> >  * page aging and the (mistaken) reclaim of hot pages, but the
> >> >  * chance of that should be relatively low. ]
> >> >  *
> >> >  * So as a performance optimization don't flush the TLB when
> >> >  * clearing the accessed bit, it will eventually be flushed by
> >> >  * a context switch or a VM operation anyway. [ In the rare
> >> >  * event of it not getting flushed for a long time the delay
> >> >  * shouldn't really matter because there's no real memory
> >> >  * pressure for swapout to react to. ]
> >> >  */
> >> > "
> >> > Running the thpscale to show some obvious improvements for compaction
> >> > latency with this patch:
> >> >                              base                   patched
> >> > Amean     fault-both-1      1093.19 (   0.00%)     1084.57 *   0.79%*
> >> > Amean     fault-both-3      2566.22 (   0.00%)     2228.45 *  13.16%*
> >> > Amean     fault-both-5      3591.22 (   0.00%)     3146.73 *  12.38%*
> >> > Amean     fault-both-7      4157.26 (   0.00%)     4113.67 *   1.05%*
> >> > Amean     fault-both-12     6184.79 (   0.00%)     5218.70 *  15.62%*
> >> > Amean     fault-both-18     9103.70 (   0.00%)     7739.71 *  14.98%*
> >> > Amean     fault-both-24    12341.73 (   0.00%)    10684.23 *  13.43%*
> >> > Amean     fault-both-30    15519.00 (   0.00%)    13695.14 *  11.75%*
> >> > Amean     fault-both-32    16189.15 (   0.00%)    14365.73 *  11.26%*
> >> >                        base       patched
> >> > Duration User         167.78      161.03
> >> > Duration System      1836.66     1673.01
> >> > Duration Elapsed     2074.58     2059.75
> >> >
> >> > Barry Song submitted a similar patch [1] before, that replaces the
> >> > ptep_clear_flush_young_notify() with ptep_clear_young_notify() in
> >> > folio_referenced_one(). However, I'm not sure if removing the tlb flush
> >> > operation is applicable to every architecture in kernel, so dropping
> >> > the tlb flush for ARM64 seems a sensible change.
> >> >
> >> > Note: I am okay for both approach, if someone can help to ensure that
> >> > all architectures do not need the tlb flush when clearing the accessed
> >> > bit, then I also think Barry's patch is better (hope Barry can resend
> >> > his patch).
> >> >
> >>
> >> Thanks!
> >>
> >> ptep_clear_flush_young() with "flush" in its name clearly says it needs a
> >> flush. but it happens in arm64, all other code which needs a flush has
> >> called other variants, for example __flush_tlb_page_nosync():
> >>
> >> static inline void arch_tlbbatch_add_pending(struct
> >> arch_tlbflush_unmap_batch *batch,
> >>  struct mm_struct *mm, unsigned long uaddr)
> >> {
> >>  __flush_tlb_page_nosync(mm, uaddr);
> >> }
> >>
> >> so it seems folio_referenced is the only left user of this
> >> ptep_clear_flush_young().
> >> The fact makes Baolin's patch look safe now.
> >>
> >> but this function still has "flush" in its name, so one day, one person might
> >> call it with the understanding that it will flush tlb but actually it
> >> won't. This is
> >> bad smell in code.
> >>
> >> I guess one side effect of not flushing tlb while clearing the access
> >> flag is that
> >> hardware won't see this cleared flag in the tlb, so it might not set this bit in
> >> memory even though the bit has been cleared before in memory(but not in tlb)
> >> while the page is accessed *again*.
> >>
> >> next time, someone reads the access flag in memory again by folio_referenced,
> >> he/she will see the page is cold as hardware has lost a chance to set
> >> the bit again
> >> since it finds tlb already has a true access flag.
> >>
> >> But anyway, tlb is so small, it will be flushed by context switch and
> >> other running
> >> code soon. so it seems we don't actually require the access flag being instantly
> >> updated. the time gap, in which access flag might lose the new set by hardware,
> >> seems to be too short to really affect the accuracy of page reclamation. but its
> >> cost is large.
> >>
> >> (A). Constant flush cost vs. (B). very very occasional reclaimed hot
> >> page,  B might
> >> be a correct choice.
> >
> > Plus, I doubt B is really going to happen. as after a page is promoted to
> > the head of lru list or new generation, it needs a long time to slide back
> > to the inactive list tail or to the candidate generation of mglru. the time
> > should have been large enough for tlb to be flushed. If the page is really
> > hot, the hardware will get second, third, fourth etc opportunity to set an
> > access flag in the long time in which the page is re-moved to the tail
> > as the page can be accessed multiple times if it is really hot.
>
> This might not be true if you have external hardware sharing the page
> tables with software through either HMM or hardware supported ATS
> though.
>
> In those cases I think it's much more likely hardware can still be
> accessing the page even after a context switch on the CPU say. So those
> pages will tend to get reclaimed even though hardware is still actively
> using them which would be quite expensive and I guess could lead to
> thrashing as each page is reclaimed and then immediately faulted back
> in.

i am not quite sure i got your point. has the external hardware sharing cpu's
pagetable the ability to set access flag in page table entries by
itself? if yes,
I don't see how our approach will hurt as folio_referenced can notify the
hardware driver and the driver can flush its own tlb. If no, i don't see
either as the external hardware can't set access flags, that means we
have ignored its reference and only knows cpu's access even in the current
mainline code. so we are not getting worse.

so the external hardware can also see cpu's TLB? or cpu's tlb flush can
also broadcast to external hardware, then external hardware sees the
cleared access flag, thus, it can set access flag in page table when the
hardware access it?  If this is the case, I feel what you said is true.

>
> Of course TLB flushes are equally (perhaps even more) expensive for this
> kind of external HW so reducing them would still be beneficial. I wonder
> if there's some way they could be deferred until the page is moved to
> the inactive list say?
>
> >>
> >> > [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
> >> > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >> > ---
> >> >  arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++---------------
> >> >  1 file changed, 16 insertions(+), 15 deletions(-)
> >> >
> >> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >> > index 0bd18de9fd97..2979d796ba9d 100644
> >> > --- a/arch/arm64/include/asm/pgtable.h
> >> > +++ b/arch/arm64/include/asm/pgtable.h
> >> > @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> >> >  static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> >> >                                          unsigned long address, pte_t *ptep)
> >> >  {
> >> > -       int young = ptep_test_and_clear_young(vma, address, ptep);
> >> > -
> >> > -       if (young) {
> >> > -               /*
> >> > -                * We can elide the trailing DSB here since the worst that can
> >> > -                * happen is that a CPU continues to use the young entry in its
> >> > -                * TLB and we mistakenly reclaim the associated page. The
> >> > -                * window for such an event is bounded by the next
> >> > -                * context-switch, which provides a DSB to complete the TLB
> >> > -                * invalidation.
> >> > -                */
> >> > -               flush_tlb_page_nosync(vma, address);
> >> > -       }
> >> > -
> >> > -       return young;
> >> > +       /*
> >> > +        * This comment is borrowed from x86, but applies equally to ARM64:
> >> > +        *
> >> > +        * Clearing the accessed bit without a TLB flush doesn't cause
> >> > +        * data corruption. [ It could cause incorrect page aging and
> >> > +        * the (mistaken) reclaim of hot pages, but the chance of that
> >> > +        * should be relatively low. ]
> >> > +        *
> >> > +        * So as a performance optimization don't flush the TLB when
> >> > +        * clearing the accessed bit, it will eventually be flushed by
> >> > +        * a context switch or a VM operation anyway. [ In the rare
> >> > +        * event of it not getting flushed for a long time the delay
> >> > +        * shouldn't really matter because there's no real memory
> >> > +        * pressure for swapout to react to. ]
> >> > +        */
> >> > +       return ptep_test_and_clear_young(vma, address, ptep);
> >> >  }
> >> >
> >> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> > --
> >> > 2.39.3
> >> >
> >>
> >> Thanks
> >> Barry
>
Thanks
Barry
  
Alistair Popple Oct. 25, 2023, 1:58 a.m. UTC | #9
Barry Song <21cnbao@gmail.com> writes:

> On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@nvidia.com> wrote:
>>
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote:
>> >>
>> >> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang
>> >> <baolin.wang@linux.alibaba.com> wrote:

[...]

>> >> (A). Constant flush cost vs. (B). very very occasional reclaimed hot
>> >> page,  B might
>> >> be a correct choice.
>> >
>> > Plus, I doubt B is really going to happen. as after a page is promoted to
>> > the head of lru list or new generation, it needs a long time to slide back
>> > to the inactive list tail or to the candidate generation of mglru. the time
>> > should have been large enough for tlb to be flushed. If the page is really
>> > hot, the hardware will get second, third, fourth etc opportunity to set an
>> > access flag in the long time in which the page is re-moved to the tail
>> > as the page can be accessed multiple times if it is really hot.
>>
>> This might not be true if you have external hardware sharing the page
>> tables with software through either HMM or hardware supported ATS
>> though.
>>
>> In those cases I think it's much more likely hardware can still be
>> accessing the page even after a context switch on the CPU say. So those
>> pages will tend to get reclaimed even though hardware is still actively
>> using them which would be quite expensive and I guess could lead to
>> thrashing as each page is reclaimed and then immediately faulted back
>> in.
>
> i am not quite sure i got your point. has the external hardware sharing cpu's
> pagetable the ability to set access flag in page table entries by
> itself? if yes,
> I don't see how our approach will hurt as folio_referenced can notify the
> hardware driver and the driver can flush its own tlb. If no, i don't see
> either as the external hardware can't set access flags, that means we
> have ignored its reference and only knows cpu's access even in the current
> mainline code. so we are not getting worse.
>
> so the external hardware can also see cpu's TLB? or cpu's tlb flush can
> also broadcast to external hardware, then external hardware sees the
> cleared access flag, thus, it can set access flag in page table when the
> hardware access it?  If this is the case, I feel what you said is true.

Perhaps it would help if I gave a concrete example. Take for example the
ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of
two ways depending on the specific HW implementation.

If broadcast TLB maintenance (BTM) is supported it will snoop CPU TLB
invalidations. If BTM is not supported it relies on SW to explicitly
forward TLB invalidations via MMU notifiers.

Now consider the case where some external device is accessing mappings
via the SMMU. The access flag will be cached in the SMMU TLB. If we
clear the access flag without a TLB invalidate the access flag in the
CPU page table will not get updated because it's already set in the SMMU
TLB.

As an aside access flag updates happen in one of two ways. If the SMMU
HW supports hardware translation table updates (HTTU) then hardware will
manage updating access/dirty flags as required. If this is not supported
then SW is relied on to update these flags which in practice means
taking a minor fault. But I don't think that is relevant here - in
either case without a TLB invalidate neither of those things will
happen.

I suppose drivers could implement the clear_flush_young() MMU notifier
callback (none do at the moment AFAICT) but then won't that just lead to
the opposite problem - that every page ever used by an external device
remains active and unavailable for reclaim because the access flag never
gets cleared? I suppose they could do the flush then which would ensure
the page is marked inactive if it's not referenced between the two
folio_referenced calls().

But that requires changes to those drivers. SMMU from memory doesn't
even register for notifiers if BTM is supported.

 - Alistair

>>
>> Of course TLB flushes are equally (perhaps even more) expensive for this
>> kind of external HW so reducing them would still be beneficial. I wonder
>> if there's some way they could be deferred until the page is moved to
>> the inactive list say?
>>
>> >>
>> >> > [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
>> >> > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> >> > ---
>> >> >  arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++---------------
>> >> >  1 file changed, 16 insertions(+), 15 deletions(-)
>> >> >
>> >> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> >> > index 0bd18de9fd97..2979d796ba9d 100644
>> >> > --- a/arch/arm64/include/asm/pgtable.h
>> >> > +++ b/arch/arm64/include/asm/pgtable.h
>> >> > @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>> >> >  static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>> >> >                                          unsigned long address, pte_t *ptep)
>> >> >  {
>> >> > -       int young = ptep_test_and_clear_young(vma, address, ptep);
>> >> > -
>> >> > -       if (young) {
>> >> > -               /*
>> >> > -                * We can elide the trailing DSB here since the worst that can
>> >> > -                * happen is that a CPU continues to use the young entry in its
>> >> > -                * TLB and we mistakenly reclaim the associated page. The
>> >> > -                * window for such an event is bounded by the next
>> >> > -                * context-switch, which provides a DSB to complete the TLB
>> >> > -                * invalidation.
>> >> > -                */
>> >> > -               flush_tlb_page_nosync(vma, address);
>> >> > -       }
>> >> > -
>> >> > -       return young;
>> >> > +       /*
>> >> > +        * This comment is borrowed from x86, but applies equally to ARM64:
>> >> > +        *
>> >> > +        * Clearing the accessed bit without a TLB flush doesn't cause
>> >> > +        * data corruption. [ It could cause incorrect page aging and
>> >> > +        * the (mistaken) reclaim of hot pages, but the chance of that
>> >> > +        * should be relatively low. ]
>> >> > +        *
>> >> > +        * So as a performance optimization don't flush the TLB when
>> >> > +        * clearing the accessed bit, it will eventually be flushed by
>> >> > +        * a context switch or a VM operation anyway. [ In the rare
>> >> > +        * event of it not getting flushed for a long time the delay
>> >> > +        * shouldn't really matter because there's no real memory
>> >> > +        * pressure for swapout to react to. ]
>> >> > +        */
>> >> > +       return ptep_test_and_clear_young(vma, address, ptep);
>> >> >  }
>> >> >
>> >> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> >> > --
>> >> > 2.39.3
>> >> >
>> >>
>> >> Thanks
>> >> Barry
>>
> Thanks
> Barry
  
Baolin Wang Oct. 25, 2023, 2:02 a.m. UTC | #10
On 10/25/2023 7:31 AM, Barry Song wrote:
> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang
>> <baolin.wang@linux.alibaba.com> wrote:
>>>
>>> Now ptep_clear_flush_young() is only called by folio_referenced() to
>>> check if the folio was referenced, and now it will call a tlb flush on
>>> ARM64 architecture. However the tlb flush can be expensive on ARM64
>>> servers, especially for the systems with a large CPU numbers.
>>>
>>> Similar to the x86 architecture, below comments also apply equally to
>>> ARM64 architecture. So we can drop the tlb flush operation in
>>> ptep_clear_flush_young() on ARM64 architecture to improve the performance.
>>> "
>>> /* Clearing the accessed bit without a TLB flush
>>>   * doesn't cause data corruption. [ It could cause incorrect
>>>   * page aging and the (mistaken) reclaim of hot pages, but the
>>>   * chance of that should be relatively low. ]
>>>   *
>>>   * So as a performance optimization don't flush the TLB when
>>>   * clearing the accessed bit, it will eventually be flushed by
>>>   * a context switch or a VM operation anyway. [ In the rare
>>>   * event of it not getting flushed for a long time the delay
>>>   * shouldn't really matter because there's no real memory
>>>   * pressure for swapout to react to. ]
>>>   */
>>> "
>>> Running the thpscale to show some obvious improvements for compaction
>>> latency with this patch:
>>>                               base                   patched
>>> Amean     fault-both-1      1093.19 (   0.00%)     1084.57 *   0.79%*
>>> Amean     fault-both-3      2566.22 (   0.00%)     2228.45 *  13.16%*
>>> Amean     fault-both-5      3591.22 (   0.00%)     3146.73 *  12.38%*
>>> Amean     fault-both-7      4157.26 (   0.00%)     4113.67 *   1.05%*
>>> Amean     fault-both-12     6184.79 (   0.00%)     5218.70 *  15.62%*
>>> Amean     fault-both-18     9103.70 (   0.00%)     7739.71 *  14.98%*
>>> Amean     fault-both-24    12341.73 (   0.00%)    10684.23 *  13.43%*
>>> Amean     fault-both-30    15519.00 (   0.00%)    13695.14 *  11.75%*
>>> Amean     fault-both-32    16189.15 (   0.00%)    14365.73 *  11.26%*
>>>                         base       patched
>>> Duration User         167.78      161.03
>>> Duration System      1836.66     1673.01
>>> Duration Elapsed     2074.58     2059.75
>>>
>>> Barry Song submitted a similar patch [1] before, that replaces the
>>> ptep_clear_flush_young_notify() with ptep_clear_young_notify() in
>>> folio_referenced_one(). However, I'm not sure if removing the tlb flush
>>> operation is applicable to every architecture in kernel, so dropping
>>> the tlb flush for ARM64 seems a sensible change.
>>>
>>> Note: I am okay for both approach, if someone can help to ensure that
>>> all architectures do not need the tlb flush when clearing the accessed
>>> bit, then I also think Barry's patch is better (hope Barry can resend
>>> his patch).
>>>
>>
>> Thanks!
>>
>> ptep_clear_flush_young() with "flush" in its name clearly says it needs a
>> flush. but it happens in arm64, all other code which needs a flush has
>> called other variants, for example __flush_tlb_page_nosync():
>>
>> static inline void arch_tlbbatch_add_pending(struct
>> arch_tlbflush_unmap_batch *batch,
>>   struct mm_struct *mm, unsigned long uaddr)
>> {
>>   __flush_tlb_page_nosync(mm, uaddr);
>> }
>>
>> so it seems folio_referenced is the only left user of this
>> ptep_clear_flush_young().
>> The fact makes Baolin's patch look safe now.
>>
>> but this function still has "flush" in its name, so one day, one person might
>> call it with the understanding that it will flush tlb but actually it
>> won't. This is
>> bad smell in code.

Agree. I think this is jsut a start, we can replace 
ptep_clear_flush_young() once other architectures have completed the 
conversion, if we can confirm that other architectures also do not 
require tlb flush when clearing the accessed bit.

>> I guess one side effect of not flushing tlb while clearing the access
>> flag is that
>> hardware won't see this cleared flag in the tlb, so it might not set this bit in
>> memory even though the bit has been cleared before in memory(but not in tlb)
>> while the page is accessed *again*. >>
>> next time, someone reads the access flag in memory again by folio_referenced,
>> he/she will see the page is cold as hardware has lost a chance to set
>> the bit again
>> since it finds tlb already has a true access flag.
>>
>> But anyway, tlb is so small, it will be flushed by context switch and
>> other running
>> code soon. so it seems we don't actually require the access flag being instantly
>> updated. the time gap, in which access flag might lose the new set by hardware,
>> seems to be too short to really affect the accuracy of page reclamation. but its
>> cost is large.
>>
>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot
>> page,  B might
>> be a correct choice.
> 
> Plus, I doubt B is really going to happen. as after a page is promoted to
> the head of lru list or new generation, it needs a long time to slide back
> to the inactive list tail or to the candidate generation of mglru. the time
> should have been large enough for tlb to be flushed. If the page is really
> hot, the hardware will get second, third, fourth etc opportunity to set an
> access flag in the long time in which the page is re-moved to the tail
> as the page can be accessed multiple times if it is really hot.

Thanks Barry, that's also what I thought. On the other hand, even if 
there is no tlb flush for a long time, I think the system is not under 
memory pressure at that time, so the incorrect page aging would not have 
much impact.
  
Baolin Wang Oct. 25, 2023, 2:43 a.m. UTC | #11
On 10/25/2023 9:58 AM, Alistair Popple wrote:
> 
> Barry Song <21cnbao@gmail.com> writes:
> 
>> On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@nvidia.com> wrote:
>>>
>>>
>>> Barry Song <21cnbao@gmail.com> writes:
>>>
>>>> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote:
>>>>>
>>>>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang
>>>>> <baolin.wang@linux.alibaba.com> wrote:
> 
> [...]
> 
>>>>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot
>>>>> page,  B might
>>>>> be a correct choice.
>>>>
>>>> Plus, I doubt B is really going to happen. as after a page is promoted to
>>>> the head of lru list or new generation, it needs a long time to slide back
>>>> to the inactive list tail or to the candidate generation of mglru. the time
>>>> should have been large enough for tlb to be flushed. If the page is really
>>>> hot, the hardware will get second, third, fourth etc opportunity to set an
>>>> access flag in the long time in which the page is re-moved to the tail
>>>> as the page can be accessed multiple times if it is really hot.
>>>
>>> This might not be true if you have external hardware sharing the page
>>> tables with software through either HMM or hardware supported ATS
>>> though.
>>>
>>> In those cases I think it's much more likely hardware can still be
>>> accessing the page even after a context switch on the CPU say. So those
>>> pages will tend to get reclaimed even though hardware is still actively
>>> using them which would be quite expensive and I guess could lead to
>>> thrashing as each page is reclaimed and then immediately faulted back
>>> in.

That's possible, but the chance should be relatively low. At least on 
x86, I have not heard of this issue.

>> i am not quite sure i got your point. has the external hardware sharing cpu's
>> pagetable the ability to set access flag in page table entries by
>> itself? if yes,
>> I don't see how our approach will hurt as folio_referenced can notify the
>> hardware driver and the driver can flush its own tlb. If no, i don't see
>> either as the external hardware can't set access flags, that means we
>> have ignored its reference and only knows cpu's access even in the current
>> mainline code. so we are not getting worse.
>>
>> so the external hardware can also see cpu's TLB? or cpu's tlb flush can
>> also broadcast to external hardware, then external hardware sees the
>> cleared access flag, thus, it can set access flag in page table when the
>> hardware access it?  If this is the case, I feel what you said is true.
> 
> Perhaps it would help if I gave a concrete example. Take for example the
> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of
> two ways depending on the specific HW implementation.
> 
> If broadcast TLB maintenance (BTM) is supported it will snoop CPU TLB
> invalidations. If BTM is not supported it relies on SW to explicitly
> forward TLB invalidations via MMU notifiers.

On our ARM64 hardware, we rely on BTM to maintain TLB coherency.

> Now consider the case where some external device is accessing mappings
> via the SMMU. The access flag will be cached in the SMMU TLB. If we
> clear the access flag without a TLB invalidate the access flag in the
> CPU page table will not get updated because it's already set in the SMMU
> TLB.
> 
> As an aside access flag updates happen in one of two ways. If the SMMU
> HW supports hardware translation table updates (HTTU) then hardware will
> manage updating access/dirty flags as required. If this is not supported
> then SW is relied on to update these flags which in practice means
> taking a minor fault. But I don't think that is relevant here - in
> either case without a TLB invalidate neither of those things will
> happen.
> 
> I suppose drivers could implement the clear_flush_young() MMU notifier
> callback (none do at the moment AFAICT) but then won't that just lead to
> the opposite problem - that every page ever used by an external device
> remains active and unavailable for reclaim because the access flag never
> gets cleared? I suppose they could do the flush then which would ensure

Yes, I think so too. The reason there is currently no problem, perhaps I 
think, there are no actual use cases at the moment? At least on our 
Alibaba's fleet, SMMU and MMU do not share page tables now.

> the page is marked inactive if it's not referenced between the two
> folio_referenced calls().
> 
> But that requires changes to those drivers. SMMU from memory doesn't
> even register for notifiers if BTM is supported.
> 
>   - Alistair
> 
>>>
>>> Of course TLB flushes are equally (perhaps even more) expensive for this
>>> kind of external HW so reducing them would still be beneficial. I wonder
>>> if there's some way they could be deferred until the page is moved to
>>> the inactive list say?
>>>
>>>>>
>>>>>> [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>> ---
>>>>>>   arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++---------------
>>>>>>   1 file changed, 16 insertions(+), 15 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>>>> index 0bd18de9fd97..2979d796ba9d 100644
>>>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>>>> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>>   static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>>>                                           unsigned long address, pte_t *ptep)
>>>>>>   {
>>>>>> -       int young = ptep_test_and_clear_young(vma, address, ptep);
>>>>>> -
>>>>>> -       if (young) {
>>>>>> -               /*
>>>>>> -                * We can elide the trailing DSB here since the worst that can
>>>>>> -                * happen is that a CPU continues to use the young entry in its
>>>>>> -                * TLB and we mistakenly reclaim the associated page. The
>>>>>> -                * window for such an event is bounded by the next
>>>>>> -                * context-switch, which provides a DSB to complete the TLB
>>>>>> -                * invalidation.
>>>>>> -                */
>>>>>> -               flush_tlb_page_nosync(vma, address);
>>>>>> -       }
>>>>>> -
>>>>>> -       return young;
>>>>>> +       /*
>>>>>> +        * This comment is borrowed from x86, but applies equally to ARM64:
>>>>>> +        *
>>>>>> +        * Clearing the accessed bit without a TLB flush doesn't cause
>>>>>> +        * data corruption. [ It could cause incorrect page aging and
>>>>>> +        * the (mistaken) reclaim of hot pages, but the chance of that
>>>>>> +        * should be relatively low. ]
>>>>>> +        *
>>>>>> +        * So as a performance optimization don't flush the TLB when
>>>>>> +        * clearing the accessed bit, it will eventually be flushed by
>>>>>> +        * a context switch or a VM operation anyway. [ In the rare
>>>>>> +        * event of it not getting flushed for a long time the delay
>>>>>> +        * shouldn't really matter because there's no real memory
>>>>>> +        * pressure for swapout to react to. ]
>>>>>> +        */
>>>>>> +       return ptep_test_and_clear_young(vma, address, ptep);
>>>>>>   }
>>>>>>
>>>>>>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>> --
>>>>>> 2.39.3
>>>>>>
>>>>>
>>>>> Thanks
>>>>> Barry
>>>
>> Thanks
>> Barry
  
Baolin Wang Oct. 25, 2023, 3:03 a.m. UTC | #12
On 10/25/2023 9:39 AM, Yin, Fengwei wrote:
> 
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index 0bd18de9fd97..2979d796ba9d 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>   static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>                                           unsigned long address, pte_t *ptep)
>>>   {
>>> -       int young = ptep_test_and_clear_young(vma, address, ptep);
>>> -
>>> -       if (young) {
>>> -               /*
>>> -                * We can elide the trailing DSB here since the worst that can
>>> -                * happen is that a CPU continues to use the young entry in its
>>> -                * TLB and we mistakenly reclaim the associated page. The
>>> -                * window for such an event is bounded by the next
>>> -                * context-switch, which provides a DSB to complete the TLB
>>> -                * invalidation.
>>> -                */
>>> -               flush_tlb_page_nosync(vma, address);
>>> -       }
>>> -
>>> -       return young;
>>> +       /*
>>> +        * This comment is borrowed from x86, but applies equally to ARM64:
>>> +        *
>>> +        * Clearing the accessed bit without a TLB flush doesn't cause
>>> +        * data corruption. [ It could cause incorrect page aging and
>>> +        * the (mistaken) reclaim of hot pages, but the chance of that
>>> +        * should be relatively low. ]
>>> +        *
>>> +        * So as a performance optimization don't flush the TLB when
>>> +        * clearing the accessed bit, it will eventually be flushed by
>>> +        * a context switch or a VM operation anyway. [ In the rare
>>> +        * event of it not getting flushed for a long time the delay
>>> +        * shouldn't really matter because there's no real memory
>>> +        * pressure for swapout to react to. ]
>>> +        */
>>> +       return ptep_test_and_clear_young(vma, address, ptep);
>>>   }
>  From https://lore.kernel.org/lkml/20181029105515.GD14127@arm.com/:
> 
> This is blindly copied from x86 and isn't true for us: we don't invalidate
> the TLB on context switch. That means our window for keeping the stale
> entries around is potentially much bigger and might not be a great idea.
> 
> 
> My understanding is that arm64 doesn't do invalidate the TLB during > context switch. The flush_tlb_page_nosync() here + DSB during context

Yes, we only perform a TLB flush when the ASID is exhausted during 
context switch, and I think this is same with x86 IIUC.

> switch make sure the TLB is invalidated during context switch.
> So we can't remove flush_tlb_page_nosync() here? Or something was changed
> for arm64 (I have zero knowledge to TLB on arm64. So some obvious thing
> may be missed)? Thanks.

IMHO, the tlb can be easily evicted or flushed if the system is under 
memory pressure, so like Barry said, the chance of reclaiming hot page 
is relatively low, at least on X86, we did not see any heavy refault issue.

For MGLRU, it uses ptep_test_and_clear_young() instead of 
ptep_clear_flush_young_notify(), and we did not find any problems until 
now since deploying to ARM servers.
  
Yin Fengwei Oct. 25, 2023, 3:08 a.m. UTC | #13
On 10/25/2023 11:03 AM, Baolin Wang wrote:
>>
>> My understanding is that arm64 doesn't do invalidate the TLB during > context switch. The flush_tlb_page_nosync() here + DSB during context
> 
> Yes, we only perform a TLB flush when the ASID is exhausted during context switch, and I think this is same with x86 IIUC.
If we remove flush_tlb_page_nosync(), can we still claim TLB is flushed during
context switch for ARM64?

> 
>> switch make sure the TLB is invalidated during context switch.
>> So we can't remove flush_tlb_page_nosync() here? Or something was changed
>> for arm64 (I have zero knowledge to TLB on arm64. So some obvious thing
>> may be missed)? Thanks.
> 
> IMHO, the tlb can be easily evicted or flushed if the system is under memory pressure, so like Barry said, the chance of reclaiming hot page is relatively low, at least on X86, we did not see any heavy refault issue.
> 
> For MGLRU, it uses ptep_test_and_clear_young() instead of ptep_clear_flush_young_notify(), and we did not find any problems until now since deploying to ARM servers.
  
Alistair Popple Oct. 25, 2023, 3:09 a.m. UTC | #14
Baolin Wang <baolin.wang@linux.alibaba.com> writes:

> On 10/25/2023 9:58 AM, Alistair Popple wrote:
>> Barry Song <21cnbao@gmail.com> writes:
>> 
>>> On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@nvidia.com> wrote:
>>>>
>>>>
>>>> Barry Song <21cnbao@gmail.com> writes:
>>>>
>>>>> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote:
>>>>>>
>>>>>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang
>>>>>> <baolin.wang@linux.alibaba.com> wrote:
>> [...]
>> 
>>>>>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot
>>>>>> page,  B might
>>>>>> be a correct choice.
>>>>>
>>>>> Plus, I doubt B is really going to happen. as after a page is promoted to
>>>>> the head of lru list or new generation, it needs a long time to slide back
>>>>> to the inactive list tail or to the candidate generation of mglru. the time
>>>>> should have been large enough for tlb to be flushed. If the page is really
>>>>> hot, the hardware will get second, third, fourth etc opportunity to set an
>>>>> access flag in the long time in which the page is re-moved to the tail
>>>>> as the page can be accessed multiple times if it is really hot.
>>>>
>>>> This might not be true if you have external hardware sharing the page
>>>> tables with software through either HMM or hardware supported ATS
>>>> though.
>>>>
>>>> In those cases I think it's much more likely hardware can still be
>>>> accessing the page even after a context switch on the CPU say. So those
>>>> pages will tend to get reclaimed even though hardware is still actively
>>>> using them which would be quite expensive and I guess could lead to
>>>> thrashing as each page is reclaimed and then immediately faulted back
>>>> in.
>
> That's possible, but the chance should be relatively low. At least on
> x86, I have not heard of this issue.

Personally I've never seen any x86 system that shares page tables with
external devices, other than with HMM. More on that below.

>>> i am not quite sure i got your point. has the external hardware sharing cpu's
>>> pagetable the ability to set access flag in page table entries by
>>> itself? if yes,
>>> I don't see how our approach will hurt as folio_referenced can notify the
>>> hardware driver and the driver can flush its own tlb. If no, i don't see
>>> either as the external hardware can't set access flags, that means we
>>> have ignored its reference and only knows cpu's access even in the current
>>> mainline code. so we are not getting worse.
>>>
>>> so the external hardware can also see cpu's TLB? or cpu's tlb flush can
>>> also broadcast to external hardware, then external hardware sees the
>>> cleared access flag, thus, it can set access flag in page table when the
>>> hardware access it?  If this is the case, I feel what you said is true.
>> Perhaps it would help if I gave a concrete example. Take for example
>> the
>> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of
>> two ways depending on the specific HW implementation.
>> If broadcast TLB maintenance (BTM) is supported it will snoop CPU
>> TLB
>> invalidations. If BTM is not supported it relies on SW to explicitly
>> forward TLB invalidations via MMU notifiers.
>
> On our ARM64 hardware, we rely on BTM to maintain TLB coherency.

Lucky you :-)

ARM64 SMMU architecture specification supports the possibilty of both,
as does the driver. Not that I think whether or not BTM is supported has
much relevance to this issue.

>> Now consider the case where some external device is accessing mappings
>> via the SMMU. The access flag will be cached in the SMMU TLB. If we
>> clear the access flag without a TLB invalidate the access flag in the
>> CPU page table will not get updated because it's already set in the SMMU
>> TLB.
>> As an aside access flag updates happen in one of two ways. If the
>> SMMU
>> HW supports hardware translation table updates (HTTU) then hardware will
>> manage updating access/dirty flags as required. If this is not supported
>> then SW is relied on to update these flags which in practice means
>> taking a minor fault. But I don't think that is relevant here - in
>> either case without a TLB invalidate neither of those things will
>> happen.
>> I suppose drivers could implement the clear_flush_young() MMU
>> notifier
>> callback (none do at the moment AFAICT) but then won't that just lead to
>> the opposite problem - that every page ever used by an external device
>> remains active and unavailable for reclaim because the access flag never
>> gets cleared? I suppose they could do the flush then which would ensure
>
> Yes, I think so too. The reason there is currently no problem, perhaps
> I think, there are no actual use cases at the moment? At least on our
> Alibaba's fleet, SMMU and MMU do not share page tables now.

We have systems that do. Also HMM (used by Nouveau and AMD among others)
is a SW implementation of page table sharing and would suffer similar
issues. That said if the flush is already being skipped on x86 then it's
already an issue for HMM. HMM based drivers can at least deal with this
by implementing the clear_flush_young() notifier though. The same
doesn't apply to eg. the SMMU driver.

>> the page is marked inactive if it's not referenced between the two
>> folio_referenced calls().
>> But that requires changes to those drivers. SMMU from memory doesn't
>> even register for notifiers if BTM is supported.
>>   - Alistair
>> 
>>>>
>>>> Of course TLB flushes are equally (perhaps even more) expensive for this
>>>> kind of external HW so reducing them would still be beneficial. I wonder
>>>> if there's some way they could be deferred until the page is moved to
>>>> the inactive list say?
>>>>
>>>>>>
>>>>>>> [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
>>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>> ---
>>>>>>>   arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++---------------
>>>>>>>   1 file changed, 16 insertions(+), 15 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>>>>> index 0bd18de9fd97..2979d796ba9d 100644
>>>>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>>>>> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>>>   static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>>>>                                           unsigned long address, pte_t *ptep)
>>>>>>>   {
>>>>>>> -       int young = ptep_test_and_clear_young(vma, address, ptep);
>>>>>>> -
>>>>>>> -       if (young) {
>>>>>>> -               /*
>>>>>>> -                * We can elide the trailing DSB here since the worst that can
>>>>>>> -                * happen is that a CPU continues to use the young entry in its
>>>>>>> -                * TLB and we mistakenly reclaim the associated page. The
>>>>>>> -                * window for such an event is bounded by the next
>>>>>>> -                * context-switch, which provides a DSB to complete the TLB
>>>>>>> -                * invalidation.
>>>>>>> -                */
>>>>>>> -               flush_tlb_page_nosync(vma, address);
>>>>>>> -       }
>>>>>>> -
>>>>>>> -       return young;
>>>>>>> +       /*
>>>>>>> +        * This comment is borrowed from x86, but applies equally to ARM64:
>>>>>>> +        *
>>>>>>> +        * Clearing the accessed bit without a TLB flush doesn't cause
>>>>>>> +        * data corruption. [ It could cause incorrect page aging and
>>>>>>> +        * the (mistaken) reclaim of hot pages, but the chance of that
>>>>>>> +        * should be relatively low. ]
>>>>>>> +        *
>>>>>>> +        * So as a performance optimization don't flush the TLB when
>>>>>>> +        * clearing the accessed bit, it will eventually be flushed by
>>>>>>> +        * a context switch or a VM operation anyway. [ In the rare
>>>>>>> +        * event of it not getting flushed for a long time the delay
>>>>>>> +        * shouldn't really matter because there's no real memory
>>>>>>> +        * pressure for swapout to react to. ]
>>>>>>> +        */
>>>>>>> +       return ptep_test_and_clear_young(vma, address, ptep);
>>>>>>>   }
>>>>>>>
>>>>>>>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>>> --
>>>>>>> 2.39.3
>>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Barry
>>>>
>>> Thanks
>>> Barry
  
Baolin Wang Oct. 25, 2023, 3:15 a.m. UTC | #15
On 10/25/2023 11:08 AM, Yin, Fengwei wrote:
> 
> 
> On 10/25/2023 11:03 AM, Baolin Wang wrote:
>>>
>>> My understanding is that arm64 doesn't do invalidate the TLB during > context switch. The flush_tlb_page_nosync() here + DSB during context
>>
>> Yes, we only perform a TLB flush when the ASID is exhausted during context switch, and I think this is same with x86 IIUC.
> If we remove flush_tlb_page_nosync(), can we still claim TLB is flushed during
> context switch for ARM64?

To be more precise, it is necessary to add prerequisite conditions, such 
as when ASID is exhausted. I can update the comments.

>>> switch make sure the TLB is invalidated during context switch.
>>> So we can't remove flush_tlb_page_nosync() here? Or something was changed
>>> for arm64 (I have zero knowledge to TLB on arm64. So some obvious thing
>>> may be missed)? Thanks.
>>
>> IMHO, the tlb can be easily evicted or flushed if the system is under memory pressure, so like Barry said, the chance of reclaiming hot page is relatively low, at least on X86, we did not see any heavy refault issue.
>>
>> For MGLRU, it uses ptep_test_and_clear_young() instead of ptep_clear_flush_young_notify(), and we did not find any problems until now since deploying to ARM servers.
  
Barry Song Oct. 25, 2023, 4:34 a.m. UTC | #16
On Wed, Oct 25, 2023 at 11:09 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 10/25/2023 11:03 AM, Baolin Wang wrote:
> >>
> >> My understanding is that arm64 doesn't do invalidate the TLB during > context switch. The flush_tlb_page_nosync() here + DSB during context
> >
> > Yes, we only perform a TLB flush when the ASID is exhausted during context switch, and I think this is same with x86 IIUC.
> If we remove flush_tlb_page_nosync(), can we still claim TLB is flushed during
> context switch for ARM64?

No. in this case, we will have to depend on hardware replacement of
TLB on other CPUs. Considering LRU's list is
really long and TLB is really small,  a hot page needs a long time to
move-back to inactive tail, other CPUs' TLB is
likely to be replaced somewhere. But we do have a very small chance
other CPUs will lose setting access flags
in some corner cases. now i have more understanding why this nosync
tlb flush is still there though in practical,
removing it seems not to hurt.

>
> >
> >> switch make sure the TLB is invalidated during context switch.
> >> So we can't remove flush_tlb_page_nosync() here? Or something was changed
> >> for arm64 (I have zero knowledge to TLB on arm64. So some obvious thing
> >> may be missed)? Thanks.
> >
> > IMHO, the tlb can be easily evicted or flushed if the system is under memory pressure, so like Barry said, the chance of reclaiming hot page is relatively low, at least on X86, we did not see any heavy refault issue.
> >
> > For MGLRU, it uses ptep_test_and_clear_young() instead of ptep_clear_flush_young_notify(), and we did not find any problems until now since deploying to ARM servers.

Thanks
Barry
  
Yu Zhao Oct. 25, 2023, 6:17 a.m. UTC | #17
On Tue, Oct 24, 2023 at 9:21 PM Alistair Popple <apopple@nvidia.com> wrote:
>
>
> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
>
> > On 10/25/2023 9:58 AM, Alistair Popple wrote:
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >>> On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@nvidia.com> wrote:
> >>>>
> >>>>
> >>>> Barry Song <21cnbao@gmail.com> writes:
> >>>>
> >>>>> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote:
> >>>>>>
> >>>>>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang
> >>>>>> <baolin.wang@linux.alibaba.com> wrote:
> >> [...]
> >>
> >>>>>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot
> >>>>>> page,  B might
> >>>>>> be a correct choice.
> >>>>>
> >>>>> Plus, I doubt B is really going to happen. as after a page is promoted to
> >>>>> the head of lru list or new generation, it needs a long time to slide back
> >>>>> to the inactive list tail or to the candidate generation of mglru. the time
> >>>>> should have been large enough for tlb to be flushed. If the page is really
> >>>>> hot, the hardware will get second, third, fourth etc opportunity to set an
> >>>>> access flag in the long time in which the page is re-moved to the tail
> >>>>> as the page can be accessed multiple times if it is really hot.
> >>>>
> >>>> This might not be true if you have external hardware sharing the page
> >>>> tables with software through either HMM or hardware supported ATS
> >>>> though.
> >>>>
> >>>> In those cases I think it's much more likely hardware can still be
> >>>> accessing the page even after a context switch on the CPU say. So those
> >>>> pages will tend to get reclaimed even though hardware is still actively
> >>>> using them which would be quite expensive and I guess could lead to
> >>>> thrashing as each page is reclaimed and then immediately faulted back
> >>>> in.
> >
> > That's possible, but the chance should be relatively low. At least on
> > x86, I have not heard of this issue.
>
> Personally I've never seen any x86 system that shares page tables with
> external devices, other than with HMM. More on that below.
>
> >>> i am not quite sure i got your point. has the external hardware sharing cpu's
> >>> pagetable the ability to set access flag in page table entries by
> >>> itself? if yes,
> >>> I don't see how our approach will hurt as folio_referenced can notify the
> >>> hardware driver and the driver can flush its own tlb. If no, i don't see
> >>> either as the external hardware can't set access flags, that means we
> >>> have ignored its reference and only knows cpu's access even in the current
> >>> mainline code. so we are not getting worse.
> >>>
> >>> so the external hardware can also see cpu's TLB? or cpu's tlb flush can
> >>> also broadcast to external hardware, then external hardware sees the
> >>> cleared access flag, thus, it can set access flag in page table when the
> >>> hardware access it?  If this is the case, I feel what you said is true.
> >> Perhaps it would help if I gave a concrete example. Take for example
> >> the
> >> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of
> >> two ways depending on the specific HW implementation.
> >> If broadcast TLB maintenance (BTM) is supported it will snoop CPU
> >> TLB
> >> invalidations. If BTM is not supported it relies on SW to explicitly
> >> forward TLB invalidations via MMU notifiers.
> >
> > On our ARM64 hardware, we rely on BTM to maintain TLB coherency.
>
> Lucky you :-)
>
> ARM64 SMMU architecture specification supports the possibilty of both,
> as does the driver. Not that I think whether or not BTM is supported has
> much relevance to this issue.
>
> >> Now consider the case where some external device is accessing mappings
> >> via the SMMU. The access flag will be cached in the SMMU TLB. If we
> >> clear the access flag without a TLB invalidate the access flag in the
> >> CPU page table will not get updated because it's already set in the SMMU
> >> TLB.
> >> As an aside access flag updates happen in one of two ways. If the
> >> SMMU
> >> HW supports hardware translation table updates (HTTU) then hardware will
> >> manage updating access/dirty flags as required. If this is not supported
> >> then SW is relied on to update these flags which in practice means
> >> taking a minor fault. But I don't think that is relevant here - in
> >> either case without a TLB invalidate neither of those things will
> >> happen.
> >> I suppose drivers could implement the clear_flush_young() MMU
> >> notifier
> >> callback (none do at the moment AFAICT) but then won't that just lead to
> >> the opposite problem - that every page ever used by an external device
> >> remains active and unavailable for reclaim because the access flag never
> >> gets cleared? I suppose they could do the flush then which would ensure
> >
> > Yes, I think so too. The reason there is currently no problem, perhaps
> > I think, there are no actual use cases at the moment? At least on our
> > Alibaba's fleet, SMMU and MMU do not share page tables now.
>
> We have systems that do.

Just curious: do those systems run the Linux kernel? If so, are pages
shared with SMMU pinned? If not, then how are IO PFs handled after
pages are reclaimed?
  
Barry Song Oct. 25, 2023, 6:27 a.m. UTC | #18
On Wed, Oct 25, 2023 at 2:17 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Tue, Oct 24, 2023 at 9:21 PM Alistair Popple <apopple@nvidia.com> wrote:
> >
> >
> > Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> >
> > > On 10/25/2023 9:58 AM, Alistair Popple wrote:
> > >> Barry Song <21cnbao@gmail.com> writes:
> > >>
> > >>> On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@nvidia.com> wrote:
> > >>>>
> > >>>>
> > >>>> Barry Song <21cnbao@gmail.com> writes:
> > >>>>
> > >>>>> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote:
> > >>>>>>
> > >>>>>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang
> > >>>>>> <baolin.wang@linux.alibaba.com> wrote:
> > >> [...]
> > >>
> > >>>>>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot
> > >>>>>> page,  B might
> > >>>>>> be a correct choice.
> > >>>>>
> > >>>>> Plus, I doubt B is really going to happen. as after a page is promoted to
> > >>>>> the head of lru list or new generation, it needs a long time to slide back
> > >>>>> to the inactive list tail or to the candidate generation of mglru. the time
> > >>>>> should have been large enough for tlb to be flushed. If the page is really
> > >>>>> hot, the hardware will get second, third, fourth etc opportunity to set an
> > >>>>> access flag in the long time in which the page is re-moved to the tail
> > >>>>> as the page can be accessed multiple times if it is really hot.
> > >>>>
> > >>>> This might not be true if you have external hardware sharing the page
> > >>>> tables with software through either HMM or hardware supported ATS
> > >>>> though.
> > >>>>
> > >>>> In those cases I think it's much more likely hardware can still be
> > >>>> accessing the page even after a context switch on the CPU say. So those
> > >>>> pages will tend to get reclaimed even though hardware is still actively
> > >>>> using them which would be quite expensive and I guess could lead to
> > >>>> thrashing as each page is reclaimed and then immediately faulted back
> > >>>> in.
> > >
> > > That's possible, but the chance should be relatively low. At least on
> > > x86, I have not heard of this issue.
> >
> > Personally I've never seen any x86 system that shares page tables with
> > external devices, other than with HMM. More on that below.
> >
> > >>> i am not quite sure i got your point. has the external hardware sharing cpu's
> > >>> pagetable the ability to set access flag in page table entries by
> > >>> itself? if yes,
> > >>> I don't see how our approach will hurt as folio_referenced can notify the
> > >>> hardware driver and the driver can flush its own tlb. If no, i don't see
> > >>> either as the external hardware can't set access flags, that means we
> > >>> have ignored its reference and only knows cpu's access even in the current
> > >>> mainline code. so we are not getting worse.
> > >>>
> > >>> so the external hardware can also see cpu's TLB? or cpu's tlb flush can
> > >>> also broadcast to external hardware, then external hardware sees the
> > >>> cleared access flag, thus, it can set access flag in page table when the
> > >>> hardware access it?  If this is the case, I feel what you said is true.
> > >> Perhaps it would help if I gave a concrete example. Take for example
> > >> the
> > >> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of
> > >> two ways depending on the specific HW implementation.
> > >> If broadcast TLB maintenance (BTM) is supported it will snoop CPU
> > >> TLB
> > >> invalidations. If BTM is not supported it relies on SW to explicitly
> > >> forward TLB invalidations via MMU notifiers.
> > >
> > > On our ARM64 hardware, we rely on BTM to maintain TLB coherency.
> >
> > Lucky you :-)
> >
> > ARM64 SMMU architecture specification supports the possibilty of both,
> > as does the driver. Not that I think whether or not BTM is supported has
> > much relevance to this issue.
> >
> > >> Now consider the case where some external device is accessing mappings
> > >> via the SMMU. The access flag will be cached in the SMMU TLB. If we
> > >> clear the access flag without a TLB invalidate the access flag in the
> > >> CPU page table will not get updated because it's already set in the SMMU
> > >> TLB.
> > >> As an aside access flag updates happen in one of two ways. If the
> > >> SMMU
> > >> HW supports hardware translation table updates (HTTU) then hardware will
> > >> manage updating access/dirty flags as required. If this is not supported
> > >> then SW is relied on to update these flags which in practice means
> > >> taking a minor fault. But I don't think that is relevant here - in
> > >> either case without a TLB invalidate neither of those things will
> > >> happen.
> > >> I suppose drivers could implement the clear_flush_young() MMU
> > >> notifier
> > >> callback (none do at the moment AFAICT) but then won't that just lead to
> > >> the opposite problem - that every page ever used by an external device
> > >> remains active and unavailable for reclaim because the access flag never
> > >> gets cleared? I suppose they could do the flush then which would ensure
> > >
> > > Yes, I think so too. The reason there is currently no problem, perhaps
> > > I think, there are no actual use cases at the moment? At least on our
> > > Alibaba's fleet, SMMU and MMU do not share page tables now.
> >
> > We have systems that do.
>
> Just curious: do those systems run the Linux kernel? If so, are pages
> shared with SMMU pinned? If not, then how are IO PFs handled after
> pages are reclaimed?

it will call handle_mm_fault(vma, prm->addr, fault_flags, NULL); in
I/O PF, so finally
it runs the same codes to get page back just like CPU's PF.

years ago, we recommended a pin solution, but obviously there were lots of
push backs:
https://lore.kernel.org/linux-mm/1612685884-19514-1-git-send-email-wangzhou1@hisilicon.com/

Thanks
Barry
  
Alistair Popple Oct. 25, 2023, 10:12 a.m. UTC | #19
Barry Song <21cnbao@gmail.com> writes:

> On Wed, Oct 25, 2023 at 2:17 PM Yu Zhao <yuzhao@google.com> wrote:
>>
>> On Tue, Oct 24, 2023 at 9:21 PM Alistair Popple <apopple@nvidia.com> wrote:
>> >
>> >
>> > Baolin Wang <baolin.wang@linux.alibaba.com> writes:
>> >
>> > > On 10/25/2023 9:58 AM, Alistair Popple wrote:
>> > >> Barry Song <21cnbao@gmail.com> writes:
>> > >>
>> > >>> On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@nvidia.com> wrote:
>> > >>>>
>> > >>>>
>> > >>>> Barry Song <21cnbao@gmail.com> writes:
>> > >>>>
>> > >>>>> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote:
>> > >>>>>>
>> > >>>>>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang
>> > >>>>>> <baolin.wang@linux.alibaba.com> wrote:
>> > >> [...]
>> > >>
>> > >>>>>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot
>> > >>>>>> page,  B might
>> > >>>>>> be a correct choice.
>> > >>>>>
>> > >>>>> Plus, I doubt B is really going to happen. as after a page is promoted to
>> > >>>>> the head of lru list or new generation, it needs a long time to slide back
>> > >>>>> to the inactive list tail or to the candidate generation of mglru. the time
>> > >>>>> should have been large enough for tlb to be flushed. If the page is really
>> > >>>>> hot, the hardware will get second, third, fourth etc opportunity to set an
>> > >>>>> access flag in the long time in which the page is re-moved to the tail
>> > >>>>> as the page can be accessed multiple times if it is really hot.
>> > >>>>
>> > >>>> This might not be true if you have external hardware sharing the page
>> > >>>> tables with software through either HMM or hardware supported ATS
>> > >>>> though.
>> > >>>>
>> > >>>> In those cases I think it's much more likely hardware can still be
>> > >>>> accessing the page even after a context switch on the CPU say. So those
>> > >>>> pages will tend to get reclaimed even though hardware is still actively
>> > >>>> using them which would be quite expensive and I guess could lead to
>> > >>>> thrashing as each page is reclaimed and then immediately faulted back
>> > >>>> in.
>> > >
>> > > That's possible, but the chance should be relatively low. At least on
>> > > x86, I have not heard of this issue.
>> >
>> > Personally I've never seen any x86 system that shares page tables with
>> > external devices, other than with HMM. More on that below.
>> >
>> > >>> i am not quite sure i got your point. has the external hardware sharing cpu's
>> > >>> pagetable the ability to set access flag in page table entries by
>> > >>> itself? if yes,
>> > >>> I don't see how our approach will hurt as folio_referenced can notify the
>> > >>> hardware driver and the driver can flush its own tlb. If no, i don't see
>> > >>> either as the external hardware can't set access flags, that means we
>> > >>> have ignored its reference and only knows cpu's access even in the current
>> > >>> mainline code. so we are not getting worse.
>> > >>>
>> > >>> so the external hardware can also see cpu's TLB? or cpu's tlb flush can
>> > >>> also broadcast to external hardware, then external hardware sees the
>> > >>> cleared access flag, thus, it can set access flag in page table when the
>> > >>> hardware access it?  If this is the case, I feel what you said is true.
>> > >> Perhaps it would help if I gave a concrete example. Take for example
>> > >> the
>> > >> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of
>> > >> two ways depending on the specific HW implementation.
>> > >> If broadcast TLB maintenance (BTM) is supported it will snoop CPU
>> > >> TLB
>> > >> invalidations. If BTM is not supported it relies on SW to explicitly
>> > >> forward TLB invalidations via MMU notifiers.
>> > >
>> > > On our ARM64 hardware, we rely on BTM to maintain TLB coherency.
>> >
>> > Lucky you :-)
>> >
>> > ARM64 SMMU architecture specification supports the possibilty of both,
>> > as does the driver. Not that I think whether or not BTM is supported has
>> > much relevance to this issue.
>> >
>> > >> Now consider the case where some external device is accessing mappings
>> > >> via the SMMU. The access flag will be cached in the SMMU TLB. If we
>> > >> clear the access flag without a TLB invalidate the access flag in the
>> > >> CPU page table will not get updated because it's already set in the SMMU
>> > >> TLB.
>> > >> As an aside access flag updates happen in one of two ways. If the
>> > >> SMMU
>> > >> HW supports hardware translation table updates (HTTU) then hardware will
>> > >> manage updating access/dirty flags as required. If this is not supported
>> > >> then SW is relied on to update these flags which in practice means
>> > >> taking a minor fault. But I don't think that is relevant here - in
>> > >> either case without a TLB invalidate neither of those things will
>> > >> happen.
>> > >> I suppose drivers could implement the clear_flush_young() MMU
>> > >> notifier
>> > >> callback (none do at the moment AFAICT) but then won't that just lead to
>> > >> the opposite problem - that every page ever used by an external device
>> > >> remains active and unavailable for reclaim because the access flag never
>> > >> gets cleared? I suppose they could do the flush then which would ensure
>> > >
>> > > Yes, I think so too. The reason there is currently no problem, perhaps
>> > > I think, there are no actual use cases at the moment? At least on our
>> > > Alibaba's fleet, SMMU and MMU do not share page tables now.
>> >
>> > We have systems that do.
>>
>> Just curious: do those systems run the Linux kernel? If so, are pages
>> shared with SMMU pinned? If not, then how are IO PFs handled after
>> pages are reclaimed?

Yes, these systems all run Linux. Pages shared with SMMU aren't pinned
and fault handling works as Barry notes below - a driver is notified of
a fault and calls handle_mm_fault() in response.

> it will call handle_mm_fault(vma, prm->addr, fault_flags, NULL); in
> I/O PF, so finally
> it runs the same codes to get page back just like CPU's PF.
>
> years ago, we recommended a pin solution, but obviously there were lots of
> push backs:
> https://lore.kernel.org/linux-mm/1612685884-19514-1-git-send-email-wangzhou1@hisilicon.com/

Right. Having to pin pages defeats the whole point of having hardware
that can handle page faults.

> Thanks
> Barry
  
Yu Zhao Oct. 25, 2023, 6:22 p.m. UTC | #20
On Wed, Oct 25, 2023 at 4:16 AM Alistair Popple <apopple@nvidia.com> wrote:
>
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Wed, Oct 25, 2023 at 2:17 PM Yu Zhao <yuzhao@google.com> wrote:
> >>
> >> On Tue, Oct 24, 2023 at 9:21 PM Alistair Popple <apopple@nvidia.com> wrote:
> >> >
> >> >
> >> > Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> >> >
> >> > > On 10/25/2023 9:58 AM, Alistair Popple wrote:
> >> > >> Barry Song <21cnbao@gmail.com> writes:
> >> > >>
> >> > >>> On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@nvidia.com> wrote:
> >> > >>>>
> >> > >>>>
> >> > >>>> Barry Song <21cnbao@gmail.com> writes:
> >> > >>>>
> >> > >>>>> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote:
> >> > >>>>>>
> >> > >>>>>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang
> >> > >>>>>> <baolin.wang@linux.alibaba.com> wrote:
> >> > >> [...]
> >> > >>
> >> > >>>>>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot
> >> > >>>>>> page,  B might
> >> > >>>>>> be a correct choice.
> >> > >>>>>
> >> > >>>>> Plus, I doubt B is really going to happen. as after a page is promoted to
> >> > >>>>> the head of lru list or new generation, it needs a long time to slide back
> >> > >>>>> to the inactive list tail or to the candidate generation of mglru. the time
> >> > >>>>> should have been large enough for tlb to be flushed. If the page is really
> >> > >>>>> hot, the hardware will get second, third, fourth etc opportunity to set an
> >> > >>>>> access flag in the long time in which the page is re-moved to the tail
> >> > >>>>> as the page can be accessed multiple times if it is really hot.
> >> > >>>>
> >> > >>>> This might not be true if you have external hardware sharing the page
> >> > >>>> tables with software through either HMM or hardware supported ATS
> >> > >>>> though.
> >> > >>>>
> >> > >>>> In those cases I think it's much more likely hardware can still be
> >> > >>>> accessing the page even after a context switch on the CPU say. So those
> >> > >>>> pages will tend to get reclaimed even though hardware is still actively
> >> > >>>> using them which would be quite expensive and I guess could lead to
> >> > >>>> thrashing as each page is reclaimed and then immediately faulted back
> >> > >>>> in.
> >> > >
> >> > > That's possible, but the chance should be relatively low. At least on
> >> > > x86, I have not heard of this issue.
> >> >
> >> > Personally I've never seen any x86 system that shares page tables with
> >> > external devices, other than with HMM. More on that below.
> >> >
> >> > >>> i am not quite sure i got your point. has the external hardware sharing cpu's
> >> > >>> pagetable the ability to set access flag in page table entries by
> >> > >>> itself? if yes,
> >> > >>> I don't see how our approach will hurt as folio_referenced can notify the
> >> > >>> hardware driver and the driver can flush its own tlb. If no, i don't see
> >> > >>> either as the external hardware can't set access flags, that means we
> >> > >>> have ignored its reference and only knows cpu's access even in the current
> >> > >>> mainline code. so we are not getting worse.
> >> > >>>
> >> > >>> so the external hardware can also see cpu's TLB? or cpu's tlb flush can
> >> > >>> also broadcast to external hardware, then external hardware sees the
> >> > >>> cleared access flag, thus, it can set access flag in page table when the
> >> > >>> hardware access it?  If this is the case, I feel what you said is true.
> >> > >> Perhaps it would help if I gave a concrete example. Take for example
> >> > >> the
> >> > >> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of
> >> > >> two ways depending on the specific HW implementation.
> >> > >> If broadcast TLB maintenance (BTM) is supported it will snoop CPU
> >> > >> TLB
> >> > >> invalidations. If BTM is not supported it relies on SW to explicitly
> >> > >> forward TLB invalidations via MMU notifiers.
> >> > >
> >> > > On our ARM64 hardware, we rely on BTM to maintain TLB coherency.
> >> >
> >> > Lucky you :-)
> >> >
> >> > ARM64 SMMU architecture specification supports the possibilty of both,
> >> > as does the driver. Not that I think whether or not BTM is supported has
> >> > much relevance to this issue.
> >> >
> >> > >> Now consider the case where some external device is accessing mappings
> >> > >> via the SMMU. The access flag will be cached in the SMMU TLB. If we
> >> > >> clear the access flag without a TLB invalidate the access flag in the
> >> > >> CPU page table will not get updated because it's already set in the SMMU
> >> > >> TLB.
> >> > >> As an aside access flag updates happen in one of two ways. If the
> >> > >> SMMU
> >> > >> HW supports hardware translation table updates (HTTU) then hardware will
> >> > >> manage updating access/dirty flags as required. If this is not supported
> >> > >> then SW is relied on to update these flags which in practice means
> >> > >> taking a minor fault. But I don't think that is relevant here - in
> >> > >> either case without a TLB invalidate neither of those things will
> >> > >> happen.
> >> > >> I suppose drivers could implement the clear_flush_young() MMU
> >> > >> notifier
> >> > >> callback (none do at the moment AFAICT) but then won't that just lead to
> >> > >> the opposite problem - that every page ever used by an external device
> >> > >> remains active and unavailable for reclaim because the access flag never
> >> > >> gets cleared? I suppose they could do the flush then which would ensure
> >> > >
> >> > > Yes, I think so too. The reason there is currently no problem, perhaps
> >> > > I think, there are no actual use cases at the moment? At least on our
> >> > > Alibaba's fleet, SMMU and MMU do not share page tables now.
> >> >
> >> > We have systems that do.
> >>
> >> Just curious: do those systems run the Linux kernel? If so, are pages
> >> shared with SMMU pinned? If not, then how are IO PFs handled after
> >> pages are reclaimed?
>
> Yes, these systems all run Linux. Pages shared with SMMU aren't pinned
> and fault handling works as Barry notes below - a driver is notified of
> a fault and calls handle_mm_fault() in response.
>
> > it will call handle_mm_fault(vma, prm->addr, fault_flags, NULL); in
> > I/O PF, so finally
> > it runs the same codes to get page back just like CPU's PF.
> >
> > years ago, we recommended a pin solution, but obviously there were lots of
> > push backs:
> > https://lore.kernel.org/linux-mm/1612685884-19514-1-git-send-email-wangzhou1@hisilicon.com/
>
> Right. Having to pin pages defeats the whole point of having hardware
> that can handle page faults.

Thanks. How would a DMA transaction be retried after the kernel
resolves an IO PF? I.e., does the h/w (PCIe spec, etc.) support auto
retries or is the s/w responsible for doing so? At least when I worked
on the PCI subsystem, I didn't know any device that was capable of
doing auto retries. (Pasha and I will have a talk on IOMMU at the
coming LPC, so this might be an interesting intersection between IOMMU
and MM to discuss.)
  
Alistair Popple Oct. 25, 2023, 11:32 p.m. UTC | #21
Yu Zhao <yuzhao@google.com> writes:

> On Wed, Oct 25, 2023 at 4:16 AM Alistair Popple <apopple@nvidia.com> wrote:
>> >> > >> Now consider the case where some external device is accessing mappings
>> >> > >> via the SMMU. The access flag will be cached in the SMMU TLB. If we
>> >> > >> clear the access flag without a TLB invalidate the access flag in the
>> >> > >> CPU page table will not get updated because it's already set in the SMMU
>> >> > >> TLB.
>> >> > >> As an aside access flag updates happen in one of two ways. If the
>> >> > >> SMMU
>> >> > >> HW supports hardware translation table updates (HTTU) then hardware will
>> >> > >> manage updating access/dirty flags as required. If this is not supported
>> >> > >> then SW is relied on to update these flags which in practice means
>> >> > >> taking a minor fault. But I don't think that is relevant here - in
>> >> > >> either case without a TLB invalidate neither of those things will
>> >> > >> happen.
>> >> > >> I suppose drivers could implement the clear_flush_young() MMU
>> >> > >> notifier
>> >> > >> callback (none do at the moment AFAICT) but then won't that just lead to
>> >> > >> the opposite problem - that every page ever used by an external device
>> >> > >> remains active and unavailable for reclaim because the access flag never
>> >> > >> gets cleared? I suppose they could do the flush then which would ensure
>> >> > >
>> >> > > Yes, I think so too. The reason there is currently no problem, perhaps
>> >> > > I think, there are no actual use cases at the moment? At least on our
>> >> > > Alibaba's fleet, SMMU and MMU do not share page tables now.
>> >> >
>> >> > We have systems that do.
>> >>
>> >> Just curious: do those systems run the Linux kernel? If so, are pages
>> >> shared with SMMU pinned? If not, then how are IO PFs handled after
>> >> pages are reclaimed?
>>
>> Yes, these systems all run Linux. Pages shared with SMMU aren't pinned
>> and fault handling works as Barry notes below - a driver is notified of
>> a fault and calls handle_mm_fault() in response.
>>
>> > it will call handle_mm_fault(vma, prm->addr, fault_flags, NULL); in
>> > I/O PF, so finally
>> > it runs the same codes to get page back just like CPU's PF.
>> >
>> > years ago, we recommended a pin solution, but obviously there were lots of
>> > push backs:
>> > https://lore.kernel.org/linux-mm/1612685884-19514-1-git-send-email-wangzhou1@hisilicon.com/
>>
>> Right. Having to pin pages defeats the whole point of having hardware
>> that can handle page faults.
>
> Thanks. How would a DMA transaction be retried after the kernel
> resolves an IO PF? I.e., does the h/w (PCIe spec, etc.) support auto
> retries or is the s/w responsible for doing so? At least when I worked
> on the PCI subsystem, I didn't know any device that was capable of
> doing auto retries. (Pasha and I will have a talk on IOMMU at the
> coming LPC, so this might be an interesting intersection between IOMMU
> and MM to discuss.)

Generally what happens if a device encounters a page fault is that it
notifies the kernel or driver (eg. via an interrupt) that it has
faulted. It is then up to SW to resolve the fault and tell HW to retry
the translation request once SW thinks the fault is resolved. I'm not
aware of HW that does automatic retries (although I'm a little unclear
what exactly is meant by automatic retry).

In the case of an IOMMU faulting (eg. SMMU on ARM) on a DMA access I
believe it stalls the transaction and SW is responsible for processing
the fault and signalling that the translation should be retried.

It's also possible for the device itself to detect a fault prior to
issuing a DMA request if it's using something like PCIe page request
services. Note my experience with this is more with non-PCIe devices
that are coherently attached, but the concepts are all much the same as
they all channel through the same IOMMU.

Unfortunately it doesn't look I will be at LPC this year otherwise it
would have been good to discuss. Happy to continue the discussion here
or via some other channel though. Hopefully I will be able to see your
talk online.
  
Anshuman Khandual Oct. 26, 2023, 4:55 a.m. UTC | #22
On 10/24/23 18:26, Baolin Wang wrote:
> Now ptep_clear_flush_young() is only called by folio_referenced() to
> check if the folio was referenced, and now it will call a tlb flush on
> ARM64 architecture. However the tlb flush can be expensive on ARM64
> servers, especially for the systems with a large CPU numbers.

TLB flush would be expensive on *any* platform with large CPU numbers ?

> 
> Similar to the x86 architecture, below comments also apply equally to
> ARM64 architecture. So we can drop the tlb flush operation in
> ptep_clear_flush_young() on ARM64 architecture to improve the performance.
> "
> /* Clearing the accessed bit without a TLB flush
>  * doesn't cause data corruption. [ It could cause incorrect
>  * page aging and the (mistaken) reclaim of hot pages, but the
>  * chance of that should be relatively low. ]
>  *
>  * So as a performance optimization don't flush the TLB when
>  * clearing the accessed bit, it will eventually be flushed by
>  * a context switch or a VM operation anyway. [ In the rare
>  * event of it not getting flushed for a long time the delay
>  * shouldn't really matter because there's no real memory
>  * pressure for swapout to react to. ]
>  */

If always true, this sounds generic enough for all platforms, why only
x86 and arm64 ?

> "
> Running the thpscale to show some obvious improvements for compaction
> latency with this patch:
>                              base                   patched
> Amean     fault-both-1      1093.19 (   0.00%)     1084.57 *   0.79%*
> Amean     fault-both-3      2566.22 (   0.00%)     2228.45 *  13.16%*
> Amean     fault-both-5      3591.22 (   0.00%)     3146.73 *  12.38%*
> Amean     fault-both-7      4157.26 (   0.00%)     4113.67 *   1.05%*
> Amean     fault-both-12     6184.79 (   0.00%)     5218.70 *  15.62%*
> Amean     fault-both-18     9103.70 (   0.00%)     7739.71 *  14.98%*
> Amean     fault-both-24    12341.73 (   0.00%)    10684.23 *  13.43%*
> Amean     fault-both-30    15519.00 (   0.00%)    13695.14 *  11.75%*
> Amean     fault-both-32    16189.15 (   0.00%)    14365.73 *  11.26%*
>                        base       patched
> Duration User         167.78      161.03
> Duration System      1836.66     1673.01
> Duration Elapsed     2074.58     2059.75

Could you please point to the test repo you are running ?

> 
> Barry Song submitted a similar patch [1] before, that replaces the
> ptep_clear_flush_young_notify() with ptep_clear_young_notify() in
> folio_referenced_one(). However, I'm not sure if removing the tlb flush
> operation is applicable to every architecture in kernel, so dropping
> the tlb flush for ARM64 seems a sensible change.

The reasoning provided here sounds generic when true, hence there seems
to be no justification to keep it limited just for arm64 and x86. Also
what about pmdp_clear_flush_young_notify() when THP is enabled. Should
that also not do a TLB flush after clearing access bit ? Although arm64
does not enable __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH, rather depends on
the generic pmdp_clear_flush_young() which also does a TLB flush via
flush_pmd_tlb_range() while clearing the access bit.

> 
> Note: I am okay for both approach, if someone can help to ensure that
> all architectures do not need the tlb flush when clearing the accessed
> bit, then I also think Barry's patch is better (hope Barry can resend
> his patch).

This paragraph belongs after the '----' below and not part of the commit
message.

> 
> [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++---------------
>  1 file changed, 16 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 0bd18de9fd97..2979d796ba9d 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  					 unsigned long address, pte_t *ptep)
>  {
> -	int young = ptep_test_and_clear_young(vma, address, ptep);
> -
> -	if (young) {
> -		/*
> -		 * We can elide the trailing DSB here since the worst that can
> -		 * happen is that a CPU continues to use the young entry in its
> -		 * TLB and we mistakenly reclaim the associated page. The
> -		 * window for such an event is bounded by the next
> -		 * context-switch, which provides a DSB to complete the TLB
> -		 * invalidation.
> -		 */
> -		flush_tlb_page_nosync(vma, address);
> -	}
> -
> -	return young;
> +	/*
> +	 * This comment is borrowed from x86, but applies equally to ARM64:
> +	 *
> +	 * Clearing the accessed bit without a TLB flush doesn't cause
> +	 * data corruption. [ It could cause incorrect page aging and
> +	 * the (mistaken) reclaim of hot pages, but the chance of that
> +	 * should be relatively low. ]
> +	 *
> +	 * So as a performance optimization don't flush the TLB when
> +	 * clearing the accessed bit, it will eventually be flushed by
> +	 * a context switch or a VM operation anyway. [ In the rare
> +	 * event of it not getting flushed for a long time the delay
> +	 * shouldn't really matter because there's no real memory
> +	 * pressure for swapout to react to. ]
> +	 */
> +	return ptep_test_and_clear_young(vma, address, ptep);
>  }
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE

There are three distinct concerns here

1) What are the chances of this misleading existing hot page reclaim process
2) How secondary MMU such as SMMU adapt to change in mappings without a flush
3) Could this break the architecture rule requiring a TLB flush after access
   bit clear on a page table entry
  
Barry Song Oct. 26, 2023, 5:54 a.m. UTC | #23
On Thu, Oct 26, 2023 at 12:55 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
>
>
> On 10/24/23 18:26, Baolin Wang wrote:
> > Now ptep_clear_flush_young() is only called by folio_referenced() to
> > check if the folio was referenced, and now it will call a tlb flush on
> > ARM64 architecture. However the tlb flush can be expensive on ARM64
> > servers, especially for the systems with a large CPU numbers.
>
> TLB flush would be expensive on *any* platform with large CPU numbers ?
>
> >
> > Similar to the x86 architecture, below comments also apply equally to
> > ARM64 architecture. So we can drop the tlb flush operation in
> > ptep_clear_flush_young() on ARM64 architecture to improve the performance.
> > "
> > /* Clearing the accessed bit without a TLB flush
> >  * doesn't cause data corruption. [ It could cause incorrect
> >  * page aging and the (mistaken) reclaim of hot pages, but the
> >  * chance of that should be relatively low. ]
> >  *
> >  * So as a performance optimization don't flush the TLB when
> >  * clearing the accessed bit, it will eventually be flushed by
> >  * a context switch or a VM operation anyway. [ In the rare
> >  * event of it not getting flushed for a long time the delay
> >  * shouldn't really matter because there's no real memory
> >  * pressure for swapout to react to. ]
> >  */
>
> If always true, this sounds generic enough for all platforms, why only
> x86 and arm64 ?
>
> > "
> > Running the thpscale to show some obvious improvements for compaction
> > latency with this patch:
> >                              base                   patched
> > Amean     fault-both-1      1093.19 (   0.00%)     1084.57 *   0.79%*
> > Amean     fault-both-3      2566.22 (   0.00%)     2228.45 *  13.16%*
> > Amean     fault-both-5      3591.22 (   0.00%)     3146.73 *  12.38%*
> > Amean     fault-both-7      4157.26 (   0.00%)     4113.67 *   1.05%*
> > Amean     fault-both-12     6184.79 (   0.00%)     5218.70 *  15.62%*
> > Amean     fault-both-18     9103.70 (   0.00%)     7739.71 *  14.98%*
> > Amean     fault-both-24    12341.73 (   0.00%)    10684.23 *  13.43%*
> > Amean     fault-both-30    15519.00 (   0.00%)    13695.14 *  11.75%*
> > Amean     fault-both-32    16189.15 (   0.00%)    14365.73 *  11.26%*
> >                        base       patched
> > Duration User         167.78      161.03
> > Duration System      1836.66     1673.01
> > Duration Elapsed     2074.58     2059.75
>
> Could you please point to the test repo you are running ?
>
> >
> > Barry Song submitted a similar patch [1] before, that replaces the
> > ptep_clear_flush_young_notify() with ptep_clear_young_notify() in
> > folio_referenced_one(). However, I'm not sure if removing the tlb flush
> > operation is applicable to every architecture in kernel, so dropping
> > the tlb flush for ARM64 seems a sensible change.
>
> The reasoning provided here sounds generic when true, hence there seems
> to be no justification to keep it limited just for arm64 and x86. Also
> what about pmdp_clear_flush_young_notify() when THP is enabled. Should
> that also not do a TLB flush after clearing access bit ? Although arm64
> does not enable __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH, rather depends on
> the generic pmdp_clear_flush_young() which also does a TLB flush via
> flush_pmd_tlb_range() while clearing the access bit.
>
> >
> > Note: I am okay for both approach, if someone can help to ensure that
> > all architectures do not need the tlb flush when clearing the accessed
> > bit, then I also think Barry's patch is better (hope Barry can resend
> > his patch).
>
> This paragraph belongs after the '----' below and not part of the commit
> message.
>
> >
> > [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
> > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > ---
> >  arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++---------------
> >  1 file changed, 16 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 0bd18de9fd97..2979d796ba9d 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> >  static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> >                                        unsigned long address, pte_t *ptep)
> >  {
> > -     int young = ptep_test_and_clear_young(vma, address, ptep);
> > -
> > -     if (young) {
> > -             /*
> > -              * We can elide the trailing DSB here since the worst that can
> > -              * happen is that a CPU continues to use the young entry in its
> > -              * TLB and we mistakenly reclaim the associated page. The
> > -              * window for such an event is bounded by the next
> > -              * context-switch, which provides a DSB to complete the TLB
> > -              * invalidation.
> > -              */
> > -             flush_tlb_page_nosync(vma, address);
> > -     }
> > -
> > -     return young;
> > +     /*
> > +      * This comment is borrowed from x86, but applies equally to ARM64:
> > +      *
> > +      * Clearing the accessed bit without a TLB flush doesn't cause
> > +      * data corruption. [ It could cause incorrect page aging and
> > +      * the (mistaken) reclaim of hot pages, but the chance of that
> > +      * should be relatively low. ]
> > +      *
> > +      * So as a performance optimization don't flush the TLB when
> > +      * clearing the accessed bit, it will eventually be flushed by
> > +      * a context switch or a VM operation anyway. [ In the rare
> > +      * event of it not getting flushed for a long time the delay
> > +      * shouldn't really matter because there's no real memory
> > +      * pressure for swapout to react to. ]
> > +      */
> > +     return ptep_test_and_clear_young(vma, address, ptep);
> >  }
> >
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>
> There are three distinct concerns here
>
> 1) What are the chances of this misleading existing hot page reclaim process
> 2) How secondary MMU such as SMMU adapt to change in mappings without a flush
> 3) Could this break the architecture rule requiring a TLB flush after access
>    bit clear on a page table entry

In terms of all of above concerns,  though 2 is different, which is an
issue between
cpu and non-cpu,
i feel kernel has actually dropped tlb flush at least for mglru, there
is no flush in
lru_gen_look_around(),

static bool folio_referenced_one(struct folio *folio,
                struct vm_area_struct *vma, unsigned long address, void *arg)
{
        ...

                if (pvmw.pte) {
                        if (lru_gen_enabled() &&
                            pte_young(ptep_get(pvmw.pte))) {
                                lru_gen_look_around(&pvmw);
                                referenced++;
                        }

                        if (ptep_clear_flush_young_notify(vma, address,
                                                pvmw.pte))
                                referenced++;
                }

        return true;
}

and so is in walk_pte_range() of vmscan.  linux has been surviving with
all above concerns for a while, believing it or not :-)

Thanks
Barry
  
Anshuman Khandual Oct. 26, 2023, 6:01 a.m. UTC | #24
On 10/26/23 11:24, Barry Song wrote:
> On Thu, Oct 26, 2023 at 12:55 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
>>
>>
>>
>> On 10/24/23 18:26, Baolin Wang wrote:
>>> Now ptep_clear_flush_young() is only called by folio_referenced() to
>>> check if the folio was referenced, and now it will call a tlb flush on
>>> ARM64 architecture. However the tlb flush can be expensive on ARM64
>>> servers, especially for the systems with a large CPU numbers.
>>
>> TLB flush would be expensive on *any* platform with large CPU numbers ?
>>
>>>
>>> Similar to the x86 architecture, below comments also apply equally to
>>> ARM64 architecture. So we can drop the tlb flush operation in
>>> ptep_clear_flush_young() on ARM64 architecture to improve the performance.
>>> "
>>> /* Clearing the accessed bit without a TLB flush
>>>  * doesn't cause data corruption. [ It could cause incorrect
>>>  * page aging and the (mistaken) reclaim of hot pages, but the
>>>  * chance of that should be relatively low. ]
>>>  *
>>>  * So as a performance optimization don't flush the TLB when
>>>  * clearing the accessed bit, it will eventually be flushed by
>>>  * a context switch or a VM operation anyway. [ In the rare
>>>  * event of it not getting flushed for a long time the delay
>>>  * shouldn't really matter because there's no real memory
>>>  * pressure for swapout to react to. ]
>>>  */
>>
>> If always true, this sounds generic enough for all platforms, why only
>> x86 and arm64 ?
>>
>>> "
>>> Running the thpscale to show some obvious improvements for compaction
>>> latency with this patch:
>>>                              base                   patched
>>> Amean     fault-both-1      1093.19 (   0.00%)     1084.57 *   0.79%*
>>> Amean     fault-both-3      2566.22 (   0.00%)     2228.45 *  13.16%*
>>> Amean     fault-both-5      3591.22 (   0.00%)     3146.73 *  12.38%*
>>> Amean     fault-both-7      4157.26 (   0.00%)     4113.67 *   1.05%*
>>> Amean     fault-both-12     6184.79 (   0.00%)     5218.70 *  15.62%*
>>> Amean     fault-both-18     9103.70 (   0.00%)     7739.71 *  14.98%*
>>> Amean     fault-both-24    12341.73 (   0.00%)    10684.23 *  13.43%*
>>> Amean     fault-both-30    15519.00 (   0.00%)    13695.14 *  11.75%*
>>> Amean     fault-both-32    16189.15 (   0.00%)    14365.73 *  11.26%*
>>>                        base       patched
>>> Duration User         167.78      161.03
>>> Duration System      1836.66     1673.01
>>> Duration Elapsed     2074.58     2059.75
>>
>> Could you please point to the test repo you are running ?
>>
>>>
>>> Barry Song submitted a similar patch [1] before, that replaces the
>>> ptep_clear_flush_young_notify() with ptep_clear_young_notify() in
>>> folio_referenced_one(). However, I'm not sure if removing the tlb flush
>>> operation is applicable to every architecture in kernel, so dropping
>>> the tlb flush for ARM64 seems a sensible change.
>>
>> The reasoning provided here sounds generic when true, hence there seems
>> to be no justification to keep it limited just for arm64 and x86. Also
>> what about pmdp_clear_flush_young_notify() when THP is enabled. Should
>> that also not do a TLB flush after clearing access bit ? Although arm64
>> does not enable __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH, rather depends on
>> the generic pmdp_clear_flush_young() which also does a TLB flush via
>> flush_pmd_tlb_range() while clearing the access bit.
>>
>>>
>>> Note: I am okay for both approach, if someone can help to ensure that
>>> all architectures do not need the tlb flush when clearing the accessed
>>> bit, then I also think Barry's patch is better (hope Barry can resend
>>> his patch).
>>
>> This paragraph belongs after the '----' below and not part of the commit
>> message.
>>
>>>
>>> [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>>  arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++---------------
>>>  1 file changed, 16 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index 0bd18de9fd97..2979d796ba9d 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>  static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>                                        unsigned long address, pte_t *ptep)
>>>  {
>>> -     int young = ptep_test_and_clear_young(vma, address, ptep);
>>> -
>>> -     if (young) {
>>> -             /*
>>> -              * We can elide the trailing DSB here since the worst that can
>>> -              * happen is that a CPU continues to use the young entry in its
>>> -              * TLB and we mistakenly reclaim the associated page. The
>>> -              * window for such an event is bounded by the next
>>> -              * context-switch, which provides a DSB to complete the TLB
>>> -              * invalidation.
>>> -              */
>>> -             flush_tlb_page_nosync(vma, address);
>>> -     }
>>> -
>>> -     return young;
>>> +     /*
>>> +      * This comment is borrowed from x86, but applies equally to ARM64:
>>> +      *
>>> +      * Clearing the accessed bit without a TLB flush doesn't cause
>>> +      * data corruption. [ It could cause incorrect page aging and
>>> +      * the (mistaken) reclaim of hot pages, but the chance of that
>>> +      * should be relatively low. ]
>>> +      *
>>> +      * So as a performance optimization don't flush the TLB when
>>> +      * clearing the accessed bit, it will eventually be flushed by
>>> +      * a context switch or a VM operation anyway. [ In the rare
>>> +      * event of it not getting flushed for a long time the delay
>>> +      * shouldn't really matter because there's no real memory
>>> +      * pressure for swapout to react to. ]
>>> +      */
>>> +     return ptep_test_and_clear_young(vma, address, ptep);
>>>  }
>>>
>>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>
>> There are three distinct concerns here
>>
>> 1) What are the chances of this misleading existing hot page reclaim process
>> 2) How secondary MMU such as SMMU adapt to change in mappings without a flush
>> 3) Could this break the architecture rule requiring a TLB flush after access
>>    bit clear on a page table entry
> 
> In terms of all of above concerns,  though 2 is different, which is an
> issue between
> cpu and non-cpu,
> i feel kernel has actually dropped tlb flush at least for mglru, there
> is no flush in
> lru_gen_look_around(),
> 
> static bool folio_referenced_one(struct folio *folio,
>                 struct vm_area_struct *vma, unsigned long address, void *arg)
> {
>         ...
> 
>                 if (pvmw.pte) {
>                         if (lru_gen_enabled() &&
>                             pte_young(ptep_get(pvmw.pte))) {
>                                 lru_gen_look_around(&pvmw);
>                                 referenced++;
>                         }
> 
>                         if (ptep_clear_flush_young_notify(vma, address,
>                                                 pvmw.pte))
>                                 referenced++;
>                 }
> 
>         return true;
> }
> 
> and so is in walk_pte_range() of vmscan.  linux has been surviving with
> all above concerns for a while, believing it or not :-)

Although the first two concerns could be worked upon in the SW, kernel surviving
after breaking arch rules explicitly is not a correct state to be in IMHO.
  
Robin Murphy Oct. 26, 2023, 12:30 p.m. UTC | #25
On 26/10/2023 7:01 am, Anshuman Khandual wrote:
> 
> 
> On 10/26/23 11:24, Barry Song wrote:
>> On Thu, Oct 26, 2023 at 12:55 PM Anshuman Khandual
>> <anshuman.khandual@arm.com> wrote:
>>>
>>>
>>>
>>> On 10/24/23 18:26, Baolin Wang wrote:
>>>> Now ptep_clear_flush_young() is only called by folio_referenced() to
>>>> check if the folio was referenced, and now it will call a tlb flush on
>>>> ARM64 architecture. However the tlb flush can be expensive on ARM64
>>>> servers, especially for the systems with a large CPU numbers.
>>>
>>> TLB flush would be expensive on *any* platform with large CPU numbers ?
>>>
>>>>
>>>> Similar to the x86 architecture, below comments also apply equally to
>>>> ARM64 architecture. So we can drop the tlb flush operation in
>>>> ptep_clear_flush_young() on ARM64 architecture to improve the performance.
>>>> "
>>>> /* Clearing the accessed bit without a TLB flush
>>>>   * doesn't cause data corruption. [ It could cause incorrect
>>>>   * page aging and the (mistaken) reclaim of hot pages, but the
>>>>   * chance of that should be relatively low. ]
>>>>   *
>>>>   * So as a performance optimization don't flush the TLB when
>>>>   * clearing the accessed bit, it will eventually be flushed by
>>>>   * a context switch or a VM operation anyway. [ In the rare
>>>>   * event of it not getting flushed for a long time the delay
>>>>   * shouldn't really matter because there's no real memory
>>>>   * pressure for swapout to react to. ]
>>>>   */
>>>
>>> If always true, this sounds generic enough for all platforms, why only
>>> x86 and arm64 ?
>>>
>>>> "
>>>> Running the thpscale to show some obvious improvements for compaction
>>>> latency with this patch:
>>>>                               base                   patched
>>>> Amean     fault-both-1      1093.19 (   0.00%)     1084.57 *   0.79%*
>>>> Amean     fault-both-3      2566.22 (   0.00%)     2228.45 *  13.16%*
>>>> Amean     fault-both-5      3591.22 (   0.00%)     3146.73 *  12.38%*
>>>> Amean     fault-both-7      4157.26 (   0.00%)     4113.67 *   1.05%*
>>>> Amean     fault-both-12     6184.79 (   0.00%)     5218.70 *  15.62%*
>>>> Amean     fault-both-18     9103.70 (   0.00%)     7739.71 *  14.98%*
>>>> Amean     fault-both-24    12341.73 (   0.00%)    10684.23 *  13.43%*
>>>> Amean     fault-both-30    15519.00 (   0.00%)    13695.14 *  11.75%*
>>>> Amean     fault-both-32    16189.15 (   0.00%)    14365.73 *  11.26%*
>>>>                         base       patched
>>>> Duration User         167.78      161.03
>>>> Duration System      1836.66     1673.01
>>>> Duration Elapsed     2074.58     2059.75
>>>
>>> Could you please point to the test repo you are running ?
>>>
>>>>
>>>> Barry Song submitted a similar patch [1] before, that replaces the
>>>> ptep_clear_flush_young_notify() with ptep_clear_young_notify() in
>>>> folio_referenced_one(). However, I'm not sure if removing the tlb flush
>>>> operation is applicable to every architecture in kernel, so dropping
>>>> the tlb flush for ARM64 seems a sensible change.
>>>
>>> The reasoning provided here sounds generic when true, hence there seems
>>> to be no justification to keep it limited just for arm64 and x86. Also
>>> what about pmdp_clear_flush_young_notify() when THP is enabled. Should
>>> that also not do a TLB flush after clearing access bit ? Although arm64
>>> does not enable __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH, rather depends on
>>> the generic pmdp_clear_flush_young() which also does a TLB flush via
>>> flush_pmd_tlb_range() while clearing the access bit.
>>>
>>>>
>>>> Note: I am okay for both approach, if someone can help to ensure that
>>>> all architectures do not need the tlb flush when clearing the accessed
>>>> bit, then I also think Barry's patch is better (hope Barry can resend
>>>> his patch).
>>>
>>> This paragraph belongs after the '----' below and not part of the commit
>>> message.
>>>
>>>>
>>>> [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>>   arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++---------------
>>>>   1 file changed, 16 insertions(+), 15 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index 0bd18de9fd97..2979d796ba9d 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>   static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>                                         unsigned long address, pte_t *ptep)
>>>>   {
>>>> -     int young = ptep_test_and_clear_young(vma, address, ptep);
>>>> -
>>>> -     if (young) {
>>>> -             /*
>>>> -              * We can elide the trailing DSB here since the worst that can
>>>> -              * happen is that a CPU continues to use the young entry in its
>>>> -              * TLB and we mistakenly reclaim the associated page. The
>>>> -              * window for such an event is bounded by the next
>>>> -              * context-switch, which provides a DSB to complete the TLB
>>>> -              * invalidation.
>>>> -              */
>>>> -             flush_tlb_page_nosync(vma, address);
>>>> -     }
>>>> -
>>>> -     return young;
>>>> +     /*
>>>> +      * This comment is borrowed from x86, but applies equally to ARM64:
>>>> +      *
>>>> +      * Clearing the accessed bit without a TLB flush doesn't cause
>>>> +      * data corruption. [ It could cause incorrect page aging and
>>>> +      * the (mistaken) reclaim of hot pages, but the chance of that
>>>> +      * should be relatively low. ]
>>>> +      *
>>>> +      * So as a performance optimization don't flush the TLB when
>>>> +      * clearing the accessed bit, it will eventually be flushed by
>>>> +      * a context switch or a VM operation anyway. [ In the rare
>>>> +      * event of it not getting flushed for a long time the delay
>>>> +      * shouldn't really matter because there's no real memory
>>>> +      * pressure for swapout to react to. ]
>>>> +      */
>>>> +     return ptep_test_and_clear_young(vma, address, ptep);
>>>>   }
>>>>
>>>>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>
>>> There are three distinct concerns here
>>>
>>> 1) What are the chances of this misleading existing hot page reclaim process
>>> 2) How secondary MMU such as SMMU adapt to change in mappings without a flush
>>> 3) Could this break the architecture rule requiring a TLB flush after access
>>>     bit clear on a page table entry
>>
>> In terms of all of above concerns,  though 2 is different, which is an
>> issue between
>> cpu and non-cpu,
>> i feel kernel has actually dropped tlb flush at least for mglru, there
>> is no flush in
>> lru_gen_look_around(),
>>
>> static bool folio_referenced_one(struct folio *folio,
>>                  struct vm_area_struct *vma, unsigned long address, void *arg)
>> {
>>          ...
>>
>>                  if (pvmw.pte) {
>>                          if (lru_gen_enabled() &&
>>                              pte_young(ptep_get(pvmw.pte))) {
>>                                  lru_gen_look_around(&pvmw);
>>                                  referenced++;
>>                          }
>>
>>                          if (ptep_clear_flush_young_notify(vma, address,
>>                                                  pvmw.pte))
>>                                  referenced++;
>>                  }
>>
>>          return true;
>> }
>>
>> and so is in walk_pte_range() of vmscan.  linux has been surviving with
>> all above concerns for a while, believing it or not :-)
> 
> Although the first two concerns could be worked upon in the SW, kernel surviving
> after breaking arch rules explicitly is not a correct state to be in IMHO.

AFAICS it's not breaking any rules as such. Indeed the architecture 
requires a TLBI to ensure that the access flag update is observed by the 
CPU, but what's being proposed here is relaxing Linux's own expectations 
to say actually we don't mind if the CPU *doesn't* observe the update 
straight away.

Thanks,
Robin.
  
Baolin Wang Oct. 26, 2023, 12:32 p.m. UTC | #26
On 10/26/2023 2:01 PM, Anshuman Khandual wrote:
> 
> 
> On 10/26/23 11:24, Barry Song wrote:
>> On Thu, Oct 26, 2023 at 12:55 PM Anshuman Khandual
>> <anshuman.khandual@arm.com> wrote:
>>>
>>>
>>>
>>> On 10/24/23 18:26, Baolin Wang wrote:
>>>> Now ptep_clear_flush_young() is only called by folio_referenced() to
>>>> check if the folio was referenced, and now it will call a tlb flush on
>>>> ARM64 architecture. However the tlb flush can be expensive on ARM64
>>>> servers, especially for the systems with a large CPU numbers.
>>>
>>> TLB flush would be expensive on *any* platform with large CPU numbers ?

Perhaps yes, but did not measure it on other platforms.

>>>> Similar to the x86 architecture, below comments also apply equally to
>>>> ARM64 architecture. So we can drop the tlb flush operation in
>>>> ptep_clear_flush_young() on ARM64 architecture to improve the performance.
>>>> "
>>>> /* Clearing the accessed bit without a TLB flush
>>>>   * doesn't cause data corruption. [ It could cause incorrect
>>>>   * page aging and the (mistaken) reclaim of hot pages, but the
>>>>   * chance of that should be relatively low. ]
>>>>   *
>>>>   * So as a performance optimization don't flush the TLB when
>>>>   * clearing the accessed bit, it will eventually be flushed by
>>>>   * a context switch or a VM operation anyway. [ In the rare
>>>>   * event of it not getting flushed for a long time the delay
>>>>   * shouldn't really matter because there's no real memory
>>>>   * pressure for swapout to react to. ]
>>>>   */
>>>
>>> If always true, this sounds generic enough for all platforms, why only
>>> x86 and arm64 ?

I am not sure this is always true for every architectures.

>>>> "
>>>> Running the thpscale to show some obvious improvements for compaction
>>>> latency with this patch:
>>>>                               base                   patched
>>>> Amean     fault-both-1      1093.19 (   0.00%)     1084.57 *   0.79%*
>>>> Amean     fault-both-3      2566.22 (   0.00%)     2228.45 *  13.16%*
>>>> Amean     fault-both-5      3591.22 (   0.00%)     3146.73 *  12.38%*
>>>> Amean     fault-both-7      4157.26 (   0.00%)     4113.67 *   1.05%*
>>>> Amean     fault-both-12     6184.79 (   0.00%)     5218.70 *  15.62%*
>>>> Amean     fault-both-18     9103.70 (   0.00%)     7739.71 *  14.98%*
>>>> Amean     fault-both-24    12341.73 (   0.00%)    10684.23 *  13.43%*
>>>> Amean     fault-both-30    15519.00 (   0.00%)    13695.14 *  11.75%*
>>>> Amean     fault-both-32    16189.15 (   0.00%)    14365.73 *  11.26%*
>>>>                         base       patched
>>>> Duration User         167.78      161.03
>>>> Duration System      1836.66     1673.01
>>>> Duration Elapsed     2074.58     2059.75
>>>
>>> Could you please point to the test repo you are running ?

The test is based on v6.5 kernel.

>>>> Barry Song submitted a similar patch [1] before, that replaces the
>>>> ptep_clear_flush_young_notify() with ptep_clear_young_notify() in
>>>> folio_referenced_one(). However, I'm not sure if removing the tlb flush
>>>> operation is applicable to every architecture in kernel, so dropping
>>>> the tlb flush for ARM64 seems a sensible change.
>>>
>>> The reasoning provided here sounds generic when true, hence there seems
>>> to be no justification to keep it limited just for arm64 and x86. Also

Right, but I can not ensure if this will break other architectures.

>>> what about pmdp_clear_flush_young_notify() when THP is enabled. Should
>>> that also not do a TLB flush after clearing access bit ? Although arm64

Yes, I think so.

>>> does not enable __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH, rather depends on
>>> the generic pmdp_clear_flush_young() which also does a TLB flush via
>>> flush_pmd_tlb_range() while clearing the access bit.
>>>
>>>>
>>>> Note: I am okay for both approach, if someone can help to ensure that
>>>> all architectures do not need the tlb flush when clearing the accessed
>>>> bit, then I also think Barry's patch is better (hope Barry can resend
>>>> his patch).
>>>
>>> This paragraph belongs after the '----' below and not part of the commit
>>> message.

OK.

>>>> [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>>   arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++---------------
>>>>   1 file changed, 16 insertions(+), 15 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index 0bd18de9fd97..2979d796ba9d 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>   static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>                                         unsigned long address, pte_t *ptep)
>>>>   {
>>>> -     int young = ptep_test_and_clear_young(vma, address, ptep);
>>>> -
>>>> -     if (young) {
>>>> -             /*
>>>> -              * We can elide the trailing DSB here since the worst that can
>>>> -              * happen is that a CPU continues to use the young entry in its
>>>> -              * TLB and we mistakenly reclaim the associated page. The
>>>> -              * window for such an event is bounded by the next
>>>> -              * context-switch, which provides a DSB to complete the TLB
>>>> -              * invalidation.
>>>> -              */
>>>> -             flush_tlb_page_nosync(vma, address);
>>>> -     }
>>>> -
>>>> -     return young;
>>>> +     /*
>>>> +      * This comment is borrowed from x86, but applies equally to ARM64:
>>>> +      *
>>>> +      * Clearing the accessed bit without a TLB flush doesn't cause
>>>> +      * data corruption. [ It could cause incorrect page aging and
>>>> +      * the (mistaken) reclaim of hot pages, but the chance of that
>>>> +      * should be relatively low. ]
>>>> +      *
>>>> +      * So as a performance optimization don't flush the TLB when
>>>> +      * clearing the accessed bit, it will eventually be flushed by
>>>> +      * a context switch or a VM operation anyway. [ In the rare
>>>> +      * event of it not getting flushed for a long time the delay
>>>> +      * shouldn't really matter because there's no real memory
>>>> +      * pressure for swapout to react to. ]
>>>> +      */
>>>> +     return ptep_test_and_clear_young(vma, address, ptep);
>>>>   }
>>>>
>>>>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>
>>> There are three distinct concerns here
>>>
>>> 1) What are the chances of this misleading existing hot page reclaim process
>>> 2) How secondary MMU such as SMMU adapt to change in mappings without a flush
>>> 3) Could this break the architecture rule requiring a TLB flush after access
>>>     bit clear on a page table entry
>>
>> In terms of all of above concerns,  though 2 is different, which is an
>> issue between
>> cpu and non-cpu,
>> i feel kernel has actually dropped tlb flush at least for mglru, there
>> is no flush in
>> lru_gen_look_around(),
>>
>> static bool folio_referenced_one(struct folio *folio,
>>                  struct vm_area_struct *vma, unsigned long address, void *arg)
>> {
>>          ...
>>
>>                  if (pvmw.pte) {
>>                          if (lru_gen_enabled() &&
>>                              pte_young(ptep_get(pvmw.pte))) {
>>                                  lru_gen_look_around(&pvmw);
>>                                  referenced++;
>>                          }
>>
>>                          if (ptep_clear_flush_young_notify(vma, address,
>>                                                  pvmw.pte))
>>                                  referenced++;
>>                  }
>>
>>          return true;
>> }
>>
>> and so is in walk_pte_range() of vmscan.  linux has been surviving with
>> all above concerns for a while, believing it or not :-)
> 
> Although the first two concerns could be worked upon in the SW, kernel surviving
> after breaking arch rules explicitly is not a correct state to be in IMHO.

Not sure what's the meaning of "not a correct state", at least we 
(Alibaba) have not found this can cause any issues until now when using 
MGLRU on x86 and ARM64 platforms.
  
Barry Song Oct. 26, 2023, 11:48 p.m. UTC | #27
On Wed, Oct 25, 2023 at 6:16 PM Alistair Popple <apopple@nvidia.com> wrote:
>
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Wed, Oct 25, 2023 at 2:17 PM Yu Zhao <yuzhao@google.com> wrote:
> >>
> >> On Tue, Oct 24, 2023 at 9:21 PM Alistair Popple <apopple@nvidia.com> wrote:
> >> >
> >> >
> >> > Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> >> >
> >> > > On 10/25/2023 9:58 AM, Alistair Popple wrote:
> >> > >> Barry Song <21cnbao@gmail.com> writes:
> >> > >>
> >> > >>> On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@nvidia.com> wrote:
> >> > >>>>
> >> > >>>>
> >> > >>>> Barry Song <21cnbao@gmail.com> writes:
> >> > >>>>
> >> > >>>>> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote:
> >> > >>>>>>
> >> > >>>>>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang
> >> > >>>>>> <baolin.wang@linux.alibaba.com> wrote:
> >> > >> [...]
> >> > >>
> >> > >>>>>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot
> >> > >>>>>> page,  B might
> >> > >>>>>> be a correct choice.
> >> > >>>>>
> >> > >>>>> Plus, I doubt B is really going to happen. as after a page is promoted to
> >> > >>>>> the head of lru list or new generation, it needs a long time to slide back
> >> > >>>>> to the inactive list tail or to the candidate generation of mglru. the time
> >> > >>>>> should have been large enough for tlb to be flushed. If the page is really
> >> > >>>>> hot, the hardware will get second, third, fourth etc opportunity to set an
> >> > >>>>> access flag in the long time in which the page is re-moved to the tail
> >> > >>>>> as the page can be accessed multiple times if it is really hot.
> >> > >>>>
> >> > >>>> This might not be true if you have external hardware sharing the page
> >> > >>>> tables with software through either HMM or hardware supported ATS
> >> > >>>> though.
> >> > >>>>
> >> > >>>> In those cases I think it's much more likely hardware can still be
> >> > >>>> accessing the page even after a context switch on the CPU say. So those
> >> > >>>> pages will tend to get reclaimed even though hardware is still actively
> >> > >>>> using them which would be quite expensive and I guess could lead to
> >> > >>>> thrashing as each page is reclaimed and then immediately faulted back
> >> > >>>> in.
> >> > >
> >> > > That's possible, but the chance should be relatively low. At least on
> >> > > x86, I have not heard of this issue.
> >> >
> >> > Personally I've never seen any x86 system that shares page tables with
> >> > external devices, other than with HMM. More on that below.
> >> >
> >> > >>> i am not quite sure i got your point. has the external hardware sharing cpu's
> >> > >>> pagetable the ability to set access flag in page table entries by
> >> > >>> itself? if yes,
> >> > >>> I don't see how our approach will hurt as folio_referenced can notify the
> >> > >>> hardware driver and the driver can flush its own tlb. If no, i don't see
> >> > >>> either as the external hardware can't set access flags, that means we
> >> > >>> have ignored its reference and only knows cpu's access even in the current
> >> > >>> mainline code. so we are not getting worse.
> >> > >>>
> >> > >>> so the external hardware can also see cpu's TLB? or cpu's tlb flush can
> >> > >>> also broadcast to external hardware, then external hardware sees the
> >> > >>> cleared access flag, thus, it can set access flag in page table when the
> >> > >>> hardware access it?  If this is the case, I feel what you said is true.
> >> > >> Perhaps it would help if I gave a concrete example. Take for example
> >> > >> the
> >> > >> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of
> >> > >> two ways depending on the specific HW implementation.
> >> > >> If broadcast TLB maintenance (BTM) is supported it will snoop CPU
> >> > >> TLB
> >> > >> invalidations. If BTM is not supported it relies on SW to explicitly
> >> > >> forward TLB invalidations via MMU notifiers.
> >> > >
> >> > > On our ARM64 hardware, we rely on BTM to maintain TLB coherency.
> >> >
> >> > Lucky you :-)
> >> >
> >> > ARM64 SMMU architecture specification supports the possibilty of both,
> >> > as does the driver. Not that I think whether or not BTM is supported has
> >> > much relevance to this issue.
> >> >
> >> > >> Now consider the case where some external device is accessing mappings
> >> > >> via the SMMU. The access flag will be cached in the SMMU TLB. If we
> >> > >> clear the access flag without a TLB invalidate the access flag in the
> >> > >> CPU page table will not get updated because it's already set in the SMMU
> >> > >> TLB.
> >> > >> As an aside access flag updates happen in one of two ways. If the
> >> > >> SMMU
> >> > >> HW supports hardware translation table updates (HTTU) then hardware will
> >> > >> manage updating access/dirty flags as required. If this is not supported
> >> > >> then SW is relied on to update these flags which in practice means
> >> > >> taking a minor fault. But I don't think that is relevant here - in
> >> > >> either case without a TLB invalidate neither of those things will
> >> > >> happen.
> >> > >> I suppose drivers could implement the clear_flush_young() MMU
> >> > >> notifier
> >> > >> callback (none do at the moment AFAICT) but then won't that just lead to
> >> > >> the opposite problem - that every page ever used by an external device
> >> > >> remains active and unavailable for reclaim because the access flag never
> >> > >> gets cleared? I suppose they could do the flush then which would ensure
> >> > >
> >> > > Yes, I think so too. The reason there is currently no problem, perhaps
> >> > > I think, there are no actual use cases at the moment? At least on our
> >> > > Alibaba's fleet, SMMU and MMU do not share page tables now.
> >> >
> >> > We have systems that do.
> >>
> >> Just curious: do those systems run the Linux kernel? If so, are pages
> >> shared with SMMU pinned? If not, then how are IO PFs handled after
> >> pages are reclaimed?
>
> Yes, these systems all run Linux. Pages shared with SMMU aren't pinned
> and fault handling works as Barry notes below - a driver is notified of
> a fault and calls handle_mm_fault() in response.
>
> > it will call handle_mm_fault(vma, prm->addr, fault_flags, NULL); in
> > I/O PF, so finally
> > it runs the same codes to get page back just like CPU's PF.
> >
> > years ago, we recommended a pin solution, but obviously there were lots of
> > push backs:
> > https://lore.kernel.org/linux-mm/1612685884-19514-1-git-send-email-wangzhou1@hisilicon.com/
>
> Right. Having to pin pages defeats the whole point of having hardware
> that can handle page faults.

Right. But
Sometimes I feel that purpose and the technology for the purpose are
contradictory.
People want user space dma to have high bandwidth and performance by dropping
u/k copying and complicated syscalls managing those buffers. but IO PF somehow
defeats all the efforts to improve performance.

AFAIK, IO PF is much much slower than PF as it depends on interrupts and event
queues in hardware. Even a minor page fault due to page migration or
demand paging
can significantly decrease the speed of userspace DMA. So finally people have to
depend on some mlock-like and pre-paging ways to achieve the purpose
if performance
is the key concern.

Thanks
Barry
  
Will Deacon Nov. 7, 2023, 10:12 a.m. UTC | #28
On Wed, Oct 25, 2023 at 09:39:19AM +0800, Yin, Fengwei wrote:
> 
> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >> index 0bd18de9fd97..2979d796ba9d 100644
> >> --- a/arch/arm64/include/asm/pgtable.h
> >> +++ b/arch/arm64/include/asm/pgtable.h
> >> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> >>  static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> >>                                          unsigned long address, pte_t *ptep)
> >>  {
> >> -       int young = ptep_test_and_clear_young(vma, address, ptep);
> >> -
> >> -       if (young) {
> >> -               /*
> >> -                * We can elide the trailing DSB here since the worst that can
> >> -                * happen is that a CPU continues to use the young entry in its
> >> -                * TLB and we mistakenly reclaim the associated page. The
> >> -                * window for such an event is bounded by the next
> >> -                * context-switch, which provides a DSB to complete the TLB
> >> -                * invalidation.
> >> -                */
> >> -               flush_tlb_page_nosync(vma, address);
> >> -       }
> >> -
> >> -       return young;
> >> +       /*
> >> +        * This comment is borrowed from x86, but applies equally to ARM64:
> >> +        *
> >> +        * Clearing the accessed bit without a TLB flush doesn't cause
> >> +        * data corruption. [ It could cause incorrect page aging and
> >> +        * the (mistaken) reclaim of hot pages, but the chance of that
> >> +        * should be relatively low. ]
> >> +        *
> >> +        * So as a performance optimization don't flush the TLB when
> >> +        * clearing the accessed bit, it will eventually be flushed by
> >> +        * a context switch or a VM operation anyway. [ In the rare
> >> +        * event of it not getting flushed for a long time the delay
> >> +        * shouldn't really matter because there's no real memory
> >> +        * pressure for swapout to react to. ]
> >> +        */
> >> +       return ptep_test_and_clear_young(vma, address, ptep);
> >>  }
> From https://lore.kernel.org/lkml/20181029105515.GD14127@arm.com/:
> 
> This is blindly copied from x86 and isn't true for us: we don't invalidate
> the TLB on context switch. That means our window for keeping the stale
> entries around is potentially much bigger and might not be a great idea.

I completely agree.

> My understanding is that arm64 doesn't do invalidate the TLB during
> context switch. The flush_tlb_page_nosync() here + DSB during context
> switch make sure the TLB is invalidated during context switch.
> So we can't remove flush_tlb_page_nosync() here? Or something was changed
> for arm64 (I have zero knowledge to TLB on arm64. So some obvious thing
> may be missed)? Thanks.

As you point out, we already elide the DSB here but I don't think we should
remove the TLB invalidation entirely because then we lose the guarantee
that the update ever becomes visible to the page-table walker.

I'm surprised that the TLBI is showing up as a performance issue without
the DSB present. Is it because we're walking over a large VA range and
invalidating on a per-page basis? If so, we'd be better off batching them
up and doing the invalidation at the end (which will be upgraded to a
full-mm invalidation if the range is large enough).

Will
  
Barry Song Nov. 7, 2023, 8:50 p.m. UTC | #29
On Tue, Nov 7, 2023 at 6:12 PM Will Deacon <will@kernel.org> wrote:
>
> On Wed, Oct 25, 2023 at 09:39:19AM +0800, Yin, Fengwei wrote:
> >
> > >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > >> index 0bd18de9fd97..2979d796ba9d 100644
> > >> --- a/arch/arm64/include/asm/pgtable.h
> > >> +++ b/arch/arm64/include/asm/pgtable.h
> > >> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> > >>  static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> > >>                                          unsigned long address, pte_t *ptep)
> > >>  {
> > >> -       int young = ptep_test_and_clear_young(vma, address, ptep);
> > >> -
> > >> -       if (young) {
> > >> -               /*
> > >> -                * We can elide the trailing DSB here since the worst that can
> > >> -                * happen is that a CPU continues to use the young entry in its
> > >> -                * TLB and we mistakenly reclaim the associated page. The
> > >> -                * window for such an event is bounded by the next
> > >> -                * context-switch, which provides a DSB to complete the TLB
> > >> -                * invalidation.
> > >> -                */
> > >> -               flush_tlb_page_nosync(vma, address);
> > >> -       }
> > >> -
> > >> -       return young;
> > >> +       /*
> > >> +        * This comment is borrowed from x86, but applies equally to ARM64:
> > >> +        *
> > >> +        * Clearing the accessed bit without a TLB flush doesn't cause
> > >> +        * data corruption. [ It could cause incorrect page aging and
> > >> +        * the (mistaken) reclaim of hot pages, but the chance of that
> > >> +        * should be relatively low. ]
> > >> +        *
> > >> +        * So as a performance optimization don't flush the TLB when
> > >> +        * clearing the accessed bit, it will eventually be flushed by
> > >> +        * a context switch or a VM operation anyway. [ In the rare
> > >> +        * event of it not getting flushed for a long time the delay
> > >> +        * shouldn't really matter because there's no real memory
> > >> +        * pressure for swapout to react to. ]
> > >> +        */
> > >> +       return ptep_test_and_clear_young(vma, address, ptep);
> > >>  }
> > From https://lore.kernel.org/lkml/20181029105515.GD14127@arm.com/:
> >
> > This is blindly copied from x86 and isn't true for us: we don't invalidate
> > the TLB on context switch. That means our window for keeping the stale
> > entries around is potentially much bigger and might not be a great idea.
>
> I completely agree.
>
> > My understanding is that arm64 doesn't do invalidate the TLB during
> > context switch. The flush_tlb_page_nosync() here + DSB during context
> > switch make sure the TLB is invalidated during context switch.
> > So we can't remove flush_tlb_page_nosync() here? Or something was changed
> > for arm64 (I have zero knowledge to TLB on arm64. So some obvious thing
> > may be missed)? Thanks.
>
> As you point out, we already elide the DSB here but I don't think we should
> remove the TLB invalidation entirely because then we lose the guarantee
> that the update ever becomes visible to the page-table walker.
>
> I'm surprised that the TLBI is showing up as a performance issue without
> the DSB present. Is it because we're walking over a large VA range and
> invalidating on a per-page basis? If so, we'd be better off batching them

nop. in lru cases, there are thousands of pages in LRU list. doing vmscan,
we depend on rmap to find their PTEs, then read and clear AF to figure out
if a page is young. So it is not from a big VM area to those pages in this VA
range. There are just too many pages from lots of processes in LRU to be
scanned. The thing is done by rmap.

> up and doing the invalidation at the end (which will be upgraded to a
> full-mm invalidation if the range is large enough).

Those pages in LRU could be from hundreds of different processes, they are
not in just one process. i guess one possibility is that hardware has a limited
tlbi/nosync buffer, once the buffer is full, something similar with dsb will be
done automatically by hardware. So too many tlbi even without dsb can still
harm performance.

>
> Will

Thanks
Barry
  

Patch

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0bd18de9fd97..2979d796ba9d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -905,21 +905,22 @@  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 					 unsigned long address, pte_t *ptep)
 {
-	int young = ptep_test_and_clear_young(vma, address, ptep);
-
-	if (young) {
-		/*
-		 * We can elide the trailing DSB here since the worst that can
-		 * happen is that a CPU continues to use the young entry in its
-		 * TLB and we mistakenly reclaim the associated page. The
-		 * window for such an event is bounded by the next
-		 * context-switch, which provides a DSB to complete the TLB
-		 * invalidation.
-		 */
-		flush_tlb_page_nosync(vma, address);
-	}
-
-	return young;
+	/*
+	 * This comment is borrowed from x86, but applies equally to ARM64:
+	 *
+	 * Clearing the accessed bit without a TLB flush doesn't cause
+	 * data corruption. [ It could cause incorrect page aging and
+	 * the (mistaken) reclaim of hot pages, but the chance of that
+	 * should be relatively low. ]
+	 *
+	 * So as a performance optimization don't flush the TLB when
+	 * clearing the accessed bit, it will eventually be flushed by
+	 * a context switch or a VM operation anyway. [ In the rare
+	 * event of it not getting flushed for a long time the delay
+	 * shouldn't really matter because there's no real memory
+	 * pressure for swapout to react to. ]
+	 */
+	return ptep_test_and_clear_young(vma, address, ptep);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE