Message ID | ae3115778a3fa10ec77152e18beed54fafe0f6e7.1698151516.git.baolin.wang@linux.alibaba.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce89:0:b0:403:3b70:6f57 with SMTP id p9csp1918603vqx; Tue, 24 Oct 2023 05:57:02 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHcaROhD5j3qvVlJIsoq7fqHaDyM+tTpZ53BXmS/xHLsVudWf+qvnHo6Tvoy+4SH6A3Ucmh X-Received: by 2002:a05:6870:1585:b0:1d5:adc0:4a1 with SMTP id j5-20020a056870158500b001d5adc004a1mr13201447oab.22.1698152222051; Tue, 24 Oct 2023 05:57:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698152222; cv=none; d=google.com; s=arc-20160816; b=N+1fgx4ydVsjNMZql+ckWHpvFh9a8Ti6y9bYksAkJxPcUgQCQh7VdCfM9rbLi6BPSZ KVJnthG0k7PkXyagL5WG0xatpAJXZhMTwz+yljWV5bmkGKVv1mQcDeYwrAF88b5nto7C pG23Bg0+Rv85jEdlIi8DSxbRaFawP74EWCeakx+6LGUMf/7EbZFlTHN9Y8C3Gf1uThfS pKKlxMMlOwIlr/nSDDTLYe4a8N16iN+1mlVjmdCZ5BaYEQpUCtmxHWJ1adTRPKFmdg11 ekVyu1yI9kbOy/T0Cw/Sp3fSa3Kwmka0naTNZjNPPpCpLuXf/zIDEt7PwXB0bhgNkxy/ Vd+Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=1hKI1M3lU5Y+7x/LGbv1N2Zep6rJt5Yw5SvegI+zkrM=; fh=sAEoGJfZHcO6gsPEmptsU8HBOVwHv9j2Vj8J8WZs/90=; b=fRdYvCRN0r4I+eII4202Qnf4UfKhPb7nagH306wSN5T7XYXGhjetjqD6R9syYP4K/z WtXVy66hiM2WeQW+OCu5C0+cwi+71MfZnnf8vNI4V8urITkmTngiEmYmMKBLaU5jSxWy JDjCuxz/0lv/uicMUFS420CnAYj4RHGn9ipdfv7YJVVCo0XTMKWrz01usPBGHxSG0k9f yKkHZOcz3WIfK4Fi7pbSDaxlpJEmkY+jrmUqtkODiDbD+w/xsTWqYIu4O785qOQNgTH+ NNev2lSxk2MwvTVV1PPHee6aU+a7cjP4QQ+WR4CAkkrxCpR1TWGP/0BVDKxAJzWy0i/u 8HmQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id j187-20020a638bc4000000b00578a7f5a0afsi8451416pge.357.2023.10.24.05.57.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 24 Oct 2023 05:57:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 5AC7E802F189; Tue, 24 Oct 2023 05:57:01 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234453AbjJXM4y (ORCPT <rfc822;a1648639935@gmail.com> + 26 others); Tue, 24 Oct 2023 08:56:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42154 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234382AbjJXM4v (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 24 Oct 2023 08:56:51 -0400 Received: from out30-119.freemail.mail.aliyun.com (out30-119.freemail.mail.aliyun.com [115.124.30.119]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 09638D7D for <linux-kernel@vger.kernel.org>; Tue, 24 Oct 2023 05:56:46 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R151e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045192;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=9;SR=0;TI=SMTPD_---0VurM0Hm_1698152201; Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0VurM0Hm_1698152201) by smtp.aliyun-inc.com; Tue, 24 Oct 2023 20:56:42 +0800 From: Baolin Wang <baolin.wang@linux.alibaba.com> To: catalin.marinas@arm.com, will@kernel.org Cc: akpm@linux-foundation.org, v-songbaohua@oppo.com, yuzhao@google.com, baolin.wang@linux.alibaba.com, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org Subject: [PATCH] arm64: mm: drop tlb flush operation when clearing the access bit Date: Tue, 24 Oct 2023 20:56:35 +0800 Message-Id: <ae3115778a3fa10ec77152e18beed54fafe0f6e7.1698151516.git.baolin.wang@linux.alibaba.com> X-Mailer: git-send-email 2.39.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-9.9 required=5.0 tests=BAYES_00, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS, UNPARSEABLE_RELAY,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Tue, 24 Oct 2023 05:57:01 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1780641664154869979 X-GMAIL-MSGID: 1780641664154869979 |
Series |
arm64: mm: drop tlb flush operation when clearing the access bit
|
|
Commit Message
Baolin Wang
Oct. 24, 2023, 12:56 p.m. UTC
Now ptep_clear_flush_young() is only called by folio_referenced() to
check if the folio was referenced, and now it will call a tlb flush on
ARM64 architecture. However the tlb flush can be expensive on ARM64
servers, especially for the systems with a large CPU numbers.
Similar to the x86 architecture, below comments also apply equally to
ARM64 architecture. So we can drop the tlb flush operation in
ptep_clear_flush_young() on ARM64 architecture to improve the performance.
"
/* Clearing the accessed bit without a TLB flush
* doesn't cause data corruption. [ It could cause incorrect
* page aging and the (mistaken) reclaim of hot pages, but the
* chance of that should be relatively low. ]
*
* So as a performance optimization don't flush the TLB when
* clearing the accessed bit, it will eventually be flushed by
* a context switch or a VM operation anyway. [ In the rare
* event of it not getting flushed for a long time the delay
* shouldn't really matter because there's no real memory
* pressure for swapout to react to. ]
*/
"
Running the thpscale to show some obvious improvements for compaction
latency with this patch:
base patched
Amean fault-both-1 1093.19 ( 0.00%) 1084.57 * 0.79%*
Amean fault-both-3 2566.22 ( 0.00%) 2228.45 * 13.16%*
Amean fault-both-5 3591.22 ( 0.00%) 3146.73 * 12.38%*
Amean fault-both-7 4157.26 ( 0.00%) 4113.67 * 1.05%*
Amean fault-both-12 6184.79 ( 0.00%) 5218.70 * 15.62%*
Amean fault-both-18 9103.70 ( 0.00%) 7739.71 * 14.98%*
Amean fault-both-24 12341.73 ( 0.00%) 10684.23 * 13.43%*
Amean fault-both-30 15519.00 ( 0.00%) 13695.14 * 11.75%*
Amean fault-both-32 16189.15 ( 0.00%) 14365.73 * 11.26%*
base patched
Duration User 167.78 161.03
Duration System 1836.66 1673.01
Duration Elapsed 2074.58 2059.75
Barry Song submitted a similar patch [1] before, that replaces the
ptep_clear_flush_young_notify() with ptep_clear_young_notify() in
folio_referenced_one(). However, I'm not sure if removing the tlb flush
operation is applicable to every architecture in kernel, so dropping
the tlb flush for ARM64 seems a sensible change.
Note: I am okay for both approach, if someone can help to ensure that
all architectures do not need the tlb flush when clearing the accessed
bit, then I also think Barry's patch is better (hope Barry can resend
his patch).
[1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++---------------
1 file changed, 16 insertions(+), 15 deletions(-)
Comments
On 2023/10/24 20:56, Baolin Wang wrote: > Now ptep_clear_flush_young() is only called by folio_referenced() to > check if the folio was referenced, and now it will call a tlb flush on > ARM64 architecture. However the tlb flush can be expensive on ARM64 > servers, especially for the systems with a large CPU numbers. > > Similar to the x86 architecture, below comments also apply equally to > ARM64 architecture. So we can drop the tlb flush operation in > ptep_clear_flush_young() on ARM64 architecture to improve the performance. > " > /* Clearing the accessed bit without a TLB flush > * doesn't cause data corruption. [ It could cause incorrect > * page aging and the (mistaken) reclaim of hot pages, but the > * chance of that should be relatively low. ] > * > * So as a performance optimization don't flush the TLB when > * clearing the accessed bit, it will eventually be flushed by > * a context switch or a VM operation anyway. [ In the rare > * event of it not getting flushed for a long time the delay > * shouldn't really matter because there's no real memory > * pressure for swapout to react to. ] > */ > " > Running the thpscale to show some obvious improvements for compaction > latency with this patch: > base patched > Amean fault-both-1 1093.19 ( 0.00%) 1084.57 * 0.79%* > Amean fault-both-3 2566.22 ( 0.00%) 2228.45 * 13.16%* > Amean fault-both-5 3591.22 ( 0.00%) 3146.73 * 12.38%* > Amean fault-both-7 4157.26 ( 0.00%) 4113.67 * 1.05%* > Amean fault-both-12 6184.79 ( 0.00%) 5218.70 * 15.62%* > Amean fault-both-18 9103.70 ( 0.00%) 7739.71 * 14.98%* > Amean fault-both-24 12341.73 ( 0.00%) 10684.23 * 13.43%* > Amean fault-both-30 15519.00 ( 0.00%) 13695.14 * 11.75%* > Amean fault-both-32 16189.15 ( 0.00%) 14365.73 * 11.26%* > base patched > Duration User 167.78 161.03 > Duration System 1836.66 1673.01 > Duration Elapsed 2074.58 2059.75 > > Barry Song submitted a similar patch [1] before, that replaces the > ptep_clear_flush_young_notify() with ptep_clear_young_notify() in > folio_referenced_one(). However, I'm not sure if removing the tlb flush > operation is applicable to every architecture in kernel, so dropping > the tlb flush for ARM64 seems a sensible change. At least x86/s390/riscv/powerpc already do it, also I think we could change pmdp_clear_flush_young_notify() too, since it is same with ptep_clear_flush_young_notify(), > > Note: I am okay for both approach, if someone can help to ensure that > all architectures do not need the tlb flush when clearing the accessed > bit, then I also think Barry's patch is better (hope Barry can resend > his patch). > > [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/ > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> > --- > arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++--------------- > 1 file changed, 16 insertions(+), 15 deletions(-) > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index 0bd18de9fd97..2979d796ba9d 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, > static inline int ptep_clear_flush_young(struct vm_area_struct *vma, > unsigned long address, pte_t *ptep) > { > - int young = ptep_test_and_clear_young(vma, address, ptep); > - > - if (young) { > - /* > - * We can elide the trailing DSB here since the worst that can > - * happen is that a CPU continues to use the young entry in its > - * TLB and we mistakenly reclaim the associated page. The > - * window for such an event is bounded by the next > - * context-switch, which provides a DSB to complete the TLB > - * invalidation. > - */ > - flush_tlb_page_nosync(vma, address); > - } > - > - return young; > + /* > + * This comment is borrowed from x86, but applies equally to ARM64: > + * > + * Clearing the accessed bit without a TLB flush doesn't cause > + * data corruption. [ It could cause incorrect page aging and > + * the (mistaken) reclaim of hot pages, but the chance of that > + * should be relatively low. ] > + * > + * So as a performance optimization don't flush the TLB when > + * clearing the accessed bit, it will eventually be flushed by > + * a context switch or a VM operation anyway. [ In the rare > + * event of it not getting flushed for a long time the delay > + * shouldn't really matter because there's no real memory > + * pressure for swapout to react to. ] > + */ > + return ptep_test_and_clear_young(vma, address, ptep); > } > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
On Tue, Oct 24, 2023 at 6:56 AM Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > > Now ptep_clear_flush_young() is only called by folio_referenced() to > check if the folio was referenced, and now it will call a tlb flush on > ARM64 architecture. However the tlb flush can be expensive on ARM64 > servers, especially for the systems with a large CPU numbers. > > Similar to the x86 architecture, below comments also apply equally to > ARM64 architecture. So we can drop the tlb flush operation in > ptep_clear_flush_young() on ARM64 architecture to improve the performance. > " > /* Clearing the accessed bit without a TLB flush > * doesn't cause data corruption. [ It could cause incorrect > * page aging and the (mistaken) reclaim of hot pages, but the > * chance of that should be relatively low. ] > * > * So as a performance optimization don't flush the TLB when > * clearing the accessed bit, it will eventually be flushed by > * a context switch or a VM operation anyway. [ In the rare > * event of it not getting flushed for a long time the delay > * shouldn't really matter because there's no real memory > * pressure for swapout to react to. ] > */ > " > Running the thpscale to show some obvious improvements for compaction > latency with this patch: > base patched > Amean fault-both-1 1093.19 ( 0.00%) 1084.57 * 0.79%* > Amean fault-both-3 2566.22 ( 0.00%) 2228.45 * 13.16%* > Amean fault-both-5 3591.22 ( 0.00%) 3146.73 * 12.38%* > Amean fault-both-7 4157.26 ( 0.00%) 4113.67 * 1.05%* > Amean fault-both-12 6184.79 ( 0.00%) 5218.70 * 15.62%* > Amean fault-both-18 9103.70 ( 0.00%) 7739.71 * 14.98%* > Amean fault-both-24 12341.73 ( 0.00%) 10684.23 * 13.43%* > Amean fault-both-30 15519.00 ( 0.00%) 13695.14 * 11.75%* > Amean fault-both-32 16189.15 ( 0.00%) 14365.73 * 11.26%* > base patched > Duration User 167.78 161.03 > Duration System 1836.66 1673.01 > Duration Elapsed 2074.58 2059.75 > > Barry Song submitted a similar patch [1] before, that replaces the > ptep_clear_flush_young_notify() with ptep_clear_young_notify() in > folio_referenced_one(). However, I'm not sure if removing the tlb flush > operation is applicable to every architecture in kernel, so dropping > the tlb flush for ARM64 seems a sensible change. > > Note: I am okay for both approach, if someone can help to ensure that > all architectures do not need the tlb flush when clearing the accessed > bit, then I also think Barry's patch is better (hope Barry can resend > his patch). > > [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/ > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> +Minchan Kim Minchan and I discussed this (again) yesterday -- I'm in favor and he can voice his different opinion on this.
On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > > Now ptep_clear_flush_young() is only called by folio_referenced() to > check if the folio was referenced, and now it will call a tlb flush on > ARM64 architecture. However the tlb flush can be expensive on ARM64 > servers, especially for the systems with a large CPU numbers. > > Similar to the x86 architecture, below comments also apply equally to > ARM64 architecture. So we can drop the tlb flush operation in > ptep_clear_flush_young() on ARM64 architecture to improve the performance. > " > /* Clearing the accessed bit without a TLB flush > * doesn't cause data corruption. [ It could cause incorrect > * page aging and the (mistaken) reclaim of hot pages, but the > * chance of that should be relatively low. ] > * > * So as a performance optimization don't flush the TLB when > * clearing the accessed bit, it will eventually be flushed by > * a context switch or a VM operation anyway. [ In the rare > * event of it not getting flushed for a long time the delay > * shouldn't really matter because there's no real memory > * pressure for swapout to react to. ] > */ > " > Running the thpscale to show some obvious improvements for compaction > latency with this patch: > base patched > Amean fault-both-1 1093.19 ( 0.00%) 1084.57 * 0.79%* > Amean fault-both-3 2566.22 ( 0.00%) 2228.45 * 13.16%* > Amean fault-both-5 3591.22 ( 0.00%) 3146.73 * 12.38%* > Amean fault-both-7 4157.26 ( 0.00%) 4113.67 * 1.05%* > Amean fault-both-12 6184.79 ( 0.00%) 5218.70 * 15.62%* > Amean fault-both-18 9103.70 ( 0.00%) 7739.71 * 14.98%* > Amean fault-both-24 12341.73 ( 0.00%) 10684.23 * 13.43%* > Amean fault-both-30 15519.00 ( 0.00%) 13695.14 * 11.75%* > Amean fault-both-32 16189.15 ( 0.00%) 14365.73 * 11.26%* > base patched > Duration User 167.78 161.03 > Duration System 1836.66 1673.01 > Duration Elapsed 2074.58 2059.75 > > Barry Song submitted a similar patch [1] before, that replaces the > ptep_clear_flush_young_notify() with ptep_clear_young_notify() in > folio_referenced_one(). However, I'm not sure if removing the tlb flush > operation is applicable to every architecture in kernel, so dropping > the tlb flush for ARM64 seems a sensible change. > > Note: I am okay for both approach, if someone can help to ensure that > all architectures do not need the tlb flush when clearing the accessed > bit, then I also think Barry's patch is better (hope Barry can resend > his patch). > Thanks! ptep_clear_flush_young() with "flush" in its name clearly says it needs a flush. but it happens in arm64, all other code which needs a flush has called other variants, for example __flush_tlb_page_nosync(): static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, struct mm_struct *mm, unsigned long uaddr) { __flush_tlb_page_nosync(mm, uaddr); } so it seems folio_referenced is the only left user of this ptep_clear_flush_young(). The fact makes Baolin's patch look safe now. but this function still has "flush" in its name, so one day, one person might call it with the understanding that it will flush tlb but actually it won't. This is bad smell in code. I guess one side effect of not flushing tlb while clearing the access flag is that hardware won't see this cleared flag in the tlb, so it might not set this bit in memory even though the bit has been cleared before in memory(but not in tlb) while the page is accessed *again*. next time, someone reads the access flag in memory again by folio_referenced, he/she will see the page is cold as hardware has lost a chance to set the bit again since it finds tlb already has a true access flag. But anyway, tlb is so small, it will be flushed by context switch and other running code soon. so it seems we don't actually require the access flag being instantly updated. the time gap, in which access flag might lose the new set by hardware, seems to be too short to really affect the accuracy of page reclamation. but its cost is large. (A). Constant flush cost vs. (B). very very occasional reclaimed hot page, B might be a correct choice. > [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/ > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> > --- > arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++--------------- > 1 file changed, 16 insertions(+), 15 deletions(-) > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index 0bd18de9fd97..2979d796ba9d 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, > static inline int ptep_clear_flush_young(struct vm_area_struct *vma, > unsigned long address, pte_t *ptep) > { > - int young = ptep_test_and_clear_young(vma, address, ptep); > - > - if (young) { > - /* > - * We can elide the trailing DSB here since the worst that can > - * happen is that a CPU continues to use the young entry in its > - * TLB and we mistakenly reclaim the associated page. The > - * window for such an event is bounded by the next > - * context-switch, which provides a DSB to complete the TLB > - * invalidation. > - */ > - flush_tlb_page_nosync(vma, address); > - } > - > - return young; > + /* > + * This comment is borrowed from x86, but applies equally to ARM64: > + * > + * Clearing the accessed bit without a TLB flush doesn't cause > + * data corruption. [ It could cause incorrect page aging and > + * the (mistaken) reclaim of hot pages, but the chance of that > + * should be relatively low. ] > + * > + * So as a performance optimization don't flush the TLB when > + * clearing the accessed bit, it will eventually be flushed by > + * a context switch or a VM operation anyway. [ In the rare > + * event of it not getting flushed for a long time the delay > + * shouldn't really matter because there's no real memory > + * pressure for swapout to react to. ] > + */ > + return ptep_test_and_clear_young(vma, address, ptep); > } > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > -- > 2.39.3 > Thanks Barry
On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang > <baolin.wang@linux.alibaba.com> wrote: > > > > Now ptep_clear_flush_young() is only called by folio_referenced() to > > check if the folio was referenced, and now it will call a tlb flush on > > ARM64 architecture. However the tlb flush can be expensive on ARM64 > > servers, especially for the systems with a large CPU numbers. > > > > Similar to the x86 architecture, below comments also apply equally to > > ARM64 architecture. So we can drop the tlb flush operation in > > ptep_clear_flush_young() on ARM64 architecture to improve the performance. > > " > > /* Clearing the accessed bit without a TLB flush > > * doesn't cause data corruption. [ It could cause incorrect > > * page aging and the (mistaken) reclaim of hot pages, but the > > * chance of that should be relatively low. ] > > * > > * So as a performance optimization don't flush the TLB when > > * clearing the accessed bit, it will eventually be flushed by > > * a context switch or a VM operation anyway. [ In the rare > > * event of it not getting flushed for a long time the delay > > * shouldn't really matter because there's no real memory > > * pressure for swapout to react to. ] > > */ > > " > > Running the thpscale to show some obvious improvements for compaction > > latency with this patch: > > base patched > > Amean fault-both-1 1093.19 ( 0.00%) 1084.57 * 0.79%* > > Amean fault-both-3 2566.22 ( 0.00%) 2228.45 * 13.16%* > > Amean fault-both-5 3591.22 ( 0.00%) 3146.73 * 12.38%* > > Amean fault-both-7 4157.26 ( 0.00%) 4113.67 * 1.05%* > > Amean fault-both-12 6184.79 ( 0.00%) 5218.70 * 15.62%* > > Amean fault-both-18 9103.70 ( 0.00%) 7739.71 * 14.98%* > > Amean fault-both-24 12341.73 ( 0.00%) 10684.23 * 13.43%* > > Amean fault-both-30 15519.00 ( 0.00%) 13695.14 * 11.75%* > > Amean fault-both-32 16189.15 ( 0.00%) 14365.73 * 11.26%* > > base patched > > Duration User 167.78 161.03 > > Duration System 1836.66 1673.01 > > Duration Elapsed 2074.58 2059.75 > > > > Barry Song submitted a similar patch [1] before, that replaces the > > ptep_clear_flush_young_notify() with ptep_clear_young_notify() in > > folio_referenced_one(). However, I'm not sure if removing the tlb flush > > operation is applicable to every architecture in kernel, so dropping > > the tlb flush for ARM64 seems a sensible change. > > > > Note: I am okay for both approach, if someone can help to ensure that > > all architectures do not need the tlb flush when clearing the accessed > > bit, then I also think Barry's patch is better (hope Barry can resend > > his patch). > > > > Thanks! > > ptep_clear_flush_young() with "flush" in its name clearly says it needs a > flush. but it happens in arm64, all other code which needs a flush has > called other variants, for example __flush_tlb_page_nosync(): > > static inline void arch_tlbbatch_add_pending(struct > arch_tlbflush_unmap_batch *batch, > struct mm_struct *mm, unsigned long uaddr) > { > __flush_tlb_page_nosync(mm, uaddr); > } > > so it seems folio_referenced is the only left user of this > ptep_clear_flush_young(). > The fact makes Baolin's patch look safe now. > > but this function still has "flush" in its name, so one day, one person might > call it with the understanding that it will flush tlb but actually it > won't. This is > bad smell in code. > > I guess one side effect of not flushing tlb while clearing the access > flag is that > hardware won't see this cleared flag in the tlb, so it might not set this bit in > memory even though the bit has been cleared before in memory(but not in tlb) > while the page is accessed *again*. > > next time, someone reads the access flag in memory again by folio_referenced, > he/she will see the page is cold as hardware has lost a chance to set > the bit again > since it finds tlb already has a true access flag. > > But anyway, tlb is so small, it will be flushed by context switch and > other running > code soon. so it seems we don't actually require the access flag being instantly > updated. the time gap, in which access flag might lose the new set by hardware, > seems to be too short to really affect the accuracy of page reclamation. but its > cost is large. > > (A). Constant flush cost vs. (B). very very occasional reclaimed hot > page, B might > be a correct choice. Plus, I doubt B is really going to happen. as after a page is promoted to the head of lru list or new generation, it needs a long time to slide back to the inactive list tail or to the candidate generation of mglru. the time should have been large enough for tlb to be flushed. If the page is really hot, the hardware will get second, third, fourth etc opportunity to set an access flag in the long time in which the page is re-moved to the tail as the page can be accessed multiple times if it is really hot. > > > [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/ > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> > > --- > > arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++--------------- > > 1 file changed, 16 insertions(+), 15 deletions(-) > > > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > > index 0bd18de9fd97..2979d796ba9d 100644 > > --- a/arch/arm64/include/asm/pgtable.h > > +++ b/arch/arm64/include/asm/pgtable.h > > @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, > > static inline int ptep_clear_flush_young(struct vm_area_struct *vma, > > unsigned long address, pte_t *ptep) > > { > > - int young = ptep_test_and_clear_young(vma, address, ptep); > > - > > - if (young) { > > - /* > > - * We can elide the trailing DSB here since the worst that can > > - * happen is that a CPU continues to use the young entry in its > > - * TLB and we mistakenly reclaim the associated page. The > > - * window for such an event is bounded by the next > > - * context-switch, which provides a DSB to complete the TLB > > - * invalidation. > > - */ > > - flush_tlb_page_nosync(vma, address); > > - } > > - > > - return young; > > + /* > > + * This comment is borrowed from x86, but applies equally to ARM64: > > + * > > + * Clearing the accessed bit without a TLB flush doesn't cause > > + * data corruption. [ It could cause incorrect page aging and > > + * the (mistaken) reclaim of hot pages, but the chance of that > > + * should be relatively low. ] > > + * > > + * So as a performance optimization don't flush the TLB when > > + * clearing the accessed bit, it will eventually be flushed by > > + * a context switch or a VM operation anyway. [ In the rare > > + * event of it not getting flushed for a long time the delay > > + * shouldn't really matter because there's no real memory > > + * pressure for swapout to react to. ] > > + */ > > + return ptep_test_and_clear_young(vma, address, ptep); > > } > > > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > -- > > 2.39.3 > > > > Thanks > Barry
Barry Song <21cnbao@gmail.com> writes: > On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote: >> >> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang >> <baolin.wang@linux.alibaba.com> wrote: >> > >> > Now ptep_clear_flush_young() is only called by folio_referenced() to >> > check if the folio was referenced, and now it will call a tlb flush on >> > ARM64 architecture. However the tlb flush can be expensive on ARM64 >> > servers, especially for the systems with a large CPU numbers. >> > >> > Similar to the x86 architecture, below comments also apply equally to >> > ARM64 architecture. So we can drop the tlb flush operation in >> > ptep_clear_flush_young() on ARM64 architecture to improve the performance. >> > " >> > /* Clearing the accessed bit without a TLB flush >> > * doesn't cause data corruption. [ It could cause incorrect >> > * page aging and the (mistaken) reclaim of hot pages, but the >> > * chance of that should be relatively low. ] >> > * >> > * So as a performance optimization don't flush the TLB when >> > * clearing the accessed bit, it will eventually be flushed by >> > * a context switch or a VM operation anyway. [ In the rare >> > * event of it not getting flushed for a long time the delay >> > * shouldn't really matter because there's no real memory >> > * pressure for swapout to react to. ] >> > */ >> > " >> > Running the thpscale to show some obvious improvements for compaction >> > latency with this patch: >> > base patched >> > Amean fault-both-1 1093.19 ( 0.00%) 1084.57 * 0.79%* >> > Amean fault-both-3 2566.22 ( 0.00%) 2228.45 * 13.16%* >> > Amean fault-both-5 3591.22 ( 0.00%) 3146.73 * 12.38%* >> > Amean fault-both-7 4157.26 ( 0.00%) 4113.67 * 1.05%* >> > Amean fault-both-12 6184.79 ( 0.00%) 5218.70 * 15.62%* >> > Amean fault-both-18 9103.70 ( 0.00%) 7739.71 * 14.98%* >> > Amean fault-both-24 12341.73 ( 0.00%) 10684.23 * 13.43%* >> > Amean fault-both-30 15519.00 ( 0.00%) 13695.14 * 11.75%* >> > Amean fault-both-32 16189.15 ( 0.00%) 14365.73 * 11.26%* >> > base patched >> > Duration User 167.78 161.03 >> > Duration System 1836.66 1673.01 >> > Duration Elapsed 2074.58 2059.75 >> > >> > Barry Song submitted a similar patch [1] before, that replaces the >> > ptep_clear_flush_young_notify() with ptep_clear_young_notify() in >> > folio_referenced_one(). However, I'm not sure if removing the tlb flush >> > operation is applicable to every architecture in kernel, so dropping >> > the tlb flush for ARM64 seems a sensible change. >> > >> > Note: I am okay for both approach, if someone can help to ensure that >> > all architectures do not need the tlb flush when clearing the accessed >> > bit, then I also think Barry's patch is better (hope Barry can resend >> > his patch). >> > >> >> Thanks! >> >> ptep_clear_flush_young() with "flush" in its name clearly says it needs a >> flush. but it happens in arm64, all other code which needs a flush has >> called other variants, for example __flush_tlb_page_nosync(): >> >> static inline void arch_tlbbatch_add_pending(struct >> arch_tlbflush_unmap_batch *batch, >> struct mm_struct *mm, unsigned long uaddr) >> { >> __flush_tlb_page_nosync(mm, uaddr); >> } >> >> so it seems folio_referenced is the only left user of this >> ptep_clear_flush_young(). >> The fact makes Baolin's patch look safe now. >> >> but this function still has "flush" in its name, so one day, one person might >> call it with the understanding that it will flush tlb but actually it >> won't. This is >> bad smell in code. >> >> I guess one side effect of not flushing tlb while clearing the access >> flag is that >> hardware won't see this cleared flag in the tlb, so it might not set this bit in >> memory even though the bit has been cleared before in memory(but not in tlb) >> while the page is accessed *again*. >> >> next time, someone reads the access flag in memory again by folio_referenced, >> he/she will see the page is cold as hardware has lost a chance to set >> the bit again >> since it finds tlb already has a true access flag. >> >> But anyway, tlb is so small, it will be flushed by context switch and >> other running >> code soon. so it seems we don't actually require the access flag being instantly >> updated. the time gap, in which access flag might lose the new set by hardware, >> seems to be too short to really affect the accuracy of page reclamation. but its >> cost is large. >> >> (A). Constant flush cost vs. (B). very very occasional reclaimed hot >> page, B might >> be a correct choice. > > Plus, I doubt B is really going to happen. as after a page is promoted to > the head of lru list or new generation, it needs a long time to slide back > to the inactive list tail or to the candidate generation of mglru. the time > should have been large enough for tlb to be flushed. If the page is really > hot, the hardware will get second, third, fourth etc opportunity to set an > access flag in the long time in which the page is re-moved to the tail > as the page can be accessed multiple times if it is really hot. This might not be true if you have external hardware sharing the page tables with software through either HMM or hardware supported ATS though. In those cases I think it's much more likely hardware can still be accessing the page even after a context switch on the CPU say. So those pages will tend to get reclaimed even though hardware is still actively using them which would be quite expensive and I guess could lead to thrashing as each page is reclaimed and then immediately faulted back in. Of course TLB flushes are equally (perhaps even more) expensive for this kind of external HW so reducing them would still be beneficial. I wonder if there's some way they could be deferred until the page is moved to the inactive list say? >> >> > [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/ >> > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >> > --- >> > arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++--------------- >> > 1 file changed, 16 insertions(+), 15 deletions(-) >> > >> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >> > index 0bd18de9fd97..2979d796ba9d 100644 >> > --- a/arch/arm64/include/asm/pgtable.h >> > +++ b/arch/arm64/include/asm/pgtable.h >> > @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, >> > static inline int ptep_clear_flush_young(struct vm_area_struct *vma, >> > unsigned long address, pte_t *ptep) >> > { >> > - int young = ptep_test_and_clear_young(vma, address, ptep); >> > - >> > - if (young) { >> > - /* >> > - * We can elide the trailing DSB here since the worst that can >> > - * happen is that a CPU continues to use the young entry in its >> > - * TLB and we mistakenly reclaim the associated page. The >> > - * window for such an event is bounded by the next >> > - * context-switch, which provides a DSB to complete the TLB >> > - * invalidation. >> > - */ >> > - flush_tlb_page_nosync(vma, address); >> > - } >> > - >> > - return young; >> > + /* >> > + * This comment is borrowed from x86, but applies equally to ARM64: >> > + * >> > + * Clearing the accessed bit without a TLB flush doesn't cause >> > + * data corruption. [ It could cause incorrect page aging and >> > + * the (mistaken) reclaim of hot pages, but the chance of that >> > + * should be relatively low. ] >> > + * >> > + * So as a performance optimization don't flush the TLB when >> > + * clearing the accessed bit, it will eventually be flushed by >> > + * a context switch or a VM operation anyway. [ In the rare >> > + * event of it not getting flushed for a long time the delay >> > + * shouldn't really matter because there's no real memory >> > + * pressure for swapout to react to. ] >> > + */ >> > + return ptep_test_and_clear_young(vma, address, ptep); >> > } >> > >> > #ifdef CONFIG_TRANSPARENT_HUGEPAGE >> > -- >> > 2.39.3 >> > >> >> Thanks >> Barry
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >> index 0bd18de9fd97..2979d796ba9d 100644 >> --- a/arch/arm64/include/asm/pgtable.h >> +++ b/arch/arm64/include/asm/pgtable.h >> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, >> static inline int ptep_clear_flush_young(struct vm_area_struct *vma, >> unsigned long address, pte_t *ptep) >> { >> - int young = ptep_test_and_clear_young(vma, address, ptep); >> - >> - if (young) { >> - /* >> - * We can elide the trailing DSB here since the worst that can >> - * happen is that a CPU continues to use the young entry in its >> - * TLB and we mistakenly reclaim the associated page. The >> - * window for such an event is bounded by the next >> - * context-switch, which provides a DSB to complete the TLB >> - * invalidation. >> - */ >> - flush_tlb_page_nosync(vma, address); >> - } >> - >> - return young; >> + /* >> + * This comment is borrowed from x86, but applies equally to ARM64: >> + * >> + * Clearing the accessed bit without a TLB flush doesn't cause >> + * data corruption. [ It could cause incorrect page aging and >> + * the (mistaken) reclaim of hot pages, but the chance of that >> + * should be relatively low. ] >> + * >> + * So as a performance optimization don't flush the TLB when >> + * clearing the accessed bit, it will eventually be flushed by >> + * a context switch or a VM operation anyway. [ In the rare >> + * event of it not getting flushed for a long time the delay >> + * shouldn't really matter because there's no real memory >> + * pressure for swapout to react to. ] >> + */ >> + return ptep_test_and_clear_young(vma, address, ptep); >> } From https://lore.kernel.org/lkml/20181029105515.GD14127@arm.com/: This is blindly copied from x86 and isn't true for us: we don't invalidate the TLB on context switch. That means our window for keeping the stale entries around is potentially much bigger and might not be a great idea. My understanding is that arm64 doesn't do invalidate the TLB during context switch. The flush_tlb_page_nosync() here + DSB during context switch make sure the TLB is invalidated during context switch. So we can't remove flush_tlb_page_nosync() here? Or something was changed for arm64 (I have zero knowledge to TLB on arm64. So some obvious thing may be missed)? Thanks. Regards Yin, Fengwei
On 10/24/2023 9:48 PM, Kefeng Wang wrote: > > > On 2023/10/24 20:56, Baolin Wang wrote: >> Now ptep_clear_flush_young() is only called by folio_referenced() to >> check if the folio was referenced, and now it will call a tlb flush on >> ARM64 architecture. However the tlb flush can be expensive on ARM64 >> servers, especially for the systems with a large CPU numbers. >> >> Similar to the x86 architecture, below comments also apply equally to >> ARM64 architecture. So we can drop the tlb flush operation in >> ptep_clear_flush_young() on ARM64 architecture to improve the >> performance. >> " >> /* Clearing the accessed bit without a TLB flush >> * doesn't cause data corruption. [ It could cause incorrect >> * page aging and the (mistaken) reclaim of hot pages, but the >> * chance of that should be relatively low. ] >> * >> * So as a performance optimization don't flush the TLB when >> * clearing the accessed bit, it will eventually be flushed by >> * a context switch or a VM operation anyway. [ In the rare >> * event of it not getting flushed for a long time the delay >> * shouldn't really matter because there's no real memory >> * pressure for swapout to react to. ] >> */ >> " >> Running the thpscale to show some obvious improvements for compaction >> latency with this patch: >> base patched >> Amean fault-both-1 1093.19 ( 0.00%) 1084.57 * 0.79%* >> Amean fault-both-3 2566.22 ( 0.00%) 2228.45 * 13.16%* >> Amean fault-both-5 3591.22 ( 0.00%) 3146.73 * 12.38%* >> Amean fault-both-7 4157.26 ( 0.00%) 4113.67 * 1.05%* >> Amean fault-both-12 6184.79 ( 0.00%) 5218.70 * 15.62%* >> Amean fault-both-18 9103.70 ( 0.00%) 7739.71 * 14.98%* >> Amean fault-both-24 12341.73 ( 0.00%) 10684.23 * 13.43%* >> Amean fault-both-30 15519.00 ( 0.00%) 13695.14 * 11.75%* >> Amean fault-both-32 16189.15 ( 0.00%) 14365.73 * 11.26%* >> base patched >> Duration User 167.78 161.03 >> Duration System 1836.66 1673.01 >> Duration Elapsed 2074.58 2059.75 >> >> Barry Song submitted a similar patch [1] before, that replaces the >> ptep_clear_flush_young_notify() with ptep_clear_young_notify() in >> folio_referenced_one(). However, I'm not sure if removing the tlb flush >> operation is applicable to every architecture in kernel, so dropping >> the tlb flush for ARM64 seems a sensible change. > > At least x86/s390/riscv/powerpc already do it, also I think we could Right. > change pmdp_clear_flush_young_notify() too, since it is same with > ptep_clear_flush_young_notify(), Perhaps yes, but I'm still unsure if removing tlb flush for PMD entry is applicable to all architectures. Let's see the discussion in this thread. Thanks.
On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@nvidia.com> wrote: > > > Barry Song <21cnbao@gmail.com> writes: > > > On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote: > >> > >> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang > >> <baolin.wang@linux.alibaba.com> wrote: > >> > > >> > Now ptep_clear_flush_young() is only called by folio_referenced() to > >> > check if the folio was referenced, and now it will call a tlb flush on > >> > ARM64 architecture. However the tlb flush can be expensive on ARM64 > >> > servers, especially for the systems with a large CPU numbers. > >> > > >> > Similar to the x86 architecture, below comments also apply equally to > >> > ARM64 architecture. So we can drop the tlb flush operation in > >> > ptep_clear_flush_young() on ARM64 architecture to improve the performance. > >> > " > >> > /* Clearing the accessed bit without a TLB flush > >> > * doesn't cause data corruption. [ It could cause incorrect > >> > * page aging and the (mistaken) reclaim of hot pages, but the > >> > * chance of that should be relatively low. ] > >> > * > >> > * So as a performance optimization don't flush the TLB when > >> > * clearing the accessed bit, it will eventually be flushed by > >> > * a context switch or a VM operation anyway. [ In the rare > >> > * event of it not getting flushed for a long time the delay > >> > * shouldn't really matter because there's no real memory > >> > * pressure for swapout to react to. ] > >> > */ > >> > " > >> > Running the thpscale to show some obvious improvements for compaction > >> > latency with this patch: > >> > base patched > >> > Amean fault-both-1 1093.19 ( 0.00%) 1084.57 * 0.79%* > >> > Amean fault-both-3 2566.22 ( 0.00%) 2228.45 * 13.16%* > >> > Amean fault-both-5 3591.22 ( 0.00%) 3146.73 * 12.38%* > >> > Amean fault-both-7 4157.26 ( 0.00%) 4113.67 * 1.05%* > >> > Amean fault-both-12 6184.79 ( 0.00%) 5218.70 * 15.62%* > >> > Amean fault-both-18 9103.70 ( 0.00%) 7739.71 * 14.98%* > >> > Amean fault-both-24 12341.73 ( 0.00%) 10684.23 * 13.43%* > >> > Amean fault-both-30 15519.00 ( 0.00%) 13695.14 * 11.75%* > >> > Amean fault-both-32 16189.15 ( 0.00%) 14365.73 * 11.26%* > >> > base patched > >> > Duration User 167.78 161.03 > >> > Duration System 1836.66 1673.01 > >> > Duration Elapsed 2074.58 2059.75 > >> > > >> > Barry Song submitted a similar patch [1] before, that replaces the > >> > ptep_clear_flush_young_notify() with ptep_clear_young_notify() in > >> > folio_referenced_one(). However, I'm not sure if removing the tlb flush > >> > operation is applicable to every architecture in kernel, so dropping > >> > the tlb flush for ARM64 seems a sensible change. > >> > > >> > Note: I am okay for both approach, if someone can help to ensure that > >> > all architectures do not need the tlb flush when clearing the accessed > >> > bit, then I also think Barry's patch is better (hope Barry can resend > >> > his patch). > >> > > >> > >> Thanks! > >> > >> ptep_clear_flush_young() with "flush" in its name clearly says it needs a > >> flush. but it happens in arm64, all other code which needs a flush has > >> called other variants, for example __flush_tlb_page_nosync(): > >> > >> static inline void arch_tlbbatch_add_pending(struct > >> arch_tlbflush_unmap_batch *batch, > >> struct mm_struct *mm, unsigned long uaddr) > >> { > >> __flush_tlb_page_nosync(mm, uaddr); > >> } > >> > >> so it seems folio_referenced is the only left user of this > >> ptep_clear_flush_young(). > >> The fact makes Baolin's patch look safe now. > >> > >> but this function still has "flush" in its name, so one day, one person might > >> call it with the understanding that it will flush tlb but actually it > >> won't. This is > >> bad smell in code. > >> > >> I guess one side effect of not flushing tlb while clearing the access > >> flag is that > >> hardware won't see this cleared flag in the tlb, so it might not set this bit in > >> memory even though the bit has been cleared before in memory(but not in tlb) > >> while the page is accessed *again*. > >> > >> next time, someone reads the access flag in memory again by folio_referenced, > >> he/she will see the page is cold as hardware has lost a chance to set > >> the bit again > >> since it finds tlb already has a true access flag. > >> > >> But anyway, tlb is so small, it will be flushed by context switch and > >> other running > >> code soon. so it seems we don't actually require the access flag being instantly > >> updated. the time gap, in which access flag might lose the new set by hardware, > >> seems to be too short to really affect the accuracy of page reclamation. but its > >> cost is large. > >> > >> (A). Constant flush cost vs. (B). very very occasional reclaimed hot > >> page, B might > >> be a correct choice. > > > > Plus, I doubt B is really going to happen. as after a page is promoted to > > the head of lru list or new generation, it needs a long time to slide back > > to the inactive list tail or to the candidate generation of mglru. the time > > should have been large enough for tlb to be flushed. If the page is really > > hot, the hardware will get second, third, fourth etc opportunity to set an > > access flag in the long time in which the page is re-moved to the tail > > as the page can be accessed multiple times if it is really hot. > > This might not be true if you have external hardware sharing the page > tables with software through either HMM or hardware supported ATS > though. > > In those cases I think it's much more likely hardware can still be > accessing the page even after a context switch on the CPU say. So those > pages will tend to get reclaimed even though hardware is still actively > using them which would be quite expensive and I guess could lead to > thrashing as each page is reclaimed and then immediately faulted back > in. i am not quite sure i got your point. has the external hardware sharing cpu's pagetable the ability to set access flag in page table entries by itself? if yes, I don't see how our approach will hurt as folio_referenced can notify the hardware driver and the driver can flush its own tlb. If no, i don't see either as the external hardware can't set access flags, that means we have ignored its reference and only knows cpu's access even in the current mainline code. so we are not getting worse. so the external hardware can also see cpu's TLB? or cpu's tlb flush can also broadcast to external hardware, then external hardware sees the cleared access flag, thus, it can set access flag in page table when the hardware access it? If this is the case, I feel what you said is true. > > Of course TLB flushes are equally (perhaps even more) expensive for this > kind of external HW so reducing them would still be beneficial. I wonder > if there's some way they could be deferred until the page is moved to > the inactive list say? > > >> > >> > [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/ > >> > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> > >> > --- > >> > arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++--------------- > >> > 1 file changed, 16 insertions(+), 15 deletions(-) > >> > > >> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > >> > index 0bd18de9fd97..2979d796ba9d 100644 > >> > --- a/arch/arm64/include/asm/pgtable.h > >> > +++ b/arch/arm64/include/asm/pgtable.h > >> > @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, > >> > static inline int ptep_clear_flush_young(struct vm_area_struct *vma, > >> > unsigned long address, pte_t *ptep) > >> > { > >> > - int young = ptep_test_and_clear_young(vma, address, ptep); > >> > - > >> > - if (young) { > >> > - /* > >> > - * We can elide the trailing DSB here since the worst that can > >> > - * happen is that a CPU continues to use the young entry in its > >> > - * TLB and we mistakenly reclaim the associated page. The > >> > - * window for such an event is bounded by the next > >> > - * context-switch, which provides a DSB to complete the TLB > >> > - * invalidation. > >> > - */ > >> > - flush_tlb_page_nosync(vma, address); > >> > - } > >> > - > >> > - return young; > >> > + /* > >> > + * This comment is borrowed from x86, but applies equally to ARM64: > >> > + * > >> > + * Clearing the accessed bit without a TLB flush doesn't cause > >> > + * data corruption. [ It could cause incorrect page aging and > >> > + * the (mistaken) reclaim of hot pages, but the chance of that > >> > + * should be relatively low. ] > >> > + * > >> > + * So as a performance optimization don't flush the TLB when > >> > + * clearing the accessed bit, it will eventually be flushed by > >> > + * a context switch or a VM operation anyway. [ In the rare > >> > + * event of it not getting flushed for a long time the delay > >> > + * shouldn't really matter because there's no real memory > >> > + * pressure for swapout to react to. ] > >> > + */ > >> > + return ptep_test_and_clear_young(vma, address, ptep); > >> > } > >> > > >> > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > >> > -- > >> > 2.39.3 > >> > > >> > >> Thanks > >> Barry > Thanks Barry
Barry Song <21cnbao@gmail.com> writes: > On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@nvidia.com> wrote: >> >> >> Barry Song <21cnbao@gmail.com> writes: >> >> > On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote: >> >> >> >> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang >> >> <baolin.wang@linux.alibaba.com> wrote: [...] >> >> (A). Constant flush cost vs. (B). very very occasional reclaimed hot >> >> page, B might >> >> be a correct choice. >> > >> > Plus, I doubt B is really going to happen. as after a page is promoted to >> > the head of lru list or new generation, it needs a long time to slide back >> > to the inactive list tail or to the candidate generation of mglru. the time >> > should have been large enough for tlb to be flushed. If the page is really >> > hot, the hardware will get second, third, fourth etc opportunity to set an >> > access flag in the long time in which the page is re-moved to the tail >> > as the page can be accessed multiple times if it is really hot. >> >> This might not be true if you have external hardware sharing the page >> tables with software through either HMM or hardware supported ATS >> though. >> >> In those cases I think it's much more likely hardware can still be >> accessing the page even after a context switch on the CPU say. So those >> pages will tend to get reclaimed even though hardware is still actively >> using them which would be quite expensive and I guess could lead to >> thrashing as each page is reclaimed and then immediately faulted back >> in. > > i am not quite sure i got your point. has the external hardware sharing cpu's > pagetable the ability to set access flag in page table entries by > itself? if yes, > I don't see how our approach will hurt as folio_referenced can notify the > hardware driver and the driver can flush its own tlb. If no, i don't see > either as the external hardware can't set access flags, that means we > have ignored its reference and only knows cpu's access even in the current > mainline code. so we are not getting worse. > > so the external hardware can also see cpu's TLB? or cpu's tlb flush can > also broadcast to external hardware, then external hardware sees the > cleared access flag, thus, it can set access flag in page table when the > hardware access it? If this is the case, I feel what you said is true. Perhaps it would help if I gave a concrete example. Take for example the ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of two ways depending on the specific HW implementation. If broadcast TLB maintenance (BTM) is supported it will snoop CPU TLB invalidations. If BTM is not supported it relies on SW to explicitly forward TLB invalidations via MMU notifiers. Now consider the case where some external device is accessing mappings via the SMMU. The access flag will be cached in the SMMU TLB. If we clear the access flag without a TLB invalidate the access flag in the CPU page table will not get updated because it's already set in the SMMU TLB. As an aside access flag updates happen in one of two ways. If the SMMU HW supports hardware translation table updates (HTTU) then hardware will manage updating access/dirty flags as required. If this is not supported then SW is relied on to update these flags which in practice means taking a minor fault. But I don't think that is relevant here - in either case without a TLB invalidate neither of those things will happen. I suppose drivers could implement the clear_flush_young() MMU notifier callback (none do at the moment AFAICT) but then won't that just lead to the opposite problem - that every page ever used by an external device remains active and unavailable for reclaim because the access flag never gets cleared? I suppose they could do the flush then which would ensure the page is marked inactive if it's not referenced between the two folio_referenced calls(). But that requires changes to those drivers. SMMU from memory doesn't even register for notifiers if BTM is supported. - Alistair >> >> Of course TLB flushes are equally (perhaps even more) expensive for this >> kind of external HW so reducing them would still be beneficial. I wonder >> if there's some way they could be deferred until the page is moved to >> the inactive list say? >> >> >> >> >> > [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/ >> >> > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >> >> > --- >> >> > arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++--------------- >> >> > 1 file changed, 16 insertions(+), 15 deletions(-) >> >> > >> >> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >> >> > index 0bd18de9fd97..2979d796ba9d 100644 >> >> > --- a/arch/arm64/include/asm/pgtable.h >> >> > +++ b/arch/arm64/include/asm/pgtable.h >> >> > @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, >> >> > static inline int ptep_clear_flush_young(struct vm_area_struct *vma, >> >> > unsigned long address, pte_t *ptep) >> >> > { >> >> > - int young = ptep_test_and_clear_young(vma, address, ptep); >> >> > - >> >> > - if (young) { >> >> > - /* >> >> > - * We can elide the trailing DSB here since the worst that can >> >> > - * happen is that a CPU continues to use the young entry in its >> >> > - * TLB and we mistakenly reclaim the associated page. The >> >> > - * window for such an event is bounded by the next >> >> > - * context-switch, which provides a DSB to complete the TLB >> >> > - * invalidation. >> >> > - */ >> >> > - flush_tlb_page_nosync(vma, address); >> >> > - } >> >> > - >> >> > - return young; >> >> > + /* >> >> > + * This comment is borrowed from x86, but applies equally to ARM64: >> >> > + * >> >> > + * Clearing the accessed bit without a TLB flush doesn't cause >> >> > + * data corruption. [ It could cause incorrect page aging and >> >> > + * the (mistaken) reclaim of hot pages, but the chance of that >> >> > + * should be relatively low. ] >> >> > + * >> >> > + * So as a performance optimization don't flush the TLB when >> >> > + * clearing the accessed bit, it will eventually be flushed by >> >> > + * a context switch or a VM operation anyway. [ In the rare >> >> > + * event of it not getting flushed for a long time the delay >> >> > + * shouldn't really matter because there's no real memory >> >> > + * pressure for swapout to react to. ] >> >> > + */ >> >> > + return ptep_test_and_clear_young(vma, address, ptep); >> >> > } >> >> > >> >> > #ifdef CONFIG_TRANSPARENT_HUGEPAGE >> >> > -- >> >> > 2.39.3 >> >> > >> >> >> >> Thanks >> >> Barry >> > Thanks > Barry
On 10/25/2023 7:31 AM, Barry Song wrote: > On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote: >> >> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang >> <baolin.wang@linux.alibaba.com> wrote: >>> >>> Now ptep_clear_flush_young() is only called by folio_referenced() to >>> check if the folio was referenced, and now it will call a tlb flush on >>> ARM64 architecture. However the tlb flush can be expensive on ARM64 >>> servers, especially for the systems with a large CPU numbers. >>> >>> Similar to the x86 architecture, below comments also apply equally to >>> ARM64 architecture. So we can drop the tlb flush operation in >>> ptep_clear_flush_young() on ARM64 architecture to improve the performance. >>> " >>> /* Clearing the accessed bit without a TLB flush >>> * doesn't cause data corruption. [ It could cause incorrect >>> * page aging and the (mistaken) reclaim of hot pages, but the >>> * chance of that should be relatively low. ] >>> * >>> * So as a performance optimization don't flush the TLB when >>> * clearing the accessed bit, it will eventually be flushed by >>> * a context switch or a VM operation anyway. [ In the rare >>> * event of it not getting flushed for a long time the delay >>> * shouldn't really matter because there's no real memory >>> * pressure for swapout to react to. ] >>> */ >>> " >>> Running the thpscale to show some obvious improvements for compaction >>> latency with this patch: >>> base patched >>> Amean fault-both-1 1093.19 ( 0.00%) 1084.57 * 0.79%* >>> Amean fault-both-3 2566.22 ( 0.00%) 2228.45 * 13.16%* >>> Amean fault-both-5 3591.22 ( 0.00%) 3146.73 * 12.38%* >>> Amean fault-both-7 4157.26 ( 0.00%) 4113.67 * 1.05%* >>> Amean fault-both-12 6184.79 ( 0.00%) 5218.70 * 15.62%* >>> Amean fault-both-18 9103.70 ( 0.00%) 7739.71 * 14.98%* >>> Amean fault-both-24 12341.73 ( 0.00%) 10684.23 * 13.43%* >>> Amean fault-both-30 15519.00 ( 0.00%) 13695.14 * 11.75%* >>> Amean fault-both-32 16189.15 ( 0.00%) 14365.73 * 11.26%* >>> base patched >>> Duration User 167.78 161.03 >>> Duration System 1836.66 1673.01 >>> Duration Elapsed 2074.58 2059.75 >>> >>> Barry Song submitted a similar patch [1] before, that replaces the >>> ptep_clear_flush_young_notify() with ptep_clear_young_notify() in >>> folio_referenced_one(). However, I'm not sure if removing the tlb flush >>> operation is applicable to every architecture in kernel, so dropping >>> the tlb flush for ARM64 seems a sensible change. >>> >>> Note: I am okay for both approach, if someone can help to ensure that >>> all architectures do not need the tlb flush when clearing the accessed >>> bit, then I also think Barry's patch is better (hope Barry can resend >>> his patch). >>> >> >> Thanks! >> >> ptep_clear_flush_young() with "flush" in its name clearly says it needs a >> flush. but it happens in arm64, all other code which needs a flush has >> called other variants, for example __flush_tlb_page_nosync(): >> >> static inline void arch_tlbbatch_add_pending(struct >> arch_tlbflush_unmap_batch *batch, >> struct mm_struct *mm, unsigned long uaddr) >> { >> __flush_tlb_page_nosync(mm, uaddr); >> } >> >> so it seems folio_referenced is the only left user of this >> ptep_clear_flush_young(). >> The fact makes Baolin's patch look safe now. >> >> but this function still has "flush" in its name, so one day, one person might >> call it with the understanding that it will flush tlb but actually it >> won't. This is >> bad smell in code. Agree. I think this is jsut a start, we can replace ptep_clear_flush_young() once other architectures have completed the conversion, if we can confirm that other architectures also do not require tlb flush when clearing the accessed bit. >> I guess one side effect of not flushing tlb while clearing the access >> flag is that >> hardware won't see this cleared flag in the tlb, so it might not set this bit in >> memory even though the bit has been cleared before in memory(but not in tlb) >> while the page is accessed *again*. >> >> next time, someone reads the access flag in memory again by folio_referenced, >> he/she will see the page is cold as hardware has lost a chance to set >> the bit again >> since it finds tlb already has a true access flag. >> >> But anyway, tlb is so small, it will be flushed by context switch and >> other running >> code soon. so it seems we don't actually require the access flag being instantly >> updated. the time gap, in which access flag might lose the new set by hardware, >> seems to be too short to really affect the accuracy of page reclamation. but its >> cost is large. >> >> (A). Constant flush cost vs. (B). very very occasional reclaimed hot >> page, B might >> be a correct choice. > > Plus, I doubt B is really going to happen. as after a page is promoted to > the head of lru list or new generation, it needs a long time to slide back > to the inactive list tail or to the candidate generation of mglru. the time > should have been large enough for tlb to be flushed. If the page is really > hot, the hardware will get second, third, fourth etc opportunity to set an > access flag in the long time in which the page is re-moved to the tail > as the page can be accessed multiple times if it is really hot. Thanks Barry, that's also what I thought. On the other hand, even if there is no tlb flush for a long time, I think the system is not under memory pressure at that time, so the incorrect page aging would not have much impact.
On 10/25/2023 9:58 AM, Alistair Popple wrote: > > Barry Song <21cnbao@gmail.com> writes: > >> On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@nvidia.com> wrote: >>> >>> >>> Barry Song <21cnbao@gmail.com> writes: >>> >>>> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote: >>>>> >>>>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang >>>>> <baolin.wang@linux.alibaba.com> wrote: > > [...] > >>>>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot >>>>> page, B might >>>>> be a correct choice. >>>> >>>> Plus, I doubt B is really going to happen. as after a page is promoted to >>>> the head of lru list or new generation, it needs a long time to slide back >>>> to the inactive list tail or to the candidate generation of mglru. the time >>>> should have been large enough for tlb to be flushed. If the page is really >>>> hot, the hardware will get second, third, fourth etc opportunity to set an >>>> access flag in the long time in which the page is re-moved to the tail >>>> as the page can be accessed multiple times if it is really hot. >>> >>> This might not be true if you have external hardware sharing the page >>> tables with software through either HMM or hardware supported ATS >>> though. >>> >>> In those cases I think it's much more likely hardware can still be >>> accessing the page even after a context switch on the CPU say. So those >>> pages will tend to get reclaimed even though hardware is still actively >>> using them which would be quite expensive and I guess could lead to >>> thrashing as each page is reclaimed and then immediately faulted back >>> in. That's possible, but the chance should be relatively low. At least on x86, I have not heard of this issue. >> i am not quite sure i got your point. has the external hardware sharing cpu's >> pagetable the ability to set access flag in page table entries by >> itself? if yes, >> I don't see how our approach will hurt as folio_referenced can notify the >> hardware driver and the driver can flush its own tlb. If no, i don't see >> either as the external hardware can't set access flags, that means we >> have ignored its reference and only knows cpu's access even in the current >> mainline code. so we are not getting worse. >> >> so the external hardware can also see cpu's TLB? or cpu's tlb flush can >> also broadcast to external hardware, then external hardware sees the >> cleared access flag, thus, it can set access flag in page table when the >> hardware access it? If this is the case, I feel what you said is true. > > Perhaps it would help if I gave a concrete example. Take for example the > ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of > two ways depending on the specific HW implementation. > > If broadcast TLB maintenance (BTM) is supported it will snoop CPU TLB > invalidations. If BTM is not supported it relies on SW to explicitly > forward TLB invalidations via MMU notifiers. On our ARM64 hardware, we rely on BTM to maintain TLB coherency. > Now consider the case where some external device is accessing mappings > via the SMMU. The access flag will be cached in the SMMU TLB. If we > clear the access flag without a TLB invalidate the access flag in the > CPU page table will not get updated because it's already set in the SMMU > TLB. > > As an aside access flag updates happen in one of two ways. If the SMMU > HW supports hardware translation table updates (HTTU) then hardware will > manage updating access/dirty flags as required. If this is not supported > then SW is relied on to update these flags which in practice means > taking a minor fault. But I don't think that is relevant here - in > either case without a TLB invalidate neither of those things will > happen. > > I suppose drivers could implement the clear_flush_young() MMU notifier > callback (none do at the moment AFAICT) but then won't that just lead to > the opposite problem - that every page ever used by an external device > remains active and unavailable for reclaim because the access flag never > gets cleared? I suppose they could do the flush then which would ensure Yes, I think so too. The reason there is currently no problem, perhaps I think, there are no actual use cases at the moment? At least on our Alibaba's fleet, SMMU and MMU do not share page tables now. > the page is marked inactive if it's not referenced between the two > folio_referenced calls(). > > But that requires changes to those drivers. SMMU from memory doesn't > even register for notifiers if BTM is supported. > > - Alistair > >>> >>> Of course TLB flushes are equally (perhaps even more) expensive for this >>> kind of external HW so reducing them would still be beneficial. I wonder >>> if there's some way they could be deferred until the page is moved to >>> the inactive list say? >>> >>>>> >>>>>> [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/ >>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >>>>>> --- >>>>>> arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++--------------- >>>>>> 1 file changed, 16 insertions(+), 15 deletions(-) >>>>>> >>>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >>>>>> index 0bd18de9fd97..2979d796ba9d 100644 >>>>>> --- a/arch/arm64/include/asm/pgtable.h >>>>>> +++ b/arch/arm64/include/asm/pgtable.h >>>>>> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, >>>>>> static inline int ptep_clear_flush_young(struct vm_area_struct *vma, >>>>>> unsigned long address, pte_t *ptep) >>>>>> { >>>>>> - int young = ptep_test_and_clear_young(vma, address, ptep); >>>>>> - >>>>>> - if (young) { >>>>>> - /* >>>>>> - * We can elide the trailing DSB here since the worst that can >>>>>> - * happen is that a CPU continues to use the young entry in its >>>>>> - * TLB and we mistakenly reclaim the associated page. The >>>>>> - * window for such an event is bounded by the next >>>>>> - * context-switch, which provides a DSB to complete the TLB >>>>>> - * invalidation. >>>>>> - */ >>>>>> - flush_tlb_page_nosync(vma, address); >>>>>> - } >>>>>> - >>>>>> - return young; >>>>>> + /* >>>>>> + * This comment is borrowed from x86, but applies equally to ARM64: >>>>>> + * >>>>>> + * Clearing the accessed bit without a TLB flush doesn't cause >>>>>> + * data corruption. [ It could cause incorrect page aging and >>>>>> + * the (mistaken) reclaim of hot pages, but the chance of that >>>>>> + * should be relatively low. ] >>>>>> + * >>>>>> + * So as a performance optimization don't flush the TLB when >>>>>> + * clearing the accessed bit, it will eventually be flushed by >>>>>> + * a context switch or a VM operation anyway. [ In the rare >>>>>> + * event of it not getting flushed for a long time the delay >>>>>> + * shouldn't really matter because there's no real memory >>>>>> + * pressure for swapout to react to. ] >>>>>> + */ >>>>>> + return ptep_test_and_clear_young(vma, address, ptep); >>>>>> } >>>>>> >>>>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>>> -- >>>>>> 2.39.3 >>>>>> >>>>> >>>>> Thanks >>>>> Barry >>> >> Thanks >> Barry
On 10/25/2023 9:39 AM, Yin, Fengwei wrote: > >>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >>> index 0bd18de9fd97..2979d796ba9d 100644 >>> --- a/arch/arm64/include/asm/pgtable.h >>> +++ b/arch/arm64/include/asm/pgtable.h >>> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, >>> static inline int ptep_clear_flush_young(struct vm_area_struct *vma, >>> unsigned long address, pte_t *ptep) >>> { >>> - int young = ptep_test_and_clear_young(vma, address, ptep); >>> - >>> - if (young) { >>> - /* >>> - * We can elide the trailing DSB here since the worst that can >>> - * happen is that a CPU continues to use the young entry in its >>> - * TLB and we mistakenly reclaim the associated page. The >>> - * window for such an event is bounded by the next >>> - * context-switch, which provides a DSB to complete the TLB >>> - * invalidation. >>> - */ >>> - flush_tlb_page_nosync(vma, address); >>> - } >>> - >>> - return young; >>> + /* >>> + * This comment is borrowed from x86, but applies equally to ARM64: >>> + * >>> + * Clearing the accessed bit without a TLB flush doesn't cause >>> + * data corruption. [ It could cause incorrect page aging and >>> + * the (mistaken) reclaim of hot pages, but the chance of that >>> + * should be relatively low. ] >>> + * >>> + * So as a performance optimization don't flush the TLB when >>> + * clearing the accessed bit, it will eventually be flushed by >>> + * a context switch or a VM operation anyway. [ In the rare >>> + * event of it not getting flushed for a long time the delay >>> + * shouldn't really matter because there's no real memory >>> + * pressure for swapout to react to. ] >>> + */ >>> + return ptep_test_and_clear_young(vma, address, ptep); >>> } > From https://lore.kernel.org/lkml/20181029105515.GD14127@arm.com/: > > This is blindly copied from x86 and isn't true for us: we don't invalidate > the TLB on context switch. That means our window for keeping the stale > entries around is potentially much bigger and might not be a great idea. > > > My understanding is that arm64 doesn't do invalidate the TLB during > context switch. The flush_tlb_page_nosync() here + DSB during context Yes, we only perform a TLB flush when the ASID is exhausted during context switch, and I think this is same with x86 IIUC. > switch make sure the TLB is invalidated during context switch. > So we can't remove flush_tlb_page_nosync() here? Or something was changed > for arm64 (I have zero knowledge to TLB on arm64. So some obvious thing > may be missed)? Thanks. IMHO, the tlb can be easily evicted or flushed if the system is under memory pressure, so like Barry said, the chance of reclaiming hot page is relatively low, at least on X86, we did not see any heavy refault issue. For MGLRU, it uses ptep_test_and_clear_young() instead of ptep_clear_flush_young_notify(), and we did not find any problems until now since deploying to ARM servers.
On 10/25/2023 11:03 AM, Baolin Wang wrote: >> >> My understanding is that arm64 doesn't do invalidate the TLB during > context switch. The flush_tlb_page_nosync() here + DSB during context > > Yes, we only perform a TLB flush when the ASID is exhausted during context switch, and I think this is same with x86 IIUC. If we remove flush_tlb_page_nosync(), can we still claim TLB is flushed during context switch for ARM64? > >> switch make sure the TLB is invalidated during context switch. >> So we can't remove flush_tlb_page_nosync() here? Or something was changed >> for arm64 (I have zero knowledge to TLB on arm64. So some obvious thing >> may be missed)? Thanks. > > IMHO, the tlb can be easily evicted or flushed if the system is under memory pressure, so like Barry said, the chance of reclaiming hot page is relatively low, at least on X86, we did not see any heavy refault issue. > > For MGLRU, it uses ptep_test_and_clear_young() instead of ptep_clear_flush_young_notify(), and we did not find any problems until now since deploying to ARM servers.
Baolin Wang <baolin.wang@linux.alibaba.com> writes: > On 10/25/2023 9:58 AM, Alistair Popple wrote: >> Barry Song <21cnbao@gmail.com> writes: >> >>> On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@nvidia.com> wrote: >>>> >>>> >>>> Barry Song <21cnbao@gmail.com> writes: >>>> >>>>> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote: >>>>>> >>>>>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang >>>>>> <baolin.wang@linux.alibaba.com> wrote: >> [...] >> >>>>>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot >>>>>> page, B might >>>>>> be a correct choice. >>>>> >>>>> Plus, I doubt B is really going to happen. as after a page is promoted to >>>>> the head of lru list or new generation, it needs a long time to slide back >>>>> to the inactive list tail or to the candidate generation of mglru. the time >>>>> should have been large enough for tlb to be flushed. If the page is really >>>>> hot, the hardware will get second, third, fourth etc opportunity to set an >>>>> access flag in the long time in which the page is re-moved to the tail >>>>> as the page can be accessed multiple times if it is really hot. >>>> >>>> This might not be true if you have external hardware sharing the page >>>> tables with software through either HMM or hardware supported ATS >>>> though. >>>> >>>> In those cases I think it's much more likely hardware can still be >>>> accessing the page even after a context switch on the CPU say. So those >>>> pages will tend to get reclaimed even though hardware is still actively >>>> using them which would be quite expensive and I guess could lead to >>>> thrashing as each page is reclaimed and then immediately faulted back >>>> in. > > That's possible, but the chance should be relatively low. At least on > x86, I have not heard of this issue. Personally I've never seen any x86 system that shares page tables with external devices, other than with HMM. More on that below. >>> i am not quite sure i got your point. has the external hardware sharing cpu's >>> pagetable the ability to set access flag in page table entries by >>> itself? if yes, >>> I don't see how our approach will hurt as folio_referenced can notify the >>> hardware driver and the driver can flush its own tlb. If no, i don't see >>> either as the external hardware can't set access flags, that means we >>> have ignored its reference and only knows cpu's access even in the current >>> mainline code. so we are not getting worse. >>> >>> so the external hardware can also see cpu's TLB? or cpu's tlb flush can >>> also broadcast to external hardware, then external hardware sees the >>> cleared access flag, thus, it can set access flag in page table when the >>> hardware access it? If this is the case, I feel what you said is true. >> Perhaps it would help if I gave a concrete example. Take for example >> the >> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of >> two ways depending on the specific HW implementation. >> If broadcast TLB maintenance (BTM) is supported it will snoop CPU >> TLB >> invalidations. If BTM is not supported it relies on SW to explicitly >> forward TLB invalidations via MMU notifiers. > > On our ARM64 hardware, we rely on BTM to maintain TLB coherency. Lucky you :-) ARM64 SMMU architecture specification supports the possibilty of both, as does the driver. Not that I think whether or not BTM is supported has much relevance to this issue. >> Now consider the case where some external device is accessing mappings >> via the SMMU. The access flag will be cached in the SMMU TLB. If we >> clear the access flag without a TLB invalidate the access flag in the >> CPU page table will not get updated because it's already set in the SMMU >> TLB. >> As an aside access flag updates happen in one of two ways. If the >> SMMU >> HW supports hardware translation table updates (HTTU) then hardware will >> manage updating access/dirty flags as required. If this is not supported >> then SW is relied on to update these flags which in practice means >> taking a minor fault. But I don't think that is relevant here - in >> either case without a TLB invalidate neither of those things will >> happen. >> I suppose drivers could implement the clear_flush_young() MMU >> notifier >> callback (none do at the moment AFAICT) but then won't that just lead to >> the opposite problem - that every page ever used by an external device >> remains active and unavailable for reclaim because the access flag never >> gets cleared? I suppose they could do the flush then which would ensure > > Yes, I think so too. The reason there is currently no problem, perhaps > I think, there are no actual use cases at the moment? At least on our > Alibaba's fleet, SMMU and MMU do not share page tables now. We have systems that do. Also HMM (used by Nouveau and AMD among others) is a SW implementation of page table sharing and would suffer similar issues. That said if the flush is already being skipped on x86 then it's already an issue for HMM. HMM based drivers can at least deal with this by implementing the clear_flush_young() notifier though. The same doesn't apply to eg. the SMMU driver. >> the page is marked inactive if it's not referenced between the two >> folio_referenced calls(). >> But that requires changes to those drivers. SMMU from memory doesn't >> even register for notifiers if BTM is supported. >> - Alistair >> >>>> >>>> Of course TLB flushes are equally (perhaps even more) expensive for this >>>> kind of external HW so reducing them would still be beneficial. I wonder >>>> if there's some way they could be deferred until the page is moved to >>>> the inactive list say? >>>> >>>>>> >>>>>>> [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/ >>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >>>>>>> --- >>>>>>> arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++--------------- >>>>>>> 1 file changed, 16 insertions(+), 15 deletions(-) >>>>>>> >>>>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >>>>>>> index 0bd18de9fd97..2979d796ba9d 100644 >>>>>>> --- a/arch/arm64/include/asm/pgtable.h >>>>>>> +++ b/arch/arm64/include/asm/pgtable.h >>>>>>> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, >>>>>>> static inline int ptep_clear_flush_young(struct vm_area_struct *vma, >>>>>>> unsigned long address, pte_t *ptep) >>>>>>> { >>>>>>> - int young = ptep_test_and_clear_young(vma, address, ptep); >>>>>>> - >>>>>>> - if (young) { >>>>>>> - /* >>>>>>> - * We can elide the trailing DSB here since the worst that can >>>>>>> - * happen is that a CPU continues to use the young entry in its >>>>>>> - * TLB and we mistakenly reclaim the associated page. The >>>>>>> - * window for such an event is bounded by the next >>>>>>> - * context-switch, which provides a DSB to complete the TLB >>>>>>> - * invalidation. >>>>>>> - */ >>>>>>> - flush_tlb_page_nosync(vma, address); >>>>>>> - } >>>>>>> - >>>>>>> - return young; >>>>>>> + /* >>>>>>> + * This comment is borrowed from x86, but applies equally to ARM64: >>>>>>> + * >>>>>>> + * Clearing the accessed bit without a TLB flush doesn't cause >>>>>>> + * data corruption. [ It could cause incorrect page aging and >>>>>>> + * the (mistaken) reclaim of hot pages, but the chance of that >>>>>>> + * should be relatively low. ] >>>>>>> + * >>>>>>> + * So as a performance optimization don't flush the TLB when >>>>>>> + * clearing the accessed bit, it will eventually be flushed by >>>>>>> + * a context switch or a VM operation anyway. [ In the rare >>>>>>> + * event of it not getting flushed for a long time the delay >>>>>>> + * shouldn't really matter because there's no real memory >>>>>>> + * pressure for swapout to react to. ] >>>>>>> + */ >>>>>>> + return ptep_test_and_clear_young(vma, address, ptep); >>>>>>> } >>>>>>> >>>>>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>>>> -- >>>>>>> 2.39.3 >>>>>>> >>>>>> >>>>>> Thanks >>>>>> Barry >>>> >>> Thanks >>> Barry
On 10/25/2023 11:08 AM, Yin, Fengwei wrote: > > > On 10/25/2023 11:03 AM, Baolin Wang wrote: >>> >>> My understanding is that arm64 doesn't do invalidate the TLB during > context switch. The flush_tlb_page_nosync() here + DSB during context >> >> Yes, we only perform a TLB flush when the ASID is exhausted during context switch, and I think this is same with x86 IIUC. > If we remove flush_tlb_page_nosync(), can we still claim TLB is flushed during > context switch for ARM64? To be more precise, it is necessary to add prerequisite conditions, such as when ASID is exhausted. I can update the comments. >>> switch make sure the TLB is invalidated during context switch. >>> So we can't remove flush_tlb_page_nosync() here? Or something was changed >>> for arm64 (I have zero knowledge to TLB on arm64. So some obvious thing >>> may be missed)? Thanks. >> >> IMHO, the tlb can be easily evicted or flushed if the system is under memory pressure, so like Barry said, the chance of reclaiming hot page is relatively low, at least on X86, we did not see any heavy refault issue. >> >> For MGLRU, it uses ptep_test_and_clear_young() instead of ptep_clear_flush_young_notify(), and we did not find any problems until now since deploying to ARM servers.
On Wed, Oct 25, 2023 at 11:09 AM Yin, Fengwei <fengwei.yin@intel.com> wrote: > > > > On 10/25/2023 11:03 AM, Baolin Wang wrote: > >> > >> My understanding is that arm64 doesn't do invalidate the TLB during > context switch. The flush_tlb_page_nosync() here + DSB during context > > > > Yes, we only perform a TLB flush when the ASID is exhausted during context switch, and I think this is same with x86 IIUC. > If we remove flush_tlb_page_nosync(), can we still claim TLB is flushed during > context switch for ARM64? No. in this case, we will have to depend on hardware replacement of TLB on other CPUs. Considering LRU's list is really long and TLB is really small, a hot page needs a long time to move-back to inactive tail, other CPUs' TLB is likely to be replaced somewhere. But we do have a very small chance other CPUs will lose setting access flags in some corner cases. now i have more understanding why this nosync tlb flush is still there though in practical, removing it seems not to hurt. > > > > >> switch make sure the TLB is invalidated during context switch. > >> So we can't remove flush_tlb_page_nosync() here? Or something was changed > >> for arm64 (I have zero knowledge to TLB on arm64. So some obvious thing > >> may be missed)? Thanks. > > > > IMHO, the tlb can be easily evicted or flushed if the system is under memory pressure, so like Barry said, the chance of reclaiming hot page is relatively low, at least on X86, we did not see any heavy refault issue. > > > > For MGLRU, it uses ptep_test_and_clear_young() instead of ptep_clear_flush_young_notify(), and we did not find any problems until now since deploying to ARM servers. Thanks Barry
On Tue, Oct 24, 2023 at 9:21 PM Alistair Popple <apopple@nvidia.com> wrote: > > > Baolin Wang <baolin.wang@linux.alibaba.com> writes: > > > On 10/25/2023 9:58 AM, Alistair Popple wrote: > >> Barry Song <21cnbao@gmail.com> writes: > >> > >>> On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@nvidia.com> wrote: > >>>> > >>>> > >>>> Barry Song <21cnbao@gmail.com> writes: > >>>> > >>>>> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote: > >>>>>> > >>>>>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang > >>>>>> <baolin.wang@linux.alibaba.com> wrote: > >> [...] > >> > >>>>>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot > >>>>>> page, B might > >>>>>> be a correct choice. > >>>>> > >>>>> Plus, I doubt B is really going to happen. as after a page is promoted to > >>>>> the head of lru list or new generation, it needs a long time to slide back > >>>>> to the inactive list tail or to the candidate generation of mglru. the time > >>>>> should have been large enough for tlb to be flushed. If the page is really > >>>>> hot, the hardware will get second, third, fourth etc opportunity to set an > >>>>> access flag in the long time in which the page is re-moved to the tail > >>>>> as the page can be accessed multiple times if it is really hot. > >>>> > >>>> This might not be true if you have external hardware sharing the page > >>>> tables with software through either HMM or hardware supported ATS > >>>> though. > >>>> > >>>> In those cases I think it's much more likely hardware can still be > >>>> accessing the page even after a context switch on the CPU say. So those > >>>> pages will tend to get reclaimed even though hardware is still actively > >>>> using them which would be quite expensive and I guess could lead to > >>>> thrashing as each page is reclaimed and then immediately faulted back > >>>> in. > > > > That's possible, but the chance should be relatively low. At least on > > x86, I have not heard of this issue. > > Personally I've never seen any x86 system that shares page tables with > external devices, other than with HMM. More on that below. > > >>> i am not quite sure i got your point. has the external hardware sharing cpu's > >>> pagetable the ability to set access flag in page table entries by > >>> itself? if yes, > >>> I don't see how our approach will hurt as folio_referenced can notify the > >>> hardware driver and the driver can flush its own tlb. If no, i don't see > >>> either as the external hardware can't set access flags, that means we > >>> have ignored its reference and only knows cpu's access even in the current > >>> mainline code. so we are not getting worse. > >>> > >>> so the external hardware can also see cpu's TLB? or cpu's tlb flush can > >>> also broadcast to external hardware, then external hardware sees the > >>> cleared access flag, thus, it can set access flag in page table when the > >>> hardware access it? If this is the case, I feel what you said is true. > >> Perhaps it would help if I gave a concrete example. Take for example > >> the > >> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of > >> two ways depending on the specific HW implementation. > >> If broadcast TLB maintenance (BTM) is supported it will snoop CPU > >> TLB > >> invalidations. If BTM is not supported it relies on SW to explicitly > >> forward TLB invalidations via MMU notifiers. > > > > On our ARM64 hardware, we rely on BTM to maintain TLB coherency. > > Lucky you :-) > > ARM64 SMMU architecture specification supports the possibilty of both, > as does the driver. Not that I think whether or not BTM is supported has > much relevance to this issue. > > >> Now consider the case where some external device is accessing mappings > >> via the SMMU. The access flag will be cached in the SMMU TLB. If we > >> clear the access flag without a TLB invalidate the access flag in the > >> CPU page table will not get updated because it's already set in the SMMU > >> TLB. > >> As an aside access flag updates happen in one of two ways. If the > >> SMMU > >> HW supports hardware translation table updates (HTTU) then hardware will > >> manage updating access/dirty flags as required. If this is not supported > >> then SW is relied on to update these flags which in practice means > >> taking a minor fault. But I don't think that is relevant here - in > >> either case without a TLB invalidate neither of those things will > >> happen. > >> I suppose drivers could implement the clear_flush_young() MMU > >> notifier > >> callback (none do at the moment AFAICT) but then won't that just lead to > >> the opposite problem - that every page ever used by an external device > >> remains active and unavailable for reclaim because the access flag never > >> gets cleared? I suppose they could do the flush then which would ensure > > > > Yes, I think so too. The reason there is currently no problem, perhaps > > I think, there are no actual use cases at the moment? At least on our > > Alibaba's fleet, SMMU and MMU do not share page tables now. > > We have systems that do. Just curious: do those systems run the Linux kernel? If so, are pages shared with SMMU pinned? If not, then how are IO PFs handled after pages are reclaimed?
On Wed, Oct 25, 2023 at 2:17 PM Yu Zhao <yuzhao@google.com> wrote: > > On Tue, Oct 24, 2023 at 9:21 PM Alistair Popple <apopple@nvidia.com> wrote: > > > > > > Baolin Wang <baolin.wang@linux.alibaba.com> writes: > > > > > On 10/25/2023 9:58 AM, Alistair Popple wrote: > > >> Barry Song <21cnbao@gmail.com> writes: > > >> > > >>> On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@nvidia.com> wrote: > > >>>> > > >>>> > > >>>> Barry Song <21cnbao@gmail.com> writes: > > >>>> > > >>>>> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote: > > >>>>>> > > >>>>>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang > > >>>>>> <baolin.wang@linux.alibaba.com> wrote: > > >> [...] > > >> > > >>>>>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot > > >>>>>> page, B might > > >>>>>> be a correct choice. > > >>>>> > > >>>>> Plus, I doubt B is really going to happen. as after a page is promoted to > > >>>>> the head of lru list or new generation, it needs a long time to slide back > > >>>>> to the inactive list tail or to the candidate generation of mglru. the time > > >>>>> should have been large enough for tlb to be flushed. If the page is really > > >>>>> hot, the hardware will get second, third, fourth etc opportunity to set an > > >>>>> access flag in the long time in which the page is re-moved to the tail > > >>>>> as the page can be accessed multiple times if it is really hot. > > >>>> > > >>>> This might not be true if you have external hardware sharing the page > > >>>> tables with software through either HMM or hardware supported ATS > > >>>> though. > > >>>> > > >>>> In those cases I think it's much more likely hardware can still be > > >>>> accessing the page even after a context switch on the CPU say. So those > > >>>> pages will tend to get reclaimed even though hardware is still actively > > >>>> using them which would be quite expensive and I guess could lead to > > >>>> thrashing as each page is reclaimed and then immediately faulted back > > >>>> in. > > > > > > That's possible, but the chance should be relatively low. At least on > > > x86, I have not heard of this issue. > > > > Personally I've never seen any x86 system that shares page tables with > > external devices, other than with HMM. More on that below. > > > > >>> i am not quite sure i got your point. has the external hardware sharing cpu's > > >>> pagetable the ability to set access flag in page table entries by > > >>> itself? if yes, > > >>> I don't see how our approach will hurt as folio_referenced can notify the > > >>> hardware driver and the driver can flush its own tlb. If no, i don't see > > >>> either as the external hardware can't set access flags, that means we > > >>> have ignored its reference and only knows cpu's access even in the current > > >>> mainline code. so we are not getting worse. > > >>> > > >>> so the external hardware can also see cpu's TLB? or cpu's tlb flush can > > >>> also broadcast to external hardware, then external hardware sees the > > >>> cleared access flag, thus, it can set access flag in page table when the > > >>> hardware access it? If this is the case, I feel what you said is true. > > >> Perhaps it would help if I gave a concrete example. Take for example > > >> the > > >> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of > > >> two ways depending on the specific HW implementation. > > >> If broadcast TLB maintenance (BTM) is supported it will snoop CPU > > >> TLB > > >> invalidations. If BTM is not supported it relies on SW to explicitly > > >> forward TLB invalidations via MMU notifiers. > > > > > > On our ARM64 hardware, we rely on BTM to maintain TLB coherency. > > > > Lucky you :-) > > > > ARM64 SMMU architecture specification supports the possibilty of both, > > as does the driver. Not that I think whether or not BTM is supported has > > much relevance to this issue. > > > > >> Now consider the case where some external device is accessing mappings > > >> via the SMMU. The access flag will be cached in the SMMU TLB. If we > > >> clear the access flag without a TLB invalidate the access flag in the > > >> CPU page table will not get updated because it's already set in the SMMU > > >> TLB. > > >> As an aside access flag updates happen in one of two ways. If the > > >> SMMU > > >> HW supports hardware translation table updates (HTTU) then hardware will > > >> manage updating access/dirty flags as required. If this is not supported > > >> then SW is relied on to update these flags which in practice means > > >> taking a minor fault. But I don't think that is relevant here - in > > >> either case without a TLB invalidate neither of those things will > > >> happen. > > >> I suppose drivers could implement the clear_flush_young() MMU > > >> notifier > > >> callback (none do at the moment AFAICT) but then won't that just lead to > > >> the opposite problem - that every page ever used by an external device > > >> remains active and unavailable for reclaim because the access flag never > > >> gets cleared? I suppose they could do the flush then which would ensure > > > > > > Yes, I think so too. The reason there is currently no problem, perhaps > > > I think, there are no actual use cases at the moment? At least on our > > > Alibaba's fleet, SMMU and MMU do not share page tables now. > > > > We have systems that do. > > Just curious: do those systems run the Linux kernel? If so, are pages > shared with SMMU pinned? If not, then how are IO PFs handled after > pages are reclaimed? it will call handle_mm_fault(vma, prm->addr, fault_flags, NULL); in I/O PF, so finally it runs the same codes to get page back just like CPU's PF. years ago, we recommended a pin solution, but obviously there were lots of push backs: https://lore.kernel.org/linux-mm/1612685884-19514-1-git-send-email-wangzhou1@hisilicon.com/ Thanks Barry
Barry Song <21cnbao@gmail.com> writes: > On Wed, Oct 25, 2023 at 2:17 PM Yu Zhao <yuzhao@google.com> wrote: >> >> On Tue, Oct 24, 2023 at 9:21 PM Alistair Popple <apopple@nvidia.com> wrote: >> > >> > >> > Baolin Wang <baolin.wang@linux.alibaba.com> writes: >> > >> > > On 10/25/2023 9:58 AM, Alistair Popple wrote: >> > >> Barry Song <21cnbao@gmail.com> writes: >> > >> >> > >>> On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@nvidia.com> wrote: >> > >>>> >> > >>>> >> > >>>> Barry Song <21cnbao@gmail.com> writes: >> > >>>> >> > >>>>> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote: >> > >>>>>> >> > >>>>>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang >> > >>>>>> <baolin.wang@linux.alibaba.com> wrote: >> > >> [...] >> > >> >> > >>>>>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot >> > >>>>>> page, B might >> > >>>>>> be a correct choice. >> > >>>>> >> > >>>>> Plus, I doubt B is really going to happen. as after a page is promoted to >> > >>>>> the head of lru list or new generation, it needs a long time to slide back >> > >>>>> to the inactive list tail or to the candidate generation of mglru. the time >> > >>>>> should have been large enough for tlb to be flushed. If the page is really >> > >>>>> hot, the hardware will get second, third, fourth etc opportunity to set an >> > >>>>> access flag in the long time in which the page is re-moved to the tail >> > >>>>> as the page can be accessed multiple times if it is really hot. >> > >>>> >> > >>>> This might not be true if you have external hardware sharing the page >> > >>>> tables with software through either HMM or hardware supported ATS >> > >>>> though. >> > >>>> >> > >>>> In those cases I think it's much more likely hardware can still be >> > >>>> accessing the page even after a context switch on the CPU say. So those >> > >>>> pages will tend to get reclaimed even though hardware is still actively >> > >>>> using them which would be quite expensive and I guess could lead to >> > >>>> thrashing as each page is reclaimed and then immediately faulted back >> > >>>> in. >> > > >> > > That's possible, but the chance should be relatively low. At least on >> > > x86, I have not heard of this issue. >> > >> > Personally I've never seen any x86 system that shares page tables with >> > external devices, other than with HMM. More on that below. >> > >> > >>> i am not quite sure i got your point. has the external hardware sharing cpu's >> > >>> pagetable the ability to set access flag in page table entries by >> > >>> itself? if yes, >> > >>> I don't see how our approach will hurt as folio_referenced can notify the >> > >>> hardware driver and the driver can flush its own tlb. If no, i don't see >> > >>> either as the external hardware can't set access flags, that means we >> > >>> have ignored its reference and only knows cpu's access even in the current >> > >>> mainline code. so we are not getting worse. >> > >>> >> > >>> so the external hardware can also see cpu's TLB? or cpu's tlb flush can >> > >>> also broadcast to external hardware, then external hardware sees the >> > >>> cleared access flag, thus, it can set access flag in page table when the >> > >>> hardware access it? If this is the case, I feel what you said is true. >> > >> Perhaps it would help if I gave a concrete example. Take for example >> > >> the >> > >> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of >> > >> two ways depending on the specific HW implementation. >> > >> If broadcast TLB maintenance (BTM) is supported it will snoop CPU >> > >> TLB >> > >> invalidations. If BTM is not supported it relies on SW to explicitly >> > >> forward TLB invalidations via MMU notifiers. >> > > >> > > On our ARM64 hardware, we rely on BTM to maintain TLB coherency. >> > >> > Lucky you :-) >> > >> > ARM64 SMMU architecture specification supports the possibilty of both, >> > as does the driver. Not that I think whether or not BTM is supported has >> > much relevance to this issue. >> > >> > >> Now consider the case where some external device is accessing mappings >> > >> via the SMMU. The access flag will be cached in the SMMU TLB. If we >> > >> clear the access flag without a TLB invalidate the access flag in the >> > >> CPU page table will not get updated because it's already set in the SMMU >> > >> TLB. >> > >> As an aside access flag updates happen in one of two ways. If the >> > >> SMMU >> > >> HW supports hardware translation table updates (HTTU) then hardware will >> > >> manage updating access/dirty flags as required. If this is not supported >> > >> then SW is relied on to update these flags which in practice means >> > >> taking a minor fault. But I don't think that is relevant here - in >> > >> either case without a TLB invalidate neither of those things will >> > >> happen. >> > >> I suppose drivers could implement the clear_flush_young() MMU >> > >> notifier >> > >> callback (none do at the moment AFAICT) but then won't that just lead to >> > >> the opposite problem - that every page ever used by an external device >> > >> remains active and unavailable for reclaim because the access flag never >> > >> gets cleared? I suppose they could do the flush then which would ensure >> > > >> > > Yes, I think so too. The reason there is currently no problem, perhaps >> > > I think, there are no actual use cases at the moment? At least on our >> > > Alibaba's fleet, SMMU and MMU do not share page tables now. >> > >> > We have systems that do. >> >> Just curious: do those systems run the Linux kernel? If so, are pages >> shared with SMMU pinned? If not, then how are IO PFs handled after >> pages are reclaimed? Yes, these systems all run Linux. Pages shared with SMMU aren't pinned and fault handling works as Barry notes below - a driver is notified of a fault and calls handle_mm_fault() in response. > it will call handle_mm_fault(vma, prm->addr, fault_flags, NULL); in > I/O PF, so finally > it runs the same codes to get page back just like CPU's PF. > > years ago, we recommended a pin solution, but obviously there were lots of > push backs: > https://lore.kernel.org/linux-mm/1612685884-19514-1-git-send-email-wangzhou1@hisilicon.com/ Right. Having to pin pages defeats the whole point of having hardware that can handle page faults. > Thanks > Barry
On Wed, Oct 25, 2023 at 4:16 AM Alistair Popple <apopple@nvidia.com> wrote: > > > Barry Song <21cnbao@gmail.com> writes: > > > On Wed, Oct 25, 2023 at 2:17 PM Yu Zhao <yuzhao@google.com> wrote: > >> > >> On Tue, Oct 24, 2023 at 9:21 PM Alistair Popple <apopple@nvidia.com> wrote: > >> > > >> > > >> > Baolin Wang <baolin.wang@linux.alibaba.com> writes: > >> > > >> > > On 10/25/2023 9:58 AM, Alistair Popple wrote: > >> > >> Barry Song <21cnbao@gmail.com> writes: > >> > >> > >> > >>> On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@nvidia.com> wrote: > >> > >>>> > >> > >>>> > >> > >>>> Barry Song <21cnbao@gmail.com> writes: > >> > >>>> > >> > >>>>> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote: > >> > >>>>>> > >> > >>>>>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang > >> > >>>>>> <baolin.wang@linux.alibaba.com> wrote: > >> > >> [...] > >> > >> > >> > >>>>>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot > >> > >>>>>> page, B might > >> > >>>>>> be a correct choice. > >> > >>>>> > >> > >>>>> Plus, I doubt B is really going to happen. as after a page is promoted to > >> > >>>>> the head of lru list or new generation, it needs a long time to slide back > >> > >>>>> to the inactive list tail or to the candidate generation of mglru. the time > >> > >>>>> should have been large enough for tlb to be flushed. If the page is really > >> > >>>>> hot, the hardware will get second, third, fourth etc opportunity to set an > >> > >>>>> access flag in the long time in which the page is re-moved to the tail > >> > >>>>> as the page can be accessed multiple times if it is really hot. > >> > >>>> > >> > >>>> This might not be true if you have external hardware sharing the page > >> > >>>> tables with software through either HMM or hardware supported ATS > >> > >>>> though. > >> > >>>> > >> > >>>> In those cases I think it's much more likely hardware can still be > >> > >>>> accessing the page even after a context switch on the CPU say. So those > >> > >>>> pages will tend to get reclaimed even though hardware is still actively > >> > >>>> using them which would be quite expensive and I guess could lead to > >> > >>>> thrashing as each page is reclaimed and then immediately faulted back > >> > >>>> in. > >> > > > >> > > That's possible, but the chance should be relatively low. At least on > >> > > x86, I have not heard of this issue. > >> > > >> > Personally I've never seen any x86 system that shares page tables with > >> > external devices, other than with HMM. More on that below. > >> > > >> > >>> i am not quite sure i got your point. has the external hardware sharing cpu's > >> > >>> pagetable the ability to set access flag in page table entries by > >> > >>> itself? if yes, > >> > >>> I don't see how our approach will hurt as folio_referenced can notify the > >> > >>> hardware driver and the driver can flush its own tlb. If no, i don't see > >> > >>> either as the external hardware can't set access flags, that means we > >> > >>> have ignored its reference and only knows cpu's access even in the current > >> > >>> mainline code. so we are not getting worse. > >> > >>> > >> > >>> so the external hardware can also see cpu's TLB? or cpu's tlb flush can > >> > >>> also broadcast to external hardware, then external hardware sees the > >> > >>> cleared access flag, thus, it can set access flag in page table when the > >> > >>> hardware access it? If this is the case, I feel what you said is true. > >> > >> Perhaps it would help if I gave a concrete example. Take for example > >> > >> the > >> > >> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of > >> > >> two ways depending on the specific HW implementation. > >> > >> If broadcast TLB maintenance (BTM) is supported it will snoop CPU > >> > >> TLB > >> > >> invalidations. If BTM is not supported it relies on SW to explicitly > >> > >> forward TLB invalidations via MMU notifiers. > >> > > > >> > > On our ARM64 hardware, we rely on BTM to maintain TLB coherency. > >> > > >> > Lucky you :-) > >> > > >> > ARM64 SMMU architecture specification supports the possibilty of both, > >> > as does the driver. Not that I think whether or not BTM is supported has > >> > much relevance to this issue. > >> > > >> > >> Now consider the case where some external device is accessing mappings > >> > >> via the SMMU. The access flag will be cached in the SMMU TLB. If we > >> > >> clear the access flag without a TLB invalidate the access flag in the > >> > >> CPU page table will not get updated because it's already set in the SMMU > >> > >> TLB. > >> > >> As an aside access flag updates happen in one of two ways. If the > >> > >> SMMU > >> > >> HW supports hardware translation table updates (HTTU) then hardware will > >> > >> manage updating access/dirty flags as required. If this is not supported > >> > >> then SW is relied on to update these flags which in practice means > >> > >> taking a minor fault. But I don't think that is relevant here - in > >> > >> either case without a TLB invalidate neither of those things will > >> > >> happen. > >> > >> I suppose drivers could implement the clear_flush_young() MMU > >> > >> notifier > >> > >> callback (none do at the moment AFAICT) but then won't that just lead to > >> > >> the opposite problem - that every page ever used by an external device > >> > >> remains active and unavailable for reclaim because the access flag never > >> > >> gets cleared? I suppose they could do the flush then which would ensure > >> > > > >> > > Yes, I think so too. The reason there is currently no problem, perhaps > >> > > I think, there are no actual use cases at the moment? At least on our > >> > > Alibaba's fleet, SMMU and MMU do not share page tables now. > >> > > >> > We have systems that do. > >> > >> Just curious: do those systems run the Linux kernel? If so, are pages > >> shared with SMMU pinned? If not, then how are IO PFs handled after > >> pages are reclaimed? > > Yes, these systems all run Linux. Pages shared with SMMU aren't pinned > and fault handling works as Barry notes below - a driver is notified of > a fault and calls handle_mm_fault() in response. > > > it will call handle_mm_fault(vma, prm->addr, fault_flags, NULL); in > > I/O PF, so finally > > it runs the same codes to get page back just like CPU's PF. > > > > years ago, we recommended a pin solution, but obviously there were lots of > > push backs: > > https://lore.kernel.org/linux-mm/1612685884-19514-1-git-send-email-wangzhou1@hisilicon.com/ > > Right. Having to pin pages defeats the whole point of having hardware > that can handle page faults. Thanks. How would a DMA transaction be retried after the kernel resolves an IO PF? I.e., does the h/w (PCIe spec, etc.) support auto retries or is the s/w responsible for doing so? At least when I worked on the PCI subsystem, I didn't know any device that was capable of doing auto retries. (Pasha and I will have a talk on IOMMU at the coming LPC, so this might be an interesting intersection between IOMMU and MM to discuss.)
Yu Zhao <yuzhao@google.com> writes: > On Wed, Oct 25, 2023 at 4:16 AM Alistair Popple <apopple@nvidia.com> wrote: >> >> > >> Now consider the case where some external device is accessing mappings >> >> > >> via the SMMU. The access flag will be cached in the SMMU TLB. If we >> >> > >> clear the access flag without a TLB invalidate the access flag in the >> >> > >> CPU page table will not get updated because it's already set in the SMMU >> >> > >> TLB. >> >> > >> As an aside access flag updates happen in one of two ways. If the >> >> > >> SMMU >> >> > >> HW supports hardware translation table updates (HTTU) then hardware will >> >> > >> manage updating access/dirty flags as required. If this is not supported >> >> > >> then SW is relied on to update these flags which in practice means >> >> > >> taking a minor fault. But I don't think that is relevant here - in >> >> > >> either case without a TLB invalidate neither of those things will >> >> > >> happen. >> >> > >> I suppose drivers could implement the clear_flush_young() MMU >> >> > >> notifier >> >> > >> callback (none do at the moment AFAICT) but then won't that just lead to >> >> > >> the opposite problem - that every page ever used by an external device >> >> > >> remains active and unavailable for reclaim because the access flag never >> >> > >> gets cleared? I suppose they could do the flush then which would ensure >> >> > > >> >> > > Yes, I think so too. The reason there is currently no problem, perhaps >> >> > > I think, there are no actual use cases at the moment? At least on our >> >> > > Alibaba's fleet, SMMU and MMU do not share page tables now. >> >> > >> >> > We have systems that do. >> >> >> >> Just curious: do those systems run the Linux kernel? If so, are pages >> >> shared with SMMU pinned? If not, then how are IO PFs handled after >> >> pages are reclaimed? >> >> Yes, these systems all run Linux. Pages shared with SMMU aren't pinned >> and fault handling works as Barry notes below - a driver is notified of >> a fault and calls handle_mm_fault() in response. >> >> > it will call handle_mm_fault(vma, prm->addr, fault_flags, NULL); in >> > I/O PF, so finally >> > it runs the same codes to get page back just like CPU's PF. >> > >> > years ago, we recommended a pin solution, but obviously there were lots of >> > push backs: >> > https://lore.kernel.org/linux-mm/1612685884-19514-1-git-send-email-wangzhou1@hisilicon.com/ >> >> Right. Having to pin pages defeats the whole point of having hardware >> that can handle page faults. > > Thanks. How would a DMA transaction be retried after the kernel > resolves an IO PF? I.e., does the h/w (PCIe spec, etc.) support auto > retries or is the s/w responsible for doing so? At least when I worked > on the PCI subsystem, I didn't know any device that was capable of > doing auto retries. (Pasha and I will have a talk on IOMMU at the > coming LPC, so this might be an interesting intersection between IOMMU > and MM to discuss.) Generally what happens if a device encounters a page fault is that it notifies the kernel or driver (eg. via an interrupt) that it has faulted. It is then up to SW to resolve the fault and tell HW to retry the translation request once SW thinks the fault is resolved. I'm not aware of HW that does automatic retries (although I'm a little unclear what exactly is meant by automatic retry). In the case of an IOMMU faulting (eg. SMMU on ARM) on a DMA access I believe it stalls the transaction and SW is responsible for processing the fault and signalling that the translation should be retried. It's also possible for the device itself to detect a fault prior to issuing a DMA request if it's using something like PCIe page request services. Note my experience with this is more with non-PCIe devices that are coherently attached, but the concepts are all much the same as they all channel through the same IOMMU. Unfortunately it doesn't look I will be at LPC this year otherwise it would have been good to discuss. Happy to continue the discussion here or via some other channel though. Hopefully I will be able to see your talk online.
On 10/24/23 18:26, Baolin Wang wrote: > Now ptep_clear_flush_young() is only called by folio_referenced() to > check if the folio was referenced, and now it will call a tlb flush on > ARM64 architecture. However the tlb flush can be expensive on ARM64 > servers, especially for the systems with a large CPU numbers. TLB flush would be expensive on *any* platform with large CPU numbers ? > > Similar to the x86 architecture, below comments also apply equally to > ARM64 architecture. So we can drop the tlb flush operation in > ptep_clear_flush_young() on ARM64 architecture to improve the performance. > " > /* Clearing the accessed bit without a TLB flush > * doesn't cause data corruption. [ It could cause incorrect > * page aging and the (mistaken) reclaim of hot pages, but the > * chance of that should be relatively low. ] > * > * So as a performance optimization don't flush the TLB when > * clearing the accessed bit, it will eventually be flushed by > * a context switch or a VM operation anyway. [ In the rare > * event of it not getting flushed for a long time the delay > * shouldn't really matter because there's no real memory > * pressure for swapout to react to. ] > */ If always true, this sounds generic enough for all platforms, why only x86 and arm64 ? > " > Running the thpscale to show some obvious improvements for compaction > latency with this patch: > base patched > Amean fault-both-1 1093.19 ( 0.00%) 1084.57 * 0.79%* > Amean fault-both-3 2566.22 ( 0.00%) 2228.45 * 13.16%* > Amean fault-both-5 3591.22 ( 0.00%) 3146.73 * 12.38%* > Amean fault-both-7 4157.26 ( 0.00%) 4113.67 * 1.05%* > Amean fault-both-12 6184.79 ( 0.00%) 5218.70 * 15.62%* > Amean fault-both-18 9103.70 ( 0.00%) 7739.71 * 14.98%* > Amean fault-both-24 12341.73 ( 0.00%) 10684.23 * 13.43%* > Amean fault-both-30 15519.00 ( 0.00%) 13695.14 * 11.75%* > Amean fault-both-32 16189.15 ( 0.00%) 14365.73 * 11.26%* > base patched > Duration User 167.78 161.03 > Duration System 1836.66 1673.01 > Duration Elapsed 2074.58 2059.75 Could you please point to the test repo you are running ? > > Barry Song submitted a similar patch [1] before, that replaces the > ptep_clear_flush_young_notify() with ptep_clear_young_notify() in > folio_referenced_one(). However, I'm not sure if removing the tlb flush > operation is applicable to every architecture in kernel, so dropping > the tlb flush for ARM64 seems a sensible change. The reasoning provided here sounds generic when true, hence there seems to be no justification to keep it limited just for arm64 and x86. Also what about pmdp_clear_flush_young_notify() when THP is enabled. Should that also not do a TLB flush after clearing access bit ? Although arm64 does not enable __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH, rather depends on the generic pmdp_clear_flush_young() which also does a TLB flush via flush_pmd_tlb_range() while clearing the access bit. > > Note: I am okay for both approach, if someone can help to ensure that > all architectures do not need the tlb flush when clearing the accessed > bit, then I also think Barry's patch is better (hope Barry can resend > his patch). This paragraph belongs after the '----' below and not part of the commit message. > > [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/ > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> > --- > arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++--------------- > 1 file changed, 16 insertions(+), 15 deletions(-) > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index 0bd18de9fd97..2979d796ba9d 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, > static inline int ptep_clear_flush_young(struct vm_area_struct *vma, > unsigned long address, pte_t *ptep) > { > - int young = ptep_test_and_clear_young(vma, address, ptep); > - > - if (young) { > - /* > - * We can elide the trailing DSB here since the worst that can > - * happen is that a CPU continues to use the young entry in its > - * TLB and we mistakenly reclaim the associated page. The > - * window for such an event is bounded by the next > - * context-switch, which provides a DSB to complete the TLB > - * invalidation. > - */ > - flush_tlb_page_nosync(vma, address); > - } > - > - return young; > + /* > + * This comment is borrowed from x86, but applies equally to ARM64: > + * > + * Clearing the accessed bit without a TLB flush doesn't cause > + * data corruption. [ It could cause incorrect page aging and > + * the (mistaken) reclaim of hot pages, but the chance of that > + * should be relatively low. ] > + * > + * So as a performance optimization don't flush the TLB when > + * clearing the accessed bit, it will eventually be flushed by > + * a context switch or a VM operation anyway. [ In the rare > + * event of it not getting flushed for a long time the delay > + * shouldn't really matter because there's no real memory > + * pressure for swapout to react to. ] > + */ > + return ptep_test_and_clear_young(vma, address, ptep); > } > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE There are three distinct concerns here 1) What are the chances of this misleading existing hot page reclaim process 2) How secondary MMU such as SMMU adapt to change in mappings without a flush 3) Could this break the architecture rule requiring a TLB flush after access bit clear on a page table entry
On Thu, Oct 26, 2023 at 12:55 PM Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > > On 10/24/23 18:26, Baolin Wang wrote: > > Now ptep_clear_flush_young() is only called by folio_referenced() to > > check if the folio was referenced, and now it will call a tlb flush on > > ARM64 architecture. However the tlb flush can be expensive on ARM64 > > servers, especially for the systems with a large CPU numbers. > > TLB flush would be expensive on *any* platform with large CPU numbers ? > > > > > Similar to the x86 architecture, below comments also apply equally to > > ARM64 architecture. So we can drop the tlb flush operation in > > ptep_clear_flush_young() on ARM64 architecture to improve the performance. > > " > > /* Clearing the accessed bit without a TLB flush > > * doesn't cause data corruption. [ It could cause incorrect > > * page aging and the (mistaken) reclaim of hot pages, but the > > * chance of that should be relatively low. ] > > * > > * So as a performance optimization don't flush the TLB when > > * clearing the accessed bit, it will eventually be flushed by > > * a context switch or a VM operation anyway. [ In the rare > > * event of it not getting flushed for a long time the delay > > * shouldn't really matter because there's no real memory > > * pressure for swapout to react to. ] > > */ > > If always true, this sounds generic enough for all platforms, why only > x86 and arm64 ? > > > " > > Running the thpscale to show some obvious improvements for compaction > > latency with this patch: > > base patched > > Amean fault-both-1 1093.19 ( 0.00%) 1084.57 * 0.79%* > > Amean fault-both-3 2566.22 ( 0.00%) 2228.45 * 13.16%* > > Amean fault-both-5 3591.22 ( 0.00%) 3146.73 * 12.38%* > > Amean fault-both-7 4157.26 ( 0.00%) 4113.67 * 1.05%* > > Amean fault-both-12 6184.79 ( 0.00%) 5218.70 * 15.62%* > > Amean fault-both-18 9103.70 ( 0.00%) 7739.71 * 14.98%* > > Amean fault-both-24 12341.73 ( 0.00%) 10684.23 * 13.43%* > > Amean fault-both-30 15519.00 ( 0.00%) 13695.14 * 11.75%* > > Amean fault-both-32 16189.15 ( 0.00%) 14365.73 * 11.26%* > > base patched > > Duration User 167.78 161.03 > > Duration System 1836.66 1673.01 > > Duration Elapsed 2074.58 2059.75 > > Could you please point to the test repo you are running ? > > > > > Barry Song submitted a similar patch [1] before, that replaces the > > ptep_clear_flush_young_notify() with ptep_clear_young_notify() in > > folio_referenced_one(). However, I'm not sure if removing the tlb flush > > operation is applicable to every architecture in kernel, so dropping > > the tlb flush for ARM64 seems a sensible change. > > The reasoning provided here sounds generic when true, hence there seems > to be no justification to keep it limited just for arm64 and x86. Also > what about pmdp_clear_flush_young_notify() when THP is enabled. Should > that also not do a TLB flush after clearing access bit ? Although arm64 > does not enable __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH, rather depends on > the generic pmdp_clear_flush_young() which also does a TLB flush via > flush_pmd_tlb_range() while clearing the access bit. > > > > > Note: I am okay for both approach, if someone can help to ensure that > > all architectures do not need the tlb flush when clearing the accessed > > bit, then I also think Barry's patch is better (hope Barry can resend > > his patch). > > This paragraph belongs after the '----' below and not part of the commit > message. > > > > > [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/ > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> > > --- > > arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++--------------- > > 1 file changed, 16 insertions(+), 15 deletions(-) > > > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > > index 0bd18de9fd97..2979d796ba9d 100644 > > --- a/arch/arm64/include/asm/pgtable.h > > +++ b/arch/arm64/include/asm/pgtable.h > > @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, > > static inline int ptep_clear_flush_young(struct vm_area_struct *vma, > > unsigned long address, pte_t *ptep) > > { > > - int young = ptep_test_and_clear_young(vma, address, ptep); > > - > > - if (young) { > > - /* > > - * We can elide the trailing DSB here since the worst that can > > - * happen is that a CPU continues to use the young entry in its > > - * TLB and we mistakenly reclaim the associated page. The > > - * window for such an event is bounded by the next > > - * context-switch, which provides a DSB to complete the TLB > > - * invalidation. > > - */ > > - flush_tlb_page_nosync(vma, address); > > - } > > - > > - return young; > > + /* > > + * This comment is borrowed from x86, but applies equally to ARM64: > > + * > > + * Clearing the accessed bit without a TLB flush doesn't cause > > + * data corruption. [ It could cause incorrect page aging and > > + * the (mistaken) reclaim of hot pages, but the chance of that > > + * should be relatively low. ] > > + * > > + * So as a performance optimization don't flush the TLB when > > + * clearing the accessed bit, it will eventually be flushed by > > + * a context switch or a VM operation anyway. [ In the rare > > + * event of it not getting flushed for a long time the delay > > + * shouldn't really matter because there's no real memory > > + * pressure for swapout to react to. ] > > + */ > > + return ptep_test_and_clear_young(vma, address, ptep); > > } > > > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > There are three distinct concerns here > > 1) What are the chances of this misleading existing hot page reclaim process > 2) How secondary MMU such as SMMU adapt to change in mappings without a flush > 3) Could this break the architecture rule requiring a TLB flush after access > bit clear on a page table entry In terms of all of above concerns, though 2 is different, which is an issue between cpu and non-cpu, i feel kernel has actually dropped tlb flush at least for mglru, there is no flush in lru_gen_look_around(), static bool folio_referenced_one(struct folio *folio, struct vm_area_struct *vma, unsigned long address, void *arg) { ... if (pvmw.pte) { if (lru_gen_enabled() && pte_young(ptep_get(pvmw.pte))) { lru_gen_look_around(&pvmw); referenced++; } if (ptep_clear_flush_young_notify(vma, address, pvmw.pte)) referenced++; } return true; } and so is in walk_pte_range() of vmscan. linux has been surviving with all above concerns for a while, believing it or not :-) Thanks Barry
On 10/26/23 11:24, Barry Song wrote: > On Thu, Oct 26, 2023 at 12:55 PM Anshuman Khandual > <anshuman.khandual@arm.com> wrote: >> >> >> >> On 10/24/23 18:26, Baolin Wang wrote: >>> Now ptep_clear_flush_young() is only called by folio_referenced() to >>> check if the folio was referenced, and now it will call a tlb flush on >>> ARM64 architecture. However the tlb flush can be expensive on ARM64 >>> servers, especially for the systems with a large CPU numbers. >> >> TLB flush would be expensive on *any* platform with large CPU numbers ? >> >>> >>> Similar to the x86 architecture, below comments also apply equally to >>> ARM64 architecture. So we can drop the tlb flush operation in >>> ptep_clear_flush_young() on ARM64 architecture to improve the performance. >>> " >>> /* Clearing the accessed bit without a TLB flush >>> * doesn't cause data corruption. [ It could cause incorrect >>> * page aging and the (mistaken) reclaim of hot pages, but the >>> * chance of that should be relatively low. ] >>> * >>> * So as a performance optimization don't flush the TLB when >>> * clearing the accessed bit, it will eventually be flushed by >>> * a context switch or a VM operation anyway. [ In the rare >>> * event of it not getting flushed for a long time the delay >>> * shouldn't really matter because there's no real memory >>> * pressure for swapout to react to. ] >>> */ >> >> If always true, this sounds generic enough for all platforms, why only >> x86 and arm64 ? >> >>> " >>> Running the thpscale to show some obvious improvements for compaction >>> latency with this patch: >>> base patched >>> Amean fault-both-1 1093.19 ( 0.00%) 1084.57 * 0.79%* >>> Amean fault-both-3 2566.22 ( 0.00%) 2228.45 * 13.16%* >>> Amean fault-both-5 3591.22 ( 0.00%) 3146.73 * 12.38%* >>> Amean fault-both-7 4157.26 ( 0.00%) 4113.67 * 1.05%* >>> Amean fault-both-12 6184.79 ( 0.00%) 5218.70 * 15.62%* >>> Amean fault-both-18 9103.70 ( 0.00%) 7739.71 * 14.98%* >>> Amean fault-both-24 12341.73 ( 0.00%) 10684.23 * 13.43%* >>> Amean fault-both-30 15519.00 ( 0.00%) 13695.14 * 11.75%* >>> Amean fault-both-32 16189.15 ( 0.00%) 14365.73 * 11.26%* >>> base patched >>> Duration User 167.78 161.03 >>> Duration System 1836.66 1673.01 >>> Duration Elapsed 2074.58 2059.75 >> >> Could you please point to the test repo you are running ? >> >>> >>> Barry Song submitted a similar patch [1] before, that replaces the >>> ptep_clear_flush_young_notify() with ptep_clear_young_notify() in >>> folio_referenced_one(). However, I'm not sure if removing the tlb flush >>> operation is applicable to every architecture in kernel, so dropping >>> the tlb flush for ARM64 seems a sensible change. >> >> The reasoning provided here sounds generic when true, hence there seems >> to be no justification to keep it limited just for arm64 and x86. Also >> what about pmdp_clear_flush_young_notify() when THP is enabled. Should >> that also not do a TLB flush after clearing access bit ? Although arm64 >> does not enable __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH, rather depends on >> the generic pmdp_clear_flush_young() which also does a TLB flush via >> flush_pmd_tlb_range() while clearing the access bit. >> >>> >>> Note: I am okay for both approach, if someone can help to ensure that >>> all architectures do not need the tlb flush when clearing the accessed >>> bit, then I also think Barry's patch is better (hope Barry can resend >>> his patch). >> >> This paragraph belongs after the '----' below and not part of the commit >> message. >> >>> >>> [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/ >>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >>> --- >>> arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++--------------- >>> 1 file changed, 16 insertions(+), 15 deletions(-) >>> >>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >>> index 0bd18de9fd97..2979d796ba9d 100644 >>> --- a/arch/arm64/include/asm/pgtable.h >>> +++ b/arch/arm64/include/asm/pgtable.h >>> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, >>> static inline int ptep_clear_flush_young(struct vm_area_struct *vma, >>> unsigned long address, pte_t *ptep) >>> { >>> - int young = ptep_test_and_clear_young(vma, address, ptep); >>> - >>> - if (young) { >>> - /* >>> - * We can elide the trailing DSB here since the worst that can >>> - * happen is that a CPU continues to use the young entry in its >>> - * TLB and we mistakenly reclaim the associated page. The >>> - * window for such an event is bounded by the next >>> - * context-switch, which provides a DSB to complete the TLB >>> - * invalidation. >>> - */ >>> - flush_tlb_page_nosync(vma, address); >>> - } >>> - >>> - return young; >>> + /* >>> + * This comment is borrowed from x86, but applies equally to ARM64: >>> + * >>> + * Clearing the accessed bit without a TLB flush doesn't cause >>> + * data corruption. [ It could cause incorrect page aging and >>> + * the (mistaken) reclaim of hot pages, but the chance of that >>> + * should be relatively low. ] >>> + * >>> + * So as a performance optimization don't flush the TLB when >>> + * clearing the accessed bit, it will eventually be flushed by >>> + * a context switch or a VM operation anyway. [ In the rare >>> + * event of it not getting flushed for a long time the delay >>> + * shouldn't really matter because there's no real memory >>> + * pressure for swapout to react to. ] >>> + */ >>> + return ptep_test_and_clear_young(vma, address, ptep); >>> } >>> >>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >> >> There are three distinct concerns here >> >> 1) What are the chances of this misleading existing hot page reclaim process >> 2) How secondary MMU such as SMMU adapt to change in mappings without a flush >> 3) Could this break the architecture rule requiring a TLB flush after access >> bit clear on a page table entry > > In terms of all of above concerns, though 2 is different, which is an > issue between > cpu and non-cpu, > i feel kernel has actually dropped tlb flush at least for mglru, there > is no flush in > lru_gen_look_around(), > > static bool folio_referenced_one(struct folio *folio, > struct vm_area_struct *vma, unsigned long address, void *arg) > { > ... > > if (pvmw.pte) { > if (lru_gen_enabled() && > pte_young(ptep_get(pvmw.pte))) { > lru_gen_look_around(&pvmw); > referenced++; > } > > if (ptep_clear_flush_young_notify(vma, address, > pvmw.pte)) > referenced++; > } > > return true; > } > > and so is in walk_pte_range() of vmscan. linux has been surviving with > all above concerns for a while, believing it or not :-) Although the first two concerns could be worked upon in the SW, kernel surviving after breaking arch rules explicitly is not a correct state to be in IMHO.
On 26/10/2023 7:01 am, Anshuman Khandual wrote: > > > On 10/26/23 11:24, Barry Song wrote: >> On Thu, Oct 26, 2023 at 12:55 PM Anshuman Khandual >> <anshuman.khandual@arm.com> wrote: >>> >>> >>> >>> On 10/24/23 18:26, Baolin Wang wrote: >>>> Now ptep_clear_flush_young() is only called by folio_referenced() to >>>> check if the folio was referenced, and now it will call a tlb flush on >>>> ARM64 architecture. However the tlb flush can be expensive on ARM64 >>>> servers, especially for the systems with a large CPU numbers. >>> >>> TLB flush would be expensive on *any* platform with large CPU numbers ? >>> >>>> >>>> Similar to the x86 architecture, below comments also apply equally to >>>> ARM64 architecture. So we can drop the tlb flush operation in >>>> ptep_clear_flush_young() on ARM64 architecture to improve the performance. >>>> " >>>> /* Clearing the accessed bit without a TLB flush >>>> * doesn't cause data corruption. [ It could cause incorrect >>>> * page aging and the (mistaken) reclaim of hot pages, but the >>>> * chance of that should be relatively low. ] >>>> * >>>> * So as a performance optimization don't flush the TLB when >>>> * clearing the accessed bit, it will eventually be flushed by >>>> * a context switch or a VM operation anyway. [ In the rare >>>> * event of it not getting flushed for a long time the delay >>>> * shouldn't really matter because there's no real memory >>>> * pressure for swapout to react to. ] >>>> */ >>> >>> If always true, this sounds generic enough for all platforms, why only >>> x86 and arm64 ? >>> >>>> " >>>> Running the thpscale to show some obvious improvements for compaction >>>> latency with this patch: >>>> base patched >>>> Amean fault-both-1 1093.19 ( 0.00%) 1084.57 * 0.79%* >>>> Amean fault-both-3 2566.22 ( 0.00%) 2228.45 * 13.16%* >>>> Amean fault-both-5 3591.22 ( 0.00%) 3146.73 * 12.38%* >>>> Amean fault-both-7 4157.26 ( 0.00%) 4113.67 * 1.05%* >>>> Amean fault-both-12 6184.79 ( 0.00%) 5218.70 * 15.62%* >>>> Amean fault-both-18 9103.70 ( 0.00%) 7739.71 * 14.98%* >>>> Amean fault-both-24 12341.73 ( 0.00%) 10684.23 * 13.43%* >>>> Amean fault-both-30 15519.00 ( 0.00%) 13695.14 * 11.75%* >>>> Amean fault-both-32 16189.15 ( 0.00%) 14365.73 * 11.26%* >>>> base patched >>>> Duration User 167.78 161.03 >>>> Duration System 1836.66 1673.01 >>>> Duration Elapsed 2074.58 2059.75 >>> >>> Could you please point to the test repo you are running ? >>> >>>> >>>> Barry Song submitted a similar patch [1] before, that replaces the >>>> ptep_clear_flush_young_notify() with ptep_clear_young_notify() in >>>> folio_referenced_one(). However, I'm not sure if removing the tlb flush >>>> operation is applicable to every architecture in kernel, so dropping >>>> the tlb flush for ARM64 seems a sensible change. >>> >>> The reasoning provided here sounds generic when true, hence there seems >>> to be no justification to keep it limited just for arm64 and x86. Also >>> what about pmdp_clear_flush_young_notify() when THP is enabled. Should >>> that also not do a TLB flush after clearing access bit ? Although arm64 >>> does not enable __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH, rather depends on >>> the generic pmdp_clear_flush_young() which also does a TLB flush via >>> flush_pmd_tlb_range() while clearing the access bit. >>> >>>> >>>> Note: I am okay for both approach, if someone can help to ensure that >>>> all architectures do not need the tlb flush when clearing the accessed >>>> bit, then I also think Barry's patch is better (hope Barry can resend >>>> his patch). >>> >>> This paragraph belongs after the '----' below and not part of the commit >>> message. >>> >>>> >>>> [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/ >>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >>>> --- >>>> arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++--------------- >>>> 1 file changed, 16 insertions(+), 15 deletions(-) >>>> >>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >>>> index 0bd18de9fd97..2979d796ba9d 100644 >>>> --- a/arch/arm64/include/asm/pgtable.h >>>> +++ b/arch/arm64/include/asm/pgtable.h >>>> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, >>>> static inline int ptep_clear_flush_young(struct vm_area_struct *vma, >>>> unsigned long address, pte_t *ptep) >>>> { >>>> - int young = ptep_test_and_clear_young(vma, address, ptep); >>>> - >>>> - if (young) { >>>> - /* >>>> - * We can elide the trailing DSB here since the worst that can >>>> - * happen is that a CPU continues to use the young entry in its >>>> - * TLB and we mistakenly reclaim the associated page. The >>>> - * window for such an event is bounded by the next >>>> - * context-switch, which provides a DSB to complete the TLB >>>> - * invalidation. >>>> - */ >>>> - flush_tlb_page_nosync(vma, address); >>>> - } >>>> - >>>> - return young; >>>> + /* >>>> + * This comment is borrowed from x86, but applies equally to ARM64: >>>> + * >>>> + * Clearing the accessed bit without a TLB flush doesn't cause >>>> + * data corruption. [ It could cause incorrect page aging and >>>> + * the (mistaken) reclaim of hot pages, but the chance of that >>>> + * should be relatively low. ] >>>> + * >>>> + * So as a performance optimization don't flush the TLB when >>>> + * clearing the accessed bit, it will eventually be flushed by >>>> + * a context switch or a VM operation anyway. [ In the rare >>>> + * event of it not getting flushed for a long time the delay >>>> + * shouldn't really matter because there's no real memory >>>> + * pressure for swapout to react to. ] >>>> + */ >>>> + return ptep_test_and_clear_young(vma, address, ptep); >>>> } >>>> >>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >>> >>> There are three distinct concerns here >>> >>> 1) What are the chances of this misleading existing hot page reclaim process >>> 2) How secondary MMU such as SMMU adapt to change in mappings without a flush >>> 3) Could this break the architecture rule requiring a TLB flush after access >>> bit clear on a page table entry >> >> In terms of all of above concerns, though 2 is different, which is an >> issue between >> cpu and non-cpu, >> i feel kernel has actually dropped tlb flush at least for mglru, there >> is no flush in >> lru_gen_look_around(), >> >> static bool folio_referenced_one(struct folio *folio, >> struct vm_area_struct *vma, unsigned long address, void *arg) >> { >> ... >> >> if (pvmw.pte) { >> if (lru_gen_enabled() && >> pte_young(ptep_get(pvmw.pte))) { >> lru_gen_look_around(&pvmw); >> referenced++; >> } >> >> if (ptep_clear_flush_young_notify(vma, address, >> pvmw.pte)) >> referenced++; >> } >> >> return true; >> } >> >> and so is in walk_pte_range() of vmscan. linux has been surviving with >> all above concerns for a while, believing it or not :-) > > Although the first two concerns could be worked upon in the SW, kernel surviving > after breaking arch rules explicitly is not a correct state to be in IMHO. AFAICS it's not breaking any rules as such. Indeed the architecture requires a TLBI to ensure that the access flag update is observed by the CPU, but what's being proposed here is relaxing Linux's own expectations to say actually we don't mind if the CPU *doesn't* observe the update straight away. Thanks, Robin.
On 10/26/2023 2:01 PM, Anshuman Khandual wrote: > > > On 10/26/23 11:24, Barry Song wrote: >> On Thu, Oct 26, 2023 at 12:55 PM Anshuman Khandual >> <anshuman.khandual@arm.com> wrote: >>> >>> >>> >>> On 10/24/23 18:26, Baolin Wang wrote: >>>> Now ptep_clear_flush_young() is only called by folio_referenced() to >>>> check if the folio was referenced, and now it will call a tlb flush on >>>> ARM64 architecture. However the tlb flush can be expensive on ARM64 >>>> servers, especially for the systems with a large CPU numbers. >>> >>> TLB flush would be expensive on *any* platform with large CPU numbers ? Perhaps yes, but did not measure it on other platforms. >>>> Similar to the x86 architecture, below comments also apply equally to >>>> ARM64 architecture. So we can drop the tlb flush operation in >>>> ptep_clear_flush_young() on ARM64 architecture to improve the performance. >>>> " >>>> /* Clearing the accessed bit without a TLB flush >>>> * doesn't cause data corruption. [ It could cause incorrect >>>> * page aging and the (mistaken) reclaim of hot pages, but the >>>> * chance of that should be relatively low. ] >>>> * >>>> * So as a performance optimization don't flush the TLB when >>>> * clearing the accessed bit, it will eventually be flushed by >>>> * a context switch or a VM operation anyway. [ In the rare >>>> * event of it not getting flushed for a long time the delay >>>> * shouldn't really matter because there's no real memory >>>> * pressure for swapout to react to. ] >>>> */ >>> >>> If always true, this sounds generic enough for all platforms, why only >>> x86 and arm64 ? I am not sure this is always true for every architectures. >>>> " >>>> Running the thpscale to show some obvious improvements for compaction >>>> latency with this patch: >>>> base patched >>>> Amean fault-both-1 1093.19 ( 0.00%) 1084.57 * 0.79%* >>>> Amean fault-both-3 2566.22 ( 0.00%) 2228.45 * 13.16%* >>>> Amean fault-both-5 3591.22 ( 0.00%) 3146.73 * 12.38%* >>>> Amean fault-both-7 4157.26 ( 0.00%) 4113.67 * 1.05%* >>>> Amean fault-both-12 6184.79 ( 0.00%) 5218.70 * 15.62%* >>>> Amean fault-both-18 9103.70 ( 0.00%) 7739.71 * 14.98%* >>>> Amean fault-both-24 12341.73 ( 0.00%) 10684.23 * 13.43%* >>>> Amean fault-both-30 15519.00 ( 0.00%) 13695.14 * 11.75%* >>>> Amean fault-both-32 16189.15 ( 0.00%) 14365.73 * 11.26%* >>>> base patched >>>> Duration User 167.78 161.03 >>>> Duration System 1836.66 1673.01 >>>> Duration Elapsed 2074.58 2059.75 >>> >>> Could you please point to the test repo you are running ? The test is based on v6.5 kernel. >>>> Barry Song submitted a similar patch [1] before, that replaces the >>>> ptep_clear_flush_young_notify() with ptep_clear_young_notify() in >>>> folio_referenced_one(). However, I'm not sure if removing the tlb flush >>>> operation is applicable to every architecture in kernel, so dropping >>>> the tlb flush for ARM64 seems a sensible change. >>> >>> The reasoning provided here sounds generic when true, hence there seems >>> to be no justification to keep it limited just for arm64 and x86. Also Right, but I can not ensure if this will break other architectures. >>> what about pmdp_clear_flush_young_notify() when THP is enabled. Should >>> that also not do a TLB flush after clearing access bit ? Although arm64 Yes, I think so. >>> does not enable __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH, rather depends on >>> the generic pmdp_clear_flush_young() which also does a TLB flush via >>> flush_pmd_tlb_range() while clearing the access bit. >>> >>>> >>>> Note: I am okay for both approach, if someone can help to ensure that >>>> all architectures do not need the tlb flush when clearing the accessed >>>> bit, then I also think Barry's patch is better (hope Barry can resend >>>> his patch). >>> >>> This paragraph belongs after the '----' below and not part of the commit >>> message. OK. >>>> [1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/ >>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >>>> --- >>>> arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++--------------- >>>> 1 file changed, 16 insertions(+), 15 deletions(-) >>>> >>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >>>> index 0bd18de9fd97..2979d796ba9d 100644 >>>> --- a/arch/arm64/include/asm/pgtable.h >>>> +++ b/arch/arm64/include/asm/pgtable.h >>>> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, >>>> static inline int ptep_clear_flush_young(struct vm_area_struct *vma, >>>> unsigned long address, pte_t *ptep) >>>> { >>>> - int young = ptep_test_and_clear_young(vma, address, ptep); >>>> - >>>> - if (young) { >>>> - /* >>>> - * We can elide the trailing DSB here since the worst that can >>>> - * happen is that a CPU continues to use the young entry in its >>>> - * TLB and we mistakenly reclaim the associated page. The >>>> - * window for such an event is bounded by the next >>>> - * context-switch, which provides a DSB to complete the TLB >>>> - * invalidation. >>>> - */ >>>> - flush_tlb_page_nosync(vma, address); >>>> - } >>>> - >>>> - return young; >>>> + /* >>>> + * This comment is borrowed from x86, but applies equally to ARM64: >>>> + * >>>> + * Clearing the accessed bit without a TLB flush doesn't cause >>>> + * data corruption. [ It could cause incorrect page aging and >>>> + * the (mistaken) reclaim of hot pages, but the chance of that >>>> + * should be relatively low. ] >>>> + * >>>> + * So as a performance optimization don't flush the TLB when >>>> + * clearing the accessed bit, it will eventually be flushed by >>>> + * a context switch or a VM operation anyway. [ In the rare >>>> + * event of it not getting flushed for a long time the delay >>>> + * shouldn't really matter because there's no real memory >>>> + * pressure for swapout to react to. ] >>>> + */ >>>> + return ptep_test_and_clear_young(vma, address, ptep); >>>> } >>>> >>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >>> >>> There are three distinct concerns here >>> >>> 1) What are the chances of this misleading existing hot page reclaim process >>> 2) How secondary MMU such as SMMU adapt to change in mappings without a flush >>> 3) Could this break the architecture rule requiring a TLB flush after access >>> bit clear on a page table entry >> >> In terms of all of above concerns, though 2 is different, which is an >> issue between >> cpu and non-cpu, >> i feel kernel has actually dropped tlb flush at least for mglru, there >> is no flush in >> lru_gen_look_around(), >> >> static bool folio_referenced_one(struct folio *folio, >> struct vm_area_struct *vma, unsigned long address, void *arg) >> { >> ... >> >> if (pvmw.pte) { >> if (lru_gen_enabled() && >> pte_young(ptep_get(pvmw.pte))) { >> lru_gen_look_around(&pvmw); >> referenced++; >> } >> >> if (ptep_clear_flush_young_notify(vma, address, >> pvmw.pte)) >> referenced++; >> } >> >> return true; >> } >> >> and so is in walk_pte_range() of vmscan. linux has been surviving with >> all above concerns for a while, believing it or not :-) > > Although the first two concerns could be worked upon in the SW, kernel surviving > after breaking arch rules explicitly is not a correct state to be in IMHO. Not sure what's the meaning of "not a correct state", at least we (Alibaba) have not found this can cause any issues until now when using MGLRU on x86 and ARM64 platforms.
On Wed, Oct 25, 2023 at 6:16 PM Alistair Popple <apopple@nvidia.com> wrote: > > > Barry Song <21cnbao@gmail.com> writes: > > > On Wed, Oct 25, 2023 at 2:17 PM Yu Zhao <yuzhao@google.com> wrote: > >> > >> On Tue, Oct 24, 2023 at 9:21 PM Alistair Popple <apopple@nvidia.com> wrote: > >> > > >> > > >> > Baolin Wang <baolin.wang@linux.alibaba.com> writes: > >> > > >> > > On 10/25/2023 9:58 AM, Alistair Popple wrote: > >> > >> Barry Song <21cnbao@gmail.com> writes: > >> > >> > >> > >>> On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@nvidia.com> wrote: > >> > >>>> > >> > >>>> > >> > >>>> Barry Song <21cnbao@gmail.com> writes: > >> > >>>> > >> > >>>>> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote: > >> > >>>>>> > >> > >>>>>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang > >> > >>>>>> <baolin.wang@linux.alibaba.com> wrote: > >> > >> [...] > >> > >> > >> > >>>>>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot > >> > >>>>>> page, B might > >> > >>>>>> be a correct choice. > >> > >>>>> > >> > >>>>> Plus, I doubt B is really going to happen. as after a page is promoted to > >> > >>>>> the head of lru list or new generation, it needs a long time to slide back > >> > >>>>> to the inactive list tail or to the candidate generation of mglru. the time > >> > >>>>> should have been large enough for tlb to be flushed. If the page is really > >> > >>>>> hot, the hardware will get second, third, fourth etc opportunity to set an > >> > >>>>> access flag in the long time in which the page is re-moved to the tail > >> > >>>>> as the page can be accessed multiple times if it is really hot. > >> > >>>> > >> > >>>> This might not be true if you have external hardware sharing the page > >> > >>>> tables with software through either HMM or hardware supported ATS > >> > >>>> though. > >> > >>>> > >> > >>>> In those cases I think it's much more likely hardware can still be > >> > >>>> accessing the page even after a context switch on the CPU say. So those > >> > >>>> pages will tend to get reclaimed even though hardware is still actively > >> > >>>> using them which would be quite expensive and I guess could lead to > >> > >>>> thrashing as each page is reclaimed and then immediately faulted back > >> > >>>> in. > >> > > > >> > > That's possible, but the chance should be relatively low. At least on > >> > > x86, I have not heard of this issue. > >> > > >> > Personally I've never seen any x86 system that shares page tables with > >> > external devices, other than with HMM. More on that below. > >> > > >> > >>> i am not quite sure i got your point. has the external hardware sharing cpu's > >> > >>> pagetable the ability to set access flag in page table entries by > >> > >>> itself? if yes, > >> > >>> I don't see how our approach will hurt as folio_referenced can notify the > >> > >>> hardware driver and the driver can flush its own tlb. If no, i don't see > >> > >>> either as the external hardware can't set access flags, that means we > >> > >>> have ignored its reference and only knows cpu's access even in the current > >> > >>> mainline code. so we are not getting worse. > >> > >>> > >> > >>> so the external hardware can also see cpu's TLB? or cpu's tlb flush can > >> > >>> also broadcast to external hardware, then external hardware sees the > >> > >>> cleared access flag, thus, it can set access flag in page table when the > >> > >>> hardware access it? If this is the case, I feel what you said is true. > >> > >> Perhaps it would help if I gave a concrete example. Take for example > >> > >> the > >> > >> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of > >> > >> two ways depending on the specific HW implementation. > >> > >> If broadcast TLB maintenance (BTM) is supported it will snoop CPU > >> > >> TLB > >> > >> invalidations. If BTM is not supported it relies on SW to explicitly > >> > >> forward TLB invalidations via MMU notifiers. > >> > > > >> > > On our ARM64 hardware, we rely on BTM to maintain TLB coherency. > >> > > >> > Lucky you :-) > >> > > >> > ARM64 SMMU architecture specification supports the possibilty of both, > >> > as does the driver. Not that I think whether or not BTM is supported has > >> > much relevance to this issue. > >> > > >> > >> Now consider the case where some external device is accessing mappings > >> > >> via the SMMU. The access flag will be cached in the SMMU TLB. If we > >> > >> clear the access flag without a TLB invalidate the access flag in the > >> > >> CPU page table will not get updated because it's already set in the SMMU > >> > >> TLB. > >> > >> As an aside access flag updates happen in one of two ways. If the > >> > >> SMMU > >> > >> HW supports hardware translation table updates (HTTU) then hardware will > >> > >> manage updating access/dirty flags as required. If this is not supported > >> > >> then SW is relied on to update these flags which in practice means > >> > >> taking a minor fault. But I don't think that is relevant here - in > >> > >> either case without a TLB invalidate neither of those things will > >> > >> happen. > >> > >> I suppose drivers could implement the clear_flush_young() MMU > >> > >> notifier > >> > >> callback (none do at the moment AFAICT) but then won't that just lead to > >> > >> the opposite problem - that every page ever used by an external device > >> > >> remains active and unavailable for reclaim because the access flag never > >> > >> gets cleared? I suppose they could do the flush then which would ensure > >> > > > >> > > Yes, I think so too. The reason there is currently no problem, perhaps > >> > > I think, there are no actual use cases at the moment? At least on our > >> > > Alibaba's fleet, SMMU and MMU do not share page tables now. > >> > > >> > We have systems that do. > >> > >> Just curious: do those systems run the Linux kernel? If so, are pages > >> shared with SMMU pinned? If not, then how are IO PFs handled after > >> pages are reclaimed? > > Yes, these systems all run Linux. Pages shared with SMMU aren't pinned > and fault handling works as Barry notes below - a driver is notified of > a fault and calls handle_mm_fault() in response. > > > it will call handle_mm_fault(vma, prm->addr, fault_flags, NULL); in > > I/O PF, so finally > > it runs the same codes to get page back just like CPU's PF. > > > > years ago, we recommended a pin solution, but obviously there were lots of > > push backs: > > https://lore.kernel.org/linux-mm/1612685884-19514-1-git-send-email-wangzhou1@hisilicon.com/ > > Right. Having to pin pages defeats the whole point of having hardware > that can handle page faults. Right. But Sometimes I feel that purpose and the technology for the purpose are contradictory. People want user space dma to have high bandwidth and performance by dropping u/k copying and complicated syscalls managing those buffers. but IO PF somehow defeats all the efforts to improve performance. AFAIK, IO PF is much much slower than PF as it depends on interrupts and event queues in hardware. Even a minor page fault due to page migration or demand paging can significantly decrease the speed of userspace DMA. So finally people have to depend on some mlock-like and pre-paging ways to achieve the purpose if performance is the key concern. Thanks Barry
On Wed, Oct 25, 2023 at 09:39:19AM +0800, Yin, Fengwei wrote: > > >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > >> index 0bd18de9fd97..2979d796ba9d 100644 > >> --- a/arch/arm64/include/asm/pgtable.h > >> +++ b/arch/arm64/include/asm/pgtable.h > >> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, > >> static inline int ptep_clear_flush_young(struct vm_area_struct *vma, > >> unsigned long address, pte_t *ptep) > >> { > >> - int young = ptep_test_and_clear_young(vma, address, ptep); > >> - > >> - if (young) { > >> - /* > >> - * We can elide the trailing DSB here since the worst that can > >> - * happen is that a CPU continues to use the young entry in its > >> - * TLB and we mistakenly reclaim the associated page. The > >> - * window for such an event is bounded by the next > >> - * context-switch, which provides a DSB to complete the TLB > >> - * invalidation. > >> - */ > >> - flush_tlb_page_nosync(vma, address); > >> - } > >> - > >> - return young; > >> + /* > >> + * This comment is borrowed from x86, but applies equally to ARM64: > >> + * > >> + * Clearing the accessed bit without a TLB flush doesn't cause > >> + * data corruption. [ It could cause incorrect page aging and > >> + * the (mistaken) reclaim of hot pages, but the chance of that > >> + * should be relatively low. ] > >> + * > >> + * So as a performance optimization don't flush the TLB when > >> + * clearing the accessed bit, it will eventually be flushed by > >> + * a context switch or a VM operation anyway. [ In the rare > >> + * event of it not getting flushed for a long time the delay > >> + * shouldn't really matter because there's no real memory > >> + * pressure for swapout to react to. ] > >> + */ > >> + return ptep_test_and_clear_young(vma, address, ptep); > >> } > From https://lore.kernel.org/lkml/20181029105515.GD14127@arm.com/: > > This is blindly copied from x86 and isn't true for us: we don't invalidate > the TLB on context switch. That means our window for keeping the stale > entries around is potentially much bigger and might not be a great idea. I completely agree. > My understanding is that arm64 doesn't do invalidate the TLB during > context switch. The flush_tlb_page_nosync() here + DSB during context > switch make sure the TLB is invalidated during context switch. > So we can't remove flush_tlb_page_nosync() here? Or something was changed > for arm64 (I have zero knowledge to TLB on arm64. So some obvious thing > may be missed)? Thanks. As you point out, we already elide the DSB here but I don't think we should remove the TLB invalidation entirely because then we lose the guarantee that the update ever becomes visible to the page-table walker. I'm surprised that the TLBI is showing up as a performance issue without the DSB present. Is it because we're walking over a large VA range and invalidating on a per-page basis? If so, we'd be better off batching them up and doing the invalidation at the end (which will be upgraded to a full-mm invalidation if the range is large enough). Will
On Tue, Nov 7, 2023 at 6:12 PM Will Deacon <will@kernel.org> wrote: > > On Wed, Oct 25, 2023 at 09:39:19AM +0800, Yin, Fengwei wrote: > > > > >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > > >> index 0bd18de9fd97..2979d796ba9d 100644 > > >> --- a/arch/arm64/include/asm/pgtable.h > > >> +++ b/arch/arm64/include/asm/pgtable.h > > >> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, > > >> static inline int ptep_clear_flush_young(struct vm_area_struct *vma, > > >> unsigned long address, pte_t *ptep) > > >> { > > >> - int young = ptep_test_and_clear_young(vma, address, ptep); > > >> - > > >> - if (young) { > > >> - /* > > >> - * We can elide the trailing DSB here since the worst that can > > >> - * happen is that a CPU continues to use the young entry in its > > >> - * TLB and we mistakenly reclaim the associated page. The > > >> - * window for such an event is bounded by the next > > >> - * context-switch, which provides a DSB to complete the TLB > > >> - * invalidation. > > >> - */ > > >> - flush_tlb_page_nosync(vma, address); > > >> - } > > >> - > > >> - return young; > > >> + /* > > >> + * This comment is borrowed from x86, but applies equally to ARM64: > > >> + * > > >> + * Clearing the accessed bit without a TLB flush doesn't cause > > >> + * data corruption. [ It could cause incorrect page aging and > > >> + * the (mistaken) reclaim of hot pages, but the chance of that > > >> + * should be relatively low. ] > > >> + * > > >> + * So as a performance optimization don't flush the TLB when > > >> + * clearing the accessed bit, it will eventually be flushed by > > >> + * a context switch or a VM operation anyway. [ In the rare > > >> + * event of it not getting flushed for a long time the delay > > >> + * shouldn't really matter because there's no real memory > > >> + * pressure for swapout to react to. ] > > >> + */ > > >> + return ptep_test_and_clear_young(vma, address, ptep); > > >> } > > From https://lore.kernel.org/lkml/20181029105515.GD14127@arm.com/: > > > > This is blindly copied from x86 and isn't true for us: we don't invalidate > > the TLB on context switch. That means our window for keeping the stale > > entries around is potentially much bigger and might not be a great idea. > > I completely agree. > > > My understanding is that arm64 doesn't do invalidate the TLB during > > context switch. The flush_tlb_page_nosync() here + DSB during context > > switch make sure the TLB is invalidated during context switch. > > So we can't remove flush_tlb_page_nosync() here? Or something was changed > > for arm64 (I have zero knowledge to TLB on arm64. So some obvious thing > > may be missed)? Thanks. > > As you point out, we already elide the DSB here but I don't think we should > remove the TLB invalidation entirely because then we lose the guarantee > that the update ever becomes visible to the page-table walker. > > I'm surprised that the TLBI is showing up as a performance issue without > the DSB present. Is it because we're walking over a large VA range and > invalidating on a per-page basis? If so, we'd be better off batching them nop. in lru cases, there are thousands of pages in LRU list. doing vmscan, we depend on rmap to find their PTEs, then read and clear AF to figure out if a page is young. So it is not from a big VM area to those pages in this VA range. There are just too many pages from lots of processes in LRU to be scanned. The thing is done by rmap. > up and doing the invalidation at the end (which will be upgraded to a > full-mm invalidation if the range is large enough). Those pages in LRU could be from hundreds of different processes, they are not in just one process. i guess one possibility is that hardware has a limited tlbi/nosync buffer, once the buffer is full, something similar with dsb will be done automatically by hardware. So too many tlbi even without dsb can still harm performance. > > Will Thanks Barry
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 0bd18de9fd97..2979d796ba9d 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, static inline int ptep_clear_flush_young(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) { - int young = ptep_test_and_clear_young(vma, address, ptep); - - if (young) { - /* - * We can elide the trailing DSB here since the worst that can - * happen is that a CPU continues to use the young entry in its - * TLB and we mistakenly reclaim the associated page. The - * window for such an event is bounded by the next - * context-switch, which provides a DSB to complete the TLB - * invalidation. - */ - flush_tlb_page_nosync(vma, address); - } - - return young; + /* + * This comment is borrowed from x86, but applies equally to ARM64: + * + * Clearing the accessed bit without a TLB flush doesn't cause + * data corruption. [ It could cause incorrect page aging and + * the (mistaken) reclaim of hot pages, but the chance of that + * should be relatively low. ] + * + * So as a performance optimization don't flush the TLB when + * clearing the accessed bit, it will eventually be flushed by + * a context switch or a VM operation anyway. [ In the rare + * event of it not getting flushed for a long time the delay + * shouldn't really matter because there's no real memory + * pressure for swapout to react to. ] + */ + return ptep_test_and_clear_young(vma, address, ptep); } #ifdef CONFIG_TRANSPARENT_HUGEPAGE