Message ID | 20230721094043.2506691-3-fengwei.yin@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:9010:0:b0:3e4:2afc:c1 with SMTP id l16csp115086vqg; Fri, 21 Jul 2023 03:47:59 -0700 (PDT) X-Google-Smtp-Source: APBJJlFda80XfFU+r2jAwKU3vs8Gz8adkIX0CnhLPX1A78NPi2IVjTDMx/vJQkLYUEnR3ZdmLqVB X-Received: by 2002:a05:6e02:1bc7:b0:345:a201:82b7 with SMTP id x7-20020a056e021bc700b00345a20182b7mr1921067ilv.26.1689936478966; Fri, 21 Jul 2023 03:47:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689936478; cv=none; d=google.com; s=arc-20160816; b=jO4yf8jgSkemR1NpEyualACzDoqg/8aVICjqkDLKTBlcDGgFQ7bqwfVwnsVgHHuV7h AnvSwMAjmSY8oxWAjcoHpdMsnQeUy9+mC+4jftAbwXHwVwNDIx+9f7t9RDomKFW17vkF 83X8Z8nM88PFL2ZwLkfaBTyTb2GRLOdqOmo1+SpIrVj1xLNMeyusP/7JGRcXDr1t9qlj hH9RNTHCWyqiVgxTFfOY54mLKtAxxvVI/ncZRPFfWGkw/5wuK8lvtbKJRJhMWX8Z0C33 gD3peqBBW2wA2y/O+L9sA8H/rcygsT1xT+mt88WF40VRoGXORhEpvzIXB1rXe1LIiYYR cy6w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=apzMRyTxT/YpU0tZgW/dawYoZpV0o4D5ZoL/b9Sz3J8=; fh=0KVzgvcQw81G5nVqs8HZCvx8mOzmVWAkxPDBTqGbDDY=; b=Vb/5C8SdCUjvw1ojmlE5ceAq5HXgSf/CHY/3OQMw1eaEMdEJkQfef0oDnjaJVVHBWZ F93WfoL7wlTzFj8AUhykxjD1Yzc6artRM4Kc5wqbNu//rDJ6E/n+L9Kd2CVUrdnKew0Q XvcE3u99gYGQBnb4uni5vvzz8ggT9wniHXLPh5Ulg0bNgSnmc+uPhTJTYM195enQcTv7 cmuv58AHEGt3Ytrww/TXc0z7m5RYp8CYVp1ugD5stgdX+juV0Y0Bde+hAFZ2GcVQte1t JqZh6B8sHBSM+7MEogqANKh3y8FbrOR8aokOo7CithLhR21iK0xGxmQpe9OJfz+p9FKw J8pA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=NPdJMpy4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 4-20020a631344000000b0054f9f9b333csi2680377pgt.686.2023.07.21.03.47.45; Fri, 21 Jul 2023 03:47:58 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=NPdJMpy4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231209AbjGUJmF (ORCPT <rfc822;assdfgzxcv4@gmail.com> + 99 others); Fri, 21 Jul 2023 05:42:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54082 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231233AbjGUJlt (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 21 Jul 2023 05:41:49 -0400 Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B45F03A87 for <linux-kernel@vger.kernel.org>; Fri, 21 Jul 2023 02:41:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1689932496; x=1721468496; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=cu8lT3oOEGsTcv6kJqkYTpbtDaklUJpbsDBtNyUKlKw=; b=NPdJMpy4bZwjJPH6rtVAZkrw5vcqK7ajUQhRNip187YeZL9tdrlc65wi LzODMzFZ2BUKDDwxntXdesEyCkLMfcbOmP9vv+KUneBK7Rq1zQ5JZFbB4 QLoraSwsmD78ctusocsUV5OxANRBMmW4f8/dbCzhNTE3ipBxvZltFWU4E bfZd0lto9Hfyg7MPBQa5n7BqXg/ELhvvmVvIIgCFNk7GwBDbi+DNZjapj bAT992MXGIgGaLw1bMfzZikxqdjLfIDl2+PfKcY15AR6B7tFlZbtYXxIv vQ3ov33RAztM48nzSlFBQTrZ/aLlEWqluNIBSHe634PYX/zTynQmk7/yY A==; X-IronPort-AV: E=McAfee;i="6600,9927,10777"; a="397874364" X-IronPort-AV: E=Sophos;i="6.01,220,1684825200"; d="scan'208";a="397874364" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Jul 2023 02:41:02 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10777"; a="971386942" X-IronPort-AV: E=Sophos;i="6.01,220,1684825200"; d="scan'208";a="971386942" Received: from fyin-dev.sh.intel.com ([10.239.159.32]) by fmsmga006.fm.intel.com with ESMTP; 21 Jul 2023 02:40:59 -0700 From: Yin Fengwei <fengwei.yin@intel.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, minchan@kernel.org, yuzhao@google.com, willy@infradead.org, david@redhat.com, ryan.roberts@arm.com, shy828301@gmail.com Cc: fengwei.yin@intel.com Subject: [RFC PATCH v2 2/4] madvise: Use notify-able API to clear and flush page table entries Date: Fri, 21 Jul 2023 17:40:41 +0800 Message-Id: <20230721094043.2506691-3-fengwei.yin@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230721094043.2506691-1-fengwei.yin@intel.com> References: <20230721094043.2506691-1-fengwei.yin@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1772026833534784325 X-GMAIL-MSGID: 1772026833534784325 |
Series |
fix large folio for madvise_cold_or_pageout()
|
|
Commit Message
Yin Fengwei
July 21, 2023, 9:40 a.m. UTC
Currently, in function madvise_cold_or_pageout_pte_range(), the
young bit of pte/pmd is cleared notify subscripter.
Using notify-able API to make sure the subscripter is signaled about
the young bit clearing.
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
mm/madvise.c | 18 ++----------------
1 file changed, 2 insertions(+), 16 deletions(-)
Comments
On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <fengwei.yin@intel.com> wrote: > > Currently, in function madvise_cold_or_pageout_pte_range(), the > young bit of pte/pmd is cleared notify subscripter. > > Using notify-able API to make sure the subscripter is signaled about > the young bit clearing. > > Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> > --- > mm/madvise.c | 18 ++---------------- > 1 file changed, 2 insertions(+), 16 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index f12933ebcc24..b236e201a738 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -403,14 +403,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > return 0; > } > > - if (pmd_young(orig_pmd)) { > - pmdp_invalidate(vma, addr, pmd); > - orig_pmd = pmd_mkold(orig_pmd); > - > - set_pmd_at(mm, addr, pmd, orig_pmd); > - tlb_remove_pmd_tlb_entry(tlb, pmd, addr); > - } > - > + pmdp_clear_flush_young_notify(vma, addr, pmd); > folio_clear_referenced(folio); > folio_test_clear_young(folio); > if (folio_test_active(folio)) > @@ -496,14 +489,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > > VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > > - if (pte_young(ptent)) { > - ptent = ptep_get_and_clear_full(mm, addr, pte, > - tlb->fullmm); > - ptent = pte_mkold(ptent); > - set_pte_at(mm, addr, pte, ptent); > - tlb_remove_tlb_entry(tlb, pte, addr); > - } > - > + ptep_clear_flush_young_notify(vma, addr, pte); These two places are tricky. I agree there is a problem here, i.e., we are not consulting the mmu notifier. In fact, we do pageout on VMs on ChromeOS, and it's been a known problem to me for a while (not a high priority one). tlb_remove_tlb_entry() is batched flush, ptep_clear_flush_young() is not. But, on x86, we might see a performance improvement since ptep_clear_flush_young() doesn't flush TLB at all. On ARM, there might be regressions though. I'd go with ptep_clear_young_notify(), but IIRC, Minchan mentioned he prefers flush. So I'll let him chime in. If we do end up with ptep_clear_young_notify(), please remove mmu_gather -- it should have been done in this patch.
On 7/25/23 13:55, Yu Zhao wrote: > On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <fengwei.yin@intel.com> wrote: >> >> Currently, in function madvise_cold_or_pageout_pte_range(), the >> young bit of pte/pmd is cleared notify subscripter. >> >> Using notify-able API to make sure the subscripter is signaled about >> the young bit clearing. >> >> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> >> --- >> mm/madvise.c | 18 ++---------------- >> 1 file changed, 2 insertions(+), 16 deletions(-) >> >> diff --git a/mm/madvise.c b/mm/madvise.c >> index f12933ebcc24..b236e201a738 100644 >> --- a/mm/madvise.c >> +++ b/mm/madvise.c >> @@ -403,14 +403,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, >> return 0; >> } >> >> - if (pmd_young(orig_pmd)) { >> - pmdp_invalidate(vma, addr, pmd); >> - orig_pmd = pmd_mkold(orig_pmd); >> - >> - set_pmd_at(mm, addr, pmd, orig_pmd); >> - tlb_remove_pmd_tlb_entry(tlb, pmd, addr); >> - } >> - >> + pmdp_clear_flush_young_notify(vma, addr, pmd); >> folio_clear_referenced(folio); >> folio_test_clear_young(folio); >> if (folio_test_active(folio)) >> @@ -496,14 +489,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, >> >> VM_BUG_ON_FOLIO(folio_test_large(folio), folio); >> >> - if (pte_young(ptent)) { >> - ptent = ptep_get_and_clear_full(mm, addr, pte, >> - tlb->fullmm); >> - ptent = pte_mkold(ptent); >> - set_pte_at(mm, addr, pte, ptent); >> - tlb_remove_tlb_entry(tlb, pte, addr); >> - } >> - >> + ptep_clear_flush_young_notify(vma, addr, pte); > > These two places are tricky. > > I agree there is a problem here, i.e., we are not consulting the mmu > notifier. In fact, we do pageout on VMs on ChromeOS, and it's been a > known problem to me for a while (not a high priority one). > > tlb_remove_tlb_entry() is batched flush, ptep_clear_flush_young() is > not. But, on x86, we might see a performance improvement since > ptep_clear_flush_young() doesn't flush TLB at all. On ARM, there might > be regressions though. > > I'd go with ptep_clear_young_notify(), but IIRC, Minchan mentioned he > prefers flush. So I'll let him chime in. I am OK with either way even no flush way here is more efficient for arm64. Let's wait for Minchan's comment. > > If we do end up with ptep_clear_young_notify(), please remove > mmu_gather -- it should have been done in this patch. I suppose "remove mmu_gather" means to trigger flush tlb operation in batched way to make sure no stale data in TLB for long time on arm64 platform. Regards Yin, Fengwei
On Tue, Jul 25, 2023 at 8:49 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > > > On 7/25/23 13:55, Yu Zhao wrote: > > On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <fengwei.yin@intel.com> wrote: > >> > >> Currently, in function madvise_cold_or_pageout_pte_range(), the > >> young bit of pte/pmd is cleared notify subscripter. > >> > >> Using notify-able API to make sure the subscripter is signaled about > >> the young bit clearing. > >> > >> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> > >> --- > >> mm/madvise.c | 18 ++---------------- > >> 1 file changed, 2 insertions(+), 16 deletions(-) > >> > >> diff --git a/mm/madvise.c b/mm/madvise.c > >> index f12933ebcc24..b236e201a738 100644 > >> --- a/mm/madvise.c > >> +++ b/mm/madvise.c > >> @@ -403,14 +403,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > >> return 0; > >> } > >> > >> - if (pmd_young(orig_pmd)) { > >> - pmdp_invalidate(vma, addr, pmd); > >> - orig_pmd = pmd_mkold(orig_pmd); > >> - > >> - set_pmd_at(mm, addr, pmd, orig_pmd); > >> - tlb_remove_pmd_tlb_entry(tlb, pmd, addr); > >> - } > >> - > >> + pmdp_clear_flush_young_notify(vma, addr, pmd); > >> folio_clear_referenced(folio); > >> folio_test_clear_young(folio); > >> if (folio_test_active(folio)) > >> @@ -496,14 +489,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > >> > >> VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > >> > >> - if (pte_young(ptent)) { > >> - ptent = ptep_get_and_clear_full(mm, addr, pte, > >> - tlb->fullmm); > >> - ptent = pte_mkold(ptent); > >> - set_pte_at(mm, addr, pte, ptent); > >> - tlb_remove_tlb_entry(tlb, pte, addr); > >> - } > >> - > >> + ptep_clear_flush_young_notify(vma, addr, pte); > > > > These two places are tricky. > > > > I agree there is a problem here, i.e., we are not consulting the mmu > > notifier. In fact, we do pageout on VMs on ChromeOS, and it's been a > > known problem to me for a while (not a high priority one). > > > > tlb_remove_tlb_entry() is batched flush, ptep_clear_flush_young() is > > not. But, on x86, we might see a performance improvement since > > ptep_clear_flush_young() doesn't flush TLB at all. On ARM, there might > > be regressions though. > > > > I'd go with ptep_clear_young_notify(), but IIRC, Minchan mentioned he > > prefers flush. So I'll let him chime in. > I am OK with either way even no flush way here is more efficient for > arm64. Let's wait for Minchan's comment. Yes, and I don't think there would be any "negative" consequences without tlb flushes when clearing the A-bit. > > If we do end up with ptep_clear_young_notify(), please remove > > mmu_gather -- it should have been done in this patch. > > I suppose "remove mmu_gather" means to trigger flush tlb operation in > batched way to make sure no stale data in TLB for long time on arm64 > platform. In madvise_cold_or_pageout_pte_range(), we only need struct mmu_gather *tlb because of tlb_remove_pmd_tlb_entry(), i.e., flushing tlb after clearing the A-bit. There is no correction, e.g., potential data corruption, involved there.
On 7/26/23 11:26, Yu Zhao wrote: > On Tue, Jul 25, 2023 at 8:49 PM Yin Fengwei <fengwei.yin@intel.com> wrote: >> >> >> On 7/25/23 13:55, Yu Zhao wrote: >>> On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <fengwei.yin@intel.com> wrote: >>>> >>>> Currently, in function madvise_cold_or_pageout_pte_range(), the >>>> young bit of pte/pmd is cleared notify subscripter. >>>> >>>> Using notify-able API to make sure the subscripter is signaled about >>>> the young bit clearing. >>>> >>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> >>>> --- >>>> mm/madvise.c | 18 ++---------------- >>>> 1 file changed, 2 insertions(+), 16 deletions(-) >>>> >>>> diff --git a/mm/madvise.c b/mm/madvise.c >>>> index f12933ebcc24..b236e201a738 100644 >>>> --- a/mm/madvise.c >>>> +++ b/mm/madvise.c >>>> @@ -403,14 +403,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, >>>> return 0; >>>> } >>>> >>>> - if (pmd_young(orig_pmd)) { >>>> - pmdp_invalidate(vma, addr, pmd); >>>> - orig_pmd = pmd_mkold(orig_pmd); >>>> - >>>> - set_pmd_at(mm, addr, pmd, orig_pmd); >>>> - tlb_remove_pmd_tlb_entry(tlb, pmd, addr); >>>> - } >>>> - >>>> + pmdp_clear_flush_young_notify(vma, addr, pmd); >>>> folio_clear_referenced(folio); >>>> folio_test_clear_young(folio); >>>> if (folio_test_active(folio)) >>>> @@ -496,14 +489,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, >>>> >>>> VM_BUG_ON_FOLIO(folio_test_large(folio), folio); >>>> >>>> - if (pte_young(ptent)) { >>>> - ptent = ptep_get_and_clear_full(mm, addr, pte, >>>> - tlb->fullmm); >>>> - ptent = pte_mkold(ptent); >>>> - set_pte_at(mm, addr, pte, ptent); >>>> - tlb_remove_tlb_entry(tlb, pte, addr); >>>> - } >>>> - >>>> + ptep_clear_flush_young_notify(vma, addr, pte); >>> >>> These two places are tricky. >>> >>> I agree there is a problem here, i.e., we are not consulting the mmu >>> notifier. In fact, we do pageout on VMs on ChromeOS, and it's been a >>> known problem to me for a while (not a high priority one). >>> >>> tlb_remove_tlb_entry() is batched flush, ptep_clear_flush_young() is >>> not. But, on x86, we might see a performance improvement since >>> ptep_clear_flush_young() doesn't flush TLB at all. On ARM, there might >>> be regressions though. >>> >>> I'd go with ptep_clear_young_notify(), but IIRC, Minchan mentioned he >>> prefers flush. So I'll let him chime in. >> I am OK with either way even no flush way here is more efficient for >> arm64. Let's wait for Minchan's comment. > > Yes, and I don't think there would be any "negative" consequences > without tlb flushes when clearing the A-bit. > >>> If we do end up with ptep_clear_young_notify(), please remove >>> mmu_gather -- it should have been done in this patch. >> >> I suppose "remove mmu_gather" means to trigger flush tlb operation in >> batched way to make sure no stale data in TLB for long time on arm64 >> platform. > > In madvise_cold_or_pageout_pte_range(), we only need struct > mmu_gather *tlb because of tlb_remove_pmd_tlb_entry(), i.e., flushing > tlb after clearing the A-bit. There is no correction, e.g., potential > data corruption, involved there. From https://lore.kernel.org/lkml/20181029105515.GD14127@arm.com/, the reason that arm64 didn't drop whole flush tlb in ptep_clear_flush_young() is to prevent the stale data in TLB. I suppose there is no correction issue there also. So why keep stale data in TLB in madvise_cold_or_pageout_pte_range() is fine? Regards Yin, Fengwei
On Tue, Jul 25, 2023 at 10:44 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > > > > On 7/26/23 11:26, Yu Zhao wrote: > > On Tue, Jul 25, 2023 at 8:49 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > >> > >> > >> On 7/25/23 13:55, Yu Zhao wrote: > >>> On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <fengwei.yin@intel.com> wrote: > >>>> > >>>> Currently, in function madvise_cold_or_pageout_pte_range(), the > >>>> young bit of pte/pmd is cleared notify subscripter. > >>>> > >>>> Using notify-able API to make sure the subscripter is signaled about > >>>> the young bit clearing. > >>>> > >>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> > >>>> --- > >>>> mm/madvise.c | 18 ++---------------- > >>>> 1 file changed, 2 insertions(+), 16 deletions(-) > >>>> > >>>> diff --git a/mm/madvise.c b/mm/madvise.c > >>>> index f12933ebcc24..b236e201a738 100644 > >>>> --- a/mm/madvise.c > >>>> +++ b/mm/madvise.c > >>>> @@ -403,14 +403,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > >>>> return 0; > >>>> } > >>>> > >>>> - if (pmd_young(orig_pmd)) { > >>>> - pmdp_invalidate(vma, addr, pmd); > >>>> - orig_pmd = pmd_mkold(orig_pmd); > >>>> - > >>>> - set_pmd_at(mm, addr, pmd, orig_pmd); > >>>> - tlb_remove_pmd_tlb_entry(tlb, pmd, addr); > >>>> - } > >>>> - > >>>> + pmdp_clear_flush_young_notify(vma, addr, pmd); > >>>> folio_clear_referenced(folio); > >>>> folio_test_clear_young(folio); > >>>> if (folio_test_active(folio)) > >>>> @@ -496,14 +489,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > >>>> > >>>> VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > >>>> > >>>> - if (pte_young(ptent)) { > >>>> - ptent = ptep_get_and_clear_full(mm, addr, pte, > >>>> - tlb->fullmm); > >>>> - ptent = pte_mkold(ptent); > >>>> - set_pte_at(mm, addr, pte, ptent); > >>>> - tlb_remove_tlb_entry(tlb, pte, addr); > >>>> - } > >>>> - > >>>> + ptep_clear_flush_young_notify(vma, addr, pte); > >>> > >>> These two places are tricky. > >>> > >>> I agree there is a problem here, i.e., we are not consulting the mmu > >>> notifier. In fact, we do pageout on VMs on ChromeOS, and it's been a > >>> known problem to me for a while (not a high priority one). > >>> > >>> tlb_remove_tlb_entry() is batched flush, ptep_clear_flush_young() is > >>> not. But, on x86, we might see a performance improvement since > >>> ptep_clear_flush_young() doesn't flush TLB at all. On ARM, there might > >>> be regressions though. > >>> > >>> I'd go with ptep_clear_young_notify(), but IIRC, Minchan mentioned he > >>> prefers flush. So I'll let him chime in. > >> I am OK with either way even no flush way here is more efficient for > >> arm64. Let's wait for Minchan's comment. > > > > Yes, and I don't think there would be any "negative" consequences > > without tlb flushes when clearing the A-bit. > > > >>> If we do end up with ptep_clear_young_notify(), please remove > >>> mmu_gather -- it should have been done in this patch. > >> > >> I suppose "remove mmu_gather" means to trigger flush tlb operation in > >> batched way to make sure no stale data in TLB for long time on arm64 > >> platform. > > > > In madvise_cold_or_pageout_pte_range(), we only need struct > > mmu_gather *tlb because of tlb_remove_pmd_tlb_entry(), i.e., flushing > > tlb after clearing the A-bit. There is no correction, e.g., potential > > data corruption, involved there. > > From https://lore.kernel.org/lkml/20181029105515.GD14127@arm.com/, > the reason that arm64 didn't drop whole flush tlb in ptep_clear_flush_young() > is to prevent the stale data in TLB. I suppose there is no correction issue > there also. > > So why keep stale data in TLB in madvise_cold_or_pageout_pte_range() is fine? Sorry, I'm not sure I understand your question here. In this patch, you removed tlb_remove_tlb_entry(), so we don't need struct mmu_gather *tlb any more. If you are asking why I prefer ptep_clear_young_notify() (no flush), which also doesn't need tlb_remove_tlb_entry(), then the answer is that the TLB size doesn't scale like DRAM does: the gap has been growing exponentially. So there is no way TLB can hold stale entries long enough to cause a measurable effect on the A-bit. This isn't a conjecture -- it's been proven conversely: we encountered bugs (almost every year) caused by missing TLB flushes and resulting in data corruption. They were never easy to reproduce, meaning stale entries never stayed long in TLB.
On 7/26/23 13:40, Yu Zhao wrote: > On Tue, Jul 25, 2023 at 10:44 PM Yin Fengwei <fengwei.yin@intel.com> wrote: >> >> >> >> On 7/26/23 11:26, Yu Zhao wrote: >>> On Tue, Jul 25, 2023 at 8:49 PM Yin Fengwei <fengwei.yin@intel.com> wrote: >>>> >>>> >>>> On 7/25/23 13:55, Yu Zhao wrote: >>>>> On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <fengwei.yin@intel.com> wrote: >>>>>> >>>>>> Currently, in function madvise_cold_or_pageout_pte_range(), the >>>>>> young bit of pte/pmd is cleared notify subscripter. >>>>>> >>>>>> Using notify-able API to make sure the subscripter is signaled about >>>>>> the young bit clearing. >>>>>> >>>>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> >>>>>> --- >>>>>> mm/madvise.c | 18 ++---------------- >>>>>> 1 file changed, 2 insertions(+), 16 deletions(-) >>>>>> >>>>>> diff --git a/mm/madvise.c b/mm/madvise.c >>>>>> index f12933ebcc24..b236e201a738 100644 >>>>>> --- a/mm/madvise.c >>>>>> +++ b/mm/madvise.c >>>>>> @@ -403,14 +403,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, >>>>>> return 0; >>>>>> } >>>>>> >>>>>> - if (pmd_young(orig_pmd)) { >>>>>> - pmdp_invalidate(vma, addr, pmd); >>>>>> - orig_pmd = pmd_mkold(orig_pmd); >>>>>> - >>>>>> - set_pmd_at(mm, addr, pmd, orig_pmd); >>>>>> - tlb_remove_pmd_tlb_entry(tlb, pmd, addr); >>>>>> - } >>>>>> - >>>>>> + pmdp_clear_flush_young_notify(vma, addr, pmd); >>>>>> folio_clear_referenced(folio); >>>>>> folio_test_clear_young(folio); >>>>>> if (folio_test_active(folio)) >>>>>> @@ -496,14 +489,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, >>>>>> >>>>>> VM_BUG_ON_FOLIO(folio_test_large(folio), folio); >>>>>> >>>>>> - if (pte_young(ptent)) { >>>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte, >>>>>> - tlb->fullmm); >>>>>> - ptent = pte_mkold(ptent); >>>>>> - set_pte_at(mm, addr, pte, ptent); >>>>>> - tlb_remove_tlb_entry(tlb, pte, addr); >>>>>> - } >>>>>> - >>>>>> + ptep_clear_flush_young_notify(vma, addr, pte); >>>>> >>>>> These two places are tricky. >>>>> >>>>> I agree there is a problem here, i.e., we are not consulting the mmu >>>>> notifier. In fact, we do pageout on VMs on ChromeOS, and it's been a >>>>> known problem to me for a while (not a high priority one). >>>>> >>>>> tlb_remove_tlb_entry() is batched flush, ptep_clear_flush_young() is >>>>> not. But, on x86, we might see a performance improvement since >>>>> ptep_clear_flush_young() doesn't flush TLB at all. On ARM, there might >>>>> be regressions though. >>>>> >>>>> I'd go with ptep_clear_young_notify(), but IIRC, Minchan mentioned he >>>>> prefers flush. So I'll let him chime in. >>>> I am OK with either way even no flush way here is more efficient for >>>> arm64. Let's wait for Minchan's comment. >>> >>> Yes, and I don't think there would be any "negative" consequences >>> without tlb flushes when clearing the A-bit. >>> >>>>> If we do end up with ptep_clear_young_notify(), please remove >>>>> mmu_gather -- it should have been done in this patch. >>>> >>>> I suppose "remove mmu_gather" means to trigger flush tlb operation in >>>> batched way to make sure no stale data in TLB for long time on arm64 >>>> platform. >>> >>> In madvise_cold_or_pageout_pte_range(), we only need struct >>> mmu_gather *tlb because of tlb_remove_pmd_tlb_entry(), i.e., flushing >>> tlb after clearing the A-bit. There is no correction, e.g., potential >>> data corruption, involved there. >> >> From https://lore.kernel.org/lkml/20181029105515.GD14127@arm.com/, >> the reason that arm64 didn't drop whole flush tlb in ptep_clear_flush_young() >> is to prevent the stale data in TLB. I suppose there is no correction issue >> there also. >> >> So why keep stale data in TLB in madvise_cold_or_pageout_pte_range() is fine? > > Sorry, I'm not sure I understand your question here. Oh. Sorry for the confusion. I will explain my understanding and question in detail. > > In this patch, you removed tlb_remove_tlb_entry(), so we don't need > struct mmu_gather *tlb any more. Yes. You are right. > > If you are asking why I prefer ptep_clear_young_notify() (no flush), > which also doesn't need tlb_remove_tlb_entry(), then the answer is > that the TLB size doesn't scale like DRAM does: the gap has been > growing exponentially. So there is no way TLB can hold stale entries > long enough to cause a measurable effect on the A-bit. This isn't a > conjecture -- it's been proven conversely: we encountered bugs (almost > every year) caused by missing TLB flushes and resulting in data > corruption. They were never easy to reproduce, meaning stale entries > never stayed long in TLB. when I read https://lore.kernel.org/lkml/20181029105515.GD14127@arm.com/, my understanding is that arm64 still keep something in ptep_clear_flush_young. The reason is finishing TLB flush by next context-switch to make sure no stale entries in TLB cross next context switch. Should we still keep same behavior (no stable entries in TLB cross next context switch) for madvise_cold_or_pageout_pte_range()? So two versions work (I assume we should keep same behavior) for me: 1. using xxx_flush_xxx() version. but with possible regression on arm64. 2. using none flush version. But do batched TLB flush. Hope this could make things clear. Thanks. Regards Yin, Fengwei
On Wed, Jul 26, 2023 at 12:21 AM Yin Fengwei <fengwei.yin@intel.com> wrote: > > > On 7/26/23 13:40, Yu Zhao wrote: > > On Tue, Jul 25, 2023 at 10:44 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > >> > >> > >> On 7/26/23 11:26, Yu Zhao wrote: > >>> On Tue, Jul 25, 2023 at 8:49 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > >>>> > >>>> > >>>> On 7/25/23 13:55, Yu Zhao wrote: > >>>>> On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <fengwei.yin@intel.com> wrote: > >>>>>> > >>>>>> Currently, in function madvise_cold_or_pageout_pte_range(), the > >>>>>> young bit of pte/pmd is cleared notify subscripter. > >>>>>> > >>>>>> Using notify-able API to make sure the subscripter is signaled about > >>>>>> the young bit clearing. > >>>>>> > >>>>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> > >>>>>> --- > >>>>>> mm/madvise.c | 18 ++---------------- > >>>>>> 1 file changed, 2 insertions(+), 16 deletions(-) > >>>>>> > >>>>>> diff --git a/mm/madvise.c b/mm/madvise.c > >>>>>> index f12933ebcc24..b236e201a738 100644 > >>>>>> --- a/mm/madvise.c > >>>>>> +++ b/mm/madvise.c > >>>>>> @@ -403,14 +403,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > >>>>>> return 0; > >>>>>> } > >>>>>> > >>>>>> - if (pmd_young(orig_pmd)) { > >>>>>> - pmdp_invalidate(vma, addr, pmd); > >>>>>> - orig_pmd = pmd_mkold(orig_pmd); > >>>>>> - > >>>>>> - set_pmd_at(mm, addr, pmd, orig_pmd); > >>>>>> - tlb_remove_pmd_tlb_entry(tlb, pmd, addr); > >>>>>> - } > >>>>>> - > >>>>>> + pmdp_clear_flush_young_notify(vma, addr, pmd); > >>>>>> folio_clear_referenced(folio); > >>>>>> folio_test_clear_young(folio); > >>>>>> if (folio_test_active(folio)) > >>>>>> @@ -496,14 +489,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > >>>>>> > >>>>>> VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > >>>>>> > >>>>>> - if (pte_young(ptent)) { > >>>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte, > >>>>>> - tlb->fullmm); > >>>>>> - ptent = pte_mkold(ptent); > >>>>>> - set_pte_at(mm, addr, pte, ptent); > >>>>>> - tlb_remove_tlb_entry(tlb, pte, addr); > >>>>>> - } > >>>>>> - > >>>>>> + ptep_clear_flush_young_notify(vma, addr, pte); > >>>>> > >>>>> These two places are tricky. > >>>>> > >>>>> I agree there is a problem here, i.e., we are not consulting the mmu > >>>>> notifier. In fact, we do pageout on VMs on ChromeOS, and it's been a > >>>>> known problem to me for a while (not a high priority one). > >>>>> > >>>>> tlb_remove_tlb_entry() is batched flush, ptep_clear_flush_young() is > >>>>> not. But, on x86, we might see a performance improvement since > >>>>> ptep_clear_flush_young() doesn't flush TLB at all. On ARM, there might > >>>>> be regressions though. > >>>>> > >>>>> I'd go with ptep_clear_young_notify(), but IIRC, Minchan mentioned he > >>>>> prefers flush. So I'll let him chime in. > >>>> I am OK with either way even no flush way here is more efficient for > >>>> arm64. Let's wait for Minchan's comment. > >>> > >>> Yes, and I don't think there would be any "negative" consequences > >>> without tlb flushes when clearing the A-bit. > >>> > >>>>> If we do end up with ptep_clear_young_notify(), please remove > >>>>> mmu_gather -- it should have been done in this patch. > >>>> > >>>> I suppose "remove mmu_gather" means to trigger flush tlb operation in > >>>> batched way to make sure no stale data in TLB for long time on arm64 > >>>> platform. > >>> > >>> In madvise_cold_or_pageout_pte_range(), we only need struct > >>> mmu_gather *tlb because of tlb_remove_pmd_tlb_entry(), i.e., flushing > >>> tlb after clearing the A-bit. There is no correction, e.g., potential > >>> data corruption, involved there. > >> > >> From https://lore.kernel.org/lkml/20181029105515.GD14127@arm.com/, > >> the reason that arm64 didn't drop whole flush tlb in ptep_clear_flush_young() > >> is to prevent the stale data in TLB. I suppose there is no correction issue > >> there also. > >> > >> So why keep stale data in TLB in madvise_cold_or_pageout_pte_range() is fine? > > > > Sorry, I'm not sure I understand your question here. > Oh. Sorry for the confusion. I will explain my understanding and > question in detail. > > > > > In this patch, you removed tlb_remove_tlb_entry(), so we don't need > > struct mmu_gather *tlb any more. > Yes. You are right. > > > > > If you are asking why I prefer ptep_clear_young_notify() (no flush), > > which also doesn't need tlb_remove_tlb_entry(), then the answer is > > that the TLB size doesn't scale like DRAM does: the gap has been > > growing exponentially. So there is no way TLB can hold stale entries > > long enough to cause a measurable effect on the A-bit. This isn't a > > conjecture -- it's been proven conversely: we encountered bugs (almost > > every year) caused by missing TLB flushes and resulting in data > > corruption. They were never easy to reproduce, meaning stale entries > > never stayed long in TLB. > > when I read https://lore.kernel.org/lkml/20181029105515.GD14127@arm.com/, > > my understanding is that arm64 still keep something in ptep_clear_flush_young. > The reason is finishing TLB flush by next context-switch to make sure no > stale entries in TLB cross next context switch. > > Should we still keep same behavior (no stable entries in TLB cross next > context switch) for madvise_cold_or_pageout_pte_range()? > > So two versions work (I assume we should keep same behavior) for me: > 1. using xxx_flush_xxx() version. but with possible regression on arm64. > 2. using none flush version. But do batched TLB flush. I see. I don't think we need to flush at all, i.e., the no flush version *without* batched TLB flush. So far nobody can actually prove that flushing TLB while clearing the A-bit is beneficial, not even in theory :)
On 7/27/2023 11:28 AM, Yu Zhao wrote: > On Wed, Jul 26, 2023 at 12:21 AM Yin Fengwei <fengwei.yin@intel.com> wrote: >> >> >> On 7/26/23 13:40, Yu Zhao wrote: >>> On Tue, Jul 25, 2023 at 10:44 PM Yin Fengwei <fengwei.yin@intel.com> wrote: >>>> >>>> >>>> On 7/26/23 11:26, Yu Zhao wrote: >>>>> On Tue, Jul 25, 2023 at 8:49 PM Yin Fengwei <fengwei.yin@intel.com> wrote: >>>>>> >>>>>> >>>>>> On 7/25/23 13:55, Yu Zhao wrote: >>>>>>> On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <fengwei.yin@intel.com> wrote: >>>>>>>> >>>>>>>> Currently, in function madvise_cold_or_pageout_pte_range(), the >>>>>>>> young bit of pte/pmd is cleared notify subscripter. >>>>>>>> >>>>>>>> Using notify-able API to make sure the subscripter is signaled about >>>>>>>> the young bit clearing. >>>>>>>> >>>>>>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> >>>>>>>> --- >>>>>>>> mm/madvise.c | 18 ++---------------- >>>>>>>> 1 file changed, 2 insertions(+), 16 deletions(-) >>>>>>>> >>>>>>>> diff --git a/mm/madvise.c b/mm/madvise.c >>>>>>>> index f12933ebcc24..b236e201a738 100644 >>>>>>>> --- a/mm/madvise.c >>>>>>>> +++ b/mm/madvise.c >>>>>>>> @@ -403,14 +403,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, >>>>>>>> return 0; >>>>>>>> } >>>>>>>> >>>>>>>> - if (pmd_young(orig_pmd)) { >>>>>>>> - pmdp_invalidate(vma, addr, pmd); >>>>>>>> - orig_pmd = pmd_mkold(orig_pmd); >>>>>>>> - >>>>>>>> - set_pmd_at(mm, addr, pmd, orig_pmd); >>>>>>>> - tlb_remove_pmd_tlb_entry(tlb, pmd, addr); >>>>>>>> - } >>>>>>>> - >>>>>>>> + pmdp_clear_flush_young_notify(vma, addr, pmd); >>>>>>>> folio_clear_referenced(folio); >>>>>>>> folio_test_clear_young(folio); >>>>>>>> if (folio_test_active(folio)) >>>>>>>> @@ -496,14 +489,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, >>>>>>>> >>>>>>>> VM_BUG_ON_FOLIO(folio_test_large(folio), folio); >>>>>>>> >>>>>>>> - if (pte_young(ptent)) { >>>>>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte, >>>>>>>> - tlb->fullmm); >>>>>>>> - ptent = pte_mkold(ptent); >>>>>>>> - set_pte_at(mm, addr, pte, ptent); >>>>>>>> - tlb_remove_tlb_entry(tlb, pte, addr); >>>>>>>> - } >>>>>>>> - >>>>>>>> + ptep_clear_flush_young_notify(vma, addr, pte); >>>>>>> >>>>>>> These two places are tricky. >>>>>>> >>>>>>> I agree there is a problem here, i.e., we are not consulting the mmu >>>>>>> notifier. In fact, we do pageout on VMs on ChromeOS, and it's been a >>>>>>> known problem to me for a while (not a high priority one). >>>>>>> >>>>>>> tlb_remove_tlb_entry() is batched flush, ptep_clear_flush_young() is >>>>>>> not. But, on x86, we might see a performance improvement since >>>>>>> ptep_clear_flush_young() doesn't flush TLB at all. On ARM, there might >>>>>>> be regressions though. >>>>>>> >>>>>>> I'd go with ptep_clear_young_notify(), but IIRC, Minchan mentioned he >>>>>>> prefers flush. So I'll let him chime in. >>>>>> I am OK with either way even no flush way here is more efficient for >>>>>> arm64. Let's wait for Minchan's comment. >>>>> >>>>> Yes, and I don't think there would be any "negative" consequences >>>>> without tlb flushes when clearing the A-bit. >>>>> >>>>>>> If we do end up with ptep_clear_young_notify(), please remove >>>>>>> mmu_gather -- it should have been done in this patch. >>>>>> >>>>>> I suppose "remove mmu_gather" means to trigger flush tlb operation in >>>>>> batched way to make sure no stale data in TLB for long time on arm64 >>>>>> platform. >>>>> >>>>> In madvise_cold_or_pageout_pte_range(), we only need struct >>>>> mmu_gather *tlb because of tlb_remove_pmd_tlb_entry(), i.e., flushing >>>>> tlb after clearing the A-bit. There is no correction, e.g., potential >>>>> data corruption, involved there. >>>> >>>> From https://lore.kernel.org/lkml/20181029105515.GD14127@arm.com/, >>>> the reason that arm64 didn't drop whole flush tlb in ptep_clear_flush_young() >>>> is to prevent the stale data in TLB. I suppose there is no correction issue >>>> there also. >>>> >>>> So why keep stale data in TLB in madvise_cold_or_pageout_pte_range() is fine? >>> >>> Sorry, I'm not sure I understand your question here. >> Oh. Sorry for the confusion. I will explain my understanding and >> question in detail. >> >>> >>> In this patch, you removed tlb_remove_tlb_entry(), so we don't need >>> struct mmu_gather *tlb any more. >> Yes. You are right. >> >>> >>> If you are asking why I prefer ptep_clear_young_notify() (no flush), >>> which also doesn't need tlb_remove_tlb_entry(), then the answer is >>> that the TLB size doesn't scale like DRAM does: the gap has been >>> growing exponentially. So there is no way TLB can hold stale entries >>> long enough to cause a measurable effect on the A-bit. This isn't a >>> conjecture -- it's been proven conversely: we encountered bugs (almost >>> every year) caused by missing TLB flushes and resulting in data >>> corruption. They were never easy to reproduce, meaning stale entries >>> never stayed long in TLB. >> >> when I read https://lore.kernel.org/lkml/20181029105515.GD14127@arm.com/, >> >> my understanding is that arm64 still keep something in ptep_clear_flush_young. >> The reason is finishing TLB flush by next context-switch to make sure no >> stale entries in TLB cross next context switch. >> >> Should we still keep same behavior (no stable entries in TLB cross next >> context switch) for madvise_cold_or_pageout_pte_range()? >> >> So two versions work (I assume we should keep same behavior) for me: >> 1. using xxx_flush_xxx() version. but with possible regression on arm64. >> 2. using none flush version. But do batched TLB flush. > > I see. I don't think we need to flush at all, i.e., the no flush > version *without* batched TLB flush. So far nobody can actually prove > that flushing TLB while clearing the A-bit is beneficial, not even in > theory :) I will just send the fix for folio_mapcount() (with your reviewed-by) as it's bug fix and it's better to be merged standalone. The other three patches need more time for discussion. Regards Yin, Fengwei
diff --git a/mm/madvise.c b/mm/madvise.c index f12933ebcc24..b236e201a738 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -403,14 +403,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, return 0; } - if (pmd_young(orig_pmd)) { - pmdp_invalidate(vma, addr, pmd); - orig_pmd = pmd_mkold(orig_pmd); - - set_pmd_at(mm, addr, pmd, orig_pmd); - tlb_remove_pmd_tlb_entry(tlb, pmd, addr); - } - + pmdp_clear_flush_young_notify(vma, addr, pmd); folio_clear_referenced(folio); folio_test_clear_young(folio); if (folio_test_active(folio)) @@ -496,14 +489,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, VM_BUG_ON_FOLIO(folio_test_large(folio), folio); - if (pte_young(ptent)) { - ptent = ptep_get_and_clear_full(mm, addr, pte, - tlb->fullmm); - ptent = pte_mkold(ptent); - set_pte_at(mm, addr, pte, ptent); - tlb_remove_tlb_entry(tlb, pte, addr); - } - + ptep_clear_flush_young_notify(vma, addr, pte); /* * We are deactivating a folio for accelerating reclaiming. * VM couldn't reclaim the folio unless we clear PG_young.