Message ID | 20230109052816.405335-1-anshuman.khandual@arm.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4e01:0:0:0:0:0 with SMTP id p1csp1992306wrt; Sun, 8 Jan 2023 21:51:24 -0800 (PST) X-Google-Smtp-Source: AMrXdXtTB9CyPCcN06jK8OC+gW3Ah0ZwY1rSrdnKTmyLqIVqjv9XgLzzh4hJyFGRqjQv4F1up7rO X-Received: by 2002:a17:907:73c1:b0:7e8:cf25:4b9c with SMTP id es1-20020a17090773c100b007e8cf254b9cmr50843931ejc.59.1673243484597; Sun, 08 Jan 2023 21:51:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1673243484; cv=none; d=google.com; s=arc-20160816; b=WoYXyskI19X7OQ0CO+D4QBuNPHwSHQdsD0WxmzenvkY1he16g8GHyuLPdoetO9vEmX Kn9WKLoKI46yUjNT0SrfEZIGgY1riZ8aJnsAqpBRqX4sX6EeWb6ZWc1knNF5YonZfWTU rsqMk18KKrO32DesZNeBcOfdBwVz1On+m5PjtRrdR98a0auTMtlA3TXitvxSJsfqNtA1 wjpMbVV+QP5Ka6lQfHClQbuOk0ramxCEB6rCM/Cvnxj6lYsv1dCs6gj0YgzYha2UIgSq rnpDesEhF8wDtvddIHEOvU2iTrZerL+TqcPV/elCo2l7dw3ER4fRNHLqxq0Kb5gfHtqS zvqA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=DgK1RJt6K5tmIuGql81yW24O0EXYHqxBvqapDfhnVlU=; b=r2TaU0ZslagtZ3ccKSHTpXQ9QcIpZE145v1rReorPmU5cyZ3MR2R0w5mO6bJ/lBGtg 3fcXu1MTnzog7GzVVOo+hLVrUh3DKfFcCXvLHd/+y5CU6w7ZNJ+35ZTS4T2OaR3NnK1O sBqgOuIW0DZixGtL1cRWQV2aU2jNxNm2TaNjVnR+27ldY1r+hroqUKB+iUHrFa4Estz0 eO1Vp5eZ75+PDIwdeFgm1jvbHcZv8AVZ1gL/gAB6y30zlarXov97noG3ChfuZp3bw2KR M2Cp6PosMolo9a2dcluY2v9+Ng25K0WgH9MT93ER2LID2MwqXCp/hU0fA2GJijATIKvf J8Lg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id di17-20020a170906731100b007ae2368c8b3si9252189ejc.730.2023.01.08.21.51.01; Sun, 08 Jan 2023 21:51:24 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229841AbjAIF2l (ORCPT <rfc822;zhanglyra.2023@gmail.com> + 99 others); Mon, 9 Jan 2023 00:28:41 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59182 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230352AbjAIF22 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 9 Jan 2023 00:28:28 -0500 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 3823F2C6 for <linux-kernel@vger.kernel.org>; Sun, 8 Jan 2023 21:28:27 -0800 (PST) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id C2F151042; Sun, 8 Jan 2023 21:29:08 -0800 (PST) Received: from a077893.blr.arm.com (unknown [10.162.40.15]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 5923C3F71A; Sun, 8 Jan 2023 21:28:24 -0800 (PST) From: Anshuman Khandual <anshuman.khandual@arm.com> To: linux-arm-kernel@lists.infradead.org, will@kernel.org, catalin.marinas@arm.com Cc: Anshuman Khandual <anshuman.khandual@arm.com>, Mark Rutland <mark.rutland@arm.com>, Andrew Morton <akpm@linux-foundation.org>, linux-kernel@vger.kernel.org Subject: [PATCH V2] arm64/mm: Intercept pfn changes in set_pte_at() Date: Mon, 9 Jan 2023 10:58:16 +0530 Message-Id: <20230109052816.405335-1-anshuman.khandual@arm.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1754522959700315699?= X-GMAIL-MSGID: =?utf-8?q?1754522959700315699?= |
Series |
[V2] arm64/mm: Intercept pfn changes in set_pte_at()
|
|
Commit Message
Anshuman Khandual
Jan. 9, 2023, 5:28 a.m. UTC
Changing pfn on a user page table mapped entry, without first going through
break-before-make (BBM) procedure is unsafe. This just updates set_pte_at()
to intercept such changes, via an updated pgattr_change_is_safe(). This new
check happens via __check_racy_pte_update(), which has now been renamed as
__check_safe_pte_update().
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
This applies on v6.2-rc3. This patch had some test time on an internal CI
system without any issues being reported.
Changes in V1:
https://lore.kernel.org/all/20221116031001.292236-1-anshuman.khandual@arm.com/
arch/arm64/include/asm/pgtable.h | 8 ++++++--
arch/arm64/mm/mmu.c | 8 +++++++-
2 files changed, 13 insertions(+), 3 deletions(-)
Comments
On 1/9/23 10:58, Anshuman Khandual wrote: > Changing pfn on a user page table mapped entry, without first going through > break-before-make (BBM) procedure is unsafe. This just updates set_pte_at() > to intercept such changes, via an updated pgattr_change_is_safe(). This new > check happens via __check_racy_pte_update(), which has now been renamed as > __check_safe_pte_update(). > > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Will Deacon <will@kernel.org> > Cc: Mark Rutland <mark.rutland@arm.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: linux-arm-kernel@lists.infradead.org > Cc: linux-kernel@vger.kernel.org > Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> > --- > This applies on v6.2-rc3. This patch had some test time on an internal CI > system without any issues being reported. Gentle ping, any updates on this patch ? Still any concerns ? > > Changes in V1: > > https://lore.kernel.org/all/20221116031001.292236-1-anshuman.khandual@arm.com/ > > arch/arm64/include/asm/pgtable.h | 8 ++++++-- > arch/arm64/mm/mmu.c | 8 +++++++- > 2 files changed, 13 insertions(+), 3 deletions(-) > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index b4bbeed80fb6..832c9c8fb58f 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -275,6 +275,7 @@ static inline void set_pte(pte_t *ptep, pte_t pte) > } > > extern void __sync_icache_dcache(pte_t pteval); > +bool pgattr_change_is_safe(u64 old, u64 new); > > /* > * PTE bits configuration in the presence of hardware Dirty Bit Management > @@ -292,7 +293,7 @@ extern void __sync_icache_dcache(pte_t pteval); > * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) > */ > > -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, > +static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep, > pte_t pte) > { > pte_t old_pte; > @@ -318,6 +319,9 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, > VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), > "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", > __func__, pte_val(old_pte), pte_val(pte)); > + VM_WARN_ONCE(!pgattr_change_is_safe(pte_val(old_pte), pte_val(pte)), > + "%s: unsafe attribute change: 0x%016llx -> 0x%016llx", > + __func__, pte_val(old_pte), pte_val(pte)); > } > > static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, > @@ -346,7 +350,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, > mte_sync_tags(old_pte, pte); > } > > - __check_racy_pte_update(mm, ptep, pte); > + __check_safe_pte_update(mm, ptep, pte); > > set_pte(ptep, pte); > } > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c > index 14c87e8d69d8..a1d16b35c4f6 100644 > --- a/arch/arm64/mm/mmu.c > +++ b/arch/arm64/mm/mmu.c > @@ -133,7 +133,7 @@ static phys_addr_t __init early_pgtable_alloc(int shift) > return phys; > } > > -static bool pgattr_change_is_safe(u64 old, u64 new) > +bool pgattr_change_is_safe(u64 old, u64 new) > { > /* > * The following mapping attributes may be updated in live > @@ -145,6 +145,12 @@ static bool pgattr_change_is_safe(u64 old, u64 new) > if (old == 0 || new == 0) > return true; > > + /* If old and new ptes are valid, pfn should not change */ > + if (pte_valid(__pte(old)) && pte_valid(__pte(new))) { > + if (pte_pfn(__pte(old)) != pte_pfn(__pte(new))) > + return false; > + } > + > /* live contiguous mappings may not be manipulated at all */ > if ((old | new) & PTE_CONT) > return false;
On Tue, Jan 24, 2023 at 11:11:49AM +0530, Anshuman Khandual wrote: > On 1/9/23 10:58, Anshuman Khandual wrote: > > Changing pfn on a user page table mapped entry, without first going through > > break-before-make (BBM) procedure is unsafe. This just updates set_pte_at() > > to intercept such changes, via an updated pgattr_change_is_safe(). This new > > check happens via __check_racy_pte_update(), which has now been renamed as > > __check_safe_pte_update(). > > > > Cc: Catalin Marinas <catalin.marinas@arm.com> > > Cc: Will Deacon <will@kernel.org> > > Cc: Mark Rutland <mark.rutland@arm.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: linux-arm-kernel@lists.infradead.org > > Cc: linux-kernel@vger.kernel.org > > Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> > > --- > > This applies on v6.2-rc3. This patch had some test time on an internal CI > > system without any issues being reported. > > Gentle ping, any updates on this patch ? Still any concerns ? I don't think we really got to the bottom of Mark's concerns with unreachable ptes on the stack, did we? I also have vague recollections of somebody (Robin?) running into issues with the vmap code not honouring BBM. So I think we should confirm/fix the vmap issue before we enable this check and also try to get some testing coverage to address Mark's worries. I think he has a syzkaller instance set up, so that sound like a good place to start. Will
On 2023-01-26 13:33, Will Deacon wrote: > On Tue, Jan 24, 2023 at 11:11:49AM +0530, Anshuman Khandual wrote: >> On 1/9/23 10:58, Anshuman Khandual wrote: >>> Changing pfn on a user page table mapped entry, without first going through >>> break-before-make (BBM) procedure is unsafe. This just updates set_pte_at() >>> to intercept such changes, via an updated pgattr_change_is_safe(). This new >>> check happens via __check_racy_pte_update(), which has now been renamed as >>> __check_safe_pte_update(). >>> >>> Cc: Catalin Marinas <catalin.marinas@arm.com> >>> Cc: Will Deacon <will@kernel.org> >>> Cc: Mark Rutland <mark.rutland@arm.com> >>> Cc: Andrew Morton <akpm@linux-foundation.org> >>> Cc: linux-arm-kernel@lists.infradead.org >>> Cc: linux-kernel@vger.kernel.org >>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> >>> --- >>> This applies on v6.2-rc3. This patch had some test time on an internal CI >>> system without any issues being reported. >> >> Gentle ping, any updates on this patch ? Still any concerns ? > > I don't think we really got to the bottom of Mark's concerns with > unreachable ptes on the stack, did we? I also have vague recollections > of somebody (Robin?) running into issues with the vmap code not honouring > BBM. Doesn't ring a bell, so either it wasn't me, or it was many years ago and about 5 levels deep into trying to fix something else :/ > So I think we should confirm/fix the vmap issue before we enable this check > and also try to get some testing coverage to address Mark's worries. I think > he has a syzkaller instance set up, so that sound like a good place to > start. I think we're also missing a subtlety here in that this restriction doesn't *always* apply. For instance if someone wants to move a page by making the mapping read-only, copying the contents to a new page, then pointing the RO mapping at that new page, that should technically not require BBM. Thanks, Robin.
Hi Annshuman, On Mon, Jan 09, 2023 at 10:58:16AM +0530, Anshuman Khandual wrote: > Changing pfn on a user page table mapped entry, without first going through > break-before-make (BBM) procedure is unsafe. This just updates set_pte_at() > to intercept such changes, via an updated pgattr_change_is_safe(). This new > check happens via __check_racy_pte_update(), which has now been renamed as > __check_safe_pte_update(). > > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Will Deacon <will@kernel.org> > Cc: Mark Rutland <mark.rutland@arm.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: linux-arm-kernel@lists.infradead.org > Cc: linux-kernel@vger.kernel.org > Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> > --- > This applies on v6.2-rc3. This patch had some test time on an internal CI > system without any issues being reported. Can you elaborate on this a little bit? It's not entirely clear what that internal CI system has tested. It would be helpful if you could indicate: * What sort of testing has been done by the CI system? e.g. is this just booting, running LTP, something else? * Has this tried a bunch of configurations and/or machines? * If any targetted stress tests have been used? e.g. stress-ng's memory system tests? I'm assuming that's hitting LTP on a few machines/configs, which'd be reasonable. It'd just be nice to confirm exactly what has been tested. I've added this to my lcoal syzkaller instance's test branch, and I'll shout if that hits anything over the weekend. > Changes in V1: > > https://lore.kernel.org/all/20221116031001.292236-1-anshuman.khandual@arm.com/ Did you mean to list some cahnges here? > > arch/arm64/include/asm/pgtable.h | 8 ++++++-- > arch/arm64/mm/mmu.c | 8 +++++++- > 2 files changed, 13 insertions(+), 3 deletions(-) > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index b4bbeed80fb6..832c9c8fb58f 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -275,6 +275,7 @@ static inline void set_pte(pte_t *ptep, pte_t pte) > } > > extern void __sync_icache_dcache(pte_t pteval); > +bool pgattr_change_is_safe(u64 old, u64 new); > > /* > * PTE bits configuration in the presence of hardware Dirty Bit Management > @@ -292,7 +293,7 @@ extern void __sync_icache_dcache(pte_t pteval); > * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) > */ > > -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, > +static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep, > pte_t pte) > { > pte_t old_pte; > @@ -318,6 +319,9 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, > VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), > "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", > __func__, pte_val(old_pte), pte_val(pte)); > + VM_WARN_ONCE(!pgattr_change_is_safe(pte_val(old_pte), pte_val(pte)), > + "%s: unsafe attribute change: 0x%016llx -> 0x%016llx", > + __func__, pte_val(old_pte), pte_val(pte)); > } > > static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, > @@ -346,7 +350,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, > mte_sync_tags(old_pte, pte); > } > > - __check_racy_pte_update(mm, ptep, pte); > + __check_safe_pte_update(mm, ptep, pte); > > set_pte(ptep, pte); > } > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c > index 14c87e8d69d8..a1d16b35c4f6 100644 > --- a/arch/arm64/mm/mmu.c > +++ b/arch/arm64/mm/mmu.c > @@ -133,7 +133,7 @@ static phys_addr_t __init early_pgtable_alloc(int shift) > return phys; > } > > -static bool pgattr_change_is_safe(u64 old, u64 new) > +bool pgattr_change_is_safe(u64 old, u64 new) > { > /* > * The following mapping attributes may be updated in live > @@ -145,6 +145,12 @@ static bool pgattr_change_is_safe(u64 old, u64 new) > if (old == 0 || new == 0) > return true; These checks above should really use pte_valid(); we were just being lazy when this was originally written since for the init_*() cases the memory should be zero initially. So could you make that: if (!pte_valid(__pte(old)) || !pte_valid(__pte(new))) return true; > + /* If old and new ptes are valid, pfn should not change */ > + if (pte_valid(__pte(old)) && pte_valid(__pte(new))) { > + if (pte_pfn(__pte(old)) != pte_pfn(__pte(new))) > + return false; > + } With the above change, it's clear that both must be valid to get this far, and this check can be reduced to: /* A live entry's pfn should not change */ if (pte_pfn(__pte(old)) != pte_pfn(__pte(new))) return false; With those changes: Acked-by: Mark Rutland <mark.rutland@arm.com> Thanks, Mark.
On Thu, Jan 26, 2023 at 01:33:22PM +0000, Will Deacon wrote: > On Tue, Jan 24, 2023 at 11:11:49AM +0530, Anshuman Khandual wrote: > > On 1/9/23 10:58, Anshuman Khandual wrote: > > > Changing pfn on a user page table mapped entry, without first going through > > > break-before-make (BBM) procedure is unsafe. This just updates set_pte_at() > > > to intercept such changes, via an updated pgattr_change_is_safe(). This new > > > check happens via __check_racy_pte_update(), which has now been renamed as > > > __check_safe_pte_update(). > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com> > > > Cc: Will Deacon <will@kernel.org> > > > Cc: Mark Rutland <mark.rutland@arm.com> > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > Cc: linux-arm-kernel@lists.infradead.org > > > Cc: linux-kernel@vger.kernel.org > > > Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> > > > --- > > > This applies on v6.2-rc3. This patch had some test time on an internal CI > > > system without any issues being reported. > > > > Gentle ping, any updates on this patch ? Still any concerns ? > > I don't think we really got to the bottom of Mark's concerns with > unreachable ptes on the stack, did we? I also have vague recollections > of somebody (Robin?) running into issues with the vmap code not honouring > BBM. > > So I think we should confirm/fix the vmap issue before we enable this check > and also try to get some testing coverage to address Mark's worries. I think > he has a syzkaller instance set up, so that sound like a good place to > start. I've thrown my Syzkaller instance at this patch; if it doesn't find anything by Monday I reckon we should pick this up. That said, I had some minor nits on the patch; I'm not sure if you'd be happy to apply the suggested changes when applying or if you'd prefer that Anshuman applies those locally and sense a v3. Thanks, Mark.
On 1/27/23 20:46, Mark Rutland wrote: > On Thu, Jan 26, 2023 at 01:33:22PM +0000, Will Deacon wrote: >> On Tue, Jan 24, 2023 at 11:11:49AM +0530, Anshuman Khandual wrote: >>> On 1/9/23 10:58, Anshuman Khandual wrote: >>>> Changing pfn on a user page table mapped entry, without first going through >>>> break-before-make (BBM) procedure is unsafe. This just updates set_pte_at() >>>> to intercept such changes, via an updated pgattr_change_is_safe(). This new >>>> check happens via __check_racy_pte_update(), which has now been renamed as >>>> __check_safe_pte_update(). >>>> >>>> Cc: Catalin Marinas <catalin.marinas@arm.com> >>>> Cc: Will Deacon <will@kernel.org> >>>> Cc: Mark Rutland <mark.rutland@arm.com> >>>> Cc: Andrew Morton <akpm@linux-foundation.org> >>>> Cc: linux-arm-kernel@lists.infradead.org >>>> Cc: linux-kernel@vger.kernel.org >>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> >>>> --- >>>> This applies on v6.2-rc3. This patch had some test time on an internal CI >>>> system without any issues being reported. >>> >>> Gentle ping, any updates on this patch ? Still any concerns ? >> >> I don't think we really got to the bottom of Mark's concerns with >> unreachable ptes on the stack, did we? I also have vague recollections >> of somebody (Robin?) running into issues with the vmap code not honouring >> BBM. >> >> So I think we should confirm/fix the vmap issue before we enable this check >> and also try to get some testing coverage to address Mark's worries. I think >> he has a syzkaller instance set up, so that sound like a good place to >> start. > > I've thrown my Syzkaller instance at this patch; if it doesn't find anything by > Monday I reckon we should pick this up. > > That said, I had some minor nits on the patch; I'm not sure if you'd be happy > to apply the suggested changes when applying or if you'd prefer that Anshuman > applies those locally and sense a v3. I could send out a V3, running some stress-ng based memory tests with the suggested changes applied on the patch.
On Fri, Jan 27, 2023 at 03:16:52PM +0000, Mark Rutland wrote: > On Thu, Jan 26, 2023 at 01:33:22PM +0000, Will Deacon wrote: > > On Tue, Jan 24, 2023 at 11:11:49AM +0530, Anshuman Khandual wrote: > > > On 1/9/23 10:58, Anshuman Khandual wrote: > > > > Changing pfn on a user page table mapped entry, without first going through > > > > break-before-make (BBM) procedure is unsafe. This just updates set_pte_at() > > > > to intercept such changes, via an updated pgattr_change_is_safe(). This new > > > > check happens via __check_racy_pte_update(), which has now been renamed as > > > > __check_safe_pte_update(). > > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com> > > > > Cc: Will Deacon <will@kernel.org> > > > > Cc: Mark Rutland <mark.rutland@arm.com> > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > > Cc: linux-arm-kernel@lists.infradead.org > > > > Cc: linux-kernel@vger.kernel.org > > > > Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> > > > > --- > > > > This applies on v6.2-rc3. This patch had some test time on an internal CI > > > > system without any issues being reported. > > > > > > Gentle ping, any updates on this patch ? Still any concerns ? > > > > I don't think we really got to the bottom of Mark's concerns with > > unreachable ptes on the stack, did we? I also have vague recollections > > of somebody (Robin?) running into issues with the vmap code not honouring > > BBM. > > > > So I think we should confirm/fix the vmap issue before we enable this check > > and also try to get some testing coverage to address Mark's worries. I think > > he has a syzkaller instance set up, so that sound like a good place to > > start. > > I've thrown my Syzkaller instance at this patch; if it doesn't find anything by > Monday I reckon we should pick this up. FWIW, that hasn't hit anything so far. It would be good if we could explicitly nots which mm test suite and/or stress tests are happy with this, but otherwise this looks good to me. Thanks, Mark. > That said, I had some minor nits on the patch; I'm not sure if you'd be happy > to apply the suggested changes when applying or if you'd prefer that Anshuman > applies those locally and sense a v3. > > Thanks, > Mark.
On 1/27/23 20:44, Mark Rutland wrote: > Hi Annshuman, > > On Mon, Jan 09, 2023 at 10:58:16AM +0530, Anshuman Khandual wrote: >> Changing pfn on a user page table mapped entry, without first going through >> break-before-make (BBM) procedure is unsafe. This just updates set_pte_at() >> to intercept such changes, via an updated pgattr_change_is_safe(). This new >> check happens via __check_racy_pte_update(), which has now been renamed as >> __check_safe_pte_update(). >> >> Cc: Catalin Marinas <catalin.marinas@arm.com> >> Cc: Will Deacon <will@kernel.org> >> Cc: Mark Rutland <mark.rutland@arm.com> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: linux-arm-kernel@lists.infradead.org >> Cc: linux-kernel@vger.kernel.org >> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> >> --- >> This applies on v6.2-rc3. This patch had some test time on an internal CI >> system without any issues being reported. > > Can you elaborate on this a little bit? It's not entirely clear what that > internal CI system has tested. It would be helpful if you could indicate: Please find the details here, as learned from internal CI folks, > > * What sort of testing has been done by the CI system? e.g. is this just > booting, running LTP, something else? Tested on both host and guest, with CONFIG_DEBUG_VM enabled - Booting - LTP > > * Has this tried a bunch of configurations and/or machines? Tested on the following platforms - LTP test on JUNO (defconfig) - LTP test on SOFTIRON (debugrun config) - Kselftests arm64 KVM (BASEAEM with defconfig) > > * If any targetted stress tests have been used? e.g. stress-ng's memory system > tests? I did run stress-ng memory system tests. > > I'm assuming that's hitting LTP on a few machines/configs, which'd be > reasonable. It'd just be nice to confirm exactly what has been tested. > > I've added this to my lcoal syzkaller instance's test branch, and I'll shout if > that hits anything over the weekend. > >> Changes in V1: >> >> https://lore.kernel.org/all/20221116031001.292236-1-anshuman.khandual@arm.com/ > > Did you mean to list some cahnges here? Actually there was no change between V1 and V2, other than just rebasing. > >> >> arch/arm64/include/asm/pgtable.h | 8 ++++++-- >> arch/arm64/mm/mmu.c | 8 +++++++- >> 2 files changed, 13 insertions(+), 3 deletions(-) >> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >> index b4bbeed80fb6..832c9c8fb58f 100644 >> --- a/arch/arm64/include/asm/pgtable.h >> +++ b/arch/arm64/include/asm/pgtable.h >> @@ -275,6 +275,7 @@ static inline void set_pte(pte_t *ptep, pte_t pte) >> } >> >> extern void __sync_icache_dcache(pte_t pteval); >> +bool pgattr_change_is_safe(u64 old, u64 new); >> >> /* >> * PTE bits configuration in the presence of hardware Dirty Bit Management >> @@ -292,7 +293,7 @@ extern void __sync_icache_dcache(pte_t pteval); >> * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) >> */ >> >> -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, >> +static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep, >> pte_t pte) >> { >> pte_t old_pte; >> @@ -318,6 +319,9 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, >> VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), >> "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", >> __func__, pte_val(old_pte), pte_val(pte)); >> + VM_WARN_ONCE(!pgattr_change_is_safe(pte_val(old_pte), pte_val(pte)), >> + "%s: unsafe attribute change: 0x%016llx -> 0x%016llx", >> + __func__, pte_val(old_pte), pte_val(pte)); >> } >> >> static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, >> @@ -346,7 +350,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, >> mte_sync_tags(old_pte, pte); >> } >> >> - __check_racy_pte_update(mm, ptep, pte); >> + __check_safe_pte_update(mm, ptep, pte); >> >> set_pte(ptep, pte); >> } >> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c >> index 14c87e8d69d8..a1d16b35c4f6 100644 >> --- a/arch/arm64/mm/mmu.c >> +++ b/arch/arm64/mm/mmu.c >> @@ -133,7 +133,7 @@ static phys_addr_t __init early_pgtable_alloc(int shift) >> return phys; >> } >> >> -static bool pgattr_change_is_safe(u64 old, u64 new) >> +bool pgattr_change_is_safe(u64 old, u64 new) >> { >> /* >> * The following mapping attributes may be updated in live >> @@ -145,6 +145,12 @@ static bool pgattr_change_is_safe(u64 old, u64 new) >> if (old == 0 || new == 0) >> return true; > > These checks above should really use pte_valid(); we were just being lazy when > this was originally written since for the init_*() cases the memory should be > zero initially. > > So could you make that: > > if (!pte_valid(__pte(old)) || !pte_valid(__pte(new))) > return true; > >> + /* If old and new ptes are valid, pfn should not change */ >> + if (pte_valid(__pte(old)) && pte_valid(__pte(new))) { >> + if (pte_pfn(__pte(old)) != pte_pfn(__pte(new))) >> + return false; >> + } > > With the above change, it's clear that both must be valid to get this far, and > this check can be reduced to: > > > /* A live entry's pfn should not change */ > if (pte_pfn(__pte(old)) != pte_pfn(__pte(new))) > return false; > > With those changes: > > Acked-by: Mark Rutland <mark.rutland@arm.com> Sent out the V3 as suggested. https://lore.kernel.org/all/20230130121457.1607675-1-anshuman.khandual@arm.com/
On Fri, Jan 27, 2023 at 12:43:17PM +0000, Robin Murphy wrote: > On 2023-01-26 13:33, Will Deacon wrote: > > On Tue, Jan 24, 2023 at 11:11:49AM +0530, Anshuman Khandual wrote: > > > On 1/9/23 10:58, Anshuman Khandual wrote: > > > > Changing pfn on a user page table mapped entry, without first going through > > > > break-before-make (BBM) procedure is unsafe. This just updates set_pte_at() > > > > to intercept such changes, via an updated pgattr_change_is_safe(). This new > > > > check happens via __check_racy_pte_update(), which has now been renamed as > > > > __check_safe_pte_update(). > > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com> > > > > Cc: Will Deacon <will@kernel.org> > > > > Cc: Mark Rutland <mark.rutland@arm.com> > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > > Cc: linux-arm-kernel@lists.infradead.org > > > > Cc: linux-kernel@vger.kernel.org > > > > Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> > > > > --- > > > > This applies on v6.2-rc3. This patch had some test time on an internal CI > > > > system without any issues being reported. > > > > > > Gentle ping, any updates on this patch ? Still any concerns ? > > > > I don't think we really got to the bottom of Mark's concerns with > > unreachable ptes on the stack, did we? I also have vague recollections > > of somebody (Robin?) running into issues with the vmap code not honouring > > BBM. > > Doesn't ring a bell, so either it wasn't me, or it was many years ago and > about 5 levels deep into trying to fix something else :/ Bah, sorry! Catalin reckons it may have been him talking about the vmemmap. Catalin -- any clues? Will
On Tue, Jan 31, 2023 at 03:49:51PM +0000, Will Deacon wrote: > On Fri, Jan 27, 2023 at 12:43:17PM +0000, Robin Murphy wrote: > > On 2023-01-26 13:33, Will Deacon wrote: > > > On Tue, Jan 24, 2023 at 11:11:49AM +0530, Anshuman Khandual wrote: > > > > On 1/9/23 10:58, Anshuman Khandual wrote: > > > > > Changing pfn on a user page table mapped entry, without first going through > > > > > break-before-make (BBM) procedure is unsafe. This just updates set_pte_at() > > > > > to intercept such changes, via an updated pgattr_change_is_safe(). This new > > > > > check happens via __check_racy_pte_update(), which has now been renamed as > > > > > __check_safe_pte_update(). > > > > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com> > > > > > Cc: Will Deacon <will@kernel.org> > > > > > Cc: Mark Rutland <mark.rutland@arm.com> > > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > > > Cc: linux-arm-kernel@lists.infradead.org > > > > > Cc: linux-kernel@vger.kernel.org > > > > > Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> > > > > > --- > > > > > This applies on v6.2-rc3. This patch had some test time on an internal CI > > > > > system without any issues being reported. > > > > > > > > Gentle ping, any updates on this patch ? Still any concerns ? > > > > > > I don't think we really got to the bottom of Mark's concerns with > > > unreachable ptes on the stack, did we? I also have vague recollections > > > of somebody (Robin?) running into issues with the vmap code not honouring > > > BBM. > > > > Doesn't ring a bell, so either it wasn't me, or it was many years ago and > > about 5 levels deep into trying to fix something else :/ > > Bah, sorry! Catalin reckons it may have been him talking about the vmemmap. Indeed. The discussion with Anshuman started from this thread: https://lore.kernel.org/all/20221025014215.3466904-1-mawupeng1@huawei.com/ We already trip over the existing checks even without Anshuman's patch, though only by chance. We are not setting the software PTE_DIRTY on the new pte (we don't bother with this bit for kernel mappings). Given that the vmemmap ptes are still live when such change happens and no-one came with a solution to the break-before-make problem, I propose we revert the arm64 part of commit 47010c040dec ("mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*"). We just need this hunk: diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 27b2592698b0..5263454a5794 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -100,7 +100,6 @@ config ARM64 select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT select ARCH_WANT_FRAME_POINTERS select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) - select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP select ARCH_WANT_LD_ORPHAN_WARN select ARCH_WANTS_NO_INSTR select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
> On Feb 1, 2023, at 20:20, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Tue, Jan 31, 2023 at 03:49:51PM +0000, Will Deacon wrote: >> On Fri, Jan 27, 2023 at 12:43:17PM +0000, Robin Murphy wrote: >>> On 2023-01-26 13:33, Will Deacon wrote: >>>> On Tue, Jan 24, 2023 at 11:11:49AM +0530, Anshuman Khandual wrote: >>>>> On 1/9/23 10:58, Anshuman Khandual wrote: >>>>>> Changing pfn on a user page table mapped entry, without first going through >>>>>> break-before-make (BBM) procedure is unsafe. This just updates set_pte_at() >>>>>> to intercept such changes, via an updated pgattr_change_is_safe(). This new >>>>>> check happens via __check_racy_pte_update(), which has now been renamed as >>>>>> __check_safe_pte_update(). >>>>>> >>>>>> Cc: Catalin Marinas <catalin.marinas@arm.com> >>>>>> Cc: Will Deacon <will@kernel.org> >>>>>> Cc: Mark Rutland <mark.rutland@arm.com> >>>>>> Cc: Andrew Morton <akpm@linux-foundation.org> >>>>>> Cc: linux-arm-kernel@lists.infradead.org >>>>>> Cc: linux-kernel@vger.kernel.org >>>>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> >>>>>> --- >>>>>> This applies on v6.2-rc3. This patch had some test time on an internal CI >>>>>> system without any issues being reported. >>>>> >>>>> Gentle ping, any updates on this patch ? Still any concerns ? >>>> >>>> I don't think we really got to the bottom of Mark's concerns with >>>> unreachable ptes on the stack, did we? I also have vague recollections >>>> of somebody (Robin?) running into issues with the vmap code not honouring >>>> BBM. >>> >>> Doesn't ring a bell, so either it wasn't me, or it was many years ago and >>> about 5 levels deep into trying to fix something else :/ >> >> Bah, sorry! Catalin reckons it may have been him talking about the vmemmap. > > Indeed. The discussion with Anshuman started from this thread: > > https://lore.kernel.org/all/20221025014215.3466904-1-mawupeng1@huawei.com/ > > We already trip over the existing checks even without Anshuman's patch, > though only by chance. We are not setting the software PTE_DIRTY on the > new pte (we don't bother with this bit for kernel mappings). > > Given that the vmemmap ptes are still live when such change happens and > no-one came with a solution to the break-before-make problem, I propose > we revert the arm64 part of commit 47010c040dec ("mm: hugetlb_vmemmap: > cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*"). We just need this hunk: > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index 27b2592698b0..5263454a5794 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -100,7 +100,6 @@ config ARM64 > select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT > select ARCH_WANT_FRAME_POINTERS > select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) > - select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP Maybe it is a little overkill for HVO as it can significantly minimize the overhead of vmemmap on ARM64 servers for some workloads (like qemu, DPDK). So I don't think disabling it is a good approach. Indeed, HVO broke BBM, but the waring does not affect anything since the tail vmemmap pages are supposed to be read-only. So, I suggest skipping warnings if it is the vmemmap address in set_pte_at(). What do you think of? Muchun, Thanks. > select ARCH_WANT_LD_ORPHAN_WARN > select ARCH_WANTS_NO_INSTR > select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES > > -- > Catalin
On Thu, Feb 02, 2023 at 05:51:39PM +0800, Muchun Song wrote: > > On Feb 1, 2023, at 20:20, Catalin Marinas <catalin.marinas@arm.com> wrote: > >> Bah, sorry! Catalin reckons it may have been him talking about the vmemmap. > > > > Indeed. The discussion with Anshuman started from this thread: > > > > https://lore.kernel.org/all/20221025014215.3466904-1-mawupeng1@huawei.com/ > > > > We already trip over the existing checks even without Anshuman's patch, > > though only by chance. We are not setting the software PTE_DIRTY on the > > new pte (we don't bother with this bit for kernel mappings). > > > > Given that the vmemmap ptes are still live when such change happens and > > no-one came with a solution to the break-before-make problem, I propose > > we revert the arm64 part of commit 47010c040dec ("mm: hugetlb_vmemmap: > > cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*"). We just need this hunk: > > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > > index 27b2592698b0..5263454a5794 100644 > > --- a/arch/arm64/Kconfig > > +++ b/arch/arm64/Kconfig > > @@ -100,7 +100,6 @@ config ARM64 > > select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT > > select ARCH_WANT_FRAME_POINTERS > > select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) > > - select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP > > Maybe it is a little overkill for HVO as it can significantly minimize the > overhead of vmemmap on ARM64 servers for some workloads (like qemu, DPDK). > So I don't think disabling it is a good approach. Indeed, HVO broke BBM, > but the waring does not affect anything since the tail vmemmap pages are > supposed to be read-only. So, I suggest skipping warnings if it is the > vmemmap address in set_pte_at(). What do you think of? IIUC, vmemmap_remap_pte() not only makes the pte read-only but also changes the output address. Architecturally, this needs a BBM sequence. We can avoid going through an invalid pte if we first make the pte read-only, TLBI but keeping the same pfn, followed by a change of the pfn while keeping the pte readonly. This also assumes that the content of the page pointed at by the pte is the same at both old and new pfn.
> On Feb 2, 2023, at 18:45, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Thu, Feb 02, 2023 at 05:51:39PM +0800, Muchun Song wrote: >>> On Feb 1, 2023, at 20:20, Catalin Marinas <catalin.marinas@arm.com> wrote: >>>> Bah, sorry! Catalin reckons it may have been him talking about the vmemmap. >>> >>> Indeed. The discussion with Anshuman started from this thread: >>> >>> https://lore.kernel.org/all/20221025014215.3466904-1-mawupeng1@huawei.com/ >>> >>> We already trip over the existing checks even without Anshuman's patch, >>> though only by chance. We are not setting the software PTE_DIRTY on the >>> new pte (we don't bother with this bit for kernel mappings). >>> >>> Given that the vmemmap ptes are still live when such change happens and >>> no-one came with a solution to the break-before-make problem, I propose >>> we revert the arm64 part of commit 47010c040dec ("mm: hugetlb_vmemmap: >>> cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*"). We just need this hunk: >>> >>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig >>> index 27b2592698b0..5263454a5794 100644 >>> --- a/arch/arm64/Kconfig >>> +++ b/arch/arm64/Kconfig >>> @@ -100,7 +100,6 @@ config ARM64 >>> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT >>> select ARCH_WANT_FRAME_POINTERS >>> select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) >>> - select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP >> >> Maybe it is a little overkill for HVO as it can significantly minimize the >> overhead of vmemmap on ARM64 servers for some workloads (like qemu, DPDK). >> So I don't think disabling it is a good approach. Indeed, HVO broke BBM, >> but the waring does not affect anything since the tail vmemmap pages are >> supposed to be read-only. So, I suggest skipping warnings if it is the >> vmemmap address in set_pte_at(). What do you think of? > > IIUC, vmemmap_remap_pte() not only makes the pte read-only but also > changes the output address. Architecturally, this needs a BBM sequence. > We can avoid going through an invalid pte if we first make the pte > read-only, TLBI but keeping the same pfn, followed by a change of the > pfn while keeping the pte readonly. This also assumes that the content > of the page pointed at by the pte is the same at both old and new pfn. Right. I think using BBM is to avoid possibly creating multiple TLB entries for the same address for a extremely short period. But accessing either the old page or the new page is fine in this case. Is it acceptable for this special case without using BBM? Thanks, Muchun.
On Fri, Feb 03, 2023 at 10:40:18AM +0800, Muchun Song wrote: > > > > On Feb 2, 2023, at 18:45, Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > On Thu, Feb 02, 2023 at 05:51:39PM +0800, Muchun Song wrote: > >>> On Feb 1, 2023, at 20:20, Catalin Marinas <catalin.marinas@arm.com> wrote: > >>>> Bah, sorry! Catalin reckons it may have been him talking about the vmemmap. > >>> > >>> Indeed. The discussion with Anshuman started from this thread: > >>> > >>> https://lore.kernel.org/all/20221025014215.3466904-1-mawupeng1@huawei.com/ > >>> > >>> We already trip over the existing checks even without Anshuman's patch, > >>> though only by chance. We are not setting the software PTE_DIRTY on the > >>> new pte (we don't bother with this bit for kernel mappings). > >>> > >>> Given that the vmemmap ptes are still live when such change happens and > >>> no-one came with a solution to the break-before-make problem, I propose > >>> we revert the arm64 part of commit 47010c040dec ("mm: hugetlb_vmemmap: > >>> cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*"). We just need this hunk: > >>> > >>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > >>> index 27b2592698b0..5263454a5794 100644 > >>> --- a/arch/arm64/Kconfig > >>> +++ b/arch/arm64/Kconfig > >>> @@ -100,7 +100,6 @@ config ARM64 > >>> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT > >>> select ARCH_WANT_FRAME_POINTERS > >>> select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) > >>> - select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP > >> > >> Maybe it is a little overkill for HVO as it can significantly minimize the > >> overhead of vmemmap on ARM64 servers for some workloads (like qemu, DPDK). > >> So I don't think disabling it is a good approach. Indeed, HVO broke BBM, > >> but the waring does not affect anything since the tail vmemmap pages are > >> supposed to be read-only. So, I suggest skipping warnings if it is the > >> vmemmap address in set_pte_at(). What do you think of? > > > > IIUC, vmemmap_remap_pte() not only makes the pte read-only but also > > changes the output address. Architecturally, this needs a BBM sequence. > > We can avoid going through an invalid pte if we first make the pte > > read-only, TLBI but keeping the same pfn, followed by a change of the > > pfn while keeping the pte readonly. This also assumes that the content > > of the page pointed at by the pte is the same at both old and new pfn. > > Right. I think using BBM is to avoid possibly creating multiple TLB entries > for the same address for a extremely short period. But accessing either the > old page or the new page is fine in this case. Is it acceptable for this > special case without using BBM? Sadly, the architecture allows the CPU to conjure up a mapping based on a combination of the old and the new descriptor (a process known as "amalgamation") so we _really_ need the BBM sequence. I'm in favour of disabling the optimisation now and bringing it back once we've got this fixed. Will
> On Feb 3, 2023, at 18:10, Will Deacon <will@kernel.org> wrote: > > On Fri, Feb 03, 2023 at 10:40:18AM +0800, Muchun Song wrote: >> >> >>> On Feb 2, 2023, at 18:45, Catalin Marinas <catalin.marinas@arm.com> wrote: >>> >>> On Thu, Feb 02, 2023 at 05:51:39PM +0800, Muchun Song wrote: >>>>> On Feb 1, 2023, at 20:20, Catalin Marinas <catalin.marinas@arm.com> wrote: >>>>>> Bah, sorry! Catalin reckons it may have been him talking about the vmemmap. >>>>> >>>>> Indeed. The discussion with Anshuman started from this thread: >>>>> >>>>> https://lore.kernel.org/all/20221025014215.3466904-1-mawupeng1@huawei.com/ >>>>> >>>>> We already trip over the existing checks even without Anshuman's patch, >>>>> though only by chance. We are not setting the software PTE_DIRTY on the >>>>> new pte (we don't bother with this bit for kernel mappings). >>>>> >>>>> Given that the vmemmap ptes are still live when such change happens and >>>>> no-one came with a solution to the break-before-make problem, I propose >>>>> we revert the arm64 part of commit 47010c040dec ("mm: hugetlb_vmemmap: >>>>> cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*"). We just need this hunk: >>>>> >>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig >>>>> index 27b2592698b0..5263454a5794 100644 >>>>> --- a/arch/arm64/Kconfig >>>>> +++ b/arch/arm64/Kconfig >>>>> @@ -100,7 +100,6 @@ config ARM64 >>>>> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT >>>>> select ARCH_WANT_FRAME_POINTERS >>>>> select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) >>>>> - select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP >>>> >>>> Maybe it is a little overkill for HVO as it can significantly minimize the >>>> overhead of vmemmap on ARM64 servers for some workloads (like qemu, DPDK). >>>> So I don't think disabling it is a good approach. Indeed, HVO broke BBM, >>>> but the waring does not affect anything since the tail vmemmap pages are >>>> supposed to be read-only. So, I suggest skipping warnings if it is the >>>> vmemmap address in set_pte_at(). What do you think of? >>> >>> IIUC, vmemmap_remap_pte() not only makes the pte read-only but also >>> changes the output address. Architecturally, this needs a BBM sequence. >>> We can avoid going through an invalid pte if we first make the pte >>> read-only, TLBI but keeping the same pfn, followed by a change of the >>> pfn while keeping the pte readonly. This also assumes that the content >>> of the page pointed at by the pte is the same at both old and new pfn. >> >> Right. I think using BBM is to avoid possibly creating multiple TLB entries >> for the same address for a extremely short period. But accessing either the >> old page or the new page is fine in this case. Is it acceptable for this >> special case without using BBM? > > Sadly, the architecture allows the CPU to conjure up a mapping based on a > combination of the old and the new descriptor (a process known as > "amalgamation") so we _really_ need the BBM sequence. I am not familiar with ARM64, what's the user-visible effect if this "amalgamation" occurs? Muchun, Thanks. > > I'm in favour of disabling the optimisation now and bringing it back once > we've got this fixed. > > Will
On Mon, Feb 06, 2023 at 11:28:12AM +0800, Muchun Song wrote: > > > > On Feb 3, 2023, at 18:10, Will Deacon <will@kernel.org> wrote: > > > > On Fri, Feb 03, 2023 at 10:40:18AM +0800, Muchun Song wrote: > >> > >> > >>> On Feb 2, 2023, at 18:45, Catalin Marinas <catalin.marinas@arm.com> wrote: > >>> > >>> On Thu, Feb 02, 2023 at 05:51:39PM +0800, Muchun Song wrote: > >>>>> On Feb 1, 2023, at 20:20, Catalin Marinas <catalin.marinas@arm.com> wrote: > >>>>>> Bah, sorry! Catalin reckons it may have been him talking about the vmemmap. > >>>>> > >>>>> Indeed. The discussion with Anshuman started from this thread: > >>>>> > >>>>> https://lore.kernel.org/all/20221025014215.3466904-1-mawupeng1@huawei.com/ > >>>>> > >>>>> We already trip over the existing checks even without Anshuman's patch, > >>>>> though only by chance. We are not setting the software PTE_DIRTY on the > >>>>> new pte (we don't bother with this bit for kernel mappings). > >>>>> > >>>>> Given that the vmemmap ptes are still live when such change happens and > >>>>> no-one came with a solution to the break-before-make problem, I propose > >>>>> we revert the arm64 part of commit 47010c040dec ("mm: hugetlb_vmemmap: > >>>>> cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*"). We just need this hunk: > >>>>> > >>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > >>>>> index 27b2592698b0..5263454a5794 100644 > >>>>> --- a/arch/arm64/Kconfig > >>>>> +++ b/arch/arm64/Kconfig > >>>>> @@ -100,7 +100,6 @@ config ARM64 > >>>>> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT > >>>>> select ARCH_WANT_FRAME_POINTERS > >>>>> select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) > >>>>> - select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP > >>>> > >>>> Maybe it is a little overkill for HVO as it can significantly minimize the > >>>> overhead of vmemmap on ARM64 servers for some workloads (like qemu, DPDK). > >>>> So I don't think disabling it is a good approach. Indeed, HVO broke BBM, > >>>> but the waring does not affect anything since the tail vmemmap pages are > >>>> supposed to be read-only. So, I suggest skipping warnings if it is the > >>>> vmemmap address in set_pte_at(). What do you think of? > >>> > >>> IIUC, vmemmap_remap_pte() not only makes the pte read-only but also > >>> changes the output address. Architecturally, this needs a BBM sequence. > >>> We can avoid going through an invalid pte if we first make the pte > >>> read-only, TLBI but keeping the same pfn, followed by a change of the > >>> pfn while keeping the pte readonly. This also assumes that the content > >>> of the page pointed at by the pte is the same at both old and new pfn. > >> > >> Right. I think using BBM is to avoid possibly creating multiple TLB entries > >> for the same address for a extremely short period. But accessing either the > >> old page or the new page is fine in this case. Is it acceptable for this > >> special case without using BBM? > > > > Sadly, the architecture allows the CPU to conjure up a mapping based on a > > combination of the old and the new descriptor (a process known as > > "amalgamation") so we _really_ need the BBM sequence. > > I am not familiar with ARM64, what's the user-visible effect if this > "amalgamation" occurs? The user-visible effects would probably be data corruption and instability, since the amalgamated TLB entry could result in a bogus physical address and bogus permissions. Will
> On Feb 7, 2023, at 22:31, Will Deacon <will@kernel.org> wrote: > > On Mon, Feb 06, 2023 at 11:28:12AM +0800, Muchun Song wrote: >> >> >>> On Feb 3, 2023, at 18:10, Will Deacon <will@kernel.org> wrote: >>> >>> On Fri, Feb 03, 2023 at 10:40:18AM +0800, Muchun Song wrote: >>>> >>>> >>>>> On Feb 2, 2023, at 18:45, Catalin Marinas <catalin.marinas@arm.com> wrote: >>>>> >>>>> On Thu, Feb 02, 2023 at 05:51:39PM +0800, Muchun Song wrote: >>>>>>> On Feb 1, 2023, at 20:20, Catalin Marinas <catalin.marinas@arm.com> wrote: >>>>>>>> Bah, sorry! Catalin reckons it may have been him talking about the vmemmap. >>>>>>> >>>>>>> Indeed. The discussion with Anshuman started from this thread: >>>>>>> >>>>>>> https://lore.kernel.org/all/20221025014215.3466904-1-mawupeng1@huawei.com/ >>>>>>> >>>>>>> We already trip over the existing checks even without Anshuman's patch, >>>>>>> though only by chance. We are not setting the software PTE_DIRTY on the >>>>>>> new pte (we don't bother with this bit for kernel mappings). >>>>>>> >>>>>>> Given that the vmemmap ptes are still live when such change happens and >>>>>>> no-one came with a solution to the break-before-make problem, I propose >>>>>>> we revert the arm64 part of commit 47010c040dec ("mm: hugetlb_vmemmap: >>>>>>> cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*"). We just need this hunk: >>>>>>> >>>>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig >>>>>>> index 27b2592698b0..5263454a5794 100644 >>>>>>> --- a/arch/arm64/Kconfig >>>>>>> +++ b/arch/arm64/Kconfig >>>>>>> @@ -100,7 +100,6 @@ config ARM64 >>>>>>> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT >>>>>>> select ARCH_WANT_FRAME_POINTERS >>>>>>> select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) >>>>>>> - select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP >>>>>> >>>>>> Maybe it is a little overkill for HVO as it can significantly minimize the >>>>>> overhead of vmemmap on ARM64 servers for some workloads (like qemu, DPDK). >>>>>> So I don't think disabling it is a good approach. Indeed, HVO broke BBM, >>>>>> but the waring does not affect anything since the tail vmemmap pages are >>>>>> supposed to be read-only. So, I suggest skipping warnings if it is the >>>>>> vmemmap address in set_pte_at(). What do you think of? >>>>> >>>>> IIUC, vmemmap_remap_pte() not only makes the pte read-only but also >>>>> changes the output address. Architecturally, this needs a BBM sequence. >>>>> We can avoid going through an invalid pte if we first make the pte >>>>> read-only, TLBI but keeping the same pfn, followed by a change of the >>>>> pfn while keeping the pte readonly. This also assumes that the content >>>>> of the page pointed at by the pte is the same at both old and new pfn. >>>> >>>> Right. I think using BBM is to avoid possibly creating multiple TLB entries >>>> for the same address for a extremely short period. But accessing either the >>>> old page or the new page is fine in this case. Is it acceptable for this >>>> special case without using BBM? >>> >>> Sadly, the architecture allows the CPU to conjure up a mapping based on a >>> combination of the old and the new descriptor (a process known as >>> "amalgamation") so we _really_ need the BBM sequence. >> >> I am not familiar with ARM64, what's the user-visible effect if this >> "amalgamation" occurs? > > The user-visible effects would probably be data corruption and instability, > since the amalgamated TLB entry could result in a bogus physical address and > bogus permissions. You mean the output address of amalgamated TLB entry is neither the old address (before updated) nor the new address (after updated)? So it is a bogus physical address? Is there any specifications to describe the rules of how to create a amalgamated TLB entry? Thanks. Muchun > > Will
On Wed, Feb 08, 2023 at 11:13:46AM +0800, Muchun Song wrote: > > On Feb 7, 2023, at 22:31, Will Deacon <will@kernel.org> wrote: > > On Mon, Feb 06, 2023 at 11:28:12AM +0800, Muchun Song wrote: > >> I am not familiar with ARM64, what's the user-visible effect if this > >> "amalgamation" occurs? > > > > The user-visible effects would probably be data corruption and instability, > > since the amalgamated TLB entry could result in a bogus physical address and > > bogus permissions. > > You mean the output address of amalgamated TLB entry is neither the old > address (before updated) nor the new address (after updated)? Yes, that is one possible result. > So it is a bogus physical address? Yes, that is one possible result. > Is there any specifications to describe the rules of how to create a > amalgamated TLB entry? Thanks. Unfortunately, this is not clearly specified in the ARM ARM, and we have to take a pessimistic reading here. We assume that amalgamation is some arbitrary function of the TLB entries which are hit (e.g. they might be OR'd together). This is something that I'd like to have clarified further by Arm's architects. The important thing to note is that amalgamation applies to *TLB entries*, not the translation table entries that they were derived from. Since the TLB format is micro-architecture dependent, and since the manner in which they might be combined is arbitrary, the results of combining could be arbitrary (and consequently, this is difficult to specify). The architecture *does* provide a few restrictions (e.g. Stage-1 entries within a VM can't escape Stage-2, NS entries can't create a secure physical address), but beyond that we cannot make any assumptions. So e.g. if you have 2 read-only entries for addresses A and B, amalgamation could result in read-write-execute for a distinct address C. It's not clear to me whether that could also affect hits for unrelated VAs. So the short answer is that we have to treat this as CONSTRAINED UNPREDICTABLE, and must avoid potential amalgamation by using Break-Before-Make. Thanks, Mark.
> On Feb 9, 2023, at 01:27, Mark Rutland <mark.rutland@arm.com> wrote: > > On Wed, Feb 08, 2023 at 11:13:46AM +0800, Muchun Song wrote: >>> On Feb 7, 2023, at 22:31, Will Deacon <will@kernel.org> wrote: >>> On Mon, Feb 06, 2023 at 11:28:12AM +0800, Muchun Song wrote: >>>> I am not familiar with ARM64, what's the user-visible effect if this >>>> "amalgamation" occurs? >>> >>> The user-visible effects would probably be data corruption and instability, >>> since the amalgamated TLB entry could result in a bogus physical address and >>> bogus permissions. >> >> You mean the output address of amalgamated TLB entry is neither the old >> address (before updated) nor the new address (after updated)? > > Yes, that is one possible result. > >> So it is a bogus physical address? > > Yes, that is one possible result. > >> Is there any specifications to describe the rules of how to create a >> amalgamated TLB entry? Thanks. > > Unfortunately, this is not clearly specified in the ARM ARM, and we have to > take a pessimistic reading here. We assume that amalgamation is some arbitrary > function of the TLB entries which are hit (e.g. they might be OR'd together). > This is something that I'd like to have clarified further by Arm's architects. > > The important thing to note is that amalgamation applies to *TLB entries*, not > the translation table entries that they were derived from. Since the TLB format > is micro-architecture dependent, and since the manner in which they might be > combined is arbitrary, the results of combining could be arbitrary (and > consequently, this is difficult to specify). > > The architecture *does* provide a few restrictions (e.g. Stage-1 entries within > a VM can't escape Stage-2, NS entries can't create a secure physical address), > but beyond that we cannot make any assumptions. > > So e.g. if you have 2 read-only entries for addresses A and B, amalgamation > could result in read-write-execute for a distinct address C. > > It's not clear to me whether that could also affect hits for unrelated VAs. > > So the short answer is that we have to treat this as CONSTRAINED UNPREDICTABLE, > and must avoid potential amalgamation by using Break-Before-Make. Thanks for your clear description. It's really helpful. > > Thanks, > Mark.
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index b4bbeed80fb6..832c9c8fb58f 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -275,6 +275,7 @@ static inline void set_pte(pte_t *ptep, pte_t pte) } extern void __sync_icache_dcache(pte_t pteval); +bool pgattr_change_is_safe(u64 old, u64 new); /* * PTE bits configuration in the presence of hardware Dirty Bit Management @@ -292,7 +293,7 @@ extern void __sync_icache_dcache(pte_t pteval); * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) */ -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, +static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep, pte_t pte) { pte_t old_pte; @@ -318,6 +319,9 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", __func__, pte_val(old_pte), pte_val(pte)); + VM_WARN_ONCE(!pgattr_change_is_safe(pte_val(old_pte), pte_val(pte)), + "%s: unsafe attribute change: 0x%016llx -> 0x%016llx", + __func__, pte_val(old_pte), pte_val(pte)); } static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, @@ -346,7 +350,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, mte_sync_tags(old_pte, pte); } - __check_racy_pte_update(mm, ptep, pte); + __check_safe_pte_update(mm, ptep, pte); set_pte(ptep, pte); } diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 14c87e8d69d8..a1d16b35c4f6 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -133,7 +133,7 @@ static phys_addr_t __init early_pgtable_alloc(int shift) return phys; } -static bool pgattr_change_is_safe(u64 old, u64 new) +bool pgattr_change_is_safe(u64 old, u64 new) { /* * The following mapping attributes may be updated in live @@ -145,6 +145,12 @@ static bool pgattr_change_is_safe(u64 old, u64 new) if (old == 0 || new == 0) return true; + /* If old and new ptes are valid, pfn should not change */ + if (pte_valid(__pte(old)) && pte_valid(__pte(new))) { + if (pte_pfn(__pte(old)) != pte_pfn(__pte(new))) + return false; + } + /* live contiguous mappings may not be manipulated at all */ if ((old | new) & PTE_CONT) return false;