[-next,1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte

Message ID 20221025014215.3466904-1-mawupeng1@huawei.com
State New
Headers
Series [-next,1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte |

Commit Message

mawupeng Oct. 25, 2022, 1:42 a.m. UTC
  From: Ma Wupeng <mawupeng1@huawei.com>

Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
read-only to catch illegal write operation to the tail page.

However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
since this may lead to dirty state cleaned. This check is introduced by
commit 2f4b829c625e ("arm64: Add support for hardware updates of the
access and dirty pte bits") and the initial check is as follow:

  BUG_ON(pte_write(*ptep) && !pte_dirty(pte));

Since we do need to mark this pte as read-only to catch illegal write
operation to the tail pages, use set_pte  to replace set_pte_at to bypass
this check.

The following shell command can be used to reproduce this WARN_ON in
6.1.0-rc1:

  echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap
  cat /proc/sys/vm/hugetlb_optimize_vmemmap

  echo 1024 > /proc/sys/vm/nr_overcommit_hugepages
  mkdir -p /root/hugetlb
  mount none /root/hugetlb -t hugetlbfs
  fallocate -l 2g /root/hugetlb/xx &

Here is the detail WARN_ON log:

------------[ cut here ]------------
__check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83
WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120
Modules linked in:
CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224
Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
Call trace:
 vmemmap_remap_pte+0x118/0x120
 vmemmap_remap_range+0x30c/0x560
 hugetlb_vmemmap_optimize+0x158/0x408
 __prep_new_huge_page+0x24/0x150
 prep_new_huge_page+0x30/0x60
 alloc_fresh_huge_page+0x1c4/0x1e0
 alloc_surplus_huge_page+0xa0/0x168
 alloc_huge_page+0x264/0x5b8
 hugetlbfs_fallocate+0x294/0x680
 vfs_fallocate+0x12c/0x568
 ksys_fallocate+0x50/0xa0
 __arm64_sys_fallocate+0x28/0x38
 invoke_syscall+0x4c/0x110
 el0_svc_common.constprop.0+0x68/0x128
 do_el0_svc+0x34/0xd0
 el0_svc+0x48/0xb8
 el0t_64_sync_handler+0xb8/0xc0
 el0t_64_sync+0x18c/0x190
---[ end trace 0000000000000000 ]---

Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
---
 mm/hugetlb_vmemmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Muchun Song Oct. 25, 2022, 6:36 a.m. UTC | #1
> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
> 
> From: Ma Wupeng <mawupeng1@huawei.com>
> 
> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
> read-only to catch illegal write operation to the tail page.
> 
> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()

Thanks for your finding this issue.

> since this may lead to dirty state cleaned. This check is introduced by
> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
> access and dirty pte bits") and the initial check is as follow:
> 
>  BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
> 
> Since we do need to mark this pte as read-only to catch illegal write
> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
> this check.

In theory, the waring does not affect anything since the tail vmemmap
pages are supposed to be read-only. So, skipping this check for vmemmap
pages seem feasible. But I am not sure whether it is general to use
set_pte here instead of set_pte_at, I didn’t see any users of set_pte
from the common code routines except the code from arch/xxx. And this
issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update()
itself.

Something like (Just some thoughts from mine):

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b5df82aa99e6..df7716965a93 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval);
  *   PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY)
  */

-static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
+static inline void __check_racy_pte_update(struct mm_struct *mm,
+                                          unsigned long addr, pte_t *ptep,
                                           pte_t pte)
 {
        pte_t old_pte;
@@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
        if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1)
                return;

+       if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) &&
+           addr >= VMEMMAP_START && addr <= VMEMMAP_END)
+               return;
+
        /*
         * Check for potential race with hardware updates of the pte
         * (ptep_set_access_flags safely changes valid ptes without going

> 
> The following shell command can be used to reproduce this WARN_ON in
> 6.1.0-rc1:
> 
>  echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap
>  cat /proc/sys/vm/hugetlb_optimize_vmemmap
> 
>  echo 1024 > /proc/sys/vm/nr_overcommit_hugepages
>  mkdir -p /root/hugetlb
>  mount none /root/hugetlb -t hugetlbfs
>  fallocate -l 2g /root/hugetlb/xx &
> 
> Here is the detail WARN_ON log:
> 
> ------------[ cut here ]------------
> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83
> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120
> Modules linked in:
> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224
> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> Call trace:
> vmemmap_remap_pte+0x118/0x120
> vmemmap_remap_range+0x30c/0x560
> hugetlb_vmemmap_optimize+0x158/0x408
> __prep_new_huge_page+0x24/0x150
> prep_new_huge_page+0x30/0x60
> alloc_fresh_huge_page+0x1c4/0x1e0
> alloc_surplus_huge_page+0xa0/0x168
> alloc_huge_page+0x264/0x5b8
> hugetlbfs_fallocate+0x294/0x680
> vfs_fallocate+0x12c/0x568
> ksys_fallocate+0x50/0xa0
> __arm64_sys_fallocate+0x28/0x38
> invoke_syscall+0x4c/0x110
> el0_svc_common.constprop.0+0x68/0x128
> do_el0_svc+0x34/0xd0
> el0_svc+0x48/0xb8
> el0t_64_sync_handler+0xb8/0xc0
> el0t_64_sync+0x18c/0x190
> ---[ end trace 0000000000000000 ]---
> 
> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")

Actually, this commit does not pose the issue for arm64. I think the correct commit
which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2.

Thanks.

> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
> ---
> mm/hugetlb_vmemmap.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index ba2a2596fb4e..cb056265c31e 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
> struct page *page = pte_page(*pte);
> 
> list_add_tail(&page->lru, walk->vmemmap_pages);
> - set_pte_at(&init_mm, addr, pte, entry);
> + set_pte(pte, entry);
> }
> 
> /*
> -- 
> 2.25.1
> 
>
  
mawupeng Oct. 26, 2022, 3:01 a.m. UTC | #2
On 2022/10/25 14:36, Muchun Song wrote:
> 
> 
>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
>>
>> From: Ma Wupeng <mawupeng1@huawei.com>
>>
>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
>> read-only to catch illegal write operation to the tail page.
>>
>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
> 
> Thanks for your finding this issue.
> 
>> since this may lead to dirty state cleaned. This check is introduced by
>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
>> access and dirty pte bits") and the initial check is as follow:
>>
>>  BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
>>
>> Since we do need to mark this pte as read-only to catch illegal write
>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
>> this check.
> 
> In theory, the waring does not affect anything since the tail vmemmap
> pages are supposed to be read-only. So, skipping this check for vmemmap
> pages seem feasible. But I am not sure whether it is general to use
> set_pte here instead of set_pte_at, I didn’t see any users of set_pte
> from the common code routines except the code from arch/xxx. And this
> issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update()
> itself.
> 
> Something like (Just some thoughts from mine):
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index b5df82aa99e6..df7716965a93 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval);
>   *   PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY)
>   */
> 
> -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
> +static inline void __check_racy_pte_update(struct mm_struct *mm,
> +                                          unsigned long addr, pte_t *ptep,
>                                            pte_t pte)
>  {
>         pte_t old_pte;
> @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
>         if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1)
>                 return;
> 
> +       if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) &&
> +           addr >= VMEMMAP_START && addr <= VMEMMAP_END)
> +               return;
> +
>         /*
>          * Check for potential race with hardware updates of the pte
>          * (ptep_set_access_flags safely changes valid ptes without going
> 
>>
>> The following shell command can be used to reproduce this WARN_ON in
>> 6.1.0-rc1:
>>
>>  echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap
>>  cat /proc/sys/vm/hugetlb_optimize_vmemmap
>>
>>  echo 1024 > /proc/sys/vm/nr_overcommit_hugepages
>>  mkdir -p /root/hugetlb
>>  mount none /root/hugetlb -t hugetlbfs
>>  fallocate -l 2g /root/hugetlb/xx &
>>
>> Here is the detail WARN_ON log:
>>
>> ------------[ cut here ]------------
>> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83
>> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120
>> Modules linked in:
>> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224
>> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
>> Call trace:
>> vmemmap_remap_pte+0x118/0x120
>> vmemmap_remap_range+0x30c/0x560
>> hugetlb_vmemmap_optimize+0x158/0x408
>> __prep_new_huge_page+0x24/0x150
>> prep_new_huge_page+0x30/0x60
>> alloc_fresh_huge_page+0x1c4/0x1e0
>> alloc_surplus_huge_page+0xa0/0x168
>> alloc_huge_page+0x264/0x5b8
>> hugetlbfs_fallocate+0x294/0x680
>> vfs_fallocate+0x12c/0x568
>> ksys_fallocate+0x50/0xa0
>> __arm64_sys_fallocate+0x28/0x38
>> invoke_syscall+0x4c/0x110
>> el0_svc_common.constprop.0+0x68/0x128
>> do_el0_svc+0x34/0xd0
>> el0_svc+0x48/0xb8
>> el0t_64_sync_handler+0xb8/0xc0
>> el0t_64_sync+0x18c/0x190
>> ---[ end trace 0000000000000000 ]---
>>
>> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
> 
> Actually, this commit does not pose the issue for arm64. I think the correct commit
> which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2.
> 
> Thanks.

Thanks for pointing it out.

I have tested the commit, it sure is patch 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2.
This VM_WARN_ONCE will be produced if HUGETLB_PAGE_FREE_VMEMMAP for arm64 is enabled.

I will change my change log and fix this in next patch.

> 
>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
>> ---
>> mm/hugetlb_vmemmap.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>> index ba2a2596fb4e..cb056265c31e 100644
>> --- a/mm/hugetlb_vmemmap.c
>> +++ b/mm/hugetlb_vmemmap.c
>> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>> struct page *page = pte_page(*pte);
>>
>> list_add_tail(&page->lru, walk->vmemmap_pages);
>> - set_pte_at(&init_mm, addr, pte, entry);
>> + set_pte(pte, entry);
>> }
>>
>> /*
>> -- 
>> 2.25.1
>>
>>
>
  
Anshuman Khandual Oct. 26, 2022, 5:06 a.m. UTC | #3
On 10/25/22 12:06, Muchun Song wrote:
> 
> 
>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
>>
>> From: Ma Wupeng <mawupeng1@huawei.com>
>>
>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
>> read-only to catch illegal write operation to the tail page.
>>
>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
> 
> Thanks for your finding this issue.
> 
>> since this may lead to dirty state cleaned. This check is introduced by
>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
>> access and dirty pte bits") and the initial check is as follow:
>>
>>  BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
>>
>> Since we do need to mark this pte as read-only to catch illegal write
>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
>> this check.
> 
> In theory, the waring does not affect anything since the tail vmemmap
> pages are supposed to be read-only. So, skipping this check for vmemmap

Tails vmemmap pages are supposed to be read-only, in practice but their
backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE()
warning would not have triggered.

        VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte),
                     "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx",
                     __func__, pte_val(old_pte), pte_val(pte));

Also, is not it true that the pte being remapped into a different page
as read only, than what it had originally (which will be freed up) i.e 
the PFN in 'old_pte' and 'pte' will be different. Hence is there still
a possibility for a race condition even when the PFN changes ?

> pages seem feasible. But I am not sure whether it is general to use
> set_pte here instead of set_pte_at, I didn’t see any users of set_pte
> from the common code routines except the code from arch/xxx. And this
> issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update()
> itself.

Right, should not change it to yet lower level platform helper set_pte()
just to work around this warning. Instead, __check_racy_pte_update() is
the right place where it should be fixed.

> 
> Something like (Just some thoughts from mine):
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index b5df82aa99e6..df7716965a93 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval);
>   *   PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY)
>   */
> 
> -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
> +static inline void __check_racy_pte_update(struct mm_struct *mm,
> +                                          unsigned long addr, pte_t *ptep,
>                                            pte_t pte)
>  {
>         pte_t old_pte;
> @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
>         if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1)
>                 return;
> 
> +       if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) &&
> +           addr >= VMEMMAP_START && addr <= VMEMMAP_END)
> +               return;
> +
>         /*
>          * Check for potential race with hardware updates of the pte
>          * (ptep_set_access_flags safely changes valid ptes without going
> 
>>
>> The following shell command can be used to reproduce this WARN_ON in
>> 6.1.0-rc1:
>>
>>  echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap
>>  cat /proc/sys/vm/hugetlb_optimize_vmemmap
>>
>>  echo 1024 > /proc/sys/vm/nr_overcommit_hugepages
>>  mkdir -p /root/hugetlb
>>  mount none /root/hugetlb -t hugetlbfs
>>  fallocate -l 2g /root/hugetlb/xx &
>>
>> Here is the detail WARN_ON log:
>>
>> ------------[ cut here ]------------
>> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83
>> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120
>> Modules linked in:
>> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224
>> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
>> Call trace:
>> vmemmap_remap_pte+0x118/0x120
>> vmemmap_remap_range+0x30c/0x560
>> hugetlb_vmemmap_optimize+0x158/0x408
>> __prep_new_huge_page+0x24/0x150
>> prep_new_huge_page+0x30/0x60
>> alloc_fresh_huge_page+0x1c4/0x1e0
>> alloc_surplus_huge_page+0xa0/0x168
>> alloc_huge_page+0x264/0x5b8
>> hugetlbfs_fallocate+0x294/0x680
>> vfs_fallocate+0x12c/0x568
>> ksys_fallocate+0x50/0xa0
>> __arm64_sys_fallocate+0x28/0x38
>> invoke_syscall+0x4c/0x110
>> el0_svc_common.constprop.0+0x68/0x128
>> do_el0_svc+0x34/0xd0
>> el0_svc+0x48/0xb8
>> el0t_64_sync_handler+0xb8/0xc0
>> el0t_64_sync+0x18c/0x190
>> ---[ end trace 0000000000000000 ]---
>>
>> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
> 
> Actually, this commit does not pose the issue for arm64. I think the correct commit
> which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2.
> 
> Thanks.
> 
>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
>> ---
>> mm/hugetlb_vmemmap.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>> index ba2a2596fb4e..cb056265c31e 100644
>> --- a/mm/hugetlb_vmemmap.c
>> +++ b/mm/hugetlb_vmemmap.c
>> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>> struct page *page = pte_page(*pte);
>>
>> list_add_tail(&page->lru, walk->vmemmap_pages);
>> - set_pte_at(&init_mm, addr, pte, entry);
>> + set_pte(pte, entry);
>> }
>>
>> /*
>> -- 
>> 2.25.1
>>
>>
> 
>
  
Muchun Song Oct. 26, 2022, 7:01 a.m. UTC | #4
> On Oct 26, 2022, at 13:06, Anshuman Khandual <anshuman.khandual@arm.com> wrote:
> 
> 
> 
> On 10/25/22 12:06, Muchun Song wrote:
>> 
>> 
>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
>>> 
>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>> 
>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
>>> read-only to catch illegal write operation to the tail page.
>>> 
>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
>> 
>> Thanks for your finding this issue.
>> 
>>> since this may lead to dirty state cleaned. This check is introduced by
>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
>>> access and dirty pte bits") and the initial check is as follow:
>>> 
>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
>>> 
>>> Since we do need to mark this pte as read-only to catch illegal write
>>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
>>> this check.
>> 
>> In theory, the waring does not affect anything since the tail vmemmap
>> pages are supposed to be read-only. So, skipping this check for vmemmap
> 
> Tails vmemmap pages are supposed to be read-only, in practice but their
> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE()
> warning would not have triggered.

Right.

> 
>        VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte),
>                     "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx",
>                     __func__, pte_val(old_pte), pte_val(pte));
> 
> Also, is not it true that the pte being remapped into a different page
> as read only, than what it had originally (which will be freed up) i.e 
> the PFN in 'old_pte' and 'pte' will be different. Hence is there still

Right.

> a possibility for a race condition even when the PFN changes ?

Sorry, I didn't get this question. Did you mean the PTE is changed from
new (pte) to the old one (old_pte) by the hardware because of the update
of dirty bit when a concurrent write operation to the tail vmemmap page?

Muchun,
Thanks.

> 
>> pages seem feasible. But I am not sure whether it is general to use
>> set_pte here instead of set_pte_at, I didn’t see any users of set_pte
>> from the common code routines except the code from arch/xxx. And this
>> issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update()
>> itself.
> 
> Right, should not change it to yet lower level platform helper set_pte()
> just to work around this warning. Instead, __check_racy_pte_update() is
> the right place where it should be fixed.
> 
>> 
>> Something like (Just some thoughts from mine):
>> 
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index b5df82aa99e6..df7716965a93 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval);
>>  *   PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY)
>>  */
>> 
>> -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
>> +static inline void __check_racy_pte_update(struct mm_struct *mm,
>> +                                          unsigned long addr, pte_t *ptep,
>>                                           pte_t pte)
>> {
>>        pte_t old_pte;
>> @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
>>        if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1)
>>                return;
>> 
>> +       if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) &&
>> +           addr >= VMEMMAP_START && addr <= VMEMMAP_END)
>> +               return;
>> +
>>        /*
>>         * Check for potential race with hardware updates of the pte
>>         * (ptep_set_access_flags safely changes valid ptes without going
>> 
>>> 
>>> The following shell command can be used to reproduce this WARN_ON in
>>> 6.1.0-rc1:
>>> 
>>> echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap
>>> cat /proc/sys/vm/hugetlb_optimize_vmemmap
>>> 
>>> echo 1024 > /proc/sys/vm/nr_overcommit_hugepages
>>> mkdir -p /root/hugetlb
>>> mount none /root/hugetlb -t hugetlbfs
>>> fallocate -l 2g /root/hugetlb/xx &
>>> 
>>> Here is the detail WARN_ON log:
>>> 
>>> ------------[ cut here ]------------
>>> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83
>>> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120
>>> Modules linked in:
>>> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224
>>> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
>>> Call trace:
>>> vmemmap_remap_pte+0x118/0x120
>>> vmemmap_remap_range+0x30c/0x560
>>> hugetlb_vmemmap_optimize+0x158/0x408
>>> __prep_new_huge_page+0x24/0x150
>>> prep_new_huge_page+0x30/0x60
>>> alloc_fresh_huge_page+0x1c4/0x1e0
>>> alloc_surplus_huge_page+0xa0/0x168
>>> alloc_huge_page+0x264/0x5b8
>>> hugetlbfs_fallocate+0x294/0x680
>>> vfs_fallocate+0x12c/0x568
>>> ksys_fallocate+0x50/0xa0
>>> __arm64_sys_fallocate+0x28/0x38
>>> invoke_syscall+0x4c/0x110
>>> el0_svc_common.constprop.0+0x68/0x128
>>> do_el0_svc+0x34/0xd0
>>> el0_svc+0x48/0xb8
>>> el0t_64_sync_handler+0xb8/0xc0
>>> el0t_64_sync+0x18c/0x190
>>> ---[ end trace 0000000000000000 ]---
>>> 
>>> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
>> 
>> Actually, this commit does not pose the issue for arm64. I think the correct commit
>> which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2.
>> 
>> Thanks.
>> 
>>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
>>> ---
>>> mm/hugetlb_vmemmap.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>> 
>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>> index ba2a2596fb4e..cb056265c31e 100644
>>> --- a/mm/hugetlb_vmemmap.c
>>> +++ b/mm/hugetlb_vmemmap.c
>>> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>>> struct page *page = pte_page(*pte);
>>> 
>>> list_add_tail(&page->lru, walk->vmemmap_pages);
>>> - set_pte_at(&init_mm, addr, pte, entry);
>>> + set_pte(pte, entry);
>>> }
>>> 
>>> /*
>>> -- 
>>> 2.25.1
  
Anshuman Khandual Oct. 26, 2022, 8:36 a.m. UTC | #5
On 10/26/22 12:31, Muchun Song wrote:
> 
> 
>> On Oct 26, 2022, at 13:06, Anshuman Khandual <anshuman.khandual@arm.com> wrote:
>>
>>
>>
>> On 10/25/22 12:06, Muchun Song wrote:
>>>
>>>
>>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
>>>>
>>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>>
>>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
>>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
>>>> read-only to catch illegal write operation to the tail page.
>>>>
>>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
>>>
>>> Thanks for your finding this issue.
>>>
>>>> since this may lead to dirty state cleaned. This check is introduced by
>>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
>>>> access and dirty pte bits") and the initial check is as follow:
>>>>
>>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
>>>>
>>>> Since we do need to mark this pte as read-only to catch illegal write
>>>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
>>>> this check.
>>>
>>> In theory, the waring does not affect anything since the tail vmemmap
>>> pages are supposed to be read-only. So, skipping this check for vmemmap
>>
>> Tails vmemmap pages are supposed to be read-only, in practice but their
>> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE()
>> warning would not have triggered.
> 
> Right.
> 
>>
>>        VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte),
>>                     "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx",
>>                     __func__, pte_val(old_pte), pte_val(pte));
>>
>> Also, is not it true that the pte being remapped into a different page
>> as read only, than what it had originally (which will be freed up) i.e 
>> the PFN in 'old_pte' and 'pte' will be different. Hence is there still
> 
> Right.
> 
>> a possibility for a race condition even when the PFN changes ?
> 
> Sorry, I didn't get this question. Did you mean the PTE is changed from
> new (pte) to the old one (old_pte) by the hardware because of the update
> of dirty bit when a concurrent write operation to the tail vmemmap page?

No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining
tails pages ? Is not there a PFN change, along with access permission change
involved in this remapping process ?
  
Muchun Song Oct. 26, 2022, 8:53 a.m. UTC | #6
> On Oct 26, 2022, at 16:36, Anshuman Khandual <anshuman.khandual@arm.com> wrote:
> 
> 
> 
> On 10/26/22 12:31, Muchun Song wrote:
>> 
>> 
>>> On Oct 26, 2022, at 13:06, Anshuman Khandual <anshuman.khandual@arm.com> wrote:
>>> 
>>> 
>>> 
>>> On 10/25/22 12:06, Muchun Song wrote:
>>>> 
>>>> 
>>>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
>>>>> 
>>>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>>> 
>>>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
>>>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
>>>>> read-only to catch illegal write operation to the tail page.
>>>>> 
>>>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
>>>> 
>>>> Thanks for your finding this issue.
>>>> 
>>>>> since this may lead to dirty state cleaned. This check is introduced by
>>>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
>>>>> access and dirty pte bits") and the initial check is as follow:
>>>>> 
>>>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
>>>>> 
>>>>> Since we do need to mark this pte as read-only to catch illegal write
>>>>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
>>>>> this check.
>>>> 
>>>> In theory, the waring does not affect anything since the tail vmemmap
>>>> pages are supposed to be read-only. So, skipping this check for vmemmap
>>> 
>>> Tails vmemmap pages are supposed to be read-only, in practice but their
>>> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE()
>>> warning would not have triggered.
>> 
>> Right.
>> 
>>> 
>>>       VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte),
>>>                    "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx",
>>>                    __func__, pte_val(old_pte), pte_val(pte));
>>> 
>>> Also, is not it true that the pte being remapped into a different page
>>> as read only, than what it had originally (which will be freed up) i.e 
>>> the PFN in 'old_pte' and 'pte' will be different. Hence is there still
>> 
>> Right.
>> 
>>> a possibility for a race condition even when the PFN changes ?
>> 
>> Sorry, I didn't get this question. Did you mean the PTE is changed from
>> new (pte) to the old one (old_pte) by the hardware because of the update
>> of dirty bit when a concurrent write operation to the tail vmemmap page?
> 
> No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining
> tails pages ? Is not there a PFN change, along with access permission change
> involved in this remapping process ?

Alright, yes, both the PFN and the access permission are changed.
  
mawupeng Oct. 27, 2022, 1:42 a.m. UTC | #7
On 2022/10/26 13:06, Anshuman Khandual wrote:
> 
> 
> On 10/25/22 12:06, Muchun Song wrote:
>>
>>
>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
>>>
>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>
>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
>>> read-only to catch illegal write operation to the tail page.
>>>
>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
>>
>> Thanks for your finding this issue.
>>
>>> since this may lead to dirty state cleaned. This check is introduced by
>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
>>> access and dirty pte bits") and the initial check is as follow:
>>>
>>>  BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
>>>
>>> Since we do need to mark this pte as read-only to catch illegal write
>>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
>>> this check.
>>
>> In theory, the waring does not affect anything since the tail vmemmap
>> pages are supposed to be read-only. So, skipping this check for vmemmap
> 
> Tails vmemmap pages are supposed to be read-only, in practice but their
> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE()
> warning would not have triggered.
> 
>         VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte),
>                      "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx",
>                      __func__, pte_val(old_pte), pte_val(pte));

So, arm64 will trigger this warn_on in this condition and this condition is not
unusual such as this scenario?

It is true that we should change the login in arm64. but how to change it to
make the code more common?

Any thoughts?

Thanks for reviewing.

> 
> Also, is not it true that the pte being remapped into a different page
> as read only, than what it had originally (which will be freed up) i.e 
> the PFN in 'old_pte' and 'pte' will be different. Hence is there still
> a possibility for a race condition even when the PFN changes ?
> 
>> pages seem feasible. But I am not sure whether it is general to use
>> set_pte here instead of set_pte_at, I didn’t see any users of set_pte
>> from the common code routines except the code from arch/xxx. And this
>> issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update()
>> itself.
> 
> Right, should not change it to yet lower level platform helper set_pte()
> just to work around this warning. Instead, __check_racy_pte_update() is
> the right place where it should be fixed.
> 
>>
>> Something like (Just some thoughts from mine):
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index b5df82aa99e6..df7716965a93 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval);
>>   *   PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY)
>>   */
>>
>> -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
>> +static inline void __check_racy_pte_update(struct mm_struct *mm,
>> +                                          unsigned long addr, pte_t *ptep,
>>                                            pte_t pte)
>>  {
>>         pte_t old_pte;
>> @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
>>         if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1)
>>                 return;
>>
>> +       if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) &&
>> +           addr >= VMEMMAP_START && addr <= VMEMMAP_END)
>> +               return;
>> +
>>         /*
>>          * Check for potential race with hardware updates of the pte
>>          * (ptep_set_access_flags safely changes valid ptes without going
>>
>>>
>>> The following shell command can be used to reproduce this WARN_ON in
>>> 6.1.0-rc1:
>>>
>>>  echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap
>>>  cat /proc/sys/vm/hugetlb_optimize_vmemmap
>>>
>>>  echo 1024 > /proc/sys/vm/nr_overcommit_hugepages
>>>  mkdir -p /root/hugetlb
>>>  mount none /root/hugetlb -t hugetlbfs
>>>  fallocate -l 2g /root/hugetlb/xx &
>>>
>>> Here is the detail WARN_ON log:
>>>
>>> ------------[ cut here ]------------
>>> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83
>>> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120
>>> Modules linked in:
>>> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224
>>> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
>>> Call trace:
>>> vmemmap_remap_pte+0x118/0x120
>>> vmemmap_remap_range+0x30c/0x560
>>> hugetlb_vmemmap_optimize+0x158/0x408
>>> __prep_new_huge_page+0x24/0x150
>>> prep_new_huge_page+0x30/0x60
>>> alloc_fresh_huge_page+0x1c4/0x1e0
>>> alloc_surplus_huge_page+0xa0/0x168
>>> alloc_huge_page+0x264/0x5b8
>>> hugetlbfs_fallocate+0x294/0x680
>>> vfs_fallocate+0x12c/0x568
>>> ksys_fallocate+0x50/0xa0
>>> __arm64_sys_fallocate+0x28/0x38
>>> invoke_syscall+0x4c/0x110
>>> el0_svc_common.constprop.0+0x68/0x128
>>> do_el0_svc+0x34/0xd0
>>> el0_svc+0x48/0xb8
>>> el0t_64_sync_handler+0xb8/0xc0
>>> el0t_64_sync+0x18c/0x190
>>> ---[ end trace 0000000000000000 ]---
>>>
>>> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
>>
>> Actually, this commit does not pose the issue for arm64. I think the correct commit
>> which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2.
>>
>> Thanks.
>>
>>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
>>> ---
>>> mm/hugetlb_vmemmap.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>> index ba2a2596fb4e..cb056265c31e 100644
>>> --- a/mm/hugetlb_vmemmap.c
>>> +++ b/mm/hugetlb_vmemmap.c
>>> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>>> struct page *page = pte_page(*pte);
>>>
>>> list_add_tail(&page->lru, walk->vmemmap_pages);
>>> - set_pte_at(&init_mm, addr, pte, entry);
>>> + set_pte(pte, entry);
>>> }
>>>
>>> /*
>>> -- 
>>> 2.25.1
>>>
>>>
>>
>>
>
  
Catalin Marinas Oct. 27, 2022, 10:50 a.m. UTC | #8
On Wed, Oct 26, 2022 at 02:06:00PM +0530, Anshuman Khandual wrote:
> On 10/26/22 12:31, Muchun Song wrote:
> >> On 10/25/22 12:06, Muchun Song wrote:
> >>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
> >>>> From: Ma Wupeng <mawupeng1@huawei.com>
> >>>>
> >>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
> >>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
> >>>> read-only to catch illegal write operation to the tail page.
> >>>>
> >>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
> >>>
> >>> Thanks for your finding this issue.
> >>>
> >>>> since this may lead to dirty state cleaned. This check is introduced by
> >>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
> >>>> access and dirty pte bits") and the initial check is as follow:
> >>>>
> >>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
> >>>>
> >>>> Since we do need to mark this pte as read-only to catch illegal write
> >>>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
> >>>> this check.
> >>>
> >>> In theory, the waring does not affect anything since the tail vmemmap
> >>> pages are supposed to be read-only. So, skipping this check for vmemmap
> >>
> >> Tails vmemmap pages are supposed to be read-only, in practice but their
> >> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE()
> >> warning would not have triggered.
> > 
> > Right.
> > 
> >>
> >>        VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte),
> >>                     "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx",
> >>                     __func__, pte_val(old_pte), pte_val(pte));
> >>
> >> Also, is not it true that the pte being remapped into a different page
> >> as read only, than what it had originally (which will be freed up) i.e 
> >> the PFN in 'old_pte' and 'pte' will be different. Hence is there still
> > 
> > Right.
> > 
> >> a possibility for a race condition even when the PFN changes ?
> > 
> > Sorry, I didn't get this question. Did you mean the PTE is changed from
> > new (pte) to the old one (old_pte) by the hardware because of the update
> > of dirty bit when a concurrent write operation to the tail vmemmap page?
> 
> No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining
> tails pages ? Is not there a PFN change, along with access permission change
> involved in this remapping process ?

For the record, as we discussed offline, changing the output address
(pfn) of a pte is not safe without break-before-make if at least one of
the mappings was writeable. The caller (vmemmap_remap_pte()) would need
to be fixed to first invalidate the pte and then write the new pte. I
assume no other CPU accesses this part of the vmemmap while the pte is
being remapped.
  
Muchun Song Oct. 28, 2022, 2:45 a.m. UTC | #9
> On Oct 27, 2022, at 18:50, Catalin Marinas <catalin.marinas@arm.com> wrote:
> 
> On Wed, Oct 26, 2022 at 02:06:00PM +0530, Anshuman Khandual wrote:
>> On 10/26/22 12:31, Muchun Song wrote:
>>>> On 10/25/22 12:06, Muchun Song wrote:
>>>>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
>>>>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>>>> 
>>>>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
>>>>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
>>>>>> read-only to catch illegal write operation to the tail page.
>>>>>> 
>>>>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
>>>>> 
>>>>> Thanks for your finding this issue.
>>>>> 
>>>>>> since this may lead to dirty state cleaned. This check is introduced by
>>>>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
>>>>>> access and dirty pte bits") and the initial check is as follow:
>>>>>> 
>>>>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
>>>>>> 
>>>>>> Since we do need to mark this pte as read-only to catch illegal write
>>>>>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
>>>>>> this check.
>>>>> 
>>>>> In theory, the waring does not affect anything since the tail vmemmap
>>>>> pages are supposed to be read-only. So, skipping this check for vmemmap
>>>> 
>>>> Tails vmemmap pages are supposed to be read-only, in practice but their
>>>> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE()
>>>> warning would not have triggered.
>>> 
>>> Right.
>>> 
>>>> 
>>>>       VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte),
>>>>                    "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx",
>>>>                    __func__, pte_val(old_pte), pte_val(pte));
>>>> 
>>>> Also, is not it true that the pte being remapped into a different page
>>>> as read only, than what it had originally (which will be freed up) i.e 
>>>> the PFN in 'old_pte' and 'pte' will be different. Hence is there still
>>> 
>>> Right.
>>> 
>>>> a possibility for a race condition even when the PFN changes ?
>>> 
>>> Sorry, I didn't get this question. Did you mean the PTE is changed from
>>> new (pte) to the old one (old_pte) by the hardware because of the update
>>> of dirty bit when a concurrent write operation to the tail vmemmap page?
>> 
>> No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining
>> tails pages ? Is not there a PFN change, along with access permission change
>> involved in this remapping process ?
> 
> For the record, as we discussed offline, changing the output address
> (pfn) of a pte is not safe without break-before-make if at least one of
> the mappings was writeable. The caller (vmemmap_remap_pte()) would need
> to be fixed to first invalidate the pte and then write the new pte. I

Hi Catalin,

Could you expose more details about what issue it will be caused? I am
not familiar with arm64.

> assume no other CPU accesses this part of the vmemmap while the pte is
> being remapped.

However, there is no guarantee that no other CPU accesses this pte.
E.g. memory failure or memory compaction, both can obtain head page
from any tail struct pages (only read) anytime.

Thanks.

> 
> -- 
> Catalin
  
Catalin Marinas Oct. 28, 2022, 3:53 p.m. UTC | #10
On Fri, Oct 28, 2022 at 10:45:09AM +0800, Muchun Song wrote:
> On Oct 27, 2022, at 18:50, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Wed, Oct 26, 2022 at 02:06:00PM +0530, Anshuman Khandual wrote:
> >> On 10/26/22 12:31, Muchun Song wrote:
> >>>> On 10/25/22 12:06, Muchun Song wrote:
> >>>>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
> >>>>>> From: Ma Wupeng <mawupeng1@huawei.com>
> >>>>>> 
> >>>>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
> >>>>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
> >>>>>> read-only to catch illegal write operation to the tail page.
> >>>>>> 
> >>>>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
> >>>>> 
> >>>>> Thanks for your finding this issue.
> >>>>> 
> >>>>>> since this may lead to dirty state cleaned. This check is introduced by
> >>>>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
> >>>>>> access and dirty pte bits") and the initial check is as follow:
> >>>>>> 
> >>>>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
> >>>>>> 
> >>>>>> Since we do need to mark this pte as read-only to catch illegal write
> >>>>>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
> >>>>>> this check.
> >>>>> 
> >>>>> In theory, the waring does not affect anything since the tail vmemmap
> >>>>> pages are supposed to be read-only. So, skipping this check for vmemmap
> >>>> 
> >>>> Tails vmemmap pages are supposed to be read-only, in practice but their
> >>>> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE()
> >>>> warning would not have triggered.
> >>> 
> >>> Right.
> >>> 
> >>>> 
> >>>>       VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte),
> >>>>                    "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx",
> >>>>                    __func__, pte_val(old_pte), pte_val(pte));
> >>>> 
> >>>> Also, is not it true that the pte being remapped into a different page
> >>>> as read only, than what it had originally (which will be freed up) i.e 
> >>>> the PFN in 'old_pte' and 'pte' will be different. Hence is there still
> >>> 
> >>> Right.
> >>> 
> >>>> a possibility for a race condition even when the PFN changes ?
> >>> 
> >>> Sorry, I didn't get this question. Did you mean the PTE is changed from
> >>> new (pte) to the old one (old_pte) by the hardware because of the update
> >>> of dirty bit when a concurrent write operation to the tail vmemmap page?
> >> 
> >> No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining
> >> tails pages ? Is not there a PFN change, along with access permission change
> >> involved in this remapping process ?
> > 
> > For the record, as we discussed offline, changing the output address
> > (pfn) of a pte is not safe without break-before-make if at least one of
> > the mappings was writeable. The caller (vmemmap_remap_pte()) would need
> > to be fixed to first invalidate the pte and then write the new pte. I
> 
> Could you expose more details about what issue it will be caused? I am
> not familiar with arm64.

Well, it's not allowed by the architecture, so some CPU implementations
may do weird things like accessing incorrect memory or triggering TLB
conflict aborts if, for some reason, they end up with two entries in
the TLB for the same VA but pointing to different pfns. The hardware
expects an invalid PTE and TLB invalidation between such changes. In
practice most likely nothing happens and this works fine but we need to
stick to the architecture requirements in case some CPUs take advantage
of this requirement.

> > assume no other CPU accesses this part of the vmemmap while the pte is
> > being remapped.
> 
> However, there is no guarantee that no other CPU accesses this pte.
> E.g. memory failure or memory compaction, both can obtain head page
> from any tail struct pages (only read) anytime.

Oh, so we cannot safely go through a break-before-make sequence here
(zero the PTE, flush the TLB, write the new PTE) as some CPU may access
this pte.
  
mawupeng Oct. 29, 2022, 1:55 a.m. UTC | #11
On 2022/10/26 11:01, mawupeng wrote:
> 
> 
> On 2022/10/25 14:36, Muchun Song wrote:
>>
>>
>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
>>>
>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>
>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
>>> read-only to catch illegal write operation to the tail page.
>>>
>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
>>
>> Thanks for your finding this issue.
>>
>>> since this may lead to dirty state cleaned. This check is introduced by
>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
>>> access and dirty pte bits") and the initial check is as follow:
>>>
>>>  BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
>>>
>>> Since we do need to mark this pte as read-only to catch illegal write
>>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
>>> this check.
>>
>> In theory, the waring does not affect anything since the tail vmemmap
>> pages are supposed to be read-only. So, skipping this check for vmemmap
>> pages seem feasible. But I am not sure whether it is general to use
>> set_pte here instead of set_pte_at, I didn’t see any users of set_pte
>> from the common code routines except the code from arch/xxx. And this
>> issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update()
>> itself.
>>
>> Something like (Just some thoughts from mine):
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index b5df82aa99e6..df7716965a93 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval);
>>   *   PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY)
>>   */
>>
>> -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
>> +static inline void __check_racy_pte_update(struct mm_struct *mm,
>> +                                          unsigned long addr, pte_t *ptep,
>>                                            pte_t pte)
>>  {
>>         pte_t old_pte;
>> @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
>>         if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1)
>>                 return;
>>
>> +       if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) &&
>> +           addr >= VMEMMAP_START && addr <= VMEMMAP_END)
>> +               return;
>> +
>>         /*
>>          * Check for potential race with hardware updates of the pte
>>          * (ptep_set_access_flags safely changes valid ptes without going
>>

IMHO, arm64 or other archs do some work on the dirty bit and rdonly bit in
pte in commit 2f4b829c625e ("arm64: Add support for hardware updates of the
access and dirty pte bits").

Maybe we can use pte_wrprotect() to mark this pte read-only? It will add 
PTE_DIRTY bit for the new pte entry compare to the old one.

Here is the diff:

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index ba2a2596fb4e..24a230895316 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -244,8 +244,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
         * Remap the tail pages as read-only to catch illegal write operation
         * to the tail pages.
         */
-       pgprot_t pgprot = PAGE_KERNEL_RO;
-       pte_t entry = mk_pte(walk->reuse_page, pgprot);
+       pte_t entry = pte_wrprotect(mk_pte(walk->reuse_page, PAGE_KERNEL));
        struct page *page = pte_page(*pte);
 
        list_add_tail(&page->lru, walk->vmemmap_pages);

>>>
>>> The following shell command can be used to reproduce this WARN_ON in
>>> 6.1.0-rc1:
>>>
>>>  echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap
>>>  cat /proc/sys/vm/hugetlb_optimize_vmemmap
>>>
>>>  echo 1024 > /proc/sys/vm/nr_overcommit_hugepages
>>>  mkdir -p /root/hugetlb
>>>  mount none /root/hugetlb -t hugetlbfs
>>>  fallocate -l 2g /root/hugetlb/xx &
>>>
>>> Here is the detail WARN_ON log:
>>>
>>> ------------[ cut here ]------------
>>> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83
>>> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120
>>> Modules linked in:
>>> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224
>>> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
>>> Call trace:
>>> vmemmap_remap_pte+0x118/0x120
>>> vmemmap_remap_range+0x30c/0x560
>>> hugetlb_vmemmap_optimize+0x158/0x408
>>> __prep_new_huge_page+0x24/0x150
>>> prep_new_huge_page+0x30/0x60
>>> alloc_fresh_huge_page+0x1c4/0x1e0
>>> alloc_surplus_huge_page+0xa0/0x168
>>> alloc_huge_page+0x264/0x5b8
>>> hugetlbfs_fallocate+0x294/0x680
>>> vfs_fallocate+0x12c/0x568
>>> ksys_fallocate+0x50/0xa0
>>> __arm64_sys_fallocate+0x28/0x38
>>> invoke_syscall+0x4c/0x110
>>> el0_svc_common.constprop.0+0x68/0x128
>>> do_el0_svc+0x34/0xd0
>>> el0_svc+0x48/0xb8
>>> el0t_64_sync_handler+0xb8/0xc0
>>> el0t_64_sync+0x18c/0x190
>>> ---[ end trace 0000000000000000 ]---
>>>
>>> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
>>
>> Actually, this commit does not pose the issue for arm64. I think the correct commit
>> which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2.
>>
>> Thanks.
> 
> Thanks for pointing it out.
> 
> I have tested the commit, it sure is patch 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2.
> This VM_WARN_ONCE will be produced if HUGETLB_PAGE_FREE_VMEMMAP for arm64 is enabled.
> 
> I will change my change log and fix this in next patch.
> 
>>
>>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
>>> ---
>>> mm/hugetlb_vmemmap.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>> index ba2a2596fb4e..cb056265c31e 100644
>>> --- a/mm/hugetlb_vmemmap.c
>>> +++ b/mm/hugetlb_vmemmap.c
>>> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>>> struct page *page = pte_page(*pte);
>>>
>>> list_add_tail(&page->lru, walk->vmemmap_pages);
>>> - set_pte_at(&init_mm, addr, pte, entry);
>>> + set_pte(pte, entry);
>>> }
>>>
>>> /*
>>> -- 
>>> 2.25.1
>>>
>>>
>>
  
Muchun Song Nov. 1, 2022, 9:29 a.m. UTC | #12
> On Oct 28, 2022, at 23:53, Catalin Marinas <catalin.marinas@arm.com> wrote:
> 
> On Fri, Oct 28, 2022 at 10:45:09AM +0800, Muchun Song wrote:
>> On Oct 27, 2022, at 18:50, Catalin Marinas <catalin.marinas@arm.com> wrote:
>>> On Wed, Oct 26, 2022 at 02:06:00PM +0530, Anshuman Khandual wrote:
>>>> On 10/26/22 12:31, Muchun Song wrote:
>>>>>> On 10/25/22 12:06, Muchun Song wrote:
>>>>>>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
>>>>>>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>>>>>> 
>>>>>>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
>>>>>>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
>>>>>>>> read-only to catch illegal write operation to the tail page.
>>>>>>>> 
>>>>>>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
>>>>>>> 
>>>>>>> Thanks for your finding this issue.
>>>>>>> 
>>>>>>>> since this may lead to dirty state cleaned. This check is introduced by
>>>>>>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
>>>>>>>> access and dirty pte bits") and the initial check is as follow:
>>>>>>>> 
>>>>>>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
>>>>>>>> 
>>>>>>>> Since we do need to mark this pte as read-only to catch illegal write
>>>>>>>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
>>>>>>>> this check.
>>>>>>> 
>>>>>>> In theory, the waring does not affect anything since the tail vmemmap
>>>>>>> pages are supposed to be read-only. So, skipping this check for vmemmap
>>>>>> 
>>>>>> Tails vmemmap pages are supposed to be read-only, in practice but their
>>>>>> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE()
>>>>>> warning would not have triggered.
>>>>> 
>>>>> Right.
>>>>> 
>>>>>> 
>>>>>>      VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte),
>>>>>>                   "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx",
>>>>>>                   __func__, pte_val(old_pte), pte_val(pte));
>>>>>> 
>>>>>> Also, is not it true that the pte being remapped into a different page
>>>>>> as read only, than what it had originally (which will be freed up) i.e 
>>>>>> the PFN in 'old_pte' and 'pte' will be different. Hence is there still
>>>>> 
>>>>> Right.
>>>>> 
>>>>>> a possibility for a race condition even when the PFN changes ?
>>>>> 
>>>>> Sorry, I didn't get this question. Did you mean the PTE is changed from
>>>>> new (pte) to the old one (old_pte) by the hardware because of the update
>>>>> of dirty bit when a concurrent write operation to the tail vmemmap page?
>>>> 
>>>> No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining
>>>> tails pages ? Is not there a PFN change, along with access permission change
>>>> involved in this remapping process ?
>>> 
>>> For the record, as we discussed offline, changing the output address
>>> (pfn) of a pte is not safe without break-before-make if at least one of
>>> the mappings was writeable. The caller (vmemmap_remap_pte()) would need
>>> to be fixed to first invalidate the pte and then write the new pte. I
>> 
>> Could you expose more details about what issue it will be caused? I am
>> not familiar with arm64.
> 
> Well, it's not allowed by the architecture, so some CPU implementations
> may do weird things like accessing incorrect memory or triggering TLB
> conflict aborts if, for some reason, they end up with two entries in
> the TLB for the same VA but pointing to different pfns. The hardware
> expects an invalid PTE and TLB invalidation between such changes. In
> practice most likely nothing happens and this works fine but we need to
> stick to the architecture requirements in case some CPUs take advantage
> of this requirement.

Got it. Thanks for your nice explanation.

> 
>>> assume no other CPU accesses this part of the vmemmap while the pte is
>>> being remapped.
>> 
>> However, there is no guarantee that no other CPU accesses this pte.
>> E.g. memory failure or memory compaction, both can obtain head page
>> from any tail struct pages (only read) anytime.
> 
> Oh, so we cannot safely go through a break-before-make sequence here
> (zero the PTE, flush the TLB, write the new PTE) as some CPU may access
> this pte.

Right.

Muchun

> 
> -- 
> Catalin
  
Catalin Marinas Nov. 1, 2022, 9:56 a.m. UTC | #13
On Sat, Oct 29, 2022 at 09:55:15AM +0800, mawupeng wrote:
> On 2022/10/26 11:01, mawupeng wrote:
> > On 2022/10/25 14:36, Muchun Song wrote:
> >>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
> >>>
> >>> From: Ma Wupeng <mawupeng1@huawei.com>
> >>>
> >>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
> >>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
> >>> read-only to catch illegal write operation to the tail page.
> >>>
> >>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
> >>
> >> Thanks for your finding this issue.
> >>
> >>> since this may lead to dirty state cleaned. This check is introduced by
> >>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
> >>> access and dirty pte bits") and the initial check is as follow:
> >>>
> >>>  BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
> >>>
> >>> Since we do need to mark this pte as read-only to catch illegal write
> >>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
> >>> this check.
[...]
> IMHO, arm64 or other archs do some work on the dirty bit and rdonly bit in
> pte in commit 2f4b829c625e ("arm64: Add support for hardware updates of the
> access and dirty pte bits").
> 
> Maybe we can use pte_wrprotect() to mark this pte read-only? It will add 
> PTE_DIRTY bit for the new pte entry compare to the old one.
> 
> Here is the diff:
> 
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index ba2a2596fb4e..24a230895316 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -244,8 +244,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>          * Remap the tail pages as read-only to catch illegal write operation
>          * to the tail pages.
>          */
> -       pgprot_t pgprot = PAGE_KERNEL_RO;
> -       pte_t entry = mk_pte(walk->reuse_page, pgprot);
> +       pte_t entry = pte_wrprotect(mk_pte(walk->reuse_page, PAGE_KERNEL));

This may silence the warning but we plan to add another to detect a
change in the pfn without going through a break-before-make sequence.
  

Patch

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index ba2a2596fb4e..cb056265c31e 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -249,7 +249,7 @@  static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
 	struct page *page = pte_page(*pte);
 
 	list_add_tail(&page->lru, walk->vmemmap_pages);
-	set_pte_at(&init_mm, addr, pte, entry);
+	set_pte(pte, entry);
 }
 
 /*