Message ID | 20221025014215.3466904-1-mawupeng1@huawei.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp748790wru; Mon, 24 Oct 2022 18:43:56 -0700 (PDT) X-Google-Smtp-Source: AMsMyM67ejgCyYC/G/xO/jKZFUiWXUYRizWFEE2Fbu57/4k5zJhUCldxasV1JwmlBv3DjPq0TYEC X-Received: by 2002:a05:6402:26c6:b0:45d:374b:fb73 with SMTP id x6-20020a05640226c600b0045d374bfb73mr32724863edd.424.1666662236695; Mon, 24 Oct 2022 18:43:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666662236; cv=none; d=google.com; s=arc-20160816; b=KzRT6f507r2EFtap+xSYbAkhurFUX+cW/bG+MM4267mDedwuaa8YeCb9VqbNHjvpgW vnNGC0fD3KIE5pMOixJvwxgSH7R/4oKQsBRjPHS78FrFT3u7IoPObAVPQMWQC6j8PCiU idVkx+qaZC5taLPF1xQLjBVrTChlFHdp2R21rH0MMruDR5gcO6LClbjfHF1DqouoCKY6 pk5i+L7OHvOKd2U72EDKBdKay8ciMGKEXMyLuAH0V4UpHgv5ppl3HtMBVO7NpO6NGBk+ ngrBKQTX3KYXF2HRHCJvmM3tH2HJpbEwvzJ4r+WsTABHd5JEpKMWB3LCY7gu/TkFsix/ 01xg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=HKXL7MOf6FOmE0SQk8hzscD/pjlbsYSybuO3jeAflUs=; b=KxlW9XA1Sdqfu6ZmwezcyJ1HKouFLzW/0ZeNtdPkGzv37ibpC4CoCx0ZqfD9lubPOa kQTr1/ng/cHCqyGZVsQIa+/tQ1MykfA8Q+bkYDMfIqkcl7yhlD1Ik8bnlJyNrOnuhORz Wpfbc6NwPgewTBtr7nrufjCH3nyRw16FPNF49BANFlh7YVJA0JrdYt0thpYI/m59M/qO FXC3q2MFO4UEIOhQ+1fcUp1L9j+8eI6zPugdx2d0XxUlqMw0n28+Y6lpODxvleX23LUs Fr25vHdzImTj1/gMdzJP9X7HqHQ/SGSyGDxwWvxWMqy5hOWTMb1eS7EA/2pmzr0t0h6T /ofw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id v30-20020a50955e000000b00461540620f1si584524eda.169.2022.10.24.18.43.33; Mon, 24 Oct 2022 18:43:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229756AbiJYBib (ORCPT <rfc822;pwkd43@gmail.com> + 99 others); Mon, 24 Oct 2022 21:38:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57822 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231253AbiJYBhy (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 24 Oct 2022 21:37:54 -0400 Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 270DE9E0F2 for <linux-kernel@vger.kernel.org>; Mon, 24 Oct 2022 18:16:35 -0700 (PDT) Received: from dggpemm500022.china.huawei.com (unknown [172.30.72.53]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4MxDV40xqfzJn9K; Tue, 25 Oct 2022 09:13:48 +0800 (CST) Received: from dggpemm500014.china.huawei.com (7.185.36.153) by dggpemm500022.china.huawei.com (7.185.36.162) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Tue, 25 Oct 2022 09:16:33 +0800 Received: from localhost.localdomain (10.175.112.125) by dggpemm500014.china.huawei.com (7.185.36.153) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Tue, 25 Oct 2022 09:16:32 +0800 From: Wupeng Ma <mawupeng1@huawei.com> To: <akpm@linux-foundation.org>, <mike.kravetz@oracle.com>, <songmuchun@bytedance.com> CC: <mhocko@suse.com>, <osalvador@suse.de>, <catalin.marinas@arm.com>, <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>, <mawupeng1@huawei.com> Subject: [PATCH -next 1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte Date: Tue, 25 Oct 2022 09:42:15 +0800 Message-ID: <20221025014215.3466904-1-mawupeng1@huawei.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.175.112.125] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To dggpemm500014.china.huawei.com (7.185.36.153) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747622021897614787?= X-GMAIL-MSGID: =?utf-8?q?1747622021897614787?= |
Series |
[-next,1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte
|
|
Commit Message
mawupeng
Oct. 25, 2022, 1:42 a.m. UTC
From: Ma Wupeng <mawupeng1@huawei.com> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as read-only to catch illegal write operation to the tail page. However this will lead to WARN_ON in arm64 in __check_racy_pte_update() since this may lead to dirty state cleaned. This check is introduced by commit 2f4b829c625e ("arm64: Add support for hardware updates of the access and dirty pte bits") and the initial check is as follow: BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); Since we do need to mark this pte as read-only to catch illegal write operation to the tail pages, use set_pte to replace set_pte_at to bypass this check. The following shell command can be used to reproduce this WARN_ON in 6.1.0-rc1: echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap cat /proc/sys/vm/hugetlb_optimize_vmemmap echo 1024 > /proc/sys/vm/nr_overcommit_hugepages mkdir -p /root/hugetlb mount none /root/hugetlb -t hugetlbfs fallocate -l 2g /root/hugetlb/xx & Here is the detail WARN_ON log: ------------[ cut here ]------------ __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83 WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120 Modules linked in: CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224 Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 Call trace: vmemmap_remap_pte+0x118/0x120 vmemmap_remap_range+0x30c/0x560 hugetlb_vmemmap_optimize+0x158/0x408 __prep_new_huge_page+0x24/0x150 prep_new_huge_page+0x30/0x60 alloc_fresh_huge_page+0x1c4/0x1e0 alloc_surplus_huge_page+0xa0/0x168 alloc_huge_page+0x264/0x5b8 hugetlbfs_fallocate+0x294/0x680 vfs_fallocate+0x12c/0x568 ksys_fallocate+0x50/0xa0 __arm64_sys_fallocate+0x28/0x38 invoke_syscall+0x4c/0x110 el0_svc_common.constprop.0+0x68/0x128 do_el0_svc+0x34/0xd0 el0_svc+0x48/0xb8 el0t_64_sync_handler+0xb8/0xc0 el0t_64_sync+0x18c/0x190 ---[ end trace 0000000000000000 ]--- Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page") Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> --- mm/hugetlb_vmemmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Comments
> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: > > From: Ma Wupeng <mawupeng1@huawei.com> > > Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with > each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as > read-only to catch illegal write operation to the tail page. > > However this will lead to WARN_ON in arm64 in __check_racy_pte_update() Thanks for your finding this issue. > since this may lead to dirty state cleaned. This check is introduced by > commit 2f4b829c625e ("arm64: Add support for hardware updates of the > access and dirty pte bits") and the initial check is as follow: > > BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); > > Since we do need to mark this pte as read-only to catch illegal write > operation to the tail pages, use set_pte to replace set_pte_at to bypass > this check. In theory, the waring does not affect anything since the tail vmemmap pages are supposed to be read-only. So, skipping this check for vmemmap pages seem feasible. But I am not sure whether it is general to use set_pte here instead of set_pte_at, I didn’t see any users of set_pte from the common code routines except the code from arch/xxx. And this issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update() itself. Something like (Just some thoughts from mine): diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index b5df82aa99e6..df7716965a93 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval); * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) */ -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, +static inline void __check_racy_pte_update(struct mm_struct *mm, + unsigned long addr, pte_t *ptep, pte_t pte) { pte_t old_pte; @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1) return; + if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) && + addr >= VMEMMAP_START && addr <= VMEMMAP_END) + return; + /* * Check for potential race with hardware updates of the pte * (ptep_set_access_flags safely changes valid ptes without going > > The following shell command can be used to reproduce this WARN_ON in > 6.1.0-rc1: > > echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap > cat /proc/sys/vm/hugetlb_optimize_vmemmap > > echo 1024 > /proc/sys/vm/nr_overcommit_hugepages > mkdir -p /root/hugetlb > mount none /root/hugetlb -t hugetlbfs > fallocate -l 2g /root/hugetlb/xx & > > Here is the detail WARN_ON log: > > ------------[ cut here ]------------ > __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83 > WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120 > Modules linked in: > CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224 > Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 > Call trace: > vmemmap_remap_pte+0x118/0x120 > vmemmap_remap_range+0x30c/0x560 > hugetlb_vmemmap_optimize+0x158/0x408 > __prep_new_huge_page+0x24/0x150 > prep_new_huge_page+0x30/0x60 > alloc_fresh_huge_page+0x1c4/0x1e0 > alloc_surplus_huge_page+0xa0/0x168 > alloc_huge_page+0x264/0x5b8 > hugetlbfs_fallocate+0x294/0x680 > vfs_fallocate+0x12c/0x568 > ksys_fallocate+0x50/0xa0 > __arm64_sys_fallocate+0x28/0x38 > invoke_syscall+0x4c/0x110 > el0_svc_common.constprop.0+0x68/0x128 > do_el0_svc+0x34/0xd0 > el0_svc+0x48/0xb8 > el0t_64_sync_handler+0xb8/0xc0 > el0t_64_sync+0x18c/0x190 > ---[ end trace 0000000000000000 ]--- > > Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page") Actually, this commit does not pose the issue for arm64. I think the correct commit which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2. Thanks. > Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> > --- > mm/hugetlb_vmemmap.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > index ba2a2596fb4e..cb056265c31e 100644 > --- a/mm/hugetlb_vmemmap.c > +++ b/mm/hugetlb_vmemmap.c > @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, > struct page *page = pte_page(*pte); > > list_add_tail(&page->lru, walk->vmemmap_pages); > - set_pte_at(&init_mm, addr, pte, entry); > + set_pte(pte, entry); > } > > /* > -- > 2.25.1 > >
On 2022/10/25 14:36, Muchun Song wrote: > > >> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: >> >> From: Ma Wupeng <mawupeng1@huawei.com> >> >> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with >> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as >> read-only to catch illegal write operation to the tail page. >> >> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() > > Thanks for your finding this issue. > >> since this may lead to dirty state cleaned. This check is introduced by >> commit 2f4b829c625e ("arm64: Add support for hardware updates of the >> access and dirty pte bits") and the initial check is as follow: >> >> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); >> >> Since we do need to mark this pte as read-only to catch illegal write >> operation to the tail pages, use set_pte to replace set_pte_at to bypass >> this check. > > In theory, the waring does not affect anything since the tail vmemmap > pages are supposed to be read-only. So, skipping this check for vmemmap > pages seem feasible. But I am not sure whether it is general to use > set_pte here instead of set_pte_at, I didn’t see any users of set_pte > from the common code routines except the code from arch/xxx. And this > issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update() > itself. > > Something like (Just some thoughts from mine): > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index b5df82aa99e6..df7716965a93 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval); > * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) > */ > > -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, > +static inline void __check_racy_pte_update(struct mm_struct *mm, > + unsigned long addr, pte_t *ptep, > pte_t pte) > { > pte_t old_pte; > @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, > if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1) > return; > > + if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) && > + addr >= VMEMMAP_START && addr <= VMEMMAP_END) > + return; > + > /* > * Check for potential race with hardware updates of the pte > * (ptep_set_access_flags safely changes valid ptes without going > >> >> The following shell command can be used to reproduce this WARN_ON in >> 6.1.0-rc1: >> >> echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap >> cat /proc/sys/vm/hugetlb_optimize_vmemmap >> >> echo 1024 > /proc/sys/vm/nr_overcommit_hugepages >> mkdir -p /root/hugetlb >> mount none /root/hugetlb -t hugetlbfs >> fallocate -l 2g /root/hugetlb/xx & >> >> Here is the detail WARN_ON log: >> >> ------------[ cut here ]------------ >> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83 >> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120 >> Modules linked in: >> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224 >> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 >> Call trace: >> vmemmap_remap_pte+0x118/0x120 >> vmemmap_remap_range+0x30c/0x560 >> hugetlb_vmemmap_optimize+0x158/0x408 >> __prep_new_huge_page+0x24/0x150 >> prep_new_huge_page+0x30/0x60 >> alloc_fresh_huge_page+0x1c4/0x1e0 >> alloc_surplus_huge_page+0xa0/0x168 >> alloc_huge_page+0x264/0x5b8 >> hugetlbfs_fallocate+0x294/0x680 >> vfs_fallocate+0x12c/0x568 >> ksys_fallocate+0x50/0xa0 >> __arm64_sys_fallocate+0x28/0x38 >> invoke_syscall+0x4c/0x110 >> el0_svc_common.constprop.0+0x68/0x128 >> do_el0_svc+0x34/0xd0 >> el0_svc+0x48/0xb8 >> el0t_64_sync_handler+0xb8/0xc0 >> el0t_64_sync+0x18c/0x190 >> ---[ end trace 0000000000000000 ]--- >> >> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page") > > Actually, this commit does not pose the issue for arm64. I think the correct commit > which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2. > > Thanks. Thanks for pointing it out. I have tested the commit, it sure is patch 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2. This VM_WARN_ONCE will be produced if HUGETLB_PAGE_FREE_VMEMMAP for arm64 is enabled. I will change my change log and fix this in next patch. > >> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> >> --- >> mm/hugetlb_vmemmap.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c >> index ba2a2596fb4e..cb056265c31e 100644 >> --- a/mm/hugetlb_vmemmap.c >> +++ b/mm/hugetlb_vmemmap.c >> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, >> struct page *page = pte_page(*pte); >> >> list_add_tail(&page->lru, walk->vmemmap_pages); >> - set_pte_at(&init_mm, addr, pte, entry); >> + set_pte(pte, entry); >> } >> >> /* >> -- >> 2.25.1 >> >> >
On 10/25/22 12:06, Muchun Song wrote: > > >> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: >> >> From: Ma Wupeng <mawupeng1@huawei.com> >> >> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with >> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as >> read-only to catch illegal write operation to the tail page. >> >> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() > > Thanks for your finding this issue. > >> since this may lead to dirty state cleaned. This check is introduced by >> commit 2f4b829c625e ("arm64: Add support for hardware updates of the >> access and dirty pte bits") and the initial check is as follow: >> >> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); >> >> Since we do need to mark this pte as read-only to catch illegal write >> operation to the tail pages, use set_pte to replace set_pte_at to bypass >> this check. > > In theory, the waring does not affect anything since the tail vmemmap > pages are supposed to be read-only. So, skipping this check for vmemmap Tails vmemmap pages are supposed to be read-only, in practice but their backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE() warning would not have triggered. VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", __func__, pte_val(old_pte), pte_val(pte)); Also, is not it true that the pte being remapped into a different page as read only, than what it had originally (which will be freed up) i.e the PFN in 'old_pte' and 'pte' will be different. Hence is there still a possibility for a race condition even when the PFN changes ? > pages seem feasible. But I am not sure whether it is general to use > set_pte here instead of set_pte_at, I didn’t see any users of set_pte > from the common code routines except the code from arch/xxx. And this > issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update() > itself. Right, should not change it to yet lower level platform helper set_pte() just to work around this warning. Instead, __check_racy_pte_update() is the right place where it should be fixed. > > Something like (Just some thoughts from mine): > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index b5df82aa99e6..df7716965a93 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval); > * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) > */ > > -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, > +static inline void __check_racy_pte_update(struct mm_struct *mm, > + unsigned long addr, pte_t *ptep, > pte_t pte) > { > pte_t old_pte; > @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, > if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1) > return; > > + if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) && > + addr >= VMEMMAP_START && addr <= VMEMMAP_END) > + return; > + > /* > * Check for potential race with hardware updates of the pte > * (ptep_set_access_flags safely changes valid ptes without going > >> >> The following shell command can be used to reproduce this WARN_ON in >> 6.1.0-rc1: >> >> echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap >> cat /proc/sys/vm/hugetlb_optimize_vmemmap >> >> echo 1024 > /proc/sys/vm/nr_overcommit_hugepages >> mkdir -p /root/hugetlb >> mount none /root/hugetlb -t hugetlbfs >> fallocate -l 2g /root/hugetlb/xx & >> >> Here is the detail WARN_ON log: >> >> ------------[ cut here ]------------ >> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83 >> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120 >> Modules linked in: >> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224 >> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 >> Call trace: >> vmemmap_remap_pte+0x118/0x120 >> vmemmap_remap_range+0x30c/0x560 >> hugetlb_vmemmap_optimize+0x158/0x408 >> __prep_new_huge_page+0x24/0x150 >> prep_new_huge_page+0x30/0x60 >> alloc_fresh_huge_page+0x1c4/0x1e0 >> alloc_surplus_huge_page+0xa0/0x168 >> alloc_huge_page+0x264/0x5b8 >> hugetlbfs_fallocate+0x294/0x680 >> vfs_fallocate+0x12c/0x568 >> ksys_fallocate+0x50/0xa0 >> __arm64_sys_fallocate+0x28/0x38 >> invoke_syscall+0x4c/0x110 >> el0_svc_common.constprop.0+0x68/0x128 >> do_el0_svc+0x34/0xd0 >> el0_svc+0x48/0xb8 >> el0t_64_sync_handler+0xb8/0xc0 >> el0t_64_sync+0x18c/0x190 >> ---[ end trace 0000000000000000 ]--- >> >> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page") > > Actually, this commit does not pose the issue for arm64. I think the correct commit > which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2. > > Thanks. > >> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> >> --- >> mm/hugetlb_vmemmap.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c >> index ba2a2596fb4e..cb056265c31e 100644 >> --- a/mm/hugetlb_vmemmap.c >> +++ b/mm/hugetlb_vmemmap.c >> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, >> struct page *page = pte_page(*pte); >> >> list_add_tail(&page->lru, walk->vmemmap_pages); >> - set_pte_at(&init_mm, addr, pte, entry); >> + set_pte(pte, entry); >> } >> >> /* >> -- >> 2.25.1 >> >> > >
> On Oct 26, 2022, at 13:06, Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > > On 10/25/22 12:06, Muchun Song wrote: >> >> >>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: >>> >>> From: Ma Wupeng <mawupeng1@huawei.com> >>> >>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with >>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as >>> read-only to catch illegal write operation to the tail page. >>> >>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() >> >> Thanks for your finding this issue. >> >>> since this may lead to dirty state cleaned. This check is introduced by >>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the >>> access and dirty pte bits") and the initial check is as follow: >>> >>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); >>> >>> Since we do need to mark this pte as read-only to catch illegal write >>> operation to the tail pages, use set_pte to replace set_pte_at to bypass >>> this check. >> >> In theory, the waring does not affect anything since the tail vmemmap >> pages are supposed to be read-only. So, skipping this check for vmemmap > > Tails vmemmap pages are supposed to be read-only, in practice but their > backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE() > warning would not have triggered. Right. > > VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), > "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", > __func__, pte_val(old_pte), pte_val(pte)); > > Also, is not it true that the pte being remapped into a different page > as read only, than what it had originally (which will be freed up) i.e > the PFN in 'old_pte' and 'pte' will be different. Hence is there still Right. > a possibility for a race condition even when the PFN changes ? Sorry, I didn't get this question. Did you mean the PTE is changed from new (pte) to the old one (old_pte) by the hardware because of the update of dirty bit when a concurrent write operation to the tail vmemmap page? Muchun, Thanks. > >> pages seem feasible. But I am not sure whether it is general to use >> set_pte here instead of set_pte_at, I didn’t see any users of set_pte >> from the common code routines except the code from arch/xxx. And this >> issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update() >> itself. > > Right, should not change it to yet lower level platform helper set_pte() > just to work around this warning. Instead, __check_racy_pte_update() is > the right place where it should be fixed. > >> >> Something like (Just some thoughts from mine): >> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >> index b5df82aa99e6..df7716965a93 100644 >> --- a/arch/arm64/include/asm/pgtable.h >> +++ b/arch/arm64/include/asm/pgtable.h >> @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval); >> * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) >> */ >> >> -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, >> +static inline void __check_racy_pte_update(struct mm_struct *mm, >> + unsigned long addr, pte_t *ptep, >> pte_t pte) >> { >> pte_t old_pte; >> @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, >> if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1) >> return; >> >> + if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) && >> + addr >= VMEMMAP_START && addr <= VMEMMAP_END) >> + return; >> + >> /* >> * Check for potential race with hardware updates of the pte >> * (ptep_set_access_flags safely changes valid ptes without going >> >>> >>> The following shell command can be used to reproduce this WARN_ON in >>> 6.1.0-rc1: >>> >>> echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap >>> cat /proc/sys/vm/hugetlb_optimize_vmemmap >>> >>> echo 1024 > /proc/sys/vm/nr_overcommit_hugepages >>> mkdir -p /root/hugetlb >>> mount none /root/hugetlb -t hugetlbfs >>> fallocate -l 2g /root/hugetlb/xx & >>> >>> Here is the detail WARN_ON log: >>> >>> ------------[ cut here ]------------ >>> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83 >>> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120 >>> Modules linked in: >>> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224 >>> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 >>> Call trace: >>> vmemmap_remap_pte+0x118/0x120 >>> vmemmap_remap_range+0x30c/0x560 >>> hugetlb_vmemmap_optimize+0x158/0x408 >>> __prep_new_huge_page+0x24/0x150 >>> prep_new_huge_page+0x30/0x60 >>> alloc_fresh_huge_page+0x1c4/0x1e0 >>> alloc_surplus_huge_page+0xa0/0x168 >>> alloc_huge_page+0x264/0x5b8 >>> hugetlbfs_fallocate+0x294/0x680 >>> vfs_fallocate+0x12c/0x568 >>> ksys_fallocate+0x50/0xa0 >>> __arm64_sys_fallocate+0x28/0x38 >>> invoke_syscall+0x4c/0x110 >>> el0_svc_common.constprop.0+0x68/0x128 >>> do_el0_svc+0x34/0xd0 >>> el0_svc+0x48/0xb8 >>> el0t_64_sync_handler+0xb8/0xc0 >>> el0t_64_sync+0x18c/0x190 >>> ---[ end trace 0000000000000000 ]--- >>> >>> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page") >> >> Actually, this commit does not pose the issue for arm64. I think the correct commit >> which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2. >> >> Thanks. >> >>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> >>> --- >>> mm/hugetlb_vmemmap.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c >>> index ba2a2596fb4e..cb056265c31e 100644 >>> --- a/mm/hugetlb_vmemmap.c >>> +++ b/mm/hugetlb_vmemmap.c >>> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, >>> struct page *page = pte_page(*pte); >>> >>> list_add_tail(&page->lru, walk->vmemmap_pages); >>> - set_pte_at(&init_mm, addr, pte, entry); >>> + set_pte(pte, entry); >>> } >>> >>> /* >>> -- >>> 2.25.1
On 10/26/22 12:31, Muchun Song wrote: > > >> On Oct 26, 2022, at 13:06, Anshuman Khandual <anshuman.khandual@arm.com> wrote: >> >> >> >> On 10/25/22 12:06, Muchun Song wrote: >>> >>> >>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: >>>> >>>> From: Ma Wupeng <mawupeng1@huawei.com> >>>> >>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with >>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as >>>> read-only to catch illegal write operation to the tail page. >>>> >>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() >>> >>> Thanks for your finding this issue. >>> >>>> since this may lead to dirty state cleaned. This check is introduced by >>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the >>>> access and dirty pte bits") and the initial check is as follow: >>>> >>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); >>>> >>>> Since we do need to mark this pte as read-only to catch illegal write >>>> operation to the tail pages, use set_pte to replace set_pte_at to bypass >>>> this check. >>> >>> In theory, the waring does not affect anything since the tail vmemmap >>> pages are supposed to be read-only. So, skipping this check for vmemmap >> >> Tails vmemmap pages are supposed to be read-only, in practice but their >> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE() >> warning would not have triggered. > > Right. > >> >> VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), >> "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", >> __func__, pte_val(old_pte), pte_val(pte)); >> >> Also, is not it true that the pte being remapped into a different page >> as read only, than what it had originally (which will be freed up) i.e >> the PFN in 'old_pte' and 'pte' will be different. Hence is there still > > Right. > >> a possibility for a race condition even when the PFN changes ? > > Sorry, I didn't get this question. Did you mean the PTE is changed from > new (pte) to the old one (old_pte) by the hardware because of the update > of dirty bit when a concurrent write operation to the tail vmemmap page? No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining tails pages ? Is not there a PFN change, along with access permission change involved in this remapping process ?
> On Oct 26, 2022, at 16:36, Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > > On 10/26/22 12:31, Muchun Song wrote: >> >> >>> On Oct 26, 2022, at 13:06, Anshuman Khandual <anshuman.khandual@arm.com> wrote: >>> >>> >>> >>> On 10/25/22 12:06, Muchun Song wrote: >>>> >>>> >>>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: >>>>> >>>>> From: Ma Wupeng <mawupeng1@huawei.com> >>>>> >>>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with >>>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as >>>>> read-only to catch illegal write operation to the tail page. >>>>> >>>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() >>>> >>>> Thanks for your finding this issue. >>>> >>>>> since this may lead to dirty state cleaned. This check is introduced by >>>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the >>>>> access and dirty pte bits") and the initial check is as follow: >>>>> >>>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); >>>>> >>>>> Since we do need to mark this pte as read-only to catch illegal write >>>>> operation to the tail pages, use set_pte to replace set_pte_at to bypass >>>>> this check. >>>> >>>> In theory, the waring does not affect anything since the tail vmemmap >>>> pages are supposed to be read-only. So, skipping this check for vmemmap >>> >>> Tails vmemmap pages are supposed to be read-only, in practice but their >>> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE() >>> warning would not have triggered. >> >> Right. >> >>> >>> VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), >>> "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", >>> __func__, pte_val(old_pte), pte_val(pte)); >>> >>> Also, is not it true that the pte being remapped into a different page >>> as read only, than what it had originally (which will be freed up) i.e >>> the PFN in 'old_pte' and 'pte' will be different. Hence is there still >> >> Right. >> >>> a possibility for a race condition even when the PFN changes ? >> >> Sorry, I didn't get this question. Did you mean the PTE is changed from >> new (pte) to the old one (old_pte) by the hardware because of the update >> of dirty bit when a concurrent write operation to the tail vmemmap page? > > No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining > tails pages ? Is not there a PFN change, along with access permission change > involved in this remapping process ? Alright, yes, both the PFN and the access permission are changed.
On 2022/10/26 13:06, Anshuman Khandual wrote: > > > On 10/25/22 12:06, Muchun Song wrote: >> >> >>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: >>> >>> From: Ma Wupeng <mawupeng1@huawei.com> >>> >>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with >>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as >>> read-only to catch illegal write operation to the tail page. >>> >>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() >> >> Thanks for your finding this issue. >> >>> since this may lead to dirty state cleaned. This check is introduced by >>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the >>> access and dirty pte bits") and the initial check is as follow: >>> >>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); >>> >>> Since we do need to mark this pte as read-only to catch illegal write >>> operation to the tail pages, use set_pte to replace set_pte_at to bypass >>> this check. >> >> In theory, the waring does not affect anything since the tail vmemmap >> pages are supposed to be read-only. So, skipping this check for vmemmap > > Tails vmemmap pages are supposed to be read-only, in practice but their > backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE() > warning would not have triggered. > > VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), > "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", > __func__, pte_val(old_pte), pte_val(pte)); So, arm64 will trigger this warn_on in this condition and this condition is not unusual such as this scenario? It is true that we should change the login in arm64. but how to change it to make the code more common? Any thoughts? Thanks for reviewing. > > Also, is not it true that the pte being remapped into a different page > as read only, than what it had originally (which will be freed up) i.e > the PFN in 'old_pte' and 'pte' will be different. Hence is there still > a possibility for a race condition even when the PFN changes ? > >> pages seem feasible. But I am not sure whether it is general to use >> set_pte here instead of set_pte_at, I didn’t see any users of set_pte >> from the common code routines except the code from arch/xxx. And this >> issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update() >> itself. > > Right, should not change it to yet lower level platform helper set_pte() > just to work around this warning. Instead, __check_racy_pte_update() is > the right place where it should be fixed. > >> >> Something like (Just some thoughts from mine): >> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >> index b5df82aa99e6..df7716965a93 100644 >> --- a/arch/arm64/include/asm/pgtable.h >> +++ b/arch/arm64/include/asm/pgtable.h >> @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval); >> * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) >> */ >> >> -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, >> +static inline void __check_racy_pte_update(struct mm_struct *mm, >> + unsigned long addr, pte_t *ptep, >> pte_t pte) >> { >> pte_t old_pte; >> @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, >> if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1) >> return; >> >> + if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) && >> + addr >= VMEMMAP_START && addr <= VMEMMAP_END) >> + return; >> + >> /* >> * Check for potential race with hardware updates of the pte >> * (ptep_set_access_flags safely changes valid ptes without going >> >>> >>> The following shell command can be used to reproduce this WARN_ON in >>> 6.1.0-rc1: >>> >>> echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap >>> cat /proc/sys/vm/hugetlb_optimize_vmemmap >>> >>> echo 1024 > /proc/sys/vm/nr_overcommit_hugepages >>> mkdir -p /root/hugetlb >>> mount none /root/hugetlb -t hugetlbfs >>> fallocate -l 2g /root/hugetlb/xx & >>> >>> Here is the detail WARN_ON log: >>> >>> ------------[ cut here ]------------ >>> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83 >>> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120 >>> Modules linked in: >>> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224 >>> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 >>> Call trace: >>> vmemmap_remap_pte+0x118/0x120 >>> vmemmap_remap_range+0x30c/0x560 >>> hugetlb_vmemmap_optimize+0x158/0x408 >>> __prep_new_huge_page+0x24/0x150 >>> prep_new_huge_page+0x30/0x60 >>> alloc_fresh_huge_page+0x1c4/0x1e0 >>> alloc_surplus_huge_page+0xa0/0x168 >>> alloc_huge_page+0x264/0x5b8 >>> hugetlbfs_fallocate+0x294/0x680 >>> vfs_fallocate+0x12c/0x568 >>> ksys_fallocate+0x50/0xa0 >>> __arm64_sys_fallocate+0x28/0x38 >>> invoke_syscall+0x4c/0x110 >>> el0_svc_common.constprop.0+0x68/0x128 >>> do_el0_svc+0x34/0xd0 >>> el0_svc+0x48/0xb8 >>> el0t_64_sync_handler+0xb8/0xc0 >>> el0t_64_sync+0x18c/0x190 >>> ---[ end trace 0000000000000000 ]--- >>> >>> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page") >> >> Actually, this commit does not pose the issue for arm64. I think the correct commit >> which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2. >> >> Thanks. >> >>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> >>> --- >>> mm/hugetlb_vmemmap.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c >>> index ba2a2596fb4e..cb056265c31e 100644 >>> --- a/mm/hugetlb_vmemmap.c >>> +++ b/mm/hugetlb_vmemmap.c >>> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, >>> struct page *page = pte_page(*pte); >>> >>> list_add_tail(&page->lru, walk->vmemmap_pages); >>> - set_pte_at(&init_mm, addr, pte, entry); >>> + set_pte(pte, entry); >>> } >>> >>> /* >>> -- >>> 2.25.1 >>> >>> >> >> >
On Wed, Oct 26, 2022 at 02:06:00PM +0530, Anshuman Khandual wrote: > On 10/26/22 12:31, Muchun Song wrote: > >> On 10/25/22 12:06, Muchun Song wrote: > >>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: > >>>> From: Ma Wupeng <mawupeng1@huawei.com> > >>>> > >>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with > >>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as > >>>> read-only to catch illegal write operation to the tail page. > >>>> > >>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() > >>> > >>> Thanks for your finding this issue. > >>> > >>>> since this may lead to dirty state cleaned. This check is introduced by > >>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the > >>>> access and dirty pte bits") and the initial check is as follow: > >>>> > >>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); > >>>> > >>>> Since we do need to mark this pte as read-only to catch illegal write > >>>> operation to the tail pages, use set_pte to replace set_pte_at to bypass > >>>> this check. > >>> > >>> In theory, the waring does not affect anything since the tail vmemmap > >>> pages are supposed to be read-only. So, skipping this check for vmemmap > >> > >> Tails vmemmap pages are supposed to be read-only, in practice but their > >> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE() > >> warning would not have triggered. > > > > Right. > > > >> > >> VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), > >> "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", > >> __func__, pte_val(old_pte), pte_val(pte)); > >> > >> Also, is not it true that the pte being remapped into a different page > >> as read only, than what it had originally (which will be freed up) i.e > >> the PFN in 'old_pte' and 'pte' will be different. Hence is there still > > > > Right. > > > >> a possibility for a race condition even when the PFN changes ? > > > > Sorry, I didn't get this question. Did you mean the PTE is changed from > > new (pte) to the old one (old_pte) by the hardware because of the update > > of dirty bit when a concurrent write operation to the tail vmemmap page? > > No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining > tails pages ? Is not there a PFN change, along with access permission change > involved in this remapping process ? For the record, as we discussed offline, changing the output address (pfn) of a pte is not safe without break-before-make if at least one of the mappings was writeable. The caller (vmemmap_remap_pte()) would need to be fixed to first invalidate the pte and then write the new pte. I assume no other CPU accesses this part of the vmemmap while the pte is being remapped.
> On Oct 27, 2022, at 18:50, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Wed, Oct 26, 2022 at 02:06:00PM +0530, Anshuman Khandual wrote: >> On 10/26/22 12:31, Muchun Song wrote: >>>> On 10/25/22 12:06, Muchun Song wrote: >>>>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: >>>>>> From: Ma Wupeng <mawupeng1@huawei.com> >>>>>> >>>>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with >>>>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as >>>>>> read-only to catch illegal write operation to the tail page. >>>>>> >>>>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() >>>>> >>>>> Thanks for your finding this issue. >>>>> >>>>>> since this may lead to dirty state cleaned. This check is introduced by >>>>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the >>>>>> access and dirty pte bits") and the initial check is as follow: >>>>>> >>>>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); >>>>>> >>>>>> Since we do need to mark this pte as read-only to catch illegal write >>>>>> operation to the tail pages, use set_pte to replace set_pte_at to bypass >>>>>> this check. >>>>> >>>>> In theory, the waring does not affect anything since the tail vmemmap >>>>> pages are supposed to be read-only. So, skipping this check for vmemmap >>>> >>>> Tails vmemmap pages are supposed to be read-only, in practice but their >>>> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE() >>>> warning would not have triggered. >>> >>> Right. >>> >>>> >>>> VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), >>>> "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", >>>> __func__, pte_val(old_pte), pte_val(pte)); >>>> >>>> Also, is not it true that the pte being remapped into a different page >>>> as read only, than what it had originally (which will be freed up) i.e >>>> the PFN in 'old_pte' and 'pte' will be different. Hence is there still >>> >>> Right. >>> >>>> a possibility for a race condition even when the PFN changes ? >>> >>> Sorry, I didn't get this question. Did you mean the PTE is changed from >>> new (pte) to the old one (old_pte) by the hardware because of the update >>> of dirty bit when a concurrent write operation to the tail vmemmap page? >> >> No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining >> tails pages ? Is not there a PFN change, along with access permission change >> involved in this remapping process ? > > For the record, as we discussed offline, changing the output address > (pfn) of a pte is not safe without break-before-make if at least one of > the mappings was writeable. The caller (vmemmap_remap_pte()) would need > to be fixed to first invalidate the pte and then write the new pte. I Hi Catalin, Could you expose more details about what issue it will be caused? I am not familiar with arm64. > assume no other CPU accesses this part of the vmemmap while the pte is > being remapped. However, there is no guarantee that no other CPU accesses this pte. E.g. memory failure or memory compaction, both can obtain head page from any tail struct pages (only read) anytime. Thanks. > > -- > Catalin
On Fri, Oct 28, 2022 at 10:45:09AM +0800, Muchun Song wrote: > On Oct 27, 2022, at 18:50, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Wed, Oct 26, 2022 at 02:06:00PM +0530, Anshuman Khandual wrote: > >> On 10/26/22 12:31, Muchun Song wrote: > >>>> On 10/25/22 12:06, Muchun Song wrote: > >>>>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: > >>>>>> From: Ma Wupeng <mawupeng1@huawei.com> > >>>>>> > >>>>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with > >>>>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as > >>>>>> read-only to catch illegal write operation to the tail page. > >>>>>> > >>>>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() > >>>>> > >>>>> Thanks for your finding this issue. > >>>>> > >>>>>> since this may lead to dirty state cleaned. This check is introduced by > >>>>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the > >>>>>> access and dirty pte bits") and the initial check is as follow: > >>>>>> > >>>>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); > >>>>>> > >>>>>> Since we do need to mark this pte as read-only to catch illegal write > >>>>>> operation to the tail pages, use set_pte to replace set_pte_at to bypass > >>>>>> this check. > >>>>> > >>>>> In theory, the waring does not affect anything since the tail vmemmap > >>>>> pages are supposed to be read-only. So, skipping this check for vmemmap > >>>> > >>>> Tails vmemmap pages are supposed to be read-only, in practice but their > >>>> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE() > >>>> warning would not have triggered. > >>> > >>> Right. > >>> > >>>> > >>>> VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), > >>>> "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", > >>>> __func__, pte_val(old_pte), pte_val(pte)); > >>>> > >>>> Also, is not it true that the pte being remapped into a different page > >>>> as read only, than what it had originally (which will be freed up) i.e > >>>> the PFN in 'old_pte' and 'pte' will be different. Hence is there still > >>> > >>> Right. > >>> > >>>> a possibility for a race condition even when the PFN changes ? > >>> > >>> Sorry, I didn't get this question. Did you mean the PTE is changed from > >>> new (pte) to the old one (old_pte) by the hardware because of the update > >>> of dirty bit when a concurrent write operation to the tail vmemmap page? > >> > >> No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining > >> tails pages ? Is not there a PFN change, along with access permission change > >> involved in this remapping process ? > > > > For the record, as we discussed offline, changing the output address > > (pfn) of a pte is not safe without break-before-make if at least one of > > the mappings was writeable. The caller (vmemmap_remap_pte()) would need > > to be fixed to first invalidate the pte and then write the new pte. I > > Could you expose more details about what issue it will be caused? I am > not familiar with arm64. Well, it's not allowed by the architecture, so some CPU implementations may do weird things like accessing incorrect memory or triggering TLB conflict aborts if, for some reason, they end up with two entries in the TLB for the same VA but pointing to different pfns. The hardware expects an invalid PTE and TLB invalidation between such changes. In practice most likely nothing happens and this works fine but we need to stick to the architecture requirements in case some CPUs take advantage of this requirement. > > assume no other CPU accesses this part of the vmemmap while the pte is > > being remapped. > > However, there is no guarantee that no other CPU accesses this pte. > E.g. memory failure or memory compaction, both can obtain head page > from any tail struct pages (only read) anytime. Oh, so we cannot safely go through a break-before-make sequence here (zero the PTE, flush the TLB, write the new PTE) as some CPU may access this pte.
On 2022/10/26 11:01, mawupeng wrote: > > > On 2022/10/25 14:36, Muchun Song wrote: >> >> >>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: >>> >>> From: Ma Wupeng <mawupeng1@huawei.com> >>> >>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with >>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as >>> read-only to catch illegal write operation to the tail page. >>> >>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() >> >> Thanks for your finding this issue. >> >>> since this may lead to dirty state cleaned. This check is introduced by >>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the >>> access and dirty pte bits") and the initial check is as follow: >>> >>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); >>> >>> Since we do need to mark this pte as read-only to catch illegal write >>> operation to the tail pages, use set_pte to replace set_pte_at to bypass >>> this check. >> >> In theory, the waring does not affect anything since the tail vmemmap >> pages are supposed to be read-only. So, skipping this check for vmemmap >> pages seem feasible. But I am not sure whether it is general to use >> set_pte here instead of set_pte_at, I didn’t see any users of set_pte >> from the common code routines except the code from arch/xxx. And this >> issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update() >> itself. >> >> Something like (Just some thoughts from mine): >> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >> index b5df82aa99e6..df7716965a93 100644 >> --- a/arch/arm64/include/asm/pgtable.h >> +++ b/arch/arm64/include/asm/pgtable.h >> @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval); >> * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) >> */ >> >> -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, >> +static inline void __check_racy_pte_update(struct mm_struct *mm, >> + unsigned long addr, pte_t *ptep, >> pte_t pte) >> { >> pte_t old_pte; >> @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, >> if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1) >> return; >> >> + if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) && >> + addr >= VMEMMAP_START && addr <= VMEMMAP_END) >> + return; >> + >> /* >> * Check for potential race with hardware updates of the pte >> * (ptep_set_access_flags safely changes valid ptes without going >> IMHO, arm64 or other archs do some work on the dirty bit and rdonly bit in pte in commit 2f4b829c625e ("arm64: Add support for hardware updates of the access and dirty pte bits"). Maybe we can use pte_wrprotect() to mark this pte read-only? It will add PTE_DIRTY bit for the new pte entry compare to the old one. Here is the diff: diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index ba2a2596fb4e..24a230895316 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -244,8 +244,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, * Remap the tail pages as read-only to catch illegal write operation * to the tail pages. */ - pgprot_t pgprot = PAGE_KERNEL_RO; - pte_t entry = mk_pte(walk->reuse_page, pgprot); + pte_t entry = pte_wrprotect(mk_pte(walk->reuse_page, PAGE_KERNEL)); struct page *page = pte_page(*pte); list_add_tail(&page->lru, walk->vmemmap_pages); >>> >>> The following shell command can be used to reproduce this WARN_ON in >>> 6.1.0-rc1: >>> >>> echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap >>> cat /proc/sys/vm/hugetlb_optimize_vmemmap >>> >>> echo 1024 > /proc/sys/vm/nr_overcommit_hugepages >>> mkdir -p /root/hugetlb >>> mount none /root/hugetlb -t hugetlbfs >>> fallocate -l 2g /root/hugetlb/xx & >>> >>> Here is the detail WARN_ON log: >>> >>> ------------[ cut here ]------------ >>> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83 >>> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120 >>> Modules linked in: >>> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224 >>> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 >>> Call trace: >>> vmemmap_remap_pte+0x118/0x120 >>> vmemmap_remap_range+0x30c/0x560 >>> hugetlb_vmemmap_optimize+0x158/0x408 >>> __prep_new_huge_page+0x24/0x150 >>> prep_new_huge_page+0x30/0x60 >>> alloc_fresh_huge_page+0x1c4/0x1e0 >>> alloc_surplus_huge_page+0xa0/0x168 >>> alloc_huge_page+0x264/0x5b8 >>> hugetlbfs_fallocate+0x294/0x680 >>> vfs_fallocate+0x12c/0x568 >>> ksys_fallocate+0x50/0xa0 >>> __arm64_sys_fallocate+0x28/0x38 >>> invoke_syscall+0x4c/0x110 >>> el0_svc_common.constprop.0+0x68/0x128 >>> do_el0_svc+0x34/0xd0 >>> el0_svc+0x48/0xb8 >>> el0t_64_sync_handler+0xb8/0xc0 >>> el0t_64_sync+0x18c/0x190 >>> ---[ end trace 0000000000000000 ]--- >>> >>> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page") >> >> Actually, this commit does not pose the issue for arm64. I think the correct commit >> which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2. >> >> Thanks. > > Thanks for pointing it out. > > I have tested the commit, it sure is patch 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2. > This VM_WARN_ONCE will be produced if HUGETLB_PAGE_FREE_VMEMMAP for arm64 is enabled. > > I will change my change log and fix this in next patch. > >> >>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> >>> --- >>> mm/hugetlb_vmemmap.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c >>> index ba2a2596fb4e..cb056265c31e 100644 >>> --- a/mm/hugetlb_vmemmap.c >>> +++ b/mm/hugetlb_vmemmap.c >>> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, >>> struct page *page = pte_page(*pte); >>> >>> list_add_tail(&page->lru, walk->vmemmap_pages); >>> - set_pte_at(&init_mm, addr, pte, entry); >>> + set_pte(pte, entry); >>> } >>> >>> /* >>> -- >>> 2.25.1 >>> >>> >>
> On Oct 28, 2022, at 23:53, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Fri, Oct 28, 2022 at 10:45:09AM +0800, Muchun Song wrote: >> On Oct 27, 2022, at 18:50, Catalin Marinas <catalin.marinas@arm.com> wrote: >>> On Wed, Oct 26, 2022 at 02:06:00PM +0530, Anshuman Khandual wrote: >>>> On 10/26/22 12:31, Muchun Song wrote: >>>>>> On 10/25/22 12:06, Muchun Song wrote: >>>>>>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: >>>>>>>> From: Ma Wupeng <mawupeng1@huawei.com> >>>>>>>> >>>>>>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with >>>>>>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as >>>>>>>> read-only to catch illegal write operation to the tail page. >>>>>>>> >>>>>>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() >>>>>>> >>>>>>> Thanks for your finding this issue. >>>>>>> >>>>>>>> since this may lead to dirty state cleaned. This check is introduced by >>>>>>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the >>>>>>>> access and dirty pte bits") and the initial check is as follow: >>>>>>>> >>>>>>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); >>>>>>>> >>>>>>>> Since we do need to mark this pte as read-only to catch illegal write >>>>>>>> operation to the tail pages, use set_pte to replace set_pte_at to bypass >>>>>>>> this check. >>>>>>> >>>>>>> In theory, the waring does not affect anything since the tail vmemmap >>>>>>> pages are supposed to be read-only. So, skipping this check for vmemmap >>>>>> >>>>>> Tails vmemmap pages are supposed to be read-only, in practice but their >>>>>> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE() >>>>>> warning would not have triggered. >>>>> >>>>> Right. >>>>> >>>>>> >>>>>> VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), >>>>>> "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", >>>>>> __func__, pte_val(old_pte), pte_val(pte)); >>>>>> >>>>>> Also, is not it true that the pte being remapped into a different page >>>>>> as read only, than what it had originally (which will be freed up) i.e >>>>>> the PFN in 'old_pte' and 'pte' will be different. Hence is there still >>>>> >>>>> Right. >>>>> >>>>>> a possibility for a race condition even when the PFN changes ? >>>>> >>>>> Sorry, I didn't get this question. Did you mean the PTE is changed from >>>>> new (pte) to the old one (old_pte) by the hardware because of the update >>>>> of dirty bit when a concurrent write operation to the tail vmemmap page? >>>> >>>> No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining >>>> tails pages ? Is not there a PFN change, along with access permission change >>>> involved in this remapping process ? >>> >>> For the record, as we discussed offline, changing the output address >>> (pfn) of a pte is not safe without break-before-make if at least one of >>> the mappings was writeable. The caller (vmemmap_remap_pte()) would need >>> to be fixed to first invalidate the pte and then write the new pte. I >> >> Could you expose more details about what issue it will be caused? I am >> not familiar with arm64. > > Well, it's not allowed by the architecture, so some CPU implementations > may do weird things like accessing incorrect memory or triggering TLB > conflict aborts if, for some reason, they end up with two entries in > the TLB for the same VA but pointing to different pfns. The hardware > expects an invalid PTE and TLB invalidation between such changes. In > practice most likely nothing happens and this works fine but we need to > stick to the architecture requirements in case some CPUs take advantage > of this requirement. Got it. Thanks for your nice explanation. > >>> assume no other CPU accesses this part of the vmemmap while the pte is >>> being remapped. >> >> However, there is no guarantee that no other CPU accesses this pte. >> E.g. memory failure or memory compaction, both can obtain head page >> from any tail struct pages (only read) anytime. > > Oh, so we cannot safely go through a break-before-make sequence here > (zero the PTE, flush the TLB, write the new PTE) as some CPU may access > this pte. Right. Muchun > > -- > Catalin
On Sat, Oct 29, 2022 at 09:55:15AM +0800, mawupeng wrote: > On 2022/10/26 11:01, mawupeng wrote: > > On 2022/10/25 14:36, Muchun Song wrote: > >>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: > >>> > >>> From: Ma Wupeng <mawupeng1@huawei.com> > >>> > >>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with > >>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as > >>> read-only to catch illegal write operation to the tail page. > >>> > >>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() > >> > >> Thanks for your finding this issue. > >> > >>> since this may lead to dirty state cleaned. This check is introduced by > >>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the > >>> access and dirty pte bits") and the initial check is as follow: > >>> > >>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); > >>> > >>> Since we do need to mark this pte as read-only to catch illegal write > >>> operation to the tail pages, use set_pte to replace set_pte_at to bypass > >>> this check. [...] > IMHO, arm64 or other archs do some work on the dirty bit and rdonly bit in > pte in commit 2f4b829c625e ("arm64: Add support for hardware updates of the > access and dirty pte bits"). > > Maybe we can use pte_wrprotect() to mark this pte read-only? It will add > PTE_DIRTY bit for the new pte entry compare to the old one. > > Here is the diff: > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > index ba2a2596fb4e..24a230895316 100644 > --- a/mm/hugetlb_vmemmap.c > +++ b/mm/hugetlb_vmemmap.c > @@ -244,8 +244,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, > * Remap the tail pages as read-only to catch illegal write operation > * to the tail pages. > */ > - pgprot_t pgprot = PAGE_KERNEL_RO; > - pte_t entry = mk_pte(walk->reuse_page, pgprot); > + pte_t entry = pte_wrprotect(mk_pte(walk->reuse_page, PAGE_KERNEL)); This may silence the warning but we plan to add another to detect a change in the pfn without going through a break-before-make sequence.
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index ba2a2596fb4e..cb056265c31e 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, struct page *page = pte_page(*pte); list_add_tail(&page->lru, walk->vmemmap_pages); - set_pte_at(&init_mm, addr, pte, entry); + set_pte(pte, entry); } /*