Message ID | 20231127084645.27017-2-songmuchun@bytedance.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp2943920vqx; Mon, 27 Nov 2023 00:48:03 -0800 (PST) X-Google-Smtp-Source: AGHT+IEEgZiVKPUrW7OpV1OFxHKWgYc2K8qpoGqZJtQ68zd1hqiXlATYMznP93LN6DsvcWnFWbAt X-Received: by 2002:a17:90b:3850:b0:285:c1e1:66e with SMTP id nl16-20020a17090b385000b00285c1e1066emr2249969pjb.48.1701074883099; Mon, 27 Nov 2023 00:48:03 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701074883; cv=none; d=google.com; s=arc-20160816; b=AZkkDDN8vdaptR7z9w5KkKGyg8zujROtMFA27VxSCUV7MrP5CY8ZiKHyg4XINthniE vKU1gz+GhWTxF3JAoSPqEUAanFOLJHXUI20FKOkMUZGsZXvzj3zZIblVPfCNRXppxSCS J8jy1NGTlZRa1RLT2oqMCk2J/2vcHTVgbsvmB2PQrAavUZAUzlXVIchZHJalAMnnCT4/ qLDJo4GeC0ZWc4E+xu7Mq0sOPShIZp+CHNGOMlD8cWILbB2U6He/xDarops9Y66othNN MKHB0HneWnqOcCPbGqGFTqLh8ZIDMf9b9HxC194xn9jpVNvUZu4sfsI0rz4Gu+QVCUnF 8DYA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=9POB4kFuEKuzMcdZHLoubaPw/TJvHsiTa281mWkLZq8=; fh=UHevQD8g/ASKsyBf3gegCtnBBR+hA6VRMUG2agRDexA=; b=EKeLPH037N9OioZ/M3CgGQaQEdc5vrkfxjAHK0mwjNXFeJZhCkd5scVDf4SMrm/aLu 4BtJyADqKhFPWi5k9tbJQoKyoYayexroOYNXEOYKb2AUr/f8uw547a76ymUfBlMO05kd MVdt9QomBSMR6krru3pvxhGKTgKzMtZeu1hRGZWe+DSUK3dGixk5UBT7fP9FK0HVtyHa kGtTJTNy1GT6qmQ2t9dV1vUcUT4QBkrNSK6LYC6UffkqWwD2U0xHjspIiz4q2XDSCmIS iucxy+3prKun4tjSUtlBjEBCG7Nw9Ch6ph6rbY8Ray76ZxnxZjPc45YFk/iu780kYPHK WF1g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b="GC5F/CXF"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32]) by mx.google.com with ESMTPS id g16-20020a1709029f9000b001cfc048af27si3381181plq.584.2023.11.27.00.48.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 27 Nov 2023 00:48:03 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b="GC5F/CXF"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id E2ADC805F86D; Mon, 27 Nov 2023 00:47:58 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232728AbjK0Iro (ORCPT <rfc822;toshivichauhan@gmail.com> + 99 others); Mon, 27 Nov 2023 03:47:44 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48020 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232719AbjK0Irl (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 27 Nov 2023 03:47:41 -0500 Received: from mail-pf1-x434.google.com (mail-pf1-x434.google.com [IPv6:2607:f8b0:4864:20::434]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 81580125 for <linux-kernel@vger.kernel.org>; Mon, 27 Nov 2023 00:47:23 -0800 (PST) Received: by mail-pf1-x434.google.com with SMTP id d2e1a72fcca58-6c398717726so3114688b3a.2 for <linux-kernel@vger.kernel.org>; Mon, 27 Nov 2023 00:47:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1701074843; x=1701679643; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=9POB4kFuEKuzMcdZHLoubaPw/TJvHsiTa281mWkLZq8=; b=GC5F/CXFhYJuJUenZ0PWI85KLbBUvdyJlID0Gj9ZEVieYDdnKXIQN88P/8/PMOrm8D uLJMT8J8tEbEjzBU7NCfnWf+SW0jaCyAZI4xacKQFUFu4vpP9S2JzcTUHSbhFYPA1Uz3 5+VOzKjq47G0RIhab/wVQ0nE2X8mO/i54mQxSXb3Aof3ujrP1ZLmP88dASCYTMOv5ebo 3O+5ZpPnxeqhBQiMUMAxhigWbKndoLp9fyKp8BUvu16ngY/dq769Q5KPyDLfKxC7k9ah 4vASxSjqO3F59TOqQfPLWSBN/6HsNgdy6/ZDusLaVzBEwyazQ0EDNY+TpTDBcD5Yy6id Lfyw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701074843; x=1701679643; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=9POB4kFuEKuzMcdZHLoubaPw/TJvHsiTa281mWkLZq8=; b=F+Q6Glr7B+sPz+6l7wZ5ZxELK0JTF+WxP4VIM8UnqVsJ4VTBCUIOTxLzX8E+bo8QV/ Zl4p9vov2m0atpf+sPcOytJqk9a4KXtWXpdt40bTJzTANgsGceXgRZYOrHsuPsfFzrO2 mxc0Nwiw+9jeUFTfo2Bf8rOGdEdi7ID+H5Fr1UT85YM0ip979fAnrOTGmKFX3r0ve0T8 lsVvMtf6OtmMmVsNAocGbfMyQGve/JpcUxGCUyKzJ1bwj6YJ94uGaqt0TTk5LH2t/yMN aMHi/Gm8jDRyeTwAt2ejqyTZ7mEh1J/HSuidLRgr6nr9ZBkVoj69AkXgr6+6HKye+jXU O1xA== X-Gm-Message-State: AOJu0YxWCR8ySgBd6m/VYU/0PL+95G1JY4EGxEJflzs9NMN6wxlKyM5+ feXM1g6nvc9XHPt7cX6GMf5n2w== X-Received: by 2002:a05:6a00:3926:b0:690:c75e:25c8 with SMTP id fh38-20020a056a00392600b00690c75e25c8mr11586636pfb.7.1701074843002; Mon, 27 Nov 2023 00:47:23 -0800 (PST) Received: from PXLDJ45XCM.bytedance.net ([139.177.225.230]) by smtp.gmail.com with ESMTPSA id e22-20020aa78c56000000b006c875abecbcsm6686932pfd.121.2023.11.27.00.47.20 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 27 Nov 2023 00:47:22 -0800 (PST) From: Muchun Song <songmuchun@bytedance.com> To: mike.kravetz@oracle.com, muchun.song@linux.dev, akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Muchun Song <songmuchun@bytedance.com> Subject: [PATCH 1/4] mm: pagewalk: assert write mmap lock only for walking the user page tables Date: Mon, 27 Nov 2023 16:46:42 +0800 Message-Id: <20231127084645.27017-2-songmuchun@bytedance.com> X-Mailer: git-send-email 2.39.3 (Apple Git-145) In-Reply-To: <20231127084645.27017-1-songmuchun@bytedance.com> References: <20231127084645.27017-1-songmuchun@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Mon, 27 Nov 2023 00:47:59 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783706296752019713 X-GMAIL-MSGID: 1783706296752019713 |
Series |
Code simplification and clean-up for hugetlb vmemmap
|
|
Commit Message
Muchun Song
Nov. 27, 2023, 8:46 a.m. UTC
The 8782fb61cc848 ("mm: pagewalk: Fix race between unmap and page walker")
introduces an assertion to walk_page_range_novma() to make all the users
of page table walker is safe. However, the race only exists for walking the
user page tables. And it is ridiculous to hold a particular user mmap write
lock against the changes of the kernel page tables. So only assert at least
mmap read lock when walking the kernel page tables. And some users matching
this case could downgrade to a mmap read lock to relief the contention of
mmap lock of init_mm, it will be nicer in hugetlb (only holding mmap read
lock) in the next patch.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/pagewalk.c | 29 ++++++++++++++++++++++++++++-
1 file changed, 28 insertions(+), 1 deletion(-)
Comments
On 2023/11/27 16:46, Muchun Song wrote: > The 8782fb61cc848 ("mm: pagewalk: Fix race between unmap and page walker") > introduces an assertion to walk_page_range_novma() to make all the users > of page table walker is safe. However, the race only exists for walking the > user page tables. And it is ridiculous to hold a particular user mmap write > lock against the changes of the kernel page tables. So only assert at least > mmap read lock when walking the kernel page tables. And some users matching > this case could downgrade to a mmap read lock to relief the contention of > mmap lock of init_mm, it will be nicer in hugetlb (only holding mmap read > lock) in the next patch. > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > --- > mm/pagewalk.c | 29 ++++++++++++++++++++++++++++- > 1 file changed, 28 insertions(+), 1 deletion(-) > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c > index b7d7e4fcfad7a..f46c80b18ce4f 100644 > --- a/mm/pagewalk.c > +++ b/mm/pagewalk.c > @@ -539,6 +539,11 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, > * not backed by VMAs. Because 'unusual' entries may be walked this function > * will also not lock the PTEs for the pte_entry() callback. This is useful for > * walking the kernel pages tables or page tables for firmware. > + * > + * Note: Be careful to walk the kernel pages tables, the caller may be need to > + * take other effective approache (mmap lock may be insufficient) to prevent > + * the intermediate kernel page tables belonging to the specified address range > + * from being freed (e.g. memory hot-remove). > */ > int walk_page_range_novma(struct mm_struct *mm, unsigned long start, > unsigned long end, const struct mm_walk_ops *ops, > @@ -556,7 +561,29 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start, > if (start >= end || !walk.mm) > return -EINVAL; > > - mmap_assert_write_locked(walk.mm); > + /* > + * 1) For walking the user virtual address space: > + * > + * The mmap lock protects the page walker from changes to the page > + * tables during the walk. However a read lock is insufficient to > + * protect those areas which don't have a VMA as munmap() detaches > + * the VMAs before downgrading to a read lock and actually tearing > + * down PTEs/page tables. In which case, the mmap write lock should > + * be hold. > + * > + * 2) For walking the kernel virtual address space: > + * > + * The kernel intermediate page tables usually do not be freed, so > + * the mmap map read lock is sufficient. But there are some exceptions. > + * E.g. memory hot-remove. In which case, the mmap lock is insufficient > + * to prevent the intermediate kernel pages tables belonging to the > + * specified address range from being freed. The caller should take > + * other actions to prevent this race. > + */ > + if (mm == &init_mm) > + mmap_assert_locked(walk.mm); > + else > + mmap_assert_write_locked(walk.mm); Maybe just use process_mm_walk_lock() and set correct page_walk_lock in struct mm_walk_ops? > > return walk_pgd_range(start, end, &walk); > }
> On Dec 1, 2023, at 19:09, Kefeng Wang <wangkefeng.wang@huawei.com> wrote: > > > > On 2023/11/27 16:46, Muchun Song wrote: >> The 8782fb61cc848 ("mm: pagewalk: Fix race between unmap and page walker") >> introduces an assertion to walk_page_range_novma() to make all the users >> of page table walker is safe. However, the race only exists for walking the >> user page tables. And it is ridiculous to hold a particular user mmap write >> lock against the changes of the kernel page tables. So only assert at least >> mmap read lock when walking the kernel page tables. And some users matching >> this case could downgrade to a mmap read lock to relief the contention of >> mmap lock of init_mm, it will be nicer in hugetlb (only holding mmap read >> lock) in the next patch. >> Signed-off-by: Muchun Song <songmuchun@bytedance.com> >> --- >> mm/pagewalk.c | 29 ++++++++++++++++++++++++++++- >> 1 file changed, 28 insertions(+), 1 deletion(-) >> diff --git a/mm/pagewalk.c b/mm/pagewalk.c >> index b7d7e4fcfad7a..f46c80b18ce4f 100644 >> --- a/mm/pagewalk.c >> +++ b/mm/pagewalk.c >> @@ -539,6 +539,11 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, >> * not backed by VMAs. Because 'unusual' entries may be walked this function >> * will also not lock the PTEs for the pte_entry() callback. This is useful for >> * walking the kernel pages tables or page tables for firmware. >> + * >> + * Note: Be careful to walk the kernel pages tables, the caller may be need to >> + * take other effective approache (mmap lock may be insufficient) to prevent >> + * the intermediate kernel page tables belonging to the specified address range >> + * from being freed (e.g. memory hot-remove). >> */ >> int walk_page_range_novma(struct mm_struct *mm, unsigned long start, >> unsigned long end, const struct mm_walk_ops *ops, >> @@ -556,7 +561,29 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start, >> if (start >= end || !walk.mm) >> return -EINVAL; >> - mmap_assert_write_locked(walk.mm); >> + /* >> + * 1) For walking the user virtual address space: >> + * >> + * The mmap lock protects the page walker from changes to the page >> + * tables during the walk. However a read lock is insufficient to >> + * protect those areas which don't have a VMA as munmap() detaches >> + * the VMAs before downgrading to a read lock and actually tearing >> + * down PTEs/page tables. In which case, the mmap write lock should >> + * be hold. >> + * >> + * 2) For walking the kernel virtual address space: >> + * >> + * The kernel intermediate page tables usually do not be freed, so >> + * the mmap map read lock is sufficient. But there are some exceptions. >> + * E.g. memory hot-remove. In which case, the mmap lock is insufficient >> + * to prevent the intermediate kernel pages tables belonging to the >> + * specified address range from being freed. The caller should take >> + * other actions to prevent this race. >> + */ >> + if (mm == &init_mm) >> + mmap_assert_locked(walk.mm); >> + else >> + mmap_assert_write_locked(walk.mm); > > Maybe just use process_mm_walk_lock() and set correct page_walk_lock in struct mm_walk_ops? No. You also need to make sure the users do not pass the wrong walk_lock, so you also need to add something like following: if (mm == &init_mm) VM_BUG_ON(walk_lock != PGWALK_RDLOCK); else VM_BUG_ON(walk_lock == PGWALK_RDLOCK); I do not think the code will be simple. > >> return walk_pgd_range(start, end, &walk); >> }
On 2023/12/2 16:08, Muchun Song wrote: > > >> On Dec 1, 2023, at 19:09, Kefeng Wang <wangkefeng.wang@huawei.com> wrote: >> >> >> >> On 2023/11/27 16:46, Muchun Song wrote: >>> The 8782fb61cc848 ("mm: pagewalk: Fix race between unmap and page walker") >>> introduces an assertion to walk_page_range_novma() to make all the users >>> of page table walker is safe. However, the race only exists for walking the >>> user page tables. And it is ridiculous to hold a particular user mmap write >>> lock against the changes of the kernel page tables. So only assert at least >>> mmap read lock when walking the kernel page tables. And some users matching >>> this case could downgrade to a mmap read lock to relief the contention of >>> mmap lock of init_mm, it will be nicer in hugetlb (only holding mmap read >>> lock) in the next patch. >>> Signed-off-by: Muchun Song <songmuchun@bytedance.com> >>> --- >>> mm/pagewalk.c | 29 ++++++++++++++++++++++++++++- >>> 1 file changed, 28 insertions(+), 1 deletion(-) >>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c >>> index b7d7e4fcfad7a..f46c80b18ce4f 100644 >>> --- a/mm/pagewalk.c >>> +++ b/mm/pagewalk.c >>> @@ -539,6 +539,11 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, >>> * not backed by VMAs. Because 'unusual' entries may be walked this function >>> * will also not lock the PTEs for the pte_entry() callback. This is useful for >>> * walking the kernel pages tables or page tables for firmware. >>> + * >>> + * Note: Be careful to walk the kernel pages tables, the caller may be need to >>> + * take other effective approache (mmap lock may be insufficient) to prevent >>> + * the intermediate kernel page tables belonging to the specified address range >>> + * from being freed (e.g. memory hot-remove). >>> */ >>> int walk_page_range_novma(struct mm_struct *mm, unsigned long start, >>> unsigned long end, const struct mm_walk_ops *ops, >>> @@ -556,7 +561,29 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start, >>> if (start >= end || !walk.mm) >>> return -EINVAL; >>> - mmap_assert_write_locked(walk.mm); >>> + /* >>> + * 1) For walking the user virtual address space: >>> + * >>> + * The mmap lock protects the page walker from changes to the page >>> + * tables during the walk. However a read lock is insufficient to >>> + * protect those areas which don't have a VMA as munmap() detaches >>> + * the VMAs before downgrading to a read lock and actually tearing >>> + * down PTEs/page tables. In which case, the mmap write lock should >>> + * be hold. >>> + * >>> + * 2) For walking the kernel virtual address space: >>> + * >>> + * The kernel intermediate page tables usually do not be freed, so >>> + * the mmap map read lock is sufficient. But there are some exceptions. >>> + * E.g. memory hot-remove. In which case, the mmap lock is insufficient >>> + * to prevent the intermediate kernel pages tables belonging to the >>> + * specified address range from being freed. The caller should take >>> + * other actions to prevent this race. >>> + */ >>> + if (mm == &init_mm) >>> + mmap_assert_locked(walk.mm); >>> + else >>> + mmap_assert_write_locked(walk.mm); >> >> Maybe just use process_mm_walk_lock() and set correct page_walk_lock in struct mm_walk_ops? > > No. You also need to make sure the users do not pass the wrong > walk_lock, so you also need to add something like following: > But all other walk_page_XX has been converted,see more from commit 49b0638502da "mm: enable page walking API to lock vmas during the walk"), there's nothing special about this one, the calls must pass the right page_walk_lock to mm_walk_ops, > if (mm == &init_mm) > VM_BUG_ON(walk_lock != PGWALK_RDLOCK); > else > VM_BUG_ON(walk_lock == PGWALK_RDLOCK); > > I do not think the code will be simple. or adding the above lock check into process_mm_walk_lock too.
> On Dec 2, 2023, at 17:25, Kefeng Wang <wangkefeng.wang@huawei.com> wrote: > > > > On 2023/12/2 16:08, Muchun Song wrote: >>>> On Dec 1, 2023, at 19:09, Kefeng Wang <wangkefeng.wang@huawei.com> wrote: >>> >>> >>> >>> On 2023/11/27 16:46, Muchun Song wrote: >>>> The 8782fb61cc848 ("mm: pagewalk: Fix race between unmap and page walker") >>>> introduces an assertion to walk_page_range_novma() to make all the users >>>> of page table walker is safe. However, the race only exists for walking the >>>> user page tables. And it is ridiculous to hold a particular user mmap write >>>> lock against the changes of the kernel page tables. So only assert at least >>>> mmap read lock when walking the kernel page tables. And some users matching >>>> this case could downgrade to a mmap read lock to relief the contention of >>>> mmap lock of init_mm, it will be nicer in hugetlb (only holding mmap read >>>> lock) in the next patch. >>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com> >>>> --- >>>> mm/pagewalk.c | 29 ++++++++++++++++++++++++++++- >>>> 1 file changed, 28 insertions(+), 1 deletion(-) >>>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c >>>> index b7d7e4fcfad7a..f46c80b18ce4f 100644 >>>> --- a/mm/pagewalk.c >>>> +++ b/mm/pagewalk.c >>>> @@ -539,6 +539,11 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, >>>> * not backed by VMAs. Because 'unusual' entries may be walked this function >>>> * will also not lock the PTEs for the pte_entry() callback. This is useful for >>>> * walking the kernel pages tables or page tables for firmware. >>>> + * >>>> + * Note: Be careful to walk the kernel pages tables, the caller may be need to >>>> + * take other effective approache (mmap lock may be insufficient) to prevent >>>> + * the intermediate kernel page tables belonging to the specified address range >>>> + * from being freed (e.g. memory hot-remove). >>>> */ >>>> int walk_page_range_novma(struct mm_struct *mm, unsigned long start, >>>> unsigned long end, const struct mm_walk_ops *ops, >>>> @@ -556,7 +561,29 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start, >>>> if (start >= end || !walk.mm) >>>> return -EINVAL; >>>> - mmap_assert_write_locked(walk.mm); >>>> + /* >>>> + * 1) For walking the user virtual address space: >>>> + * >>>> + * The mmap lock protects the page walker from changes to the page >>>> + * tables during the walk. However a read lock is insufficient to >>>> + * protect those areas which don't have a VMA as munmap() detaches >>>> + * the VMAs before downgrading to a read lock and actually tearing >>>> + * down PTEs/page tables. In which case, the mmap write lock should >>>> + * be hold. >>>> + * >>>> + * 2) For walking the kernel virtual address space: >>>> + * >>>> + * The kernel intermediate page tables usually do not be freed, so >>>> + * the mmap map read lock is sufficient. But there are some exceptions. >>>> + * E.g. memory hot-remove. In which case, the mmap lock is insufficient >>>> + * to prevent the intermediate kernel pages tables belonging to the >>>> + * specified address range from being freed. The caller should take >>>> + * other actions to prevent this race. >>>> + */ >>>> + if (mm == &init_mm) >>>> + mmap_assert_locked(walk.mm); >>>> + else >>>> + mmap_assert_write_locked(walk.mm); >>> >>> Maybe just use process_mm_walk_lock() and set correct page_walk_lock in struct mm_walk_ops? >> No. You also need to make sure the users do not pass the wrong >> walk_lock, so you also need to add something like following: > > But all other walk_page_XX has been converted,see more from commit > 49b0638502da "mm: enable page walking API to lock vmas during the walk"), > there's nothing special about this one, the calls must pass the right > page_walk_lock to mm_walk_ops, If you think this one is not special, why it is not converted by that commit at that time? > >> if (mm == &init_mm) >> VM_BUG_ON(walk_lock != PGWALK_RDLOCK); >> else >> VM_BUG_ON(walk_lock == PGWALK_RDLOCK); >> I do not think the code will be simple. > > or adding the above lock check into process_mm_walk_lock too. No. it’s wrong. walk_page_range_novma is special compared with other variants, the check is only applicable for walk_page_range_novma, not for its variants.
On 11/27/23 16:46, Muchun Song wrote: > The 8782fb61cc848 ("mm: pagewalk: Fix race between unmap and page walker") > introduces an assertion to walk_page_range_novma() to make all the users > of page table walker is safe. However, the race only exists for walking the > user page tables. And it is ridiculous to hold a particular user mmap write > lock against the changes of the kernel page tables. So only assert at least > mmap read lock when walking the kernel page tables. And some users matching > this case could downgrade to a mmap read lock to relief the contention of > mmap lock of init_mm, it will be nicer in hugetlb (only holding mmap read > lock) in the next patch. At first, I did not understand your motivation for this patch. But, it makes sense as your next patch will replace hugetlb vmemmap specific walk routines with walk_page_range_novma. Directly Cc'ing Steven and Jann in case they have comments. > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > --- > mm/pagewalk.c | 29 ++++++++++++++++++++++++++++- > 1 file changed, 28 insertions(+), 1 deletion(-) Looks fine to me, Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
diff --git a/mm/pagewalk.c b/mm/pagewalk.c index b7d7e4fcfad7a..f46c80b18ce4f 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -539,6 +539,11 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, * not backed by VMAs. Because 'unusual' entries may be walked this function * will also not lock the PTEs for the pte_entry() callback. This is useful for * walking the kernel pages tables or page tables for firmware. + * + * Note: Be careful to walk the kernel pages tables, the caller may be need to + * take other effective approache (mmap lock may be insufficient) to prevent + * the intermediate kernel page tables belonging to the specified address range + * from being freed (e.g. memory hot-remove). */ int walk_page_range_novma(struct mm_struct *mm, unsigned long start, unsigned long end, const struct mm_walk_ops *ops, @@ -556,7 +561,29 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start, if (start >= end || !walk.mm) return -EINVAL; - mmap_assert_write_locked(walk.mm); + /* + * 1) For walking the user virtual address space: + * + * The mmap lock protects the page walker from changes to the page + * tables during the walk. However a read lock is insufficient to + * protect those areas which don't have a VMA as munmap() detaches + * the VMAs before downgrading to a read lock and actually tearing + * down PTEs/page tables. In which case, the mmap write lock should + * be hold. + * + * 2) For walking the kernel virtual address space: + * + * The kernel intermediate page tables usually do not be freed, so + * the mmap map read lock is sufficient. But there are some exceptions. + * E.g. memory hot-remove. In which case, the mmap lock is insufficient + * to prevent the intermediate kernel pages tables belonging to the + * specified address range from being freed. The caller should take + * other actions to prevent this race. + */ + if (mm == &init_mm) + mmap_assert_locked(walk.mm); + else + mmap_assert_write_locked(walk.mm); return walk_pgd_range(start, end, &walk); }