Message ID | 20221207203158.651092-1-peterx@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp394676wrr; Wed, 7 Dec 2022 12:40:23 -0800 (PST) X-Google-Smtp-Source: AA0mqf4gX+nvuE2hDdDm8c656TJT0seUDEvo/Y1D+TYxL4W7huy8Wd4xpHVYF3ydXQvXlENhlKCD X-Received: by 2002:a17:90a:b945:b0:21a:1f5f:e797 with SMTP id f5-20020a17090ab94500b0021a1f5fe797mr2253624pjw.14.1670445622803; Wed, 07 Dec 2022 12:40:22 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1670445622; cv=none; d=google.com; s=arc-20160816; b=vfPtMOK23BroBlnL/UbtRsN4P/2B5ktsFf+HtSCpkEyq6IohCudYf4cKwfBXMi/JfV pFpms4EBTOFo9rHlTYDTTKAgVq5QDF+v8sosk0vKe3z7XZEgMdJXAMLshZeTGp4uFore 3ia652TeSiECL2r+ORbYXj/o+C/CZixkkxiOOF9P6LTePjxaEAwWxazbfmWFMlIK76fh B0gu0rtt/nLOLG4g6BMe2EIRJLa+ETBa8N/h0NMiOSLHrq7JQxkRiI1OP7c+LKWQpGIQ ICXhiuX001IMTmFnIRkhQBKwRxT8TugmB3MU/YazAZItONJcZoKyUShUJc8ZkE94WYrZ yNpQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=IJ6TW2E1hLLncShjpLbEpjeIsJ0SxMbIuMnmoV2Wa4Y=; b=S7dMMNI4zJxLGyRuz0/Ni5/cdvHZ6HC4KWPDbtK7scCP+c6ReAmuev/HHEzqQXdV0z czIb1gbmXX3Y4HYKCEudAzBTKKVDRUqLRNd+vfQ3JJuSujaCR/W4Obf4uRPnpwGtlKZx QpzAEHDrpoerpiFjte6ccmtEOJJLJOH5+6hO4YnLLbHSlvjqc9PoxMjh1cAnLwcYeU9o nx+vCae5exM3mkqAUdS9IOiKgYCEM782rB08H4nTKRpNk2BULVZLifYUp6qjh6pDtstC Pq5fX0A8VMuMuV+KW2E9iQi4HAttogS8hMchWBLvj6kH4qgYzWqC2yaARkwlrTMFZaxN AHhQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=iC7vgpMD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id s11-20020a170902ea0b00b0018938988ea9si23495070plg.520.2022.12.07.12.40.08; Wed, 07 Dec 2022 12:40:22 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=iC7vgpMD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229893AbiLGUdt (ORCPT <rfc822;foxyelen666@gmail.com> + 99 others); Wed, 7 Dec 2022 15:33:49 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50668 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229877AbiLGUdX (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 7 Dec 2022 15:33:23 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B2C907E40B for <linux-kernel@vger.kernel.org>; Wed, 7 Dec 2022 12:32:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1670445122; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=IJ6TW2E1hLLncShjpLbEpjeIsJ0SxMbIuMnmoV2Wa4Y=; b=iC7vgpMD1kprpPJTbKrRyUt3svh17KKgXEa/L3QmSGYlHFLn5DPIs5YYg3IDRDvsC9643J Bp5b5a3GiA4/E/fI7Ir1alKClmjdETTQVhkdcpKsMc3JYNPb8fjbG+JO3RFVoeFcjT2TO6 Drg9L25wi3K1gZNMP1QY2DsK57xU/4w= Received: from mail-qv1-f72.google.com (mail-qv1-f72.google.com [209.85.219.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-526-TThzOUmoOVa49-e_Ar_Mew-1; Wed, 07 Dec 2022 15:32:01 -0500 X-MC-Unique: TThzOUmoOVa49-e_Ar_Mew-1 Received: by mail-qv1-f72.google.com with SMTP id 71-20020a0c804d000000b004b2fb260447so37549156qva.10 for <linux-kernel@vger.kernel.org>; Wed, 07 Dec 2022 12:32:01 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=IJ6TW2E1hLLncShjpLbEpjeIsJ0SxMbIuMnmoV2Wa4Y=; b=xZg0+5I8pd3k9ZKQY/Ut8S4jSnK+Op4a9wLCqSv4U1qLqE64G+HrSc85Lv7iah8W4E gwibX+dcUtcbxgFbGr4xMI1EzAZ4AnIxmS6/xrbuRv3iXRc6UC3n33GZNH6La4fp3Hfo WPxaS+K/9wpNpFvrCsTeti5NspNIAFOCSnnB3GHx1cB/lhfJGsTylyGMsxdavZZuqUlK Ynrh0k1qhcq8VDaRUIqM2o3JjD6LwfifVXHy3k3YDBYV0rSnB/fOOushvAchzGQd8GAA a9TCr7Ip3SuiymPl80T9bRLNBwQ/jcNZl/bebRGeCfPt8gjWcRqP3aMnSL9ssHliv1f6 PYDg== X-Gm-Message-State: ANoB5pm0Tn0Zwob6Y5Ai1538JLZlWdIO+sIwahFlX8G9k2ols7AetQ2G d4Oe1GQxVuyvda4epCxEZne9OOYRJnJ7J0wUJULFfzutcDJ47OPZfm48D0iike754TAWAmdxsXX REPG4Y5aR9eHn4XMT2haEqaRn X-Received: by 2002:ad4:4709:0:b0:4c7:629e:7a70 with SMTP id qb9-20020ad44709000000b004c7629e7a70mr1636771qvb.44.1670445120725; Wed, 07 Dec 2022 12:32:00 -0800 (PST) X-Received: by 2002:ad4:4709:0:b0:4c7:629e:7a70 with SMTP id qb9-20020ad44709000000b004c7629e7a70mr1636758qvb.44.1670445120489; Wed, 07 Dec 2022 12:32:00 -0800 (PST) Received: from x1n.redhat.com (bras-base-aurron9127w-grc-46-70-31-27-79.dsl.bell.ca. [70.31.27.79]) by smtp.gmail.com with ESMTPSA id bm8-20020a05620a198800b006fa8299b4d5sm18007118qkb.100.2022.12.07.12.31.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 Dec 2022 12:32:00 -0800 (PST) From: Peter Xu <peterx@redhat.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Jann Horn <jannh@google.com>, Andrea Arcangeli <aarcange@redhat.com>, peterx@redhat.com, James Houghton <jthoughton@google.com>, Rik van Riel <riel@surriel.com>, Miaohe Lin <linmiaohe@huawei.com>, Nadav Amit <nadav.amit@gmail.com>, John Hubbard <jhubbard@nvidia.com>, Mike Kravetz <mike.kravetz@oracle.com>, David Hildenbrand <david@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, Muchun Song <songmuchun@bytedance.com> Subject: [PATCH v2 10/10] mm/hugetlb: Document why page_vma_mapped_walk() is safe to walk Date: Wed, 7 Dec 2022 15:31:58 -0500 Message-Id: <20221207203158.651092-1-peterx@redhat.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20221207203034.650899-1-peterx@redhat.com> References: <20221207203034.650899-1-peterx@redhat.com> MIME-Version: 1.0 Content-type: text/plain Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1751589189712769753?= X-GMAIL-MSGID: =?utf-8?q?1751589189712769753?= |
Series |
[v2,01/10] mm/hugetlb: Let vma_offset_start() to return start
|
|
Commit Message
Peter Xu
Dec. 7, 2022, 8:31 p.m. UTC
Taking vma lock here is not needed for now because all potential hugetlb walkers here should have i_mmap_rwsem held. Document the fact. Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Peter Xu <peterx@redhat.com> --- mm/page_vma_mapped.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
Comments
On 12/7/22 12:31, Peter Xu wrote: > Taking vma lock here is not needed for now because all potential hugetlb > walkers here should have i_mmap_rwsem held. Document the fact. > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > mm/page_vma_mapped.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > index e97b2e23bd28..2e59a0419d22 100644 > --- a/mm/page_vma_mapped.c > +++ b/mm/page_vma_mapped.c > @@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > /* The only possible mapping was handled on last iteration */ > if (pvmw->pte) > return not_found(pvmw); > - > - /* when pud is not present, pte will be NULL */ > + /* > + * NOTE: we don't need explicit lock here to walk the > + * hugetlb pgtable because either (1) potential callers of > + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the > + * caller will not walk a hugetlb vma (e.g. ksm or uprobe). > + * When one day this rule breaks, one will get a warning > + * in hugetlb_walk(), and then we'll figure out what to do. > + */ Confused. Is this documentation actually intended to refer to hugetlb_walk() itself, or just this call site? If the former, then let's move it over to be right before hugetlb_walk(). > pvmw->pte = hugetlb_walk(vma, pvmw->address, size); > if (!pvmw->pte) > return false; thanks,
On 07.12.22 21:31, Peter Xu wrote: > Taking vma lock here is not needed for now because all potential hugetlb > walkers here should have i_mmap_rwsem held. Document the fact. > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > mm/page_vma_mapped.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > index e97b2e23bd28..2e59a0419d22 100644 > --- a/mm/page_vma_mapped.c > +++ b/mm/page_vma_mapped.c > @@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > /* The only possible mapping was handled on last iteration */ > if (pvmw->pte) > return not_found(pvmw); > - > - /* when pud is not present, pte will be NULL */ > + /* > + * NOTE: we don't need explicit lock here to walk the > + * hugetlb pgtable because either (1) potential callers of > + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the > + * caller will not walk a hugetlb vma (e.g. ksm or uprobe). > + * When one day this rule breaks, one will get a warning > + * in hugetlb_walk(), and then we'll figure out what to do. > + */ > pvmw->pte = hugetlb_walk(vma, pvmw->address, size); > if (!pvmw->pte) > return false; Would it make sense to squash that into the previous commit?
On Wed, Dec 07, 2022 at 04:16:11PM -0800, John Hubbard wrote: > On 12/7/22 12:31, Peter Xu wrote: > > Taking vma lock here is not needed for now because all potential hugetlb > > walkers here should have i_mmap_rwsem held. Document the fact. > > > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> > > Signed-off-by: Peter Xu <peterx@redhat.com> > > --- > > mm/page_vma_mapped.c | 10 ++++++++-- > > 1 file changed, 8 insertions(+), 2 deletions(-) > > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > > index e97b2e23bd28..2e59a0419d22 100644 > > --- a/mm/page_vma_mapped.c > > +++ b/mm/page_vma_mapped.c > > @@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > > /* The only possible mapping was handled on last iteration */ > > if (pvmw->pte) > > return not_found(pvmw); > > - > > - /* when pud is not present, pte will be NULL */ > > + /* > > + * NOTE: we don't need explicit lock here to walk the > > + * hugetlb pgtable because either (1) potential callers of > > + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the > > + * caller will not walk a hugetlb vma (e.g. ksm or uprobe). > > + * When one day this rule breaks, one will get a warning > > + * in hugetlb_walk(), and then we'll figure out what to do. > > + */ > > Confused. Is this documentation actually intended to refer to hugetlb_walk() > itself, or just this call site? If the former, then let's move it over > to be right before hugetlb_walk(). It is for this specific code path not hugetlb_walk(). The "holds i_mmap_rwsem" here is a true statement (not requirement) because PVMW rmap walkers always have that. That satisfies with hugetlb_walk() requirements already even without holding the vma lock.
On Thu, Dec 08, 2022 at 02:16:03PM +0100, David Hildenbrand wrote: > On 07.12.22 21:31, Peter Xu wrote: > > Taking vma lock here is not needed for now because all potential hugetlb > > walkers here should have i_mmap_rwsem held. Document the fact. > > > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> > > Signed-off-by: Peter Xu <peterx@redhat.com> > > --- > > mm/page_vma_mapped.c | 10 ++++++++-- > > 1 file changed, 8 insertions(+), 2 deletions(-) > > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > > index e97b2e23bd28..2e59a0419d22 100644 > > --- a/mm/page_vma_mapped.c > > +++ b/mm/page_vma_mapped.c > > @@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > > /* The only possible mapping was handled on last iteration */ > > if (pvmw->pte) > > return not_found(pvmw); > > - > > - /* when pud is not present, pte will be NULL */ > > + /* > > + * NOTE: we don't need explicit lock here to walk the > > + * hugetlb pgtable because either (1) potential callers of > > + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the > > + * caller will not walk a hugetlb vma (e.g. ksm or uprobe). > > + * When one day this rule breaks, one will get a warning > > + * in hugetlb_walk(), and then we'll figure out what to do. > > + */ > > pvmw->pte = hugetlb_walk(vma, pvmw->address, size); > > if (!pvmw->pte) > > return false; > > Would it make sense to squash that into the previous commit? Sure thing.
On 12/8/22 13:05, Peter Xu wrote: >>> + /* >>> + * NOTE: we don't need explicit lock here to walk the >>> + * hugetlb pgtable because either (1) potential callers of >>> + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the >>> + * caller will not walk a hugetlb vma (e.g. ksm or uprobe). >>> + * When one day this rule breaks, one will get a warning >>> + * in hugetlb_walk(), and then we'll figure out what to do. >>> + */ >> >> Confused. Is this documentation actually intended to refer to hugetlb_walk() >> itself, or just this call site? If the former, then let's move it over >> to be right before hugetlb_walk(). > > It is for this specific code path not hugetlb_walk(). > > The "holds i_mmap_rwsem" here is a true statement (not requirement) because > PVMW rmap walkers always have that. That satisfies with hugetlb_walk() > requirements already even without holding the vma lock. > It's really hard to understand. Do you have a few extra words to explain it? I can help with actual comment wording perhaps, but I am still a bit in the dark as to the actual meaning. :) thanks,
On Thu, Dec 08, 2022 at 01:54:27PM -0800, John Hubbard wrote: > On 12/8/22 13:05, Peter Xu wrote: > > > > + /* > > > > + * NOTE: we don't need explicit lock here to walk the > > > > + * hugetlb pgtable because either (1) potential callers of > > > > + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the > > > > + * caller will not walk a hugetlb vma (e.g. ksm or uprobe). > > > > + * When one day this rule breaks, one will get a warning > > > > + * in hugetlb_walk(), and then we'll figure out what to do. > > > > + */ > > > > > > Confused. Is this documentation actually intended to refer to hugetlb_walk() > > > itself, or just this call site? If the former, then let's move it over > > > to be right before hugetlb_walk(). > > > > It is for this specific code path not hugetlb_walk(). > > > > The "holds i_mmap_rwsem" here is a true statement (not requirement) because > > PVMW rmap walkers always have that. That satisfies with hugetlb_walk() > > requirements already even without holding the vma lock. > > > > It's really hard to understand. Do you have a few extra words to explain it? > I can help with actual comment wording perhaps, but I am still a bit in > the dark as to the actual meaning. :) Firstly, this patch (to be squashed into previous) is trying to document page_vma_mapped_walk() on why it's not needed to further take any lock to call hugetlb_walk(). To call hugetlb_walk() we need either of the locks listed below (in either read or write mode), according to the rules we setup for it in patch 3: (1) hugetlb vma lock (2) i_mmap_rwsem lock page_vma_mapped_walk() is called in below sites across the kernel: __replace_page[179] if (!page_vma_mapped_walk(&pvmw)) __damon_pa_mkold[24] while (page_vma_mapped_walk(&pvmw)) { __damon_pa_young[97] while (page_vma_mapped_walk(&pvmw)) { write_protect_page[1065] if (!page_vma_mapped_walk(&pvmw)) remove_migration_pte[179] while (page_vma_mapped_walk(&pvmw)) { page_idle_clear_pte_refs_one[56] while (page_vma_mapped_walk(&pvmw)) { page_mapped_in_vma[318] if (!page_vma_mapped_walk(&pvmw)) folio_referenced_one[813] while (page_vma_mapped_walk(&pvmw)) { page_vma_mkclean_one[958] while (page_vma_mapped_walk(pvmw)) { try_to_unmap_one[1506] while (page_vma_mapped_walk(&pvmw)) { try_to_migrate_one[1881] while (page_vma_mapped_walk(&pvmw)) { page_make_device_exclusive_one[2205] while (page_vma_mapped_walk(&pvmw)) { If we group them, we can see that most of them are during a rmap walk (i.e., comes from a higher rmap_walk() stack), they are: __damon_pa_mkold[24] while (page_vma_mapped_walk(&pvmw)) { __damon_pa_young[97] while (page_vma_mapped_walk(&pvmw)) { remove_migration_pte[179] while (page_vma_mapped_walk(&pvmw)) { page_idle_clear_pte_refs_one[56] while (page_vma_mapped_walk(&pvmw)) { page_mapped_in_vma[318] if (!page_vma_mapped_walk(&pvmw)) folio_referenced_one[813] while (page_vma_mapped_walk(&pvmw)) { page_vma_mkclean_one[958] while (page_vma_mapped_walk(pvmw)) { try_to_unmap_one[1506] while (page_vma_mapped_walk(&pvmw)) { try_to_migrate_one[1881] while (page_vma_mapped_walk(&pvmw)) { page_make_device_exclusive_one[2205] while (page_vma_mapped_walk(&pvmw)) { Let's call it case (A). We have another two special cases that are not during a rmap walk, they are: write_protect_page[1065] if (!page_vma_mapped_walk(&pvmw)) __replace_page[179] if (!page_vma_mapped_walk(&pvmw)) Let's call it case (B). Case (A) is always safe because it always take the i_mmap_rwsem lock in read mode. It's done in rmap_walk_file() where: if (!locked) { if (i_mmap_trylock_read(mapping)) goto lookup; if (rwc->try_lock) { rwc->contended = true; return; } i_mmap_lock_read(mapping); } If locked==true it means the caller already holds the lock, so no need to take it. It justifies that all callers from rmap_walk() upon a hugetlb vma is safe to call hugetlb_walk() already according to the rule of hugetlb_walk(). Case (B) contains two cases either in KSM path or uprobe path, and none of the paths (afaict) can get a hugetlb vma involved. IOW, the whole path of if (unlikely(is_vm_hugetlb_page(vma))) { In page_vma_mapped_walk() just should never trigger. To summarize above into a shorter paragraph, it'll become the comment. Hope it explains. Thanks.
On 12/8/22 14:21, Peter Xu wrote: > > Firstly, this patch (to be squashed into previous) is trying to document > page_vma_mapped_walk() on why it's not needed to further take any lock to > call hugetlb_walk(). > > To call hugetlb_walk() we need either of the locks listed below (in either > read or write mode), according to the rules we setup for it in patch 3: > > (1) hugetlb vma lock > (2) i_mmap_rwsem lock > > page_vma_mapped_walk() is called in below sites across the kernel: > > __replace_page[179] if (!page_vma_mapped_walk(&pvmw)) > __damon_pa_mkold[24] while (page_vma_mapped_walk(&pvmw)) { > __damon_pa_young[97] while (page_vma_mapped_walk(&pvmw)) { > write_protect_page[1065] if (!page_vma_mapped_walk(&pvmw)) > remove_migration_pte[179] while (page_vma_mapped_walk(&pvmw)) { > page_idle_clear_pte_refs_one[56] while (page_vma_mapped_walk(&pvmw)) { > page_mapped_in_vma[318] if (!page_vma_mapped_walk(&pvmw)) > folio_referenced_one[813] while (page_vma_mapped_walk(&pvmw)) { > page_vma_mkclean_one[958] while (page_vma_mapped_walk(pvmw)) { > try_to_unmap_one[1506] while (page_vma_mapped_walk(&pvmw)) { > try_to_migrate_one[1881] while (page_vma_mapped_walk(&pvmw)) { > page_make_device_exclusive_one[2205] while (page_vma_mapped_walk(&pvmw)) { > > If we group them, we can see that most of them are during a rmap walk > (i.e., comes from a higher rmap_walk() stack), they are: > > __damon_pa_mkold[24] while (page_vma_mapped_walk(&pvmw)) { > __damon_pa_young[97] while (page_vma_mapped_walk(&pvmw)) { > remove_migration_pte[179] while (page_vma_mapped_walk(&pvmw)) { > page_idle_clear_pte_refs_one[56] while (page_vma_mapped_walk(&pvmw)) { > page_mapped_in_vma[318] if (!page_vma_mapped_walk(&pvmw)) > folio_referenced_one[813] while (page_vma_mapped_walk(&pvmw)) { > page_vma_mkclean_one[958] while (page_vma_mapped_walk(pvmw)) { > try_to_unmap_one[1506] while (page_vma_mapped_walk(&pvmw)) { > try_to_migrate_one[1881] while (page_vma_mapped_walk(&pvmw)) { > page_make_device_exclusive_one[2205] while (page_vma_mapped_walk(&pvmw)) { > > Let's call it case (A). > > We have another two special cases that are not during a rmap walk, they > are: > > write_protect_page[1065] if (!page_vma_mapped_walk(&pvmw)) > __replace_page[179] if (!page_vma_mapped_walk(&pvmw)) > > Let's call it case (B). > > Case (A) is always safe because it always take the i_mmap_rwsem lock in > read mode. It's done in rmap_walk_file() where: > > if (!locked) { > if (i_mmap_trylock_read(mapping)) > goto lookup; > > if (rwc->try_lock) { > rwc->contended = true; > return; > } > > i_mmap_lock_read(mapping); > } > > If locked==true it means the caller already holds the lock, so no need to > take it. It justifies that all callers from rmap_walk() upon a hugetlb vma > is safe to call hugetlb_walk() already according to the rule of hugetlb_walk(). > > Case (B) contains two cases either in KSM path or uprobe path, and none of > the paths (afaict) can get a hugetlb vma involved. IOW, the whole path of > > if (unlikely(is_vm_hugetlb_page(vma))) { > > In page_vma_mapped_walk() just should never trigger. > > To summarize above into a shorter paragraph, it'll become the comment. > > Hope it explains. Thanks. > It does! And now for the comment, I'll think you'll find that this suffices: /* * All callers that get here will already hold the i_mmap_rwsem. * Therefore, no additional locks need to be taken before * calling hugetlb_walk(). */ ...which, considering all the data above, is probably the mother of all summaries. :) But really, it's all that people need to know here, and it's readily understandable without wondering what KSM has to do with this, for example. thanks,
On Thu, Dec 08, 2022 at 04:24:19PM -0800, John Hubbard wrote: > It does! And now for the comment, I'll think you'll find that this suffices: > > /* > * All callers that get here will already hold the i_mmap_rwsem. > * Therefore, no additional locks need to be taken before > * calling hugetlb_walk(). > */ > > ...which, considering all the data above, is probably the mother of > all summaries. :) But really, it's all that people need to know here, and > it's readily understandable without wondering what KSM has to do with this, > for example. I'm okay with the change. :) I think what I'll do is I'll move part of the original one into commit message, and take the new version in the code. Thanks,
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index e97b2e23bd28..2e59a0419d22 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) /* The only possible mapping was handled on last iteration */ if (pvmw->pte) return not_found(pvmw); - - /* when pud is not present, pte will be NULL */ + /* + * NOTE: we don't need explicit lock here to walk the + * hugetlb pgtable because either (1) potential callers of + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the + * caller will not walk a hugetlb vma (e.g. ksm or uprobe). + * When one day this rule breaks, one will get a warning + * in hugetlb_walk(), and then we'll figure out what to do. + */ pvmw->pte = hugetlb_walk(vma, pvmw->address, size); if (!pvmw->pte) return false;