Message ID | 20221129193526.3588187-10-peterx@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp535500wrr; Tue, 29 Nov 2022 11:48:40 -0800 (PST) X-Google-Smtp-Source: AA0mqf6lXITAdYfhJT4KdLzcFxlYnYZO5IZo/rjET4EjIITxPRgIri4YdVVKAK3+x+QyGJHXqNlX X-Received: by 2002:a17:902:ab8d:b0:17f:8232:257a with SMTP id f13-20020a170902ab8d00b0017f8232257amr39189076plr.138.1669751320642; Tue, 29 Nov 2022 11:48:40 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669751320; cv=none; d=google.com; s=arc-20160816; b=u6eUAPu/vi2IvdFl18NU8fRdsk6sooh8WCN8syOhC/tQC53lfHBaK8isipx2Gte8qb 7FJSfFPhoY0pK5pBcIFUXHkYIPZqyuQS+lcRg7e2ifxTYQkGey6GKA/EfKvxMbb5F5bZ +c/KIpLGjm5Zue3CKhMcK1HZdfn6y+sTH17s30fqHeHK7/I8q4HbarJRc+hfii2GDeO1 YsyexZiMjsVhMmJ8WOJO2kaZcFw9JLlE8v4HarV8rVV1FWThYYkZsPUAAkIgHjO23mcv qp/nJ9g12W1O0VHNYqHsPAMDdmRTwq4pRWyE2ErgoP95xHZuihOwjstRJ6JxOKXBh0fR yDOw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=T6R1JTF+AaMAR6h+PxJxqsbJoPo1sRZuxu2H68hPMl4=; b=FC48p3rV4UT02efF0RcyMxtYPRJWjim9H1aa7Am7aIDEJbDIc7BnUqDmgHlouc3OM/ LAX4kjTu+vh+jcfsgapZ32xz43NlxyJwFmML5wrniliGZCjRPyQqCg9DbNV4ISz985xZ 9BOSDmDysX4vzLBjFJZse+Au7HLgLF29Ey3T1ixGspyHPdO0db6Jw7l27hhndbrJ+iJM hvAGL4Qzd5crSasD9ewvh31+PRIcibkmaI+srOfPAIYNZ5J9o712Nko5F7hRQBgqB+vC aG3mApbch+0zmYoJChpeN4g0SU7pGkMYu9281HwUhUhwxIYLBgBk9/SYY1TchPgY1ZZV B/5g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=ZyJNUbZY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j7-20020a654d47000000b00477bf7b0c43si15549480pgt.458.2022.11.29.11.48.27; Tue, 29 Nov 2022 11:48:40 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=ZyJNUbZY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237117AbiK2TjE (ORCPT <rfc822;rbbytesnap@gmail.com> + 99 others); Tue, 29 Nov 2022 14:39:04 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48422 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237009AbiK2The (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 29 Nov 2022 14:37:34 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6C55F61B9E for <linux-kernel@vger.kernel.org>; Tue, 29 Nov 2022 11:35:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1669750546; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=T6R1JTF+AaMAR6h+PxJxqsbJoPo1sRZuxu2H68hPMl4=; b=ZyJNUbZYcDJiXuNvY51fAT1dqbwwBU567CV9FlKVUidJW/2OMKqYsdI8rX0kJPYoQ5aKwm IBsMo71H7IAq8K8eTNWsIrNIQNdEAETQGglVgN7i10/42DG0rMSRxxDTUjiTKLn8C50kcO VmpY121zu8V7Q3rPfKl68Ly6YX0ZtBo= Received: from mail-qk1-f200.google.com (mail-qk1-f200.google.com [209.85.222.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-194-nlRha3X-Mq67M3YdpnzmwA-1; Tue, 29 Nov 2022 14:35:44 -0500 X-MC-Unique: nlRha3X-Mq67M3YdpnzmwA-1 Received: by mail-qk1-f200.google.com with SMTP id u5-20020a05620a0c4500b006fb30780443so32368208qki.22 for <linux-kernel@vger.kernel.org>; Tue, 29 Nov 2022 11:35:44 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=T6R1JTF+AaMAR6h+PxJxqsbJoPo1sRZuxu2H68hPMl4=; b=zxN9CAUXaF+XXAlA4QMp79Z/t5PkTH95AZKxBV0xAuVvNHwE5meKkXDVcYpOaTg+N9 ORIiOiO9EiWNfqNeax+aPlyuhRdiMs56+pGBPUVNcvmoryXReBeoOCBox3uTeqUUHPHw RxtMVPhBtIwIxeDqv7xs1VTrx256o0+2f6HO9hflgiQTOO4DBiQeM2JG2GfEY3VF02tE 7GhvqWfLDqm0HLMELU1Gtky7cJTdCYUBBWQjvR1KMn4AJ/TKP9dUM3IOecbZUxJo+o6I 7X0tYQjBaBVsdKckpoYVPe1LMEhHeaPIfXiXS9TPH7HQxsX9s20bXniC5cp9lMpNKprc zOdA== X-Gm-Message-State: ANoB5pmBUZjVSTYCvL4qePWWtGOdkwb+e7gJf+Xw82YPDGrjLVvOzbHb ID/gW/n0SL7ZE9nGIdUyzft07Gf48vI8/VujPvPBf6uA5p9XigMKB/mmUkdaXvMY+aDCFznGw3U hi3E05q7P41ZmfVSs92kUk1nK X-Received: by 2002:ac8:5511:0:b0:3a5:ae62:7b5a with SMTP id j17-20020ac85511000000b003a5ae627b5amr54714914qtq.595.1669750543739; Tue, 29 Nov 2022 11:35:43 -0800 (PST) X-Received: by 2002:ac8:5511:0:b0:3a5:ae62:7b5a with SMTP id j17-20020ac85511000000b003a5ae627b5amr54714888qtq.595.1669750543457; Tue, 29 Nov 2022 11:35:43 -0800 (PST) Received: from x1n.redhat.com (bras-base-aurron9127w-grc-46-70-31-27-79.dsl.bell.ca. [70.31.27.79]) by smtp.gmail.com with ESMTPSA id n1-20020a05620a294100b006fa16fe93bbsm11313013qkp.15.2022.11.29.11.35.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 29 Nov 2022 11:35:43 -0800 (PST) From: Peter Xu <peterx@redhat.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: James Houghton <jthoughton@google.com>, Jann Horn <jannh@google.com>, peterx@redhat.com, Andrew Morton <akpm@linux-foundation.org>, Andrea Arcangeli <aarcange@redhat.com>, Rik van Riel <riel@surriel.com>, Nadav Amit <nadav.amit@gmail.com>, Miaohe Lin <linmiaohe@huawei.com>, Muchun Song <songmuchun@bytedance.com>, Mike Kravetz <mike.kravetz@oracle.com>, David Hildenbrand <david@redhat.com> Subject: [PATCH 09/10] mm/hugetlb: Make page_vma_mapped_walk() safe to pmd unshare Date: Tue, 29 Nov 2022 14:35:25 -0500 Message-Id: <20221129193526.3588187-10-peterx@redhat.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20221129193526.3588187-1-peterx@redhat.com> References: <20221129193526.3588187-1-peterx@redhat.com> MIME-Version: 1.0 Content-type: text/plain Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1750861161260715246?= X-GMAIL-MSGID: =?utf-8?q?1750861161260715246?= |
Series |
[01/10] mm/hugetlb: Let vma_offset_start() to return start
|
|
Commit Message
Peter Xu
Nov. 29, 2022, 7:35 p.m. UTC
Since page_vma_mapped_walk() walks the pgtable, it needs the vma lock
to make sure the pgtable page will not be freed concurrently.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
include/linux/rmap.h | 4 ++++
mm/page_vma_mapped.c | 5 ++++-
2 files changed, 8 insertions(+), 1 deletion(-)
Comments
On 29.11.22 20:35, Peter Xu wrote: > Since page_vma_mapped_walk() walks the pgtable, it needs the vma lock > to make sure the pgtable page will not be freed concurrently. > > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > include/linux/rmap.h | 4 ++++ > mm/page_vma_mapped.c | 5 ++++- > 2 files changed, 8 insertions(+), 1 deletion(-) > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index bd3504d11b15..a50d18bb86aa 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -13,6 +13,7 @@ > #include <linux/highmem.h> > #include <linux/pagemap.h> > #include <linux/memremap.h> > +#include <linux/hugetlb.h> > > /* > * The anon_vma heads a list of private "related" vmas, to scan if > @@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) > pte_unmap(pvmw->pte); > if (pvmw->ptl) > spin_unlock(pvmw->ptl); > + /* This needs to be after unlock of the spinlock */ > + if (is_vm_hugetlb_page(pvmw->vma)) > + hugetlb_vma_unlock_read(pvmw->vma); > } > > bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw); > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > index 93e13fc17d3c..f94ec78b54ff 100644 > --- a/mm/page_vma_mapped.c > +++ b/mm/page_vma_mapped.c > @@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > if (pvmw->pte) > return not_found(pvmw); > > + hugetlb_vma_lock_read(vma); > /* when pud is not present, pte will be NULL */ > pvmw->pte = huge_pte_offset(mm, pvmw->address, size); > - if (!pvmw->pte) > + if (!pvmw->pte) { > + hugetlb_vma_unlock_read(vma); > return false; > + } > > pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte); > if (!check_pte(pvmw)) Looking at code like mm/damon/paddr.c:__damon_pa_mkold() and reading the doc of page_vma_mapped_walk(), this might be broken. Can't we get page_vma_mapped_walk() called multiple times? Wouldn't we have to remember that we already took the lock to not lock twice, and to see if we really have to unlock in page_vma_mapped_walk_done() ?
On Wed, Nov 30, 2022 at 05:18:45PM +0100, David Hildenbrand wrote: > On 29.11.22 20:35, Peter Xu wrote: > > Since page_vma_mapped_walk() walks the pgtable, it needs the vma lock > > to make sure the pgtable page will not be freed concurrently. > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > --- > > include/linux/rmap.h | 4 ++++ > > mm/page_vma_mapped.c | 5 ++++- > > 2 files changed, 8 insertions(+), 1 deletion(-) > > > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > > index bd3504d11b15..a50d18bb86aa 100644 > > --- a/include/linux/rmap.h > > +++ b/include/linux/rmap.h > > @@ -13,6 +13,7 @@ > > #include <linux/highmem.h> > > #include <linux/pagemap.h> > > #include <linux/memremap.h> > > +#include <linux/hugetlb.h> > > /* > > * The anon_vma heads a list of private "related" vmas, to scan if > > @@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) > > pte_unmap(pvmw->pte); > > if (pvmw->ptl) > > spin_unlock(pvmw->ptl); > > + /* This needs to be after unlock of the spinlock */ > > + if (is_vm_hugetlb_page(pvmw->vma)) > > + hugetlb_vma_unlock_read(pvmw->vma); > > } > > bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw); > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > > index 93e13fc17d3c..f94ec78b54ff 100644 > > --- a/mm/page_vma_mapped.c > > +++ b/mm/page_vma_mapped.c > > @@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > > if (pvmw->pte) > > return not_found(pvmw); > > + hugetlb_vma_lock_read(vma); > > /* when pud is not present, pte will be NULL */ > > pvmw->pte = huge_pte_offset(mm, pvmw->address, size); > > - if (!pvmw->pte) > > + if (!pvmw->pte) { > > + hugetlb_vma_unlock_read(vma); > > return false; > > + } > > pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte); > > if (!check_pte(pvmw)) > > Looking at code like mm/damon/paddr.c:__damon_pa_mkold() and reading the > doc of page_vma_mapped_walk(), this might be broken. > > Can't we get page_vma_mapped_walk() called multiple times? Yes it normally can, but not for hugetlbfs? Feel free to check: if (unlikely(is_vm_hugetlb_page(vma))) { ... /* The only possible mapping was handled on last iteration */ if (pvmw->pte) return not_found(pvmw); } > Wouldn't we have to remember that we already took the lock to not lock > twice, and to see if we really have to unlock in > page_vma_mapped_walk_done() ?
On 30.11.22 17:32, Peter Xu wrote: > On Wed, Nov 30, 2022 at 05:18:45PM +0100, David Hildenbrand wrote: >> On 29.11.22 20:35, Peter Xu wrote: >>> Since page_vma_mapped_walk() walks the pgtable, it needs the vma lock >>> to make sure the pgtable page will not be freed concurrently. >>> >>> Signed-off-by: Peter Xu <peterx@redhat.com> >>> --- >>> include/linux/rmap.h | 4 ++++ >>> mm/page_vma_mapped.c | 5 ++++- >>> 2 files changed, 8 insertions(+), 1 deletion(-) >>> >>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h >>> index bd3504d11b15..a50d18bb86aa 100644 >>> --- a/include/linux/rmap.h >>> +++ b/include/linux/rmap.h >>> @@ -13,6 +13,7 @@ >>> #include <linux/highmem.h> >>> #include <linux/pagemap.h> >>> #include <linux/memremap.h> >>> +#include <linux/hugetlb.h> >>> /* >>> * The anon_vma heads a list of private "related" vmas, to scan if >>> @@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) >>> pte_unmap(pvmw->pte); >>> if (pvmw->ptl) >>> spin_unlock(pvmw->ptl); >>> + /* This needs to be after unlock of the spinlock */ >>> + if (is_vm_hugetlb_page(pvmw->vma)) >>> + hugetlb_vma_unlock_read(pvmw->vma); >>> } >>> bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw); >>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c >>> index 93e13fc17d3c..f94ec78b54ff 100644 >>> --- a/mm/page_vma_mapped.c >>> +++ b/mm/page_vma_mapped.c >>> @@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) >>> if (pvmw->pte) >>> return not_found(pvmw); >>> + hugetlb_vma_lock_read(vma); >>> /* when pud is not present, pte will be NULL */ >>> pvmw->pte = huge_pte_offset(mm, pvmw->address, size); >>> - if (!pvmw->pte) >>> + if (!pvmw->pte) { >>> + hugetlb_vma_unlock_read(vma); >>> return false; >>> + } >>> pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte); >>> if (!check_pte(pvmw)) >> >> Looking at code like mm/damon/paddr.c:__damon_pa_mkold() and reading the >> doc of page_vma_mapped_walk(), this might be broken. >> >> Can't we get page_vma_mapped_walk() called multiple times? > > Yes it normally can, but not for hugetlbfs? Feel free to check: > > if (unlikely(is_vm_hugetlb_page(vma))) { > ... > /* The only possible mapping was handled on last iteration */ > if (pvmw->pte) > return not_found(pvmw); > } Ah, I see, thanks. Acked-by: David Hildenbrand <david@redhat.com>
On 11/29/22 14:35, Peter Xu wrote: > Since page_vma_mapped_walk() walks the pgtable, it needs the vma lock > to make sure the pgtable page will not be freed concurrently. > > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > include/linux/rmap.h | 4 ++++ > mm/page_vma_mapped.c | 5 ++++- > 2 files changed, 8 insertions(+), 1 deletion(-) > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index bd3504d11b15..a50d18bb86aa 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -13,6 +13,7 @@ > #include <linux/highmem.h> > #include <linux/pagemap.h> > #include <linux/memremap.h> > +#include <linux/hugetlb.h> > > /* > * The anon_vma heads a list of private "related" vmas, to scan if > @@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) > pte_unmap(pvmw->pte); > if (pvmw->ptl) > spin_unlock(pvmw->ptl); > + /* This needs to be after unlock of the spinlock */ > + if (is_vm_hugetlb_page(pvmw->vma)) > + hugetlb_vma_unlock_read(pvmw->vma); > } > > bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw); > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > index 93e13fc17d3c..f94ec78b54ff 100644 > --- a/mm/page_vma_mapped.c > +++ b/mm/page_vma_mapped.c > @@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > if (pvmw->pte) > return not_found(pvmw); > > + hugetlb_vma_lock_read(vma); > /* when pud is not present, pte will be NULL */ > pvmw->pte = huge_pte_offset(mm, pvmw->address, size); > - if (!pvmw->pte) > + if (!pvmw->pte) { > + hugetlb_vma_unlock_read(vma); > return false; > + } > > pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte); > if (!check_pte(pvmw)) I think this is going to cause try_to_unmap() to always fail for hugetlb shared pages. See try_to_unmap_one: while (page_vma_mapped_walk(&pvmw)) { ... if (folio_test_hugetlb(folio)) { ... /* * To call huge_pmd_unshare, i_mmap_rwsem must be * held in write mode. Caller needs to explicitly * do this outside rmap routines. * * We also must hold hugetlb vma_lock in write mode. * Lock order dictates acquiring vma_lock BEFORE * i_mmap_rwsem. We can only try lock here and fail * if unsuccessful. */ if (!anon) { VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); if (!hugetlb_vma_trylock_write(vma)) { page_vma_mapped_walk_done(&pvmw); ret = false; } Can not think of a great solution right now.
On 12/05/22 15:52, Mike Kravetz wrote: > On 11/29/22 14:35, Peter Xu wrote: > > Since page_vma_mapped_walk() walks the pgtable, it needs the vma lock > > to make sure the pgtable page will not be freed concurrently. > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > --- > > include/linux/rmap.h | 4 ++++ > > mm/page_vma_mapped.c | 5 ++++- > > 2 files changed, 8 insertions(+), 1 deletion(-) > > > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > > index bd3504d11b15..a50d18bb86aa 100644 > > --- a/include/linux/rmap.h > > +++ b/include/linux/rmap.h > > @@ -13,6 +13,7 @@ > > #include <linux/highmem.h> > > #include <linux/pagemap.h> > > #include <linux/memremap.h> > > +#include <linux/hugetlb.h> > > > > /* > > * The anon_vma heads a list of private "related" vmas, to scan if > > @@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) > > pte_unmap(pvmw->pte); > > if (pvmw->ptl) > > spin_unlock(pvmw->ptl); > > + /* This needs to be after unlock of the spinlock */ > > + if (is_vm_hugetlb_page(pvmw->vma)) > > + hugetlb_vma_unlock_read(pvmw->vma); > > } > > > > bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw); > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > > index 93e13fc17d3c..f94ec78b54ff 100644 > > --- a/mm/page_vma_mapped.c > > +++ b/mm/page_vma_mapped.c > > @@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > > if (pvmw->pte) > > return not_found(pvmw); > > > > + hugetlb_vma_lock_read(vma); > > /* when pud is not present, pte will be NULL */ > > pvmw->pte = huge_pte_offset(mm, pvmw->address, size); > > - if (!pvmw->pte) > > + if (!pvmw->pte) { > > + hugetlb_vma_unlock_read(vma); > > return false; > > + } > > > > pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte); > > if (!check_pte(pvmw)) > > I think this is going to cause try_to_unmap() to always fail for hugetlb > shared pages. See try_to_unmap_one: > > while (page_vma_mapped_walk(&pvmw)) { > ... > if (folio_test_hugetlb(folio)) { > ... > /* > * To call huge_pmd_unshare, i_mmap_rwsem must be > * held in write mode. Caller needs to explicitly > * do this outside rmap routines. > * > * We also must hold hugetlb vma_lock in write mode. > * Lock order dictates acquiring vma_lock BEFORE > * i_mmap_rwsem. We can only try lock here and fail > * if unsuccessful. > */ > if (!anon) { > VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); > if (!hugetlb_vma_trylock_write(vma)) { > page_vma_mapped_walk_done(&pvmw); > ret = false; > } > > > Can not think of a great solution right now. Thought of this last night ... Perhaps we do not need vma_lock in this code path (not sure about all page_vma_mapped_walk calls). Why? We already hold i_mmap_rwsem.
On Tue, Dec 06, 2022 at 09:10:00AM -0800, Mike Kravetz wrote: > On 12/05/22 15:52, Mike Kravetz wrote: > > On 11/29/22 14:35, Peter Xu wrote: > > > Since page_vma_mapped_walk() walks the pgtable, it needs the vma lock > > > to make sure the pgtable page will not be freed concurrently. > > > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > > --- > > > include/linux/rmap.h | 4 ++++ > > > mm/page_vma_mapped.c | 5 ++++- > > > 2 files changed, 8 insertions(+), 1 deletion(-) > > > > > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > > > index bd3504d11b15..a50d18bb86aa 100644 > > > --- a/include/linux/rmap.h > > > +++ b/include/linux/rmap.h > > > @@ -13,6 +13,7 @@ > > > #include <linux/highmem.h> > > > #include <linux/pagemap.h> > > > #include <linux/memremap.h> > > > +#include <linux/hugetlb.h> > > > > > > /* > > > * The anon_vma heads a list of private "related" vmas, to scan if > > > @@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) > > > pte_unmap(pvmw->pte); > > > if (pvmw->ptl) > > > spin_unlock(pvmw->ptl); > > > + /* This needs to be after unlock of the spinlock */ > > > + if (is_vm_hugetlb_page(pvmw->vma)) > > > + hugetlb_vma_unlock_read(pvmw->vma); > > > } > > > > > > bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw); > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > > > index 93e13fc17d3c..f94ec78b54ff 100644 > > > --- a/mm/page_vma_mapped.c > > > +++ b/mm/page_vma_mapped.c > > > @@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > > > if (pvmw->pte) > > > return not_found(pvmw); > > > > > > + hugetlb_vma_lock_read(vma); > > > /* when pud is not present, pte will be NULL */ > > > pvmw->pte = huge_pte_offset(mm, pvmw->address, size); > > > - if (!pvmw->pte) > > > + if (!pvmw->pte) { > > > + hugetlb_vma_unlock_read(vma); > > > return false; > > > + } > > > > > > pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte); > > > if (!check_pte(pvmw)) > > > > I think this is going to cause try_to_unmap() to always fail for hugetlb > > shared pages. See try_to_unmap_one: > > > > while (page_vma_mapped_walk(&pvmw)) { > > ... > > if (folio_test_hugetlb(folio)) { > > ... > > /* > > * To call huge_pmd_unshare, i_mmap_rwsem must be > > * held in write mode. Caller needs to explicitly > > * do this outside rmap routines. > > * > > * We also must hold hugetlb vma_lock in write mode. > > * Lock order dictates acquiring vma_lock BEFORE > > * i_mmap_rwsem. We can only try lock here and fail > > * if unsuccessful. > > */ > > if (!anon) { > > VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); > > if (!hugetlb_vma_trylock_write(vma)) { > > page_vma_mapped_walk_done(&pvmw); > > ret = false; > > } > > > > > > Can not think of a great solution right now. > > Thought of this last night ... > > Perhaps we do not need vma_lock in this code path (not sure about all > page_vma_mapped_walk calls). Why? We already hold i_mmap_rwsem. Exactly. The only concern is when it's not in a rmap. I'm actually preparing something that adds a new flag to PVMW, like: #define PVMW_HUGETLB_NEEDS_LOCK (1 << 2) But maybe we don't need that at all, since I had a closer look the only outliers of not using a rmap is: __replace_page write_protect_page I'm pretty sure ksm doesn't have hugetlb involved, then the other one is uprobe (uprobe_write_opcode). I think it's the same. If it's true, we can simply drop this patch. Then we also have hugetlb_walk and the lock checks there guarantee that we're safe anyways. Potentially we can document this fact, which I also attached a comment patch just for it to be appended to the end of the patchset. Mike, let me know what do you think. Andrew, if this patch to be dropped then the last patch may not cleanly apply. Let me know if you want a full repost of the things. Thanks,
On Tue, Dec 06, 2022 at 12:39:53PM -0500, Peter Xu wrote: > On Tue, Dec 06, 2022 at 09:10:00AM -0800, Mike Kravetz wrote: > > On 12/05/22 15:52, Mike Kravetz wrote: > > > On 11/29/22 14:35, Peter Xu wrote: > > > > Since page_vma_mapped_walk() walks the pgtable, it needs the vma lock > > > > to make sure the pgtable page will not be freed concurrently. > > > > > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > > > --- > > > > include/linux/rmap.h | 4 ++++ > > > > mm/page_vma_mapped.c | 5 ++++- > > > > 2 files changed, 8 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > > > > index bd3504d11b15..a50d18bb86aa 100644 > > > > --- a/include/linux/rmap.h > > > > +++ b/include/linux/rmap.h > > > > @@ -13,6 +13,7 @@ > > > > #include <linux/highmem.h> > > > > #include <linux/pagemap.h> > > > > #include <linux/memremap.h> > > > > +#include <linux/hugetlb.h> > > > > > > > > /* > > > > * The anon_vma heads a list of private "related" vmas, to scan if > > > > @@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) > > > > pte_unmap(pvmw->pte); > > > > if (pvmw->ptl) > > > > spin_unlock(pvmw->ptl); > > > > + /* This needs to be after unlock of the spinlock */ > > > > + if (is_vm_hugetlb_page(pvmw->vma)) > > > > + hugetlb_vma_unlock_read(pvmw->vma); > > > > } > > > > > > > > bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw); > > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > > > > index 93e13fc17d3c..f94ec78b54ff 100644 > > > > --- a/mm/page_vma_mapped.c > > > > +++ b/mm/page_vma_mapped.c > > > > @@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > > > > if (pvmw->pte) > > > > return not_found(pvmw); > > > > > > > > + hugetlb_vma_lock_read(vma); > > > > /* when pud is not present, pte will be NULL */ > > > > pvmw->pte = huge_pte_offset(mm, pvmw->address, size); > > > > - if (!pvmw->pte) > > > > + if (!pvmw->pte) { > > > > + hugetlb_vma_unlock_read(vma); > > > > return false; > > > > + } > > > > > > > > pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte); > > > > if (!check_pte(pvmw)) > > > > > > I think this is going to cause try_to_unmap() to always fail for hugetlb > > > shared pages. See try_to_unmap_one: > > > > > > while (page_vma_mapped_walk(&pvmw)) { > > > ... > > > if (folio_test_hugetlb(folio)) { > > > ... > > > /* > > > * To call huge_pmd_unshare, i_mmap_rwsem must be > > > * held in write mode. Caller needs to explicitly > > > * do this outside rmap routines. > > > * > > > * We also must hold hugetlb vma_lock in write mode. > > > * Lock order dictates acquiring vma_lock BEFORE > > > * i_mmap_rwsem. We can only try lock here and fail > > > * if unsuccessful. > > > */ > > > if (!anon) { > > > VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); > > > if (!hugetlb_vma_trylock_write(vma)) { > > > page_vma_mapped_walk_done(&pvmw); > > > ret = false; > > > } > > > > > > > > > Can not think of a great solution right now. > > > > Thought of this last night ... > > > > Perhaps we do not need vma_lock in this code path (not sure about all > > page_vma_mapped_walk calls). Why? We already hold i_mmap_rwsem. > > Exactly. The only concern is when it's not in a rmap. > > I'm actually preparing something that adds a new flag to PVMW, like: > > #define PVMW_HUGETLB_NEEDS_LOCK (1 << 2) > > But maybe we don't need that at all, since I had a closer look the only > outliers of not using a rmap is: > > __replace_page > write_protect_page > > I'm pretty sure ksm doesn't have hugetlb involved, then the other one is > uprobe (uprobe_write_opcode). I think it's the same. If it's true, we can > simply drop this patch. Then we also have hugetlb_walk and the lock checks > there guarantee that we're safe anyways. > > Potentially we can document this fact, which I also attached a comment > patch just for it to be appended to the end of the patchset. > > Mike, let me know what do you think. > > Andrew, if this patch to be dropped then the last patch may not cleanly > apply. Let me know if you want a full repost of the things. The document patch that can be appended to the end of this series attached. I referenced hugetlb_walk() so it needs to be the last patch.
On 12/06/22 12:43, Peter Xu wrote: > On Tue, Dec 06, 2022 at 12:39:53PM -0500, Peter Xu wrote: > > On Tue, Dec 06, 2022 at 09:10:00AM -0800, Mike Kravetz wrote: > > > On 12/05/22 15:52, Mike Kravetz wrote: > > > > On 11/29/22 14:35, Peter Xu wrote: > > > > > Since page_vma_mapped_walk() walks the pgtable, it needs the vma lock > > > > > to make sure the pgtable page will not be freed concurrently. > > > > > > > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > > > > --- > > > > > include/linux/rmap.h | 4 ++++ > > > > > mm/page_vma_mapped.c | 5 ++++- > > > > > 2 files changed, 8 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > > > > > index bd3504d11b15..a50d18bb86aa 100644 > > > > > --- a/include/linux/rmap.h > > > > > +++ b/include/linux/rmap.h > > > > > @@ -13,6 +13,7 @@ > > > > > #include <linux/highmem.h> > > > > > #include <linux/pagemap.h> > > > > > #include <linux/memremap.h> > > > > > +#include <linux/hugetlb.h> > > > > > > > > > > /* > > > > > * The anon_vma heads a list of private "related" vmas, to scan if > > > > > @@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) > > > > > pte_unmap(pvmw->pte); > > > > > if (pvmw->ptl) > > > > > spin_unlock(pvmw->ptl); > > > > > + /* This needs to be after unlock of the spinlock */ > > > > > + if (is_vm_hugetlb_page(pvmw->vma)) > > > > > + hugetlb_vma_unlock_read(pvmw->vma); > > > > > } > > > > > > > > > > bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw); > > > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > > > > > index 93e13fc17d3c..f94ec78b54ff 100644 > > > > > --- a/mm/page_vma_mapped.c > > > > > +++ b/mm/page_vma_mapped.c > > > > > @@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > > > > > if (pvmw->pte) > > > > > return not_found(pvmw); > > > > > > > > > > + hugetlb_vma_lock_read(vma); > > > > > /* when pud is not present, pte will be NULL */ > > > > > pvmw->pte = huge_pte_offset(mm, pvmw->address, size); > > > > > - if (!pvmw->pte) > > > > > + if (!pvmw->pte) { > > > > > + hugetlb_vma_unlock_read(vma); > > > > > return false; > > > > > + } > > > > > > > > > > pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte); > > > > > if (!check_pte(pvmw)) > > > > > > > > I think this is going to cause try_to_unmap() to always fail for hugetlb > > > > shared pages. See try_to_unmap_one: > > > > > > > > while (page_vma_mapped_walk(&pvmw)) { > > > > ... > > > > if (folio_test_hugetlb(folio)) { > > > > ... > > > > /* > > > > * To call huge_pmd_unshare, i_mmap_rwsem must be > > > > * held in write mode. Caller needs to explicitly > > > > * do this outside rmap routines. > > > > * > > > > * We also must hold hugetlb vma_lock in write mode. > > > > * Lock order dictates acquiring vma_lock BEFORE > > > > * i_mmap_rwsem. We can only try lock here and fail > > > > * if unsuccessful. > > > > */ > > > > if (!anon) { > > > > VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); > > > > if (!hugetlb_vma_trylock_write(vma)) { > > > > page_vma_mapped_walk_done(&pvmw); > > > > ret = false; > > > > } > > > > > > > > > > > > Can not think of a great solution right now. > > > > > > Thought of this last night ... > > > > > > Perhaps we do not need vma_lock in this code path (not sure about all > > > page_vma_mapped_walk calls). Why? We already hold i_mmap_rwsem. > > > > Exactly. The only concern is when it's not in a rmap. > > > > I'm actually preparing something that adds a new flag to PVMW, like: > > > > #define PVMW_HUGETLB_NEEDS_LOCK (1 << 2) > > > > But maybe we don't need that at all, since I had a closer look the only > > outliers of not using a rmap is: > > > > __replace_page > > write_protect_page > > > > I'm pretty sure ksm doesn't have hugetlb involved, then the other one is > > uprobe (uprobe_write_opcode). I think it's the same. If it's true, we can > > simply drop this patch. Then we also have hugetlb_walk and the lock checks > > there guarantee that we're safe anyways. > > > > Potentially we can document this fact, which I also attached a comment > > patch just for it to be appended to the end of the patchset. > > > > Mike, let me know what do you think. > > > > Andrew, if this patch to be dropped then the last patch may not cleanly > > apply. Let me know if you want a full repost of the things. > > The document patch that can be appended to the end of this series attached. > I referenced hugetlb_walk() so it needs to be the last patch. > > -- > Peter Xu Agree with dropping this patch and adding the document patch below. Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Also, happy we have the warnings in place to catch incorrect locking.
diff --git a/include/linux/rmap.h b/include/linux/rmap.h index bd3504d11b15..a50d18bb86aa 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -13,6 +13,7 @@ #include <linux/highmem.h> #include <linux/pagemap.h> #include <linux/memremap.h> +#include <linux/hugetlb.h> /* * The anon_vma heads a list of private "related" vmas, to scan if @@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) pte_unmap(pvmw->pte); if (pvmw->ptl) spin_unlock(pvmw->ptl); + /* This needs to be after unlock of the spinlock */ + if (is_vm_hugetlb_page(pvmw->vma)) + hugetlb_vma_unlock_read(pvmw->vma); } bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw); diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 93e13fc17d3c..f94ec78b54ff 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) if (pvmw->pte) return not_found(pvmw); + hugetlb_vma_lock_read(vma); /* when pud is not present, pte will be NULL */ pvmw->pte = huge_pte_offset(mm, pvmw->address, size); - if (!pvmw->pte) + if (!pvmw->pte) { + hugetlb_vma_unlock_read(vma); return false; + } pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte); if (!check_pte(pvmw))