Message ID | 20221129193526.3588187-9-peterx@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp535457wrr; Tue, 29 Nov 2022 11:48:35 -0800 (PST) X-Google-Smtp-Source: AA0mqf4pfW06nYNZZkkG+KFyDAgLh0ASLPYDs2aqXZJ0RRINR+Kmu2t73tDNaysCqBv+sMXsggFv X-Received: by 2002:a17:902:e5c6:b0:189:a50d:2a1d with SMTP id u6-20020a170902e5c600b00189a50d2a1dmr1237386plf.18.1669751314903; Tue, 29 Nov 2022 11:48:34 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669751314; cv=none; d=google.com; s=arc-20160816; b=a+nxXflQKumpfCI3XK2m3UOFzGjo8RJdXBaUbUcqtxq4Z7x/Z76jDFbqLxt6iHBa2z +aEMoKZH7iXi33Ut8RGptm/vCurliko08X1yqGVhOw25WtStufjlHwVfQGmxz1nDH7Qz nA8lPQlUmtTKnYyHL+I7SqGaZ6WPLJf31a2vUPHQlgcPj55A0y5MhOPCxwWsCpm6KNUQ GQt9U7jaBHXFEGhWARlRbCZQ74ZWO5WIosyKW77ffRj54d55JG8XbpZWqbmzqsVnDx2Y nH5OhV5ROCdu0TawzcOE4lux52c66OQty2tBfvRlKcENiQ/AtZXznLWNspXpJVUmfomq o6Ew== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=tlgp91ck5zE/ROfq0OY9j8l1q/C3T3fwuEiiEDoRd8M=; b=G8mNrDbOIYD7EJf11l/kZpPOMgE1qQgdOUTbi5/5yK7gUzkfW1pAxyhEolVIK1pfZb Y5oLVMhiUbPXrTfxq5z5EEef+bPsXNP4LKsuOrbwAOf1xTpwsGKzOvcceZbVF9NuEDAu WhPR15OqLzu/GpD3RGf0TAOgmSe+mB4xar5srCuJRowht58fBtOI574LG2OwKdGX1Yop 7Y1gFPLwg2cOv/GHNJ1lDaH/kNpCcX9eRHDfkBKhAyZhDlEfmlW/swZIeq3uOrRACQNA bBzFTOuRPou3I/EoGTOzLp7AoUH6yFufETMELBfwgmU+dmj5+7vK/ePStYdTckgB7AKa 3MrA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=PLmp9qAA; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d7-20020a656b87000000b0047829d1b8f0si5587086pgw.738.2022.11.29.11.48.20; Tue, 29 Nov 2022 11:48:34 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=PLmp9qAA; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237110AbiK2Ti6 (ORCPT <rfc822;rbbytesnap@gmail.com> + 99 others); Tue, 29 Nov 2022 14:38:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48670 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236996AbiK2Th3 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 29 Nov 2022 14:37:29 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 952C15917E for <linux-kernel@vger.kernel.org>; Tue, 29 Nov 2022 11:35:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1669750544; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=tlgp91ck5zE/ROfq0OY9j8l1q/C3T3fwuEiiEDoRd8M=; b=PLmp9qAAVoXZ/DLCrhHauwbfIJ8eucPnqbLK/nRBJIKfSy2xRYyOfG9cpnvH4yNsOCxf60 ONlKONJCwUJht1zh2iZ82SSNfzcMDGnyIEYpeBPBqEMkp7MCCQCSIvZDQqthMmEHJpg43b 9OyN7osSfnd1FAHdBbWORvnupPB2HUQ= Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com [209.85.222.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-413-6Mz9jCuLNK6tvdNUgjXp2w-1; Tue, 29 Nov 2022 14:35:43 -0500 X-MC-Unique: 6Mz9jCuLNK6tvdNUgjXp2w-1 Received: by mail-qk1-f197.google.com with SMTP id u5-20020a05620a0c4500b006fb30780443so32368012qki.22 for <linux-kernel@vger.kernel.org>; Tue, 29 Nov 2022 11:35:43 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tlgp91ck5zE/ROfq0OY9j8l1q/C3T3fwuEiiEDoRd8M=; b=lWfkIv9xg8ZejBv4f/6TxbEFtPTmtTJQhuFtb0mA6ngyiFi9eOLCwYX7bKa849ZW5r vw0KTQFp9XXfCAJhd8JZ4ECUv+Cs3PFq050gxKhdS4PWKB9cy+i0FMf+XIkyVUNTwvPX yTA8MlD/RWK1tBUw+vOJ41Y0VCXuu7Vs5Rn+eDmvxlBF3s3VOgtK4GDI3wbnWfLXME/M kakemfAKk3N/R3sUuQDPM7G/HCwCgNtIzg2j70OAvzhRYP03smopomcvKTm335plKh1K lqYTT5LLwtzZd38ceCAv5SVshAUjD9F5duU/dySEIbdLU5e5n17Q4b90t2D7HUdcnF02 +13w== X-Gm-Message-State: ANoB5pll1dXkrF4iD7lPIt4RtIeXwNBlYmGoJwfBKj1GNgSi6QkrP1Hq 6RnwMuBfiF5xrJ+xY8OXjlx7q+OF292wZ2rJqlxlFVhK22ZarBKgfIZ7/9qo8giEJeozByz/Wgg NPyTwmRvaGoNFsJUFm4QhNXXZ X-Received: by 2002:a05:6214:3607:b0:4c6:fb3e:4993 with SMTP id nv7-20020a056214360700b004c6fb3e4993mr12852435qvb.110.1669750542529; Tue, 29 Nov 2022 11:35:42 -0800 (PST) X-Received: by 2002:a05:6214:3607:b0:4c6:fb3e:4993 with SMTP id nv7-20020a056214360700b004c6fb3e4993mr12852419qvb.110.1669750542285; Tue, 29 Nov 2022 11:35:42 -0800 (PST) Received: from x1n.redhat.com (bras-base-aurron9127w-grc-46-70-31-27-79.dsl.bell.ca. [70.31.27.79]) by smtp.gmail.com with ESMTPSA id n1-20020a05620a294100b006fa16fe93bbsm11313013qkp.15.2022.11.29.11.35.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 29 Nov 2022 11:35:41 -0800 (PST) From: Peter Xu <peterx@redhat.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: James Houghton <jthoughton@google.com>, Jann Horn <jannh@google.com>, peterx@redhat.com, Andrew Morton <akpm@linux-foundation.org>, Andrea Arcangeli <aarcange@redhat.com>, Rik van Riel <riel@surriel.com>, Nadav Amit <nadav.amit@gmail.com>, Miaohe Lin <linmiaohe@huawei.com>, Muchun Song <songmuchun@bytedance.com>, Mike Kravetz <mike.kravetz@oracle.com>, David Hildenbrand <david@redhat.com> Subject: [PATCH 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare Date: Tue, 29 Nov 2022 14:35:24 -0500 Message-Id: <20221129193526.3588187-9-peterx@redhat.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20221129193526.3588187-1-peterx@redhat.com> References: <20221129193526.3588187-1-peterx@redhat.com> MIME-Version: 1.0 Content-type: text/plain Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1750861154382093342?= X-GMAIL-MSGID: =?utf-8?q?1750861154382093342?= |
Series |
[01/10] mm/hugetlb: Let vma_offset_start() to return start
|
|
Commit Message
Peter Xu
Nov. 29, 2022, 7:35 p.m. UTC
Since walk_hugetlb_range() walks the pgtable, it needs the vma lock
to make sure the pgtable page will not be freed concurrently.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
mm/pagewalk.c | 2 ++
1 file changed, 2 insertions(+)
Comments
On 29.11.22 20:35, Peter Xu wrote: > Since walk_hugetlb_range() walks the pgtable, it needs the vma lock > to make sure the pgtable page will not be freed concurrently. > > Signed-off-by: Peter Xu <peterx@redhat.com> > --- Acked-by: David Hildenbrand <david@redhat.com>
On 11/29/22 14:35, Peter Xu wrote: > Since walk_hugetlb_range() walks the pgtable, it needs the vma lock > to make sure the pgtable page will not be freed concurrently. > > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > mm/pagewalk.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c > index 7f1c9b274906..d98564a7be57 100644 > --- a/mm/pagewalk.c > +++ b/mm/pagewalk.c > @@ -302,6 +302,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, > const struct mm_walk_ops *ops = walk->ops; > int err = 0; > > + hugetlb_vma_lock_read(vma); > do { > next = hugetlb_entry_end(h, addr, end); > pte = huge_pte_offset(walk->mm, addr & hmask, sz); For each found pte, we will be calling mm_walk_ops->hugetlb_entry() with the vma_lock held. I looked into the various hugetlb_entry routines, and I am not sure about hmm_vma_walk_hugetlb_entry. It seems like it could possibly call hmm_vma_fault -> handle_mm_fault -> hugetlb_fault. If this can happen, then we may have an issue as hugetlb_fault will also need to acquire the vma_lock in read mode. I do not know the hmm code well enough to know if this may be an actual issue?
On 12/5/22 15:33, Mike Kravetz wrote: > On 11/29/22 14:35, Peter Xu wrote: >> Since walk_hugetlb_range() walks the pgtable, it needs the vma lock >> to make sure the pgtable page will not be freed concurrently. >> >> Signed-off-by: Peter Xu <peterx@redhat.com> >> --- >> mm/pagewalk.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/mm/pagewalk.c b/mm/pagewalk.c >> index 7f1c9b274906..d98564a7be57 100644 >> --- a/mm/pagewalk.c >> +++ b/mm/pagewalk.c >> @@ -302,6 +302,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, >> const struct mm_walk_ops *ops = walk->ops; >> int err = 0; >> >> + hugetlb_vma_lock_read(vma); >> do { >> next = hugetlb_entry_end(h, addr, end); >> pte = huge_pte_offset(walk->mm, addr & hmask, sz); > > For each found pte, we will be calling mm_walk_ops->hugetlb_entry() with > the vma_lock held. I looked into the various hugetlb_entry routines, and > I am not sure about hmm_vma_walk_hugetlb_entry. It seems like it could > possibly call hmm_vma_fault -> handle_mm_fault -> hugetlb_fault. If this > can happen, then we may have an issue as hugetlb_fault will also need to > acquire the vma_lock in read mode. > > I do not know the hmm code well enough to know if this may be an actual > issue? Oh, this sounds like a serious concern. If we add a new lock, and hold it during callbacks that also need to take it, that's not going to work out, right? And yes, hmm_range_fault() and related things do a good job of revealing this kind of deadlock. :) thanks,
On Mon, Dec 05, 2022 at 03:52:51PM -0800, John Hubbard wrote: > On 12/5/22 15:33, Mike Kravetz wrote: > > On 11/29/22 14:35, Peter Xu wrote: > > > Since walk_hugetlb_range() walks the pgtable, it needs the vma lock > > > to make sure the pgtable page will not be freed concurrently. > > > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > > --- > > > mm/pagewalk.c | 2 ++ > > > 1 file changed, 2 insertions(+) > > > > > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c > > > index 7f1c9b274906..d98564a7be57 100644 > > > --- a/mm/pagewalk.c > > > +++ b/mm/pagewalk.c > > > @@ -302,6 +302,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, > > > const struct mm_walk_ops *ops = walk->ops; > > > int err = 0; > > > + hugetlb_vma_lock_read(vma); > > > do { > > > next = hugetlb_entry_end(h, addr, end); > > > pte = huge_pte_offset(walk->mm, addr & hmask, sz); > > > > For each found pte, we will be calling mm_walk_ops->hugetlb_entry() with > > the vma_lock held. I looked into the various hugetlb_entry routines, and > > I am not sure about hmm_vma_walk_hugetlb_entry. It seems like it could > > possibly call hmm_vma_fault -> handle_mm_fault -> hugetlb_fault. If this > > can happen, then we may have an issue as hugetlb_fault will also need to > > acquire the vma_lock in read mode. Thanks for spotting that, Mike. I used to notice that path special but that's when I was still using RCU locks who doesn't have the issue. Then I overlooked this one when switchover. > > > > I do not know the hmm code well enough to know if this may be an actual > > issue? > > Oh, this sounds like a serious concern. If we add a new lock, and hold it > during callbacks that also need to take it, that's not going to work out, > right? > > And yes, hmm_range_fault() and related things do a good job of revealing > this kind of deadlock. :) I've got a fixup attached. John, since this got your attention please also have a look too in case there's further issues. Thanks,
On 12/06/22 11:45, Peter Xu wrote: > On Mon, Dec 05, 2022 at 03:52:51PM -0800, John Hubbard wrote: > > On 12/5/22 15:33, Mike Kravetz wrote: > > > On 11/29/22 14:35, Peter Xu wrote: > > > > Since walk_hugetlb_range() walks the pgtable, it needs the vma lock > > > > to make sure the pgtable page will not be freed concurrently. > > > > > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > > > --- > > > > mm/pagewalk.c | 2 ++ > > > > 1 file changed, 2 insertions(+) > > > > > > > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c > > > > index 7f1c9b274906..d98564a7be57 100644 > > > > --- a/mm/pagewalk.c > > > > +++ b/mm/pagewalk.c > > > > @@ -302,6 +302,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, > > > > const struct mm_walk_ops *ops = walk->ops; > > > > int err = 0; > > > > + hugetlb_vma_lock_read(vma); > > > > do { > > > > next = hugetlb_entry_end(h, addr, end); > > > > pte = huge_pte_offset(walk->mm, addr & hmask, sz); > > > > > > For each found pte, we will be calling mm_walk_ops->hugetlb_entry() with > > > the vma_lock held. I looked into the various hugetlb_entry routines, and > > > I am not sure about hmm_vma_walk_hugetlb_entry. It seems like it could > > > possibly call hmm_vma_fault -> handle_mm_fault -> hugetlb_fault. If this > > > can happen, then we may have an issue as hugetlb_fault will also need to > > > acquire the vma_lock in read mode. > > Thanks for spotting that, Mike. > > I used to notice that path special but that's when I was still using RCU > locks who doesn't have the issue. Then I overlooked this one when > switchover. > > > > > > > I do not know the hmm code well enough to know if this may be an actual > > > issue? > > > > Oh, this sounds like a serious concern. If we add a new lock, and hold it > > during callbacks that also need to take it, that's not going to work out, > > right? > > > > And yes, hmm_range_fault() and related things do a good job of revealing > > this kind of deadlock. :) > > I've got a fixup attached. John, since this got your attention please also > have a look too in case there's further issues. > > Thanks, > > -- > Peter Xu Thanks Peter. I am good with the fixup. When combined with original, Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
On 12/6/22 08:45, Peter Xu wrote: > I've got a fixup attached. John, since this got your attention please also > have a look too in case there's further issues. > Well, one question: Normally, the pattern of "release_lock(A); call f(); acquire_lock(A);" is tricky, because one must revalidate that the state protected by A has not changed while the lock was released. However, in this case, it's letting page fault handling proceed, which already assumes that pages might be gone, so generally that seems OK. However, I'm lagging behind on understanding what the vma lock actually protects. It seems to be a hugetlb-specific protection for concurrent freeing of the page tables? If so, then running a page fault handler seems safe. If there's something else it protects, then we might need to revalidate that after re-acquiring the vma lock. Also, scattering hugetlb-specific locks throughout mm seems like an unfortuate thing, I wonder if there is a longer term plan to Not Do That? thanks,
On Tue, Dec 06, 2022 at 01:03:45PM -0800, John Hubbard wrote: > On 12/6/22 08:45, Peter Xu wrote: > > I've got a fixup attached. John, since this got your attention please also > > have a look too in case there's further issues. > > > > Well, one question: Normally, the pattern of "release_lock(A); call f(); > acquire_lock(A);" is tricky, because one must revalidate that the state > protected by A has not changed while the lock was released. However, in > this case, it's letting page fault handling proceed, which already > assumes that pages might be gone, so generally that seems OK. Yes it's tricky, but not as tricky in this case. I hope my documentation supplemented that (in the fixup patch): + * @hugetlb_entry: if set, called for each hugetlb entry. Note that + * currently the hook function is protected by hugetlb + * vma lock to make sure pte_t* and the spinlock is valid + * to access. If the hook function needs to yield the + * thread or retake the vma lock for some reason, it + * needs to properly release the vma lock manually, + * and retake it before the function returns. The vma lock here makes sure the pte_t and the pgtable spinlock being stable. Without the lock, they're prone to be freed in parallel. > > However, I'm lagging behind on understanding what the vma lock actually > protects. It seems to be a hugetlb-specific protection for concurrent > freeing of the page tables? Not exactly freeing, but unsharing. Mike probably has more to say. The series is here: https://lore.kernel.org/all/20220914221810.95771-1-mike.kravetz@oracle.com/#t > If so, then running a page fault handler seems safe. If there's something > else it protects, then we might need to revalidate that after > re-acquiring the vma lock. Nothing to validate here. The only reason to take the vma lock is to match with the caller who assumes the lock taken, so either it'll be released very soon or it prepares for the next hugetlb pgtable walk (huge_pte_offset). > > Also, scattering hugetlb-specific locks throughout mm seems like an > unfortuate thing, I wonder if there is a longer term plan to Not Do > That? So far HMM is really the only one - normally hugetlb_entry() hook is pretty light, so not really throughout the whole mm yet. It's even not urgently needed for the other two places calling cond_sched(), I added it mostly just for completeness, and with the slight hope that maybe we can yield earlier for some pmd unsharers. But yes it's unfortunate, I just didn't come up with a good solution. Suggestion is always welcomed. Thanks,
On 12/6/22 13:51, Peter Xu wrote: > On Tue, Dec 06, 2022 at 01:03:45PM -0800, John Hubbard wrote: >> On 12/6/22 08:45, Peter Xu wrote: >>> I've got a fixup attached. John, since this got your attention please also >>> have a look too in case there's further issues. >>> >> >> Well, one question: Normally, the pattern of "release_lock(A); call f(); >> acquire_lock(A);" is tricky, because one must revalidate that the state >> protected by A has not changed while the lock was released. However, in >> this case, it's letting page fault handling proceed, which already >> assumes that pages might be gone, so generally that seems OK. > > Yes it's tricky, but not as tricky in this case. > > I hope my documentation supplemented that (in the fixup patch): > > + * @hugetlb_entry: if set, called for each hugetlb entry. Note that > + * currently the hook function is protected by hugetlb > + * vma lock to make sure pte_t* and the spinlock is valid > + * to access. If the hook function needs to yield the So far so good... > + * thread or retake the vma lock for some reason, it > + * needs to properly release the vma lock manually, > + * and retake it before the function returns. ...but you can actually delete this second sentence. It does not add any real information--clearly, if you must drop the lock, then you must "manually" drop the lock. And it still ignores my original question, which I don't think I've fully communicated. Basically, what can happen to the protected data during the time when the lock is not held? > > The vma lock here makes sure the pte_t and the pgtable spinlock being > stable. Without the lock, they're prone to be freed in parallel. > Yes, but think about this: if the vma lock protects against the pte going away, then: lock() get a pte unlock() ...let hmm_vma_fault() cond_resched() run... lock() ...whoops, something else release the pte that I'd previously retrieved. >> >> However, I'm lagging behind on understanding what the vma lock actually >> protects. It seems to be a hugetlb-specific protection for concurrent >> freeing of the page tables? > > Not exactly freeing, but unsharing. Mike probably has more to say. The > series is here: > > https://lore.kernel.org/all/20220914221810.95771-1-mike.kravetz@oracle.com/#t > >> If so, then running a page fault handler seems safe. If there's something >> else it protects, then we might need to revalidate that after >> re-acquiring the vma lock. > > Nothing to validate here. The only reason to take the vma lock is to match > with the caller who assumes the lock taken, so either it'll be released > very soon or it prepares for the next hugetlb pgtable walk (huge_pte_offset). > ummm, see above. :) >> >> Also, scattering hugetlb-specific locks throughout mm seems like an >> unfortuate thing, I wonder if there is a longer term plan to Not Do >> That? > > So far HMM is really the only one - normally hugetlb_entry() hook is pretty > light, so not really throughout the whole mm yet. It's even not urgently > needed for the other two places calling cond_sched(), I added it mostly > just for completeness, and with the slight hope that maybe we can yield > earlier for some pmd unsharers. > > But yes it's unfortunate, I just didn't come up with a good solution. > Suggestion is always welcomed. > I guess it's on me to think of something cleaner, so if I do I'll pipe up. :) thanks,
On Tue, Dec 06, 2022 at 02:31:30PM -0800, John Hubbard wrote: > On 12/6/22 13:51, Peter Xu wrote: > > On Tue, Dec 06, 2022 at 01:03:45PM -0800, John Hubbard wrote: > > > On 12/6/22 08:45, Peter Xu wrote: > > > > I've got a fixup attached. John, since this got your attention please also > > > > have a look too in case there's further issues. > > > > > > > > > > Well, one question: Normally, the pattern of "release_lock(A); call f(); > > > acquire_lock(A);" is tricky, because one must revalidate that the state > > > protected by A has not changed while the lock was released. However, in > > > this case, it's letting page fault handling proceed, which already > > > assumes that pages might be gone, so generally that seems OK. > > > > Yes it's tricky, but not as tricky in this case. > > > > I hope my documentation supplemented that (in the fixup patch): > > > > + * @hugetlb_entry: if set, called for each hugetlb entry. Note that > > + * currently the hook function is protected by hugetlb > > + * vma lock to make sure pte_t* and the spinlock is valid > > + * to access. If the hook function needs to yield the [1] > > So far so good... > > > + * thread or retake the vma lock for some reason, it > > + * needs to properly release the vma lock manually, > > + * and retake it before the function returns. > > ...but you can actually delete this second sentence. It does not add > any real information--clearly, if you must drop the lock, then you must > "manually" drop the lock. > > And it still ignores my original question, which I don't think I've > fully communicated. Basically, what can happen to the protected data > during the time when the lock is not held? I thought I answered this one at [1] above. If not, I can extend the answer. What can happen is some thread can firstly unshare the pmd pgtable page (e.g. by clearing the PUD entry in current mm), then release the pmd pgtable page (e.g. by unmapping it) even if current thread is still accessing it. It will cause use-after-free on the pmd pgtable page on this thread in various ways. One way to trigger this is when the current thread tries to take the pgtable lock and it'll trigger warning like the call stack referenced in the cover letter of this series: https://lore.kernel.org/r/20221129193526.3588187-1-peterx@redhat.com Please also feel free to read the reproducer attached in the cover letter, it has details on how this can trigger (even though it's so hard to trigger so I added a delay in the kernel to make it trigger). The idea should be the same. > > > > > The vma lock here makes sure the pte_t and the pgtable spinlock being > > stable. Without the lock, they're prone to be freed in parallel. > > > > Yes, but think about this: if the vma lock protects against the pte > going away, then: > > lock() > get a pte > unlock() > > ...let hmm_vma_fault() cond_resched() run... > > lock() > ...whoops, something else release the pte that I'd previously > retrieved. Here the pte_t* is never referenced again after hugetlb_entry() returned. The loop looks like: do { next = hugetlb_entry_end(h, addr, end); pte = hugetlb_walk(vma, addr & hmask, sz); if (pte) err = ops->hugetlb_entry(pte, hmask, addr, next, walk); else if (ops->pte_hole) err = ops->pte_hole(addr, next, -1, walk); if (err) break; } while (addr = next, addr != end); After hugetlb_entry() returned, we'll _never_ touch that pte again we got from either huge_pte_offset() or hugetlb_walk() after this patchset applied. If we touch it, it's a potential bug as you mentioned. But we didn't. Hope it explains. > > > > > > > However, I'm lagging behind on understanding what the vma lock actually > > > protects. It seems to be a hugetlb-specific protection for concurrent > > > freeing of the page tables? > > > > Not exactly freeing, but unsharing. Mike probably has more to say. The > > series is here: > > > > https://lore.kernel.org/all/20220914221810.95771-1-mike.kravetz@oracle.com/#t > > > > > If so, then running a page fault handler seems safe. If there's something > > > else it protects, then we might need to revalidate that after > > > re-acquiring the vma lock. > > > > Nothing to validate here. The only reason to take the vma lock is to match > > with the caller who assumes the lock taken, so either it'll be released > > very soon or it prepares for the next hugetlb pgtable walk (huge_pte_offset). > > > > ummm, see above. :) > > > > > > > Also, scattering hugetlb-specific locks throughout mm seems like an > > > unfortuate thing, I wonder if there is a longer term plan to Not Do > > > That? > > > > So far HMM is really the only one - normally hugetlb_entry() hook is pretty > > light, so not really throughout the whole mm yet. It's even not urgently > > needed for the other two places calling cond_sched(), I added it mostly > > just for completeness, and with the slight hope that maybe we can yield > > earlier for some pmd unsharers. > > > > But yes it's unfortunate, I just didn't come up with a good solution. > > Suggestion is always welcomed. > > > > I guess it's on me to think of something cleaner, so if I do I'll pipe > up. :) That'll be very much appricated. It's really that I don't know how to make this better, or I can rework the series as long as it hasn't land upstream. Thanks,
On 12/6/22 16:07, Peter Xu wrote: > I thought I answered this one at [1] above. If not, I can extend the > answer. [1] explains it, but it doesn't mention why it's safe to drop and reacquire. ... > > If we touch it, it's a potential bug as you mentioned. But we didn't. > > Hope it explains. I think it's OK after all, because hmm_vma_fault() does revalidate after it takes the vma lock, so that closes the loop that I was fretting over. I was just also worried that I'd missed some other place, but it looks like that's not the case. So, good. How about this incremental diff on top, as an attempt to clarify what's going on? Or is this too much wordage? Sometimes I write too many words: diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h index 1f7c2011f6cb..27a6df448ee5 100644 --- a/include/linux/pagewalk.h +++ b/include/linux/pagewalk.h @@ -21,13 +21,16 @@ struct mm_walk; * depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD. * Any folded depths (where PTRS_PER_P?D is equal to 1) * are skipped. - * @hugetlb_entry: if set, called for each hugetlb entry. Note that - * currently the hook function is protected by hugetlb - * vma lock to make sure pte_t* and the spinlock is valid - * to access. If the hook function needs to yield the - * thread or retake the vma lock for some reason, it - * needs to properly release the vma lock manually, - * and retake it before the function returns. + * @hugetlb_entry: if set, called for each hugetlb entry. This hook + * function is called with the vma lock held, in order to + * protect against a concurrent freeing of the pte_t* or + * the ptl. In some cases, the hook function needs to drop + * and retake the vma lock in order to avoid deadlocks + * while calling other functions. In such cases the hook + * function must either refrain from accessing the pte or + * ptl after dropping the vma lock, or else revalidate + * those items after re-acquiring the vma lock and before + * accessing them. * @test_walk: caller specific callback function to determine whether * we walk over the current vma or not. Returning 0 means * "do page table walk over the current vma", returning diff --git a/mm/hmm.c b/mm/hmm.c index dcd624f28bcf..b428f2011cfd 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -497,7 +497,13 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask, spin_unlock(ptl); hugetlb_vma_unlock_read(vma); - /* hmm_vma_fault() can retake the vma lock */ + /* + * Avoid deadlock: drop the vma lock before calling + * hmm_vma_fault(), which will itself potentially take and drop + * the vma lock. This is also correct from a protection point of + * view, because there is no further use here of either pte or + * ptl after dropping the vma lock. + */ ret = hmm_vma_fault(addr, end, required_fault, walk); hugetlb_vma_lock_read(vma); return ret; >> I guess it's on me to think of something cleaner, so if I do I'll pipe >> up. :) > > That'll be very much appricated. > > It's really that I don't know how to make this better, or I can rework the > series as long as it hasn't land upstream. > It's always 10x easier to notice an imperfection, than it is to improve on it. :) thanks,
On Tue, Dec 06, 2022 at 06:38:54PM -0800, John Hubbard wrote: > On 12/6/22 16:07, Peter Xu wrote: > > I thought I answered this one at [1] above. If not, I can extend the > > answer. > > [1] explains it, but it doesn't mention why it's safe to drop and reacquire. > > ... > > > > If we touch it, it's a potential bug as you mentioned. But we didn't. > > > > Hope it explains. > > I think it's OK after all, because hmm_vma_fault() does revalidate after > it takes the vma lock, so that closes the loop that I was fretting over. > > I was just also worried that I'd missed some other place, but it looks > like that's not the case. > > So, good. > > How about this incremental diff on top, as an attempt to clarify what's > going on? Or is this too much wordage? Sometimes I write too many words: Nop, that all looks good, thanks. I'll apply them in my new post. > > > diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h > index 1f7c2011f6cb..27a6df448ee5 100644 > --- a/include/linux/pagewalk.h > +++ b/include/linux/pagewalk.h > @@ -21,13 +21,16 @@ struct mm_walk; > * depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD. > * Any folded depths (where PTRS_PER_P?D is equal to 1) > * are skipped. > - * @hugetlb_entry: if set, called for each hugetlb entry. Note that > - * currently the hook function is protected by hugetlb > - * vma lock to make sure pte_t* and the spinlock is valid > - * to access. If the hook function needs to yield the > - * thread or retake the vma lock for some reason, it > - * needs to properly release the vma lock manually, > - * and retake it before the function returns. > + * @hugetlb_entry: if set, called for each hugetlb entry. This hook > + * function is called with the vma lock held, in order to > + * protect against a concurrent freeing of the pte_t* or > + * the ptl. In some cases, the hook function needs to drop > + * and retake the vma lock in order to avoid deadlocks > + * while calling other functions. In such cases the hook > + * function must either refrain from accessing the pte or > + * ptl after dropping the vma lock, or else revalidate > + * those items after re-acquiring the vma lock and before > + * accessing them. > * @test_walk: caller specific callback function to determine whether > * we walk over the current vma or not. Returning 0 means > * "do page table walk over the current vma", returning > diff --git a/mm/hmm.c b/mm/hmm.c > index dcd624f28bcf..b428f2011cfd 100644 > --- a/mm/hmm.c > +++ b/mm/hmm.c > @@ -497,7 +497,13 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask, > spin_unlock(ptl); > hugetlb_vma_unlock_read(vma); > - /* hmm_vma_fault() can retake the vma lock */ > + /* > + * Avoid deadlock: drop the vma lock before calling > + * hmm_vma_fault(), which will itself potentially take and drop > + * the vma lock. This is also correct from a protection point of > + * view, because there is no further use here of either pte or > + * ptl after dropping the vma lock. > + */ > ret = hmm_vma_fault(addr, end, required_fault, walk); > hugetlb_vma_lock_read(vma); > return ret; > > > > I guess it's on me to think of something cleaner, so if I do I'll pipe > > > up. :) > > > > That'll be very much appricated. > > > > It's really that I don't know how to make this better, or I can rework the > > series as long as it hasn't land upstream. > > > > It's always 10x easier to notice an imperfection, than it is to improve on > it. :)
diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 7f1c9b274906..d98564a7be57 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -302,6 +302,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, const struct mm_walk_ops *ops = walk->ops; int err = 0; + hugetlb_vma_lock_read(vma); do { next = hugetlb_entry_end(h, addr, end); pte = huge_pte_offset(walk->mm, addr & hmask, sz); @@ -314,6 +315,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, if (err) break; } while (addr = next, addr != end); + hugetlb_vma_unlock_read(vma); return err; }