Message ID | 35e983f5-7ed3-b310-d949-9ae8b130cdab@google.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp1310094vqr; Sun, 28 May 2023 23:24:09 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5xc4xU+cmgcdHoTP2RszFnS4G0uhL4QnMUCj48RyFfBPVPR9dJ+zGLHCKYGTE/c9HPUxb0 X-Received: by 2002:a05:6a21:3282:b0:105:b75e:9df6 with SMTP id yt2-20020a056a21328200b00105b75e9df6mr8533852pzb.26.1685341449172; Sun, 28 May 2023 23:24:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685341449; cv=none; d=google.com; s=arc-20160816; b=QUbVMjEPC2vjEGur5aSdrxDmdbjYGw70EmWJXZfwVFJdBihVHQvBaqKckh+YOt76ax HbOVoo655YABKHF+MFBI3pZmlO6srm8bzUQlWO1dxaHv5nxKf/ZJMt0RpQnec7sA/lDD nOivYApow93V/R80qgGBAH3ymI0/L0gHkwMHyK9uYAio9aTJZrffrwn8qozV4/Ky2z08 1Qs5OpSHeY/rKe/Tmr2qrgArDzWrb7MWoXumXFQPGnPrST0ogJpoYcxBQW/PumPGlqk4 be+m1mP6WMtAdec5aiLHsL6P8lFN13VK2deqeVqCGVnbBIvjJSEAlH+2GawXpmPvf+FR TsjQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:message-id:subject:cc:to:from:date :dkim-signature; bh=zliJFPKkBRA/n2GaYWcI8flBsJ+sn1iKXV6efEOvVuc=; b=SDtDKssaH59zWVwDj5OkHOxQ7iYFC8QVWU7qda5IuawGPzPt4wS9P8/pxExau1yU+T u2Gj5m9tVrph0OKHjuvEqTKIDLiIFx6fhrImeODe3eRgdeKAuJNhrNvYoY2udjS51o1l g262SbyRvjS38Kabgmnc+hu9XXOOzG7CdFIcFD2tI77iCcw+38xX7X1twxBfWmt7HVw4 LQE+xunYY/BIwuaAcgwmipNBe2l2zqyOEYQm/iiiDxdGSz+sJ45nRfLuVApa21JYAHyM lXEByxVfaug+3GQPxWJW/hQ4HLSgHIvoi7HkH0HzZQXcsseVg4sF8Mux2IBKlIU4NR4E v/WA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=7B1W5MLw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l194-20020a633ecb000000b005347ed33096si4366227pga.258.2023.05.28.23.23.55; Sun, 28 May 2023 23:24:09 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=7B1W5MLw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230249AbjE2GLa (ORCPT <rfc822;callmefire3@gmail.com> + 99 others); Mon, 29 May 2023 02:11:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41522 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230160AbjE2GL0 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 29 May 2023 02:11:26 -0400 Received: from mail-yb1-xb34.google.com (mail-yb1-xb34.google.com [IPv6:2607:f8b0:4864:20::b34]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0F94BAD for <linux-kernel@vger.kernel.org>; Sun, 28 May 2023 23:11:25 -0700 (PDT) Received: by mail-yb1-xb34.google.com with SMTP id 3f1490d57ef6-ba81ded8d3eso4527894276.3 for <linux-kernel@vger.kernel.org>; Sun, 28 May 2023 23:11:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1685340684; x=1687932684; h=mime-version:message-id:subject:cc:to:from:date:from:to:cc:subject :date:message-id:reply-to; bh=zliJFPKkBRA/n2GaYWcI8flBsJ+sn1iKXV6efEOvVuc=; b=7B1W5MLwR/lNmx3BIWIEBSexKnAAtAeI5OObdohvQtJ0ya6QvwHlgYFcOox2PoWx79 AqTZ5kdBzWLNorOsK4QCVE297PN18hQKAQbCdVGnbvFaS7kyYm3WQLVA/nvW5TUXBwEK caAcy7rQi84u8SNtdxTLyllY4LVN73jaN501s4k727QQrqblWsE0z78z1nbHIMjpBe5J U7bsOrnIMfDvYrfJ38H4/OhgD5cA+Y2quHyXE8xkEJQ1/qi+xCB9+P6Rf/V3USkqhcRr ADhdVtWOZGV+AvVYyI9ulfCMHPe9IScFN/bL7d58cmN+M8/9WGgQpLLoAa3PpgIRJ6V+ lp1g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685340684; x=1687932684; h=mime-version:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=zliJFPKkBRA/n2GaYWcI8flBsJ+sn1iKXV6efEOvVuc=; b=KoDx5dVQSKxNgQBmETNhR6XBWDC44QCslbkKyKMJpLcH64HnYUPntWBJF799J8WF2Y C29wykff8A/6TZzZrvw4ovODZs1Yp5/s51W4awmP3RFCC42pn9SMCz+eHYOb9f3xZKn9 zHPSZamv7uc07ZFBSp8PbaN1HMydq/AUc6BnAbpYedMhc+VwwTqzUQl8xLRv/bOqZGmg jRHmbEiyHxNLFYEPMkRI8kGlg/NNPx+nqiEBSWgGBts5nJgdRRTBXQSUn7WhpWIhuE+s kvEId5J0FluIW9kcpWOFPfHWBKWxDqGHXobv+wgqtU/DrW7x0ImEEa5ShCQKAA531vFv PiIQ== X-Gm-Message-State: AC+VfDxgRf5A/JbpGrxDfo9xpTlhW5YqYAUbAtKSP1L1pZEojDt5t2v2 X+ATXpx272zNynAfeVaAM5uA6Q== X-Received: by 2002:a25:d4b:0:b0:ba8:6530:d561 with SMTP id 72-20020a250d4b000000b00ba86530d561mr10306665ybn.30.1685340684077; Sun, 28 May 2023 23:11:24 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id r63-20020a252b42000000b00b7b0aba5cccsm2703954ybr.22.2023.05.28.23.11.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 28 May 2023 23:11:23 -0700 (PDT) Date: Sun, 28 May 2023 23:11:07 -0700 (PDT) From: Hugh Dickins <hughd@google.com> X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton <akpm@linux-foundation.org> cc: Mike Kravetz <mike.kravetz@oracle.com>, Mike Rapoport <rppt@kernel.org>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Matthew Wilcox <willy@infradead.org>, David Hildenbrand <david@redhat.com>, Suren Baghdasaryan <surenb@google.com>, Qi Zheng <zhengqi.arch@bytedance.com>, Yang Shi <shy828301@gmail.com>, Mel Gorman <mgorman@techsingularity.net>, Peter Xu <peterx@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Will Deacon <will@kernel.org>, Yu Zhao <yuzhao@google.com>, Alistair Popple <apopple@nvidia.com>, Ralph Campbell <rcampbell@nvidia.com>, Ira Weiny <ira.weiny@intel.com>, Steven Price <steven.price@arm.com>, SeongJae Park <sj@kernel.org>, Naoya Horiguchi <naoya.horiguchi@nec.com>, Christophe Leroy <christophe.leroy@csgroup.eu>, Zack Rusin <zackr@vmware.com>, Jason Gunthorpe <jgg@ziepe.ca>, Axel Rasmussen <axelrasmussen@google.com>, Anshuman Khandual <anshuman.khandual@arm.com>, Pasha Tatashin <pasha.tatashin@soleen.com>, Miaohe Lin <linmiaohe@huawei.com>, Minchan Kim <minchan@kernel.org>, Christoph Hellwig <hch@infradead.org>, Song Liu <song@kernel.org>, Thomas Hellstrom <thomas.hellstrom@linux.intel.com>, Russell King <linux@armlinux.org.uk>, "David S. Miller" <davem@davemloft.net>, Michael Ellerman <mpe@ellerman.id.au>, "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>, Heiko Carstens <hca@linux.ibm.com>, Christian Borntraeger <borntraeger@linux.ibm.com>, Claudio Imbrenda <imbrenda@linux.ibm.com>, Alexander Gordeev <agordeev@linux.ibm.com>, Jann Horn <jannh@google.com>, linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 00/12] mm: free retracted page table by RCU Message-ID: <35e983f5-7ed3-b310-d949-9ae8b130cdab@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1767208595177748780?= X-GMAIL-MSGID: =?utf-8?q?1767208595177748780?= |
Series |
mm: free retracted page table by RCU
|
|
Message
Hugh Dickins
May 29, 2023, 6:11 a.m. UTC
Here is the third series of patches to mm (and a few architectures), based on v6.4-rc3 with the preceding two series applied: in which khugepaged takes advantage of pte_offset_map[_lock]() allowing for pmd transitions. This follows on from the "arch: allow pte_offset_map[_lock]() to fail" https://lore.kernel.org/linux-mm/77a5d8c-406b-7068-4f17-23b7ac53bc83@google.com/ series of 23 posted on 2023-05-09, and the "mm: allow pte_offset_map[_lock]() to fail" https://lore.kernel.org/linux-mm/68a97fbe-5c1e-7ac6-72c-7b9c6290b370@google.com/ series of 31 posted on 2023-05-21. Those two series were "independent": neither depending for build or correctness on the other, but both series needed before this third one can safely make the effective changes. I'll send v2 of those two series in a couple of days, incorporating Acks and Revieweds and the minor fixes. What is it all about? Some mmap_lock avoidance i.e. latency reduction. Initially just for the case of collapsing shmem or file pages to THPs: the usefulness of MADV_COLLAPSE on shmem is being limited by that mmap_write_lock it currently requires. Likely to be relied upon later in other contexts e.g. freeing of empty page tables (but that's not work I'm doing). mmap_write_lock avoidance when collapsing to anon THPs? Perhaps, but again that's not work I've done: a quick attempt was not as easy as the shmem/file case. These changes (though of course not these exact patches) have been in Google's data centre kernel for three years now: we do rely upon them. Based on the preceding two series over v6.4-rc3, but good over v6.4-rc[1-4], current mm-everything or current linux-next. 01/12 mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s 02/12 mm/pgtable: add PAE safety to __pte_offset_map() 03/12 arm: adjust_pte() use pte_offset_map_nolock() 04/12 powerpc: assert_pte_locked() use pte_offset_map_nolock() 05/12 powerpc: add pte_free_defer() for pgtables sharing page 06/12 sparc: add pte_free_defer() for pgtables sharing page 07/12 s390: add pte_free_defer(), with use of mmdrop_async() 08/12 mm/pgtable: add pte_free_defer() for pgtable as page 09/12 mm/khugepaged: retract_page_tables() without mmap or vma lock 10/12 mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() 11/12 mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps() 12/12 mm: delete mmap_write_trylock() and vma_try_start_write() arch/arm/mm/fault-armv.c | 3 +- arch/powerpc/include/asm/pgalloc.h | 4 + arch/powerpc/mm/pgtable-frag.c | 18 ++ arch/powerpc/mm/pgtable.c | 16 +- arch/s390/include/asm/pgalloc.h | 4 + arch/s390/mm/pgalloc.c | 34 +++ arch/sparc/include/asm/pgalloc_64.h | 4 + arch/sparc/mm/init_64.c | 16 ++ include/linux/mm.h | 17 -- include/linux/mm_types.h | 2 +- include/linux/mmap_lock.h | 10 - include/linux/pgtable.h | 6 +- include/linux/sched/mm.h | 1 + kernel/fork.c | 2 +- mm/khugepaged.c | 425 ++++++++---------------------- mm/pgtable-generic.c | 44 +++- 16 files changed, 253 insertions(+), 353 deletions(-) Hugh
Comments
On Mon, May 29, 2023 at 8:11 AM Hugh Dickins <hughd@google.com> wrote: > Here is the third series of patches to mm (and a few architectures), based > on v6.4-rc3 with the preceding two series applied: in which khugepaged > takes advantage of pte_offset_map[_lock]() allowing for pmd transitions. To clarify: Part of the design here is that when you look up a user page table with pte_offset_map_nolock() or pte_offset_map() without holding mmap_lock in write mode, and you later lock the page table yourself, you don't know whether you actually have the real page table or a detached table that is currently in its RCU grace period, right? And detached tables are supposed to consist of only zeroed entries, and we assume that no relevant codepath will do anything bad if one of these functions spuriously returns a pointer to a page table full of zeroed entries? So in particular, in handle_pte_fault() we can reach the "if (unlikely(!pte_same(*vmf->pte, entry)))" with vmf->pte pointing to a detached zeroed page table, but we're okay with that because in that case we know that !pte_none(vmf->orig_pte)&&pte_none(*vmf->pte) , which implies !pte_same(*vmf->pte, entry) , which means we'll bail out? If that's the intent, it might be good to add some comments, because at least to me that's not very obvious.
On Wed, 31 May 2023, Jann Horn wrote: > On Mon, May 29, 2023 at 8:11 AM Hugh Dickins <hughd@google.com> wrote: > > Here is the third series of patches to mm (and a few architectures), based > > on v6.4-rc3 with the preceding two series applied: in which khugepaged > > takes advantage of pte_offset_map[_lock]() allowing for pmd transitions. > > To clarify: Part of the design here is that when you look up a user > page table with pte_offset_map_nolock() or pte_offset_map() without > holding mmap_lock in write mode, and you later lock the page table > yourself, you don't know whether you actually have the real page table > or a detached table that is currently in its RCU grace period, right? Right. (And I'd rather not assume anything of mmap_lock, but there are one or two or three places that may still do so.) > And detached tables are supposed to consist of only zeroed entries, > and we assume that no relevant codepath will do anything bad if one of > these functions spuriously returns a pointer to a page table full of > zeroed entries? (Nit that I expect you're well aware of: IIRC "zeroed" isn't 0 on s390.) If someone is using pte_offset_map() without lock, they must be prepared to accept page-table-like changes. The limits of pte_offset_map_nolock() with later spin_lock(ptl): I'm still exploring: there's certainly an argument that one ought to do a pmd_same() check before proceeding, but I don't think anywhere needs that at present. Whether the page table has to be full of zeroed entries when detached: I believe it is always like that at present (by the end of the series, when the collapse_pte_offset_map() oddity is fixed), but whether it needs to be so I'm not sure. Quite likely it will need to be; but I'm open to the possibility that all it needs is to be still a page table, with perhaps new entries from a new usage in it. The most obvious vital thing (in the split ptlock case) is that it remains a struct page with a usable ptl spinlock embedded in it. The question becomes more urgent when/if extending to replacing the pagetable pmd by huge pmd in one go, without any mmap_lock: powerpc wants to deposit the page table for later use even in the shmem/file case (and all arches in the anon case): I did work out the details once before, but I'm not sure whether I would still agree with myself; and was glad to leave replacement out of this series, to revisit some time later. > > So in particular, in handle_pte_fault() we can reach the "if > (unlikely(!pte_same(*vmf->pte, entry)))" with vmf->pte pointing to a > detached zeroed page table, but we're okay with that because in that > case we know that !pte_none(vmf->orig_pte)&&pte_none(*vmf->pte) , > which implies !pte_same(*vmf->pte, entry) , which means we'll bail > out? There is no current (even at end of series) circumstance in which we could be pointing to a detached page table there; but yes, I want to allow for that, and yes I agree with your analysis. But with the interesting unanswered question for the future, of what if the same value could be found there: would that imply it's safe to proceed, or would some further prevention be needed? > > If that's the intent, it might be good to add some comments, because > at least to me that's not very obvious. That's a very fair request; but I shall have difficulty deciding where to place such comments. I shall have to try, then you redirect me. And I think we approach this in opposite ways: my nature is to put some infrastructure in place, and then look at it to see what we can get away with; whereas your nature is to define upfront what the possibilities are. We can expect some tussles! Thanks, Hugh
On Fri, Jun 2, 2023 at 6:37 AM Hugh Dickins <hughd@google.com> wrote: > On Wed, 31 May 2023, Jann Horn wrote: > > On Mon, May 29, 2023 at 8:11 AM Hugh Dickins <hughd@google.com> wrote: > > > Here is the third series of patches to mm (and a few architectures), based > > > on v6.4-rc3 with the preceding two series applied: in which khugepaged > > > takes advantage of pte_offset_map[_lock]() allowing for pmd transitions. > > > > To clarify: Part of the design here is that when you look up a user > > page table with pte_offset_map_nolock() or pte_offset_map() without > > holding mmap_lock in write mode, and you later lock the page table > > yourself, you don't know whether you actually have the real page table > > or a detached table that is currently in its RCU grace period, right? > > Right. (And I'd rather not assume anything of mmap_lock, but there are > one or two or three places that may still do so.) > > > And detached tables are supposed to consist of only zeroed entries, > > and we assume that no relevant codepath will do anything bad if one of > > these functions spuriously returns a pointer to a page table full of > > zeroed entries? > > (Nit that I expect you're well aware of: IIRC "zeroed" isn't 0 on s390.) I was not aware, thanks. I only knew that on Intel's Knights Landing CPUs, the A/D bits are ignored by pte_none() due to some erratum. > If someone is using pte_offset_map() without lock, they must be prepared > to accept page-table-like changes. The limits of pte_offset_map_nolock() > with later spin_lock(ptl): I'm still exploring: there's certainly an > argument that one ought to do a pmd_same() check before proceeding, > but I don't think anywhere needs that at present. > > Whether the page table has to be full of zeroed entries when detached: > I believe it is always like that at present (by the end of the series, > when the collapse_pte_offset_map() oddity is fixed), but whether it needs > to be so I'm not sure. Quite likely it will need to be; but I'm open to > the possibility that all it needs is to be still a page table, with > perhaps new entries from a new usage in it. My understanding is that at least handle_pte_fault(), the way it is currently written, would do bad things in that case: // assume we enter with mmap_lock in read mode, // for a write fault on a shared writable VMA without a page_mkwrite handler static vm_fault_t handle_pte_fault(struct vm_fault *vmf) { pte_t entry; if (unlikely(pmd_none(*vmf->pmd))) { [ not executed ] } else { /* * A regular pmd is established and it can't morph into a huge * pmd by anon khugepaged, since that takes mmap_lock in write * mode; but shmem or file collapse to THP could still morph * it into a huge pmd: just retry later if so. */ vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (unlikely(!vmf->pte)) [not executed] [assume that at this point, a concurrent THP collapse operation removes the page table, and the page table has now been reused and contains a read-only PTE] // this reads page table contents protected solely by RCU vmf->orig_pte = ptep_get_lockless(vmf->pte); vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; if (pte_none(vmf->orig_pte)) { pte_unmap(vmf->pte); vmf->pte = NULL; } } if (!vmf->pte) [not executed] if (!pte_present(vmf->orig_pte)) [not executed] if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) [not executed] spin_lock(vmf->ptl); entry = vmf->orig_pte; if (unlikely(!pte_same(*vmf->pte, entry))) { [not executed] } if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { if (!pte_write(entry)) // This will go into wp_page_shared(), // which will call wp_page_reuse(), // which will upgrade the page to writable return do_wp_page(vmf); [ not executed ] } [ not executed ] } That looks like we could end up racing such that a read-only PTE in the reused page table gets upgraded to writable, which would probably be a security bug. But I guess if you added bailout checks to every page table lock operation, it'd be a different story, maybe? > The most obvious vital thing (in the split ptlock case) is that it > remains a struct page with a usable ptl spinlock embedded in it. > > The question becomes more urgent when/if extending to replacing the > pagetable pmd by huge pmd in one go, without any mmap_lock: powerpc > wants to deposit the page table for later use even in the shmem/file > case (and all arches in the anon case): I did work out the details once > before, but I'm not sure whether I would still agree with myself; and was > glad to leave replacement out of this series, to revisit some time later. > > > > > So in particular, in handle_pte_fault() we can reach the "if > > (unlikely(!pte_same(*vmf->pte, entry)))" with vmf->pte pointing to a > > detached zeroed page table, but we're okay with that because in that > > case we know that !pte_none(vmf->orig_pte)&&pte_none(*vmf->pte) , > > which implies !pte_same(*vmf->pte, entry) , which means we'll bail > > out? > > There is no current (even at end of series) circumstance in which we > could be pointing to a detached page table there; but yes, I want to > allow for that, and yes I agree with your analysis. Hmm, what am I missing here? static vm_fault_t handle_pte_fault(struct vm_fault *vmf) { pte_t entry; if (unlikely(pmd_none(*vmf->pmd))) { [not executed] } else { /* * A regular pmd is established and it can't morph into a huge * pmd by anon khugepaged, since that takes mmap_lock in write * mode; but shmem or file collapse to THP could still morph * it into a huge pmd: just retry later if so. */ vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (unlikely(!vmf->pte)) [not executed] // this reads a present readonly PTE vmf->orig_pte = ptep_get_lockless(vmf->pte); vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; if (pte_none(vmf->orig_pte)) { [not executed] } } [at this point, a concurrent THP collapse operation detaches the page table] // vmf->pte now points into a detached page table if (!vmf->pte) [not executed] if (!pte_present(vmf->orig_pte)) [not executed] if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) [not executed] spin_lock(vmf->ptl); entry = vmf->orig_pte; // vmf->pte still points into a detached page table if (unlikely(!pte_same(*vmf->pte, entry))) { update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); goto unlock; } [...] } > But with the > interesting unanswered question for the future, of what if the same > value could be found there: would that imply it's safe to proceed, > or would some further prevention be needed? That would then hand the pointer to the detached page table to functions like do_wp_page(), which I think will do bad things (see above) if they are called on either a page table that has been reused in a different VMA with different protection flags (which could, for example, lead to pages becoming writable that should not be writable or such things) or on a page table that is no longer in use (because it would assume that PTEs referencing pages are refcounted when they actually aren't). > > If that's the intent, it might be good to add some comments, because > > at least to me that's not very obvious. > > That's a very fair request; but I shall have difficulty deciding where > to place such comments. I shall have to try, then you redirect me. > > And I think we approach this in opposite ways: my nature is to put some > infrastructure in place, and then look at it to see what we can get away > with; whereas your nature is to define upfront what the possibilities are. > We can expect some tussles! Yeah. :P One of my strongly-held beliefs is that it's important when making changes to code to continuously ask oneself "If I had to explain the rules by which this code operates - who has to take which locks, who holds references to what and so on -, how complicated would those rules be?", and if that turns into a series of exception cases, that probably means there will be bugs, because someone will probably lose track of one of those exceptions. So I would prefer it if we could have some rule like "whenever you lock an L1 page table, you must immediately recheck whether the page table is still referenced by the L2 page table, unless you know that you have a stable page reference for whatever reason", and then any code that operates on a locked page table doesn't have to worry about whether the page table might be detached.
On Fri, 2 Jun 2023, Jann Horn wrote: > On Fri, Jun 2, 2023 at 6:37 AM Hugh Dickins <hughd@google.com> wrote: > > > The most obvious vital thing (in the split ptlock case) is that it > > remains a struct page with a usable ptl spinlock embedded in it. > > > > The question becomes more urgent when/if extending to replacing the > > pagetable pmd by huge pmd in one go, without any mmap_lock: powerpc > > wants to deposit the page table for later use even in the shmem/file > > case (and all arches in the anon case): I did work out the details once > > before, but I'm not sure whether I would still agree with myself; and was > > glad to leave replacement out of this series, to revisit some time later. > > > > > > > > So in particular, in handle_pte_fault() we can reach the "if > > > (unlikely(!pte_same(*vmf->pte, entry)))" with vmf->pte pointing to a > > > detached zeroed page table, but we're okay with that because in that > > > case we know that !pte_none(vmf->orig_pte)&&pte_none(*vmf->pte) , > > > which implies !pte_same(*vmf->pte, entry) , which means we'll bail > > > out? > > > > There is no current (even at end of series) circumstance in which we > > could be pointing to a detached page table there; but yes, I want to > > allow for that, and yes I agree with your analysis. > > Hmm, what am I missing here? I spent quite a while trying to reconstruct what I had been thinking, what meaning of "detached" or "there" I had in mind when I asserted so confidently "There is no current (even at end of series) circumstance in which we could be pointing to a detached page table there". But had to give up and get on with more useful work. Of course you are right, and that is what this series is about. Hugh > > static vm_fault_t handle_pte_fault(struct vm_fault *vmf) > { > pte_t entry; > > if (unlikely(pmd_none(*vmf->pmd))) { > [not executed] > } else { > /* > * A regular pmd is established and it can't morph into a huge > * pmd by anon khugepaged, since that takes mmap_lock in write > * mode; but shmem or file collapse to THP could still morph > * it into a huge pmd: just retry later if so. > */ > vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd, > vmf->address, &vmf->ptl); > if (unlikely(!vmf->pte)) > [not executed] > // this reads a present readonly PTE > vmf->orig_pte = ptep_get_lockless(vmf->pte); > vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; > > if (pte_none(vmf->orig_pte)) { > [not executed] > } > } > > [at this point, a concurrent THP collapse operation detaches the page table] > // vmf->pte now points into a detached page table > > if (!vmf->pte) > [not executed] > > if (!pte_present(vmf->orig_pte)) > [not executed] > > if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) > [not executed] > > spin_lock(vmf->ptl); > entry = vmf->orig_pte; > // vmf->pte still points into a detached page table > if (unlikely(!pte_same(*vmf->pte, entry))) { > update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); > goto unlock; > } > [...] > }