Message ID | 20221022114424.515572025@infradead.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4242:0:0:0:0:0 with SMTP id s2csp1169388wrr; Sat, 22 Oct 2022 04:53:33 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7cFSPtennD8BoLYjWt0BrmEf4agKSExK9e8guDnh9pGoVdCkDBt4RjMmgkIjzD+aKs9Xrv X-Received: by 2002:a17:90b:4f87:b0:212:effe:bdd7 with SMTP id qe7-20020a17090b4f8700b00212effebdd7mr784316pjb.66.1666439613654; Sat, 22 Oct 2022 04:53:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666439613; cv=none; d=google.com; s=arc-20160816; b=JSUfyiHjF3a4jF9NNKIO1bZ88C/UhBx/BasvAVr+aKXgdTSKS8mSLuJdBAMIf9jtAF F3UkL06gZpM2Kb42f5NIhWSPrjlV8a7SlFD0jFwccVaT6pTSHKCCZX2Axs4VHnZR8G92 2TYzAhKMH8j5deLb3wayFYuewFtS4e0pDaym/Ud8wYB52D9rlD+BoO4e30WI3sxyguLe UH6Ya2FLurR5v99sUc5cnVQPUMJde23GJOZ9AjNduvzmu0UP4k94tkMFF0dwbYGxHVXJ qlQfHFFdhRWTB9I/fA/MpYj1FjAAF6Pbwj2fjON1TRrBZGRFuTp+7efKbfI9RCe/E5z1 5m2Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:references:subject:cc:to:from:date :user-agent:message-id:dkim-signature; bh=B/ouY7blFSKbM/QG08uUBBh7FdO1AskbZkW9MoHfj6s=; b=mnzIGA22UHGOdFHyzF/K9akYlI8BH4kEFyZmwRsyYCPU4FQOC+eBfC+1X9ZI7kD/Ey j81aWYZt/lAS/XVdACZux2lqTcDadFZE1IDoQe/aM+P7GOcxxk8aURfFHmVFntl4aGdY vqOBK/2fsacHbN/pRGsxYL03zpiCIcDybeEfMIA2nl9aYsvAT/Xe2hyzKn5sua09yeAH h+/ecfYWXeDbEUzHo7qC4jPhXdIafH5IGkSEwA9XBM+EQFRN5AaW71ufLYb4LqonalIV CZFJ+dizhTJ7fK0VF0xu4YV0cvwuE9bRuEtwdKtmWBZJJCiS+Zq6Liar///F7XzDnwN4 ECAw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=desiato.20200630 header.b="S//P5/Vx"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k5-20020a63ba05000000b004403d152465si30436540pgf.12.2022.10.22.04.53.19; Sat, 22 Oct 2022 04:53:33 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=desiato.20200630 header.b="S//P5/Vx"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230233AbiJVLtE (ORCPT <rfc822;pwkd43@gmail.com> + 99 others); Sat, 22 Oct 2022 07:49:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43378 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229824AbiJVLst (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sat, 22 Oct 2022 07:48:49 -0400 Received: from desiato.infradead.org (desiato.infradead.org [IPv6:2001:8b0:10b:1:d65d:64ff:fe57:4e05]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1E73824FEFE for <linux-kernel@vger.kernel.org>; Sat, 22 Oct 2022 04:48:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=B/ouY7blFSKbM/QG08uUBBh7FdO1AskbZkW9MoHfj6s=; b=S//P5/VxJDDNdFqdqGkA0ghIGL VinkdVqdm4OvZQrImBQTdKp1JukAyaG5YVjjy3e3UbDF00JPqap9zUb+NzaQ9P+zYX2Pis2cBtO0z EgXV49kD0T3r9+YUm/5EWRySgkDdgVD8o2tUfQDEEFP59xFAkC0pvHiU34HWdkAN3NmJAqwoaljai vOSad0sFEWshxHC1ZMIEuH56m4bKhF56jG0i4XooByga1cS3S1h7RrM/RW1O7j+nHFv3L1Cs38kQ9 GL7ZcQmZzyK2TEjFTPmNeE2yXiIj2VXTm5ImT10pf0FddSMPwgv3YCOyx9TMjBlEnjQrdOYxkwymX r5aanGyA==; Received: from j130084.upc-j.chello.nl ([24.132.130.84] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1omCzL-005XdF-G3; Sat, 22 Oct 2022 11:48:28 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 327EC300472; Sat, 22 Oct 2022 13:48:26 +0200 (CEST) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 0) id 1C06028B8E50E; Sat, 22 Oct 2022 13:48:26 +0200 (CEST) Message-ID: <20221022114424.515572025@infradead.org> User-Agent: quilt/0.66 Date: Sat, 22 Oct 2022 13:14:04 +0200 From: Peter Zijlstra <peterz@infradead.org> To: x86@kernel.org, willy@infradead.org, torvalds@linux-foundation.org, akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, linux-mm@kvack.org, aarcange@redhat.com, kirill.shutemov@linux.intel.com, jroedel@suse.de, ubizjak@gmail.com Subject: [PATCH 01/13] mm: Update ptep_get_lockless()s comment References: <20221022111403.531902164@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747388584111427892?= X-GMAIL-MSGID: =?utf-8?q?1747388584111427892?= |
Series |
Clean up pmd_get_atomic() and i386-PAE
|
|
Commit Message
Peter Zijlstra
Oct. 22, 2022, 11:14 a.m. UTC
Improve the comment.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
include/linux/pgtable.h | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)
Comments
On 10/22/22 04:14, Peter Zijlstra wrote: > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -260,15 +260,12 @@ static inline pte_t ptep_get(pte_t *ptep > > #ifdef CONFIG_GUP_GET_PTE_LOW_HIGH > /* > - * WARNING: only to be used in the get_user_pages_fast() implementation. > - * > - * With get_user_pages_fast(), we walk down the pagetables without taking any > - * locks. For this we would like to load the pointers atomically, but sometimes > - * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE). What > - * we do have is the guarantee that a PTE will only either go from not present > - * to present, or present to not present or both -- it will not switch to a > - * completely different present page without a TLB flush in between; something > - * that we are blocking by holding interrupts off. > + * For walking the pagetables without holding any locks. Some architectures > + * (eg x86-32 PAE) cannot load the entries atomically without using expensive > + * instructions. We are guaranteed that a PTE will only either go from not > + * present to present, or present to not present -- it will not switch to a > + * completely different present page without a TLB flush inbetween; which we > + * are blocking by holding interrupts off. This is getting interesting. My latest understanding of this story is that both the "before" and "after" versions of that comment are incorrect! Because, as Jann Horn noticed recently [1], there might not be any IPIs involved in a TLB flush, if x86 is running under a hypervisor, and that breaks the chain of reasoning here. [1] https://lore.kernel.org/all/CAG48ez3h-mnp9ZFC10v+-BW_8NQvxbwBsMYJFP8JX31o0B17Pg@mail.gmail.com/ thanks,
On Sun, Oct 23, 2022 at 10:42:49PM -0700, John Hubbard wrote: > On 10/22/22 04:14, Peter Zijlstra wrote: > > --- a/include/linux/pgtable.h > > +++ b/include/linux/pgtable.h > > @@ -260,15 +260,12 @@ static inline pte_t ptep_get(pte_t *ptep > > > > #ifdef CONFIG_GUP_GET_PTE_LOW_HIGH > > /* > > - * WARNING: only to be used in the get_user_pages_fast() implementation. > > - * > > - * With get_user_pages_fast(), we walk down the pagetables without taking any > > - * locks. For this we would like to load the pointers atomically, but sometimes > > - * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE). What > > - * we do have is the guarantee that a PTE will only either go from not present > > - * to present, or present to not present or both -- it will not switch to a > > - * completely different present page without a TLB flush in between; something > > - * that we are blocking by holding interrupts off. > > + * For walking the pagetables without holding any locks. Some architectures > > + * (eg x86-32 PAE) cannot load the entries atomically without using expensive > > + * instructions. We are guaranteed that a PTE will only either go from not > > + * present to present, or present to not present -- it will not switch to a > > + * completely different present page without a TLB flush inbetween; which we > > + * are blocking by holding interrupts off. > > > This is getting interesting. My latest understanding of this story is > that both the "before" and "after" versions of that comment are > incorrect! Because, as Jann Horn noticed recently [1], there might not > be any IPIs involved in a TLB flush, if x86 is running under a > hypervisor, and that breaks the chain of reasoning here. That mail doesn't really include enough detail. The way x86 HV TLB flushing is supposed to work is by making use of MMU_GATHER_RCU_TABLE_FREE. Specifically, something like: vCPU0 vCPU1 tlb_gather_mmut(&tlb, mm); .... local_irq_disable(); ... starts page-table walk ... <schedules out; sets KVM_VCPU_PREEMPTED> tlb_finish_mmu(&tlb) ... kvm_flush_tlb_multi() if (state & KVM_VCPU_PREEMPTED) if (try_cmpxchg(,&state, state | KVM_VCPU_FLUSH_TLB)) __cpumask_clear_cpu(cpu, flushmask); tlb_remove_table_sync_one() / call_rcu() <schedules back in> ... continues page-table walk ... local_irq_enable(); If mmu gather is forced into tlb_remove_talbe_sync_one() (by memory pressure), then you've got your IPI back, otherwise it does call_rcu() and RCU itself will need vCPU0 to enable IRQs in order to make progress. Either way around, the actual freeing of the pages is delayed until the page-table walk is finished. What am I missing?
On Mon, Oct 24, 2022 at 10:01 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Sun, Oct 23, 2022 at 10:42:49PM -0700, John Hubbard wrote: > > On 10/22/22 04:14, Peter Zijlstra wrote: > > > --- a/include/linux/pgtable.h > > > +++ b/include/linux/pgtable.h > > > @@ -260,15 +260,12 @@ static inline pte_t ptep_get(pte_t *ptep > > > > > > #ifdef CONFIG_GUP_GET_PTE_LOW_HIGH > > > /* > > > - * WARNING: only to be used in the get_user_pages_fast() implementation. > > > - * > > > - * With get_user_pages_fast(), we walk down the pagetables without taking any > > > - * locks. For this we would like to load the pointers atomically, but sometimes > > > - * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE). What > > > - * we do have is the guarantee that a PTE will only either go from not present > > > - * to present, or present to not present or both -- it will not switch to a > > > - * completely different present page without a TLB flush in between; something > > > - * that we are blocking by holding interrupts off. > > > + * For walking the pagetables without holding any locks. Some architectures > > > + * (eg x86-32 PAE) cannot load the entries atomically without using expensive > > > + * instructions. We are guaranteed that a PTE will only either go from not > > > + * present to present, or present to not present -- it will not switch to a > > > + * completely different present page without a TLB flush inbetween; which we > > > + * are blocking by holding interrupts off. > > > > > > This is getting interesting. My latest understanding of this story is > > that both the "before" and "after" versions of that comment are > > incorrect! Because, as Jann Horn noticed recently [1], there might not > > be any IPIs involved in a TLB flush, if x86 is running under a > > hypervisor, and that breaks the chain of reasoning here. > > That mail doesn't really include enough detail. The way x86 HV TLB > flushing is supposed to work is by making use of > MMU_GATHER_RCU_TABLE_FREE. Specifically, something like: > > > vCPU0 vCPU1 > > tlb_gather_mmut(&tlb, mm); > > .... > > local_irq_disable(); > ... starts page-table walk ... > > <schedules out; sets KVM_VCPU_PREEMPTED> > > tlb_finish_mmu(&tlb) > ... > kvm_flush_tlb_multi() > if (state & KVM_VCPU_PREEMPTED) > if (try_cmpxchg(,&state, state | KVM_VCPU_FLUSH_TLB)) > __cpumask_clear_cpu(cpu, flushmask); > > > tlb_remove_table_sync_one() / call_rcu() > > > <schedules back in> > > ... continues page-table walk ... > local_irq_enable(); > > If mmu gather is forced into tlb_remove_talbe_sync_one() (by memory > pressure), then you've got your IPI back, otherwise it does call_rcu() > and RCU itself will need vCPU0 to enable IRQs in order to make progress. > > Either way around, the actual freeing of the pages is delayed until the > page-table walk is finished. > > What am I missing? Unless I'm completely misunderstanding what's going on here, the whole "remove_table" thing only happens when you "remove a table", meaning you free an entire *pagetable*. Just zapping PTEs doesn't trigger that logic.
On Mon, Oct 24, 2022 at 12:58 PM Jann Horn <jannh@google.com> wrote: > > Unless I'm completely misunderstanding what's going on here, the whole > "remove_table" thing only happens when you "remove a table", meaning > you free an entire *pagetable*. Just zapping PTEs doesn't trigger that > logic. I do have to admit that I'd be happier if this code - and the GUP code that also relies on "interrupts off" behavior - would just use a sequence counter instead. Relying on blocking IPI's is clever, but also clearly very subtle and somewhat dangerous. I think our GUP code is a *lot* more important than some "legacy x86-32 has problems in case you have an incredibly unlikely race that re-populates the page table with a different page that just happens to be exactly the same MOD-4GB", so honestly, I don't think the load-tearing is even worth worrying about - if you have hardware that is good enough at virtualizing things, it's almost certainly already 64-bit, and running 32-bit virtual machines with PAE you really only have yourself to blame. So I can't find it in myself to care about the 32-bit tearing thing, but this discussion makes me worried about Fast GUP. Note that even with proper atomic pte_t pte = ptep_get_lockless(ptep); in gup_pte_range(), and even if the page tables are RCU-free'd, that just means that the 'ptep' access itself is safe. But then you have the whole "the lookup of the page pointer is not atomic" wrt that. And right now that GUP code does rely on the "block IPI" to make it basically valid. I don't think it matters if GUP races with munmap or madvise() or something like that - if you get the old page, that's still a valid page, and the user only has himself to blame. But if we have memory pressure that causes vmscan to push out a page, and it gets replaced with a new page, and GUP gets the old page with no serialization, that sounds like a possible source of data inconsistency. I don't know if this can happen, but the whole "interrupts disabled doesn't actually block IPI's and synchronize with TLB flushes" really sounds like it would affect GUP too. And be much more serious there than on some x86-32 platform that nobody should be using anyway. Linus
On Mon, Oct 24, 2022 at 10:19 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, Oct 24, 2022 at 12:58 PM Jann Horn <jannh@google.com> wrote: > > > > Unless I'm completely misunderstanding what's going on here, the whole > > "remove_table" thing only happens when you "remove a table", meaning > > you free an entire *pagetable*. Just zapping PTEs doesn't trigger that > > logic. > > I do have to admit that I'd be happier if this code - and the GUP code > that also relies on "interrupts off" behavior - would just use a > sequence counter instead. > > Relying on blocking IPI's is clever, but also clearly very subtle and > somewhat dangerous. > > I think our GUP code is a *lot* more important than some "legacy > x86-32 has problems in case you have an incredibly unlikely race that > re-populates the page table with a different page that just happens to > be exactly the same MOD-4GB", so honestly, I don't think the > load-tearing is even worth worrying about - if you have hardware that > is good enough at virtualizing things, it's almost certainly already > 64-bit, and running 32-bit virtual machines with PAE you really only > have yourself to blame. > > So I can't find it in myself to care about the 32-bit tearing thing, > but this discussion makes me worried about Fast GUP. > > Note that even with proper atomic > > pte_t pte = ptep_get_lockless(ptep); > > in gup_pte_range(), and even if the page tables are RCU-free'd, that > just means that the 'ptep' access itself is safe. > > But then you have the whole "the lookup of the page pointer is not > atomic" wrt that. And right now that GUP code does rely on the "block > IPI" to make it basically valid. > > I don't think it matters if GUP races with munmap or madvise() or > something like that - if you get the old page, that's still a valid > page, and the user only has himself to blame. > > But if we have memory pressure that causes vmscan to push out a page, > and it gets replaced with a new page, and GUP gets the old page with > no serialization, that sounds like a possible source of data > inconsistency. > > I don't know if this can happen, but the whole "interrupts disabled > doesn't actually block IPI's and synchronize with TLB flushes" really > sounds like it would affect GUP too. And be much more serious there > than on some x86-32 platform that nobody should be using anyway. That's why GUP-fast re-checks the PTE after it has grabbed a reference to the page. Pasting from a longer writeup that I plan to publish soon, describing how gup_pte_range() works: """ This guarantees that the page tables that are being walked aren't freed concurrently, but at the end of the walk, we have to grab a stable reference to the referenced page. For this we use the grab-reference-and-revalidate trick from above again: First we (locklessly) load the page table entry, then we grab a reference to the page that it points to (which can fail if the refcount is zero, in that case we bail), then we recheck that the page table entry is still the same, and if it changed in between, we drop the page reference and bail. This can, again, grab a reference to a page after it has already been freed and reallocated. The reason why this is fine is that the metadata structure that holds this refcount, `struct folio` (or `struct page`, depending on which kernel version you're looking at; in current kernels it's `folio` but `struct page` and `struct folio` are actually aliases for the same memory, basically, though that is supposed to maybe change at some point) is never freed; even when a page is freed and reallocated, the corresponding `struct folio` stays. This does have the fun consequence that whenever a page/folio has a non-zero refcount, the refcount can spuriously go up and then back down for a little bit. (Also it's technically not as simple as I just described it, because the `struct page` that the PTE points to might be a "tail page" of a `struct folio`. So actually we first read the PTE, the PTE gives us the `page*`, then from that we go to the `folio*`, then we try to grab a reference to the `folio`, then if that worked we check that the `page` still points to the same `folio`, and then we recheck that the PTE is still the same.) """
On Mon, Oct 24, 2022 at 1:24 PM Jann Horn <jannh@google.com> wrote: > > That's why GUP-fast re-checks the PTE after it has grabbed a reference > to the page. Bah. I should have known that. We got that through the PPC version that never had the whole IPI serialization thing (just the RCU freeing). I haven't looked at that code in much too long. Actually, by "much too long" I really mean "thankfully I haven't needed to" ;) Linus
On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote: > """ > This guarantees that the page tables that are being walked > aren't freed concurrently, but at the end of the walk, we > have to grab a stable reference to the referenced page. > For this we use the grab-reference-and-revalidate trick > from above again: > First we (locklessly) load the page > table entry, then we grab a reference to the page that it > points to (which can fail if the refcount is zero, in that > case we bail), then we recheck that the page table entry > is still the same, and if it changed in between, we drop the > page reference and bail. > This can, again, grab a reference to a page after it has > already been freed and reallocated. The reason why this is > fine is that the metadata structure that holds this refcount, > `struct folio` (or `struct page`, depending on which kernel > version you're looking at; in current kernels it's `folio` > but `struct page` and `struct folio` are actually aliases for > the same memory, basically, though that is supposed to maybe > change at some point) is never freed; even when a page is > freed and reallocated, the corresponding `struct folio` > stays. This does have the fun consequence that whenever a > page/folio has a non-zero refcount, the refcount can > spuriously go up and then back down for a little bit. > (Also it's technically not as simple as I just described it, > because the `struct page` that the PTE points to might be > a "tail page" of a `struct folio`. > So actually we first read the PTE, the PTE gives us the > `page*`, then from that we go to the `folio*`, then we > try to grab a reference to the `folio`, then if that worked > we check that the `page` still points to the same `folio`, > and then we recheck that the PTE is still the same.) > """ Nngh. In trying to make this description fit all kernels (with both pages and folios), you've complicated it maximally. Let's try a more simple explanation: First we (locklessly) load the page table entry, then we grab a reference to the folio that contains it (which can fail if the refcount is zero, in that case we bail), then we recheck that the page table entry is still the same, and if it changed in between, we drop the folio reference and bail. This can, again, grab a reference to a folio after it has already been freed and reallocated. The reason why this is fine is that the metadata structure that holds this refcount, `struct folio` is never freed; even when a folio is freed and reallocated, the corresponding `struct folio` stays. This does have the fun consequence that whenever a folio has a non-zero refcount, the refcount can spuriously go up and then back down for a little bit. (Also it's slightly more complex than I just described, because the PTE that we just loaded might be in the middle of being reallocated into a different folio. So actually we first read the PTE, translate the PTE into the `page*`, then from that we go to the `folio*`, then we try to grab a reference to the `folio`, then if that worked we check that the `page` is still in the same `folio`, and then we recheck that the PTE is still the same. Older kernels did not make a clear distinction between pages and folios, so it was even more confusing.) Better?
Matthew Wilcox <willy@infradead.org> writes: > On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote: >> """ >> This guarantees that the page tables that are being walked >> aren't freed concurrently, but at the end of the walk, we >> have to grab a stable reference to the referenced page. >> For this we use the grab-reference-and-revalidate trick >> from above again: >> First we (locklessly) load the page >> table entry, then we grab a reference to the page that it >> points to (which can fail if the refcount is zero, in that >> case we bail), then we recheck that the page table entry >> is still the same, and if it changed in between, we drop the >> page reference and bail. >> This can, again, grab a reference to a page after it has >> already been freed and reallocated. The reason why this is >> fine is that the metadata structure that holds this refcount, >> `struct folio` (or `struct page`, depending on which kernel >> version you're looking at; in current kernels it's `folio` >> but `struct page` and `struct folio` are actually aliases for >> the same memory, basically, though that is supposed to maybe >> change at some point) is never freed; even when a page is >> freed and reallocated, the corresponding `struct folio` >> stays. This does have the fun consequence that whenever a >> page/folio has a non-zero refcount, the refcount can >> spuriously go up and then back down for a little bit. >> (Also it's technically not as simple as I just described it, >> because the `struct page` that the PTE points to might be >> a "tail page" of a `struct folio`. >> So actually we first read the PTE, the PTE gives us the >> `page*`, then from that we go to the `folio*`, then we >> try to grab a reference to the `folio`, then if that worked >> we check that the `page` still points to the same `folio`, >> and then we recheck that the PTE is still the same.) >> """ > > Nngh. In trying to make this description fit all kernels (with > both pages and folios), you've complicated it maximally. Let's > try a more simple explanation: > > First we (locklessly) load the page table entry, then we grab a > reference to the folio that contains it (which can fail if the > refcount is zero, in that case we bail), then we recheck that the > page table entry is still the same, and if it changed in between, > we drop the folio reference and bail. > This can, again, grab a reference to a folio after it has > already been freed and reallocated. The reason why this is > fine is that the metadata structure that holds this refcount, > `struct folio` is never freed; even when a folio is > freed and reallocated, the corresponding `struct folio` > stays. I'm probably missing something obvious but how is that synchronised against memory hotplug? AFAICT if it isn't couldn't the pages be freed and memory removed? In that case the above would no longer hold because (I think) the metadata structure could have been freed. > This does have the fun consequence that whenever a > folio has a non-zero refcount, the refcount can > spuriously go up and then back down for a little bit. > (Also it's slightly more complex than I just described, > because the PTE that we just loaded might be in the middle of > being reallocated into a different folio. > So actually we first read the PTE, translate the PTE into the > `page*`, then from that we go to the `folio*`, then we > try to grab a reference to the `folio`, then if that worked > we check that the `page` is still in the same `folio`, > and then we recheck that the PTE is still the same. Older kernels > did not make a clear distinction between pages and folios, so > it was even more confusing.) > > > Better?
On Tue, Oct 25, 2022 at 06:54:10PM +1100, Alistair Popple wrote: > > First we (locklessly) load the page table entry, then we grab a > > reference to the folio that contains it (which can fail if the > > refcount is zero, in that case we bail), then we recheck that the > > page table entry is still the same, and if it changed in between, > > we drop the folio reference and bail. > > This can, again, grab a reference to a folio after it has > > already been freed and reallocated. The reason why this is > > fine is that the metadata structure that holds this refcount, > > `struct folio` is never freed; even when a folio is > > freed and reallocated, the corresponding `struct folio` > > stays. > > I'm probably missing something obvious but how is that synchronised > against memory hotplug? AFAICT if it isn't couldn't the pages be freed > and memory removed? In that case the above would no longer hold because > (I think) the metadata structure could have been freed. Note, this scheme is older than memory hot-plug, so if anybody is to blame it's the memory hotplug code. Anyway, since all that is done with IRQs disabled, all the hotplug stuff needs to do is rcu_synchronize() in order to ensure all active IRQ-disabled regions are finshed (between ensuring the memory is unused and taking out the struct page).
On Tue, Oct 25, 2022 at 10:11 AM Alistair Popple <apopple@nvidia.com> wrote: > > > Matthew Wilcox <willy@infradead.org> writes: > > > On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote: > >> """ > >> This guarantees that the page tables that are being walked > >> aren't freed concurrently, but at the end of the walk, we > >> have to grab a stable reference to the referenced page. > >> For this we use the grab-reference-and-revalidate trick > >> from above again: > >> First we (locklessly) load the page > >> table entry, then we grab a reference to the page that it > >> points to (which can fail if the refcount is zero, in that > >> case we bail), then we recheck that the page table entry > >> is still the same, and if it changed in between, we drop the > >> page reference and bail. > >> This can, again, grab a reference to a page after it has > >> already been freed and reallocated. The reason why this is > >> fine is that the metadata structure that holds this refcount, > >> `struct folio` (or `struct page`, depending on which kernel > >> version you're looking at; in current kernels it's `folio` > >> but `struct page` and `struct folio` are actually aliases for > >> the same memory, basically, though that is supposed to maybe > >> change at some point) is never freed; even when a page is > >> freed and reallocated, the corresponding `struct folio` > >> stays. This does have the fun consequence that whenever a > >> page/folio has a non-zero refcount, the refcount can > >> spuriously go up and then back down for a little bit. > >> (Also it's technically not as simple as I just described it, > >> because the `struct page` that the PTE points to might be > >> a "tail page" of a `struct folio`. > >> So actually we first read the PTE, the PTE gives us the > >> `page*`, then from that we go to the `folio*`, then we > >> try to grab a reference to the `folio`, then if that worked > >> we check that the `page` still points to the same `folio`, > >> and then we recheck that the PTE is still the same.) > >> """ > > > > Nngh. In trying to make this description fit all kernels (with > > both pages and folios), you've complicated it maximally. Let's > > try a more simple explanation: > > > > First we (locklessly) load the page table entry, then we grab a > > reference to the folio that contains it (which can fail if the > > refcount is zero, in that case we bail), then we recheck that the > > page table entry is still the same, and if it changed in between, > > we drop the folio reference and bail. > > This can, again, grab a reference to a folio after it has > > already been freed and reallocated. The reason why this is > > fine is that the metadata structure that holds this refcount, > > `struct folio` is never freed; even when a folio is > > freed and reallocated, the corresponding `struct folio` > > stays. Oh, thanks. You're right, trying to talk about kernels with folios made it unnecessarily complicated... > I'm probably missing something obvious but how is that synchronised > against memory hotplug? AFAICT if it isn't couldn't the pages be freed > and memory removed? In that case the above would no longer hold because > (I think) the metadata structure could have been freed. Hm... that's this codepath? arch_remove_memory -> __remove_pages -> __remove_section -> sparse_remove_section -> section_deactivate -> depopulate_section_memmap -> vmemmap_free -> remove_pagetable which then walks down the page tables and ends up freeing individual pages in remove_pte_table() using the confusingly-named free_pagetable()? I'm not sure what the synchronization against hotplug is - GUP-fast is running with IRQs disabled, but other codepaths might not, like get_ksm_page()? I don't know if that's holding something else for protection...
On Mon, Oct 24, 2022 at 09:58:07PM +0200, Jann Horn wrote: > Unless I'm completely misunderstanding what's going on here, the whole > "remove_table" thing only happens when you "remove a table", meaning > you free an entire *pagetable*. Just zapping PTEs doesn't trigger that > logic. Aah; yes true. OTOH even it that were not so, I think it would still be broken because the current code relies on the TLB flush to have completed, whereas the RCU scheme is effectively async and can be considered pending until the callback runs. Hurmph... easiest fix is probably to dis-allow kvm_flush_tlb_multi() for i386-pae builds. Something like so... nobody in his right mind should care about i386-pae virt performance much. --- diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 95fb85bea111..cbfb84e88251 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -473,6 +473,12 @@ static DEFINE_PER_CPU(cpumask_var_t, __pv_cpu_mask); static bool pv_tlb_flush_supported(void) { + /* + * i386-PAE split loads are incompatible with optimized TLB flushes. + */ + if (IS_ENABLED(CONFIG_GUP_GET_PTE_LOW_HIGH)) + return false; + return (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) && !kvm_para_has_hint(KVM_HINTS_REALTIME) && kvm_para_has_feature(KVM_FEATURE_STEAL_TIME) &&
On Tue, Oct 25, 2022 at 4:02 PM Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Oct 24, 2022 at 09:58:07PM +0200, Jann Horn wrote: > > > Unless I'm completely misunderstanding what's going on here, the whole > > "remove_table" thing only happens when you "remove a table", meaning > > you free an entire *pagetable*. Just zapping PTEs doesn't trigger that > > logic. > > Aah; yes true. OTOH even it that were not so, I think it would still be > broken because the current code relies on the TLB flush to have > completed, whereas the RCU scheme is effectively async and can be > considered pending until the callback runs. > > Hurmph... easiest fix is probably to dis-allow kvm_flush_tlb_multi() > for i386-pae builds. > > Something like so... nobody in his right mind should care about i386-pae > virt performance much. I think Xen and HyperV have similar codepaths. hyperv_flush_tlb_multi() looks like it uses remote flush hypercalls, xen_flush_tlb_multi() too. On top of that, I think that theoretically, Linux doesn't even ensure that you have a TLB flush in between tearing down one PTE and installing another PTE (see https://lore.kernel.org/all/CAG48ez1Oz4tT-N2Y=Zs6jumu=zOp7SQRZ=V2c+b5bT9P4retJA@mail.gmail.com/), but I haven't tested that, and if it is true, I'm also not entirely sure if it's correct (in the sense that it only creates incoherent-TLB states when userspace is doing something stupid like racing MADV_DONTNEED and page faults on the same region). I think the more clearly correct fix would be to get rid of the split loads and use CMPXCHG16B instead (probably destroying the performance of GUP-fast completely), but that's complicated because some of the architectures that use the split loads path don't have cmpxchg_double (or at least don't have it wired up).
On Tue, Oct 25, 2022 at 04:18:20PM +0200, Jann Horn wrote: > On Tue, Oct 25, 2022 at 4:02 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Oct 24, 2022 at 09:58:07PM +0200, Jann Horn wrote: > > > > > Unless I'm completely misunderstanding what's going on here, the whole > > > "remove_table" thing only happens when you "remove a table", meaning > > > you free an entire *pagetable*. Just zapping PTEs doesn't trigger that > > > logic. > > > > Aah; yes true. OTOH even it that were not so, I think it would still be > > broken because the current code relies on the TLB flush to have > > completed, whereas the RCU scheme is effectively async and can be > > considered pending until the callback runs. > > > > Hurmph... easiest fix is probably to dis-allow kvm_flush_tlb_multi() > > for i386-pae builds. > > > > Something like so... nobody in his right mind should care about i386-pae > > virt performance much. > > I think Xen and HyperV have similar codepaths. > hyperv_flush_tlb_multi() looks like it uses remote flush hypercalls, > xen_flush_tlb_multi() too. Sure (not updated). > On top of that, I think that theoretically, Linux doesn't even ensure > that you have a TLB flush in between tearing down one PTE and > installing another PTE (see > https://lore.kernel.org/all/CAG48ez1Oz4tT-N2Y=Zs6jumu=zOp7SQRZ=V2c+b5bT9P4retJA@mail.gmail.com/), > but I haven't tested that, and if it is true, I'm also not entirely > sure if it's correct (in the sense that it only creates incoherent-TLB > states when userspace is doing something stupid like racing > MADV_DONTNEED and page faults on the same region). > > I think the more clearly correct fix would be to get rid of the split > loads and use CMPXCHG16B instead (probably destroying the performance > of GUP-fast completely), but that's complicated because some of the > architectures that use the split loads path don't have cmpxchg_double > (or at least don't have it wired up). cmpxchg8b; but no, I think we want to fix MADV_DONTNEED, incoherent TLB states are a pain nobody needs. Something like so should force TLB flushes before dropping pte_lock (not looked at the various pmd level things yet). diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 95fb85bea111..cbfb84e88251 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -473,6 +473,12 @@ static DEFINE_PER_CPU(cpumask_var_t, __pv_cpu_mask); static bool pv_tlb_flush_supported(void) { + /* + * i386-PAE split loads are incompatible with optimized TLB flushes. + */ + if (IS_ENABLED(CONFIG_GUP_GET_PTE_LOW_HIGH)) + return false; + return (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) && !kvm_para_has_hint(KVM_HINTS_REALTIME) && kvm_para_has_feature(KVM_FEATURE_STEAL_TIME) && diff --git a/include/linux/mm.h b/include/linux/mm.h index 8bbcccbc5565..397bc04e2d82 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3474,5 +3474,6 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start, * default, the flag is not set. */ #define ZAP_FLAG_DROP_MARKER ((__force zap_flags_t) BIT(0)) +#define ZAP_FLAG_FORCE_FLUSH ((__force zap_flags_t) BIT(1)) #endif /* _LINUX_MM_H */ diff --git a/mm/memory.c b/mm/memory.c index f88c351aecd4..9bb63b3fbee1 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1440,6 +1440,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, tlb_remove_tlb_entry(tlb, pte, addr); zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); + + if (!force_flush && !tlb->fullmm && details && + details->zap_flags & ZAP_FLAG_FORCE_FLUSH) + force_flush = 1; + if (unlikely(!page)) continue; @@ -1749,6 +1754,9 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start, struct maple_tree *mt = &vma->vm_mm->mm_mt; unsigned long end = start + size; struct mmu_notifier_range range; + struct zap_details details = { + .zap_flags = ZAP_FLAG_FORCE_FLUSH, + }; struct mmu_gather tlb; MA_STATE(mas, mt, vma->vm_end, vma->vm_end); @@ -1759,7 +1767,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start, update_hiwater_rss(vma->vm_mm); mmu_notifier_invalidate_range_start(&range); do { - unmap_single_vma(&tlb, vma, start, range.end, NULL); + unmap_single_vma(&tlb, vma, start, range.end, &details); } while ((vma = mas_find(&mas, end - 1)) != NULL); mmu_notifier_invalidate_range_end(&range); tlb_finish_mmu(&tlb); @@ -1806,7 +1814,7 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, unsigned long size) { if (!range_in_vma(vma, address, address + size) || - !(vma->vm_flags & VM_PFNMAP)) + !(vma->vm_flags & VM_PFNMAP)) return; zap_page_range_single(vma, address, size, NULL);
Jann Horn <jannh@google.com> writes: > On Tue, Oct 25, 2022 at 10:11 AM Alistair Popple <apopple@nvidia.com> wrote: >> >> >> Matthew Wilcox <willy@infradead.org> writes: >> >> > On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote: >> >> """ >> >> This guarantees that the page tables that are being walked >> >> aren't freed concurrently, but at the end of the walk, we >> >> have to grab a stable reference to the referenced page. >> >> For this we use the grab-reference-and-revalidate trick >> >> from above again: >> >> First we (locklessly) load the page >> >> table entry, then we grab a reference to the page that it >> >> points to (which can fail if the refcount is zero, in that >> >> case we bail), then we recheck that the page table entry >> >> is still the same, and if it changed in between, we drop the >> >> page reference and bail. >> >> This can, again, grab a reference to a page after it has >> >> already been freed and reallocated. The reason why this is >> >> fine is that the metadata structure that holds this refcount, >> >> `struct folio` (or `struct page`, depending on which kernel >> >> version you're looking at; in current kernels it's `folio` >> >> but `struct page` and `struct folio` are actually aliases for >> >> the same memory, basically, though that is supposed to maybe >> >> change at some point) is never freed; even when a page is >> >> freed and reallocated, the corresponding `struct folio` >> >> stays. This does have the fun consequence that whenever a >> >> page/folio has a non-zero refcount, the refcount can >> >> spuriously go up and then back down for a little bit. >> >> (Also it's technically not as simple as I just described it, >> >> because the `struct page` that the PTE points to might be >> >> a "tail page" of a `struct folio`. >> >> So actually we first read the PTE, the PTE gives us the >> >> `page*`, then from that we go to the `folio*`, then we >> >> try to grab a reference to the `folio`, then if that worked >> >> we check that the `page` still points to the same `folio`, >> >> and then we recheck that the PTE is still the same.) >> >> """ >> > >> > Nngh. In trying to make this description fit all kernels (with >> > both pages and folios), you've complicated it maximally. Let's >> > try a more simple explanation: >> > >> > First we (locklessly) load the page table entry, then we grab a >> > reference to the folio that contains it (which can fail if the >> > refcount is zero, in that case we bail), then we recheck that the >> > page table entry is still the same, and if it changed in between, >> > we drop the folio reference and bail. >> > This can, again, grab a reference to a folio after it has >> > already been freed and reallocated. The reason why this is >> > fine is that the metadata structure that holds this refcount, >> > `struct folio` is never freed; even when a folio is >> > freed and reallocated, the corresponding `struct folio` >> > stays. > > Oh, thanks. You're right, trying to talk about kernels with folios > made it unnecessarily complicated... > >> I'm probably missing something obvious but how is that synchronised >> against memory hotplug? AFAICT if it isn't couldn't the pages be freed >> and memory removed? In that case the above would no longer hold because >> (I think) the metadata structure could have been freed. > > Hm... that's this codepath? > > arch_remove_memory -> __remove_pages -> __remove_section -> > sparse_remove_section -> section_deactivate -> > depopulate_section_memmap -> vmemmap_free -> remove_pagetable > which then walks down the page tables and ends up freeing individual > pages in remove_pte_table() using the confusingly-named > free_pagetable()? Right. section_deactivate() will also clear SECTION_HAS_MEM_MAP which would trigger VM_BUG_ON(!pfn_valid(pte_pfn(pte))) in gup_pte_range(). > I'm not sure what the synchronization against hotplug is - GUP-fast is > running with IRQs disabled, but other codepaths might not, like > get_ksm_page()? I don't know if that's holding something else for protection... I was thinking about this from the ZONE_DEVICE perspective (ie. memunmap_pages -> pageunmap_range -> arch_remove_memory -> ...). That runs with IRQs enabled, and I couldn't see any other synchronization. pageunmap_range() does call mem_hotplug_begin() which takes hotplug locks but GUP-fast doesn't take those locks. So based on Peter's response I think I need to add a rcu_synchronize() call to pageunmap_range() right after calling mem_hotplug_begin(). I could just add it to mem_hotplug_begin() but offline_pages() calls that too early (before pages have been isolated) so will need a separate change.
On Tue, Oct 25, 2022 at 5:06 PM Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Oct 25, 2022 at 04:18:20PM +0200, Jann Horn wrote: > > On Tue, Oct 25, 2022 at 4:02 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > On Mon, Oct 24, 2022 at 09:58:07PM +0200, Jann Horn wrote: > > > > > > > Unless I'm completely misunderstanding what's going on here, the whole > > > > "remove_table" thing only happens when you "remove a table", meaning > > > > you free an entire *pagetable*. Just zapping PTEs doesn't trigger that > > > > logic. > > > > > > Aah; yes true. OTOH even it that were not so, I think it would still be > > > broken because the current code relies on the TLB flush to have > > > completed, whereas the RCU scheme is effectively async and can be > > > considered pending until the callback runs. > > > > > > Hurmph... easiest fix is probably to dis-allow kvm_flush_tlb_multi() > > > for i386-pae builds. > > > > > > Something like so... nobody in his right mind should care about i386-pae > > > virt performance much. > > > > I think Xen and HyperV have similar codepaths. > > hyperv_flush_tlb_multi() looks like it uses remote flush hypercalls, > > xen_flush_tlb_multi() too. > > Sure (not updated). > > > On top of that, I think that theoretically, Linux doesn't even ensure > > that you have a TLB flush in between tearing down one PTE and > > installing another PTE (see > > https://lore.kernel.org/all/CAG48ez1Oz4tT-N2Y=Zs6jumu=zOp7SQRZ=V2c+b5bT9P4retJA@mail.gmail.com/), > > but I haven't tested that, and if it is true, I'm also not entirely > > sure if it's correct (in the sense that it only creates incoherent-TLB > > states when userspace is doing something stupid like racing > > MADV_DONTNEED and page faults on the same region). > > > > I think the more clearly correct fix would be to get rid of the split > > loads and use CMPXCHG16B instead (probably destroying the performance > > of GUP-fast completely), but that's complicated because some of the > > architectures that use the split loads path don't have cmpxchg_double > > (or at least don't have it wired up). > > cmpxchg8b; but no, I think we want to fix MADV_DONTNEED, incoherent TLB > states are a pain nobody needs. > > Something like so should force TLB flushes before dropping pte_lock (not > looked at the various pmd level things yet). [...] > #endif /* _LINUX_MM_H */ > diff --git a/mm/memory.c b/mm/memory.c > index f88c351aecd4..9bb63b3fbee1 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1440,6 +1440,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, > tlb_remove_tlb_entry(tlb, pte, addr); > zap_install_uffd_wp_if_needed(vma, addr, pte, details, > ptent); > + > + if (!force_flush && !tlb->fullmm && details && > + details->zap_flags & ZAP_FLAG_FORCE_FLUSH) > + force_flush = 1; > + Hmm... I guess that might work, assuming that there is no other codepath we might race with that first turns the present PTE into a non-present PTE but keeps the flush queued for later. At least codepaths that use the tlb_batched infrastructure are unproblematic...
On Oct 25, 2022, at 6:06 PM, Peter Zijlstra <peterz@infradead.org> wrote: > if (!force_flush && !tlb->fullmm && details && > + details->zap_flags & ZAP_FLAG_FORCE_FLUSH) > + force_flush = 1; Isn’t it too big of a hammer? At the same time, the whole reasoning about TLB flushes is not getting any simpler. We had cases in which MADV_DONTNEED and another concurrent operation that effectively zapped PTEs (e.g., another MADV_DONTNEED) caused the zap_pte_range() to skip entries since pte_none() was true. To resolve these cases we relied on tlb_finish_mmu() to flush the range when needed (i.e., flush the whole range when mm_tlb_flush_nested()). Now, I do not have a specific broken scenario in mind following this change, but it is all sounds to me a bit dangerous and at same time can potentially introduce new overheads. One alternative may be using mm_tlb_flush_pending() when setting a new PTE to check for pending flushes and flushing the TLB if that is the case. This is somewhat similar to what ptep_clear_flush() does. Anyhow, I guess this might induce some overheads. As noted before, it is possible to track pending TLB flushes in VMA/page-table granularity, with different tradeoffs of overheads.
On Wed, Oct 26, 2022 at 06:45:16PM +0200, Jann Horn wrote: > > #endif /* _LINUX_MM_H */ > > diff --git a/mm/memory.c b/mm/memory.c > > index f88c351aecd4..9bb63b3fbee1 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -1440,6 +1440,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, > > tlb_remove_tlb_entry(tlb, pte, addr); > > zap_install_uffd_wp_if_needed(vma, addr, pte, details, > > ptent); > > + > > + if (!force_flush && !tlb->fullmm && details && > > + details->zap_flags & ZAP_FLAG_FORCE_FLUSH) > > + force_flush = 1; > > + > > Hmm... I guess that might work, assuming that there is no other > codepath we might race with that first turns the present PTE into a > non-present PTE but keeps the flush queued for later. At least > codepaths that use the tlb_batched infrastructure are unproblematic... So I thought the general rule was that if you modify a PTE and have not unmapped things -- IOW, there's actual concurrency possible on the thing, then the TLB invalidate needs to happen under pte_lock, since that is what controls concurrency at the pte level. As it stands MADV_DONTNEED seems to blatatly violate that general rule. Then again; I could've missed something and the rules changed?
On Wed, Oct 26, 2022 at 10:43:21PM +0300, Nadav Amit wrote: > On Oct 25, 2022, at 6:06 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > > if (!force_flush && !tlb->fullmm && details && > > + details->zap_flags & ZAP_FLAG_FORCE_FLUSH) > > + force_flush = 1; > > Isn’t it too big of a hammer? It is the obvious hammer :-) TLB invalidate under pte_lock when clearing. > At the same time, the whole reasoning about TLB flushes is not getting any > simpler. We had cases in which MADV_DONTNEED and another concurrent > operation that effectively zapped PTEs (e.g., another MADV_DONTNEED) caused > the zap_pte_range() to skip entries since pte_none() was true. To resolve > these cases we relied on tlb_finish_mmu() to flush the range when needed > (i.e., flush the whole range when mm_tlb_flush_nested()). Yeah, whoever thought that allowing concurrency there was a great idea :/ And I must admit to hating the pending thing with a passion. And that mm_tlb_flush_nested() thing in tlb_finish_mmu() is a giant hack at the best of times. Also; I feel it's part of the problem here; it violates the basic rules we've had for a very long time. > Now, I do not have a specific broken scenario in mind following this change, > but it is all sounds to me a bit dangerous and at same time can potentially > introduce new overheads. I'll take correctness over being fast. As you say, this whole TLB thing is getting out of hand. > One alternative may be using mm_tlb_flush_pending() when setting a new PTE > to check for pending flushes and flushing the TLB if that is the case. This > is somewhat similar to what ptep_clear_flush() does. Anyhow, I guess this > might induce some overheads. As noted before, it is possible to track > pending TLB flushes in VMA/page-table granularity, with different tradeoffs > of overheads. Right; I just don't believe in VMAs for this, they're *waaay* to big.
On Oct 27, 2022, at 12:27 AM, Peter Zijlstra <peterz@infradead.org> wrote: >> One alternative may be using mm_tlb_flush_pending() when setting a new PTE >> to check for pending flushes and flushing the TLB if that is the case. This >> is somewhat similar to what ptep_clear_flush() does. Anyhow, I guess this >> might induce some overheads. As noted before, it is possible to track >> pending TLB flushes in VMA/page-table granularity, with different tradeoffs >> of overheads. > > Right; I just don't believe in VMAs for this, they're *waaay* to big. Well, I did it for VMA in an RFC only because I was pushed. I thought and do think that page-table granularity is the right one.
On Thu, Oct 27, 2022 at 12:08 AM Peter Zijlstra <peterz@infradead.org> wrote: > > So I thought the general rule was that if you modify a PTE and have not > unmapped things -- IOW, there's actual concurrency possible on the > thing, then the TLB invalidate needs to happen under pte_lock, since > that is what controls concurrency at the pte level. Yeah, I think we should really think about the TLB issue and make the rules very clear. A lot of our thinking behind "fix TLB races" have been about "use-after-free wrt TLB on another CPU", and the design in zap_page_ranges() is almost entirely about that: making sure that the TLB flush happens before the underlying pages are free'd. I say "almost entirely", because you can see the effects of the *other* kind of race in if (!PageAnon(page)) { if (pte_dirty(ptent)) { force_flush = 1; set_page_dirty(page); } where it isn't about the lifetime of the page, but about the state coherency wrt other users. But I think even that code is quite questionable, and we really should write down the rules much more strictly somewhere. For example, wrt that pte_dirty() causing force_flush, I'd love to have some core set of rules that would explain (a) why it is only relevant for shared pages (b) do even shared pages need it for the "fullmm" case when we tear down the whole VM? (c) what are the force_flush interactions wrt holding the mmap lock for writing vs reading? In the above, I think (b)/(c) are related - I suspect "fullmm" is mostly equivalent to "mmap held for writing", in that it guarantees that there should be no concurrent faults or other active sw ops on the table. But "fullmm" is probably even stronger than "mmap write-lock" in that it should also mean "no other CPU can be actively using this" - either for hardware page table walking, or for GUP. Both both have the vmscan code and rmap that can still get to the pages - and the vmscan code clearly only cares about the page table lock. But the vmscan code itself also shouldn't care about some stale TLB value, so ... > As it stands MADV_DONTNEED seems to blatatly violate that general rule. So let's talk about exactly what and why the TLB would need to be flushed before dropping the page table lock. For example, MADV_DONTNEED does this all with just the mmap lock held for reading, so we *unless* we have that 'force_flush', we can (a) have another CPU continue to use the old stale TLB entry for quite a while (b) yet another CPU (that didn't have a TLB entry, or wanted to write to a read-only one ) could take a page fault, and install a *new* PTE entry in the same slot, all at the same time. Now, that's clearly *very* confusing. But being confusing may not mean "wrong" - we're still delaying the free of the old entry, so there's no use-after-free. The biggest problem I can see is that this means that user space memory ordering might be screwed up: the CPU in (a) will see not just an old TLB entry, but simply old *data*, when the CPU in (b) may be writing to that same address with new data. So I think we clearly do *NOT* serialize as much as we could for MADV_DONTNEED, and I think the above is a real semantic result of that. But equally arguably if you do this kind of unserialized MADV_DONTNEED in user space, that's a case of "you asked for the breakage - you get to keep both pieces". So is that ever an actual problem? If the worst issue is that some CPU's may see old ("pre-DONTNEED") data, while other CPU's then see new data while the MADV_DONTNEED is executing, I think maybe *THAT* part is fine. But because it's so very confusing, maybe we have other problems in this area, which is why I think it would be really good to have an actual set of real hard documented rules about this all, and exactly when we need to flush TLBs synchronously with the page table lock, and when we do not. Anybody willing to try to write up the rules (and have each rule document *why* it's a rule - not just "by fiat", but an actual "these are the rules and this is *why* they are the rules"). Because right now I think all of our rules are almost entirely just encoded in the code, with a couple of comments, and a few people who just remember why we do what we do. Linus
Just two quick remarks; it's far to late to really think :-) On Thu, Oct 27, 2022 at 11:13:55AM -0700, Linus Torvalds wrote: > But "fullmm" is probably even stronger than "mmap write-lock" in that > it should also mean "no other CPU can be actively using this" - either > for hardware page table walking, or for GUP. IIRC fullmm is really: this is the last user and we're taking the whole mm down -- IOW exit(). > For example, MADV_DONTNEED does this all with just the mmap lock held > for reading, so we *unless* we have that 'force_flush', we can > > (a) have another CPU continue to use the old stale TLB entry for quite a while > > (b) yet another CPU (that didn't have a TLB entry, or wanted to write > to a read-only one ) could take a page fault, and install a *new* PTE > entry in the same slot, all at the same time. > > Now, that's clearly *very* confusing. But being confusing may not mean > "wrong" - we're still delaying the free of the old entry, so there's > no use-after-free. Do we worry about CPU errata where things go side-ways if multiple CPUs have inconsistent TLB state?
On Thu, Oct 27, 2022 at 12:35 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, Oct 27, 2022 at 11:13:55AM -0700, Linus Torvalds wrote: > > > But "fullmm" is probably even stronger than "mmap write-lock" in that > > it should also mean "no other CPU can be actively using this" - either > > for hardware page table walking, or for GUP. > > IIRC fullmm is really: this is the last user and we're taking the whole > mm down -- IOW exit(). Yes. But that doesn't mean that it's entirely "just ours" - vmscan can still see the entries due to rmap, I think. So there can still be some concurrency concerns, but it's limited. > Do we worry about CPU errata where things go side-ways if multiple CPUs > have inconsistent TLB state? Yeah, we should definitely worry about those, since I think they have been known to cause machine checks etc, which then crashes the machine because the machine check architecture is broken garbage. "User gets the odd memory ordering they asked for" is different from "user can crash machine because of bad machine check architecture" ;) That said, I don't think this is a real worry here. Because if I recall the errata correctly, they are not about "different TLB contents", but "different cacheability for the same physical page". Because different TLB contents are normal and even expected, I think. Things like kmap_local etc already end up doing some lazy TLB flushing. No? I think it's only "somebody did an UC access to a cacheline I have" that ends up being bad. Note the *WILD HANDWAVING* above - I didn't actually look up the errata. The above is from my dim memories of the issues we had, and I might just be wrong. Linus
On Oct 27, 2022, at 11:13 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > Anybody willing to try to write up the rules (and have each rule > document *why* it's a rule - not just "by fiat", but an actual "these > are the rules and this is *why* they are the rules"). > > Because right now I think all of our rules are almost entirely just > encoded in the code, with a couple of comments, and a few people who > just remember why we do what we do. I think it might be easier to come up with new rules instead of phrasing the existing ones. The approach I suggested before [1] is something like: 1. Turn x86’s TLB-generation mechanism to be generic. Turn the TLB-generation into “pending TLB-generation”. 2. For each mm track “completed TLB-generation”, whenever an actual flush takes place. 3. When you defer a TLB-flush, while holding the PTL: a. Increase the TLB-generation. b. Save the updated “table generation" in a new field in the page-table’s page-struct. 4. When you are about to rely on a PTE value that is read from a page-table, first check if a TLB flush is needed. The check is performed by comparing the “table generation” with the “completed generation”. If the “table generation” is behind, a TLB flush is needed. [ You rely on the PTE value when you install new PTEs or change them ] That’s about it. I might have not covered some issues with fast-GUP. But in general I think it is a simple scheme. The thing I like about this scheme the most is that it avoids relying on almost all the OS data-structures (e.g., PageAnon()), making it much easier to grasp. I can revive the patch-set if the overall approach is agreeable. [1] https://lore.kernel.org/lkml/20210131001132.3368247-1-namit@vmware.com/
On Thu, Oct 27, 2022 at 1:15 PM Nadav Amit <nadav.amit@gmail.com> wrote: > > I think it might be easier to come up with new rules instead of phrasing the > existing ones. I'm ok with that, but I think you are missing a very important issue: all the cases where we can short-circuit TLB invalidations *entirely*. You don't mention those at all. Those optimizations are *very* important. Process exit is one of the most performance-critical pieces of code in the kernel on some loads, because a lot of traditional unix loads have a *ton* of small fork/exec/exit sequences, and the whole "do just one TLB flush" was at least historically quite a big deal. So one very big issue here is when zap_page_tables() can end up skipping TLB flushes entirely, because nobody cares. And no, the fix is not to turn it into some "just increment a generation number". We want to avoid *even that* cost for the whole "we don't actually need a TLB flush at all, until we actually free the pages". So there are two levels of tlb flush optimizations (a) avoiding them entirely in the first place (b) the whole "once you have to flush, keep track of lazy modes and TLB generations, and flush ranges" And honestly, I think you ignored (a), and that's where we do exactly those kinds of "this case doesn't need to flush AT ALL" things. So when you say > The thing I like about this scheme > the most is that it avoids relying on almost all the OS data-structures > (e.g., PageAnon()), making it much easier to grasp. I think it's because you've ignored a big part of the whole issue. Linus
On Oct 27, 2022, at 1:31 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > So there are two levels of tlb flush optimizations > > (a) avoiding them entirely in the first place > > (b) the whole "once you have to flush, keep track of lazy modes and > TLB generations, and flush ranges" > > And honestly, I think you ignored (a), and that's where we do exactly > those kinds of "this case doesn't need to flush AT ALL" things. I did try to avoid TLB flushes by introducing pte_needs_flush() and avoiding flushes based on the architectural PTE changes. There are even more x86 arch-based opportunities to further avoid TLB flushes (and then only flush the TLB if spurious #PF occurs). Personally, I still think that making decisions on flushes based on (mostly) only the arch state makes the code more robust against misuse (e.g., see various confusions between mmu_gather’s fullmm and need_flush_all). Having said that, I will follow your feedback that the extra complexity worth the extra performance. Anyhow, admittedly, I need to give it more thought. For instance, in respect to the code that you mentioned (in zap_pte_range), after reading it again, it seems strange: how is ok to defer the TLB flush after the rmap was removed, even if it is done while the PTL is held. folio_clear_dirty_for_io() would not sync on the PTL afterwards, so the page might be later re-dirtied using a stale cached PTE. Oh well.
On Oct 27, 2022, at 2:44 PM, Nadav Amit <nadav.amit@gmail.com> wrote: > Anyhow, admittedly, I need to give it more thought. For instance, in respect > to the code that you mentioned (in zap_pte_range), after reading it again, > it seems strange: how is ok to defer the TLB flush after the rmap was > removed, even if it is done while the PTL is held. > folio_clear_dirty_for_io() would not sync on the PTL afterwards, so the page > might be later re-dirtied using a stale cached PTE. Oh well. Peter, So it appears to be a problem - flushing after removing the rmap. I attach a PoC that shows the problem. The problem is in the following code of zap_pte_range(): if (!PageAnon(page)) { if (pte_dirty(ptent)) { force_flush = 1; set_page_dirty(page); } … } page_remove_rmap(page, vma, false); Once we remove the rmap, rmap_walk() would not acquire the page-table lock anymore. As a result, nothing prevents the kernel from performing writeback and cleaning the page-struct dirty-bit before the TLB is actually flushed. As a result, if there is an additional thread that has the dirty-PTE cached in the TLB, it can keep writing to the page and nothing (PTE/page-struct) will keep track that the page has been dirtied. In other words, writes to the memory mapped file after munmap()/MADV_DONTNEED started can be lost. This affects both munmap() and MADV_DONTNEED. One might argue that if you don’t keep writing after munmap()/MADV_DONTNEED it’s your fault (questionable), but if that’s the case why do we bother with force_flush at all? If we want it to behave correctly - i.e., writes after munmap/MADV_DONTNEED to propagate to the file - we need to collect dirty pages and remove their rmap only after we flush the TLB (and before we release the ptl). mmu_gather would probably need to be changed for this matter. Thoughts? --- Some details about the PoC (below): We got 3 threads that use 2MB range: 1. Maintains a counter in the first 4KB of the 2MB range and checks it actually updates the memory. 2. Dirties pages in 2MB range (just to make the race more likely). 3. Either (i) maps the same mapping at the first 4KB (to test munmap indirectly); or (ii) runs MADV_DONTNEED on the first 4KB. In addition, a child process runs msync and fdatasync to writeback the first 4KB. The PoC should be run with a file on a block RAM device. It manages to trigger the issue relatively reliably (within 60 seconds) with munmap() and slightly less reliably with MADV_DONTNEED. I have no idea if it works in VM, deep C-states, etc.. -- >8 -- #define _GNU_SOURCE #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <pthread.h> #include <sys/mman.h> #include <sys/stat.h> #include <fcntl.h> #include <stdbool.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <time.h> #define handle_error(msg) \ do { perror(msg); exit(EXIT_FAILURE); } while (0) void *p; volatile bool stop = false; pid_t flusher_pid; int fd; #define PAGE_SIZE (4096ul) #define PAGES_PER_PMD (512) #define HPAGE_SIZE (PAGE_SIZE * PAGES_PER_PMD) // Comment MUNMAP_TEST for MADV_DONTNEED test #define MUNMAP_TEST void *dirtying_thread(void *arg) { int i; while (!stop) { for (i = 1; i < PAGES_PER_PMD; i++) { *(volatile char *)(p + (i * PAGE_SIZE) + 64) = 5; } } return NULL; } void *checking_thread(void *arg) { volatile unsigned long *ul_p = (volatile unsigned long*)p; unsigned long cnt = 0; while (!stop) { *ul_p = cnt; if (*ul_p != cnt) { printf("FAILED: expected %ld, got %ld\n", cnt, *ul_p); kill(flusher_pid, SIGTERM); exit(0); } cnt++; } return NULL; } void *remap_thread(void *arg) { void *ptr; struct timespec t = { .tv_nsec = 10000, }; while (!stop) { #ifdef MUNMAP_TEST ptr = mmap(p, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_POPULATE, fd, 0); if (ptr == MAP_FAILED) handle_error("remap_thread"); #else if (madvise(p, PAGE_SIZE, MADV_DONTNEED) < 0) handle_error("MADV_DONTNEED"); nanosleep(&t, NULL); #endif } return NULL; } void flushing_process(void) { // Remove the pages to speed up rmap_walk and allow to drop caches. if (madvise(p, HPAGE_SIZE, MADV_DONTNEED) < 0) handle_error("MADV_DONTNEED"); while (true) { if (msync(p, PAGE_SIZE, MS_SYNC)) handle_error("msync"); if (posix_fadvise(fd, 0, PAGE_SIZE, POSIX_FADV_DONTNEED)) handle_error("posix_fadvise"); } } int main(int argc, char *argv[]) { void *(*thread_funcs[])(void*) = { &dirtying_thread, &checking_thread, &remap_thread, }; int r, i; int rc1, rc2; unsigned long addr; void *ptr; char *page = malloc(PAGE_SIZE); int n_threads = sizeof(thread_funcs) / sizeof(*thread_funcs); pthread_t *threads = malloc(sizeof(pthread_t) * n_threads); pid_t pid; if (argc < 2) { fprintf(stderr, "usages: %s [filename]\n", argv[0]); exit(EXIT_FAILURE); } fd = open(argv[1], O_RDWR|O_CREAT, 0666); if (fd == -1) handle_error("open fd"); for (i = 0; i < PAGES_PER_PMD; i++) { if (write(fd, page, PAGE_SIZE) != PAGE_SIZE) handle_error("write"); } free(page); ptr = mmap(NULL, HPAGE_SIZE * 2, PROT_NONE, MAP_PRIVATE|MAP_ANON, -1, 0); if (ptr == MAP_FAILED) handle_error("mmap anon"); addr = (unsigned long)(ptr + HPAGE_SIZE - 1) & ~(HPAGE_SIZE - 1); printf("starting...\n"); ptr = mmap((void *)addr, HPAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_POPULATE, fd, 0); if (ptr == MAP_FAILED) handle_error("mmap file - start"); p = ptr; for (i = 0; i < n_threads; i++) { r = pthread_create(&threads[i], NULL, thread_funcs[i], NULL); if (r) handle_error("pthread_create"); } // Run the flushing process in a different process, so msync() would // not require mmap_lock. pid = fork(); if (pid == 0) flushing_process(); flusher_pid = pid; sleep(60); stop = true; for (i = 0; i < n_threads; i++) pthread_join(threads[i], NULL); kill(flusher_pid, SIGTERM); printf("Finished without an error"); exit(0); }
On Fri, Oct 28, 2022 at 4:57 PM Nadav Amit <nadav.amit@gmail.com> wrote: > > The problem is in the following code of zap_pte_range(): > > if (!PageAnon(page)) { > if (pte_dirty(ptent)) { > force_flush = 1; > set_page_dirty(page); > } > … > } > page_remove_rmap(page, vma, false); > > Once we remove the rmap, rmap_walk() would not acquire the page-table lock > anymore. As a result, nothing prevents the kernel from performing writeback > and cleaning the page-struct dirty-bit before the TLB is actually flushed. Hah. The original reason for force_flush there was similar, with a race wrt page_mkclean() because this code doesn't take the page lock that would normally serialize things, because the page lock is so painful and ends up having nasty nasty interactions with slow IO operations. So all the page-dirty handling really *wants* to use the page lock, and for the IO side (writeback) that ends up being acceptable and works well, but from that "serialize VM state" it's horrendous. So instead the code intentionally serialized on the rmap data structures which page_mkclean() also walks, and as you point out, that's broken. It's not broken at the point where we do set_page_dirty(), but it *comes* broken when we drop the rmap, and the problem is exactly that "we still have the dirty bit hidden in the TLB state" issue that you pinpoint. I think the proper fix (or at least _a_ proper fix) would be to actually carry the dirty bit along to the __tlb_remove_page() point, and actually treat it exactly the same way as the page pointer itself - set the page dirty after the TLB flush, the same way we can free the page after the TLB flush. We could easiy hide said dirty bit in the low bits of the "batch->pages[]" array or something like that. We'd just have to add the 'dirty' argument to __tlb_remove_page_size() and friends. Hmm? Your idea of "do the page_remove_rmap() late instead" would also work, but the reason I think just squirrelling away the dirty bit is the "proper" fix is that it would get rid of the whole need for 'force_flush' in this area entirely. So we'd not only fix that race you noticed, we'd actually do so and reduce the number of TLB flushes too. I don't know. Maybe I'm missing something fundamental, and my idea is just stupid. Linus
On Oct 28, 2022, at 5:42 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > I think the proper fix (or at least _a_ proper fix) would be to > actually carry the dirty bit along to the __tlb_remove_page() point, > and actually treat it exactly the same way as the page pointer itself > - set the page dirty after the TLB flush, the same way we can free the > page after the TLB flush. > > We could easiy hide said dirty bit in the low bits of the > "batch->pages[]" array or something like that. We'd just have to add > the 'dirty' argument to __tlb_remove_page_size() and friends. Thank you for your quick response. I was slow to respond due to a jet lag. Anyhow, I am not sure whether the solution that you propose would work. Please let me know if my understanding makes sense. Let’s assume that we do not call set_page_dirty() before we remove the rmap but only after we invalidate the page [*]. Let’s assume that shrink_page_list() is called after the page’s rmap is removed and the page is no longer mapped, but before set_page_dirty() was actually called. In such a case, shrink_page_list() would consider the page clean, and would indeed keep the page (since __remove_mapping() would find elevated page refcount), which appears to give us a chance to mark the page as dirty later. However, IIUC, in this case shrink_page_list() might still call filemap_release_folio() and release the buffers, so calling set_page_dirty() afterwards - after the actual TLB invalidation took place - would fail. > Your idea of "do the page_remove_rmap() late instead" would also work, > but the reason I think just squirrelling away the dirty bit is the > "proper" fix is that it would get rid of the whole need for > 'force_flush' in this area entirely. So we'd not only fix that race > you noticed, we'd actually do so and reduce the number of TLB flushes > too. I’m all for reducing the number of TLB flushes, and your solution does sound better in general. I proposed something that I considered having the path of least resistance (i.e., least chance of breaking something). I can do what you propsosed, but I am not sure how to deal with the buffers being removed. One more note: This issue, I think, also affects migrate_vma_collect_pmd(). Alistair recently addressed an issue there, but in my prior feedback to him I missed this issue. [*] Note that this would be for our scenario pretty much the same if we also called set_page_dirty() before removing the rmap, but the page was cleaned while the TLB invalidation has still not been performed.
On Sat, Oct 29, 2022 at 11:05 AM Nadav Amit <nadav.amit@gmail.com> wrote: > > Anyhow, I am not sure whether the solution that you propose would work. > Please let me know if my understanding makes sense. Let me include my (UNTESTED! PROBABLY GARBAGE!) patch here just as a "I meant something like this". Note that it's untested, but I tried to make it intentionally use different names and an opaque type in the 'mmu_gather' data structure so that at least these bit games end up being type-safe and more likely to be correct if it compiles. And I will say that it does compile for me - but only in my normal x86-64 config. I haven't tested anything else, and I guarantee it won't build on s390, for example, because s390 has its own mmu_gather functions. > Let’s assume that we do not call set_page_dirty() before we remove the rmap > but only after we invalidate the page [*]. Let’s assume that > shrink_page_list() is called after the page’s rmap is removed and the page > is no longer mapped, but before set_page_dirty() was actually called. > > In such a case, shrink_page_list() would consider the page clean, and would > indeed keep the page (since __remove_mapping() would find elevated page > refcount), which appears to give us a chance to mark the page as dirty > later. Right. That is not different to any other function (like "write()" having looked up the page. > However, IIUC, in this case shrink_page_list() might still call > filemap_release_folio() and release the buffers, so calling set_page_dirty() > afterwards - after the actual TLB invalidation took place - would fail. I'm not seeing why. That would imply that any "look up page, do set_page_dirty()" is broken. They don't have rmap either. And we have a number of them all over (eg think "GUP users" etc). But hey, you were right about the stale TLB case, so I may just be missing something. I *think* the important thing is just that we need to mark the page dirtty from the page tables _after_ we've flushed the TLB, just to make sure that there can be no subsequent dirtiers that then get lost. Anyway, I think the best documentation for "this is what I meant" is simply the patch. Does this affect your PoC on your setup? I tried to run your program on my machine (WITHOUT this patch - I have compiled this patch, but I haven't even booted it - it really could be horrible broken). But it doesn't fail for me even without the patch and I just get "Finished without an error" over and over again - but you said it had to be run on a RAM block device, which I didn't do, so my testing is very suspect,. Again: THIS PATCH IS UNTESTED. I feel like it might actually work, if only because I tried to be so careful with the type system. But that "might actually work" is probably me being wildly optimistic, and also depends on the whole concept being ok in the first place. So realistically, think of this patch more as a "document in code what Linus meant with his incoherent ramblings" rather than anything else. Hmm? Linus
On Sat, Oct 29, 2022 at 11:36 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Anyway, I think the best documentation for "this is what I meant" is > simply the patch. Does this affect your PoC on your setup? Here's a slightly cleaned up set with preliminary commit messages, and an explanation for why some of the 'struct page' declarations were moved around a bit in case you wondered about that part of the change in the full patch. The end result should be the same, so if you already looked at the previous unified patch, never mind. But this one tries to make for a better patch series. Still not tested in any way, shape, or form. I decided I wanted to send this one before booting into this and possibly blowing up ;^) Linus
On Sat, Oct 29, 2022 at 11:58 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Still not tested in any way, shape, or form. I decided I wanted to > send this one before booting into this and possibly blowing up ;^) Well, it boots, and I see no difference with your PoC code. It didn't fail for me before, it doesn't fail for me with those patches. Again, the "it doesn't fail for me" is probably because I'm running it incorrectly, although for all I know there can also be hardware differences. I'm testing on an older AMD threadripper, and as I'm sure you are very aware, some AMD cores used to have special support for keeping the TLB coherent with the actual page table contents in order to then avoid TLB flushes entirely. Those things ended up being buggy and disabled, but my point is that hardware differences can obviously actively hide this issue by making the TLB contents track page table changes. So even if I were to run it the same way you do, I might not see the failure due to just running it on different hardware with different TLB and timing. Anyway, the patches don't seem to cause any *obvious* problems. That's not to say that they are correct, or that they fix anything, but it's certainly a fairly simple and straightforward patch, and it "feels right" to me. Sadly, reality doesn't always agree with my feelings. Damn. Linus
On Oct 29, 2022, at 12:14 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, Oct 29, 2022 at 11:58 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> Still not tested in any way, shape, or form. I decided I wanted to >> send this one before booting into this and possibly blowing up ;^) > > Well, it boots, and I see no difference with your PoC code. > > It didn't fail for me before, it doesn't fail for me with those patches. > > Again, the "it doesn't fail for me" is probably because I'm running it > incorrectly, although for all I know there can also be hardware > differences. Please give me some time to test it. I presume you ran it with block ram device (not tmpfs) and not on a virtual machine (which can affect besides Intel/AMD implementation differences). But even if your patches work and the tests pass, I am not sure it means that everything is fine. I did not try to trigger a race with shrink_page_list(), and doing that might be harder than the race I tried to create before. I need to do some tracing to understand what I was missing in my understanding of the shrink_page_list() - assuming that I am mistaken about the buffers being potentially released. I would note that my concern about releasing the buffers is partially driven by to issues that were reported before [1]. I am actually not sure how they were resolved. [1] https://lore.kernel.org/all/20180103100430.GE4911@quack2.suse.cz/
On 10/29/22 11:36, Linus Torvalds wrote: >> In such a case, shrink_page_list() would consider the page clean, and would >> indeed keep the page (since __remove_mapping() would find elevated page >> refcount), which appears to give us a chance to mark the page as dirty >> later. > > Right. That is not different to any other function (like "write()" > having looked up the page. > >> However, IIUC, in this case shrink_page_list() might still call >> filemap_release_folio() and release the buffers, so calling set_page_dirty() >> afterwards - after the actual TLB invalidation took place - would fail. > > I'm not seeing why. > > That would imply that any "look up page, do set_page_dirty()" is > broken. They don't have rmap either. And we have a number of them all > over (eg think "GUP users" etc). Yes, we do have a bunch of "look up page, do set_page_dirty()" cases. And I think that many (most?) of them are in fact broken! Because: the dirtiness of a page is something that the filesystem believes that it is managing, and so filesystem coordination is, in general, required in order to mark a page as dirty. Jan Kara's 2018 analysis [1] (which launched the pin_user_pages() effort) shows a nice clear example. And since then, I've come to believe that most of the gup/pup call sites have it wrong: a) pin_user_pages() b) /* access page contents */ c) set_page_dirty() or set_page_dirty_lock() // PROBLEM HERE d) unpin_user_page() ext4 has since papered over the problem, by soldiering on if it finds a page without writeback buffers when it expected to be able to writeback a dirty page. But you get the idea. And I think that applies beyond the gup/pup situation. [1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/T/#u thanks,
On Sat, Oct 29, 2022 at 12:39 PM John Hubbard <jhubbard@nvidia.com> wrote: > > ext4 has since papered over the problem, by soldiering on if it finds a > page without writeback buffers when it expected to be able to writeback > a dirty page. But you get the idea. I suspect that "soldiering on" is the right thing to do, but yes, our 'mkdirty' vs 'mkclean' thing has always been problematic. I think we always needed a page lock for it, but PG_lock itself doesn't work (as mentioned earlier) because the VM can't serialize with IO, and needs the lock to basically be a spinlock. The page table lock kind of took its place, and then the rmap removal makes for problems (since it is what basically ends up being the shared place to look it upo). I can think of three options: (a) filesystems just deal with it (b) we could move the "page_remove_rmap()" into the "flush-and-free" path too (c) we could actually add a spinlock (hashed on the page?) for this I think (a) is basically our current expectation. And (b) would be fairly easy - same model as that dirty bit patch, just a 'do page_remove_rmap too' - except page_remove_rmap() wants the vma as well (and we delay the TLB flush over multiple vma's, so it's not just a "save vma in mmu_gather"). Doing (c) doesn't look hard, except for the "new lock" thing, which is always a potential huge disaster. If it's only across set_page_dirty() and page_mkclean(), though, and uses some simple page-based hash, it sounds fairly benign. Linus Linus
On Sat, Oct 29, 2022 at 1:15 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I can think of three options: > > (a) filesystems just deal with it > > (b) we could move the "page_remove_rmap()" into the "flush-and-free" path too > > (c) we could actually add a spinlock (hashed on the page?) for this > > I think (a) is basically our current expectation. Side note: anybody doing gup + set_page_dirty() won't be fixed by b/c anyway, so I think (a) is basically the only thing. And that's true even if you do a page pinning gup, since the source of the gup may be actively unmapped after the gup. So a filesystem that thinks that only write, or a rmap-accessible mmap can turn the page dirty really seems to be fundamentally broken. And I think that has always been the case, it's just that filesystem writers may not have been happy with it, and may not have had test-cases for it. It's not surprising that the filesystem people then try to blame users. Linus
On 10/29/22 13:30, Linus Torvalds wrote: >> I can think of three options: >> >> (a) filesystems just deal with it >> >> (b) we could move the "page_remove_rmap()" into the "flush-and-free" path too >> >> (c) we could actually add a spinlock (hashed on the page?) for this >> >> I think (a) is basically our current expectation. > > Side note: anybody doing gup + set_page_dirty() won't be fixed by b/c > anyway, so I think (a) is basically the only thing. > > And that's true even if you do a page pinning gup, since the source of > the gup may be actively unmapped after the gup. I was just now writing a response that favored (c) over (b), precisely because of that, yes. :) > > So a filesystem that thinks that only write, or a rmap-accessible mmap > can turn the page dirty really seems to be fundamentally broken. > > And I think that has always been the case, it's just that filesystem > writers may not have been happy with it, and may not have had > test-cases for it. > > It's not surprising that the filesystem people then try to blame users. > > Linus Yes, lots of unhappy debates about this over the years. However, I remain intrigued by (c), because if we had a "dirty page lock" that is looked up by page (much like looking up the ptl), it seems like a building block that would potentially help solve the whole thing. The above points about "file system needs to coordinate with mm about what's allowed to be dirtied, including gup/dma cases", those are still true and not yet solved, yes. But having a solid point of synchronization for this, definitely looks interesting. Of course, without working through this more thoroughly, it's not fair to impose this constraint on the current discussion, understood. :) thanks,
On Oct 29, 2022, at 1:15 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > (b) we could move the "page_remove_rmap()" into the "flush-and-free" path too > > And (b) would be fairly easy - same model as that dirty bit patch, > just a 'do page_remove_rmap too' - except page_remove_rmap() wants the > vma as well (and we delay the TLB flush over multiple vma's, so it's > not just a "save vma in mmu_gather”). (b) sounds reasonable and may potentially allow future performance improvements (batching, doing stuff without locks). It does appear to break a potential hidden assumption that rmap is removed while the ptl is acquired (at least in the several instances I samples). Yet, anyhow page_vma_mapped_walk() checks the PTE before calling the function, so it should be fine. I’ll give it a try.
On Sat, Oct 29, 2022 at 01:15:26PM -0700, Linus Torvalds wrote: > On Sat, Oct 29, 2022 at 12:39 PM John Hubbard <jhubbard@nvidia.com> wrote: > > > > ext4 has since papered over the problem, by soldiering on if it finds a > > page without writeback buffers when it expected to be able to writeback > > a dirty page. But you get the idea. > > I suspect that "soldiering on" is the right thing to do, but yes, our > 'mkdirty' vs 'mkclean' thing has always been problematic. > > ... > > (a) filesystems just deal with it It should be noted that "soldiering on" just means that the kernel will not crash or BUG. It may mean that the dirty page will not gotten written back (since at the time when it is discovered we are in a context we may not allocate memory or block if there is a need to allocate blocks if the file system uses delayed allocation). Furthermore, since the file system does not know that one or more pages have dirtied behind it's back, if the file system is almost full, some writes may silently fail --- including writes where the usesrspace application was implicitly promised that the write would succeed by having the write(2) system call return without errors. If people are OK with that, it's fine. Just don't complain to the file system maintainers. :-) - Ted P.S. The reason why this isn't an utter disaster is because normally users of remote RMA use preallocated and pre-written/initialized files. And there aren't _that_ many other users of gup. So long as this remains the case, we might be happy to let sleeping canines lie. Just please dear $DEITY, let's not have any additional users of gup until we have a better solution.
On Oct 29, 2022, at 1:56 PM, Nadav Amit <nadav.amit@gmail.com> wrote: > On Oct 29, 2022, at 1:15 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> (b) we could move the "page_remove_rmap()" into the "flush-and-free" path too >> >> And (b) would be fairly easy - same model as that dirty bit patch, >> just a 'do page_remove_rmap too' - except page_remove_rmap() wants the >> vma as well (and we delay the TLB flush over multiple vma's, so it's >> not just a "save vma in mmu_gather”). > > (b) sounds reasonable and may potentially allow future performance > improvements (batching, doing stuff without locks). > > It does appear to break a potential hidden assumption that rmap is removed > while the ptl is acquired (at least in the several instances I samples). > Yet, anyhow page_vma_mapped_walk() checks the PTE before calling the > function, so it should be fine. > > I’ll give it a try. I have just seen John’s and your emails. It seems (b) fell off. (a) is out of my “zone”, and anyhow assuming it would not be solved soon, deferring page_remove_rmap() might cause regressions. (c) might be more intrusive and potentially induce overheads. If we need a small backportable solution, I think the approach that I proposed (marking the page dirty after the invalidation, before the PTL is released) is the simplest one. Please advise how to proceed.
On Sat, Oct 29, 2022 at 1:56 PM Nadav Amit <nadav.amit@gmail.com> wrote: > > On Oct 29, 2022, at 1:15 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > (b) we could move the "page_remove_rmap()" into the "flush-and-free" path too > > > > And (b) would be fairly easy - same model as that dirty bit patch, > > just a 'do page_remove_rmap too' - except page_remove_rmap() wants the > > vma as well (and we delay the TLB flush over multiple vma's, so it's > > not just a "save vma in mmu_gather”). > > (b) sounds reasonable and may potentially allow future performance > improvements (batching, doing stuff without locks). So the thing is, I think (b) makes sense from a TLB flush standpoint, but as mentioned, if filesystems _really_ want to feel in control, none of these locks fundamentally help, because the whole "gup + set_page_dirty()" situation still exists. And that fundamentally does not hold any locks between the lookup and the set_page_dirty(), and fundamentally doesn't stop an rmap from happening in between, so while the zap_page_range() and dirty TLB case case be dealt with that way, you don't actually fix any fundamental issues. Now, neither does my (c) case on its own, but as John Hubbard alluded to, *if* we had some sane serialization for a struct page, maybe that could then be used to at least avoid the issue with "rmap no longer exists", and make the filesystem handling case a bit better. IOW, I really do think that not only is (a) the current solution, it's the *correct* solution. But it is possible that (c) could then be used as a way to make (a) more palatable to filesystems, in that at least then there would be some way to serialize with "set_page_dirty()". But (b) cannot be used for that - because GUP fundamentally breaks that rmap association. It's all nasty. Linus
On Oct 29, 2022, at 12:14 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> It didn't fail for me before, it doesn't fail for me with those patches.
For the record, I tried to run the PoC on another machine, and it indeed did
not fail.
Turns out I had a small bug in one of the mechanisms that were intended to
make the failure more likely (I should have mapped again or madvised
HPAGE_SIZE to increase the time zap_pte_range spends to increase the
probability of the race).
I am still trying to figure out how to address this issue, and whether the
fact that some rmap_walk(), which do not use PVMW_SYNC are of an issue.
---
#define _GNU_SOURCE
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
void *p;
volatile bool stop = false;
pid_t flusher_pid;
int fd;
#define PAGE_SIZE (4096ul)
#define PAGES_PER_PMD (512)
#define HPAGE_SIZE (PAGE_SIZE * PAGES_PER_PMD)
// Comment MUNMAP_TEST for MADV_DONTNEED test
#define MUNMAP_TEST
void *dirtying_thread(void *arg)
{
int i;
while (!stop) {
for (i = 1; i < PAGES_PER_PMD; i++) {
*(volatile char *)(p + (i * PAGE_SIZE) + 64) = 5;
}
}
return NULL;
}
void *checking_thread(void *arg)
{
volatile unsigned long *ul_p = (volatile unsigned long*)p;
unsigned long cnt = 0;
while (!stop) {
*ul_p = cnt;
if (*ul_p != cnt) {
printf("FAILED: expected %ld, got %ld\n", cnt, *ul_p);
kill(flusher_pid, SIGTERM);
exit(0);
}
cnt++;
}
return NULL;
}
void *remap_thread(void *arg)
{
void *ptr;
struct timespec t = {
.tv_nsec = 10000,
};
while (!stop) {
#ifdef MUNMAP_TEST
ptr = mmap(p, HPAGE_SIZE, PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_FIXED|MAP_POPULATE, fd, 0);
if (ptr == MAP_FAILED)
handle_error("remap_thread");
#else
if (madvise(p, HPAGE_SIZE, MADV_DONTNEED) < 0)
handle_error("MADV_DONTNEED");
nanosleep(&t, NULL);
#endif
}
return NULL;
}
void flushing_process(void)
{
// Remove the pages to speed up rmap_walk and allow to drop caches.
if (madvise(p, HPAGE_SIZE, MADV_DONTNEED) < 0)
handle_error("MADV_DONTNEED");
while (true) {
if (msync(p, PAGE_SIZE, MS_SYNC))
handle_error("msync");
if (posix_fadvise(fd, 0, PAGE_SIZE, POSIX_FADV_DONTNEED))
handle_error("posix_fadvise");
}
}
int main(int argc, char *argv[])
{
void *(*thread_funcs[])(void*) = {
&dirtying_thread,
&checking_thread,
&remap_thread,
};
int r, i;
int rc1, rc2;
unsigned long addr;
void *ptr;
char *page = malloc(PAGE_SIZE);
int n_threads = sizeof(thread_funcs) / sizeof(*thread_funcs);
pthread_t *threads = malloc(sizeof(pthread_t) * n_threads);
pid_t pid;
if (argc < 2) {
fprintf(stderr, "usages: %s [filename]\n", argv[0]);
exit(EXIT_FAILURE);
}
fd = open(argv[1], O_RDWR|O_CREAT, 0666);
if (fd == -1)
handle_error("open fd");
for (i = 0; i < PAGES_PER_PMD; i++) {
if (write(fd, page, PAGE_SIZE) != PAGE_SIZE)
handle_error("write");
}
free(page);
ptr = mmap(NULL, HPAGE_SIZE * 2, PROT_NONE, MAP_PRIVATE|MAP_ANON,
-1, 0);
if (ptr == MAP_FAILED)
handle_error("mmap anon");
addr = (unsigned long)(ptr + HPAGE_SIZE - 1) & ~(HPAGE_SIZE - 1);
printf("starting...\n");
ptr = mmap((void *)addr, HPAGE_SIZE, PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_FIXED|MAP_POPULATE, fd, 0);
if (ptr == MAP_FAILED)
handle_error("mmap file - start");
p = ptr;
for (i = 0; i < n_threads; i++) {
r = pthread_create(&threads[i], NULL, thread_funcs[i], NULL);
if (r)
handle_error("pthread_create");
}
// Run the flushing process in a different process, so msync() would
// not require mmap_lock.
pid = fork();
if (pid == 0)
flushing_process();
flusher_pid = pid;
sleep(60);
stop = true;
for (i = 0; i < n_threads; i++)
pthread_join(threads[i], NULL);
kill(flusher_pid, SIGTERM);
printf("Finished without an error\n");
exit(0);
}
On Oct 29, 2022, at 11:58 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, Oct 29, 2022 at 11:36 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> Anyway, I think the best documentation for "this is what I meant" is >> simply the patch. Does this affect your PoC on your setup? > > Here's a slightly cleaned up set with preliminary commit messages, and > an explanation for why some of the 'struct page' declarations were > moved around a bit in case you wondered about that part of the change > in the full patch. > > The end result should be the same, so if you already looked at the > previous unified patch, never mind. But this one tries to make for a > better patch series. > > Still not tested in any way, shape, or form. I decided I wanted to > send this one before booting into this and possibly blowing up ;^) Running the PoC on Linux 6.0.6 with these patches caused the following splat on the following line: WARN_ON_ONCE(!folio_test_locked(folio) && !folio_test_dirty(folio)); Although I did not hit the warning on the next line (!folio_buffers(folio)), the commit log for the warning that actually triggered also leads to the same patch of Jan Kara that is intended to check if a page is dirtied without buffers (the scenario we are concerned about). Author: Jan Kara <jack@suse.cz> Date: Thu Dec 1 11:46:40 2016 -0500 ext4: warn when page is dirtied without buffers Warn when a page is dirtied without buffers (as that will likely lead to a crash in ext4_writepages()) or when it gets newly dirtied without the page being locked (as there is nothing that prevents buffers to get stripped just before calling set_page_dirty() under memory pressure). [ 908.444806] ------------[ cut here ]------------ [ 908.451010] WARNING: CPU: 16 PID: 2113 at fs/ext4/inode.c:3634 ext4_dirty_folio+0x74/0x80 [ 908.460343] Modules linked in: [ 908.463856] CPU: 16 PID: 2113 Comm: poc Not tainted 6.0.6+ #21 [ 908.470521] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.13.0 05/14/2021 [ 908.479202] RIP: 0010:ext4_dirty_folio+0x74/0x80 [ 908.484489] Code: d5 ee ff 41 5c 41 5d 5d c3 cc cc cc cc be 08 00 00 00 4c 89 e7 e8 bc 03 e0 ff 4c 89 e7 e8 f4 f8 df ff 49 8b 04 24 a8 08 75 bc <0f> 0b eb b8 0f 0b eb c6 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 [ 908.505851] RSP: 0018:ffff88a1197df9a8 EFLAGS: 00010246 [ 908.511826] RAX: 0057ffffc0002014 RBX: ffffffff83414b60 RCX: ffffffff818ceafc [ 908.519964] RDX: dffffc0000000000 RSI: 0000000000000008 RDI: ffffea00fffd9f40 [ 908.528103] RBP: ffff88a1197df9b8 R08: 0000000000000001 R09: fffff9401fffb3e9 [ 908.536239] R10: ffffea00fffd9f47 R11: fffff9401fffb3e8 R12: ffffea00fffd9f40 [ 908.544376] R13: ffff88a087d368d8 R14: ffff88a1197dfb08 R15: ffff88a1197dfb00 [ 908.552509] FS: 00007ff7caa68700(0000) GS:ffff8897edc00000(0000) knlGS:0000000000000000 [ 908.561731] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 908.568299] CR2: 00007ff7caa67ed8 CR3: 00000020cc970001 CR4: 00000000001706e0 [ 908.576437] Call Trace: [ 908.579252] <TASK> [ 908.581683] folio_mark_dirty+0x69/0xa0 [ 908.586097] set_page_dirty+0x2a/0x90 [ 908.590301] tlb_flush_mmu+0xc1/0x320 [ 908.594517] tlb_finish_mmu+0x49/0x190 [ 908.598822] unmap_region+0x1fa/0x250 [ 908.603029] ? anon_vma_compatible+0x120/0x120 [ 908.608110] ? __kasan_check_read+0x11/0x20 [ 908.612926] ? __vma_rb_erase+0x38a/0x610 [ 908.617547] __do_munmap+0x313/0x770 [ 908.621669] mmap_region+0x227/0xa50 [ 908.625774] ? down_read+0x320/0x320 [ 908.629874] ? lock_acquire+0x19a/0x450 [ 908.634285] ? __x64_sys_brk+0x4e0/0x4e0 [ 908.641552] ? thp_get_unmapped_area+0xca/0x150 [ 908.649404] ? cap_mmap_addr+0x1d/0x90 [ 908.656373] ? security_mmap_addr+0x3c/0x50 [ 908.663781] ? get_unmapped_area+0x173/0x1f0 [ 908.671248] ? arch_get_unmapped_area+0x330/0x330 [ 908.679231] do_mmap+0x3c3/0x610 [ 908.685519] vm_mmap_pgoff+0x177/0x230 [ 908.692303] ? randomize_page+0x70/0x70 [ 908.699133] ksys_mmap_pgoff+0x241/0x2a0 [ 908.706011] __x64_sys_mmap+0x8d/0xb0 [ 908.712594] do_syscall_64+0x3b/0x90 [ 908.719090] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 908.727201] RIP: 0033:0x7ff7cbf868e6 [ 908.733559] Code: 00 00 00 00 f3 0f 1e fa 41 f7 c1 ff 0f 00 00 75 2b 55 48 89 fd 53 89 cb 48 85 ff 74 37 41 89 da 48 89 ef b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 62 5b 5d c3 0f 1f 80 00 00 00 00 48 8b 05 71 [ 908.759522] RSP: 002b:00007ff7caa67ea8 EFLAGS: 00000206 ORIG_RAX: 0000000000000009 [ 908.770475] RAX: ffffffffffffffda RBX: 0000000000008011 RCX: 00007ff7cbf868e6 [ 908.780919] RDX: 0000000000000003 RSI: 0000000000200000 RDI: 00007ff7cbc00000 [ 908.791344] RBP: 00007ff7cbc00000 R08: 0000000000000003 R09: 0000000000000000 [ 908.801751] R10: 0000000000008011 R11: 0000000000000206 R12: 00007ffed51cbc4e [ 908.812118] R13: 00007ffed51cbc4f R14: 00007ffed51cbc50 R15: 00007ff7caa67fc0 [ 908.822523] </TASK> [ 908.827213] irq event stamp: 4169 [ 908.833101] hardirqs last enabled at (4183): [<ffffffff8133f028>] __up_console_sem+0x68/0x80 [ 908.844884] hardirqs last disabled at (4194): [<ffffffff8133f00d>] __up_console_sem+0x4d/0x80 [ 908.856622] softirqs last enabled at (4154): [<ffffffff83000430>] __do_softirq+0x430/0x5db [ 908.868167] softirqs last disabled at (4149): [<ffffffff8125fd89>] irq_exit_rcu+0xe9/0x120 [ 908.879611] ---[ end trace 0000000000000000 ]---
On Sat, Oct 29, 2022 at 7:17 PM Nadav Amit <nadav.amit@gmail.com> wrote: > > Running the PoC on Linux 6.0.6 with these patches caused the following splat > on the following line: > > WARN_ON_ONCE(!folio_test_locked(folio) && !folio_test_dirty(folio)); Yeah, this is a sign of that "folio_mkclean() serializes with folio_mark_dirty using rmap and the page table lock". And page_remove_rmap() could *almost* be called later, but it does have code that also depends on the page table lock, although it looks like realistically that's just because it "knows" that means that preemption is disabled, so it uses non-atomic statistics update. I say "knows" in quotes, because that's what the comment says, but it turns out that __mod_node_page_state() has to deal with CONFIG_RT anyway and does that preempt_disable_nested(); ... preempt_enable_nested(); thing. And then it wants to see the vma, although that's actually only to see if it's 'mlock'ed, so we could just squirrel that away. So we *could* move page_remove_rmap() later into the TLB flush region, but then we would have lost the page table lock anyway, so then folio_mkclean() can come in regardless. So that doesn't even help. End result: we do want to do the page_set_dirty() and the remove_rmap() under the paeg table lock, because it's what serializes folio_mkclean(). And we'd _like_ to do the TLB flush before the remove_rmap(), but we *really* don't want to do that for every page. So my current gut feel is that we should just say that if you do "MADV_DONTNEED or do a munmap() (which includes the "re-mmap() over the area", while some other thread is still writing to that memory region, you may lose writes. IOW, just accept the behavior that Nadav's test-program tries to show, and say "look, you're doing insane things, we've never given you any other semantics, it's your problem" to any user program that does that. If a user program does MADV_DONTNEED on an area that it is actively using at the same time in another thread, that sounds really really bogus. Same goes doubly for 'munmap()' or over-mapping. Linus
On Sun, Oct 30, 2022 at 11:19 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And we'd _like_ to do the TLB flush before the remove_rmap(), but we > *really* don't want to do that for every page. Hmm. I have yet another crazy idea. We could keep the current placement of the TLB flush, to just before we drop the page table lock. And we could do all the things we do in 'page_remove_rmap()' right now *except* for the mapcount stuff. And only move the mapcount code to the page freeing stage. Because all the rmap() walk synchronization really needs is that 'page->_mapcount' is still elevated, and if it is it will serialize with the page table lock. And it turns out that 'page_remove_rmap()' already treats the case we care about differently, and all it does is lock_page_memcg(page); if (!PageAnon(page)) { page_remove_file_rmap(page, compound); goto out; } ... out: unlock_page_memcg(page); munlock_vma_page(page, vma, compound); for that case. And that 'page_remove_file_rmap()' is literally the code that modifies the _mapcount. Annoyingly, this is all complicated by that 'compound' argument, but that's always false in that zap_page_range() case. So what we *could* do, is make a new version of page_remove_rmap(), which is specialized for this case: no 'compound' argument (always false), and doesn't call 'page_remove_file_rmap()', because we'll do that for the !PageAnon(page) case later after the TLB flush. That would keep the existing TLB flush logic, keep the existing 'mark page dirty' and would just make sure that 'folio_mkclean()' ends up being serialized with the TLB flush simply because it will take the page table lock because we delay the '._mapcount' update until afterwards. Annoyingly, the organization of 'page_remove_rmap()' is a bit ugly, and we have several other callers that want the existing logic, so while the above sounds conceptually simple, I think the patch would be a bit messy. Linus
On Oct 30, 2022, at 11:19 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > And page_remove_rmap() could *almost* be called later, but it does > have code that also depends on the page table lock, although it looks > like realistically that's just because it "knows" that means that > preemption is disabled, so it uses non-atomic statistics update. > > I say "knows" in quotes, because that's what the comment says, but it > turns out that __mod_node_page_state() has to deal with CONFIG_RT > anyway and does that > > preempt_disable_nested(); > ... > preempt_enable_nested(); > > thing. > > And then it wants to see the vma, although that's actually only to see > if it's 'mlock'ed, so we could just squirrel that away. > > So we *could* move page_remove_rmap() later into the TLB flush region, > but then we would have lost the page table lock anyway, so then > folio_mkclean() can come in regardless. > > So that doesn't even help. Well, if you combine it with the per-page-table stale TLB detection mechanism that I proposed, I think this could work. Reminder (feel free to skip): you would have per-mm “completed TLB-generation” in addition to the current one, which would be renamed to “pending TLB-generation”. Whenever you update the page-tables in a manner that might require a TLB flush, you would increase the “pending TLB-generation” and save the pending TLB-generation in the page-table’s page-struct. All of that is done once under the page-table lock. When you finish a TLB-flush, you update the “completed TLB-generation”. Then on page_vma_mkclean_one(), you would check if the page-table’s TLB-generation is greater than the completed TLB-generation, which would indicate that TLB entries for PTEs in this table might be stale. In that case you would just flush the TLB. [ Of course you can instead just flush if mm_tlb_flush_pending(), but nobody likes this mechanism that has a very coarse granularity, and therefore can lead to many unnecessary TLB flushes. ] Indeed, there would be potentially some overhead in extreme cases, since mm's TLB-generation since its cache is already highly-contended in extreme cases. But I think it worth it to have simple logic that allows to reason about correctness. My intuition is that although you appear to be right that we can just mark this case as “extreme case nobody cares about”, it might have now or in the future some other implications that are hard to predict and prevent.
On Sun, Oct 30, 2022 at 11:51 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > We could keep the current placement of the TLB flush, to just before > we drop the page table lock. > > And we could do all the things we do in 'page_remove_rmap()' right now > *except* for the mapcount stuff. > > And only move the mapcount code to the page freeing stage. So I actually have a three-commit series to do the rmap simplification, but let me just post the end result of that series, because the end result is actually smaller than the individual commits (I did it as three incremental commits just to make it more obvious to me how to get to that end result). The three commits end up being mm: introduce simplified versions of 'page_remove_rmap()' mm: inline simpler case of page_remove_file_rmap() mm: re-unify the simplified page_zap_*_rmap() function and the end result of them is this attached patch. I'm *claiming* that the attached patch is semantically identical to what we do before it, just _hugely_ simplified. Basically, that new 'page_zap_pte_rmap()' does the same things that 'page_remove_rmap()' did, except it is limited to only last-level PTE entries (and that munlock_vma_page() has to be called separately). The simplification comes from 'compound' being false, from it always being about small pages, and from the atomic mapcount decrement having been moved outside the memcg lock, since it is independent of it. Anyway, this simplification patch basically means that the *next* step could be to just move that ipage_zap_pte_rmap()' after the TLB flush, and now it's trivial and no longer scary. I did *not* do that yet, because it still needs that "encoded_page[]" array - except now it doesn't encode the 'dirty' bit, now it would encode the 'do a page->_mapcount decrement' bit. I didn't do that part, because needed to do the rc3 release, plus I'd like to have somebody look at this introductory patch first. Linus
On Sun, Oct 30, 2022 at 3:47 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Anyway, this simplification patch basically means that the *next* step > could be to just move that ipage_zap_pte_rmap()' after the TLB flush, > and now it's trivial and no longer scary. So here's that next patch: it's patch 4/4 in this series. Patches 1-3 are the patches that I already sent out as one (smaller) combined patch. I'm including them here as the whole series in case somebody else wants to follow along with how I did the simplified version of page_remove_rmap(). So if you already looked at the previous patch and don't have any questions about that one, you could just apply PATCH 4/4 on top of that one. Or you can do the whole series with commit messages and (hopefully) clearer individual steps to the simplified version of page_remove_rmap(). I haven't actually tested 4/4 yet, so this is yet another of my "I think this should work" patches. The reason I haven't actually tested it is partly because I never recreated the original problem Navav reported, and partly because the meat of patch 4/4 is just the same "encode an extra flag bit in the low bit of the page pointer" that I _did_ test, just doing the "remove rmap" instead of "set dirty". In other words, I *think* this should make Nadav's test-case happy, and avoid the warning he saw. If somebody wants a git branch, I guess I can push that too, but it would be a non-stable branch only for testing. Also, it's worth noting that zap_pte_range() does this sanity test: if (unlikely(page_mapcount(page) < 0)) print_bad_pte(vma, addr, ptent, page); and that is likely worthless now (because it hasn't actually decremented the mapcount yet). I didn't remove it, because I wasn't sure which option was best: (a) just remove it entirely (b) change the "< 0" to "<= 0" (c) move it to clean_and_free_pages_and_swap_cache() that actually does the page_zap_pte_rmap() now. so I'm in no way claiming this series is any kind of final word, but I do mostly like how the code ended up looking. Now, this whole "remove rmap after flush" would probably make sense for some of the largepage cases too, and there's room for another bit in there (or more, if you align 'struct page') and that whole code could use page size hints etc too. But I suspect that the main case we really care is for individual small pages, just because that's the case where I think it would be much worse to do any fancy TLB tracking. The largepage cases presumably aren't as critical, since there by definition is fewer of those. Comments? Linus
On Oct 30, 2022, at 6:47 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > The reason I haven't actually tested it is partly because I never > recreated the original problem Navav reported, and partly because the > meat of patch 4/4 is just the same "encode an extra flag bit in the > low bit of the page pointer" that I _did_ test, just doing the "remove > rmap" instead of "set dirty". > > In other words, I *think* this should make Nadav's test-case happy, > and avoid the warning he saw. I am sorry for not managing to make it reproducible on your system. The fact that you did not get the warning that I got means that it is not a hardware-TLB differences issue (at least not only that), but the race does not happen on your system (assuming you used ext4 on the BRD). Anyhow, I ran the tests with the patches and there are no failures. Thanks for addressing this issue. I understand from the code that you decided to drop the deferring of set_page_dirty(), which could - at least for the munmap case (where mmap_lock is taken for write) - prevent the need for “force_flush” and potentially save TLB flushes. I was just wondering whether the reason for that is that you wanted to have small backportable and conservative patches, or whether you changed your mind about it.
On Oct 30, 2022, at 9:09 PM, Nadav Amit <nadav.amit@gmail.com> wrote: > I understand from the code that you decided to drop the deferring of > set_page_dirty(), which could - at least for the munmap case (where > mmap_lock is taken for write) - prevent the need for “force_flush” and > potentially save TLB flushes. > > I was just wondering whether the reason for that is that you wanted > to have small backportable and conservative patches, or whether you > changed your mind about it. Please ignore this silly question. I understand - the buffers might still be dropped.
On Sun, Oct 30, 2022 at 9:09 PM Nadav Amit <nadav.amit@gmail.com> wrote: > > I am sorry for not managing to make it reproducible on your system. Heh, that's very much *not* your fault. Honestly, I didn't try very much or very hard. I felt like I understood the problem cause sufficiently that I didn't really need to have a reproducer, and I much prefer to just think the solution through and try to make it really robust. Or, put another way - I'm just lazy. > Anyhow, I ran the tests with the patches and there are no failures. Lovely. > Thanks for addressing this issue. Well, I'm not sure the issue is "addressed" yet. I think the patch series is likely the right thing to do, but others may disagree with this approach. And regardless of that, this still leaves some questions open. (a) there's the issue of s390, which does its own version of __tlb_remove_page_size. I *think* s390 basically does the TLB flush synchronously in zap_pte_range(), and that it would be for that reason trivial to just add that 'flags' argument to the s390 __tlb_remove_page_size(), and make it do if (flags & TLB_ZAP_RMAP) page_zap_pte_rmap(page); at the top synchronously too. But some s390 person would need to look at it. I *think* the issue is literally that straightforward and not a big deal, but it's probably not even worth bothering the s390 people until VM people have decided "yes, this makes sense". (b) the issue I mentioned with the currently useless "page_mapcount(page) < 0" test with that patch. Again, this is mostly just janitorial stuff associated with that patch series. (c) whether to worry about back-porting I don't *think* this is worth backporting, but if it causes other changes, then maybe.. > I understand from the code that you decided to drop the deferring of > set_page_dirty(), which could - at least for the munmap case (where > mmap_lock is taken for write) - prevent the need for “force_flush” and > potentially save TLB flushes. I really liked my dirty patch, but your warning case really made it obvious that it was just broken. The thing is, moving the "set_page_dirty()" to later is really nice, and really makes a *lot* of sense from a conceptual standpoint: only after that TLB flush do we really have no more people who can dirty it. BUT. Even if we just used another bit for in the array for "dirty", and did the set_page_dirty() later (but still before getting rid of the rmap), that wouldn't actually *work*. Why? Because the race with folio_mkclean() would just come back. Yes, now we'd have the rmap data, so mkclean would be forced to serialize with the page table lock. But if we get rid of the "force_flush" for the dirty bit, that serialization won't help, simply because we've *dropped* the page table lock before we actually then do the set_page_dirty() again. So the mkclean serialization needs *both* the late rmap dropping _and_ the page table lock being kept. So deferring set_page_dirty() is conceptually the right thing to do from a pure "just track dirty bit" standpoint, but it doesn't work with the way we currently expect mkclean to work. > I was just wondering whether the reason for that is that you wanted > to have small backportable and conservative patches, or whether you > changed your mind about it. See above: I still think it would be the right thing in a perfect world. But with the current folio_mkclean(), we just can't do it. I had completely forgotten / repressed that horror-show. So the current ordering rules are basically that we need to do set_page_dirty() *and* we need to flush the TLB's before dropping the page table lock. That's what gets us serialized with "mkclean". The whole "drop rmap" can then happen at any later time, the only important thing was that it was kept to at least after the TLB flush. We could do the rmap drop still inside the page table lock, but honestly, it just makes more sense to do as we free the batched pages anyway. Am I missing something still? And again, this is about our horrid serialization between folio_mkclean and set_page_dirty(). It's related to how GUP + set_page_dirty() is also fundamentally problematic. So that dirty bit situation *may* change if the rules for folio_mkclean() change... Linus
On Sun, Oct 30, 2022 at 03:47:36PM -0700, Linus Torvalds wrote: > include/linux/rmap.h | 1 + > mm/memory.c | 3 ++- > mm/rmap.c | 24 ++++++++++++++++++++++++ > 3 files changed, 27 insertions(+), 1 deletion(-) > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index bd3504d11b15..f62af001707c 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -196,6 +196,7 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, > unsigned long address); > void page_add_file_rmap(struct page *, struct vm_area_struct *, > bool compound); > +void page_zap_pte_rmap(struct page *); > void page_remove_rmap(struct page *, struct vm_area_struct *, > bool compound); > > diff --git a/mm/memory.c b/mm/memory.c > index f88c351aecd4..c893f5ffc5a8 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1452,8 +1452,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, > likely(!(vma->vm_flags & VM_SEQ_READ))) > mark_page_accessed(page); > } > + page_zap_pte_rmap(page); > + munlock_vma_page(page, vma, false); > rss[mm_counter(page)]--; > - page_remove_rmap(page, vma, false); > if (unlikely(page_mapcount(page) < 0)) > print_bad_pte(vma, addr, ptent, page); > if (unlikely(__tlb_remove_page(tlb, page))) { > diff --git a/mm/rmap.c b/mm/rmap.c > index 2ec925e5fa6a..28b51a31ebb0 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1412,6 +1412,30 @@ static void page_remove_anon_compound_rmap(struct page *page) > __mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr); > } > > +/** > + * page_zap_pte_rmap - take down a pte mapping from a page > + * @page: page to remove mapping from > + * > + * This is the simplified form of page_remove_rmap(), that only > + * deals with last-level pages, so 'compound' is always false, > + * and the caller does 'munlock_vma_page(page, vma, compound)' > + * separately. > + * > + * This allows for a much simpler calling convention and code. > + * > + * The caller holds the pte lock. > + */ > +void page_zap_pte_rmap(struct page *page) > +{ One could consider adding something like: #ifdef USE_SPLIT_PTE_PTLOCKS lockdep_assert_held(ptlock_ptr(page)) #endif > + if (!atomic_add_negative(-1, &page->_mapcount)) > + return; > + > + lock_page_memcg(page); > + __dec_lruvec_page_state(page, > + PageAnon(page) ? NR_ANON_MAPPED : NR_FILE_MAPPED); > + unlock_page_memcg(page); > +} Took me a little while, but yes, .compound=false seems to reduce to this.
On Sun, Oct 30, 2022 at 06:47:23PM -0700, Linus Torvalds wrote: > Also, it's worth noting that zap_pte_range() does this sanity test: > > if (unlikely(page_mapcount(page) < 0)) > print_bad_pte(vma, addr, ptent, page); > > and that is likely worthless now (because it hasn't actually > decremented the mapcount yet). I didn't remove it, because I wasn't > sure which option was best: > > (a) just remove it entirely > > (b) change the "< 0" to "<= 0" > > (c) move it to clean_and_free_pages_and_swap_cache() that actually > does the page_zap_pte_rmap() now. I'm leaning towards (c); simply because the error case is so terrifying I feel we should check for it (and I do have vague memories of us actually hitting something like this in the very distant past).
On Sun, Oct 30, 2022 at 06:47:23PM -0700, Linus Torvalds wrote: > - there's no 'compund' argument any more, because it was always false > > - we can get rid of the tests for 'compund' and 'PageAnon()' since we > know that they are You're making up new words ;-) s/compund/compound/g
On Sun, Oct 30, 2022 at 06:47:23PM -0700, Linus Torvalds wrote: > diff --git a/mm/memory.c b/mm/memory.c > index ba1d08a908a4..c893f5ffc5a8 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1451,9 +1451,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, > if (pte_young(ptent) && > likely(!(vma->vm_flags & VM_SEQ_READ))) > mark_page_accessed(page); > + } > + page_zap_pte_rmap(page); > munlock_vma_page(page, vma, false); > rss[mm_counter(page)]--; > if (unlikely(page_mapcount(page) < 0)) > diff --git a/mm/rmap.c b/mm/rmap.c > index 69de6c833d5c..28b51a31ebb0 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1413,47 +1413,26 @@ static void page_remove_anon_compound_rmap(struct page *page) > } > > /** > + * page_zap_pte_rmap - take down a pte mapping from a page > * @page: page to remove mapping from > * > + * This is the simplified form of page_remove_rmap(), that only > + * deals with last-level pages, so 'compound' is always false, > + * and the caller does 'munlock_vma_page(page, vma, compound)' > + * separately. > * > + * This allows for a much simpler calling convention and code. > * > * The caller holds the pte lock. > */ > +void page_zap_pte_rmap(struct page *page) > { > if (!atomic_add_negative(-1, &page->_mapcount)) > return; > > lock_page_memcg(page); > + __dec_lruvec_page_state(page, > + PageAnon(page) ? NR_ANON_MAPPED : NR_FILE_MAPPED); > unlock_page_memcg(page); > } So we *could* use atomic_add_return() and include the print_bad_pte() thing in this function -- however that turns the whole thing into a mess again :/
On Oct 30, 2022, at 10:00 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > So the current ordering rules are basically that we need to do > set_page_dirty() *and* we need to flush the TLB's before dropping the > page table lock. That's what gets us serialized with "mkclean”. I understand. I am still not sure whether ordering the set_page_dirty() and dropping the mapcount reference cannot suffice for the reclaim logic not to free the buffers if the page is dirtied. According to the code, shrink_page_list() first checks for folio_mapped() and then for folio_test_dirty() to check whether pageout() is necessary. IIUC, the buffers are not dropped up to this point and set_page_dirty() would always set the page-struct dirty bit. IOW: In shrink_page_list(), when we decide on whether to pageout(), we should see whether the page is dirty (give or take smp_rmb()). But this is an optimization and I do not know all the cases in which buffers might be dropped. My intuition says that they cannot be dropped while mapcount != 0, but I need to further explore it.
On Mon, Oct 31, 2022 at 2:29 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Sun, Oct 30, 2022 at 03:47:36PM -0700, Linus Torvalds wrote: > > > + * This is the simplified form of page_remove_rmap(), that only > > + * deals with last-level pages, so 'compound' is always false, > > + * and the caller does 'munlock_vma_page(page, vma, compound)' > > + * separately. > > + * > > + * This allows for a much simpler calling convention and code. > > + * > > + * The caller holds the pte lock. > > + */ > > +void page_zap_pte_rmap(struct page *page) > > +{ > > One could consider adding something like: > > #ifdef USE_SPLIT_PTE_PTLOCKS > lockdep_assert_held(ptlock_ptr(page)) > #endif Yes, except that the page lock goes away in the next few patches and gets replaced by just using the safe dec_lruvec_page_state() instead, so it's not really worth it. > > + if (!atomic_add_negative(-1, &page->_mapcount)) > > + return; > > + > > + lock_page_memcg(page); > > + __dec_lruvec_page_state(page, > > + PageAnon(page) ? NR_ANON_MAPPED : NR_FILE_MAPPED); > > + unlock_page_memcg(page); > > +} > > Took me a little while, but yes, .compound=false seems to reduce to > this. Yeah - it's why I kept that thing as three separate patches, because even if each of the patches isn't "immediately obvious", you can at least go back and follow along and see what I did. The *full* simplification end result just looks like magic. Admittedly, I think a lot of that "looks like magic" is because the rmap code has seriously cruftified over the years. We had that time when we actually Go back a decade, and we literally used to do pretty much exactly what the simplified form does. The transformation to complexity hell starts with commit 89c06bd52fb9 ("memcg: use new logic for page stat accounting"), but just look at what it looked like before that: git show 89c06bd52fb9^:mm/rmap.c gets you the state back when it was simple. And look at what it did: void page_remove_rmap(struct page *page) { /* page still mapped by someone else? */ if (!atomic_add_negative(-1, &page->_mapcount)) return; ... statistics go here .. so in the end the simplified form of this page_zap_pte_rmap() really isn't *that* surprising. In fact, that old code handled PageHuge() fairly naturally too, and almost all the mess comes from the memcg accounting - and locking - changes. And I actually really wanted to one step further, and try to then batch up the page state accounting too. It's kind of stupid how we do all that memcg locking for each page, and with this new setup we have one nice array of pages that we *could* try to just batch things with. The pages in normal situations *probably* mostly all share the same memcg and node, so just optimistically trying to do something like "as long as it's the same memcg as last time, just keep the lock". Instead of locking and unlocking for every single page. But just looking at it exhausted me enough that I don't think I'll go there. Put another way: after this series, it's not the 'rmap' code that makes me go Ugh, it's the memcg tracking.. (But the hugepage rmap code is incredibly nasty stuff still, and I think the whole 'compound=true' case would be better with somebody taking a look at that case too, but that somebody won't be me). Linus
On Mon, Oct 31, 2022 at 2:39 AM Peter Zijlstra <peterz@infradead.org> wrote: > > You're making up new words ;-) > > s/compund/compound/g Oops. Fixed. Thanks, Linus
On Mon, Oct 31, 2022 at 2:36 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > (c) move it to clean_and_free_pages_and_swap_cache() that actually > > does the page_zap_pte_rmap() now. > > I'm leaning towards (c); simply because the error case is so terrifying > I feel we should check for it (and I do have vague memories of us > actually hitting something like this in the very distant past). Ok. At that point we no longer have the pte or the virtual address, so it's not goign to be exactly the same debug output. But I think it ends up being fairly natural to do VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page); instead, and I've fixed that last patch up to do that. Linus
On Mon, Oct 31, 2022 at 8:43 AM Nadav Amit <nadav.amit@gmail.com> wrote: > > On Oct 30, 2022, at 10:00 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > So the current ordering rules are basically that we need to do > > set_page_dirty() *and* we need to flush the TLB's before dropping the > > page table lock. That's what gets us serialized with "mkclean”. > > I understand. I am still not sure whether ordering the set_page_dirty() and > dropping the mapcount reference cannot suffice for the reclaim logic not to > free the buffers if the page is dirtied. Ahh, ok. > According to the code, shrink_page_list() first checks for folio_mapped() > and then for folio_test_dirty() to check whether pageout() is necessary. > IIUC, the buffers are not dropped up to this point and set_page_dirty() > would always set the page-struct dirty bit. > > IOW: In shrink_page_list(), when we decide on whether to pageout(), we > should see whether the page is dirty (give or take smp_rmb()). > > But this is an optimization and I do not know all the cases in which buffers > might be dropped. My intuition says that they cannot be dropped while > mapcount != 0, but I need to further explore it. Yes, the above sounds like one fairly good way out of the whole forced TLB flushing for dirty pages, while still keeping the filesystem code happy. But at this point it's an independent issue. And I really would like any fix like that to also fix the whole issue with GUP too, which seems related. Linus
Updated subject line, and here's the link to the original discussion for new people: https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/ On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Ok. At that point we no longer have the pte or the virtual address, so > it's not going to be exactly the same debug output. > > But I think it ends up being fairly natural to do > > VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page); > > instead, and I've fixed that last patch up to do that. Ok, so I've got a fixed set of patches based on the feedback from PeterZ, and also tried to do the s390 updates for this blindly, and pushed them out into a git branch: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix If people really want to see the patches in email again, I can do that, but most of you already have, and the changes are either trivial fixes or the s390 updates. For the s390 people that I've now added to the participant list maybe the git tree is fine - and the fundamental explanation of the problem is in that top-most commit (with the three preceding commits being prep-work). Or that link to the thread about this all. That top-most commit is also where I tried to fix things up for s390 that uses its own non-gathering TLB flush due to CONFIG_MMU_GATHER_NO_GATHER. NOTE NOTE NOTE! Unlike my regular git branch, this one may end up rebased etc for further comments and fixes. So don't consider that stable, it's still more of an RFC branch. At a minimum I'll update it with Ack's etc, assuming I get those, and my s390 changes are entirely untested and probably won't work. As far as I can tell, s390 doesn't actually *have* the problem that causes this change, because of its synchronous TLB flush, but it obviously needs to deal with the change of rmap zapping logic. Also added a few people who are explicitly listed as being mmu_gather maintainers. Maybe people saw the discussion on the linux-mm list, but let's make it explicit. Do people have any objections to this approach, or other suggestions? I do *not* consider this critical, so it's a "queue for 6.2" issue for me. It probably makes most sense to queue in the -MM tree (after the thing is acked and people agree), but I can keep that branch alive too and just deal with it all myself as well. Anybody? Linus
Am 31.10.22 um 19:43 schrieb Linus Torvalds: > Updated subject line, and here's the link to the original discussion > for new people: > > https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/ > > On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Ok. At that point we no longer have the pte or the virtual address, so >> it's not going to be exactly the same debug output. >> >> But I think it ends up being fairly natural to do >> >> VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page); >> >> instead, and I've fixed that last patch up to do that. > > Ok, so I've got a fixed set of patches based on the feedback from > PeterZ, and also tried to do the s390 updates for this blindly, and > pushed them out into a git branch: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix > > If people really want to see the patches in email again, I can do > that, but most of you already have, and the changes are either trivial > fixes or the s390 updates. > > For the s390 people that I've now added to the participant list maybe > the git tree is fine - and the fundamental explanation of the problem > is in that top-most commit (with the three preceding commits being > prep-work). Or that link to the thread about this all. Adding Gerald. > > That top-most commit is also where I tried to fix things up for s390 > that uses its own non-gathering TLB flush due to > CONFIG_MMU_GATHER_NO_GATHER. > > NOTE NOTE NOTE! Unlike my regular git branch, this one may end up > rebased etc for further comments and fixes. So don't consider that > stable, it's still more of an RFC branch. > > At a minimum I'll update it with Ack's etc, assuming I get those, and > my s390 changes are entirely untested and probably won't work. > > As far as I can tell, s390 doesn't actually *have* the problem that > causes this change, because of its synchronous TLB flush, but it > obviously needs to deal with the change of rmap zapping logic. > > Also added a few people who are explicitly listed as being mmu_gather > maintainers. Maybe people saw the discussion on the linux-mm list, but > let's make it explicit. > > Do people have any objections to this approach, or other suggestions? > > I do *not* consider this critical, so it's a "queue for 6.2" issue for me. > > It probably makes most sense to queue in the -MM tree (after the thing > is acked and people agree), but I can keep that branch alive too and > just deal with it all myself as well. > > Anybody? > > Linus It certainly needs a build fix for s390: In file included from kernel/sched/core.c:78: ./arch/s390/include/asm/tlb.h: In function '__tlb_remove_page_size': ./arch/s390/include/asm/tlb.h:50:17: error: implicit declaration of function 'page_zap_pte_rmap' [-Werror=implicit-function-declaration] 50 | page_zap_pte_rmap(page); | ^~~~~~~~~~~~~~~~~
Am 02.11.22 um 10:14 schrieb Christian Borntraeger: > Am 31.10.22 um 19:43 schrieb Linus Torvalds: >> Updated subject line, and here's the link to the original discussion >> for new people: >> >> https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/ >> >> On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds >> <torvalds@linux-foundation.org> wrote: >>> >>> Ok. At that point we no longer have the pte or the virtual address, so >>> it's not going to be exactly the same debug output. >>> >>> But I think it ends up being fairly natural to do >>> >>> VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page); >>> >>> instead, and I've fixed that last patch up to do that. >> >> Ok, so I've got a fixed set of patches based on the feedback from >> PeterZ, and also tried to do the s390 updates for this blindly, and >> pushed them out into a git branch: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix >> >> If people really want to see the patches in email again, I can do >> that, but most of you already have, and the changes are either trivial >> fixes or the s390 updates. >> >> For the s390 people that I've now added to the participant list maybe >> the git tree is fine - and the fundamental explanation of the problem >> is in that top-most commit (with the three preceding commits being >> prep-work). Or that link to the thread about this all. > > Adding Gerald. now the correct Gerald.... > >> >> That top-most commit is also where I tried to fix things up for s390 >> that uses its own non-gathering TLB flush due to >> CONFIG_MMU_GATHER_NO_GATHER. >> >> NOTE NOTE NOTE! Unlike my regular git branch, this one may end up >> rebased etc for further comments and fixes. So don't consider that >> stable, it's still more of an RFC branch. >> >> At a minimum I'll update it with Ack's etc, assuming I get those, and >> my s390 changes are entirely untested and probably won't work. >> >> As far as I can tell, s390 doesn't actually *have* the problem that >> causes this change, because of its synchronous TLB flush, but it >> obviously needs to deal with the change of rmap zapping logic. >> >> Also added a few people who are explicitly listed as being mmu_gather >> maintainers. Maybe people saw the discussion on the linux-mm list, but >> let's make it explicit. >> >> Do people have any objections to this approach, or other suggestions? >> >> I do *not* consider this critical, so it's a "queue for 6.2" issue for me. >> >> It probably makes most sense to queue in the -MM tree (after the thing >> is acked and people agree), but I can keep that branch alive too and >> just deal with it all myself as well. >> >> Anybody? >> >> Linus > > It certainly needs a build fix for s390: > > > In file included from kernel/sched/core.c:78: > ./arch/s390/include/asm/tlb.h: In function '__tlb_remove_page_size': > ./arch/s390/include/asm/tlb.h:50:17: error: implicit declaration of function 'page_zap_pte_rmap' [-Werror=implicit-function-declaration] > 50 | page_zap_pte_rmap(page); > | ^~~~~~~~~~~~~~~~~
On Mon, Oct 31, 2022 at 11:43:30AM -0700, Linus Torvalds wrote: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
On Wed, Nov 2, 2022 at 2:15 AM Christian Borntraeger <borntraeger@linux.ibm.com> wrote: > > It certainly needs a build fix for s390: > > In file included from kernel/sched/core.c:78: > ./arch/s390/include/asm/tlb.h: In function '__tlb_remove_page_size': > ./arch/s390/include/asm/tlb.h:50:17: error: implicit declaration of function 'page_zap_pte_rmap' [-Werror=implicit-function-declaration] > 50 | page_zap_pte_rmap(page); > | ^~~~~~~~~~~~~~~~~ Hmm. I'm not sure if I can add a #include <linux/rmap.h> to that s390 asm header file without causing more issues. The minimal damage would probably be to duplicate the declaration of page_zap_pte_rmap() in the s390 asm/tlb.h header where it is used. Not pretty to have two different declarations of that thing, but anything that then includes both <asm/tlb.h> and <linux/rmap.h> (which is much of mm) would then verify the consistency of them. So I'll do that minimal fix and update that branch, but if s390 people end up having a better fix, please holler. Linus
On Wed, Nov 2, 2022 at 10:55 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So I'll do that minimal fix and update that branch, but if s390 people > end up having a better fix, please holler. I've updated the branch with that, so hopefully s390 builds now. I also fixed a typo in the commit message and added Peter's ack. Other than that it's all the same it was before. Linus
On Wed, 2 Nov 2022 10:55:10 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Nov 2, 2022 at 2:15 AM Christian Borntraeger > <borntraeger@linux.ibm.com> wrote: > > > > It certainly needs a build fix for s390: > > > > In file included from kernel/sched/core.c:78: > > ./arch/s390/include/asm/tlb.h: In function '__tlb_remove_page_size': > > ./arch/s390/include/asm/tlb.h:50:17: error: implicit declaration of function 'page_zap_pte_rmap' [-Werror=implicit-function-declaration] > > 50 | page_zap_pte_rmap(page); > > | ^~~~~~~~~~~~~~~~~ > > Hmm. I'm not sure if I can add a > > #include <linux/rmap.h> > > to that s390 asm header file without causing more issues. > > The minimal damage would probably be to duplicate the declaration of > page_zap_pte_rmap() in the s390 asm/tlb.h header where it is used. > > Not pretty to have two different declarations of that thing, but > anything that then includes both <asm/tlb.h> and <linux/rmap.h> (which > is much of mm) would then verify the consistency of them. > > So I'll do that minimal fix and update that branch, but if s390 people > end up having a better fix, please holler. It compiles now with your duplicate declaration, but adding the #include also did not cause any damage, so that should also be OK.
On Mon, 31 Oct 2022 11:43:30 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > Updated subject line, and here's the link to the original discussion > for new people: > > https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/ > > On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Ok. At that point we no longer have the pte or the virtual address, so > > it's not going to be exactly the same debug output. > > > > But I think it ends up being fairly natural to do > > > > VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page); > > > > instead, and I've fixed that last patch up to do that. > > Ok, so I've got a fixed set of patches based on the feedback from > PeterZ, and also tried to do the s390 updates for this blindly, and > pushed them out into a git branch: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix > > If people really want to see the patches in email again, I can do > that, but most of you already have, and the changes are either trivial > fixes or the s390 updates. > > For the s390 people that I've now added to the participant list maybe > the git tree is fine - and the fundamental explanation of the problem > is in that top-most commit (with the three preceding commits being > prep-work). Or that link to the thread about this all. > > That top-most commit is also where I tried to fix things up for s390 > that uses its own non-gathering TLB flush due to > CONFIG_MMU_GATHER_NO_GATHER. > > NOTE NOTE NOTE! Unlike my regular git branch, this one may end up > rebased etc for further comments and fixes. So don't consider that > stable, it's still more of an RFC branch. > > At a minimum I'll update it with Ack's etc, assuming I get those, and > my s390 changes are entirely untested and probably won't work. > > As far as I can tell, s390 doesn't actually *have* the problem that > causes this change, because of its synchronous TLB flush, but it > obviously needs to deal with the change of rmap zapping logic. Correct, we need to flush already when we change a PTE, which is done in ptep_get_and_clear() etc. Only exception would be lazy flushing when only one active thread is attached, then we would flush later in flush_tlb_mm/range(), or as soon as another thread is attached (IIRC). So it seems straight forward to just call page_zap_pte_rmap() from our private __tlb_remove_page_size() implementation. Just wondering a bit why you did not also add the VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page), like in the generic change. Acked-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> # s390
On Wed, Nov 2, 2022 at 3:31 PM Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: > > Just wondering a bit why you did not also add the > VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page), like > in the generic change. Heh, I had considered dropping it entirely even from the generic code, since I don't remember seeing that ever trigger, but PeterZ convinced me otherwise. For the s390 side I really wanted to keep things minimal since I (obviously) didn't even built-test it, so.. I'm perfectly happy with s390 people adding it later, of course. Linus
On 31.10.22 19:43, Linus Torvalds wrote: > Updated subject line, and here's the link to the original discussion > for new people: > > https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/ > > On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Ok. At that point we no longer have the pte or the virtual address, so >> it's not going to be exactly the same debug output. >> >> But I think it ends up being fairly natural to do >> >> VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page); >> >> instead, and I've fixed that last patch up to do that. > > Ok, so I've got a fixed set of patches based on the feedback from > PeterZ, and also tried to do the s390 updates for this blindly, and > pushed them out into a git branch: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix > > If people really want to see the patches in email again, I can do > that, but most of you already have, and the changes are either trivial > fixes or the s390 updates. > > For the s390 people that I've now added to the participant list maybe > the git tree is fine - and the fundamental explanation of the problem > is in that top-most commit (with the three preceding commits being > prep-work). Or that link to the thread about this all. > > That top-most commit is also where I tried to fix things up for s390 > that uses its own non-gathering TLB flush due to > CONFIG_MMU_GATHER_NO_GATHER. > > NOTE NOTE NOTE! Unlike my regular git branch, this one may end up > rebased etc for further comments and fixes. So don't consider that > stable, it's still more of an RFC branch. > > At a minimum I'll update it with Ack's etc, assuming I get those, and > my s390 changes are entirely untested and probably won't work. > > As far as I can tell, s390 doesn't actually *have* the problem that > causes this change, because of its synchronous TLB flush, but it > obviously needs to deal with the change of rmap zapping logic. > > Also added a few people who are explicitly listed as being mmu_gather > maintainers. Maybe people saw the discussion on the linux-mm list, but > let's make it explicit. > > Do people have any objections to this approach, or other suggestions? > > I do *not* consider this critical, so it's a "queue for 6.2" issue for me. > > It probably makes most sense to queue in the -MM tree (after the thing > is acked and people agree), but I can keep that branch alive too and > just deal with it all myself as well. > > Anybody? Happy to see that we're still decrementing the mapcount before decrementingthe refcount, I was briefly concerned. I was not able to come up quickly with something that would be fundamentally wrong here, but devil is in the detail. Some minor things could be improved IMHO (ENCODE_PAGE_BITS naming is unfortunate, TLB_ZAP_RMAP could be a __bitwise type, using VM_WARN_ON instead of VM_BUG_ON). I agree that 6.2 is good enough and that upstreaming this via the -MM tree would be a good way to move forward.
On Thu, Nov 3, 2022 at 2:52 AM David Hildenbrand <david@redhat.com> wrote: > > Happy to see that we're still decrementing the mapcount before > decrementingthe refcount, I was briefly concerned. Oh, that would have been horribly wrong. > I was not able to come up quickly with something that would be > fundamentally wrong here, but devil is in the detail. So I tried to be very careful. The biggest change in the whole series (visible in last patch, but there in the prep-patches too) is how it narrows down some lock coverage. Now, that locking didn't *do* anything valid, but I did try to point it out when it happens - how first the mapcount is decremented outside the memcg lock (in preparatory patches), and then later on the __dec_lruvec_page_state() turns into a dec_lruvec_page_state() because it's then done outside the page table lock. The locking in the second case did do something - not locking-wise, but simply the "running under the spinlock means we are not preemptable without RT". And in the memcg case it was just plain overly wide lock regions. None of the other changes should have any real semantic meaning *apart* from just keeping ->_mapcount elevated slightly longer. > Some minor things could be improved IMHO (ENCODE_PAGE_BITS naming is > unfortunate, TLB_ZAP_RMAP could be a __bitwise type, using VM_WARN_ON > instead of VM_BUG_ON). That VM_BUG_ON() is a case of "what to do if this ever triggers?" So a WARN_ON() would be fatal too, it's some seriously bogus stuff to try to put bits in that won't fit. It probably should be removed, since the value should always be pretty much a simple constant. It was more of a "let's be careful with new code, not for production". Probably like pretty much all VM_BUG_ON's are - test code that just got left around. I considered just getting rid of ENCODE_PAGE_BITS entirely, since there is only one bit. But it was always "let's keep the option open for dirty bits etc", so I kept it, but I agree that the name isn't wonderful. And in fact I wanted the encoding to really be done by the caller (so that TLB_ZAP_RMAP wouldn't be a new argument, but the 'page' argument to __tlb_remove_page_*() would simply be an 'encoded page' pointer, but that would have caused the patch to be much bigger (and expanded the s390 side too). Which I didn't want to do. Long-term that's probably still the right thing to do, including passing the encoded pointers all the way to free_pages_and_swap_cache(). Because it's pretty disgusting how it cleans up that array in place and then does that cast to a new array type, but it is also disgusting how it traverses that array multiple times (ie free_pages_and_swap_cache() will just have *another* loop). But again, those changes would have made the patch bigger, which I didn't want at this point (and 'release_pages()' would need that clean-in-place anyway, unless we changed *that* too and made the whole page encoding be something widely available). That's also why I then went with that fairly generic "ENCODE_PAGE_BITS" name. The *use* of it right now is very very specific to just the TLB gather, and the TLB_ZAP_RMAP bit shows that in the name. But then I went for a fairly generic "encode extra bits in the page pointer" name because it felt like it might expand beyond the initial intentionally small patch in the long run. So it's a combination of "we might want to expand on this in the future" and yet also "I really want to keep the changes localized in this patch". And the two are kind of inverses of each other, which hopefully explains the seemingly disparate naming logic. Linus
On Thu, Nov 3, 2022 at 9:54 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But again, those changes would have made the patch bigger, which I > didn't want at this point (and 'release_pages()' would need that > clean-in-place anyway, unless we changed *that* too and made the whole > page encoding be something widely available). And just to clarify: this is not just me trying to expand the reach of my patch. I'd suggest people look at mlock_pagevec(), and realize that LRU_PAGE and NEW_PAGE are both *exactly* the same kind of "encoded_page" bits that TLB_ZAP_RMAP is. Except the mlock code does *not* show that in the type system, and instead just passes a "struct page **" array around in pvec->pages, and then you'd just better know that "oh, it's not *really* just a page pointer". So I really think that the "array of encoded page pointers" thing is a generic notion that we *already* have. It's just that we've done it disgustingly in the past, and I didn't want to do that disgusting thing again. So I would hope that the nasty things that the mlock code would some day use the same page pointer encoding logic to actually make the whole "this is not a page pointer that you can use directly, it has low bits set for flags" very explicit. I am *not* sure if then the actual encoded bits would be unified. Probably not - you might have very different and distinct uses of the encode_page() thing where the bits mean different things in different contexts. Anyway, this is me just explaining the thinking behind it all. The page bit encoding is a very generic thing (well, "very generic" in this case means "has at least one other independent user"), explaining the very generic naming. But at the same time, the particular _patch_ was meant to be very targeted. So slightly schizophrenic name choices as a result. Linus
On 03.11.22 18:09, Linus Torvalds wrote: > On Thu, Nov 3, 2022 at 9:54 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> But again, those changes would have made the patch bigger, which I >> didn't want at this point (and 'release_pages()' would need that >> clean-in-place anyway, unless we changed *that* too and made the whole >> page encoding be something widely available). > > And just to clarify: this is not just me trying to expand the reach of my patch. > > I'd suggest people look at mlock_pagevec(), and realize that LRU_PAGE > and NEW_PAGE are both *exactly* the same kind of "encoded_page" bits > that TLB_ZAP_RMAP is. > > Except the mlock code does *not* show that in the type system, and > instead just passes a "struct page **" array around in pvec->pages, > and then you'd just better know that "oh, it's not *really* just a > page pointer". > > So I really think that the "array of encoded page pointers" thing is a > generic notion that we *already* have. > > It's just that we've done it disgustingly in the past, and I didn't > want to do that disgusting thing again. > > So I would hope that the nasty things that the mlock code would some > day use the same page pointer encoding logic to actually make the > whole "this is not a page pointer that you can use directly, it has > low bits set for flags" very explicit. > > I am *not* sure if then the actual encoded bits would be unified. > Probably not - you might have very different and distinct uses of the > encode_page() thing where the bits mean different things in different > contexts. > > Anyway, this is me just explaining the thinking behind it all. The > page bit encoding is a very generic thing (well, "very generic" in > this case means "has at least one other independent user"), explaining > the very generic naming. > > But at the same time, the particular _patch_ was meant to be very targeted. > > So slightly schizophrenic name choices as a result. Thanks for the explanation. I brought it up because the generic name somehow felt weird in include/asm-generic/tlb.h. Skimming over the code I'd have expected something like TLB_ENCODE_PAGE_BITS, so making the "very generic" things "very specific" as long as it lives in tlb.h :)
On Mon, Oct 31, 2022 at 11:43:30AM -0700, Linus Torvalds wrote: [...] > If people really want to see the patches in email again, I can do > that, but most of you already have, and the changes are either trivial > fixes or the s390 updates. > > For the s390 people that I've now added to the participant list maybe > the git tree is fine - and the fundamental explanation of the problem > is in that top-most commit (with the three preceding commits being > prep-work). Or that link to the thread about this all. I rather have a question to the generic part (had to master the code quotting). > static void clean_and_free_pages_and_swap_cache(struct encoded_page **pages, unsigned int nr) > { > for (unsigned int i = 0; i < nr; i++) { > struct encoded_page *encoded = pages[i]; > unsigned int flags = encoded_page_flags(encoded); > if (flags) { > /* Clean the flagged pointer in-place */ > struct page *page = encoded_page_ptr(encoded); > pages[i] = encode_page(page, 0); > > /* The flag bit being set means that we should zap the rmap */ Why TLB_ZAP_RMAP bit is not checked explicitly here, like in s390 version? (I assume, when/if ENCODE_PAGE_BITS is not TLB_ZAP_RMAP only, calling page_zap_pte_rmap() without such a check would be a bug). > page_zap_pte_rmap(page); > VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page); > } > } > > /* > * Now all entries have been un-encoded, and changed to plain > * page pointers, so we can cast the 'encoded_page' array to > * a plain page array and free them > */ > free_pages_and_swap_cache((struct page **)pages, nr); > } Thanks!
On Thu, Nov 3, 2022 at 11:33 PM Alexander Gordeev <agordeev@linux.ibm.com> wrote: > > I rather have a question to the generic part (had to master the code quotting). Sure. Although now I think the series in in Andrew's -mm tree, or just about to get moved in there, so I'm not going to touch my actual branch any more. > > static void clean_and_free_pages_and_swap_cache(struct encoded_page **pages, unsigned int nr) > > { > > for (unsigned int i = 0; i < nr; i++) { > > struct encoded_page *encoded = pages[i]; > > unsigned int flags = encoded_page_flags(encoded); > > if (flags) { > > /* Clean the flagged pointer in-place */ > > struct page *page = encoded_page_ptr(encoded); > > pages[i] = encode_page(page, 0); > > > > /* The flag bit being set means that we should zap the rmap */ > > Why TLB_ZAP_RMAP bit is not checked explicitly here, like in s390 version? > (I assume, when/if ENCODE_PAGE_BITS is not TLB_ZAP_RMAP only, calling > page_zap_pte_rmap() without such a check would be a bug). No major reason. This is basically the same issue as the naming, which I touched on in https://lore.kernel.org/all/CAHk-=wiDg_1up8K4PhK4+kzPN7xJG297=nw+tvgrGn7aVgZdqw@mail.gmail.com/ and the follow-up note about how I hope the "encoded page pointer with flags" thing gets used by the mlock code some day too. IOW, there's kind of a generic "I have extra flags associated with the pointer", and then the specific "this case uses this flag", and depending on which mindset you have at the time, you might do one or the other. So in that clean_and_free_pages_and_swap_cache(), the code basically knows "I have a pointer with extra flags", and it's written that way. And that's partly historical, because it actually started with the original code tracking the dirty bit as the extra piece of information, and then transformed into this "no, the information is TLB_ZAP_RMAP". So "unsigned int flags" at one point was "bool dirty" instead, but then became more of a "I think this whole page pointer with flags is general", and the naming changed, and I had both cases in mind, and then the code is perhaps not so specifically named. I'm not sure the zap_page_range() case will ever use more than one flag, but the mlock case already has two different flags. So the "encode_page" thing is very much written to be about more than just the zap_page_range() case. But yes, that code could (and maybe should) use "if (flags & TLB_ZAP_RMAP)" to make it clear that in this case, the single flags bit is that one bit. But the "if ()" there actually does two conceptually *separate* things: it needs to clean the pointer in-place (which is regardless of "which" flag bit is set, and then it does that page_zap_pte_rmap(), which is just for the TLB_ZAP_RMAP bit. So to be really explicit about it, you'd have two different tests: one for "do I have flags that need to be cleaned up" and then an inner test for each flag. And since there is only one flag in this particular use case, it's essentially that inner test that I dropped as pointless. In contrast, in the s390 version, that bit was never encoded as a a general "flags associated with a page pointer" in the first place, so there was never any such duality. There is only TLB_ZAP_RMAP. Hope that explains the thinking. Linus
Adding Stephen to Cc, for next-20221107 alert. Adding Johannes to Cc, particularly for lock_page_memcg discussion. On Fri, 4 Nov 2022, Linus Torvalds wrote: > On Thu, Nov 3, 2022 at 11:33 PM Alexander Gordeev > <agordeev@linux.ibm.com> wrote: > > > > I rather have a question to the generic part (had to master the code quotting). > > Sure. > > Although now I think the series in in Andrew's -mm tree, or just about > to get moved in there, so I'm not going to touch my actual branch any > more. Linus, we've been exchanging about my mm/rmap.c mods in private mail, I need to raise some points about your mods here in public mail. Four topics - munlock (good), anon_vma (bad), mm-unstable (bad), lock_page_memcg (uncertain). I'm asking Andrew here to back your patchset out of mm-unstable for now, until we have its fixes in: otherwise I'm worried that next-20221107 will waste everyone's time. munlock (good) -------------- You've separated out the munlock_vma_page() from the rest of PTE remove rmap work. Worried me at first, but I think that's fine: bundling it into page_remove_rmap() was mainly so that we wouldn't forget to do it anywhere (and other rmap funcs already took vma arg). Certainly during development, I had been using page mapcount somewhere inside munlock_vma_page(), either for optimization or for sanity check, I forget; but gave up on that as unnecessary complication, and I think it became impossible with the pagevec; so not an issue now. If it were my change, I'd certainly do it by changing the name of the vma arg in page_remove_rmap() from vma to munlock_vma, not doing the munlock when that is NULL, and avoid the need for duplicating code: whereas you're very pleased with cutting out the unnecessary stuff in slimmed-down page_zap_pte_rmap(). Differing tastes (or perhaps you'll say taste versus no taste). anon_vma (bad) -------------- My first reaction to your patchset was, that it wasn't obvious to me that delaying the decrementation of mapcount is safe: I needed time to think about it. But happily, testing saved me from needing much thought. The first problem, came immediately in the first iteration of my huge tmpfs kbuild swapping loads, was BUG at mm/migrate.c:1105! VM_BUG_ON_FOLIO(folio_test_anon(src) && !folio_test_ksm(src) && !anon_vma, src); Okay, that's interesting but not necessarily fatal, we can very easily make it a recoverable condition. I didn't bother to think on that one, just patched around it. The second problem was more significant: came after nearly five hours of load, BUG NULL dereference (address 8) in down_read_trylock() in rmap_walk_anon(), while kswapd was doing a folio_referenced(). That one goes right to the heart of the matter, and instinct had been correct to worry about delaying the decrementation of mapcount, that is, extending the life of page_mapped() or folio_mapped(). See folio_lock_anon_vma_read(): folio_mapped() plays a key role in establishing the continued validity of an anon_vma. See comments above folio_get_anon_vma(), some by me but most by PeterZ IIRC. I believe what has happened is that your patchset has, very intentionally, kept the page as "folio_mapped" until after free_pgtables() does its unlink_anon_vmas(); but that is telling folio_lock_anon_vma_read() that the anon_vma is safe to use when actually it has been freed. (It looked like a page table when I peeped at it.) I'm not certain, but I think that you made page_zap_pte_rmap() handle anon as well as file, just for the righteous additional simplification; but I'm afraid that (without opening a huge anon_vma refcounting can of worms) that unification has to be reverted, and anon left to go the same old way it did before. I didn't look into whether reverting one of your patches would achieve that, I just adjusted the code in zap_pte_range() to go the old way for PageAnon; and that has been running successfully, hitting neither BUG, for 15 hours now. mm-unstable (bad) ----------------- Aside from that PageAnon issue, mm-unstable is in an understandably bad state because you could not have foreseen my subpages_mapcount addition to page_remove_rmap(). page_zap_pte_rmap() now needs to handle the PageCompound (but not the "compound") case too. I rushed you and akpm an emergency patch for that on Friday night, but you, let's say, had reservations about it. So I haven't posted it, and while the PageAnon issue remains, I think your patchset has to be removed from mm-unstable and linux-next anyway. What happens to mm-unstable with page_zap_pte_rmap() not handling subpages_mapcount? In my CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y case, I get a "Bad page state" even before reaching first login after boot, "page dumped because: nonzero subpages_mapcount". Yes, I make it worse for myself by having a BUG() in bad_page(); others will limp along with more and more of those "Bad page" messages. lock_page_memcg (uncertain) --------------------------- Johannes, please help! Linus has got quite upset by lock_page_memcg(), its calls from mm/rmap.c anyway, and most particularly by the way in which it is called at the start of page_remove_rmap(), before anyone's critical atomic_add_negative(), yet its use is to guarantee the stability of page memcg while doing the stats updates, done only when atomic_add_negative() says so. I do have one relevant insight on this. It (or its antecedents under other names) date from the days when we did "reparenting" of memcg charges from an LRU: and in those days the lock_page_memcg() before mapcount adjustment was vital, to pair with the uses of folio_mapped() or page_mapped() in mem_cgroup_move_account() - those "mapped" checks are precisely around the stats which the rmap functions affect. But nowadays mem_cgroup_move_account() is only called, with page table lock held, on matching pages found in a task's page table: so its "mapped" checks are redundant - I've sometimes thought in the past of removing them, but held back, because I always have the notion (not hope!) that "reparenting" may one day be brought back from the grave. I'm too out of touch with memcg to know where that notion stands today. I've gone through a multiverse of opinions on those lock_page_memcg()s in the last day: I currently believe that Linus is right, that the lock_page_memcg()s could and should be moved just before the stats updates. But I am not 100% certain of that - is there still some reason why it's important that the page memcg at the instant of the critical mapcount transition be kept unchanged until the stats are updated? I've tried running scenarios through my mind but given up. (And note that the answer might be different with Linus's changes than without them: since he delays the mapcount decrementation until long after pte was removed and page table lock dropped). And I do wish that we could settle this lock_page_memcg() question in an entirely separate patch: as it stands, page_zap_pte_rmap() gets to benefit from Linus's insight (or not), and all the other rmap functions carry on with the mis?placed lock_page_memcg() as before. Let's press Send, Hugh
[ Editing down to just the bare-bones problem cases ] On Sun, Nov 6, 2022 at 1:06 PM Hugh Dickins <hughd@google.com> wrote: > > anon_vma (bad) > -------------- > > See folio_lock_anon_vma_read(): folio_mapped() plays a key role in > establishing the continued validity of an anon_vma. See comments > above folio_get_anon_vma(), some by me but most by PeterZ IIRC. > > I believe what has happened is that your patchset has, very intentionally, > kept the page as "folio_mapped" until after free_pgtables() does its > unlink_anon_vmas(); but that is telling folio_lock_anon_vma_read() > that the anon_vma is safe to use when actually it has been freed. > (It looked like a page table when I peeped at it.) > > I'm not certain, but I think that you made page_zap_pte_rmap() handle > anon as well as file, just for the righteous additional simplification; > but I'm afraid that (without opening a huge anon_vma refcounting can of > worms) that unification has to be reverted, and anon left to go the > same old way it did before. Indeed. I made them separate initially, just because the only case that mattered for the dirty bit was the file-mapped case. But then the two functions ended up being basically the identical function, so I unified them again. But the anonvma lifetime issue looks very real, and so doing the "delay rmap only for file mappings" seems sane. In fact, I wonder if we should delay it only for *dirty* file mappings, since it doesn't matter for the clean case. Hmm. I already threw away my branch (since Andrew picked the patches up), so a question for Andrew: do you want me to re-do the branch entirely, or do you want me to just send you an incremental patch? To make for minimal changes, I'd drop the 're-unification' patch, and then small updates to the zap_pte_range() code to keep the anon (and possibly non-dirty) case synchronous. And btw, this one is interesting: for anonymous (and non-dirty file-mapped) patches, we actually can end up delaying the final page free (and the rmap zapping) all the way to "tlb_finish_mmu()". Normally we still have the vma's all available, but yes, free_pgtables() can and does happen before the final TLB flush. The file-mapped dirty case doesn't have that issue - not just because it doesn't have an anonvma at all, but because it also does that "force_flush" thing that just measn that the page freeign never gets delayed that far in the first place. > mm-unstable (bad) > ----------------- > Aside from that PageAnon issue, mm-unstable is in an understandably bad > state because you could not have foreseen my subpages_mapcount addition > to page_remove_rmap(). page_zap_pte_rmap() now needs to handle the > PageCompound (but not the "compound") case too. I rushed you and akpm > an emergency patch for that on Friday night, but you, let's say, had > reservations about it. So I haven't posted it, and while the PageAnon > issue remains, I think your patchset has to be removed from mm-unstable > and linux-next anyway. So I think I'm fine with your patch, I just want to move the memcg accounting to outside of it. I can re-do my series on top of mm-unstable, I guess. That's probably the easiest way to handle this all. Andrew - can you remove those patches again, and I'll create a new series for you? Linus
On Sun, 6 Nov 2022 14:34:51 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > > mm-unstable (bad) > > ----------------- > > Aside from that PageAnon issue, mm-unstable is in an understandably bad > > state because you could not have foreseen my subpages_mapcount addition > > to page_remove_rmap(). page_zap_pte_rmap() now needs to handle the > > PageCompound (but not the "compound") case too. I rushed you and akpm > > an emergency patch for that on Friday night, but you, let's say, had > > reservations about it. So I haven't posted it, and while the PageAnon > > issue remains, I think your patchset has to be removed from mm-unstable > > and linux-next anyway. > > So I think I'm fine with your patch, I just want to move the memcg > accounting to outside of it. > > I can re-do my series on top of mm-unstable, I guess. That's probably > the easiest way to handle this all. > > Andrew - can you remove those patches again, and I'll create a new > series for you? Yes, I've removed both serieses and shall push the tree out within half an hour from now.
Hi all, On Sun, 6 Nov 2022 15:14:16 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > > Yes, I've removed both serieses and shall push the tree out within half > an hour from now. And I have picked up the new version for today's linux-next. Thanks all.
Hi Hugh! On Sun, Nov 06, 2022 at 01:06:19PM -0800, Hugh Dickins wrote: > See folio_lock_anon_vma_read(): folio_mapped() plays a key role in > establishing the continued validity of an anon_vma. See comments > above folio_get_anon_vma(), some by me but most by PeterZ IIRC. Ohhh, you're quite right. Unfortunately I seem to have completely forgotten about that :-(
On Sun, Nov 6, 2022 at 3:14 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > Yes, I've removed both serieses and shall push the tree out within half > an hour from now. Oh, can you please just put Hugh's series back in? I don't love the mapcount changes, but I'm going to re-do my parts so that there is no clash with them. And while I'd much rather see the mapcounts done another way, that's a completely separate issue. Linus
Hello, On Sun, Nov 06, 2022 at 01:06:19PM -0800, Hugh Dickins wrote: > lock_page_memcg (uncertain) > --------------------------- > Johannes, please help! Linus has got quite upset by lock_page_memcg(), > its calls from mm/rmap.c anyway, and most particularly by the way > in which it is called at the start of page_remove_rmap(), before > anyone's critical atomic_add_negative(), yet its use is to guarantee > the stability of page memcg while doing the stats updates, done only > when atomic_add_negative() says so. As you mentioned, the pte lock historically wasn't always taken on the move side. And so the reason lock_page_memcg() covers the mapcount update is that move_account() needs to atomically see either a) the page is mapped and counted, or b) unmapped and uncounted. If we lock after mapcountdec, move_account() could miss pending updates that need transferred, and it would break the scheme breaks thusly: memcg1->nr_mapped = 1 memcg2->nr_mapped = 0 page_remove_rmap: mem_cgroup_move_account(): if atomic_add_negative(page->mapcount): lock_page_memcg() if page->mapcount: // NOT TAKEN memcg1->nr_mapped-- memcg2->nr_mapped++ page->memcg = memcg2 unlock_page_memcg() lock_page_memcg() page->memcg->nr_mapped-- // UNDERFLOW memcg2->nr_mapped unlock_page_memcg() > I do have one relevant insight on this. It (or its antecedents under > other names) date from the days when we did "reparenting" of memcg > charges from an LRU: and in those days the lock_page_memcg() before > mapcount adjustment was vital, to pair with the uses of folio_mapped() > or page_mapped() in mem_cgroup_move_account() - those "mapped" checks > are precisely around the stats which the rmap functions affect. > > But nowadays mem_cgroup_move_account() is only called, with page table > lock held, on matching pages found in a task's page table: so its > "mapped" checks are redundant - I've sometimes thought in the past of > removing them, but held back, because I always have the notion (not > hope!) that "reparenting" may one day be brought back from the grave. > I'm too out of touch with memcg to know where that notion stands today. > > I've gone through a multiverse of opinions on those lock_page_memcg()s > in the last day: I currently believe that Linus is right, that the > lock_page_memcg()s could and should be moved just before the stats > updates. But I am not 100% certain of that - is there still some > reason why it's important that the page memcg at the instant of the > critical mapcount transition be kept unchanged until the stats are > updated? I've tried running scenarios through my mind but given up. Okay, I think there are two options. - If we don't want to codify the pte lock requirement on the move side, then moving the lock_page_memcg() like that would break the locking scheme, as per above. - If we DO want to codify the pte lock requirement, we should just remove the lock_page_memcg() altogether, as it's fully redundant. I'm leaning toward the second option. If somebody brings back reparenting they can bring the lock back along with it. [ If it's even still necessary by then. It's conceivable reparenting is brought back only for cgroup2, where the race wouldn't matter because of the hierarchical stats. The reparenting side wouldn't have to move page state to the parent - it's already there. And whether rmap would see the dying child or the parent doesn't matter much either: the parent will see the update anyway, directly or recursively, and we likely don't care to balance the books on a dying cgroup. It's then just a matter of lifetime - which should be guaranteed also, as long as the pte lock prevents an RCU quiescent state. ] So how about something like below? UNTESTED, just for illustration. This is cgroup1 code, which I haven't looked at too closely in a while. If you can't spot an immediate hole in it, I'd go ahead and test it and send a proper patch. --- From 88a32b1b5737630fb981114f6333d8fd057bd8e9 Mon Sep 17 00:00:00 2001 From: Johannes Weiner <hannes@cmpxchg.org> Date: Mon, 7 Nov 2022 12:05:09 -0500 Subject: [PATCH] mm: remove lock_page_memcg() from rmap rmap changes (mapping and unmapping) of a page currently take lock_page_memcg() to serialize 1) update of the mapcount and the cgroup mapped counter with 2) cgroup moving the page and updating the old cgroup and the new cgroup counters based on page_mapped(). Before b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups"), we used to reassign all pages that could be found on a cgroup's LRU list on deletion - something that rmap didn't naturally serialize against. Since that commit, however, the only pages that get moved are those mapped into page tables of a task that's being migrated. In that case, the pte lock is always held (and we know the page is mapped), which keeps rmap changes at bay already. The additional lock_page_memcg() by rmap is redundant. Remove it. NOT-Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- mm/memcontrol.c | 27 ++++++++++++--------------- mm/rmap.c | 30 ++++++++++++------------------ 2 files changed, 24 insertions(+), 33 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2d8549ae1b30..f7716e9038e9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5666,7 +5666,10 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma, * @from: mem_cgroup which the page is moved from. * @to: mem_cgroup which the page is moved to. @from != @to. * - * The caller must make sure the page is not on LRU (isolate_page() is useful.) + * This function acquires folio_lock() and folio_lock_memcg(). The + * caller must exclude all other possible ways of accessing + * page->memcg, such as LRU isolation (to lock out isolation) and + * having the page mapped and pte-locked (to lock out rmap). * * This function doesn't do "charge" to new cgroup and doesn't do "uncharge" * from old cgroup. @@ -5685,6 +5688,7 @@ static int mem_cgroup_move_account(struct page *page, VM_BUG_ON(from == to); VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); VM_BUG_ON(compound && !folio_test_large(folio)); + VM_WARN_ON_ONCE(!folio_mapped(folio)); /* * Prevent mem_cgroup_migrate() from looking at @@ -5705,30 +5709,23 @@ static int mem_cgroup_move_account(struct page *page, folio_memcg_lock(folio); if (folio_test_anon(folio)) { - if (folio_mapped(folio)) { - __mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages); - __mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages); - if (folio_test_transhuge(folio)) { - __mod_lruvec_state(from_vec, NR_ANON_THPS, - -nr_pages); - __mod_lruvec_state(to_vec, NR_ANON_THPS, - nr_pages); - } + __mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages); + __mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages); + if (folio_test_transhuge(folio)) { + __mod_lruvec_state(from_vec, NR_ANON_THPS, -nr_pages); + __mod_lruvec_state(to_vec, NR_ANON_THPS, nr_pages); } } else { __mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages); __mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages); + __mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages); + __mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages); if (folio_test_swapbacked(folio)) { __mod_lruvec_state(from_vec, NR_SHMEM, -nr_pages); __mod_lruvec_state(to_vec, NR_SHMEM, nr_pages); } - if (folio_mapped(folio)) { - __mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages); - __mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages); - } - if (folio_test_dirty(folio)) { struct address_space *mapping = folio_mapping(folio); diff --git a/mm/rmap.c b/mm/rmap.c index 2ec925e5fa6a..60c31375f274 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1197,11 +1197,6 @@ void page_add_anon_rmap(struct page *page, bool compound = flags & RMAP_COMPOUND; bool first; - if (unlikely(PageKsm(page))) - lock_page_memcg(page); - else - VM_BUG_ON_PAGE(!PageLocked(page), page); - if (compound) { atomic_t *mapcount; VM_BUG_ON_PAGE(!PageLocked(page), page); @@ -1217,19 +1212,19 @@ void page_add_anon_rmap(struct page *page, if (first) { int nr = compound ? thp_nr_pages(page) : 1; /* - * We use the irq-unsafe __{inc|mod}_zone_page_stat because - * these counters are not modified in interrupt context, and - * pte lock(a spinlock) is held, which implies preemption - * disabled. + * We use the irq-unsafe __{inc|mod}_zone_page_stat + * because these counters are not modified in + * interrupt context, and pte lock(a spinlock) is + * held, which implies preemption disabled. + * + * The pte lock also stabilizes page->memcg wrt + * mem_cgroup_move_account(). */ if (compound) __mod_lruvec_page_state(page, NR_ANON_THPS, nr); __mod_lruvec_page_state(page, NR_ANON_MAPPED, nr); } - if (unlikely(PageKsm(page))) - unlock_page_memcg(page); - /* address might be in next vma when migration races vma_adjust */ else if (first) __page_set_anon_rmap(page, vma, address, @@ -1290,7 +1285,6 @@ void page_add_file_rmap(struct page *page, int i, nr = 0; VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page); - lock_page_memcg(page); if (compound && PageTransHuge(page)) { int nr_pages = thp_nr_pages(page); @@ -1311,6 +1305,7 @@ void page_add_file_rmap(struct page *page, if (nr == nr_pages && PageDoubleMap(page)) ClearPageDoubleMap(page); + /* The pte lock stabilizes page->memcg */ if (PageSwapBacked(page)) __mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED, nr_pages); @@ -1328,7 +1323,6 @@ void page_add_file_rmap(struct page *page, out: if (nr) __mod_lruvec_page_state(page, NR_FILE_MAPPED, nr); - unlock_page_memcg(page); mlock_vma_page(page, vma, compound); } @@ -1356,6 +1350,7 @@ static void page_remove_file_rmap(struct page *page, bool compound) } if (!atomic_add_negative(-1, compound_mapcount_ptr(page))) goto out; + /* The pte lock stabilizes page->memcg */ if (PageSwapBacked(page)) __mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED, -nr_pages); @@ -1423,8 +1418,6 @@ static void page_remove_anon_compound_rmap(struct page *page) void page_remove_rmap(struct page *page, struct vm_area_struct *vma, bool compound) { - lock_page_memcg(page); - if (!PageAnon(page)) { page_remove_file_rmap(page, compound); goto out; @@ -1443,6 +1436,9 @@ void page_remove_rmap(struct page *page, * We use the irq-unsafe __{inc|mod}_zone_page_stat because * these counters are not modified in interrupt context, and * pte lock(a spinlock) is held, which implies preemption disabled. + * + * The pte lock also stabilizes page->memcg wrt + * mem_cgroup_move_account(). */ __dec_lruvec_page_state(page, NR_ANON_MAPPED); @@ -1459,8 +1455,6 @@ void page_remove_rmap(struct page *page, * faster for those pages still in swapcache. */ out: - unlock_page_memcg(page); - munlock_vma_page(page, vma, compound); }
On Mon, Nov 7, 2022 at 12:07 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > - If we DO want to codify the pte lock requirement, we should just > remove the lock_page_memcg() altogether, as it's fully redundant. > > I'm leaning toward that second option. The thing is, that's very much the case we do *not* want. We need to delay the rmap removal until at least after the TLB flush. At least for dirty filemapped pages - because the page cleaning needs to see that they exists as mapped entities until all CPU's have *actually* dropped them. Now, we do the TLB flush still under the page table lock, so we could still then do the rmap removal before dropping the lock. But it would be much cleaner from the TLB flushing standpoint to delay it until the page freeing, which ends up being delayed until after the lock is dropped. That said, if always doing the rmap removal under the page table lock means that that memcg lock can just be deleted in that whole path, I will certainly bow to _that_ simplification instead, and just handle the dirty pages after the TLB flush but before the page table drop. Linus
On Mon, 7 Nov 2022 08:19:24 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, Nov 6, 2022 at 3:14 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > Yes, I've removed both serieses and shall push the tree out within half > > an hour from now. > > Oh, can you please just put Hugh's series back in? > Done, all pushed out to git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable.
Hi all, On Mon, 7 Nov 2022 15:02:42 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > > On Mon, 7 Nov 2022 08:19:24 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Done, all pushed out to > git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable. Will be in today's linux-next.
On Mon, Nov 7, 2022 at 12:29 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > That said, if always doing the rmap removal under the page table lock > means that that memcg lock can just be deleted in that whole path, I > will certainly bow to _that_ simplification instead, and just handle > the dirty pages after the TLB flush but before the page table drop. Ok, so I think I have a fairly clean way to do this. Let me try to make that series look reasonable, although it might be until tomorrow. I'll need to massage my mess into not just prettier code, but a sane history. Linus
On Mon, Nov 7, 2022 at 3:47 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Ok, so I think I have a fairly clean way to do this. > > Let me try to make that series look reasonable, although it might be > until tomorrow. I'll need to massage my mess into not just prettier > code, but a sane history. Ugh. Ok, so massaging it into a saner form and splitting it out into a pretty history took a lot longer than writing the initial ugly "something like this". Anyway, the end result looks very different from the previous series, since I very consciously tried to keep away from rmap changes to not clash with Hugh's work, and also made sure we still do the page_remove_rmap() under the page table lock, just *after* the TLB flush. It does share some basic stuff - I'm still using that "struct encoded_page" thing, I just moved it into a separate commit, and moved it to where it conceptually belongs (I'd love to say that I also made the mlock.c code use it, but I didn't - that 'pagevec' thing is a pretty pointless abstraction and I didn't want to fight it). So you'll see some familiar support structures and scaffolding, but the actual approach to zap_pte_range() is very different. The approach taken to the "s390 is very different" issue is also completely different. It should actually be better for s390: we used to cause that "force_flush" whenever we hit a dirty shared page, and s390 doesn't care. The new model makes that "s390 doesn't care" be part of the whole design, so now s390 treats dirty shared pages basically as if they weren't anything special at all. Which they aren't on s390. But, as with the previous case, I don't even try to cross-compile for s390, so my attempts at handling s390 well are just that: attempts. The code may be broken. Of course, the code may be broken on x86-64 too, but at least there I've compiled it and am running it right now. Oh, and because I copied large parts of the commit message from the previous approach (because the problem description is the same), I noticed that I also kept the "Acked-by:". Those are bogus, because the code is sufficiently different that any previous acks are just not valid any more, but I just hadn't fixed that yet. The meat of it all is in that last patch, the rest is just type system cleanups to prep for it. But it was also that last patch that I spent hours on just tweaking it to look sensible. The *code* was pretty easy. Making it have sensible naming and a sensible abstraction interface that worked well for s390 too, that was 90% of the effort. But also I hope the end result is reasonably easy to review as a result. I'm sending this out because I'm stepping away from the keyboard, because that whole "let's massage it into something lagible" was really somewhat exhausting. You don't see all the small side turns it took only to go "that's ugly, let's try again" ;) Anybody interested in giving this a peek? (Patch 2/4 might make some people pause. It's fairly small and simple. It's effective and makes it easy to do some of the later changes. And it's also quite different from our usual model. It was "inspired" by the signed-vs-unsigned char thread from a few weeks ago. But patch 4/4 is the one that matters). Linus
On Mon, Nov 7, 2022 at 8:28 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I'm sending this out because I'm stepping away from the keyboard, > because that whole "let's massage it into something legible" was > really somewhat exhausting. You don't see all the small side turns it > took only to go "that's ugly, let's try again" ;) Ok, I actually sent the individual patches with 'git-send-email', although I only sent them to the mailing list and to people that were mentioned in the commit descriptions. I hope that makes review easier. See https://lore.kernel.org/all/20221108194139.57604-1-torvalds@linux-foundation.org for the series if you weren't mentioned and are interested. Oh, and because I decided to just use the email in this thread as the reference and cover letter, it turns out that this all confuses 'b4', because it actually walks up the whole thread all the way to the original 13-patch series by PeterZ that started this whole discussion. I've seen that before with other peoples patch series, but now that it happened to my own, I'm cc'ing Konstantine here too to see if there's some magic for b4 to say "look, I pointed you to a msg-id that is clearly a new series, don't walk all the way up and then take patches from a completely different one. Oh well. I guess I should just have not been lazy and done a cover-letter and a whole new thread. My bad. Konstantin, please help me look like less of a tool? Linus
On Tue, Nov 08, 2022 at 11:56:13AM -0800, Linus Torvalds wrote: > On Mon, Nov 7, 2022 at 8:28 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > I'm sending this out because I'm stepping away from the keyboard, > > because that whole "let's massage it into something legible" was > > really somewhat exhausting. You don't see all the small side turns it > > took only to go "that's ugly, let's try again" ;) > > Ok, I actually sent the individual patches with 'git-send-email', > although I only sent them to the mailing list and to people that were > mentioned in the commit descriptions. > > I hope that makes review easier. > > See > > https://lore.kernel.org/all/20221108194139.57604-1-torvalds@linux-foundation.org > > for the series if you weren't mentioned and are interested. > > Oh, and because I decided to just use the email in this thread as the > reference and cover letter, it turns out that this all confuses 'b4', > because it actually walks up the whole thread all the way to the > original 13-patch series by PeterZ that started this whole discussion. > > I've seen that before with other peoples patch series, but now that it > happened to my own, I'm cc'ing Konstantine here too to see if there's > some magic for b4 to say "look, I pointed you to a msg-id that is > clearly a new series, don't walk all the way up and then take patches > from a completely different one. Yes, --no-parent. It's slightly more complicated in your case because the patches aren't threaded to the first patch/cover letter, but you can choose an arbitrary msgid upthread and tell b4 to ignore anything that came before it. E.g.: b4 am -o/tmp --no-parent 20221108194139.57604-1-torvalds@linux-foundation.org -K
On Tue, Nov 8, 2022 at 12:03 PM Konstantin Ryabitsev <mricon@kernel.org> wrote: > > Yes, --no-parent. Ahh, that's new. I guess I need to update my ancient b4-0.8.0 install.. But yes, with that, and the manual parent lookup (because otherwise "--no-parent" will fetch *just* the patch itself, not even walking up the single parent chain). it works. Maybe a "--single-parent" or "--deadbeat-parent" option would be a good idea? Anyway, with a more recent b4 version, the command b4 am --no-parent CAHk-=wh6MxaCA4pXpt1F5Bn2__6MxCq0Dr-rES4i=MOL9ibjpg@mail.gmail.com gets that series and only that series. Linus
--- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -260,15 +260,12 @@ static inline pte_t ptep_get(pte_t *ptep #ifdef CONFIG_GUP_GET_PTE_LOW_HIGH /* - * WARNING: only to be used in the get_user_pages_fast() implementation. - * - * With get_user_pages_fast(), we walk down the pagetables without taking any - * locks. For this we would like to load the pointers atomically, but sometimes - * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE). What - * we do have is the guarantee that a PTE will only either go from not present - * to present, or present to not present or both -- it will not switch to a - * completely different present page without a TLB flush in between; something - * that we are blocking by holding interrupts off. + * For walking the pagetables without holding any locks. Some architectures + * (eg x86-32 PAE) cannot load the entries atomically without using expensive + * instructions. We are guaranteed that a PTE will only either go from not + * present to present, or present to not present -- it will not switch to a + * completely different present page without a TLB flush inbetween; which we + * are blocking by holding interrupts off. * * Setting ptes from not present to present goes: *