Message ID | 28eb289f-ea2c-8eb9-63bb-9f7d7b9ccc11@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp1320093vqr; Sun, 28 May 2023 23:51:34 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5VigIv7Tlt8PeFqBXrDSCeEulpJRFT4qqCBvriBZqZqYsJJHD5fwNxXfBSeGX9RrOtM9FP X-Received: by 2002:aa7:888e:0:b0:63b:8eeb:77b8 with SMTP id z14-20020aa7888e000000b0063b8eeb77b8mr13752275pfe.13.1685343094184; Sun, 28 May 2023 23:51:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685343094; cv=none; d=google.com; s=arc-20160816; b=AkpiIMsGVYOtpyEYxX+2Cv8RfD0Kfk2u9AV1y8uPI6iuoOe5HVkYkAc+pAQmJccCKe ybmP2nxvcUc7qUm0OA9mdV2cczDGocAfBGNF2nzqooBo5T4lGAhpXQuz6R0G1a1VOT49 FFouDBTROyPXXrvCp+5jHinck3bCw/d4wtR1Oqvh7AL7ajhMxhTNqJOVO7lAEGv1qYbW mr8VCvBZeR6g24Kp1OrQ8kPIWbRwTrg/qmK5hWx0k3HbZi9MzXnGfPOOpWhA76NJ70od cg2PfQF8t0Wz4kQ1FbsgMR5OAI0f3LExFvTiFk7wNtX/+Y+//8gnBRthcwUPorm2rbIv 2xUw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:references:message-id:in-reply-to :subject:cc:to:from:date:dkim-signature; bh=osELeF1qrpmk7tMBwVJZ0+Hn0dD614B/zeb//sVH2O8=; b=JkryshcTYHP5T1gnfZlk1FHPaTF4+eLOzRblvrrBJ8W696o9gTU3eu2yGkmBzl6sVQ acUBLpZD6cJ1TMR/hvkANLzE72FDPYXXxm18MXDZH5LDRZ76tNFFwO6yiZWK97S2ih0A NMYn0XjudMHz8V2mjMy5K57FJPUuBw9ALfwANC4vycv5ZAF+VRVmps+YZ0aja8etbyaU UACn3h/s6VL0WCYUqqYYvaoN3DLbsJRSIM2j4zrnU8JtErTyzL8XFb6u4UanBLAiLFYi BaRGfg4nO3WwYJY/LpkYpW/PC42NCzRhk4C/MdyPXLHewmRNHWjndYm1mNZZeoLmzGqg NaPA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=lDd5t1DN; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q11-20020aa7960b000000b0063b57ad9891si9384258pfg.315.2023.05.28.23.51.22; Sun, 28 May 2023 23:51:34 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=lDd5t1DN; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229961AbjE2GUd (ORCPT <rfc822;callmefire3@gmail.com> + 99 others); Mon, 29 May 2023 02:20:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48096 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231355AbjE2GU2 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 29 May 2023 02:20:28 -0400 Received: from mail-yb1-xb2a.google.com (mail-yb1-xb2a.google.com [IPv6:2607:f8b0:4864:20::b2a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 62E21B2 for <linux-kernel@vger.kernel.org>; Sun, 28 May 2023 23:20:27 -0700 (PDT) Received: by mail-yb1-xb2a.google.com with SMTP id 3f1490d57ef6-bacf9edc87bso4401016276.1 for <linux-kernel@vger.kernel.org>; Sun, 28 May 2023 23:20:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1685341226; x=1687933226; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=osELeF1qrpmk7tMBwVJZ0+Hn0dD614B/zeb//sVH2O8=; b=lDd5t1DNHYrERHggJZt1tFe/2MuojYDB8DYuX4saUcSsWZjN9TJvB4D3jOc40ZJzuw KXTyU64u/loqgfjTny6vyEEHc7I3ap+5KBFfNEbc+et5E0P2BRIz5MO59vg+RyUV5HGF LQIvE8HoWeX7dfHrfHrThINSr2Jkgw1dNwKUT5ElKyxmqFpo9JxBWX275kcmdKdX3BCO TKmGDZP/FC8QUdgVXd4u26ydd2mkiG9rdEZxkd6dJlJez+Ye+8eFZqq1OHv/Cc+ig/ka 6L20gyN+7z7wp1ikvxHZmZH1RAvGKKb0++xvnFOn6a1yNDE6E2enDzRJirlL0XfSo/Zh 2JMQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685341226; x=1687933226; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=osELeF1qrpmk7tMBwVJZ0+Hn0dD614B/zeb//sVH2O8=; b=AJ1Vqg1fIVpMNdMTg1tJnO1oS4PBcK8DvBH33sXDoYMd6ssCNoueKS2k45H4w/j1gC ID4bBlD9cwPJFFdy6j72BTH3ls6YdpQy4dk7ggK7DyMRnYNuHsBJPdkXOZuQCIfl8Nvo C52/LmdwuCEGyT7ILP67NyvjZe2Rrq16Ktp2kKAl4Zs9P5gosyV1l/lgBhdci4E/bIQx 7KjZ7cgzNM61hagwa/lOgytAvfPNbV88BiJYy1pH9t0DuTtARGn+5oOugtWJmQ2QA8XR 2sBmNH6rU3/DzzOzz15DKd6AyR6VwsbPwJWGwHedN6vr79QOCRtb0hm/Nu6S6Ncs3P65 V3fg== X-Gm-Message-State: AC+VfDz7+6WRng//oQ53Y5o0hBEA68NNrktWnvp2u/DZTJ96DxB1LtZ9 RHxFJ0nJ84AyJbibpR44cRsOYg== X-Received: by 2002:a81:8304:0:b0:565:c888:1d09 with SMTP id t4-20020a818304000000b00565c8881d09mr8471629ywf.30.1685341226458; Sun, 28 May 2023 23:20:26 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id y184-20020a0dd6c1000000b00565e57e6662sm1530559ywd.55.2023.05.28.23.20.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 28 May 2023 23:20:25 -0700 (PDT) Date: Sun, 28 May 2023 23:20:21 -0700 (PDT) From: Hugh Dickins <hughd@google.com> X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton <akpm@linux-foundation.org> cc: Mike Kravetz <mike.kravetz@oracle.com>, Mike Rapoport <rppt@kernel.org>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Matthew Wilcox <willy@infradead.org>, David Hildenbrand <david@redhat.com>, Suren Baghdasaryan <surenb@google.com>, Qi Zheng <zhengqi.arch@bytedance.com>, Yang Shi <shy828301@gmail.com>, Mel Gorman <mgorman@techsingularity.net>, Peter Xu <peterx@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Will Deacon <will@kernel.org>, Yu Zhao <yuzhao@google.com>, Alistair Popple <apopple@nvidia.com>, Ralph Campbell <rcampbell@nvidia.com>, Ira Weiny <ira.weiny@intel.com>, Steven Price <steven.price@arm.com>, SeongJae Park <sj@kernel.org>, Naoya Horiguchi <naoya.horiguchi@nec.com>, Christophe Leroy <christophe.leroy@csgroup.eu>, Zack Rusin <zackr@vmware.com>, Jason Gunthorpe <jgg@ziepe.ca>, Axel Rasmussen <axelrasmussen@google.com>, Anshuman Khandual <anshuman.khandual@arm.com>, Pasha Tatashin <pasha.tatashin@soleen.com>, Miaohe Lin <linmiaohe@huawei.com>, Minchan Kim <minchan@kernel.org>, Christoph Hellwig <hch@infradead.org>, Song Liu <song@kernel.org>, Thomas Hellstrom <thomas.hellstrom@linux.intel.com>, Russell King <linux@armlinux.org.uk>, "David S. Miller" <davem@davemloft.net>, Michael Ellerman <mpe@ellerman.id.au>, "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>, Heiko Carstens <hca@linux.ibm.com>, Christian Borntraeger <borntraeger@linux.ibm.com>, Claudio Imbrenda <imbrenda@linux.ibm.com>, Alexander Gordeev <agordeev@linux.ibm.com>, Jann Horn <jannh@google.com>, linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page In-Reply-To: <35e983f5-7ed3-b310-d949-9ae8b130cdab@google.com> Message-ID: <28eb289f-ea2c-8eb9-63bb-9f7d7b9ccc11@google.com> References: <35e983f5-7ed3-b310-d949-9ae8b130cdab@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1767210320196307584?= X-GMAIL-MSGID: =?utf-8?q?1767210320196307584?= |
Series |
mm: free retracted page table by RCU
|
|
Commit Message
Hugh Dickins
May 29, 2023, 6:20 a.m. UTC
Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon. This precedes
the generic version to avoid build breakage from incompatible pgtable_t.
Signed-off-by: Hugh Dickins <hughd@google.com>
---
arch/powerpc/include/asm/pgalloc.h | 4 ++++
arch/powerpc/mm/pgtable-frag.c | 18 ++++++++++++++++++
2 files changed, 22 insertions(+)
Comments
On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote: > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) > +{ > + struct page *page; > + > + page = virt_to_page(pgtable); > + call_rcu(&page->rcu_head, pte_free_now); > +} This can't be safe (on ppc). IIRC you might have up to 16x4k page tables sharing one 64kB page. So if you have two page tables from the same page being defer-freed simultaneously, you'll reuse the rcu_head and I cannot imagine things go well from that point. I have no idea how to solve this problem.
On Mon, 29 May 2023, Matthew Wilcox wrote: > On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote: > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) > > +{ > > + struct page *page; > > + > > + page = virt_to_page(pgtable); > > + call_rcu(&page->rcu_head, pte_free_now); > > +} > > This can't be safe (on ppc). IIRC you might have up to 16x4k page > tables sharing one 64kB page. So if you have two page tables from the > same page being defer-freed simultaneously, you'll reuse the rcu_head > and I cannot imagine things go well from that point. Oh yes, of course, thanks for catching that so quickly. So my s390 and sparc implementations will be equally broken. > > I have no idea how to solve this problem. I do: I'll have to go back to the more complicated implementation we actually ran with on powerpc - I was thinking those complications just related to deposit/withdraw matters, forgetting the one-rcu_head issue. It uses large (0x10000) increments of the page refcount, avoiding call_rcu() when already active. It's not a complication I had wanted to explain or test for now, but we shall have to. Should apply equally well to sparc, but s390 more of a problem, since s390 already has its own refcount cleverness. Thanks, I must dash, out much of the day. Hugh
On Mon, 29 May 2023 07:36:40 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote: > On Mon, 29 May 2023, Matthew Wilcox wrote: > > On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote: > > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) > > > +{ > > > + struct page *page; > > > + > > > + page = virt_to_page(pgtable); > > > + call_rcu(&page->rcu_head, pte_free_now); > > > +} > > > > This can't be safe (on ppc). IIRC you might have up to 16x4k page > > tables sharing one 64kB page. So if you have two page tables from the > > same page being defer-freed simultaneously, you'll reuse the rcu_head > > and I cannot imagine things go well from that point. > > Oh yes, of course, thanks for catching that so quickly. > So my s390 and sparc implementations will be equally broken. > > > > > I have no idea how to solve this problem. > > I do: I'll have to go back to the more complicated implementation we > actually ran with on powerpc - I was thinking those complications just > related to deposit/withdraw matters, forgetting the one-rcu_head issue. > > It uses large (0x10000) increments of the page refcount, avoiding > call_rcu() when already active. > > It's not a complication I had wanted to explain or test for now, > but we shall have to. Should apply equally well to sparc, but s390 > more of a problem, since s390 already has its own refcount cleverness. Yes, we have 2 pagetables in one 4K page, which could result in same rcu_head reuse. It might be possible to use the cleverness from our page_table_free() function, e.g. to only do the call_rcu() once, for the case where both 2K pagetable fragments become unused, similar to how we decide when to actually call __free_page(). However, it might be much worse, and page->rcu_head from a pagetable page cannot be used at all for s390, because we also use page->lru to keep our list of free 2K pagetable fragments. I always get confused by struct page unions, so not completely sure, but it seems to me that page->rcu_head would overlay with page->lru, right?
On Thu, 1 Jun 2023, Gerald Schaefer wrote: > On Mon, 29 May 2023 07:36:40 -0700 (PDT) > Hugh Dickins <hughd@google.com> wrote: > > On Mon, 29 May 2023, Matthew Wilcox wrote: > > > On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote: > > > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) > > > > +{ > > > > + struct page *page; > > > > + > > > > + page = virt_to_page(pgtable); > > > > + call_rcu(&page->rcu_head, pte_free_now); > > > > +} > > > > > > This can't be safe (on ppc). IIRC you might have up to 16x4k page > > > tables sharing one 64kB page. So if you have two page tables from the > > > same page being defer-freed simultaneously, you'll reuse the rcu_head > > > and I cannot imagine things go well from that point. > > > > Oh yes, of course, thanks for catching that so quickly. > > So my s390 and sparc implementations will be equally broken. > > > > > > > > I have no idea how to solve this problem. > > > > I do: I'll have to go back to the more complicated implementation we > > actually ran with on powerpc - I was thinking those complications just > > related to deposit/withdraw matters, forgetting the one-rcu_head issue. > > > > It uses large (0x10000) increments of the page refcount, avoiding > > call_rcu() when already active. > > > > It's not a complication I had wanted to explain or test for now, > > but we shall have to. Should apply equally well to sparc, but s390 > > more of a problem, since s390 already has its own refcount cleverness. > > Yes, we have 2 pagetables in one 4K page, which could result in same > rcu_head reuse. It might be possible to use the cleverness from our > page_table_free() function, e.g. to only do the call_rcu() once, for > the case where both 2K pagetable fragments become unused, similar to > how we decide when to actually call __free_page(). Yes, I expect that it will be possible to mesh in with s390's cleverness there; but I may not be clever enough to do so myself - it was easier to get right by going my own way - except that the multiply-used rcu_head soon showed that I'd not got it right at all :-( > > However, it might be much worse, and page->rcu_head from a pagetable > page cannot be used at all for s390, because we also use page->lru > to keep our list of free 2K pagetable fragments. I always get confused > by struct page unions, so not completely sure, but it seems to me that > page->rcu_head would overlay with page->lru, right? However, I believe you are right that it's worse. I'm glad to hear that you get confused by the struct page unions, me too, I preferred the old pre-union days when we could see at a glance which fields overlaid. (Perhaps I'm nostalgically exaggerating that "see at a glance" ease.) But I think I remember the discussions when rcu_head, and compound_head at lru.next, came in: with the agreement that rcu_head.next would at least be 2-aligned to avoid PageTail - ah, it's even commented in the fundamental include/linux/types.h. Sigh. I don't at this moment know what to do for s390: it is frustrating to be held up by just the one architecture. But big thanks to you, Gerald, for bringing this to light. Hugh
On Mon, May 29, 2023 at 03:02:02PM +0100, Matthew Wilcox wrote: > On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote: > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) > > +{ > > + struct page *page; > > + > > + page = virt_to_page(pgtable); > > + call_rcu(&page->rcu_head, pte_free_now); > > +} > > This can't be safe (on ppc). IIRC you might have up to 16x4k page > tables sharing one 64kB page. So if you have two page tables from the > same page being defer-freed simultaneously, you'll reuse the rcu_head > and I cannot imagine things go well from that point. > > I have no idea how to solve this problem. Maybe power and s390 should allocate a side structure, sort of a pre-memdesc thing to store enough extra data? If we can get enough bytes then something like this would let a single rcu head be shared to manage the free bits. struct 64k_page { u8 free_pages; u8 pending_rcu_free_pages; struct rcu_head head; } free_sub_page(sub_id) if (atomic_fetch_or(1 << sub_id, &64k_page->pending_rcu_free_pages)) call_rcu(&64k_page->head) rcu_func() 64k_page->free_pages |= atomic_xchg(0, &64k_page->pending_rcu_free_pages) if (64k_pages->free_pages == all_ones) free_pgea(64k_page); Jason
On Fri, 2 Jun 2023, Jason Gunthorpe wrote: > On Mon, May 29, 2023 at 03:02:02PM +0100, Matthew Wilcox wrote: > > On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote: > > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) > > > +{ > > > + struct page *page; > > > + > > > + page = virt_to_page(pgtable); > > > + call_rcu(&page->rcu_head, pte_free_now); > > > +} > > > > This can't be safe (on ppc). IIRC you might have up to 16x4k page > > tables sharing one 64kB page. So if you have two page tables from the > > same page being defer-freed simultaneously, you'll reuse the rcu_head > > and I cannot imagine things go well from that point. > > > > I have no idea how to solve this problem. > > Maybe power and s390 should allocate a side structure, sort of a > pre-memdesc thing to store enough extra data? > > If we can get enough bytes then something like this would let a single > rcu head be shared to manage the free bits. > > struct 64k_page { > u8 free_pages; > u8 pending_rcu_free_pages; > struct rcu_head head; > } > > free_sub_page(sub_id) > if (atomic_fetch_or(1 << sub_id, &64k_page->pending_rcu_free_pages)) > call_rcu(&64k_page->head) > > rcu_func() > 64k_page->free_pages |= atomic_xchg(0, &64k_page->pending_rcu_free_pages) > > if (64k_pages->free_pages == all_ones) > free_pgea(64k_page); Or simply allocate as many rcu_heads as page tables. I have not thought through your suggestion above, because I'm against asking s390, or any other architecture, to degrade its page table implementation by demanding more memory, just for the sake of my patch series. In a future memdesc world it might turn out to be reasonable, but not for this (if I can possibly avoid it). Below is what I believe to be the correct powerpc patch (built but not retested). sparc I thought was going to be an equal problem, but turns out not: I'll comment on 06/12. And let's move s390 discussion to 07/12. [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu(). pte_free_defer() will be called inside khugepaged's retract_page_tables() loop, where allocating extra memory cannot be relied upon. This precedes the generic version to avoid build breakage from incompatible pgtable_t. This is awkward because the struct page contains only one rcu_head, but that page may be shared between PTE_FRAG_NR pagetables, each wanting to use the rcu_head at the same time: account concurrent deferrals with a heightened refcount, only the first making use of the rcu_head, but re-deferring if more deferrals arrived during its grace period. Signed-off-by: Hugh Dickins <hughd@google.com> --- arch/powerpc/include/asm/pgalloc.h | 4 +++ arch/powerpc/mm/pgtable-frag.c | 51 ++++++++++++++++++++++++++++++ 2 files changed, 55 insertions(+) diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h index 3360cad78ace..3a971e2a8c73 100644 --- a/arch/powerpc/include/asm/pgalloc.h +++ b/arch/powerpc/include/asm/pgalloc.h @@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage) pte_fragment_free((unsigned long *)ptepage, 0); } +/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c */ +#define pte_free_defer pte_free_defer +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable); + /* * Functions that deal with pagetables that could be at any level of * the table need to be passed an "index_size" so they know how to diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c index 20652daa1d7e..e4f58c5fc2ac 100644 --- a/arch/powerpc/mm/pgtable-frag.c +++ b/arch/powerpc/mm/pgtable-frag.c @@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel) __free_page(page); } } + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */ + +static void pte_free_now(struct rcu_head *head) +{ + struct page *page; + int refcount; + + page = container_of(head, struct page, rcu_head); + refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1, + &page->pt_frag_refcount); + if (refcount < PTE_FREE_DEFERRED) { + pte_fragment_free((unsigned long *)page_address(page), 0); + return; + } + /* + * One page may be shared between PTE_FRAG_NR pagetables. + * At least one more call to pte_free_defer() came in while we + * were already deferring, so the free must be deferred again; + * but just for one grace period, however many calls came in. + */ + while (refcount >= PTE_FREE_DEFERRED + PTE_FREE_DEFERRED) { + refcount = atomic_sub_return(PTE_FREE_DEFERRED, + &page->pt_frag_refcount); + } + /* Remove that refcount of 1 left for fragment freeing above */ + atomic_dec(&page->pt_frag_refcount); + call_rcu(&page->rcu_head, pte_free_now); +} + +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) +{ + struct page *page; + + page = virt_to_page(pgtable); + /* + * One page may be shared between PTE_FRAG_NR pagetables: only queue + * it once for freeing, but note whenever the free must be deferred. + * + * (This would be much simpler if the struct page had an rcu_head for + * each fragment, or if we could allocate a separate array for that.) + * + * Convert our refcount of 1 to a refcount of PTE_FREE_DEFERRED, and + * proceed to call_rcu() only when the rcu_head is not already in use. + */ + if (atomic_add_return(PTE_FREE_DEFERRED - 1, &page->pt_frag_refcount) < + PTE_FREE_DEFERRED + PTE_FREE_DEFERRED) + call_rcu(&page->rcu_head, pte_free_now); +} +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
On Mon, Jun 05, 2023 at 08:40:01PM -0700, Hugh Dickins wrote: > diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c > index 20652daa1d7e..e4f58c5fc2ac 100644 > --- a/arch/powerpc/mm/pgtable-frag.c > +++ b/arch/powerpc/mm/pgtable-frag.c > @@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel) > __free_page(page); > } > } > + > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > +#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */ > + > +static void pte_free_now(struct rcu_head *head) > +{ > + struct page *page; > + int refcount; > + > + page = container_of(head, struct page, rcu_head); > + refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1, > + &page->pt_frag_refcount); > + if (refcount < PTE_FREE_DEFERRED) { > + pte_fragment_free((unsigned long *)page_address(page), 0); > + return; > + } From what I can tell power doesn't recycle the sub fragment into any kind of free list. It just waits for the last fragment to be unused and then frees the whole page. So why not simply go into pte_fragment_free() and do the call_rcu directly: BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0); if (atomic_dec_and_test(&page->pt_frag_refcount)) { if (!kernel) pgtable_pte_page_dtor(page); call_rcu(&page->rcu_head, free_page_rcu) ? Jason
On Tue, Jun 06, 2023 at 03:23:30PM -0300, Jason Gunthorpe wrote: > On Mon, Jun 05, 2023 at 08:40:01PM -0700, Hugh Dickins wrote: > > > diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c > > index 20652daa1d7e..e4f58c5fc2ac 100644 > > --- a/arch/powerpc/mm/pgtable-frag.c > > +++ b/arch/powerpc/mm/pgtable-frag.c > > @@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel) > > __free_page(page); > > } > > } > > + > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > +#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */ > > + > > +static void pte_free_now(struct rcu_head *head) > > +{ > > + struct page *page; > > + int refcount; > > + > > + page = container_of(head, struct page, rcu_head); > > + refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1, > > + &page->pt_frag_refcount); > > + if (refcount < PTE_FREE_DEFERRED) { > > + pte_fragment_free((unsigned long *)page_address(page), 0); > > + return; > > + } > > From what I can tell power doesn't recycle the sub fragment into any > kind of free list. It just waits for the last fragment to be unused > and then frees the whole page. > > So why not simply go into pte_fragment_free() and do the call_rcu directly: > > BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0); > if (atomic_dec_and_test(&page->pt_frag_refcount)) { > if (!kernel) > pgtable_pte_page_dtor(page); > call_rcu(&page->rcu_head, free_page_rcu) We need to be careful on the lock being freed in pgtable_pte_page_dtor(), in Hugh's series IIUC we need the spinlock being there for the rcu section alongside the page itself. So even if to do so we'll need to also rcu call pgtable_pte_page_dtor() when needed.
On Tue, Jun 06, 2023 at 03:03:31PM -0400, Peter Xu wrote: > On Tue, Jun 06, 2023 at 03:23:30PM -0300, Jason Gunthorpe wrote: > > On Mon, Jun 05, 2023 at 08:40:01PM -0700, Hugh Dickins wrote: > > > > > diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c > > > index 20652daa1d7e..e4f58c5fc2ac 100644 > > > --- a/arch/powerpc/mm/pgtable-frag.c > > > +++ b/arch/powerpc/mm/pgtable-frag.c > > > @@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel) > > > __free_page(page); > > > } > > > } > > > + > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > +#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */ > > > + > > > +static void pte_free_now(struct rcu_head *head) > > > +{ > > > + struct page *page; > > > + int refcount; > > > + > > > + page = container_of(head, struct page, rcu_head); > > > + refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1, > > > + &page->pt_frag_refcount); > > > + if (refcount < PTE_FREE_DEFERRED) { > > > + pte_fragment_free((unsigned long *)page_address(page), 0); > > > + return; > > > + } > > > > From what I can tell power doesn't recycle the sub fragment into any > > kind of free list. It just waits for the last fragment to be unused > > and then frees the whole page. > > > > So why not simply go into pte_fragment_free() and do the call_rcu directly: > > > > BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0); > > if (atomic_dec_and_test(&page->pt_frag_refcount)) { > > if (!kernel) > > pgtable_pte_page_dtor(page); > > call_rcu(&page->rcu_head, free_page_rcu) > > We need to be careful on the lock being freed in pgtable_pte_page_dtor(), > in Hugh's series IIUC we need the spinlock being there for the rcu section > alongside the page itself. So even if to do so we'll need to also rcu call > pgtable_pte_page_dtor() when needed. Er yes, I botched that, the dtor and the free_page should be in a the rcu callback function Jason
On Tue, 6 Jun 2023, Jason Gunthorpe wrote: > On Tue, Jun 06, 2023 at 03:03:31PM -0400, Peter Xu wrote: > > On Tue, Jun 06, 2023 at 03:23:30PM -0300, Jason Gunthorpe wrote: > > > On Mon, Jun 05, 2023 at 08:40:01PM -0700, Hugh Dickins wrote: > > > > > > > diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c > > > > index 20652daa1d7e..e4f58c5fc2ac 100644 > > > > --- a/arch/powerpc/mm/pgtable-frag.c > > > > +++ b/arch/powerpc/mm/pgtable-frag.c > > > > @@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel) > > > > __free_page(page); > > > > } > > > > } > > > > + > > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > > +#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */ > > > > + > > > > +static void pte_free_now(struct rcu_head *head) > > > > +{ > > > > + struct page *page; > > > > + int refcount; > > > > + > > > > + page = container_of(head, struct page, rcu_head); > > > > + refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1, > > > > + &page->pt_frag_refcount); > > > > + if (refcount < PTE_FREE_DEFERRED) { > > > > + pte_fragment_free((unsigned long *)page_address(page), 0); > > > > + return; > > > > + } > > > > > > From what I can tell power doesn't recycle the sub fragment into any > > > kind of free list. It just waits for the last fragment to be unused > > > and then frees the whole page. Yes, it's relatively simple in that way: not as sophisticated as s390. > > > > > > So why not simply go into pte_fragment_free() and do the call_rcu directly: > > > > > > BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0); > > > if (atomic_dec_and_test(&page->pt_frag_refcount)) { > > > if (!kernel) > > > pgtable_pte_page_dtor(page); > > > call_rcu(&page->rcu_head, free_page_rcu) > > > > We need to be careful on the lock being freed in pgtable_pte_page_dtor(), > > in Hugh's series IIUC we need the spinlock being there for the rcu section > > alongside the page itself. So even if to do so we'll need to also rcu call > > pgtable_pte_page_dtor() when needed. Thanks, Peter, yes that's right. > > Er yes, I botched that, the dtor and the free_page should be in a the > rcu callback function But it was just a botched detail, and won't have answered Jason's doubt. I had three (or perhaps it amounts to two) reasons for doing it this way: none of which may seem good enough reasons to you. Certainly I'd agree that the way it's done seems... arcane. One, as I've indicated before, I don't actually dare to go all the way into RCU freeing of all page tables for powerpc (or any other): I should think it's a good idea that everyone wants in the end, but I'm limited by my time and competence - and dread of losing my way in the mmu_gather TLB #ifdef maze. It's work for someone else not me. (pte_free_defer() do as you suggest, without changing pte_fragment_free() itself? No, that doesn't work out when defer does, say, the decrement of pt_frag_refcount from 2 to 1, then pte_fragment_free() does the decrement from 1 to 0: page freed without deferral.) Two, this was the code I'd worked out before, and was used in production, so I had confidence in it - it was just my mistake that I'd forgotten the single rcu_head issue, and thought I could avoid it in the initial posting. powerpc has changed around since then, but apparently not in any way that affects this. And it's too easy to agree in review that something can be simpler, without bringing back to mind why the complications are there. Three (just an explanation of why the old code was like this), powerpc relies on THP's page table deposit+withdraw protocol, even for shmem/ file THPs. I've skirted that issue in this series, by sticking with retract_page_tables(), not attempting to insert huge pmd immediately. But if huge pmd is inserted to replace ptetable pmd, then ptetable must be deposited: pte_free_defer() as written protects the deposited ptetable from then being freed without deferral (rather like in the example above). But does not protect it from being withdrawn and reused within that grace period. Jann has grave doubts whether that can ever be allowed (or perhaps I should grant him certainty, and examples that it cannot). I did convince myself, back in the day, that it was safe here: but I'll have to put in a lot more thought to re-justify it now, and on the way may instead be completely persuaded by Jann. Not very good reasons: good enough, or can you supply a better patch? Thanks, Hugh
diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h index 3360cad78ace..3a971e2a8c73 100644 --- a/arch/powerpc/include/asm/pgalloc.h +++ b/arch/powerpc/include/asm/pgalloc.h @@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage) pte_fragment_free((unsigned long *)ptepage, 0); } +/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c */ +#define pte_free_defer pte_free_defer +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable); + /* * Functions that deal with pagetables that could be at any level of * the table need to be passed an "index_size" so they know how to diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c index 20652daa1d7e..3a3dac77faf2 100644 --- a/arch/powerpc/mm/pgtable-frag.c +++ b/arch/powerpc/mm/pgtable-frag.c @@ -120,3 +120,21 @@ void pte_fragment_free(unsigned long *table, int kernel) __free_page(page); } } + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static void pte_free_now(struct rcu_head *head) +{ + struct page *page; + + page = container_of(head, struct page, rcu_head); + pte_fragment_free((unsigned long *)page_to_virt(page), 0); +} + +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) +{ + struct page *page; + + page = virt_to_page(pgtable); + call_rcu(&page->rcu_head, pte_free_now); +} +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */