Message ID | 88c445ae-552-5243-31a4-2674bac62d4d@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp1309717vqr; Sun, 28 May 2023 23:22:57 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5LjFsFcXMvmZKKBLMXlslI4D+T6rvtgURJxhKNiLiKYhiMHor6XLcGwm/xBxjEuLeWW+pY X-Received: by 2002:a05:6a00:10c5:b0:64d:1451:8220 with SMTP id d5-20020a056a0010c500b0064d14518220mr12873878pfu.20.1685341376874; Sun, 28 May 2023 23:22:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685341376; cv=none; d=google.com; s=arc-20160816; b=hgmTdYT6X+nYC5IXcl7zoPkLb1qW7mvAbvapwUng87AsB/szHp0AHg8pBMWR8JPVsK dFlNGT23P4wW2jNhNc8jzmZsbRwfSqQcaUdyLZfDSHIXi8WxN8ndNU7VLf5fCMg28cUR q2gBOJrBT42P4RTXb0ik6lu8DEyBpOhzM5ZzqoI7vE3pI8Ia9casXTOb1oTzAWVBcao8 au73SRSpv0GWcnmFhAamgcEwGdvLpLN4p7/PpLL05BDky2Yd618V1qO2M12ChFDnrMN7 33wqfVgd7jHNZm/r7DV11pvZCyFZLXDjMwVrwsXqThGr446C7+sGTGES/X5PvjD597uq ZioA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:references:message-id:in-reply-to :subject:cc:to:from:date:dkim-signature; bh=2Xgtva00uaRjPQjQYWCbyYlUVerqtzlz4H3/JM5nRsA=; b=FddlZ4Pv8m784t7Cjm3F0yeTu7WNUs4IVJu5KdWjf7UVWv959hKMufDCwmAlnBAOGJ hVNsWmD5NgLjmpT/Ue11TdloXELnwDCzYz0DNLXNEZC5wh4yanEDjnrOVJX7o0lp1plV HMFgejTOFp70+Mth7AtSRckQpYNjr9m8Gv/gEUoZ0JLlfT6AaBCy5GmWGDTPYGSsuOFY o7oqeTMApkA0G/PdYZui2deoAaLDWCluabFUmqdsgHt1F6tWQHAhBBMBFo0XLaesuEok 64aWFCaIA8Ja8B/Ms5gXnLsgOhwC0NcAZB0siZmWC+oJ3MksQefjspFO+tyJbex9i8Xd aZRg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b="j75/ucEx"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id r25-20020a638f59000000b0053490e8df4dsi5925506pgn.104.2023.05.28.23.22.42; Sun, 28 May 2023 23:22:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b="j75/ucEx"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230335AbjE2GP0 (ORCPT <rfc822;callmefire3@gmail.com> + 99 others); Mon, 29 May 2023 02:15:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43924 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231428AbjE2GPY (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 29 May 2023 02:15:24 -0400 Received: from mail-yw1-x112e.google.com (mail-yw1-x112e.google.com [IPv6:2607:f8b0:4864:20::112e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E0EE0CF for <linux-kernel@vger.kernel.org>; Sun, 28 May 2023 23:15:11 -0700 (PDT) Received: by mail-yw1-x112e.google.com with SMTP id 00721157ae682-5658875abfaso41767437b3.1 for <linux-kernel@vger.kernel.org>; Sun, 28 May 2023 23:15:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1685340901; x=1687932901; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=2Xgtva00uaRjPQjQYWCbyYlUVerqtzlz4H3/JM5nRsA=; b=j75/ucEx6oo4OkFwpLoopPUGefT5zRQ/GQSssdYFJ4lLRni3rLEVhCMo1e8jwYMVcb YP6Bm+ySwzQyzGGQM2rSRov76G9UQ/Tn5RA33SDKqUYCIvPra4wwNlCuRu98iddHbdYN EF4ZzSOFNLEo4zXasuFIAULbNX6r1iMaUqjVeX4Zi3vuHqsR+CvgEyNVfPdXCJfUki2B 3dBz1c7QLHmuXQ+aR4qL+wJWgzXAMcn6OOeaRV4ZdMMOf1AGowm+XIRkGpuH6QAGrUcE pfw/89xaKUYoO8gOTy9XVNliAthq+Njc/D/eHcTCmEi6RvCXrpEvwwxwM3i0HwRhfzXq 7JqQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685340901; x=1687932901; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=2Xgtva00uaRjPQjQYWCbyYlUVerqtzlz4H3/JM5nRsA=; b=IQg33jI7+MSW1OA2dSVvkoo++t7C4stbp0FS9tMk9XFwBAtE1WSY/R0DBl9npzUsSH fSITtJslxGcWWzAWdYIfKwvsK0SI6Se8dEVr22Hn7frHm5t9bXzVQIre/dg/dCbyrFfm xSU9/rLYtD4ORPD2Jpnnlus6GzJXVMh1WHyvlZa0eUh7/7m96zVibRJpAKoB98DKeH3o k87BwY9gIMLzXjvf27gfV6TNyzZjDAlg6k9go9+WXdMi9nBZf7QDSLESMmjVjS4MHhCT EmOFZ0AlzwhBI7DNDMZEC2KFB1/2ZkUHw5GlKI0bokvcSeDrw4p8mDcVfVG4ViCOCf7t pkxA== X-Gm-Message-State: AC+VfDyUfMSnYg3GLbSjqCkc/Otfi/ZRED72CYqsUm7A0JZ8EGoJmSCt QZZaGm12NiHSYku6tYCesMeoog== X-Received: by 2002:a0d:e685:0:b0:55a:30f5:3d65 with SMTP id p127-20020a0de685000000b0055a30f53d65mr12605725ywe.41.1685340901250; Sun, 28 May 2023 23:15:01 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id z7-20020a81a247000000b00560c2e3ec63sm3404765ywg.77.2023.05.28.23.14.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 28 May 2023 23:15:00 -0700 (PDT) Date: Sun, 28 May 2023 23:14:48 -0700 (PDT) From: Hugh Dickins <hughd@google.com> X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton <akpm@linux-foundation.org> cc: Mike Kravetz <mike.kravetz@oracle.com>, Mike Rapoport <rppt@kernel.org>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Matthew Wilcox <willy@infradead.org>, David Hildenbrand <david@redhat.com>, Suren Baghdasaryan <surenb@google.com>, Qi Zheng <zhengqi.arch@bytedance.com>, Yang Shi <shy828301@gmail.com>, Mel Gorman <mgorman@techsingularity.net>, Peter Xu <peterx@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Will Deacon <will@kernel.org>, Yu Zhao <yuzhao@google.com>, Alistair Popple <apopple@nvidia.com>, Ralph Campbell <rcampbell@nvidia.com>, Ira Weiny <ira.weiny@intel.com>, Steven Price <steven.price@arm.com>, SeongJae Park <sj@kernel.org>, Naoya Horiguchi <naoya.horiguchi@nec.com>, Christophe Leroy <christophe.leroy@csgroup.eu>, Zack Rusin <zackr@vmware.com>, Jason Gunthorpe <jgg@ziepe.ca>, Axel Rasmussen <axelrasmussen@google.com>, Anshuman Khandual <anshuman.khandual@arm.com>, Pasha Tatashin <pasha.tatashin@soleen.com>, Miaohe Lin <linmiaohe@huawei.com>, Minchan Kim <minchan@kernel.org>, Christoph Hellwig <hch@infradead.org>, Song Liu <song@kernel.org>, Thomas Hellstrom <thomas.hellstrom@linux.intel.com>, Russell King <linux@armlinux.org.uk>, "David S. Miller" <davem@davemloft.net>, Michael Ellerman <mpe@ellerman.id.au>, "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>, Heiko Carstens <hca@linux.ibm.com>, Christian Borntraeger <borntraeger@linux.ibm.com>, Claudio Imbrenda <imbrenda@linux.ibm.com>, Alexander Gordeev <agordeev@linux.ibm.com>, Jann Horn <jannh@google.com>, linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s In-Reply-To: <35e983f5-7ed3-b310-d949-9ae8b130cdab@google.com> Message-ID: <88c445ae-552-5243-31a4-2674bac62d4d@google.com> References: <35e983f5-7ed3-b310-d949-9ae8b130cdab@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1767208519252291980?= X-GMAIL-MSGID: =?utf-8?q?1767208519252291980?= |
Series |
mm: free retracted page table by RCU
|
|
Commit Message
Hugh Dickins
May 29, 2023, 6:14 a.m. UTC
Before putting them to use (several commits later), add rcu_read_lock()
to pte_offset_map(), and rcu_read_unlock() to pte_unmap(). Make this a
separate commit, since it risks exposing imbalances: prior commits have
fixed all the known imbalances, but we may find some have been missed.
Signed-off-by: Hugh Dickins <hughd@google.com>
---
include/linux/pgtable.h | 4 ++--
mm/pgtable-generic.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)
Comments
On Mon, May 29, 2023 at 8:15 AM Hugh Dickins <hughd@google.com> wrote: > Before putting them to use (several commits later), add rcu_read_lock() > to pte_offset_map(), and rcu_read_unlock() to pte_unmap(). Make this a > separate commit, since it risks exposing imbalances: prior commits have > fixed all the known imbalances, but we may find some have been missed. [...] > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c > index c7ab18a5fb77..674671835631 100644 > --- a/mm/pgtable-generic.c > +++ b/mm/pgtable-generic.c > @@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp) > { > pmd_t pmdval; > > - /* rcu_read_lock() to be added later */ > + rcu_read_lock(); > pmdval = pmdp_get_lockless(pmd); > if (pmdvalp) > *pmdvalp = pmdval; It might be a good idea to document that this series assumes that the first argument to __pte_offset_map() is a pointer into a second-level page table (and not a local copy of the entry) unless the containing VMA is known to not be THP-eligible or the page table is detached from the page table hierarchy or something like that. Currently a bunch of places pass references to local copies of the entry, and while I think all of these are fine, it would probably be good to at least document why these are allowed to do it while other places aren't. $ vgrep 'pte_offset_map(&' Index File Line Content 0 arch/sparc/mm/tlb.c 151 pte = pte_offset_map(&pmd, vaddr); 1 kernel/events/core.c 7501 ptep = pte_offset_map(&pmd, addr); 2 mm/gup.c 2460 ptem = ptep = pte_offset_map(&pmd, addr); 3 mm/huge_memory.c 2057 pte = pte_offset_map(&_pmd, haddr); 4 mm/huge_memory.c 2214 pte = pte_offset_map(&_pmd, haddr); 5 mm/page_table_check.c 240 pte_t *ptep = pte_offset_map(&pmd, addr);
On Wed, 31 May 2023, Jann Horn wrote: > On Mon, May 29, 2023 at 8:15 AM Hugh Dickins <hughd@google.com> wrote: > > Before putting them to use (several commits later), add rcu_read_lock() > > to pte_offset_map(), and rcu_read_unlock() to pte_unmap(). Make this a > > separate commit, since it risks exposing imbalances: prior commits have > > fixed all the known imbalances, but we may find some have been missed. > [...] > > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c > > index c7ab18a5fb77..674671835631 100644 > > --- a/mm/pgtable-generic.c > > +++ b/mm/pgtable-generic.c > > @@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp) > > { > > pmd_t pmdval; > > > > - /* rcu_read_lock() to be added later */ > > + rcu_read_lock(); > > pmdval = pmdp_get_lockless(pmd); > > if (pmdvalp) > > *pmdvalp = pmdval; > > It might be a good idea to document that this series assumes that the > first argument to __pte_offset_map() is a pointer into a second-level > page table (and not a local copy of the entry) unless the containing > VMA is known to not be THP-eligible or the page table is detached from > the page table hierarchy or something like that. Currently a bunch of > places pass references to local copies of the entry, and while I think > all of these are fine, it would probably be good to at least document > why these are allowed to do it while other places aren't. Thanks Jann: but I have to guess that here you are showing awareness of an important issue that I'm simply ignorant of. I have been haunted by a dim recollection that there is one architecture (arm-32?) which is fussy about the placement of the pmdval being examined (deduces info missing from the arch-independent interface, by following up the address?), but I couldn't track it down when I tried. Please tell me more; or better, don't spend your time explaining to me, but please just send a link to a good reference on the issue. I'll be unable to document what you ask there, without educating myself first. Thanks, Hugh > > $ vgrep 'pte_offset_map(&' > Index File Line Content > 0 arch/sparc/mm/tlb.c 151 pte = pte_offset_map(&pmd, vaddr); > 1 kernel/events/core.c 7501 ptep = pte_offset_map(&pmd, addr); > 2 mm/gup.c 2460 ptem = ptep = pte_offset_map(&pmd, addr); > 3 mm/huge_memory.c 2057 pte = pte_offset_map(&_pmd, haddr); > 4 mm/huge_memory.c 2214 pte = pte_offset_map(&_pmd, haddr); > 5 mm/page_table_check.c 240 pte_t *ptep = pte_offset_map(&pmd, addr);
On Fri, Jun 2, 2023 at 4:50 AM Hugh Dickins <hughd@google.com> wrote: > On Wed, 31 May 2023, Jann Horn wrote: > > On Mon, May 29, 2023 at 8:15 AM Hugh Dickins <hughd@google.com> wrote: > > > Before putting them to use (several commits later), add rcu_read_lock() > > > to pte_offset_map(), and rcu_read_unlock() to pte_unmap(). Make this a > > > separate commit, since it risks exposing imbalances: prior commits have > > > fixed all the known imbalances, but we may find some have been missed. > > [...] > > > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c > > > index c7ab18a5fb77..674671835631 100644 > > > --- a/mm/pgtable-generic.c > > > +++ b/mm/pgtable-generic.c > > > @@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp) > > > { > > > pmd_t pmdval; > > > > > > - /* rcu_read_lock() to be added later */ > > > + rcu_read_lock(); > > > pmdval = pmdp_get_lockless(pmd); > > > if (pmdvalp) > > > *pmdvalp = pmdval; > > > > It might be a good idea to document that this series assumes that the > > first argument to __pte_offset_map() is a pointer into a second-level > > page table (and not a local copy of the entry) unless the containing > > VMA is known to not be THP-eligible or the page table is detached from > > the page table hierarchy or something like that. Currently a bunch of > > places pass references to local copies of the entry, and while I think > > all of these are fine, it would probably be good to at least document > > why these are allowed to do it while other places aren't. > > Thanks Jann: but I have to guess that here you are showing awareness of > an important issue that I'm simply ignorant of. > > I have been haunted by a dim recollection that there is one architecture > (arm-32?) which is fussy about the placement of the pmdval being examined > (deduces info missing from the arch-independent interface, by following > up the address?), but I couldn't track it down when I tried. > > Please tell me more; or better, don't spend your time explaining to me, > but please just send a link to a good reference on the issue. I'll be > unable to document what you ask there, without educating myself first. Sorry, I think I was somewhat confused about what was going on when I wrote that message. After this series, __pte_offset_map() looks as follows, with added comments describing my understanding of the semantics: // `pmd` points to one of: // case 1: a pmd_t stored outside a page table, // referencing a page table detached by the caller // case 2: a pmd_t stored outside a page table, which the caller copied // from a page table in an RCU-critical section that extends // until at least the end of this function // case 3: a pmd_t stored inside a page table pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp) { unsigned long __maybe_unused flags; pmd_t pmdval; // begin an RCU section; this is needed for case 3 rcu_read_lock(); config_might_irq_save(flags); // read the pmd_t. // if the pmd_t references a page table, this page table can not // go away because: // - in case 1, the caller is the main owner of the page table // - in case 2, because the caller // started an RCU read-side critical section before the caller // read the original pmd_t. (This pmdp_get_lockless() is just // reading a copied pmd_t off the stack.) // - in case 3, because we started an RCU section above before // reading the pmd_t out of the page table here pmdval = pmdp_get_lockless(pmd); config_might_irq_restore(flags); if (pmdvalp) *pmdvalp = pmdval; if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval))) goto nomap; if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval))) goto nomap; if (unlikely(pmd_bad(pmdval))) { pmd_clear_bad(pmd); goto nomap; } return __pte_map(&pmdval, addr); nomap: rcu_read_unlock(); return NULL; } case 1 is what happens in __page_table_check_pte_clear_range(), __split_huge_zero_page_pmd() and __split_huge_pmd_locked(). case 2 happens in lockless page table traversal (gup_pte_range() and perf_get_pgtable_size()). case 3 is normal page table traversal under mmap lock or mapping lock. I think having a function like this that can run in three different contexts in which it is protected in three different ways is somewhat hard to understand without comments. Though maybe I'm thinking about it the wrong way? Basically my point is: __pte_offset_map() normally requires that the pmd argument points into a page table so that the rcu_read_lock() can provide protection starting from the time the pmd_t is read from a page table. The exception are cases where the caller has taken its own precautions to ensure that the referenced page table can not have been freed.
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index a1326e61d7ee..8b0fc7fdc46f 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -99,7 +99,7 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address) ((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address))) #define pte_unmap(pte) do { \ kunmap_local((pte)); \ - /* rcu_read_unlock() to be added later */ \ + rcu_read_unlock(); \ } while (0) #else static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address) @@ -108,7 +108,7 @@ static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address) } static inline void pte_unmap(pte_t *pte) { - /* rcu_read_unlock() to be added later */ + rcu_read_unlock(); } #endif diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index c7ab18a5fb77..674671835631 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp) { pmd_t pmdval; - /* rcu_read_lock() to be added later */ + rcu_read_lock(); pmdval = pmdp_get_lockless(pmd); if (pmdvalp) *pmdvalp = pmdval; @@ -250,7 +250,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp) } return __pte_map(&pmdval, addr); nomap: - /* rcu_read_unlock() to be added later */ + rcu_read_unlock(); return NULL; }