Message ID | 20230612151545.3317766-1-ryan.roberts@arm.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp2694774vqr; Mon, 12 Jun 2023 09:08:03 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5d117Xf80wqScrl7BptEGgNm1xzidr6vWZK0qBlJp4jrx8f85lherQdl5uvQa0rrsyMrMR X-Received: by 2002:a17:903:124d:b0:1b0:471c:4c3 with SMTP id u13-20020a170903124d00b001b0471c04c3mr8097250plh.40.1686586083561; Mon, 12 Jun 2023 09:08:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686586083; cv=none; d=google.com; s=arc-20160816; b=o6putCDpxsNqIKZdEmi//HEwNSBrVxyy+evHYpRd60VaNnccco/g0C881Xd4KvnCSK X1citQz4Z9nrKxq6q0K9ErB3eFWWzgas9AC+zYPxkZM22cQXEYoO6wZ1BrNUqk8fsq0i l5mPBtUvwWYJrK6R+Z+sayUXDMAASMB4aGoQFY4hRziVRZTDXPolHRYm2ENrra0xyZqw CtrViFauej6WdSpfei0J8mms8wSjQuuSycjXDI54FMbWmj4UGGx3MXhxEnc/ML9QSBph sH34Zc7XiT6nrzfYzSbcg4zUkU4KqseNpP4cK+dPx/VK+hDk5xbOX9Corb2kJuFS9NH9 rK3A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=F8Is49mqR+xnx8M7svhWuLvI1W26lqHdsz/cDfr6iuM=; b=rKbXZwjPis0np4eZmLMJU/3jiY4w/I/Xx+6Z9jFPhEFP8cg00BS3gez5yy/7tBNqfL LbLSk6iy0xqjIC4hFkTBj4ztt9iaa8DrIvBQ00jANanTMNtsS2VrUnowolbYwggxUm9y z89FuwA7DwLOCeGqDetIokCAWlh/eV2cYraKobGGCCiJjAWlGeW1XAECWViRvf9YmE+s lpBNACQ2k32xN3YQMO2K6lPwTo1Rvu5894W6V8d/2akPWEY+M+flpMHDX4Ork/wbXyF1 yHNW4pF3ihoTKZobgrfESJhm8aUHpoX48H3r5qi2lIknoGqn6+bPjJoDEiNQ4EqA8LIr Gk7w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q4-20020a170902dac400b001ae5fe35b78si4134967plx.497.2023.06.12.09.07.49; Mon, 12 Jun 2023 09:08:03 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234310AbjFLPQE (ORCPT <rfc822;rust.linux@gmail.com> + 99 others); Mon, 12 Jun 2023 11:16:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44398 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234242AbjFLPQD (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 12 Jun 2023 11:16:03 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 9F82498 for <linux-kernel@vger.kernel.org>; Mon, 12 Jun 2023 08:16:01 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 4E5FE1FB; Mon, 12 Jun 2023 08:16:46 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 628073F5A1; Mon, 12 Jun 2023 08:15:56 -0700 (PDT) From: Ryan Roberts <ryan.roberts@arm.com> To: Andrew Morton <akpm@linux-foundation.org>, SeongJae Park <sj@kernel.org>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Mike Rapoport <rppt@kernel.org>, Yu Zhao <yuzhao@google.com>, Jason Gunthorpe <jgg@ziepe.ca>, David Airlie <airlied@gmail.com>, Daniel Vetter <daniel@ffwll.ch>, Dimitri Sivanich <dimitri.sivanich@hpe.com>, Alex Williamson <alex.williamson@redhat.com>, Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>, Alexander Viro <viro@zeniv.linux.org.uk>, Christian Brauner <brauner@kernel.org>, Mike Kravetz <mike.kravetz@oracle.com>, Muchun Song <muchun.song@linux.dev>, Mark Rutland <mark.rutland@arm.com>, Alexander Shishkin <alexander.shishkin@linux.intel.com>, Jiri Olsa <jolsa@kernel.org>, Namhyung Kim <namhyung@kernel.org>, Ian Rogers <irogers@google.com>, Adrian Hunter <adrian.hunter@intel.com>, =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= <jglisse@redhat.com>, Andrey Ryabinin <ryabinin.a.a@gmail.com>, Alexander Potapenko <glider@google.com>, Andrey Konovalov <andreyknvl@gmail.com>, Dmitry Vyukov <dvyukov@google.com>, Vincenzo Frascino <vincenzo.frascino@arm.com>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeelb@google.com>, Naoya Horiguchi <naoya.horiguchi@nec.com>, Miaohe Lin <linmiaohe@huawei.com>, Pasha Tatashin <pasha.tatashin@soleen.com>, Uladzislau Rezki <urezki@gmail.com>, Christoph Hellwig <hch@infradead.org>, Lorenzo Stoakes <lstoakes@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, damon@lists.linux.dev Subject: [PATCH v3 0/3] Encapsulate PTE contents from non-arch code Date: Mon, 12 Jun 2023 16:15:42 +0100 Message-Id: <20230612151545.3317766-1-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1768513689280306864?= X-GMAIL-MSGID: =?utf-8?q?1768513689280306864?= |
Series |
Encapsulate PTE contents from non-arch code
|
|
Message
Ryan Roberts
June 12, 2023, 3:15 p.m. UTC
Hi All, (Including wider audience this time since changes touch a fair few subsystems) This is the second half of v3 of a series to improve the encapsulation of pte entries by disallowing non-arch code from directly dereferencing pte_t pointers. Based on earlier feedback, I split the series in 2; the first part, fixes for existing bugs, was already posted at [3] and merged into mm-stable. This second part contains the conversion from direct dereferences to instead use ptep_get()/ptep_get_lockless(). See the v1 cover letter at [1] for rationale for this work. Based on feedback at v2, I've removed the new ptep_deref() helper I originally added, and am now using the existing ptep_get() and ptep_get_lockless() helpers. Testing on Ampere Altra (arm64) showed no difference in performance when using ptep_deref() (*pte) vs ptep_get() (READ_ONCE(*pte)). Patches are based on mm-unstable (49e038b1919e) and a branch is available at [4] (Let me know if this is the wrong branch to target - I'm still not familiar with the details of the mm- dev process!). Note that Hugh Dickins's "mm: allow pte_offset_map[_lock]() to fail" (now in mm-unstable) patch set caused a number of conflicts which I've resolved. But due to that, you won't be able to apply these patches on top of Linus's tree. I have an alternate branch on top of v6.4-rc6 at [5]. Changes since v2 [2]: - Removed ptep_deref() helper - Converted ptep_deref() callsites to use ptep_get[_lockless]() Changes since v1 [1]: - Fixed sh build bug reported by 0-day CI [1] https://lore.kernel.org/linux-mm/20230511132113.80196-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/linux-mm/20230518110727.2106156-1-ryan.roberts@arm.com/ [3] https://lore.kernel.org/all/20230602092949.545577-1-ryan.roberts@arm.com/ [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/ptep_get-mm-unstable-lkml_v3 [5] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/ptep_get-v6.4-rc6-lkml_v3 Thanks, Ryan Ryan Roberts (3): mm: ptdump should use ptep_get_lockless() mm: Move ptep_get() and pmdp_get() helpers mm: ptep_get() conversion .../drm/i915/gem/selftests/i915_gem_mman.c | 8 +- drivers/misc/sgi-gru/grufault.c | 2 +- drivers/vfio/vfio_iommu_type1.c | 7 +- drivers/xen/privcmd.c | 2 +- fs/proc/task_mmu.c | 33 +++--- fs/userfaultfd.c | 6 +- include/linux/hugetlb.h | 4 + include/linux/mm_inline.h | 2 +- include/linux/pgtable.h | 34 +++--- kernel/events/uprobes.c | 2 +- mm/damon/ops-common.c | 2 +- mm/damon/paddr.c | 2 +- mm/damon/vaddr.c | 10 +- mm/filemap.c | 2 +- mm/gup.c | 21 ++-- mm/highmem.c | 12 ++- mm/hmm.c | 2 +- mm/huge_memory.c | 4 +- mm/hugetlb.c | 2 +- mm/hugetlb_vmemmap.c | 6 +- mm/kasan/init.c | 9 +- mm/kasan/shadow.c | 10 +- mm/khugepaged.c | 22 ++-- mm/ksm.c | 22 ++-- mm/madvise.c | 6 +- mm/mapping_dirty_helpers.c | 4 +- mm/memcontrol.c | 4 +- mm/memory-failure.c | 26 ++--- mm/memory.c | 100 ++++++++++-------- mm/mempolicy.c | 6 +- mm/migrate.c | 14 +-- mm/migrate_device.c | 15 +-- mm/mincore.c | 2 +- mm/mlock.c | 6 +- mm/mprotect.c | 8 +- mm/mremap.c | 2 +- mm/page_table_check.c | 4 +- mm/page_vma_mapped.c | 27 +++-- mm/pgtable-generic.c | 2 +- mm/ptdump.c | 2 +- mm/rmap.c | 34 +++--- mm/sparse-vmemmap.c | 8 +- mm/swap_state.c | 8 +- mm/swapfile.c | 20 ++-- mm/userfaultfd.c | 4 +- mm/vmalloc.c | 6 +- mm/vmscan.c | 14 +-- virt/kvm/kvm_main.c | 11 +- 48 files changed, 316 insertions(+), 243 deletions(-) -- 2.25.1
Comments
On Mon, 12 Jun 2023 16:15:42 +0100 Ryan Roberts <ryan.roberts@arm.com> wrote: > Hi All, > > (Including wider audience this time since changes touch a fair few subsystems) > > This is the second half of v3 of a series to improve the encapsulation of pte > entries by disallowing non-arch code from directly dereferencing pte_t pointers. That's basically all we have here for [0/N] cover letter content. I stole some words from the [3/3] changelog, so we now have: : A series to improve the encapsulation of pte entries by disallowing : non-arch code from directly dereferencing pte_t pointers. : : This means that by default, the accesses change from a C dereference to a : READ_ONCE(). This is technically the correct thing to do since where : pgtables are modified by HW (for access/dirty) they are volatile and : therefore we should always ensure READ_ONCE() semantics. : : But more importantly, by always using the helper, it can be overridden by : the architecture to fully encapsulate the contents of the pte. Arch code : is deliberately not converted, as the arch code knows best. It is : intended that arch code (arm64) will override the default with its own : implementation that can (e.g.) hide certain bits from the core code, or : determine young/dirty status by mixing in state from another source. > Based on earlier feedback, I split the series in 2; the first part, fixes for > existing bugs, was already posted at [3] and merged into mm-stable. This second > part contains the conversion from direct dereferences to instead use > ptep_get()/ptep_get_lockless(). > > See the v1 cover letter at [1] for rationale for this work. > > Based on feedback at v2, I've removed the new ptep_deref() helper I originally > added, and am now using the existing ptep_get() and ptep_get_lockless() helpers. > Testing on Ampere Altra (arm64) showed no difference in performance when using > ptep_deref() (*pte) vs ptep_get() (READ_ONCE(*pte)). > > Patches are based on mm-unstable (49e038b1919e) and a branch is available at [4] > (Let me know if this is the wrong branch to target - I'm still not familiar with > the details of the mm- dev process!). Note that Hugh Dickins's "mm: allow > pte_offset_map[_lock]() to fail" (now in mm-unstable) patch set caused a number > of conflicts which I've resolved. But due to that, you won't be able to apply > these patches on top of Linus's tree. I have an alternate branch on top of > v6.4-rc6 at [5]. Yep, that's all great, thanks. Is there some clever trick we can do to prevent new open-coded derefs of pte_t* from being introduced? I suppose we could convert pte_t to a single-member struct to force a compile error. That struct will get passed by value to ptep_get() so that's OK. But this isn't viable unless/until all architectures are converted :( Or we rely upon Ryan to grep the tree occasionally ;)
> On Jun 12, 2023, at 23:15, Ryan Roberts <ryan.roberts@arm.com> wrote: > > Hi All, > > (Including wider audience this time since changes touch a fair few subsystems) > > This is the second half of v3 of a series to improve the encapsulation of pte > entries by disallowing non-arch code from directly dereferencing pte_t pointers. > Based on earlier feedback, I split the series in 2; the first part, fixes for > existing bugs, was already posted at [3] and merged into mm-stable. This second > part contains the conversion from direct dereferences to instead use > ptep_get()/ptep_get_lockless(). > > See the v1 cover letter at [1] for rationale for this work. > > Based on feedback at v2, I've removed the new ptep_deref() helper I originally > added, and am now using the existing ptep_get() and ptep_get_lockless() helpers. When I first saw the name of ptep_get()/ptep_get_lockless(), I thought the pte seems like to be protected by the refcount mechanism (Why I have this though? Because Qi Zheng has proposed a approach to free pte page tables by using the refcount mechanism [1]). And your proposed name of ptep_deref() is intuitive for me, so I have another thought, should we rename ptep_get() to ptep_deref()? Just a thought from me, I'd like to hear if others object. Thanks. [1] https://lore.kernel.org/lkml/20211110105428.32458-7-zhengqi.arch@bytedance.com/
Hi Andrew, Thanks for taking the patches! On 12/06/2023 21:16, Andrew Morton wrote: > On Mon, 12 Jun 2023 16:15:42 +0100 Ryan Roberts <ryan.roberts@arm.com> wrote: > >> Hi All, >> >> (Including wider audience this time since changes touch a fair few subsystems) >> >> This is the second half of v3 of a series to improve the encapsulation of pte >> entries by disallowing non-arch code from directly dereferencing pte_t pointers. > > That's basically all we have here for [0/N] cover letter content. I > stole some words from the [3/3] changelog, so we now have: > > : A series to improve the encapsulation of pte entries by disallowing > : non-arch code from directly dereferencing pte_t pointers. > : > : This means that by default, the accesses change from a C dereference to a > : READ_ONCE(). This is technically the correct thing to do since where > : pgtables are modified by HW (for access/dirty) they are volatile and > : therefore we should always ensure READ_ONCE() semantics. > : > : But more importantly, by always using the helper, it can be overridden by > : the architecture to fully encapsulate the contents of the pte. Arch code > : is deliberately not converted, as the arch code knows best. It is > : intended that arch code (arm64) will override the default with its own > : implementation that can (e.g.) hide certain bits from the core code, or > : determine young/dirty status by mixing in state from another source. I'm not sure I've understood what you are saying here? Is there some expectation for the content of the cover letter for mm patches, which I'm not aware of? If you can point me to some docs I'll make sure I'm conformant in future. > > >> Based on earlier feedback, I split the series in 2; the first part, fixes for >> existing bugs, was already posted at [3] and merged into mm-stable. This second >> part contains the conversion from direct dereferences to instead use >> ptep_get()/ptep_get_lockless(). >> >> See the v1 cover letter at [1] for rationale for this work. >> >> Based on feedback at v2, I've removed the new ptep_deref() helper I originally >> added, and am now using the existing ptep_get() and ptep_get_lockless() helpers. >> Testing on Ampere Altra (arm64) showed no difference in performance when using >> ptep_deref() (*pte) vs ptep_get() (READ_ONCE(*pte)). >> >> Patches are based on mm-unstable (49e038b1919e) and a branch is available at [4] >> (Let me know if this is the wrong branch to target - I'm still not familiar with >> the details of the mm- dev process!). Note that Hugh Dickins's "mm: allow >> pte_offset_map[_lock]() to fail" (now in mm-unstable) patch set caused a number >> of conflicts which I've resolved. But due to that, you won't be able to apply >> these patches on top of Linus's tree. I have an alternate branch on top of >> v6.4-rc6 at [5]. > > Yep, that's all great, thanks. > > > Is there some clever trick we can do to prevent new open-coded derefs > of pte_t* from being introduced? I've been thinking about this myself but don't have a good answer at the moment. I used 2 techniques to find them all, but neither are fullproof so required hand editing: 1) global find/replace of "pte_t *" to "pte_handle_t ", and define pte_handle_t as void *. Then the compiler will throw an error for an attempted dereference. This obbiously only checks code that is actually being compiled, and it doesn't prevent anyone from declaring a pte_t* and directly dereferencing in future. 2) Using the coccinelle script. This worked pretty well, but finds a few places where the pte_t pointer is not pointing to a page table, but a variable on the stack, and there it is legitimate to dereference (we don't want to use ptep_get()/ptep_set_at() there). And also flags up lots of arch code that needs to be ignored. There are coccinelle scripts kept in scripts/coccinelle that apprarently get run semi-regularly [1]. So could submit something there, but would need to add support for excluding the arch directory and for marking known false-positives, neither of which I think are currently available. Does anyone know better? > > I suppose we could convert pte_t to a single-member struct to force a > compile error. That struct will get passed by value to ptep_get() so > that's OK. But this isn't viable unless/until all architectures are > converted :( It already is defined that way for arm64 at least: typedef struct { pteval_t pte; } pte_t; But I'm not sure how that helps? You can still legally deref it in C. > > Or we rely upon Ryan to grep the tree occasionally ;) This is my least favourite option ;-) But it wouldn't be the end of the world in the short term. Once I achieve the full objective of using arm64's contpte capability, open coded derefs will manifest as bugs (at least for cases where the accuracy of the access/dirty bits matter), which will certainly make sure they get fixed. Thanks, Ryan
On 13/06/2023 03:16, Muchun Song wrote: > > >> On Jun 12, 2023, at 23:15, Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> Hi All, >> >> (Including wider audience this time since changes touch a fair few subsystems) >> >> This is the second half of v3 of a series to improve the encapsulation of pte >> entries by disallowing non-arch code from directly dereferencing pte_t pointers. >> Based on earlier feedback, I split the series in 2; the first part, fixes for >> existing bugs, was already posted at [3] and merged into mm-stable. This second >> part contains the conversion from direct dereferences to instead use >> ptep_get()/ptep_get_lockless(). >> >> See the v1 cover letter at [1] for rationale for this work. >> >> Based on feedback at v2, I've removed the new ptep_deref() helper I originally >> added, and am now using the existing ptep_get() and ptep_get_lockless() helpers. > > When I first saw the name of ptep_get()/ptep_get_lockless(), I thought > the pte seems like to be protected by the refcount mechanism (Why I have > this though? Because Qi Zheng has proposed a approach to free pte page tables > by using the refcount mechanism [1]). And your proposed name of ptep_deref() > is intuitive for me, so I have another thought, should we rename ptep_get() > to ptep_deref()? Just a thought from me, I'd like to hear if others object. I see your point, but I think any renaming exercise should be discussed and applied in the context of a separate patch series, given that ptep_get() and ptep_get_lockless() already exist in the code base. This would be a much bigger job, since it would need to cover all the arch code too. > > Thanks. > > [1] https://lore.kernel.org/lkml/20211110105428.32458-7-zhengqi.arch@bytedance.com/