Message ID | a690186fc37e1ea92556a7dbd0887fe201fcc709.1683067198.git.lstoakes@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp946900vqo; Tue, 2 May 2023 16:03:09 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5uHOMuv6ceJ6OBD5bhld90yvjTKRXaZUR4bzYOVuDhitaGDaeEveZ82sCyDEDnfdqSUUVX X-Received: by 2002:a9d:480a:0:b0:6a5:e9ad:3990 with SMTP id c10-20020a9d480a000000b006a5e9ad3990mr10307230otf.26.1683068588765; Tue, 02 May 2023 16:03:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683068588; cv=none; d=google.com; s=arc-20160816; b=ii2mQ0oKAxLrdM5QKPTjwtg/UQRVyPZiwJ/7D8WHCHrv5msLkV47ft/xDNRPFP4rLw 9OicZvSHgVprKn26s6c9876tKPPZU8SQS2D7XXlnO4rcUPPgnGeiX4w1HOMCR4aLjafM LWLsZPGMIkNQwJY0DmiQ5F6IqBjxG+AxC7u3MQ+uGl2THcWAs934OfTJR7PmW8FSmAa9 3Ig2ffwkoZpcsHLGQWOkVMAVxPKOIKKV2TrBtmjpoxbSZZhyLXsOqubVy0sadV8p4dIz 8fqzqlT9Ml+FQMg0CwvWKrmfRrnbwtjYZDqdJNkwB7psvkH1087dVeD8vcfQ9PHsplIX 6o3g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=Xf2/3xrrY6zTi/UvbQtqx+EfWva2KHubX5xnIkqzf5o=; b=t3B03qGjLWxgElEnK6cXjBhCrdqzxMtGNTk7PubERas6vVhEYlSnjbGt9Im2Coabny 9GKfsOgnO/+kmuVKPZi/6RjVIDnawXA8UQYOm04KaqmLbXPJIk5dQ9hWeKa7jh1o9jNA gTe2VAnZzvoLP0VueMWQfyN+nSNMGgenR75Q8jyE+6c7Cu1damhxSkQKyzQiw9Z1F6pT IauzSRSI/6iVB29erLHZpts4QKHgBEYCnFkBzgM6EHeCbk7eiB31UjjvOOGDZWOt0OXM Odn9d6ljavbwr2r7RKpDJUa4WArU5GfpsBIeg/JczARs6HG1Os3wW/zsYsk3J3vtpilv VdJw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=LVh6Knut; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q11-20020a9d57cb000000b0068aecfed666si31024oti.161.2023.05.02.16.02.51; Tue, 02 May 2023 16:03:08 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=LVh6Knut; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230387AbjEBWxv (ORCPT <rfc822;heyuhang3455@gmail.com> + 99 others); Tue, 2 May 2023 18:53:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43274 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230374AbjEBWxo (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 2 May 2023 18:53:44 -0400 Received: from mail-wm1-x32e.google.com (mail-wm1-x32e.google.com [IPv6:2a00:1450:4864:20::32e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5D5393C30; Tue, 2 May 2023 15:53:01 -0700 (PDT) Received: by mail-wm1-x32e.google.com with SMTP id 5b1f17b1804b1-3f315735514so3070335e9.1; Tue, 02 May 2023 15:53:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683067908; x=1685659908; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Xf2/3xrrY6zTi/UvbQtqx+EfWva2KHubX5xnIkqzf5o=; b=LVh6KnutrVzZP3pKjiBLHQPMNpZfapIZr7zSwXpKJ0A5l8snJuAcm6cwIPPs0vsGwT J0IMunQq4OHRkaRX8hOsyanireDdl0ndKtBcflF99xWsMEK60BmE0t3wFOE09R27GZi5 8cE4YKI3F26zrrubCrRma9v5r/EHhlL4PAdQYlJTKgGLBRgKC8HyKwBDQiiIsQYnVwwz 7p26ZV5u7WztCcD8Cs7KXhSC0EaNPc0li67nnnBnfdWQN2omgbc8Ec3gUO2v4JTZokBr qRc8thucpJ1mefYIPzzKLCympyWJjwM43Yl7Ltv4LSjbjGtayfbarJbF80B11A9ec9Md R6qQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683067908; x=1685659908; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Xf2/3xrrY6zTi/UvbQtqx+EfWva2KHubX5xnIkqzf5o=; b=DZYgyAVxJqYUTDfn6fyA6slelpiHY61+Qz+fNOh+JlDcvd5cDVd5pkZWp28JmoLrRP p+hzyc1sJFIZzaznurbMHsgJQubWd5hG0c00pYZhmwuxIi0Vj0jni9J0UH5BSJmesuHm KNkM/8noN/3iSF2R10w5jiQOffTf15eYymRneh8zExUl4Q8052ZFMtFC57GfJbYFYFJK 2srCwIOvcvk+Q82Z+/jivHeT8lEYTF0csMrAbmE+oSU+fRt2QESNYcdllNGv7aikRbk1 KLIJhGc/CgT/rFAgXBfKgssqqdyIsYNl9LkvBjSZFj0DLXecKAKvrIRQpKpU0VYUbuSU NDUg== X-Gm-Message-State: AC+VfDz48UGX6lj0OyH+q7XBlzYetX7tI0v+nG8VMx14pygoF0hF37iW IHqMJjban/VsUz4SmmLpNRI= X-Received: by 2002:a05:600c:6024:b0:3f1:89de:7e51 with SMTP id az36-20020a05600c602400b003f189de7e51mr77204wmb.12.1683067907725; Tue, 02 May 2023 15:51:47 -0700 (PDT) Received: from lucifer.home (host86-156-84-164.range86-156.btcentralplus.com. [86.156.84.164]) by smtp.googlemail.com with ESMTPSA id o18-20020a05600c379200b003f17300c7dcsm58143wmr.48.2023.05.02.15.51.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 May 2023 15:51:46 -0700 (PDT) From: Lorenzo Stoakes <lstoakes@gmail.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org> Cc: Jason Gunthorpe <jgg@ziepe.ca>, Jens Axboe <axboe@kernel.dk>, Matthew Wilcox <willy@infradead.org>, Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>, Leon Romanovsky <leon@kernel.org>, Christian Benvenuti <benve@cisco.com>, Nelson Escobar <neescoba@cisco.com>, Bernard Metzler <bmt@zurich.ibm.com>, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Arnaldo Carvalho de Melo <acme@kernel.org>, Mark Rutland <mark.rutland@arm.com>, Alexander Shishkin <alexander.shishkin@linux.intel.com>, Jiri Olsa <jolsa@kernel.org>, Namhyung Kim <namhyung@kernel.org>, Ian Rogers <irogers@google.com>, Adrian Hunter <adrian.hunter@intel.com>, Bjorn Topel <bjorn@kernel.org>, Magnus Karlsson <magnus.karlsson@intel.com>, Maciej Fijalkowski <maciej.fijalkowski@intel.com>, Jonathan Lemon <jonathan.lemon@gmail.com>, "David S . Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Christian Brauner <brauner@kernel.org>, Richard Cochran <richardcochran@gmail.com>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Jesper Dangaard Brouer <hawk@kernel.org>, John Fastabend <john.fastabend@gmail.com>, linux-fsdevel@vger.kernel.org, linux-perf-users@vger.kernel.org, netdev@vger.kernel.org, bpf@vger.kernel.org, Oleg Nesterov <oleg@redhat.com>, Jason Gunthorpe <jgg@nvidia.com>, John Hubbard <jhubbard@nvidia.com>, Jan Kara <jack@suse.cz>, "Kirill A . Shutemov" <kirill@shutemov.name>, Pavel Begunkov <asml.silence@gmail.com>, Mika Penttila <mpenttil@redhat.com>, David Hildenbrand <david@redhat.com>, Dave Chinner <david@fromorbit.com>, Theodore Ts'o <tytso@mit.edu>, Peter Xu <peterx@redhat.com>, Matthew Rosato <mjrosato@linux.ibm.com>, "Paul E . McKenney" <paulmck@kernel.org>, Christian Borntraeger <borntraeger@linux.ibm.com>, Lorenzo Stoakes <lstoakes@gmail.com> Subject: [PATCH v8 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings Date: Tue, 2 May 2023 23:51:35 +0100 Message-Id: <a690186fc37e1ea92556a7dbd0887fe201fcc709.1683067198.git.lstoakes@gmail.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <cover.1683067198.git.lstoakes@gmail.com> References: <cover.1683067198.git.lstoakes@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764825328385587626?= X-GMAIL-MSGID: =?utf-8?q?1764825328385587626?= |
Series |
mm/gup: disallow GUP writing to file-backed mappings by default
|
|
Commit Message
Lorenzo Stoakes
May 2, 2023, 10:51 p.m. UTC
Writing to file-backed dirty-tracked mappings via GUP is inherently broken
as we cannot rule out folios being cleaned and then a GUP user writing to
them again and possibly marking them dirty unexpectedly.
This is especially egregious for long-term mappings (as indicated by the
use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
we have already done in the slow path.
We have access to less information in the fast path as we cannot examine
the VMA containing the mapping, however we can determine whether the folio
is anonymous or belonging to a whitelisted filesystem - specifically
hugetlb and shmem mappings.
We take special care to ensure that both the folio and mapping are safe to
access when performing these checks and document folio_fast_pin_allowed()
accordingly.
It's important to note that there are no APIs allowing users to specify
FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
always rely on the fact that if we fail to pin on the fast path, the code
will fall back to the slow path which can perform the more thorough check.
Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Kirill A . Shutemov <kirill@shutemov.name>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
mm/gup.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 102 insertions(+)
Comments
On Tue, 2 May 2023 23:51:35 +0100 Lorenzo Stoakes <lstoakes@gmail.com> wrote: > Writing to file-backed dirty-tracked mappings via GUP is inherently broken > as we cannot rule out folios being cleaned and then a GUP user writing to > them again and possibly marking them dirty unexpectedly. > > This is especially egregious for long-term mappings (as indicated by the > use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as > we have already done in the slow path. > > We have access to less information in the fast path as we cannot examine > the VMA containing the mapping, however we can determine whether the folio > is anonymous or belonging to a whitelisted filesystem - specifically > hugetlb and shmem mappings. > > We take special care to ensure that both the folio and mapping are safe to > access when performing these checks and document folio_fast_pin_allowed() > accordingly. > > It's important to note that there are no APIs allowing users to specify > FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can > always rely on the fact that if we fail to pin on the fast path, the code > will fall back to the slow path which can perform the more thorough check. arm allnoconfig said mm/gup.c:115:13: warning: 'folio_fast_pin_allowed' defined but not used [-Wunused-function] 115 | static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags) | ^~~~~~~~~~~~~~~~~~~~~~ so I moved the definition inside CONFIG_ARCH_HAS_PTE_SPECIAL. mm/gup.c | 154 ++++++++++++++++++++++++++--------------------------- 1 file changed, 77 insertions(+), 77 deletions(-) --- a/mm/gup.c~mm-gup-disallow-foll_longterm-gup-fast-writing-to-file-backed-mappings-fix +++ a/mm/gup.c @@ -96,83 +96,6 @@ retry: return folio; } -/* - * Used in the GUP-fast path to determine whether a pin is permitted for a - * specific folio. - * - * This call assumes the caller has pinned the folio, that the lowest page table - * level still points to this folio, and that interrupts have been disabled. - * - * Writing to pinned file-backed dirty tracked folios is inherently problematic - * (see comment describing the writable_file_mapping_allowed() function). We - * therefore try to avoid the most egregious case of a long-term mapping doing - * so. - * - * This function cannot be as thorough as that one as the VMA is not available - * in the fast path, so instead we whitelist known good cases and if in doubt, - * fall back to the slow path. - */ -static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags) -{ - struct address_space *mapping; - unsigned long mapping_flags; - - /* - * If we aren't pinning then no problematic write can occur. A long term - * pin is the most egregious case so this is the one we disallow. - */ - if ((flags & (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) != - (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) - return true; - - /* The folio is pinned, so we can safely access folio fields. */ - - /* Neither of these should be possible, but check to be sure. */ - if (unlikely(folio_test_slab(folio) || folio_test_swapcache(folio))) - return false; - - /* hugetlb mappings do not require dirty-tracking. */ - if (folio_test_hugetlb(folio)) - return true; - - /* - * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods - * cannot proceed, which means no actions performed under RCU can - * proceed either. - * - * inodes and thus their mappings are freed under RCU, which means the - * mapping cannot be freed beneath us and thus we can safely dereference - * it. - */ - lockdep_assert_irqs_disabled(); - - /* - * However, there may be operations which _alter_ the mapping, so ensure - * we read it once and only once. - */ - mapping = READ_ONCE(folio->mapping); - - /* - * The mapping may have been truncated, in any case we cannot determine - * if this mapping is safe - fall back to slow path to determine how to - * proceed. - */ - if (!mapping) - return false; - - /* Anonymous folios are fine, other non-file backed cases are not. */ - mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS; - if (mapping_flags) - return mapping_flags == PAGE_MAPPING_ANON; - - /* - * At this point, we know the mapping is non-null and points to an - * address_space object. The only remaining whitelisted file system is - * shmem. - */ - return shmem_mapping(mapping); -} - /** * try_grab_folio() - Attempt to get or pin a folio. * @page: pointer to page to be grabbed @@ -2474,6 +2397,83 @@ static void __maybe_unused undo_dev_page #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL /* + * Used in the GUP-fast path to determine whether a pin is permitted for a + * specific folio. + * + * This call assumes the caller has pinned the folio, that the lowest page table + * level still points to this folio, and that interrupts have been disabled. + * + * Writing to pinned file-backed dirty tracked folios is inherently problematic + * (see comment describing the writable_file_mapping_allowed() function). We + * therefore try to avoid the most egregious case of a long-term mapping doing + * so. + * + * This function cannot be as thorough as that one as the VMA is not available + * in the fast path, so instead we whitelist known good cases and if in doubt, + * fall back to the slow path. + */ +static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags) +{ + struct address_space *mapping; + unsigned long mapping_flags; + + /* + * If we aren't pinning then no problematic write can occur. A long term + * pin is the most egregious case so this is the one we disallow. + */ + if ((flags & (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) != + (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) + return true; + + /* The folio is pinned, so we can safely access folio fields. */ + + /* Neither of these should be possible, but check to be sure. */ + if (unlikely(folio_test_slab(folio) || folio_test_swapcache(folio))) + return false; + + /* hugetlb mappings do not require dirty-tracking. */ + if (folio_test_hugetlb(folio)) + return true; + + /* + * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods + * cannot proceed, which means no actions performed under RCU can + * proceed either. + * + * inodes and thus their mappings are freed under RCU, which means the + * mapping cannot be freed beneath us and thus we can safely dereference + * it. + */ + lockdep_assert_irqs_disabled(); + + /* + * However, there may be operations which _alter_ the mapping, so ensure + * we read it once and only once. + */ + mapping = READ_ONCE(folio->mapping); + + /* + * The mapping may have been truncated, in any case we cannot determine + * if this mapping is safe - fall back to slow path to determine how to + * proceed. + */ + if (!mapping) + return false; + + /* Anonymous folios are fine, other non-file backed cases are not. */ + mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS; + if (mapping_flags) + return mapping_flags == PAGE_MAPPING_ANON; + + /* + * At this point, we know the mapping is non-null and points to an + * address_space object. The only remaining whitelisted file system is + * shmem. + */ + return shmem_mapping(mapping); +} + +/* * Fast-gup relies on pte change detection to avoid concurrent pgtable * operations. *
On Tue 02-05-23 23:51:35, Lorenzo Stoakes wrote: > Writing to file-backed dirty-tracked mappings via GUP is inherently broken > as we cannot rule out folios being cleaned and then a GUP user writing to > them again and possibly marking them dirty unexpectedly. > > This is especially egregious for long-term mappings (as indicated by the > use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as > we have already done in the slow path. > > We have access to less information in the fast path as we cannot examine > the VMA containing the mapping, however we can determine whether the folio > is anonymous or belonging to a whitelisted filesystem - specifically > hugetlb and shmem mappings. > > We take special care to ensure that both the folio and mapping are safe to > access when performing these checks and document folio_fast_pin_allowed() > accordingly. > > It's important to note that there are no APIs allowing users to specify > FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can > always rely on the fact that if we fail to pin on the fast path, the code > will fall back to the slow path which can perform the more thorough check. > > Suggested-by: David Hildenbrand <david@redhat.com> > Suggested-by: Kirill A . Shutemov <kirill@shutemov.name> > Suggested-by: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> The patch looks good to me now. Feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > mm/gup.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 102 insertions(+) > > diff --git a/mm/gup.c b/mm/gup.c > index 0ea9ebec9547..1ab369b5d889 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -18,6 +18,7 @@ > #include <linux/migrate.h> > #include <linux/mm_inline.h> > #include <linux/sched/mm.h> > +#include <linux/shmem_fs.h> > > #include <asm/mmu_context.h> > #include <asm/tlbflush.h> > @@ -95,6 +96,83 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > return folio; > } > > +/* > + * Used in the GUP-fast path to determine whether a pin is permitted for a > + * specific folio. > + * > + * This call assumes the caller has pinned the folio, that the lowest page table > + * level still points to this folio, and that interrupts have been disabled. > + * > + * Writing to pinned file-backed dirty tracked folios is inherently problematic > + * (see comment describing the writable_file_mapping_allowed() function). We > + * therefore try to avoid the most egregious case of a long-term mapping doing > + * so. > + * > + * This function cannot be as thorough as that one as the VMA is not available > + * in the fast path, so instead we whitelist known good cases and if in doubt, > + * fall back to the slow path. > + */ > +static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags) > +{ > + struct address_space *mapping; > + unsigned long mapping_flags; > + > + /* > + * If we aren't pinning then no problematic write can occur. A long term > + * pin is the most egregious case so this is the one we disallow. > + */ > + if ((flags & (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) != > + (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) > + return true; > + > + /* The folio is pinned, so we can safely access folio fields. */ > + > + /* Neither of these should be possible, but check to be sure. */ > + if (unlikely(folio_test_slab(folio) || folio_test_swapcache(folio))) > + return false; > + > + /* hugetlb mappings do not require dirty-tracking. */ > + if (folio_test_hugetlb(folio)) > + return true; > + > + /* > + * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods > + * cannot proceed, which means no actions performed under RCU can > + * proceed either. > + * > + * inodes and thus their mappings are freed under RCU, which means the > + * mapping cannot be freed beneath us and thus we can safely dereference > + * it. > + */ > + lockdep_assert_irqs_disabled(); > + > + /* > + * However, there may be operations which _alter_ the mapping, so ensure > + * we read it once and only once. > + */ > + mapping = READ_ONCE(folio->mapping); > + > + /* > + * The mapping may have been truncated, in any case we cannot determine > + * if this mapping is safe - fall back to slow path to determine how to > + * proceed. > + */ > + if (!mapping) > + return false; > + > + /* Anonymous folios are fine, other non-file backed cases are not. */ > + mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS; > + if (mapping_flags) > + return mapping_flags == PAGE_MAPPING_ANON; > + > + /* > + * At this point, we know the mapping is non-null and points to an > + * address_space object. The only remaining whitelisted file system is > + * shmem. > + */ > + return shmem_mapping(mapping); > +} > + > /** > * try_grab_folio() - Attempt to get or pin a folio. > * @page: pointer to page to be grabbed > @@ -2464,6 +2542,11 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > goto pte_unmap; > } > > + if (!folio_fast_pin_allowed(folio, flags)) { > + gup_put_folio(folio, 1, flags); > + goto pte_unmap; > + } > + > if (!pte_write(pte) && gup_must_unshare(NULL, flags, page)) { > gup_put_folio(folio, 1, flags); > goto pte_unmap; > @@ -2656,6 +2739,11 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, > return 0; > } > > + if (!folio_fast_pin_allowed(folio, flags)) { > + gup_put_folio(folio, refs, flags); > + return 0; > + } > + > if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) { > gup_put_folio(folio, refs, flags); > return 0; > @@ -2722,6 +2810,10 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, > return 0; > } > > + if (!folio_fast_pin_allowed(folio, flags)) { > + gup_put_folio(folio, refs, flags); > + return 0; > + } > if (!pmd_write(orig) && gup_must_unshare(NULL, flags, &folio->page)) { > gup_put_folio(folio, refs, flags); > return 0; > @@ -2762,6 +2854,11 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, > return 0; > } > > + if (!folio_fast_pin_allowed(folio, flags)) { > + gup_put_folio(folio, refs, flags); > + return 0; > + } > + > if (!pud_write(orig) && gup_must_unshare(NULL, flags, &folio->page)) { > gup_put_folio(folio, refs, flags); > return 0; > @@ -2797,6 +2894,11 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, > return 0; > } > > + if (!folio_fast_pin_allowed(folio, flags)) { > + gup_put_folio(folio, refs, flags); > + return 0; > + } > + > *nr += refs; > folio_set_referenced(folio); > return 1; > -- > 2.40.1 >
[...] > +static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags) > +{ > + struct address_space *mapping; > + unsigned long mapping_flags; > + > + /* > + * If we aren't pinning then no problematic write can occur. A long term > + * pin is the most egregious case so this is the one we disallow. > + */ > + if ((flags & (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) != > + (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) > + return true; > + > + /* The folio is pinned, so we can safely access folio fields. */ > + > + /* Neither of these should be possible, but check to be sure. */ You can easily have anon pages that are at the swapcache at this point (especially, because this function is called before our unsharing checks), the comment is misleading. And there is nothing wrong about pinning an anon page that's still in the swapcache. The following folio_test_anon() check will allow them. The check made sense in page_mapping(), but here it's not required. I do agree regarding folio_test_slab(), though. Should we WARN in case we would have one? if (WARN_ON_ONCE(folio_test_slab(folio))) return false; > + if (unlikely(folio_test_slab(folio) || folio_test_swapcache(folio))) > + return false; > + > + /* hugetlb mappings do not require dirty-tracking. */ > + if (folio_test_hugetlb(folio)) > + return true; > + > + /* > + * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods > + * cannot proceed, which means no actions performed under RCU can > + * proceed either. > + * > + * inodes and thus their mappings are freed under RCU, which means the > + * mapping cannot be freed beneath us and thus we can safely dereference > + * it. > + */ > + lockdep_assert_irqs_disabled(); > + > + /* > + * However, there may be operations which _alter_ the mapping, so ensure > + * we read it once and only once. > + */ > + mapping = READ_ONCE(folio->mapping); > + > + /* > + * The mapping may have been truncated, in any case we cannot determine > + * if this mapping is safe - fall back to slow path to determine how to > + * proceed. > + */ > + if (!mapping) > + return false; > + > + /* Anonymous folios are fine, other non-file backed cases are not. */ > + mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS; > + if (mapping_flags) > + return mapping_flags == PAGE_MAPPING_ANON; KSM pages are also (shared) anonymous folios, and that check would fail -- which is ok (the following unsharing checks rejects long-term pinning them), but a bit inconstent with your comment and folio_test_anon(). It would be more consistent (with your comment and also the folio_test_anon implementation) to have here: return mapping_flags & PAGE_MAPPING_ANON; > + > + /* > + * At this point, we know the mapping is non-null and points to an > + * address_space object. The only remaining whitelisted file system is > + * shmem. > + */ > + return shmem_mapping(mapping); > +} > + In general, LGTM Acked-by: David Hildenbrand <david@redhat.com>
On Tue, May 02, 2023 at 07:18:21PM -0700, Andrew Morton wrote: > On Tue, 2 May 2023 23:51:35 +0100 Lorenzo Stoakes <lstoakes@gmail.com> wrote: > > > Writing to file-backed dirty-tracked mappings via GUP is inherently broken > > as we cannot rule out folios being cleaned and then a GUP user writing to > > them again and possibly marking them dirty unexpectedly. > > > > This is especially egregious for long-term mappings (as indicated by the > > use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as > > we have already done in the slow path. > > > > We have access to less information in the fast path as we cannot examine > > the VMA containing the mapping, however we can determine whether the folio > > is anonymous or belonging to a whitelisted filesystem - specifically > > hugetlb and shmem mappings. > > > > We take special care to ensure that both the folio and mapping are safe to > > access when performing these checks and document folio_fast_pin_allowed() > > accordingly. > > > > It's important to note that there are no APIs allowing users to specify > > FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can > > always rely on the fact that if we fail to pin on the fast path, the code > > will fall back to the slow path which can perform the more thorough check. > > arm allnoconfig said > > mm/gup.c:115:13: warning: 'folio_fast_pin_allowed' defined but not used [-Wunused-function] > 115 | static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags) > | ^~~~~~~~~~~~~~~~~~~~~~ > > so I moved the definition inside CONFIG_ARCH_HAS_PTE_SPECIAL. > > > > mm/gup.c | 154 ++++++++++++++++++++++++++--------------------------- > 1 file changed, 77 insertions(+), 77 deletions(-) > > --- a/mm/gup.c~mm-gup-disallow-foll_longterm-gup-fast-writing-to-file-backed-mappings-fix > +++ a/mm/gup.c > @@ -96,83 +96,6 @@ retry: > return folio; > } > > -/* > - * Used in the GUP-fast path to determine whether a pin is permitted for a > - * specific folio. > - * > - * This call assumes the caller has pinned the folio, that the lowest page table > - * level still points to this folio, and that interrupts have been disabled. > - * > - * Writing to pinned file-backed dirty tracked folios is inherently problematic > - * (see comment describing the writable_file_mapping_allowed() function). We > - * therefore try to avoid the most egregious case of a long-term mapping doing > - * so. > - * > - * This function cannot be as thorough as that one as the VMA is not available > - * in the fast path, so instead we whitelist known good cases and if in doubt, > - * fall back to the slow path. > - */ > -static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags) > -{ > - struct address_space *mapping; > - unsigned long mapping_flags; > - > - /* > - * If we aren't pinning then no problematic write can occur. A long term > - * pin is the most egregious case so this is the one we disallow. > - */ > - if ((flags & (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) != > - (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) > - return true; > - > - /* The folio is pinned, so we can safely access folio fields. */ > - > - /* Neither of these should be possible, but check to be sure. */ > - if (unlikely(folio_test_slab(folio) || folio_test_swapcache(folio))) > - return false; > - > - /* hugetlb mappings do not require dirty-tracking. */ > - if (folio_test_hugetlb(folio)) > - return true; > - > - /* > - * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods > - * cannot proceed, which means no actions performed under RCU can > - * proceed either. > - * > - * inodes and thus their mappings are freed under RCU, which means the > - * mapping cannot be freed beneath us and thus we can safely dereference > - * it. > - */ > - lockdep_assert_irqs_disabled(); > - > - /* > - * However, there may be operations which _alter_ the mapping, so ensure > - * we read it once and only once. > - */ > - mapping = READ_ONCE(folio->mapping); > - > - /* > - * The mapping may have been truncated, in any case we cannot determine > - * if this mapping is safe - fall back to slow path to determine how to > - * proceed. > - */ > - if (!mapping) > - return false; > - > - /* Anonymous folios are fine, other non-file backed cases are not. */ > - mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS; > - if (mapping_flags) > - return mapping_flags == PAGE_MAPPING_ANON; > - > - /* > - * At this point, we know the mapping is non-null and points to an > - * address_space object. The only remaining whitelisted file system is > - * shmem. > - */ > - return shmem_mapping(mapping); > -} > - > /** > * try_grab_folio() - Attempt to get or pin a folio. > * @page: pointer to page to be grabbed > @@ -2474,6 +2397,83 @@ static void __maybe_unused undo_dev_page > > #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > /* > + * Used in the GUP-fast path to determine whether a pin is permitted for a > + * specific folio. > + * > + * This call assumes the caller has pinned the folio, that the lowest page table > + * level still points to this folio, and that interrupts have been disabled. > + * > + * Writing to pinned file-backed dirty tracked folios is inherently problematic > + * (see comment describing the writable_file_mapping_allowed() function). We > + * therefore try to avoid the most egregious case of a long-term mapping doing > + * so. > + * > + * This function cannot be as thorough as that one as the VMA is not available > + * in the fast path, so instead we whitelist known good cases and if in doubt, > + * fall back to the slow path. > + */ > +static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags) > +{ > + struct address_space *mapping; > + unsigned long mapping_flags; > + > + /* > + * If we aren't pinning then no problematic write can occur. A long term > + * pin is the most egregious case so this is the one we disallow. > + */ > + if ((flags & (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) != > + (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) > + return true; > + > + /* The folio is pinned, so we can safely access folio fields. */ > + > + /* Neither of these should be possible, but check to be sure. */ > + if (unlikely(folio_test_slab(folio) || folio_test_swapcache(folio))) > + return false; > + > + /* hugetlb mappings do not require dirty-tracking. */ > + if (folio_test_hugetlb(folio)) > + return true; > + > + /* > + * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods > + * cannot proceed, which means no actions performed under RCU can > + * proceed either. > + * > + * inodes and thus their mappings are freed under RCU, which means the > + * mapping cannot be freed beneath us and thus we can safely dereference > + * it. > + */ > + lockdep_assert_irqs_disabled(); > + > + /* > + * However, there may be operations which _alter_ the mapping, so ensure > + * we read it once and only once. > + */ > + mapping = READ_ONCE(folio->mapping); > + > + /* > + * The mapping may have been truncated, in any case we cannot determine > + * if this mapping is safe - fall back to slow path to determine how to > + * proceed. > + */ > + if (!mapping) > + return false; > + > + /* Anonymous folios are fine, other non-file backed cases are not. */ > + mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS; > + if (mapping_flags) > + return mapping_flags == PAGE_MAPPING_ANON; > + > + /* > + * At this point, we know the mapping is non-null and points to an > + * address_space object. The only remaining whitelisted file system is > + * shmem. > + */ > + return shmem_mapping(mapping); > +} > + > +/* > * Fast-gup relies on pte change detection to avoid concurrent pgtable > * operations. > * > _ > Ack thanks for this, I think it doesn't quite cover all cases (kernel bot moaning on -next for mips), needs some more fiddly #ifdef stuff, I will spin a v9 shortly anyway to fix this up and address a few other minor things.
On Thu, May 04, 2023 at 05:04:34PM +0200, David Hildenbrand wrote: > [...] > > > +static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags) > > +{ > > + struct address_space *mapping; > > + unsigned long mapping_flags; > > + > > + /* > > + * If we aren't pinning then no problematic write can occur. A long term > > + * pin is the most egregious case so this is the one we disallow. > > + */ > > + if ((flags & (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) != > > + (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) > > + return true; > > + > > + /* The folio is pinned, so we can safely access folio fields. */ > > + > > + /* Neither of these should be possible, but check to be sure. */ > > You can easily have anon pages that are at the swapcache at this point > (especially, because this function is called before our unsharing checks), > the comment is misleading. Ack will update. > > And there is nothing wrong about pinning an anon page that's still in the > swapcache. The following folio_test_anon() check will allow them. > > The check made sense in page_mapping(), but here it's not required. Waaaaaaaaaait a second, you were saying before:- "Folios in the swap cache return the swap mapping" -- you might disallow pinning anonymous pages that are in the swap cache. I recall that there are corner cases where we can end up with an anon page that's mapped writable but still in the swap cache ... so you'd fallback to the GUP slow path (acceptable for these corner cases, I guess), however especially the comment is a bit misleading then. So are we allowing or disallowing pinning anon swap cache pages? :P I mean slow path would allow them if they are just marked anon so I'm inclined to allow them. > > I do agree regarding folio_test_slab(), though. Should we WARN in case we > would have one? > > if (WARN_ON_ONCE(folio_test_slab(folio))) > return false; > God help us if we have a slab page at this point, so agreed worth doing, it would surely have to arise from some dreadful bug/memory corruption. > > + if (unlikely(folio_test_slab(folio) || folio_test_swapcache(folio))) > > + return false; > > + > > + /* hugetlb mappings do not require dirty-tracking. */ > > + if (folio_test_hugetlb(folio)) > > + return true; > > + > > + /* > > + * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods > > + * cannot proceed, which means no actions performed under RCU can > > + * proceed either. > > + * > > + * inodes and thus their mappings are freed under RCU, which means the > > + * mapping cannot be freed beneath us and thus we can safely dereference > > + * it. > > + */ > > + lockdep_assert_irqs_disabled(); > > + > > + /* > > + * However, there may be operations which _alter_ the mapping, so ensure > > + * we read it once and only once. > > + */ > > + mapping = READ_ONCE(folio->mapping); > > + > > + /* > > + * The mapping may have been truncated, in any case we cannot determine > > + * if this mapping is safe - fall back to slow path to determine how to > > + * proceed. > > + */ > > + if (!mapping) > > + return false; > > + > > + /* Anonymous folios are fine, other non-file backed cases are not. */ > > + mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS; > > + if (mapping_flags) > > + return mapping_flags == PAGE_MAPPING_ANON; > > KSM pages are also (shared) anonymous folios, and that check would fail -- > which is ok (the following unsharing checks rejects long-term pinning them), > but a bit inconstent with your comment and folio_test_anon(). > > It would be more consistent (with your comment and also the folio_test_anon > implementation) to have here: > > return mapping_flags & PAGE_MAPPING_ANON; > I explicitly excluded KSM out of fear that could be some breakage given they're wrprotect'd + expected to CoW though? But I guess you mean they'd get picked up by the unshare and so it doesn't matter + we wouldn't want to exclude an PG_anon_exclusive case? I'll make the change in any case given the unshare check! I notice the gup_huge_pgd() doesn't do an unshare but I mean, a PGD-sized huge page probably isn't going to be CoW'd :P > > + > > + /* > > + * At this point, we know the mapping is non-null and points to an > > + * address_space object. The only remaining whitelisted file system is > > + * shmem. > > + */ > > + return shmem_mapping(mapping); > > +} > > + > > In general, LGTM > > Acked-by: David Hildenbrand <david@redhat.com> > Thanks! Will respin, addressing your comments and addressing the issue the kernel bot picked up with placement in the appropriate #ifdef's and send out a v9 shortly. > -- > Thanks, > > David / dhildenb >
>> And there is nothing wrong about pinning an anon page that's still in the >> swapcache. The following folio_test_anon() check will allow them. >> >> The check made sense in page_mapping(), but here it's not required. > > Waaaaaaaaaait a second, you were saying before:- > > "Folios in the swap cache return the swap mapping" -- you might disallow > pinning anonymous pages that are in the swap cache. > > I recall that there are corner cases where we can end up with an anon > page that's mapped writable but still in the swap cache ... so you'd > fallback to the GUP slow path (acceptable for these corner cases, I > guess), however especially the comment is a bit misleading then. > > So are we allowing or disallowing pinning anon swap cache pages? :P If we have an exclusive anon page that's still in the swap cache, sure! :) I think there are ways that can be done, and nothing would actually break. (I even wrote selftests in the cow selftests for that to amke sure it works as expected) > > I mean slow path would allow them if they are just marked anon so I'm inclined > to allow them. Exactly my reasoning. The less checks the better (especially if ordinary GUP just allows for pinning it) :) > >> >> I do agree regarding folio_test_slab(), though. Should we WARN in case we >> would have one? >> >> if (WARN_ON_ONCE(folio_test_slab(folio))) >> return false; >> > > God help us if we have a slab page at this point, so agreed worth doing, it > would surely have to arise from some dreadful bug/memory corruption. > Or some nasty race condition that we managed to ignore with rechecking if the PTEs/PMDs changed :) >>> + if (unlikely(folio_test_slab(folio) || folio_test_swapcache(folio))) >>> + return false; >>> + >>> + /* hugetlb mappings do not require dirty-tracking. */ >>> + if (folio_test_hugetlb(folio)) >>> + return true; >>> + >>> + /* >>> + * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods >>> + * cannot proceed, which means no actions performed under RCU can >>> + * proceed either. >>> + * >>> + * inodes and thus their mappings are freed under RCU, which means the >>> + * mapping cannot be freed beneath us and thus we can safely dereference >>> + * it. >>> + */ >>> + lockdep_assert_irqs_disabled(); >>> + >>> + /* >>> + * However, there may be operations which _alter_ the mapping, so ensure >>> + * we read it once and only once. >>> + */ >>> + mapping = READ_ONCE(folio->mapping); >>> + >>> + /* >>> + * The mapping may have been truncated, in any case we cannot determine >>> + * if this mapping is safe - fall back to slow path to determine how to >>> + * proceed. >>> + */ >>> + if (!mapping) >>> + return false; >>> + >>> + /* Anonymous folios are fine, other non-file backed cases are not. */ >>> + mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS; >>> + if (mapping_flags) >>> + return mapping_flags == PAGE_MAPPING_ANON; >> >> KSM pages are also (shared) anonymous folios, and that check would fail -- >> which is ok (the following unsharing checks rejects long-term pinning them), >> but a bit inconstent with your comment and folio_test_anon(). >> >> It would be more consistent (with your comment and also the folio_test_anon >> implementation) to have here: >> >> return mapping_flags & PAGE_MAPPING_ANON; >> > > I explicitly excluded KSM out of fear that could be some breakage given they're > wrprotect'd + expected to CoW though? But I guess you mean they'd get picked up > by the unshare and so it doesn't matter + we wouldn't want to exclude an > PG_anon_exclusive case? Yes, unsharing handles that in the ordinary GUP and GUP-fast case. And unsharing is neither GUP-fast nor even longterm specific (for anon pages). Reason I'm brining this up is that I think it's best if we let folio_fast_pin_allowed() just check for what's absolutely GUP-fast specific. > > I'll make the change in any case given the unshare check! > > I notice the gup_huge_pgd() doesn't do an unshare but I mean, a PGD-sized huge > page probably isn't going to be CoW'd :P I spotted exactly the same thing and wondered about that (after all I added all that unsharing logic ... so I should know). I'm sure there must be a reason I didn't add it ;) ... probably we should just add it even though it might essentially be dead code for now (at least the cow selftests would try with each and every hugetlb size and eventually reveal the problem on whatever arch ends up using that code ... ). Do you want to send a patch to add unsharing to gup_huge_pgd() as well?
On Fri, May 05, 2023 at 04:17:38PM +0200, David Hildenbrand wrote: > > > And there is nothing wrong about pinning an anon page that's still in the > > > swapcache. The following folio_test_anon() check will allow them. > > > > > > The check made sense in page_mapping(), but here it's not required. > > > > Waaaaaaaaaait a second, you were saying before:- > > > > "Folios in the swap cache return the swap mapping" -- you might disallow > > pinning anonymous pages that are in the swap cache. > > > > I recall that there are corner cases where we can end up with an anon > > page that's mapped writable but still in the swap cache ... so you'd > > fallback to the GUP slow path (acceptable for these corner cases, I > > guess), however especially the comment is a bit misleading then. > > > > So are we allowing or disallowing pinning anon swap cache pages? :P > > If we have an exclusive anon page that's still in the swap cache, sure! :) > > I think there are ways that can be done, and nothing would actually break. > (I even wrote selftests in the cow selftests for that to amke sure it works > as expected) > > > > > I mean slow path would allow them if they are just marked anon so I'm inclined > > to allow them. > > Exactly my reasoning. > > The less checks the better (especially if ordinary GUP just allows for > pinning it) :) Yeah a lot of my decision making on this has been trying to be conservative about what we filter for but you get this weird inversion whereby if you're too conservative about what you allow, you are actually being more outlandish about what you disallow and vice-versa. > > > > > > > > > I do agree regarding folio_test_slab(), though. Should we WARN in case we > > > would have one? > > > > > > if (WARN_ON_ONCE(folio_test_slab(folio))) > > > return false; > > > > > > > God help us if we have a slab page at this point, so agreed worth doing, it > > would surely have to arise from some dreadful bug/memory corruption. > > > > Or some nasty race condition that we managed to ignore with rechecking if > the PTEs/PMDs changed :) Well that should be sorted now :) > > > > > + if (unlikely(folio_test_slab(folio) || folio_test_swapcache(folio))) > > > > + return false; > > > > + > > > > + /* hugetlb mappings do not require dirty-tracking. */ > > > > + if (folio_test_hugetlb(folio)) > > > > + return true; > > > > + > > > > + /* > > > > + * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods > > > > + * cannot proceed, which means no actions performed under RCU can > > > > + * proceed either. > > > > + * > > > > + * inodes and thus their mappings are freed under RCU, which means the > > > > + * mapping cannot be freed beneath us and thus we can safely dereference > > > > + * it. > > > > + */ > > > > + lockdep_assert_irqs_disabled(); > > > > + > > > > + /* > > > > + * However, there may be operations which _alter_ the mapping, so ensure > > > > + * we read it once and only once. > > > > + */ > > > > + mapping = READ_ONCE(folio->mapping); > > > > + > > > > + /* > > > > + * The mapping may have been truncated, in any case we cannot determine > > > > + * if this mapping is safe - fall back to slow path to determine how to > > > > + * proceed. > > > > + */ > > > > + if (!mapping) > > > > + return false; > > > > + > > > > + /* Anonymous folios are fine, other non-file backed cases are not. */ > > > > + mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS; > > > > + if (mapping_flags) > > > > + return mapping_flags == PAGE_MAPPING_ANON; > > > > > > KSM pages are also (shared) anonymous folios, and that check would fail -- > > > which is ok (the following unsharing checks rejects long-term pinning them), > > > but a bit inconstent with your comment and folio_test_anon(). > > > > > > It would be more consistent (with your comment and also the folio_test_anon > > > implementation) to have here: > > > > > > return mapping_flags & PAGE_MAPPING_ANON; > > > > > > > I explicitly excluded KSM out of fear that could be some breakage given they're > > wrprotect'd + expected to CoW though? But I guess you mean they'd get picked up > > by the unshare and so it doesn't matter + we wouldn't want to exclude an > > PG_anon_exclusive case? > > Yes, unsharing handles that in the ordinary GUP and GUP-fast case. And > unsharing is neither GUP-fast nor even longterm specific (for anon pages). > > Reason I'm brining this up is that I think it's best if we let > folio_fast_pin_allowed() just check for what's absolutely GUP-fast specific. Ack, indeed it's a separate thing, see above for the contradictory nature of wanting to be cautious but then accidentally making your change _more radical_ than you intended...! > > > > > I'll make the change in any case given the unshare check! > > > > I notice the gup_huge_pgd() doesn't do an unshare but I mean, a PGD-sized huge > > page probably isn't going to be CoW'd :P > > I spotted exactly the same thing and wondered about that (after all I added > all that unsharing logic ... so I should know). I'm sure there must be a > reason I didn't add it ;) > > ... probably we should just add it even though it might essentially be dead > code for now (at least the cow selftests would try with each and every > hugetlb size and eventually reveal the problem on whatever arch ends up > using that code ... ). > > Do you want to send a patch to add unsharing to gup_huge_pgd() as well? > Sure will do! > -- > Thanks, > > David / dhildenb >
diff --git a/mm/gup.c b/mm/gup.c index 0ea9ebec9547..1ab369b5d889 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -18,6 +18,7 @@ #include <linux/migrate.h> #include <linux/mm_inline.h> #include <linux/sched/mm.h> +#include <linux/shmem_fs.h> #include <asm/mmu_context.h> #include <asm/tlbflush.h> @@ -95,6 +96,83 @@ static inline struct folio *try_get_folio(struct page *page, int refs) return folio; } +/* + * Used in the GUP-fast path to determine whether a pin is permitted for a + * specific folio. + * + * This call assumes the caller has pinned the folio, that the lowest page table + * level still points to this folio, and that interrupts have been disabled. + * + * Writing to pinned file-backed dirty tracked folios is inherently problematic + * (see comment describing the writable_file_mapping_allowed() function). We + * therefore try to avoid the most egregious case of a long-term mapping doing + * so. + * + * This function cannot be as thorough as that one as the VMA is not available + * in the fast path, so instead we whitelist known good cases and if in doubt, + * fall back to the slow path. + */ +static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags) +{ + struct address_space *mapping; + unsigned long mapping_flags; + + /* + * If we aren't pinning then no problematic write can occur. A long term + * pin is the most egregious case so this is the one we disallow. + */ + if ((flags & (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) != + (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) + return true; + + /* The folio is pinned, so we can safely access folio fields. */ + + /* Neither of these should be possible, but check to be sure. */ + if (unlikely(folio_test_slab(folio) || folio_test_swapcache(folio))) + return false; + + /* hugetlb mappings do not require dirty-tracking. */ + if (folio_test_hugetlb(folio)) + return true; + + /* + * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods + * cannot proceed, which means no actions performed under RCU can + * proceed either. + * + * inodes and thus their mappings are freed under RCU, which means the + * mapping cannot be freed beneath us and thus we can safely dereference + * it. + */ + lockdep_assert_irqs_disabled(); + + /* + * However, there may be operations which _alter_ the mapping, so ensure + * we read it once and only once. + */ + mapping = READ_ONCE(folio->mapping); + + /* + * The mapping may have been truncated, in any case we cannot determine + * if this mapping is safe - fall back to slow path to determine how to + * proceed. + */ + if (!mapping) + return false; + + /* Anonymous folios are fine, other non-file backed cases are not. */ + mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS; + if (mapping_flags) + return mapping_flags == PAGE_MAPPING_ANON; + + /* + * At this point, we know the mapping is non-null and points to an + * address_space object. The only remaining whitelisted file system is + * shmem. + */ + return shmem_mapping(mapping); +} + /** * try_grab_folio() - Attempt to get or pin a folio. * @page: pointer to page to be grabbed @@ -2464,6 +2542,11 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, goto pte_unmap; } + if (!folio_fast_pin_allowed(folio, flags)) { + gup_put_folio(folio, 1, flags); + goto pte_unmap; + } + if (!pte_write(pte) && gup_must_unshare(NULL, flags, page)) { gup_put_folio(folio, 1, flags); goto pte_unmap; @@ -2656,6 +2739,11 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, return 0; } + if (!folio_fast_pin_allowed(folio, flags)) { + gup_put_folio(folio, refs, flags); + return 0; + } + if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) { gup_put_folio(folio, refs, flags); return 0; @@ -2722,6 +2810,10 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, return 0; } + if (!folio_fast_pin_allowed(folio, flags)) { + gup_put_folio(folio, refs, flags); + return 0; + } if (!pmd_write(orig) && gup_must_unshare(NULL, flags, &folio->page)) { gup_put_folio(folio, refs, flags); return 0; @@ -2762,6 +2854,11 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, return 0; } + if (!folio_fast_pin_allowed(folio, flags)) { + gup_put_folio(folio, refs, flags); + return 0; + } + if (!pud_write(orig) && gup_must_unshare(NULL, flags, &folio->page)) { gup_put_folio(folio, refs, flags); return 0; @@ -2797,6 +2894,11 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, return 0; } + if (!folio_fast_pin_allowed(folio, flags)) { + gup_put_folio(folio, refs, flags); + return 0; + } + *nr += refs; folio_set_referenced(folio); return 1;