Message ID | b3a4441cade9770e00d24f5ecb75c8f4481785a4.1683044162.git.lstoakes@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp752066vqo; Tue, 2 May 2023 09:41:24 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7Dj4qeiUh3N70cyl9+wcKih7TZxrvGIkg4h8YOmxCwR9sA1qDkz/8FQlGW8UWMTefThQNF X-Received: by 2002:a05:6a20:734b:b0:f2:6a6c:e886 with SMTP id v11-20020a056a20734b00b000f26a6ce886mr21912072pzc.16.1683045684427; Tue, 02 May 2023 09:41:24 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683045684; cv=none; d=google.com; s=arc-20160816; b=GmwvQcj5kkFjl8OS3umEUqDONRNUuxa9G5H+KzzG8jrhNN2dIsM+AM9o6iFLat9Kw7 OgfHF3omoaYHOuMp1gqCCc20bIDTIroo9y3Ma091l/oVkcpKemDWrk1U+14uVvF27WEw DiriDY6AQWoVCQx9svrv0OSO4jvdll0KqbVx2pyHczzTuNu9oC2VENYKHn/+6u26WX1x RPQ3oEHBBXP3CuFDlfo/22ARY4FHjt7pVkbhnwCls93rDiJd3PnJvhgQmOQVpuALF4Hf 9F/k8UHYfGSEaGr+J8a23Ft5jff5ycH1Ex5vjCVNRxLWlm7QjQVobnKCgCuqnILEB3c3 iT2Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=JD+whRS851erhVVABMqdG0VTRc0I6Z1rUQ+rUiOFVdo=; b=oV0u1TeAG5AyijsYnZvR6zwk7qOo6nUFszp3CSkmKQihOa85+m8mJdpSp2pCbeMByU w5CZws7JxRvAyilsfSC6ymrr/tB8KTwxnvI3Q3W9KN7V2HvRa7PuMil3Yhit03oVCi1d 2+7llUWNwsOPU6D2HueU4b/LemkSJfALUBYJdGiza2SNlGzzBPWpE7ZiNOhZnYrGML5C jIdyGhf93192vOVq0QmWg+lo0PLuk/XU8P+PHonoNuE1aEBXzEA7A8ZvcI1bP12MeuYK AUuqYTvhcg85eAvGl0KIwmje4cHELzbpKIAALu6oXtj//tIqpAJwt9tYbcQVWP8EHLeT XVOw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=LOnfftTm; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f12-20020aa7968c000000b0063b7c4435c2si30674045pfk.52.2023.05.02.09.41.08; Tue, 02 May 2023 09:41:24 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=LOnfftTm; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234253AbjEBQeg (ORCPT <rfc822;rbbytesnap@gmail.com> + 99 others); Tue, 2 May 2023 12:34:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55922 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234105AbjEBQeV (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 2 May 2023 12:34:21 -0400 Received: from mail-wr1-x42d.google.com (mail-wr1-x42d.google.com [IPv6:2a00:1450:4864:20::42d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8005A1BD6; Tue, 2 May 2023 09:34:19 -0700 (PDT) Received: by mail-wr1-x42d.google.com with SMTP id ffacd0b85a97d-2f7db354092so2518925f8f.2; Tue, 02 May 2023 09:34:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683045258; x=1685637258; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=JD+whRS851erhVVABMqdG0VTRc0I6Z1rUQ+rUiOFVdo=; b=LOnfftTmcJ81+iwcBByQXB7Ir64O4Y1k4pQkOcPV6TI0bAkgfp3UaIZJ+zuRf48mH+ sKQ+qJFhhAT+ExIv4raZCe4Y7g6YIyog7zt61v/DO+Fa7aU1K2QXQwuKAeXJn8SIoYiB W4rnS2m03quL9SZbgo+gI56ky1WuHrcAT7sa4wKL20+7gCA8GdMCVBH1cDnLcEb6fDXA c27XUVFOEl1cVsIdPhrXIdFzaWhtobeae2CFGmZSQpcLNg81bdYOgyqtNdKwNp+4Ve3p UgEWWsgUHsHxSNWn8K/VU+Ldw1J4kZz6hnohQGiLK02HgtBSbtURRyI2GNirMOgnJNRQ Xo9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683045258; x=1685637258; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=JD+whRS851erhVVABMqdG0VTRc0I6Z1rUQ+rUiOFVdo=; b=Dece9Vz5wPLUHVgLqyDpFGQWlVv8o2HMu5YG49Mlm0ffjprVnC3qZb8zT6QJMR425L ZTG+CSOg7LXklGF/ZhEcZcEHLm2JB19son1WnZfpsqx64I2sKvPg77imQAOrOHG0OkwB pkHSllyu0RT1/L7mYlNKsHQ/QuoWjfYhVjfdyDwJc7aQUWB/G9Qb8LZwLENUphqR6s9D /kq9q5Gns1IxNc0rUcaNPqMV2SWylKRV1F72T9Ceql+H23e8sjJePtX1ZyStcl67gCoO 14mu8wgXq8yQ19Sh1vxrPeviizBiriF7gcNEzg9bHc84N8d+DnfhJhPQ+Z8b6kQ6aL9N 5OHQ== X-Gm-Message-State: AC+VfDxZ2tnWkkp4+dZOZPs8vHUQOHa24rtTle0o5rV/CMyrKakPUyjo /2gAXynFGBWo8FrB5d++E1I= X-Received: by 2002:adf:edcf:0:b0:306:2cd4:b01b with SMTP id v15-20020adfedcf000000b003062cd4b01bmr5305262wro.37.1683045257681; Tue, 02 May 2023 09:34:17 -0700 (PDT) Received: from lucifer.home (host86-156-84-164.range86-156.btcentralplus.com. [86.156.84.164]) by smtp.googlemail.com with ESMTPSA id b10-20020a5d550a000000b0030639a86f9dsm1789919wrv.51.2023.05.02.09.34.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 May 2023 09:34:17 -0700 (PDT) From: Lorenzo Stoakes <lstoakes@gmail.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org> Cc: Jason Gunthorpe <jgg@ziepe.ca>, Jens Axboe <axboe@kernel.dk>, Matthew Wilcox <willy@infradead.org>, Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>, Leon Romanovsky <leon@kernel.org>, Christian Benvenuti <benve@cisco.com>, Nelson Escobar <neescoba@cisco.com>, Bernard Metzler <bmt@zurich.ibm.com>, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Arnaldo Carvalho de Melo <acme@kernel.org>, Mark Rutland <mark.rutland@arm.com>, Alexander Shishkin <alexander.shishkin@linux.intel.com>, Jiri Olsa <jolsa@kernel.org>, Namhyung Kim <namhyung@kernel.org>, Ian Rogers <irogers@google.com>, Adrian Hunter <adrian.hunter@intel.com>, Bjorn Topel <bjorn@kernel.org>, Magnus Karlsson <magnus.karlsson@intel.com>, Maciej Fijalkowski <maciej.fijalkowski@intel.com>, Jonathan Lemon <jonathan.lemon@gmail.com>, "David S . Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Christian Brauner <brauner@kernel.org>, Richard Cochran <richardcochran@gmail.com>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Jesper Dangaard Brouer <hawk@kernel.org>, John Fastabend <john.fastabend@gmail.com>, linux-fsdevel@vger.kernel.org, linux-perf-users@vger.kernel.org, netdev@vger.kernel.org, bpf@vger.kernel.org, Oleg Nesterov <oleg@redhat.com>, Jason Gunthorpe <jgg@nvidia.com>, John Hubbard <jhubbard@nvidia.com>, Jan Kara <jack@suse.cz>, "Kirill A . Shutemov" <kirill@shutemov.name>, Pavel Begunkov <asml.silence@gmail.com>, Mika Penttila <mpenttil@redhat.com>, David Hildenbrand <david@redhat.com>, Dave Chinner <david@fromorbit.com>, Theodore Ts'o <tytso@mit.edu>, Peter Xu <peterx@redhat.com>, Matthew Rosato <mjrosato@linux.ibm.com>, "Paul E . McKenney" <paulmck@kernel.org>, Christian Borntraeger <borntraeger@linux.ibm.com>, Lorenzo Stoakes <lstoakes@gmail.com> Subject: [PATCH v7 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings Date: Tue, 2 May 2023 17:34:05 +0100 Message-Id: <b3a4441cade9770e00d24f5ecb75c8f4481785a4.1683044162.git.lstoakes@gmail.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <cover.1683044162.git.lstoakes@gmail.com> References: <cover.1683044162.git.lstoakes@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764801311706745194?= X-GMAIL-MSGID: =?utf-8?q?1764801311706745194?= |
Series |
mm/gup: disallow GUP writing to file-backed mappings by default
|
|
Commit Message
Lorenzo Stoakes
May 2, 2023, 4:34 p.m. UTC
Writing to file-backed dirty-tracked mappings via GUP is inherently broken
as we cannot rule out folios being cleaned and then a GUP user writing to
them again and possibly marking them dirty unexpectedly.
This is especially egregious for long-term mappings (as indicated by the
use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
we have already done in the slow path.
We have access to less information in the fast path as we cannot examine
the VMA containing the mapping, however we can determine whether the folio
is anonymous and then whitelist known-good mappings - specifically hugetlb
and shmem mappings.
While we obtain a stable folio for this check, the mapping might not be, as
a truncate could nullify it at any time. Since doing so requires mappings
to be zapped, we can synchronise against a TLB shootdown operation.
For some architectures TLB shootdown is synchronised by IPI, against which
we are protected as the GUP-fast operation is performed with interrupts
disabled. Equally, we are protected from architectures which specify
CONFIG_MMU_GATHER_RCU_TABLE_FREE as the interrupts being disabled imply an
RCU lock as well.
We whitelist anonymous mappings (and those which otherwise do not have a
valid mapping), shmem and hugetlb mappings, none of which require dirty
tracking so are safe to long-term pin.
It's important to note that there are no APIs allowing users to specify
FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
always rely on the fact that if we fail to pin on the fast path, the code
will fall back to the slow path which can perform the more thorough check.
Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Kirill A . Shutemov <kirill@shutemov.name>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
mm/gup.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 60 insertions(+), 2 deletions(-)
Comments
[...] > +{ > + struct address_space *mapping; > + > + /* > + * GUP-fast disables IRQs - this prevents IPIs from causing page tables > + * to disappear from under us, as well as preventing RCU grace periods > + * from making progress (i.e. implying rcu_read_lock()). > + * > + * This means we can rely on the folio remaining stable for all > + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE > + * and those that do not. > + * > + * We get the added benefit that given inodes, and thus address_space, > + * objects are RCU freed, we can rely on the mapping remaining stable > + * here with no risk of a truncation or similar race. > + */ > + lockdep_assert_irqs_disabled(); > + > + /* > + * If no mapping can be found, this implies an anonymous or otherwise > + * non-file backed folio so in this instance we permit the pin. > + * > + * shmem and hugetlb mappings do not require dirty-tracking so we > + * explicitly whitelist these. > + * > + * Other non dirty-tracked folios will be picked up on the slow path. > + */ > + mapping = folio_mapping(folio); > + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio); "Folios in the swap cache return the swap mapping" -- you might disallow pinning anonymous pages that are in the swap cache. I recall that there are corner cases where we can end up with an anon page that's mapped writable but still in the swap cache ... so you'd fallback to the GUP slow path (acceptable for these corner cases, I guess), however especially the comment is a bit misleading then. So I'd suggest not dropping the folio_test_anon() check, or open-coding it ... which will make this piece of code most certainly easier to get when staring at folio_mapping(). Or to spell it out in the comment (usually I prefer code over comments). > +} > + > /** > * try_grab_folio() - Attempt to get or pin a folio. > * @page: pointer to page to be grabbed > @@ -123,6 +170,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > */ > struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > { > + bool is_longterm = flags & FOLL_LONGTERM; > + > if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) > return NULL; > > @@ -136,8 +185,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > * right zone, so fail and let the caller fall back to the slow > * path. > */ > - if (unlikely((flags & FOLL_LONGTERM) && > - !is_longterm_pinnable_page(page))) > + if (unlikely(is_longterm && !is_longterm_pinnable_page(page))) > return NULL; > > /* > @@ -148,6 +196,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > if (!folio) > return NULL; > > + /* > + * Can this folio be safely pinned? We need to perform this > + * check after the folio is stabilised. > + */ > + if ((flags & FOLL_WRITE) && is_longterm && > + !folio_longterm_write_pin_allowed(folio)) { > + folio_put_refs(folio, refs); > + return NULL; > + } So we perform this change before validating whether the PTE changed. Hmm, naturally, I would have done it afterwards. IIRC, without IPI syncs during TLB flush (i.e., CONFIG_MMU_GATHER_RCU_TABLE_FREE), there is the possibility that (1) We lookup the pte (2) The page was unmapped and free (3) The page gets reallocated and used (4) We pin the page (5) We dereference page->mapping If we then de-reference page->mapping that gets used by whoever allocated it for something completely different (not a pointer to something reasonable), I wonder if we might be in trouble. Checking first, whether the PTE changed makes sure that what we pinned and what we're looking at is what we expected. ... I can spot that the page_is_secretmem() check is also done before that. But it at least makes sure that it's still an LRU page before staring at the mapping (making it a little safer?). BUT, I keep messing up this part of the story. Maybe it all works as expected because we will be synchronizing RCU somehow before actually freeing the page in the !IPI case. ... but I think that's only true for page tables with CONFIG_MMU_GATHER_RCU_TABLE_FREE.
On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote: > [...] > > > +{ > > + struct address_space *mapping; > > + > > + /* > > + * GUP-fast disables IRQs - this prevents IPIs from causing page tables > > + * to disappear from under us, as well as preventing RCU grace periods > > + * from making progress (i.e. implying rcu_read_lock()). > > + * > > + * This means we can rely on the folio remaining stable for all > > + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE > > + * and those that do not. > > + * > > + * We get the added benefit that given inodes, and thus address_space, > > + * objects are RCU freed, we can rely on the mapping remaining stable > > + * here with no risk of a truncation or similar race. > > + */ > > + lockdep_assert_irqs_disabled(); > > + > > + /* > > + * If no mapping can be found, this implies an anonymous or otherwise > > + * non-file backed folio so in this instance we permit the pin. > > + * > > + * shmem and hugetlb mappings do not require dirty-tracking so we > > + * explicitly whitelist these. > > + * > > + * Other non dirty-tracked folios will be picked up on the slow path. > > + */ > > + mapping = folio_mapping(folio); > > + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio); > > "Folios in the swap cache return the swap mapping" -- you might disallow > pinning anonymous pages that are in the swap cache. > > I recall that there are corner cases where we can end up with an anon page > that's mapped writable but still in the swap cache ... so you'd fallback to > the GUP slow path (acceptable for these corner cases, I guess), however > especially the comment is a bit misleading then. > > So I'd suggest not dropping the folio_test_anon() check, or open-coding it > ... which will make this piece of code most certainly easier to get when > staring at folio_mapping(). Or to spell it out in the comment (usually I > prefer code over comments). So how stable is folio->mapping at this point? Can two subsequent reads get different values? (eg. an actual mapping and NULL) If so, folio_mapping() itself seems to be missing a READ_ONCE() to avoid the compiler from emitting the load multiple times.
On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote: > [...] > > > +{ > > + struct address_space *mapping; > > + > > + /* > > + * GUP-fast disables IRQs - this prevents IPIs from causing page tables > > + * to disappear from under us, as well as preventing RCU grace periods > > + * from making progress (i.e. implying rcu_read_lock()). > > + * > > + * This means we can rely on the folio remaining stable for all > > + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE > > + * and those that do not. > > + * > > + * We get the added benefit that given inodes, and thus address_space, > > + * objects are RCU freed, we can rely on the mapping remaining stable > > + * here with no risk of a truncation or similar race. > > + */ > > + lockdep_assert_irqs_disabled(); > > + > > + /* > > + * If no mapping can be found, this implies an anonymous or otherwise > > + * non-file backed folio so in this instance we permit the pin. > > + * > > + * shmem and hugetlb mappings do not require dirty-tracking so we > > + * explicitly whitelist these. > > + * > > + * Other non dirty-tracked folios will be picked up on the slow path. > > + */ > > + mapping = folio_mapping(folio); > > + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio); > > "Folios in the swap cache return the swap mapping" -- you might disallow > pinning anonymous pages that are in the swap cache. > > I recall that there are corner cases where we can end up with an anon page > that's mapped writable but still in the swap cache ... so you'd fallback to > the GUP slow path (acceptable for these corner cases, I guess), however > especially the comment is a bit misleading then. How could that happen? > > So I'd suggest not dropping the folio_test_anon() check, or open-coding it > ... which will make this piece of code most certainly easier to get when > staring at folio_mapping(). Or to spell it out in the comment (usually I > prefer code over comments). I literally made this change based on your suggestion :) but perhaps I misinterpreted what you meant. I do spell it out in the comment that the page can be anonymous, But perhaps explicitly checking the mapping flags is the way to go. > > > +} > > + > > /** > > * try_grab_folio() - Attempt to get or pin a folio. > > * @page: pointer to page to be grabbed > > @@ -123,6 +170,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > > */ > > struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > > { > > + bool is_longterm = flags & FOLL_LONGTERM; > > + > > if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) > > return NULL; > > @@ -136,8 +185,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > > * right zone, so fail and let the caller fall back to the slow > > * path. > > */ > > - if (unlikely((flags & FOLL_LONGTERM) && > > - !is_longterm_pinnable_page(page))) > > + if (unlikely(is_longterm && !is_longterm_pinnable_page(page))) > > return NULL; > > /* > > @@ -148,6 +196,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > > if (!folio) > > return NULL; > > + /* > > + * Can this folio be safely pinned? We need to perform this > > + * check after the folio is stabilised. > > + */ > > + if ((flags & FOLL_WRITE) && is_longterm && > > + !folio_longterm_write_pin_allowed(folio)) { > > + folio_put_refs(folio, refs); > > + return NULL; > > + } > > So we perform this change before validating whether the PTE changed. > > Hmm, naturally, I would have done it afterwards. > > IIRC, without IPI syncs during TLB flush (i.e., > CONFIG_MMU_GATHER_RCU_TABLE_FREE), there is the possibility that > (1) We lookup the pte > (2) The page was unmapped and free > (3) The page gets reallocated and used > (4) We pin the page > (5) We dereference page->mapping But we have an implied RCU lock from disabled IRQs right? Unless that CONFIG option does something odd (I've not really dug into its brehaviour). It feels like that would break GUP-fast as a whole. > > If we then de-reference page->mapping that gets used by whoever allocated it > for something completely different (not a pointer to something reasonable), > I wonder if we might be in trouble. > > Checking first, whether the PTE changed makes sure that what we pinned and > what we're looking at is what we expected. > > ... I can spot that the page_is_secretmem() check is also done before that. > But it at least makes sure that it's still an LRU page before staring at the > mapping (making it a little safer?). As do we :) We also via try_get_folio() check to ensure that we aren't subject to a split. > > BUT, I keep messing up this part of the story. Maybe it all works as > expected because we will be synchronizing RCU somehow before actually > freeing the page in the !IPI case. ... but I think that's only true for page > tables with CONFIG_MMU_GATHER_RCU_TABLE_FREE. My understanding based on what Peter said is that the IRQs being disabled should prevent anything bad from happening here. > > -- > Thanks, > > David / dhildenb >
On 02.05.23 19:22, Peter Zijlstra wrote: > On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote: >> [...] >> >>> +{ >>> + struct address_space *mapping; >>> + >>> + /* >>> + * GUP-fast disables IRQs - this prevents IPIs from causing page tables >>> + * to disappear from under us, as well as preventing RCU grace periods >>> + * from making progress (i.e. implying rcu_read_lock()). >>> + * >>> + * This means we can rely on the folio remaining stable for all >>> + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE >>> + * and those that do not. >>> + * >>> + * We get the added benefit that given inodes, and thus address_space, >>> + * objects are RCU freed, we can rely on the mapping remaining stable >>> + * here with no risk of a truncation or similar race. >>> + */ >>> + lockdep_assert_irqs_disabled(); >>> + >>> + /* >>> + * If no mapping can be found, this implies an anonymous or otherwise >>> + * non-file backed folio so in this instance we permit the pin. >>> + * >>> + * shmem and hugetlb mappings do not require dirty-tracking so we >>> + * explicitly whitelist these. >>> + * >>> + * Other non dirty-tracked folios will be picked up on the slow path. >>> + */ >>> + mapping = folio_mapping(folio); >>> + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio); >> >> "Folios in the swap cache return the swap mapping" -- you might disallow >> pinning anonymous pages that are in the swap cache. >> >> I recall that there are corner cases where we can end up with an anon page >> that's mapped writable but still in the swap cache ... so you'd fallback to >> the GUP slow path (acceptable for these corner cases, I guess), however >> especially the comment is a bit misleading then. >> >> So I'd suggest not dropping the folio_test_anon() check, or open-coding it >> ... which will make this piece of code most certainly easier to get when >> staring at folio_mapping(). Or to spell it out in the comment (usually I >> prefer code over comments). > > So how stable is folio->mapping at this point? Can two subsequent reads > get different values? (eg. an actual mapping and NULL) > > If so, folio_mapping() itself seems to be missing a READ_ONCE() to avoid > the compiler from emitting the load multiple times. I can only talk about anon pages in this specific call order here (check first, then test if the PTE changed in the meantime): we don't care if we get two different values. If we get a different value the second time, surely we (temporarily) pinned an anon page that is no longer mapped (freed in the meantime). But in that case (even if we read garbage folio->mapping and made the wrong call here), we'll detect afterwards that the PTE changed, and unpin what we (temporarily) pinned. As folio_test_anon() only checks two bits in folio->mapping it's fine, because we won't dereference garbage folio->mapping. With folio_mapping() on !anon and READ_ONCE() ... good question. Kirill said it would be fairly stable, but I suspect that it could change (especially if we call it before validating if the PTE changed as I described further below). Now, if we read folio->mapping after checking if the page we pinned is still mapped (PTE unchanged), at least the page we pinned cannot be reused in the meantime. I suspect that we can still read "NULL" on the second read. But whatever we dereference from the first read should still be valid, even if the second read would have returned NULL ("rcu freeing").
On Tue, May 02, 2023 at 07:22:32PM +0200, Peter Zijlstra wrote: > On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote: > > [...] > > > > > +{ > > > + struct address_space *mapping; > > > + > > > + /* > > > + * GUP-fast disables IRQs - this prevents IPIs from causing page tables > > > + * to disappear from under us, as well as preventing RCU grace periods > > > + * from making progress (i.e. implying rcu_read_lock()). > > > + * > > > + * This means we can rely on the folio remaining stable for all > > > + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE > > > + * and those that do not. > > > + * > > > + * We get the added benefit that given inodes, and thus address_space, > > > + * objects are RCU freed, we can rely on the mapping remaining stable > > > + * here with no risk of a truncation or similar race. > > > + */ > > > + lockdep_assert_irqs_disabled(); > > > + > > > + /* > > > + * If no mapping can be found, this implies an anonymous or otherwise > > > + * non-file backed folio so in this instance we permit the pin. > > > + * > > > + * shmem and hugetlb mappings do not require dirty-tracking so we > > > + * explicitly whitelist these. > > > + * > > > + * Other non dirty-tracked folios will be picked up on the slow path. > > > + */ > > > + mapping = folio_mapping(folio); > > > + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio); > > > > "Folios in the swap cache return the swap mapping" -- you might disallow > > pinning anonymous pages that are in the swap cache. > > > > I recall that there are corner cases where we can end up with an anon page > > that's mapped writable but still in the swap cache ... so you'd fallback to > > the GUP slow path (acceptable for these corner cases, I guess), however > > especially the comment is a bit misleading then. > > > > So I'd suggest not dropping the folio_test_anon() check, or open-coding it > > ... which will make this piece of code most certainly easier to get when > > staring at folio_mapping(). Or to spell it out in the comment (usually I > > prefer code over comments). > > So how stable is folio->mapping at this point? Can two subsequent reads > get different values? (eg. an actual mapping and NULL) > > If so, folio_mapping() itself seems to be missing a READ_ONCE() to avoid > the compiler from emitting the load multiple times. Yes that actually feels like a bit of a flaw in folio_mapping(). I suppose we run the risk of mapping being reset (e.g. to NULL) even if any mapping we get being guaranteed to be safe due to the RCU thing. Based on David's feedback I think I'll recode to something more direct where we READ_ONCE() the mapping, then check mapping flags for anon, avoid the swap case + check shmem.
On 02.05.23 19:31, Lorenzo Stoakes wrote: > On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote: >> [...] >> >>> +{ >>> + struct address_space *mapping; >>> + >>> + /* >>> + * GUP-fast disables IRQs - this prevents IPIs from causing page tables >>> + * to disappear from under us, as well as preventing RCU grace periods >>> + * from making progress (i.e. implying rcu_read_lock()). >>> + * >>> + * This means we can rely on the folio remaining stable for all >>> + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE >>> + * and those that do not. >>> + * >>> + * We get the added benefit that given inodes, and thus address_space, >>> + * objects are RCU freed, we can rely on the mapping remaining stable >>> + * here with no risk of a truncation or similar race. >>> + */ >>> + lockdep_assert_irqs_disabled(); >>> + >>> + /* >>> + * If no mapping can be found, this implies an anonymous or otherwise >>> + * non-file backed folio so in this instance we permit the pin. >>> + * >>> + * shmem and hugetlb mappings do not require dirty-tracking so we >>> + * explicitly whitelist these. >>> + * >>> + * Other non dirty-tracked folios will be picked up on the slow path. >>> + */ >>> + mapping = folio_mapping(folio); >>> + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio); >> >> "Folios in the swap cache return the swap mapping" -- you might disallow >> pinning anonymous pages that are in the swap cache. >> >> I recall that there are corner cases where we can end up with an anon page >> that's mapped writable but still in the swap cache ... so you'd fallback to >> the GUP slow path (acceptable for these corner cases, I guess), however >> especially the comment is a bit misleading then. > > How could that happen? > >> >> So I'd suggest not dropping the folio_test_anon() check, or open-coding it >> ... which will make this piece of code most certainly easier to get when >> staring at folio_mapping(). Or to spell it out in the comment (usually I >> prefer code over comments). > > I literally made this change based on your suggestion :) but perhaps I > misinterpreted what you meant. > > I do spell it out in the comment that the page can be anonymous, But perhaps > explicitly checking the mapping flags is the way to go. > >> >>> +} >>> + >>> /** >>> * try_grab_folio() - Attempt to get or pin a folio. >>> * @page: pointer to page to be grabbed >>> @@ -123,6 +170,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs) >>> */ >>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) >>> { >>> + bool is_longterm = flags & FOLL_LONGTERM; >>> + >>> if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) >>> return NULL; >>> @@ -136,8 +185,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) >>> * right zone, so fail and let the caller fall back to the slow >>> * path. >>> */ >>> - if (unlikely((flags & FOLL_LONGTERM) && >>> - !is_longterm_pinnable_page(page))) >>> + if (unlikely(is_longterm && !is_longterm_pinnable_page(page))) >>> return NULL; >>> /* >>> @@ -148,6 +196,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) >>> if (!folio) >>> return NULL; >>> + /* >>> + * Can this folio be safely pinned? We need to perform this >>> + * check after the folio is stabilised. >>> + */ >>> + if ((flags & FOLL_WRITE) && is_longterm && >>> + !folio_longterm_write_pin_allowed(folio)) { >>> + folio_put_refs(folio, refs); >>> + return NULL; >>> + } >> >> So we perform this change before validating whether the PTE changed. >> >> Hmm, naturally, I would have done it afterwards. >> >> IIRC, without IPI syncs during TLB flush (i.e., >> CONFIG_MMU_GATHER_RCU_TABLE_FREE), there is the possibility that >> (1) We lookup the pte >> (2) The page was unmapped and free >> (3) The page gets reallocated and used >> (4) We pin the page >> (5) We dereference page->mapping > > But we have an implied RCU lock from disabled IRQs right? Unless that CONFIG > option does something odd (I've not really dug into its brehaviour). It feels > like that would break GUP-fast as a whole. > >> >> If we then de-reference page->mapping that gets used by whoever allocated it >> for something completely different (not a pointer to something reasonable), >> I wonder if we might be in trouble. >> >> Checking first, whether the PTE changed makes sure that what we pinned and >> what we're looking at is what we expected. >> >> ... I can spot that the page_is_secretmem() check is also done before that. >> But it at least makes sure that it's still an LRU page before staring at the >> mapping (making it a little safer?). > > As do we :) > > We also via try_get_folio() check to ensure that we aren't subject to a split. > >> >> BUT, I keep messing up this part of the story. Maybe it all works as >> expected because we will be synchronizing RCU somehow before actually >> freeing the page in the !IPI case. ... but I think that's only true for page >> tables with CONFIG_MMU_GATHER_RCU_TABLE_FREE. > > My understanding based on what Peter said is that the IRQs being disabled should > prevent anything bad from happening here. ... only if we verified that the PTE didn't change IIUC. IRQs disabled only protect you from the mapping getting freed and reused (because mappings are freed via RCU IIUC). But as far as I can tell, it doesn't protect you from the page itself getting freed and reused, and whoever freed the page uses page->mapping to store something completely different. But, again, it's all complicated and confusing to me. page_is_secretmem() also doesn't use a READ_ONCE() ...
On Tue, May 02, 2023 at 07:38:27PM +0200, David Hildenbrand wrote: > On 02.05.23 19:31, Lorenzo Stoakes wrote: > > On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote: > > > [...] > > > > > > > +{ > > > > + struct address_space *mapping; > > > > + > > > > + /* > > > > + * GUP-fast disables IRQs - this prevents IPIs from causing page tables > > > > + * to disappear from under us, as well as preventing RCU grace periods > > > > + * from making progress (i.e. implying rcu_read_lock()). > > > > + * > > > > + * This means we can rely on the folio remaining stable for all > > > > + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE > > > > + * and those that do not. > > > > + * > > > > + * We get the added benefit that given inodes, and thus address_space, > > > > + * objects are RCU freed, we can rely on the mapping remaining stable > > > > + * here with no risk of a truncation or similar race. > > > > + */ > > > > + lockdep_assert_irqs_disabled(); > > > > + > > > > + /* > > > > + * If no mapping can be found, this implies an anonymous or otherwise > > > > + * non-file backed folio so in this instance we permit the pin. > > > > + * > > > > + * shmem and hugetlb mappings do not require dirty-tracking so we > > > > + * explicitly whitelist these. > > > > + * > > > > + * Other non dirty-tracked folios will be picked up on the slow path. > > > > + */ > > > > + mapping = folio_mapping(folio); > > > > + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio); > > > > > > "Folios in the swap cache return the swap mapping" -- you might disallow > > > pinning anonymous pages that are in the swap cache. > > > > > > I recall that there are corner cases where we can end up with an anon page > > > that's mapped writable but still in the swap cache ... so you'd fallback to > > > the GUP slow path (acceptable for these corner cases, I guess), however > > > especially the comment is a bit misleading then. > > > > How could that happen? > > > > > > > > So I'd suggest not dropping the folio_test_anon() check, or open-coding it > > > ... which will make this piece of code most certainly easier to get when > > > staring at folio_mapping(). Or to spell it out in the comment (usually I > > > prefer code over comments). > > > > I literally made this change based on your suggestion :) but perhaps I > > misinterpreted what you meant. > > > > I do spell it out in the comment that the page can be anonymous, But perhaps > > explicitly checking the mapping flags is the way to go. > > > > > > > > > +} > > > > + > > > > /** > > > > * try_grab_folio() - Attempt to get or pin a folio. > > > > * @page: pointer to page to be grabbed > > > > @@ -123,6 +170,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > > > > */ > > > > struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > > > > { > > > > + bool is_longterm = flags & FOLL_LONGTERM; > > > > + > > > > if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) > > > > return NULL; > > > > @@ -136,8 +185,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > > > > * right zone, so fail and let the caller fall back to the slow > > > > * path. > > > > */ > > > > - if (unlikely((flags & FOLL_LONGTERM) && > > > > - !is_longterm_pinnable_page(page))) > > > > + if (unlikely(is_longterm && !is_longterm_pinnable_page(page))) > > > > return NULL; > > > > /* > > > > @@ -148,6 +196,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > > > > if (!folio) > > > > return NULL; > > > > + /* > > > > + * Can this folio be safely pinned? We need to perform this > > > > + * check after the folio is stabilised. > > > > + */ > > > > + if ((flags & FOLL_WRITE) && is_longterm && > > > > + !folio_longterm_write_pin_allowed(folio)) { > > > > + folio_put_refs(folio, refs); > > > > + return NULL; > > > > + } > > > > > > So we perform this change before validating whether the PTE changed. > > > > > > Hmm, naturally, I would have done it afterwards. > > > > > > IIRC, without IPI syncs during TLB flush (i.e., > > > CONFIG_MMU_GATHER_RCU_TABLE_FREE), there is the possibility that > > > (1) We lookup the pte > > > (2) The page was unmapped and free > > > (3) The page gets reallocated and used > > > (4) We pin the page > > > (5) We dereference page->mapping > > > > But we have an implied RCU lock from disabled IRQs right? Unless that CONFIG > > option does something odd (I've not really dug into its brehaviour). It feels > > like that would break GUP-fast as a whole. > > > > > > > > If we then de-reference page->mapping that gets used by whoever allocated it > > > for something completely different (not a pointer to something reasonable), > > > I wonder if we might be in trouble. > > > > > > Checking first, whether the PTE changed makes sure that what we pinned and > > > what we're looking at is what we expected. > > > > > > ... I can spot that the page_is_secretmem() check is also done before that. > > > But it at least makes sure that it's still an LRU page before staring at the > > > mapping (making it a little safer?). > > > > As do we :) > > > > We also via try_get_folio() check to ensure that we aren't subject to a split. > > > > > > > > BUT, I keep messing up this part of the story. Maybe it all works as > > > expected because we will be synchronizing RCU somehow before actually > > > freeing the page in the !IPI case. ... but I think that's only true for page > > > tables with CONFIG_MMU_GATHER_RCU_TABLE_FREE. > > > > My understanding based on what Peter said is that the IRQs being disabled should > > prevent anything bad from happening here. > > > ... only if we verified that the PTE didn't change IIUC. IRQs disabled only > protect you from the mapping getting freed and reused (because mappings are > freed via RCU IIUC). > > But as far as I can tell, it doesn't protect you from the page itself > getting freed and reused, and whoever freed the page uses page->mapping to > store something completely different. Ack, and we'd not have mapping->inode to save us in an anon case either. I'd rather be as cautious as we can possibly be, so let's move this to after the 'PTE is the same' check then, will fix on respin. > > But, again, it's all complicated and confusing to me. > It's just a fiddly, complicated, delicate area I feel :) hence why I endeavour to take on board the community's views on this series to ensure we end up with the best possible implementation. > > page_is_secretmem() also doesn't use a READ_ONCE() ... Perhaps one for a follow up patch... > > -- > Thanks, > > David / dhildenb >
On Tue, May 02, 2023 at 07:34:06PM +0200, David Hildenbrand wrote: > On 02.05.23 19:22, Peter Zijlstra wrote: > > On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote: > > > [...] > > > > > > > +{ > > > > + struct address_space *mapping; > > > > + > > > > + /* > > > > + * GUP-fast disables IRQs - this prevents IPIs from causing page tables > > > > + * to disappear from under us, as well as preventing RCU grace periods > > > > + * from making progress (i.e. implying rcu_read_lock()). > > > > + * > > > > + * This means we can rely on the folio remaining stable for all > > > > + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE > > > > + * and those that do not. > > > > + * > > > > + * We get the added benefit that given inodes, and thus address_space, > > > > + * objects are RCU freed, we can rely on the mapping remaining stable > > > > + * here with no risk of a truncation or similar race. > > > > + */ > > > > + lockdep_assert_irqs_disabled(); > > > > + > > > > + /* > > > > + * If no mapping can be found, this implies an anonymous or otherwise > > > > + * non-file backed folio so in this instance we permit the pin. > > > > + * > > > > + * shmem and hugetlb mappings do not require dirty-tracking so we > > > > + * explicitly whitelist these. > > > > + * > > > > + * Other non dirty-tracked folios will be picked up on the slow path. > > > > + */ > > > > + mapping = folio_mapping(folio); > > > > + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio); > > > > > > "Folios in the swap cache return the swap mapping" -- you might disallow > > > pinning anonymous pages that are in the swap cache. > > > > > > I recall that there are corner cases where we can end up with an anon page > > > that's mapped writable but still in the swap cache ... so you'd fallback to > > > the GUP slow path (acceptable for these corner cases, I guess), however > > > especially the comment is a bit misleading then. > > > > > > So I'd suggest not dropping the folio_test_anon() check, or open-coding it > > > ... which will make this piece of code most certainly easier to get when > > > staring at folio_mapping(). Or to spell it out in the comment (usually I > > > prefer code over comments). > > > > So how stable is folio->mapping at this point? Can two subsequent reads > > get different values? (eg. an actual mapping and NULL) > > > > If so, folio_mapping() itself seems to be missing a READ_ONCE() to avoid > > the compiler from emitting the load multiple times. > > I can only talk about anon pages in this specific call order here (check > first, then test if the PTE changed in the meantime): we don't care if we > get two different values. If we get a different value the second time, > surely we (temporarily) pinned an anon page that is no longer mapped (freed > in the meantime). But in that case (even if we read garbage folio->mapping > and made the wrong call here), we'll detect afterwards that the PTE changed, > and unpin what we (temporarily) pinned. As folio_test_anon() only checks two > bits in folio->mapping it's fine, because we won't dereference garbage > folio->mapping. > > With folio_mapping() on !anon and READ_ONCE() ... good question. Kirill said > it would be fairly stable, but I suspect that it could change (especially if > we call it before validating if the PTE changed as I described further > below). > > Now, if we read folio->mapping after checking if the page we pinned is still > mapped (PTE unchanged), at least the page we pinned cannot be reused in the > meantime. I suspect that we can still read "NULL" on the second read. But > whatever we dereference from the first read should still be valid, even if > the second read would have returned NULL ("rcu freeing"). > On a specific point - if mapping turns out to be NULL after we confirm stable PTE, I'd be inclined to reject and let the slow path take care of it, would you agree that that's the correct approach? I guess you could take that to mean that the page is no longer mapped so is safe, but it feels like refusing it would be the safe course. > -- > Thanks, > > David / dhildenb >
On Tue, May 02, 2023 at 07:34:06PM +0200, David Hildenbrand wrote: > Now, if we read folio->mapping after checking if the page we pinned is still > mapped (PTE unchanged), at least the page we pinned cannot be reused in the > meantime. I suspect that we can still read "NULL" on the second read. But > whatever we dereference from the first read should still be valid, even if > the second read would have returned NULL ("rcu freeing"). Right, but given it's the compiler adding loads we're not sure what if anything it uses and it gets very hard to reason about the code. This is where READ_ONCE() helps, we instruct the compiler to only do a single load and we can still reason about the code.
On 02.05.23 20:59, Peter Zijlstra wrote: > On Tue, May 02, 2023 at 07:34:06PM +0200, David Hildenbrand wrote: >> Now, if we read folio->mapping after checking if the page we pinned is still >> mapped (PTE unchanged), at least the page we pinned cannot be reused in the >> meantime. I suspect that we can still read "NULL" on the second read. But >> whatever we dereference from the first read should still be valid, even if >> the second read would have returned NULL ("rcu freeing"). > > Right, but given it's the compiler adding loads we're not sure what if > anything it uses and it gets very hard to reason about the code. > > This is where READ_ONCE() helps, we instruct the compiler to only do a > single load and we can still reason about the code. I completely agree, and I think we should fix that in page_is_secretmem() as well.
On 02.05.23 20:17, Lorenzo Stoakes wrote: > On Tue, May 02, 2023 at 07:34:06PM +0200, David Hildenbrand wrote: >> On 02.05.23 19:22, Peter Zijlstra wrote: >>> On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote: >>>> [...] >>>> >>>>> +{ >>>>> + struct address_space *mapping; >>>>> + >>>>> + /* >>>>> + * GUP-fast disables IRQs - this prevents IPIs from causing page tables >>>>> + * to disappear from under us, as well as preventing RCU grace periods >>>>> + * from making progress (i.e. implying rcu_read_lock()). >>>>> + * >>>>> + * This means we can rely on the folio remaining stable for all >>>>> + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE >>>>> + * and those that do not. >>>>> + * >>>>> + * We get the added benefit that given inodes, and thus address_space, >>>>> + * objects are RCU freed, we can rely on the mapping remaining stable >>>>> + * here with no risk of a truncation or similar race. >>>>> + */ >>>>> + lockdep_assert_irqs_disabled(); >>>>> + >>>>> + /* >>>>> + * If no mapping can be found, this implies an anonymous or otherwise >>>>> + * non-file backed folio so in this instance we permit the pin. >>>>> + * >>>>> + * shmem and hugetlb mappings do not require dirty-tracking so we >>>>> + * explicitly whitelist these. >>>>> + * >>>>> + * Other non dirty-tracked folios will be picked up on the slow path. >>>>> + */ >>>>> + mapping = folio_mapping(folio); >>>>> + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio); >>>> >>>> "Folios in the swap cache return the swap mapping" -- you might disallow >>>> pinning anonymous pages that are in the swap cache. >>>> >>>> I recall that there are corner cases where we can end up with an anon page >>>> that's mapped writable but still in the swap cache ... so you'd fallback to >>>> the GUP slow path (acceptable for these corner cases, I guess), however >>>> especially the comment is a bit misleading then. >>>> >>>> So I'd suggest not dropping the folio_test_anon() check, or open-coding it >>>> ... which will make this piece of code most certainly easier to get when >>>> staring at folio_mapping(). Or to spell it out in the comment (usually I >>>> prefer code over comments). >>> >>> So how stable is folio->mapping at this point? Can two subsequent reads >>> get different values? (eg. an actual mapping and NULL) >>> >>> If so, folio_mapping() itself seems to be missing a READ_ONCE() to avoid >>> the compiler from emitting the load multiple times. >> >> I can only talk about anon pages in this specific call order here (check >> first, then test if the PTE changed in the meantime): we don't care if we >> get two different values. If we get a different value the second time, >> surely we (temporarily) pinned an anon page that is no longer mapped (freed >> in the meantime). But in that case (even if we read garbage folio->mapping >> and made the wrong call here), we'll detect afterwards that the PTE changed, >> and unpin what we (temporarily) pinned. As folio_test_anon() only checks two >> bits in folio->mapping it's fine, because we won't dereference garbage >> folio->mapping. >> >> With folio_mapping() on !anon and READ_ONCE() ... good question. Kirill said >> it would be fairly stable, but I suspect that it could change (especially if >> we call it before validating if the PTE changed as I described further >> below). >> >> Now, if we read folio->mapping after checking if the page we pinned is still >> mapped (PTE unchanged), at least the page we pinned cannot be reused in the >> meantime. I suspect that we can still read "NULL" on the second read. But >> whatever we dereference from the first read should still be valid, even if >> the second read would have returned NULL ("rcu freeing"). >> > > On a specific point - if mapping turns out to be NULL after we confirm > stable PTE, I'd be inclined to reject and let the slow path take care of > it, would you agree that that's the correct approach? If it's not an anon page and the mapping is NULL, I'd say simply fallback to the slow path.
On Tue, May 02, 2023 at 07:17:14PM +0100, Lorenzo Stoakes wrote: > On a specific point - if mapping turns out to be NULL after we confirm > stable PTE, I'd be inclined to reject and let the slow path take care of > it, would you agree that that's the correct approach? I think in general if GUP fast detects any kind of race it should bail to the slow path. The races it tries to resolve itself should have really safe and obvious solutions. I think this comment is misleading: > + /* > + * GUP-fast disables IRQs - this prevents IPIs from causing page tables > + * to disappear from under us, as well as preventing RCU grace periods > + * from making progress (i.e. implying rcu_read_lock()). True, but that is not important here since we are not reading page tables > + * This means we can rely on the folio remaining stable for all > + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE > + * and those that do not. Not really clear. We have a valid folio refcount here, that is all. > + * We get the added benefit that given inodes, and thus address_space, > + * objects are RCU freed, we can rely on the mapping remaining stable > + * here with no risk of a truncation or similar race. Which is the real point: 1) GUP-fast disables IRQs which means this is the same context as rcu_read_lock() 2) We have a valid ref on the folio due to normal GUP fast operation Thus derefing struct folio is OK 3) folio->mapping can be deref'd safely under RCU since mapping is RCU free'd It may be zero if we are racing a page free path Can it be zero for other reasons? If it can't be zero for any other reason then go to GUP slow and let it sort it out Otherwise you have to treat NULL as a success. Really what you are trying to do here is remove the folio lock which would normally protect folio->mapping. Ie this test really boils down to just 'folio_get_mapping_a_ops_rcu() == shmem_aops' The hugetlb test is done on a page flag which should be stable under the pageref. So.. Your function really ought to be doing this logic: // Should be impossible for a slab page to be in a VMA if (unlikely(folio_test_slab(folio))) return do gup slow; // Can a present PTE even be a swap cache? if (unlikely(folio_test_swapcache(folio))) return do gup slow; if (folio_test_hugetlb(folio)) return safe for fast // Safe without the folio lock ? struct address_space *mapping = READ_ONCE(folio->mapping) if ((mapping & PAGE_MAPPING_FLAGS) == PAGE_MAPPING_ANON) return safe for fast if ((mapping & PAGE_MAPPING_FLAGS) == 0 && mapping) return mapping->a_ops == shmem_aops; // Depends on what mapping = NULL means return do gup slow Jason
> +static bool folio_longterm_write_pin_allowed(struct folio *folio) > +{ > + struct address_space *mapping; > + > + /* > + * GUP-fast disables IRQs - this prevents IPIs from causing page tables > + * to disappear from under us, as well as preventing RCU grace periods > + * from making progress (i.e. implying rcu_read_lock()). > + * > + * This means we can rely on the folio remaining stable for all > + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE > + * and those that do not. > + * > + * We get the added benefit that given inodes, and thus address_space, > + * objects are RCU freed, we can rely on the mapping remaining stable > + * here with no risk of a truncation or similar race. > + */ > + lockdep_assert_irqs_disabled(); > + > + /* > + * If no mapping can be found, this implies an anonymous or otherwise > + * non-file backed folio so in this instance we permit the pin. > + * > + * shmem and hugetlb mappings do not require dirty-tracking so we > + * explicitly whitelist these. > + * > + * Other non dirty-tracked folios will be picked up on the slow path. > + */ > + mapping = folio_mapping(folio); > + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio); > +} BTW, try_grab_folio() is also called from follow_hugetlb_page(), which is ordinary GUP and has interrupts enabled if I am not wrong.
On Tue, May 02, 2023 at 04:07:50PM -0300, Jason Gunthorpe wrote: > On Tue, May 02, 2023 at 07:17:14PM +0100, Lorenzo Stoakes wrote: > > > On a specific point - if mapping turns out to be NULL after we confirm > > stable PTE, I'd be inclined to reject and let the slow path take care of > > it, would you agree that that's the correct approach? > > I think in general if GUP fast detects any kind of race it should bail > to the slow path. > > The races it tries to resolve itself should have really safe and > obvious solutions. > > I think this comment is misleading: > > > + /* > > + * GUP-fast disables IRQs - this prevents IPIs from causing page tables > > + * to disappear from under us, as well as preventing RCU grace periods > > + * from making progress (i.e. implying rcu_read_lock()). > > True, but that is not important here since we are not reading page > tables > > > + * This means we can rely on the folio remaining stable for all > > + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE > > + * and those that do not. > > Not really clear. We have a valid folio refcount here, that is all. Some of this is a product of mixed signals from different commenters and my being perhaps a little _too_ willing to just go with the flow. With interrupts disabled and IPI blocked, plus the assurances that interrupts being disabled implied the RCU version of page table manipulation is also blocked, my understanding was that remapping in this process to another page could not occur. Of course the folio is 'stable' in the sense we have a refcount on it, but it is unlocked so things can change. I'm guessing the RCU guarantees in the TLB logic are not as solid as IPI, because in the IPI case it seems to me you couldn't even clear the PTE entry before getting to the page table case. Otherwise, I'm a bit uncertain actually as to how we can get to the point where the folio->mapping is being manipulated. Is this why? > > > + * We get the added benefit that given inodes, and thus address_space, > > + * objects are RCU freed, we can rely on the mapping remaining stable > > + * here with no risk of a truncation or similar race. > > Which is the real point: > > 1) GUP-fast disables IRQs which means this is the same context as rcu_read_lock() > 2) We have a valid ref on the folio due to normal GUP fast operation > Thus derefing struct folio is OK > 3) folio->mapping can be deref'd safely under RCU since mapping is RCU free'd > It may be zero if we are racing a page free path > Can it be zero for other reasons? Zero? You mean NULL? In any case, I will try to clarify these, I do agree the _key_ point is that we can rely on safely derefing the mapping, at least READ_ONCE()'d, as use-after-free or dereffing garbage is the fear here. > > If it can't be zero for any other reason then go to GUP slow and let > it sort it out > > Otherwise you have to treat NULL as a success. > Well that was literally the question :) and I've got somewhat contradictory feedback. My instinct aligns with yours in that, just fallback to slow path, so that's what I'll do. But just wanted to confirm. > Really what you are trying to do here is remove the folio lock which > would normally protect folio->mapping. Ie this test really boils down > to just 'folio_get_mapping_a_ops_rcu() == shmem_aops' > > The hugetlb test is done on a page flag which should be stable under > the pageref. > > So.. Your function really ought to be doing this logic: > > // Should be impossible for a slab page to be in a VMA > if (unlikely(folio_test_slab(folio))) > return do gup slow; > > // Can a present PTE even be a swap cache? > if (unlikely(folio_test_swapcache(folio))) > return do gup slow; > > if (folio_test_hugetlb(folio)) > return safe for fast > > // Safe without the folio lock ? > struct address_space *mapping = READ_ONCE(folio->mapping) > if ((mapping & PAGE_MAPPING_FLAGS) == PAGE_MAPPING_ANON) > return safe for fast > if ((mapping & PAGE_MAPPING_FLAGS) == 0 && mapping) > return mapping->a_ops == shmem_aops; > > // Depends on what mapping = NULL means > return do gup slow > Yeah this is how I was planning to implement it, or something along these lines. The only question was whether my own view aligned with others to avoid more spam :) The READ_ONCE() approach is precisely how I wanted to do it in thet first instance, but feared feedback about duplication and wondered if it made much difference, but now it's clear this is ther ight way. > Jason
On 02.05.23 21:25, Lorenzo Stoakes wrote: > On Tue, May 02, 2023 at 04:07:50PM -0300, Jason Gunthorpe wrote: >> On Tue, May 02, 2023 at 07:17:14PM +0100, Lorenzo Stoakes wrote: >> >>> On a specific point - if mapping turns out to be NULL after we confirm >>> stable PTE, I'd be inclined to reject and let the slow path take care of >>> it, would you agree that that's the correct approach? >> >> I think in general if GUP fast detects any kind of race it should bail >> to the slow path. >> >> The races it tries to resolve itself should have really safe and >> obvious solutions. >> >> I think this comment is misleading: >> >>> + /* >>> + * GUP-fast disables IRQs - this prevents IPIs from causing page tables >>> + * to disappear from under us, as well as preventing RCU grace periods >>> + * from making progress (i.e. implying rcu_read_lock()). >> >> True, but that is not important here since we are not reading page >> tables >> >>> + * This means we can rely on the folio remaining stable for all >>> + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE >>> + * and those that do not. >> >> Not really clear. We have a valid folio refcount here, that is all. > > Some of this is a product of mixed signals from different commenters and > my being perhaps a little _too_ willing to just go with the flow. > > With interrupts disabled and IPI blocked, plus the assurances that > interrupts being disabled implied the RCU version of page table > manipulation is also blocked, my understanding was that remapping in this > process to another page could not occur. > > Of course the folio is 'stable' in the sense we have a refcount on it, but > it is unlocked so things can change. > > I'm guessing the RCU guarantees in the TLB logic are not as solid as IPI, > because in the IPI case it seems to me you couldn't even clear the PTE > entry before getting to the page table case. > > Otherwise, I'm a bit uncertain actually as to how we can get to the point > where the folio->mapping is being manipulated. Is this why? I'll just stress again that I think there are cases where we unmap and free a page without synchronizing against GUP-fast using an IPI or RCU. That's one of the reasons why we recheck if the PTE changed to back off, so I've been told. I'm happy if someone proves me wrong and a page we just (temporarily) pinned cannot have been freed+reused in the meantime.
On Tue, May 02, 2023 at 09:33:45PM +0200, David Hildenbrand wrote: > On 02.05.23 21:25, Lorenzo Stoakes wrote: > > On Tue, May 02, 2023 at 04:07:50PM -0300, Jason Gunthorpe wrote: > > > On Tue, May 02, 2023 at 07:17:14PM +0100, Lorenzo Stoakes wrote: > > > > > > > On a specific point - if mapping turns out to be NULL after we confirm > > > > stable PTE, I'd be inclined to reject and let the slow path take care of > > > > it, would you agree that that's the correct approach? > > > > > > I think in general if GUP fast detects any kind of race it should bail > > > to the slow path. > > > > > > The races it tries to resolve itself should have really safe and > > > obvious solutions. > > > > > > I think this comment is misleading: > > > > > > > + /* > > > > + * GUP-fast disables IRQs - this prevents IPIs from causing page tables > > > > + * to disappear from under us, as well as preventing RCU grace periods > > > > + * from making progress (i.e. implying rcu_read_lock()). > > > > > > True, but that is not important here since we are not reading page > > > tables > > > > > > > + * This means we can rely on the folio remaining stable for all > > > > + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE > > > > + * and those that do not. > > > > > > Not really clear. We have a valid folio refcount here, that is all. > > > > Some of this is a product of mixed signals from different commenters and > > my being perhaps a little _too_ willing to just go with the flow. > > > > With interrupts disabled and IPI blocked, plus the assurances that > > interrupts being disabled implied the RCU version of page table > > manipulation is also blocked, my understanding was that remapping in this > > process to another page could not occur. > > > > Of course the folio is 'stable' in the sense we have a refcount on it, but > > it is unlocked so things can change. > > > > I'm guessing the RCU guarantees in the TLB logic are not as solid as IPI, > > because in the IPI case it seems to me you couldn't even clear the PTE > > entry before getting to the page table case. > > > > Otherwise, I'm a bit uncertain actually as to how we can get to the point > > where the folio->mapping is being manipulated. Is this why? > > I'll just stress again that I think there are cases where we unmap and free > a page without synchronizing against GUP-fast using an IPI or RCU. OK that explains why we need to be careful. Don't worry I am going to move the check after we confirm PTE entry hasn't changed. > > That's one of the reasons why we recheck if the PTE changed to back off, so > I've been told. > > I'm happy if someone proves me wrong and a page we just (temporarily) pinned > cannot have been freed+reused in the meantime. Let's play it safe for now :) > > -- > Thanks, > > David / dhildenb >
On Tue, May 02, 2023 at 09:17:53PM +0200, David Hildenbrand wrote: > > +static bool folio_longterm_write_pin_allowed(struct folio *folio) > > +{ > > + struct address_space *mapping; > > + > > + /* > > + * GUP-fast disables IRQs - this prevents IPIs from causing page tables > > + * to disappear from under us, as well as preventing RCU grace periods > > + * from making progress (i.e. implying rcu_read_lock()). > > + * > > + * This means we can rely on the folio remaining stable for all > > + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE > > + * and those that do not. > > + * > > + * We get the added benefit that given inodes, and thus address_space, > > + * objects are RCU freed, we can rely on the mapping remaining stable > > + * here with no risk of a truncation or similar race. > > + */ > > + lockdep_assert_irqs_disabled(); > > + > > + /* > > + * If no mapping can be found, this implies an anonymous or otherwise > > + * non-file backed folio so in this instance we permit the pin. > > + * > > + * shmem and hugetlb mappings do not require dirty-tracking so we > > + * explicitly whitelist these. > > + * > > + * Other non dirty-tracked folios will be picked up on the slow path. > > + */ > > + mapping = folio_mapping(folio); > > + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio); > > +} > > BTW, try_grab_folio() is also called from follow_hugetlb_page(), which is > ordinary GUP and has interrupts enabled if I am not wrong. It does hold the PTL though, so can't fiddle with the entry. But that does suggest folio_test_hugetlb() should be put _before_ the irq disabled assertion then :) > > -- > Thanks, > > David / dhildenb >
On Tue, May 02, 2023 at 09:33:45PM +0200, David Hildenbrand wrote: > I'll just stress again that I think there are cases where we unmap and free > a page without synchronizing against GUP-fast using an IPI or RCU. AFAIK this is true on ARM64 and other arches that do not use IPIs for TLB shootdown. Eg the broadcast TLBI described here: https://developer.arm.com/documentation/101811/0102/Translation-Lookaside-Buffer-maintenance TLB invalidation of remote CPUs Is done at the interconnect level and does not trigger any interrupt. So, arches that don't use IPI have to RCU free the page table entires to work with GUP fast. They will set MMU_GATHER_RCU_TABLE_FREE. The whole interrupt disable thing in GUP turns into nothing more than a hacky RCU on those arches. The ugly bit is the comment: * We do not adopt an rcu_read_lock() here as we also want to block IPIs * that come from THPs splitting. Which, I think, today can be summarized in today's code base as serializing with split_huge_page_to_list(). I don't know this code well, but the comment there says "Only caller must hold pin on the @page" which is only possible if all the PTEs have been replaced with migration entries or otherwise before we get here. So the IPI the comment talks about, I suppose, is from the installation of the migration entry? However gup_huge_pmd() does the usual read, ref, check pattern, and split_huge_page_to_list() uses page_ref_freeze() which is cmpxchg So the three races seem to resolve themselves without IPI magic - GUP sees the THP folio before the migration entry and +1's the ref split_huge_page_to_list() bails beacuse the ref is elevated - GUP fails to +1 the ref because it is zero and bails, split_huge_page_to_list() made it zero, so it runs - GUP sees the THP folio, split_huge_page_to_list() ran to completion, and then GUP +1's a 4k page. The recheck of pmd_val will see the migration entry, or the new PTE table pointer, but never the original THP folio. So GUP will bail. The speculative ref on the 4k page is harmless. I can't figure out what this 2014 comment means in today's code. Or where ARM64 hid the "IPI broadcast on THP split" :) See commit 2667f50e8b81457fcb4a3dbe6aff3e81ea009e13 > That's one of the reasons why we recheck if the PTE changed to back off, so > I've been told. Yes, with no synchronizing point the refcount in GUP fast could be taken on the page after it has been free'd and reallocated, but this is only possible on RCU > I'm happy if someone proves me wrong and a page we just (temporarily) pinned > cannot have been freed+reused in the meantime. Sadly I think no.. We'd need to RCU free the page itself as well to make this true. Jason
On Tue, May 02, 2023 at 08:25:57PM +0100, Lorenzo Stoakes wrote: > Otherwise, I'm a bit uncertain actually as to how we can get to the point > where the folio->mapping is being manipulated. Is this why? On RCU architectures another thread zap's the PTEs and proceeds to teardown and free the page. Part of that is clearing folio->mapping under the folio lock. The IPI approach would not get there, but we can't think in terms of IPI since we have RCU architectures. > In any case, I will try to clarify these, I do agree the _key_ point is > that we can rely on safely derefing the mapping, at least READ_ONCE()'d, as > use-after-free or dereffing garbage is the fear here. I didn't chase it down, but as long as folio->mapping is always set to something that is RCU free'd then this would be OK. > Well that was literally the question :) and I've got somewhat contradictory > feedback. My instinct aligns with yours in that, just fallback to slow > path, so that's what I'll do. But just wanted to confirm. I don't know it well enough to say, it looks like folio->mapping is almost always set to something from what I can see.. At least NULL pointer and the anon bits set for instance. In any case fall back to fast is always safe, I'd just go to the trouble to check with testing that in normal process cases we don't hit it. > The READ_ONCE() approach is precisely how I wanted to do it in thet first > instance, but feared feedback about duplication and wondered if it made > much difference, but now it's clear this is ther ight way. The duplication is bad, and maybe we need more functions to counter it, but GUP needs to know some of the details a little more, eg a NULL return from folio_mapping() could inidicate the anon bits being set which should not flow down to slow. Jason
diff --git a/mm/gup.c b/mm/gup.c index 6e209ca10967..93b4aa39e5a5 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -18,6 +18,7 @@ #include <linux/migrate.h> #include <linux/mm_inline.h> #include <linux/sched/mm.h> +#include <linux/shmem_fs.h> #include <asm/mmu_context.h> #include <asm/tlbflush.h> @@ -95,6 +96,52 @@ static inline struct folio *try_get_folio(struct page *page, int refs) return folio; } +/* + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM | + * FOLL_WRITE pin is permitted for a specific folio. + * + * This assumes the folio is stable and pinned. + * + * Writing to pinned file-backed dirty tracked folios is inherently problematic + * (see comment describing the writeable_file_mapping_allowed() function). We + * therefore try to avoid the most egregious case of a long-term mapping doing + * so. + * + * This function cannot be as thorough as that one as the VMA is not available + * in the fast path, so instead we whitelist known good cases. + */ +static bool folio_longterm_write_pin_allowed(struct folio *folio) +{ + struct address_space *mapping; + + /* + * GUP-fast disables IRQs - this prevents IPIs from causing page tables + * to disappear from under us, as well as preventing RCU grace periods + * from making progress (i.e. implying rcu_read_lock()). + * + * This means we can rely on the folio remaining stable for all + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE + * and those that do not. + * + * We get the added benefit that given inodes, and thus address_space, + * objects are RCU freed, we can rely on the mapping remaining stable + * here with no risk of a truncation or similar race. + */ + lockdep_assert_irqs_disabled(); + + /* + * If no mapping can be found, this implies an anonymous or otherwise + * non-file backed folio so in this instance we permit the pin. + * + * shmem and hugetlb mappings do not require dirty-tracking so we + * explicitly whitelist these. + * + * Other non dirty-tracked folios will be picked up on the slow path. + */ + mapping = folio_mapping(folio); + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio); +} + /** * try_grab_folio() - Attempt to get or pin a folio. * @page: pointer to page to be grabbed @@ -123,6 +170,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs) */ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) { + bool is_longterm = flags & FOLL_LONGTERM; + if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) return NULL; @@ -136,8 +185,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) * right zone, so fail and let the caller fall back to the slow * path. */ - if (unlikely((flags & FOLL_LONGTERM) && - !is_longterm_pinnable_page(page))) + if (unlikely(is_longterm && !is_longterm_pinnable_page(page))) return NULL; /* @@ -148,6 +196,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) if (!folio) return NULL; + /* + * Can this folio be safely pinned? We need to perform this + * check after the folio is stabilised. + */ + if ((flags & FOLL_WRITE) && is_longterm && + !folio_longterm_write_pin_allowed(folio)) { + folio_put_refs(folio, refs); + return NULL; + } + /* * When pinning a large folio, use an exact count to track it. *