Message ID | 6b73e692c2929dc4613af711bdf92e2ec1956a66.1682638385.git.lstoakes@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp601228vqo; Thu, 27 Apr 2023 17:03:06 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4whos7esdE1WSV8pAVdTq2DtFjffmrqPJtw1RBrPfp/tHUCoQc9GkK46MlyEeZJ7bRP3Vg X-Received: by 2002:a05:6a00:1905:b0:63d:3595:26db with SMTP id y5-20020a056a00190500b0063d359526dbmr4840373pfi.23.1682640186253; Thu, 27 Apr 2023 17:03:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682640186; cv=none; d=google.com; s=arc-20160816; b=su6Af3GerBqgUJupMtMe/Mcs5H16tFFTU85RxpspDg28WFDxS13aa6srwSfugl6RFD l5yYxiquIr6u4q44Sj5cCQxpMfMTG0J+UCqBi1IQwUfgtrK6hHoWg6kc/29REI9XMYb7 R/W9LEWvwPPIQmdxpuaGV4MgUR24hVuDxjDflog5qeSFuOzhPnTB+jBKjT8BcGZTs6uP RwnyfJbNKIuB1QDqx2iWdIcaXHNS2N35V+z7oIRlHS0s1GK2SM6WWs8Ip2JdyjsrjW8n dnXS6nMLDEJOco2GbRct7UGLhWFwg94oO/MYjYP/4K5CJXvWWwf8PD/fdf4tOQEAmfEE NaWw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=5XGoYes4lZe9LymJ1b7BEtzp9oEzXqkK69ZXVdNGO28=; b=0nr8VXScxHlrgR+Nn8Iyny7Sy0LUtnNfqwr6/DP5UbndaViu6QMtYjikG85Y/uOQ4P OCo3FKgW9DZvddMro8gqIaEl6Y6qNs0BIw0ITshAs+vy6tyUFVKyPr2AAPD1Ig8LkUco fueYh8UE4wTNCrDqT3MFaaHloSpMs2c184vvxesecn7JvGaJU12q8ibCNQGXFxAVNjDP ny8nQvyos6jC0PFZ60Bb1pLZ/SjaPDEjD4o/Q0OB5hYJ2FE7rjmdemNLGaAnVe3O/oug rrpHEwda3OBHSg5jzoRt0ezni5FxD5yp6DcEtei4HNHq36IkI/EZYNaSsFzfW1pF8Fv0 1DHw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=fL2eTb8M; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 195-20020a6302cc000000b00518462accf4si20666802pgc.536.2023.04.27.17.02.27; Thu, 27 Apr 2023 17:03:06 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=fL2eTb8M; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344432AbjD0Xml (ORCPT <rfc822;zxc52fgh@gmail.com> + 99 others); Thu, 27 Apr 2023 19:42:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47882 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344277AbjD0Xmj (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 27 Apr 2023 19:42:39 -0400 Received: from mail-wr1-x433.google.com (mail-wr1-x433.google.com [IPv6:2a00:1450:4864:20::433]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0D46930F1; Thu, 27 Apr 2023 16:42:38 -0700 (PDT) Received: by mail-wr1-x433.google.com with SMTP id ffacd0b85a97d-2f3fe12de15so5659696f8f.3; Thu, 27 Apr 2023 16:42:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1682638956; x=1685230956; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=5XGoYes4lZe9LymJ1b7BEtzp9oEzXqkK69ZXVdNGO28=; b=fL2eTb8MSCdfF5H27s7rWxS11eVMySNR+UgjCaYYPyw7IM6/KjLmevrV4cS/3ssY7q ZKuM9oVuXrzc+BmFpYkE8GriSs1WUGQwJZXOt4bIfKDFE3NIf3QyilBbtC4tYhmZQ6z2 YD7LxwXGZSHZUjSJ8NTdeMEnbrP9bqIhM6GJHNXATHwmBNO/vOdyGR/tezN+6/zgO1KJ wOGuBclWRddmRNkp7hLbCm7EG4IDKsn+mUVcbDp5Tl/lRWGpaN0hA7JAh5l8hXH/Un/R rBJNVbIRvBBJk7wb9yB9vJbKAgSdRYeKaqtk7BJV4Kg99hInez3GR83H+MT5mmZdZF9X h8oQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682638956; x=1685230956; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=5XGoYes4lZe9LymJ1b7BEtzp9oEzXqkK69ZXVdNGO28=; b=OVKSdhjL7Q2K/b3jn5nnLXBT/qoju04TybaFFAfbITHJzrPrTTY+VB8P0M3pUcOFWF sze95ib+5apKFqyLVIib+IZcHxOa7nywAizqAOkivLFZqZOnB3NzD3SbiVFDBxyoFCRf Y6lWm/3NDwuWJLrMM7xIZmCPfTCrBtEOjpRWxL6SBJqQ3MGvf/XbiDAXOwzVargVmxuC +nLqgYCXdubFas25nUgMeJyWgWUnEuSMP8h8G2a9D14xfErVcQG5xCD0hIr1/Dit+++Z AglN5kU6chY2Kyemjxwg6XkruDz5D48oiISyxhaXuwbksHKoTxWNB5DQMbcIhT+dL0I5 oSiw== X-Gm-Message-State: AC+VfDxyqSgKxmo9toTiobZsInnFUZoXiixkzSPeMf6EMLL1UGCcXSMw T/mKnhLCm1tILD7oJavME+0= X-Received: by 2002:a5d:620d:0:b0:2f5:3fa1:6226 with SMTP id y13-20020a5d620d000000b002f53fa16226mr2407368wru.14.1682638956261; Thu, 27 Apr 2023 16:42:36 -0700 (PDT) Received: from lucifer.home (host86-156-84-164.range86-156.btcentralplus.com. [86.156.84.164]) by smtp.googlemail.com with ESMTPSA id 25-20020a05600c025900b003ed2c0a0f37sm22520047wmj.35.2023.04.27.16.42.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Apr 2023 16:42:35 -0700 (PDT) From: Lorenzo Stoakes <lstoakes@gmail.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org> Cc: Jason Gunthorpe <jgg@ziepe.ca>, Jens Axboe <axboe@kernel.dk>, Matthew Wilcox <willy@infradead.org>, Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>, Leon Romanovsky <leon@kernel.org>, Christian Benvenuti <benve@cisco.com>, Nelson Escobar <neescoba@cisco.com>, Bernard Metzler <bmt@zurich.ibm.com>, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Arnaldo Carvalho de Melo <acme@kernel.org>, Mark Rutland <mark.rutland@arm.com>, Alexander Shishkin <alexander.shishkin@linux.intel.com>, Jiri Olsa <jolsa@kernel.org>, Namhyung Kim <namhyung@kernel.org>, Ian Rogers <irogers@google.com>, Adrian Hunter <adrian.hunter@intel.com>, Bjorn Topel <bjorn@kernel.org>, Magnus Karlsson <magnus.karlsson@intel.com>, Maciej Fijalkowski <maciej.fijalkowski@intel.com>, Jonathan Lemon <jonathan.lemon@gmail.com>, "David S . Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Christian Brauner <brauner@kernel.org>, Richard Cochran <richardcochran@gmail.com>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Jesper Dangaard Brouer <hawk@kernel.org>, John Fastabend <john.fastabend@gmail.com>, linux-fsdevel@vger.kernel.org, linux-perf-users@vger.kernel.org, netdev@vger.kernel.org, bpf@vger.kernel.org, Oleg Nesterov <oleg@redhat.com>, Jason Gunthorpe <jgg@nvidia.com>, John Hubbard <jhubbard@nvidia.com>, Jan Kara <jack@suse.cz>, "Kirill A . Shutemov" <kirill@shutemov.name>, Pavel Begunkov <asml.silence@gmail.com>, Mika Penttila <mpenttil@redhat.com>, Lorenzo Stoakes <lstoakes@gmail.com> Subject: [PATCH v5] mm/gup: disallow GUP writing to file-backed mappings by default Date: Fri, 28 Apr 2023 00:42:32 +0100 Message-Id: <6b73e692c2929dc4613af711bdf92e2ec1956a66.1682638385.git.lstoakes@gmail.com> X-Mailer: git-send-email 2.40.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764376115831313744?= X-GMAIL-MSGID: =?utf-8?q?1764376115831313744?= |
Series |
[v5] mm/gup: disallow GUP writing to file-backed mappings by default
|
|
Commit Message
Lorenzo Stoakes
April 27, 2023, 11:42 p.m. UTC
Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.
A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.
The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.
As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.
For example, consider the following scenario:-
1. A folio is written to via GUP which write-faults the memory, notifying
the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
(though it does not have to).
This results in both data being written to a folio without writenotify, and
the folio being dirtied unexpectedly (if the caller decides to do so).
This issue was first reported by Jan Kara [1] in 2018, where the problem
resulted in file system crashes.
This is only relevant when the mappings are file-backed and the underlying
file system requires folio dirty tracking. File systems which do not, such
as shmem or hugetlb, are not at risk and therefore can be written to
without issue.
Unfortunately this limitation of GUP has been present for some time and
requires future rework of the GUP API in order to provide correct write
access to such mappings.
However, for the time being we introduce this check to prevent the most
egregious case of this occurring, use of the FOLL_LONGTERM pin.
These mappings are considerably more likely to be written to after
folios are cleaned and thus simply must not be permitted to do so.
As part of this change we separate out vma_needs_dirty_tracking() as a
helper function to determine this which is distinct from
vma_wants_writenotify() which is specific to determining which PTE flags to
set.
[1]:https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
include/linux/mm.h | 1 +
mm/gup.c | 41 ++++++++++++++++++++++++++++++++++++++++-
mm/mmap.c | 36 +++++++++++++++++++++++++++---------
3 files changed, 68 insertions(+), 10 deletions(-)
--
2.40.0
Comments
On 4/27/23 16:42, Lorenzo Stoakes wrote: > Writing to file-backed mappings which require folio dirty tracking using > GUP is a fundamentally broken operation, as kernel write access to GUP > mappings do not adhere to the semantics expected by a file system. > > A GUP caller uses the direct mapping to access the folio, which does not > cause write notify to trigger, nor does it enforce that the caller marks > the folio dirty. > > The problem arises when, after an initial write to the folio, writeback > results in the folio being cleaned and then the caller, via the GUP > interface, writes to the folio again. > > As a result of the use of this secondary, direct, mapping to the folio no > write notify will occur, and if the caller does mark the folio dirty, this > will be done so unexpectedly. > > For example, consider the following scenario:- > > 1. A folio is written to via GUP which write-faults the memory, notifying > the file system and dirtying the folio. > 2. Later, writeback is triggered, resulting in the folio being cleaned and > the PTE being marked read-only. > 3. The GUP caller writes to the folio, as it is mapped read/write via the > direct mapping. > 4. The GUP caller, now done with the page, unpins it and sets it dirty > (though it does not have to). > > This results in both data being written to a folio without writenotify, and > the folio being dirtied unexpectedly (if the caller decides to do so). > > This issue was first reported by Jan Kara [1] in 2018, where the problem > resulted in file system crashes. > > This is only relevant when the mappings are file-backed and the underlying > file system requires folio dirty tracking. File systems which do not, such > as shmem or hugetlb, are not at risk and therefore can be written to > without issue. > > Unfortunately this limitation of GUP has been present for some time and > requires future rework of the GUP API in order to provide correct write > access to such mappings. > > However, for the time being we introduce this check to prevent the most > egregious case of this occurring, use of the FOLL_LONGTERM pin. > > These mappings are considerably more likely to be written to after > folios are cleaned and thus simply must not be permitted to do so. > > As part of this change we separate out vma_needs_dirty_tracking() as a > helper function to determine this which is distinct from > vma_wants_writenotify() which is specific to determining which PTE flags to > set. > > [1]:https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/ > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> > --- > include/linux/mm.h | 1 + > mm/gup.c | 41 ++++++++++++++++++++++++++++++++++++++++- > mm/mmap.c | 36 +++++++++++++++++++++++++++--------- > 3 files changed, 68 insertions(+), 10 deletions(-) > Looks good. Reviewed-by: John Hubbard <jhubbard@nvidia.com> thanks,
On 28.4.2023 2.42, Lorenzo Stoakes wrote: > Writing to file-backed mappings which require folio dirty tracking using > GUP is a fundamentally broken operation, as kernel write access to GUP > mappings do not adhere to the semantics expected by a file system. > > A GUP caller uses the direct mapping to access the folio, which does not > cause write notify to trigger, nor does it enforce that the caller marks > the folio dirty. > > The problem arises when, after an initial write to the folio, writeback > results in the folio being cleaned and then the caller, via the GUP > interface, writes to the folio again. > > As a result of the use of this secondary, direct, mapping to the folio no > write notify will occur, and if the caller does mark the folio dirty, this > will be done so unexpectedly. > > For example, consider the following scenario:- > > 1. A folio is written to via GUP which write-faults the memory, notifying > the file system and dirtying the folio. > 2. Later, writeback is triggered, resulting in the folio being cleaned and > the PTE being marked read-only. > 3. The GUP caller writes to the folio, as it is mapped read/write via the > direct mapping. > 4. The GUP caller, now done with the page, unpins it and sets it dirty > (though it does not have to). > > This results in both data being written to a folio without writenotify, and > the folio being dirtied unexpectedly (if the caller decides to do so). > > This issue was first reported by Jan Kara [1] in 2018, where the problem > resulted in file system crashes. > > This is only relevant when the mappings are file-backed and the underlying > file system requires folio dirty tracking. File systems which do not, such > as shmem or hugetlb, are not at risk and therefore can be written to > without issue. > > Unfortunately this limitation of GUP has been present for some time and > requires future rework of the GUP API in order to provide correct write > access to such mappings. > > However, for the time being we introduce this check to prevent the most > egregious case of this occurring, use of the FOLL_LONGTERM pin. > > These mappings are considerably more likely to be written to after > folios are cleaned and thus simply must not be permitted to do so. > > As part of this change we separate out vma_needs_dirty_tracking() as a > helper function to determine this which is distinct from > vma_wants_writenotify() which is specific to determining which PTE flags to > set. > > [1]:https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/ > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> > --- > include/linux/mm.h | 1 + > mm/gup.c | 41 ++++++++++++++++++++++++++++++++++++++++- > mm/mmap.c | 36 +++++++++++++++++++++++++++--------- > 3 files changed, 68 insertions(+), 10 deletions(-) > Thanks Lorenzo for the added patch descriptions! Reviewed-by: Mika Penttilä <mpenttil@redhat.com> > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 37554b08bb28..f7da02fc89c6 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, > #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ > MM_CP_UFFD_WP_RESOLVE) > > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma); > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); > static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) > { > diff --git a/mm/gup.c b/mm/gup.c > index 1f72a717232b..d36a5db9feb1 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma, > return 0; > } > > +/* > + * Writing to file-backed mappings which require folio dirty tracking using GUP > + * is a fundamentally broken operation, as kernel write access to GUP mappings > + * do not adhere to the semantics expected by a file system. > + * > + * Consider the following scenario:- > + * > + * 1. A folio is written to via GUP which write-faults the memory, notifying > + * the file system and dirtying the folio. > + * 2. Later, writeback is triggered, resulting in the folio being cleaned and > + * the PTE being marked read-only. > + * 3. The GUP caller writes to the folio, as it is mapped read/write via the > + * direct mapping. > + * 4. The GUP caller, now done with the page, unpins it and sets it dirty > + * (though it does not have to). > + * > + * This results in both data being written to a folio without writenotify, and > + * the folio being dirtied unexpectedly (if the caller decides to do so). > + */ > +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma, > + unsigned long gup_flags) > +{ > + /* If we aren't pinning then no problematic write can occur. */ > + if (!(gup_flags & (FOLL_GET | FOLL_PIN))) > + return true; > + > + /* We limit this check to the most egregious case - a long term pin. */ > + if (!(gup_flags & FOLL_LONGTERM)) > + return true; > + > + /* If the VMA requires dirty tracking then GUP will be problematic. */ > + return vma_needs_dirty_tracking(vma); > +} > + > static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > { > vm_flags_t vm_flags = vma->vm_flags; > int write = (gup_flags & FOLL_WRITE); > int foreign = (gup_flags & FOLL_REMOTE); > + bool vma_anon = vma_is_anonymous(vma); > > if (vm_flags & (VM_IO | VM_PFNMAP)) > return -EFAULT; > > - if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) > + if ((gup_flags & FOLL_ANON) && !vma_anon) > return -EFAULT; > > if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) > @@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > return -EFAULT; > > if (write) { > + if (!vma_anon && > + !writeable_file_mapping_allowed(vma, gup_flags)) > + return -EFAULT; > + > if (!(vm_flags & VM_WRITE)) { > if (!(gup_flags & FOLL_FORCE)) > return -EFAULT; > diff --git a/mm/mmap.c b/mm/mmap.c > index 536bbb8fa0ae..7b6344d1832a 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1475,6 +1475,31 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) > } > #endif /* __ARCH_WANT_SYS_OLD_MMAP */ > > +/* Do VMA operations imply write notify is required? */ > +static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops) > +{ > + return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite); > +} > + > +/* > + * Does this VMA require the underlying folios to have their dirty state > + * tracked? > + */ > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma) > +{ > + /* Does the filesystem need to be notified? */ > + if (vm_ops_needs_writenotify(vma->vm_ops)) > + return true; > + > + /* Specialty mapping? */ > + if (vma->vm_flags & VM_PFNMAP) > + return false; > + > + /* Can the mapping track the dirty pages? */ > + return vma->vm_file && vma->vm_file->f_mapping && > + mapping_can_writeback(vma->vm_file->f_mapping); > +} > + > /* > * Some shared mappings will want the pages marked read-only > * to track write events. If so, we'll downgrade vm_page_prot > @@ -1484,14 +1509,13 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) > { > vm_flags_t vm_flags = vma->vm_flags; > - const struct vm_operations_struct *vm_ops = vma->vm_ops; > > /* If it was private or non-writable, the write bit is already clear */ > if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) > return 0; > > /* The backer wishes to know when pages are first written to? */ > - if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite)) > + if (vm_ops_needs_writenotify(vma->vm_ops)) > return 1; > > /* The open routine did something to the protections that pgprot_modify > @@ -1511,13 +1535,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) > if (userfaultfd_wp(vma)) > return 1; > > - /* Specialty mapping? */ > - if (vm_flags & VM_PFNMAP) > - return 0; > - > - /* Can the mapping track the dirty pages? */ > - return vma->vm_file && vma->vm_file->f_mapping && > - mapping_can_writeback(vma->vm_file->f_mapping); > + return vma_needs_dirty_tracking(vma); > } > > /* > -- > 2.40.0 >
On Fri 28-04-23 00:42:32, Lorenzo Stoakes wrote: > Writing to file-backed mappings which require folio dirty tracking using > GUP is a fundamentally broken operation, as kernel write access to GUP > mappings do not adhere to the semantics expected by a file system. > > A GUP caller uses the direct mapping to access the folio, which does not > cause write notify to trigger, nor does it enforce that the caller marks > the folio dirty. > > The problem arises when, after an initial write to the folio, writeback > results in the folio being cleaned and then the caller, via the GUP > interface, writes to the folio again. > > As a result of the use of this secondary, direct, mapping to the folio no > write notify will occur, and if the caller does mark the folio dirty, this > will be done so unexpectedly. > > For example, consider the following scenario:- > > 1. A folio is written to via GUP which write-faults the memory, notifying > the file system and dirtying the folio. > 2. Later, writeback is triggered, resulting in the folio being cleaned and > the PTE being marked read-only. > 3. The GUP caller writes to the folio, as it is mapped read/write via the > direct mapping. > 4. The GUP caller, now done with the page, unpins it and sets it dirty > (though it does not have to). > > This results in both data being written to a folio without writenotify, and > the folio being dirtied unexpectedly (if the caller decides to do so). > > This issue was first reported by Jan Kara [1] in 2018, where the problem > resulted in file system crashes. > > This is only relevant when the mappings are file-backed and the underlying > file system requires folio dirty tracking. File systems which do not, such > as shmem or hugetlb, are not at risk and therefore can be written to > without issue. > > Unfortunately this limitation of GUP has been present for some time and > requires future rework of the GUP API in order to provide correct write > access to such mappings. > > However, for the time being we introduce this check to prevent the most > egregious case of this occurring, use of the FOLL_LONGTERM pin. > > These mappings are considerably more likely to be written to after > folios are cleaned and thus simply must not be permitted to do so. > > As part of this change we separate out vma_needs_dirty_tracking() as a > helper function to determine this which is distinct from > vma_wants_writenotify() which is specific to determining which PTE flags to > set. > > [1]:https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/ > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> I'm for trying this out and let's see who complains ;) The patch looks good to me from the implementation point of view. Feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > include/linux/mm.h | 1 + > mm/gup.c | 41 ++++++++++++++++++++++++++++++++++++++++- > mm/mmap.c | 36 +++++++++++++++++++++++++++--------- > 3 files changed, 68 insertions(+), 10 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 37554b08bb28..f7da02fc89c6 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, > #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ > MM_CP_UFFD_WP_RESOLVE) > > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma); > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); > static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) > { > diff --git a/mm/gup.c b/mm/gup.c > index 1f72a717232b..d36a5db9feb1 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma, > return 0; > } > > +/* > + * Writing to file-backed mappings which require folio dirty tracking using GUP > + * is a fundamentally broken operation, as kernel write access to GUP mappings > + * do not adhere to the semantics expected by a file system. > + * > + * Consider the following scenario:- > + * > + * 1. A folio is written to via GUP which write-faults the memory, notifying > + * the file system and dirtying the folio. > + * 2. Later, writeback is triggered, resulting in the folio being cleaned and > + * the PTE being marked read-only. > + * 3. The GUP caller writes to the folio, as it is mapped read/write via the > + * direct mapping. > + * 4. The GUP caller, now done with the page, unpins it and sets it dirty > + * (though it does not have to). > + * > + * This results in both data being written to a folio without writenotify, and > + * the folio being dirtied unexpectedly (if the caller decides to do so). > + */ > +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma, > + unsigned long gup_flags) > +{ > + /* If we aren't pinning then no problematic write can occur. */ > + if (!(gup_flags & (FOLL_GET | FOLL_PIN))) > + return true; > + > + /* We limit this check to the most egregious case - a long term pin. */ > + if (!(gup_flags & FOLL_LONGTERM)) > + return true; > + > + /* If the VMA requires dirty tracking then GUP will be problematic. */ > + return vma_needs_dirty_tracking(vma); > +} > + > static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > { > vm_flags_t vm_flags = vma->vm_flags; > int write = (gup_flags & FOLL_WRITE); > int foreign = (gup_flags & FOLL_REMOTE); > + bool vma_anon = vma_is_anonymous(vma); > > if (vm_flags & (VM_IO | VM_PFNMAP)) > return -EFAULT; > > - if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) > + if ((gup_flags & FOLL_ANON) && !vma_anon) > return -EFAULT; > > if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) > @@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > return -EFAULT; > > if (write) { > + if (!vma_anon && > + !writeable_file_mapping_allowed(vma, gup_flags)) > + return -EFAULT; > + > if (!(vm_flags & VM_WRITE)) { > if (!(gup_flags & FOLL_FORCE)) > return -EFAULT; > diff --git a/mm/mmap.c b/mm/mmap.c > index 536bbb8fa0ae..7b6344d1832a 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1475,6 +1475,31 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) > } > #endif /* __ARCH_WANT_SYS_OLD_MMAP */ > > +/* Do VMA operations imply write notify is required? */ > +static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops) > +{ > + return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite); > +} > + > +/* > + * Does this VMA require the underlying folios to have their dirty state > + * tracked? > + */ > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma) > +{ > + /* Does the filesystem need to be notified? */ > + if (vm_ops_needs_writenotify(vma->vm_ops)) > + return true; > + > + /* Specialty mapping? */ > + if (vma->vm_flags & VM_PFNMAP) > + return false; > + > + /* Can the mapping track the dirty pages? */ > + return vma->vm_file && vma->vm_file->f_mapping && > + mapping_can_writeback(vma->vm_file->f_mapping); > +} > + > /* > * Some shared mappings will want the pages marked read-only > * to track write events. If so, we'll downgrade vm_page_prot > @@ -1484,14 +1509,13 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) > { > vm_flags_t vm_flags = vma->vm_flags; > - const struct vm_operations_struct *vm_ops = vma->vm_ops; > > /* If it was private or non-writable, the write bit is already clear */ > if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) > return 0; > > /* The backer wishes to know when pages are first written to? */ > - if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite)) > + if (vm_ops_needs_writenotify(vma->vm_ops)) > return 1; > > /* The open routine did something to the protections that pgprot_modify > @@ -1511,13 +1535,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) > if (userfaultfd_wp(vma)) > return 1; > > - /* Specialty mapping? */ > - if (vm_flags & VM_PFNMAP) > - return 0; > - > - /* Can the mapping track the dirty pages? */ > - return vma->vm_file && vma->vm_file->f_mapping && > - mapping_can_writeback(vma->vm_file->f_mapping); > + return vma_needs_dirty_tracking(vma); > } > > /* > -- > 2.40.0
On Fri, Apr 28, 2023 at 12:42:32AM +0100, Lorenzo Stoakes wrote: > Writing to file-backed mappings which require folio dirty tracking using > GUP is a fundamentally broken operation, as kernel write access to GUP > mappings do not adhere to the semantics expected by a file system. > > A GUP caller uses the direct mapping to access the folio, which does not > cause write notify to trigger, nor does it enforce that the caller marks > the folio dirty. > > The problem arises when, after an initial write to the folio, writeback > results in the folio being cleaned and then the caller, via the GUP > interface, writes to the folio again. > > As a result of the use of this secondary, direct, mapping to the folio no > write notify will occur, and if the caller does mark the folio dirty, this > will be done so unexpectedly. > > For example, consider the following scenario:- > > 1. A folio is written to via GUP which write-faults the memory, notifying > the file system and dirtying the folio. > 2. Later, writeback is triggered, resulting in the folio being cleaned and > the PTE being marked read-only. > 3. The GUP caller writes to the folio, as it is mapped read/write via the > direct mapping. > 4. The GUP caller, now done with the page, unpins it and sets it dirty > (though it does not have to). > > This results in both data being written to a folio without writenotify, and > the folio being dirtied unexpectedly (if the caller decides to do so). > > This issue was first reported by Jan Kara [1] in 2018, where the problem > resulted in file system crashes. > > This is only relevant when the mappings are file-backed and the underlying > file system requires folio dirty tracking. File systems which do not, such > as shmem or hugetlb, are not at risk and therefore can be written to > without issue. > > Unfortunately this limitation of GUP has been present for some time and > requires future rework of the GUP API in order to provide correct write > access to such mappings. > > However, for the time being we introduce this check to prevent the most > egregious case of this occurring, use of the FOLL_LONGTERM pin. > > These mappings are considerably more likely to be written to after > folios are cleaned and thus simply must not be permitted to do so. > > As part of this change we separate out vma_needs_dirty_tracking() as a > helper function to determine this which is distinct from > vma_wants_writenotify() which is specific to determining which PTE flags to > set. > > [1]:https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/ > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> > --- > include/linux/mm.h | 1 + > mm/gup.c | 41 ++++++++++++++++++++++++++++++++++++++++- > mm/mmap.c | 36 +++++++++++++++++++++++++++--------- > 3 files changed, 68 insertions(+), 10 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 37554b08bb28..f7da02fc89c6 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, > #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ > MM_CP_UFFD_WP_RESOLVE) > > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma); > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); > static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) > { > diff --git a/mm/gup.c b/mm/gup.c > index 1f72a717232b..d36a5db9feb1 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma, > return 0; > } > > +/* > + * Writing to file-backed mappings which require folio dirty tracking using GUP > + * is a fundamentally broken operation, as kernel write access to GUP mappings > + * do not adhere to the semantics expected by a file system. > + * > + * Consider the following scenario:- > + * > + * 1. A folio is written to via GUP which write-faults the memory, notifying > + * the file system and dirtying the folio. > + * 2. Later, writeback is triggered, resulting in the folio being cleaned and > + * the PTE being marked read-only. > + * 3. The GUP caller writes to the folio, as it is mapped read/write via the > + * direct mapping. > + * 4. The GUP caller, now done with the page, unpins it and sets it dirty > + * (though it does not have to). > + * > + * This results in both data being written to a folio without writenotify, and > + * the folio being dirtied unexpectedly (if the caller decides to do so). > + */ > +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma, > + unsigned long gup_flags) > +{ > + /* If we aren't pinning then no problematic write can occur. */ > + if (!(gup_flags & (FOLL_GET | FOLL_PIN))) > + return true; > + > + /* We limit this check to the most egregious case - a long term pin. */ > + if (!(gup_flags & FOLL_LONGTERM)) > + return true; > + > + /* If the VMA requires dirty tracking then GUP will be problematic. */ > + return vma_needs_dirty_tracking(vma); > +} > + > static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > { > vm_flags_t vm_flags = vma->vm_flags; > int write = (gup_flags & FOLL_WRITE); > int foreign = (gup_flags & FOLL_REMOTE); > + bool vma_anon = vma_is_anonymous(vma); > > if (vm_flags & (VM_IO | VM_PFNMAP)) > return -EFAULT; > > - if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) > + if ((gup_flags & FOLL_ANON) && !vma_anon) > return -EFAULT; > > if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) > @@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > return -EFAULT; > > if (write) { > + if (!vma_anon && > + !writeable_file_mapping_allowed(vma, gup_flags)) > + return -EFAULT; > + > if (!(vm_flags & VM_WRITE)) { > if (!(gup_flags & FOLL_FORCE)) > return -EFAULT; > diff --git a/mm/mmap.c b/mm/mmap.c > index 536bbb8fa0ae..7b6344d1832a 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1475,6 +1475,31 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) > } > #endif /* __ARCH_WANT_SYS_OLD_MMAP */ > > +/* Do VMA operations imply write notify is required? */ > +static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops) > +{ > + return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite); > +} > + > +/* > + * Does this VMA require the underlying folios to have their dirty state > + * tracked? > + */ > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma) > +{ > + /* Does the filesystem need to be notified? */ > + if (vm_ops_needs_writenotify(vma->vm_ops)) > + return true; > + > + /* Specialty mapping? */ > + if (vma->vm_flags & VM_PFNMAP) > + return false; > + > + /* Can the mapping track the dirty pages? */ > + return vma->vm_file && vma->vm_file->f_mapping && > + mapping_can_writeback(vma->vm_file->f_mapping); > +} > + > /* > * Some shared mappings will want the pages marked read-only > * to track write events. If so, we'll downgrade vm_page_prot > @@ -1484,14 +1509,13 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) > { > vm_flags_t vm_flags = vma->vm_flags; > - const struct vm_operations_struct *vm_ops = vma->vm_ops; > > /* If it was private or non-writable, the write bit is already clear */ > if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) > return 0; > > /* The backer wishes to know when pages are first written to? */ > - if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite)) > + if (vm_ops_needs_writenotify(vma->vm_ops)) > return 1; > > /* The open routine did something to the protections that pgprot_modify > @@ -1511,13 +1535,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) > if (userfaultfd_wp(vma)) > return 1; > > - /* Specialty mapping? */ > - if (vm_flags & VM_PFNMAP) > - return 0; > - > - /* Can the mapping track the dirty pages? */ > - return vma->vm_file && vma->vm_file->f_mapping && > - mapping_can_writeback(vma->vm_file->f_mapping); > + return vma_needs_dirty_tracking(vma); > } > > /* > -- > 2.40.0 Apologies, for some reason I forgot to include the revision list in that patch, enclosed below:- v5: - Rebased on latest mm-unstable as of 25th April 2023. - Some small refactorings suggested by John. - Added an extended description of the problem in the comment around writeable_file_mapping_allowed() for clarity. - Updated commit message as suggested by Mika and John. v4: - Split out vma_needs_dirty_tracking() from vma_wants_writenotify() to reduce duplication and update to use this in the GUP check. Note that both separately check vm_ops_needs_writenotify() as the latter needs to test this before the vm_pgprot_modify() test, resulting in vma_wants_writenotify() checking this twice, however it is such a small check this should not be egregious. https://lore.kernel.org/all/3b92d56f55671a0389252379237703df6e86ea48.1682464032.git.lstoakes@gmail.com/ v3: - Rebased on latest mm-unstable as of 24th April 2023. - Explicitly check whether file system requires folio dirtying. Note that vma_wants_writenotify() could not be used directly as it is very much focused on determining if the PTE r/w should be set (e.g. assuming private mapping does not require it as already set, soft dirty considerations). - Tested code against shmem and hugetlb mappings - confirmed that these are not disallowed by the check. - Eliminate FOLL_ALLOW_BROKEN_FILE_MAPPING flag and instead perform check only for FOLL_LONGTERM pins. - As a result, limit check to internal GUP code. https://lore.kernel.org/all/23c19e27ef0745f6d3125976e047ee0da62569d4.1682406295.git.lstoakes@gmail.com/ v2: - Add accidentally excluded ptrace_access_vm() use of FOLL_ALLOW_BROKEN_FILE_MAPPING. - Tweak commit message. https://lore.kernel.org/all/c8ee7e02d3d4f50bb3e40855c53bda39eec85b7d.1682321768.git.lstoakes@gmail.com/ v1: https://lore.kernel.org/all/f86dc089b460c80805e321747b0898fd1efe93d7.1682168199.git.lstoakes@gmail.com/
On Fri, Apr 28, 2023 at 12:42:32AM +0100, Lorenzo Stoakes wrote: > Writing to file-backed mappings which require folio dirty tracking using > GUP is a fundamentally broken operation, as kernel write access to GUP > mappings do not adhere to the semantics expected by a file system. > > A GUP caller uses the direct mapping to access the folio, which does not > cause write notify to trigger, nor does it enforce that the caller marks > the folio dirty. > > The problem arises when, after an initial write to the folio, writeback > results in the folio being cleaned and then the caller, via the GUP > interface, writes to the folio again. > > As a result of the use of this secondary, direct, mapping to the folio no > write notify will occur, and if the caller does mark the folio dirty, this > will be done so unexpectedly. > > For example, consider the following scenario:- > > 1. A folio is written to via GUP which write-faults the memory, notifying > the file system and dirtying the folio. > 2. Later, writeback is triggered, resulting in the folio being cleaned and > the PTE being marked read-only. > 3. The GUP caller writes to the folio, as it is mapped read/write via the > direct mapping. > 4. The GUP caller, now done with the page, unpins it and sets it dirty > (though it does not have to). > > This results in both data being written to a folio without writenotify, and > the folio being dirtied unexpectedly (if the caller decides to do so). > > This issue was first reported by Jan Kara [1] in 2018, where the problem > resulted in file system crashes. > > This is only relevant when the mappings are file-backed and the underlying > file system requires folio dirty tracking. File systems which do not, such > as shmem or hugetlb, are not at risk and therefore can be written to > without issue. > > Unfortunately this limitation of GUP has been present for some time and > requires future rework of the GUP API in order to provide correct write > access to such mappings. > > However, for the time being we introduce this check to prevent the most > egregious case of this occurring, use of the FOLL_LONGTERM pin. > > These mappings are considerably more likely to be written to after > folios are cleaned and thus simply must not be permitted to do so. > > As part of this change we separate out vma_needs_dirty_tracking() as a > helper function to determine this which is distinct from > vma_wants_writenotify() which is specific to determining which PTE flags to > set. > > [1]:https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/ > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> > --- > include/linux/mm.h | 1 + > mm/gup.c | 41 ++++++++++++++++++++++++++++++++++++++++- > mm/mmap.c | 36 +++++++++++++++++++++++++++--------- > 3 files changed, 68 insertions(+), 10 deletions(-) Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Jason
Sorry for jumping in late, I'm on vacation :) On 28.04.23 01:42, Lorenzo Stoakes wrote: > Writing to file-backed mappings which require folio dirty tracking using > GUP is a fundamentally broken operation, as kernel write access to GUP > mappings do not adhere to the semantics expected by a file system. > > A GUP caller uses the direct mapping to access the folio, which does not > cause write notify to trigger, nor does it enforce that the caller marks > the folio dirty. How should we enforce it? It would be a BUG in the GUP user. > > The problem arises when, after an initial write to the folio, writeback > results in the folio being cleaned and then the caller, via the GUP > interface, writes to the folio again. > > As a result of the use of this secondary, direct, mapping to the folio no > write notify will occur, and if the caller does mark the folio dirty, this > will be done so unexpectedly. Right, in mprotect() code we only allow upgrading write permissions in this case if the pte is dirty, so we always go via the pagefault path. > > For example, consider the following scenario:- > > 1. A folio is written to via GUP which write-faults the memory, notifying > the file system and dirtying the folio. > 2. Later, writeback is triggered, resulting in the folio being cleaned and > the PTE being marked read-only. How would that be triggered? Would that writeback triggered by e.g., fsync that Jan tried to tackle recently? > 3. The GUP caller writes to the folio, as it is mapped read/write via the > direct mapping. > 4. The GUP caller, now done with the page, unpins it and sets it dirty > (though it does not have to). > > This results in both data being written to a folio without writenotify, and > the folio being dirtied unexpectedly (if the caller decides to do so). > > This issue was first reported by Jan Kara [1] in 2018, where the problem > resulted in file system crashes. > > This is only relevant when the mappings are file-backed and the underlying > file system requires folio dirty tracking. File systems which do not, such > as shmem or hugetlb, are not at risk and therefore can be written to > without issue. > > Unfortunately this limitation of GUP has been present for some time and > requires future rework of the GUP API in order to provide correct write > access to such mappings. > > However, for the time being we introduce this check to prevent the most > egregious case of this occurring, use of the FOLL_LONGTERM pin. > > These mappings are considerably more likely to be written to after > folios are cleaned and thus simply must not be permitted to do so. > > As part of this change we separate out vma_needs_dirty_tracking() as a > helper function to determine this which is distinct from > vma_wants_writenotify() which is specific to determining which PTE flags to > set. > > [1]:https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/ > This change has the potential to break existing setups. Simple example: libvirt domains configured for file-backed VM memory that also has a vfio device configured. It can easily be configured by users (evolving VM configuration, copy-paste etc.). And it works from a VM perspective, because the guest memory is essentially stale once the VM is shutdown and the pages were unpinned. At least we're not concerned about stale data on disk. With your changes, such VMs would no longer start, breaking existing user setups with a kernel update. I don't really see a lot of reasons to perform this change now. It's been known to be problematic for a long time. People are working on a fix (I see Jan is already CCed, CCing Dave and Christop). FOLL_LONGTERM check is only handling some of the problematic cases, so it's not even a complete blocker. I know, Jason und John will disagree, but I don't think we want to be very careful with changing the default. Sure, we could warn, or convert individual users using a flag (io_uring). But maybe we should invest more energy on a fix? > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> > --- > include/linux/mm.h | 1 + > mm/gup.c | 41 ++++++++++++++++++++++++++++++++++++++++- > mm/mmap.c | 36 +++++++++++++++++++++++++++--------- > 3 files changed, 68 insertions(+), 10 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 37554b08bb28..f7da02fc89c6 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, > #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ > MM_CP_UFFD_WP_RESOLVE) > > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma); > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); > static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) > { > diff --git a/mm/gup.c b/mm/gup.c > index 1f72a717232b..d36a5db9feb1 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma, > return 0; > } > > +/* > + * Writing to file-backed mappings which require folio dirty tracking using GUP > + * is a fundamentally broken operation, as kernel write access to GUP mappings > + * do not adhere to the semantics expected by a file system. > + * > + * Consider the following scenario:- > + * > + * 1. A folio is written to via GUP which write-faults the memory, notifying > + * the file system and dirtying the folio. > + * 2. Later, writeback is triggered, resulting in the folio being cleaned and > + * the PTE being marked read-only. > + * 3. The GUP caller writes to the folio, as it is mapped read/write via the > + * direct mapping. > + * 4. The GUP caller, now done with the page, unpins it and sets it dirty > + * (though it does not have to). > + * > + * This results in both data being written to a folio without writenotify, and > + * the folio being dirtied unexpectedly (if the caller decides to do so). > + */ > +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma, > + unsigned long gup_flags) > +{ > + /* If we aren't pinning then no problematic write can occur. */ > + if (!(gup_flags & (FOLL_GET | FOLL_PIN))) > + return true; FOLL_LONGTERM only applies to FOLL_PIN. This check can be dropped. > + > + /* We limit this check to the most egregious case - a long term pin. */ > + if (!(gup_flags & FOLL_LONGTERM)) > + return true; > + > + /* If the VMA requires dirty tracking then GUP will be problematic. */ > + return vma_needs_dirty_tracking(vma); > +} > + > static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > { > vm_flags_t vm_flags = vma->vm_flags; > int write = (gup_flags & FOLL_WRITE); > int foreign = (gup_flags & FOLL_REMOTE); > + bool vma_anon = vma_is_anonymous(vma); > > if (vm_flags & (VM_IO | VM_PFNMAP)) > return -EFAULT; > > - if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) > + if ((gup_flags & FOLL_ANON) && !vma_anon) > return -EFAULT; > > if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) > @@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > return -EFAULT; > > if (write) { > + if (!vma_anon && > + !writeable_file_mapping_allowed(vma, gup_flags)) > + return -EFAULT; > + > if (!(vm_flags & VM_WRITE)) { > if (!(gup_flags & FOLL_FORCE)) > return -EFAULT; > diff --git a/mm/mmap.c b/mm/mmap.c > index 536bbb8fa0ae..7b6344d1832a 100644 > --- a/mm/mmap.c I'm probably missing something, why don't we have to handle GUP-fast (having said that, it's hard to handle ;) )? The sequence you describe above should apply to GUP-fast as well, no? 1) Pin writable mapped page using GUP-fast 2) Trigger writeback 3) Write to page via pin 4) Unpin and set dirty
On Fri, Apr 28, 2023 at 04:20:46PM +0200, David Hildenbrand wrote: > Sorry for jumping in late, I'm on vacation :) > > On 28.04.23 01:42, Lorenzo Stoakes wrote: > > Writing to file-backed mappings which require folio dirty tracking using > > GUP is a fundamentally broken operation, as kernel write access to GUP > > mappings do not adhere to the semantics expected by a file system. > > > > A GUP caller uses the direct mapping to access the folio, which does not > > cause write notify to trigger, nor does it enforce that the caller marks > > the folio dirty. > > How should we enforce it? It would be a BUG in the GUP user. I hope we don't have these kinds of mistakes.. hard to enforce by code. > This change has the potential to break existing setups. Simple example: > libvirt domains configured for file-backed VM memory that also has a vfio > device configured. It can easily be configured by users (evolving VM > configuration, copy-paste etc.). And it works from a VM perspective, because > the guest memory is essentially stale once the VM is shutdown and the pages > were unpinned. At least we're not concerned about stale data on > disk. I think this is broken today and we should block it. We know from experiments with RDMA that doing exactly this triggers kernel oop's. Run your qemu config once, all the pages in the file become dirty. Run your qmeu config again and now all the dirty pages are longterm pinned. Something eventually does writeback and FS cleans the page. Now close your VM and the page is dirtied without make write. FS is inconsistent with the MM, kernel will eventually oops. I'm skeptical that anyone can actually do this combination of things successfully without getting kernel crashes or file data corruption - ie there is no real user to break. > With your changes, such VMs would no longer start, breaking existing user > setups with a kernel update. Yes, as a matter of security we should break it. Earlier I suggested making this contingent on kernel lockdown >= integrity, if actual users come and complain we should go to that option. > Sure, we could warn, or convert individual users using a flag (io_uring). > But maybe we should invest more energy on a fix? It has been years now, I think we need to admit a fix is still years away. Blocking the security problem may even motivate more people to work on a fix. Security is the primary case where we have historically closed uAPI items. Jason
On Fri, Apr 28, 2023 at 04:20:46PM +0200, David Hildenbrand wrote: > Sorry for jumping in late, I'm on vacation :) > > On 28.04.23 01:42, Lorenzo Stoakes wrote: > > Writing to file-backed mappings which require folio dirty tracking using > > GUP is a fundamentally broken operation, as kernel write access to GUP > > mappings do not adhere to the semantics expected by a file system. > > > > A GUP caller uses the direct mapping to access the folio, which does not > > cause write notify to trigger, nor does it enforce that the caller marks > > the folio dirty. > > How should we enforce it? It would be a BUG in the GUP user. There was discussion about holding locks when passing back pages but it's a sticky question no doubt, but a bit out of scope here. This is more documenting that this is a thing. > > > > > The problem arises when, after an initial write to the folio, writeback > > results in the folio being cleaned and then the caller, via the GUP > > interface, writes to the folio again. > > > > As a result of the use of this secondary, direct, mapping to the folio no > > write notify will occur, and if the caller does mark the folio dirty, this > > will be done so unexpectedly. > > Right, in mprotect() code we only allow upgrading write permissions in this > case if the pte is dirty, so we always go via the pagefault path. > In my ideal world we'd somehow do remote accesses via a kthread_use_mm() style approach and page fault in every time. That's again a bit out of scope though... > > > > For example, consider the following scenario:- > > > > 1. A folio is written to via GUP which write-faults the memory, notifying > > the file system and dirtying the folio. > > 2. Later, writeback is triggered, resulting in the folio being cleaned and > > the PTE being marked read-only. > > > How would that be triggered? Would that writeback triggered by e.g., fsync > that Jan tried to tackle recently? Yes or just perioditic writeback threads. The folio is dirty and needs writeback, so at some point that'll happen. Obviously fsync/msync could cause it too (I may be missing something here :) > > > > 3. The GUP caller writes to the folio, as it is mapped read/write via the > > direct mapping. > > 4. The GUP caller, now done with the page, unpins it and sets it dirty > > (though it does not have to). > > > > This results in both data being written to a folio without writenotify, and > > the folio being dirtied unexpectedly (if the caller decides to do so). > > > > This issue was first reported by Jan Kara [1] in 2018, where the problem > > resulted in file system crashes. > > > > This is only relevant when the mappings are file-backed and the underlying > > file system requires folio dirty tracking. File systems which do not, such > > as shmem or hugetlb, are not at risk and therefore can be written to > > without issue. > > > > Unfortunately this limitation of GUP has been present for some time and > > requires future rework of the GUP API in order to provide correct write > > access to such mappings. > > > > However, for the time being we introduce this check to prevent the most > > egregious case of this occurring, use of the FOLL_LONGTERM pin. > > > > These mappings are considerably more likely to be written to after > > folios are cleaned and thus simply must not be permitted to do so. > > > > As part of this change we separate out vma_needs_dirty_tracking() as a > > helper function to determine this which is distinct from > > vma_wants_writenotify() which is specific to determining which PTE flags to > > set. > > > > [1]:https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/ > > > > > This change has the potential to break existing setups. Simple example: > libvirt domains configured for file-backed VM memory that also has a vfio > device configured. It can easily be configured by users (evolving VM > configuration, copy-paste etc.). And it works from a VM perspective, because > the guest memory is essentially stale once the VM is shutdown and the pages > were unpinned. At least we're not concerned about stale data on disk. > > With your changes, such VMs would no longer start, breaking existing user > setups with a kernel update. Which vfio vm_ops are we talking about? vfio_pci_mmap_ops for example doesn't specify page_mkwrite or pfn_mkwrite. Unless you mean some arbitrary file system in the guest? I may well be missing context on this so forgive me if I'm being a little dumb here, but it'd be good to get a specific example. > > I don't really see a lot of reasons to perform this change now. It's been > known to be problematic for a long time. People are working on a fix (I see > Jan is already CCed, CCing Dave and Christop). FOLL_LONGTERM check is only > handling some of the problematic cases, so it's not even a complete blocker. I am not a huge fan of the commentary along the lines of 'I don't really see a lot of reasons to make this change' when people have gone to lengths to posit reasons as to why this change is being made, by all means disagree but it's more helpful if it's of the form 'you say we should do this for reasons X, Y, Z I disagree for reasons a, b, c' (which you've done elsewhere). Also those people working on a longer term fix are supporting this change ;) I think to think that I myself may also contribute to this fix longer term, perhaps. > > I know, Jason und John will disagree, but I don't think we want to be very > careful with changing the default. > > Sure, we could warn, or convert individual users using a flag (io_uring). > But maybe we should invest more energy on a fix? This is proactively blocking a cleanup (eliminating vmas) that I believe will be useful in moving things forward. I am not against an opt-in option (I have been responding to community feedback in adapting my approach), which is the way I implemented it all the way back then :) But given we know this is both entirely broken and a potential security issue, and FOLL_LONGTERM is about as egregious as you can get (user explicitly saying they'll hold write access indefinitely) I feel it is an important improvement and makes clear that this is not an acceptable usage. I see Jason has said more on this also :) > > > > > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> > > --- > > include/linux/mm.h | 1 + > > mm/gup.c | 41 ++++++++++++++++++++++++++++++++++++++++- > > mm/mmap.c | 36 +++++++++++++++++++++++++++--------- > > 3 files changed, 68 insertions(+), 10 deletions(-) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 37554b08bb28..f7da02fc89c6 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, > > #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ > > MM_CP_UFFD_WP_RESOLVE) > > > > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma); > > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); > > static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) > > { > > diff --git a/mm/gup.c b/mm/gup.c > > index 1f72a717232b..d36a5db9feb1 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma, > > return 0; > > } > > > > +/* > > + * Writing to file-backed mappings which require folio dirty tracking using GUP > > + * is a fundamentally broken operation, as kernel write access to GUP mappings > > + * do not adhere to the semantics expected by a file system. > > + * > > + * Consider the following scenario:- > > + * > > + * 1. A folio is written to via GUP which write-faults the memory, notifying > > + * the file system and dirtying the folio. > > + * 2. Later, writeback is triggered, resulting in the folio being cleaned and > > + * the PTE being marked read-only. > > + * 3. The GUP caller writes to the folio, as it is mapped read/write via the > > + * direct mapping. > > + * 4. The GUP caller, now done with the page, unpins it and sets it dirty > > + * (though it does not have to). > > + * > > + * This results in both data being written to a folio without writenotify, and > > + * the folio being dirtied unexpectedly (if the caller decides to do so). > > + */ > > +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma, > > + unsigned long gup_flags) > > +{ > > + /* If we aren't pinning then no problematic write can occur. */ > > + if (!(gup_flags & (FOLL_GET | FOLL_PIN))) > > + return true; > > FOLL_LONGTERM only applies to FOLL_PIN. This check can be dropped. I understand that of course (well maybe not of course, but I mean I do, I have oodles of diagrams referencing this int he book :) This is intended to document the fact that the check isn't relevant if we don't pin at all, e.g. reading this you see:- - (implicit) if not writing or anon we're good - if not pin we're good - ok we are only currently checking one especially egregious case - finally, perform the dirty tracking check. So this is intentional. > > > + > > + /* We limit this check to the most egregious case - a long term pin. */ > > + if (!(gup_flags & FOLL_LONGTERM)) > > + return true; > > + > > + /* If the VMA requires dirty tracking then GUP will be problematic. */ > > + return vma_needs_dirty_tracking(vma); > > +} > > + > > static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > > { > > vm_flags_t vm_flags = vma->vm_flags; > > int write = (gup_flags & FOLL_WRITE); > > int foreign = (gup_flags & FOLL_REMOTE); > > + bool vma_anon = vma_is_anonymous(vma); > > > > if (vm_flags & (VM_IO | VM_PFNMAP)) > > return -EFAULT; > > > > - if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) > > + if ((gup_flags & FOLL_ANON) && !vma_anon) > > return -EFAULT; > > > > if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) > > @@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > > return -EFAULT; > > > > if (write) { > > + if (!vma_anon && > > + !writeable_file_mapping_allowed(vma, gup_flags)) > > + return -EFAULT; > > + > > if (!(vm_flags & VM_WRITE)) { > > if (!(gup_flags & FOLL_FORCE)) > > return -EFAULT; > > diff --git a/mm/mmap.c b/mm/mmap.c > > index 536bbb8fa0ae..7b6344d1832a 100644 > > --- a/mm/mmap.c > > > I'm probably missing something, why don't we have to handle GUP-fast (having > said that, it's hard to handle ;) )? The sequence you describe above should > apply to GUP-fast as well, no? > > 1) Pin writable mapped page using GUP-fast > 2) Trigger writeback > 3) Write to page via pin > 4) Unpin and set dirty You're right, and this is an excellent point. I worry about other GUP use cases too, but we're a bit out of luck there because we don't get to check the VMA _at all_ (which opens yet another Pandora's box about how safe it is to do unlocked pinning :) But again, this comes down to the fact we're trying to make things _incrementally__ better rather than throwing our hands up and saying one day my ship will come in... > > > -- > Thanks, > > David / dhildenb > >
On 28.04.23 16:35, Jason Gunthorpe wrote: > On Fri, Apr 28, 2023 at 04:20:46PM +0200, David Hildenbrand wrote: >> Sorry for jumping in late, I'm on vacation :) >> >> On 28.04.23 01:42, Lorenzo Stoakes wrote: >>> Writing to file-backed mappings which require folio dirty tracking using >>> GUP is a fundamentally broken operation, as kernel write access to GUP >>> mappings do not adhere to the semantics expected by a file system. >>> >>> A GUP caller uses the direct mapping to access the folio, which does not >>> cause write notify to trigger, nor does it enforce that the caller marks >>> the folio dirty. >> >> How should we enforce it? It would be a BUG in the GUP user. > > I hope we don't have these kinds of mistakes.. hard to enforce by > code. > I briefly played with the idea of only allowing to write-pin fs pages that are dirty (or the pte is dirty). If we adjust writeback code to leave such (maybe pinned) pages dirty, there would be no need to reset the pages dirty I guess. Just an idea, getting the writebackcode fixed (and race-free with GUP-fast) is the harder problem. >> This change has the potential to break existing setups. Simple example: >> libvirt domains configured for file-backed VM memory that also has a vfio >> device configured. It can easily be configured by users (evolving VM >> configuration, copy-paste etc.). And it works from a VM perspective, because >> the guest memory is essentially stale once the VM is shutdown and the pages >> were unpinned. At least we're not concerned about stale data on >> disk. > > I think this is broken today and we should block it. We know from > experiments with RDMA that doing exactly this triggers kernel oop's. > I never saw similar reports in the wild (especially targeted at RHEL), so is this still a current issue that has not been mitigated? Or is it just so hard to actually trigger? > Run your qemu config once, all the pages in the file become dirty. > > Run your qmeu config again and now all the dirty pages are longterm > pinned. > > Something eventually does writeback and FS cleans the page. At least vmscan does not touch pages that have additional references -- pageout() quits early. So that other thing doesn't happen that often I guess (manual fsync doesn't usually happen on VM memory if I am not wrong ...). > > Now close your VM and the page is dirtied without make write. FS is > inconsistent with the MM, kernel will eventually oops. > > I'm skeptical that anyone can actually do this combination of things > successfully without getting kernel crashes or file data corruption - > ie there is no real user to break. I am pretty sure that there are such VM users, because on the libvirt level it's completely unclear which features trigger what behavior :/ I remember (but did not check) that VM memory might usually get deleted whenever we usually shutdown a VM, another reason why we wouldn't see this in the wild. libvirt has the "discard" option exactly for that purpose, to be used with file-based guest memory. [1] Users tend to copy-paste domain XMLs + edit because it's just so hard to get right. Changing the VM backing to be backed from a file can be done with a one-line change in the libvirt XML. > >> With your changes, such VMs would no longer start, breaking existing user >> setups with a kernel update. > > Yes, as a matter of security we should break it. > > Earlier I suggested making this contingent on kernel lockdown >= > integrity, if actual users come and complain we should go to that > option. > >> Sure, we could warn, or convert individual users using a flag (io_uring). >> But maybe we should invest more energy on a fix? > > It has been years now, I think we need to admit a fix is still years > away. Blocking the security problem may even motivate more people to > work on a fix. Maybe we should make this a topic this year at LSF/MM (again?). At least we learned a lot about GUP, what might work, what might not work, and got a depper understanding (+ motivation to fix? :) ) the issue at hand. > > Security is the primary case where we have historically closed uAPI > items. As this patch 1) Does not tackle GUP-fast 2) Does not take care of !FOLL_LONGTERM I am not convinced by the security argument in regard to this patch. If we want to sells this as a security thing, we have to block it *completely* and then CC stable. Everything else sounds like band-aids to me, is insufficient, and might cause more harm than actually help IMHO. Especially the gup-fast case is extremely easy to work-around in malicious user space. [1] https://listman.redhat.com/archives/libvir-list/2018-May/msg00885.html
[...] >> This change has the potential to break existing setups. Simple example: >> libvirt domains configured for file-backed VM memory that also has a vfio >> device configured. It can easily be configured by users (evolving VM >> configuration, copy-paste etc.). And it works from a VM perspective, because >> the guest memory is essentially stale once the VM is shutdown and the pages >> were unpinned. At least we're not concerned about stale data on disk. >> >> With your changes, such VMs would no longer start, breaking existing user >> setups with a kernel update. > > Which vfio vm_ops are we talking about? vfio_pci_mmap_ops for example > doesn't specify page_mkwrite or pfn_mkwrite. Unless you mean some arbitrary > file system in the guest? Sorry, you define a VM to have its memory backed by VM memory and, at the same time, define a vfio-pci device for your VM, which will end up long-term pinning the VM memory. > > I may well be missing context on this so forgive me if I'm being a little > dumb here, but it'd be good to get a specific example. I was giving to little details ;) [...] >> >> I know, Jason und John will disagree, but I don't think we want to be very >> careful with changing the default. >> >> Sure, we could warn, or convert individual users using a flag (io_uring). >> But maybe we should invest more energy on a fix? > > This is proactively blocking a cleanup (eliminating vmas) that I believe > will be useful in moving things forward. I am not against an opt-in option > (I have been responding to community feedback in adapting my approach), > which is the way I implemented it all the way back then :) There are alternatives: just use a flag as Jason initially suggested and use that in io_uring code. Then, you can also bail out on the GUP-fast path as "cannot support it right now, never do GUP-fast". IMHO, this patch is not a prereq. > > But given we know this is both entirely broken and a potential security > issue, and FOLL_LONGTERM is about as egregious as you can get (user > explicitly saying they'll hold write access indefinitely) I feel it is an > important improvement and makes clear that this is not an acceptable usage. > > I see Jason has said more on this also :) > >> >> >> >> >>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> >>> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> >>> --- >>> include/linux/mm.h | 1 + >>> mm/gup.c | 41 ++++++++++++++++++++++++++++++++++++++++- >>> mm/mmap.c | 36 +++++++++++++++++++++++++++--------- >>> 3 files changed, 68 insertions(+), 10 deletions(-) >>> >>> diff --git a/include/linux/mm.h b/include/linux/mm.h >>> index 37554b08bb28..f7da02fc89c6 100644 >>> --- a/include/linux/mm.h >>> +++ b/include/linux/mm.h >>> @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, >>> #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ >>> MM_CP_UFFD_WP_RESOLVE) >>> >>> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma); >>> int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); >>> static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) >>> { >>> diff --git a/mm/gup.c b/mm/gup.c >>> index 1f72a717232b..d36a5db9feb1 100644 >>> --- a/mm/gup.c >>> +++ b/mm/gup.c >>> @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma, >>> return 0; >>> } >>> >>> +/* >>> + * Writing to file-backed mappings which require folio dirty tracking using GUP >>> + * is a fundamentally broken operation, as kernel write access to GUP mappings >>> + * do not adhere to the semantics expected by a file system. >>> + * >>> + * Consider the following scenario:- >>> + * >>> + * 1. A folio is written to via GUP which write-faults the memory, notifying >>> + * the file system and dirtying the folio. >>> + * 2. Later, writeback is triggered, resulting in the folio being cleaned and >>> + * the PTE being marked read-only. >>> + * 3. The GUP caller writes to the folio, as it is mapped read/write via the >>> + * direct mapping. >>> + * 4. The GUP caller, now done with the page, unpins it and sets it dirty >>> + * (though it does not have to). >>> + * >>> + * This results in both data being written to a folio without writenotify, and >>> + * the folio being dirtied unexpectedly (if the caller decides to do so). >>> + */ >>> +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma, >>> + unsigned long gup_flags) >>> +{ >>> + /* If we aren't pinning then no problematic write can occur. */ >>> + if (!(gup_flags & (FOLL_GET | FOLL_PIN))) >>> + return true; >> >> FOLL_LONGTERM only applies to FOLL_PIN. This check can be dropped. > > I understand that of course (well maybe not of course, but I mean I do, I > have oodles of diagrams referencing this int he book :) This is intended to > document the fact that the check isn't relevant if we don't pin at all, > e.g. reading this you see:- > > - (implicit) if not writing or anon we're good > - if not pin we're good > - ok we are only currently checking one especially egregious case > - finally, perform the dirty tracking check. > > So this is intentional. > >> >>> + >>> + /* We limit this check to the most egregious case - a long term pin. */ >>> + if (!(gup_flags & FOLL_LONGTERM)) >>> + return true; >>> + >>> + /* If the VMA requires dirty tracking then GUP will be problematic. */ >>> + return vma_needs_dirty_tracking(vma); >>> +} >>> + >>> static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) >>> { >>> vm_flags_t vm_flags = vma->vm_flags; >>> int write = (gup_flags & FOLL_WRITE); >>> int foreign = (gup_flags & FOLL_REMOTE); >>> + bool vma_anon = vma_is_anonymous(vma); >>> >>> if (vm_flags & (VM_IO | VM_PFNMAP)) >>> return -EFAULT; >>> >>> - if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) >>> + if ((gup_flags & FOLL_ANON) && !vma_anon) >>> return -EFAULT; >>> >>> if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) >>> @@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) >>> return -EFAULT; >>> >>> if (write) { >>> + if (!vma_anon && >>> + !writeable_file_mapping_allowed(vma, gup_flags)) >>> + return -EFAULT; >>> + >>> if (!(vm_flags & VM_WRITE)) { >>> if (!(gup_flags & FOLL_FORCE)) >>> return -EFAULT; >>> diff --git a/mm/mmap.c b/mm/mmap.c >>> index 536bbb8fa0ae..7b6344d1832a 100644 >>> --- a/mm/mmap.c >> >> >> I'm probably missing something, why don't we have to handle GUP-fast (having >> said that, it's hard to handle ;) )? The sequence you describe above should >> apply to GUP-fast as well, no? >> >> 1) Pin writable mapped page using GUP-fast >> 2) Trigger writeback >> 3) Write to page via pin >> 4) Unpin and set dirty > > You're right, and this is an excellent point. I worry about other GUP use > cases too, but we're a bit out of luck there because we don't get to check > the VMA _at all_ (which opens yet another Pandora's box about how safe it > is to do unlocked pinning :) > > But again, this comes down to the fact we're trying to make things > _incrementally__ better rather than throwing our hands up and saying one > day my ship will come in... That's not how security fixes are supposed to work IMHO, sorry.
On 4/28/23 9:13?AM, David Hildenbrand wrote: >>> I know, Jason und John will disagree, but I don't think we want to be very >>> careful with changing the default. >>> >>> Sure, we could warn, or convert individual users using a flag (io_uring). >>> But maybe we should invest more energy on a fix? >> >> This is proactively blocking a cleanup (eliminating vmas) that I believe >> will be useful in moving things forward. I am not against an opt-in option >> (I have been responding to community feedback in adapting my approach), >> which is the way I implemented it all the way back then :) > > There are alternatives: just use a flag as Jason initially suggested > and use that in io_uring code. Then, you can also bail out on the > GUP-fast path as "cannot support it right now, never do GUP-fast". Since I've seen this brougth up a few times, what's the issue on the io_uring side? We already dropped the special vma checking, it's in -git right. Hence I don't believe there are any special cases left for io_uring at all, and we certainly don't allow real file backings either, never have done.
>> >> Security is the primary case where we have historically closed uAPI >> items. > > As this patch > > 1) Does not tackle GUP-fast > 2) Does not take care of !FOLL_LONGTERM > > I am not convinced by the security argument in regard to this patch. > > > If we want to sells this as a security thing, we have to block it > *completely* and then CC stable. Regarding GUP-fast, to fix the issue there as well, I guess we could do something similar as I did in gup_must_unshare(): If we're in GUP-fast (no VMA), and want to pin a !anon page writable, fallback to ordinary GUP. IOW, if we don't know, better be safe. Of course, this would prevent hugetlb/shmem from getting pinned writable during gup-fast. Unless we're able to whitelist them somehow in there. For FOLL_LONGTERM it might fairly uncontroversial. For everything else I'm not sure if there could be undesired side-effects.
On Fri, Apr 28, 2023 at 05:13:07PM +0200, David Hildenbrand wrote: > [...] > > > > This change has the potential to break existing setups. Simple example: > > > libvirt domains configured for file-backed VM memory that also has a vfio > > > device configured. It can easily be configured by users (evolving VM > > > configuration, copy-paste etc.). And it works from a VM perspective, because > > > the guest memory is essentially stale once the VM is shutdown and the pages > > > were unpinned. At least we're not concerned about stale data on disk. > > > > > > With your changes, such VMs would no longer start, breaking existing user > > > setups with a kernel update. > > > > Which vfio vm_ops are we talking about? vfio_pci_mmap_ops for example > > doesn't specify page_mkwrite or pfn_mkwrite. Unless you mean some arbitrary > > file system in the guest? > > Sorry, you define a VM to have its memory backed by VM memory and, at the > same time, define a vfio-pci device for your VM, which will end up long-term > pinning the VM memory. Ah ack. Jason seemed concerned that this was already a broken case, I guess that's one for you two to hash out... > > > > > I may well be missing context on this so forgive me if I'm being a little > > dumb here, but it'd be good to get a specific example. > > I was giving to little details ;) > > [...] > > > > > > > I know, Jason und John will disagree, but I don't think we want to be very > > > careful with changing the default. > > > > > > Sure, we could warn, or convert individual users using a flag (io_uring). > > > But maybe we should invest more energy on a fix? > > > > This is proactively blocking a cleanup (eliminating vmas) that I believe > > will be useful in moving things forward. I am not against an opt-in option > > (I have been responding to community feedback in adapting my approach), > > which is the way I implemented it all the way back then :) > > There are alternatives: just use a flag as Jason initially suggested and use > that in io_uring code. Then, you can also bail out on the GUP-fast path as > "cannot support it right now, never do GUP-fast". I already implemented the alternatives (look back through revisions to see :) and there were objections for various reasons. Personally my preference is to provide a FOLL_SAFE_FILE_WRITE flag or such and replace the FOLL_LONGTERM check with this (that'll automatically get rejected for GUP-fast so the GUP-fast conundrum wouldn't be a thing either). GUP-fast is a problem as you say,, but it feels like a fundamental issue with GUP-fast as a whole since you don't get to look at a VMA since you can't take the mmap_lock. You could just look at the folio->mapping once you've walked the page tables and say 'I'm out' if FOLL_WRITE and it's non-anon if that's what you're suggesting? I'm not against that change but this being incremental, I think that would be a further increment. > > IMHO, this patch is not a prereq. > > > > > But given we know this is both entirely broken and a potential security > > issue, and FOLL_LONGTERM is about as egregious as you can get (user > > explicitly saying they'll hold write access indefinitely) I feel it is an > > important improvement and makes clear that this is not an acceptable usage. > > > > I see Jason has said more on this also :) > > > > > > > > > > > > > > > > > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > > > > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> > > > > --- > > > > include/linux/mm.h | 1 + > > > > mm/gup.c | 41 ++++++++++++++++++++++++++++++++++++++++- > > > > mm/mmap.c | 36 +++++++++++++++++++++++++++--------- > > > > 3 files changed, 68 insertions(+), 10 deletions(-) > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > index 37554b08bb28..f7da02fc89c6 100644 > > > > --- a/include/linux/mm.h > > > > +++ b/include/linux/mm.h > > > > @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, > > > > #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ > > > > MM_CP_UFFD_WP_RESOLVE) > > > > > > > > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma); > > > > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); > > > > static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) > > > > { > > > > diff --git a/mm/gup.c b/mm/gup.c > > > > index 1f72a717232b..d36a5db9feb1 100644 > > > > --- a/mm/gup.c > > > > +++ b/mm/gup.c > > > > @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma, > > > > return 0; > > > > } > > > > > > > > +/* > > > > + * Writing to file-backed mappings which require folio dirty tracking using GUP > > > > + * is a fundamentally broken operation, as kernel write access to GUP mappings > > > > + * do not adhere to the semantics expected by a file system. > > > > + * > > > > + * Consider the following scenario:- > > > > + * > > > > + * 1. A folio is written to via GUP which write-faults the memory, notifying > > > > + * the file system and dirtying the folio. > > > > + * 2. Later, writeback is triggered, resulting in the folio being cleaned and > > > > + * the PTE being marked read-only. > > > > + * 3. The GUP caller writes to the folio, as it is mapped read/write via the > > > > + * direct mapping. > > > > + * 4. The GUP caller, now done with the page, unpins it and sets it dirty > > > > + * (though it does not have to). > > > > + * > > > > + * This results in both data being written to a folio without writenotify, and > > > > + * the folio being dirtied unexpectedly (if the caller decides to do so). > > > > + */ > > > > +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma, > > > > + unsigned long gup_flags) > > > > +{ > > > > + /* If we aren't pinning then no problematic write can occur. */ > > > > + if (!(gup_flags & (FOLL_GET | FOLL_PIN))) > > > > + return true; > > > > > > FOLL_LONGTERM only applies to FOLL_PIN. This check can be dropped. > > > > I understand that of course (well maybe not of course, but I mean I do, I > > have oodles of diagrams referencing this int he book :) This is intended to > > document the fact that the check isn't relevant if we don't pin at all, > > e.g. reading this you see:- > > > > - (implicit) if not writing or anon we're good > > - if not pin we're good > > - ok we are only currently checking one especially egregious case > > - finally, perform the dirty tracking check. > > > > So this is intentional. > > > > > > > > > + > > > > + /* We limit this check to the most egregious case - a long term pin. */ > > > > + if (!(gup_flags & FOLL_LONGTERM)) > > > > + return true; > > > > + > > > > + /* If the VMA requires dirty tracking then GUP will be problematic. */ > > > > + return vma_needs_dirty_tracking(vma); > > > > +} > > > > + > > > > static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > > > > { > > > > vm_flags_t vm_flags = vma->vm_flags; > > > > int write = (gup_flags & FOLL_WRITE); > > > > int foreign = (gup_flags & FOLL_REMOTE); > > > > + bool vma_anon = vma_is_anonymous(vma); > > > > > > > > if (vm_flags & (VM_IO | VM_PFNMAP)) > > > > return -EFAULT; > > > > > > > > - if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) > > > > + if ((gup_flags & FOLL_ANON) && !vma_anon) > > > > return -EFAULT; > > > > > > > > if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) > > > > @@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > > > > return -EFAULT; > > > > > > > > if (write) { > > > > + if (!vma_anon && > > > > + !writeable_file_mapping_allowed(vma, gup_flags)) > > > > + return -EFAULT; > > > > + > > > > if (!(vm_flags & VM_WRITE)) { > > > > if (!(gup_flags & FOLL_FORCE)) > > > > return -EFAULT; > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > > index 536bbb8fa0ae..7b6344d1832a 100644 > > > > --- a/mm/mmap.c > > > > > > > > > I'm probably missing something, why don't we have to handle GUP-fast (having > > > said that, it's hard to handle ;) )? The sequence you describe above should > > > apply to GUP-fast as well, no? > > > > > > 1) Pin writable mapped page using GUP-fast > > > 2) Trigger writeback > > > 3) Write to page via pin > > > 4) Unpin and set dirty > > > > You're right, and this is an excellent point. I worry about other GUP use > > cases too, but we're a bit out of luck there because we don't get to check > > the VMA _at all_ (which opens yet another Pandora's box about how safe it > > is to do unlocked pinning :) > > > > But again, this comes down to the fact we're trying to make things > > _incrementally__ better rather than throwing our hands up and saying one > > day my ship will come in... > > That's not how security fixes are supposed to work IMHO, sorry. Sure, but I don't think the 'let things continue to be terribly broken for X more years' is also a great approach. Personally I come at this from the 'I just want my vmas patch series' unblocked perspective :) and feel there's a functional aspect here too. > > -- > Thanks, > > David / dhildenb > >
On Fri, Apr 28, 2023 at 09:15:08AM -0600, Jens Axboe wrote: > On 4/28/23 9:13?AM, David Hildenbrand wrote: > >>> I know, Jason und John will disagree, but I don't think we want to be very > >>> careful with changing the default. > >>> > >>> Sure, we could warn, or convert individual users using a flag (io_uring). > >>> But maybe we should invest more energy on a fix? > >> > >> This is proactively blocking a cleanup (eliminating vmas) that I believe > >> will be useful in moving things forward. I am not against an opt-in option > >> (I have been responding to community feedback in adapting my approach), > >> which is the way I implemented it all the way back then :) > > > > There are alternatives: just use a flag as Jason initially suggested > > and use that in io_uring code. Then, you can also bail out on the > > GUP-fast path as "cannot support it right now, never do GUP-fast". > > Since I've seen this brougth up a few times, what's the issue on the > io_uring side? We already dropped the special vma checking, it's in -git > right. Hence I don't believe there are any special cases left for > io_uring at all, and we certainly don't allow real file backings either, > never have done. The purpose from my perspective is being able to have GUP perform the 'is the file-backed mapping sane to GUP' check rather than you having to open code it. There is nothing special beyond that. Personally I think the best solution is an opt-in FOLL_SAFE_WRITE_FILE flag or such that you call and drop the vma check you have. That way we don't risk breaking anything, the vmas patch series can unblock, and you don't have to have raw mm code in your bit :) > > -- > Jens Axboe >
On Fri, Apr 28, 2023 at 05:08:27PM +0200, David Hildenbrand wrote: > > I think this is broken today and we should block it. We know from > > experiments with RDMA that doing exactly this triggers kernel oop's. > > I never saw similar reports in the wild (especially targeted at RHEL), so is > this still a current issue that has not been mitigated? Or is it just so > hard to actually trigger? People send RDMA related bug reports to us, and we tell them not to do this stuff :) > > I'm skeptical that anyone can actually do this combination of things > > successfully without getting kernel crashes or file data corruption - > > ie there is no real user to break. > > I am pretty sure that there are such VM users, because on the libvirt level > it's completely unclear which features trigger what behavior :/ IDK, why on earth would anyone want to do this? Using VFIO forces all the memory to become resident so what was the point of making it file backed in the first place? I'm skeptical there are real users even if it now requires special steps to be crashy/corrupty. > > > Sure, we could warn, or convert individual users using a flag (io_uring). > > > But maybe we should invest more energy on a fix? > > > > It has been years now, I think we need to admit a fix is still years > > away. Blocking the security problem may even motivate more people to > > work on a fix. > > Maybe we should make this a topic this year at LSF/MM (again?). At least we > learned a lot about GUP, what might work, what might not work, and got a > depper understanding (+ motivation to fix? :) ) the issue at hand. We keep having the topic.. This is the old argument that the FS people say the MM isn't following its inode and dirty lifetime rules and the MM people say the FS isn't following its refcounting rules <shrug> > > Security is the primary case where we have historically closed uAPI > > items. > > As this patch > > 1) Does not tackle GUP-fast > 2) Does not take care of !FOLL_LONGTERM > > I am not convinced by the security argument in regard to this patch. It is incremental and a temperature check to see what kind of real users exist. We have no idea right now, just speculation. Like I said, if there is feedback we can weaken it even further. > Everything else sounds like band-aids to me, is insufficient, and might > cause more harm than actually help IMHO. Especially the gup-fast case is > extremely easy to work-around in malicious user space. It is true this patch should probably block gup_fast when using FOLL_LONGTERM as well, just like we used to do for the DAX check. Jason
On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote: > > > > > > Security is the primary case where we have historically closed uAPI > > > items. > > > > As this patch > > > > 1) Does not tackle GUP-fast > > 2) Does not take care of !FOLL_LONGTERM > > > > I am not convinced by the security argument in regard to this patch. > > > > > > If we want to sells this as a security thing, we have to block it > > *completely* and then CC stable. > > Regarding GUP-fast, to fix the issue there as well, I guess we could do > something similar as I did in gup_must_unshare(): > > If we're in GUP-fast (no VMA), and want to pin a !anon page writable, > fallback to ordinary GUP. IOW, if we don't know, better be safe. How do we determine it's non-anon in the first place? The check is on the VMA. We could do it by following page tables down to folio and checking folio->mapping for PAGE_MAPPING_ANON I suppose? > > Of course, this would prevent hugetlb/shmem from getting pinned writable > during gup-fast. Unless we're able to whitelist them somehow in there. We could degrade those to non-fast assuming not FOLL_FAST_ONLY. But it'd be a pity. > > > For FOLL_LONGTERM it might fairly uncontroversial. For everything else I'm > not sure if there could be undesired side-effects. Yeah this is why I pared the patch down to this alone :) there are definitely concerns and issues with other cases, notably ptrace + friends but obviously not only. FOLL_LONGTERM is just the most egregious case. > > -- > Thanks, > > David / dhildenb >
On 28.04.23 17:24, Lorenzo Stoakes wrote: > On Fri, Apr 28, 2023 at 05:13:07PM +0200, David Hildenbrand wrote: >> [...] >> >>>> This change has the potential to break existing setups. Simple example: >>>> libvirt domains configured for file-backed VM memory that also has a vfio >>>> device configured. It can easily be configured by users (evolving VM >>>> configuration, copy-paste etc.). And it works from a VM perspective, because >>>> the guest memory is essentially stale once the VM is shutdown and the pages >>>> were unpinned. At least we're not concerned about stale data on disk. >>>> >>>> With your changes, such VMs would no longer start, breaking existing user >>>> setups with a kernel update. >>> >>> Which vfio vm_ops are we talking about? vfio_pci_mmap_ops for example >>> doesn't specify page_mkwrite or pfn_mkwrite. Unless you mean some arbitrary >>> file system in the guest? >> >> Sorry, you define a VM to have its memory backed by VM memory and, at the >> same time, define a vfio-pci device for your VM, which will end up long-term >> pinning the VM memory. > "memory backed by file memory", I guess you figured that out :) > Ah ack. Jason seemed concerned that this was already a broken case, I guess > that's one for you two to hash out... > >> >>> >>> I may well be missing context on this so forgive me if I'm being a little >>> dumb here, but it'd be good to get a specific example. >> >> I was giving to little details ;) >> >> [...] >> >>>> >>>> I know, Jason und John will disagree, but I don't think we want to be very >>>> careful with changing the default. >>>> >>>> Sure, we could warn, or convert individual users using a flag (io_uring). >>>> But maybe we should invest more energy on a fix? >>> >>> This is proactively blocking a cleanup (eliminating vmas) that I believe >>> will be useful in moving things forward. I am not against an opt-in option >>> (I have been responding to community feedback in adapting my approach), >>> which is the way I implemented it all the way back then :) >> >> There are alternatives: just use a flag as Jason initially suggested and use >> that in io_uring code. Then, you can also bail out on the GUP-fast path as >> "cannot support it right now, never do GUP-fast". > > I already implemented the alternatives (look back through revisions to see :) > and there were objections for various reasons. > > Personally my preference is to provide a FOLL_SAFE_FILE_WRITE flag or such and > replace the FOLL_LONGTERM check with this (that'll automatically get rejected > for GUP-fast so the GUP-fast conundrum wouldn't be a thing either). > > GUP-fast is a problem as you say,, but it feels like a fundamental issue with > GUP-fast as a whole since you don't get to look at a VMA since you can't take > the mmap_lock. You could just look at the folio->mapping once you've walked the > page tables and say 'I'm out' if FOLL_WRITE and it's non-anon if that's what > you're suggesting? See my other reply, kind-of yes. Like we do with gup_must_unshare(). I'm only concerned about how to keep GUP-fast working on hugetlb and shmem. > > I'm not against that change but this being incremental, I think that would be a > further increment. If we want to fix a security issue, as Jason said, incremental is IMHO the wrong approach. It's often too tempting to ignore the hard part and fix the easy part, making the hard part an increment for the future that nobody will really implement ... because it's hard. [...] >>> >>> But given we know this is both entirely broken and a potential security >>> issue, and FOLL_LONGTERM is about as egregious as you can get (user >>> explicitly saying they'll hold write access indefinitely) I feel it is an >>> important improvement and makes clear that this is not an acceptable usage. >>> >>> I see Jason has said more on this also :) >>> >>>> >>>> >>>> >>>> >>>>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> >>>>> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> >>>>> --- >>>>> include/linux/mm.h | 1 + >>>>> mm/gup.c | 41 ++++++++++++++++++++++++++++++++++++++++- >>>>> mm/mmap.c | 36 +++++++++++++++++++++++++++--------- >>>>> 3 files changed, 68 insertions(+), 10 deletions(-) >>>>> >>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h >>>>> index 37554b08bb28..f7da02fc89c6 100644 >>>>> --- a/include/linux/mm.h >>>>> +++ b/include/linux/mm.h >>>>> @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, >>>>> #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ >>>>> MM_CP_UFFD_WP_RESOLVE) >>>>> >>>>> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma); >>>>> int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); >>>>> static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) >>>>> { >>>>> diff --git a/mm/gup.c b/mm/gup.c >>>>> index 1f72a717232b..d36a5db9feb1 100644 >>>>> --- a/mm/gup.c >>>>> +++ b/mm/gup.c >>>>> @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma, >>>>> return 0; >>>>> } >>>>> >>>>> +/* >>>>> + * Writing to file-backed mappings which require folio dirty tracking using GUP >>>>> + * is a fundamentally broken operation, as kernel write access to GUP mappings >>>>> + * do not adhere to the semantics expected by a file system. >>>>> + * >>>>> + * Consider the following scenario:- >>>>> + * >>>>> + * 1. A folio is written to via GUP which write-faults the memory, notifying >>>>> + * the file system and dirtying the folio. >>>>> + * 2. Later, writeback is triggered, resulting in the folio being cleaned and >>>>> + * the PTE being marked read-only. >>>>> + * 3. The GUP caller writes to the folio, as it is mapped read/write via the >>>>> + * direct mapping. >>>>> + * 4. The GUP caller, now done with the page, unpins it and sets it dirty >>>>> + * (though it does not have to). >>>>> + * >>>>> + * This results in both data being written to a folio without writenotify, and >>>>> + * the folio being dirtied unexpectedly (if the caller decides to do so). >>>>> + */ >>>>> +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma, >>>>> + unsigned long gup_flags) >>>>> +{ >>>>> + /* If we aren't pinning then no problematic write can occur. */ >>>>> + if (!(gup_flags & (FOLL_GET | FOLL_PIN))) >>>>> + return true; >>>> >>>> FOLL_LONGTERM only applies to FOLL_PIN. This check can be dropped. >>> >>> I understand that of course (well maybe not of course, but I mean I do, I >>> have oodles of diagrams referencing this int he book :) This is intended to >>> document the fact that the check isn't relevant if we don't pin at all, >>> e.g. reading this you see:- >>> >>> - (implicit) if not writing or anon we're good >>> - if not pin we're good >>> - ok we are only currently checking one especially egregious case >>> - finally, perform the dirty tracking check. >>> >>> So this is intentional. >>> >>>> >>>>> + >>>>> + /* We limit this check to the most egregious case - a long term pin. */ >>>>> + if (!(gup_flags & FOLL_LONGTERM)) >>>>> + return true; >>>>> + >>>>> + /* If the VMA requires dirty tracking then GUP will be problematic. */ >>>>> + return vma_needs_dirty_tracking(vma); >>>>> +} >>>>> + >>>>> static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) >>>>> { >>>>> vm_flags_t vm_flags = vma->vm_flags; >>>>> int write = (gup_flags & FOLL_WRITE); >>>>> int foreign = (gup_flags & FOLL_REMOTE); >>>>> + bool vma_anon = vma_is_anonymous(vma); >>>>> >>>>> if (vm_flags & (VM_IO | VM_PFNMAP)) >>>>> return -EFAULT; >>>>> >>>>> - if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) >>>>> + if ((gup_flags & FOLL_ANON) && !vma_anon) >>>>> return -EFAULT; >>>>> >>>>> if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) >>>>> @@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) >>>>> return -EFAULT; >>>>> >>>>> if (write) { >>>>> + if (!vma_anon && >>>>> + !writeable_file_mapping_allowed(vma, gup_flags)) >>>>> + return -EFAULT; >>>>> + >>>>> if (!(vm_flags & VM_WRITE)) { >>>>> if (!(gup_flags & FOLL_FORCE)) >>>>> return -EFAULT; >>>>> diff --git a/mm/mmap.c b/mm/mmap.c >>>>> index 536bbb8fa0ae..7b6344d1832a 100644 >>>>> --- a/mm/mmap.c >>>> >>>> >>>> I'm probably missing something, why don't we have to handle GUP-fast (having >>>> said that, it's hard to handle ;) )? The sequence you describe above should >>>> apply to GUP-fast as well, no? >>>> >>>> 1) Pin writable mapped page using GUP-fast >>>> 2) Trigger writeback >>>> 3) Write to page via pin >>>> 4) Unpin and set dirty >>> >>> You're right, and this is an excellent point. I worry about other GUP use >>> cases too, but we're a bit out of luck there because we don't get to check >>> the VMA _at all_ (which opens yet another Pandora's box about how safe it >>> is to do unlocked pinning :) >>> >>> But again, this comes down to the fact we're trying to make things >>> _incrementally__ better rather than throwing our hands up and saying one >>> day my ship will come in... >> >> That's not how security fixes are supposed to work IMHO, sorry. > > Sure, but I don't think the 'let things continue to be terribly broken for X > more years' is also a great approach. Not at all, people (including me) were simply not aware that it is that much of an (security) issue because I never saw any real bug reports (or CVE numbers) and only herd John talk about possible fixes a year ago :) So I'm saying we either try to block it completely or finally look into fixing it for good. I'm not a friend of anything in between. You don't gain a lot of security by locking the front door but knowingly leaving the back door unlocked. > > Personally I come at this from the 'I just want my vmas patch series' unblocked > perspective :) and feel there's a functional aspect here too. I know, it always gets messy when touching such sensible topics :P
On 28.04.23 17:33, Lorenzo Stoakes wrote: > On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote: >>>> >>>> Security is the primary case where we have historically closed uAPI >>>> items. >>> >>> As this patch >>> >>> 1) Does not tackle GUP-fast >>> 2) Does not take care of !FOLL_LONGTERM >>> >>> I am not convinced by the security argument in regard to this patch. >>> >>> >>> If we want to sells this as a security thing, we have to block it >>> *completely* and then CC stable. >> >> Regarding GUP-fast, to fix the issue there as well, I guess we could do >> something similar as I did in gup_must_unshare(): >> >> If we're in GUP-fast (no VMA), and want to pin a !anon page writable, >> fallback to ordinary GUP. IOW, if we don't know, better be safe. > > How do we determine it's non-anon in the first place? The check is on the > VMA. We could do it by following page tables down to folio and checking > folio->mapping for PAGE_MAPPING_ANON I suppose? PageAnon(page) can be called from GUP-fast after grabbing a reference. See gup_must_unshare(). > >> >> Of course, this would prevent hugetlb/shmem from getting pinned writable >> during gup-fast. Unless we're able to whitelist them somehow in there. > > We could degrade those to non-fast assuming not FOLL_FAST_ONLY. But it'd be > a pity.
On 28.04.23 17:27, Jason Gunthorpe wrote: > On Fri, Apr 28, 2023 at 05:08:27PM +0200, David Hildenbrand wrote: > >>> I think this is broken today and we should block it. We know from >>> experiments with RDMA that doing exactly this triggers kernel oop's. >> >> I never saw similar reports in the wild (especially targeted at RHEL), so is >> this still a current issue that has not been mitigated? Or is it just so >> hard to actually trigger? > > People send RDMA related bug reports to us, and we tell them not to do > this stuff :) > >>> I'm skeptical that anyone can actually do this combination of things >>> successfully without getting kernel crashes or file data corruption - >>> ie there is no real user to break. >> >> I am pretty sure that there are such VM users, because on the libvirt level >> it's completely unclear which features trigger what behavior :/ > > IDK, why on earth would anyone want to do this? Using VFIO forces all > the memory to become resident so what was the point of making it file > backed in the first place? As I said, copy-and paste, incremental changes to domain XMLs. I've seen some crazy domain XMLs in bug reports. > > I'm skeptical there are real users even if it now requires special > steps to be crashy/corrupty. In any case, I think we should document the possible implications of this patch. I gave one use case that could be broken. > >>>> Sure, we could warn, or convert individual users using a flag (io_uring). >>>> But maybe we should invest more energy on a fix? >>> >>> It has been years now, I think we need to admit a fix is still years >>> away. Blocking the security problem may even motivate more people to >>> work on a fix. >> >> Maybe we should make this a topic this year at LSF/MM (again?). At least we >> learned a lot about GUP, what might work, what might not work, and got a >> depper understanding (+ motivation to fix? :) ) the issue at hand. > > We keep having the topic.. This is the old argument that the FS people > say the MM isn't following its inode and dirty lifetime rules and the > MM people say the FS isn't following its refcounting rules <shrug> :/ so we have to discuss it ... again I guess. > >>> Security is the primary case where we have historically closed uAPI >>> items. >> >> As this patch >> >> 1) Does not tackle GUP-fast >> 2) Does not take care of !FOLL_LONGTERM >> >> I am not convinced by the security argument in regard to this patch. > > It is incremental and a temperature check to see what kind of real > users exist. We have no idea right now, just speculation. Right, but again, if we start talking about security it's a different thing IMHO. >> Everything else sounds like band-aids to me, is insufficient, and might >> cause more harm than actually help IMHO. Especially the gup-fast case is >> extremely easy to work-around in malicious user space. > > It is true this patch should probably block gup_fast when using > FOLL_LONGTERM as well, just like we used to do for the DAX check. Then we'd at least fix the security issue for all FOLL_LONGTERM completely.
On 28.04.23 17:34, David Hildenbrand wrote: > On 28.04.23 17:33, Lorenzo Stoakes wrote: >> On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote: >>>>> >>>>> Security is the primary case where we have historically closed uAPI >>>>> items. >>>> >>>> As this patch >>>> >>>> 1) Does not tackle GUP-fast >>>> 2) Does not take care of !FOLL_LONGTERM >>>> >>>> I am not convinced by the security argument in regard to this patch. >>>> >>>> >>>> If we want to sells this as a security thing, we have to block it >>>> *completely* and then CC stable. >>> >>> Regarding GUP-fast, to fix the issue there as well, I guess we could do >>> something similar as I did in gup_must_unshare(): >>> >>> If we're in GUP-fast (no VMA), and want to pin a !anon page writable, >>> fallback to ordinary GUP. IOW, if we don't know, better be safe. >> >> How do we determine it's non-anon in the first place? The check is on the >> VMA. We could do it by following page tables down to folio and checking >> folio->mapping for PAGE_MAPPING_ANON I suppose? > > PageAnon(page) can be called from GUP-fast after grabbing a reference. > See gup_must_unshare(). IIRC, PageHuge() can also be called from GUP-fast and could special-case hugetlb eventually, as it's table while we hold a (temporary) reference. Shmem might be not so easy ...
On Fri, Apr 28, 2023 at 05:33:17PM +0200, David Hildenbrand wrote: > On 28.04.23 17:24, Lorenzo Stoakes wrote: > > On Fri, Apr 28, 2023 at 05:13:07PM +0200, David Hildenbrand wrote: > > > [...] > > > > > > > > This change has the potential to break existing setups. Simple example: > > > > > libvirt domains configured for file-backed VM memory that also has a vfio > > > > > device configured. It can easily be configured by users (evolving VM > > > > > configuration, copy-paste etc.). And it works from a VM perspective, because > > > > > the guest memory is essentially stale once the VM is shutdown and the pages > > > > > were unpinned. At least we're not concerned about stale data on disk. > > > > > > > > > > With your changes, such VMs would no longer start, breaking existing user > > > > > setups with a kernel update. > > > > > > > > Which vfio vm_ops are we talking about? vfio_pci_mmap_ops for example > > > > doesn't specify page_mkwrite or pfn_mkwrite. Unless you mean some arbitrary > > > > file system in the guest? > > > > > > Sorry, you define a VM to have its memory backed by VM memory and, at the > > > same time, define a vfio-pci device for your VM, which will end up long-term > > > pinning the VM memory. > > > > "memory backed by file memory", I guess you figured that out :) Ack yeah I vaguely assumed this was what you meant :) as in a virtualised 'file' system that ultimately actually is in reality memory backed but not from guest's persective. > > > Ah ack. Jason seemed concerned that this was already a broken case, I guess > > that's one for you two to hash out... > > > > > > > > > > > > > I may well be missing context on this so forgive me if I'm being a little > > > > dumb here, but it'd be good to get a specific example. > > > > > > I was giving to little details ;) > > > > > > [...] > > > > > > > > > > > > > I know, Jason und John will disagree, but I don't think we want to be very > > > > > careful with changing the default. > > > > > > > > > > Sure, we could warn, or convert individual users using a flag (io_uring). > > > > > But maybe we should invest more energy on a fix? > > > > > > > > This is proactively blocking a cleanup (eliminating vmas) that I believe > > > > will be useful in moving things forward. I am not against an opt-in option > > > > (I have been responding to community feedback in adapting my approach), > > > > which is the way I implemented it all the way back then :) > > > > > > There are alternatives: just use a flag as Jason initially suggested and use > > > that in io_uring code. Then, you can also bail out on the GUP-fast path as > > > "cannot support it right now, never do GUP-fast". > > > > I already implemented the alternatives (look back through revisions to see :) > > and there were objections for various reasons. > > > > Personally my preference is to provide a FOLL_SAFE_FILE_WRITE flag or such and > > replace the FOLL_LONGTERM check with this (that'll automatically get rejected > > for GUP-fast so the GUP-fast conundrum wouldn't be a thing either). > > > > GUP-fast is a problem as you say,, but it feels like a fundamental issue with > > GUP-fast as a whole since you don't get to look at a VMA since you can't take > > the mmap_lock. You could just look at the folio->mapping once you've walked the > > page tables and say 'I'm out' if FOLL_WRITE and it's non-anon if that's what > > you're suggesting? > > See my other reply, kind-of yes. Like we do with gup_must_unshare(). I'm > only concerned about how to keep GUP-fast working on hugetlb and shmem. > > > > > I'm not against that change but this being incremental, I think that would be a > > further increment. > > If we want to fix a security issue, as Jason said, incremental is IMHO the > wrong approach. > > It's often too tempting to ignore the hard part and fix the easy part, > making the hard part an increment for the future that nobody will really > implement ... because it's hard. > [...] > > > > > > > > > But given we know this is both entirely broken and a potential security > > > > issue, and FOLL_LONGTERM is about as egregious as you can get (user > > > > explicitly saying they'll hold write access indefinitely) I feel it is an > > > > important improvement and makes clear that this is not an acceptable usage. > > > > > > > > I see Jason has said more on this also :) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > > > > > > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> > > > > > > --- > > > > > > include/linux/mm.h | 1 + > > > > > > mm/gup.c | 41 ++++++++++++++++++++++++++++++++++++++++- > > > > > > mm/mmap.c | 36 +++++++++++++++++++++++++++--------- > > > > > > 3 files changed, 68 insertions(+), 10 deletions(-) > > > > > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > > > index 37554b08bb28..f7da02fc89c6 100644 > > > > > > --- a/include/linux/mm.h > > > > > > +++ b/include/linux/mm.h > > > > > > @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, > > > > > > #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ > > > > > > MM_CP_UFFD_WP_RESOLVE) > > > > > > > > > > > > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma); > > > > > > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); > > > > > > static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) > > > > > > { > > > > > > diff --git a/mm/gup.c b/mm/gup.c > > > > > > index 1f72a717232b..d36a5db9feb1 100644 > > > > > > --- a/mm/gup.c > > > > > > +++ b/mm/gup.c > > > > > > @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma, > > > > > > return 0; > > > > > > } > > > > > > > > > > > > +/* > > > > > > + * Writing to file-backed mappings which require folio dirty tracking using GUP > > > > > > + * is a fundamentally broken operation, as kernel write access to GUP mappings > > > > > > + * do not adhere to the semantics expected by a file system. > > > > > > + * > > > > > > + * Consider the following scenario:- > > > > > > + * > > > > > > + * 1. A folio is written to via GUP which write-faults the memory, notifying > > > > > > + * the file system and dirtying the folio. > > > > > > + * 2. Later, writeback is triggered, resulting in the folio being cleaned and > > > > > > + * the PTE being marked read-only. > > > > > > + * 3. The GUP caller writes to the folio, as it is mapped read/write via the > > > > > > + * direct mapping. > > > > > > + * 4. The GUP caller, now done with the page, unpins it and sets it dirty > > > > > > + * (though it does not have to). > > > > > > + * > > > > > > + * This results in both data being written to a folio without writenotify, and > > > > > > + * the folio being dirtied unexpectedly (if the caller decides to do so). > > > > > > + */ > > > > > > +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma, > > > > > > + unsigned long gup_flags) > > > > > > +{ > > > > > > + /* If we aren't pinning then no problematic write can occur. */ > > > > > > + if (!(gup_flags & (FOLL_GET | FOLL_PIN))) > > > > > > + return true; > > > > > > > > > > FOLL_LONGTERM only applies to FOLL_PIN. This check can be dropped. > > > > > > > > I understand that of course (well maybe not of course, but I mean I do, I > > > > have oodles of diagrams referencing this int he book :) This is intended to > > > > document the fact that the check isn't relevant if we don't pin at all, > > > > e.g. reading this you see:- > > > > > > > > - (implicit) if not writing or anon we're good > > > > - if not pin we're good > > > > - ok we are only currently checking one especially egregious case > > > > - finally, perform the dirty tracking check. > > > > > > > > So this is intentional. > > > > > > > > > > > > > > > + > > > > > > + /* We limit this check to the most egregious case - a long term pin. */ > > > > > > + if (!(gup_flags & FOLL_LONGTERM)) > > > > > > + return true; > > > > > > + > > > > > > + /* If the VMA requires dirty tracking then GUP will be problematic. */ > > > > > > + return vma_needs_dirty_tracking(vma); > > > > > > +} > > > > > > + > > > > > > static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > > > > > > { > > > > > > vm_flags_t vm_flags = vma->vm_flags; > > > > > > int write = (gup_flags & FOLL_WRITE); > > > > > > int foreign = (gup_flags & FOLL_REMOTE); > > > > > > + bool vma_anon = vma_is_anonymous(vma); > > > > > > > > > > > > if (vm_flags & (VM_IO | VM_PFNMAP)) > > > > > > return -EFAULT; > > > > > > > > > > > > - if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) > > > > > > + if ((gup_flags & FOLL_ANON) && !vma_anon) > > > > > > return -EFAULT; > > > > > > > > > > > > if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) > > > > > > @@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > > > > > > return -EFAULT; > > > > > > > > > > > > if (write) { > > > > > > + if (!vma_anon && > > > > > > + !writeable_file_mapping_allowed(vma, gup_flags)) > > > > > > + return -EFAULT; > > > > > > + > > > > > > if (!(vm_flags & VM_WRITE)) { > > > > > > if (!(gup_flags & FOLL_FORCE)) > > > > > > return -EFAULT; > > > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > > > > index 536bbb8fa0ae..7b6344d1832a 100644 > > > > > > --- a/mm/mmap.c > > > > > > > > > > > > > > > I'm probably missing something, why don't we have to handle GUP-fast (having > > > > > said that, it's hard to handle ;) )? The sequence you describe above should > > > > > apply to GUP-fast as well, no? > > > > > > > > > > 1) Pin writable mapped page using GUP-fast > > > > > 2) Trigger writeback > > > > > 3) Write to page via pin > > > > > 4) Unpin and set dirty > > > > > > > > You're right, and this is an excellent point. I worry about other GUP use > > > > cases too, but we're a bit out of luck there because we don't get to check > > > > the VMA _at all_ (which opens yet another Pandora's box about how safe it > > > > is to do unlocked pinning :) > > > > > > > > But again, this comes down to the fact we're trying to make things > > > > _incrementally__ better rather than throwing our hands up and saying one > > > > day my ship will come in... > > > > > > That's not how security fixes are supposed to work IMHO, sorry. > > > > Sure, but I don't think the 'let things continue to be terribly broken for X > > more years' is also a great approach. > > Not at all, people (including me) were simply not aware that it is that much > of an (security) issue because I never saw any real bug reports (or CVE > numbers) and only herd John talk about possible fixes a year ago :) > > So I'm saying we either try to block it completely or finally look into > fixing it for good. I'm not a friend of anything in between. > > You don't gain a lot of security by locking the front door but knowingly > leaving the back door unlocked. > > > > > Personally I come at this from the 'I just want my vmas patch series' unblocked > > perspective :) and feel there's a functional aspect here too. > > I know, it always gets messy when touching such sensible topics :P I feel that several people owe me drinks at LSF/MM :P To cut a long story short to your other points, I'm _really_ leaning towards an opt-in variant of this change that we just hand to io_uring to make everything simple with minimum risk (if Jens was also open to this idea, it'd simply be deleting the open coded vma checks there and adding FOLL_SAFE_FILE_WRITE). That way we can save the delightful back and forth for another time while adding a useful feature and documenting the issue. Altneratively I could try to adapt this to also do the GUP-fast check, hoping that no FOLL_FAST_ONLY users would get nixed (I'd have to check who uses that). The others should just get degraded to a standard GUP right? I feel these various series have really helped beat out some details about GUP, so as to your point on another thread (trying to reduce noise here :P), I think discussion at LSF/MM is also a sensible idea, also you know, if beers were bought too it could all work out nicely :] > > -- > Thanks, > > David / dhildenb >
On Fri, Apr 28, 2023 at 05:34:35PM +0200, David Hildenbrand wrote: > On 28.04.23 17:33, Lorenzo Stoakes wrote: > > On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote: > > > > > > > > > > Security is the primary case where we have historically closed uAPI > > > > > items. > > > > > > > > As this patch > > > > > > > > 1) Does not tackle GUP-fast > > > > 2) Does not take care of !FOLL_LONGTERM > > > > > > > > I am not convinced by the security argument in regard to this patch. > > > > > > > > > > > > If we want to sells this as a security thing, we have to block it > > > > *completely* and then CC stable. > > > > > > Regarding GUP-fast, to fix the issue there as well, I guess we could do > > > something similar as I did in gup_must_unshare(): > > > > > > If we're in GUP-fast (no VMA), and want to pin a !anon page writable, > > > fallback to ordinary GUP. IOW, if we don't know, better be safe. > > > > How do we determine it's non-anon in the first place? The check is on the > > VMA. We could do it by following page tables down to folio and checking > > folio->mapping for PAGE_MAPPING_ANON I suppose? > > PageAnon(page) can be called from GUP-fast after grabbing a reference. See > gup_must_unshare(). Hmm.. Is it a good idea at all to sacrifise all "!anon" fast-gups for this? People will silently got degrade even on legal pins on shmem/hugetlb, I think, which seems to be still a very major use case.
[...] >>> >>> Personally I come at this from the 'I just want my vmas patch series' unblocked >>> perspective :) and feel there's a functional aspect here too. >> >> I know, it always gets messy when touching such sensible topics :P > > I feel that several people owe me drinks at LSF/MM :P > > To cut a long story short to your other points, I'm _really_ leaning > towards an opt-in variant of this change that we just hand to io_uring to > make everything simple with minimum risk (if Jens was also open to this > idea, it'd simply be deleting the open coded vma checks there and adding > FOLL_SAFE_FILE_WRITE). > > That way we can save the delightful back and forth for another time while > adding a useful feature and documenting the issue. Just for the records: I'm not opposed to disabling it system-wide, especially once this is an actual security issue and can bring down the machine easily (thanks to Jason for raising the security aspect). I just wanted to raise awareness that there might be users affected ... Sure, we could glue this to some system knob like Jason said, if we want to play safe. > > Altneratively I could try to adapt this to also do the GUP-fast check, > hoping that no FOLL_FAST_ONLY users would get nixed (I'd have to check who > uses that). The others should just get degraded to a standard GUP right? Yes. When you need the VMA to make a decision, fallback to standard GUP. The only problematic part is something like get_user_pages_fast_only(), that would observe a change. But KVM never passes FOLL_LONGTERM, so at least in that context the change should be fine I guess. The performance concern is the most problematic thing (how to identify shmem pages). > > I feel these various series have really helped beat out some details about > GUP, so as to your point on another thread (trying to reduce noise here > :P), I think discussion at LSF/MM is also a sensible idea, also you know, > if beers were bought too it could all work out nicely :] The issue is, that GUP is so complicated, that each and every MM developer familiar with GUP has something to add :P What stood out to me is that we disallow something for ordinary GUP but disallow it for GUP-fast, which looks very odd. So sorry again for jumping in late ...
On 28.04.23 17:56, Peter Xu wrote: > On Fri, Apr 28, 2023 at 05:34:35PM +0200, David Hildenbrand wrote: >> On 28.04.23 17:33, Lorenzo Stoakes wrote: >>> On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote: >>>>>> >>>>>> Security is the primary case where we have historically closed uAPI >>>>>> items. >>>>> >>>>> As this patch >>>>> >>>>> 1) Does not tackle GUP-fast >>>>> 2) Does not take care of !FOLL_LONGTERM >>>>> >>>>> I am not convinced by the security argument in regard to this patch. >>>>> >>>>> >>>>> If we want to sells this as a security thing, we have to block it >>>>> *completely* and then CC stable. >>>> >>>> Regarding GUP-fast, to fix the issue there as well, I guess we could do >>>> something similar as I did in gup_must_unshare(): >>>> >>>> If we're in GUP-fast (no VMA), and want to pin a !anon page writable, >>>> fallback to ordinary GUP. IOW, if we don't know, better be safe. >>> >>> How do we determine it's non-anon in the first place? The check is on the >>> VMA. We could do it by following page tables down to folio and checking >>> folio->mapping for PAGE_MAPPING_ANON I suppose? >> >> PageAnon(page) can be called from GUP-fast after grabbing a reference. See >> gup_must_unshare(). > > Hmm.. Is it a good idea at all to sacrifise all "!anon" fast-gups for this? > People will silently got degrade even on legal pins on shmem/hugetlb, I > think, which seems to be still a very major use case. > Right. Optimizing for hugetlb should be easy. Shmem is problematic. I once raised to John that PageAnonExclusive is essentially a "anon page is pinnable" flag. Too bad we don't have spare page flags ;)
On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote: > On 28.04.23 17:34, David Hildenbrand wrote: > > On 28.04.23 17:33, Lorenzo Stoakes wrote: > > > On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote: > > > > > > > > > > > > Security is the primary case where we have historically closed uAPI > > > > > > items. > > > > > > > > > > As this patch > > > > > > > > > > 1) Does not tackle GUP-fast > > > > > 2) Does not take care of !FOLL_LONGTERM > > > > > > > > > > I am not convinced by the security argument in regard to this patch. > > > > > > > > > > > > > > > If we want to sells this as a security thing, we have to block it > > > > > *completely* and then CC stable. > > > > > > > > Regarding GUP-fast, to fix the issue there as well, I guess we could do > > > > something similar as I did in gup_must_unshare(): > > > > > > > > If we're in GUP-fast (no VMA), and want to pin a !anon page writable, > > > > fallback to ordinary GUP. IOW, if we don't know, better be safe. > > > > > > How do we determine it's non-anon in the first place? The check is on the > > > VMA. We could do it by following page tables down to folio and checking > > > folio->mapping for PAGE_MAPPING_ANON I suppose? > > > > PageAnon(page) can be called from GUP-fast after grabbing a reference. > > See gup_must_unshare(). > > IIRC, PageHuge() can also be called from GUP-fast and could special-case > hugetlb eventually, as it's table while we hold a (temporary) reference. > Shmem might be not so easy ... page->mapping->a_ops should be enough to whitelist whatever fs you want.
On 28.04.23 18:09, Kirill A . Shutemov wrote: > On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote: >> On 28.04.23 17:34, David Hildenbrand wrote: >>> On 28.04.23 17:33, Lorenzo Stoakes wrote: >>>> On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote: >>>>>>> >>>>>>> Security is the primary case where we have historically closed uAPI >>>>>>> items. >>>>>> >>>>>> As this patch >>>>>> >>>>>> 1) Does not tackle GUP-fast >>>>>> 2) Does not take care of !FOLL_LONGTERM >>>>>> >>>>>> I am not convinced by the security argument in regard to this patch. >>>>>> >>>>>> >>>>>> If we want to sells this as a security thing, we have to block it >>>>>> *completely* and then CC stable. >>>>> >>>>> Regarding GUP-fast, to fix the issue there as well, I guess we could do >>>>> something similar as I did in gup_must_unshare(): >>>>> >>>>> If we're in GUP-fast (no VMA), and want to pin a !anon page writable, >>>>> fallback to ordinary GUP. IOW, if we don't know, better be safe. >>>> >>>> How do we determine it's non-anon in the first place? The check is on the >>>> VMA. We could do it by following page tables down to folio and checking >>>> folio->mapping for PAGE_MAPPING_ANON I suppose? >>> >>> PageAnon(page) can be called from GUP-fast after grabbing a reference. >>> See gup_must_unshare(). >> >> IIRC, PageHuge() can also be called from GUP-fast and could special-case >> hugetlb eventually, as it's table while we hold a (temporary) reference. >> Shmem might be not so easy ... > > page->mapping->a_ops should be enough to whitelist whatever fs you want. > The issue is how to stabilize that from GUP-fast, such that we can safely dereference the mapping. Any idea? At least for anon page I know that page->mapping only gets cleared when freeing the page, and we don't dereference the mapping but only check a single flag stored alongside the mapping. Therefore, PageAnon() is fine in GUP-fast context.
On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote: > On 28.04.23 18:09, Kirill A . Shutemov wrote: > > On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote: > > > On 28.04.23 17:34, David Hildenbrand wrote: > > > > On 28.04.23 17:33, Lorenzo Stoakes wrote: > > > > > On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote: > > > > > > > > > > > > > > > > Security is the primary case where we have historically closed uAPI > > > > > > > > items. > > > > > > > > > > > > > > As this patch > > > > > > > > > > > > > > 1) Does not tackle GUP-fast > > > > > > > 2) Does not take care of !FOLL_LONGTERM > > > > > > > > > > > > > > I am not convinced by the security argument in regard to this patch. > > > > > > > > > > > > > > > > > > > > > If we want to sells this as a security thing, we have to block it > > > > > > > *completely* and then CC stable. > > > > > > > > > > > > Regarding GUP-fast, to fix the issue there as well, I guess we could do > > > > > > something similar as I did in gup_must_unshare(): > > > > > > > > > > > > If we're in GUP-fast (no VMA), and want to pin a !anon page writable, > > > > > > fallback to ordinary GUP. IOW, if we don't know, better be safe. > > > > > > > > > > How do we determine it's non-anon in the first place? The check is on the > > > > > VMA. We could do it by following page tables down to folio and checking > > > > > folio->mapping for PAGE_MAPPING_ANON I suppose? > > > > > > > > PageAnon(page) can be called from GUP-fast after grabbing a reference. > > > > See gup_must_unshare(). > > > > > > IIRC, PageHuge() can also be called from GUP-fast and could special-case > > > hugetlb eventually, as it's table while we hold a (temporary) reference. > > > Shmem might be not so easy ... > > > > page->mapping->a_ops should be enough to whitelist whatever fs you want. > > > > The issue is how to stabilize that from GUP-fast, such that we can safely > dereference the mapping. Any idea? > > At least for anon page I know that page->mapping only gets cleared when > freeing the page, and we don't dereference the mapping but only check a > single flag stored alongside the mapping. Therefore, PageAnon() is fine in > GUP-fast context. What codepath you are worry about that clears ->mapping on pages with non-zero refcount? I can only think of truncate (and punch hole). READ_ONCE(page->mapping) and fail GUP_fast if it is NULL should be fine, no? I guess we should consider if the inode can be freed from under us and the mapping pointer becomes dangling. But I think we should be fine here too: VMA pins inode and VMA cannot go away from under GUP. Hm? (I didn't look close at GUP for a while and my reasoning might be off.)
On Fri, Apr 28, 2023 at 07:22:07PM +0300, Kirill A . Shutemov wrote: > On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote: > > On 28.04.23 18:09, Kirill A . Shutemov wrote: > > > On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote: > > > > On 28.04.23 17:34, David Hildenbrand wrote: > > > > > On 28.04.23 17:33, Lorenzo Stoakes wrote: > > > > > > On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote: > > > > > > > > > > > > > > > > > > Security is the primary case where we have historically closed uAPI > > > > > > > > > items. > > > > > > > > > > > > > > > > As this patch > > > > > > > > > > > > > > > > 1) Does not tackle GUP-fast > > > > > > > > 2) Does not take care of !FOLL_LONGTERM > > > > > > > > > > > > > > > > I am not convinced by the security argument in regard to this patch. > > > > > > > > > > > > > > > > > > > > > > > > If we want to sells this as a security thing, we have to block it > > > > > > > > *completely* and then CC stable. > > > > > > > > > > > > > > Regarding GUP-fast, to fix the issue there as well, I guess we could do > > > > > > > something similar as I did in gup_must_unshare(): > > > > > > > > > > > > > > If we're in GUP-fast (no VMA), and want to pin a !anon page writable, > > > > > > > fallback to ordinary GUP. IOW, if we don't know, better be safe. > > > > > > > > > > > > How do we determine it's non-anon in the first place? The check is on the > > > > > > VMA. We could do it by following page tables down to folio and checking > > > > > > folio->mapping for PAGE_MAPPING_ANON I suppose? > > > > > > > > > > PageAnon(page) can be called from GUP-fast after grabbing a reference. > > > > > See gup_must_unshare(). > > > > > > > > IIRC, PageHuge() can also be called from GUP-fast and could special-case > > > > hugetlb eventually, as it's table while we hold a (temporary) reference. > > > > Shmem might be not so easy ... > > > > > > page->mapping->a_ops should be enough to whitelist whatever fs you want. > > > > > > > The issue is how to stabilize that from GUP-fast, such that we can safely > > dereference the mapping. Any idea? > > > > At least for anon page I know that page->mapping only gets cleared when > > freeing the page, and we don't dereference the mapping but only check a > > single flag stored alongside the mapping. Therefore, PageAnon() is fine in > > GUP-fast context. > > What codepath you are worry about that clears ->mapping on pages with > non-zero refcount? > > I can only think of truncate (and punch hole). READ_ONCE(page->mapping) > and fail GUP_fast if it is NULL should be fine, no? > > I guess we should consider if the inode can be freed from under us and the > mapping pointer becomes dangling. But I think we should be fine here too: > VMA pins inode and VMA cannot go away from under GUP. Can vma still go away if during a fast-gup? > > Hm? > > (I didn't look close at GUP for a while and my reasoning might be off.) Thanks,
On 28.04.23 18:39, Peter Xu wrote: > On Fri, Apr 28, 2023 at 07:22:07PM +0300, Kirill A . Shutemov wrote: >> On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote: >>> On 28.04.23 18:09, Kirill A . Shutemov wrote: >>>> On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote: >>>>> On 28.04.23 17:34, David Hildenbrand wrote: >>>>>> On 28.04.23 17:33, Lorenzo Stoakes wrote: >>>>>>> On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote: >>>>>>>>>> >>>>>>>>>> Security is the primary case where we have historically closed uAPI >>>>>>>>>> items. >>>>>>>>> >>>>>>>>> As this patch >>>>>>>>> >>>>>>>>> 1) Does not tackle GUP-fast >>>>>>>>> 2) Does not take care of !FOLL_LONGTERM >>>>>>>>> >>>>>>>>> I am not convinced by the security argument in regard to this patch. >>>>>>>>> >>>>>>>>> >>>>>>>>> If we want to sells this as a security thing, we have to block it >>>>>>>>> *completely* and then CC stable. >>>>>>>> >>>>>>>> Regarding GUP-fast, to fix the issue there as well, I guess we could do >>>>>>>> something similar as I did in gup_must_unshare(): >>>>>>>> >>>>>>>> If we're in GUP-fast (no VMA), and want to pin a !anon page writable, >>>>>>>> fallback to ordinary GUP. IOW, if we don't know, better be safe. >>>>>>> >>>>>>> How do we determine it's non-anon in the first place? The check is on the >>>>>>> VMA. We could do it by following page tables down to folio and checking >>>>>>> folio->mapping for PAGE_MAPPING_ANON I suppose? >>>>>> >>>>>> PageAnon(page) can be called from GUP-fast after grabbing a reference. >>>>>> See gup_must_unshare(). >>>>> >>>>> IIRC, PageHuge() can also be called from GUP-fast and could special-case >>>>> hugetlb eventually, as it's table while we hold a (temporary) reference. >>>>> Shmem might be not so easy ... >>>> >>>> page->mapping->a_ops should be enough to whitelist whatever fs you want. >>>> >>> >>> The issue is how to stabilize that from GUP-fast, such that we can safely >>> dereference the mapping. Any idea? >>> >>> At least for anon page I know that page->mapping only gets cleared when >>> freeing the page, and we don't dereference the mapping but only check a >>> single flag stored alongside the mapping. Therefore, PageAnon() is fine in >>> GUP-fast context. >> >> What codepath you are worry about that clears ->mapping on pages with >> non-zero refcount? >> >> I can only think of truncate (and punch hole). READ_ONCE(page->mapping) >> and fail GUP_fast if it is NULL should be fine, no? >> >> I guess we should consider if the inode can be freed from under us and the >> mapping pointer becomes dangling. But I think we should be fine here too: >> VMA pins inode and VMA cannot go away from under GUP. > > Can vma still go away if during a fast-gup? > So, after we grabbed the page and made sure the the PTE didn't change (IOW, the PTE was stable while we processed it), the page can get unmapped (but not freed, because we hold a reference) and the VMA can theoretically go away (and as far as I understand, nothing stops the file from getting deleted, truncated etc). So we might be looking at folio->mapping and the VMA is no longer there. Maybe even the file is no longer there.
On Fri, Apr 28, 2023 at 06:51:46PM +0200, David Hildenbrand wrote: > On 28.04.23 18:39, Peter Xu wrote: > > On Fri, Apr 28, 2023 at 07:22:07PM +0300, Kirill A . Shutemov wrote: > > > On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote: > > > > On 28.04.23 18:09, Kirill A . Shutemov wrote: > > > > > On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote: > > > > > > On 28.04.23 17:34, David Hildenbrand wrote: > > > > > > > On 28.04.23 17:33, Lorenzo Stoakes wrote: > > > > > > > > On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote: > > > > > > > > > > > > > > > > > > > > > > Security is the primary case where we have historically closed uAPI > > > > > > > > > > > items. > > > > > > > > > > > > > > > > > > > > As this patch > > > > > > > > > > > > > > > > > > > > 1) Does not tackle GUP-fast > > > > > > > > > > 2) Does not take care of !FOLL_LONGTERM > > > > > > > > > > > > > > > > > > > > I am not convinced by the security argument in regard to this patch. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If we want to sells this as a security thing, we have to block it > > > > > > > > > > *completely* and then CC stable. > > > > > > > > > > > > > > > > > > Regarding GUP-fast, to fix the issue there as well, I guess we could do > > > > > > > > > something similar as I did in gup_must_unshare(): > > > > > > > > > > > > > > > > > > If we're in GUP-fast (no VMA), and want to pin a !anon page writable, > > > > > > > > > fallback to ordinary GUP. IOW, if we don't know, better be safe. > > > > > > > > > > > > > > > > How do we determine it's non-anon in the first place? The check is on the > > > > > > > > VMA. We could do it by following page tables down to folio and checking > > > > > > > > folio->mapping for PAGE_MAPPING_ANON I suppose? > > > > > > > > > > > > > > PageAnon(page) can be called from GUP-fast after grabbing a reference. > > > > > > > See gup_must_unshare(). > > > > > > > > > > > > IIRC, PageHuge() can also be called from GUP-fast and could special-case > > > > > > hugetlb eventually, as it's table while we hold a (temporary) reference. > > > > > > Shmem might be not so easy ... > > > > > > > > > > page->mapping->a_ops should be enough to whitelist whatever fs you want. > > > > > > > > > > > > > The issue is how to stabilize that from GUP-fast, such that we can safely > > > > dereference the mapping. Any idea? > > > > > > > > At least for anon page I know that page->mapping only gets cleared when > > > > freeing the page, and we don't dereference the mapping but only check a > > > > single flag stored alongside the mapping. Therefore, PageAnon() is fine in > > > > GUP-fast context. > > > > > > What codepath you are worry about that clears ->mapping on pages with > > > non-zero refcount? > > > > > > I can only think of truncate (and punch hole). READ_ONCE(page->mapping) > > > and fail GUP_fast if it is NULL should be fine, no? > > > > > > I guess we should consider if the inode can be freed from under us and the > > > mapping pointer becomes dangling. But I think we should be fine here too: > > > VMA pins inode and VMA cannot go away from under GUP. > > > > Can vma still go away if during a fast-gup? > > > > So, after we grabbed the page and made sure the the PTE didn't change (IOW, > the PTE was stable while we processed it), the page can get unmapped (but > not freed, because we hold a reference) and the VMA can theoretically go > away (and as far as I understand, nothing stops the file from getting > deleted, truncated etc). > > So we might be looking at folio->mapping and the VMA is no longer there. > Maybe even the file is no longer there. No. VMA cannot get away before PTEs are unmapped and TLB is flushed. And TLB flushing is serialized against GUP_fast().
On Fri, Apr 28, 2023 at 06:51:46PM +0200, David Hildenbrand wrote: > On 28.04.23 18:39, Peter Xu wrote: > > On Fri, Apr 28, 2023 at 07:22:07PM +0300, Kirill A . Shutemov wrote: > > > On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote: > > > > On 28.04.23 18:09, Kirill A . Shutemov wrote: > > > > > On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote: > > > > > > On 28.04.23 17:34, David Hildenbrand wrote: > > > > > > > On 28.04.23 17:33, Lorenzo Stoakes wrote: > > > > > > > > On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote: > > > > > > > > > > > > > > > > > > > > > > Security is the primary case where we have historically closed uAPI > > > > > > > > > > > items. > > > > > > > > > > > > > > > > > > > > As this patch > > > > > > > > > > > > > > > > > > > > 1) Does not tackle GUP-fast > > > > > > > > > > 2) Does not take care of !FOLL_LONGTERM > > > > > > > > > > > > > > > > > > > > I am not convinced by the security argument in regard to this patch. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If we want to sells this as a security thing, we have to block it > > > > > > > > > > *completely* and then CC stable. > > > > > > > > > > > > > > > > > > Regarding GUP-fast, to fix the issue there as well, I guess we could do > > > > > > > > > something similar as I did in gup_must_unshare(): > > > > > > > > > > > > > > > > > > If we're in GUP-fast (no VMA), and want to pin a !anon page writable, > > > > > > > > > fallback to ordinary GUP. IOW, if we don't know, better be safe. > > > > > > > > > > > > > > > > How do we determine it's non-anon in the first place? The check is on the > > > > > > > > VMA. We could do it by following page tables down to folio and checking > > > > > > > > folio->mapping for PAGE_MAPPING_ANON I suppose? > > > > > > > > > > > > > > PageAnon(page) can be called from GUP-fast after grabbing a reference. > > > > > > > See gup_must_unshare(). > > > > > > > > > > > > IIRC, PageHuge() can also be called from GUP-fast and could special-case > > > > > > hugetlb eventually, as it's table while we hold a (temporary) reference. > > > > > > Shmem might be not so easy ... > > > > > > > > > > page->mapping->a_ops should be enough to whitelist whatever fs you want. > > > > > > > > > > > > > The issue is how to stabilize that from GUP-fast, such that we can safely > > > > dereference the mapping. Any idea? > > > > > > > > At least for anon page I know that page->mapping only gets cleared when > > > > freeing the page, and we don't dereference the mapping but only check a > > > > single flag stored alongside the mapping. Therefore, PageAnon() is fine in > > > > GUP-fast context. > > > > > > What codepath you are worry about that clears ->mapping on pages with > > > non-zero refcount? > > > > > > I can only think of truncate (and punch hole). READ_ONCE(page->mapping) > > > and fail GUP_fast if it is NULL should be fine, no? > > > > > > I guess we should consider if the inode can be freed from under us and the > > > mapping pointer becomes dangling. But I think we should be fine here too: > > > VMA pins inode and VMA cannot go away from under GUP. > > > > Can vma still go away if during a fast-gup? > > > > So, after we grabbed the page and made sure the the PTE didn't change (IOW, > the PTE was stable while we processed it), the page can get unmapped (but > not freed, because we hold a reference) and the VMA can theoretically go > away (and as far as I understand, nothing stops the file from getting > deleted, truncated etc). > > So we might be looking at folio->mapping and the VMA is no longer there. > Maybe even the file is no longer there. > This shouldn't be an issue though right? Because after a pup call unlocks the mmap_lock we're in the same situation anyway. GUP doesn't generally guarantee the mapping remains valid, only pinning the underlying folio. I'm thinking of respinning with a gup_fast component then, if a_ops is sufficient to identify file systems. We'll just revert to slow path for non-FOLL_FAST_ONLY cases. This would at least cover both FOLL_LONGTERM angles and could provoke some further interesting discussion :) > -- > Thanks, > > David / dhildenb >
On Fri, Apr 28, 2023 at 07:56:23PM +0300, Kirill A . Shutemov wrote: > On Fri, Apr 28, 2023 at 06:51:46PM +0200, David Hildenbrand wrote: > > On 28.04.23 18:39, Peter Xu wrote: > > > On Fri, Apr 28, 2023 at 07:22:07PM +0300, Kirill A . Shutemov wrote: > > > > On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote: > > > > > On 28.04.23 18:09, Kirill A . Shutemov wrote: > > > > > > On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote: > > > > > > > On 28.04.23 17:34, David Hildenbrand wrote: > > > > > > > > On 28.04.23 17:33, Lorenzo Stoakes wrote: > > > > > > > > > On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote: > > > > > > > > > > > > > > > > > > > > > > > > Security is the primary case where we have historically closed uAPI > > > > > > > > > > > > items. > > > > > > > > > > > > > > > > > > > > > > As this patch > > > > > > > > > > > > > > > > > > > > > > 1) Does not tackle GUP-fast > > > > > > > > > > > 2) Does not take care of !FOLL_LONGTERM > > > > > > > > > > > > > > > > > > > > > > I am not convinced by the security argument in regard to this patch. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If we want to sells this as a security thing, we have to block it > > > > > > > > > > > *completely* and then CC stable. > > > > > > > > > > > > > > > > > > > > Regarding GUP-fast, to fix the issue there as well, I guess we could do > > > > > > > > > > something similar as I did in gup_must_unshare(): > > > > > > > > > > > > > > > > > > > > If we're in GUP-fast (no VMA), and want to pin a !anon page writable, > > > > > > > > > > fallback to ordinary GUP. IOW, if we don't know, better be safe. > > > > > > > > > > > > > > > > > > How do we determine it's non-anon in the first place? The check is on the > > > > > > > > > VMA. We could do it by following page tables down to folio and checking > > > > > > > > > folio->mapping for PAGE_MAPPING_ANON I suppose? > > > > > > > > > > > > > > > > PageAnon(page) can be called from GUP-fast after grabbing a reference. > > > > > > > > See gup_must_unshare(). > > > > > > > > > > > > > > IIRC, PageHuge() can also be called from GUP-fast and could special-case > > > > > > > hugetlb eventually, as it's table while we hold a (temporary) reference. > > > > > > > Shmem might be not so easy ... > > > > > > > > > > > > page->mapping->a_ops should be enough to whitelist whatever fs you want. > > > > > > > > > > > > > > > > The issue is how to stabilize that from GUP-fast, such that we can safely > > > > > dereference the mapping. Any idea? > > > > > > > > > > At least for anon page I know that page->mapping only gets cleared when > > > > > freeing the page, and we don't dereference the mapping but only check a > > > > > single flag stored alongside the mapping. Therefore, PageAnon() is fine in > > > > > GUP-fast context. > > > > > > > > What codepath you are worry about that clears ->mapping on pages with > > > > non-zero refcount? > > > > > > > > I can only think of truncate (and punch hole). READ_ONCE(page->mapping) > > > > and fail GUP_fast if it is NULL should be fine, no? > > > > > > > > I guess we should consider if the inode can be freed from under us and the > > > > mapping pointer becomes dangling. But I think we should be fine here too: > > > > VMA pins inode and VMA cannot go away from under GUP. > > > > > > Can vma still go away if during a fast-gup? > > > > > > > So, after we grabbed the page and made sure the the PTE didn't change (IOW, > > the PTE was stable while we processed it), the page can get unmapped (but > > not freed, because we hold a reference) and the VMA can theoretically go > > away (and as far as I understand, nothing stops the file from getting > > deleted, truncated etc). > > > > So we might be looking at folio->mapping and the VMA is no longer there. > > Maybe even the file is no longer there. > > No. VMA cannot get away before PTEs are unmapped and TLB is flushed. And > TLB flushing is serialized against GUP_fast(). > But the lockless path doesn't hold the PTL? So PTE can disappear at anytime, since we use ptep_get_lockless(). But it won't matter because then the fast path will just fail anyway and can revert back to slow one. > -- > Kiryl Shutsemau / Kirill A. Shutemov
On 28.04.23 18:56, Kirill A . Shutemov wrote: > On Fri, Apr 28, 2023 at 06:51:46PM +0200, David Hildenbrand wrote: >> On 28.04.23 18:39, Peter Xu wrote: >>> On Fri, Apr 28, 2023 at 07:22:07PM +0300, Kirill A . Shutemov wrote: >>>> On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote: >>>>> On 28.04.23 18:09, Kirill A . Shutemov wrote: >>>>>> On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote: >>>>>>> On 28.04.23 17:34, David Hildenbrand wrote: >>>>>>>> On 28.04.23 17:33, Lorenzo Stoakes wrote: >>>>>>>>> On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote: >>>>>>>>>>>> >>>>>>>>>>>> Security is the primary case where we have historically closed uAPI >>>>>>>>>>>> items. >>>>>>>>>>> >>>>>>>>>>> As this patch >>>>>>>>>>> >>>>>>>>>>> 1) Does not tackle GUP-fast >>>>>>>>>>> 2) Does not take care of !FOLL_LONGTERM >>>>>>>>>>> >>>>>>>>>>> I am not convinced by the security argument in regard to this patch. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> If we want to sells this as a security thing, we have to block it >>>>>>>>>>> *completely* and then CC stable. >>>>>>>>>> >>>>>>>>>> Regarding GUP-fast, to fix the issue there as well, I guess we could do >>>>>>>>>> something similar as I did in gup_must_unshare(): >>>>>>>>>> >>>>>>>>>> If we're in GUP-fast (no VMA), and want to pin a !anon page writable, >>>>>>>>>> fallback to ordinary GUP. IOW, if we don't know, better be safe. >>>>>>>>> >>>>>>>>> How do we determine it's non-anon in the first place? The check is on the >>>>>>>>> VMA. We could do it by following page tables down to folio and checking >>>>>>>>> folio->mapping for PAGE_MAPPING_ANON I suppose? >>>>>>>> >>>>>>>> PageAnon(page) can be called from GUP-fast after grabbing a reference. >>>>>>>> See gup_must_unshare(). >>>>>>> >>>>>>> IIRC, PageHuge() can also be called from GUP-fast and could special-case >>>>>>> hugetlb eventually, as it's table while we hold a (temporary) reference. >>>>>>> Shmem might be not so easy ... >>>>>> >>>>>> page->mapping->a_ops should be enough to whitelist whatever fs you want. >>>>>> >>>>> >>>>> The issue is how to stabilize that from GUP-fast, such that we can safely >>>>> dereference the mapping. Any idea? >>>>> >>>>> At least for anon page I know that page->mapping only gets cleared when >>>>> freeing the page, and we don't dereference the mapping but only check a >>>>> single flag stored alongside the mapping. Therefore, PageAnon() is fine in >>>>> GUP-fast context. >>>> >>>> What codepath you are worry about that clears ->mapping on pages with >>>> non-zero refcount? >>>> >>>> I can only think of truncate (and punch hole). READ_ONCE(page->mapping) >>>> and fail GUP_fast if it is NULL should be fine, no? >>>> >>>> I guess we should consider if the inode can be freed from under us and the >>>> mapping pointer becomes dangling. But I think we should be fine here too: >>>> VMA pins inode and VMA cannot go away from under GUP. >>> >>> Can vma still go away if during a fast-gup? >>> >> >> So, after we grabbed the page and made sure the the PTE didn't change (IOW, >> the PTE was stable while we processed it), the page can get unmapped (but >> not freed, because we hold a reference) and the VMA can theoretically go >> away (and as far as I understand, nothing stops the file from getting >> deleted, truncated etc). >> >> So we might be looking at folio->mapping and the VMA is no longer there. >> Maybe even the file is no longer there. > > No. VMA cannot get away before PTEs are unmapped and TLB is flushed. And > TLB flushing is serialized against GUP_fast(). The whole CONFIG_MMU_GATHER_RCU_TABLE_FREE handling makes the situation more complicated.
On 28.04.23 19:01, Lorenzo Stoakes wrote: > On Fri, Apr 28, 2023 at 06:51:46PM +0200, David Hildenbrand wrote: >> On 28.04.23 18:39, Peter Xu wrote: >>> On Fri, Apr 28, 2023 at 07:22:07PM +0300, Kirill A . Shutemov wrote: >>>> On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote: >>>>> On 28.04.23 18:09, Kirill A . Shutemov wrote: >>>>>> On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote: >>>>>>> On 28.04.23 17:34, David Hildenbrand wrote: >>>>>>>> On 28.04.23 17:33, Lorenzo Stoakes wrote: >>>>>>>>> On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote: >>>>>>>>>>>> >>>>>>>>>>>> Security is the primary case where we have historically closed uAPI >>>>>>>>>>>> items. >>>>>>>>>>> >>>>>>>>>>> As this patch >>>>>>>>>>> >>>>>>>>>>> 1) Does not tackle GUP-fast >>>>>>>>>>> 2) Does not take care of !FOLL_LONGTERM >>>>>>>>>>> >>>>>>>>>>> I am not convinced by the security argument in regard to this patch. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> If we want to sells this as a security thing, we have to block it >>>>>>>>>>> *completely* and then CC stable. >>>>>>>>>> >>>>>>>>>> Regarding GUP-fast, to fix the issue there as well, I guess we could do >>>>>>>>>> something similar as I did in gup_must_unshare(): >>>>>>>>>> >>>>>>>>>> If we're in GUP-fast (no VMA), and want to pin a !anon page writable, >>>>>>>>>> fallback to ordinary GUP. IOW, if we don't know, better be safe. >>>>>>>>> >>>>>>>>> How do we determine it's non-anon in the first place? The check is on the >>>>>>>>> VMA. We could do it by following page tables down to folio and checking >>>>>>>>> folio->mapping for PAGE_MAPPING_ANON I suppose? >>>>>>>> >>>>>>>> PageAnon(page) can be called from GUP-fast after grabbing a reference. >>>>>>>> See gup_must_unshare(). >>>>>>> >>>>>>> IIRC, PageHuge() can also be called from GUP-fast and could special-case >>>>>>> hugetlb eventually, as it's table while we hold a (temporary) reference. >>>>>>> Shmem might be not so easy ... >>>>>> >>>>>> page->mapping->a_ops should be enough to whitelist whatever fs you want. >>>>>> >>>>> >>>>> The issue is how to stabilize that from GUP-fast, such that we can safely >>>>> dereference the mapping. Any idea? >>>>> >>>>> At least for anon page I know that page->mapping only gets cleared when >>>>> freeing the page, and we don't dereference the mapping but only check a >>>>> single flag stored alongside the mapping. Therefore, PageAnon() is fine in >>>>> GUP-fast context. >>>> >>>> What codepath you are worry about that clears ->mapping on pages with >>>> non-zero refcount? >>>> >>>> I can only think of truncate (and punch hole). READ_ONCE(page->mapping) >>>> and fail GUP_fast if it is NULL should be fine, no? >>>> >>>> I guess we should consider if the inode can be freed from under us and the >>>> mapping pointer becomes dangling. But I think we should be fine here too: >>>> VMA pins inode and VMA cannot go away from under GUP. >>> >>> Can vma still go away if during a fast-gup? >>> >> >> So, after we grabbed the page and made sure the the PTE didn't change (IOW, >> the PTE was stable while we processed it), the page can get unmapped (but >> not freed, because we hold a reference) and the VMA can theoretically go >> away (and as far as I understand, nothing stops the file from getting >> deleted, truncated etc). >> >> So we might be looking at folio->mapping and the VMA is no longer there. >> Maybe even the file is no longer there. >> > > This shouldn't be an issue though right? Because after a pup call unlocks the > mmap_lock we're in the same situation anyway. GUP doesn't generally guarantee > the mapping remains valid, only pinning the underlying folio. Yes. But the issue here is rather dereferencing something that has already been freed, eventually leading to undefined behavior. Maybe de-referencing folio->mapping is fine ... but yes, we could handle that optimization in a separate patch.
On Fri, Apr 28, 2023 at 07:05:38PM +0200, David Hildenbrand wrote: > On 28.04.23 19:01, Lorenzo Stoakes wrote: > > On Fri, Apr 28, 2023 at 06:51:46PM +0200, David Hildenbrand wrote: > > > On 28.04.23 18:39, Peter Xu wrote: > > > > On Fri, Apr 28, 2023 at 07:22:07PM +0300, Kirill A . Shutemov wrote: > > > > > On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote: > > > > > > On 28.04.23 18:09, Kirill A . Shutemov wrote: > > > > > > > On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote: > > > > > > > > On 28.04.23 17:34, David Hildenbrand wrote: > > > > > > > > > On 28.04.23 17:33, Lorenzo Stoakes wrote: > > > > > > > > > > On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > Security is the primary case where we have historically closed uAPI > > > > > > > > > > > > > items. > > > > > > > > > > > > > > > > > > > > > > > > As this patch > > > > > > > > > > > > > > > > > > > > > > > > 1) Does not tackle GUP-fast > > > > > > > > > > > > 2) Does not take care of !FOLL_LONGTERM > > > > > > > > > > > > > > > > > > > > > > > > I am not convinced by the security argument in regard to this patch. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If we want to sells this as a security thing, we have to block it > > > > > > > > > > > > *completely* and then CC stable. > > > > > > > > > > > > > > > > > > > > > > Regarding GUP-fast, to fix the issue there as well, I guess we could do > > > > > > > > > > > something similar as I did in gup_must_unshare(): > > > > > > > > > > > > > > > > > > > > > > If we're in GUP-fast (no VMA), and want to pin a !anon page writable, > > > > > > > > > > > fallback to ordinary GUP. IOW, if we don't know, better be safe. > > > > > > > > > > > > > > > > > > > > How do we determine it's non-anon in the first place? The check is on the > > > > > > > > > > VMA. We could do it by following page tables down to folio and checking > > > > > > > > > > folio->mapping for PAGE_MAPPING_ANON I suppose? > > > > > > > > > > > > > > > > > > PageAnon(page) can be called from GUP-fast after grabbing a reference. > > > > > > > > > See gup_must_unshare(). > > > > > > > > > > > > > > > > IIRC, PageHuge() can also be called from GUP-fast and could special-case > > > > > > > > hugetlb eventually, as it's table while we hold a (temporary) reference. > > > > > > > > Shmem might be not so easy ... > > > > > > > > > > > > > > page->mapping->a_ops should be enough to whitelist whatever fs you want. > > > > > > > > > > > > > > > > > > > The issue is how to stabilize that from GUP-fast, such that we can safely > > > > > > dereference the mapping. Any idea? > > > > > > > > > > > > At least for anon page I know that page->mapping only gets cleared when > > > > > > freeing the page, and we don't dereference the mapping but only check a > > > > > > single flag stored alongside the mapping. Therefore, PageAnon() is fine in > > > > > > GUP-fast context. > > > > > > > > > > What codepath you are worry about that clears ->mapping on pages with > > > > > non-zero refcount? > > > > > > > > > > I can only think of truncate (and punch hole). READ_ONCE(page->mapping) > > > > > and fail GUP_fast if it is NULL should be fine, no? > > > > > > > > > > I guess we should consider if the inode can be freed from under us and the > > > > > mapping pointer becomes dangling. But I think we should be fine here too: > > > > > VMA pins inode and VMA cannot go away from under GUP. > > > > > > > > Can vma still go away if during a fast-gup? > > > > > > > > > > So, after we grabbed the page and made sure the the PTE didn't change (IOW, > > > the PTE was stable while we processed it), the page can get unmapped (but > > > not freed, because we hold a reference) and the VMA can theoretically go > > > away (and as far as I understand, nothing stops the file from getting > > > deleted, truncated etc). > > > > > > So we might be looking at folio->mapping and the VMA is no longer there. > > > Maybe even the file is no longer there. > > > > > > > This shouldn't be an issue though right? Because after a pup call unlocks the > > mmap_lock we're in the same situation anyway. GUP doesn't generally guarantee > > the mapping remains valid, only pinning the underlying folio. > > Yes. But the issue here is rather dereferencing something that has already > been freed, eventually leading to undefined behavior. > Is that an issue with interrupts disabled though? Will block page tables being removed and as Kirill says (sorry I maybe misinterpreted you) we should be ok. > Maybe de-referencing folio->mapping is fine ... but yes, we could handle > that optimization in a separate patch. > > -- > Thanks, > > David / dhildenb >
On 28.04.23 19:13, Lorenzo Stoakes wrote: > On Fri, Apr 28, 2023 at 07:05:38PM +0200, David Hildenbrand wrote: >> On 28.04.23 19:01, Lorenzo Stoakes wrote: >>> On Fri, Apr 28, 2023 at 06:51:46PM +0200, David Hildenbrand wrote: >>>> On 28.04.23 18:39, Peter Xu wrote: >>>>> On Fri, Apr 28, 2023 at 07:22:07PM +0300, Kirill A . Shutemov wrote: >>>>>> On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote: >>>>>>> On 28.04.23 18:09, Kirill A . Shutemov wrote: >>>>>>>> On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote: >>>>>>>>> On 28.04.23 17:34, David Hildenbrand wrote: >>>>>>>>>> On 28.04.23 17:33, Lorenzo Stoakes wrote: >>>>>>>>>>> On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Security is the primary case where we have historically closed uAPI >>>>>>>>>>>>>> items. >>>>>>>>>>>>> >>>>>>>>>>>>> As this patch >>>>>>>>>>>>> >>>>>>>>>>>>> 1) Does not tackle GUP-fast >>>>>>>>>>>>> 2) Does not take care of !FOLL_LONGTERM >>>>>>>>>>>>> >>>>>>>>>>>>> I am not convinced by the security argument in regard to this patch. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> If we want to sells this as a security thing, we have to block it >>>>>>>>>>>>> *completely* and then CC stable. >>>>>>>>>>>> >>>>>>>>>>>> Regarding GUP-fast, to fix the issue there as well, I guess we could do >>>>>>>>>>>> something similar as I did in gup_must_unshare(): >>>>>>>>>>>> >>>>>>>>>>>> If we're in GUP-fast (no VMA), and want to pin a !anon page writable, >>>>>>>>>>>> fallback to ordinary GUP. IOW, if we don't know, better be safe. >>>>>>>>>>> >>>>>>>>>>> How do we determine it's non-anon in the first place? The check is on the >>>>>>>>>>> VMA. We could do it by following page tables down to folio and checking >>>>>>>>>>> folio->mapping for PAGE_MAPPING_ANON I suppose? >>>>>>>>>> >>>>>>>>>> PageAnon(page) can be called from GUP-fast after grabbing a reference. >>>>>>>>>> See gup_must_unshare(). >>>>>>>>> >>>>>>>>> IIRC, PageHuge() can also be called from GUP-fast and could special-case >>>>>>>>> hugetlb eventually, as it's table while we hold a (temporary) reference. >>>>>>>>> Shmem might be not so easy ... >>>>>>>> >>>>>>>> page->mapping->a_ops should be enough to whitelist whatever fs you want. >>>>>>>> >>>>>>> >>>>>>> The issue is how to stabilize that from GUP-fast, such that we can safely >>>>>>> dereference the mapping. Any idea? >>>>>>> >>>>>>> At least for anon page I know that page->mapping only gets cleared when >>>>>>> freeing the page, and we don't dereference the mapping but only check a >>>>>>> single flag stored alongside the mapping. Therefore, PageAnon() is fine in >>>>>>> GUP-fast context. >>>>>> >>>>>> What codepath you are worry about that clears ->mapping on pages with >>>>>> non-zero refcount? >>>>>> >>>>>> I can only think of truncate (and punch hole). READ_ONCE(page->mapping) >>>>>> and fail GUP_fast if it is NULL should be fine, no? >>>>>> >>>>>> I guess we should consider if the inode can be freed from under us and the >>>>>> mapping pointer becomes dangling. But I think we should be fine here too: >>>>>> VMA pins inode and VMA cannot go away from under GUP. >>>>> >>>>> Can vma still go away if during a fast-gup? >>>>> >>>> >>>> So, after we grabbed the page and made sure the the PTE didn't change (IOW, >>>> the PTE was stable while we processed it), the page can get unmapped (but >>>> not freed, because we hold a reference) and the VMA can theoretically go >>>> away (and as far as I understand, nothing stops the file from getting >>>> deleted, truncated etc). >>>> >>>> So we might be looking at folio->mapping and the VMA is no longer there. >>>> Maybe even the file is no longer there. >>>> >>> >>> This shouldn't be an issue though right? Because after a pup call unlocks the >>> mmap_lock we're in the same situation anyway. GUP doesn't generally guarantee >>> the mapping remains valid, only pinning the underlying folio. >> >> Yes. But the issue here is rather dereferencing something that has already >> been freed, eventually leading to undefined behavior. >> > > Is that an issue with interrupts disabled though? Will block page tables being > removed and as Kirill says (sorry I maybe misinterpreted you) we should be ok. Let's rule out page table freeing. If our VMA only spans a single page and falls into the same PMD as another VMA, an munmap() would not even free a single page table. However, if unmapping a page (flushing the TLB) would imply an IPI as Kirill said, we'd be fine. I recall that that's not the case for all architectures, but I might be just wrong. ... and now I'll stop reading mails until Tuesday :)
On Fri, Apr 28, 2023 at 07:02:22PM +0200, David Hildenbrand wrote: > > No. VMA cannot get away before PTEs are unmapped and TLB is flushed. And > > TLB flushing is serialized against GUP_fast(). > > The whole CONFIG_MMU_GATHER_RCU_TABLE_FREE handling makes the situation more > complicated. Yeah, you have to think of gup_fast as RCU with a hacky pre-RCU implementation on most architectures. We could make page->mapping safe under RCU, for instance. Jason
On Fri, Apr 28, 2023 at 11:56:55AM -0400, Peter Xu wrote: > > PageAnon(page) can be called from GUP-fast after grabbing a reference. See > > gup_must_unshare(). > > Hmm.. Is it a good idea at all to sacrifise all "!anon" fast-gups for this? > People will silently got degrade even on legal pins on shmem/hugetlb, I > think, which seems to be still a very major use case. Remember gup fast was like this until quite recently - DAX wrecked it. I fixed it when I changed DAX to not post-scan the VMA list.. I'm not sure longterm and really fast need to go together. Jason
On Fri, Apr 28, 2023 at 02:31:38PM -0300, Jason Gunthorpe wrote: > On Fri, Apr 28, 2023 at 07:02:22PM +0200, David Hildenbrand wrote: > > > > No. VMA cannot get away before PTEs are unmapped and TLB is flushed. And > > > TLB flushing is serialized against GUP_fast(). > > > > The whole CONFIG_MMU_GATHER_RCU_TABLE_FREE handling makes the situation more > > complicated. > > Yeah, you have to think of gup_fast as RCU with a hacky pre-RCU implementation > on most architectures. > > We could make page->mapping safe under RCU, for instance. > > Jason Does it really require a change though? I might be missing some details, but afaict with interrupts disabled we should be ok to deref page->mapping to check PageAnon and a_ops before handing back a page right?
On Fri, Apr 28, 2023 at 06:42:41PM +0100, Lorenzo Stoakes wrote: > On Fri, Apr 28, 2023 at 02:31:38PM -0300, Jason Gunthorpe wrote: > > On Fri, Apr 28, 2023 at 07:02:22PM +0200, David Hildenbrand wrote: > > > > > > No. VMA cannot get away before PTEs are unmapped and TLB is flushed. And > > > > TLB flushing is serialized against GUP_fast(). > > > > > > The whole CONFIG_MMU_GATHER_RCU_TABLE_FREE handling makes the situation more > > > complicated. > > > > Yeah, you have to think of gup_fast as RCU with a hacky pre-RCU implementation > > on most architectures. > > > > We could make page->mapping safe under RCU, for instance. > > > > Jason > > Does it really require a change though? I might be missing some details, > but afaict with interrupts disabled we should be ok to deref page->mapping > to check PageAnon and a_ops before handing back a page right? AFAIK not on all architectures Jason
On Fri, Apr 28, 2023 at 11:35:32AM -0300, Jason Gunthorpe wrote: > > It has been years now, I think we need to admit a fix is still years > away. Blocking the security problem may even motivate more people to > work on a fix. Do we think we can still trigger a kernel crash, or maybe even some more exciting like an arbitrary buffer overrun, via the process_vm_writev(2) system call into a file-backed mmap'ed region? Maybe if someone can come up with an easy-to-expliot security proof of aconcept, that doesn't require special RDMA hardware or some special libvirt setup, we could finally get motivation to get it fixed, or at least blocked? :-) We've only been talking about it for years, after all... - Ted > Security is the primary case where we have historically closed uAPI > items. > > Jason
On 4/28/23 10:33, Jason Gunthorpe wrote: > On Fri, Apr 28, 2023 at 11:56:55AM -0400, Peter Xu wrote: > >>> PageAnon(page) can be called from GUP-fast after grabbing a reference. See >>> gup_must_unshare(). >> >> Hmm.. Is it a good idea at all to sacrifise all "!anon" fast-gups for this? >> People will silently got degrade even on legal pins on shmem/hugetlb, I >> think, which seems to be still a very major use case. > > Remember gup fast was like this until quite recently - DAX wrecked it. > > I fixed it when I changed DAX to not post-scan the VMA list.. > > I'm not sure longterm and really fast need to go together. > Exactly: gup_fast + FOLL_LONGTERM is asking too much of this system, it's really excessive. I like the idea of simply: "if (FOLL_LONGTERM) jump straight to slow gup". thanks,
On 4/28/23 10:29, David Hildenbrand wrote: ... >> Is that an issue with interrupts disabled though? Will block page tables being >> removed and as Kirill says (sorry I maybe misinterpreted you) we should be ok. > > Let's rule out page table freeing. If our VMA only spans a single page > and falls into the same PMD as another VMA, an munmap() would not even > free a single page table. > > However, if unmapping a page (flushing the TLB) would imply an IPI as > Kirill said, we'd be fine. I recall that that's not the case for all > architectures, but I might be just wrong. > Some architectures use IPIs, and others use an RCU-like approach instead. thanks,
On Fri, Apr 28, 2023 at 02:25:53PM -0400, Theodore Ts'o wrote: > On Fri, Apr 28, 2023 at 11:35:32AM -0300, Jason Gunthorpe wrote: > > > > It has been years now, I think we need to admit a fix is still years > > away. Blocking the security problem may even motivate more people to > > work on a fix. > > Do we think we can still trigger a kernel crash, or maybe even some > more exciting like an arbitrary buffer overrun, via the > process_vm_writev(2) system call into a file-backed mmap'ed region? Jens? You blocked it from io_uring, did you have a specific attack in mind? Jason
On Fri, Apr 28, 2023 at 07:02:22PM +0200, David Hildenbrand wrote: > On 28.04.23 18:56, Kirill A . Shutemov wrote: > > On Fri, Apr 28, 2023 at 06:51:46PM +0200, David Hildenbrand wrote: > > > On 28.04.23 18:39, Peter Xu wrote: > > > > On Fri, Apr 28, 2023 at 07:22:07PM +0300, Kirill A . Shutemov wrote: > > > > > On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote: > > > > > > On 28.04.23 18:09, Kirill A . Shutemov wrote: > > > > > > > On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote: > > > > > > > > On 28.04.23 17:34, David Hildenbrand wrote: > > > > > > > > > On 28.04.23 17:33, Lorenzo Stoakes wrote: > > > > > > > > > > On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > Security is the primary case where we have historically closed uAPI > > > > > > > > > > > > > items. > > > > > > > > > > > > > > > > > > > > > > > > As this patch > > > > > > > > > > > > > > > > > > > > > > > > 1) Does not tackle GUP-fast > > > > > > > > > > > > 2) Does not take care of !FOLL_LONGTERM > > > > > > > > > > > > > > > > > > > > > > > > I am not convinced by the security argument in regard to this patch. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If we want to sells this as a security thing, we have to block it > > > > > > > > > > > > *completely* and then CC stable. > > > > > > > > > > > > > > > > > > > > > > Regarding GUP-fast, to fix the issue there as well, I guess we could do > > > > > > > > > > > something similar as I did in gup_must_unshare(): > > > > > > > > > > > > > > > > > > > > > > If we're in GUP-fast (no VMA), and want to pin a !anon page writable, > > > > > > > > > > > fallback to ordinary GUP. IOW, if we don't know, better be safe. > > > > > > > > > > > > > > > > > > > > How do we determine it's non-anon in the first place? The check is on the > > > > > > > > > > VMA. We could do it by following page tables down to folio and checking > > > > > > > > > > folio->mapping for PAGE_MAPPING_ANON I suppose? > > > > > > > > > > > > > > > > > > PageAnon(page) can be called from GUP-fast after grabbing a reference. > > > > > > > > > See gup_must_unshare(). > > > > > > > > > > > > > > > > IIRC, PageHuge() can also be called from GUP-fast and could special-case > > > > > > > > hugetlb eventually, as it's table while we hold a (temporary) reference. > > > > > > > > Shmem might be not so easy ... > > > > > > > > > > > > > > page->mapping->a_ops should be enough to whitelist whatever fs you want. > > > > > > > > > > > > > > > > > > > The issue is how to stabilize that from GUP-fast, such that we can safely > > > > > > dereference the mapping. Any idea? > > > > > > > > > > > > At least for anon page I know that page->mapping only gets cleared when > > > > > > freeing the page, and we don't dereference the mapping but only check a > > > > > > single flag stored alongside the mapping. Therefore, PageAnon() is fine in > > > > > > GUP-fast context. > > > > > > > > > > What codepath you are worry about that clears ->mapping on pages with > > > > > non-zero refcount? > > > > > > > > > > I can only think of truncate (and punch hole). READ_ONCE(page->mapping) > > > > > and fail GUP_fast if it is NULL should be fine, no? > > > > > > > > > > I guess we should consider if the inode can be freed from under us and the > > > > > mapping pointer becomes dangling. But I think we should be fine here too: > > > > > VMA pins inode and VMA cannot go away from under GUP. > > > > > > > > Can vma still go away if during a fast-gup? > > > > > > > > > > So, after we grabbed the page and made sure the the PTE didn't change (IOW, > > > the PTE was stable while we processed it), the page can get unmapped (but > > > not freed, because we hold a reference) and the VMA can theoretically go > > > away (and as far as I understand, nothing stops the file from getting > > > deleted, truncated etc). > > > > > > So we might be looking at folio->mapping and the VMA is no longer there. > > > Maybe even the file is no longer there. > > > > No. VMA cannot get away before PTEs are unmapped and TLB is flushed. And > > TLB flushing is serialized against GUP_fast(). > > > The whole CONFIG_MMU_GATHER_RCU_TABLE_FREE handling makes the situation more > complicated. I think I found relevant snippet of code that solves similar issue. get_futex_key() uses RCU to stabilize page->mapping after GUP_fast: /* * The associated futex object in this case is the inode and * the page->mapping must be traversed. Ordinarily this should * be stabilised under page lock but it's not strictly * necessary in this case as we just want to pin the inode, not * update the radix tree or anything like that. * * The RCU read lock is taken as the inode is finally freed * under RCU. If the mapping still matches expectations then the * mapping->host can be safely accessed as being a valid inode. */ rcu_read_lock(); if (READ_ONCE(page->mapping) != mapping) { rcu_read_unlock(); put_page(page); goto again; } inode = READ_ONCE(mapping->host); if (!inode) { rcu_read_unlock(); put_page(page); goto again; } I think something similar can be used inside GUP_fast too.
On Fri, Apr 28, 2023 at 03:50:20PM -0300, Jason Gunthorpe wrote: > > Do we think we can still trigger a kernel crash, or maybe even some > > more exciting like an arbitrary buffer overrun, via the > > process_vm_writev(2) system call into a file-backed mmap'ed region? I paged back into my memory the details, and (un)fortunately(?) it probably can't be turned into high severity security exploit; it's "just" a silent case of data loss. (Which is *so* much better.... :-) There was a reliable reproducer which was found by Syzkaller, that didn't require any kind of exotic hardware or setup[1], and we ultimately kluged a workaround in commit cc5095747edf ("ext4: don't BUG if someone dirty pages without asking ext4 first"). [1] https://lore.kernel.org/all/Yg0m6IjcNmfaSokM@google.com/ Commit cc5095747edf had the (un)fortunate(?) side effect that GUP writes to ext4 file-backed mappings no longer would cause random low-probability crashes on large installations using RDMA, which has apparently removed some of the motivation of really fixing the problem instead of papering over it. The good news is that I'm no longer getting complaints from syzbot for this issue, and *I* don't have to support anyone trying to use RDMA into file-backed mappings. :-) In any case, the file system maintainers' position (mine and I doubt Dave Chinner's position has changed) is that if you write to file-backed mappings via GUP/RDMA/process_vm_writev, and it causes silent data corruption, you get to keep both pieces, and don't go looking for us for anything other than sympathy... - Ted
On Sat, Apr 29, 2023 at 12:21:09AM -0400, Theodore Ts'o wrote: > In any case, the file system maintainers' position (mine and I doubt > Dave Chinner's position has changed) is that if you write to > file-backed mappings via GUP/RDMA/process_vm_writev, and it causes > silent data corruption, you get to keep both pieces, and don't go > looking for us for anything other than sympathy... This alone is enough reason to block it. I'm tired of this round and round and I think we should just say enough, the mm will work to enforce this view point. Files can only be written through PTEs. If this upsets people they can work on fixing it, but at least we don't have these kernel problems and inconsistencies to deal with. Jason
On Sat, Apr 29, 2023 at 08:01:11PM -0300, Jason Gunthorpe wrote: > On Sat, Apr 29, 2023 at 12:21:09AM -0400, Theodore Ts'o wrote: > > > In any case, the file system maintainers' position (mine and I doubt > > Dave Chinner's position has changed) is that if you write to > > file-backed mappings via GUP/RDMA/process_vm_writev, and it causes > > silent data corruption, you get to keep both pieces, and don't go > > looking for us for anything other than sympathy... > > This alone is enough reason to block it. I'm tired of this round and > round and I think we should just say enough, the mm will work to > enforce this view point. Files can only be written through PTEs. > > If this upsets people they can work on fixing it, but at least we > don't have these kernel problems and inconsistencies to deal with. > Indeed, I think there is a broad consensus that FOLL_LONGTERM is such an egregious case that this is the way forward. I will respin with the discussed GUP-fast changes relatively soon and then we can see where that takes us :) > Jason
On Mon, May 01, 2023 at 05:27:18PM +1000, Dave Chinner wrote: > On Sat, Apr 29, 2023 at 08:01:11PM -0300, Jason Gunthorpe wrote: > > On Sat, Apr 29, 2023 at 12:21:09AM -0400, Theodore Ts'o wrote: > > > > > In any case, the file system maintainers' position (mine and I doubt > > > Dave Chinner's position has changed) is that if you write to > > > file-backed mappings via GUP/RDMA/process_vm_writev, and it causes > > > silent data corruption, you get to keep both pieces, and don't go > > > looking for us for anything other than sympathy... > > > > This alone is enough reason to block it. I'm tired of this round and > > round and I think we should just say enough, the mm will work to > > enforce this view point. Files can only be written through PTEs. > > It has to be at least 5 years ago now that we were told that the > next-gen RDMA hardware would be able to trigger hardware page faults > when remote systems dirtied local pages. This would enable > ->page-mkwrite to be run on file backed pages mapped pages just like > local CPU write faults and everything would be fine. Things are progressing, but I'm not as optimistic as I once was.. - Today mlx5 has ODP which allows this to work using hmm_range_fault() techniques. I know of at least one deployment using this with a DAX configuration. This is now at least 5 years old stuff. The downside is that HMM approaches yield poor wost case performance, and have weird caching corner cases. This is still only one vendor, in the past 5 years nobody else stepped up to implement it. - Intel Sapphire Rapids chips have ATS/PRI support and we are doing research on integrating mlx5 with that. In Linux this is called "IOMMU SVA". However, performance is wonky - in the best case it is worse than ODP but it removes ODP's worst case corners. It also makes the entire MM notably slower for processes that turn it on. Who knows when or what this will turn out to be useful for. - Full cache coherence with CXL. CXL has taken a long time to really reach the mainstream market - maybe next gen of server CPUs. I'm not aware of anyone doing work here in the RDMA space, it is difficult to see the benefit. This seems likely to be very popular in the GPU space, I already see some products announced. This is a big topic on its own for FSs.. So, basically, you can make it work on the most popular HW, but at the cost of top performance. Which makes it unpopular. I don't expect anything on the horizon to subtantially change this calculus, the latency cost of doing ATS like things is an inherent performance penalty that can't be overcome Jason
On Sat 29-04-23 02:43:32, Kirill A . Shutemov wrote: > I think I found relevant snippet of code that solves similar issue. > get_futex_key() uses RCU to stabilize page->mapping after GUP_fast: > > > /* > * The associated futex object in this case is the inode and > * the page->mapping must be traversed. Ordinarily this should > * be stabilised under page lock but it's not strictly > * necessary in this case as we just want to pin the inode, not > * update the radix tree or anything like that. > * > * The RCU read lock is taken as the inode is finally freed > * under RCU. If the mapping still matches expectations then the > * mapping->host can be safely accessed as being a valid inode. > */ > rcu_read_lock(); > > if (READ_ONCE(page->mapping) != mapping) { > rcu_read_unlock(); > put_page(page); > > goto again; > } > > inode = READ_ONCE(mapping->host); > if (!inode) { > rcu_read_unlock(); > put_page(page); > > goto again; > } > > I think something similar can be used inside GUP_fast too. Yeah, inodes (and thus struct address_space) is RCU protected these days so grabbing RCU lock in gup_fast() will get you enough protection for checking aops if you are careful (like the futex code is). Honza
On Tue, May 02, 2023 at 10:00:16AM +0200, Jan Kara wrote: > On Sat 29-04-23 02:43:32, Kirill A . Shutemov wrote: > > I think I found relevant snippet of code that solves similar issue. > > get_futex_key() uses RCU to stabilize page->mapping after GUP_fast: > > > > > > /* > > * The associated futex object in this case is the inode and > > * the page->mapping must be traversed. Ordinarily this should > > * be stabilised under page lock but it's not strictly > > * necessary in this case as we just want to pin the inode, not > > * update the radix tree or anything like that. > > * > > * The RCU read lock is taken as the inode is finally freed > > * under RCU. If the mapping still matches expectations then the > > * mapping->host can be safely accessed as being a valid inode. > > */ > > rcu_read_lock(); > > > > if (READ_ONCE(page->mapping) != mapping) { > > rcu_read_unlock(); > > put_page(page); > > > > goto again; > > } > > > > inode = READ_ONCE(mapping->host); > > if (!inode) { > > rcu_read_unlock(); > > put_page(page); > > > > goto again; > > } > > > > I think something similar can be used inside GUP_fast too. > > Yeah, inodes (and thus struct address_space) is RCU protected these days so > grabbing RCU lock in gup_fast() will get you enough protection for checking > aops if you are careful (like the futex code is). GUP_fast has IRQs disabled per definition, there is no need to then also use rcu_read_lock().
diff --git a/include/linux/mm.h b/include/linux/mm.h index 37554b08bb28..f7da02fc89c6 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ MM_CP_UFFD_WP_RESOLVE) +bool vma_needs_dirty_tracking(struct vm_area_struct *vma); int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) { diff --git a/mm/gup.c b/mm/gup.c index 1f72a717232b..d36a5db9feb1 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma, return 0; } +/* + * Writing to file-backed mappings which require folio dirty tracking using GUP + * is a fundamentally broken operation, as kernel write access to GUP mappings + * do not adhere to the semantics expected by a file system. + * + * Consider the following scenario:- + * + * 1. A folio is written to via GUP which write-faults the memory, notifying + * the file system and dirtying the folio. + * 2. Later, writeback is triggered, resulting in the folio being cleaned and + * the PTE being marked read-only. + * 3. The GUP caller writes to the folio, as it is mapped read/write via the + * direct mapping. + * 4. The GUP caller, now done with the page, unpins it and sets it dirty + * (though it does not have to). + * + * This results in both data being written to a folio without writenotify, and + * the folio being dirtied unexpectedly (if the caller decides to do so). + */ +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma, + unsigned long gup_flags) +{ + /* If we aren't pinning then no problematic write can occur. */ + if (!(gup_flags & (FOLL_GET | FOLL_PIN))) + return true; + + /* We limit this check to the most egregious case - a long term pin. */ + if (!(gup_flags & FOLL_LONGTERM)) + return true; + + /* If the VMA requires dirty tracking then GUP will be problematic. */ + return vma_needs_dirty_tracking(vma); +} + static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) { vm_flags_t vm_flags = vma->vm_flags; int write = (gup_flags & FOLL_WRITE); int foreign = (gup_flags & FOLL_REMOTE); + bool vma_anon = vma_is_anonymous(vma); if (vm_flags & (VM_IO | VM_PFNMAP)) return -EFAULT; - if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) + if ((gup_flags & FOLL_ANON) && !vma_anon) return -EFAULT; if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) @@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) return -EFAULT; if (write) { + if (!vma_anon && + !writeable_file_mapping_allowed(vma, gup_flags)) + return -EFAULT; + if (!(vm_flags & VM_WRITE)) { if (!(gup_flags & FOLL_FORCE)) return -EFAULT; diff --git a/mm/mmap.c b/mm/mmap.c index 536bbb8fa0ae..7b6344d1832a 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1475,6 +1475,31 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) } #endif /* __ARCH_WANT_SYS_OLD_MMAP */ +/* Do VMA operations imply write notify is required? */ +static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops) +{ + return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite); +} + +/* + * Does this VMA require the underlying folios to have their dirty state + * tracked? + */ +bool vma_needs_dirty_tracking(struct vm_area_struct *vma) +{ + /* Does the filesystem need to be notified? */ + if (vm_ops_needs_writenotify(vma->vm_ops)) + return true; + + /* Specialty mapping? */ + if (vma->vm_flags & VM_PFNMAP) + return false; + + /* Can the mapping track the dirty pages? */ + return vma->vm_file && vma->vm_file->f_mapping && + mapping_can_writeback(vma->vm_file->f_mapping); +} + /* * Some shared mappings will want the pages marked read-only * to track write events. If so, we'll downgrade vm_page_prot @@ -1484,14 +1509,13 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) { vm_flags_t vm_flags = vma->vm_flags; - const struct vm_operations_struct *vm_ops = vma->vm_ops; /* If it was private or non-writable, the write bit is already clear */ if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) return 0; /* The backer wishes to know when pages are first written to? */ - if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite)) + if (vm_ops_needs_writenotify(vma->vm_ops)) return 1; /* The open routine did something to the protections that pgprot_modify @@ -1511,13 +1535,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) if (userfaultfd_wp(vma)) return 1; - /* Specialty mapping? */ - if (vm_flags & VM_PFNMAP) - return 0; - - /* Can the mapping track the dirty pages? */ - return vma->vm_file && vma->vm_file->f_mapping && - mapping_can_writeback(vma->vm_file->f_mapping); + return vma_needs_dirty_tracking(vma); } /*