Message ID | 20230525223953.225496-2-dhowells@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp90066vqr; Thu, 25 May 2023 15:44:06 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ78nHatGYTlKVXD0ej4xCSIQ6GYFvj3q9LJhuvp/Gy2Q335YhNqyfcY+QS4H8YhiBqabVoj X-Received: by 2002:a17:90b:185:b0:255:8cbf:acb7 with SMTP id t5-20020a17090b018500b002558cbfacb7mr330973pjs.11.1685054646417; Thu, 25 May 2023 15:44:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685054646; cv=none; d=google.com; s=arc-20160816; b=FgsGZ7C1GpjdH/QgfCCrrSFVBtb8QLGCm3ATeXbBXZ7N4mjrUUJ801bq0zdqpNnxG2 YBAJSARWntHII5Aa2GagRy9DkuAtT9twrV8M/D7gAPwlKZLVjvlWmbPac8tdNTMlkojq o2DmjQzXH7QeLTI08933JONWZ7qAXByREsEUjMtOjzqD77W08ndd+1E4jnqdriksTRRu 6Z79U2+Q4NNhINZkc0PWkh76WZgTMeZpHZ/1AlSTGklBVcaUAOz/y4B7rjZMszqoWrxi IvLEHKJpsb9pckOC65ixRBwROOq58OU+2FMWdk0XEI/1zNfcENbjjHun61VPug9VH6YJ eEnA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=FUrVwlJ5El5UIYEfnTZW9jIZIlAoV+8/20tkEUBC5dE=; b=E74ad0I+kHhWAc12RbpWI6Bqd0teWis7ZeIcit9CV6xkYgsTRGo48asOY6xA9/CxAp 1tJ2hv4c4OsItqctz9vWyysv20GOOpNhA3hM0g3nuqbcKfJLTfZbTXu9iRCmWP7ZnmKj zAnW7fODZBRIQeEQ8ohMR1Rfjl5t1gNxHMpygf9txX7ftpyHnOw1Fbfes0gMIdapzWhf B09iXFKNCTJ+h+HAwTcXqMMjExyLCaxLIZn2X2jllqHivl9jg9JA6Wnwu6+shlK6MVcl kTs7130ZePLj2pVIsASrI9z9fHKdlRhtsntg6YWffpOoFI6wMAiYHd/Gx5naaig/k8g0 WNsA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Y6Baw+PB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x19-20020a63b213000000b0053f3e25b944si908269pge.749.2023.05.25.15.43.53; Thu, 25 May 2023 15:44:06 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Y6Baw+PB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230465AbjEYWky (ORCPT <rfc822;jian.xie.xdx@gmail.com> + 99 others); Thu, 25 May 2023 18:40:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40316 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229765AbjEYWkw (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 25 May 2023 18:40:52 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DE080194 for <linux-kernel@vger.kernel.org>; Thu, 25 May 2023 15:40:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1685054407; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FUrVwlJ5El5UIYEfnTZW9jIZIlAoV+8/20tkEUBC5dE=; b=Y6Baw+PBjervD2NCg1qevTb5vV8W+7iyHvx7BEiF8TfqUmi6l7oOw+O+6uNWvsgwrRvh/J iaO3MNsZhj3rbe370ZbFp6bRmlJGj45AG2swWER0m5NhtSoZBxfy1hW/KbKwvmCLX+Og6y OzaBZNyrR+a4m0ZXI2ca/DbvCtRF0lY= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-628-fTcodurYMs6Ocu2Sqmj11g-1; Thu, 25 May 2023 18:40:02 -0400 X-MC-Unique: fTcodurYMs6Ocu2Sqmj11g-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id BC01485A5BD; Thu, 25 May 2023 22:40:01 +0000 (UTC) Received: from warthog.procyon.org.uk (unknown [10.39.192.68]) by smtp.corp.redhat.com (Postfix) with ESMTP id 5EAFC1121314; Thu, 25 May 2023 22:39:59 +0000 (UTC) From: David Howells <dhowells@redhat.com> To: Christoph Hellwig <hch@infradead.org>, David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com>, Jens Axboe <axboe@kernel.dk>, Al Viro <viro@zeniv.linux.org.uk>, Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>, Jeff Layton <jlayton@kernel.org>, Jason Gunthorpe <jgg@nvidia.com>, Logan Gunthorpe <logang@deltatee.com>, Hillf Danton <hdanton@sina.com>, Christian Brauner <brauner@kernel.org>, Linus Torvalds <torvalds@linux-foundation.org>, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org> Subject: [RFC PATCH v2 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages() Date: Thu, 25 May 2023 23:39:51 +0100 Message-Id: <20230525223953.225496-2-dhowells@redhat.com> In-Reply-To: <20230525223953.225496-1-dhowells@redhat.com> References: <20230525223953.225496-1-dhowells@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.3 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1766907861141097843?= X-GMAIL-MSGID: =?utf-8?q?1766907861141097843?= |
Series |
block: Make old dio use iov_iter_extract_pages() and page pinning
|
|
Commit Message
David Howells
May 25, 2023, 10:39 p.m. UTC
Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
to it from the page tables and make unpin_user_page*() correspondingly
ignore a ZERO_PAGE when unpinning. We don't want to risk overrunning a
zero page's refcount as we're only allowed ~2 million pins on it -
something that userspace can conceivably trigger.
Add a pair of functions to test whether a page or a folio is a ZERO_PAGE.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: David Hildenbrand <david@redhat.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jason Gunthorpe <jgg@nvidia.com>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: Hillf Danton <hdanton@sina.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-kernel@vger.kernel.org
cc: linux-mm@kvack.org
---
Notes:
ver #2)
- Fix use of ZERO_PAGE().
- Add is_zero_page() and is_zero_folio() wrappers.
- Return the zero page obtained, not ZERO_PAGE(0) unconditionally.
include/linux/pgtable.h | 10 ++++++++++
mm/gup.c | 25 ++++++++++++++++++++++++-
2 files changed, 34 insertions(+), 1 deletion(-)
Comments
On Thu, May 25, 2023 at 11:39:51PM +0100, David Howells wrote: > Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer > to it from the page tables and make unpin_user_page*() correspondingly > ignore a ZERO_PAGE when unpinning. We don't want to risk overrunning a > zero page's refcount as we're only allowed ~2 million pins on it - > something that userspace can conceivably trigger. I guess we're not quite as concerned about FOLL_GET because FOLL_GET should be ephemeral and FOLL_PIN (horrifically) adds GUP_PIN_COUNTING_BIAS each time? > > Add a pair of functions to test whether a page or a folio is a ZERO_PAGE. > > Signed-off-by: David Howells <dhowells@redhat.com> > cc: Christoph Hellwig <hch@infradead.org> > cc: David Hildenbrand <david@redhat.com> > cc: Andrew Morton <akpm@linux-foundation.org> > cc: Jens Axboe <axboe@kernel.dk> > cc: Al Viro <viro@zeniv.linux.org.uk> > cc: Matthew Wilcox <willy@infradead.org> > cc: Jan Kara <jack@suse.cz> > cc: Jeff Layton <jlayton@kernel.org> > cc: Jason Gunthorpe <jgg@nvidia.com> > cc: Logan Gunthorpe <logang@deltatee.com> > cc: Hillf Danton <hdanton@sina.com> > cc: Christian Brauner <brauner@kernel.org> > cc: Linus Torvalds <torvalds@linux-foundation.org> > cc: linux-fsdevel@vger.kernel.org > cc: linux-block@vger.kernel.org > cc: linux-kernel@vger.kernel.org > cc: linux-mm@kvack.org > --- > > Notes: > ver #2) > - Fix use of ZERO_PAGE(). > - Add is_zero_page() and is_zero_folio() wrappers. > - Return the zero page obtained, not ZERO_PAGE(0) unconditionally. > > include/linux/pgtable.h | 10 ++++++++++ > mm/gup.c | 25 ++++++++++++++++++++++++- > 2 files changed, 34 insertions(+), 1 deletion(-) > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index c5a51481bbb9..2b0431a11de2 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -1245,6 +1245,16 @@ static inline unsigned long my_zero_pfn(unsigned long addr) > } > #endif /* CONFIG_MMU */ > > +static inline bool is_zero_page(const struct page *page) > +{ > + return is_zero_pfn(page_to_pfn(page)); > +} > + > +static inline bool is_zero_folio(const struct folio *folio) > +{ > + return is_zero_page(&folio->page); > +} > + > #ifdef CONFIG_MMU > > #ifndef CONFIG_TRANSPARENT_HUGEPAGE > diff --git a/mm/gup.c b/mm/gup.c > index bbe416236593..69b002628f5d 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -51,7 +51,8 @@ static inline void sanity_check_pinned_pages(struct page **pages, > struct page *page = *pages; > struct folio *folio = page_folio(page); > > - if (!folio_test_anon(folio)) > + if (is_zero_page(page) || > + !folio_test_anon(folio)) > continue; > if (!folio_test_large(folio) || folio_test_hugetlb(folio)) > VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), page); > @@ -131,6 +132,13 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > else if (flags & FOLL_PIN) { > struct folio *folio; > > + /* > + * Don't take a pin on the zero page - it's not going anywhere > + * and it is used in a *lot* of places. > + */ > + if (is_zero_page(page)) > + return page_folio(page); > + This will capture huge page cases too which have folio->_pincount and thus don't suffer the GUP_PIN_COUNTING_BIAS issue, however it is equally logical to simply skip these when pinning. This does make me think that we should just skip pinning for FOLL_GET cases too - there's literally no sane reason we should be pinning zero pages in any case (unless I'm missing something!) > /* > * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a > * right zone, so fail and let the caller fall back to the slow > @@ -180,6 +188,8 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) > { > if (flags & FOLL_PIN) { > + if (is_zero_folio(folio)) > + return; > node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs); > if (folio_test_large(folio)) > atomic_sub(refs, &folio->_pincount); > @@ -224,6 +234,13 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) > if (flags & FOLL_GET) > folio_ref_inc(folio); > else if (flags & FOLL_PIN) { > + /* > + * Don't take a pin on the zero page - it's not going anywhere > + * and it is used in a *lot* of places. > + */ > + if (is_zero_page(page)) > + return 0; > + > /* > * Similar to try_grab_folio(): be sure to *also* > * increment the normal page refcount field at least once, > @@ -3079,6 +3096,9 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast); > * > * FOLL_PIN means that the pages must be released via unpin_user_page(). Please > * see Documentation/core-api/pin_user_pages.rst for further details. > + * > + * Note that if the zero_page is amongst the returned pages, it will not have > + * pins in it and unpin_user_page() will not remove pins from it. > */ > int pin_user_pages_fast(unsigned long start, int nr_pages, > unsigned int gup_flags, struct page **pages) > @@ -3161,6 +3181,9 @@ EXPORT_SYMBOL(pin_user_pages); > * pin_user_pages_unlocked() is the FOLL_PIN variant of > * get_user_pages_unlocked(). Behavior is the same, except that this one sets > * FOLL_PIN and rejects FOLL_GET. > + * > + * Note that if the zero_page is amongst the returned pages, it will not have > + * pins in it and unpin_user_page() will not remove pins from it. > */ > long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, > struct page **pages, unsigned int gup_flags) > Shouldn't this comment be added to pup() and pup_remote() also? Also I think it's well worth updating Documentation/core-api/pin_user_pages.rst to reflect this as that is explclitly referenced by a few comments and it's a worthwhile corner case to cover. Another nitty thing that I noticed is, in is_longterm_pinnable_page():- /* The zero page may always be pinned */ if (is_zero_pfn(page_to_pfn(page))) return true; Which, strictly speaking I suppose we are 'pinning' it or rather allowing the pin to succeed without actually pinning, but to be super pedantic perhaps it's worth updating this comment too. Other than the pedantic nits, this patch looks good and makes a lot of sense so:- Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
On Fri, May 26, 2023 at 09:10:33AM +0100, Lorenzo Stoakes wrote: > On Thu, May 25, 2023 at 11:39:51PM +0100, David Howells wrote: > > Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer > > to it from the page tables and make unpin_user_page*() correspondingly > > ignore a ZERO_PAGE when unpinning. We don't want to risk overrunning a > > zero page's refcount as we're only allowed ~2 million pins on it - > > something that userspace can conceivably trigger. > > I guess we're not quite as concerned about FOLL_GET because FOLL_GET should > be ephemeral and FOLL_PIN (horrifically) adds GUP_PIN_COUNTING_BIAS each > time? I think FOLL_GET would be just as useful. But given that we have a few places that release pins while gets just do a put_page it would be a lot more effort to audit all of them. Maybe it's better do only do this once we've converted all the places that should do pin and have very few FOLL_GET users left.
Shouldn't unpin_user_pages and bio_release_page also be updated to skip a zero page here?
On Fri, May 26, 2023 at 01:22:54AM -0700, Christoph Hellwig wrote: > Shouldn't unpin_user_pages and bio_release_page also be updated to > skip a zero page here? > unpin_user_pages*() all call gup_put_folio() which already skips the zero page so we should be covered on that front.
Christoph Hellwig <hch@infradead.org> wrote: > Shouldn't unpin_user_pages It calls gup_put_folio() which is where the skip is. > and bio_release_page also be updated to skip a zero page here? Porbably best to leave it to unpin_user_page() there. I've tried to hide the behaviour entirely within gup.c. David
Lorenzo Stoakes <lstoakes@gmail.com> wrote: > I guess we're not quite as concerned about FOLL_GET because FOLL_GET should > be ephemeral and FOLL_PIN (horrifically) adds GUP_PIN_COUNTING_BIAS each > time? It's not that - it's that iov_iter_get_pages*() is a lot more commonly used at the moment, and we'd have to find *all* the places that things using that hand refs around. iov_iter_extract_pages(), on the other hand, is only used in two places with these patches and the pins are always released with unpin_user_page*() so it's a lot easier to audit. I could modify put_page(), folio_put(), etc. to ignore the zero pages, but that might have a larger performance impact. > > + if (is_zero_page(page)) > > + return page_folio(page); > > + > > This will capture huge page cases too which have folio->_pincount and thus > don't suffer the GUP_PIN_COUNTING_BIAS issue, however it is equally logical > to simply skip these when pinning. I'm not sure I understand. The zero page(s) is/are single-page folios? > This does make me think that we should just skip pinning for FOLL_GET cases > too - there's literally no sane reason we should be pinning zero pages in > any case (unless I'm missing something!) As mentioned above, there's a code auditing issue and a potential performance issue, depending on how it's done. > Another nitty thing that I noticed is, in is_longterm_pinnable_page():- > > /* The zero page may always be pinned */ > if (is_zero_pfn(page_to_pfn(page))) > return true; > > Which, strictly speaking I suppose we are 'pinning' it or rather allowing > the pin to succeed without actually pinning, but to be super pedantic > perhaps it's worth updating this comment too. Yeah. It is "pinnable" but no pin will actually be added. David
On Fri, May 26, 2023 at 09:43:35AM +0100, David Howells wrote: > Lorenzo Stoakes <lstoakes@gmail.com> wrote: > > > I guess we're not quite as concerned about FOLL_GET because FOLL_GET should > > be ephemeral and FOLL_PIN (horrifically) adds GUP_PIN_COUNTING_BIAS each > > time? > > It's not that - it's that iov_iter_get_pages*() is a lot more commonly used at > the moment, and we'd have to find *all* the places that things using that hand > refs around. > > iov_iter_extract_pages(), on the other hand, is only used in two places with > these patches and the pins are always released with unpin_user_page*() so it's > a lot easier to audit. Thanks for the clarification. I guess these are the cases where you're likely to see zero page usage, but since this is changing all PUP*() callers don't you need to audit all of those too? > > I could modify put_page(), folio_put(), etc. to ignore the zero pages, but > that might have a larger performance impact. > > > > + if (is_zero_page(page)) > > > + return page_folio(page); > > > + > > > > This will capture huge page cases too which have folio->_pincount and thus > > don't suffer the GUP_PIN_COUNTING_BIAS issue, however it is equally logical > > to simply skip these when pinning. > > I'm not sure I understand. The zero page(s) is/are single-page folios? I'm actually a little unsure of how huge zero pages are handled (not an area I'm hugely familiar with) so this might just be mistaken, I mean the point was more so that hugetlb calls into this, but seems then not an issue. > > > This does make me think that we should just skip pinning for FOLL_GET cases > > too - there's literally no sane reason we should be pinning zero pages in > > any case (unless I'm missing something!) > > As mentioned above, there's a code auditing issue and a potential performance > issue, depending on how it's done. Ack, makes sense. It'd be good to have this documented somewhere though in commit msg or docs so this trade-off is clear. > > > Another nitty thing that I noticed is, in is_longterm_pinnable_page():- > > > > /* The zero page may always be pinned */ > > if (is_zero_pfn(page_to_pfn(page))) > > return true; > > > > Which, strictly speaking I suppose we are 'pinning' it or rather allowing > > the pin to succeed without actually pinning, but to be super pedantic > > perhaps it's worth updating this comment too. > > Yeah. It is "pinnable" but no pin will actually be added. The comment striks me as misleading, previously it literally meant that you could pin the zero page. Now it means that we just don't. I do think for the sake of avoiding confusion this should be tweaked. Obviously something of a nit, however! I did dig into this change a fair bit and kept adding then deleting comments since you cover all the bases well, so overall this is nice + I can but nit it :) Nice to see further improvements to GUP which is crying out for that. > > David > >
Lorenzo Stoakes <lstoakes@gmail.com> wrote: > > iov_iter_extract_pages(), on the other hand, is only used in two places > > with these patches and the pins are always released with > > unpin_user_page*() so it's a lot easier to audit. > > Thanks for the clarification. I guess these are the cases where you're > likely to see zero page usage, but since this is changing all PUP*() callers > don't you need to audit all of those too? I don't think it should be necessary. This only affects pages obtained from gup with FOLL_PIN - and, so far as I know, those always have to be released with unpin_user_page*() which is part of the gup API and thus it should be transparent to the users. Pages obtained FOLL_GET, on the other hand, aren't freed through the gup API - and there are a bunch of ways of releasing them - and getting additional refs too. David
On Fri, May 26, 2023 at 10:15:26AM +0100, David Howells wrote: > Lorenzo Stoakes <lstoakes@gmail.com> wrote: > > > > iov_iter_extract_pages(), on the other hand, is only used in two places > > > with these patches and the pins are always released with > > > unpin_user_page*() so it's a lot easier to audit. > > > > Thanks for the clarification. I guess these are the cases where you're > > likely to see zero page usage, but since this is changing all PUP*() callers > > don't you need to audit all of those too? > > I don't think it should be necessary. This only affects pages obtained from > gup with FOLL_PIN - and, so far as I know, those always have to be released > with unpin_user_page*() which is part of the gup API and thus it should be > transparent to the users. > Right, I was only saying so in relation to you stating the need to audit, for precisely this reason I wondered why you felt the need to :) > Pages obtained FOLL_GET, on the other hand, aren't freed through the gup API - > and there are a bunch of ways of releasing them - and getting additional refs > too. Yes that's a very good point! Sorry, in my enthusiasm for GUP reform this thorny aspect slipped my mind... As Christoph said though hopefully over time we can limit the use of FOLL_GET so this becomes easier perhaps. Larger discussion on this area in [0] :) [0]:https://lore.kernel.org/all/ZGWnq%2FdAYELyKpTy@infradead.org/ > > David >
On 26.05.23 11:15, David Howells wrote: > Lorenzo Stoakes <lstoakes@gmail.com> wrote: > >>> iov_iter_extract_pages(), on the other hand, is only used in two places >>> with these patches and the pins are always released with >>> unpin_user_page*() so it's a lot easier to audit. >> >> Thanks for the clarification. I guess these are the cases where you're >> likely to see zero page usage, but since this is changing all PUP*() callers >> don't you need to audit all of those too? > > I don't think it should be necessary. This only affects pages obtained from > gup with FOLL_PIN - and, so far as I know, those always have to be released > with unpin_user_page*() which is part of the gup API and thus it should be > transparent to the users. Right, and even code like like 873aefb376bb ("vfio/type1: Unpin zero pages") would handle it transparently, because they also call unpin_user_page(). [we can remove 873aefb376bb even without this change way because it uses FOLL_LONGTERM that shouldn't return the shared zeropage anymore ]
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index c5a51481bbb9..2b0431a11de2 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1245,6 +1245,16 @@ static inline unsigned long my_zero_pfn(unsigned long addr) } #endif /* CONFIG_MMU */ +static inline bool is_zero_page(const struct page *page) +{ + return is_zero_pfn(page_to_pfn(page)); +} + +static inline bool is_zero_folio(const struct folio *folio) +{ + return is_zero_page(&folio->page); +} + #ifdef CONFIG_MMU #ifndef CONFIG_TRANSPARENT_HUGEPAGE diff --git a/mm/gup.c b/mm/gup.c index bbe416236593..69b002628f5d 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -51,7 +51,8 @@ static inline void sanity_check_pinned_pages(struct page **pages, struct page *page = *pages; struct folio *folio = page_folio(page); - if (!folio_test_anon(folio)) + if (is_zero_page(page) || + !folio_test_anon(folio)) continue; if (!folio_test_large(folio) || folio_test_hugetlb(folio)) VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), page); @@ -131,6 +132,13 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) else if (flags & FOLL_PIN) { struct folio *folio; + /* + * Don't take a pin on the zero page - it's not going anywhere + * and it is used in a *lot* of places. + */ + if (is_zero_page(page)) + return page_folio(page); + /* * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a * right zone, so fail and let the caller fall back to the slow @@ -180,6 +188,8 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) { if (flags & FOLL_PIN) { + if (is_zero_folio(folio)) + return; node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs); if (folio_test_large(folio)) atomic_sub(refs, &folio->_pincount); @@ -224,6 +234,13 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) if (flags & FOLL_GET) folio_ref_inc(folio); else if (flags & FOLL_PIN) { + /* + * Don't take a pin on the zero page - it's not going anywhere + * and it is used in a *lot* of places. + */ + if (is_zero_page(page)) + return 0; + /* * Similar to try_grab_folio(): be sure to *also* * increment the normal page refcount field at least once, @@ -3079,6 +3096,9 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast); * * FOLL_PIN means that the pages must be released via unpin_user_page(). Please * see Documentation/core-api/pin_user_pages.rst for further details. + * + * Note that if the zero_page is amongst the returned pages, it will not have + * pins in it and unpin_user_page() will not remove pins from it. */ int pin_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages) @@ -3161,6 +3181,9 @@ EXPORT_SYMBOL(pin_user_pages); * pin_user_pages_unlocked() is the FOLL_PIN variant of * get_user_pages_unlocked(). Behavior is the same, except that this one sets * FOLL_PIN and rejects FOLL_GET. + * + * Note that if the zero_page is amongst the returned pages, it will not have + * pins in it and unpin_user_page() will not remove pins from it. */ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags)