Message ID | 20230619231044.112894-6-peterx@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp3307123vqr; Mon, 19 Jun 2023 16:20:40 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ60aa9Vh3VIr6FYAfoFXqTWW0oWkQlglc42d5QUhuk1Vab3PH+hUVGAKnXGC0JQ1bITmxhw X-Received: by 2002:a17:90a:e385:b0:25e:e70f:423f with SMTP id b5-20020a17090ae38500b0025ee70f423fmr8414115pjz.19.1687216840298; Mon, 19 Jun 2023 16:20:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1687216840; cv=none; d=google.com; s=arc-20160816; b=ngpremygdOIKLAj2Av/emeBi0f6TMv1ru9UiBeu5ppDLpoSPjo5KcwHGr2FCidN4jH T0q7X7r6oSrLCrZQnG6YbaVSuuuc1ulkv734qNyc63vv9ObkYeTigJPDJ1Xbz0hTR8u/ g+mxN7THxn06+cREuRVcQ77P9EEldg1E0JtEkvOE4b5rfoNDHowwlF7/Q26mI2LhUwnS izP2ACzmD904TporAMzyC3ikDuaYSWYAHdg/gmbym2xhIxRvbtXzPCUlDoKP/foVhQl2 zd6X5a594o0Eavk6G7wKt6ojbMI/hnQRVUaWBy817LXUh5auDt3hqsSFAephCmT9qtOh bviw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=00BT5P4M9oBALYuFaAi0Lj2RlKWYXU3sXFeAAd1zvs8=; b=DgSwi86kLFxDYtu69x/8uIikwUMyX/onPUATsQH+0CbInTrq4aKUuV1u3s/jSvq1LX 6DmrBF5hxrjbD/oogcang8Gc+8zbItutzNVGj9VKwqvyB546GhQJRflLExAdJldhw+q1 ITgqP72HexKW8U294zq3ToZNY1blqNrGZ53kTXzsM+7Lgbe8gEUWxluVEszpn7Uo/QlA kuuIVqr4UxtZkTRQ4185yBP0+KU6cv6yUUY4R9DatxRf1LSsjwOZUX61ZMqLIoyJuvd2 kFQT8mMgz87m30Ou588LDZHo1ChHMQh5Vgs5dne/7nqhEmrbQ8yWU1FV1OQ4rqQFk94g fG8w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=f0tYTU3I; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k127-20020a636f85000000b0054fd2f87c21si441052pgc.198.2023.06.19.16.20.28; Mon, 19 Jun 2023 16:20:40 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=f0tYTU3I; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229674AbjFSXL7 (ORCPT <rfc822;duw91626@gmail.com> + 99 others); Mon, 19 Jun 2023 19:11:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39864 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229782AbjFSXLo (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 19 Jun 2023 19:11:44 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5B654E60 for <linux-kernel@vger.kernel.org>; Mon, 19 Jun 2023 16:10:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1687216255; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=00BT5P4M9oBALYuFaAi0Lj2RlKWYXU3sXFeAAd1zvs8=; b=f0tYTU3IzvH4uOMNyt0P+hlzEtEZI38d/jYjo6H9ZbicL5L3gPelcENF6pHDuQpqGrlqkB YqG95Aqb2zqhx9ez0cL7WzqFWH/febnEomXp8EQZSl5OLcqKncb4PVjAZEDTFoa9LsfdxG C73H4N6xdT7IQZkROB7OveoMmizkAFA= Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com [209.85.222.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-606-6T6UrSYSPX6oDz6AbT-ZHw-1; Mon, 19 Jun 2023 19:10:54 -0400 X-MC-Unique: 6T6UrSYSPX6oDz6AbT-ZHw-1 Received: by mail-qk1-f197.google.com with SMTP id af79cd13be357-76248f3057bso37228285a.1 for <linux-kernel@vger.kernel.org>; Mon, 19 Jun 2023 16:10:54 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687216254; x=1689808254; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=00BT5P4M9oBALYuFaAi0Lj2RlKWYXU3sXFeAAd1zvs8=; b=f7PsrErmV0NWmaf+Ba5wXgnthol/rA2XpQzqS+xn/puj1ia7E1UdQC4qU4OH1BNQsn fU1VTiagNN9blmMKn3HfxeMFcUxasgoXxlnx4P4pS2sSBajkJBmkKXNULTvTIEx9NqMk Ij9ebXYbmjNf99TzxBqAJgZCkLgFEz+r4DlZPRo76BT+ezuEXAQFSsMK7J9AioaJ4ILt 31RCIUK3jTxaSzgBt+Z2U2mgJRwn3lPP1rmM3heJJllRshnxdUc1tqDkSDcvNr4/Hgja u+jAsCHTELuSnqoyj+BQ7d3cNaQ3L4bPG8IYou+YfydkVSBDlSXrTTPM8JzfAV7T/cE+ YMdQ== X-Gm-Message-State: AC+VfDy7E0RZ8MAzHPBGUl6pUzd7Z2nOsHuiz2WjWfc6GCSIGdbM7bOI uaD93ANv01zkg/6gPt8Y+x68ENada5/Zp+dRqNdXrnGbDOb5kjNRfuU4NP4UIVAaZk7IcfPxgPd Euje83LIkDjwDKD8xsUVISRJ6 X-Received: by 2002:a05:620a:2889:b0:75b:23a1:82a4 with SMTP id j9-20020a05620a288900b0075b23a182a4mr13246406qkp.5.1687216253843; Mon, 19 Jun 2023 16:10:53 -0700 (PDT) X-Received: by 2002:a05:620a:2889:b0:75b:23a1:82a4 with SMTP id j9-20020a05620a288900b0075b23a182a4mr13246392qkp.5.1687216253602; Mon, 19 Jun 2023 16:10:53 -0700 (PDT) Received: from x1n.. (cpe5c7695f3aee0-cm5c7695f3aede.cpe.net.cable.rogers.com. [99.254.144.39]) by smtp.gmail.com with ESMTPSA id t15-20020a05620a034f00b007592f2016f4sm405864qkm.110.2023.06.19.16.10.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 19 Jun 2023 16:10:53 -0700 (PDT) From: Peter Xu <peterx@redhat.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrea Arcangeli <aarcange@redhat.com>, Mike Rapoport <rppt@kernel.org>, David Hildenbrand <david@redhat.com>, Matthew Wilcox <willy@infradead.org>, Vlastimil Babka <vbabka@suse.cz>, John Hubbard <jhubbard@nvidia.com>, "Kirill A . Shutemov" <kirill@shutemov.name>, James Houghton <jthoughton@google.com>, Andrew Morton <akpm@linux-foundation.org>, Lorenzo Stoakes <lstoakes@gmail.com>, Hugh Dickins <hughd@google.com>, Mike Kravetz <mike.kravetz@oracle.com>, peterx@redhat.com, Jason Gunthorpe <jgg@nvidia.com> Subject: [PATCH v2 5/8] mm/gup: Accelerate thp gup even for "pages != NULL" Date: Mon, 19 Jun 2023 19:10:41 -0400 Message-Id: <20230619231044.112894-6-peterx@redhat.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230619231044.112894-1-peterx@redhat.com> References: <20230619231044.112894-1-peterx@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1769175085594240061?= X-GMAIL-MSGID: =?utf-8?q?1769175085594240061?= |
Series |
mm/gup: Unify hugetlb, speed up thp
|
|
Commit Message
Peter Xu
June 19, 2023, 11:10 p.m. UTC
The acceleration of THP was done with ctx.page_mask, however it'll be
ignored if **pages is non-NULL.
The old optimization was introduced in 2013 in 240aadeedc4a ("mm:
accelerate mm_populate() treatment of THP pages"). It didn't explain why
we can't optimize the **pages non-NULL case. It's possible that at that
time the major goal was for mm_populate() which should be enough back then.
Optimize thp for all cases, by properly looping over each subpage, doing
cache flushes, and boost refcounts / pincounts where needed in one go.
This can be verified using gup_test below:
# chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10
Before: 13992.50 ( +-8.75%)
After: 378.50 (+-69.62%)
Signed-off-by: Peter Xu <peterx@redhat.com>
---
mm/gup.c | 51 ++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 44 insertions(+), 7 deletions(-)
Comments
On 20.06.23 01:10, Peter Xu wrote: > The acceleration of THP was done with ctx.page_mask, however it'll be > ignored if **pages is non-NULL. > > The old optimization was introduced in 2013 in 240aadeedc4a ("mm: > accelerate mm_populate() treatment of THP pages"). It didn't explain why > we can't optimize the **pages non-NULL case. It's possible that at that > time the major goal was for mm_populate() which should be enough back then. In the past we had these sub-page refcounts for THP. My best guess (and I didn't check if that was still the case in 2013) would be that it was simpler regarding refcount handling to to do it one-subpage at a time. But I might be just wrong. > > Optimize thp for all cases, by properly looping over each subpage, doing > cache flushes, and boost refcounts / pincounts where needed in one go. > > This can be verified using gup_test below: > > # chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10 > > Before: 13992.50 ( +-8.75%) > After: 378.50 (+-69.62%) > > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > mm/gup.c | 51 ++++++++++++++++++++++++++++++++++++++++++++------- > 1 file changed, 44 insertions(+), 7 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index 4a00d609033e..b50272012e49 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -1199,16 +1199,53 @@ static long __get_user_pages(struct mm_struct *mm, > goto out; > } > next_page: > - if (pages) { > - pages[i] = page; > - flush_anon_page(vma, page, start); > - flush_dcache_page(page); > - ctx.page_mask = 0; > - } > - > page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask); > if (page_increm > nr_pages) > page_increm = nr_pages; > + > + if (pages) { > + struct page *subpage; > + unsigned int j; > + > + /* > + * This must be a large folio (and doesn't need to > + * be the whole folio; it can be part of it), do > + * the refcount work for all the subpages too. > + * > + * NOTE: here the page may not be the head page > + * e.g. when start addr is not thp-size aligned. > + * try_grab_folio() should have taken care of tail > + * pages. > + */ > + if (page_increm > 1) { > + struct folio *folio; > + > + /* > + * Since we already hold refcount on the > + * large folio, this should never fail. > + */ > + folio = try_grab_folio(page, page_increm - 1, > + foll_flags); > + if (WARN_ON_ONCE(!folio)) { > + /* > + * Release the 1st page ref if the > + * folio is problematic, fail hard. > + */ > + gup_put_folio(page_folio(page), 1, > + foll_flags); > + ret = -EFAULT; > + goto out; > + } > + } > + > + for (j = 0; j < page_increm; j++) { > + subpage = nth_page(page, j); > + pages[i+j] = subpage; Doe checkpatch like pages[i+j]? I'd have used spaces around the +. > + flush_anon_page(vma, subpage, start + j * PAGE_SIZE); > + flush_dcache_page(subpage); > + } > + } > + > i += page_increm; > start += page_increm * PAGE_SIZE; > nr_pages -= page_increm; So, we did the first try_grab_folio() while our page was PMD-mapped udner the PT lock and we had sufficient permissions (e.g., mapped writable, no unsharing required). With FOLL_PIN, we incremented the pincount. I was wondering if something could have happened ever since we unlocked the PT table lock and possibly PTE-mapped the THP. ... but as it's already pinned, it cannot get shared during fork() [will stay exclusive]. So we can just take additional pins on that folio. LGTM, although I do like the GUP-fast way of recording+ref'ing it at a central place (see gup_huge_pmd() with record_subpages() and friends), not after the effects.
On Tue, Jun 20, 2023 at 05:43:35PM +0200, David Hildenbrand wrote: > On 20.06.23 01:10, Peter Xu wrote: > > The acceleration of THP was done with ctx.page_mask, however it'll be > > ignored if **pages is non-NULL. > > > > The old optimization was introduced in 2013 in 240aadeedc4a ("mm: > > accelerate mm_populate() treatment of THP pages"). It didn't explain why > > we can't optimize the **pages non-NULL case. It's possible that at that > > time the major goal was for mm_populate() which should be enough back then. > > In the past we had these sub-page refcounts for THP. My best guess (and I > didn't check if that was still the case in 2013) would be that it was > simpler regarding refcount handling to to do it one-subpage at a time. > > But I might be just wrong. > > > > > Optimize thp for all cases, by properly looping over each subpage, doing > > cache flushes, and boost refcounts / pincounts where needed in one go. > > > > This can be verified using gup_test below: > > > > # chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10 > > > > Before: 13992.50 ( +-8.75%) > > After: 378.50 (+-69.62%) > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > --- > > mm/gup.c | 51 ++++++++++++++++++++++++++++++++++++++++++++------- > > 1 file changed, 44 insertions(+), 7 deletions(-) > > > > diff --git a/mm/gup.c b/mm/gup.c > > index 4a00d609033e..b50272012e49 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -1199,16 +1199,53 @@ static long __get_user_pages(struct mm_struct *mm, > > goto out; > > } > > next_page: > > - if (pages) { > > - pages[i] = page; > > - flush_anon_page(vma, page, start); > > - flush_dcache_page(page); > > - ctx.page_mask = 0; > > - } > > - > > page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask); > > if (page_increm > nr_pages) > > page_increm = nr_pages; > > + > > + if (pages) { > > + struct page *subpage; > > + unsigned int j; > > + > > + /* > > + * This must be a large folio (and doesn't need to > > + * be the whole folio; it can be part of it), do > > + * the refcount work for all the subpages too. > > + * > > + * NOTE: here the page may not be the head page > > + * e.g. when start addr is not thp-size aligned. > > + * try_grab_folio() should have taken care of tail > > + * pages. > > + */ > > + if (page_increm > 1) { > > + struct folio *folio; > > + > > + /* > > + * Since we already hold refcount on the > > + * large folio, this should never fail. > > + */ > > + folio = try_grab_folio(page, page_increm - 1, > > + foll_flags); > > + if (WARN_ON_ONCE(!folio)) { > > + /* > > + * Release the 1st page ref if the > > + * folio is problematic, fail hard. > > + */ > > + gup_put_folio(page_folio(page), 1, > > + foll_flags); > > + ret = -EFAULT; > > + goto out; > > + } > > + } > > + > > + for (j = 0; j < page_increm; j++) { > > + subpage = nth_page(page, j); > > + pages[i+j] = subpage; > > Doe checkpatch like pages[i+j]? I'd have used spaces around the +. Can do. > > > + flush_anon_page(vma, subpage, start + j * PAGE_SIZE); > > + flush_dcache_page(subpage); > > + } > > + } > > + > > i += page_increm; > > start += page_increm * PAGE_SIZE; > > nr_pages -= page_increm; > > > So, we did the first try_grab_folio() while our page was PMD-mapped udner > the PT lock and we had sufficient permissions (e.g., mapped writable, no > unsharing required). With FOLL_PIN, we incremented the pincount. > > > I was wondering if something could have happened ever since we unlocked the > PT table lock and possibly PTE-mapped the THP. ... but as it's already > pinned, it cannot get shared during fork() [will stay exclusive]. > > So we can just take additional pins on that folio. > > > LGTM, although I do like the GUP-fast way of recording+ref'ing it at a > central place (see gup_huge_pmd() with record_subpages() and friends), not > after the effects. My read on this is follow_page_mask() is also used in follow page, which does not need page*. No strong opinion here. Maybe we leave this as a follow up even if it can be justified? This patch is probably still the smallest (and still clean) change to speed this whole thing up over either thp or hugetlb.
On 20.06.23 18:23, Peter Xu wrote: > On Tue, Jun 20, 2023 at 05:43:35PM +0200, David Hildenbrand wrote: >> On 20.06.23 01:10, Peter Xu wrote: >>> The acceleration of THP was done with ctx.page_mask, however it'll be >>> ignored if **pages is non-NULL. >>> >>> The old optimization was introduced in 2013 in 240aadeedc4a ("mm: >>> accelerate mm_populate() treatment of THP pages"). It didn't explain why >>> we can't optimize the **pages non-NULL case. It's possible that at that >>> time the major goal was for mm_populate() which should be enough back then. >> >> In the past we had these sub-page refcounts for THP. My best guess (and I >> didn't check if that was still the case in 2013) would be that it was >> simpler regarding refcount handling to to do it one-subpage at a time. >> >> But I might be just wrong. >> >>> >>> Optimize thp for all cases, by properly looping over each subpage, doing >>> cache flushes, and boost refcounts / pincounts where needed in one go. >>> >>> This can be verified using gup_test below: >>> >>> # chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10 >>> >>> Before: 13992.50 ( +-8.75%) >>> After: 378.50 (+-69.62%) >>> >>> Signed-off-by: Peter Xu <peterx@redhat.com> >>> --- >>> mm/gup.c | 51 ++++++++++++++++++++++++++++++++++++++++++++------- >>> 1 file changed, 44 insertions(+), 7 deletions(-) >>> >>> diff --git a/mm/gup.c b/mm/gup.c >>> index 4a00d609033e..b50272012e49 100644 >>> --- a/mm/gup.c >>> +++ b/mm/gup.c >>> @@ -1199,16 +1199,53 @@ static long __get_user_pages(struct mm_struct *mm, >>> goto out; >>> } >>> next_page: >>> - if (pages) { >>> - pages[i] = page; >>> - flush_anon_page(vma, page, start); >>> - flush_dcache_page(page); >>> - ctx.page_mask = 0; >>> - } >>> - >>> page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask); >>> if (page_increm > nr_pages) >>> page_increm = nr_pages; >>> + >>> + if (pages) { >>> + struct page *subpage; >>> + unsigned int j; >>> + >>> + /* >>> + * This must be a large folio (and doesn't need to >>> + * be the whole folio; it can be part of it), do >>> + * the refcount work for all the subpages too. >>> + * >>> + * NOTE: here the page may not be the head page >>> + * e.g. when start addr is not thp-size aligned. >>> + * try_grab_folio() should have taken care of tail >>> + * pages. >>> + */ >>> + if (page_increm > 1) { >>> + struct folio *folio; >>> + >>> + /* >>> + * Since we already hold refcount on the >>> + * large folio, this should never fail. >>> + */ >>> + folio = try_grab_folio(page, page_increm - 1, >>> + foll_flags); >>> + if (WARN_ON_ONCE(!folio)) { >>> + /* >>> + * Release the 1st page ref if the >>> + * folio is problematic, fail hard. >>> + */ >>> + gup_put_folio(page_folio(page), 1, >>> + foll_flags); >>> + ret = -EFAULT; >>> + goto out; >>> + } >>> + } >>> + >>> + for (j = 0; j < page_increm; j++) { >>> + subpage = nth_page(page, j); >>> + pages[i+j] = subpage; >> >> Doe checkpatch like pages[i+j]? I'd have used spaces around the +. > > Can do. > >> >>> + flush_anon_page(vma, subpage, start + j * PAGE_SIZE); >>> + flush_dcache_page(subpage); >>> + } >>> + } >>> + >>> i += page_increm; >>> start += page_increm * PAGE_SIZE; >>> nr_pages -= page_increm; >> >> >> So, we did the first try_grab_folio() while our page was PMD-mapped udner >> the PT lock and we had sufficient permissions (e.g., mapped writable, no >> unsharing required). With FOLL_PIN, we incremented the pincount. >> >> >> I was wondering if something could have happened ever since we unlocked the >> PT table lock and possibly PTE-mapped the THP. ... but as it's already >> pinned, it cannot get shared during fork() [will stay exclusive]. >> >> So we can just take additional pins on that folio. >> >> >> LGTM, although I do like the GUP-fast way of recording+ref'ing it at a >> central place (see gup_huge_pmd() with record_subpages() and friends), not >> after the effects. > > My read on this is follow_page_mask() is also used in follow page, which > does not need page*. Right ... maybe one day we can do that "better". > > No strong opinion here. Maybe we leave this as a follow up even if it can > be justified? This patch is probably still the smallest (and still clean) > change to speed this whole thing up over either thp or hugetlb. Sure, we can leave that as a follow-up. Thinking about why we have the flush_anon_page/flush_dcache_page stuff here and not in GUP-fast ... I suspect that all GUP-fast archs don't need that stuff. I was wondering if there are some possible races with the flush_anon_page() / flush_dcache_page() on a page that might have been unmapped in the meantime (as we dropped the PT lock ...). Some flush_dcache_page() implementations do some IMHO confusing page_mapcount() things (like in arch/arc/mm/cache.c). But maybe the unmap code handles that as well ... and most likely these archs don't support THP. Anyhow, just a note that the flush_anon_page/flush_dcache_page left me confused.
On Tue, Jun 20, 2023 at 08:02:41PM +0200, David Hildenbrand wrote: > Thinking about why we have the flush_anon_page/flush_dcache_page stuff here > and not in GUP-fast ... I suspect that all GUP-fast archs don't need that > stuff. Yeah that's a bit confusing, and I sincerely don't know the answer. Though here I had the other side of the feeling - I feel like gup-fast should also do it.. but maybe it's just get missed. AFAIU the idea was that the data can be mis-aligned between user / kernel, and if it's needed on slow gup I don't see why it's not needed in fast.. There're still a few archs that implemented flush_dcache_page() but meanwhile has HAVE_FAST_GUP selected, like arm/arm64/powerpc. It's just getting out of scope of what this series wanted to achieve. > I was wondering if there are some possible races with the flush_anon_page() > / flush_dcache_page() on a page that might have been unmapped in the > meantime (as we dropped the PT lock ...). > > Some flush_dcache_page() implementations do some IMHO confusing > page_mapcount() things (like in arch/arc/mm/cache.c). But maybe the unmap > code handles that as well ... and most likely these archs don't support THP. Maybe true. It seems that the page_mapcount() was mostly used to identify whether a page is mapped in the userspace address space, if so I'd worry less because the only race possible here, iiuc, is when the user unmaps the page concurrently (and since we got it from gup it must have been mapped once). Then I would assume the caller should be prepared for that, and the flush_dcache_page() won't matter too much in this case I assume, if the userspace dropped all the data anyway - the whole page* can already be invalid for that VA for a completed unmap. > > Anyhow, just a note that the flush_anon_page/flush_dcache_page left me > confused. I share the same confusion. Hopefully, what this series did here was not changing that, at least not making it worse.
On Mon, Jun 19, 2023 at 07:10:41PM -0400, Peter Xu wrote: > The acceleration of THP was done with ctx.page_mask, however it'll be > ignored if **pages is non-NULL. > > The old optimization was introduced in 2013 in 240aadeedc4a ("mm: > accelerate mm_populate() treatment of THP pages"). It didn't explain why > we can't optimize the **pages non-NULL case. It's possible that at that > time the major goal was for mm_populate() which should be enough back then. > > Optimize thp for all cases, by properly looping over each subpage, doing > cache flushes, and boost refcounts / pincounts where needed in one go. > > This can be verified using gup_test below: > > # chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10 > > Before: 13992.50 ( +-8.75%) > After: 378.50 (+-69.62%) > > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > mm/gup.c | 51 ++++++++++++++++++++++++++++++++++++++++++++------- > 1 file changed, 44 insertions(+), 7 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index 4a00d609033e..b50272012e49 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -1199,16 +1199,53 @@ static long __get_user_pages(struct mm_struct *mm, > goto out; > } > next_page: > - if (pages) { > - pages[i] = page; > - flush_anon_page(vma, page, start); > - flush_dcache_page(page); > - ctx.page_mask = 0; > - } > - > page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask); > if (page_increm > nr_pages) > page_increm = nr_pages; > + > + if (pages) { > + struct page *subpage; > + unsigned int j; > + > + /* > + * This must be a large folio (and doesn't need to > + * be the whole folio; it can be part of it), do > + * the refcount work for all the subpages too. > + * > + * NOTE: here the page may not be the head page > + * e.g. when start addr is not thp-size aligned. > + * try_grab_folio() should have taken care of tail > + * pages. > + */ > + if (page_increm > 1) { > + struct folio *folio; > + > + /* > + * Since we already hold refcount on the > + * large folio, this should never fail. > + */ > + folio = try_grab_folio(page, page_increm - 1, > + foll_flags); > + if (WARN_ON_ONCE(!folio)) { > + /* > + * Release the 1st page ref if the > + * folio is problematic, fail hard. > + */ > + gup_put_folio(page_folio(page), 1, > + foll_flags); > + ret = -EFAULT; > + goto out; > + } Thanks this looks good to me, I agree it'd be quite surprising for us not to retrieve folio here and probably something has gone wrong if so, so not actually too unreasonable to warn, as long as we error out. > + } > + > + for (j = 0; j < page_increm; j++) { > + subpage = nth_page(page, j); > + pages[i+j] = subpage; > + flush_anon_page(vma, subpage, start + j * PAGE_SIZE); > + flush_dcache_page(subpage); > + } > + } > + > i += page_increm; > start += page_increm * PAGE_SIZE; > nr_pages -= page_increm; > -- > 2.40.1 > Looks good to me overall, Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
diff --git a/mm/gup.c b/mm/gup.c index 4a00d609033e..b50272012e49 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1199,16 +1199,53 @@ static long __get_user_pages(struct mm_struct *mm, goto out; } next_page: - if (pages) { - pages[i] = page; - flush_anon_page(vma, page, start); - flush_dcache_page(page); - ctx.page_mask = 0; - } - page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask); if (page_increm > nr_pages) page_increm = nr_pages; + + if (pages) { + struct page *subpage; + unsigned int j; + + /* + * This must be a large folio (and doesn't need to + * be the whole folio; it can be part of it), do + * the refcount work for all the subpages too. + * + * NOTE: here the page may not be the head page + * e.g. when start addr is not thp-size aligned. + * try_grab_folio() should have taken care of tail + * pages. + */ + if (page_increm > 1) { + struct folio *folio; + + /* + * Since we already hold refcount on the + * large folio, this should never fail. + */ + folio = try_grab_folio(page, page_increm - 1, + foll_flags); + if (WARN_ON_ONCE(!folio)) { + /* + * Release the 1st page ref if the + * folio is problematic, fail hard. + */ + gup_put_folio(page_folio(page), 1, + foll_flags); + ret = -EFAULT; + goto out; + } + } + + for (j = 0; j < page_increm; j++) { + subpage = nth_page(page, j); + pages[i+j] = subpage; + flush_anon_page(vma, subpage, start + j * PAGE_SIZE); + flush_dcache_page(subpage); + } + } + i += page_increm; start += page_increm * PAGE_SIZE; nr_pages -= page_increm;