Message ID | 20230405155120.3608140-1-peterx@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp409946vqo; Wed, 5 Apr 2023 08:54:29 -0700 (PDT) X-Google-Smtp-Source: AKy350bi/K0VHBMH0h1QI02xbAmBT8J+TvlxB9MDAyc2gFohMdUImMytdC90nr112jNBJ48o3IAa X-Received: by 2002:a17:90b:1e45:b0:23f:7ab1:2899 with SMTP id pi5-20020a17090b1e4500b0023f7ab12899mr7251801pjb.34.1680710069396; Wed, 05 Apr 2023 08:54:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680710069; cv=none; d=google.com; s=arc-20160816; b=I++5v5MT0uJZ7oEhWC6p4LZxUDqL5w3AkW36usfwOxQchJdje+/gc7RXrrcz1Lr0Cy uyt1NS4jscVRHF1O5IKattmWjDAqCVew39tdoRuJzQYdmTJTmEzQKnQXmysC0ji5Pvbg 4D+iEgXSRU7bXZLoBavE3OEVGYSBgAv29W8Ot7q7RUmFAhv812KKtvocYCjgZPzP5m4U p57ORw0w5idj9IUplQZmn095jdnRRJZ+rxZtSwGBckh8iw878UI3dG3SWaFvudLTIqWV SrrwTmxO/4ByIXnLp4PPl1pzD5/SUNn1F6zFXmuCDaBj6+eC56dlscMm9crwvAsQ7QIs tGdA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=dX5bBmfEsfxuoyfhYLC4nRt+vkdDVDFdF9t4fgBK6nc=; b=v/SVrmSTaOvY73SE1DziDO2VAFu4OPQCjAWl3mV6ydaIRPhSyP+rzQZ7KuU4KKwRud tLGAFspVYAkjvVkGgehesh+VKRPOnVCOVfZIsqSaU5uoi3iqBuJ5S+HTAPJnO+j6rsRm KHN0mOp2qKdEhvszBV1Y58sF4Jmi38JjIRn7E62vETFFvnBxtR/Hv8kUsZmBCsze+7iC 1MH23/zrpaNIOMVnuCazmwSk0Kg9uNtjEt1bNzScH5ehUHwpvrvI9zTjn9S+5soSinn6 ko8a5fFEjPe3iKpkji00pW8DeSMPYTIFTJRVJtLFvtDBNUtLMdhTZpsgXt0SquhVD+PC Avwg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=PRPjKAVx; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y9-20020a17090ad70900b0023b01e1b914si1550153pju.46.2023.04.05.08.54.15; Wed, 05 Apr 2023 08:54:29 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=PRPjKAVx; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239153AbjDEPw1 (ORCPT <rfc822;lkml4gm@gmail.com> + 99 others); Wed, 5 Apr 2023 11:52:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38964 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239133AbjDEPwP (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 5 Apr 2023 11:52:15 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4999965AB for <linux-kernel@vger.kernel.org>; Wed, 5 Apr 2023 08:51:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1680709884; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=dX5bBmfEsfxuoyfhYLC4nRt+vkdDVDFdF9t4fgBK6nc=; b=PRPjKAVx90wyO8giVwWrjThIbjjPz/PTx1A+ATSVS9SetOEfoXgBHJEoTb/XuoYZnTCuEj +WpFuKwTo178qoyGGidG3Nlc9K4MwhaMhzD3KCX2IRk8nF97olydd9OEVhPBReVw/cXWEd oMxLWwy2vlPVt5OkuPlWfvxFsLacEh8= Received: from mail-qt1-f197.google.com (mail-qt1-f197.google.com [209.85.160.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-564-yIScEAaOPuieDHHVXPlBqg-1; Wed, 05 Apr 2023 11:51:23 -0400 X-MC-Unique: yIScEAaOPuieDHHVXPlBqg-1 Received: by mail-qt1-f197.google.com with SMTP id d75a77b69052e-3e2daffa0d4so12288731cf.0 for <linux-kernel@vger.kernel.org>; Wed, 05 Apr 2023 08:51:22 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680709882; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=dX5bBmfEsfxuoyfhYLC4nRt+vkdDVDFdF9t4fgBK6nc=; b=7L6dmqYXgknIQ3k4GhKECULUj6SRLvcb7rwvMO+Gwba3F9zAQRvTipNnEyHxYGUZ+H 7RO/G9G4Y5yP+niXF/8bNZlYdk0oNrbrGgtIgScOMQmlDePiEBUT59a52Psr5KUYWTwt cqOZ1jtEaaLJ3UG1stHMpM8sobaHDw3Gc6V8J0n80xOb7GDh+dE101quIcrFu4rNpRBq DPMGHpRi5TvthTxIoNBFuL2DnSJFfBrgkBifXkKzin1lgemDBMvsi3G7sS5OAfslHhSA v3iGHbMYHn0F0FXSgqJOhh0VGGW9Ii2ZWgzerTm8Q1dIOPBgZN4Wz2dz8ur/gqM/uDUS H70g== X-Gm-Message-State: AAQBX9c4LDI+xg80Z0xjwglp4r8LCCFvk/NUL5m/14H5XU08swOKlU/E sFtqsqUsCZ/W1ZZs73vFb3CFdKsqMAd/lNorhUK6yB/27DGUaT2GHBBRSsdg4O9PK/T2B6zfDNw 38zdoOmO1k2l6IPMIT78/Stvv X-Received: by 2002:a05:622a:1a24:b0:3e6:707e:d3c2 with SMTP id f36-20020a05622a1a2400b003e6707ed3c2mr6579972qtb.0.1680709882538; Wed, 05 Apr 2023 08:51:22 -0700 (PDT) X-Received: by 2002:a05:622a:1a24:b0:3e6:707e:d3c2 with SMTP id f36-20020a05622a1a2400b003e6707ed3c2mr6579938qtb.0.1680709882192; Wed, 05 Apr 2023 08:51:22 -0700 (PDT) Received: from x1n.redhat.com (bras-base-aurron9127w-grc-40-70-52-229-124.dsl.bell.ca. [70.52.229.124]) by smtp.gmail.com with ESMTPSA id 21-20020a370415000000b0074683c45f6csm4538557qke.1.2023.04.05.08.51.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 05 Apr 2023 08:51:21 -0700 (PDT) From: Peter Xu <peterx@redhat.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Axel Rasmussen <axelrasmussen@google.com>, Nadav Amit <nadav.amit@gmail.com>, David Hildenbrand <david@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, peterx@redhat.com, Andrea Arcangeli <aarcange@redhat.com>, Mike Rapoport <rppt@linux.vnet.ibm.com>, Yang Shi <shy828301@gmail.com>, linux-stable <stable@vger.kernel.org> Subject: [PATCH] mm/khugepaged: Check again on anon uffd-wp during isolation Date: Wed, 5 Apr 2023 11:51:20 -0400 Message-Id: <20230405155120.3608140-1-peterx@redhat.com> X-Mailer: git-send-email 2.39.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.2 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1762352242056712590?= X-GMAIL-MSGID: =?utf-8?q?1762352242056712590?= |
Series |
mm/khugepaged: Check again on anon uffd-wp during isolation
|
|
Commit Message
Peter Xu
April 5, 2023, 3:51 p.m. UTC
Khugepaged collapse an anonymous thp in two rounds of scans. The 2nd round
done in __collapse_huge_page_isolate() after hpage_collapse_scan_pmd(),
during which all the locks will be released temporarily. It means the
pgtable can change during this phase before 2nd round starts.
It's logically possible some ptes got wr-protected during this phase, and
we can errornously collapse a thp without noticing some ptes are
wr-protected by userfault. e1e267c7928f wanted to avoid it but it only did
that for the 1st phase, not the 2nd phase.
Since __collapse_huge_page_isolate() happens after a round of small page
swapins, we don't need to worry on any !present ptes - if it existed
khugepaged will already bail out. So we only need to check present ptes
with uffd-wp bit set there.
This is something I found only but never had a reproducer, I thought it was
one caused a bug in Muhammad's recent pagemap new ioctl work, but it turns
out it's not the cause of that but an userspace bug. However this seems to
still be a real bug even with a very small race window, still worth to have
it fixed and copy stable.
Cc: linux-stable <stable@vger.kernel.org>
Fixes: e1e267c7928f ("khugepaged: skip collapse if uffd-wp detected")
Signed-off-by: Peter Xu <peterx@redhat.com>
---
mm/khugepaged.c | 4 ++++
1 file changed, 4 insertions(+)
Comments
On 05.04.23 17:51, Peter Xu wrote: > Khugepaged collapse an anonymous thp in two rounds of scans. The 2nd round > done in __collapse_huge_page_isolate() after hpage_collapse_scan_pmd(), > during which all the locks will be released temporarily. It means the > pgtable can change during this phase before 2nd round starts. > > It's logically possible some ptes got wr-protected during this phase, and > we can errornously collapse a thp without noticing some ptes are > wr-protected by userfault. e1e267c7928f wanted to avoid it but it only did > that for the 1st phase, not the 2nd phase. > > Since __collapse_huge_page_isolate() happens after a round of small page > swapins, we don't need to worry on any !present ptes - if it existed > khugepaged will already bail out. So we only need to check present ptes > with uffd-wp bit set there. > > This is something I found only but never had a reproducer, I thought it was > one caused a bug in Muhammad's recent pagemap new ioctl work, but it turns > out it's not the cause of that but an userspace bug. However this seems to > still be a real bug even with a very small race window, still worth to have > it fixed and copy stable. > > Cc: linux-stable <stable@vger.kernel.org> > Fixes: e1e267c7928f ("khugepaged: skip collapse if uffd-wp detected") > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > mm/khugepaged.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index a19aa140fd52..42ac93b4bd87 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -575,6 +575,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > result = SCAN_PTE_NON_PRESENT; > goto out; > } > + if (pte_uffd_wp(pteval)) { > + result = SCAN_PTE_UFFD_WP; > + goto out; > + } > page = vm_normal_page(vma, address, pteval); > if (unlikely(!page) || unlikely(is_zone_device_page(page))) { > result = SCAN_PAGE_NULL; Yes, I agree that would be a small race where it could happen. Reviewed-by: David Hildenbrand <david@redhat.com>
On Wed, Apr 5, 2023 at 8:51 AM Peter Xu <peterx@redhat.com> wrote: > > Khugepaged collapse an anonymous thp in two rounds of scans. The 2nd round > done in __collapse_huge_page_isolate() after hpage_collapse_scan_pmd(), > during which all the locks will be released temporarily. It means the > pgtable can change during this phase before 2nd round starts. > > It's logically possible some ptes got wr-protected during this phase, and > we can errornously collapse a thp without noticing some ptes are > wr-protected by userfault. e1e267c7928f wanted to avoid it but it only did > that for the 1st phase, not the 2nd phase. > > Since __collapse_huge_page_isolate() happens after a round of small page > swapins, we don't need to worry on any !present ptes - if it existed > khugepaged will already bail out. So we only need to check present ptes > with uffd-wp bit set there. > > This is something I found only but never had a reproducer, I thought it was > one caused a bug in Muhammad's recent pagemap new ioctl work, but it turns > out it's not the cause of that but an userspace bug. However this seems to > still be a real bug even with a very small race window, still worth to have > it fixed and copy stable. Yeah, I agree. But I got confused by userfaultfd_wp(vma) and pte_uffd_wp(pte). If a vma is armed with uffd wp, shall we skip the whole vma? If so, whether it is better to just check vma? We do revalidate vma once reacquiring mmap_lock, so we should be able to bail out earlier. > > Cc: linux-stable <stable@vger.kernel.org> > Fixes: e1e267c7928f ("khugepaged: skip collapse if uffd-wp detected") > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > mm/khugepaged.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index a19aa140fd52..42ac93b4bd87 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -575,6 +575,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > result = SCAN_PTE_NON_PRESENT; > goto out; > } > + if (pte_uffd_wp(pteval)) { > + result = SCAN_PTE_UFFD_WP; > + goto out; > + } > page = vm_normal_page(vma, address, pteval); > if (unlikely(!page) || unlikely(is_zone_device_page(page))) { > result = SCAN_PAGE_NULL; > -- > 2.39.1 >
On Wed, Apr 05, 2023 at 09:59:15AM -0700, Yang Shi wrote: > On Wed, Apr 5, 2023 at 8:51 AM Peter Xu <peterx@redhat.com> wrote: > > > > Khugepaged collapse an anonymous thp in two rounds of scans. The 2nd round > > done in __collapse_huge_page_isolate() after hpage_collapse_scan_pmd(), > > during which all the locks will be released temporarily. It means the > > pgtable can change during this phase before 2nd round starts. > > > > It's logically possible some ptes got wr-protected during this phase, and > > we can errornously collapse a thp without noticing some ptes are > > wr-protected by userfault. e1e267c7928f wanted to avoid it but it only did > > that for the 1st phase, not the 2nd phase. > > > > Since __collapse_huge_page_isolate() happens after a round of small page > > swapins, we don't need to worry on any !present ptes - if it existed > > khugepaged will already bail out. So we only need to check present ptes > > with uffd-wp bit set there. > > > > This is something I found only but never had a reproducer, I thought it was > > one caused a bug in Muhammad's recent pagemap new ioctl work, but it turns > > out it's not the cause of that but an userspace bug. However this seems to > > still be a real bug even with a very small race window, still worth to have > > it fixed and copy stable. > > Yeah, I agree. But I got confused by userfaultfd_wp(vma) and > pte_uffd_wp(pte). If a vma is armed with uffd wp, shall we skip the > whole vma? If so, whether it is better to just check vma? We do > revalidate vma once reacquiring mmap_lock, so we should be able to > bail out earlier. Checking against VMA is safe too, the difference is current code still allows thp to be collapsed as long as none of the page is explicitly protected over the thp range, even if the range is registered with userfault-wp. That's also what e1e267c7928f does. Here we have slightly different handling between anon / file thps (file thps checks against the vma flags), IMHO mostly because we don't scan pgtables when making decisions to collapse a shmem thp, so we made it simple by checking against vma flags. We can make it the same as anon but it might be an overkill just to scan the entries for uffd-wp purpose. For anon we always scans the pgtable anyway so it's easier to make a more accurate decision. Thanks,
On Wed, Apr 5, 2023 at 11:10 AM Peter Xu <peterx@redhat.com> wrote: > > On Wed, Apr 05, 2023 at 09:59:15AM -0700, Yang Shi wrote: > > On Wed, Apr 5, 2023 at 8:51 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > Khugepaged collapse an anonymous thp in two rounds of scans. The 2nd round > > > done in __collapse_huge_page_isolate() after hpage_collapse_scan_pmd(), > > > during which all the locks will be released temporarily. It means the > > > pgtable can change during this phase before 2nd round starts. > > > > > > It's logically possible some ptes got wr-protected during this phase, and > > > we can errornously collapse a thp without noticing some ptes are > > > wr-protected by userfault. e1e267c7928f wanted to avoid it but it only did > > > that for the 1st phase, not the 2nd phase. > > > > > > Since __collapse_huge_page_isolate() happens after a round of small page > > > swapins, we don't need to worry on any !present ptes - if it existed > > > khugepaged will already bail out. So we only need to check present ptes > > > with uffd-wp bit set there. > > > > > > This is something I found only but never had a reproducer, I thought it was > > > one caused a bug in Muhammad's recent pagemap new ioctl work, but it turns > > > out it's not the cause of that but an userspace bug. However this seems to > > > still be a real bug even with a very small race window, still worth to have > > > it fixed and copy stable. > > > > Yeah, I agree. But I got confused by userfaultfd_wp(vma) and > > pte_uffd_wp(pte). If a vma is armed with uffd wp, shall we skip the > > whole vma? If so, whether it is better to just check vma? We do > > revalidate vma once reacquiring mmap_lock, so we should be able to > > bail out earlier. > > Checking against VMA is safe too, the difference is current code still > allows thp to be collapsed as long as none of the page is explicitly > protected over the thp range, even if the range is registered with > userfault-wp. That's also what e1e267c7928f does. > > Here we have slightly different handling between anon / file thps (file > thps checks against the vma flags), IMHO mostly because we don't scan > pgtables when making decisions to collapse a shmem thp, so we made it > simple by checking against vma flags. We can make it the same as anon but > it might be an overkill just to scan the entries for uffd-wp purpose. > > For anon we always scans the pgtable anyway so it's easier to make a more > accurate decision. Aha, I see. It does resolve my confusion. Thanks for elaborating. The patch looks good to me. Reviewed-by: Yang Shi <shy828301@gmail.com> > > Thanks, > > -- > Peter Xu >
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index a19aa140fd52..42ac93b4bd87 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -575,6 +575,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, result = SCAN_PTE_NON_PRESENT; goto out; } + if (pte_uffd_wp(pteval)) { + result = SCAN_PTE_UFFD_WP; + goto out; + } page = vm_normal_page(vma, address, pteval); if (unlikely(!page) || unlikely(is_zone_device_page(page))) { result = SCAN_PAGE_NULL;