Message ID | 20240117171058.2192286-1-leitao@debian.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-29260-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:30f:b0:101:a8e8:374 with SMTP id ia15csp52710dyb; Wed, 17 Jan 2024 09:11:55 -0800 (PST) X-Google-Smtp-Source: AGHT+IFnGZ4FqhhEwtgMvO3crahj8IKntEKVgIIZiSoVEShREhDubHAt63mrIUZT9HjakPD+ptEx X-Received: by 2002:a05:6512:3084:b0:50e:6cce:f83 with SMTP id z4-20020a056512308400b0050e6cce0f83mr5544410lfd.41.1705511514932; Wed, 17 Jan 2024 09:11:54 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1705511514; cv=pass; d=google.com; s=arc-20160816; b=cvB02D8De9MDtMQV/PbKHHRzNGzzYNTj9YUfXpX/pZ38ZMgVM7EgjeMdDmAPtONBLC BzeXjO449xa1whETUKi0pA1mzkpvGXfVVGp1ej79hpIPtbZO4Tc1cYkFcbSReFzYeBlx iWubPC5qrDG5u9BkE8hw43ZatuCWKkP0W8qh8z5aP8a3kNJL/ho6RBkN9o27khjjqiwa 1COfMWnX3xkoOecl9z5M70M+BKYcarmzA8awLAF7VVH+Mb9J+qj4ilw0cqii7bK9gktH GASn7cwPy+x3D8puvoWWbB2CwZSYaCqG885utqUDvxz9Mvl9s/PSDxe45fFAtuRcv3pi +/aQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from; bh=zJ2+XhKbom46b/fZIunrod77ylLhDHU0KUj29SPzLh4=; fh=Q54OXt/qRxrTtd0mSEE2XGJd4z99RHMb/29q6WtXyZg=; b=mN+Po3bt6QviE4mcJm6OMrG6uAmJau9LYq1boFPlWGMw3t+FcI+mPNGWlmGgl+K16I 1eoryFzM6Ts9jNmib6FO++9ZHubo3aSw7HqFDR1onE9NmoON/icIqmmLn2kpw9oXwPcX ywC6bNKBZtuSJ6GbWL1Tlnm4Sb2IVSlDyYSQ6X3LyJjO+5tVhdsYs6RqQnOIbz0l4g9F HPV6EWAQL6SOj8sM+sVT0kkaboQaAAodOoICES9+l2EfZxO4gbr4G7RcU2vA1sc7bsGv 7AfcixleGFlQCjWKbe5UNU4Ok0JnOeT4ePKNxfnObpOg4DbGhPHsA+QgvKr4uXdcgIwy MrOw== ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=gmail.com); spf=pass (google.com: domain of linux-kernel+bounces-29260-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-29260-ouuuleilei=gmail.com@vger.kernel.org" Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id y14-20020a1709063a8e00b00a26add7df8bsi5827395ejd.109.2024.01.17.09.11.54 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 17 Jan 2024 09:11:54 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-29260-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=gmail.com); spf=pass (google.com: domain of linux-kernel+bounces-29260-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-29260-ouuuleilei=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 864AB1F23496 for <ouuuleilei@gmail.com>; Wed, 17 Jan 2024 17:11:54 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 2DEA222F19; Wed, 17 Jan 2024 17:11:19 +0000 (UTC) Received: from mail-lj1-f182.google.com (mail-lj1-f182.google.com [209.85.208.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 16C45219F2 for <linux-kernel@vger.kernel.org>; Wed, 17 Jan 2024 17:11:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1705511477; cv=none; b=Pt5V2rBrG//JYjkV96bShddfe4xXD14/4KSDZcwVlbJhwFrnOGElv7wNU/4nTG0W5Z4kNPhJgpQAwZpz107l4u6Tb36ZZ8/0SCafuASZr9ZnCeltw1noIVDdFE4v5gyC3Aog0nl6KmrVFHqFbvz5syNpL8yKP8rtYv+GZl+Ecow= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1705511477; c=relaxed/simple; bh=yRA1Nvwd9G6y3dgBa4YVeXvCTThqtHdsWY4phzrl/lU=; h=Received:X-Google-DKIM-Signature:X-Gm-Message-State: X-Google-Smtp-Source:X-Received:Received:From:To:Cc:Subject:Date: Message-Id:X-Mailer:MIME-Version:Content-Transfer-Encoding; b=ETRoc3nAzSX6BsotQ4rv7t2KIggpG/5P6nTC1Fdx+0+KtltJbCbrgjjWfqebaxGDWJ39MlpfEbvi/EQl+5xx4tbqziEFkuAqhATFtiXmCln7zwuneqxiTlDYkCsrFyVYWrPkL0YgThC/GJ2MuW6znBK5DQi8WA+juHGz+jwC2NU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=debian.org; spf=pass smtp.mailfrom=gmail.com; arc=none smtp.client-ip=209.85.208.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=debian.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lj1-f182.google.com with SMTP id 38308e7fff4ca-2cdb50d8982so45619051fa.2 for <linux-kernel@vger.kernel.org>; Wed, 17 Jan 2024 09:11:11 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705511470; x=1706116270; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=zJ2+XhKbom46b/fZIunrod77ylLhDHU0KUj29SPzLh4=; b=vwdWkgtHd4seFqFnUmJlABO4C3L8YNkXaReU8LVSlF31/uivSN/5sphvM4gSSjhG1e tq4S4g6cUcha00b7nAJEv7ME44guUvYva3vg63mKcbUAVrS1ZWXSNk64vSzQC3Bfeto8 YG56LzyaxOjpaYaXuCKQ5V0H2crcq31gbIm90zkALlMusTvBs/mWtV+EMFqKu0pZ0J40 wG+tXGbp4xi7awBiUx+p74rk+J7LET3iwnBPXh9n9tdK8u1e/A364I7bwSQIZse4f5v5 PVU+00DHqYV9Oz+n85ibrMKPoP+R7JXL0wCIlqqvqygi8XqI2gBH9nKZ+wq1AoRVcOtx G4Gw== X-Gm-Message-State: AOJu0YxduqKDx6P+1npHO5/GrU6Z/HPpwRvJ07kE15XKqa+3uYf2sPzV V/mp4lEWfV5tjleDB81a/0Q/yTuNzTc= X-Received: by 2002:a2e:7d12:0:b0:2cd:98e3:42b3 with SMTP id y18-20020a2e7d12000000b002cd98e342b3mr3866977ljc.24.1705511469837; Wed, 17 Jan 2024 09:11:09 -0800 (PST) Received: from localhost (fwdproxy-cln-025.fbsv.net. [2a03:2880:31ff:19::face:b00c]) by smtp.gmail.com with ESMTPSA id i13-20020a05640200cd00b00559c71e70e3sm1547951edu.79.2024.01.17.09.11.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 17 Jan 2024 09:11:09 -0800 (PST) From: Breno Leitao <leitao@debian.org> To: Muchun Song <muchun.song@linux.dev>, Andrew Morton <akpm@linux-foundation.org> Cc: lstoakes@gmail.com, willy@infradead.org, mike.kravetz@oracle.com, Rik van Riel <riel@surriel.com>, linux-mm@kvack.org (open list:HUGETLB SUBSYSTEM), linux-kernel@vger.kernel.org (open list) Subject: [PATCH 1/2] mm/hugetlb: Restore the reservation if needed Date: Wed, 17 Jan 2024 09:10:57 -0800 Message-Id: <20240117171058.2192286-1-leitao@debian.org> X-Mailer: git-send-email 2.34.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1788358442355505237 X-GMAIL-MSGID: 1788358442355505237 |
Series |
[1/2] mm/hugetlb: Restore the reservation if needed
|
|
Commit Message
Breno Leitao
Jan. 17, 2024, 5:10 p.m. UTC
Currently there is a bug that a huge page could be stolen, and when the
original owner tries to fault in it, it causes a page fault.
You can achieve that by:
1) Creating a single page
echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
2) mmap() the page above with MAP_HUGETLB into (void *ptr1).
* This will mark the page as reserved
3) touch the page, which causes a page fault and allocates the page
* This will move the page out of the free list.
* It will also unreserved the page, since there is no more free
page
4) madvise(MADV_DONTNEED) the page
* This will free the page, but not mark it as reserved.
5) Allocate a secondary page with mmap(MAP_HUGETLB) into (void *ptr2).
* it should fail, but, since there is no more available page.
* But, since the page above is not reserved, this mmap() succeed.
6) Faulting at ptr1 will cause a SIGBUS
* it will try to allocate a huge page, but there is none
available
A full reproducer is in selftest. See
https://lore.kernel.org/all/20240105155419.1939484-1-leitao@debian.org/
Fix this by restoring the reserved page if necessary. If the page being
unmapped has HPAGE_RESV_OWNER set, and needs a reservation, set the
restore_reserve flag, which will move the page from free to reserved.
Suggested-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
mm/hugetlb.c | 10 ++++++++++
1 file changed, 10 insertions(+)
Comments
On 17/01/2024 17:10, Breno Leitao wrote: > Currently there is a bug that a huge page could be stolen, and when the > original owner tries to fault in it, it causes a page fault. > > You can achieve that by: > 1) Creating a single page > echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages > > 2) mmap() the page above with MAP_HUGETLB into (void *ptr1). > * This will mark the page as reserved > 3) touch the page, which causes a page fault and allocates the page > * This will move the page out of the free list. > * It will also unreserved the page, since there is no more free > page > 4) madvise(MADV_DONTNEED) the page > * This will free the page, but not mark it as reserved. > 5) Allocate a secondary page with mmap(MAP_HUGETLB) into (void *ptr2). > * it should fail, but, since there is no more available page. > * But, since the page above is not reserved, this mmap() succeed. > 6) Faulting at ptr1 will cause a SIGBUS > * it will try to allocate a huge page, but there is none > available > > A full reproducer is in selftest. See > https://lore.kernel.org/all/20240105155419.1939484-1-leitao@debian.org/ > > Fix this by restoring the reserved page if necessary. If the page being > unmapped has HPAGE_RESV_OWNER set, and needs a reservation, set the > restore_reserve flag, which will move the page from free to reserved. > > Suggested-by: Rik van Riel <riel@surriel.com> > Signed-off-by: Breno Leitao <leitao@debian.org> > --- > mm/hugetlb.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index ed1581b670d4..fa2c17767e44 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -5677,6 +5677,16 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, > hugetlb_count_sub(pages_per_huge_page(h), mm); > hugetlb_remove_rmap(page_folio(page)); > > + if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) && > + vma_needs_reservation(h, vma, start)) { > + /* > + * Restore the reservation if needed, otherwise the > + * backing page could be stolen by someone. > + */ > + folio_set_hugetlb_restore_reserve(page_folio(page)); > + vma_add_reservation(h, vma, address); > + } > + > spin_unlock(ptl); > tlb_remove_page_size(tlb, page, huge_page_size(h)); > /* Hi Breno, I'm seeing a kernel bug fire when running the "map_hugetlb" mm selftest against latest mm-unstable. Bisect tells me this patch is culprit. I'm running on arm64 with defconfig plus the following: # Woraround mm-unstable build errors. /scripts/config --file ${param:builddir}/.config --enable ANON_VMA_NAME # Squashfs for snaps, xfs for large file folios. /scripts/config --enable CONFIG_SQUASHFS_LZ4 /scripts/config --enable CONFIG_SQUASHFS_LZO /scripts/config --enable CONFIG_SQUASHFS_XZ /scripts/config --enable CONFIG_SQUASHFS_ZSTD /scripts/config --enable CONFIG_XFS_FS # Useful trace features (on for Ubuntu configs). /scripts/config --enable CONFIG_FTRACE /scripts/config --enable CONFIG_FUNCTION_TRACER /scripts/config --enable CONFIG_KPROBES /scripts/config --enable CONFIG_HIST_TRIGGERS /scripts/config --enable CONFIG_FTRACE_SYSCALLS # For general mm debug. /scripts/config --enable CONFIG_DEBUG_VM /scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE /scripts/config --enable CONFIG_DEBUG_VM_RB /scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS /scripts/config --enable CONFIG_DEBUG_VM_PGTABLE /scripts/config --enable CONFIG_PAGE_TABLE_CHECK # For mm selftests. /scripts/config --enable CONFIG_USERFAULTFD /scripts/config --enable CONFIG_TEST_VMALLOC /scripts/config --enable CONFIG_GUP_TEST Steps to reproduce: $ sudo ./map_hugetlb git bisect log: git bisect start # good: [6613476e225e090cc9aad49be7fa504e290dd33d] Linux 6.8-rc1 git bisect good 6613476e225e090cc9aad49be7fa504e290dd33d # bad: [ba34798948c09778ff707415fd52daa591bd361a] mm: add swappiness= arg to memory.reclaim git bisect bad ba34798948c09778ff707415fd52daa591bd361a # bad: [69d0e75874638fe852ed7004666e068cf9e2148d] mm: list_lru: disable memcg_aware when cgroup.memory is set to "nokmem" git bisect bad 69d0e75874638fe852ed7004666e068cf9e2148d # good: [38c61fca93058635b533ad927c1d6529757424d3] mm: huge_memory: don't force huge page alignment on 32 bit git bisect good 38c61fca93058635b533ad927c1d6529757424d3 # bad: [552225e31fdc31315d06c5c7f59b8dbc78e45d18] mm: vmalloc: rename adjust_va_to_fit_type() function git bisect bad 552225e31fdc31315d06c5c7f59b8dbc78e45d18 # bad: [184c8db9112ee2370c2d17efab76a6c2aad061bc] s390/sclp: remove unhandled memory notifier type git bisect bad 184c8db9112ee2370c2d17efab76a6c2aad061bc # bad: [7ccaa21733f6a0fefdf978da0ee08a459c8441ca] mm/cma: fix placement of trace_cma_alloc_start/finish git bisect bad 7ccaa21733f6a0fefdf978da0ee08a459c8441ca # bad: [be7a8289b10f82365b9a209d9db9e4aaab1a86d4] selftests/mm: run_vmtests.sh: add hugetlb_madv_vs_map git bisect bad be7a8289b10f82365b9a209d9db9e4aaab1a86d4 # bad: [947b031634e7af3d265275c26ec17e2f96fdb5a1] mm/hugetlb: restore the reservation if needed git bisect bad 947b031634e7af3d265275c26ec17e2f96fdb5a1 # first bad commit: [947b031634e7af3d265275c26ec17e2f96fdb5a1] mm/hugetlb: restore the reservation if needed BUG: [ 11.959447] kernel BUG at mm/hugetlb.c:1129! [ 11.959628] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP [ 11.959870] Modules linked in: [ 11.959975] CPU: 0 PID: 1211 Comm: map_hugetlb Not tainted 6.8.0-rc1-00029-g947b031634e7 #11 [ 11.960272] Hardware name: linux,dummy-virt (DT) [ 11.960434] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 11.960680] pc : resv_map_release+0xa8/0xb8 [ 11.960895] lr : resv_map_release+0x6c/0xb8 [ 11.961041] sp : ffff80008863bb20 [ 11.961159] x29: ffff80008863bb20 x28: 0000000000000000 x27: 0000000000000001 [ 11.961500] x26: ffff77b80e91a704 x25: 0000000000000001 x24: ffff77b80e91a708 [ 11.961740] x23: ffff77b80e91a700 x22: 0000000000000080 x21: 0000000000000000 [ 11.961992] x20: ffff77b80e91a720 x19: ffff77b80e91a720 x18: 0000fffff7fecfff [ 11.962239] x17: 0000000000000000 x16: 1fffeef701697601 x15: 0000000000000001 [ 11.962484] x14: ffffffffffffffff x13: 0000000000000000 x12: ffffc8c6bc3b3cd0 [ 11.962729] x11: 0000000000000001 x10: 0000000000000000 x9 : ffffc8c6bad5245c [ 11.962971] x8 : ffff80008863bae0 x7 : 0000000000000000 x6 : ffff77b801966900 [ 11.963285] x5 : ffffc8c6bad6b7e4 x4 : fffffddee1613ea0 x3 : 0000000000400010 [ 11.963629] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 000000000000007f [ 11.963945] Call trace: [ 11.964070] resv_map_release+0xa8/0xb8 [ 11.964255] hugetlb_vm_op_close+0x1a0/0x1f0 [ 11.964431] remove_vma+0x34/0x88 [ 11.964604] do_vmi_align_munmap+0x4b8/0x5f0 [ 11.964792] do_vmi_munmap+0xb4/0x138 [ 11.964936] __vm_munmap+0xa8/0x188 [ 11.965074] __arm64_sys_munmap+0x28/0x40 [ 11.965237] invoke_syscall+0x50/0x128 [ 11.965388] el0_svc_common.constprop.0+0x48/0xf0 [ 11.965573] do_el0_svc+0x24/0x38 [ 11.965705] el0_svc+0x34/0xb8 [ 11.965862] el0t_64_sync_handler+0x100/0x130 [ 11.966032] el0t_64_sync+0x190/0x198 [ 11.966174] Code: a8c47bfd d50323bf d65f03c0 a9025bf5 (d4210000) [ 11.966410] ---[ end trace 0000000000000000 ]--- [ 11.966594] note: map_hugetlb[1211] exited with irqs disabled [ 11.966874] note: map_hugetlb[1211] exited with preempt_count 1 [ 11.967915] ------------[ cut here ]------------ [ 11.968165] WARNING: CPU: 0 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120 [ 11.968560] Modules linked in: [ 11.968651] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G D 6.8.0-rc1-00029-g947b031634e7 #11 [ 11.969235] Hardware name: linux,dummy-virt (DT) [ 11.970045] pstate: 214003c5 (nzCv DAIF +PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 11.970941] pc : ct_kernel_exit.constprop.0+0x108/0x120 [ 11.971241] lr : ct_idle_enter+0x10/0x20 [ 11.971397] sp : ffffc8c6bd7f3d40 [ 11.971527] x29: ffffc8c6bd7f3d40 x28: 000000004224d0ac x27: 0000000000000000 [ 11.971811] x26: 0000000000000000 x25: ffffc8c6bd804980 x24: 0000000000000000 [ 11.972279] x23: 0000000000000000 x22: ffffc8c6bd804980 x21: ffffc8c6bd7f9ae0 [ 11.973023] x20: ffffc8c6bd7f9a20 x19: ffff77b83ff8dce0 x18: 0000000000000001 [ 11.973856] x17: 0000000000000000 x16: 1fffeef701697721 x15: 0000fffff7fed000 [ 11.975799] x14: ffff80008863b688 x13: 0000000000000000 x12: ffffc8c6bc3b3cd0 [ 11.976202] x11: 0000000000000001 x10: fc206bc54f751bff x9 : ffffc8c6bab7f564 [ 11.976461] x8 : ffffc8c6bd805a58 x7 : 00000000000001e5 x6 : 00000000119b8986 [ 11.976722] x5 : 4000000000000002 x4 : ffffaef18328b000 x3 : ffffc8c6bd7f3d40 [ 11.976985] x2 : ffffc8c6bcd02ce0 x1 : 4000000000000000 x0 : ffffc8c6bcd02ce0 [ 11.977245] Call trace: [ 11.977338] ct_kernel_exit.constprop.0+0x108/0x120 [ 11.977520] ct_idle_enter+0x10/0x20 [ 11.977654] default_idle_call+0x3c/0x170 [ 11.977804] do_idle+0x218/0x278 [ 11.977942] cpu_startup_entry+0x40/0x50 [ 11.978087] rest_init+0xf0/0xf8 [ 11.978209] arch_call_rest_init+0x18/0x20 [ 11.978391] start_kernel+0x524/0x670 [ 11.978528] __primary_switched+0xb8/0xc8 [ 11.978681] ---[ end trace 0000000000000000 ]--- Thanks, Ryan
On Wed, Jan 24, 2024 at 09:22:03AM +0000, Ryan Roberts wrote: > On 17/01/2024 17:10, Breno Leitao wrote: > > Currently there is a bug that a huge page could be stolen, and when the > > original owner tries to fault in it, it causes a page fault. > > > > You can achieve that by: > > 1) Creating a single page > > echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages > > > > 2) mmap() the page above with MAP_HUGETLB into (void *ptr1). > > * This will mark the page as reserved > > 3) touch the page, which causes a page fault and allocates the page > > * This will move the page out of the free list. > > * It will also unreserved the page, since there is no more free > > page > > 4) madvise(MADV_DONTNEED) the page > > * This will free the page, but not mark it as reserved. > > 5) Allocate a secondary page with mmap(MAP_HUGETLB) into (void *ptr2). > > * it should fail, but, since there is no more available page. > > * But, since the page above is not reserved, this mmap() succeed. > > 6) Faulting at ptr1 will cause a SIGBUS > > * it will try to allocate a huge page, but there is none > > available > > > > A full reproducer is in selftest. See > > https://lore.kernel.org/all/20240105155419.1939484-1-leitao@debian.org/ > > > > Fix this by restoring the reserved page if necessary. If the page being > > unmapped has HPAGE_RESV_OWNER set, and needs a reservation, set the > > restore_reserve flag, which will move the page from free to reserved. > > > > Suggested-by: Rik van Riel <riel@surriel.com> > > Signed-off-by: Breno Leitao <leitao@debian.org> > > --- > > mm/hugetlb.c | 10 ++++++++++ > > 1 file changed, 10 insertions(+) > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index ed1581b670d4..fa2c17767e44 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -5677,6 +5677,16 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, > > hugetlb_count_sub(pages_per_huge_page(h), mm); > > hugetlb_remove_rmap(page_folio(page)); > > > > + if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) && > > + vma_needs_reservation(h, vma, start)) { > > + /* > > + * Restore the reservation if needed, otherwise the > > + * backing page could be stolen by someone. > > + */ > > + folio_set_hugetlb_restore_reserve(page_folio(page)); > > + vma_add_reservation(h, vma, address); > > + } > > + > > spin_unlock(ptl); > > tlb_remove_page_size(tlb, page, huge_page_size(h)); > > /* > > Hi Breno, > > I'm seeing a kernel bug fire when running the "map_hugetlb" mm selftest against latest mm-unstable. Bisect tells me this patch is culprit. I'm running on arm64 with defconfig plus the following: Hello Ryan, Thanks for the heads-up. I was able to reproduce the problem, and I am working on a solution. Thanks,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ed1581b670d4..fa2c17767e44 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5677,6 +5677,16 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, hugetlb_count_sub(pages_per_huge_page(h), mm); hugetlb_remove_rmap(page_folio(page)); + if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) && + vma_needs_reservation(h, vma, start)) { + /* + * Restore the reservation if needed, otherwise the + * backing page could be stolen by someone. + */ + folio_set_hugetlb_restore_reserve(page_folio(page)); + vma_add_reservation(h, vma, address); + } + spin_unlock(ptl); tlb_remove_page_size(tlb, page, huge_page_size(h)); /*