Message ID | 20221129193526.3588187-1-peterx@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp535221wrr; Tue, 29 Nov 2022 11:48:06 -0800 (PST) X-Google-Smtp-Source: AA0mqf5v1wYb6edzhqBavFpay6yOBllqaza1g3N6OYSgK3oQjVM8x2pETwc1oPZmaYaqI9DEjnv7 X-Received: by 2002:a17:906:7096:b0:78c:fda3:c025 with SMTP id b22-20020a170906709600b0078cfda3c025mr40394013ejk.461.1669751286294; Tue, 29 Nov 2022 11:48:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669751286; cv=none; d=google.com; s=arc-20160816; b=F4UoOq+77x3bap2ZuvntnWFod3hDRlPanWds71rI8MgnNrYN1MkNGTEEsSfyWI+Zyr T6/FowdsoOJSQQAuK9KVz16Ru/boITTXJi87lnXy87pPWvcEbFOn1AzwT5Lt37l4gz8/ eVb4C6P8ol+u0zFpQ1ANqTmkM8M3x3uXnf1zkptNc/E6E0PNPvB5FgNxufgRhZj9m1hD L7Z0YCJSVLaWSUClf5VWBHjaEZiZKKGtkS8ROO7VSxQ9gS20uDxk7TR/4HzTZlT8qAp6 muLokypIz1znY7r+kg0yplrQ8KCkoW/h+2XBYrEMDbkmktXAbj7tbqyp3D8s50DEb9zc 3kpg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=9cYFK9SGnoPMocsHzXaXyOxyqXLpIgUasYtw9gZWBJ4=; b=V9ASoyALo/E9+nNFKvwU4tqu/bZ8BG+oq8LHotlPQHNjdqRlLOjri0EqXqBSZnfN1o 7arAjwexwO1wZ6nDfoz7OKUuuf1qW6JgwR5FsHB2WkAT+09NlmihSBoqhyl3AjzSQ3Ty O1WUHsJYSEtzldr1Mm2ajJGKyG74DbwKYSK3uZ/L2NNf86jfWSq3+6m4CturuIF0FVTk gz6qvbqGmGhT0cx0txiK8sqwcK4vdyd52+Beye2vi6chNz9L1PZf8SJe3qUIPCZMVBX/ 3br/w0kWWq+gctp1mONW0SAmDQJNZtiJiHaKDSDctD4Moo+blFgPvB3JXXbdpYEL+BeB Sxfw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=GLNf2zUh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id gs41-20020a1709072d2900b007b273d1f664si13811107ejc.128.2022.11.29.11.47.42; Tue, 29 Nov 2022 11:48:06 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=GLNf2zUh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236999AbiK2Th3 (ORCPT <rfc822;rbbytesnap@gmail.com> + 99 others); Tue, 29 Nov 2022 14:37:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48038 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237039AbiK2Tgt (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 29 Nov 2022 14:36:49 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1F7FD30F4A for <linux-kernel@vger.kernel.org>; Tue, 29 Nov 2022 11:35:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1669750532; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=9cYFK9SGnoPMocsHzXaXyOxyqXLpIgUasYtw9gZWBJ4=; b=GLNf2zUhBoS12rhFffZatkUNwCBMcSH6Gc9vXREWEqofrY2oe95pCZfvEt4A+L/YHmqTTF eUlCJbODTHl7p/KLIzq4Y7+dbmcM2U/ANWUBoKINI9ILmjri6d5Ukos5myBaVc2SVJwVBe TVdUAzjW7BRAgsi2K8u3KBedPZH+aaA= Received: from mail-qt1-f199.google.com (mail-qt1-f199.google.com [209.85.160.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-404-kC5KRRrfMsqRwGLCXnFhaA-1; Tue, 29 Nov 2022 14:35:30 -0500 X-MC-Unique: kC5KRRrfMsqRwGLCXnFhaA-1 Received: by mail-qt1-f199.google.com with SMTP id i4-20020ac813c4000000b003a5044a818cso23301000qtj.11 for <linux-kernel@vger.kernel.org>; Tue, 29 Nov 2022 11:35:30 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=9cYFK9SGnoPMocsHzXaXyOxyqXLpIgUasYtw9gZWBJ4=; b=rzH1uK7EG1I4mVJp/eM0dJznwqaZOm4Y1ZDAMUVLCRgEsW/hIZ0N8NbOxkJgjECZHh Ew0Z4XShGuBIJuX7qYpASuVFczPbu2BPRhu1fRlxj/tNgqVc3V4ssqPCliG6xuQ3ipfQ Rh64sxj7VTl0PAVDJ7+WSKfIRFEdejqfpJGO34ciAPWctnjArJbjNOi0ifPhVxFxwScC OlfLY6Y+qf6rCfKjcK3wsQ4cTyqEPfJu14fhh+h7y165HsBNnMXVVrAJ+U9CTrK1CUY5 oReQwfxcC6yMFXAnzBe7AW6lU2bxflqKPLNWvt6ujs3/dhoN1cIchuGGnq17QsY5bbcC /57w== X-Gm-Message-State: ANoB5pl1F6Tvi0ofRS86lhXts5fNwlUIPA9i/pa4OCsnH7fFoynZoBBn juzkDWoKdQwkTA50ifv5b7iaFYBMDff8FRhVGNGhEHRt+wgqtqs5epA1c29KjXVMJlgKksqWC9Y e7vOlvXUbMLTTA3zO/kM0fiZU X-Received: by 2002:a05:620a:22c4:b0:6ec:42e:a454 with SMTP id o4-20020a05620a22c400b006ec042ea454mr51234948qki.168.1669750529174; Tue, 29 Nov 2022 11:35:29 -0800 (PST) X-Received: by 2002:a05:620a:22c4:b0:6ec:42e:a454 with SMTP id o4-20020a05620a22c400b006ec042ea454mr51234918qki.168.1669750528783; Tue, 29 Nov 2022 11:35:28 -0800 (PST) Received: from x1n.redhat.com (bras-base-aurron9127w-grc-46-70-31-27-79.dsl.bell.ca. [70.31.27.79]) by smtp.gmail.com with ESMTPSA id n1-20020a05620a294100b006fa16fe93bbsm11313013qkp.15.2022.11.29.11.35.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 29 Nov 2022 11:35:27 -0800 (PST) From: Peter Xu <peterx@redhat.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: James Houghton <jthoughton@google.com>, Jann Horn <jannh@google.com>, peterx@redhat.com, Andrew Morton <akpm@linux-foundation.org>, Andrea Arcangeli <aarcange@redhat.com>, Rik van Riel <riel@surriel.com>, Nadav Amit <nadav.amit@gmail.com>, Miaohe Lin <linmiaohe@huawei.com>, Muchun Song <songmuchun@bytedance.com>, Mike Kravetz <mike.kravetz@oracle.com>, David Hildenbrand <david@redhat.com> Subject: [PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare Date: Tue, 29 Nov 2022 14:35:16 -0500 Message-Id: <20221129193526.3588187-1-peterx@redhat.com> X-Mailer: git-send-email 2.37.3 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1750861125244655766?= X-GMAIL-MSGID: =?utf-8?q?1750861125244655766?= |
Commit Message
Peter Xu
Nov. 29, 2022, 7:35 p.m. UTC
Based on latest mm-unstable (9ed079378408). This can be seen as a follow-up series to Mike's recent hugetlb vma lock series for pmd unsharing, but majorly covering safe use of huge_pte_offset. Comparing to previous rfcv2, the major change is I dropped the new pgtable lock but only use vma lock for locking. The major reason is I overlooked that the pgtable lock was not protected by RCU: __pmd_free_tlb() frees the pgtable lock before e.g. call_rcu() for RCU_TABLE_FREE archs. OTOH many of the huge_pte_offset() call sites do need to take pgtable lock. It means the valid users for the new RCU lock will be very limited. It's possible that in the future we can rework the pgtable free to only free the pgtable lock after RCU grace period (move pgtable_pmd_page_dtor() to be within tlb_remove_table_rcu()), then the RCU lock will make more sense. For now, make it simple by fixing the races first. Since this version attached a reproducer (below) and also removed the RCU (especially, the fallback irqoff) solution, removing RFC tag. Old versions: rfcv1: https://lore.kernel.org/r/20221030212929.335473-1-peterx@redhat.com rfcv2: https://lore.kernel.org/r/20221118011025.2178986-1-peterx@redhat.com Problem ======= huge_pte_offset() is a major helper used by hugetlb code paths to walk a hugetlb pgtable. It's used mostly everywhere since that's needed even before taking the pgtable lock. huge_pte_offset() is always called with mmap lock held with either read or write. It was assumed to be safe but it's actually not. One race condition can easily trigger by: (1) firstly trigger pmd share on a memory range, (2) do huge_pte_offset() on the range, then at the meantime, (3) another thread unshare the pmd range, and the pgtable page is prone to lost if the other shared process wants to free it completely (by either munmap or exit mm). The recent work from Mike on vma lock can resolve most of this already. It's achieved by forbidden pmd unsharing during the lock being taken, so no further risk of the pgtable page being freed. It means if we can take the vma lock around all huge_pte_offset() callers it'll be safe. There're already a bunch of them that we did as per the latest mm-unstable, but also quite a few others that we didn't for various reasons especially on huge_pte_offset() usage. One more thing to mention is that besides the vma lock, i_mmap_rwsem can also be used to protect the pgtable page (along with its pgtable lock) from being freed from under us. IOW, huge_pte_offset() callers need to either hold the vma lock or i_mmap_rwsem to safely walk the pgtables. A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is very hard to trigger, one needs to apply another kernel delay patch too, see below): ======8<======= #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <errno.h> #include <unistd.h> #include <sys/mman.h> #include <fcntl.h> #include <linux/memfd.h> #include <assert.h> #include <pthread.h> #define MSIZE (1UL << 30) /* 1GB */ #define PSIZE (2UL << 20) /* 2MB */ #define HOLD_SEC (1) int pipefd[2]; void *buf; void *do_map(int fd) { unsigned char *tmpbuf, *p; int ret; ret = posix_memalign((void **)&tmpbuf, MSIZE, MSIZE); if (ret) { perror("posix_memalign() failed"); return NULL; } tmpbuf = mmap(tmpbuf, MSIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, 0); if (tmpbuf == MAP_FAILED) { perror("mmap() failed"); return NULL; } printf("mmap() -> %p\n", tmpbuf); for (p = tmpbuf; p < tmpbuf + MSIZE; p += PSIZE) { *p = 1; } return tmpbuf; } void do_unmap(void *buf) { munmap(buf, MSIZE); } void proc2(int fd) { unsigned char c; buf = do_map(fd); if (!buf) return; read(pipefd[0], &c, 1); /* * This frees the shared pgtable page, causing use-after-free in * proc1_thread1 when soft walking hugetlb pgtable. */ do_unmap(buf); printf("Proc2 quitting\n"); } void *proc1_thread1(void *data) { /* * Trigger follow-page on 1st 2m page. Kernel hack patch needed to * withhold this procedure for easier reproduce. */ madvise(buf, PSIZE, MADV_POPULATE_WRITE); printf("Proc1-thread1 quitting\n"); return NULL; } void *proc1_thread2(void *data) { unsigned char c; /* Wait a while until proc1_thread1() start to wait */ sleep(0.5); /* Trigger pmd unshare */ madvise(buf, PSIZE, MADV_DONTNEED); /* Kick off proc2 to release the pgtable */ write(pipefd[1], &c, 1); printf("Proc1-thread2 quitting\n"); return NULL; } void proc1(int fd) { pthread_t tid1, tid2; int ret; buf = do_map(fd); if (!buf) return; ret = pthread_create(&tid1, NULL, proc1_thread1, NULL); assert(ret == 0); ret = pthread_create(&tid2, NULL, proc1_thread2, NULL); assert(ret == 0); /* Kick the child to share the PUD entry */ pthread_join(tid1, NULL); pthread_join(tid2, NULL); do_unmap(buf); } int main(void) { int fd, ret; fd = memfd_create("test-huge", MFD_HUGETLB | MFD_HUGE_2MB); if (fd < 0) { perror("open failed"); return -1; } ret = ftruncate(fd, MSIZE); if (ret) { perror("ftruncate() failed"); return -1; } ret = pipe(pipefd); if (ret) { perror("pipe() failed"); return -1; } if (fork()) { proc1(fd); } else { proc2(fd); } close(pipefd[0]); close(pipefd[1]); close(fd); return 0; } ======8<======= The kernel patch needed to present such a race so it'll trigger 100%: ======8<======= ======8<======= It'll trigger use-after-free of the pgtable spinlock: ======8<======= [ 16.959907] follow_hugetlb_page: withhold 1 sec... [ 17.960315] follow_hugetlb_page: withhold 1 sec...done [ 17.960550] ------------[ cut here ]------------ [ 17.960742] DEBUG_LOCKS_WARN_ON(1) [ 17.960756] WARNING: CPU: 3 PID: 542 at kernel/locking/lockdep.c:231 __lock_acquire+0x955/0x1fa0 [ 17.961264] Modules linked in: [ 17.961394] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Not tainted 6.1.0-rc4-peterx+ #46 [ 17.961704] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 [ 17.962266] RIP: 0010:__lock_acquire+0x955/0x1fa0 [ 17.962516] Code: c0 0f 84 5f fe ff ff 44 8b 1d 0f 9a 29 02 45 85 db 0f 85 4f fe ff ff 48 c7 c6 75 50 83 82 48 c7 c7 1b 4b 7d 82 e8 d3 22 d8 00 <0f> 0b 31 c0 4c 8b 54 24 08 4c 8b 04 24 e9 [ 17.963494] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010096 [ 17.963704] RAX: 0000000000000016 RBX: fffffffffd3925a8 RCX: 0000000000000000 [ 17.963989] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff [ 17.964276] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffc90000e4fa58 [ 17.964557] R10: 0000000000000003 R11: ffffffff83162688 R12: 0000000000000000 [ 17.964839] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001 [ 17.965123] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000 [ 17.965443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 17.965672] CR2: 00007f17c09ffef8 CR3: 000000010c87a005 CR4: 0000000000770ee0 [ 17.965956] PKRU: 55555554 [ 17.966068] Call Trace: [ 17.966172] <TASK> [ 17.966268] ? tick_nohz_tick_stopped+0x12/0x30 [ 17.966455] lock_acquire+0xbf/0x2b0 [ 17.966603] ? follow_hugetlb_page.cold+0x75/0x5c4 [ 17.966799] ? _printk+0x48/0x4e [ 17.966934] _raw_spin_lock+0x2f/0x40 [ 17.967087] ? follow_hugetlb_page.cold+0x75/0x5c4 [ 17.967285] follow_hugetlb_page.cold+0x75/0x5c4 [ 17.967473] __get_user_pages+0xbb/0x620 [ 17.967635] faultin_vma_page_range+0x9a/0x100 [ 17.967817] madvise_vma_behavior+0x3c0/0xbd0 [ 17.967998] ? mas_prev+0x11/0x290 [ 17.968141] ? find_vma_prev+0x5e/0xa0 [ 17.968304] ? madvise_vma_anon_name+0x70/0x70 [ 17.968486] madvise_walk_vmas+0xa9/0x120 [ 17.968650] do_madvise.part.0+0xfa/0x270 [ 17.968813] __x64_sys_madvise+0x5a/0x70 [ 17.968974] do_syscall_64+0x37/0x90 [ 17.969123] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 17.969329] RIP: 0033:0x7f1840f0efdb [ 17.969477] Code: c3 66 0f 1f 44 00 00 48 8b 15 39 6e 0e 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 0f 1f 44 00 00 f3 0f 1e fa b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0d 68 [ 17.970205] RSP: 002b:00007f17c09ffe38 EFLAGS: 00000202 ORIG_RAX: 000000000000001c [ 17.970504] RAX: ffffffffffffffda RBX: 00007f17c0a00640 RCX: 00007f1840f0efdb [ 17.970786] RDX: 0000000000000017 RSI: 0000000000200000 RDI: 00007f1800000000 [ 17.971068] RBP: 00007f17c09ffe50 R08: 0000000000000000 R09: 00007ffd3954164f [ 17.971353] R10: 00007f1840e10348 R11: 0000000000000202 R12: ffffffffffffff80 [ 17.971709] R13: 0000000000000000 R14: 00007ffd39541550 R15: 00007f17c0200000 [ 17.972083] </TASK> [ 17.972199] irq event stamp: 2353 [ 17.972372] hardirqs last enabled at (2353): [<ffffffff8117fe4e>] __up_console_sem+0x5e/0x70 [ 17.972869] hardirqs last disabled at (2352): [<ffffffff8117fe33>] __up_console_sem+0x43/0x70 [ 17.973365] softirqs last enabled at (2330): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160 [ 17.973857] softirqs last disabled at (2323): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160 [ 17.974341] ---[ end trace 0000000000000000 ]--- [ 17.974614] BUG: kernel NULL pointer dereference, address: 00000000000000b8 [ 17.975012] #PF: supervisor read access in kernel mode [ 17.975314] #PF: error_code(0x0000) - not-present page [ 17.975615] PGD 103f7b067 P4D 103f7b067 PUD 106cd7067 PMD 0 [ 17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 17.976197] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Tainted: G W 6.1.0-rc4-peterx+ #46 [ 17.976712] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 [ 17.977370] RIP: 0010:__lock_acquire+0x190/0x1fa0 [ 17.977655] Code: 98 00 00 00 41 89 46 24 81 e2 ff 1f 00 00 48 0f a3 15 e4 ba dd 02 0f 83 ff 05 00 00 48 8d 04 52 48 c1 e0 06 48 05 c0 d2 f4 83 <44> 0f b6 a0 b8 00 00 00 41 0f b7 46 20 6f [ 17.979170] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010046 [ 17.979787] RAX: 0000000000000000 RBX: fffffffffd3925a8 RCX: 0000000000000000 [ 17.980838] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff [ 17.982048] RBP: 0000000000000000 R08: ffff888105eac720 R09: ffffc90000e4fa58 [ 17.982892] R10: ffff888105eab900 R11: ffffffff83162688 R12: 0000000000000000 [ 17.983771] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001 [ 17.984815] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000 [ 17.985924] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 17.986265] CR2: 00000000000000b8 CR3: 000000010c87a005 CR4: 0000000000770ee0 [ 17.986674] PKRU: 55555554 [ 17.986832] Call Trace: [ 17.987012] <TASK> [ 17.987266] ? tick_nohz_tick_stopped+0x12/0x30 [ 17.987770] lock_acquire+0xbf/0x2b0 [ 17.988118] ? follow_hugetlb_page.cold+0x75/0x5c4 [ 17.988575] ? _printk+0x48/0x4e [ 17.988889] _raw_spin_lock+0x2f/0x40 [ 17.989243] ? follow_hugetlb_page.cold+0x75/0x5c4 [ 17.989687] follow_hugetlb_page.cold+0x75/0x5c4 [ 17.990119] __get_user_pages+0xbb/0x620 [ 17.990500] faultin_vma_page_range+0x9a/0x100 [ 17.990928] madvise_vma_behavior+0x3c0/0xbd0 [ 17.991354] ? mas_prev+0x11/0x290 [ 17.991678] ? find_vma_prev+0x5e/0xa0 [ 17.992024] ? madvise_vma_anon_name+0x70/0x70 [ 17.992421] madvise_walk_vmas+0xa9/0x120 [ 17.992793] do_madvise.part.0+0xfa/0x270 [ 17.993166] __x64_sys_madvise+0x5a/0x70 [ 17.993539] do_syscall_64+0x37/0x90 [ 17.993879] entry_SYSCALL_64_after_hwframe+0x63/0xcd ======8<======= Resolution ========== This patchset protects all the huge_pte_offset() callers to also take the vma lock properly. Patch Layout ============ Patch 1-2: cleanup, or dependency of the follow up patches Patch 3: before fixing, document huge_pte_offset() on lock required Patch 4-9: each patch resolves one possible race condition Patch 10: introduce hugetlb_walk() to replace huge_pte_offset() Tests ===== The series is verified with the above reproducer so the race cannot trigger anymore. It also passes all hugetlb kselftests. Comments welcomed, thanks. Peter Xu (10): mm/hugetlb: Let vma_offset_start() to return start mm/hugetlb: Don't wait for migration entry during follow page mm/hugetlb: Document huge_pte_offset usage mm/hugetlb: Move swap entry handling into vma lock when faulted mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare mm/hugetlb: Make hugetlb_follow_page_mask() safe to pmd unshare mm/hugetlb: Make follow_hugetlb_page() safe to pmd unshare mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare mm/hugetlb: Make page_vma_mapped_walk() safe to pmd unshare mm/hugetlb: Introduce hugetlb_walk() fs/hugetlbfs/inode.c | 28 ++++++------- fs/userfaultfd.c | 24 +++++++---- include/linux/hugetlb.h | 69 ++++++++++++++++++++++++++++++++ include/linux/rmap.h | 4 ++ include/linux/swapops.h | 6 ++- mm/hugetlb.c | 89 ++++++++++++++++++----------------------- mm/migrate.c | 25 ++++++++++-- mm/page_vma_mapped.c | 7 +++- mm/pagewalk.c | 6 +-- 9 files changed, 175 insertions(+), 83 deletions(-)
Comments
On Tue, 29 Nov 2022 14:35:16 -0500 Peter Xu <peterx@redhat.com> wrote: > Based on latest mm-unstable (9ed079378408). > > This can be seen as a follow-up series to Mike's recent hugetlb vma lock > series for pmd unsharing, but majorly covering safe use of huge_pte_offset. We're at -rc7 (a -rc8 appears probable this time) and I'm looking to settle down and stabilize things... > > ... > > huge_pte_offset() is always called with mmap lock held with either read or > write. It was assumed to be safe but it's actually not. One race > condition can easily trigger by: (1) firstly trigger pmd share on a memory > range, (2) do huge_pte_offset() on the range, then at the meantime, (3) > another thread unshare the pmd range, and the pgtable page is prone to lost > if the other shared process wants to free it completely (by either munmap > or exit mm). That sounds like a hard-to-hit memory leak, but what we have here is a user-triggerable use-after-free and an oops. Ugh. Could people please prioritize the review and testing of this patchset?
On Tue, 29 Nov 2022 14:35:16 -0500 Peter Xu <peterx@redhat.com> wrote:
> [ 17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI
Do we know which kernel versions are affected here?
Hi, Andrew, On Tue, Nov 29, 2022 at 12:49:44PM -0800, Andrew Morton wrote: > On Tue, 29 Nov 2022 14:35:16 -0500 Peter Xu <peterx@redhat.com> wrote: > > > Based on latest mm-unstable (9ed079378408). > > > > This can be seen as a follow-up series to Mike's recent hugetlb vma lock > > series for pmd unsharing, but majorly covering safe use of huge_pte_offset. > > We're at -rc7 (a -rc8 appears probable this time) and I'm looking to > settle down and stabilize things... I targeted this series for the next release not current, because there's no known report for it per my knowledge. The reproducer needs explicit kernel delays to trigger as mentioned in the cover letter. So far I didn't try to reproduce with a generic kernel yet but just to verify the existance of the problem. > > > > > ... > > > > huge_pte_offset() is always called with mmap lock held with either read or > > write. It was assumed to be safe but it's actually not. One race > > condition can easily trigger by: (1) firstly trigger pmd share on a memory > > range, (2) do huge_pte_offset() on the range, then at the meantime, (3) > > another thread unshare the pmd range, and the pgtable page is prone to lost > > if the other shared process wants to free it completely (by either munmap > > or exit mm). > > That sounds like a hard-to-hit memory leak, but what we have here is a > user-triggerable use-after-free and an oops. Ugh. IIUC it's not a leak, but it's just that huge_pte_offset() can walk the (pmd-shared) pgtable page and also trying to take the pgtable lock even though the page can already be freed in parallel, hence accessing either the page or the pgtable lock after the pgtable page got freed. E.g., the 1st warning was trigger by: static inline struct lock_class *hlock_class(struct held_lock *hlock) { unsigned int class_idx = hlock->class_idx; /* Don't re-read hlock->class_idx, can't use READ_ONCE() on bitfield */ barrier(); if (!test_bit(class_idx, lock_classes_in_use)) { /* * Someone passed in garbage, we give up. */ DEBUG_LOCKS_WARN_ON(1); <---------------------------- return NULL; } ... } I think it's because the spin lock got freed along with the pgtable page, so when we want to lock the pgtable lock we see weird lock state in dep_map, as the lock pointer is not valid at all.
On Tue, 29 Nov 2022 16:19:52 -0500 Peter Xu <peterx@redhat.com> wrote: > On Tue, Nov 29, 2022 at 12:49:44PM -0800, Andrew Morton wrote: > > On Tue, 29 Nov 2022 14:35:16 -0500 Peter Xu <peterx@redhat.com> wrote: > > > > > Based on latest mm-unstable (9ed079378408). > > > > > > This can be seen as a follow-up series to Mike's recent hugetlb vma lock > > > series for pmd unsharing, but majorly covering safe use of huge_pte_offset. > > > > We're at -rc7 (a -rc8 appears probable this time) and I'm looking to > > settle down and stabilize things... > > I targeted this series for the next release not current, because there's no > known report for it per my knowledge. > > The reproducer needs explicit kernel delays to trigger as mentioned in the > cover letter. So far I didn't try to reproduce with a generic kernel yet > but just to verify the existance of the problem. OK, thanks, I missed that. I'll give the series a run in -next for a couple of days then I'll pull it out until after the next Linus merge window, so it can't invalidate testing heading into that merge window.
On Tue, Nov 29, 2022 at 12:51:17PM -0800, Andrew Morton wrote: > On Tue, 29 Nov 2022 14:35:16 -0500 Peter Xu <peterx@redhat.com> wrote: > > > [ 17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI > > Do we know which kernel versions are affected here? Since lockless walk of huge_pte_offset() existed since the initial git commit, it should be the time when we introduced pmd sharing for hugetlb, which means any kernel after commit 39dde65c9940 ("[PATCH] shared page table for hugetlb page", 2006-12-07) should be prone to the issue at least for x86 (as pmd sharing was proposed initially only for x86).
On 29.11.22 20:35, Peter Xu wrote: > Based on latest mm-unstable (9ed079378408). > > This can be seen as a follow-up series to Mike's recent hugetlb vma lock > series for pmd unsharing, but majorly covering safe use of huge_pte_offset. > > Comparing to previous rfcv2, the major change is I dropped the new pgtable > lock but only use vma lock for locking. The major reason is I overlooked > that the pgtable lock was not protected by RCU: __pmd_free_tlb() frees the > pgtable lock before e.g. call_rcu() for RCU_TABLE_FREE archs. OTOH many of > the huge_pte_offset() call sites do need to take pgtable lock. It means > the valid users for the new RCU lock will be very limited. Thanks. > > It's possible that in the future we can rework the pgtable free to only > free the pgtable lock after RCU grace period (move pgtable_pmd_page_dtor() > to be within tlb_remove_table_rcu()), then the RCU lock will make more > sense. For now, make it simple by fixing the races first. Good. > > Since this version attached a reproducer (below) and also removed the RCU > (especially, the fallback irqoff) solution, removing RFC tag. Very nice, thanks. > > Old versions: > > rfcv1: https://lore.kernel.org/r/20221030212929.335473-1-peterx@redhat.com > rfcv2: https://lore.kernel.org/r/20221118011025.2178986-1-peterx@redhat.com > > Problem > ======= > > huge_pte_offset() is a major helper used by hugetlb code paths to walk a > hugetlb pgtable. It's used mostly everywhere since that's needed even > before taking the pgtable lock. > > huge_pte_offset() is always called with mmap lock held with either read or > write. It was assumed to be safe but it's actually not. One race > condition can easily trigger by: (1) firstly trigger pmd share on a memory > range, (2) do huge_pte_offset() on the range, then at the meantime, (3) > another thread unshare the pmd range, and the pgtable page is prone to lost > if the other shared process wants to free it completely (by either munmap > or exit mm). So just that I understand correctly: Two processes, #A and #B, share a page table. Process #A runs two threads, #A1 and #A2. #A1 walks that shared page table (using huge_pte_offset()), for example, to resolve a page fault. Concurrently, #A2 triggers unsharing of that page table (replacing it by a private page table), for example, using munmap(). So #A1 will eventually read/write the shared page table while we're placing a private page table. Which would be fine (assuming no unsharing would be required by #A1), however, if #B also concurrently drops the reference to the shared page table (), the shared page table could essentially get freed while #A1 is still walking it. I suspect, looking at the reproducer, that the page table deconstructor was called. Will the page table also actually get freed already? IOW, could #A1 be reading/writing a freed page? > > The recent work from Mike on vma lock can resolve most of this already. > It's achieved by forbidden pmd unsharing during the lock being taken, so no > further risk of the pgtable page being freed. It means if we can take the > vma lock around all huge_pte_offset() callers it'll be safe. Agreed. > > There're already a bunch of them that we did as per the latest mm-unstable, > but also quite a few others that we didn't for various reasons especially > on huge_pte_offset() usage. > > One more thing to mention is that besides the vma lock, i_mmap_rwsem can > also be used to protect the pgtable page (along with its pgtable lock) from > being freed from under us. IOW, huge_pte_offset() callers need to either > hold the vma lock or i_mmap_rwsem to safely walk the pgtables. > > A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is > very hard to trigger, one needs to apply another kernel delay patch too, > see below): Thanks, David / dhildenb
On Wed, Nov 30, 2022 at 10:46:24AM +0100, David Hildenbrand wrote: > > huge_pte_offset() is always called with mmap lock held with either read or > > write. It was assumed to be safe but it's actually not. One race > > condition can easily trigger by: (1) firstly trigger pmd share on a memory > > range, (2) do huge_pte_offset() on the range, then at the meantime, (3) > > another thread unshare the pmd range, and the pgtable page is prone to lost > > if the other shared process wants to free it completely (by either munmap > > or exit mm). > > So just that I understand correctly: > > Two processes, #A and #B, share a page table. Process #A runs two threads, > #A1 and #A2. > > #A1 walks that shared page table (using huge_pte_offset()), for example, to > resolve a page fault. Concurrently, #A2 triggers unsharing of that page > table (replacing it by a private page table), Not yet replacing it, just unsharing. If the replacement happened we shouldn't trigger a bug either because huge_pte_offset() will return the private pgtable page instead. > for example, using munmap(). munmap() may not work because it needs mmap lock, so it'll wait until #A1 completes huge_pte_offset() walks and release mmap lock read. Many of other things can trigger unshare, though. In the reproducer I used MADV_DONTNEED. > > So #A1 will eventually read/write the shared page table while we're placing > a private page table. Which would be fine (assuming no unsharing would be > required by #A1), however, if #B also concurrently drops the reference to > the shared page table (), the shared page table could essentially get freed > while #A1 is still walking it. > > I suspect, looking at the reproducer, that the page table deconstructor was > called. Will the page table also actually get freed already? IOW, could #A1 > be reading/writing a freed page? If with the existing code base, I think it could. If with RCU lock, it couldn't, but still since the pgtable lock is freed even if the page is not, we'll still hit weird issues when accessing the lock. And with vma lock it should be all safe.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9d97c9a2a15d..f8d99dad5004 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -38,6 +38,7 @@ #include <asm/page.h> #include <asm/pgalloc.h> #include <asm/tlb.h> +#include <asm/delay.h> #include <linux/io.h> #include <linux/hugetlb.h> @@ -6290,6 +6291,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, bool unshare = false; int absent; struct page *page; + unsigned long c = 0; /* * If we have a pending SIGKILL, don't keep faulting pages and @@ -6309,6 +6311,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, */ pte = huge_pte_offset(mm, vaddr & huge_page_mask(h), huge_page_size(h)); + + pr_info("%s: withhold 1 sec...\n", __func__); + for (c = 0; c < 100; c++) { + udelay(10000); + } + pr_info("%s: withhold 1 sec...done\n", __func__); + if (pte) ptl = huge_pte_lock(h, mm, pte); absent = !pte || huge_pte_none(huge_ptep_get(pte));