Message ID | 20231001005659.2185316-1-riel@surriel.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2a8e:b0:403:3b70:6f57 with SMTP id in14csp642660vqb; Sat, 30 Sep 2023 17:58:01 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEIleLq4MLkEXGqx42s2MumtEntcYaEVVrof80doSuPKjO9bSZrmKka/KuXfEBFb1clu3Wk X-Received: by 2002:a05:620a:318d:b0:774:9dc7:ce04 with SMTP id bi13-20020a05620a318d00b007749dc7ce04mr9076066qkb.14.1696121881533; Sat, 30 Sep 2023 17:58:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696121881; cv=none; d=google.com; s=arc-20160816; b=LnAD7y+q/zazmgwizCQK/jH4CBmBRJEUlt0+Hiyae0Ykt1FvaHEjDE7vcj6TtFHKLt o+k9ZtWJbRUzzAozOdX4Z1dS4V/uQ0Zd3fpgb8qPY1iKcf5XVbKokNtQIuH9HDfKn99i hfvHoaahP9CCOOdZB6AO38W6l4nPfntamYy7V7MAEC9TRHncxYdci1V9k5Ko6oyJNWyY GzRrnzku3gs2w7P7O57Qd9lhzqkJgOGOeL3ffb1IhgZg3iyNi2G9cYJdW32kBwFUgH8l sNE6VOIuuow4ihxtxYAuGpjWquMdcnpW+l45QzBnjAXdI1Jvko3sM2WYW+PbBf9rofOK 84rw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=q74vk6NjIVCnnLqm8nJWmYw2+j7Q4Kywks0MXA0rCTw=; fh=q9SQMO40xEi/EnFYzkYLxDJR2yDm5elEChUbTy2RFH8=; b=wb3yh1hHfC4pLuFvA5M4A1MOg0Ilnc3UhO1tzQxNCbOLqkOxK/FePBrP5YYG3R8zHv WhemRcL7ujD5F40LpiE3WLE8qyZH8rKjWZLoS6hILbJdoseDEWNKH/zf5CWA4+GBBZ2i Omwjz/PIEA5ox3ZaHI9KF5nRwBvcWLEwrQwDDZUUvRvXpKZ30GnVpL4Qh5KlgqYjAyZM laQXeHj7Wf22/DUE5gQBCnsnRvnCq4QPvCuA0TIUREG+tX7qxao4lowXadT9OyBSUaXa aZsUkFfSMiPF/6ZtAfaRicafrptNbkb08PGqa35TRiXyALrRP/nAGkeOEEse64xVII4V l9tA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id w64-20020a638243000000b00578a2db484esi24848957pgd.248.2023.09.30.17.58.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 30 Sep 2023 17:58:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id B0DC18029248; Sat, 30 Sep 2023 17:58:00 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234147AbjJAA56 (ORCPT <rfc822;pwkd43@gmail.com> + 19 others); Sat, 30 Sep 2023 20:57:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45808 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229873AbjJAA55 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sat, 30 Sep 2023 20:57:57 -0400 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 71463D3 for <linux-kernel@vger.kernel.org>; Sat, 30 Sep 2023 17:57:55 -0700 (PDT) Received: from imladris.home.surriel.com ([10.0.13.28] helo=imladris.surriel.com) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from <riel@shelob.surriel.com>) id 1qmklZ-0008G8-0C; Sat, 30 Sep 2023 20:57:01 -0400 From: riel@surriel.com To: linux-kernel@vger.kernel.org Cc: kernel-team@meta.com, linux-mm@kvack.org, akpm@linux-foundation.org, muchun.song@linux.dev, mike.kravetz@oracle.com, leit@meta.com, willy@infradead.org Subject: [PATCH v5 0/3] hugetlbfs: close race between MADV_DONTNEED and page fault Date: Sat, 30 Sep 2023 20:55:47 -0400 Message-ID: <20231001005659.2185316-1-riel@surriel.com> X-Mailer: git-send-email 2.41.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: riel@surriel.com X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Sat, 30 Sep 2023 17:58:00 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778512697877887549 X-GMAIL-MSGID: 1778512697877887549 |
Series |
hugetlbfs: close race between MADV_DONTNEED and page fault
|
|
Message
Rik van Riel
Oct. 1, 2023, 12:55 a.m. UTC
v5: somehow a __vma_private_lock(vma) test failed to make it from my tree into the v4 series, fix that v4: fix unmap_vmas locking issue pointed out by Mike Kravetz, and resulting lockdep fallout v3: fix compile error w/ lockdep and test case errors with patch 3 v2: fix the locking bug found with the libhugetlbfs tests. Malloc libraries, like jemalloc and tcalloc, take decisions on when to call madvise independently from the code in the main application. This sometimes results in the application page faulting on an address, right after the malloc library has shot down the backing memory with MADV_DONTNEED. Usually this is harmless, because we always have some 4kB pages sitting around to satisfy a page fault. However, with hugetlbfs systems often allocate only the exact number of huge pages that the application wants. Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of any lock taken on the page fault path, which can open up the following race condition: CPU 1 CPU 2 MADV_DONTNEED unmap page shoot down TLB entry page fault fail to allocate a huge page killed with SIGBUS free page Fix that race by extending the hugetlb_vma_lock locking scheme to also cover private hugetlb mappings (with resv_map), and pulling the locking from __unmap_hugepage_final_range into helper functions called from zap_page_range_single. This ensures page faults stay locked out of the MADV_DONTNEED VMA until the huge pages have actually been freed. The third patch in the series is more of an RFC. Using the invalidate_lock instead of the hugetlb_vma_lock greatly simplifies the code, but at the cost of turning a per-VMA lock into a lock per backing hugetlbfs file, which could slow things down when multiple processes are mapping the same hugetlbfs file.
Comments
On Sat, 30 Sep 2023 20:55:47 -0400 riel@surriel.com wrote: > v5: somehow a __vma_private_lock(vma) test failed to make it from my tree into the v4 series, fix that > v4: fix unmap_vmas locking issue pointed out by Mike Kravetz, and resulting lockdep fallout > v3: fix compile error w/ lockdep and test case errors with patch 3 > v2: fix the locking bug found with the libhugetlbfs tests. > > Malloc libraries, like jemalloc and tcalloc, take decisions on when > to call madvise independently from the code in the main application. > > This sometimes results in the application page faulting on an address, > right after the malloc library has shot down the backing memory with > MADV_DONTNEED. > > Usually this is harmless, because we always have some 4kB pages > sitting around to satisfy a page fault. However, with hugetlbfs > systems often allocate only the exact number of huge pages that > the application wants. > > Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of > any lock taken on the page fault path, which can open up the following > race condition: > > CPU 1 CPU 2 > > MADV_DONTNEED > unmap page > shoot down TLB entry > page fault > fail to allocate a huge page > killed with SIGBUS > free page > > Fix that race by extending the hugetlb_vma_lock locking scheme to also > cover private hugetlb mappings (with resv_map), and pulling the locking > from __unmap_hugepage_final_range into helper functions called from > zap_page_range_single. This ensures page faults stay locked out of > the MADV_DONTNEED VMA until the huge pages have actually been freed. Didn't we decide that [1/3] and [2/3] should be cc:stable? > The third patch in the series is more of an RFC. Using the > invalidate_lock instead of the hugetlb_vma_lock greatly simplifies > the code, but at the cost of turning a per-VMA lock into a lock > per backing hugetlbfs file, which could slow things down when > multiple processes are mapping the same hugetlbfs file. "could slow things down" is testable-for? This third one I'd queue up for testing for a 6.7-rc1 merge, so I'll split the series apart. Not a problem, but it would be a little better if things were originally packaged that way.