From patchwork Wed Oct 4 03:25:02 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Rik van Riel X-Patchwork-Id: 14894 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2a8e:b0:403:3b70:6f57 with SMTP id in14csp2496904vqb; Tue, 3 Oct 2023 20:28:36 -0700 (PDT) X-Google-Smtp-Source: AGHT+IH0NE4KJz8UD6f9uzI6dN5WMrCjPByJj7djn0Yl2Rgxxox3srGqLsjqHfBbUhfukIpbIZej X-Received: by 2002:a05:6a20:7487:b0:15d:3a10:18c6 with SMTP id p7-20020a056a20748700b0015d3a1018c6mr1261174pzd.45.1696390115917; Tue, 03 Oct 2023 20:28:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696390115; cv=none; d=google.com; s=arc-20160816; b=Ll4l/JBds5N0Iw775XB7sdsHeozs5/MxtIaj0SrsOYwWaB3gGBzb8Yh7IziaNL9+vm U9sxrduMRndCgtxcAmuxw3mp2P42+yzAewprkkQCsgQGd3grRRLSqunkzstbSiOKZ82k nqDNfO7woXkH3u1OxABuB3dogS7vS/FVjbi/ikQCQuPn9cqE6rKsBy8JjGQnTwd2JuYM oaYaPO4XBtry4zvXu7FY4+6hUlpRXg4nBxX0rfy5zzm80ZQjYIicYlNn3cLXx0ZE5WvJ lp8dXPpdaq2wyaiqsiYuwvmafJsGxnP7UYB7fj9i1OYM6gME5ecU/ojOTp8fRTs09FXF Ta3A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=/zHmMkPgzKcyRWwi6WQNRMOs20bt61liBgfsWbPu8N8=; fh=q9SQMO40xEi/EnFYzkYLxDJR2yDm5elEChUbTy2RFH8=; b=NeSsUZjHv4Tpm6jemfGC3fqTwTUBwFCGteacygkvBDTi+ZtDfGRPGnvzyhZtL4a8Hu I7GiHKmzamwFA1Z2RYR5kXV0vbyS1DJ1PfU74lzojuQF1AYEXK1x6fXaX8f3UDqbmwu9 QsyuYl7WcjVvihCjyGW7fKx6OMq4kxyuP8tGsK+rcSzLSDbUAP7NpMA4J8a2JZYRSjaj 0XDy2QpKWJKxp7XysCnW+CiciUMHOy8oJBHSg1nuBZMeTAbqZbo9hd7BI9z8jxIzndeT 0c9Sg+L1HwsF39F7xiIYSmEDWASZ8Y/T1VDBPz+/+A1r/p4yWZ77Vc0lk+TKZnvPMlE4 4SNQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from groat.vger.email (groat.vger.email. [23.128.96.35]) by mx.google.com with ESMTPS id be6-20020a170902aa0600b001bb324569efsi2722282plb.364.2023.10.03.20.28.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Oct 2023 20:28:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) client-ip=23.128.96.35; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 979E280A2829; Tue, 3 Oct 2023 20:28:33 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231867AbjJDD21 (ORCPT + 17 others); Tue, 3 Oct 2023 23:28:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41252 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229530AbjJDD20 (ORCPT ); Tue, 3 Oct 2023 23:28:26 -0400 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7A085AF for ; Tue, 3 Oct 2023 20:28:23 -0700 (PDT) Received: from imladris.home.surriel.com ([10.0.13.28] helo=imladris.surriel.com) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1qnsYb-0005Bb-0H; Tue, 03 Oct 2023 23:28:17 -0400 From: riel@surriel.com To: linux-kernel@vger.kernel.org Cc: kernel-team@meta.com, linux-mm@kvack.org, akpm@linux-foundation.org, muchun.song@linux.dev, mike.kravetz@oracle.com, leit@meta.com, willy@infradead.org Subject: [PATCH v6 0/3] hugetlbfs: close race between MADV_DONTNEED and page fault Date: Tue, 3 Oct 2023 23:25:02 -0400 Message-ID: <20231004032814.3108383-1-riel@surriel.com> X-Mailer: git-send-email 2.41.0 MIME-Version: 1.0 Sender: riel@surriel.com X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Tue, 03 Oct 2023 20:28:33 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778512697877887549 X-GMAIL-MSGID: 1778793962238906419 v6: move a fix from patch 3 to patch 2, more locking fixes v5: somehow a __vma_private_lock(vma) test failed to make it from my tree into the v4 series, fix that v4: fix unmap_vmas locking issue pointed out by Mike Kravetz, and resulting lockdep fallout v3: fix compile error w/ lockdep and test case errors with patch 3 v2: fix the locking bug found with the libhugetlbfs tests. Malloc libraries, like jemalloc and tcalloc, take decisions on when to call madvise independently from the code in the main application. This sometimes results in the application page faulting on an address, right after the malloc library has shot down the backing memory with MADV_DONTNEED. Usually this is harmless, because we always have some 4kB pages sitting around to satisfy a page fault. However, with hugetlbfs systems often allocate only the exact number of huge pages that the application wants. Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of any lock taken on the page fault path, which can open up the following race condition: CPU 1 CPU 2 MADV_DONTNEED unmap page shoot down TLB entry page fault fail to allocate a huge page killed with SIGBUS free page Fix that race by extending the hugetlb_vma_lock locking scheme to also cover private hugetlb mappings (with resv_map), and pulling the locking from __unmap_hugepage_final_range into helper functions called from zap_page_range_single. This ensures page faults stay locked out of the MADV_DONTNEED VMA until the huge pages have actually been freed. The third patch in the series is more of an RFC. Using the invalidate_lock instead of the hugetlb_vma_lock greatly simplifies the code, but at the cost of turning a per-VMA lock into a lock per backing hugetlbfs file, which could slow things down when multiple processes are mapping the same hugetlbfs file.