From patchwork Fri Apr 28 00:41:32 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Jiaqi Yan X-Patchwork-Id: 8741 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp620710vqo; Thu, 27 Apr 2023 17:51:07 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7tzV/W1fp9a1TEWONpjvfuH1nSEM+fLKa0j1Ut6+kcxSshcnaLm1EhNAIHnz3UOllfNIyV X-Received: by 2002:a05:6a20:e489:b0:ed:1355:f88a with SMTP id ni9-20020a056a20e48900b000ed1355f88amr3125354pzb.46.1682643067386; Thu, 27 Apr 2023 17:51:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682643067; cv=none; d=google.com; s=arc-20160816; b=i6R1i7tQfmgUyiTRtacHg9ZVqH99eVh1TaivOlf4YawQZyUyx78iZIFhHhNG9k2tSG J11DcyYp4K3qZu1hu1BJVlluRSGUy/f0GEuSTNV0cKbg3wjtZUStkBTbXBxNHUGQVK7x wu2cnbxp2xLhwJgwCVe96R2P6ODXLp7nkWO0tDV0mUZRJ8b9IJvbJ/9BDvXkC5yXER1r yWq0sVWeELmHkMOAP423MUKdjRYNn2GbHeElc1IUrZUkPf3L9ECpk4rPg6NQ4+dI5aoX vQsd4Nk4DL6KFVs80o5yDw3tRZpeoEhr6h4oRV6Vx6jhklzUnimPseUtkLQqWjH9pCDf O0TA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :message-id:mime-version:date:dkim-signature; bh=Vt8UHocOfme6+v43sNp8IDOABRyG4m+oUHCMBiTHbAY=; b=L1NTAMSdw9bgAwL4Qg7PungFTmP8La51rwQPSHF3RVE93bHXLooTkVJGFNy2ZdjJRY 8LoHe/u7hTT9VK8dmUx9E/LofNeC+tIYT7rvTnqs0+VrsK7BoXF48JBX2zRYGH1I349p VK9Y2/7zC33uxFKKB8DCJjpApap1ekcldOxiKa4rSZwICqnFPPZCZb/syXAD5KKQVSnJ znaBcr8kzG9PlrfSrvMBZMwyOn94769KKAIq2ii95WZw3SqHDHkXGPP4g4xDvwsh9Hzk 2xlgNaJu6NfQmmBvLJqlzxrtIH1QRR/BHQ35Dj1fNBKwtUjWl0CePrDc1KCMAE3JaoWO b7cQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=j1pjcXeH; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a6-20020a17090a70c600b00237dd21c1b7si829854pjm.143.2023.04.27.17.50.54; Thu, 27 Apr 2023 17:51:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=j1pjcXeH; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344638AbjD1Alt (ORCPT + 99 others); Thu, 27 Apr 2023 20:41:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37990 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229508AbjD1Alr (ORCPT ); Thu, 27 Apr 2023 20:41:47 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E25F8210E for ; Thu, 27 Apr 2023 17:41:45 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-b9a7e76b32bso1482888276.1 for ; Thu, 27 Apr 2023 17:41:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1682642505; x=1685234505; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:from:to:cc:subject:date:message-id:reply-to; bh=Vt8UHocOfme6+v43sNp8IDOABRyG4m+oUHCMBiTHbAY=; b=j1pjcXeHNZekfQnY9WfUhFNWm4AZPk1FMP3aujwEifqVqCXDUB/6PU1/QtAWZh6UyR o1oX2953HqpAvD81qId75eNTX/3bH0Ig61QfDr6ep2qVJ9z04GSkNyyB7mvWEtLg+gSX 0oQ76RuUQt6snu3KZtxKzn7YgKRnlHGp4UV+mTwPjNl9qIXfF9QjC0JTIIUcx0vKDdjS nUwwV6asUa0wScwbnyM/AC0QiUGDP7RJaMDv1x5CK/tyc6GOfOc+27u3Wuzr7hOiV/3i G5KnoIIzNUaW97FuCcxg0KQBzDficy+Ws4eidjMgP4h7pbo4ugEg8zFm5kyeS/9Pl3FG 64yw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682642505; x=1685234505; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=Vt8UHocOfme6+v43sNp8IDOABRyG4m+oUHCMBiTHbAY=; b=ToGQ3NzdX1tSCm9otEemFemrEes5UgLTwnJB2ksZ7d6W9GRN8fWI81ifWShfda9kSK Mb9QszW1IN158qD4/KB8LX/PYuywjSO5bRD8eedZYEsrq36yjUyjVoNLwxr6nILjU0Qt ovahHlb466tPHNILF1qHPXzskOWYSMsAwFjfRIjMlxTKmcc7/wXamDw7zkeMbRznSxed d4ooFPcMRut6GI93FgwTXVDd09LGtzbS3WXxmjYQxojCCm+/U3rLhZO/QF4tsjyanNI9 7XSwLRSI9EG/G7Y0grXUbpYkJjZqqNXatP01qsuvqKUVow1BBqxZbg5wYZkjpNA0oJus fTKw== X-Gm-Message-State: AC+VfDxYmw891EHdiroS6zIqLWxwQcE0SSp0afP0gIgr2LXgakv595eX xyB+v1z+PWeESqiWIM03/Mps6e24e4ZqIQ== X-Received: from yjq3.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:272f]) (user=jiaqiyan job=sendgmr) by 2002:a05:6902:1003:b0:b8f:54f5:89ff with SMTP id w3-20020a056902100300b00b8f54f589ffmr2013735ybt.11.1682642505134; Thu, 27 Apr 2023 17:41:45 -0700 (PDT) Date: Fri, 28 Apr 2023 00:41:32 +0000 Mime-Version: 1.0 X-Mailer: git-send-email 2.40.1.495.gc816e09b53d-goog Message-ID: <20230428004139.2899856-1-jiaqiyan@google.com> Subject: [RFC PATCH v1 0/7] PAGE_SIZE Unmapping in Memory Failure Recovery for HugeTLB Pages From: Jiaqi Yan To: mike.kravetz@oracle.com, peterx@redhat.com, naoya.horiguchi@nec.com Cc: songmuchun@bytedance.com, duenwen@google.com, axelrasmussen@google.com, jthoughton@google.com, rientjes@google.com, linmiaohe@huawei.com, shy828301@gmail.com, baolin.wang@linux.alibaba.com, wangkefeng.wang@huawei.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Jiaqi Yan X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764379136895590472?= X-GMAIL-MSGID: =?utf-8?q?1764379136895590472?= Goal ==== Currently once a byte in a HugeTLB hugepage becomes HWPOISON, the whole hugepage will be unmapped from the page table because that is the finest granularity of the mapping. High granularity mapping (HGM) [1], the functionality to map memory addresses at finer granularities (extreme case is PAGE_SIZE), is recently proposed upstream, and provides the opportunity to handle memory error more efficiently: instead of unmapping the whole hugepage, only the raw subpage in the hugepage needs to be thrown away and all the healthy subpages can still be kept available for users. Idea ==== Today memory failure recovery for HugeTLB pages (hugepage) is different from raw and THP pages. We are only interested in in-use hugepages, which is dealt with in these simplified steps: 1. Increment the refcount on the compound head of the hugepage. 2. Insert the raw HWPOISON page to the compound head’s raw_hwp_list (_hugetlb_hwpoison) if it is not already in the list. 3. Unmap the entire hugepage from HugeTLB’s page table. 4. Kill the processes that are accessing the poisoned hugepage. HGM can greatly improve this recovery mechanism. Step #3 (unmapping entire hugepage) can be replaced by 3.1 Map the entire hugepage at finer granularity, so that the exact HWPOISON address is mapped by a PAGE_SIZE PTE, and the rest of the address spaces optimally mapped by either smaller P*Ds or PTEs. In other words, the original HugeTLB PTE is split into smaller P*Ds and PTEs. 3.2 Only unmap the newly mapped PTE that maps the HWPOISON address. For shared mappings, current HGM patches is already a solid basis for splitting functionality in step #3.1. This RFC drafts a complete solution for shared mapping. The splitting-based idea can be applied to private mappings as well, but additional subtle complexity needs to be dealt with. We defer the private mapping case as future work. Splitting HugeTLB PTEs (Step #3.1) ================================== The general process of splitting a present leaf HugeTLB PTE is 1. Get and clear the original HugeTLB PTE old_pte. 2. Initialize curr with the start address range corresponding to old_pte. 3. Find the optimal level we should map curr at. 4. Perform HGM walk on curr with the optimal level found in step 3, potentially allocating a new PTE at the optimal level. 5. Populate the newly allocated PTE with bits from old_pte, including dirty, write, and UFFD_WP. 6. Update curr += the newly created PTE size, repeat step 3 until the entire VMA is covered. The functionality of splitting hugepage mapping is not meaningful for mostly none PTEs. We handle none or userfaultfd write protect (UFFD_WP) marker HugeTLB PTEs at the time of page faulting. Migration and HWPOISON PTEs are better left not touched. Memory Failure Recovery and Unmapping (Step #3.2) ================================================= A few changes are made in memory_failure and rmap to only unmap raw HWPOISON pages: 1. as long as HGM is turned on in CONFIG, memory_failure attempts to enable HGM on the VMA containing the poisoned hugepage 2. memory_failure attempts to split the HugeTLB PTE so that poisoned address is mapped by a PAGE_SIZE PTE, for all the VMAs containing the poisoned hugepage. 3. get_huge_page_for_hwpoison only returns -EHWPOISON if the raw page is already in the compound head’s raw_hwp_list. This makes unmapping work correctly when multiple raw pages in the same hugepage become HWPOISON. 4. rmap utilizes compound head’s raw_hwp_list to 1) avoid unmapping raw pages not in the list, and 2) keep track if the raw pages in the list are already unmapped. 5. page refcount check in me_huge_page is skipped. Between mmap() and Page Fault ========================== Memory error can occur between the time when userspace maps a hugepage and the time when userspace faults in the mapped hugepage. General idea is to not create any raw-page-size page table entry for HWPOISON memory, and render memory in healthy raw pages still available to userspace (via normal fault handling). At the time of hugetlb_no_page: - If the entire hugepage doesn’t contain any HWPOISON page, the normal page fault handler continues. - If the memory address being faulted is within a HWPOISON raw page, hugetlb_no_page returns VM_FAULT_HWPOISON_LARGE (so that page fault handler sends a BUS_MCEERR_AR SIGBUS to the faulting process). - If the memory address being faulted is within a healthy raw page, hugetlb_no_page utilize HGM to create a new HugeTLB PTE so that its hugetlb_pte_size cannot be larger and at the same time it doesn’t map any HWPOISON address. Then the normal page fault handler continues. Failure Handling ================ - If the kernel still fails to allocate a new raw_hwp_page after a retry, memory_failure returns MF_IGNORED with MF_MSG_UNKNOWN. - For each VMA that maps the HWPOISON hugepage - If the VMA is not eligible for HGM, the old behavior is taken: unmap the entire hugepage from that VMA. - If memory_failure fails to enable HGM on the VMA, or if memory_failure fails to split any VMA that mapped the HWPOISON page, the recovery returns MF_IGNORED with MF_MSG_UNMAP_FAILED. - For a particular VMA, if splitting HugeTLB PTE fails, the original PTE will be restored to the page table. Code Changes ============ The code patches in this RFC is based on HGM patchset V2 [1], composed of two parts. The first part implements the idea laid out in the cover letter; the second part tests two major scenarios: HWPOISON on already faulted pages and HWPOISON between mapped and faulted. Future Changes ============== There is a pending improvement to hugetlbfs_read_iter. If a hugepage is found from page cache and it contains HWPOISON subpages, today kernel returns -EIO immediately. With the new splitting-then-unmap behavior, kernel can return userspace every byte until up to the first raw HWPOISON byte. If userspace wants the read to start within a raw HWPOISON page, kernel will have to return -EIO. This improvement and its selftest will be done in the future patch series. [1] https://lore.kernel.org/all/20230218002819.1486479-1-jthoughton@google.com/ Jiaqi Yan (7): hugetlb: add HugeTLB splitting functionality hugetlb: create PTE level mapping when possible mm: publish raw_hwp_page in mm.h mm/memory_failure: unmap raw HWPoison PTEs when possible hugetlb: only VM_FAULT_HWPOISON_LARGE raw page selftest/mm: test PAGESIZE unmapping HWPOISON pages selftest/mm: test PAGESIZE unmapping UFFD WP marker HWPOISON pages include/linux/hugetlb.h | 14 + include/linux/mm.h | 36 ++ mm/hugetlb.c | 405 ++++++++++++++++++++++- mm/memory-failure.c | 206 ++++++++++-- mm/rmap.c | 38 ++- tools/testing/selftests/mm/hugetlb-hgm.c | 364 ++++++++++++++++++-- 6 files changed, 1004 insertions(+), 59 deletions(-)