From patchwork Fri Apr 28 00:41:33 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jiaqi Yan X-Patchwork-Id: 88421 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp618048vqo; Thu, 27 Apr 2023 17:43:50 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ657ygIfO8LcF8UF4V+PlrfPpZv4bwUnA9Ju5V83oojCsLhdlFYmP4wmyRLLbOh72xWe1x4 X-Received: by 2002:a05:6a20:2446:b0:f0:a283:4854 with SMTP id t6-20020a056a20244600b000f0a2834854mr4352710pzc.13.1682642630620; Thu, 27 Apr 2023 17:43:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682642630; cv=none; d=google.com; s=arc-20160816; b=MxzrZN8SxZHxPKVV7ElGNui1+n8ckyjnXa6EWXXQAfKlgK5hDcbauDcW0acV4ffSOd ZDE1IVC6SGHa3+bUn7E/8FmwLLFLxU54Q5unbIxfiAP8TkKd5zpJKiYmTw3uahy3d4J+ Hbt5lK4fqRe0i1YZMOW3i4YpKu+0wvk/rufvnefHmUFMyfXmVMvFcw5ceUOe2gpjIIKs GTyaojAgbxMoGh/LIs8jCPyxLxOAFjj8Vqr3CbPd0WkYdgwDbm6GEX/q9NwtCwl80k6I 72d2SRjC7SjdwdM5haCRjtHgIe6iMlE2NmzV1hyvrVSfqyH2wbO6Oc6rAgph0i25EYYs k93Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=HlsvVMr5wJrguvOpjMRMDu7HLXIfjm1V1/lZO+i2wEY=; b=NYuBXaNL2q3Crm7BbTgvl6ovpk6Xmp3hT/JSA3pIlj5TykdCeFdP8m9ezPs1ShmptS iXh6qg6hyGGgvAeJjFZkCFEito52YaB6ybQPRXojKeKBypBdwq0/DLWkwyWpnWDG/eSk wVeYx/+WcPLxVP/j0SfIFw15USjVMIECD2kqTX6omEw/oZ4yRCjRovd/aamPvF1KkQ5L Cu8lAwdGrYtVQK2j5Rq2kBiSXssmx9XkB6l1XPhmMNk1j/c+e8RyXF/mTFu4RtjXBIn0 rNea/A2aH/X/83nInJ0kKvVQasvrsa40buotdB8maE10b1ncuXXrHKUmyupJi+EiLwU3 YSfw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=bk8krxON; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id e4-20020aa79804000000b0063f16039f14si17817451pfl.128.2023.04.27.17.43.36; Thu, 27 Apr 2023 17:43:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=bk8krxON; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344649AbjD1Alv (ORCPT + 99 others); Thu, 27 Apr 2023 20:41:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37996 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344635AbjD1Als (ORCPT ); Thu, 27 Apr 2023 20:41:48 -0400 Received: from mail-pf1-x44a.google.com (mail-pf1-x44a.google.com [IPv6:2607:f8b0:4864:20::44a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 525DD213A for ; Thu, 27 Apr 2023 17:41:47 -0700 (PDT) Received: by mail-pf1-x44a.google.com with SMTP id d2e1a72fcca58-63b5a2099a6so6138461b3a.3 for ; Thu, 27 Apr 2023 17:41:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1682642507; x=1685234507; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=HlsvVMr5wJrguvOpjMRMDu7HLXIfjm1V1/lZO+i2wEY=; b=bk8krxONTSthWw2wf37v9leSTroOrwRjUmemZZJ/Lyh5va07os3jH0SdlKMX2Zlkwj 1mnDEDdY4bjpGnEORTKMNTb16HBb+EUa0NxlyJbDTbRUJSGV6xcl2sUkO+b9xA1tcteZ kE4xcf4qAtxu+ZvDVFv6OUnvm2WuKr6H5Zk0Y26N+8UjC/Dh1ZosktrEV0alGZDuZAzp xmSlF2vf5rAlQF9o5tSrvz/IpWboDcrV66QOeb2d7d9hznArgM+WJwI7eI/8GgRge0wi qWiUu+8FqsMv/fRAtfP3XkQ/i7K33M36dADdpudayfMi988owv/SqJUleGXA2KWm6Zbv D8/g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682642507; x=1685234507; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=HlsvVMr5wJrguvOpjMRMDu7HLXIfjm1V1/lZO+i2wEY=; b=VB4EoNVkS0aURalAKxFb+11906k/MUFw0zCD1KlBtiuyupJqzPBgiPT7n/Izxw//42 K8Tsdl1LQqcp5b1iMQIwI5lYC//OQqefFlVuTBMzRvCxKe00fG8AImEPFdqQi3Afgh3H 9hIDq+tsJ97LJAqiMLkBt5tEeLKPhcx75V9qv3MF6h3iQfbMsXsYzKhRThoWsaQXWah6 oAnWL9EtOGPky4TTDdsTtTaGhVfTkEFepH2FAV/QgLW9bjbb1ZDSRl2C8YRUhDNEv053 0ank8Tton84LexqC5CmeA/AZiglMDojjfU0zjvrrgRnKts8bn5vzH1y5IrOYUip9D1qN h7Ng== X-Gm-Message-State: AC+VfDwkd5leoCL1VkfqKtV31TGYQFQOCwosLJDsztrlk2u3PTErAaUP FqPWzQsUexRXTtlmq8+3GAIvpv648MBMWw== X-Received: from yjq3.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:272f]) (user=jiaqiyan job=sendgmr) by 2002:a05:6a00:80de:b0:63b:234e:d641 with SMTP id ei30-20020a056a0080de00b0063b234ed641mr928248pfb.4.1682642506763; Thu, 27 Apr 2023 17:41:46 -0700 (PDT) Date: Fri, 28 Apr 2023 00:41:33 +0000 In-Reply-To: <20230428004139.2899856-1-jiaqiyan@google.com> Mime-Version: 1.0 References: <20230428004139.2899856-1-jiaqiyan@google.com> X-Mailer: git-send-email 2.40.1.495.gc816e09b53d-goog Message-ID: <20230428004139.2899856-2-jiaqiyan@google.com> Subject: [RFC PATCH v1 1/7] hugetlb: add HugeTLB splitting functionality From: Jiaqi Yan To: mike.kravetz@oracle.com, peterx@redhat.com, naoya.horiguchi@nec.com Cc: songmuchun@bytedance.com, duenwen@google.com, axelrasmussen@google.com, jthoughton@google.com, rientjes@google.com, linmiaohe@huawei.com, shy828301@gmail.com, baolin.wang@linux.alibaba.com, wangkefeng.wang@huawei.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Jiaqi Yan X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764378678819839652?= X-GMAIL-MSGID: =?utf-8?q?1764378678819839652?= The new function, hugetlb_split_to_shift, optimally splits the page table to map a particular address at a paricular granularity. This is useful for punching a hole in the mapping and for mapping (and unmapping) small sections of a HugeTLB page. Splitting is for present leaf HugeTLB PTE only. None HugeTLB PTEs and other non-present HugeTLB PTE types are not supported as they are better left untouched: * None PTEs * Migration PTEs * HWPOISON PTEs * UFFD writeprotect PTEs Signed-off-by: Jiaqi Yan --- include/linux/hugetlb.h | 9 ++ mm/hugetlb.c | 249 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 258 insertions(+) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 742e7f2cb170..d44bf6a794e5 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -1266,6 +1266,9 @@ int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm, unsigned long end); int hugetlb_collapse(struct mm_struct *mm, unsigned long start, unsigned long end); +int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma, + struct hugetlb_pte *hpte, unsigned long addr, + unsigned int desired_shift); #else static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma) { @@ -1292,6 +1295,12 @@ int hugetlb_collapse(struct mm_struct *mm, unsigned long start, { return -EINVAL; } +int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma, + const struct hugetlb_pte *hpte, unsigned long addr, + unsigned int desired_shift) +{ + return -EINVAL; +} #endif static inline diff --git a/mm/hugetlb.c b/mm/hugetlb.c index df4c17164abb..d3f3f1c2d293 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -8203,6 +8203,255 @@ int hugetlb_collapse(struct mm_struct *mm, unsigned long start, return ret; } +/* + * Find the optimal HugeTLB PTE shift that @desired_addr could be mapped at. + */ +static int hugetlb_find_shift(struct vm_area_struct *vma, + unsigned long curr, + unsigned long end, + unsigned long desired_addr, + unsigned long desired_shift, + unsigned int *shift_found) +{ + struct hstate *h = hstate_vma(vma); + struct hstate *tmp_h; + unsigned int shift; + unsigned long sz; + + for_each_hgm_shift(h, tmp_h, shift) { + sz = 1UL << shift; + /* This sz is not aligned or too large. */ + if (!IS_ALIGNED(curr, sz) || curr + sz > end) + continue; + /* + * When desired_addr is in [curr, curr + sz), + * we want shift to be as close to desired_shift + * as possible. + */ + if (curr <= desired_addr && desired_addr < curr + sz + && shift > desired_shift) + continue; + + *shift_found = shift; + return 0; + } + + return -EINVAL; +} + +/* + * Given a particular address @addr and it is a present leaf HugeTLB PTE, + * split it so that the PTE that maps @addr is at @desired_shift. + */ +static int hugetlb_split_to_shift_present_leaf(struct mm_struct *mm, + struct vm_area_struct *vma, + pte_t old_entry, + unsigned long start, + unsigned long end, + unsigned long addr, + unsigned int orig_shift, + unsigned int desired_shift) +{ + bool old_entry_dirty; + bool old_entry_write; + bool old_entry_uffd_wp; + pte_t new_entry; + unsigned long curr; + unsigned long sz; + unsigned int shift; + int ret = 0; + struct hugetlb_pte new_hpte; + struct page *subpage = NULL; + struct folio *folio = page_folio(compound_head(pte_page(old_entry))); + struct hstate *h = hstate_vma(vma); + spinlock_t *ptl; + + /* Unmap original unsplit hugepage per huge_ptep_get_and_clear. */ + hugetlb_remove_rmap(folio_page(folio, 0), orig_shift, h, vma); + + old_entry_dirty = huge_pte_dirty(old_entry); + old_entry_write = huge_pte_write(old_entry); + old_entry_uffd_wp = huge_pte_uffd_wp(old_entry); + + for (curr = start; curr < end; curr += sz) { + ret = hugetlb_find_shift(vma, curr, end, addr, + desired_shift, &shift); + + /* Unable to find a shift that works */ + if (WARN_ON(ret)) + goto abort; + + /* + * Do HGM full walk and allocate new page table structures + * to continue to walk to the level we want. + */ + sz = 1UL << shift; + ret = hugetlb_full_walk_alloc(&new_hpte, vma, curr, sz); + if (WARN_ON(ret)) + goto abort; + + BUG_ON(hugetlb_pte_size(&new_hpte) > sz); + /* + * When hugetlb_pte_size(new_hpte) is than sz, increment + * curr by hugetlb_pte_size(new_hpte) to avoid skip over + * some PTEs. + */ + if (hugetlb_pte_size(&new_hpte) < sz) + sz = hugetlb_pte_size(&new_hpte); + + subpage = hugetlb_find_subpage(h, folio, curr); + /* + * Creating a new (finer granularity) PT entry and + * populate it with old_entry's bits. + */ + new_entry = make_huge_pte(vma, subpage, + huge_pte_write(old_entry), shift); + if (old_entry_dirty) + new_entry = huge_pte_mkdirty(new_entry); + if (old_entry_write) + new_entry = huge_pte_mkwrite(new_entry); + if (old_entry_uffd_wp) + new_entry = huge_pte_mkuffd_wp(new_entry); + ptl = hugetlb_pte_lock(&new_hpte); + set_huge_pte_at(mm, curr, new_hpte.ptep, new_entry); + spin_unlock(ptl); + /* Increment ref/mapcount per set_huge_pte_at(). */ + hugetlb_add_file_rmap(subpage, shift, h, vma); + folio_get(folio); + } + /* + * This refcount decrement is for the huge_ptep_get_and_clear + * on the hpte BEFORE splitting, for the same reason as + * hugetlb_remove_rmap(), but we cannot do it at that time. + * Now that splitting succeeded, the refcount can be decremented. + */ + folio_put(folio); + return 0; +abort: + /* + * Restore mapcount on unsplitted hugepage. No need to restore + * refcount as we won't folio_put() until splitting succeeded. + */ + hugetlb_add_file_rmap(folio_page(folio, 0), orig_shift, h, vma); + return ret; +} + +/* + * Given a particular address @addr, split the HugeTLB PTE that currently + * maps it so that, for the given @addr, the PTE that maps it is @desired_shift. + * The splitting is always done optimally. + * + * Example: given a HugeTLB 1G page mapped from VA 0 to 1G, if caller calls + * this API with addr=0 and desired_shift=PAGE_SHIFT, we will change the page + * table as follows: + * 1. The original PUD will be split into 512 2M PMDs first + * 2. The 1st PMD will further be split into 512 4K PTEs + * + * Callers are required to hold locks on the file mapping within vma. + */ +int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma, + struct hugetlb_pte *hpte, unsigned long addr, + unsigned int desired_shift) +{ + unsigned long start, end; + unsigned long desired_sz = 1UL << desired_shift; + int ret; + pte_t old_entry; + struct mmu_gather tlb; + struct mmu_notifier_range range; + spinlock_t *ptl; + + BUG_ON(!hpte->ptep); + + start = addr & hugetlb_pte_mask(hpte); + end = start + hugetlb_pte_size(hpte); + BUG_ON(!IS_ALIGNED(start, desired_sz)); + BUG_ON(!IS_ALIGNED(end, desired_sz)); + BUG_ON(addr < start || end <= addr); + + if (hpte->shift == desired_shift) + return 0; + + /* + * Non none-mostly hugetlb PTEs must be present leaf-level PTE, + * i.e. not split before. + */ + ptl = hugetlb_pte_lock(hpte); + BUG_ON(!huge_pte_none_mostly(huge_ptep_get(hpte->ptep)) && + !hugetlb_pte_present_leaf(hpte, huge_ptep_get(hpte->ptep))); + + i_mmap_assert_write_locked(vma->vm_file->f_mapping); + + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start, end); + mmu_notifier_invalidate_range_start(&range); + + /* + * Get and clear the PTE. We will allocate new page table structures + * when walking the page table. + */ + old_entry = huge_ptep_get_and_clear(mm, start, hpte->ptep); + spin_unlock(ptl); + + /* + * From now on, any failure exit needs to go through "skip" to + * put old_entry back. If any form of hugetlb_split_to_shift_xxx + * is invoked, it also needs to go through "abort" to get rid of + * the allocated PTEs created before splitting fails. + */ + + if (unlikely(huge_pte_none_mostly(old_entry))) { + ret = -EAGAIN; + goto skip; + } + if (unlikely(!pte_present(old_entry))) { + if (is_hugetlb_entry_migration(old_entry)) + ret = -EBUSY; + else if (is_hugetlb_entry_hwpoisoned(old_entry)) + ret = -EHWPOISON; + else { + WARN_ONCE(1, "Unexpected case of non-present HugeTLB PTE\n"); + ret = -EINVAL; + } + goto skip; + } + + if (!hugetlb_pte_present_leaf(hpte, old_entry)) { + WARN_ONCE(1, "HugeTLB present PTE is not leaf\n"); + ret = -EAGAIN; + goto skip; + } + /* From now on old_entry is present leaf entry. */ + ret = hugetlb_split_to_shift_present_leaf(mm, vma, old_entry, + start, end, addr, + hpte->shift, + desired_shift); + if (ret) + goto abort; + + /* Splitting done, new page table entries successfully setup. */ + mmu_notifier_invalidate_range_end(&range); + return 0; +abort: + /* Splitting failed, restoring to the original page table state. */ + tlb_gather_mmu(&tlb, mm); + /* Decrement mapcount for all the split PTEs. */ + __unmap_hugepage_range(&tlb, vma, start, end, NULL, ZAP_FLAG_DROP_MARKER); + /* + * Free any newly allocated page table entries. + * Ok if no new entries allocated at all. + */ + hugetlb_free_pgd_range(&tlb, start, end, start, end); + /* Decrement refcount for all the split PTEs. */ + tlb_finish_mmu(&tlb); +skip: + /* Restore the old entry. */ + ptl = hugetlb_pte_lock(hpte); + set_huge_pte_at(mm, start, hpte->ptep, old_entry); + spin_unlock(ptl); + mmu_notifier_invalidate_range_end(&range); + return ret; +} + #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */ /* From patchwork Fri Apr 28 00:41:34 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jiaqi Yan X-Patchwork-Id: 88428 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp620769vqo; Thu, 27 Apr 2023 17:51:17 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5/Bg1Na84t5+RN7ZuTlZXSJozIYi8MD1gjfTeH+Z7k9UYkv/WFAvwrpcbSzgF5leiqsqDc X-Received: by 2002:a17:90b:33c6:b0:246:8193:1fdc with SMTP id lk6-20020a17090b33c600b0024681931fdcmr3672927pjb.3.1682643077510; Thu, 27 Apr 2023 17:51:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682643077; cv=none; d=google.com; s=arc-20160816; b=qFqwLFt+SSPeAxF4LvmZLl+zEWAx3B03MUspVEQjRVpswJkHiUFwdE/HOMy9kbTJk7 xw4a2oqu9uPEk2gTgMsCfVmb+iGFdilQv7iiDykvdcIDhJjm8gwcwqLrzme1TSYu+2qM JdsM87z4qFflqr8iQ+/MU+lvSMmjla9/O6kPgTOFk3kNV9kh48YjXxOrleADTXEpvWkV F221dQtysF/zv2QE6q9kln+cUDPcwReH8rDVyu6A3fmL6JYbITHBIUQkkR6HSHOsf9Ve bRg8YG0KwwnmCdujXxdNGKZ5r6/mHj9r5tC7eBfFj2Cw/SrEgGSclAWrL4Hvt8T7vGp3 anpw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=8bYeQzrUNDYZNhkuugE6jf2Akn1BdGdutTjwyn1e4o8=; b=0mR19rE0msLNT8rh1GAegzs45U81DgeogKgS63tP0Qxo8LTtRpKKOULxmGhzW5QaVW PFYIR34f8hutGvVEHIN0/7r9778AQLt4wiAXGMem2GO+gPYV2tI+o6wOkKa4URX2AYc7 ielt/T0jdKdionx5t9Cy8l/IJ5H8xjKB3RSv3vnIHOgGcA2qXji05u9FRTRI2fI0gDSt zM8ZXKLQjd/c7qlZ/I3a7CdCXROHxr+URzGNKaCZPi1VWcZPk8ZrsRPXUWuKvByxJI9O 9TA/+palzuGOV1Y3bfqpYjRM9Oz1s+4PD4Q+e9SNgMHtVDnPjliwiAymU3c2v9eKdEL4 /GXQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=3F224aKz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k19-20020a63ff13000000b004fb98290dbdsi20147302pgi.50.2023.04.27.17.51.05; Thu, 27 Apr 2023 17:51:17 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=3F224aKz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344675AbjD1Aly (ORCPT + 99 others); Thu, 27 Apr 2023 20:41:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38016 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229508AbjD1Alu (ORCPT ); Thu, 27 Apr 2023 20:41:50 -0400 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0E0AF210E for ; Thu, 27 Apr 2023 17:41:49 -0700 (PDT) Received: by mail-pj1-x104a.google.com with SMTP id 98e67ed59e1d1-24736839649so5120970a91.2 for ; Thu, 27 Apr 2023 17:41:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1682642508; x=1685234508; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=8bYeQzrUNDYZNhkuugE6jf2Akn1BdGdutTjwyn1e4o8=; b=3F224aKzgYOc/AvfHDwiitWzc6njaindjq7TzzaiPro6lvt8bS0DSSV6QcWa4f2xh7 OCnqi82Pv2zBPGZRbSpyfGekXLluJqCZbi0SwBFNzDNX1eKvR5Qfu8UGApuRJSegTD23 Wf2t/MaVERhWf96SaiPu5FGxffT+fMBXn6u0g35bMo7NXOc0+4rjPu9+Py3aLIXOJajh UJQg241cRBfw3XHvvrB95hKTnaifEYCXS1lCQau+Z9JvPKFx1KSnltCBZWdCG+eStjbn Omnk245IfYkhnNTfMfBDS7p3Nby60Fh1QgneJKJYHONxX26YbbNS96JQhxWa0+pZZPH8 XV9g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682642508; x=1685234508; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=8bYeQzrUNDYZNhkuugE6jf2Akn1BdGdutTjwyn1e4o8=; b=cX0jlZHjFB6FIxTeEuZijvNruZH1F/yEIs9rFkCBcR66XXEK1vL9AiDjN6XXEkduIu kD6PJ43JM9EQHMuVNXOgz82HBu3augtiW5oZeZkJfUbVx2sfDc3038Dm21jXR+nEkDx6 c87vZxtldCt3+uplSgYPavQHruZcSuHZmnoRxSyP8eDP1JB2MvW2QZf4TVioPMCFvFs7 1EiVDdh7uYDsaTYS6gGKQfEciUeHP5H2U4oItMlNuW8byi3quYLIg0EOTijEeYE0eFo3 kqHD13DyzD7fJYUn1e9zI1r2p8+JoHtJx3aOlQ0vjk8FVI0hXztIuXFwdy8wtS8FFKB/ G80w== X-Gm-Message-State: AC+VfDxeJSSl6VenFNdEMQKYY5Dv1v/ERSbSfrw9qRcsvpFZz34kHRRu o+WhN2xP97fmCyNNNa4dvV5A3dSIJQP7KA== X-Received: from yjq3.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:272f]) (user=jiaqiyan job=sendgmr) by 2002:a17:90a:304b:b0:247:1639:9650 with SMTP id q11-20020a17090a304b00b0024716399650mr1011181pjl.2.1682642508523; Thu, 27 Apr 2023 17:41:48 -0700 (PDT) Date: Fri, 28 Apr 2023 00:41:34 +0000 In-Reply-To: <20230428004139.2899856-1-jiaqiyan@google.com> Mime-Version: 1.0 References: <20230428004139.2899856-1-jiaqiyan@google.com> X-Mailer: git-send-email 2.40.1.495.gc816e09b53d-goog Message-ID: <20230428004139.2899856-3-jiaqiyan@google.com> Subject: [RFC PATCH v1 2/7] hugetlb: create PTE level mapping when possible From: Jiaqi Yan To: mike.kravetz@oracle.com, peterx@redhat.com, naoya.horiguchi@nec.com Cc: songmuchun@bytedance.com, duenwen@google.com, axelrasmussen@google.com, jthoughton@google.com, rientjes@google.com, linmiaohe@huawei.com, shy828301@gmail.com, baolin.wang@linux.alibaba.com, wangkefeng.wang@huawei.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Jiaqi Yan X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764379147287979938?= X-GMAIL-MSGID: =?utf-8?q?1764379147287979938?= In memory_failure handling, for each VMA that the HWPOISON HugeTLB page mapped to, enable HGM if eligible, then split the P*D mapped hugepage to smaller PTEs. try_to_unmap still unmaps the entire hugetlb page, one PTE by one PTE, at levels smaller than original P*D. For example, if a hugepage was original mapped at PUD size, it will be split into PMDs and PTEs, and all of these PMDs and PTEs will be unmapped. The next commit will only unmap the raw HWPOISON PTE. For VMA that is not HGM eligible, or failed to enable HGM, or failed to split hugepage mapping, the hugepage is still mapped by its original P*D then unmapped at this P*D. Signed-off-by: Jiaqi Yan --- include/linux/hugetlb.h | 5 +++ mm/hugetlb.c | 27 ++++++++++++++++ mm/memory-failure.c | 68 +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 100 insertions(+) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index d44bf6a794e5..03074b23c396 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -1266,6 +1266,7 @@ int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm, unsigned long end); int hugetlb_collapse(struct mm_struct *mm, unsigned long start, unsigned long end); +int hugetlb_enable_hgm_vma(struct vm_area_struct *vma); int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma, struct hugetlb_pte *hpte, unsigned long addr, unsigned int desired_shift); @@ -1295,6 +1296,10 @@ int hugetlb_collapse(struct mm_struct *mm, unsigned long start, { return -EINVAL; } +int hugetlb_enable_hgm_vma(struct vm_area_struct *vma) +{ + return -EINVAL; +} int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma, const struct hugetlb_pte *hpte, unsigned long addr, unsigned int desired_shift) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index d3f3f1c2d293..1419176b7e51 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -8203,6 +8203,33 @@ int hugetlb_collapse(struct mm_struct *mm, unsigned long start, return ret; } +int hugetlb_enable_hgm_vma(struct vm_area_struct *vma) +{ + if (hugetlb_hgm_enabled(vma)) + return 0; + + if (!is_vm_hugetlb_page(vma)) { + pr_warn("VMA=[%#lx, %#lx) is not HugeTLB\n", + vma->vm_start, vma->vm_end); + return -EINVAL; + } + + if (!hugetlb_hgm_eligible(vma)) { + pr_warn("VMA=[%#lx, %#lx) is not HGM eligible\n", + vma->vm_start, vma->vm_end); + return -EINVAL; + } + + hugetlb_unshare_all_pmds(vma); + + /* + * TODO: add the ability to tell if HGM is enabled by kernel + * (for HWPOISON unmapping) or by userspace (via MADV_SPLIT). + */ + vm_flags_set(vma, VM_HUGETLB_HGM); + return 0; +} + /* * Find the optimal HugeTLB PTE shift that @desired_addr could be mapped at. */ diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 0b37cbc6e8ae..eb5579b6787e 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1479,6 +1479,73 @@ static int get_hwpoison_page(struct page *p, unsigned long flags) return ret; } +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING +/* + * For each HGM-eligible VMA that the poisoned page mapped to, create new + * HGM mapping for hugepage @folio and make sure @poisoned_page is mapped + * by a PAGESIZE level PTE. Caller (hwpoison_user_mappings) must ensure + * 1. folio's address space (mapping) is locked in write mode. + * 2. folio is locked. + */ +static void try_to_split_huge_mapping(struct folio *folio, + struct page *poisoned_page) +{ + struct address_space *mapping = folio_mapping(folio); + pgoff_t pgoff_start; + pgoff_t pgoff_end; + struct vm_area_struct *vma; + unsigned long poisoned_addr; + unsigned long head_addr; + struct hugetlb_pte hpte; + + if (WARN_ON(!mapping)) + return; + + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); + + pgoff_start = folio_pgoff(folio); + pgoff_end = pgoff_start + folio_nr_pages(folio) - 1; + + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff_start, pgoff_end) { + /* Enable HGM on HGM-eligible VMAs. */ + if (!hugetlb_hgm_eligible(vma)) + continue; + + i_mmap_assert_locked(vma->vm_file->f_mapping); + if (hugetlb_enable_hgm_vma(vma)) { + pr_err("Failed to enable HGM on eligible VMA=[%#lx, %#lx)\n", + vma->vm_start, vma->vm_end); + continue; + } + + poisoned_addr = vma_address(poisoned_page, vma); + head_addr = vma_address(folio_page(folio, 0), vma); + /* + * Get the hugetlb_pte of the PUD-mapped hugepage first, + * then split the PUD entry into PMD + PTE entries. + * + * Both getting original huge PTE and splitting requires write + * lock on vma->vm_file->f_mapping, which caller + * (e.g. hwpoison_user_mappings) should already acquired. + */ + if (hugetlb_full_walk(&hpte, vma, head_addr)) + continue; + + if (hugetlb_split_to_shift(vma->vm_mm, vma, &hpte, + poisoned_addr, PAGE_SHIFT)) { + pr_err("Failed to split huge mapping: pfn=%#lx, vaddr=%#lx in VMA=[%#lx, %#lx)\n", + page_to_pfn(poisoned_page), poisoned_addr, + vma->vm_start, vma->vm_end); + } + } +} +#else +static void try_to_split_huge_mapping(struct folio *folio, + struct page *poisoned_page) +{ +} +#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */ + /* * Do all that is necessary to remove user space mappings. Unmap * the pages and send SIGBUS to the processes if the data was dirty. @@ -1555,6 +1622,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn, */ mapping = hugetlb_page_mapping_lock_write(hpage); if (mapping) { + try_to_split_huge_mapping(folio, p); try_to_unmap(folio, ttu|TTU_RMAP_LOCKED); i_mmap_unlock_write(mapping); } else From patchwork Fri Apr 28 00:41:35 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jiaqi Yan X-Patchwork-Id: 88425 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp620160vqo; Thu, 27 Apr 2023 17:49:33 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5+1cQhOFQZle/ypqUdqJAwUEciFtC5YTWn2fkj7+C42QaJkapyW7C81G7DvRnHEgPoU9Dj X-Received: by 2002:a05:6a20:938e:b0:f4:d4a8:9c82 with SMTP id x14-20020a056a20938e00b000f4d4a89c82mr4479841pzh.47.1682642973367; Thu, 27 Apr 2023 17:49:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682642973; cv=none; d=google.com; s=arc-20160816; b=l40N+xRr9SBD7j2Gbi/Ibk4q2IcNf+BlUl+FSu95TE0Nsnmlu6K0rHoJUvAOEWF3px w7YFuLS4Nwuy3gYTVyhw97+7Ywmsd+vnUOX46VSrAFGEsp+WmvLox+TyyehcZo1V/y57 cHPxg5UTfHE649VfAzMOn8UUEY1k9q+q/lU7be2b4FV9XJaQYaq26ULVYfOG2P9KoYSe 7PeFIsMLFJzKeRmSLqGN2MvjRylE1Gm+PAE9P/mnyBEXDYrwklNRc9a/YyYe3EOck9qc outhnBGtgkG3adoNiimv46DuDIa2gfGufpxGZa/N93MnesiVjfmP7DVH5NORQenmpe5b 5mlQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=cdjbuBZu8tlGo1pxc0tk7o3QpZ2/1SgnLkFhr9tanbY=; b=hrfD0XzGakrLTjx/GPU0Nph1doY7aS3bprcmHVri6X7CcMuHY1hRhIBOw1leBPZRff o2VM87q/EEfektGzCJmUDiiwmjjNruoTB1cVh1O+bxBxr/8KyQjYUb1nI5emb+eePotq +1+3deAzXMdn6ibThk93jOxPRad/8EB32tGgtR9WQV+midvmiqJGi4UeZyAyPxQo02IL doMRExjk8WpitVECrUdS6zsn0zmGyEvySJ4YI0doovkYz5uzyyoPusXMjfpMtqmB/Lo4 pzGjxgMYDCtz0HSQ6023mIAABdMLVbjefkCBKQTqL6pzuCN/dkPDyyug5Dielxe4sv0e YyRA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=NA00K5Oq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i14-20020a056a00004e00b0063c7ed4cb0esi19794710pfk.170.2023.04.27.17.49.21; Thu, 27 Apr 2023 17:49:33 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=NA00K5Oq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344712AbjD1AmA (ORCPT + 99 others); Thu, 27 Apr 2023 20:42:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38026 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344643AbjD1Alv (ORCPT ); Thu, 27 Apr 2023 20:41:51 -0400 Received: from mail-pf1-x44a.google.com (mail-pf1-x44a.google.com [IPv6:2607:f8b0:4864:20::44a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6FA252D50 for ; Thu, 27 Apr 2023 17:41:50 -0700 (PDT) Received: by mail-pf1-x44a.google.com with SMTP id d2e1a72fcca58-64115ef7234so6820724b3a.1 for ; Thu, 27 Apr 2023 17:41:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1682642510; x=1685234510; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=cdjbuBZu8tlGo1pxc0tk7o3QpZ2/1SgnLkFhr9tanbY=; b=NA00K5Oq++SM0fGqna1cu6q3oeHcwy8f6So1qbLBBkMZpAFX9KXHuuDhs+gKOGNWxO BhUnhKp6iJbWjTLkiohP0DmdwrNXAziXuc0rvi0ihCWDhd7KVFxtgwOLgIjQlCxv10Cb Fr6lK4GVdCDCCuqabbeuFGr2SIxFl+p/TQ0exA+3rwcuGyQVsa66S5vAjj2+5psPOGf+ xKc58mbSpQGNsTNlj2VAUHQltWBcsb5W8OiCbzA004IXZZA/CdrPP2C+gtRnuVuprbCi P+Bwl8NfFz/7uOfeFZvdmYgCjQhOuj1pGE5/iVs2S/NXjYWrLHuFkM8vb6hoL8o2cfhT CcLA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682642510; x=1685234510; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=cdjbuBZu8tlGo1pxc0tk7o3QpZ2/1SgnLkFhr9tanbY=; b=FM16TUf8xkrb6L1Dn0+Ry/8x2RHH/ozboM8SyvsZIoGaNsRBflR73mjM2jPL3j6wA6 V+q1fyPdSa3xUglMBY+15NGYue/5M78tIsCoyWSetdOQAlJflSDTVFbaDMKp7ONwCo/P BC1RvqhHCkFRUqS9Snb+v89tgRAMl1oSaAuD1ge2LdfHEdSJdsjukWYBzRxirYei2UCp x6qkKvqMsaiGYcCdt5jnfi5vGM1mwGuEUaZeo39Nq6tM4Rg5TI/FQOmV2Jzypf2pwDin X8AQ50eYTSdBJ9wyd8uqp79DNvfEPaxMzxFo8RYjQw9SFYpGAbv+J/4beDdz2WPd8y/K TKCA== X-Gm-Message-State: AC+VfDwScHh6twdDQBuE/pnHW8IGeqJw8KeVJnSDNSt5fAGqudSV8CDd 2+g/naY+8fOZVR2LGfc36nqdCm47ZGT9SA== X-Received: from yjq3.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:272f]) (user=jiaqiyan job=sendgmr) by 2002:a05:6a00:44c9:b0:63d:24ea:4172 with SMTP id cv9-20020a056a0044c900b0063d24ea4172mr2753850pfb.1.1682642510002; Thu, 27 Apr 2023 17:41:50 -0700 (PDT) Date: Fri, 28 Apr 2023 00:41:35 +0000 In-Reply-To: <20230428004139.2899856-1-jiaqiyan@google.com> Mime-Version: 1.0 References: <20230428004139.2899856-1-jiaqiyan@google.com> X-Mailer: git-send-email 2.40.1.495.gc816e09b53d-goog Message-ID: <20230428004139.2899856-4-jiaqiyan@google.com> Subject: [RFC PATCH v1 3/7] mm: publish raw_hwp_page in mm.h From: Jiaqi Yan To: mike.kravetz@oracle.com, peterx@redhat.com, naoya.horiguchi@nec.com Cc: songmuchun@bytedance.com, duenwen@google.com, axelrasmussen@google.com, jthoughton@google.com, rientjes@google.com, linmiaohe@huawei.com, shy828301@gmail.com, baolin.wang@linux.alibaba.com, wangkefeng.wang@huawei.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Jiaqi Yan X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764379038456055539?= X-GMAIL-MSGID: =?utf-8?q?1764379038456055539?= raw_hwp_page will be needed by HugeTLB to determine if a raw subpage in a hugepage is poisoned and either should be unmapped or not faulted in at PAGE_SIZE PTE level Signed-off-by: Jiaqi Yan --- include/linux/mm.h | 16 ++++++++++++++++ mm/memory-failure.c | 13 ------------- 2 files changed, 16 insertions(+), 13 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 9d3216b4284a..4496d7bdd3ea 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3522,6 +3522,22 @@ enum mf_action_page_type { */ extern const struct attribute_group memory_failure_attr_group; +#ifdef CONFIG_HUGETLB_PAGE +/* + * Struct raw_hwp_page represents information about "raw error page", + * constructing singly linked list from ->_hugetlb_hwpoison field of folio. + */ +struct raw_hwp_page { + struct llist_node node; + struct page *page; +}; + +static inline struct llist_head *raw_hwp_list_head(struct folio *folio) +{ + return (struct llist_head *)&folio->_hugetlb_hwpoison; +} +#endif + #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) extern void clear_huge_page(struct page *page, unsigned long addr_hint, diff --git a/mm/memory-failure.c b/mm/memory-failure.c index eb5579b6787e..48e62d04af17 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1826,19 +1826,6 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs); #endif /* CONFIG_FS_DAX */ #ifdef CONFIG_HUGETLB_PAGE -/* - * Struct raw_hwp_page represents information about "raw error page", - * constructing singly linked list from ->_hugetlb_hwpoison field of folio. - */ -struct raw_hwp_page { - struct llist_node node; - struct page *page; -}; - -static inline struct llist_head *raw_hwp_list_head(struct folio *folio) -{ - return (struct llist_head *)&folio->_hugetlb_hwpoison; -} static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag) { From patchwork Fri Apr 28 00:41:36 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jiaqi Yan X-Patchwork-Id: 88427 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp620730vqo; Thu, 27 Apr 2023 17:51:12 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5gkhDlXhsjLwD2q4HvO55ESwCNyAX1PPQje0XdN0GqWFOHkdESGni12Pnh2r24BK37XlFO X-Received: by 2002:a05:6a00:1348:b0:63b:19e5:a9ec with SMTP id k8-20020a056a00134800b0063b19e5a9ecmr5171639pfu.33.1682643071747; Thu, 27 Apr 2023 17:51:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682643071; cv=none; d=google.com; s=arc-20160816; b=FYQRFGqtvGxzaoS8+knYadlCNaACN9iTwZbN7XlZ7NTphUxAkwZ+eYpnKtHGh1jq9F ikfLT11iyg71fsXV7k91LfLuU95GVkkRuLnNHzw95JL95CO/3chueIhIGhS8FJ8E4FG3 J8icnx2+hc/ue05Dw+SDQmnkBpqLtKqmC6CmWInkEhlU/Wi/9Q0gFXSVJ6lScLOGCG7Y FAeYP+VOPSTBcEztgsFpXh1DRMNpkqW9Sp5RahoFpaNESSuXs3sjCvZyQk5wYxdYuPhr U2oo8XXy6qVrgGLB9IwvFyvltGVAdZwErDJH0QzFs25dVa3gtySQZ330f75wHVb2gbTT e+sA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=/o5oEUz9QY3MpW1sWxeWWzj/nFi9LBqOWOdS4qpvdww=; b=qsohadHiAjh4ad4ndCVRiVBdbADr8dv6v8zold5NfcNEFBE6hLJ46oZiDCt22hs0rr U2sD3yKszEXK9yg32D3bQ/58xalGxgeCVGSEJZkg52nOy3ERh+Wx23GXGGOhgIAbu0Ry ThglBKXuhzEYULXIL4mYEOzbqLIDtEu3X33YlXeIfUXZ+OFx8wRkN13+Ec+fWU539zmb FH/GGds9M2fN0xNnQc/ft0Cu5HGjGtBXcso2RjtaRodULXbxVT2dSLMz4SyKxR36C6+w uwiTwoIBJctLRDxwWmiZFclUINGMq5RXVWsb6bCkKXs+Dcy2w0w39808De4y2h8rcdY9 DWcg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=H3Aow8dq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x190-20020a6386c7000000b0051322a8d2aesi10671376pgd.110.2023.04.27.17.50.59; Thu, 27 Apr 2023 17:51:11 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=H3Aow8dq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344718AbjD1AmE (ORCPT + 99 others); Thu, 27 Apr 2023 20:42:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38128 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344679AbjD1Aly (ORCPT ); Thu, 27 Apr 2023 20:41:54 -0400 Received: from mail-pg1-x549.google.com (mail-pg1-x549.google.com [IPv6:2607:f8b0:4864:20::549]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3AABE3C3C for ; Thu, 27 Apr 2023 17:41:52 -0700 (PDT) Received: by mail-pg1-x549.google.com with SMTP id 41be03b00d2f7-5144902c15eso5480849a12.2 for ; Thu, 27 Apr 2023 17:41:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1682642511; x=1685234511; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=/o5oEUz9QY3MpW1sWxeWWzj/nFi9LBqOWOdS4qpvdww=; b=H3Aow8dqpsTl+l8cACpDA0oGPYleJ8rZzl0l4HNzNXUQLqmNjqSWtnGmjDEbYTT2ug rimuWS5nO+TuUPapY0M0aNes9nsnlU4VRybeQAvHLUWI3zxVov2KJc4PtUvWGG5q23ft BocnV5I6j9gGFe7suFnGIyh9nycSkCjfU6iEEVM5hafPAUe8G6Tgmbovev4CAZzjn/F1 bkNlgHQsxB2naDkK+Ksj/yf6Bi8WCjSulS2hbY0+guIEGzgoHIuXLH3p/5g3TFI6ei8X zj7r82kI5QrxhOf73CJ3SqQm925ACvT/lcFvzS2U4C9P0wCgLPd2ZzT2atfl/Lm0T9tR 9Cyg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682642511; x=1685234511; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=/o5oEUz9QY3MpW1sWxeWWzj/nFi9LBqOWOdS4qpvdww=; b=OFBcDrpwP1BSVi+/9hqYyb3bgqGEykBVQwmQKtqjVxFcm6rd7xeA4HWz7CzEf/vCMj Xdj7jHomBs2vqbhAV31forQ0dGIdaUB4wBHcubQzCMyi385hWFl26V0h1Rse6tud6qVt ZuBMSQA+EjYrv3Eo+fdCTBn9A5bPWRcXm+nIpemrUrOVMtURKE4/MpRBmQfHlMqZSto2 8J6cfWHGlabQJ/NiIrccpAfvRDwP1DYDNxufyyLZ1OI7Ob+94fppGqN/UL7EKNB+yng3 XQcGkYtbUoS+0jTbh4SuWZJLrLTqFDiAgCvRfMiHEdPoZk4An90vngdARRmFqAArrn+d GaNg== X-Gm-Message-State: AC+VfDw2pSp3wdG1ZZEKWbWcwlTDn3AchqeW/4+crVkbjAGTkQrJbLJc HgVBQEnpkvmKE1HXE27tQWe+4RKmb/k2dA== X-Received: from yjq3.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:272f]) (user=jiaqiyan job=sendgmr) by 2002:a63:6985:0:b0:517:ce37:756e with SMTP id e127-20020a636985000000b00517ce37756emr815534pgc.7.1682642511665; Thu, 27 Apr 2023 17:41:51 -0700 (PDT) Date: Fri, 28 Apr 2023 00:41:36 +0000 In-Reply-To: <20230428004139.2899856-1-jiaqiyan@google.com> Mime-Version: 1.0 References: <20230428004139.2899856-1-jiaqiyan@google.com> X-Mailer: git-send-email 2.40.1.495.gc816e09b53d-goog Message-ID: <20230428004139.2899856-5-jiaqiyan@google.com> Subject: [RFC PATCH v1 4/7] mm/memory_failure: unmap raw HWPoison PTEs when possible From: Jiaqi Yan To: mike.kravetz@oracle.com, peterx@redhat.com, naoya.horiguchi@nec.com Cc: songmuchun@bytedance.com, duenwen@google.com, axelrasmussen@google.com, jthoughton@google.com, rientjes@google.com, linmiaohe@huawei.com, shy828301@gmail.com, baolin.wang@linux.alibaba.com, wangkefeng.wang@huawei.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Jiaqi Yan X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764379141655050197?= X-GMAIL-MSGID: =?utf-8?q?1764379141655050197?= When a folio's VMA is HGM eligible, try_to_unmap_one now only unmaps the raw HWPOISON page (previously split and mapped at PTE size). If HGM failed to be enabled on eligible VMA or splitting failed, try_to_unmap_one fails. For VMS that is not HGM eligible, try_to_unmap_one still unmaps the whole P*D. When only the raw HWPOISON subpage is unmapped but others keep mapped, the old way in memory_failure to check if unmapping successful doesn't work. So introduce is_unmapping_successful() to cover both existing and new unmapping behavior. For the new unmapping behavior, store how many times a raw HWPOISON page is expected to be unmapped, and how many times it is actually unmapped in try_to_unmap_one(). A HWPOISON raw page is expected to be unmapped from a VMA if splitting succeeded in try_to_split_huge_mapping(), so unmap_success = (nr_expected_unamps == nr_actual_unmaps). Old folio_set_hugetlb_hwpoison returns -EHWPOISON if a folio has any raw HWPOISON subpage, and try_memory_failure_hugetlb won't attempt recovery actions again because recovery used to be done on the entire hugepage. With the new unmapping behavior, this doesn't hold. More subpages in the hugepage can become corrupted, and needs to be recovered (i.e. unmapped) individually. New folio_set_hugetlb_hwpoison returns 0 after adding a new raw subpage to raw_hwp_list. Unmapping raw HWPOISON page requires allocating raw_hwp_page successfully in folio_set_hugetlb_hwpoison, so try_memory_failure_hugetlb now may fail due to OOM. Signed-off-by: Jiaqi Yan --- include/linux/mm.h | 20 ++++++- mm/memory-failure.c | 140 ++++++++++++++++++++++++++++++++++++++------ mm/rmap.c | 38 +++++++++++- 3 files changed, 175 insertions(+), 23 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 4496d7bdd3ea..dc192f98cb1d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3522,20 +3522,38 @@ enum mf_action_page_type { */ extern const struct attribute_group memory_failure_attr_group; -#ifdef CONFIG_HUGETLB_PAGE /* * Struct raw_hwp_page represents information about "raw error page", * constructing singly linked list from ->_hugetlb_hwpoison field of folio. + * @node: the node in folio->_hugetlb_hwpoison list. + * @page: the raw HWPOISON page struct. + * @nr_vmas_mapped: the number of VMAs that map @page when detected. + * @nr_expected_unmaps: if a VMA that maps @page when detected is eligible + * for high granularity mapping, @page is expected to be unmapped. + * @nr_actual_unmaps: how many times the raw page is actually unmapped. */ struct raw_hwp_page { struct llist_node node; struct page *page; + int nr_vmas_mapped; + int nr_expected_unmaps; + int nr_actual_unmaps; }; +#ifdef CONFIG_HUGETLB_PAGE static inline struct llist_head *raw_hwp_list_head(struct folio *folio) { return (struct llist_head *)&folio->_hugetlb_hwpoison; } + +struct raw_hwp_page *find_in_raw_hwp_list(struct folio *folio, + struct page *subpage); +#else +static inline struct raw_hwp_page *find_in_raw_hwp_list(struct folio *folio, + struct page *subpage) +{ + return NULL; +} #endif #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 48e62d04af17..47b935918ceb 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1120,10 +1120,10 @@ static int me_swapcache_clean(struct page_state *ps, struct page *p) } /* - * Huge pages. Needs work. - * Issues: - * - Error on hugepage is contained in hugepage unit (not in raw page unit.) - * To narrow down kill region to one page, we need to break up pmd. + * Huge pages. + * - Without HGM: Error on hugepage is contained in hugepage unit (not in + * raw page unit). + * - With HGM: Kill region is narrowed down to just one page. */ static int me_huge_page(struct page_state *ps, struct page *p) { @@ -1131,6 +1131,7 @@ static int me_huge_page(struct page_state *ps, struct page *p) struct page *hpage = compound_head(p); struct address_space *mapping; bool extra_pins = false; + struct raw_hwp_page *hwp_page = find_in_raw_hwp_list(page_folio(p), p); if (!PageHuge(hpage)) return MF_DELAYED; @@ -1157,7 +1158,8 @@ static int me_huge_page(struct page_state *ps, struct page *p) } } - if (has_extra_refcount(ps, p, extra_pins)) + if (hwp_page->nr_expected_unmaps == 0 && + has_extra_refcount(ps, p, extra_pins)) res = MF_FAILED; return res; @@ -1497,24 +1499,30 @@ static void try_to_split_huge_mapping(struct folio *folio, unsigned long poisoned_addr; unsigned long head_addr; struct hugetlb_pte hpte; + struct raw_hwp_page *hwp_page = NULL; if (WARN_ON(!mapping)) return; VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); + hwp_page = find_in_raw_hwp_list(folio, poisoned_page); + VM_BUG_ON_PAGE(!hwp_page, poisoned_page); + pgoff_start = folio_pgoff(folio); pgoff_end = pgoff_start + folio_nr_pages(folio) - 1; vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff_start, pgoff_end) { + ++hwp_page->nr_vmas_mapped; + /* Enable HGM on HGM-eligible VMAs. */ if (!hugetlb_hgm_eligible(vma)) continue; i_mmap_assert_locked(vma->vm_file->f_mapping); if (hugetlb_enable_hgm_vma(vma)) { - pr_err("Failed to enable HGM on eligible VMA=[%#lx, %#lx)\n", - vma->vm_start, vma->vm_end); + pr_err("%#lx: failed to enable HGM on eligible VMA=[%#lx, %#lx)\n", + page_to_pfn(poisoned_page), vma->vm_start, vma->vm_end); continue; } @@ -1528,15 +1536,21 @@ static void try_to_split_huge_mapping(struct folio *folio, * lock on vma->vm_file->f_mapping, which caller * (e.g. hwpoison_user_mappings) should already acquired. */ - if (hugetlb_full_walk(&hpte, vma, head_addr)) + if (hugetlb_full_walk(&hpte, vma, head_addr)) { + pr_err("%#lx: failed to PT-walk with HGM on eligible VMA=[%#lx, %#lx)\n", + page_to_pfn(poisoned_page), vma->vm_start, vma->vm_end); continue; + } if (hugetlb_split_to_shift(vma->vm_mm, vma, &hpte, poisoned_addr, PAGE_SHIFT)) { - pr_err("Failed to split huge mapping: pfn=%#lx, vaddr=%#lx in VMA=[%#lx, %#lx)\n", + pr_err("%#lx: Failed to split huge mapping: vaddr=%#lx in VMA=[%#lx, %#lx)\n", page_to_pfn(poisoned_page), poisoned_addr, vma->vm_start, vma->vm_end); + continue; } + + ++hwp_page->nr_expected_unmaps; } } #else @@ -1546,6 +1560,47 @@ static void try_to_split_huge_mapping(struct folio *folio, } #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */ +static bool is_unmapping_successful(struct folio *folio, + struct page *poisoned_page) +{ + bool unmap_success = false; + struct raw_hwp_page *hwp_page = find_in_raw_hwp_list(folio, poisoned_page); + + if (!folio_test_hugetlb(folio) || + folio_test_anon(folio) || + !IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING)) { + unmap_success = folio_mapped(folio); + if (!unmap_success) + pr_err("%#lx: failed to unmap page (mapcount=%d)\n", + page_to_pfn(poisoned_page), + page_mapcount(folio_page(folio, 0))); + + return unmap_success; + } + + VM_BUG_ON_PAGE(!hwp_page, poisoned_page); + + /* + * Unmapping may not happen for some VMA: + * - HGM-eligible VMA but @poisoned_page is not faulted yet: nothing + * needs to be done at this point yet until page fault handling. + * - HGM-non-eliggible VMA: mapcount decreases by nr_subpages for each VMA, + * but not tracked so cannot tell if successfully unmapped from such VMA. + */ + if (hwp_page->nr_vmas_mapped != hwp_page->nr_expected_unmaps) + pr_info("%#lx: mapped by %d VMAs but %d unmappings are expected\n", + page_to_pfn(poisoned_page), hwp_page->nr_vmas_mapped, + hwp_page->nr_expected_unmaps); + + unmap_success = hwp_page->nr_expected_unmaps == hwp_page->nr_actual_unmaps; + + if (!unmap_success) + pr_err("%#lx: failed to unmap page (folio_mapcount=%d)\n", + page_to_pfn(poisoned_page), folio_mapcount(folio)); + + return unmap_success; +} + /* * Do all that is necessary to remove user space mappings. Unmap * the pages and send SIGBUS to the processes if the data was dirty. @@ -1631,10 +1686,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn, try_to_unmap(folio, ttu); } - unmap_success = !page_mapped(hpage); - if (!unmap_success) - pr_err("%#lx: failed to unmap page (mapcount=%d)\n", - pfn, page_mapcount(hpage)); + unmap_success = is_unmapping_successful(folio, p); /* * try_to_unmap() might put mlocked page in lru cache, so call @@ -1827,6 +1879,31 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs); #ifdef CONFIG_HUGETLB_PAGE +/* + * Given a HWPOISON @subpage as raw page, find its location in @folio's + * _hugetlb_hwpoison. Return NULL if @subpage is not in the list. + */ +struct raw_hwp_page *find_in_raw_hwp_list(struct folio *folio, + struct page *subpage) +{ + struct llist_node *t, *tnode; + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio); + struct raw_hwp_page *hwp_page = NULL; + struct raw_hwp_page *p; + + VM_BUG_ON_PAGE(PageHWPoison(subpage), subpage); + + llist_for_each_safe(tnode, t, raw_hwp_head->first) { + p = container_of(tnode, struct raw_hwp_page, node); + if (subpage == p->page) { + hwp_page = p; + break; + } + } + + return hwp_page; +} + static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag) { struct llist_head *head; @@ -1837,6 +1914,9 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag) llist_for_each_safe(tnode, t, head->first) { struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node); + /* Ideally raw HWPoison pages are fully unmapped if possible. */ + WARN_ON(p->nr_expected_unmaps != p->nr_actual_unmaps); + if (move_flag) SetPageHWPoison(p->page); else @@ -1853,7 +1933,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page) struct llist_head *head; struct raw_hwp_page *raw_hwp; struct llist_node *t, *tnode; - int ret = folio_test_set_hwpoison(folio) ? -EHWPOISON : 0; + bool has_hwpoison = folio_test_set_hwpoison(folio); + bool hgm_enabled = IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING); /* * Once the hwpoison hugepage has lost reliable raw error info, @@ -1873,9 +1954,20 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page) raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_ATOMIC); if (raw_hwp) { raw_hwp->page = page; + raw_hwp->nr_vmas_mapped = 0; + raw_hwp->nr_expected_unmaps = 0; + raw_hwp->nr_actual_unmaps = 0; llist_add(&raw_hwp->node, head); + if (hgm_enabled) + /* + * A new raw poisoned page. Don't return + * HWPOISON. Error event will be counted + * in action_result(). + */ + return 0; + /* the first error event will be counted in action_result(). */ - if (ret) + if (has_hwpoison) num_poisoned_pages_inc(page_to_pfn(page)); } else { /* @@ -1889,8 +1981,16 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page) * used any more, so free it. */ __folio_free_raw_hwp(folio, false); + + /* + * HGM relies on raw_hwp allocated and inserted to raw_hwp_list. + */ + if (hgm_enabled) + return -ENOMEM; } - return ret; + + BUG_ON(hgm_enabled); + return has_hwpoison ? -EHWPOISON : 0; } static unsigned long folio_free_raw_hwp(struct folio *folio, bool move_flag) @@ -1936,6 +2036,7 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags, struct page *page = pfn_to_page(pfn); struct folio *folio = page_folio(page); int ret = 2; /* fallback to normal page handling */ + int set_page_hwpoison = 0; bool count_increased = false; if (!folio_test_hugetlb(folio)) @@ -1956,8 +2057,9 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags, goto out; } - if (folio_set_hugetlb_hwpoison(folio, page)) { - ret = -EHWPOISON; + set_page_hwpoison = folio_set_hugetlb_hwpoison(folio, page); + if (set_page_hwpoison) { + ret = set_page_hwpoison; goto out; } @@ -2004,7 +2106,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb res = kill_accessing_process(current, folio_pfn(folio), flags); } return res; - } else if (res == -EBUSY) { + } else if (res == -EBUSY || res == -ENOMEM) { if (!(flags & MF_NO_RETRY)) { flags |= MF_NO_RETRY; goto retry; diff --git a/mm/rmap.c b/mm/rmap.c index d3bc81466902..4cfaa34b001e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1453,6 +1453,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, struct mmu_notifier_range range; enum ttu_flags flags = (enum ttu_flags)(long)arg; bool page_poisoned; + bool hgm_eligible = hugetlb_hgm_eligible(vma); + struct raw_hwp_page *hwp_page; /* * When racing against e.g. zap_pte_range() on another cpu, @@ -1525,6 +1527,29 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, * in the case where the hugetlb page is poisoned. */ VM_BUG_ON_FOLIO(!page_poisoned, folio); + + /* + * When VMA is not HGM eligible, unmap at hugepage's + * original P*D. + * + * When HGM is eligible: + * - if original P*D is split to smaller P*Ds and + * PTEs, we skip subpage if it is not raw HWPoison + * page, or it was but was already unmapped. + * - if original P*D is not split, skip unmapping + * and memory_failure result will be MF_IGNORED. + */ + if (hgm_eligible) { + if (pvmw.pte_order > 0) + continue; + hwp_page = find_in_raw_hwp_list(folio, subpage); + if (hwp_page == NULL) + continue; + if (hwp_page->nr_expected_unmaps == + hwp_page->nr_actual_unmaps) + continue; + } + /* * huge_pmd_unshare may unmap an entire PMD page. * There is no way of knowing exactly which PMDs may @@ -1760,12 +1785,19 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, * * See Documentation/mm/mmu_notifier.rst */ - if (folio_test_hugetlb(folio)) + if (!folio_test_hugetlb(folio)) + page_remove_rmap(subpage, vma, false); + else { hugetlb_remove_rmap(subpage, pvmw.pte_order + PAGE_SHIFT, hstate_vma(vma), vma); - else - page_remove_rmap(subpage, vma, false); + if (hgm_eligible) { + VM_BUG_ON_FOLIO(pvmw.pte_order > 0, folio); + VM_BUG_ON_FOLIO(!hwp_page, folio); + VM_BUG_ON_FOLIO(subpage != hwp_page->page, folio); + ++hwp_page->nr_actual_unmaps; + } + } if (vma->vm_flags & VM_LOCKED) mlock_drain_local(); From patchwork Fri Apr 28 00:41:37 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jiaqi Yan X-Patchwork-Id: 88426 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp620684vqo; Thu, 27 Apr 2023 17:51:04 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6DeN4aYXIUCUlgcC1bE7NSxL2IgjzeDBn6W09ayIvTgq2oDbrlg+PRbSXFpc343PjM8Rns X-Received: by 2002:a17:902:d488:b0:1a6:961e:fd0b with SMTP id c8-20020a170902d48800b001a6961efd0bmr4110792plg.4.1682643064090; Thu, 27 Apr 2023 17:51:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682643064; cv=none; d=google.com; s=arc-20160816; b=H6Z9uLG3AZ3hAmsgC8eoFuw7OCLmSvowbxedNHYYPHk7MDGB2DE1WDDOIeUl37utym bduABaJeR+iev5RFsKj6V/NJ7zugn2TXH5iAHnSz1QrTavVQ4YLrUbvOX266gyTZAvIu G3u5Irr53vwxcFdTAM30BfK+kLYjGu5IMO1WTtIgBzkMXnQBK0GwPxIfFmPBwiGHoO3T XF+9iObS5d+y3N974U0pScTTaxmeoH9SkrFSdJwGW3EFpVI8xBCxb4211JXI/NuV4a/S e9NnkeXUule6zG++37TmA3pb6Y23PTwdvCUsHRT8K2ud/eziZKfPo2rNY4OwO7MMvkhz 1xRw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=n4TaCEi/qu9zeBFiKyPkDKaAfUVepeL/TJwi41Ch5WU=; b=XSVf0ZxGkv3XiEKOM81xcv/39Cd6m8ToZy43uT3EpsiacDVlA0tZ1Yy7797GxtwWkZ A3Xsu5xbUU5AcxC6z12k6C8XatmJ8xWH0CWZ4GnbfP/QvVx32w621Sjed0+A+sZexT/U h2ag+OnayXqpZfWaukE649juUuQlNisuNoiqZCQwjf+7BU9Eo9A3Pl10111ncorDtsFU NjgPKYrCvDt9Dz2Ul8TrHq7D4mvZd0eS0qU+deYD9zwtrf3q6a+Dyzy+tZh+Q2dn5RSW 6iAs+S7ToTTMS3re24ldem8duF4R88b50WiY7Vmf7nYGsv71SQAGXuikBLuLaKqeHQv6 HWSQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=VnYGn8ix; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ot2-20020a17090b3b4200b0023f37b648d9si25278848pjb.157.2023.04.27.17.50.51; Thu, 27 Apr 2023 17:51:04 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=VnYGn8ix; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344729AbjD1AmL (ORCPT + 99 others); Thu, 27 Apr 2023 20:42:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38188 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344685AbjD1Alz (ORCPT ); Thu, 27 Apr 2023 20:41:55 -0400 Received: from mail-pl1-x649.google.com (mail-pl1-x649.google.com [IPv6:2607:f8b0:4864:20::649]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EEFF94487 for ; Thu, 27 Apr 2023 17:41:53 -0700 (PDT) Received: by mail-pl1-x649.google.com with SMTP id d9443c01a7336-1a6862e47edso58313645ad.1 for ; Thu, 27 Apr 2023 17:41:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1682642513; x=1685234513; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=n4TaCEi/qu9zeBFiKyPkDKaAfUVepeL/TJwi41Ch5WU=; b=VnYGn8ix7ehNw4jSdgSliIEL30nuomoEdqnEPKIL+LwGe6Kq9F94+shVVxiheZbkWD 6SCxFWByNoglqaAY8+2jwRBhhMv8is9cTlRndkNUAT+ekUicUpVUKSrYTJE1+1qgY1+T XalfCUNsRxx40XaKcI4Au6GAT6yyjsw2JmxEiO3jC331ijP61wbqZRvaor+AuQRY/EV0 8xhF7Q/kD7NJMX6DeaoffwGfPsl1qiOTx1jh3A1GwOTpnDWdZFX3F96vp3dTkKAd/s8Q x0j8eh66XJ/N6SBvIyYPcrKHiqstUyfTb2pMBzaxOcrjP89vFasMLH8UbLHFT1qmcwmr pIMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682642513; x=1685234513; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=n4TaCEi/qu9zeBFiKyPkDKaAfUVepeL/TJwi41Ch5WU=; b=fPqYua7Z5RY8ab25xmfPFyZp2EDyHu5FAHrc8xOLJys9/fYH8ppw+7opaObF5uLDkD Di/jC2eoK89gay+tHP9wiM8uyC05T4V825ws3zn85UISNNl9T66p/+IH2QawwHtRrEFb 4Axg6VlnjCndabBP7duvr3rlK+lNNzhL7hPX0sximLFodhH3xToVmmrt0+tAnsfVJTA6 /8Vmre4V6ttIYR7vej9xTDBXXqimWdq7V72rb5MWBylyQtlelRstzwwlhcLaGFxSul1M c9QUO1s2bLTV8UL9g1ZlRF0dKgdc19RsRM/ICZrQbrhUZSnpnrETp+vu8W8noTOUfnp2 CZLg== X-Gm-Message-State: AC+VfDzdADYGAMU56csJWCpGMgHrw3rGuxzQjqQSE6qIdMpJIZToRAtM MQx/vDnhug3EYzfKRl7SM8IooL7wWa7ELg== X-Received: from yjq3.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:272f]) (user=jiaqiyan job=sendgmr) by 2002:a17:902:cecf:b0:1a5:e03:55b with SMTP id d15-20020a170902cecf00b001a50e03055bmr992059plg.11.1682642513448; Thu, 27 Apr 2023 17:41:53 -0700 (PDT) Date: Fri, 28 Apr 2023 00:41:37 +0000 In-Reply-To: <20230428004139.2899856-1-jiaqiyan@google.com> Mime-Version: 1.0 References: <20230428004139.2899856-1-jiaqiyan@google.com> X-Mailer: git-send-email 2.40.1.495.gc816e09b53d-goog Message-ID: <20230428004139.2899856-6-jiaqiyan@google.com> Subject: [RFC PATCH v1 5/7] hugetlb: only VM_FAULT_HWPOISON_LARGE raw page From: Jiaqi Yan To: mike.kravetz@oracle.com, peterx@redhat.com, naoya.horiguchi@nec.com Cc: songmuchun@bytedance.com, duenwen@google.com, axelrasmussen@google.com, jthoughton@google.com, rientjes@google.com, linmiaohe@huawei.com, shy828301@gmail.com, baolin.wang@linux.alibaba.com, wangkefeng.wang@huawei.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Jiaqi Yan X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764379133722394828?= X-GMAIL-MSGID: =?utf-8?q?1764379133722394828?= Memory raw pages can become HWPOISON between when userspace maps a hugepage and when userspace faults in the hugepage. Today when hugetlb faults somewhere in a hugepage containing HWPOISON raw pages, the result is a VM_FAULT_HWPOISON_LARGE. This commit teaches hugetlb page fault handler to only VM_FAULT_HWPOISON_LARGE if the faulting address is within HWPOISON raw page; otherwise, fault handler can continue to fault in healthy raw pages. Signed-off-by: Jiaqi Yan --- include/linux/mm.h | 2 + mm/hugetlb.c | 129 ++++++++++++++++++++++++++++++++++++++++++-- mm/memory-failure.c | 1 + 3 files changed, 127 insertions(+), 5 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index dc192f98cb1d..7caa4530953f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3531,6 +3531,7 @@ extern const struct attribute_group memory_failure_attr_group; * @nr_expected_unmaps: if a VMA that maps @page when detected is eligible * for high granularity mapping, @page is expected to be unmapped. * @nr_actual_unmaps: how many times the raw page is actually unmapped. + * @index: index of the poisoned subpage in the folio. */ struct raw_hwp_page { struct llist_node node; @@ -3538,6 +3539,7 @@ struct raw_hwp_page { int nr_vmas_mapped; int nr_expected_unmaps; int nr_actual_unmaps; + unsigned long index; }; #ifdef CONFIG_HUGETLB_PAGE diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 1419176b7e51..f8ddf04ae0c4 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6158,6 +6158,30 @@ static struct folio *hugetlb_try_find_lock_folio(struct address_space *mapping, return folio; } +static vm_fault_t hugetlb_no_page_hwpoison(struct mm_struct *mm, + struct vm_area_struct *vma, + struct folio *folio, + unsigned long address, + struct hugetlb_pte *hpte, + unsigned int flags); + +#ifndef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING +static vm_fault_t hugetlb_no_page_hwpoison(struct mm_struct *mm, + struct vm_area_struct *vma, + struct folio *folio, + unsigned long address, + struct hugetlb_pte *hpte, + unsigned int flags) +{ + if (unlikely(folio_test_hwpoison(folio))) { + return VM_FAULT_HWPOISON_LARGE | + VM_FAULT_SET_HINDEX(hstate_index(hstate_vma(vma))); + } + + return 0; +} +#endif + static vm_fault_t hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, struct address_space *mapping, pgoff_t idx, @@ -6287,13 +6311,13 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, /* * If memory error occurs between mmap() and fault, some process * don't have hwpoisoned swap entry for errored virtual address. - * So we need to block hugepage fault by PG_hwpoison bit check. + * So we need to block hugepage fault by hwpoison check: + * - without HGM, the check is based on PG_hwpoison + * - with HGM, check if the raw page for address is poisoned */ - if (unlikely(folio_test_hwpoison(folio))) { - ret = VM_FAULT_HWPOISON_LARGE | - VM_FAULT_SET_HINDEX(hstate_index(h)); + ret = hugetlb_no_page_hwpoison(mm, vma, folio, address, hpte, flags); + if (unlikely(ret)) goto backout_unlocked; - } /* Check for page in userfault range. */ if (userfaultfd_minor(vma)) { @@ -8426,6 +8450,11 @@ int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma, * the allocated PTEs created before splitting fails. */ + /* + * For none and UFFD_WP marker PTEs, given try_to_unmap_one doesn't + * unmap them, delay the splitting until page fault happens. See the + * hugetlb_no_page_hwpoison check in hugetlb_no_page. + */ if (unlikely(huge_pte_none_mostly(old_entry))) { ret = -EAGAIN; goto skip; @@ -8479,6 +8508,96 @@ int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma, return ret; } +/* + * Given a hugetlb PTE, if we want to split it into its next smaller level + * PTE, return what size we should use to do HGM walk with allocations. + * If given hugetlb PTE is already at smallest PAGESIZE, returns -EINVAL. + */ +static int hgm_next_size(struct vm_area_struct *vma, struct hugetlb_pte *hpte) +{ + struct hstate *h = hstate_vma(vma), *tmp_h; + unsigned int shift; + unsigned long curr_size = hugetlb_pte_size(hpte); + unsigned long next_size; + + for_each_hgm_shift(h, tmp_h, shift) { + next_size = 1UL << shift; + if (next_size < curr_size) + return next_size; + } + + return -EINVAL; +} + +/* + * Check if address is in the range of a HWPOISON raw page. + * During checking hugetlb PTE may be split into smaller hguetlb PTEs. + */ +static vm_fault_t hugetlb_no_page_hwpoison(struct mm_struct *mm, + struct vm_area_struct *vma, + struct folio *folio, + unsigned long address, + struct hugetlb_pte *hpte, + unsigned int flags) +{ + unsigned long range_start, range_end; + unsigned long start_index, end_index; + unsigned long folio_start = vma_address(folio_page(folio, 0), vma); + struct llist_node *t, *tnode; + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio); + struct raw_hwp_page *p = NULL; + bool contain_hwpoison = false; + int hgm_size; + int hgm_ret = 0; + + if (likely(!folio_test_hwpoison(folio))) + return 0; + + if (hugetlb_enable_hgm_vma(vma)) + return VM_FAULT_HWPOISON_LARGE | + VM_FAULT_SET_HINDEX(hstate_index(hstate_vma(vma))); + +recheck: + range_start = address & hugetlb_pte_mask(hpte); + range_end = range_start + hugetlb_pte_size(hpte); + start_index = (range_start - folio_start) / PAGE_SIZE; + end_index = start_index + hugetlb_pte_size(hpte) / PAGE_SIZE; + + contain_hwpoison = false; + llist_for_each_safe(tnode, t, raw_hwp_head->first) { + p = container_of(tnode, struct raw_hwp_page, node); + if (start_index <= p->index && p->index < end_index) { + contain_hwpoison = true; + break; + } + } + + if (!contain_hwpoison) + return 0; + + if (hugetlb_pte_size(hpte) == PAGE_SIZE) + return VM_FAULT_HWPOISON; + + /* + * hugetlb_fault already ensured hugetlb_vma_lock_read. + * We also checked hugetlb_pte_size(hpte) != PAGE_SIZE, + * so hgm_size must be something meaningful to HGM. + */ + hgm_size = hgm_next_size(vma, hpte); + VM_BUG_ON(hgm_size == -EINVAL); + hgm_ret = hugetlb_full_walk_alloc(hpte, vma, address, hgm_size); + if (hgm_ret) { + WARN_ON_ONCE(hgm_ret); + /* + * When splitting using HGM fails, return like + * HGM is not eligible or enabled. + */ + return VM_FAULT_HWPOISON_LARGE | + VM_FAULT_SET_HINDEX(hstate_index(hstate_vma(vma))); + } + goto recheck; +} + #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */ /* diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 47b935918ceb..9093ba53feed 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1957,6 +1957,7 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page) raw_hwp->nr_vmas_mapped = 0; raw_hwp->nr_expected_unmaps = 0; raw_hwp->nr_actual_unmaps = 0; + raw_hwp->index = folio_page_idx(folio, page); llist_add(&raw_hwp->node, head); if (hgm_enabled) /* From patchwork Fri Apr 28 00:41:38 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jiaqi Yan X-Patchwork-Id: 88422 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp618481vqo; Thu, 27 Apr 2023 17:45:07 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7oZjABPg8iIpOLj4rGFeFsbFsu8LCBq7mSCXxxkpnVFfYMa3RLwsilqkRSpWpBlLAawvYb X-Received: by 2002:a17:902:d2c5:b0:1a9:5ee9:71e0 with SMTP id n5-20020a170902d2c500b001a95ee971e0mr4513757plc.0.1682642707404; Thu, 27 Apr 2023 17:45:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682642707; cv=none; d=google.com; s=arc-20160816; b=DNBX4oAgG/ZEVFu2SoW3AlcwR9rGi0Hz9rsrLDb/TZtP7Jz5JmabtCXpN7+NjNdhYN u66N1Npy2bqdllLLcpkhWMO2IdwhiUbKtNN7/7DfBAXKn/ABbfmWA4Q63APWg4349LGl fLIxY/uljiIHW6cIhSmUDkZ+O7TUFVG3P/X8N7M9692Zn8WZaGB8zXbuiNssduISKcx5 7CwGPS8UbyILU+H6CMltMbY/lemZlqPYf2tusAiIJIom+uVxburb4yeqHDt8471lNbeS Ywd1wteKTn/eI9+oG1r9ucYzYA1hpC+zrFYzsYyfTv5sgZ+zSrZ7D1SQRVNS5l4xEGzR kafg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=83gQugE5cbaf4bRjalHdDmK4/joJvX/FU1hRUwFi3sU=; b=BNeASwaYPhT2b9Eis+82FVsZ5O4VywBK2tTKrQN0x0cJOBn0XV7Tkgm3TQWNHPGpTJ 6Xo2xjYv3FJ5zMSApo9m/TnKBp/VX7bWdsjunchL8fX5lvUOMikdtmm2IbHuii3CmJ3m rx79bQ+phwihKRa1uHYk99q39cvo3g4YUWeyJN7d2bVSKSzI/Hu3jF2Lz8OHJkpmkvZH zaV951Nq26QQe3lnKujY8hFOGYLNJWCASmHmuAXVzubUJ3MDclQE9CR+AIxfx51kYQ/T 37d4+A8gP02cc1XICOKyKjiOUkvOMnx+JfTJpOxwwUkTZA8RQ+trOaK/rlsjR1VA9O6j 0HWg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=lkIUSE2Y; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a19-20020a170902b59300b001a6fce1d951si19261695pls.377.2023.04.27.17.44.52; Thu, 27 Apr 2023 17:45:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=lkIUSE2Y; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344762AbjD1AmV (ORCPT + 99 others); Thu, 27 Apr 2023 20:42:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38188 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344647AbjD1AmI (ORCPT ); Thu, 27 Apr 2023 20:42:08 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DB88D40FB for ; Thu, 27 Apr 2023 17:41:55 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-b99ef860a40so7583894276.3 for ; Thu, 27 Apr 2023 17:41:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1682642515; x=1685234515; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=83gQugE5cbaf4bRjalHdDmK4/joJvX/FU1hRUwFi3sU=; b=lkIUSE2Y29UDUiwiH/PXq03j0d2aozkiWLLGwRUxJKBzv349gzdyhvqftmJ2Fu62zY jMWxszNhpeRS5y3kImN5n2opTgBWFzDwuqKRzgYyjj6M0n66z7iMvb+tEcCceMdcsBOp Yeb8sup8/qzQ53OohH3ZK9C3z9mQqNbOrxs+PHi1sWeypkY3IZhF0GeGg2khyzhtLKeV DWO5IYJmpJdERF9S5Pot3hgD65WiP9Z2x3emz7ewMNj79q9jZZcAwhpTWXAelkXAdsUG qsrxPQRSj6bArSMm1jg0g9gbmM8D0ht2i0W9PUL5DphQLDMfi4v6IBF2iF6Uhh+G6NXh 45Zw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682642515; x=1685234515; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=83gQugE5cbaf4bRjalHdDmK4/joJvX/FU1hRUwFi3sU=; b=b/o7N4/oj2IwMT/kePUAKe+trRtwC3AcHGJ2iIcts91zhq8QPs4m2GgYL+zjInHaH+ s6PAnEpd05zA1wbbij85nr+8IZQSa2VcMfTTUJML+OYntf7Z1zGkcAtZ3+82LOUfpS1r yLcezVa8xkaRhwAtUeFrGGassyk6KFe6Fe9Br+iSHGdGwGTDw696XOi/tB5FrJHDB/Zr NszzHKZlUhEzfnSFMyY73eHtkvuLUx7gsNdPuh58oUlM0+YyN2fiUW8/7k25DmWRXYzN Mz6+WA6Q0zdh9yZWXOc6xxNfwNhXv3M7S5gTXcpXqLDvQ4nAyBMSDJ/I+KNHszleKsOC fVUg== X-Gm-Message-State: AC+VfDyUeLs78g3UD14BiycMohVPkyjooYGeD9I8VG6xW8bbc7+mdlTr OZQhjGMi4zamwbU4/O1dEw1fpXrGllshDQ== X-Received: from yjq3.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:272f]) (user=jiaqiyan job=sendgmr) by 2002:a25:c041:0:b0:b96:5b8a:3c34 with SMTP id c62-20020a25c041000000b00b965b8a3c34mr1217514ybf.11.1682642515107; Thu, 27 Apr 2023 17:41:55 -0700 (PDT) Date: Fri, 28 Apr 2023 00:41:38 +0000 In-Reply-To: <20230428004139.2899856-1-jiaqiyan@google.com> Mime-Version: 1.0 References: <20230428004139.2899856-1-jiaqiyan@google.com> X-Mailer: git-send-email 2.40.1.495.gc816e09b53d-goog Message-ID: <20230428004139.2899856-7-jiaqiyan@google.com> Subject: [RFC PATCH v1 6/7] selftest/mm: test PAGESIZE unmapping HWPOISON pages From: Jiaqi Yan To: mike.kravetz@oracle.com, peterx@redhat.com, naoya.horiguchi@nec.com Cc: songmuchun@bytedance.com, duenwen@google.com, axelrasmussen@google.com, jthoughton@google.com, rientjes@google.com, linmiaohe@huawei.com, shy828301@gmail.com, baolin.wang@linux.alibaba.com, wangkefeng.wang@huawei.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Jiaqi Yan X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764378759165068722?= X-GMAIL-MSGID: =?utf-8?q?1764378759165068722?= After injecting memory errors to byte addresses inside HugeTLB page, the updated test checks 1. only a raw page is unmapped, and userspace gets correct SIGBUS from kernel. 2. other subpages in the same hugepage are still mapped and data not corrupted. Signed-off-by: Jiaqi Yan --- tools/testing/selftests/mm/hugetlb-hgm.c | 194 +++++++++++++++++++---- 1 file changed, 167 insertions(+), 27 deletions(-) diff --git a/tools/testing/selftests/mm/hugetlb-hgm.c b/tools/testing/selftests/mm/hugetlb-hgm.c index c0ba6ad44005..bc9529986b66 100644 --- a/tools/testing/selftests/mm/hugetlb-hgm.c +++ b/tools/testing/selftests/mm/hugetlb-hgm.c @@ -39,6 +39,10 @@ #define MADV_SPLIT 26 #endif +#ifndef NUM_HWPOISON_PAGES +#define NUM_HWPOISON_PAGES 3UL +#endif + #define PREFIX " ... " #define ERROR_PREFIX " !!! " @@ -241,6 +245,9 @@ static int test_sigbus(char *addr, bool poison) sigbus_addr, addr); else if (poison && !was_mceerr) printf(ERROR_PREFIX "didn't get an MCEERR?\n"); + else if (!poison && was_mceerr) + printf(ERROR_PREFIX "got BUS_MCEERR_AR sigbus on expected healthy address: %p\n", + sigbus_addr); else ret = 0; out: @@ -272,43 +279,176 @@ static int read_event_from_uffd(int *uffd, pthread_t *pthread) return 0; } -static int test_sigbus_range(char *primary_map, size_t len, bool hwpoison) +struct range_exclude_pages { + /* Starting address of the buffer. */ + char *mapping; + /* Length of the buffer in bytes. */ + size_t length; + /* The value that each byte in buffer should equal to. */ + char value; + /* + * PAGESIZE aligned addresses excluded from the checking, + * e.g. if PAGE_SIZE=4k, for each addr in excludes, + * skips checking on [addr, addr + 4096). + */ + unsigned long excluded[NUM_HWPOISON_PAGES]; +}; + +static int check_range_exclude_pages(struct range_exclude_pages *range) +{ + const unsigned long pagesize = getpagesize(); + unsigned long excluded_index; + unsigned long page_index; + bool should_skip; + size_t i = 0; + size_t j = 0; + + while (i < range->length) { + page_index = ((unsigned long)(range->mapping + i)) / pagesize; + should_skip = false; + for (j = 0; j < NUM_HWPOISON_PAGES; ++j) { + excluded_index = range->excluded[j] / pagesize; + if (page_index == excluded_index) { + should_skip = true; + break; + } + } + if (should_skip) { + printf(PREFIX "skip excluded addr range [%#lx, %#lx)\n", + (unsigned long)(range->mapping + i), + (unsigned long)(range->mapping + i + pagesize)); + i += pagesize; + continue; + } + if (range->mapping[i] != range->value) { + printf(ERROR_PREFIX "mismatch at %p (%d != %d)\n", + &range->mapping[i], range->mapping[i], range->value); + return -1; + } + ++i; + } + + return 0; +} + +enum test_status verify_raw_pages(char *map, size_t len, + unsigned long excluded[NUM_HWPOISON_PAGES]) { const unsigned long pagesize = getpagesize(); - const int num_checks = 512; - unsigned long bytes_per_check = len/num_checks; - int i; + unsigned long size, offset, value; + size_t j = 0; + + for (size = len / 2, offset = 0, value = 1; size > pagesize; + offset += size, size /= 2, ++value) { + struct range_exclude_pages range = { + .mapping = map + offset, + .length = size, + .value = value, + }; + for (j = 0; j < NUM_HWPOISON_PAGES; ++j) + range.excluded[j] = excluded[j]; + + printf(PREFIX "checking non-poisoned range [%p, %p) " + "(len=%#lx) per-byte value=%lu\n", + range.mapping, range.mapping + range.length, + range.length, value); + if (check_range_exclude_pages(&range)) + return TEST_FAILED; + + printf(PREFIX PREFIX "good\n"); + } - printf(PREFIX "checking that we can't access " - "(%d addresses within %p -> %p)\n", - num_checks, primary_map, primary_map + len); + return TEST_PASSED; +} - if (pagesize > bytes_per_check) - bytes_per_check = pagesize; +static int read_hwpoison_pages(unsigned long *nr_hwp_pages) +{ + const unsigned long pagesize = getpagesize(); + char buffer[256] = {0}; + char *cmd = "cat /proc/meminfo | grep -i HardwareCorrupted | grep -o '[0-9]*'"; + FILE *cmdfile = popen(cmd, "r"); - for (i = 0; i < len; i += bytes_per_check) - if (test_sigbus(primary_map + i, hwpoison) < 0) - return 1; - /* check very last byte, because we left it unmapped */ - if (test_sigbus(primary_map + len - 1, hwpoison)) - return 1; + if (!(fgets(buffer, sizeof(buffer), cmdfile))) { + perror("failed to read HardwareCorrupted from /proc/meminfo\n"); + return -1; + } + pclose(cmdfile); + *nr_hwp_pages = atoll(buffer) * 1024 / pagesize; return 0; } -static enum test_status test_hwpoison(char *primary_map, size_t len) +static enum test_status test_hwpoison_one_raw_page(char *hwpoison_addr) { - printf(PREFIX "poisoning %p -> %p\n", primary_map, primary_map + len); - if (madvise(primary_map, len, MADV_HWPOISON) < 0) { + const unsigned long pagesize = getpagesize(); + + printf(PREFIX "poisoning [%p, %p) (len=%#lx)\n", + hwpoison_addr, hwpoison_addr + pagesize, pagesize); + if (madvise(hwpoison_addr, pagesize, MADV_HWPOISON) < 0) { perror(ERROR_PREFIX "MADV_HWPOISON failed"); return TEST_SKIPPED; } - return test_sigbus_range(primary_map, len, true) - ? TEST_FAILED : TEST_PASSED; + printf(PREFIX "checking poisoned range [%p, %p) (len=%#lx)\n", + hwpoison_addr, hwpoison_addr + pagesize, pagesize); + if (test_sigbus(hwpoison_addr, true) < 0) + return TEST_FAILED; + + return TEST_PASSED; } -static int test_fork(int uffd, char *primary_map, size_t len) +static enum test_status test_hwpoison_present(char *map, size_t len, + bool already_injected) +{ + const unsigned long pagesize = getpagesize(); + const unsigned long hwpoison_next = 128; + unsigned long nr_hwpoison_pages_before, nr_hwpoison_pages_after; + enum test_status ret; + size_t i; + char *hwpoison_addr = map; + unsigned long hwpoison_addrs[NUM_HWPOISON_PAGES]; + + if (hwpoison_next * (NUM_HWPOISON_PAGES - 1) >= (len / pagesize)) { + printf(ERROR_PREFIX "max hwpoison_addr out of range"); + return TEST_SKIPPED; + } + + for (i = 0; i < NUM_HWPOISON_PAGES; ++i) { + hwpoison_addrs[i] = (unsigned long)hwpoison_addr; + hwpoison_addr += hwpoison_next * pagesize; + } + + if (already_injected) + return verify_raw_pages(map, len, hwpoison_addrs); + + if (read_hwpoison_pages(&nr_hwpoison_pages_before)) { + printf(ERROR_PREFIX "check #HWPOISON pages\n"); + return TEST_SKIPPED; + } + printf(PREFIX "Before injections, #HWPOISON pages = %ld\n", nr_hwpoison_pages_before); + + for (i = 0; i < NUM_HWPOISON_PAGES; ++i) { + ret = test_hwpoison_one_raw_page((char *)hwpoison_addrs[i]); + if (ret != TEST_PASSED) + return ret; + } + + if (read_hwpoison_pages(&nr_hwpoison_pages_after)) { + printf(ERROR_PREFIX "check #HWPOISON pages\n"); + return TEST_SKIPPED; + } + printf(PREFIX "After injections, #HWPOISON pages = %ld\n", nr_hwpoison_pages_after); + + if (nr_hwpoison_pages_after - nr_hwpoison_pages_before != NUM_HWPOISON_PAGES) { + printf(ERROR_PREFIX "delta #HWPOISON pages != %ld", + NUM_HWPOISON_PAGES); + return TEST_FAILED; + } + + return verify_raw_pages(map, len, hwpoison_addrs); +} + +int test_fork(int uffd, char *primary_map, size_t len) { int status; int ret = 0; @@ -360,7 +500,6 @@ static int test_fork(int uffd, char *primary_map, size_t len) pthread_join(uffd_thd, NULL); return ret; - } static int uffd_register(int uffd, char *primary_map, unsigned long len, @@ -394,6 +533,7 @@ test_hgm(int fd, size_t hugepagesize, size_t len, enum test_type type) bool uffd_wp = type == TEST_UFFDWP; bool verify = type == TEST_DEFAULT; int register_args; + enum test_status hwp_status = TEST_SKIPPED; if (ftruncate(fd, len) < 0) { perror(ERROR_PREFIX "ftruncate failed"); @@ -489,10 +629,10 @@ test_hgm(int fd, size_t hugepagesize, size_t len, enum test_type type) * mapping. */ if (hwpoison) { - enum test_status new_status = test_hwpoison(primary_map, len); - - if (new_status != TEST_PASSED) { - status = new_status; + /* test_hwpoison can fail with TEST_SKIPPED. */ + hwp_status = test_hwpoison_present(primary_map, len, false); + if (hwp_status != TEST_PASSED) { + status = hwp_status; goto done; } } @@ -539,7 +679,7 @@ test_hgm(int fd, size_t hugepagesize, size_t len, enum test_type type) /* * Verify that memory is still poisoned. */ - if (hwpoison && test_sigbus_range(primary_map, len, true)) + if (hwpoison && test_hwpoison_present(primary_map, len, true)) goto done; status = TEST_PASSED; From patchwork Fri Apr 28 00:41:39 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jiaqi Yan X-Patchwork-Id: 88424 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp619976vqo; Thu, 27 Apr 2023 17:48:56 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7YJnUA2a3SG/YQJDcwJm/WT5zFISBhAWMZDR4Q9Ponot92il8Gzu7Xndap90lwaX/LWU4W X-Received: by 2002:a05:6a20:4426:b0:de:13c4:5529 with SMTP id ce38-20020a056a20442600b000de13c45529mr3871112pzb.62.1682642936079; Thu, 27 Apr 2023 17:48:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682642936; cv=none; d=google.com; s=arc-20160816; b=FSr5TZXRgUQfIm6IgT21tDkgHH3CL/Igc1pzmg3PLPP+SIMnhMxGm89a38CkYShLNt JkLhEVosHhMJbd7aXCcAlOflw04ZDHmHamXYAL1/T2OisXgTszn/fYyeim1BHB4xPE9m CbU+Nwo44d/zryJiP+FbCYYL8qamp5o7PpDBMsLgWuS8Bga1UBmGtxqp5/mluaqSxGaD pzbLplECcT3IrMMGXHTZbZvt2C9Ep7RRLN14hUs2KVxOWbLvl33H9ldLu+DBIJAM8RXp W6DZrKEVyDah5MheQUIwL0D0lfX0p7phJ4nVjMoGaiZfTXLOlwOxxp8F82uWmMvgsJYg 1dCw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=Ezb/mtBWXto18t5BMeGyfP88KWTxkli4FptKeY/x46M=; b=iUcrEJMHK+XA9/VtA03BfYYR4DjIljCvGoj8WE6mKq7YDVbh2yTXjJZ1b/r0kEDC+O cCcOus5nvfVBwo9QTbpg3+1qu4FPwVPB4jYnMUxfDasrF9D6VqOUSMgO/Ium7G/zly+s 9s1fw6iT2YQ3/5JE4421EL+MOk+W5eVjzCTfaQ3jfgAHiO33kdMmwfG+pyDThqk059F+ Oh6rHgkUOFKprCKViOAhdWpI8y3XCTK/hv3XMQ32xuNdytQwnCtBS+c9ehTr5Bj6np2d pVCAHYEH4O6yDKfRwOrvXxmMiPzfiKynQxTaw9A85Su4VXL8fAfxvedHdL3xRqs1TPD9 GhEA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=cBpNq4kD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a28-20020aa78e9c000000b0063d26262efasi20462409pfr.187.2023.04.27.17.48.38; Thu, 27 Apr 2023 17:48:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=cBpNq4kD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344782AbjD1AmZ (ORCPT + 99 others); Thu, 27 Apr 2023 20:42:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38238 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344654AbjD1AmS (ORCPT ); Thu, 27 Apr 2023 20:42:18 -0400 Received: from mail-pg1-x549.google.com (mail-pg1-x549.google.com [IPv6:2607:f8b0:4864:20::549]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 65AD03A82 for ; Thu, 27 Apr 2023 17:41:57 -0700 (PDT) Received: by mail-pg1-x549.google.com with SMTP id 41be03b00d2f7-52855ba7539so3164282a12.3 for ; Thu, 27 Apr 2023 17:41:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1682642516; x=1685234516; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Ezb/mtBWXto18t5BMeGyfP88KWTxkli4FptKeY/x46M=; b=cBpNq4kDOoS1KXNAw36+zGssSnylFqMaR3rTdo/Bbi9j+uSIW0zcOYI0md0xbW4j6W +PHFxJ/rOe3KUn/hfS4DGeQhcETYJOxEfXjZyGlSMKsv+CSEvmuki2TM5s9h+sqzZXkm zFMImG52kUZ92kxbvfqyxdPPLXjw/DJeyVb7bUVcVONfdKFL8z9+VjLIhMS0+re2GE2O zEwtBtEBHkyndAa1lteLDV2QAeqC8zrgH0gfV610xTUo+Pm8uYlqrCTPp0id3O0nqgM8 EO4A4dGanmX/FeoSb2uM1T+dObQeuLR3ZhqUU9X+KEIF6pPvyCLlK8E7wcVNgFeDCjlE Uraw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682642516; x=1685234516; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Ezb/mtBWXto18t5BMeGyfP88KWTxkli4FptKeY/x46M=; b=lYQCpCorHHymnhdArlIeoy8FvwBXiUSwYDXMUrD4+/1TUrvPClckboVjN0OpUm5vIc pUza3mZLnS2oVysbAKdXtB+G17SAgMHS4onOTwt7ztNRWM9mUhMIiQCNY/U1svw5OV47 QLLlAlPILFnmCsPwof3KkJYLyLoipuuDUy9/6ulz4vxD6sAFjV4RkPEwPW+WfUBptpqQ 4jkjKzQc1g3moZ4ggOc/8ucaVF5TksKYv2bJG+7PHmbJSQlR7mvmynH0RDzv/YqqYrax m9IJq6Q+wgvHSoM6HXk7xqV0rNHq6hwETsx46sjz+NH0/4J9I2nTWMYIvB47CgnWIIft V5TA== X-Gm-Message-State: AC+VfDzrFWOfOsxKRvjmz7xlVJA87TwyO5GoHSdg3WyPfFi8Z+WC6CfM s9i2pgMNs7NO7AJ1xEaitmTq7SqIpGYGcg== X-Received: from yjq3.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:272f]) (user=jiaqiyan job=sendgmr) by 2002:a63:2885:0:b0:513:953f:fee4 with SMTP id bs127-20020a632885000000b00513953ffee4mr815399pgb.10.1682642516710; Thu, 27 Apr 2023 17:41:56 -0700 (PDT) Date: Fri, 28 Apr 2023 00:41:39 +0000 In-Reply-To: <20230428004139.2899856-1-jiaqiyan@google.com> Mime-Version: 1.0 References: <20230428004139.2899856-1-jiaqiyan@google.com> X-Mailer: git-send-email 2.40.1.495.gc816e09b53d-goog Message-ID: <20230428004139.2899856-8-jiaqiyan@google.com> Subject: [RFC PATCH v1 7/7] selftest/mm: test PAGESIZE unmapping UFFD WP marker HWPOISON pages From: Jiaqi Yan To: mike.kravetz@oracle.com, peterx@redhat.com, naoya.horiguchi@nec.com Cc: songmuchun@bytedance.com, duenwen@google.com, axelrasmussen@google.com, jthoughton@google.com, rientjes@google.com, linmiaohe@huawei.com, shy828301@gmail.com, baolin.wang@linux.alibaba.com, wangkefeng.wang@huawei.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Jiaqi Yan X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764378999251829887?= X-GMAIL-MSGID: =?utf-8?q?1764378999251829887?= For not-yet-faulted hugepage containing HWPOISON raw page, test 1. only HWPOISON raw page will not be faulted, and a BUS_MCEERR_AR SIGBUS will be sent to userspace. 2. healthy raw pages are faulted in as normal. Since the hugepage has been writeprotect by UFFD, non BUS_MCEERR_AR SIGBUS will be sent to userspace. Signed-off-by: Jiaqi Yan --- tools/testing/selftests/mm/hugetlb-hgm.c | 170 +++++++++++++++++++++++ 1 file changed, 170 insertions(+) diff --git a/tools/testing/selftests/mm/hugetlb-hgm.c b/tools/testing/selftests/mm/hugetlb-hgm.c index bc9529986b66..81ee2d99fea8 100644 --- a/tools/testing/selftests/mm/hugetlb-hgm.c +++ b/tools/testing/selftests/mm/hugetlb-hgm.c @@ -515,6 +515,169 @@ static int uffd_register(int uffd, char *primary_map, unsigned long len, return ioctl(uffd, UFFDIO_REGISTER, ®); } +static int setup_present_map(char *present_map, size_t len) +{ + size_t offset = 0; + unsigned char iter = 0; + unsigned long pagesize = getpagesize(); + uint64_t size; + + for (size = len/2; size >= pagesize; + offset += size, size /= 2) { + iter++; + memset(present_map + offset, iter, size); + } + return 0; +} + +static enum test_status test_hwpoison_absent_uffd_wp(int fd, size_t hugepagesize, size_t len) +{ + int uffd; + char *absent_map, *present_map; + struct uffdio_api api; + int register_args; + struct sigaction new, old; + enum test_status status = TEST_SKIPPED; + const unsigned long pagesize = getpagesize(); + const unsigned long hwpoison_index = 128; + char *hwpoison_addr; + + if (hwpoison_index >= (len / pagesize)) { + printf(ERROR_PREFIX "hwpoison_index out of range"); + return TEST_FAILED; + } + + if (ftruncate(fd, len) < 0) { + perror(ERROR_PREFIX "ftruncate failed"); + return TEST_FAILED; + } + + uffd = userfaultfd(O_CLOEXEC); + if (uffd < 0) { + perror(ERROR_PREFIX "uffd not created"); + return TEST_FAILED; + } + + absent_map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + if (absent_map == MAP_FAILED) { + perror(ERROR_PREFIX "mmap for ABSENT mapping failed"); + goto close_uffd; + } + printf(PREFIX "ABSENT mapping: %p\n", absent_map); + + api.api = UFFD_API; + api.features = UFFD_FEATURE_SIGBUS | UFFD_FEATURE_EXACT_ADDRESS | + UFFD_FEATURE_EVENT_FORK; + if (ioctl(uffd, UFFDIO_API, &api) == -1) { + perror(ERROR_PREFIX "UFFDIO_API failed"); + goto unmap_absent; + } + + /* + * Register with UFFDIO_REGISTER_MODE_WP to have UFFD WP bit on + * the HugeTLB page table entry. + */ + register_args = UFFDIO_REGISTER_MODE_MISSING | UFFDIO_REGISTER_MODE_WP; + if (uffd_register(uffd, absent_map, len, register_args)) { + perror(ERROR_PREFIX "UFFDIO_REGISTER failed"); + goto unmap_absent; + } + + new.sa_sigaction = &sigbus_handler; + new.sa_flags = SA_SIGINFO; + if (sigaction(SIGBUS, &new, &old) < 0) { + perror(ERROR_PREFIX "could not setup SIGBUS handler"); + goto unmap_absent; + } + + /* + * Set WP markers to the absent huge mapping. With HGM enabled in + * kernel CONFIG, memory_failure will enabled HGM in kernel, + * so no need to enable HGM from userspace. + */ + if (userfaultfd_writeprotect(uffd, absent_map, len, true) < 0) { + status = TEST_FAILED; + goto unmap_absent; + } + + status = TEST_PASSED; + + /* + * With MAP_SHARED hugetlb memory, we cna inject memory error to + * not-yet-faulted mapping (absent_map) by injecting memory error + * to a already faulted mapping (present_map). + */ + present_map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + if (present_map == MAP_FAILED) { + perror(ERROR_PREFIX "mmap for non present mapping failed"); + goto close_uffd; + } + printf(PREFIX "PRESENT mapping: %p\n", present_map); + setup_present_map(present_map, len); + + hwpoison_addr = present_map + hwpoison_index * pagesize; + if (madvise(hwpoison_addr, pagesize, MADV_HWPOISON)) { + perror(PREFIX "MADV_HWPOISON a page in PRESENT mapping failed"); + status = TEST_FAILED; + goto unmap_present; + } + + printf(PREFIX "checking poisoned range [%p, %p) (len=%#lx) in PRESENT mapping\n", + hwpoison_addr, hwpoison_addr + pagesize, pagesize); + if (test_sigbus(hwpoison_addr, true) < 0) { + status = TEST_FAILED; + goto done; + } + printf(PREFIX "checking healthy pages in PRESENT mapping\n"); + unsigned long hwpoison_addrs[] = { + (unsigned long)hwpoison_addr, + (unsigned long)hwpoison_addr, + (unsigned long)hwpoison_addr + }; + status = verify_raw_pages(present_map, len, hwpoison_addrs); + if (status != TEST_PASSED) { + printf(ERROR_PREFIX "checking healthy pages failed\n"); + goto done; + } + + for (int i = 0; i < len; i += pagesize) { + if (i == hwpoison_index * pagesize) { + printf(PREFIX "checking poisoned range [%p, %p) (len=%#lx) in ABSENT mapping\n", + absent_map + i, absent_map + i + pagesize, pagesize); + if (test_sigbus(absent_map + i, true) < 0) { + status = TEST_FAILED; + break; + } + } else { + /* + * With UFFD_FEATURE_SIGBUS, we should get a SIGBUS for + * every not faulted (non present) page/byte. + */ + if (test_sigbus(absent_map + i, false) < 0) { + printf(PREFIX "checking healthy range [%p, %p) (len=%#lx) in ABSENT mapping failed\n", + absent_map + i, absent_map + i + pagesize, pagesize); + status = TEST_FAILED; + break; + } + } + } +done: + if (ftruncate(fd, 0) < 0) { + perror(ERROR_PREFIX "ftruncate back to 0 failed"); + status = TEST_FAILED; + } +unmap_present: + printf(PREFIX "Unmap PRESENT mapping=%p\n", absent_map); + munmap(present_map, len); +unmap_absent: + printf(PREFIX "Unmap ABSENT mapping=%p\n", absent_map); + munmap(absent_map, len); +close_uffd: + printf(PREFIX "Close UFFD\n"); + close(uffd); + return status; +} + enum test_type { TEST_DEFAULT, TEST_UFFDWP, @@ -744,6 +907,13 @@ int main(void) printf("HGM hwpoison test: %s\n", status_to_str(status)); if (status == TEST_FAILED) ret = -1; + + printf("HGM hwpoison UFFD-WP marker test...\n"); + status = test_hwpoison_absent_uffd_wp(fd, hugepagesize, len); + printf("HGM hwpoison UFFD-WP marker test: %s\n", + status_to_str(status)); + if (status == TEST_FAILED) + ret = -1; close: close(fd);