Message ID | 20221019170835.155381-1-tony.luck@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4ac7:0:0:0:0:0 with SMTP id y7csp442060wrs; Wed, 19 Oct 2022 10:15:58 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6W/p4xnvo3qJwqLdvE0utLd6G28Gd8mJK4x+Q21Xv32uZ7+LREvDaN905D+hjd0pasdppG X-Received: by 2002:a17:903:258e:b0:17b:a251:c80a with SMTP id jb14-20020a170903258e00b0017ba251c80amr9595115plb.110.1666199757991; Wed, 19 Oct 2022 10:15:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666199757; cv=none; d=google.com; s=arc-20160816; b=zfiS23mtqr+6vLjT8R26zcVoeHxFGpJYFhPu+v1bEwxsL6gpSKyAGOPI2KrhLhfnbY w8jE3r28SJkDXsA+IBYE0onmaN4D2nnUYsr9hDQTXrInzwUzkcvtwiIquJ1Yju1sKRs/ BWO7WLrHIhY0pO2XOl6q7mhBHhkGm0SLjoDOXdeX8onpqYf0mldcBMNrZvim9b0BvuNU Siswfmk5lv4jcwP36vzrmeZyZ/1bsSVe/6EH8rAqRvsArHtHdFWRpjKE2aY8+x/ZXeGf RKzypxcJJoO46FOl79xkzbxXhpTj5CvzU11mS9YbPx2S4FUxkp/7lY5zmkqZG+VfEDI8 UmEw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=mg17RKhj2cyO0hsFRQbb2CKKXVOghmd5CwXAEz25N/Q=; b=a44wJm2az76r9YaLSWyNylimlg82i8PPEq1S2kXVR7GOWfk4xdBJ/fDhTiIKckA9Rp NvUrc//KVGpJXQSpZ8ZQRPKNkHc5b+rnm9moO3ZoszVI6ygPaxBCw4PJ62jlfIXSplP4 sKElWK0sbvOS4WcBXsh/pWr+LJs2CO+CRJNBaIBKwWbxC8OOcjBZcQKPq0u/nNhRxHnB N5nAC9y/aldM08CGZ+YeyJgj3UyHM3kzLFY6yJxUrE4QGrStP7KpKhJQQ1JloG2qLLvS kl9xYKsyb9r3RR0PyiseclF+Y93cxWHb1CPuEcXeDDTFzjE7REUtyLZ4ZuZi29gwlVkc suag== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=jZ5fb5G0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 27-20020a630e5b000000b00439e032490csi19344241pgo.350.2022.10.19.10.15.44; Wed, 19 Oct 2022 10:15:57 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=jZ5fb5G0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231262AbiJSRJS (ORCPT <rfc822;samuel.l.nystrom@gmail.com> + 99 others); Wed, 19 Oct 2022 13:09:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45564 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231443AbiJSRJG (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 19 Oct 2022 13:09:06 -0400 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B445F1C9049 for <linux-kernel@vger.kernel.org>; Wed, 19 Oct 2022 10:08:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1666199337; x=1697735337; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Zm5MM+MQE2zvO00SCaUTpTsivTJ9TberyUbpMfFKGfw=; b=jZ5fb5G0gIctMwwLD7TlKbR7LEK465FsbmNt+iMvp0pYyYhtCdYKhEMD 2jtPdJlbeuUXa21D+OFkzmsAb5R6sq0TKsT9jOhtGreCkOYTqAwAwqgbk +WOj0GEgdovS4DN6RkYxafJzZM67MzQfMbjT4OAGfvdvp9kcafWMo565G YjYEJEJ1YpSGHx3IRJ3oZiH+Rpsvo5jh64g52DrV3hHoiw07F8GaPdS/8 8g5bQrvjA6XKhoxEkI6t+iTjVL0VwUJ8pBznbItlWMJaZVfhCxQ1nCWf6 jWCrtKas9KHXIRE02WvWCLYTVqh+6E9hHpV+M9jTNn5FAkl+LD1MWs0wp w==; X-IronPort-AV: E=McAfee;i="6500,9779,10505"; a="304092715" X-IronPort-AV: E=Sophos;i="5.95,196,1661842800"; d="scan'208";a="304092715" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Oct 2022 10:08:43 -0700 X-IronPort-AV: E=McAfee;i="6500,9779,10505"; a="692489392" X-IronPort-AV: E=Sophos;i="5.95,196,1661842800"; d="scan'208";a="692489392" Received: from agluck-desk3.sc.intel.com ([172.25.222.78]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Oct 2022 10:08:43 -0700 From: Tony Luck <tony.luck@intel.com> To: Naoya Horiguchi <naoya.horiguchi@nec.com>, Andrew Morton <akpm@linux-foundation.org> Cc: Miaohe Lin <linmiaohe@huawei.com>, Matthew Wilcox <willy@infradead.org>, Shuai Xue <xueshuai@linux.alibaba.com>, Dan Williams <dan.j.williams@intel.com>, Michael Ellerman <mpe@ellerman.id.au>, Nicholas Piggin <npiggin@gmail.com>, Christophe Leroy <christophe.leroy@csgroup.eu>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Tony Luck <tony.luck@intel.com> Subject: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults Date: Wed, 19 Oct 2022 10:08:35 -0700 Message-Id: <20221019170835.155381-1-tony.luck@intel.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <SJ1PR11MB60838C1F65CA293188BB442DFC289@SJ1PR11MB6083.namprd11.prod.outlook.com> References: <SJ1PR11MB60838C1F65CA293188BB442DFC289@SJ1PR11MB6083.namprd11.prod.outlook.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-7.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747137077376497017?= X-GMAIL-MSGID: =?utf-8?q?1747137077376497017?= |
Series |
[v2] mm, hwpoison: Try to recover from copy-on write faults
|
|
Commit Message
Luck, Tony
Oct. 19, 2022, 5:08 p.m. UTC
If the kernel is copying a page as the result of a copy-on-write
fault and runs into an uncorrectable error, Linux will crash because
it does not have recovery code for this case where poison is consumed
by the kernel.
It is easy to set up a test case. Just inject an error into a private
page, fork(2), and have the child process write to the page.
I wrapped that neatly into a test at:
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
just enable ACPI error injection and run:
# ./einj_mem-uc -f copy-on-write
Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
on architectures where that is available (currently x86 and powerpc).
When an error is detected during the page copy, return VM_FAULT_HWPOISON
to caller of wp_page_copy(). This propagates up the call stack. Both x86
and powerpc have code in their fault handler to deal with this code by
sending a SIGBUS to the application.
Note that this patch avoids a system crash and signals the process that
triggered the copy-on-write action. It does not take any action for the
memory error that is still in the shared page. To handle that a call to
memory_failure() is needed. But this cannot be done from wp_page_copy()
because it holds mmap_lock(). Perhaps the architecture fault handlers
can deal with this loose end in a subsequent patch?
On Intel/x86 this loose end will often be handled automatically because
the memory controller provides an additional notification of the h/w
poison in memory, the handler for this will call memory_failure(). This
isn't a 100% solution. If there are multiple errors, not all may be
logged in this way.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Changes in V2:
Naoya Horiguchi:
1) Use -EHWPOISON error code instead of minus one.
2) Poison path needs also to deal with old_page
Tony Luck:
Rewrote commit message
Added some powerpc folks to Cc: list
---
include/linux/highmem.h | 19 +++++++++++++++++++
mm/memory.c | 28 +++++++++++++++++++---------
2 files changed, 38 insertions(+), 9 deletions(-)
Comments
Tony Luck wrote: > If the kernel is copying a page as the result of a copy-on-write > fault and runs into an uncorrectable error, Linux will crash because > it does not have recovery code for this case where poison is consumed > by the kernel. > > It is easy to set up a test case. Just inject an error into a private > page, fork(2), and have the child process write to the page. > > I wrapped that neatly into a test at: > > git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git > > just enable ACPI error injection and run: > > # ./einj_mem-uc -f copy-on-write > > Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel() > on architectures where that is available (currently x86 and powerpc). > When an error is detected during the page copy, return VM_FAULT_HWPOISON > to caller of wp_page_copy(). This propagates up the call stack. Both x86 > and powerpc have code in their fault handler to deal with this code by > sending a SIGBUS to the application. > > Note that this patch avoids a system crash and signals the process that > triggered the copy-on-write action. It does not take any action for the > memory error that is still in the shared page. To handle that a call to > memory_failure() is needed. But this cannot be done from wp_page_copy() > because it holds mmap_lock(). Perhaps the architecture fault handlers > can deal with this loose end in a subsequent patch? > > On Intel/x86 this loose end will often be handled automatically because > the memory controller provides an additional notification of the h/w > poison in memory, the handler for this will call memory_failure(). This > isn't a 100% solution. If there are multiple errors, not all may be > logged in this way. > > Signed-off-by: Tony Luck <tony.luck@intel.com> Just some minor comments below, but you can add: Reviewed-by: Dan Williams <dan.j.williams@intel.com> > > --- > Changes in V2: > Naoya Horiguchi: > 1) Use -EHWPOISON error code instead of minus one. > 2) Poison path needs also to deal with old_page > Tony Luck: > Rewrote commit message > Added some powerpc folks to Cc: list > --- > include/linux/highmem.h | 19 +++++++++++++++++++ > mm/memory.c | 28 +++++++++++++++++++--------- > 2 files changed, 38 insertions(+), 9 deletions(-) > > diff --git a/include/linux/highmem.h b/include/linux/highmem.h > index e9912da5441b..5967541fbf0e 100644 > --- a/include/linux/highmem.h > +++ b/include/linux/highmem.h > @@ -319,6 +319,25 @@ static inline void copy_user_highpage(struct page *to, struct page *from, > > #endif > > +static inline int copy_user_highpage_mc(struct page *to, struct page *from, > + unsigned long vaddr, struct vm_area_struct *vma) > +{ > + unsigned long ret = 0; > +#ifdef copy_mc_to_kernel > + char *vfrom, *vto; > + > + vfrom = kmap_local_page(from); > + vto = kmap_local_page(to); > + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE); > + kunmap_local(vto); > + kunmap_local(vfrom); > +#else > + copy_user_highpage(to, from, vaddr, vma); > +#endif > + > + return ret; > +} > + There is likely some small benefit of doing this the idiomatic way and let grep see that there are multiple definitions of copy_user_highpage_mc() with an organization like: #ifdef copy_mc_to_kernel static inline int copy_user_highpage_mc(struct page *to, struct page *from, unsigned long vaddr, struct vm_area_struct *vma) { unsigned long ret = 0; char *vfrom, *vto; vfrom = kmap_local_page(from); vto = kmap_local_page(to); ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE); kunmap_local(vto); kunmap_local(vfrom); return ret; } #else static inline int copy_user_highpage_mc(struct page *to, struct page *from, unsigned long vaddr, struct vm_area_struct *vma) { copy_user_highpage(to, from, vaddr, vma); return 0; } #endif Per the copy_mc* discussion with Linus I would have called this function copy_mc_to_user_highpage() to clarify that hwpoison is handled from the source buffer of the copy. > #ifndef __HAVE_ARCH_COPY_HIGHPAGE > > static inline void copy_highpage(struct page *to, struct page *from) > diff --git a/mm/memory.c b/mm/memory.c > index f88c351aecd4..a32556c9b689 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2848,8 +2848,14 @@ static inline int pte_unmap_same(struct vm_fault *vmf) > return same; > } > > -static inline bool __wp_page_copy_user(struct page *dst, struct page *src, > - struct vm_fault *vmf) > +/* > + * Return: > + * -EHWPOISON: copy failed due to hwpoison in source page > + * 0: copied failed (some other reason) > + * 1: copied succeeded > + */ > +static inline int __wp_page_copy_user(struct page *dst, struct page *src, > + struct vm_fault *vmf) > { > bool ret; > void *kaddr; > @@ -2860,8 +2866,9 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, > unsigned long addr = vmf->address; > > if (likely(src)) { > - copy_user_highpage(dst, src, addr, vma); > - return true; > + if (copy_user_highpage_mc(dst, src, addr, vma)) > + return -EHWPOISON; Given there is no use case for the residue value returned by copy_mc_to_kernel() perhaps just return EHWPOISON directly from copyuser_highpage_mc() in the short-copy case? > + return 1; > } > > /* > @@ -2888,7 +2895,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, > * and update local tlb only > */ > update_mmu_tlb(vma, addr, vmf->pte); > - ret = false; > + ret = 0; What do you think about just making these 'false' cases also return a negative errno? (rationale below...) > goto pte_unlock; > } > > @@ -2913,7 +2920,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, > if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) { > /* The PTE changed under us, update local tlb */ > update_mmu_tlb(vma, addr, vmf->pte); > - ret = false; > + ret = 0; > goto pte_unlock; > } > > @@ -2932,7 +2939,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, > } > } > > - ret = true; > + ret = 1; > > pte_unlock: > if (locked) > @@ -3104,6 +3111,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) > pte_t entry; > int page_copied = 0; > struct mmu_notifier_range range; > + int ret; > > delayacct_wpcopy_start(); > > @@ -3121,19 +3129,21 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) > if (!new_page) > goto oom; > > - if (!__wp_page_copy_user(new_page, old_page, vmf)) { > + ret = __wp_page_copy_user(new_page, old_page, vmf); > + if (ret <= 0) { ...this would become a typical '0 == success' and 'negative errno == failure', where all but EHWPOISON are retried. > /* > * COW failed, if the fault was solved by other, > * it's fine. If not, userspace would re-fault on > * the same address and we will handle the fault > * from the second attempt. > + * The -EHWPOISON case will not be retried. > */ > put_page(new_page); > if (old_page) > put_page(old_page); > > delayacct_wpcopy_end(); > - return 0; > + return ret == -EHWPOISON ? VM_FAULT_HWPOISON : 0;
> Given there is no use case for the residue value returned by > copy_mc_to_kernel() perhaps just return EHWPOISON directly from > copyuser_highpage_mc() in the short-copy case? I don't think it hurts to keep the return value as residue count. It isn't making that code any more complex and could be useful someday. Other feedback looks good and I have applied ready for next version. Thanks for the review. -Tony
在 2022/10/20 AM1:08, Tony Luck 写道: > If the kernel is copying a page as the result of a copy-on-write > fault and runs into an uncorrectable error, Linux will crash because > it does not have recovery code for this case where poison is consumed > by the kernel. > > It is easy to set up a test case. Just inject an error into a private > page, fork(2), and have the child process write to the page. > > I wrapped that neatly into a test at: > > git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git > > just enable ACPI error injection and run: > > # ./einj_mem-uc -f copy-on-write > > Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel() > on architectures where that is available (currently x86 and powerpc). > When an error is detected during the page copy, return VM_FAULT_HWPOISON > to caller of wp_page_copy(). This propagates up the call stack. Both x86 > and powerpc have code in their fault handler to deal with this code by > sending a SIGBUS to the application. Does it send SIGBUS to only child process or both parent and child process? > > Note that this patch avoids a system crash and signals the process that > triggered the copy-on-write action. It does not take any action for the > memory error that is still in the shared page. To handle that a call to > memory_failure() is needed. If the error page is not poisoned, should the return value of wp_page_copy be VM_FAULT_HWPOISON or VM_FAULT_SIGBUS? When is_hwpoison_entry(entry) or PageHWPoison(page) is true, do_swap_page return VM_FAULT_HWPOISON to caller. And when is_swapin_error_entry is true, do_swap_page return VM_FAULT_SIGBUS. Thanks. Best Regards, Shuai > But this cannot be done from wp_page_copy() > because it holds mmap_lock(). Perhaps the architecture fault handlers > can deal with this loose end in a subsequent patch? > > On Intel/x86 this loose end will often be handled automatically because > the memory controller provides an additional notification of the h/w > poison in memory, the handler for this will call memory_failure(). This > isn't a 100% solution. If there are multiple errors, not all may be > logged in this way. > > Signed-off-by: Tony Luck <tony.luck@intel.com> > > --- > Changes in V2: > Naoya Horiguchi: > 1) Use -EHWPOISON error code instead of minus one. > 2) Poison path needs also to deal with old_page > Tony Luck: > Rewrote commit message > Added some powerpc folks to Cc: list > --- > include/linux/highmem.h | 19 +++++++++++++++++++ > mm/memory.c | 28 +++++++++++++++++++--------- > 2 files changed, 38 insertions(+), 9 deletions(-) > > diff --git a/include/linux/highmem.h b/include/linux/highmem.h > index e9912da5441b..5967541fbf0e 100644 > --- a/include/linux/highmem.h > +++ b/include/linux/highmem.h > @@ -319,6 +319,25 @@ static inline void copy_user_highpage(struct page *to, struct page *from, > > #endif > > +static inline int copy_user_highpage_mc(struct page *to, struct page *from, > + unsigned long vaddr, struct vm_area_struct *vma) > +{ > + unsigned long ret = 0; > +#ifdef copy_mc_to_kernel > + char *vfrom, *vto; > + > + vfrom = kmap_local_page(from); > + vto = kmap_local_page(to); > + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE); > + kunmap_local(vto); > + kunmap_local(vfrom); > +#else > + copy_user_highpage(to, from, vaddr, vma); > +#endif > + > + return ret; > +} > + > #ifndef __HAVE_ARCH_COPY_HIGHPAGE > > static inline void copy_highpage(struct page *to, struct page *from) > diff --git a/mm/memory.c b/mm/memory.c > index f88c351aecd4..a32556c9b689 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2848,8 +2848,14 @@ static inline int pte_unmap_same(struct vm_fault *vmf) > return same; > } > > -static inline bool __wp_page_copy_user(struct page *dst, struct page *src, > - struct vm_fault *vmf) > +/* > + * Return: > + * -EHWPOISON: copy failed due to hwpoison in source page > + * 0: copied failed (some other reason) > + * 1: copied succeeded > + */ > +static inline int __wp_page_copy_user(struct page *dst, struct page *src, > + struct vm_fault *vmf) > { > bool ret; > void *kaddr; > @@ -2860,8 +2866,9 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, > unsigned long addr = vmf->address; > > if (likely(src)) { > - copy_user_highpage(dst, src, addr, vma); > - return true; > + if (copy_user_highpage_mc(dst, src, addr, vma)) > + return -EHWPOISON; > + return 1; > } > > /* > @@ -2888,7 +2895,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, > * and update local tlb only > */ > update_mmu_tlb(vma, addr, vmf->pte); > - ret = false; > + ret = 0; > goto pte_unlock; > } > > @@ -2913,7 +2920,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, > if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) { > /* The PTE changed under us, update local tlb */ > update_mmu_tlb(vma, addr, vmf->pte); > - ret = false; > + ret = 0; > goto pte_unlock; > } > > @@ -2932,7 +2939,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, > } > } > > - ret = true; > + ret = 1; > > pte_unlock: > if (locked) > @@ -3104,6 +3111,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) > pte_t entry; > int page_copied = 0; > struct mmu_notifier_range range; > + int ret; > > delayacct_wpcopy_start(); > > @@ -3121,19 +3129,21 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) > if (!new_page) > goto oom; > > - if (!__wp_page_copy_user(new_page, old_page, vmf)) { > + ret = __wp_page_copy_user(new_page, old_page, vmf); > + if (ret <= 0) { > /* > * COW failed, if the fault was solved by other, > * it's fine. If not, userspace would re-fault on > * the same address and we will handle the fault > * from the second attempt. > + * The -EHWPOISON case will not be retried. > */ > put_page(new_page); > if (old_page) > put_page(old_page); > > delayacct_wpcopy_end(); > - return 0; > + return ret == -EHWPOISON ? VM_FAULT_HWPOISON : 0; > } > kmsan_copy_page_meta(new_page, old_page); > }
On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote: > > > 在 2022/10/20 AM1:08, Tony Luck 写道: > > If the kernel is copying a page as the result of a copy-on-write > > fault and runs into an uncorrectable error, Linux will crash because > > it does not have recovery code for this case where poison is consumed > > by the kernel. > > > > It is easy to set up a test case. Just inject an error into a private > > page, fork(2), and have the child process write to the page. > > > > I wrapped that neatly into a test at: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git > > > > just enable ACPI error injection and run: > > > > # ./einj_mem-uc -f copy-on-write > > > > Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel() > > on architectures where that is available (currently x86 and powerpc). > > When an error is detected during the page copy, return VM_FAULT_HWPOISON > > to caller of wp_page_copy(). This propagates up the call stack. Both x86 > > and powerpc have code in their fault handler to deal with this code by > > sending a SIGBUS to the application. > > Does it send SIGBUS to only child process or both parent and child process? This only sends a SIGBUS to the process that wrote the page (typically the child, but also possible that the parent is the one that does the write that causes the COW). > > > > Note that this patch avoids a system crash and signals the process that > > triggered the copy-on-write action. It does not take any action for the > > memory error that is still in the shared page. To handle that a call to > > memory_failure() is needed. > > If the error page is not poisoned, should the return value of wp_page_copy > be VM_FAULT_HWPOISON or VM_FAULT_SIGBUS? When is_hwpoison_entry(entry) or > PageHWPoison(page) is true, do_swap_page return VM_FAULT_HWPOISON to caller. > And when is_swapin_error_entry is true, do_swap_page return VM_FAULT_SIGBUS. The page has uncorrected data in it, but this patch doesn't mark it as poisoned. Returning VM_FAULT_SIGBUS would send an "ordinary" SIGBUS that doesn't include the BUS_MCEERR_AR and "lsb" information. It would also skip the: "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n" console message. So might result in confusion and attepmts to debug a s/w problem with the application instead of blaming the death on a bad DIMM. > > But this cannot be done from wp_page_copy() > > because it holds mmap_lock(). Perhaps the architecture fault handlers > > can deal with this loose end in a subsequent patch? I started looking at this for x86 ... but I have changed my mind about this being a good place for a fix. When control returns back to the architecture fault handler it no longer has easy access to the physical page frame number. It has the virtual address, so it could descend back into somee new mm/memory.c function to get the physical address ... but that seems silly. I'm experimenting with using sched_work() to handle the call to memory_failure() (echoing what the machine check handler does using task_work)_add() to avoid the same problem of not being able to directly call memory_failure()). So far it seems to be working. Patch below (goes on top of original patch ... well on top of the internal version with mods based on feedback from Dan Williams ... but should show the general idea) With this patch applied the page does get unmapped from all users. Other tasks that shared the page will get a SIGBUS if they attempt to access it later (from the page fault handler because of is_hwpoison_entry() as you mention above. -Tony From d3879e83bf91cd6c61e12d32d3e15eb6ef069204 Mon Sep 17 00:00:00 2001 From: Tony Luck <tony.luck@intel.com> Date: Thu, 20 Oct 2022 09:57:28 -0700 Subject: [PATCH] mm, hwpoison: Call memory_failure() for source page of COW failure Cannot call memory_failure() directly from the fault handler because mmap_lock (and others) are held. It is important, but not urgent, to mark the source page as h/w poisoned and unmap it from other tasks. Use schedule_work() to queue a request to call memory_failure() for the page with the error. Signed-off-by: Tony Luck <tony.luck@intel.com> --- mm/memory.c | 35 ++++++++++++++++++++++++++++++++++- 1 file changed, 34 insertions(+), 1 deletion(-) diff --git a/mm/memory.c b/mm/memory.c index b6056eef2f72..4a1304cf1f4e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2848,6 +2848,37 @@ static inline int pte_unmap_same(struct vm_fault *vmf) return same; } +#ifdef CONFIG_MEMORY_FAILURE +struct pfn_work { + struct work_struct work; + unsigned long pfn; +}; + +static void do_sched_memory_failure(struct work_struct *w) +{ + struct pfn_work *p = container_of(w, struct pfn_work, work); + + memory_failure(p->pfn, 0); + kfree(p); +} + +static void sched_memory_failure(unsigned long pfn) +{ + struct pfn_work *p; + + p = kmalloc(sizeof *p, GFP_KERNEL); + if (!p) + return; + INIT_WORK(&p->work, do_sched_memory_failure); + p->pfn = pfn; + schedule_work(&p->work); +} +#else +static void sched_memory_failure(unsigned long pfn) +{ +} +#endif + /* * Return: * 0: copied succeeded @@ -2866,8 +2897,10 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src, unsigned long addr = vmf->address; if (likely(src)) { - if (copy_mc_user_highpage(dst, src, addr, vma)) + if (copy_mc_user_highpage(dst, src, addr, vma)) { + sched_memory_failure(page_to_pfn(src)); return -EHWPOISON; + } return 0; }
在 2022/10/21 AM4:05, Tony Luck 写道: > On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote: >> >> >> 在 2022/10/20 AM1:08, Tony Luck 写道: >>> If the kernel is copying a page as the result of a copy-on-write >>> fault and runs into an uncorrectable error, Linux will crash because >>> it does not have recovery code for this case where poison is consumed >>> by the kernel. >>> >>> It is easy to set up a test case. Just inject an error into a private >>> page, fork(2), and have the child process write to the page. >>> >>> I wrapped that neatly into a test at: >>> >>> git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git >>> >>> just enable ACPI error injection and run: >>> >>> # ./einj_mem-uc -f copy-on-write >>> >>> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel() >>> on architectures where that is available (currently x86 and powerpc). >>> When an error is detected during the page copy, return VM_FAULT_HWPOISON >>> to caller of wp_page_copy(). This propagates up the call stack. Both x86 >>> and powerpc have code in their fault handler to deal with this code by >>> sending a SIGBUS to the application. >> >> Does it send SIGBUS to only child process or both parent and child process? > > This only sends a SIGBUS to the process that wrote the page (typically > the child, but also possible that the parent is the one that does the > write that causes the COW). Thanks for your explanation. > >>> >>> Note that this patch avoids a system crash and signals the process that >>> triggered the copy-on-write action. It does not take any action for the >>> memory error that is still in the shared page. To handle that a call to >>> memory_failure() is needed. >> >> If the error page is not poisoned, should the return value of wp_page_copy >> be VM_FAULT_HWPOISON or VM_FAULT_SIGBUS? When is_hwpoison_entry(entry) or >> PageHWPoison(page) is true, do_swap_page return VM_FAULT_HWPOISON to caller. >> And when is_swapin_error_entry is true, do_swap_page return VM_FAULT_SIGBUS. > > The page has uncorrected data in it, but this patch doesn't mark it > as poisoned. Returning VM_FAULT_SIGBUS would send an "ordinary" SIGBUS > that doesn't include the BUS_MCEERR_AR and "lsb" information. It would > also skip the: > > "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n" > > console message. So might result in confusion and attepmts to debug a > s/w problem with the application instead of blaming the death on a bad > DIMM. I see your point. Thank you. > >>> But this cannot be done from wp_page_copy() >>> because it holds mmap_lock(). Perhaps the architecture fault handlers >>> can deal with this loose end in a subsequent patch? > > I started looking at this for x86 ... but I have changed my mind > about this being a good place for a fix. When control returns back > to the architecture fault handler it no longer has easy access to > the physical page frame number. It has the virtual address, so it > could descend back into somee new mm/memory.c function to get the > physical address ... but that seems silly. > > I'm experimenting with using sched_work() to handle the call to > memory_failure() (echoing what the machine check handler does using > task_work)_add() to avoid the same problem of not being able to directly > call memory_failure()). Work queues permit work to be deferred outside of the interrupt context into the kernel process context. If we return to user-space before the queued memory_failure() work is processed, we will take the fault again, as we discussed recently. commit 7f17b4a121d0d ACPI: APEI: Kick the memory_failure() queue for synchronous errors commit 415fed694fe11 ACPI: APEI: do not add task_work to kernel thread to avoid memory leak So, in my opinion, we should add memory failure as a task work, like do_machine_check does, e.g. queue_task_work(&m, msg, kill_me_maybe); > > So far it seems to be working. Patch below (goes on top of original > patch ... well on top of the internal version with mods based on > feedback from Dan Williams ... but should show the general idea) > > With this patch applied the page does get unmapped from all users. > Other tasks that shared the page will get a SIGBUS if they attempt > to access it later (from the page fault handler because of > is_hwpoison_entry() as you mention above. > > -Tony > > From d3879e83bf91cd6c61e12d32d3e15eb6ef069204 Mon Sep 17 00:00:00 2001 > From: Tony Luck <tony.luck@intel.com> > Date: Thu, 20 Oct 2022 09:57:28 -0700 > Subject: [PATCH] mm, hwpoison: Call memory_failure() for source page of COW > failure > > Cannot call memory_failure() directly from the fault handler because > mmap_lock (and others) are held. > > It is important, but not urgent, to mark the source page as h/w poisoned > and unmap it from other tasks. > > Use schedule_work() to queue a request to call memory_failure() for the > page with the error. > > Signed-off-by: Tony Luck <tony.luck@intel.com> > --- > mm/memory.c | 35 ++++++++++++++++++++++++++++++++++- > 1 file changed, 34 insertions(+), 1 deletion(-) > > diff --git a/mm/memory.c b/mm/memory.c > index b6056eef2f72..4a1304cf1f4e 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2848,6 +2848,37 @@ static inline int pte_unmap_same(struct vm_fault *vmf) > return same; > } > > +#ifdef CONFIG_MEMORY_FAILURE > +struct pfn_work { > + struct work_struct work; > + unsigned long pfn; > +}; > + > +static void do_sched_memory_failure(struct work_struct *w) > +{ > + struct pfn_work *p = container_of(w, struct pfn_work, work); > + > + memory_failure(p->pfn, 0); > + kfree(p); > +} > + > +static void sched_memory_failure(unsigned long pfn) > +{ > + struct pfn_work *p; > + > + p = kmalloc(sizeof *p, GFP_KERNEL); > + if (!p) > + return; > + INIT_WORK(&p->work, do_sched_memory_failure); > + p->pfn = pfn; > + schedule_work(&p->work); > +} I think there is already a function to do such work in mm/memory-failure.c. void memory_failure_queue(unsigned long pfn, int flags) Best Regards, Shuai > +#else > +static void sched_memory_failure(unsigned long pfn) > +{ > +} > +#endif > + > /* > * Return: > * 0: copied succeeded > @@ -2866,8 +2897,10 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src, > unsigned long addr = vmf->address; > > if (likely(src)) { > - if (copy_mc_user_highpage(dst, src, addr, vma)) > + if (copy_mc_user_highpage(dst, src, addr, vma)) { > + sched_memory_failure(page_to_pfn(src)); > return -EHWPOISON; > + } > return 0; > } >
>> + INIT_WORK(&p->work, do_sched_memory_failure); >> + p->pfn = pfn; >> + schedule_work(&p->work); > > There is already memory_failure_queue() that can do this. Can we use it directly? Miaohe Lin, Yes, can use that. A thousand thanks for pointing it out. I just tried it, and it works perfectly. I think I'll need to add an empty stub version for the CONFIG_MEMORY_FAILURE=n build. But that's trivial. -Tony
On Fri, Oct 21, 2022 at 09:52:01AM +0800, Shuai Xue wrote: > > > 在 2022/10/21 AM4:05, Tony Luck 写道: > > On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote: > >> > >> > >> 在 2022/10/20 AM1:08, Tony Luck 写道: > > I'm experimenting with using sched_work() to handle the call to > > memory_failure() (echoing what the machine check handler does using > > task_work)_add() to avoid the same problem of not being able to directly > > call memory_failure()). > > Work queues permit work to be deferred outside of the interrupt context > into the kernel process context. If we return to user-space before the > queued memory_failure() work is processed, we will take the fault again, > as we discussed recently. > > commit 7f17b4a121d0d ACPI: APEI: Kick the memory_failure() queue for synchronous errors > commit 415fed694fe11 ACPI: APEI: do not add task_work to kernel thread to avoid memory leak > > So, in my opinion, we should add memory failure as a task work, like > do_machine_check does, e.g. > > queue_task_work(&m, msg, kill_me_maybe); Maybe ... but this case isn't pending back to a user instruction that is trying to READ the poison memory address. The task is just trying to WRITE to any address within the page. So this is much more like a patrol scrub error found asynchronously by the memory controller (in this case found asynchronously by the Linux page copy function). So I don't feel that it's really the responsibility of the current task. When we do return to user mode the task is going to be busy servicing a SIGBUS ... so shouldn't try to touch the poison page before the memory_failure() called by the worker thread cleans things up. > > + INIT_WORK(&p->work, do_sched_memory_failure); > > + p->pfn = pfn; > > + schedule_work(&p->work); > > +} > > I think there is already a function to do such work in mm/memory-failure.c. > > void memory_failure_queue(unsigned long pfn, int flags) Also pointed out by Miaohe Lin <linmiaohe@huawei.com> ... this does exacly what I want, and is working well in tests so far. So perhaps a cleaner solution than making the kill_me_maybe() function globally visible. -Tony
From: Tony Luck > Sent: 21 October 2022 05:08 .... > When we do return to user mode the task is going to be busy servicing > a SIGBUS ... so shouldn't try to touch the poison page before the > memory_failure() called by the worker thread cleans things up. What about an RT process on a busy system? The worker threads are pretty low priority. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
>> When we do return to user mode the task is going to be busy servicing >> a SIGBUS ... so shouldn't try to touch the poison page before the >> memory_failure() called by the worker thread cleans things up. > > What about an RT process on a busy system? > The worker threads are pretty low priority. Most tasks don't have a SIGBUS handler ... so they just die without possibility of accessing poison If this task DOES have a SIGBUS handler, and that for some bizarre reason just does a "return" so the task jumps back to the instruction that cause the COW then there is a 63/64 likelihood that it is touching a different cache line from the poisoned one. In the 1/64 case ... its probably a simple store (since there was a COW, we know it was trying to modify the page) ... so won't generate another machine check (those only happen for reads). But maybe it is some RMW instruction ... then, if all the above options didn't happen ... we could get another machine check from the same address. But then we just follow the usual recovery path. -Tony
在 2022/10/21 PM12:08, Tony Luck 写道: > On Fri, Oct 21, 2022 at 09:52:01AM +0800, Shuai Xue wrote: >> >> >> 在 2022/10/21 AM4:05, Tony Luck 写道: >>> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote: >>>> >>>> >>>> 在 2022/10/20 AM1:08, Tony Luck 写道: > >>> I'm experimenting with using sched_work() to handle the call to >>> memory_failure() (echoing what the machine check handler does using >>> task_work)_add() to avoid the same problem of not being able to directly >>> call memory_failure()). >> >> Work queues permit work to be deferred outside of the interrupt context >> into the kernel process context. If we return to user-space before the >> queued memory_failure() work is processed, we will take the fault again, >> as we discussed recently. >> >> commit 7f17b4a121d0d ACPI: APEI: Kick the memory_failure() queue for synchronous errors >> commit 415fed694fe11 ACPI: APEI: do not add task_work to kernel thread to avoid memory leak >> >> So, in my opinion, we should add memory failure as a task work, like >> do_machine_check does, e.g. >> >> queue_task_work(&m, msg, kill_me_maybe); > > Maybe ... but this case isn't pending back to a user instruction > that is trying to READ the poison memory address. The task is just > trying to WRITE to any address within the page. Aha, I see the difference. Thank you. But I still have a question on this. Let us discuss in your reply to David Laight. Best Regards, Shuai > > So this is much more like a patrol scrub error found asynchronously > by the memory controller (in this case found asynchronously by the > Linux page copy function). So I don't feel that it's really the > responsibility of the current task. > > When we do return to user mode the task is going to be busy servicing > a SIGBUS ... so shouldn't try to touch the poison page before the > memory_failure() called by the worker thread cleans things up. > >>> + INIT_WORK(&p->work, do_sched_memory_failure); >>> + p->pfn = pfn; >>> + schedule_work(&p->work); >>> +} >> >> I think there is already a function to do such work in mm/memory-failure.c. >> >> void memory_failure_queue(unsigned long pfn, int flags) > > Also pointed out by Miaohe Lin <linmiaohe@huawei.com> ... this does > exacly what I want, and is working well in tests so far. So perhaps > a cleaner solution than making the kill_me_maybe() function globally > visible. > > -Tony
在 2022/10/21 PM12:41, Luck, Tony 写道: >>> When we do return to user mode the task is going to be busy servicing >>> a SIGBUS ... so shouldn't try to touch the poison page before the >>> memory_failure() called by the worker thread cleans things up. >> >> What about an RT process on a busy system? >> The worker threads are pretty low priority. > > Most tasks don't have a SIGBUS handler ... so they just die without possibility of accessing poison > > If this task DOES have a SIGBUS handler, and that for some bizarre reason just does a "return" > so the task jumps back to the instruction that cause the COW then there is a 63/64 > likelihood that it is touching a different cache line from the poisoned one. > > In the 1/64 case ... its probably a simple store (since there was a COW, we know it was trying to > modify the page) ... so won't generate another machine check (those only happen for reads). > > But maybe it is some RMW instruction ... then, if all the above options didn't happen ... we > could get another machine check from the same address. But then we just follow the usual > recovery path. > > -Tony Let assume the instruction that cause the COW is in the 63/64 case, aka, it is writing a different cache line from the poisoned one. But the new_page allocated in COW is dropped right? So might page fault again? Best Regards, Shuai
>> But maybe it is some RMW instruction ... then, if all the above options didn't happen ... we >> could get another machine check from the same address. But then we just follow the usual >> recovery path. > Let assume the instruction that cause the COW is in the 63/64 case, aka, > it is writing a different cache line from the poisoned one. But the new_page > allocated in COW is dropped right? So might page fault again? It can, but this should be no surprise to a user that has a signal handler for a h/w event (SIGBUS, SIGSEGV, SIGILL) that does nothing to address the problem, but simply returns to re-execute the same instruction that caused the original trap. There may be badly written signal handlers that do this. But they just cause pain for themselves. Linux can keep taking the traps and fixing things up and sending a new signal over and over. In this case that loop may involve taking the machine check again, so some extra pain for the kernel, but recoverable machine checks on Intel/x86 switched from broadcast to delivery to just the logical CPU that tried to consume the poison a few generations back. So only a bit more painful than a repeated page fault. -Tony
在 2022/10/22 AM12:30, Luck, Tony 写道: >>> But maybe it is some RMW instruction ... then, if all the above options didn't happen ... we >>> could get another machine check from the same address. But then we just follow the usual >>> recovery path. > > >> Let assume the instruction that cause the COW is in the 63/64 case, aka, >> it is writing a different cache line from the poisoned one. But the new_page >> allocated in COW is dropped right? So might page fault again? > > It can, but this should be no surprise to a user that has a signal handler for > a h/w event (SIGBUS, SIGSEGV, SIGILL) that does nothing to address the > problem, but simply returns to re-execute the same instruction that caused > the original trap. > > There may be badly written signal handlers that do this. But they just cause > pain for themselves. Linux can keep taking the traps and fixing things up and > sending a new signal over and over. > > In this case that loop may involve taking the machine check again, so some > extra pain for the kernel, but recoverable machine checks on Intel/x86 switched > from broadcast to delivery to just the logical CPU that tried to consume the poison > a few generations back. So only a bit more painful than a repeated page fault. > > -Tony > > I see, thanks for your patient explanation :) Best Regards, Shuai
在 2022/10/22 AM4:01, Tony Luck 写道: > Part 1 deals with the process that triggered the copy on write > fault with a store to a shared read-only page. That process is > send a SIGBUS with the usual machine check decoration to specify > the virtual address of the lost page, together with the scope. > > Part 2 sets up to asynchronously take the page with the uncorrected > error offline to prevent additional machine check faults. H/t to > Miaohe Lin <linmiaohe@huawei.com> and Shuai Xue <xueshuai@linux.alibaba.com> > for pointing me to the existing function to queue a call to > memory_failure(). > > On x86 there is some duplicate reporting (because the error is > also signalled by the memory controller as well as by the core > that triggered the machine check). Console logs look like this: > > [ 1647.723403] mce: [Hardware Error]: Machine check events logged > Machine check from kernel copy routine > > [ 1647.723414] MCE: Killing einj_mem_uc:3600 due to hardware memory corruption fault at 7f3309503400 > x86 fault handler sends SIGBUS to child process > > [ 1647.735183] Memory failure: 0x905b92d: recovery action for dirty LRU page: Recovered > Async call to memory_failure() from copy on write path The recovery action might also be handled asynchronously in CMCI uc_decode_notifier handler signaled by memory controller, right? I have a one more memory failure log than yours. [ 3187.485742] MCE: Killing einj_mem_uc:31746 due to hardware memory corruption fault at 7fc4bf7cf400 [ 3187.740620] Memory failure: 0x1a3b80: recovery action for dirty LRU page: Recovered uc_decode_notifier() processes memory controller report [ 3187.748272] Memory failure: 0x1a3b80: already hardware poisoned Workqueue: events memory_failure_work_func // queued by ghes_do_memory_failure [ 3187.754194] Memory failure: 0x1a3b80: already hardware poisoned Workqueue: events memory_failure_work_func // queued by __wp_page_copy_user [ 3188.615920] MCE: Killing einj_mem_uc:31745 due to hardware memory corruption fault at 7fc4bf7cf400 Best Regards, Shuai > > [ 1647.748397] Memory failure: 0x905b92d: already hardware poisoned > uc_decode_notifier() processes memory controller report > > [ 1647.761313] MCE: Killing einj_mem_uc:3599 due to hardware memory corruption fault at 7f3309503400 > Parent process tries to read poisoned page. Page has been unmapped, so > #PF handler sends SIGBUS > > > Tony Luck (2): > mm, hwpoison: Try to recover from copy-on write faults > mm, hwpoison: When copy-on-write hits poison, take page offline > > include/linux/highmem.h | 24 ++++++++++++++++++++++++ > include/linux/mm.h | 5 ++++- > mm/memory.c | 32 ++++++++++++++++++++++---------- > 3 files changed, 50 insertions(+), 11 deletions(-) >
在 2022/10/23 PM11:52, Shuai Xue 写道: > > > 在 2022/10/22 AM4:01, Tony Luck 写道: >> Part 1 deals with the process that triggered the copy on write >> fault with a store to a shared read-only page. That process is >> send a SIGBUS with the usual machine check decoration to specify >> the virtual address of the lost page, together with the scope. >> >> Part 2 sets up to asynchronously take the page with the uncorrected >> error offline to prevent additional machine check faults. H/t to >> Miaohe Lin <linmiaohe@huawei.com> and Shuai Xue <xueshuai@linux.alibaba.com> >> for pointing me to the existing function to queue a call to >> memory_failure(). >> >> On x86 there is some duplicate reporting (because the error is >> also signalled by the memory controller as well as by the core >> that triggered the machine check). Console logs look like this: >> >> [ 1647.723403] mce: [Hardware Error]: Machine check events logged >> Machine check from kernel copy routine >> >> [ 1647.723414] MCE: Killing einj_mem_uc:3600 due to hardware memory corruption fault at 7f3309503400 >> x86 fault handler sends SIGBUS to child process >> >> [ 1647.735183] Memory failure: 0x905b92d: recovery action for dirty LRU page: Recovered >> Async call to memory_failure() from copy on write path > > The recovery action might also be handled asynchronously in CMCI uc_decode_notifier > handler signaled by memory controller, right? > > I have a one more memory failure log than yours. > > [ 3187.485742] MCE: Killing einj_mem_uc:31746 due to hardware memory corruption fault at 7fc4bf7cf400 > [ 3187.740620] Memory failure: 0x1a3b80: recovery action for dirty LRU page: Recovered > uc_decode_notifier() processes memory controller report > > [ 3187.748272] Memory failure: 0x1a3b80: already hardware poisoned > Workqueue: events memory_failure_work_func // queued by ghes_do_memory_failure > > [ 3187.754194] Memory failure: 0x1a3b80: already hardware poisoned > Workqueue: events memory_failure_work_func // queued by __wp_page_copy_user > > [ 3188.615920] MCE: Killing einj_mem_uc:31745 due to hardware memory corruption fault at 7fc4bf7cf400 > > Best Regards, > Shuai Tested-by: Shuai Xue <xueshuai@linux.alibaba.com> Thank you. Shuai > >> >> [ 1647.748397] Memory failure: 0x905b92d: already hardware poisoned >> uc_decode_notifier() processes memory controller report >> >> [ 1647.761313] MCE: Killing einj_mem_uc:3599 due to hardware memory corruption fault at 7f3309503400 >> Parent process tries to read poisoned page. Page has been unmapped, so >> #PF handler sends SIGBUS >> >> >> Tony Luck (2): >> mm, hwpoison: Try to recover from copy-on write faults >> mm, hwpoison: When copy-on-write hits poison, take page offline >> >> include/linux/highmem.h | 24 ++++++++++++++++++++++++ >> include/linux/mm.h | 5 ++++- >> mm/memory.c | 32 ++++++++++++++++++++++---------- >> 3 files changed, 50 insertions(+), 11 deletions(-) >>
diff --git a/include/linux/highmem.h b/include/linux/highmem.h index e9912da5441b..5967541fbf0e 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -319,6 +319,25 @@ static inline void copy_user_highpage(struct page *to, struct page *from, #endif +static inline int copy_user_highpage_mc(struct page *to, struct page *from, + unsigned long vaddr, struct vm_area_struct *vma) +{ + unsigned long ret = 0; +#ifdef copy_mc_to_kernel + char *vfrom, *vto; + + vfrom = kmap_local_page(from); + vto = kmap_local_page(to); + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE); + kunmap_local(vto); + kunmap_local(vfrom); +#else + copy_user_highpage(to, from, vaddr, vma); +#endif + + return ret; +} + #ifndef __HAVE_ARCH_COPY_HIGHPAGE static inline void copy_highpage(struct page *to, struct page *from) diff --git a/mm/memory.c b/mm/memory.c index f88c351aecd4..a32556c9b689 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2848,8 +2848,14 @@ static inline int pte_unmap_same(struct vm_fault *vmf) return same; } -static inline bool __wp_page_copy_user(struct page *dst, struct page *src, - struct vm_fault *vmf) +/* + * Return: + * -EHWPOISON: copy failed due to hwpoison in source page + * 0: copied failed (some other reason) + * 1: copied succeeded + */ +static inline int __wp_page_copy_user(struct page *dst, struct page *src, + struct vm_fault *vmf) { bool ret; void *kaddr; @@ -2860,8 +2866,9 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, unsigned long addr = vmf->address; if (likely(src)) { - copy_user_highpage(dst, src, addr, vma); - return true; + if (copy_user_highpage_mc(dst, src, addr, vma)) + return -EHWPOISON; + return 1; } /* @@ -2888,7 +2895,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, * and update local tlb only */ update_mmu_tlb(vma, addr, vmf->pte); - ret = false; + ret = 0; goto pte_unlock; } @@ -2913,7 +2920,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) { /* The PTE changed under us, update local tlb */ update_mmu_tlb(vma, addr, vmf->pte); - ret = false; + ret = 0; goto pte_unlock; } @@ -2932,7 +2939,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, } } - ret = true; + ret = 1; pte_unlock: if (locked) @@ -3104,6 +3111,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) pte_t entry; int page_copied = 0; struct mmu_notifier_range range; + int ret; delayacct_wpcopy_start(); @@ -3121,19 +3129,21 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) if (!new_page) goto oom; - if (!__wp_page_copy_user(new_page, old_page, vmf)) { + ret = __wp_page_copy_user(new_page, old_page, vmf); + if (ret <= 0) { /* * COW failed, if the fault was solved by other, * it's fine. If not, userspace would re-fault on * the same address and we will handle the fault * from the second attempt. + * The -EHWPOISON case will not be retried. */ put_page(new_page); if (old_page) put_page(old_page); delayacct_wpcopy_end(); - return 0; + return ret == -EHWPOISON ? VM_FAULT_HWPOISON : 0; } kmsan_copy_page_meta(new_page, old_page); }