Message ID | 20240111135548.3207437-3-tongtiangen@huawei.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-23688-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:2411:b0:101:2151:f287 with SMTP id m17csp1460964dyi; Thu, 11 Jan 2024 05:56:30 -0800 (PST) X-Google-Smtp-Source: AGHT+IFAIQoKqUdKFoj8r9ENaYd9Cesf+aX1b8ZaQtNS/u1TBaEYS/hOiKCITCURnn/Aufe+6SNS X-Received: by 2002:ac8:4e55:0:b0:429:c95a:63d4 with SMTP id e21-20020ac84e55000000b00429c95a63d4mr206897qtw.126.1704981389958; Thu, 11 Jan 2024 05:56:29 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1704981389; cv=none; d=google.com; s=arc-20160816; b=LG+esUKJfB1tuQjjzZgNdvHZztZy9tXoJaxlCg8hz9RcOKvwhyfWF1soK+Dt75R6WV 9u041C68+6UVjKJJKQuA00aDMc5YzWjcEN4ExKcuTA/qj3aD4mTIC8nSLmT77Kfh9R8I G4/z9Hh8taYaMjRrxzZs2A+6Xdh854pfy99oKzgRlWtw9gEX2ywnLmt7mW14MGEmGXd8 Rj5RuhJ6p9ik+oL5eHTNwK7dsF9GO8RGX/P83PzRjECSVS9NWyCEXKiDOMv0I/56UjZ9 K4gHARmBc8ndP9Q8+39p5P/5TgNCHQb0z3hq9Smfc4kT4+s0dUFsmrTrosHm0S40xPT9 EnNg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:in-reply-to:message-id :date:subject:cc:to:from; bh=l2zsVIk+1b7Atngi/VgaU2Rm1o1MWo/CA5ScH0V0f5E=; fh=fozFfmvOB2OH6LioO+nm/i/X+k2E+hTLuoXcoSn9Ark=; b=o7jXTYp6AAnr7HiUHgc2LynGJm8qHL3DkQ1rLKaCLRII73C264u3d0KjJaNL44q+V/ 0CJvvJyw28j9t7Y/58BX8Lh3G2eI4RP5c/sNNf6aMzMy/5CtQgFlWqXkq1dFckeDujIp cq/nlUaFC0VUJPL/3SGoBMQSvuJsIwfgGNYrKrghqEZRfFL4EuAe/6K1etIyMgjPi8OP oUwCC4sK5BS0VwHN2ia3L1n4dQ3oZQc5NZJVxhHvDYgZNQRCwGBUjozpoDpX7x8u9qmT PerwFCRfjOs1O+yrAeCNE5sp4e9rHwkrKUXzuGCDl8ecQHkcBDuALcg9BQOSCXBXJea5 Wxvg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel+bounces-23688-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-23688-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id bb14-20020a05622a1b0e00b00429c9dba8casi32857qtb.763.2024.01.11.05.56.29 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Jan 2024 05:56:29 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-23688-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel+bounces-23688-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-23688-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id BD88D1C23897 for <ouuuleilei@gmail.com>; Thu, 11 Jan 2024 13:56:29 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id EDEFA3C680; Thu, 11 Jan 2024 13:56:05 +0000 (UTC) Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CDB86364A9; Thu, 11 Jan 2024 13:56:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.88.194]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4T9mQ92wXWz1Q7fB; Thu, 11 Jan 2024 21:55:13 +0800 (CST) Received: from kwepemm600017.china.huawei.com (unknown [7.193.23.234]) by mail.maildlp.com (Postfix) with ESMTPS id 3CAAD14011B; Thu, 11 Jan 2024 21:55:59 +0800 (CST) Received: from localhost.localdomain (10.175.112.125) by kwepemm600017.china.huawei.com (7.193.23.234) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Thu, 11 Jan 2024 21:55:57 +0800 From: Tong Tiangen <tongtiangen@huawei.com> To: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, <wangkefeng.wang@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, <x86@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>, Tony Luck <tony.luck@intel.com>, Andy Lutomirski <luto@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Andrew Morton <akpm@linux-foundation.org>, Naoya Horiguchi <naoya.horiguchi@nec.com> CC: <linux-kernel@vger.kernel.org>, <linux-edac@vger.kernel.org>, <linux-mm@kvack.org>, Tong Tiangen <tongtiangen@huawei.com>, Guohanjun <guohanjun@huawei.com> Subject: [PATCH -next v4 2/3] x86/mce: rename MCE_IN_KERNEL_COPYIN to MCE_IN_KERNEL_COPY_MC Date: Thu, 11 Jan 2024 21:55:47 +0800 Message-ID: <20240111135548.3207437-3-tongtiangen@huawei.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20240111135548.3207437-1-tongtiangen@huawei.com> References: <20240111135548.3207437-1-tongtiangen@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To kwepemm600017.china.huawei.com (7.193.23.234) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1787802566101855940 X-GMAIL-MSGID: 1787802566101855940 |
Series |
[-next,v4,1/3] x86/mce: remove redundant fixup type EX_TYPE_COPY
|
|
Commit Message
Tong Tiangen
Jan. 11, 2024, 1:55 p.m. UTC
In the x86 mce processing, macro MCE_IN_KERNEL_COPYIN is used to identify
copied from user. do_machine_check() uses this flag to isolate posion
page in memory_failure(). there's nothing wrong but we can expand the use
of this macro.
Currently, there are some kernel memory copy scenarios is also mc safe
which use copy_mc_to_kernel() or copy_mc_user_highpage(). In these
scenarios, posion pages need to be isolated too. Therefore, a macro similar
to MCE_IN_KERNEL_COPYIN is required. For this reason, we can rename
MCE_IN_KERNEL_COPYIN to MCE_IN_KERNEL_COPY_MC, the new macro can be applied
to both user-to-kernel mc safe copy and kernel-to-kernel mc safe copy.
Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
---
arch/x86/include/asm/mce.h | 8 ++++----
arch/x86/kernel/cpu/mce/core.c | 2 +-
arch/x86/kernel/cpu/mce/severity.c | 2 +-
3 files changed, 6 insertions(+), 6 deletions(-)
Comments
On Thu, Jan 11, 2024 at 09:55:47PM +0800, Tong Tiangen wrote: > Currently, there are some kernel memory copy scenarios is also mc safe > which use copy_mc_to_kernel() or copy_mc_user_highpage(). Both of those end up in copy_mc_enhanced_fast_string() which does EX_TYPE_DEFAULT_MCE_SAFE.
在 2024/1/31 15:02, Borislav Petkov 写道: > On Thu, Jan 11, 2024 at 09:55:47PM +0800, Tong Tiangen wrote: >> Currently, there are some kernel memory copy scenarios is also mc safe >> which use copy_mc_to_kernel() or copy_mc_user_highpage(). > > Both of those end up in copy_mc_enhanced_fast_string() which does > EX_TYPE_DEFAULT_MCE_SAFE. OK, how about this commit msg change? :) Currently, there are some kernel memory copy scenarios is also mc safe which use copy_mc_to_kernel() or copy_mc_user_highpage(), **both of those end up in copy_mc_enhanced_fast_string() or copy_mc_fragile() which does EX_TYPE_DEFAULT_MCE_SAFE.** In these scenarios, posion pages need to be isolated too. Therefore, a macro similar to MCE_IN_KERNEL_COPYIN is required. For this reason, we can rename MCE_IN_KERNEL_COPYIN to MCE_IN_KERNEL_COPY_MC, the new macro can be applied to both user-to-kernel mc safe copy and kernel-to-kernel mc safe copy. Thanks. Tong. >
On Thu, Feb 01, 2024 at 07:37:25PM +0800, Tong Tiangen wrote: > 在 2024/1/31 15:02, Borislav Petkov 写道: > > On Thu, Jan 11, 2024 at 09:55:47PM +0800, Tong Tiangen wrote: > > > Currently, there are some kernel memory copy scenarios is also mc safe > > > which use copy_mc_to_kernel() or copy_mc_user_highpage(). > > > > Both of those end up in copy_mc_enhanced_fast_string() which does > > EX_TYPE_DEFAULT_MCE_SAFE. > > OK, how about this commit msg change? :) > > Currently, there are some kernel memory copy scenarios is also mc safe > which use copy_mc_to_kernel() or copy_mc_user_highpage(), **both of those > end up in copy_mc_enhanced_fast_string() or copy_mc_fragile() which does > EX_TYPE_DEFAULT_MCE_SAFE.** > > In these scenarios, posion pages need to be isolated too. Therefore, a > macro similar to MCE_IN_KERNEL_COPYIN is required. For this reason, we > can rename MCE_IN_KERNEL_COPYIN to MCE_IN_KERNEL_COPY_MC, the new macro > can be applied to both user-to-kernel mc safe copy and kernel-to-kernel > mc safe copy. Maybe my question wasn't clear: why is that renaming churn needed at all? What are you "fixing" here? What is the problem that you're encountering which needs fixing? Thx.
在 2024/2/1 22:20, Borislav Petkov 写道: > On Thu, Feb 01, 2024 at 07:37:25PM +0800, Tong Tiangen wrote: >> 在 2024/1/31 15:02, Borislav Petkov 写道: >>> On Thu, Jan 11, 2024 at 09:55:47PM +0800, Tong Tiangen wrote: >>>> Currently, there are some kernel memory copy scenarios is also mc safe >>>> which use copy_mc_to_kernel() or copy_mc_user_highpage(). >>> >>> Both of those end up in copy_mc_enhanced_fast_string() which does >>> EX_TYPE_DEFAULT_MCE_SAFE. >> >> OK, how about this commit msg change? :) >> >> Currently, there are some kernel memory copy scenarios is also mc safe >> which use copy_mc_to_kernel() or copy_mc_user_highpage(), **both of those >> end up in copy_mc_enhanced_fast_string() or copy_mc_fragile() which does >> EX_TYPE_DEFAULT_MCE_SAFE.** >> >> In these scenarios, posion pages need to be isolated too. Therefore, a >> macro similar to MCE_IN_KERNEL_COPYIN is required. For this reason, we >> can rename MCE_IN_KERNEL_COPYIN to MCE_IN_KERNEL_COPY_MC, the new macro >> can be applied to both user-to-kernel mc safe copy and kernel-to-kernel >> mc safe copy. > > Maybe my question wasn't clear: why is that renaming churn needed at > all? What are you "fixing" here? > > What is the problem that you're encountering which needs fixing? This patch is a prepare patch and the next patch is a fix patch, the complete logic of the two patches is as follows: The problem i'm encountering: ------------------------------- In the x86 mce processing, error_context() setting macro MCE_IN_KERNEL_COPYIN to identify copy from user(user-to-kernel copy) for fixup_type EX_TYPE_UACCESS. Then do_machine_check() uses macro MCE_IN_KERNEL_COPYIN to isolate posion page in memory_failure(). Currently, there are some kernel memory copy scenarios is also mc safe which use copy_mc_to_kernel() or copy_mc_user_highpage(), these kernel- to-kernel copy use fixup_type EX_TYPE_DEFAULT_MCE_SAFE. In these scenarios, posion pages need to be isolated too and the current implementation is to actively call memory_failure_queue() when the copy fails. Calling memory_failure_queue() separately is not a good implementation, call it uniformly in do_machine_check() is more reasonable. Solution: ---------- A macro similar to MCE_IN_KERNEL_COPYIN is required, so we can rename MCE_IN_KERNEL_COPYIN to MCE_IN_KERNEL_COPY_MC, the new macro can be applied to both user-to-kernel mc safe copy and kernel-to-kernel mc safe copy, in error_context(),we can set MCE_IN_KERNEL_COPY_MC for both fixup_type EX_TYPE_UACCESS and EX_TYPE_DEFAULT_MCE_SAFE. Do you think it's clear to say so and then we can merge the two patches to make the complete logic clearer in commit msg ? Many thanks. Tong. > > Thx. >
On Fri, Feb 02, 2024 at 03:51:12PM +0800, Tong Tiangen wrote: > Currently, there are some kernel memory copy scenarios is also mc safe > which use copy_mc_to_kernel() or copy_mc_user_highpage(), these kernel- > to-kernel copy use fixup_type EX_TYPE_DEFAULT_MCE_SAFE. In these > scenarios, posion pages need to be isolated too and the current So you have, for example: unsigned long __must_check copy_mc_to_kernel(void *dst, const void *src, unsigned len) Now imagine you get a MCE for *dst which is some kernel page which cannot be poisoned: direct map, kernel text, and so on. Attempting to poison such a page would not work, to put it mildly. So, again, what *exactly* are you "fixing" here? When I read "Currently, there are some kernel memory copy scenarios" and there's nothing more explaining what those scenarios are, I'm tempted to ignore this completely until you give a detailed and concrete example what the problem is: What exactly are you doing, what goes wrong, why does this need to be fixed and so on... If there isn't such a real-life use case you're encountering, then this all is waste of time. Thx.
> So you have, for example: > > unsigned long __must_check copy_mc_to_kernel(void *dst, const void *src, unsigned len) > > Now imagine you get a MCE for *dst which is some kernel page which > cannot be poisoned: direct map, kernel text, and so on. At least on Intel you can only get a machine check for operation on poison data LOAD. Not for a STORE. I believe that is generally true - other arches to confirm. So there can't me a machine check on "*dst". -Tony
On Fri, Feb 02, 2024 at 06:44:32PM +0000, Luck, Tony wrote: > At least on Intel you can only get a machine check for operation on poison data LOAD. > Not for a STORE. I believe that is generally true - other arches to confirm. So what happens if you store to a poisoned cacheline on Intel? It'll raise a poison consumption error when that cacheline is loaded in the cache? Because you need to load that line into the cache for writing, I'd presume... What happens if you have bits flipped in the cacheline you want to write to? That's fine because you're overwriting them anyway? I'd presume ECC check gets performed on cacheline load and then you'll have to raise an #MC...
> > At least on Intel you can only get a machine check for operation on poison data LOAD. > > Not for a STORE. I believe that is generally true - other arches to confirm. > > So what happens if you store to a poisoned cacheline on Intel? It'll > raise a poison consumption error when that cacheline is loaded in the > cache? Because you need to load that line into the cache for writing, > I'd presume... There are two places in the pipeline where poison is significant. 1) When the memory controller gets a request to fetch some data. If the ECC check on the bits returned from the DIMMs the memory controller will log a "UCNA" signature error to a machine check bank for the memory channel where the DIMMs live. If CMCI is enabled for that bank, then a CMCI is sent to all logical CPUs that are in the scope of that bank (generally a CPU socket). The data is marked with a POISON signature and passed to the entity that requested it. Caches support this POISON signature and preserve it as data is moved between caches, or written back to memory. This may have been a prefetch or a speculative read. In these cases there won't be a machine check. Linux uc_decode_notifier() will try to offline pages when it sees UCNA signatures. 2) When a CPU core tries to retire an instruction that consumes poison data, or needs to retire a poisoned instruction. These log an SRAR signature into a core scoped bank (on most Xeons to date bank 0 for poisoned instructions, bank 1 for poisoned data consumption). Then they signal a machine check. > What happens if you have bits flipped in the cacheline you want to write > to? > > That's fine because you're overwriting them anyway? > > I'd presume ECC check gets performed on cacheline load and then you'll > have to raise an #MC... Partial cacheline stores to data marked as POISON in the cache maintain the poison status. Full cacheline writes (certainly with MOVDIR64B instruction, possibly with some AVX512 instructions) can clear the POISON status (since you have all new data). A sequence of partial cache line stores that overwrite all data in a cache line will NOT clear the POISON status. Nothing is logged or signaled when updating data in the cache. -Tony
On Fri, Feb 02, 2024 at 09:36:27PM +0000, Luck, Tony wrote: > There are two places in the pipeline where poison is significant. > > 1) When the memory controller gets a request to fetch some data. If the ECC > check on the bits returned from the DIMMs the memory controller will log > a "UCNA" signature error to a machine check bank for the memory channel > where the DIMMs live. If CMCI is enabled for that bank, then a CMCI is > sent to all logical CPUs that are in the scope of that bank (generally a > CPU socket). The data is marked with a POISON signature and passed > to the entity that requested it. Caches support this POISON signature > and preserve it as data is moved between caches, or written back to > memory. This may have been a prefetch or a speculative read. In these > cases there won't be a machine check. Linux uc_decode_notifier() will > try to offline pages when it sees UCNA signatures. Yap, deferred errors. > 2) When a CPU core tries to retire an instruction that consumes poison > data, or needs to retire a poisoned instruction. These log an SRAR signature > into a core scoped bank (on most Xeons to date bank 0 for poisoned instructions, > bank 1 for poisoned data consumption). Then they signal a machine check. And that is the #MC on a poison data load thing. FWIW, the other vendor does it very similarly. > Partial cacheline stores to data marked as POISON in the cache maintain > the poison status. Full cacheline writes (certainly with MOVDIR64B instruction, > possibly with some AVX512 instructions) can clear the POISON status (since > you have all new data). A sequence of partial cache line stores that overwrite > all data in a cache line will NOT clear the POISON status. That's interesting - partial stores don't clear poison data. > Nothing is logged or signaled when updating data in the cache. Ok, no #MC on writing to poisoned cachelines. Ok, so long story short, #MC only on loads. Good. Now, since you're explaining things today :) pls explain to me what this patchset is all about? You having reviewed patch 3 and all? Why is this pattern: if (copy_mc_user_highpage(dst, src, addr, vma)) { memory_failure_queue(page_to_pfn(src), 0); not good anymore? Or is the goal here to poison straight from the #MC handler and not waste time and potentially get another #MC while memory_failure_queue() on the source address is done? Or something completely different? Thx.
> Now, since you're explaining things today :) pls explain to me what this > patchset is all about? You having reviewed patch 3 and all? > > Why is this pattern: > > if (copy_mc_user_highpage(dst, src, addr, vma)) { > memory_failure_queue(page_to_pfn(src), 0); > > not good anymore? > > Or is the goal here to poison straight from the #MC handler and not > waste time and potentially get another #MC while memory_failure_queue() > on the source address is done? > > Or something completely different? See the comment above memory_failure_queue() * The function is primarily of use for corruptions that * happen outside the current execution context (e.g. when * detected by a background scrubber) In the copy_mc_user_highpage() case the fault happens in the current execution context. So scheduling someone else to handle it at some future point is risky. Just deal with it right away. -Tony
在 2024/2/3 6:46, Luck, Tony 写道: >> Now, since you're explaining things today :) pls explain to me what this >> patchset is all about? You having reviewed patch 3 and all? >> >> Why is this pattern: >> >> if (copy_mc_user_highpage(dst, src, addr, vma)) { >> memory_failure_queue(page_to_pfn(src), 0); >> >> not good anymore? >> >> Or is the goal here to poison straight from the #MC handler and not >> waste time and potentially get another #MC while memory_failure_queue() >> on the source address is done? >> >> Or something completely different? > > See the comment above memory_failure_queue() > > * The function is primarily of use for corruptions that > * happen outside the current execution context (e.g. when > * detected by a background scrubber) > > In the copy_mc_user_highpage() case the fault happens in > the current execution context. So scheduling someone else > to handle it at some future point is risky. Just deal with it > right away. > > -Tony The goal of this patch: When #MC is triggered by copy_mc_user_highpage(), #MC is directly processed in the synchronously triggered do_machine_check() -> kill_me_never() -> memory_failure(). And the current handling is to call memory_failure_queue() -> schedule_work_on() in the execution context, I think that's what "scheduling someone else to handle it at some future point is risky." Thanks. Tong.
On Sat, Feb 03, 2024 at 03:56:04PM +0800, Tong Tiangen wrote: > The goal of this patch: > When #MC is triggered by copy_mc_user_highpage(), #MC is directly > processed in the synchronously triggered do_machine_check() -> > kill_me_never() -> memory_failure(). > > And the current handling is to call memory_failure_queue() -> > schedule_work_on() in the execution context, I think that's what > "scheduling someone else to handle it at some future point is risky." Ok, now take everything that was discussed on the thread and use it to rewrite all your commit messages to explain *why* you're doing this, not *what* you're doing - that is visible from the diff. A possible way to structure them is: 1. Prepare the context for the explanation briefly. 2. Explain the problem at hand. 3. "It happens because of <...>" 4. "Fix it by doing X" 5. "(Potentially do Y)." And some of those above are optional depending on the issue being explained. For more detailed info, see Documentation/process/submitting-patches.rst, Section "2) Describe your changes". Also, to the tone, from Documentation/process/submitting-patches.rst: "Describe your changes in imperative mood, e.g. "make xyzzy do frotz" instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy to do frotz", as if you are giving orders to the codebase to change its behaviour." Also, do not talk about what your patch does - that should (hopefully) be visible from the diff itself. Rather, talk about *why* you're doing what you're doing. Thx.
在 2024/2/3 17:43, Borislav Petkov 写道: > On Sat, Feb 03, 2024 at 03:56:04PM +0800, Tong Tiangen wrote: >> The goal of this patch: >> When #MC is triggered by copy_mc_user_highpage(), #MC is directly >> processed in the synchronously triggered do_machine_check() -> >> kill_me_never() -> memory_failure(). >> >> And the current handling is to call memory_failure_queue() -> >> schedule_work_on() in the execution context, I think that's what >> "scheduling someone else to handle it at some future point is risky." > > Ok, now take everything that was discussed on the thread and use it to > rewrite all your commit messages to explain *why* you're doing this, not > *what* you're doing - that is visible from the diff. > > A possible way to structure them is: > > 1. Prepare the context for the explanation briefly. > > 2. Explain the problem at hand. > > 3. "It happens because of <...>" > > 4. "Fix it by doing X" > > 5. "(Potentially do Y)." > > And some of those above are optional depending on the issue being > explained. > > For more detailed info, see > Documentation/process/submitting-patches.rst, > Section "2) Describe your changes". > > Also, to the tone, from Documentation/process/submitting-patches.rst: > > "Describe your changes in imperative mood, e.g. "make xyzzy do frotz" > instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy > to do frotz", as if you are giving orders to the codebase to change > its behaviour." > > Also, do not talk about what your patch does - that should (hopefully) be > visible from the diff itself. Rather, talk about *why* you're doing what > you're doing. OK, will improved next version. Thank. Tong. > > Thx. >
diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index de3118305838..cb628ab2f32f 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -151,11 +151,11 @@ /* * Indicates an MCE that happened in kernel space while copying data - * from user. In this case fixup_exception() gets the kernel to the - * error exit for the copy function. Machine check handler can then - * treat it like a fault taken in user mode. + * from user or kernel. In this case fixup_exception() gets the kernel + * to the error exit for the copy function. Machine check handler can + * then treat it like a fault taken in user or kernel mode. */ -#define MCE_IN_KERNEL_COPYIN BIT_ULL(7) +#define MCE_IN_KERNEL_COPY_MC BIT_ULL(7) /* * This structure contains all data related to the MCE log. Also diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 2bfd0e3c62e4..bd1a31ed148b 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1607,7 +1607,7 @@ noinstr void do_machine_check(struct pt_regs *regs) mce_panic("Failed kernel mode recovery", &m, msg); } - if (m.kflags & MCE_IN_KERNEL_COPYIN) + if (m.kflags & MCE_IN_KERNEL_COPY_MC) queue_task_work(&m, msg, kill_me_never); } diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c index bca780fa5e57..df67a7a13034 100644 --- a/arch/x86/kernel/cpu/mce/severity.c +++ b/arch/x86/kernel/cpu/mce/severity.c @@ -292,7 +292,7 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs) case EX_TYPE_UACCESS: if (!copy_user) return IN_KERNEL; - m->kflags |= MCE_IN_KERNEL_COPYIN; + m->kflags |= MCE_IN_KERNEL_COPY_MC; fallthrough; case EX_TYPE_FAULT_MCE_SAFE: