Message ID | cover.1674070170.git.alison.schofield@intel.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp2562471wrn; Wed, 18 Jan 2023 13:05:43 -0800 (PST) X-Google-Smtp-Source: AMrXdXsvpOHTmdFgbd5eH2bADvHnQ/Lm2Nhlcpw23CayP+ZXrgAIGy5gX0lDmNOYDdNLV7OQP7EL X-Received: by 2002:a05:6402:b:b0:496:f517:d30d with SMTP id d11-20020a056402000b00b00496f517d30dmr8874719edu.38.1674075943488; Wed, 18 Jan 2023 13:05:43 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674075943; cv=none; d=google.com; s=arc-20160816; b=Xy2olbQAP5da/TsBpb1m5mCQ3XUKs2U+VUhyWT7rsyoWpV7twzdx7ZHlM5Q3sNVVPb dbSbR6AFv9fglNP8kEB3apC88WdOdOsXaDiARvoGPomfjDy+tDS/ESm6qMjlSAPgM9Tz z9nJ5l51mdYXjlvJRlYzTFc6IVlI3qEmLx/OTUG4K8T3bG1jw3qlXVp11dwVOi+7gYjD ccZCrZxxP0vJGLRetVEDh7pHq0jf1e2FOzQgriRYCRwyNjtLWP82d2aSS1GRQYZ4vC6V J4G2QPzJEDJA0rLqrVujrzHyST8PaOr59WJbvzstYCI1SzFnfp8vRpxWpwkRIhRqMq1D psdg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=KJmolL6HnXPXxLm136CHvlpHUGO/PRhQBH0jW2LFDdw=; b=Poo/18tkCeS27aNolZTPjM/sAMTCVe9GD2isEW/dJ4UTjLbEyUJRcN3BF0b9umToAc Y8C8s+Rsl8lJo8g2Qpe7Vo6DE75JwwZL3/fnTMMtZLqXFO/aadEoNB6CU5TZ3JUVYGPF gx9VJzfhGIZyxaGF+6VxJ+91IfICXiX3+sVaS/GUjSpK1Y0fWXE02P1jfmBiXnvuQZ1x Fwk9lJA2N21fZA92FphbpvQnfdANQuEnLlYSYoT93nLnj842e5A1BX48Og2qnvaGoiEL k0+2YCBcT+e7FFoRQ0WpCmE9YJDe7MegTtFfAhYTpizHDV2nh82chAz/83PPfAZfmfKF pQtw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=SSYtiGZT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x2-20020a05640225c200b0048d63df54d7si38366881edb.20.2023.01.18.13.05.18; Wed, 18 Jan 2023 13:05:43 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=SSYtiGZT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230266AbjARVAk (ORCPT <rfc822;pfffrao@gmail.com> + 99 others); Wed, 18 Jan 2023 16:00:40 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58954 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230282AbjARU76 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 18 Jan 2023 15:59:58 -0500 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2623F6049C; Wed, 18 Jan 2023 12:59:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674075589; x=1705611589; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=UZr+y7YjygeR8zirILd/pq2MOnSrrVweQbKlUNOy8GU=; b=SSYtiGZTVTtEJK7PJgyPQhIyt7rI4f49Q14oo2knSGq64ybvzPc7lk7U yYPPW9fK5YkIiEMqK7zW5NR2QFzkzqQWxHcSdDrtplC518RfkZRDrCGp7 76y/EoSLaUbIJYlWzqs90BNZAhfTjrjYtVnIZc3HobfnzWH6gytAqnX9Z fO+ZAh1QjLHonZIgMNy5EFIFO7UDL3GZLVGzCGxOkQ8ayGoiVdsoQzHkS hYNK1AvjcyqWinnCbhz/9e5cgv+Q+6p6e9fze+bcTBTOUn6TV8sW70PdV WHUIRxA71uub08CHZZ/Va53Qu3Yr5yGArH08RyGQesDM8VId4bmJnVVaP w==; X-IronPort-AV: E=McAfee;i="6500,9779,10594"; a="308660500" X-IronPort-AV: E=Sophos;i="5.97,226,1669104000"; d="scan'208";a="308660500" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jan 2023 12:59:48 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10594"; a="692160802" X-IronPort-AV: E=Sophos;i="5.97,226,1669104000"; d="scan'208";a="692160802" Received: from aschofie-mobl2.amr.corp.intel.com (HELO localhost) ([10.209.119.104]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jan 2023 12:59:48 -0800 From: alison.schofield@intel.com To: Dan Williams <dan.j.williams@intel.com>, Ira Weiny <ira.weiny@intel.com>, Vishal Verma <vishal.l.verma@intel.com>, Dave Jiang <dave.jiang@intel.com>, Ben Widawsky <bwidawsk@kernel.org>, Steven Rostedt <rostedt@goodmis.org> Cc: Alison Schofield <alison.schofield@intel.com>, linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 0/5] CXL Poison List Retrieval & Tracing Date: Wed, 18 Jan 2023 12:59:45 -0800 Message-Id: <cover.1674070170.git.alison.schofield@intel.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <cover.1674070170.git.alison.schofield@intel.com> References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE, SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1755395856987764663?= X-GMAIL-MSGID: =?utf-8?q?1755395856987764663?= |
Series |
CXL Poison List Retrieval & Tracing
|
|
Message
Alison Schofield
Jan. 18, 2023, 8:59 p.m. UTC
From: Alison Schofield <alison.schofield@intel.com>
**RESENDING this cover letter previously mis-threaded.
Changes in v5:
- Rebase on cxl/next
- Use struct_size() to calc mbox cmd payload .min_out
- s/INTERNAL/INJECTED mocked poison record source
- Added Jonathan Reviewed-by tag on Patch 3
Link to v4:
https://lore.kernel.org/linux-cxl/cover.1671135967.git.alison.schofield@intel.com/
Add support for retrieving device poison lists and store the returned
error records as kernel trace events.
The handling of the poison list is guided by the CXL 3.0 Specification
Section 8.2.9.8.4.1. [1]
Example, triggered by memdev:
$ echo 1 > /sys/bus/cxl/devices/mem3/trigger_poison_list
cxl_poison: memdev=mem3 pcidev=cxl_mem.3 region= region_uuid=00000000-0000-0000-0000-000000000000 dpa=0x0 length=0x40 source=Internal flags= overflow_time=0
Example, triggered by region:
$ echo 1 > /sys/bus/cxl/devices/region5/trigger_poison_list
cxl_poison: memdev=mem0 pcidev=cxl_mem.0 region=region5 region_uuid=bfcb7a29-890e-4a41-8236-fe22221fc75c dpa=0x0 length=0x40 source=Internal flags= overflow_time=0
cxl_poison: memdev=mem1 pcidev=cxl_mem.1 region=region5 region_uuid=bfcb7a29-890e-4a41-8236-fe22221fc75c dpa=0x0 length=0x40 source=Internal flags= overflow_time=0
[1]: https://www.computeexpresslink.org/download-the-specification
Alison Schofield (5):
cxl/mbox: Add GET_POISON_LIST mailbox command
cxl/trace: Add TRACE support for CXL media-error records
cxl/memdev: Add trigger_poison_list sysfs attribute
cxl/region: Add trigger_poison_list sysfs attribute
tools/testing/cxl: Mock support for Get Poison List
Documentation/ABI/testing/sysfs-bus-cxl | 28 +++++++++
drivers/cxl/core/mbox.c | 78 +++++++++++++++++++++++
drivers/cxl/core/memdev.c | 45 ++++++++++++++
drivers/cxl/core/region.c | 33 ++++++++++
drivers/cxl/core/trace.h | 83 +++++++++++++++++++++++++
drivers/cxl/cxlmem.h | 69 +++++++++++++++++++-
drivers/cxl/pci.c | 4 ++
tools/testing/cxl/test/mem.c | 42 +++++++++++++
8 files changed, 381 insertions(+), 1 deletion(-)
base-commit: 589c3357370a596ef7c99c00baca8ac799fce531
Comments
On Thu, Jan 26, 2023 at 05:59:03PM -0800, Dan Williams wrote: > alison.schofield@ wrote: > > From: Alison Schofield <alison.schofield@intel.com> > > > > Subject: [PATCH v5 0/5] CXL Poison List Retrieval & Tracing > > > > Changes in v5: > > - Rebase on cxl/next > > - Use struct_size() to calc mbox cmd payload .min_out > > - s/INTERNAL/INJECTED mocked poison record source > > - Added Jonathan Reviewed-by tag on Patch 3 > > > > Link to v4: > > https://lore.kernel.org/linux-cxl/cover.1671135967.git.alison.schofield@intel.com/ > > > > Add support for retrieving device poison lists and store the returned > > error records as kernel trace events. > > > > The handling of the poison list is guided by the CXL 3.0 Specification > > Section 8.2.9.8.4.1. [1] > > > > Example, triggered by memdev: > > $ echo 1 > /sys/bus/cxl/devices/mem3/trigger_poison_list > > cxl_poison: memdev=mem3 pcidev=cxl_mem.3 region= region_uuid=00000000-0000-0000-0000-000000000000 dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 > > I think the pcidev= field wants to be called something like "host" or > "parent", because there is no strict requirement that a 'struct > cxl_memdev' is related to a 'struct pci_dev'. In fact in that example > "cxl_mem.3" is a 'struct platform_device'. Now that I think about it, I > think all CXL device events should be emitting the PCIe serial number > for the memdev. ] Will do, 'host' and add PCIe serial no. > > I will look in the implementation, but do region= and region_uuid= get > populated when mem3 is a member of the region? Not always. In the case above, where the trigger was by memdev, no. Region= and region_uuid= (and in the follow-on patch, hpa=) only get populated if the poison was triggered by region, like the case below. It could be looked up for the by memdev cases. Is that wanted? Thanks for the reviews Dan! > > > > > Example, triggered by region: > > $ echo 1 > /sys/bus/cxl/devices/region5/trigger_poison_list > > cxl_poison: memdev=mem0 pcidev=cxl_mem.0 region=region5 region_uuid=bfcb7a29-890e-4a41-8236-fe22221fc75c dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 > > cxl_poison: memdev=mem1 pcidev=cxl_mem.1 region=region5 region_uuid=bfcb7a29-890e-4a41-8236-fe22221fc75c dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 > > > > [1]: https://www.computeexpresslink.org/download-the-specification > > > > Alison Schofield (5): > > cxl/mbox: Add GET_POISON_LIST mailbox command > > cxl/trace: Add TRACE support for CXL media-error records > > cxl/memdev: Add trigger_poison_list sysfs attribute > > cxl/region: Add trigger_poison_list sysfs attribute > > tools/testing/cxl: Mock support for Get Poison List > > > > Documentation/ABI/testing/sysfs-bus-cxl | 28 +++++++++ > > drivers/cxl/core/mbox.c | 78 +++++++++++++++++++++++ > > drivers/cxl/core/memdev.c | 45 ++++++++++++++ > > drivers/cxl/core/region.c | 33 ++++++++++ > > drivers/cxl/core/trace.h | 83 +++++++++++++++++++++++++ > > drivers/cxl/cxlmem.h | 69 +++++++++++++++++++- > > drivers/cxl/pci.c | 4 ++ > > tools/testing/cxl/test/mem.c | 42 +++++++++++++ > > 8 files changed, 381 insertions(+), 1 deletion(-) > > > > > > base-commit: 589c3357370a596ef7c99c00baca8ac799fce531 > > -- > > 2.37.3 > > > >
Alison Schofield wrote: > On Thu, Jan 26, 2023 at 05:59:03PM -0800, Dan Williams wrote: > > alison.schofield@ wrote: > > > From: Alison Schofield <alison.schofield@intel.com> > > > > > > Subject: [PATCH v5 0/5] CXL Poison List Retrieval & Tracing > > > > > > Changes in v5: > > > - Rebase on cxl/next > > > - Use struct_size() to calc mbox cmd payload .min_out > > > - s/INTERNAL/INJECTED mocked poison record source > > > - Added Jonathan Reviewed-by tag on Patch 3 > > > > > > Link to v4: > > > https://lore.kernel.org/linux-cxl/cover.1671135967.git.alison.schofield@intel.com/ > > > > > > Add support for retrieving device poison lists and store the returned > > > error records as kernel trace events. > > > > > > The handling of the poison list is guided by the CXL 3.0 Specification > > > Section 8.2.9.8.4.1. [1] > > > > > > Example, triggered by memdev: > > > $ echo 1 > /sys/bus/cxl/devices/mem3/trigger_poison_list > > > cxl_poison: memdev=mem3 pcidev=cxl_mem.3 region= region_uuid=00000000-0000-0000-0000-000000000000 dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 > > > > I think the pcidev= field wants to be called something like "host" or > > "parent", because there is no strict requirement that a 'struct > > cxl_memdev' is related to a 'struct pci_dev'. In fact in that example > > "cxl_mem.3" is a 'struct platform_device'. Now that I think about it, I > > think all CXL device events should be emitting the PCIe serial number > > for the memdev. > ] > > Will do, 'host' and add PCIe serial no. > > > > > I will look in the implementation, but do region= and region_uuid= get > > populated when mem3 is a member of the region? > > Not always. > In the case above, where the trigger was by memdev, no. > Region= and region_uuid= (and in the follow-on patch, hpa=) only get > populated if the poison was triggered by region, like the case below. > > It could be looked up for the by memdev cases. Is that wanted? Just trying to understand the semantics. However, I do think it makes sense for a memdev trigger to lookup information on all impacted regions across all of the device's DPA and the region trigger makes sense to lookup all memdevs, but bounded by the DPA that contributes to that region. I just want to avoid someone having to trigger the region to get extra information that was readily available from a memdev listing. > > Thanks for the reviews Dan! > > > > > > > > Example, triggered by region: > > > $ echo 1 > /sys/bus/cxl/devices/region5/trigger_poison_list > > > cxl_poison: memdev=mem0 pcidev=cxl_mem.0 region=region5 region_uuid=bfcb7a29-890e-4a41-8236-fe22221fc75c dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 > > > cxl_poison: memdev=mem1 pcidev=cxl_mem.1 region=region5 region_uuid=bfcb7a29-890e-4a41-8236-fe22221fc75c dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 > > > > > > [1]: https://www.computeexpresslink.org/download-the-specification > > > > > > Alison Schofield (5): > > > cxl/mbox: Add GET_POISON_LIST mailbox command > > > cxl/trace: Add TRACE support for CXL media-error records > > > cxl/memdev: Add trigger_poison_list sysfs attribute > > > cxl/region: Add trigger_poison_list sysfs attribute > > > tools/testing/cxl: Mock support for Get Poison List > > > > > > Documentation/ABI/testing/sysfs-bus-cxl | 28 +++++++++ > > > drivers/cxl/core/mbox.c | 78 +++++++++++++++++++++++ > > > drivers/cxl/core/memdev.c | 45 ++++++++++++++ > > > drivers/cxl/core/region.c | 33 ++++++++++ > > > drivers/cxl/core/trace.h | 83 +++++++++++++++++++++++++ > > > drivers/cxl/cxlmem.h | 69 +++++++++++++++++++- > > > drivers/cxl/pci.c | 4 ++ > > > tools/testing/cxl/test/mem.c | 42 +++++++++++++ > > > 8 files changed, 381 insertions(+), 1 deletion(-) > > > > > > > > > base-commit: 589c3357370a596ef7c99c00baca8ac799fce531 > > > -- > > > 2.37.3 > > > > > > >
On Fri, Jan 27, 2023 at 11:16:49AM -0800, Dan Williams wrote: > Alison Schofield wrote: > > On Thu, Jan 26, 2023 at 05:59:03PM -0800, Dan Williams wrote: > > > alison.schofield@ wrote: > > > > From: Alison Schofield <alison.schofield@intel.com> > > > > > > > > Subject: [PATCH v5 0/5] CXL Poison List Retrieval & Tracing > > > > > > > > Changes in v5: > > > > - Rebase on cxl/next > > > > - Use struct_size() to calc mbox cmd payload .min_out > > > > - s/INTERNAL/INJECTED mocked poison record source > > > > - Added Jonathan Reviewed-by tag on Patch 3 > > > > > > > > Link to v4: > > > > https://lore.kernel.org/linux-cxl/cover.1671135967.git.alison.schofield@intel.com/ > > > > > > > > Add support for retrieving device poison lists and store the returned > > > > error records as kernel trace events. > > > > > > > > The handling of the poison list is guided by the CXL 3.0 Specification > > > > Section 8.2.9.8.4.1. [1] > > > > > > > > Example, triggered by memdev: > > > > $ echo 1 > /sys/bus/cxl/devices/mem3/trigger_poison_list > > > > cxl_poison: memdev=mem3 pcidev=cxl_mem.3 region= region_uuid=00000000-0000-0000-0000-000000000000 dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 > > > > > > I think the pcidev= field wants to be called something like "host" or > > > "parent", because there is no strict requirement that a 'struct > > > cxl_memdev' is related to a 'struct pci_dev'. In fact in that example > > > "cxl_mem.3" is a 'struct platform_device'. Now that I think about it, I > > > think all CXL device events should be emitting the PCIe serial number > > > for the memdev. > > ] > > > > Will do, 'host' and add PCIe serial no. > > > > > > > > I will look in the implementation, but do region= and region_uuid= get > > > populated when mem3 is a member of the region? > > > > Not always. > > In the case above, where the trigger was by memdev, no. > > Region= and region_uuid= (and in the follow-on patch, hpa=) only get > > populated if the poison was triggered by region, like the case below. > > > > It could be looked up for the by memdev cases. Is that wanted? > > Just trying to understand the semantics. However, I do think it makes sense > for a memdev trigger to lookup information on all impacted regions > across all of the device's DPA and the region trigger makes sense to > lookup all memdevs, but bounded by the DPA that contributes to that > region. I just want to avoid someone having to trigger the region to get > extra information that was readily available from a memdev listing. > Dan - Confirming my take-away from this email, and our chat: Remove the by-region trigger_poison_list option entirely. User space needs to trigger by-memdev the memdevs participating in the region and filter those events by region. Add the region info (region name, uuid) to the TRACE_EVENTs when the poisoned DPA is part of any region. Alison > > > > Thanks for the reviews Dan! > > > > > > > > > > > Example, triggered by region: > > > > $ echo 1 > /sys/bus/cxl/devices/region5/trigger_poison_list > > > > cxl_poison: memdev=mem0 pcidev=cxl_mem.0 region=region5 region_uuid=bfcb7a29-890e-4a41-8236-fe22221fc75c dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 > > > > cxl_poison: memdev=mem1 pcidev=cxl_mem.1 region=region5 region_uuid=bfcb7a29-890e-4a41-8236-fe22221fc75c dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 > > > > > > > > [1]: https://www.computeexpresslink.org/download-the-specification > > > > > > > > Alison Schofield (5): > > > > cxl/mbox: Add GET_POISON_LIST mailbox command > > > > cxl/trace: Add TRACE support for CXL media-error records > > > > cxl/memdev: Add trigger_poison_list sysfs attribute > > > > cxl/region: Add trigger_poison_list sysfs attribute > > > > tools/testing/cxl: Mock support for Get Poison List > > > > > > > > Documentation/ABI/testing/sysfs-bus-cxl | 28 +++++++++ > > > > drivers/cxl/core/mbox.c | 78 +++++++++++++++++++++++ > > > > drivers/cxl/core/memdev.c | 45 ++++++++++++++ > > > > drivers/cxl/core/region.c | 33 ++++++++++ > > > > drivers/cxl/core/trace.h | 83 +++++++++++++++++++++++++ > > > > drivers/cxl/cxlmem.h | 69 +++++++++++++++++++- > > > > drivers/cxl/pci.c | 4 ++ > > > > tools/testing/cxl/test/mem.c | 42 +++++++++++++ > > > > 8 files changed, 381 insertions(+), 1 deletion(-) > > > > > > > > > > > > base-commit: 589c3357370a596ef7c99c00baca8ac799fce531 > > > > -- > > > > 2.37.3 > > > > > > > > > > > >
Alison Schofield wrote: > On Fri, Jan 27, 2023 at 11:16:49AM -0800, Dan Williams wrote: > > Alison Schofield wrote: > > > On Thu, Jan 26, 2023 at 05:59:03PM -0800, Dan Williams wrote: > > > > alison.schofield@ wrote: > > > > > From: Alison Schofield <alison.schofield@intel.com> > > > > > > > > > > Subject: [PATCH v5 0/5] CXL Poison List Retrieval & Tracing > > > > > > > > > > Changes in v5: > > > > > - Rebase on cxl/next > > > > > - Use struct_size() to calc mbox cmd payload .min_out > > > > > - s/INTERNAL/INJECTED mocked poison record source > > > > > - Added Jonathan Reviewed-by tag on Patch 3 > > > > > > > > > > Link to v4: > > > > > https://lore.kernel.org/linux-cxl/cover.1671135967.git.alison.schofield@intel.com/ > > > > > > > > > > Add support for retrieving device poison lists and store the returned > > > > > error records as kernel trace events. > > > > > > > > > > The handling of the poison list is guided by the CXL 3.0 Specification > > > > > Section 8.2.9.8.4.1. [1] > > > > > > > > > > Example, triggered by memdev: > > > > > $ echo 1 > /sys/bus/cxl/devices/mem3/trigger_poison_list > > > > > cxl_poison: memdev=mem3 pcidev=cxl_mem.3 region= region_uuid=00000000-0000-0000-0000-000000000000 dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 > > > > > > > > I think the pcidev= field wants to be called something like "host" or > > > > "parent", because there is no strict requirement that a 'struct > > > > cxl_memdev' is related to a 'struct pci_dev'. In fact in that example > > > > "cxl_mem.3" is a 'struct platform_device'. Now that I think about it, I > > > > think all CXL device events should be emitting the PCIe serial number > > > > for the memdev. > > > ] > > > > > > Will do, 'host' and add PCIe serial no. > > > > > > > > > > > I will look in the implementation, but do region= and region_uuid= get > > > > populated when mem3 is a member of the region? > > > > > > Not always. > > > In the case above, where the trigger was by memdev, no. > > > Region= and region_uuid= (and in the follow-on patch, hpa=) only get > > > populated if the poison was triggered by region, like the case below. > > > > > > It could be looked up for the by memdev cases. Is that wanted? > > > > Just trying to understand the semantics. However, I do think it makes sense > > for a memdev trigger to lookup information on all impacted regions > > across all of the device's DPA and the region trigger makes sense to > > lookup all memdevs, but bounded by the DPA that contributes to that > > region. I just want to avoid someone having to trigger the region to get > > extra information that was readily available from a memdev listing. > > > > Dan - > > Confirming my take-away from this email, and our chat: > > Remove the by-region trigger_poison_list option entirely. User space > needs to trigger by-memdev the memdevs participating in the region and > filter those events by region. > > Add the region info (region name, uuid) to the TRACE_EVENTs when the > poisoned DPA is part of any region. That's what I was thinking, yes. So the internals of cxl_mem_get_poison() will take the cxl_region_rwsem for read and compare the device's endpoint decoder settings against the media error records to do the region (and later HPA) lookup.