Message ID | 20221110185758.879472-3-ira.weiny@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp323397wru; Thu, 10 Nov 2022 11:07:39 -0800 (PST) X-Google-Smtp-Source: AMsMyM6+LDO2deaTCUGGKm67xq2xnpFRK1ix/UFbD7UvxpHQspLXOTbyaml+g8pevzUjz+EWf3vt X-Received: by 2002:a17:906:69d2:b0:7a4:c236:906 with SMTP id g18-20020a17090669d200b007a4c2360906mr3492666ejs.318.1668107259388; Thu, 10 Nov 2022 11:07:39 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1668107259; cv=none; d=google.com; s=arc-20160816; b=HMdqhOI4I05CZO8qHZ8L6HureDt3ZFxticiwUM9Cz/KqMWuwbQaZVS4nhCqGKCxLoo i6GCs0Umc6YhXEcC6NSeLmra0H9IeERccxKC2BYAvkLRhgPWvoUOfdCn+ReFQcYlV2g8 BAxLo2ZSV/ujYhT2pMpwdma0fVLfjfb9zS7GnMeqoprhSf/SoE1CPxw9jDR6ZiHj35eY VV8ag1lmhr/MhrIJ3Vf+6QEGjZNeYkcMtidcl28Zhj3pkhqoRkaNoDij++TmFNHkJEO5 U6Qczzs8l/CDOrCiiUo1q7zGuepMe/9YQWu0NXXBt+VC0yZ3+xAaWK9Ljhdv0/E/ynrl Ps9w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=eva+rUNk/HB9Y3pdx7+dUqrLYNoyOf82Beh5Gdsn91U=; b=KZgf6K3bcIHIjG7bMjByYSKP2HmfGN6Lj8pqOJgEtT3iZwPbfkVJ82xIDwdu5Cm5fV FudTu39lmyxexXXZr1Fu0y5jMUUkxS/ODNT/V5e+8seb9JDr8irA4Z7bI+rEHngzyuoA jLpFCkqhfl5JQa3pkTr8WIC0JK0XyX2FW2/TPNWHIO7BBCI0MtNnEO14VGX27N72dqdf x/p70yd58QGO40V8fcobJlUYV3BBZ4imGqimBsmw/KsOotuhKWhowB0oZA58amREjKxA SBnfrjgDvqW7IVYLwyvWuCp+66K9KGpCqrhurRpay6Pqip2B80DRxs6E7ME4fZpogS8V 3PfQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Apt6dHGi; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id dn5-20020a17090794c500b0078dcd448f99si22027ejc.855.2022.11.10.11.07.14; Thu, 10 Nov 2022 11:07:39 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Apt6dHGi; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231575AbiKJTF6 (ORCPT <rfc822;winker.wchi@gmail.com> + 99 others); Thu, 10 Nov 2022 14:05:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48700 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231499AbiKJTFd (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 10 Nov 2022 14:05:33 -0500 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1035A57B58; Thu, 10 Nov 2022 11:05:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1668107111; x=1699643111; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=i9QLVcvjrWi2bU5ScvvMmPNidfmPBzVUn1G+fNtFVRU=; b=Apt6dHGiKkJJLj147v3I9OqtXZzXMyZ/GoW/S+w7yHyLiDXMs57FnMgx vg67yHpNCuGMfR14YQfIMFAESP3CQnvWJGr8ed2/4+VcU+zXsVtb+US8Q W2aQHzszlB07JYzvc5NZPrOuaMNUd072nUfgeZA9KFsqGeBk3yEHNAT2j 34DtxSPDCan1561f8dNMtSyxS0ZvvTNQx9pPP1MBMKtJGWpVG9fefJUne e5/TiqZe8rgOdjFAfVlLJheuyNQxf5VxDX0dpgoA+UR64iWGh8hyHMRU3 Rhp7MtgBJVsj5JoPN0w1Myf0auZT/hSff4YWOuxEhysBVd8ZaqN1uFwQf A==; X-IronPort-AV: E=McAfee;i="6500,9779,10527"; a="375662059" X-IronPort-AV: E=Sophos;i="5.96,154,1665471600"; d="scan'208";a="375662059" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Nov 2022 10:58:05 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10527"; a="882473403" X-IronPort-AV: E=Sophos;i="5.96,154,1665471600"; d="scan'208";a="882473403" Received: from iweiny-mobl.amr.corp.intel.com (HELO localhost) ([10.212.6.223]) by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Nov 2022 10:58:05 -0800 From: ira.weiny@intel.com To: Dan Williams <dan.j.williams@intel.com> Cc: Ira Weiny <ira.weiny@intel.com>, Steven Rostedt <rostedt@goodmis.org>, Alison Schofield <alison.schofield@intel.com>, Vishal Verma <vishal.l.verma@intel.com>, Ben Widawsky <bwidawsk@kernel.org>, Jonathan Cameron <Jonathan.Cameron@huawei.com>, Davidlohr Bueso <dave@stgolabs.net>, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org Subject: [PATCH 02/11] cxl/mem: Implement Get Event Records command Date: Thu, 10 Nov 2022 10:57:49 -0800 Message-Id: <20221110185758.879472-3-ira.weiny@intel.com> X-Mailer: git-send-email 2.37.2 In-Reply-To: <20221110185758.879472-1-ira.weiny@intel.com> References: <20221110185758.879472-1-ira.weiny@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1749137237431094886?= X-GMAIL-MSGID: =?utf-8?q?1749137237431094886?= |
Series |
CXL: Process event logs
|
|
Commit Message
Ira Weiny
Nov. 10, 2022, 6:57 p.m. UTC
From: Ira Weiny <ira.weiny@intel.com> CXL devices have multiple event logs which can be queried for CXL event records. Devices are required to support the storage of at least one event record in each event log type. Devices track event log overflow by incrementing a counter and tracking the time of the first and last overflow event seen. Software queries events via the Get Event Record mailbox command; CXL rev 3.0 section 8.2.9.2.2. Issue the Get Event Record mailbox command on driver load. Trace each record found with a generic record trace. Trace any overflow conditions. The device can return up to 1MB worth of event records per query. This presents complications with allocating a huge buffers to potentially capture all the records. It is not anticipated that these event logs will be very deep and reading them does not need to be performant. Process only 3 records at a time. 3 records was chosen as it fits comfortably on the stack to prevent dynamic allocation while still cutting down on extra mailbox messages. This patch traces a raw event record only and leaves the specific event record types to subsequent patches. Macros are created to use for tracing the common CXL Event header fields. Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ira Weiny <ira.weiny@intel.com> --- Change from RFC v2: Support reading 3 events at once. Reverse Jonathan's suggestion and check for positive number of records. Because the record count may have been returned as something > 3 based on what the device thinks it can send back even though the core Linux mbox processing truncates the data. Alison and Dave Jiang Change header uuid type to uuid_t for better user space processing Smita Check status reg before reading log. Steven Prefix all trace points with 'cxl_' Use static branch <trace>_enabled() calls Jonathan s/CXL_EVENT_TYPE_INFO/0 s/{first,last}/{first,last}_ts Remove Reserved field from header Fix header issue for cxl_event_log_type_str() Change from RFC: Remove redundant error message in get event records loop s/EVENT_RECORD_DATA_LENGTH/CXL_EVENT_RECORD_DATA_LENGTH Use hdr_uuid for the header UUID field Use cxl_event_log_type_str() for the trace events Create macros for the header fields and common entries of each event Add reserved buffer output dump Report error if event query fails Remove unused record_cnt variable Steven - reorder overflow record Remove NOTE about checkpatch Jonathan check for exactly 1 record s/v3.0/rev 3.0 Use 3 byte fields for 24bit fields Add 3.0 Maintenance Operation Class Add Dynamic Capacity log type Fix spelling Dave Jiang/Dan/Alison s/cxl-event/cxl trace/events/cxl-events => trace/events/cxl.h s/cxl_event_overflow/overflow s/cxl_event/generic_event --- MAINTAINERS | 1 + drivers/cxl/core/mbox.c | 70 +++++++++++++++++++ drivers/cxl/cxl.h | 8 +++ drivers/cxl/cxlmem.h | 73 ++++++++++++++++++++ include/trace/events/cxl.h | 127 +++++++++++++++++++++++++++++++++++ include/uapi/linux/cxl_mem.h | 1 + 6 files changed, 280 insertions(+) create mode 100644 include/trace/events/cxl.h
Comments
On 11/10/2022 10:57 AM, ira.weiny@intel.com wrote: > From: Ira Weiny <ira.weiny@intel.com> > > CXL devices have multiple event logs which can be queried for CXL event > records. Devices are required to support the storage of at least one > event record in each event log type. > > Devices track event log overflow by incrementing a counter and tracking > the time of the first and last overflow event seen. > > Software queries events via the Get Event Record mailbox command; CXL > rev 3.0 section 8.2.9.2.2. > > Issue the Get Event Record mailbox command on driver load. Trace each > record found with a generic record trace. Trace any overflow > conditions. > > The device can return up to 1MB worth of event records per query. This > presents complications with allocating a huge buffers to potentially > capture all the records. It is not anticipated that these event logs > will be very deep and reading them does not need to be performant. > Process only 3 records at a time. 3 records was chosen as it fits > comfortably on the stack to prevent dynamic allocation while still > cutting down on extra mailbox messages. > > This patch traces a raw event record only and leaves the specific event > record types to subsequent patches. > > Macros are created to use for tracing the common CXL Event header > fields. > > Cc: Steven Rostedt <rostedt@goodmis.org> > Signed-off-by: Ira Weiny <ira.weiny@intel.com> Would it be cleaner to split out the include/trace/events/cxl.h changes to its own patch? Reviewed-by: Dave Jiang <dave.jiang@intel.com> > > --- > Change from RFC v2: > Support reading 3 events at once. > Reverse Jonathan's suggestion and check for positive number of > records. Because the record count may have been > returned as something > 3 based on what the device > thinks it can send back even though the core Linux mbox > processing truncates the data. > Alison and Dave Jiang > Change header uuid type to uuid_t for better user space > processing > Smita > Check status reg before reading log. > Steven > Prefix all trace points with 'cxl_' > Use static branch <trace>_enabled() calls > Jonathan > s/CXL_EVENT_TYPE_INFO/0 > s/{first,last}/{first,last}_ts > Remove Reserved field from header > Fix header issue for cxl_event_log_type_str() > > Change from RFC: > Remove redundant error message in get event records loop > s/EVENT_RECORD_DATA_LENGTH/CXL_EVENT_RECORD_DATA_LENGTH > Use hdr_uuid for the header UUID field > Use cxl_event_log_type_str() for the trace events > Create macros for the header fields and common entries of each event > Add reserved buffer output dump > Report error if event query fails > Remove unused record_cnt variable > Steven - reorder overflow record > Remove NOTE about checkpatch > Jonathan > check for exactly 1 record > s/v3.0/rev 3.0 > Use 3 byte fields for 24bit fields > Add 3.0 Maintenance Operation Class > Add Dynamic Capacity log type > Fix spelling > Dave Jiang/Dan/Alison > s/cxl-event/cxl > trace/events/cxl-events => trace/events/cxl.h > s/cxl_event_overflow/overflow > s/cxl_event/generic_event > --- > MAINTAINERS | 1 + > drivers/cxl/core/mbox.c | 70 +++++++++++++++++++ > drivers/cxl/cxl.h | 8 +++ > drivers/cxl/cxlmem.h | 73 ++++++++++++++++++++ > include/trace/events/cxl.h | 127 +++++++++++++++++++++++++++++++++++ > include/uapi/linux/cxl_mem.h | 1 + > 6 files changed, 280 insertions(+) > create mode 100644 include/trace/events/cxl.h > > diff --git a/MAINTAINERS b/MAINTAINERS > index ca063a504026..4b7c6e3055c6 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -5223,6 +5223,7 @@ M: Dan Williams <dan.j.williams@intel.com> > L: linux-cxl@vger.kernel.org > S: Maintained > F: drivers/cxl/ > +F: include/trace/events/cxl.h > F: include/uapi/linux/cxl_mem.h > > CONEXANT ACCESSRUNNER USB DRIVER > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > index 16176b9278b4..a908b95a7de4 100644 > --- a/drivers/cxl/core/mbox.c > +++ b/drivers/cxl/core/mbox.c > @@ -7,6 +7,9 @@ > #include <cxlmem.h> > #include <cxl.h> > > +#define CREATE_TRACE_POINTS > +#include <trace/events/cxl.h> > + > #include "core.h" > > static bool cxl_raw_allow_all; > @@ -48,6 +51,7 @@ static struct cxl_mem_command cxl_mem_commands[CXL_MEM_COMMAND_ID_MAX] = { > CXL_CMD(RAW, CXL_VARIABLE_PAYLOAD, CXL_VARIABLE_PAYLOAD, 0), > #endif > CXL_CMD(GET_SUPPORTED_LOGS, 0, CXL_VARIABLE_PAYLOAD, CXL_CMD_FLAG_FORCE_ENABLE), > + CXL_CMD(GET_EVENT_RECORD, 1, CXL_VARIABLE_PAYLOAD, 0), > CXL_CMD(GET_FW_INFO, 0, 0x50, 0), > CXL_CMD(GET_PARTITION_INFO, 0, 0x20, 0), > CXL_CMD(GET_LSA, 0x8, CXL_VARIABLE_PAYLOAD, 0), > @@ -704,6 +708,72 @@ int cxl_enumerate_cmds(struct cxl_dev_state *cxlds) > } > EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL); > > +static void cxl_mem_get_records_log(struct cxl_dev_state *cxlds, > + enum cxl_event_log_type type) > +{ > + struct cxl_get_event_payload payload; > + u16 pl_nr; > + > + do { > + u8 log_type = type; > + int rc; > + > + rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_EVENT_RECORD, > + &log_type, sizeof(log_type), > + &payload, sizeof(payload)); > + if (rc) { > + dev_err(cxlds->dev, "Event log '%s': Failed to query event records : %d", > + cxl_event_log_type_str(type), rc); > + return; > + } > + > + pl_nr = le16_to_cpu(payload.record_count); > + if (trace_cxl_generic_event_enabled()) { > + u16 nr_rec = min_t(u16, pl_nr, CXL_GET_EVENT_NR_RECORDS); > + int i; > + > + for (i = 0; i < nr_rec; i++) > + trace_cxl_generic_event(dev_name(cxlds->dev), > + type, > + &payload.record[i]); > + } > + > + if (trace_cxl_overflow_enabled() && > + (payload.flags & CXL_GET_EVENT_FLAG_OVERFLOW)) > + trace_cxl_overflow(dev_name(cxlds->dev), type, &payload); > + > + } while (pl_nr > CXL_GET_EVENT_NR_RECORDS || > + payload.flags & CXL_GET_EVENT_FLAG_MORE_RECORDS); > +} > + > +/** > + * cxl_mem_get_event_records - Get Event Records from the device > + * @cxlds: The device data for the operation > + * > + * Retrieve all event records available on the device and report them as trace > + * events. > + * > + * See CXL rev 3.0 @8.2.9.2.2 Get Event Records > + */ > +void cxl_mem_get_event_records(struct cxl_dev_state *cxlds) > +{ > + u32 status = readl(cxlds->regs.status + CXLDEV_DEV_EVENT_STATUS_OFFSET); > + > + dev_dbg(cxlds->dev, "Reading event logs: %x\n", status); > + > + if (status & CXLDEV_EVENT_STATUS_INFO) > + cxl_mem_get_records_log(cxlds, CXL_EVENT_TYPE_INFO); > + if (status & CXLDEV_EVENT_STATUS_WARN) > + cxl_mem_get_records_log(cxlds, CXL_EVENT_TYPE_WARN); > + if (status & CXLDEV_EVENT_STATUS_FAIL) > + cxl_mem_get_records_log(cxlds, CXL_EVENT_TYPE_FAIL); > + if (status & CXLDEV_EVENT_STATUS_FATAL) > + cxl_mem_get_records_log(cxlds, CXL_EVENT_TYPE_FATAL); > + if (status & CXLDEV_EVENT_STATUS_DYNAMIC_CAP) > + cxl_mem_get_records_log(cxlds, CXL_EVENT_TYPE_DYNAMIC_CAP); > +} > +EXPORT_SYMBOL_NS_GPL(cxl_mem_get_event_records, CXL); > + > /** > * cxl_mem_get_partition_info - Get partition info > * @cxlds: The device data for the operation > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h > index f680450f0b16..492cff1bea6d 100644 > --- a/drivers/cxl/cxl.h > +++ b/drivers/cxl/cxl.h > @@ -132,6 +132,14 @@ static inline int ways_to_cxl(unsigned int ways, u8 *iw) > #define CXLDEV_CAP_CAP_ID_SECONDARY_MAILBOX 0x3 > #define CXLDEV_CAP_CAP_ID_MEMDEV 0x4000 > > +/* CXL 3.0 8.2.8.3.1 Event Status Register */ > +#define CXLDEV_DEV_EVENT_STATUS_OFFSET 0x00 > +#define CXLDEV_EVENT_STATUS_INFO BIT(0) > +#define CXLDEV_EVENT_STATUS_WARN BIT(1) > +#define CXLDEV_EVENT_STATUS_FAIL BIT(2) > +#define CXLDEV_EVENT_STATUS_FATAL BIT(3) > +#define CXLDEV_EVENT_STATUS_DYNAMIC_CAP BIT(4) > + > /* CXL 2.0 8.2.8.4 Mailbox Registers */ > #define CXLDEV_MBOX_CAPS_OFFSET 0x00 > #define CXLDEV_MBOX_CAP_PAYLOAD_SIZE_MASK GENMASK(4, 0) > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h > index b7b955ded3ac..da64ba0f156b 100644 > --- a/drivers/cxl/cxlmem.h > +++ b/drivers/cxl/cxlmem.h > @@ -4,6 +4,7 @@ > #define __CXL_MEM_H__ > #include <uapi/linux/cxl_mem.h> > #include <linux/cdev.h> > +#include <linux/uuid.h> > #include "cxl.h" > > /* CXL 2.0 8.2.8.5.1.1 Memory Device Status Register */ > @@ -256,6 +257,7 @@ struct cxl_dev_state { > enum cxl_opcode { > CXL_MBOX_OP_INVALID = 0x0000, > CXL_MBOX_OP_RAW = CXL_MBOX_OP_INVALID, > + CXL_MBOX_OP_GET_EVENT_RECORD = 0x0100, > CXL_MBOX_OP_GET_FW_INFO = 0x0200, > CXL_MBOX_OP_ACTIVATE_FW = 0x0202, > CXL_MBOX_OP_GET_SUPPORTED_LOGS = 0x0400, > @@ -325,6 +327,76 @@ struct cxl_mbox_identify { > u8 qos_telemetry_caps; > } __packed; > > +/* > + * Common Event Record Format > + * CXL rev 3.0 section 8.2.9.2.1; Table 8-42 > + */ > +struct cxl_event_record_hdr { > + uuid_t id; > + u8 length; > + u8 flags[3]; > + __le16 handle; > + __le16 related_handle; > + __le64 timestamp; > + u8 maint_op_class; > + u8 reserved[0xf]; > +} __packed; > + > +#define CXL_EVENT_RECORD_DATA_LENGTH 0x50 > +struct cxl_event_record_raw { > + struct cxl_event_record_hdr hdr; > + u8 data[CXL_EVENT_RECORD_DATA_LENGTH]; > +} __packed; > + > +/* > + * Get Event Records output payload > + * CXL rev 3.0 section 8.2.9.2.2; Table 8-50 > + */ > +#define CXL_GET_EVENT_FLAG_OVERFLOW BIT(0) > +#define CXL_GET_EVENT_FLAG_MORE_RECORDS BIT(1) > +#define CXL_GET_EVENT_NR_RECORDS 3 > +struct cxl_get_event_payload { > + u8 flags; > + u8 reserved1; > + __le16 overflow_err_count; > + __le64 first_overflow_timestamp; > + __le64 last_overflow_timestamp; > + __le16 record_count; > + u8 reserved2[0xa]; > + struct cxl_event_record_raw record[CXL_GET_EVENT_NR_RECORDS]; > +} __packed; > + > +/* > + * CXL rev 3.0 section 8.2.9.2.2; Table 8-49 > + */ > +enum cxl_event_log_type { > + CXL_EVENT_TYPE_INFO = 0x00, > + CXL_EVENT_TYPE_WARN, > + CXL_EVENT_TYPE_FAIL, > + CXL_EVENT_TYPE_FATAL, > + CXL_EVENT_TYPE_DYNAMIC_CAP, > + CXL_EVENT_TYPE_MAX > +}; > + > +static inline const char *cxl_event_log_type_str(enum cxl_event_log_type type) > +{ > + switch (type) { > + case CXL_EVENT_TYPE_INFO: > + return "Informational"; > + case CXL_EVENT_TYPE_WARN: > + return "Warning"; > + case CXL_EVENT_TYPE_FAIL: > + return "Failure"; > + case CXL_EVENT_TYPE_FATAL: > + return "Fatal"; > + case CXL_EVENT_TYPE_DYNAMIC_CAP: > + return "Dynamic Capacity"; > + default: > + break; > + } > + return "<unknown>"; > +} > + > struct cxl_mbox_get_partition_info { > __le64 active_volatile_cap; > __le64 active_persistent_cap; > @@ -384,6 +456,7 @@ int cxl_mem_create_range_info(struct cxl_dev_state *cxlds); > struct cxl_dev_state *cxl_dev_state_create(struct device *dev); > void set_exclusive_cxl_commands(struct cxl_dev_state *cxlds, unsigned long *cmds); > void clear_exclusive_cxl_commands(struct cxl_dev_state *cxlds, unsigned long *cmds); > +void cxl_mem_get_event_records(struct cxl_dev_state *cxlds); > #ifdef CONFIG_CXL_SUSPEND > void cxl_mem_active_inc(void); > void cxl_mem_active_dec(void); > diff --git a/include/trace/events/cxl.h b/include/trace/events/cxl.h > new file mode 100644 > index 000000000000..60dec9a84918 > --- /dev/null > +++ b/include/trace/events/cxl.h > @@ -0,0 +1,127 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#undef TRACE_SYSTEM > +#define TRACE_SYSTEM cxl > + > +#if !defined(_CXL_TRACE_EVENTS_H) || defined(TRACE_HEADER_MULTI_READ) > +#define _CXL_TRACE_EVENTS_H > + > +#include <asm-generic/unaligned.h> > +#include <linux/tracepoint.h> > +#include <cxlmem.h> > + > +TRACE_EVENT(cxl_overflow, > + > + TP_PROTO(const char *dev_name, enum cxl_event_log_type log, > + struct cxl_get_event_payload *payload), > + > + TP_ARGS(dev_name, log, payload), > + > + TP_STRUCT__entry( > + __string(dev_name, dev_name) > + __field(int, log) > + __field(u64, first_ts) > + __field(u64, last_ts) > + __field(u16, count) > + ), > + > + TP_fast_assign( > + __assign_str(dev_name, dev_name); > + __entry->log = log; > + __entry->count = le16_to_cpu(payload->overflow_err_count); > + __entry->first_ts = le64_to_cpu(payload->first_overflow_timestamp); > + __entry->last_ts = le64_to_cpu(payload->last_overflow_timestamp); > + ), > + > + TP_printk("%s: EVENT LOG OVERFLOW log=%s : %u records from %llu to %llu", > + __get_str(dev_name), cxl_event_log_type_str(__entry->log), > + __entry->count, __entry->first_ts, __entry->last_ts) > + > +); > + > +/* > + * Common Event Record Format > + * CXL 3.0 section 8.2.9.2.1; Table 8-42 > + */ > +#define CXL_EVENT_RECORD_FLAG_PERMANENT BIT(2) > +#define CXL_EVENT_RECORD_FLAG_MAINT_NEEDED BIT(3) > +#define CXL_EVENT_RECORD_FLAG_PERF_DEGRADED BIT(4) > +#define CXL_EVENT_RECORD_FLAG_HW_REPLACE BIT(5) > +#define show_hdr_flags(flags) __print_flags(flags, " | ", \ > + { CXL_EVENT_RECORD_FLAG_PERMANENT, "Permanent Condition" }, \ > + { CXL_EVENT_RECORD_FLAG_MAINT_NEEDED, "Maintenance Needed" }, \ > + { CXL_EVENT_RECORD_FLAG_PERF_DEGRADED, "Performance Degraded" }, \ > + { CXL_EVENT_RECORD_FLAG_HW_REPLACE, "Hardware Replacement Needed" } \ > +) > + > +/* > + * Define macros for the common header of each CXL event. > + * > + * Tracepoints using these macros must do 3 things: > + * > + * 1) Add CXL_EVT_TP_entry to TP_STRUCT__entry > + * 2) Use CXL_EVT_TP_fast_assign within TP_fast_assign; > + * pass the dev_name, log, and CXL event header > + * 3) Use CXL_EVT_TP_printk() instead of TP_printk() > + * > + * See the generic_event tracepoint as an example. > + */ > +#define CXL_EVT_TP_entry \ > + __string(dev_name, dev_name) \ > + __field(int, log) \ > + __field_struct(uuid_t, hdr_uuid) \ > + __field(u32, hdr_flags) \ > + __field(u16, hdr_handle) \ > + __field(u16, hdr_related_handle) \ > + __field(u64, hdr_timestamp) \ > + __field(u8, hdr_length) \ > + __field(u8, hdr_maint_op_class) > + > +#define CXL_EVT_TP_fast_assign(dname, l, hdr) \ > + __assign_str(dev_name, (dname)); \ > + __entry->log = (l); \ > + memcpy(&__entry->hdr_uuid, &(hdr).id, sizeof(uuid_t)); \ > + __entry->hdr_length = (hdr).length; \ > + __entry->hdr_flags = get_unaligned_le24((hdr).flags); \ > + __entry->hdr_handle = le16_to_cpu((hdr).handle); \ > + __entry->hdr_related_handle = le16_to_cpu((hdr).related_handle); \ > + __entry->hdr_timestamp = le64_to_cpu((hdr).timestamp); \ > + __entry->hdr_maint_op_class = (hdr).maint_op_class > + > + > +#define CXL_EVT_TP_printk(fmt, ...) \ > + TP_printk("%s log=%s : time=%llu uuid=%pUb len=%d flags='%s' " \ > + "handle=%x related_handle=%x maint_op_class=%u" \ > + " : " fmt, \ > + __get_str(dev_name), cxl_event_log_type_str(__entry->log), \ > + __entry->hdr_timestamp, &__entry->hdr_uuid, __entry->hdr_length,\ > + show_hdr_flags(__entry->hdr_flags), __entry->hdr_handle, \ > + __entry->hdr_related_handle, __entry->hdr_maint_op_class, \ > + ##__VA_ARGS__) > + > +TRACE_EVENT(cxl_generic_event, > + > + TP_PROTO(const char *dev_name, enum cxl_event_log_type log, > + struct cxl_event_record_raw *rec), > + > + TP_ARGS(dev_name, log, rec), > + > + TP_STRUCT__entry( > + CXL_EVT_TP_entry > + __array(u8, data, CXL_EVENT_RECORD_DATA_LENGTH) > + ), > + > + TP_fast_assign( > + CXL_EVT_TP_fast_assign(dev_name, log, rec->hdr); > + memcpy(__entry->data, &rec->data, CXL_EVENT_RECORD_DATA_LENGTH); > + ), > + > + CXL_EVT_TP_printk("%s", > + __print_hex(__entry->data, CXL_EVENT_RECORD_DATA_LENGTH)) > +); > + > +#endif /* _CXL_TRACE_EVENTS_H */ > + > +/* This part must be outside protection */ > +#undef TRACE_INCLUDE_FILE > +#define TRACE_INCLUDE_FILE cxl > +#include <trace/define_trace.h> > diff --git a/include/uapi/linux/cxl_mem.h b/include/uapi/linux/cxl_mem.h > index c71021a2a9ed..70459be5bdd4 100644 > --- a/include/uapi/linux/cxl_mem.h > +++ b/include/uapi/linux/cxl_mem.h > @@ -24,6 +24,7 @@ > ___C(IDENTIFY, "Identify Command"), \ > ___C(RAW, "Raw device command"), \ > ___C(GET_SUPPORTED_LOGS, "Get Supported Logs"), \ > + ___C(GET_EVENT_RECORD, "Get Event Record"), \ > ___C(GET_FW_INFO, "Get FW Info"), \ > ___C(GET_PARTITION_INFO, "Get Partition Information"), \ > ___C(GET_LSA, "Get Label Storage Area"), \
On Thu, 10 Nov 2022 10:57:49 -0800 ira.weiny@intel.com wrote: > From: Ira Weiny <ira.weiny@intel.com> > > CXL devices have multiple event logs which can be queried for CXL event > records. Devices are required to support the storage of at least one > event record in each event log type. > > Devices track event log overflow by incrementing a counter and tracking > the time of the first and last overflow event seen. > > Software queries events via the Get Event Record mailbox command; CXL > rev 3.0 section 8.2.9.2.2. > > Issue the Get Event Record mailbox command on driver load. Trace each > record found with a generic record trace. Trace any overflow > conditions. > > The device can return up to 1MB worth of event records per query. This > presents complications with allocating a huge buffers to potentially > capture all the records. It is not anticipated that these event logs > will be very deep and reading them does not need to be performant. > Process only 3 records at a time. 3 records was chosen as it fits > comfortably on the stack to prevent dynamic allocation while still > cutting down on extra mailbox messages. > > This patch traces a raw event record only and leaves the specific event > record types to subsequent patches. > > Macros are created to use for tracing the common CXL Event header > fields. > > Cc: Steven Rostedt <rostedt@goodmis.org> > Signed-off-by: Ira Weiny <ira.weiny@intel.com> Hi Ira, A question inline about whether some of the conditions you are checking for can actually happen. Otherwise looks good to me. Jonathan > > --- > Change from RFC v2: > Support reading 3 events at once. > Reverse Jonathan's suggestion and check for positive number of > records. Because the record count may have been > returned as something > 3 based on what the device > thinks it can send back even though the core Linux mbox > processing truncates the data. > Alison and Dave Jiang > Change header uuid type to uuid_t for better user space > processing > Smita > Check status reg before reading log. > Steven > Prefix all trace points with 'cxl_' > Use static branch <trace>_enabled() calls > Jonathan > s/CXL_EVENT_TYPE_INFO/0 > s/{first,last}/{first,last}_ts > Remove Reserved field from header > Fix header issue for cxl_event_log_type_str() > > Change from RFC: > Remove redundant error message in get event records loop > s/EVENT_RECORD_DATA_LENGTH/CXL_EVENT_RECORD_DATA_LENGTH > Use hdr_uuid for the header UUID field > Use cxl_event_log_type_str() for the trace events > Create macros for the header fields and common entries of each event > Add reserved buffer output dump > Report error if event query fails > Remove unused record_cnt variable > Steven - reorder overflow record > Remove NOTE about checkpatch > Jonathan > check for exactly 1 record > s/v3.0/rev 3.0 > Use 3 byte fields for 24bit fields > Add 3.0 Maintenance Operation Class > Add Dynamic Capacity log type > Fix spelling > Dave Jiang/Dan/Alison > s/cxl-event/cxl > trace/events/cxl-events => trace/events/cxl.h > s/cxl_event_overflow/overflow > s/cxl_event/generic_event > --- > MAINTAINERS | 1 + > drivers/cxl/core/mbox.c | 70 +++++++++++++++++++ > drivers/cxl/cxl.h | 8 +++ > drivers/cxl/cxlmem.h | 73 ++++++++++++++++++++ > include/trace/events/cxl.h | 127 +++++++++++++++++++++++++++++++++++ > include/uapi/linux/cxl_mem.h | 1 + > 6 files changed, 280 insertions(+) > create mode 100644 include/trace/events/cxl.h > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > index 16176b9278b4..a908b95a7de4 100644 > --- a/drivers/cxl/core/mbox.c > +++ b/drivers/cxl/core/mbox.c > +static void cxl_mem_get_records_log(struct cxl_dev_state *cxlds, > + enum cxl_event_log_type type) > +{ > + struct cxl_get_event_payload payload; > + u16 pl_nr; > + > + do { > + u8 log_type = type; > + int rc; > + > + rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_EVENT_RECORD, > + &log_type, sizeof(log_type), > + &payload, sizeof(payload)); > + if (rc) { > + dev_err(cxlds->dev, "Event log '%s': Failed to query event records : %d", > + cxl_event_log_type_str(type), rc); > + return; > + } > + > + pl_nr = le16_to_cpu(payload.record_count); > + if (trace_cxl_generic_event_enabled()) { > + u16 nr_rec = min_t(u16, pl_nr, CXL_GET_EVENT_NR_RECORDS); Either I'm misreading the spec, or it can't be greater than NR_RECORDS. "The number of event records in the Event Records list...." Event Records being the field inside this payload which is not big enough to take more than CXL_GET_EVENT_NR_RECORDS and the intro to Get Event Records refers to the number being restricted by the mailbox output payload provided. I'm in favor of defense against broken hardware, but don't paper over any such error - scream about it. > + int i; > + > + for (i = 0; i < nr_rec; i++) > + trace_cxl_generic_event(dev_name(cxlds->dev), > + type, > + &payload.record[i]); > + } > + > + if (trace_cxl_overflow_enabled() && > + (payload.flags & CXL_GET_EVENT_FLAG_OVERFLOW)) > + trace_cxl_overflow(dev_name(cxlds->dev), type, &payload); > + > + } while (pl_nr > CXL_GET_EVENT_NR_RECORDS || Isn't pl_nr > CXL_GET_EVENT_NR_RECORDS a hardware bug? It's the number in returned payload not the total number. > + payload.flags & CXL_GET_EVENT_FLAG_MORE_RECORDS); > +} > diff --git a/include/trace/events/cxl.h b/include/trace/events/cxl.h > new file mode 100644 > index 000000000000..60dec9a84918 > --- /dev/null > +++ b/include/trace/events/cxl.h > @@ -0,0 +1,127 @@ > +#define CXL_EVT_TP_fast_assign(dname, l, hdr) \ > + __assign_str(dev_name, (dname)); \ > + __entry->log = (l); \ > + memcpy(&__entry->hdr_uuid, &(hdr).id, sizeof(uuid_t)); \ > + __entry->hdr_length = (hdr).length; \ > + __entry->hdr_flags = get_unaligned_le24((hdr).flags); \ > + __entry->hdr_handle = le16_to_cpu((hdr).handle); \ > + __entry->hdr_related_handle = le16_to_cpu((hdr).related_handle); \ > + __entry->hdr_timestamp = le64_to_cpu((hdr).timestamp); \ > + __entry->hdr_maint_op_class = (hdr).maint_op_class > + Trivial: Maybe one blank line is enough? > + > +#define CXL_EVT_TP_printk(fmt, ...) \ > + TP_printk("%s log=%s : time=%llu uuid=%pUb len=%d flags='%s' " \ > + "handle=%x related_handle=%x maint_op_class=%u" \ > + " : " fmt, \ > + __get_str(dev_name), cxl_event_log_type_str(__entry->log), \ > + __entry->hdr_timestamp, &__entry->hdr_uuid, __entry->hdr_length,\ > + show_hdr_flags(__entry->hdr_flags), __entry->hdr_handle, \ > + __entry->hdr_related_handle, __entry->hdr_maint_op_class, \ > + ##__VA_ARGS__)
On Wed, Nov 16, 2022 at 03:19:36PM +0000, Jonathan Cameron wrote: > On Thu, 10 Nov 2022 10:57:49 -0800 > ira.weiny@intel.com wrote: > > > From: Ira Weiny <ira.weiny@intel.com> > > > > CXL devices have multiple event logs which can be queried for CXL event > > records. Devices are required to support the storage of at least one > > event record in each event log type. > > > > Devices track event log overflow by incrementing a counter and tracking > > the time of the first and last overflow event seen. > > > > Software queries events via the Get Event Record mailbox command; CXL > > rev 3.0 section 8.2.9.2.2. > > > > Issue the Get Event Record mailbox command on driver load. Trace each > > record found with a generic record trace. Trace any overflow > > conditions. > > > > The device can return up to 1MB worth of event records per query. This > > presents complications with allocating a huge buffers to potentially > > capture all the records. It is not anticipated that these event logs > > will be very deep and reading them does not need to be performant. > > Process only 3 records at a time. 3 records was chosen as it fits > > comfortably on the stack to prevent dynamic allocation while still > > cutting down on extra mailbox messages. > > > > This patch traces a raw event record only and leaves the specific event > > record types to subsequent patches. > > > > Macros are created to use for tracing the common CXL Event header > > fields. > > > > Cc: Steven Rostedt <rostedt@goodmis.org> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > Hi Ira, > > A question inline about whether some of the conditions you are checking > for can actually happen. Otherwise looks good to me. > > Jonathan > [snip] > > +static void cxl_mem_get_records_log(struct cxl_dev_state *cxlds, > > + enum cxl_event_log_type type) > > +{ > > + struct cxl_get_event_payload payload; > > + u16 pl_nr; > > + > > + do { > > + u8 log_type = type; > > + int rc; > > + > > + rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_EVENT_RECORD, > > + &log_type, sizeof(log_type), > > + &payload, sizeof(payload)); > > + if (rc) { > > + dev_err(cxlds->dev, "Event log '%s': Failed to query event records : %d", > > + cxl_event_log_type_str(type), rc); > > + return; > > + } > > + > > + pl_nr = le16_to_cpu(payload.record_count); > > + if (trace_cxl_generic_event_enabled()) { > > + u16 nr_rec = min_t(u16, pl_nr, CXL_GET_EVENT_NR_RECORDS); > > Either I'm misreading the spec, or it can't be greater than NR_RECORDS. Well... I could have read the spec wrong as well. But after reading very carefully I think this is actually correct. > "The number of event records in the Event Records list...." Where is this quote from? I don't see that in the spec. > Event Records being the field inside this payload which is not big enough to > take more than CXL_GET_EVENT_NR_RECORDS and the intro to Get Event Records > refers to the number being restricted by the mailbox output payload provided. My understanding is that the output payload is only limited by the Payload Size reported in the Mailbox Capability Register.Payload Size. (Section 8.2.8.4.3) This can be up to 1MB. So the device could fill up to 1MB's worth of Event Records while still being in compliance. The generic mailbox code in the driver caps the data based on the size passed into cxl_mbox_send_cmd() however, the number of records reported is not changed. > > I'm in favor of defense against broken hardware, but don't paper over any > such error - scream about it. I don't think this is out of spec unless the device is trying to write more than 1MB and I think the core mailbox code will scream about that. > > > + int i; > > + > > + for (i = 0; i < nr_rec; i++) > > + trace_cxl_generic_event(dev_name(cxlds->dev), > > + type, > > + &payload.record[i]); > > + } > > + > > + if (trace_cxl_overflow_enabled() && > > + (payload.flags & CXL_GET_EVENT_FLAG_OVERFLOW)) > > + trace_cxl_overflow(dev_name(cxlds->dev), type, &payload); > > + > > + } while (pl_nr > CXL_GET_EVENT_NR_RECORDS || > > Isn't pl_nr > CXL_GET_EVENT_NR_RECORDS a hardware bug? It's the number in returned > payload not the total number. I don't think so. The only value passed to the device is the _input_ payload size. The output payload size is not passed to the device and is not included in the Get Event Records Input Payload. (Table 8-49) So my previous code was wrong. Here is an example I think which is within the spec but would result in the more records flag not being set. Device log depth == 10 nr log entries == 7 nr log entries in 1MB ~= (1M - hdr size) / 128 ~= 8000 Device sets Output Payload.Event Record Count == 7 (which is < 8000). Common mailbox code truncates that to 3. More Event Records == 0 because it sent all 7 that it had. This code will clear 3 and read again 2 more times. Am I reading that wrong? > > > + payload.flags & CXL_GET_EVENT_FLAG_MORE_RECORDS); > > +} > > > > diff --git a/include/trace/events/cxl.h b/include/trace/events/cxl.h > > new file mode 100644 > > index 000000000000..60dec9a84918 > > --- /dev/null > > +++ b/include/trace/events/cxl.h > > @@ -0,0 +1,127 @@ > > > > +#define CXL_EVT_TP_fast_assign(dname, l, hdr) \ > > + __assign_str(dev_name, (dname)); \ > > + __entry->log = (l); \ > > + memcpy(&__entry->hdr_uuid, &(hdr).id, sizeof(uuid_t)); \ > > + __entry->hdr_length = (hdr).length; \ > > + __entry->hdr_flags = get_unaligned_le24((hdr).flags); \ > > + __entry->hdr_handle = le16_to_cpu((hdr).handle); \ > > + __entry->hdr_related_handle = le16_to_cpu((hdr).related_handle); \ > > + __entry->hdr_timestamp = le64_to_cpu((hdr).timestamp); \ > > + __entry->hdr_maint_op_class = (hdr).maint_op_class > > + > Trivial: Maybe one blank line is enough? Yea I'll adjust, Ira > > + > > +#define CXL_EVT_TP_printk(fmt, ...) \ > > + TP_printk("%s log=%s : time=%llu uuid=%pUb len=%d flags='%s' " \ > > + "handle=%x related_handle=%x maint_op_class=%u" \ > > + " : " fmt, \ > > + __get_str(dev_name), cxl_event_log_type_str(__entry->log), \ > > + __entry->hdr_timestamp, &__entry->hdr_uuid, __entry->hdr_length,\ > > + show_hdr_flags(__entry->hdr_flags), __entry->hdr_handle, \ > > + __entry->hdr_related_handle, __entry->hdr_maint_op_class, \ > > + ##__VA_ARGS__) >
On Wed, 16 Nov 2022 16:47:20 -0800 Ira Weiny <ira.weiny@intel.com> wrote: > On Wed, Nov 16, 2022 at 03:19:36PM +0000, Jonathan Cameron wrote: > > On Thu, 10 Nov 2022 10:57:49 -0800 > > ira.weiny@intel.com wrote: > > > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > > CXL devices have multiple event logs which can be queried for CXL event > > > records. Devices are required to support the storage of at least one > > > event record in each event log type. > > > > > > Devices track event log overflow by incrementing a counter and tracking > > > the time of the first and last overflow event seen. > > > > > > Software queries events via the Get Event Record mailbox command; CXL > > > rev 3.0 section 8.2.9.2.2. > > > > > > Issue the Get Event Record mailbox command on driver load. Trace each > > > record found with a generic record trace. Trace any overflow > > > conditions. > > > > > > The device can return up to 1MB worth of event records per query. This > > > presents complications with allocating a huge buffers to potentially > > > capture all the records. It is not anticipated that these event logs > > > will be very deep and reading them does not need to be performant. > > > Process only 3 records at a time. 3 records was chosen as it fits > > > comfortably on the stack to prevent dynamic allocation while still > > > cutting down on extra mailbox messages. > > > > > > This patch traces a raw event record only and leaves the specific event > > > record types to subsequent patches. > > > > > > Macros are created to use for tracing the common CXL Event header > > > fields. > > > > > > Cc: Steven Rostedt <rostedt@goodmis.org> > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > > > Hi Ira, > > > > A question inline about whether some of the conditions you are checking > > for can actually happen. Otherwise looks good to me. > > > > Jonathan > > > > [snip] > > > > +static void cxl_mem_get_records_log(struct cxl_dev_state *cxlds, > > > + enum cxl_event_log_type type) > > > +{ > > > + struct cxl_get_event_payload payload; > > > + u16 pl_nr; > > > + > > > + do { > > > + u8 log_type = type; > > > + int rc; > > > + > > > + rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_EVENT_RECORD, > > > + &log_type, sizeof(log_type), > > > + &payload, sizeof(payload)); > > > + if (rc) { > > > + dev_err(cxlds->dev, "Event log '%s': Failed to query event records : %d", > > > + cxl_event_log_type_str(type), rc); > > > + return; > > > + } > > > + > > > + pl_nr = le16_to_cpu(payload.record_count); > > > + if (trace_cxl_generic_event_enabled()) { > > > + u16 nr_rec = min_t(u16, pl_nr, CXL_GET_EVENT_NR_RECORDS); > > > > Either I'm misreading the spec, or it can't be greater than NR_RECORDS. > > Well... I could have read the spec wrong as well. But after reading very > carefully I think this is actually correct. > > > "The number of event records in the Event Records list...." > > Where is this quote from? I don't see that in the spec. Table 8-50 Event Record Count (the field we are reading here). > > > Event Records being the field inside this payload which is not big enough to > > take more than CXL_GET_EVENT_NR_RECORDS and the intro to Get Event Records > > refers to the number being restricted by the mailbox output payload provided. > > My understanding is that the output payload is only limited by the Payload Size > reported in the Mailbox Capability Register.Payload Size. (Section 8.2.8.4.3) > > This can be up to 1MB. So the device could fill up to 1MB's worth of Event > Records while still being in compliance. The generic mailbox code in the > driver caps the data based on the size passed into cxl_mbox_send_cmd() however, > the number of records reported is not changed. Indeed I had that wrong. I thought we passed in an output payload length whereas we only provide "payload length" which is defined as being the input length in 8.2.8.4.5 > > > > > I'm in favor of defense against broken hardware, but don't paper over any > > such error - scream about it. > > I don't think this is out of spec unless the device is trying to write more > than 1MB and I think the core mailbox code will scream about that. > > > > > > + int i; > > > + > > > + for (i = 0; i < nr_rec; i++) > > > + trace_cxl_generic_event(dev_name(cxlds->dev), > > > + type, > > > + &payload.record[i]); > > > + } > > > + > > > + if (trace_cxl_overflow_enabled() && > > > + (payload.flags & CXL_GET_EVENT_FLAG_OVERFLOW)) > > > + trace_cxl_overflow(dev_name(cxlds->dev), type, &payload); > > > + > > > + } while (pl_nr > CXL_GET_EVENT_NR_RECORDS || > > > > Isn't pl_nr > CXL_GET_EVENT_NR_RECORDS a hardware bug? It's the number in returned > > payload not the total number. > > I don't think so. The only value passed to the device is the _input_ payload > size. The output payload size is not passed to the device and is not included > in the Get Event Records Input Payload. (Table 8-49) > > So my previous code was wrong. Here is an example I think which is within the > spec but would result in the more records flag not being set. > > Device log depth == 10 > nr log entries == 7 > nr log entries in 1MB ~= (1M - hdr size) / 128 ~= 8000 > > Device sets Output Payload.Event Record Count == 7 (which is < 8000). Common > mailbox code truncates that to 3. More Event Records == 0 because it sent all > 7 that it had. > > This code will clear 3 and read again 2 more times. > > Am I reading that wrong? I think this is still wrong, but for a different reason. :) If we don't clear the records and more records is set, that means it didn't fit in the mailbox payload (potentially 1MB) then the next read will return the next set of records from there. Taking this patch only, let's say the mailbox takes 4 records. Read 1: Records 0, 1, 2, 3 More set. We handle 0, 1, 2 Read 2: Records 4, 5, 6 More not set. We handle 4, 5, 6 Record 3 is never handled. If we add in clearing as happens later in the series, the current assumption is that if we clear some records a subsequent read will start again. I'm not sure that is true. If it is spec reference needed. So assumption is Read 1: Records 0, 1, 2, 3 More set Clear 0, 1, 2 Read 2: Records 3, 4, 5, 6 Clear 3, 4, 5 More not set, but catch it with the condition above. Read 3: 6 only Clear 6 However, I think a valid implementation could do the following (imagine a ring buffer with a pointer to the 'next' record to read out and each record has a 'valid' flag to deal with corner cases around sequences such as read log once, start reading again and some clears occur using handles obtained from first read - not that case isn't ruled out by the spec as far as I can see). Read 1: Records 0, 1, 2, 3 More set. 'next' pointer points to record 4. Clear 0, 1, 2 Read 2: Records 4, 5, 6 More not set. 'next' pointer points to record 7. Clear 4, 5, 6 Skipping record 3. So I think we have to absorb the full mailbox payload each time to guarantee we don't skip events or process them out of order (which is what would happen if we relied on a retry loop - we aren't allowed to clear them out of order anyway 8.2.9.2.3 "Events shall be cleared in temporal order. The device shall verify the event record handles specified in the input payload are in temporal order. ... "). Obviously that temporal order thing is only relevant if we get my second example occurring on real hardware. I think the spec is vague enough to allow that implementation. Would have been easy to specify this originally but it probably won't go in as errata so we need to cope with all the flexibility that is present. What fun and oh for a parameter to control how many records are returned! Jonathan > > > > > > + payload.flags & CXL_GET_EVENT_FLAG_MORE_RECORDS); > > > +} > > >
On Thu, Nov 17, 2022 at 10:43:37AM +0000, Jonathan Cameron wrote: > On Wed, 16 Nov 2022 16:47:20 -0800 > Ira Weiny <ira.weiny@intel.com> wrote: > > [snip] > > > > > > > > > + int i; > > > > + > > > > + for (i = 0; i < nr_rec; i++) > > > > + trace_cxl_generic_event(dev_name(cxlds->dev), > > > > + type, > > > > + &payload.record[i]); > > > > + } > > > > + > > > > + if (trace_cxl_overflow_enabled() && > > > > + (payload.flags & CXL_GET_EVENT_FLAG_OVERFLOW)) > > > > + trace_cxl_overflow(dev_name(cxlds->dev), type, &payload); > > > > + > > > > + } while (pl_nr > CXL_GET_EVENT_NR_RECORDS || > > > > > > Isn't pl_nr > CXL_GET_EVENT_NR_RECORDS a hardware bug? It's the number in returned > > > payload not the total number. > > > > I don't think so. The only value passed to the device is the _input_ payload > > size. The output payload size is not passed to the device and is not included > > in the Get Event Records Input Payload. (Table 8-49) > > > > So my previous code was wrong. Here is an example I think which is within the > > spec but would result in the more records flag not being set. > > > > Device log depth == 10 > > nr log entries == 7 > > nr log entries in 1MB ~= (1M - hdr size) / 128 ~= 8000 > > > > Device sets Output Payload.Event Record Count == 7 (which is < 8000). Common > > mailbox code truncates that to 3. More Event Records == 0 because it sent all > > 7 that it had. > > > > This code will clear 3 and read again 2 more times. > > > > Am I reading that wrong? > > I think this is still wrong, but for a different reason. :) I hope not... :-/ > If we don't clear the records and more records is set, that means it didn't > fit in the mailbox payload (potentially 1MB) then the next read > will return the next set of records from there. That is not how I read the Get Event Records command: From 8.2.9.2.2 Get Event Records ... "Devices shall return event records to the host in the temporal order the device detected the events in. The event occurring the earliest in time, in the specific event log, shall be returned first." If item 3 below is earlier than 4 then it must be returned if we have not cleared it. At least that is how I read the above. :-/ > > Taking this patch only, let's say the mailbox takes 4 records. > Read 1: Records 0, 1, 2, 3 More set. > We handle 0, 1, 2 > Read 2: Records 4, 5, 6 More not set. > We handle 4, 5, 6 > > Record 3 is never handled. > > If we add in clearing as happens later in the series, I suppose I should squash the patches as this may not work without the clearing. :-/ > the current > assumption is that if we clear some records a subsequent read will > start again. I'm not sure that is true. If it is spec reference needed. > > So assumption is > Read 1: Records 0, 1, 2, 3 More set > Clear 0, 1, 2 > Read 2: Records 3, 4, 5, 6 > Clear 3, 4, 5 More not set, but catch it with the condition above. > Read 3: 6 only > Clear 6 > > However, I think a valid implementation could do the following > (imagine a ring buffer with a pointer to the 'next' record to read out and > each record has a 'valid' flag to deal with corner cases around > sequences such as read log once, start reading again and some > clears occur using handles obtained from first read - not that > case isn't ruled out by the spec as far as I can see). I believe this is a violation because the next pointer can't be advanced until the record is cleared. Otherwise the device is not returning items in temporal order based on what is in the log. > > Read 1: Records 0, 1, 2, 3 More set. 'next' pointer points to record 4. > Clear 0, 1, 2 > Read 2: Records 4, 5, 6 More not set. 'next' pointer points to record 7. > Clear 4, 5, 6 > > Skipping record 3. > > So I think we have to absorb the full mailbox payload each time to guarantee > we don't skip events or process them out of order (which is what would happen > if we relied on a retry loop - we aren't allowed to clear them out of > order anyway 8.2.9.2.3 "Events shall be cleared in temporal order. The device > shall verify the event record handles specified in the input payload are in > temporal order. ... "). > Obviously that temporal order thing is only relevant if we get my second > example occurring on real hardware. I think the spec is vague enough > to allow that implementation. Would have been easy to specify this originally > but it probably won't go in as errata so we need to cope with all the > flexibility that is present. :-( Yea coulda, woulda, shoulda... ;-) > > What fun and oh for a parameter to control how many records are returned! Yea. But I really don't think there is a problem unless someone really take liberty with the spec. I think it boils down to how one interprets _when_ a record is removed from the log. If the record is removed when it is returned (as in your 'next' pointer example) then why have a clear at all? If my interpretation is correct then the next available entry is the one which has not been cleared. Therefore in your example 'next' is not incremented until clear has been called. I think that implementation is also supported by the idea that records must be cleared in temporal order. Otherwise I think devices would get confused. FWIW the qemu implementation is based on my interpretation ATM. Ira > > Jonathan > > > > > > > > > > > + payload.flags & CXL_GET_EVENT_FLAG_MORE_RECORDS); > > > > +} > > > > > > >
On Fri, 18 Nov 2022 15:26:17 -0800 Ira Weiny <ira.weiny@intel.com> wrote: > On Thu, Nov 17, 2022 at 10:43:37AM +0000, Jonathan Cameron wrote: > > On Wed, 16 Nov 2022 16:47:20 -0800 > > Ira Weiny <ira.weiny@intel.com> wrote: > > > > > > [snip] > > > > > > > > > > > > > + int i; > > > > > + > > > > > + for (i = 0; i < nr_rec; i++) > > > > > + trace_cxl_generic_event(dev_name(cxlds->dev), > > > > > + type, > > > > > + &payload.record[i]); > > > > > + } > > > > > + > > > > > + if (trace_cxl_overflow_enabled() && > > > > > + (payload.flags & CXL_GET_EVENT_FLAG_OVERFLOW)) > > > > > + trace_cxl_overflow(dev_name(cxlds->dev), type, &payload); > > > > > + > > > > > + } while (pl_nr > CXL_GET_EVENT_NR_RECORDS || > > > > > > > > Isn't pl_nr > CXL_GET_EVENT_NR_RECORDS a hardware bug? It's the number in returned > > > > payload not the total number. > > > > > > I don't think so. The only value passed to the device is the _input_ payload > > > size. The output payload size is not passed to the device and is not included > > > in the Get Event Records Input Payload. (Table 8-49) > > > > > > So my previous code was wrong. Here is an example I think which is within the > > > spec but would result in the more records flag not being set. > > > > > > Device log depth == 10 > > > nr log entries == 7 > > > nr log entries in 1MB ~= (1M - hdr size) / 128 ~= 8000 > > > > > > Device sets Output Payload.Event Record Count == 7 (which is < 8000). Common > > > mailbox code truncates that to 3. More Event Records == 0 because it sent all > > > 7 that it had. > > > > > > This code will clear 3 and read again 2 more times. > > > > > > Am I reading that wrong? > > > > I think this is still wrong, but for a different reason. :) > > I hope not... :-/ > > > If we don't clear the records and more records is set, that means it didn't > > fit in the mailbox payload (potentially 1MB) then the next read > > will return the next set of records from there. > > That is not how I read the Get Event Records command: > > From 8.2.9.2.2 Get Event Records > > ... "Devices shall return event records to the host in the temporal order the > device detected the events in. The event occurring the earliest in time, in the > specific event log, shall be returned first." > > If item 3 below is earlier than 4 then it must be returned if we have not > cleared it. At least that is how I read the above. :-/ In general that doesn't work. Imagine we cleared no records. In that case we'd return 4 despite there being earlier records. There is no language to cover this particular case of clearing part of what was returned. The device did return the records in temporal order, we just didn't notice some of them. The wonders of slightly loose spec wording. Far as I can tell we are stuck with having to come with all things that could be read as being valid implementations. > > > > > Taking this patch only, let's say the mailbox takes 4 records. > > Read 1: Records 0, 1, 2, 3 More set. > > We handle 0, 1, 2 > > Read 2: Records 4, 5, 6 More not set. > > We handle 4, 5, 6 > > > > Record 3 is never handled. > > > > If we add in clearing as happens later in the series, > > I suppose I should squash the patches as this may not work without the > clearing. :-/ > > > the current > > assumption is that if we clear some records a subsequent read will > > start again. I'm not sure that is true. If it is spec reference needed. > > > > So assumption is > > Read 1: Records 0, 1, 2, 3 More set > > Clear 0, 1, 2 > > Read 2: Records 3, 4, 5, 6 > > Clear 3, 4, 5 More not set, but catch it with the condition above. > > Read 3: 6 only > > Clear 6 > > > > However, I think a valid implementation could do the following > > (imagine a ring buffer with a pointer to the 'next' record to read out and > > each record has a 'valid' flag to deal with corner cases around > > sequences such as read log once, start reading again and some > > clears occur using handles obtained from first read - not that > > case isn't ruled out by the spec as far as I can see). > > I believe this is a violation because the next pointer can't be advanced until > the record is cleared. Otherwise the device is not returning items in temporal > order based on what is in the log. Ah. This is where we disagree. The temporal order is (potentially?) unconnected from the clearing. The device did return them in temporal order, we just didn't take any novice of record 3 being returned. A valid reading of that temporal order comment is actually the other way around that the device must not reset it's idea of temporal order until all records have been read (reading 3 twice is not in temporal order - imagine we had read 5 each time and it becomes more obvious as the read order becomes 0,1,2,3,4,3,4,5,6,7 etc which is clearly not in temporal order by any normal reading of the term. The more I read this, the more I think the current implementation is not compliant with the specification at all. I'm not seeing a spec mention of 'reseting' the ordering on clearing records (which might have been a good thing in the first place but too late now). > > > > > Read 1: Records 0, 1, 2, 3 More set. 'next' pointer points to record 4. > > Clear 0, 1, 2 > > Read 2: Records 4, 5, 6 More not set. 'next' pointer points to record 7. > > Clear 4, 5, 6 > > > > Skipping record 3. > > > > So I think we have to absorb the full mailbox payload each time to guarantee > > we don't skip events or process them out of order (which is what would happen > > if we relied on a retry loop - we aren't allowed to clear them out of > > order anyway 8.2.9.2.3 "Events shall be cleared in temporal order. The device > > shall verify the event record handles specified in the input payload are in > > temporal order. ... "). > > Obviously that temporal order thing is only relevant if we get my second > > example occurring on real hardware. I think the spec is vague enough > > to allow that implementation. Would have been easy to specify this originally > > but it probably won't go in as errata so we need to cope with all the > > flexibility that is present. > > :-( Yea coulda, woulda, shoulda... ;-) > > > > > What fun and oh for a parameter to control how many records are returned! > > Yea. But I really don't think there is a problem unless someone really take > liberty with the spec. I think it boils down to how one interprets _when_ a > record is removed from the log. This is nothing to do with removal. The wording we have is just about reading and I think a strict reading of the spec would say your assumption of a reset of the read pointer on clear is NOT a valid implementation. There is separate wording about clears being in temporal order, but that doesn't effect the Get Event Records handling. > > If the record is removed when it is returned (as in your 'next' pointer > example) then why have a clear at all? Because if your software crashes, you don't have a handshake to reestablish state. If that happens you read the whole log until MORE is not set and then read it again to get a clean list. It's messy situation that has been discussed before for GET POISON LIST which has the same nasty handing of MORE. (look in appropriate forum for resolution to that one that we can't yet discuss here!) Also, allows for non destructive readback (debugging tools might take a look having paused the normal handling). > If my interpretation is correct then > the next available entry is the one which has not been cleared. If that is the case the language in "More Event Records" doesn't work "The host should continue to retrieve records using this command, until this indicator is no longer set by the device" With your reading of the spec, if we clear nothing, we'd keep getting the first set of records and only be able to read more by clearing them... > Therefore in > your example 'next' is not incremented until clear has been called. I think > that implementation is also supported by the idea that records must be cleared > in temporal order. Otherwise I think devices would get confused. Not hard for device to do this (how I now read the spec) properly. Two pointers: 1) Next to clear: CLEAR 2) Next to read: READ Advance the the READ pointer on Get Event Records For CLEAR, check that the requested clears are handled in order and that they are before the READ pointer. Maybe we should just take it to appropriate spec forum to seek a clarification? Jonathan > > FWIW the qemu implementation is based on my interpretation ATM. > > Ira > > > > > Jonathan > > > > > > > > > > > > > > > > + payload.flags & CXL_GET_EVENT_FLAG_MORE_RECORDS); > > > > > +} > > > > > > > > > > >
On Mon, Nov 21, 2022 at 10:47:14AM +0000, Jonathan Cameron wrote: > On Fri, 18 Nov 2022 15:26:17 -0800 > Ira Weiny <ira.weiny@intel.com> wrote: > > > On Thu, Nov 17, 2022 at 10:43:37AM +0000, Jonathan Cameron wrote: > > > On Wed, 16 Nov 2022 16:47:20 -0800 > > > Ira Weiny <ira.weiny@intel.com> wrote: > > > > > > > > > > [snip] > > > > > > > > > > > > > > > > > + int i; > > > > > > + > > > > > > + for (i = 0; i < nr_rec; i++) > > > > > > + trace_cxl_generic_event(dev_name(cxlds->dev), > > > > > > + type, > > > > > > + &payload.record[i]); > > > > > > + } > > > > > > + > > > > > > + if (trace_cxl_overflow_enabled() && > > > > > > + (payload.flags & CXL_GET_EVENT_FLAG_OVERFLOW)) > > > > > > + trace_cxl_overflow(dev_name(cxlds->dev), type, &payload); > > > > > > + > > > > > > + } while (pl_nr > CXL_GET_EVENT_NR_RECORDS || > > > > > > > > > > Isn't pl_nr > CXL_GET_EVENT_NR_RECORDS a hardware bug? It's the number in returned > > > > > payload not the total number. > > > > > > > > I don't think so. The only value passed to the device is the _input_ payload > > > > size. The output payload size is not passed to the device and is not included > > > > in the Get Event Records Input Payload. (Table 8-49) > > > > > > > > So my previous code was wrong. Here is an example I think which is within the > > > > spec but would result in the more records flag not being set. > > > > > > > > Device log depth == 10 > > > > nr log entries == 7 > > > > nr log entries in 1MB ~= (1M - hdr size) / 128 ~= 8000 > > > > > > > > Device sets Output Payload.Event Record Count == 7 (which is < 8000). Common > > > > mailbox code truncates that to 3. More Event Records == 0 because it sent all > > > > 7 that it had. > > > > > > > > This code will clear 3 and read again 2 more times. > > > > > > > > Am I reading that wrong? > > > > > > I think this is still wrong, but for a different reason. :) > > > > I hope not... :-/ > > > > > If we don't clear the records and more records is set, that means it didn't > > > fit in the mailbox payload (potentially 1MB) then the next read > > > will return the next set of records from there. > > > > That is not how I read the Get Event Records command: > > > > From 8.2.9.2.2 Get Event Records > > > > ... "Devices shall return event records to the host in the temporal order the > > device detected the events in. The event occurring the earliest in time, in the > > specific event log, shall be returned first." > > > > If item 3 below is earlier than 4 then it must be returned if we have not > > cleared it. At least that is how I read the above. :-/ > > In general that doesn't work. Imagine we cleared no records. > In that case we'd return 4 despite there being earlier records. > There is no language to cover this particular case of clearing > part of what was returned. The device did return the records > in temporal order, we just didn't notice some of them. > > The wonders of slightly loose spec wording. Far as I can tell > we are stuck with having to come with all things that could be > read as being valid implementations. So I've been thinking about this for a while. Lets take this example: > > > > > > Taking this patch only, let's say the mailbox takes 4 records. > > > Read 1: Records 0, 1, 2, 3 More set. > > > We handle 0, 1, 2 > > > Read 2: Records 4, 5, 6 More not set. > > > We handle 4, 5, 6 > > > In this case what happens if you do a 3rd read? Does the device return nothing? Or does it return 0, 1, 2, 3 again? It must start from the beginning right? But that is no longer in temporal order by your definition either. And if it returns nothing then there is no way to recover them except on device reset? FWIW I'm altering the patch set to do what you say and allocate a buffer large enough to get all the records. Because I am thinking you are correct. However, considering the buffer may be large, I fear we may run afoul of memory allocation failures. And that will require some more tricky error recovery to continue reading the log because the irq settings state: "... Settings: Specifies the settings for the interrupt when the <event> event log transitions from having no entries to having one or more entries." ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This means that no more interrupts will happen until the log is empty and additional events occur. So if an allocation failure happens I'll have to put a task on a work queue to wake up and continue to try. Otherwise the log will stall. Or we could just put a WARN_ON_ONCE() in and hope this never happens... I still believe that with a clear operation defined my method makes more sense. But I agree with you that the language is not strong. :-( > > > Record 3 is never handled. > > > > > > If we add in clearing as happens later in the series, > > > > I suppose I should squash the patches as this may not work without the > > clearing. :-/ > > > > > the current > > > assumption is that if we clear some records a subsequent read will > > > start again. I'm not sure that is true. If it is spec reference needed. > > > > > > So assumption is > > > Read 1: Records 0, 1, 2, 3 More set > > > Clear 0, 1, 2 > > > Read 2: Records 3, 4, 5, 6 > > > Clear 3, 4, 5 More not set, but catch it with the condition above. > > > Read 3: 6 only > > > Clear 6 > > > > > > However, I think a valid implementation could do the following > > > (imagine a ring buffer with a pointer to the 'next' record to read out and > > > each record has a 'valid' flag to deal with corner cases around > > > sequences such as read log once, start reading again and some > > > clears occur using handles obtained from first read - not that > > > case isn't ruled out by the spec as far as I can see). > > > > I believe this is a violation because the next pointer can't be advanced until > > the record is cleared. Otherwise the device is not returning items in temporal > > order based on what is in the log. > > Ah. This is where we disagree. The temporal order is (potentially?) unconnected > from the clearing. The device did return them in temporal order, we just didn't > take any novice of record 3 being returned. :-/ > A valid reading of that temporal order comment is actually the other way around > that the device must not reset it's idea of temporal order until all records > have been read (reading 3 twice is not in temporal order - imagine we had > read 5 each time and it becomes more obvious as the read order becomes > 0,1,2,3,4,3,4,5,6,7 etc which is clearly not in temporal order by any normal > reading of the term. Well I guess. My reading was that it must return the first element temporally within the list at the time of the Get operation. So in this example since 3 is still in the list it must return it first. Each read is considered atomic from the others. Yes as long as 0 is in the queue it will be returned. But I can see it your way too... > > The more I read this, the more I think the current implementation > is not compliant with the specification at all. > > I'm not seeing a spec mention of 'reseting' the ordering on clearing records > (which might have been a good thing in the first place but too late now). There is no resetting of order. Only that the device does not consider the previous reads on determining which events to return on any individual Get call. > > > > > > > > > Read 1: Records 0, 1, 2, 3 More set. 'next' pointer points to record 4. > > > Clear 0, 1, 2 > > > Read 2: Records 4, 5, 6 More not set. 'next' pointer points to record 7. > > > Clear 4, 5, 6 > > > > > > Skipping record 3. > > > > > > So I think we have to absorb the full mailbox payload each time to guarantee > > > we don't skip events or process them out of order (which is what would happen > > > if we relied on a retry loop - we aren't allowed to clear them out of > > > order anyway 8.2.9.2.3 "Events shall be cleared in temporal order. The device > > > shall verify the event record handles specified in the input payload are in > > > temporal order. ... "). > > > Obviously that temporal order thing is only relevant if we get my second > > > example occurring on real hardware. I think the spec is vague enough > > > to allow that implementation. Would have been easy to specify this originally > > > but it probably won't go in as errata so we need to cope with all the > > > flexibility that is present. > > > > :-( Yea coulda, woulda, shoulda... ;-) > > > > > > > > What fun and oh for a parameter to control how many records are returned! > > > > Yea. But I really don't think there is a problem unless someone really take > > liberty with the spec. I think it boils down to how one interprets _when_ a > > record is removed from the log. > > This is nothing to do with removal. The wording we have is just about reading > and I think a strict reading of the spec would say your assumption of a reset of the > read pointer on clear is NOT a valid implementation. There is separate wording > about clears being in temporal order, but that doesn't effect the Get Event > Records handling. > > > > > If the record is removed when it is returned (as in your 'next' pointer > > example) then why have a clear at all? > > Because if your software crashes, you don't have a handshake to reestablish > state. If that happens you read the whole log until MORE is not set and > then read it again to get a clean list. It's messy situation that has > been discussed before for GET POISON LIST which has the same nasty handing > of MORE. (look in appropriate forum for resolution to that one that we can't > yet discuss here!) I can see the similarities but I think events are a more ephemeral item which makes sense to clear once they are consumed. The idea that they should be left for others to consume does not make sense to me. Where Poison is something which could be a permanent marker which should be left in a list. > > Also, allows for non destructive readback (debugging tools might take a look > having paused the normal handling). That is true. > > > If my interpretation is correct then > > the next available entry is the one which has not been cleared. > > If that is the case the language in "More Event Records" doesn't work > "The host should continue to retrieve records using this command, until > this indicator is no longer set by the device" > > With your reading of the spec, if we clear nothing, we'd keep getting the > first set of records and only be able to read more by clearing them... > Yea. > > > Therefore in > > your example 'next' is not incremented until clear has been called. I think > > that implementation is also supported by the idea that records must be cleared > > in temporal order. Otherwise I think devices would get confused. > > Not hard for device to do this (how I now read the spec) properly. > > Two pointers: > 1) Next to clear: CLEAR > 2) Next to read: READ > > Advance the the READ pointer on Get Event Records And loop back to the start on a further read... I'm looking at changing the code for this but I think making it fully robust under a memory allocation failure is going to be more tedious or we punt. > For CLEAR, check that the requested clears are handled in order and that > they are before the READ pointer. > > Maybe we should just take it to appropriate spec forum to seek a clarification? Probably. I've not paid attention lately. I've sent a separate email with you cc'ed. Perhaps we can get some clarification before I completely rework this. Ira > > Jonathan > > > > > FWIW the qemu implementation is based on my interpretation ATM. > > > > Ira > > > > > > > > Jonathan > > > > > > > > > > > > > > > > > > > > > + payload.flags & CXL_GET_EVENT_FLAG_MORE_RECORDS); > > > > > > +} > > > > > > > > > > > > > > > >
On Mon, 28 Nov 2022 15:30:12 -0800 Ira Weiny <ira.weiny@intel.com> wrote: > On Mon, Nov 21, 2022 at 10:47:14AM +0000, Jonathan Cameron wrote: > > On Fri, 18 Nov 2022 15:26:17 -0800 > > Ira Weiny <ira.weiny@intel.com> wrote: > > > > > On Thu, Nov 17, 2022 at 10:43:37AM +0000, Jonathan Cameron wrote: > > > > On Wed, 16 Nov 2022 16:47:20 -0800 > > > > Ira Weiny <ira.weiny@intel.com> wrote: > > > > > > > > > > > > > > [snip] > > > > > > > > > > > > > > > > > > > > > + int i; > > > > > > > + > > > > > > > + for (i = 0; i < nr_rec; i++) > > > > > > > + trace_cxl_generic_event(dev_name(cxlds->dev), > > > > > > > + type, > > > > > > > + &payload.record[i]); > > > > > > > + } > > > > > > > + > > > > > > > + if (trace_cxl_overflow_enabled() && > > > > > > > + (payload.flags & CXL_GET_EVENT_FLAG_OVERFLOW)) > > > > > > > + trace_cxl_overflow(dev_name(cxlds->dev), type, &payload); > > > > > > > + > > > > > > > + } while (pl_nr > CXL_GET_EVENT_NR_RECORDS || > > > > > > > > > > > > Isn't pl_nr > CXL_GET_EVENT_NR_RECORDS a hardware bug? It's the number in returned > > > > > > payload not the total number. > > > > > > > > > > I don't think so. The only value passed to the device is the _input_ payload > > > > > size. The output payload size is not passed to the device and is not included > > > > > in the Get Event Records Input Payload. (Table 8-49) > > > > > > > > > > So my previous code was wrong. Here is an example I think which is within the > > > > > spec but would result in the more records flag not being set. > > > > > > > > > > Device log depth == 10 > > > > > nr log entries == 7 > > > > > nr log entries in 1MB ~= (1M - hdr size) / 128 ~= 8000 > > > > > > > > > > Device sets Output Payload.Event Record Count == 7 (which is < 8000). Common > > > > > mailbox code truncates that to 3. More Event Records == 0 because it sent all > > > > > 7 that it had. > > > > > > > > > > This code will clear 3 and read again 2 more times. > > > > > > > > > > Am I reading that wrong? > > > > > > > > I think this is still wrong, but for a different reason. :) > > > > > > I hope not... :-/ > > > > > > > If we don't clear the records and more records is set, that means it didn't > > > > fit in the mailbox payload (potentially 1MB) then the next read > > > > will return the next set of records from there. > > > > > > That is not how I read the Get Event Records command: > > > > > > From 8.2.9.2.2 Get Event Records > > > > > > ... "Devices shall return event records to the host in the temporal order the > > > device detected the events in. The event occurring the earliest in time, in the > > > specific event log, shall be returned first." > > > > > > If item 3 below is earlier than 4 then it must be returned if we have not > > > cleared it. At least that is how I read the above. :-/ > > > > In general that doesn't work. Imagine we cleared no records. > > In that case we'd return 4 despite there being earlier records. > > There is no language to cover this particular case of clearing > > part of what was returned. The device did return the records > > in temporal order, we just didn't notice some of them. > > > > The wonders of slightly loose spec wording. Far as I can tell > > we are stuck with having to come with all things that could be > > read as being valid implementations. > > So I've been thinking about this for a while. > > Lets take this example: > > > > > > > > > Taking this patch only, let's say the mailbox takes 4 records. > > > > Read 1: Records 0, 1, 2, 3 More set. > > > > We handle 0, 1, 2 > > > > Read 2: Records 4, 5, 6 More not set. > > > > We handle 4, 5, 6 > > > > > > In this case what happens if you do a 3rd read? Does the device return > nothing? Or does it return 0, 1, 2, 3 again? > > It must start from the beginning right? But that is no longer in temporal > order by your definition either. Agreed that is not clearly specified either. I assume it works the same way as poison where we raised the question and conclusion was it starts again at the beginning. In fact we have to loop twice to guarantee that we have all the records (as other software may have crashed half way through reading the poison list so we don't know if we have the first record or not).. > > And if it returns nothing then there is no way to recover them except on device > reset? > > FWIW I'm altering the patch set to do what you say and allocate a buffer large > enough to get all the records. Because I am thinking you are correct. Horrible, but maybe the best we can do (subject to suggested hack below ;) > > However, considering the buffer may be large, I fear we may run afoul of memory > allocation failures. And that will require some more tricky error recovery to > continue reading the log because the irq settings state: > We could implement cleverer mailbox handling to avoid the large allocation requirement. Would be messy though as we'd effectively have to lock the mailbox whilst we did multiple reads of the content into a smaller buffer. > "... Settings: Specifies the settings for the interrupt when the <event> event > log transitions from having no entries to having one or more entries." > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > This means that no more interrupts will happen until the log is empty and > additional events occur. So if an allocation failure happens I'll have to put > a task on a work queue to wake up and continue to try. Otherwise the log will > stall. Or we could just put a WARN_ON_ONCE() in and hope this never happens... I think the WARN_ON_ONCE() is probably fine. If we are paranoid vmalloc one when we initially connect device as failure less likely... As a side note, seems like we should maybe take a request to SSWG for devices to optionally be told to use a smaller mailbox than they support - in order to allow for corners like this. There is such a command but it's prohibited on primary and secondary mailboxes (Set Response Message Limit). That is allowed on switch CCIs, I guess because it is assumed they may be connected to a BMC without much memory. > > I still believe that with a clear operation defined my method makes more sense. > But I agree with you that the language is not strong. Absolutely agree! Your method would be the one I'd push for if we were starting from scratch (or another similar method looking like what I can't talk about for a similar case...) > > :-( > > > > > Record 3 is never handled. > > > > > > > > If we add in clearing as happens later in the series, > > > > > > I suppose I should squash the patches as this may not work without the > > > clearing. :-/ > > > > > > > the current > > > > assumption is that if we clear some records a subsequent read will > > > > start again. I'm not sure that is true. If it is spec reference needed. > > > > > > > > So assumption is > > > > Read 1: Records 0, 1, 2, 3 More set > > > > Clear 0, 1, 2 > > > > Read 2: Records 3, 4, 5, 6 > > > > Clear 3, 4, 5 More not set, but catch it with the condition above. > > > > Read 3: 6 only > > > > Clear 6 > > > > > > > > However, I think a valid implementation could do the following > > > > (imagine a ring buffer with a pointer to the 'next' record to read out and > > > > each record has a 'valid' flag to deal with corner cases around > > > > sequences such as read log once, start reading again and some > > > > clears occur using handles obtained from first read - not that > > > > case isn't ruled out by the spec as far as I can see). > > > > > > I believe this is a violation because the next pointer can't be advanced until > > > the record is cleared. Otherwise the device is not returning items in temporal > > > order based on what is in the log. > > > > Ah. This is where we disagree. The temporal order is (potentially?) unconnected > > from the clearing. The device did return them in temporal order, we just didn't > > take any novice of record 3 being returned. > > :-/ > > > A valid reading of that temporal order comment is actually the other way around > > that the device must not reset it's idea of temporal order until all records > > have been read (reading 3 twice is not in temporal order - imagine we had > > read 5 each time and it becomes more obvious as the read order becomes > > 0,1,2,3,4,3,4,5,6,7 etc which is clearly not in temporal order by any normal > > reading of the term. > > Well I guess. My reading was that it must return the first element temporally > within the list at the time of the Get operation. > > So in this example since 3 is still in the list it must return it first. Each > read is considered atomic from the others. Yes as long as 0 is in the queue it > will be returned. > > But I can see it your way too... That pesky text under More Event Records flag doesn't mention clearing when it says "The host should continue to retrieve records using this command, until this indicator is no longer set by the device." I wish it did :( > > > > > The more I read this, the more I think the current implementation > > is not compliant with the specification at all. > > > > I'm not seeing a spec mention of 'reseting' the ordering on clearing records > > (which might have been a good thing in the first place but too late now). > > There is no resetting of order. Only that the device does not consider the > previous reads on determining which events to return on any individual Get > call. Sure, see above quote though. > > > > > > > > > > > > > > Read 1: Records 0, 1, 2, 3 More set. 'next' pointer points to record 4. > > > > Clear 0, 1, 2 > > > > Read 2: Records 4, 5, 6 More not set. 'next' pointer points to record 7. > > > > Clear 4, 5, 6 > > > > > > > > Skipping record 3. > > > > > > > > So I think we have to absorb the full mailbox payload each time to guarantee > > > > we don't skip events or process them out of order (which is what would happen > > > > if we relied on a retry loop - we aren't allowed to clear them out of > > > > order anyway 8.2.9.2.3 "Events shall be cleared in temporal order. The device > > > > shall verify the event record handles specified in the input payload are in > > > > temporal order. ... "). > > > > Obviously that temporal order thing is only relevant if we get my second > > > > example occurring on real hardware. I think the spec is vague enough > > > > to allow that implementation. Would have been easy to specify this originally > > > > but it probably won't go in as errata so we need to cope with all the > > > > flexibility that is present. > > > > > > :-( Yea coulda, woulda, shoulda... ;-) > > > > > > > > > > > What fun and oh for a parameter to control how many records are returned! > > > > > > Yea. But I really don't think there is a problem unless someone really take > > > liberty with the spec. I think it boils down to how one interprets _when_ a > > > record is removed from the log. > > > > This is nothing to do with removal. The wording we have is just about reading > > and I think a strict reading of the spec would say your assumption of a reset of the > > read pointer on clear is NOT a valid implementation. There is separate wording > > about clears being in temporal order, but that doesn't effect the Get Event > > Records handling. > > > > > > > > If the record is removed when it is returned (as in your 'next' pointer > > > example) then why have a clear at all? > > > > Because if your software crashes, you don't have a handshake to reestablish > > state. If that happens you read the whole log until MORE is not set and > > then read it again to get a clean list. It's messy situation that has > > been discussed before for GET POISON LIST which has the same nasty handing > > of MORE. (look in appropriate forum for resolution to that one that we can't > > yet discuss here!) > > I can see the similarities but I think events are a more ephemeral item which > makes sense to clear once they are consumed. The idea that they should be left > for others to consume does not make sense to me. Where Poison is something > which could be a permanent marker which should be left in a list. Agreed - but sections use same wording for the More flag.. So we need to interpret the same. > > > > > Also, allows for non destructive readback (debugging tools might take a look > > having paused the normal handling). > > That is true. > > > > > > If my interpretation is correct then > > > the next available entry is the one which has not been cleared. > > > > If that is the case the language in "More Event Records" doesn't work > > "The host should continue to retrieve records using this command, until > > this indicator is no longer set by the device" > > > > With your reading of the spec, if we clear nothing, we'd keep getting the > > first set of records and only be able to read more by clearing them... > > > > Yea. > > > > > > Therefore in > > > your example 'next' is not incremented until clear has been called. I think > > > that implementation is also supported by the idea that records must be cleared > > > in temporal order. Otherwise I think devices would get confused. > > > > Not hard for device to do this (how I now read the spec) properly. > > > > Two pointers: > > 1) Next to clear: CLEAR > > 2) Next to read: READ > > > > Advance the the READ pointer on Get Event Records > > And loop back to the start on a further read... I'm looking at changing the > code for this but I think making it fully robust under a memory allocation > failure is going to be more tedious or we punt. If we get a memory allocation failure, perhaps we could do the follow horrible hack. 1 Allocate a small buffer. 2 Read once. 3 Hopefully we get the full record - in which case success. 4 Clear those records. 5 If not dealt with all records - read again until More Event Records not set (may already not be if it fitted in the buffer) 6 Go back to 2. If we think a valid implementation might reset the read pointer on clear then there is a variant where we make use of the fact the handles are constant - read 3 records, clear 2 and then use the handle of remaining one to identify if we have the next 3 to clear or not... > > > For CLEAR, check that the requested clears are handled in order and that > > they are before the READ pointer. > > > > Maybe we should just take it to appropriate spec forum to seek a clarification? > > Probably. I've not paid attention lately. > > I've sent a separate email with you cc'ed. Perhaps we can get some > clarification before I completely rework this. Fingers crossed. Thanks, Jonathan > > Ira > > > > > Jonathan > > > > > > > > FWIW the qemu implementation is based on my interpretation ATM. > > > > > > Ira > > > > > > > > > > > Jonathan > > > > > > > > > > > > > > > > > > > > > > > > > > + payload.flags & CXL_GET_EVENT_FLAG_MORE_RECORDS); > > > > > > > +} > > > > > > > > > > > > > > > > > > > > >
On Tue, Nov 29, 2022 at 12:26:20PM +0000, Jonathan Cameron wrote: > On Mon, 28 Nov 2022 15:30:12 -0800 > Ira Weiny <ira.weiny@intel.com> wrote: > [snip] > > > A valid reading of that temporal order comment is actually the other way around > > > that the device must not reset it's idea of temporal order until all records > > > have been read (reading 3 twice is not in temporal order - imagine we had > > > read 5 each time and it becomes more obvious as the read order becomes > > > 0,1,2,3,4,3,4,5,6,7 etc which is clearly not in temporal order by any normal > > > reading of the term. > > > > Well I guess. My reading was that it must return the first element temporally > > within the list at the time of the Get operation. > > > > So in this example since 3 is still in the list it must return it first. Each > > read is considered atomic from the others. Yes as long as 0 is in the queue it > > will be returned. > > > > But I can see it your way too... > > That pesky text under More Event Records flag doesn't mention clearing when it > says "The host should continue to retrieve > records using this command, until this indicator is no longer set by the > device." > > I wish it did :( > As I have reviewed these in my head again I have come to the conclusion that the More Event Records flags is useless. Let me explain: The Clear all Records flag is useless because if an event which occurs between the Get and Clear all operation will be dropped without the host having seen it. However, while clearing records based on the handles read, additional events could come in. Because of the way the interrupts are specified the host can't be sure that those new events will cause a zero to non-zero transition. This is because there is no way to guarantee all the events were cleared at the moment the events came in. I believe this is what you mentioned in another email about needing an 'extra read' at the end to ensure there was nothing more to be read. But based on that logic the only thing that matters is the Get Event.Record Count. If it is not 0 keep on reading because while the host is clearing the records another event could come in. In other words, the only way to be sure that all records are seen is to do a Get and see the number of records equal to 0. Thus any further events will trigger an interrupt and we can safely exit the loop. Ira Basically the loop looks like: int nr_rec; do { ... <Get Events> ... nr_rec = le16_to_cpu(payload->record_count); ... <for each record trace> ... ... <for each record clear> ... } while (nr_rec);
On Tue, 29 Nov 2022 21:09:58 -0800 Ira Weiny <ira.weiny@intel.com> wrote: > On Tue, Nov 29, 2022 at 12:26:20PM +0000, Jonathan Cameron wrote: > > On Mon, 28 Nov 2022 15:30:12 -0800 > > Ira Weiny <ira.weiny@intel.com> wrote: > > > > [snip] > > > > > A valid reading of that temporal order comment is actually the other way around > > > > that the device must not reset it's idea of temporal order until all records > > > > have been read (reading 3 twice is not in temporal order - imagine we had > > > > read 5 each time and it becomes more obvious as the read order becomes > > > > 0,1,2,3,4,3,4,5,6,7 etc which is clearly not in temporal order by any normal > > > > reading of the term. > > > > > > Well I guess. My reading was that it must return the first element temporally > > > within the list at the time of the Get operation. > > > > > > So in this example since 3 is still in the list it must return it first. Each > > > read is considered atomic from the others. Yes as long as 0 is in the queue it > > > will be returned. > > > > > > But I can see it your way too... > > > > That pesky text under More Event Records flag doesn't mention clearing when it > > says "The host should continue to retrieve > > records using this command, until this indicator is no longer set by the > > device." > > > > I wish it did :( > > > > As I have reviewed these in my head again I have come to the conclusion that > the More Event Records flags is useless. Let me explain: > > The Clear all Records flag is useless because if an event which occurs between the > Get and Clear all operation will be dropped without the host having seen it. Can still be used to get a known clean sheet if you don't care about a bunch of records on initial boot because no data in flight yet etc. Agreed it is no use if you care about content of the records. Make sure interrupts are enabled before re-checking if there are new records to close that race. > > However, while clearing records based on the handles read, additional events > could come in. Because of the way the interrupts are specified the host can't > be sure that those new events will cause a zero to non-zero transition. This > is because there is no way to guarantee all the events were cleared at the > moment the events came in. > > I believe this is what you mentioned in another email about needing an 'extra > read' at the end to ensure there was nothing more to be read. But based on > that logic the only thing that matters is the Get Event.Record > Count. If it is not 0 keep on reading because while the host is clearing the > records another event could come in. > > In other words, the only way to be sure that all records are seen is to do a > Get and see the number of records equal to 0. Thus any further events will > trigger an interrupt and we can safely exit the loop. Agreed - standard race to close when ever we have a FIFO with edge interrupts on how full it is. More records is useful for a different potential pattern of non destructive read and later clear. Or for a debug non destructive read. int nr_rec; <list> round_we_go: do { ... <for each record trace and add to list...> ... ... } while (!MORE); for_each_list_entry() { clear records one at a time. } nr_rec = le16_to_cpu(payload->record_count); if (nr_rec) goto round_we_go; ... > > Ira > > Basically the loop looks like: > > int nr_rec; > > do { > ... <Get Events> ... > > nr_rec = le16_to_cpu(payload->record_count); > > ... <for each record trace> ... > ... <for each record clear> ... > > } while (nr_rec); >
diff --git a/MAINTAINERS b/MAINTAINERS index ca063a504026..4b7c6e3055c6 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5223,6 +5223,7 @@ M: Dan Williams <dan.j.williams@intel.com> L: linux-cxl@vger.kernel.org S: Maintained F: drivers/cxl/ +F: include/trace/events/cxl.h F: include/uapi/linux/cxl_mem.h CONEXANT ACCESSRUNNER USB DRIVER diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c index 16176b9278b4..a908b95a7de4 100644 --- a/drivers/cxl/core/mbox.c +++ b/drivers/cxl/core/mbox.c @@ -7,6 +7,9 @@ #include <cxlmem.h> #include <cxl.h> +#define CREATE_TRACE_POINTS +#include <trace/events/cxl.h> + #include "core.h" static bool cxl_raw_allow_all; @@ -48,6 +51,7 @@ static struct cxl_mem_command cxl_mem_commands[CXL_MEM_COMMAND_ID_MAX] = { CXL_CMD(RAW, CXL_VARIABLE_PAYLOAD, CXL_VARIABLE_PAYLOAD, 0), #endif CXL_CMD(GET_SUPPORTED_LOGS, 0, CXL_VARIABLE_PAYLOAD, CXL_CMD_FLAG_FORCE_ENABLE), + CXL_CMD(GET_EVENT_RECORD, 1, CXL_VARIABLE_PAYLOAD, 0), CXL_CMD(GET_FW_INFO, 0, 0x50, 0), CXL_CMD(GET_PARTITION_INFO, 0, 0x20, 0), CXL_CMD(GET_LSA, 0x8, CXL_VARIABLE_PAYLOAD, 0), @@ -704,6 +708,72 @@ int cxl_enumerate_cmds(struct cxl_dev_state *cxlds) } EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL); +static void cxl_mem_get_records_log(struct cxl_dev_state *cxlds, + enum cxl_event_log_type type) +{ + struct cxl_get_event_payload payload; + u16 pl_nr; + + do { + u8 log_type = type; + int rc; + + rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_EVENT_RECORD, + &log_type, sizeof(log_type), + &payload, sizeof(payload)); + if (rc) { + dev_err(cxlds->dev, "Event log '%s': Failed to query event records : %d", + cxl_event_log_type_str(type), rc); + return; + } + + pl_nr = le16_to_cpu(payload.record_count); + if (trace_cxl_generic_event_enabled()) { + u16 nr_rec = min_t(u16, pl_nr, CXL_GET_EVENT_NR_RECORDS); + int i; + + for (i = 0; i < nr_rec; i++) + trace_cxl_generic_event(dev_name(cxlds->dev), + type, + &payload.record[i]); + } + + if (trace_cxl_overflow_enabled() && + (payload.flags & CXL_GET_EVENT_FLAG_OVERFLOW)) + trace_cxl_overflow(dev_name(cxlds->dev), type, &payload); + + } while (pl_nr > CXL_GET_EVENT_NR_RECORDS || + payload.flags & CXL_GET_EVENT_FLAG_MORE_RECORDS); +} + +/** + * cxl_mem_get_event_records - Get Event Records from the device + * @cxlds: The device data for the operation + * + * Retrieve all event records available on the device and report them as trace + * events. + * + * See CXL rev 3.0 @8.2.9.2.2 Get Event Records + */ +void cxl_mem_get_event_records(struct cxl_dev_state *cxlds) +{ + u32 status = readl(cxlds->regs.status + CXLDEV_DEV_EVENT_STATUS_OFFSET); + + dev_dbg(cxlds->dev, "Reading event logs: %x\n", status); + + if (status & CXLDEV_EVENT_STATUS_INFO) + cxl_mem_get_records_log(cxlds, CXL_EVENT_TYPE_INFO); + if (status & CXLDEV_EVENT_STATUS_WARN) + cxl_mem_get_records_log(cxlds, CXL_EVENT_TYPE_WARN); + if (status & CXLDEV_EVENT_STATUS_FAIL) + cxl_mem_get_records_log(cxlds, CXL_EVENT_TYPE_FAIL); + if (status & CXLDEV_EVENT_STATUS_FATAL) + cxl_mem_get_records_log(cxlds, CXL_EVENT_TYPE_FATAL); + if (status & CXLDEV_EVENT_STATUS_DYNAMIC_CAP) + cxl_mem_get_records_log(cxlds, CXL_EVENT_TYPE_DYNAMIC_CAP); +} +EXPORT_SYMBOL_NS_GPL(cxl_mem_get_event_records, CXL); + /** * cxl_mem_get_partition_info - Get partition info * @cxlds: The device data for the operation diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index f680450f0b16..492cff1bea6d 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -132,6 +132,14 @@ static inline int ways_to_cxl(unsigned int ways, u8 *iw) #define CXLDEV_CAP_CAP_ID_SECONDARY_MAILBOX 0x3 #define CXLDEV_CAP_CAP_ID_MEMDEV 0x4000 +/* CXL 3.0 8.2.8.3.1 Event Status Register */ +#define CXLDEV_DEV_EVENT_STATUS_OFFSET 0x00 +#define CXLDEV_EVENT_STATUS_INFO BIT(0) +#define CXLDEV_EVENT_STATUS_WARN BIT(1) +#define CXLDEV_EVENT_STATUS_FAIL BIT(2) +#define CXLDEV_EVENT_STATUS_FATAL BIT(3) +#define CXLDEV_EVENT_STATUS_DYNAMIC_CAP BIT(4) + /* CXL 2.0 8.2.8.4 Mailbox Registers */ #define CXLDEV_MBOX_CAPS_OFFSET 0x00 #define CXLDEV_MBOX_CAP_PAYLOAD_SIZE_MASK GENMASK(4, 0) diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h index b7b955ded3ac..da64ba0f156b 100644 --- a/drivers/cxl/cxlmem.h +++ b/drivers/cxl/cxlmem.h @@ -4,6 +4,7 @@ #define __CXL_MEM_H__ #include <uapi/linux/cxl_mem.h> #include <linux/cdev.h> +#include <linux/uuid.h> #include "cxl.h" /* CXL 2.0 8.2.8.5.1.1 Memory Device Status Register */ @@ -256,6 +257,7 @@ struct cxl_dev_state { enum cxl_opcode { CXL_MBOX_OP_INVALID = 0x0000, CXL_MBOX_OP_RAW = CXL_MBOX_OP_INVALID, + CXL_MBOX_OP_GET_EVENT_RECORD = 0x0100, CXL_MBOX_OP_GET_FW_INFO = 0x0200, CXL_MBOX_OP_ACTIVATE_FW = 0x0202, CXL_MBOX_OP_GET_SUPPORTED_LOGS = 0x0400, @@ -325,6 +327,76 @@ struct cxl_mbox_identify { u8 qos_telemetry_caps; } __packed; +/* + * Common Event Record Format + * CXL rev 3.0 section 8.2.9.2.1; Table 8-42 + */ +struct cxl_event_record_hdr { + uuid_t id; + u8 length; + u8 flags[3]; + __le16 handle; + __le16 related_handle; + __le64 timestamp; + u8 maint_op_class; + u8 reserved[0xf]; +} __packed; + +#define CXL_EVENT_RECORD_DATA_LENGTH 0x50 +struct cxl_event_record_raw { + struct cxl_event_record_hdr hdr; + u8 data[CXL_EVENT_RECORD_DATA_LENGTH]; +} __packed; + +/* + * Get Event Records output payload + * CXL rev 3.0 section 8.2.9.2.2; Table 8-50 + */ +#define CXL_GET_EVENT_FLAG_OVERFLOW BIT(0) +#define CXL_GET_EVENT_FLAG_MORE_RECORDS BIT(1) +#define CXL_GET_EVENT_NR_RECORDS 3 +struct cxl_get_event_payload { + u8 flags; + u8 reserved1; + __le16 overflow_err_count; + __le64 first_overflow_timestamp; + __le64 last_overflow_timestamp; + __le16 record_count; + u8 reserved2[0xa]; + struct cxl_event_record_raw record[CXL_GET_EVENT_NR_RECORDS]; +} __packed; + +/* + * CXL rev 3.0 section 8.2.9.2.2; Table 8-49 + */ +enum cxl_event_log_type { + CXL_EVENT_TYPE_INFO = 0x00, + CXL_EVENT_TYPE_WARN, + CXL_EVENT_TYPE_FAIL, + CXL_EVENT_TYPE_FATAL, + CXL_EVENT_TYPE_DYNAMIC_CAP, + CXL_EVENT_TYPE_MAX +}; + +static inline const char *cxl_event_log_type_str(enum cxl_event_log_type type) +{ + switch (type) { + case CXL_EVENT_TYPE_INFO: + return "Informational"; + case CXL_EVENT_TYPE_WARN: + return "Warning"; + case CXL_EVENT_TYPE_FAIL: + return "Failure"; + case CXL_EVENT_TYPE_FATAL: + return "Fatal"; + case CXL_EVENT_TYPE_DYNAMIC_CAP: + return "Dynamic Capacity"; + default: + break; + } + return "<unknown>"; +} + struct cxl_mbox_get_partition_info { __le64 active_volatile_cap; __le64 active_persistent_cap; @@ -384,6 +456,7 @@ int cxl_mem_create_range_info(struct cxl_dev_state *cxlds); struct cxl_dev_state *cxl_dev_state_create(struct device *dev); void set_exclusive_cxl_commands(struct cxl_dev_state *cxlds, unsigned long *cmds); void clear_exclusive_cxl_commands(struct cxl_dev_state *cxlds, unsigned long *cmds); +void cxl_mem_get_event_records(struct cxl_dev_state *cxlds); #ifdef CONFIG_CXL_SUSPEND void cxl_mem_active_inc(void); void cxl_mem_active_dec(void); diff --git a/include/trace/events/cxl.h b/include/trace/events/cxl.h new file mode 100644 index 000000000000..60dec9a84918 --- /dev/null +++ b/include/trace/events/cxl.h @@ -0,0 +1,127 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM cxl + +#if !defined(_CXL_TRACE_EVENTS_H) || defined(TRACE_HEADER_MULTI_READ) +#define _CXL_TRACE_EVENTS_H + +#include <asm-generic/unaligned.h> +#include <linux/tracepoint.h> +#include <cxlmem.h> + +TRACE_EVENT(cxl_overflow, + + TP_PROTO(const char *dev_name, enum cxl_event_log_type log, + struct cxl_get_event_payload *payload), + + TP_ARGS(dev_name, log, payload), + + TP_STRUCT__entry( + __string(dev_name, dev_name) + __field(int, log) + __field(u64, first_ts) + __field(u64, last_ts) + __field(u16, count) + ), + + TP_fast_assign( + __assign_str(dev_name, dev_name); + __entry->log = log; + __entry->count = le16_to_cpu(payload->overflow_err_count); + __entry->first_ts = le64_to_cpu(payload->first_overflow_timestamp); + __entry->last_ts = le64_to_cpu(payload->last_overflow_timestamp); + ), + + TP_printk("%s: EVENT LOG OVERFLOW log=%s : %u records from %llu to %llu", + __get_str(dev_name), cxl_event_log_type_str(__entry->log), + __entry->count, __entry->first_ts, __entry->last_ts) + +); + +/* + * Common Event Record Format + * CXL 3.0 section 8.2.9.2.1; Table 8-42 + */ +#define CXL_EVENT_RECORD_FLAG_PERMANENT BIT(2) +#define CXL_EVENT_RECORD_FLAG_MAINT_NEEDED BIT(3) +#define CXL_EVENT_RECORD_FLAG_PERF_DEGRADED BIT(4) +#define CXL_EVENT_RECORD_FLAG_HW_REPLACE BIT(5) +#define show_hdr_flags(flags) __print_flags(flags, " | ", \ + { CXL_EVENT_RECORD_FLAG_PERMANENT, "Permanent Condition" }, \ + { CXL_EVENT_RECORD_FLAG_MAINT_NEEDED, "Maintenance Needed" }, \ + { CXL_EVENT_RECORD_FLAG_PERF_DEGRADED, "Performance Degraded" }, \ + { CXL_EVENT_RECORD_FLAG_HW_REPLACE, "Hardware Replacement Needed" } \ +) + +/* + * Define macros for the common header of each CXL event. + * + * Tracepoints using these macros must do 3 things: + * + * 1) Add CXL_EVT_TP_entry to TP_STRUCT__entry + * 2) Use CXL_EVT_TP_fast_assign within TP_fast_assign; + * pass the dev_name, log, and CXL event header + * 3) Use CXL_EVT_TP_printk() instead of TP_printk() + * + * See the generic_event tracepoint as an example. + */ +#define CXL_EVT_TP_entry \ + __string(dev_name, dev_name) \ + __field(int, log) \ + __field_struct(uuid_t, hdr_uuid) \ + __field(u32, hdr_flags) \ + __field(u16, hdr_handle) \ + __field(u16, hdr_related_handle) \ + __field(u64, hdr_timestamp) \ + __field(u8, hdr_length) \ + __field(u8, hdr_maint_op_class) + +#define CXL_EVT_TP_fast_assign(dname, l, hdr) \ + __assign_str(dev_name, (dname)); \ + __entry->log = (l); \ + memcpy(&__entry->hdr_uuid, &(hdr).id, sizeof(uuid_t)); \ + __entry->hdr_length = (hdr).length; \ + __entry->hdr_flags = get_unaligned_le24((hdr).flags); \ + __entry->hdr_handle = le16_to_cpu((hdr).handle); \ + __entry->hdr_related_handle = le16_to_cpu((hdr).related_handle); \ + __entry->hdr_timestamp = le64_to_cpu((hdr).timestamp); \ + __entry->hdr_maint_op_class = (hdr).maint_op_class + + +#define CXL_EVT_TP_printk(fmt, ...) \ + TP_printk("%s log=%s : time=%llu uuid=%pUb len=%d flags='%s' " \ + "handle=%x related_handle=%x maint_op_class=%u" \ + " : " fmt, \ + __get_str(dev_name), cxl_event_log_type_str(__entry->log), \ + __entry->hdr_timestamp, &__entry->hdr_uuid, __entry->hdr_length,\ + show_hdr_flags(__entry->hdr_flags), __entry->hdr_handle, \ + __entry->hdr_related_handle, __entry->hdr_maint_op_class, \ + ##__VA_ARGS__) + +TRACE_EVENT(cxl_generic_event, + + TP_PROTO(const char *dev_name, enum cxl_event_log_type log, + struct cxl_event_record_raw *rec), + + TP_ARGS(dev_name, log, rec), + + TP_STRUCT__entry( + CXL_EVT_TP_entry + __array(u8, data, CXL_EVENT_RECORD_DATA_LENGTH) + ), + + TP_fast_assign( + CXL_EVT_TP_fast_assign(dev_name, log, rec->hdr); + memcpy(__entry->data, &rec->data, CXL_EVENT_RECORD_DATA_LENGTH); + ), + + CXL_EVT_TP_printk("%s", + __print_hex(__entry->data, CXL_EVENT_RECORD_DATA_LENGTH)) +); + +#endif /* _CXL_TRACE_EVENTS_H */ + +/* This part must be outside protection */ +#undef TRACE_INCLUDE_FILE +#define TRACE_INCLUDE_FILE cxl +#include <trace/define_trace.h> diff --git a/include/uapi/linux/cxl_mem.h b/include/uapi/linux/cxl_mem.h index c71021a2a9ed..70459be5bdd4 100644 --- a/include/uapi/linux/cxl_mem.h +++ b/include/uapi/linux/cxl_mem.h @@ -24,6 +24,7 @@ ___C(IDENTIFY, "Identify Command"), \ ___C(RAW, "Raw device command"), \ ___C(GET_SUPPORTED_LOGS, "Get Supported Logs"), \ + ___C(GET_EVENT_RECORD, "Get Event Record"), \ ___C(GET_FW_INFO, "Get FW Info"), \ ___C(GET_PARTITION_INFO, "Get Partition Information"), \ ___C(GET_LSA, "Get Label Storage Area"), \