From patchwork Tue Oct 25 12:39:29 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "yekai (A)" X-Patchwork-Id: 10760 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp985943wru; Tue, 25 Oct 2022 05:51:42 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7VB1cxgotEX7XJT9TIpQ/b+b9/dD+m6d/Vk6xnsydgJXcEuGPwqbd0DxSiLYsLvT/oPtlW X-Received: by 2002:aa7:800a:0:b0:565:af23:f5a4 with SMTP id j10-20020aa7800a000000b00565af23f5a4mr38928687pfi.42.1666702302328; Tue, 25 Oct 2022 05:51:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666702302; cv=none; d=google.com; s=arc-20160816; b=VZYf9mEg3RIbFZpcotzm0a9fGeCtjaYnVGxG1/+57TlMVbEdR7oRigrCrIVdIC3O2Z L1WpAWBgP5YPe3zdmB3uW8bgBaFwgVqKkpUJxwdwzbGYGbtd4AMtTRsy5mnjWDXTZMda VqZWWizg2ze5vkJnPo5db0EEAPNmIu6GoQ+rKzZIkdZRvePhBiV0IQe5olbPwXx7lr/b e6NVp9ao2tHmeAfXRFqtVB8IgBPKF3EmfM9IbpmL7tM+OwRlzZIHdafIk9zFn3dwHe+e 3XCwyP4uTBO2u+huPDBdvZBPcJS6f/+v+/k7wI+mXFk9xCpdGlIuc5ZNDPKOR9E+DUAJ iZbA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:references:in-reply-to:message-id :date:subject:to:from; bh=Nt2G44ZLLPVqbq9sfiItclZjlUGYvWCd0DVPV0lg0DY=; b=vwEPP52mKgENR/Q+y3t+ly8MzziPWeHbqi6IbapPcXKiimJp/+1grLtSqejmbRUw1/ 2iOdXXOmhHOL8wcckPCELAJbNJKQqFSXC9RHdwInYS6hlfQPBrIVfWlpZRnJYkvB5FB2 07btGxeYqgbjaRaqjxO/1ixgDLS7ygGhF0OrMFiDt+CFhPHaX/9e54y5vnjYHsHIXBFX hw1IqZrc7wg/dG+SXSxKeDVst1VfTgVfFWS0WGF4+6obX8BFH+6g5HXpGIsj+DOxmW7x Uc45rlQkrrHt0djSdEWC2SFj/6pLoTLhNNcB9wgAK65so9rQgV1jjA+/J0Z/8sJmt6Qr +aOw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c1-20020a170902724100b00176bccea71esi2377527pll.57.2022.10.25.05.51.28; Tue, 25 Oct 2022 05:51:42 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232128AbiJYMsD (ORCPT + 99 others); Tue, 25 Oct 2022 08:48:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40530 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232288AbiJYMrm (ORCPT ); Tue, 25 Oct 2022 08:47:42 -0400 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 04FEE18C959 for ; Tue, 25 Oct 2022 05:45:25 -0700 (PDT) Received: from dggpeml500021.china.huawei.com (unknown [172.30.72.55]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4MxWqn2FkBzHv0M; Tue, 25 Oct 2022 20:45:09 +0800 (CST) Received: from dggpeml100012.china.huawei.com (7.185.36.121) by dggpeml500021.china.huawei.com (7.185.36.21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Tue, 25 Oct 2022 20:45:21 +0800 Received: from huawei.com (10.67.165.24) by dggpeml100012.china.huawei.com (7.185.36.121) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Tue, 25 Oct 2022 20:45:21 +0800 From: Kai Ye To: , , , , , , Subject: [PATCH v9 1/3] uacce: supports device isolation feature Date: Tue, 25 Oct 2022 12:39:29 +0000 Message-ID: <20221025123931.42161-2-yekai13@huawei.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20221025123931.42161-1-yekai13@huawei.com> References: <20221025123931.42161-1-yekai13@huawei.com> MIME-Version: 1.0 X-Originating-IP: [10.67.165.24] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To dggpeml100012.china.huawei.com (7.185.36.121) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747664033649076840?= X-GMAIL-MSGID: =?utf-8?q?1747664033649076840?= UACCE adds the hardware error isolation API. Users can configure the isolation frequency by this sysfs node. UACCE reports the device isolate state to the user space. If the AER error frequency exceeds the set value in one hour, the device will be isolated. Signed-off-by: Kai Ye --- drivers/misc/uacce/uacce.c | 145 +++++++++++++++++++++++++++++++++++++ include/linux/uacce.h | 43 ++++++++++- 2 files changed, 187 insertions(+), 1 deletion(-) diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c index b70a013139c7..f293fcdcf44f 100644 --- a/drivers/misc/uacce/uacce.c +++ b/drivers/misc/uacce/uacce.c @@ -7,10 +7,100 @@ #include #include +#define MAX_ERR_ISOLATE_COUNT 65535 + static struct class *uacce_class; static dev_t uacce_devt; static DEFINE_XARRAY_ALLOC(uacce_xa); +static int cdev_get(struct device *dev, void *data) +{ + struct uacce_device *uacce; + struct device **t_dev = data; + + uacce = container_of(dev, struct uacce_device, dev); + if (uacce->parent == *t_dev) { + *t_dev = dev; + return 1; + } + + return 0; +} + +/** + * dev_to_uacce - Get structure uacce device from its parent device + * @dev the device + */ +struct uacce_device *dev_to_uacce(struct device *dev) +{ + struct device **tdev = &dev; + int ret; + + ret = class_for_each_device(uacce_class, NULL, tdev, cdev_get); + if (ret) { + dev = *tdev; + return container_of(dev, struct uacce_device, dev); + } + return NULL; +} +EXPORT_SYMBOL_GPL(dev_to_uacce); + +/** + * uacce_hw_err_isolate - Try to set the isolation status of the uacce device + * according to user's configuration of isolation strategy. + * @uacce the uacce device + */ +int uacce_hw_err_isolate(struct uacce_device *uacce) +{ + struct uacce_hw_err *err, *tmp, *hw_err; + struct uacce_err_isolate *isolate_ctx; + u32 count = 0; + + if (!uacce) + return -EINVAL; + + isolate_ctx = uacce->isolate_ctx; + +#define SECONDS_PER_HOUR 3600 + + /* All the hw errs are processed by PF driver */ + if (uacce->is_vf || isolate_ctx->is_isolate || + !isolate_ctx->hw_err_isolate_hz) + return 0; + + hw_err = kzalloc(sizeof(*hw_err), GFP_KERNEL); + if (!hw_err) + return -ENOMEM; + + hw_err->timestamp = jiffies; + list_for_each_entry_safe(err, tmp, &isolate_ctx->hw_errs, list) { + if ((hw_err->timestamp - err->timestamp) / HZ > + SECONDS_PER_HOUR) { + list_del(&err->list); + kfree(err); + } else { + count++; + } + } + list_add(&hw_err->list, &isolate_ctx->hw_errs); + + if (count >= isolate_ctx->hw_err_isolate_hz) + isolate_ctx->is_isolate = true; + + return 0; +} +EXPORT_SYMBOL_GPL(uacce_hw_err_isolate); + +static void uacce_hw_err_destroy(struct uacce_device *uacce) +{ + struct uacce_hw_err *err, *tmp; + + list_for_each_entry_safe(err, tmp, &uacce->isolate_data.hw_errs, list) { + list_del(&err->list); + kfree(err); + } +} + /* * If the parent driver or the device disappears, the queue state is invalid and * ops are not usable anymore. @@ -363,12 +453,59 @@ static ssize_t region_dus_size_show(struct device *dev, uacce->qf_pg_num[UACCE_QFRT_DUS] << PAGE_SHIFT); } +static ssize_t isolate_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct uacce_device *uacce = to_uacce_device(dev); + int ret = UACCE_DEV_NORMAL; + + if (uacce->isolate_ctx->is_isolate) + ret = UACCE_DEV_ISOLATE; + + return sysfs_emit(buf, "%d\n", ret); +} + +static ssize_t isolate_strategy_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct uacce_device *uacce = to_uacce_device(dev); + + return sysfs_emit(buf, "%u\n", uacce->isolate_ctx->hw_err_isolate_hz); +} + +static ssize_t isolate_strategy_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct uacce_device *uacce = to_uacce_device(dev); + unsigned long val; + + /* must be set by PF */ + if (uacce->is_vf) + return -EPERM; + + if (kstrtoul(buf, 0, &val) < 0) + return -EINVAL; + + if (val > MAX_ERR_ISOLATE_COUNT) + return -EINVAL; + + uacce->isolate_ctx->hw_err_isolate_hz = val; + + /* After the policy is updated, need to reset the hardware err list */ + uacce_hw_err_destroy(uacce); + + return count; +} + static DEVICE_ATTR_RO(api); static DEVICE_ATTR_RO(flags); static DEVICE_ATTR_RO(available_instances); static DEVICE_ATTR_RO(algorithms); static DEVICE_ATTR_RO(region_mmio_size); static DEVICE_ATTR_RO(region_dus_size); +static DEVICE_ATTR_RO(isolate); +static DEVICE_ATTR_RW(isolate_strategy); static struct attribute *uacce_dev_attrs[] = { &dev_attr_api.attr, @@ -377,6 +514,8 @@ static struct attribute *uacce_dev_attrs[] = { &dev_attr_algorithms.attr, &dev_attr_region_mmio_size.attr, &dev_attr_region_dus_size.attr, + &dev_attr_isolate.attr, + &dev_attr_isolate_strategy.attr, NULL, }; @@ -392,6 +531,9 @@ static umode_t uacce_dev_is_visible(struct kobject *kobj, (!uacce->qf_pg_num[UACCE_QFRT_DUS]))) return 0; + if (attr == &dev_attr_isolate_strategy.attr && !uacce->isolate_ctx) + return 0; + return attr->mode; } @@ -474,6 +616,7 @@ struct uacce_device *uacce_alloc(struct device *parent, goto err_with_uacce; INIT_LIST_HEAD(&uacce->queues); + INIT_LIST_HEAD(&uacce->isolate_data.hw_errs); mutex_init(&uacce->mutex); device_initialize(&uacce->dev); uacce->dev.devt = MKDEV(MAJOR(uacce_devt), uacce->dev_id); @@ -555,6 +698,8 @@ void uacce_remove(struct uacce_device *uacce) if (uacce->cdev) cdev_device_del(uacce->cdev, &uacce->dev); xa_erase(&uacce_xa, uacce->dev_id); + + uacce_hw_err_destroy(uacce); /* * uacce exists as long as there are open fds, but ops will be freed * now. Ensure that bugs cause NULL deref rather than use-after-free. diff --git a/include/linux/uacce.h b/include/linux/uacce.h index 9ce88c28b0a8..c8eecaf7b16d 100644 --- a/include/linux/uacce.h +++ b/include/linux/uacce.h @@ -12,6 +12,28 @@ struct uacce_queue; struct uacce_device; +/** + * struct uacce_hw_err - Structure describing the device errors + * @list: hardware error log node + * @timestamp: timestamp when the error occurred + */ +struct uacce_hw_err { + struct list_head list; + unsigned long long timestamp; +}; + +/** + * struct uacce_err_isolate - Structure describing the isolation data + * @hw_err_isolate_hz: user cfg freq which triggers isolation + * @is_isolate: device isolate state + * @hw_errs: uacce hardware error list + */ +struct uacce_err_isolate { + u32 hw_err_isolate_hz; + bool is_isolate; + struct list_head hw_errs; +}; + /** * struct uacce_qfile_region - structure of queue file region * @type: type of the region @@ -57,6 +79,11 @@ struct uacce_interface { const struct uacce_ops *ops; }; +enum uacce_dev_state { + UACCE_DEV_NORMAL, + UACCE_DEV_ISOLATE, +}; + enum uacce_q_state { UACCE_Q_ZOMBIE = 0, UACCE_Q_INIT, @@ -101,6 +128,8 @@ struct uacce_queue { * @dev: dev of the uacce * @mutex: protects uacce operation * @priv: private pointer of the uacce + * @isolate_data: device isolation data about pf and vf device + * @isolate_ctx: isolation ctx about current char device * @queues: list of queues * @inode: core vfs */ @@ -117,6 +146,8 @@ struct uacce_device { struct device dev; struct mutex mutex; void *priv; + struct uacce_err_isolate isolate_data; + struct uacce_err_isolate *isolate_ctx; struct list_head queues; struct inode *inode; }; @@ -127,7 +158,8 @@ struct uacce_device *uacce_alloc(struct device *parent, struct uacce_interface *interface); int uacce_register(struct uacce_device *uacce); void uacce_remove(struct uacce_device *uacce); - +struct uacce_device *dev_to_uacce(struct device *dev); +int uacce_hw_err_isolate(struct uacce_device *uacce); #else /* CONFIG_UACCE */ static inline @@ -144,6 +176,15 @@ static inline int uacce_register(struct uacce_device *uacce) static inline void uacce_remove(struct uacce_device *uacce) {} +static inline struct uacce_device *dev_to_uacce(struct device *dev) +{ + return NULL; +} + +int uacce_hw_err_isolate(struct uacce_device *uacce) +{ + return -EINVAL; +} #endif /* CONFIG_UACCE */ #endif /* _LINUX_UACCE_H */ From patchwork Tue Oct 25 12:39:30 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "yekai (A)" X-Patchwork-Id: 10753 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp985130wru; Tue, 25 Oct 2022 05:50:00 -0700 (PDT) X-Google-Smtp-Source: AMsMyM5YvvUzd66OZXYVlUa4LR7YJuRed0WqkNUMXLd8HNJSE+Aukh2aWuvJLpS5owW4QW1fsEJs X-Received: by 2002:a17:90b:1c82:b0:1ee:eb41:b141 with SMTP id oo2-20020a17090b1c8200b001eeeb41b141mr45945210pjb.143.1666702200214; Tue, 25 Oct 2022 05:50:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666702200; cv=none; d=google.com; s=arc-20160816; b=TJqodiHI5DTgXih/EXyKKPdzUcY6CzEaQ+AJ//ERDu5BvYjOA7FfJnSMqbGm5hAxXM vybt6rQorZ2nvcCIwbFrpCXlBjsRWXMDDneWdqLjCoMX/wH5JgNS1ZhRDAMz+xhQWway wDv2fc7q0AEh8btlaB0nTLMkj0xfqNtxdK3qU2aKiYUbs+bodws/CcZmX7wC7rntH/Cb R5EAx4EIgKhc9vBPT5KmCI+10GqyByT/z+4pDH2sybj+Knh7fNNyO5KyRbYqRNz3OPUo /W3RROEu4PenNn5VBAIidPLLzj6eTZWXSWigonkFk48D9tac7D7nEddHlAAD63yZCBS8 tn2Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:references:in-reply-to:message-id :date:subject:to:from; bh=TnVBRfFRADiZwr+rjiCqW6rqnKA1UGqiut+nUrzHa0Y=; b=wdXfSkyj8KBFODo8qvpAol3zTBHswAqpfu/vTIZSrAkVTsiKDTFxL1W+RecfM/j6MN rBFiqUTOmuoPm2GJbfv+4CbWkZzwQ4t0c5X5ldvQYfUggqUmOJJfgZLGyUcB3ZGYfbZj qlyx14F31005dY2HseWi1RN74F6wCN1S+mOaL4TPzXZ83cVxWUQkD9lClM8NT2UWoUBL FkTHmYR+7pYJQXeDNSbE94eusPuZvM9FurZl2L/qVX8fh+6Yxxx/Nn7wsWEhPdA1x32w G6Nb3/uqI+H/10Afv07LPwV9rqhPKB17nnJSlITX35MCgcDuI0cZjUXTolCZuVi2c61w bFiQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id e8-20020a635448000000b0046ecd36adbasi3002772pgm.635.2022.10.25.05.49.46; Tue, 25 Oct 2022 05:50:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232333AbiJYMr7 (ORCPT + 99 others); Tue, 25 Oct 2022 08:47:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43820 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232286AbiJYMrm (ORCPT ); Tue, 25 Oct 2022 08:47:42 -0400 Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0480519299B for ; Tue, 25 Oct 2022 05:45:25 -0700 (PDT) Received: from dggpeml500021.china.huawei.com (unknown [172.30.72.54]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4MxWkP6L7Gz15Lxv; Tue, 25 Oct 2022 20:40:29 +0800 (CST) Received: from dggpeml100012.china.huawei.com (7.185.36.121) by dggpeml500021.china.huawei.com (7.185.36.21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Tue, 25 Oct 2022 20:45:21 +0800 Received: from huawei.com (10.67.165.24) by dggpeml100012.china.huawei.com (7.185.36.121) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Tue, 25 Oct 2022 20:45:21 +0800 From: Kai Ye To: , , , , , , Subject: [PATCH v9 2/3] Documentation: add a isolation strategy sysfs node for uacce Date: Tue, 25 Oct 2022 12:39:30 +0000 Message-ID: <20221025123931.42161-3-yekai13@huawei.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20221025123931.42161-1-yekai13@huawei.com> References: <20221025123931.42161-1-yekai13@huawei.com> MIME-Version: 1.0 X-Originating-IP: [10.67.165.24] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To dggpeml100012.china.huawei.com (7.185.36.121) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747663926336489657?= X-GMAIL-MSGID: =?utf-8?q?1747663926336489657?= Update documentation describing sysfs node that could help to configure isolation strategy for users in the user space. And describing sysfs node that could read the device isolated state. Signed-off-by: Kai Ye --- Documentation/ABI/testing/sysfs-driver-uacce | 27 ++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-driver-uacce b/Documentation/ABI/testing/sysfs-driver-uacce index 08f2591138af..50737c897ba3 100644 --- a/Documentation/ABI/testing/sysfs-driver-uacce +++ b/Documentation/ABI/testing/sysfs-driver-uacce @@ -19,6 +19,33 @@ Contact: linux-accelerators@lists.ozlabs.org Description: Available instances left of the device Return -ENODEV if uacce_ops get_available_instances is not provided +What: /sys/class/uacce//isolate_strategy +Date: Oct 2022 +KernelVersion: 6.1 +Contact: linux-accelerators@lists.ozlabs.org +Description: (RW) Configure the frequency size for the hardware error + isolation strategy. This unit is the number of times. Number + of occurrences in a period, also means threshold. If the number + of device pci AER error exceeds the threshold in a time window, + the device is isolated. This size is a configured integer value. + The default is 0. The maximum value is 65535. + + In the hisilicon accelerator engine, first we will + time-stamp every slot AER error. Then check the AER error log + when the device AER error occurred. if the device slot AER error + count exceeds the preset the number of times in one hour, the + isolated state will be set to true. So the device will be + isolated. And the AER error log that exceed one hour will be + cleared. + +What: /sys/class/uacce//isolate +Date: Oct 2022 +KernelVersion: 6.1 +Contact: linux-accelerators@lists.ozlabs.org +Description: (R) A sysfs node that read the device isolated state. The value 1 + means the device is unavailable. The 0 means the device is + available. + What: /sys/class/uacce//algorithms Date: Feb 2020 KernelVersion: 5.7 From patchwork Tue Oct 25 12:39:31 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "yekai (A)" X-Patchwork-Id: 10754 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp985370wru; Tue, 25 Oct 2022 05:50:30 -0700 (PDT) X-Google-Smtp-Source: AMsMyM77TMr/j4bkjqBSyQGYloiuLE22C411FqgWXOWVyG0H+9+mgK+rQgOmrLOVBebCHLrBMewb X-Received: by 2002:a05:6a00:16c4:b0:535:890:d4a with SMTP id l4-20020a056a0016c400b0053508900d4amr39008085pfc.0.1666702230265; Tue, 25 Oct 2022 05:50:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666702230; cv=none; d=google.com; s=arc-20160816; b=lw6bFoAN+EYI21h+6u+w9oR03LPw88pasiai3VQ0sL5qfzEV+e3SnsSoDntsguudE9 3z960xt7kHr6ZRGCTopKji3KzIBkMr593glR9CLD/xiHsrv1elhp9zcR8CZv9DNT4PIE nk5BUCqy4frwtIh9uqVHVILXk2wdfYKQT6P4xOVQBMQfP+3Wvm5NEDcc1vsqnRqmVt4x H7OcTdRgsJNy7Q2716/v5t7wZCTNQGkXzuh2z5iRzAsFi7sbi5nYJvROr5dmvK2Coudn DVtifsKEszah1WDn6TtgUjQujhELmOqMW9zb00OrQQd9EDdvJZzcBVvBqkE3lMKrbZOe +Cfg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:references:in-reply-to:message-id :date:subject:to:from; bh=FABGfEKanl4dppt9P0wBKFV2zIn2ZT6sCFvKSFX8YyA=; b=0M3FWmSI4Ev3w6vF57Xd289tjMkiWqcobSQOF6uZhpztdC+OBz54ZqwV6n7BqD91VJ 8DazVXagim6R16G6gVnV9JYrskijbKadTGOnj8Ruev62beccS1yPc2Gmv0ullLZeOGIf ik1XX9FRDs217wfFw3SGRcHa3iuYX6gDpeA6yEPieA2LIE51FukOpJYcMPkGYH5ioDLM TS0qztLRVIjPpTcp2AOt8AjmKfkAPCkKwZfSg1sQfyCjimi4RPR6WiAN01UuMJ9RP0Hx IIJ7tlAm6fTxJeWFAtCsERxUK33VLua1EmmSXT1yhNFoAjvGSykZ0TD92p5NNDzHb5Bx oPxg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 73-20020a63034c000000b004637c92ef98si2890819pgd.195.2022.10.25.05.50.16; Tue, 25 Oct 2022 05:50:30 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232433AbiJYMsH (ORCPT + 99 others); Tue, 25 Oct 2022 08:48:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40624 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232321AbiJYMrn (ORCPT ); Tue, 25 Oct 2022 08:47:43 -0400 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 052E71958FE for ; Tue, 25 Oct 2022 05:45:26 -0700 (PDT) Received: from dggpeml500024.china.huawei.com (unknown [172.30.72.55]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4MxWkW4tPXzVj6j; Tue, 25 Oct 2022 20:40:35 +0800 (CST) Received: from dggpeml100012.china.huawei.com (7.185.36.121) by dggpeml500024.china.huawei.com (7.185.36.10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Tue, 25 Oct 2022 20:45:21 +0800 Received: from huawei.com (10.67.165.24) by dggpeml100012.china.huawei.com (7.185.36.121) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Tue, 25 Oct 2022 20:45:21 +0800 From: Kai Ye To: , , , , , , Subject: [PATCH v9 3/3] crypto: hisilicon/qm - add the device isolation feature for acc Date: Tue, 25 Oct 2022 12:39:31 +0000 Message-ID: <20221025123931.42161-4-yekai13@huawei.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20221025123931.42161-1-yekai13@huawei.com> References: <20221025123931.42161-1-yekai13@huawei.com> MIME-Version: 1.0 X-Originating-IP: [10.67.165.24] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To dggpeml100012.china.huawei.com (7.185.36.121) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747663957723463940?= X-GMAIL-MSGID: =?utf-8?q?1747663957723463940?= Record every AER error by uacce api. And isolate the device directly when the controller reset fail. The VF device use the PF device isolation strategy. Once the PF device is isolated, its VF device will also be isolated. Signed-off-by: Kai Ye --- drivers/crypto/hisilicon/qm.c | 66 ++++++++++++++++++++++++++++++----- 1 file changed, 57 insertions(+), 9 deletions(-) diff --git a/drivers/crypto/hisilicon/qm.c b/drivers/crypto/hisilicon/qm.c index 363a02810a16..aa953ce86f70 100644 --- a/drivers/crypto/hisilicon/qm.c +++ b/drivers/crypto/hisilicon/qm.c @@ -3397,6 +3397,29 @@ static void qm_set_sqctype(struct uacce_queue *q, u16 type) up_write(&qm->qps_lock); } +static int qm_uacce_isolate_init(struct hisi_qm *qm) +{ + struct pci_dev *pdev = qm->pdev; + struct uacce_device *pf_uacce, *uacce; + struct device *pf_dev = &(pci_physfn(pdev)->dev); + + uacce = qm->uacce; + if (uacce->is_vf) { + /* VF uses PF's isoalte data */ + pf_uacce = dev_to_uacce(pf_dev); + if (!pf_uacce) { + pci_err(pdev, "fail to PF device!\n"); + return -ENODEV; + } + + uacce->isolate_ctx = &pf_uacce->isolate_data; + } else { + uacce->isolate_ctx = &uacce->isolate_data; + } + + return 0; +} + static long hisi_qm_uacce_ioctl(struct uacce_queue *q, unsigned int cmd, unsigned long arg) { @@ -3450,6 +3473,14 @@ static const struct uacce_ops uacce_qm_ops = { .is_q_updated = hisi_qm_is_q_updated, }; +static void qm_remove_uacce(struct hisi_qm *qm) +{ + if (qm->use_sva) { + uacce_remove(qm->uacce); + qm->uacce = NULL; + } +} + static int qm_alloc_uacce(struct hisi_qm *qm) { struct pci_dev *pdev = qm->pdev; @@ -3511,7 +3542,14 @@ static int qm_alloc_uacce(struct hisi_qm *qm) qm->uacce = uacce; + ret = qm_uacce_isolate_init(qm); + if (ret) + goto err_rm_uacce; + return 0; +err_rm_uacce: + qm_remove_uacce(qm); + return ret; } /** @@ -5133,6 +5171,12 @@ static int qm_controller_reset_prepare(struct hisi_qm *qm) return ret; } + if (qm->use_sva) { + ret = uacce_hw_err_isolate(qm->uacce); + if (ret) + pci_err(pdev, "failed to isolate hw err!\n"); + } + ret = qm_wait_vf_prepare_finish(qm); if (ret) pci_err(pdev, "failed to stop by vfs in soft reset!\n"); @@ -5458,21 +5502,25 @@ static int qm_controller_reset(struct hisi_qm *qm) qm->err_ini->show_last_dfx_regs(qm); ret = qm_soft_reset(qm); - if (ret) { - pci_err(pdev, "Controller reset failed (%d)\n", ret); - qm_reset_bit_clear(qm); - return ret; - } + if (ret) + goto err_reset; ret = qm_controller_reset_done(qm); - if (ret) { - qm_reset_bit_clear(qm); - return ret; - } + if (ret) + goto err_reset; pci_info(pdev, "Controller reset complete\n"); return 0; +err_reset: + pci_info(pdev, "Controller reset failed (%d)\n", ret); + qm_reset_bit_clear(qm); + + /* if resetting fails, isolate the device */ + if (qm->use_sva && !qm->uacce->is_vf) + qm->uacce->isolate_ctx->is_isolate = true; + + return ret; } /**