Message ID | 20230424014324.218531-1-andrealmeid@igalia.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp2465761vqo; Sun, 23 Apr 2023 19:30:37 -0700 (PDT) X-Google-Smtp-Source: AKy350bePGvhQ31tQk4qWKCPIEIV5Kll7UO9u7yBEO4Cdet1S5JlVZB6gwktEF/pwKgy5x3gtJG4 X-Received: by 2002:a17:90b:1d08:b0:23f:d487:1bc8 with SMTP id on8-20020a17090b1d0800b0023fd4871bc8mr19562172pjb.13.1682303437352; Sun, 23 Apr 2023 19:30:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682303437; cv=none; d=google.com; s=arc-20160816; b=qW4yFVB66WqkRcZdPlHi0yzU1bt8fuTtMVhQzOPDLql8yx3mYKD3nwe3jS3pQa46sL zps13YDAUw4mCHEr+XT50WoGNIJ/SRcPcCmCM2O1SjUul+M+a9WfWxYS/8+rSqWYRICB zuAhbKwSyWo11HlTOMTrP5QPt/wnprWEsmgje5Vsv4/+nzCbojPd4mp64HCAAQTtRSb2 jPSgY/uwZR57o+HNihDLBW5HsZBMUed7Ot416z3PYkYxP6siDDGuk1Cyyg1imp8XxMwi W+uEB4MwKIFi8LknlEpIhS/xseyfqt/FtVmetNc+KDduUVwRK7oKAjKQWuuPd2b+LEnb WImg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=0Sf6LJiinXD0+Z04+CoBCjhP6xRPg9WnwCH4CvJn44g=; b=Q/5OOLUhclcx/R5oNxA+NWp7L5gyt6gREcm+98ALCcp9ehNyIcu69VW6owzIdpROq3 qVIOxZRGQXLHyvaoo4/xpl6JGTxQqDbYZ/v96+jdXCV7Hl1jLEeYGWjlV9Wsya92P83K D/CbL5WmW/V3JqNK0UJ7MTzjrLL63FuA+ooFJPgYbpEfXwdq+YTSRtDhptPliotkhAkm aQmQ38b46fh0NrKMLAQhV4RhCqVFbWtVd4ZstUV3ro6dRUJHeMUAaymY/sGe8Q3EKLQj rsojF6r7FXNvdpozuzp5aCcR7lB1zr8CdPJplC2ez+DCxZjoge070YZfh4jRC72o+1dG t92g== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@igalia.com header.s=20170329 header.b=JIcVwiky; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id pf14-20020a17090b1d8e00b0023f14c9a5fesi11109699pjb.100.2023.04.23.19.30.23; Sun, 23 Apr 2023 19:30:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=fail header.i=@igalia.com header.s=20170329 header.b=JIcVwiky; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230385AbjDXBoT (ORCPT <rfc822;fengqi706@gmail.com> + 99 others); Sun, 23 Apr 2023 21:44:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38176 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230369AbjDXBoS (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sun, 23 Apr 2023 21:44:18 -0400 Received: from fanzine2.igalia.com (fanzine2.igalia.com [213.97.179.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BD41E10CC for <linux-kernel@vger.kernel.org>; Sun, 23 Apr 2023 18:44:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-Id: Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: In-Reply-To:References:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=0Sf6LJiinXD0+Z04+CoBCjhP6xRPg9WnwCH4CvJn44g=; b=JIcVwiky7ofsROMjxVeNT+AQEL gSU2A0P4XNRrWSyZwXgZi/+FOiKwNjUmXLgtMSC3hz+vOL4g/EYOnY170TdlZqVgSFfImBxA5+sFo iPTnAFTQhlifb5rJHjqGpyfhf49jz7pt0yN/UHMkYNQhAgwyslyIB6BdsBlZIenn4GiL/Fv98NYDA gDjFSgCcyQTeR6ls08+hXDWYDf3aGeA9zCejqkMaiEnzaEcH089aZc9Mdl5B++z2groTyLjc1dsvd uEwnjUDOTomCjEfvzlCCPNtq5LZI0vv+SN0/hE1Ll2yuZ85HzJY2PD2SVyNZWUA3DjgMnj0HnB+It xX5CGfEQ==; Received: from [152.249.146.45] (helo=steammachine.lan) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim) id 1pqlFV-00AeE7-4w; Mon, 24 Apr 2023 03:44:13 +0200 From: =?utf-8?q?Andr=C3=A9_Almeida?= <andrealmeid@igalia.com> To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: kernel-dev@igalia.com, alexander.deucher@amd.com, christian.koenig@amd.com, pierre-eric.pelloux-prayer@amd.com, =?utf-8?b?J01hcmVrIE9sxaHDoWsn?= <maraeo@gmail.com>, =?utf-8?q?Andr=C3=A9_A?= =?utf-8?q?lmeida?= <andrealmeid@igalia.com> Subject: [PATCH] drm/amdgpu: Mark contexts guilty for any reset type Date: Sun, 23 Apr 2023 22:43:24 -0300 Message-Id: <20230424014324.218531-1-andrealmeid@igalia.com> X-Mailer: git-send-email 2.40.0 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764023009232767589?= X-GMAIL-MSGID: =?utf-8?q?1764023009232767589?= |
Series |
drm/amdgpu: Mark contexts guilty for any reset type
|
|
Commit Message
André Almeida
April 24, 2023, 1:43 a.m. UTC
When a DRM job timeout, the GPU is probably hang and amdgpu have some
ways to deal with that, ranging from soft recoveries to full device
reset. Anyway, when userspace ask the kernel the state of the context
(via AMDGPU_CTX_OP_QUERY_STATE), the kernel reports that the device was
reset, regardless if a full reset happened or not.
However, amdgpu only marks a context guilty in the ASIC reset path. This
makes the userspace report incomplete, given that on soft recovery path
the guilty context is not told that it's the guilty one.
Fix this by marking the context guilty for every type of reset when a
job timeouts.
Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 +++++++-
2 files changed, 7 insertions(+), 4 deletions(-)
Comments
Hi André,
kernel test robot noticed the following build warnings:
[auto build test WARNING on drm-misc/drm-misc-next]
[also build test WARNING on drm/drm-next drm-exynos/exynos-drm-next drm-intel/for-linux-next drm-intel/for-linux-next-fixes drm-tip/drm-tip linus/master v6.3 next-20230421]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Andr-Almeida/drm-amdgpu-Mark-contexts-guilty-for-any-reset-type/20230424-094534
base: git://anongit.freedesktop.org/drm/drm-misc drm-misc-next
patch link: https://lore.kernel.org/r/20230424014324.218531-1-andrealmeid%40igalia.com
patch subject: [PATCH] drm/amdgpu: Mark contexts guilty for any reset type
config: s390-allyesconfig (https://download.01.org/0day-ci/archive/20230424/202304241259.Qq0Dmlud-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/ea7b1d78b677fdcf5f4776e63de611a2681cd5fb
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Andr-Almeida/drm-amdgpu-Mark-contexts-guilty-for-any-reset-type/20230424-094534
git checkout ea7b1d78b677fdcf5f4776e63de611a2681cd5fb
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash drivers/gpu/drm/amd/amdgpu/
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304241259.Qq0Dmlud-lkp@intel.com/
All warnings (new ones prefixed by >>):
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c: In function 'amdgpu_device_pre_asic_reset':
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:4738:28: warning: variable 'job' set but not used [-Wunused-but-set-variable]
4738 | struct amdgpu_job *job = NULL;
| ^~~
vim +/job +4738 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
5740682e66cef5 Monk Liu 2017-10-25 4733
e3c1b0712fdb03 shaoyunl 2021-02-16 4734 int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
04442bf70debb1 Lijo Lazar 2021-03-16 4735 struct amdgpu_reset_context *reset_context)
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4736 {
5c1e6fa49e8d8d Huang Rui 2021-12-16 4737 int i, r = 0;
04442bf70debb1 Lijo Lazar 2021-03-16 @4738 struct amdgpu_job *job = NULL;
04442bf70debb1 Lijo Lazar 2021-03-16 4739 bool need_full_reset =
04442bf70debb1 Lijo Lazar 2021-03-16 4740 test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
04442bf70debb1 Lijo Lazar 2021-03-16 4741
04442bf70debb1 Lijo Lazar 2021-03-16 4742 if (reset_context->reset_req_dev == adev)
04442bf70debb1 Lijo Lazar 2021-03-16 4743 job = reset_context->job;
711826656bebb0 Monk Liu 2017-12-25 4744
b602ca5f31fe69 Tiecheng Zhou 2020-08-19 4745 if (amdgpu_sriov_vf(adev)) {
b602ca5f31fe69 Tiecheng Zhou 2020-08-19 4746 /* stop the data exchange thread */
b602ca5f31fe69 Tiecheng Zhou 2020-08-19 4747 amdgpu_virt_fini_data_exchange(adev);
b602ca5f31fe69 Tiecheng Zhou 2020-08-19 4748 }
b602ca5f31fe69 Tiecheng Zhou 2020-08-19 4749
9e225fb9e636b3 Andrey Grodzovsky 2022-06-18 4750 amdgpu_fence_driver_isr_toggle(adev, true);
9e225fb9e636b3 Andrey Grodzovsky 2022-06-18 4751
711826656bebb0 Monk Liu 2017-12-25 4752 /* block all schedulers and reset given job's ring */
0875dc9e80eb3b Chunming Zhou 2016-06-12 4753 for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
0875dc9e80eb3b Chunming Zhou 2016-06-12 4754 struct amdgpu_ring *ring = adev->rings[i];
0875dc9e80eb3b Chunming Zhou 2016-06-12 4755
51687759be93fb Chunming Zhou 2017-04-24 4756 if (!ring || !ring->sched.thread)
0875dc9e80eb3b Chunming Zhou 2016-06-12 4757 continue;
5740682e66cef5 Monk Liu 2017-10-25 4758
c530b02f39850a Jack Zhang 2021-05-12 4759 /*clear job fence from fence drv to avoid force_completion
c530b02f39850a Jack Zhang 2021-05-12 4760 *leave NULL and vm flush fence in fence drv */
5c1e6fa49e8d8d Huang Rui 2021-12-16 4761 amdgpu_fence_driver_clear_job_fences(ring);
c530b02f39850a Jack Zhang 2021-05-12 4762
2200edac745a65 Chunming Zhou 2016-06-30 4763 /* after all hw jobs are reset, hw fence is meaningless, so force_completion */
2f9d4084cac96a Monk Liu 2017-10-16 4764 amdgpu_fence_driver_force_completion(ring);
2f9d4084cac96a Monk Liu 2017-10-16 4765 }
d38ceaf99ed015 Alex Deucher 2015-04-20 4766
9e225fb9e636b3 Andrey Grodzovsky 2022-06-18 4767 amdgpu_fence_driver_isr_toggle(adev, false);
9e225fb9e636b3 Andrey Grodzovsky 2022-06-18 4768
04442bf70debb1 Lijo Lazar 2021-03-16 4769 r = amdgpu_reset_prepare_hwcontext(adev, reset_context);
404b277bbe4945 Lijo Lazar 2021-03-26 4770 /* If reset handler not implemented, continue; otherwise return */
404b277bbe4945 Lijo Lazar 2021-03-26 4771 if (r == -ENOSYS)
404b277bbe4945 Lijo Lazar 2021-03-26 4772 r = 0;
404b277bbe4945 Lijo Lazar 2021-03-26 4773 else
04442bf70debb1 Lijo Lazar 2021-03-16 4774 return r;
04442bf70debb1 Lijo Lazar 2021-03-16 4775
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4776 /* Don't suspend on bare metal if we are not going to HW reset the ASIC */
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4777 if (!amdgpu_sriov_vf(adev)) {
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4778
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4779 if (!need_full_reset)
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4780 need_full_reset = amdgpu_device_ip_need_full_reset(adev);
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4781
360cd08196cabc Likun Gao 2022-12-21 4782 if (!need_full_reset && amdgpu_gpu_recovery &&
360cd08196cabc Likun Gao 2022-12-21 4783 amdgpu_device_ip_check_soft_reset(adev)) {
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4784 amdgpu_device_ip_pre_soft_reset(adev);
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4785 r = amdgpu_device_ip_soft_reset(adev);
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4786 amdgpu_device_ip_post_soft_reset(adev);
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4787 if (r || amdgpu_device_ip_check_soft_reset(adev)) {
aac891685da661 Dennis Li 2020-08-20 4788 dev_info(adev->dev, "soft reset failed, will fallback to full reset!\n");
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4789 need_full_reset = true;
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4790 }
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4791 }
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4792
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4793 if (need_full_reset)
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4794 r = amdgpu_device_ip_suspend(adev);
04442bf70debb1 Lijo Lazar 2021-03-16 4795 if (need_full_reset)
04442bf70debb1 Lijo Lazar 2021-03-16 4796 set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
04442bf70debb1 Lijo Lazar 2021-03-16 4797 else
04442bf70debb1 Lijo Lazar 2021-03-16 4798 clear_bit(AMDGPU_NEED_FULL_RESET,
04442bf70debb1 Lijo Lazar 2021-03-16 4799 &reset_context->flags);
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4800 }
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4801
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4802 return r;
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4803 }
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4804
Am 24.04.23 um 03:43 schrieb André Almeida: > When a DRM job timeout, the GPU is probably hang and amdgpu have some > ways to deal with that, ranging from soft recoveries to full device > reset. Anyway, when userspace ask the kernel the state of the context > (via AMDGPU_CTX_OP_QUERY_STATE), the kernel reports that the device was > reset, regardless if a full reset happened or not. > > However, amdgpu only marks a context guilty in the ASIC reset path. This > makes the userspace report incomplete, given that on soft recovery path > the guilty context is not told that it's the guilty one. > > Fix this by marking the context guilty for every type of reset when a > job timeouts. The guilty handling is pretty much broken by design and only works because we go through multiple hops of validating the entity after the job has already been pushed to the hw. I think we should probably just remove that completely and use an approach where we check the in flight submissions in the query state IOCTL. See my other patch on the mailing list regarding that. Additional to that I currently didn't considered soft-recovered submissions as fatal and continue accepting submissions from that context, but already wanted to talk with Marek about that behavior. Regards, Christian. > > Signed-off-by: André Almeida <andrealmeid@igalia.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 --- > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 +++++++- > 2 files changed, 7 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index ac78caa7cba8..ea169d1689e2 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -4771,9 +4771,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev, > > amdgpu_fence_driver_isr_toggle(adev, false); > > - if (job && job->vm) > - drm_sched_increase_karma(&job->base); > - > r = amdgpu_reset_prepare_hwcontext(adev, reset_context); > /* If reset handler not implemented, continue; otherwise return */ > if (r == -ENOSYS) > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > index c3d9d75143f4..097ed8f06865 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > @@ -51,6 +51,13 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) > memset(&ti, 0, sizeof(struct amdgpu_task_info)); > adev->job_hang = true; > > + amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti); > + > + if (job && job->vm) { > + DRM_INFO("marking %s context as guilty", ti.process_name); > + drm_sched_increase_karma(&job->base); > + } > + > if (amdgpu_gpu_recovery && > amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) { > DRM_ERROR("ring %s timeout, but soft recovered\n", > @@ -58,7 +65,6 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) > goto exit; > } > > - amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti); > DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n", > job->base.sched->name, atomic_read(&ring->fence_drv.last_seq), > ring->fence_drv.sync_seq);
Hi Christian, thank you for your comments. Em 24/04/2023 04:03, Christian König escreveu: > Am 24.04.23 um 03:43 schrieb André Almeida: >> When a DRM job timeout, the GPU is probably hang and amdgpu have some >> ways to deal with that, ranging from soft recoveries to full device >> reset. Anyway, when userspace ask the kernel the state of the context >> (via AMDGPU_CTX_OP_QUERY_STATE), the kernel reports that the device was >> reset, regardless if a full reset happened or not. >> >> However, amdgpu only marks a context guilty in the ASIC reset path. This >> makes the userspace report incomplete, given that on soft recovery path >> the guilty context is not told that it's the guilty one. >> >> Fix this by marking the context guilty for every type of reset when a >> job timeouts. > > The guilty handling is pretty much broken by design and only works > because we go through multiple hops of validating the entity after the > job has already been pushed to the hw. I see, thanks. > > I think we should probably just remove that completely and use an > approach where we check the in flight submissions in the query state > IOCTL. Like the DRM_IOCTL_I915_GET_RESET_STATS approach? > See my other patch on the mailing list regarding that. Which one, the "[PATCH 1/8] drm/scheduler: properly forward fence errors" series? > > Additional to that I currently didn't considered soft-recovered > submissions as fatal and continue accepting submissions from that > context, but already wanted to talk with Marek about that behavior. > Interesting. I will try to test and validate this approach to see if the contexts keep working as expected on soft-resets. > Regards, > Christian.
On 4/24/23 15:26, André Almeida wrote: >> >> Additional to that I currently didn't considered soft-recovered submissions as fatal and continue accepting submissions from that context, but already wanted to talk with Marek about that behavior. >> > > Interesting. I will try to test and validate this approach to see if the contexts keep working as expected on soft-resets. FWIW, on this Thinkpad E595 with a Picasso APU, I've hit soft-resets (usually either in Firefox or gnome-shell) a number of times, and usually continued using the GNOME session for a few days without any issues. (Interestingly, Firefox reacts to the soft-resets by falling back to software rendering, even when it's not guilty itself)
On 4/24/23 18:45, Marek Olšák wrote:
> Soft resets are fatal just as hard resets, but no reset is "always fatal". There are cases when apps keep working depending on which features are being used. It's still unsafe.
Agreed, in theory.
In practice, from a user PoV, right now there's pretty much 0 chance of the user session surviving if the GPU context in certain critical processes (e.g. the Wayland compositor or Xwayland) hits a fatal reset. There's a > 0 chance of it surviving after a soft reset. There's ongoing work towards making user-space components more robust against fatal resets, but it's taking time. Meanwhile, I suspect most users would take the > 0 chance.
On 4/25/23 14:08, Christian König wrote: > Well signaling that something happened is not the question. We do this for both soft as well as hard resets. > > The question is if errors result in blocking further submissions with the same context or not. > > In case of a hard reset and potential loss of state we have to kill the context, otherwise a follow up submission would just lockup the hardware once more. > > In case of a soft reset I think we can keep the context alive, this way even applications without robustness handling can keep work. > > You potentially still get some corruption, but at least not your compositor killed. Right, and if there is corruption, the user can restart the session. Maybe a possible compromise could be making soft resets fatal if user space enabled robustness for the context, and non-fatal if not. > Am 25.04.23 um 13:07 schrieb Marek Olšák: >> That supposedly depends on the compositor. There may be compositors for very specific cases (e.g. Steam Deck) that handle resets very well, and those would like to be properly notified of all resets because that's how they get the best outcome, e.g. no corruption. A soft reset that is unhandled by userspace may result in persistent corruption.
Am 25.04.23 um 14:14 schrieb Michel Dänzer: > On 4/25/23 14:08, Christian König wrote: >> Well signaling that something happened is not the question. We do this for both soft as well as hard resets. >> >> The question is if errors result in blocking further submissions with the same context or not. >> >> In case of a hard reset and potential loss of state we have to kill the context, otherwise a follow up submission would just lockup the hardware once more. >> >> In case of a soft reset I think we can keep the context alive, this way even applications without robustness handling can keep work. >> >> You potentially still get some corruption, but at least not your compositor killed. > Right, and if there is corruption, the user can restart the session. > > > Maybe a possible compromise could be making soft resets fatal if user space enabled robustness for the context, and non-fatal if not. Well that should already be mostly the case. If an application has enabled robustness it should notice that something went wrong and act appropriately. The only thing we need to handle is for applications without robustness in case of a hard reset or otherwise it will trigger an reset over and over again. Christian. > > >> Am 25.04.23 um 13:07 schrieb Marek Olšák: >>> That supposedly depends on the compositor. There may be compositors for very specific cases (e.g. Steam Deck) that handle resets very well, and those would like to be properly notified of all resets because that's how they get the best outcome, e.g. no corruption. A soft reset that is unhandled by userspace may result in persistent corruption. >
On 4/25/23 21:11, Marek Olšák wrote:
> The last 3 comments in this thread contain arguments that are false and were specifically pointed out as false 6 comments ago: Soft resets are just as fatal as hard resets. There is nothing better about soft resets. If the VRAM is lost completely, that's a different story, and if the hard reset is 100% unreliable, that's also a different story, but other than those two outliers, there is no difference between the two from the user point view. Both can repeatedly hang if you don't prevent the app that caused the hang from using the GPU even if the app is not robust. The robustness context type doesn't matter here. By definition, no guilty app can continue after a reset, and no innocent apps affected by a reset can continue either because those can now hang too. That's how destructive all resets are. Personal anecdotes that the soft reset is better are just that, anecdotes.
You're trying to frame the situation as black or white, but reality is shades of grey.
There's a similar situation with kernel Oopsen: In principle it's not safe to continue executing the kernel after it hits an Oops, since it might be in an inconsistent state, which could result in any kind of misbehaviour. Still, the default behaviour is to continue executing, and in most cases it turns out fine. Users which cannot accept the residual risk can choose to make the kernel panic when it hits an Oops (either via CONFIG_PANIC_ON_OOPS at build time, or via oops=panic on the kernel command line). A kernel panic means that the machine basically freezes from a user PoV, which would be worse as the default behaviour for most users (because it would e.g. incur a higher risk of losing filesystem data).
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index ac78caa7cba8..ea169d1689e2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4771,9 +4771,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev, amdgpu_fence_driver_isr_toggle(adev, false); - if (job && job->vm) - drm_sched_increase_karma(&job->base); - r = amdgpu_reset_prepare_hwcontext(adev, reset_context); /* If reset handler not implemented, continue; otherwise return */ if (r == -ENOSYS) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index c3d9d75143f4..097ed8f06865 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -51,6 +51,13 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) memset(&ti, 0, sizeof(struct amdgpu_task_info)); adev->job_hang = true; + amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti); + + if (job && job->vm) { + DRM_INFO("marking %s context as guilty", ti.process_name); + drm_sched_increase_karma(&job->base); + } + if (amdgpu_gpu_recovery && amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) { DRM_ERROR("ring %s timeout, but soft recovered\n", @@ -58,7 +65,6 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) goto exit; } - amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti); DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n", job->base.sched->name, atomic_read(&ring->fence_drv.last_seq), ring->fence_drv.sync_seq);