Message ID | 20230521100330.22478-1-quic_kriskura@quicinc.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp801619vqo; Sun, 21 May 2023 03:26:20 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6fdqKQNye270ho1SVzAOy4rFZ5lLB7KEVYfhcHhkUhfJ/3mmoTSbh7Bm/mImUNcj1yOB6T X-Received: by 2002:a17:903:22c2:b0:1ae:ce9:f391 with SMTP id y2-20020a17090322c200b001ae0ce9f391mr12392655plg.4.1684664780160; Sun, 21 May 2023 03:26:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684664780; cv=none; d=google.com; s=arc-20160816; b=OFWbfWQHLeWbeLM9MLv9mi7AuUR+TGo9EXErmkgQGeB4kYekfCC46aRJCdg6EwDYQD cJWItRLPVYuCLrhl4lu3yT3g2GgssR+vPj+WBW6eqgp0PXYAfGkxSFALpycH/xl4Rh/3 GIl7oc29/hiIxzYPCtE2nP7grT5YI8UUzHA1AD8GhSJIodaP9QlQF1k40fL44sRCMvW/ s9Ue1ZBlfn29f3tFWwYLwcxsJZFCU7kGxYOMrIiQGZf1nvqAdHvdbA8tHNaIAfYVIqIk +sEOgI91E11Jfoee/FQ1TZ9AqiehQQD5sKCK1/2KAL71ptcmGVA2MzBlUUr6cGKlxMZG Mrlw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=68kLvzwuNwZkgyfDs3LEwKDRLCkN60rJMtgAlw5OSlI=; b=DIeHmjPrqPJFxCLsG7PNFY9WuOCU5sy4m2ckDguzQxtXUkGIczR6ghz3C2Ri+h1jfZ eteOJHvlv2e0D/YVncprBDFBRytRfm321GuoyUdy7PBA7cfo1ByKu1tJeAQ8glynTlvD Gazfslyv7iQMnuGl1SMzVpbFKgm9wtasecK6aphAr1aPSZo1KJK0ez4CZvOcRO0ByS56 Kya0gHrRQoXXH0eK6dvO/WjVFlGJnhrrZB4BaIETdg3sD8hIAFSWmuEbc1mNaGLl+F5R 7tJJ+w/IxqjrdKUk4fzSSU9h+A8w2E8nnO3yaEofIhRtU6eWT0YYh3bZIBoS33Z9JxhT trWQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@quicinc.com header.s=qcppdkim1 header.b=MdlldXfZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=quicinc.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x17-20020a170902ea9100b001a043389a7asi2284576plb.310.2023.05.21.03.26.04; Sun, 21 May 2023 03:26:20 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@quicinc.com header.s=qcppdkim1 header.b=MdlldXfZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=quicinc.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230496AbjEUKV0 (ORCPT <rfc822;cscallsign@gmail.com> + 99 others); Sun, 21 May 2023 06:21:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34256 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230518AbjEUKHR (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sun, 21 May 2023 06:07:17 -0400 Received: from mx0a-0031df01.pphosted.com (mx0a-0031df01.pphosted.com [205.220.168.131]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E1DEC184; Sun, 21 May 2023 03:03:45 -0700 (PDT) Received: from pps.filterd (m0279866.ppops.net [127.0.0.1]) by mx0a-0031df01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 34LA1aSD004786; Sun, 21 May 2023 10:03:43 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quicinc.com; h=from : to : cc : subject : date : message-id : mime-version : content-transfer-encoding : content-type; s=qcppdkim1; bh=68kLvzwuNwZkgyfDs3LEwKDRLCkN60rJMtgAlw5OSlI=; b=MdlldXfZtKMmfKA03uFPYnd4AGE2kwuRGHMGIkA0lnlDvqqxfMG5DlL3TRo+3nxkqw85 CnklEhLhfJUYZp3qWRIgBzzo5dnxotI5g/6VwSsiIDjk97mnAFILJIThmW+RZ8Un+qDj iL0W5UiDRpb2L7NHt0Epn58SlDV8VOq/1kamc/pDxlkKpYQhsRKjlyDHMXB9uKl7Qa6a htEQF3kt6dNDEEoVck9nqp7IFEVcNRONisWfj1W021hi7laN9b8WSwmcZ0FboQJQkr/I f4iyQRC9N9FdrVaI9DyI8TW2i9+5mn660hbS5DjxNs53MDRkNB6UiB/lTc6Dw52cStNh Kg== Received: from nalasppmta02.qualcomm.com (Global_NAT1.qualcomm.com [129.46.96.20]) by mx0a-0031df01.pphosted.com (PPS) with ESMTPS id 3qpqgf1fdv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 21 May 2023 10:03:43 +0000 Received: from nalasex01a.na.qualcomm.com (nalasex01a.na.qualcomm.com [10.47.209.196]) by NALASPPMTA02.qualcomm.com (8.17.1.5/8.17.1.5) with ESMTPS id 34LA3gsL003800 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 21 May 2023 10:03:42 GMT Received: from hu-kriskura-hyd.qualcomm.com (10.80.80.8) by nalasex01a.na.qualcomm.com (10.47.209.196) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.42; Sun, 21 May 2023 03:03:39 -0700 From: Krishna Kurapati <quic_kriskura@quicinc.com> To: Thinh Nguyen <Thinh.Nguyen@synopsys.com>, Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: <linux-usb@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <quic_ppratap@quicinc.com>, <quic_wcheng@quicinc.com>, <quic_jackp@quicinc.com>, <quic_ugoswami@quicinc.com>, Krishna Kurapati <quic_kriskura@quicinc.com> Subject: [RFC] usb: dwc3: gadget: Fix amount of data copied from event buf to cache Date: Sun, 21 May 2023 15:33:30 +0530 Message-ID: <20230521100330.22478-1-quic_kriskura@quicinc.com> X-Mailer: git-send-email 2.40.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-Originating-IP: [10.80.80.8] X-ClientProxiedBy: nasanex01a.na.qualcomm.com (10.52.223.231) To nalasex01a.na.qualcomm.com (10.47.209.196) X-QCInternal: smtphost X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=5800 signatures=585085 X-Proofpoint-ORIG-GUID: tgts-eL5NTbPitm1W5NwuNnnMcUEg0zC X-Proofpoint-GUID: tgts-eL5NTbPitm1W5NwuNnnMcUEg0zC X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.254,Aquarius:18.0.957,Hydra:6.0.573,FMLib:17.11.170.22 definitions=2023-05-21_06,2023-05-17_02,2023-02-09_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 clxscore=1015 suspectscore=0 malwarescore=0 priorityscore=1501 impostorscore=0 phishscore=0 mlxscore=0 bulkscore=0 mlxlogscore=759 lowpriorityscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2304280000 definitions=main-2305210086 X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_LOW,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1766499056378269074?= X-GMAIL-MSGID: =?utf-8?q?1766499056378269074?= |
Series |
[RFC] usb: dwc3: gadget: Fix amount of data copied from event buf to cache
|
|
Commit Message
Krishna Kurapati
May 21, 2023, 10:03 a.m. UTC
In the current implementation, the check_event_buf call reads the
GEVNTCOUNT register to determine the amount of event data generated
and copies it from ev->buf to ev->cache after masking interrupt.
During copy if the amount of data to be copied is more than
(length - lpos), we fill the ev->cache till the end of 4096 byte
buffer allocated and then start filling from the top (lpos = 0).
In one instance of SMMU crash it is observed that GEVNTCOUNT register
reads more than 4096 bytes:
dwc3_readl base=0xffffffc0091dc000 offset=50188 value=63488
(offset = 50188 -> 0xC40C) -> reads 63488 bytes
As per crash dump:
dwc->lpos = 2056
and evt->buf is at 0xFFFFFFC009185000 and the crash is at
0xFFFFFFC009186000. The diff which is exactly 0x1000 bytes.
We first memcpy 2040 bytes from (buf + lpos) to (buf + 0x1000).
And then we copy the rest of the data (64388 - 2040) from beginning
of dwc->ev_buf. While doing so we go beyond bounds as we are trying
to memcpy 62348 bytes into a 4K buffer resulting in crash.
Fix this by limiting the total data being copied to ev->length to
avoid copying data beyond bounds. Moreover this is logical because if
the controller generated events more than the size of ring buffer,
some of them might have been overwritten by the controller.
Signed-off-by: Krishna Kurapati <quic_kriskura@quicinc.com>
---
Only one instance of this crash was observed so far. As per the
databook:
"The controller always leaves one entry free in each Event Buffer.
When the Event Buffer is almost full, hardware writes the Event
Buffer Overflow event and the USB eventually gets stalled when
endpoints start responding NRDY or the link layer stops returning
credits (in SuperSpeed). This event is an indication to software that
it is not processing events quickly enough. During this time, events
are queued up internally. When software frees up Event Buffer space,
the queued up events are written out and the USB returns to normal
operation"
I didn't see any overflow event in the event buffer after parsing
crash dump. Although this could be some HW issue, I thought we can
include this fix in software as well to avoid such scenario.
drivers/usb/dwc3/gadget.c | 3 +++
1 file changed, 3 insertions(+)
Comments
On Sun, May 21, 2023, Krishna Kurapati wrote: > In the current implementation, the check_event_buf call reads the > GEVNTCOUNT register to determine the amount of event data generated > and copies it from ev->buf to ev->cache after masking interrupt. > > During copy if the amount of data to be copied is more than > (length - lpos), we fill the ev->cache till the end of 4096 byte > buffer allocated and then start filling from the top (lpos = 0). > > In one instance of SMMU crash it is observed that GEVNTCOUNT register > reads more than 4096 bytes: > > dwc3_readl base=0xffffffc0091dc000 offset=50188 value=63488 > > (offset = 50188 -> 0xC40C) -> reads 63488 bytes > > As per crash dump: > dwc->lpos = 2056 > > and evt->buf is at 0xFFFFFFC009185000 and the crash is at > 0xFFFFFFC009186000. The diff which is exactly 0x1000 bytes. > > We first memcpy 2040 bytes from (buf + lpos) to (buf + 0x1000). > > And then we copy the rest of the data (64388 - 2040) from beginning > of dwc->ev_buf. While doing so we go beyond bounds as we are trying > to memcpy 62348 bytes into a 4K buffer resulting in crash. > > Fix this by limiting the total data being copied to ev->length to > avoid copying data beyond bounds. Moreover this is logical because if > the controller generated events more than the size of ring buffer, > some of them might have been overwritten by the controller. > > Signed-off-by: Krishna Kurapati <quic_kriskura@quicinc.com> > --- > Only one instance of this crash was observed so far. As per the > databook: > > "The controller always leaves one entry free in each Event Buffer. > When the Event Buffer is almost full, hardware writes the Event > Buffer Overflow event and the USB eventually gets stalled when > endpoints start responding NRDY or the link layer stops returning > credits (in SuperSpeed). This event is an indication to software that > it is not processing events quickly enough. During this time, events > are queued up internally. When software frees up Event Buffer space, > the queued up events are written out and the USB returns to normal > operation" > > I didn't see any overflow event in the event buffer after parsing > crash dump. Although this could be some HW issue, I thought we can > include this fix in software as well to avoid such scenario. > What's the GEVNTSIZ at the point of the crash? That's where the driver tells the controller how much it allocated for the event buffer. Check to make sure that it wasn't reset during operation (not cleanup). BR, Thinh
On 5/23/2023 2:47 AM, Thinh Nguyen wrote: > On Sun, May 21, 2023, Krishna Kurapati wrote: >> In the current implementation, the check_event_buf call reads the >> GEVNTCOUNT register to determine the amount of event data generated >> and copies it from ev->buf to ev->cache after masking interrupt. >> >> During copy if the amount of data to be copied is more than >> (length - lpos), we fill the ev->cache till the end of 4096 byte >> buffer allocated and then start filling from the top (lpos = 0). >> >> In one instance of SMMU crash it is observed that GEVNTCOUNT register >> reads more than 4096 bytes: >> >> dwc3_readl base=0xffffffc0091dc000 offset=50188 value=63488 >> >> (offset = 50188 -> 0xC40C) -> reads 63488 bytes >> >> As per crash dump: >> dwc->lpos = 2056 >> >> and evt->buf is at 0xFFFFFFC009185000 and the crash is at >> 0xFFFFFFC009186000. The diff which is exactly 0x1000 bytes. >> >> We first memcpy 2040 bytes from (buf + lpos) to (buf + 0x1000). >> >> And then we copy the rest of the data (64388 - 2040) from beginning >> of dwc->ev_buf. While doing so we go beyond bounds as we are trying >> to memcpy 62348 bytes into a 4K buffer resulting in crash. >> >> Fix this by limiting the total data being copied to ev->length to >> avoid copying data beyond bounds. Moreover this is logical because if >> the controller generated events more than the size of ring buffer, >> some of them might have been overwritten by the controller. >> >> Signed-off-by: Krishna Kurapati <quic_kriskura@quicinc.com> >> --- >> Only one instance of this crash was observed so far. As per the >> databook: >> >> "The controller always leaves one entry free in each Event Buffer. >> When the Event Buffer is almost full, hardware writes the Event >> Buffer Overflow event and the USB eventually gets stalled when >> endpoints start responding NRDY or the link layer stops returning >> credits (in SuperSpeed). This event is an indication to software that >> it is not processing events quickly enough. During this time, events >> are queued up internally. When software frees up Event Buffer space, >> the queued up events are written out and the USB returns to normal >> operation" >> >> I didn't see any overflow event in the event buffer after parsing >> crash dump. Although this could be some HW issue, I thought we can >> include this fix in software as well to avoid such scenario. >> > > What's the GEVNTSIZ at the point of the crash? That's where the driver > tells the controller how much it allocated for the event buffer. > > Check to make sure that it wasn't reset during operation (not cleanup). Hi Thinh, The last 3 RW traces were as follows: <idle>-0 [001] 5834.471640: dwc3_writel base=0xffffffc0091dc000 offset=50184 value=4096 <idle>-0 [001] 5834.471799: dwc3_readl base=0xffffffc0091dc000 offset=50188 value=63488 <idle>-0 [001] 5834.471803: dwc3_writel base=0xffffffc0091dc000 offset=50184 value=2147487744 The first one was at the end of previous process_event_entry call where we unmasked the interrupt. The second one was the read of GEVTCOUNT. The third one was where we wrote 31st bit of the GEVTSIZ register to mask interrupt along with 4096 bytes to [15:0] bits. There is only 100-150us gap between each of these read writes and if you are referring to whether the GEVTSIZ got modified in between, I am not sure of that. Regards, Krishna,
On Tue, May 23, 2023, Krishna Kurapati PSSNV wrote: > > > On 5/23/2023 2:47 AM, Thinh Nguyen wrote: > > On Sun, May 21, 2023, Krishna Kurapati wrote: > > > In the current implementation, the check_event_buf call reads the > > > GEVNTCOUNT register to determine the amount of event data generated > > > and copies it from ev->buf to ev->cache after masking interrupt. > > > > > > During copy if the amount of data to be copied is more than > > > (length - lpos), we fill the ev->cache till the end of 4096 byte > > > buffer allocated and then start filling from the top (lpos = 0). > > > > > > In one instance of SMMU crash it is observed that GEVNTCOUNT register > > > reads more than 4096 bytes: > > > > > > dwc3_readl base=0xffffffc0091dc000 offset=50188 value=63488 > > > > > > (offset = 50188 -> 0xC40C) -> reads 63488 bytes > > > > > > As per crash dump: > > > dwc->lpos = 2056 > > > > > > and evt->buf is at 0xFFFFFFC009185000 and the crash is at > > > 0xFFFFFFC009186000. The diff which is exactly 0x1000 bytes. > > > > > > We first memcpy 2040 bytes from (buf + lpos) to (buf + 0x1000). > > > > > > And then we copy the rest of the data (64388 - 2040) from beginning > > > of dwc->ev_buf. While doing so we go beyond bounds as we are trying > > > to memcpy 62348 bytes into a 4K buffer resulting in crash. > > > > > > Fix this by limiting the total data being copied to ev->length to > > > avoid copying data beyond bounds. Moreover this is logical because if > > > the controller generated events more than the size of ring buffer, > > > some of them might have been overwritten by the controller. > > > > > > Signed-off-by: Krishna Kurapati <quic_kriskura@quicinc.com> > > > --- > > > Only one instance of this crash was observed so far. As per the > > > databook: > > > > > > "The controller always leaves one entry free in each Event Buffer. > > > When the Event Buffer is almost full, hardware writes the Event > > > Buffer Overflow event and the USB eventually gets stalled when > > > endpoints start responding NRDY or the link layer stops returning > > > credits (in SuperSpeed). This event is an indication to software that > > > it is not processing events quickly enough. During this time, events > > > are queued up internally. When software frees up Event Buffer space, > > > the queued up events are written out and the USB returns to normal > > > operation" > > > > > > I didn't see any overflow event in the event buffer after parsing > > > crash dump. Although this could be some HW issue, I thought we can > > > include this fix in software as well to avoid such scenario. > > > > > > > What's the GEVNTSIZ at the point of the crash? That's where the driver > > tells the controller how much it allocated for the event buffer. > > > > Check to make sure that it wasn't reset during operation (not cleanup). > > > Hi Thinh, > > The last 3 RW traces were as follows: > > <idle>-0 [001] 5834.471640: dwc3_writel base=0xffffffc0091dc000 > offset=50184 value=4096 > <idle>-0 [001] 5834.471799: dwc3_readl base=0xffffffc0091dc000 > offset=50188 value=63488 > <idle>-0 [001] 5834.471803: dwc3_writel base=0xffffffc0091dc000 > offset=50184 value=2147487744 > > The first one was at the end of previous process_event_entry call where we > unmasked the interrupt. > > The second one was the read of GEVTCOUNT. > > The third one was where we wrote 31st bit of the GEVTSIZ register to mask > interrupt along with 4096 bytes to [15:0] bits. > > There is only 100-150us gap between each of these read writes and if you are > referring to whether the GEVTSIZ got modified in between, I am not sure of > that. > The read from GEVNTCOUNT looks corrupted. When was the last time (delta time) an event was handled prior to the failure? If it's a short delta, it further indicates that whatever the value returned is invalid. So, setting count = evt->length if count > evt->length is no good. The event count is probably much lower and you may end up reading stale events. It's odd that we are only able to see this issue now. For this type of failure, it should've been exposed much early on. Is there potentially something different for your setup/configuration that may cause this issue. Also, how easily are you able to reproduce this. Did you try to re-read the GEVNTCOUNT multiple times when this happens to see if the value changes dramatically? Thanks, Thinh
On 5/24/2023 2:52 AM, Thinh Nguyen wrote: > On Tue, May 23, 2023, Krishna Kurapati PSSNV wrote: >> >> >> On 5/23/2023 2:47 AM, Thinh Nguyen wrote: >>> On Sun, May 21, 2023, Krishna Kurapati wrote: >>>> In the current implementation, the check_event_buf call reads the >>>> GEVNTCOUNT register to determine the amount of event data generated >>>> and copies it from ev->buf to ev->cache after masking interrupt. >>>> >>>> During copy if the amount of data to be copied is more than >>>> (length - lpos), we fill the ev->cache till the end of 4096 byte >>>> buffer allocated and then start filling from the top (lpos = 0). >>>> >>>> In one instance of SMMU crash it is observed that GEVNTCOUNT register >>>> reads more than 4096 bytes: >>>> >>>> dwc3_readl base=0xffffffc0091dc000 offset=50188 value=63488 >>>> >>>> (offset = 50188 -> 0xC40C) -> reads 63488 bytes >>>> >>>> As per crash dump: >>>> dwc->lpos = 2056 >>>> >>>> and evt->buf is at 0xFFFFFFC009185000 and the crash is at >>>> 0xFFFFFFC009186000. The diff which is exactly 0x1000 bytes. >>>> >>>> We first memcpy 2040 bytes from (buf + lpos) to (buf + 0x1000). >>>> >>>> And then we copy the rest of the data (64388 - 2040) from beginning >>>> of dwc->ev_buf. While doing so we go beyond bounds as we are trying >>>> to memcpy 62348 bytes into a 4K buffer resulting in crash. >>>> >>>> Fix this by limiting the total data being copied to ev->length to >>>> avoid copying data beyond bounds. Moreover this is logical because if >>>> the controller generated events more than the size of ring buffer, >>>> some of them might have been overwritten by the controller. >>>> >>>> Signed-off-by: Krishna Kurapati <quic_kriskura@quicinc.com> >>>> --- >>>> Only one instance of this crash was observed so far. As per the >>>> databook: >>>> >>>> "The controller always leaves one entry free in each Event Buffer. >>>> When the Event Buffer is almost full, hardware writes the Event >>>> Buffer Overflow event and the USB eventually gets stalled when >>>> endpoints start responding NRDY or the link layer stops returning >>>> credits (in SuperSpeed). This event is an indication to software that >>>> it is not processing events quickly enough. During this time, events >>>> are queued up internally. When software frees up Event Buffer space, >>>> the queued up events are written out and the USB returns to normal >>>> operation" >>>> >>>> I didn't see any overflow event in the event buffer after parsing >>>> crash dump. Although this could be some HW issue, I thought we can >>>> include this fix in software as well to avoid such scenario. >>>> >>> >>> What's the GEVNTSIZ at the point of the crash? That's where the driver >>> tells the controller how much it allocated for the event buffer. >>> >>> Check to make sure that it wasn't reset during operation (not cleanup). >> >> >> Hi Thinh, >> >> The last 3 RW traces were as follows: >> >> <idle>-0 [001] 5834.471640: dwc3_writel base=0xffffffc0091dc000 >> offset=50184 value=4096 >> <idle>-0 [001] 5834.471799: dwc3_readl base=0xffffffc0091dc000 >> offset=50188 value=63488 >> <idle>-0 [001] 5834.471803: dwc3_writel base=0xffffffc0091dc000 >> offset=50184 value=2147487744 >> >> The first one was at the end of previous process_event_entry call where we >> unmasked the interrupt. >> >> The second one was the read of GEVTCOUNT. >> >> The third one was where we wrote 31st bit of the GEVTSIZ register to mask >> interrupt along with 4096 bytes to [15:0] bits. >> >> There is only 100-150us gap between each of these read writes and if you are >> referring to whether the GEVTSIZ got modified in between, I am not sure of >> that. >> > > The read from GEVNTCOUNT looks corrupted. When was the last time (delta > time) an event was handled prior to the failure? If it's a short delta, > it further indicates that whatever the value returned is invalid. > > So, setting count = evt->length if count > evt->length is no good. The > event count is probably much lower and you may end up reading stale > events. > > It's odd that we are only able to see this issue now. For this type of > failure, it should've been exposed much early on. Is there potentially > something different for your setup/configuration that may cause this > issue. Also, how easily are you able to reproduce this. Did you try to > re-read the GEVNTCOUNT multiple times when this happens to see if the > value changes dramatically? > The previous event processed was just before this interrupt came in which too was having a huge GEVTCOUNT but not big to crash the device (in memcpy). And very soon, after it was processed, this current interrupt came in. (That is why in the comment section of the patch, I mentioned that this could be a hw issue too that came up randomly. But wanted to check if this case is something that can be considered for fixing.) And there has been only one instance reported so far on a target which has very less computational capability and has a hug sw stack (apart from linux) running on top of it. Regards, Krishna,
On Wed, May 24, 2023, Krishna Kurapati PSSNV wrote: > > > On 5/24/2023 2:52 AM, Thinh Nguyen wrote: > > On Tue, May 23, 2023, Krishna Kurapati PSSNV wrote: > > > > > > > > > On 5/23/2023 2:47 AM, Thinh Nguyen wrote: > > > > On Sun, May 21, 2023, Krishna Kurapati wrote: > > > > > In the current implementation, the check_event_buf call reads the > > > > > GEVNTCOUNT register to determine the amount of event data generated > > > > > and copies it from ev->buf to ev->cache after masking interrupt. > > > > > > > > > > During copy if the amount of data to be copied is more than > > > > > (length - lpos), we fill the ev->cache till the end of 4096 byte > > > > > buffer allocated and then start filling from the top (lpos = 0). > > > > > > > > > > In one instance of SMMU crash it is observed that GEVNTCOUNT register > > > > > reads more than 4096 bytes: > > > > > > > > > > dwc3_readl base=0xffffffc0091dc000 offset=50188 value=63488 > > > > > > > > > > (offset = 50188 -> 0xC40C) -> reads 63488 bytes > > > > > > > > > > As per crash dump: > > > > > dwc->lpos = 2056 > > > > > > > > > > and evt->buf is at 0xFFFFFFC009185000 and the crash is at > > > > > 0xFFFFFFC009186000. The diff which is exactly 0x1000 bytes. > > > > > > > > > > We first memcpy 2040 bytes from (buf + lpos) to (buf + 0x1000). > > > > > > > > > > And then we copy the rest of the data (64388 - 2040) from beginning > > > > > of dwc->ev_buf. While doing so we go beyond bounds as we are trying > > > > > to memcpy 62348 bytes into a 4K buffer resulting in crash. > > > > > > > > > > Fix this by limiting the total data being copied to ev->length to > > > > > avoid copying data beyond bounds. Moreover this is logical because if > > > > > the controller generated events more than the size of ring buffer, > > > > > some of them might have been overwritten by the controller. > > > > > > > > > > Signed-off-by: Krishna Kurapati <quic_kriskura@quicinc.com> > > > > > --- > > > > > Only one instance of this crash was observed so far. As per the > > > > > databook: > > > > > > > > > > "The controller always leaves one entry free in each Event Buffer. > > > > > When the Event Buffer is almost full, hardware writes the Event > > > > > Buffer Overflow event and the USB eventually gets stalled when > > > > > endpoints start responding NRDY or the link layer stops returning > > > > > credits (in SuperSpeed). This event is an indication to software that > > > > > it is not processing events quickly enough. During this time, events > > > > > are queued up internally. When software frees up Event Buffer space, > > > > > the queued up events are written out and the USB returns to normal > > > > > operation" > > > > > > > > > > I didn't see any overflow event in the event buffer after parsing > > > > > crash dump. Although this could be some HW issue, I thought we can > > > > > include this fix in software as well to avoid such scenario. > > > > > > > > > > > > > What's the GEVNTSIZ at the point of the crash? That's where the driver > > > > tells the controller how much it allocated for the event buffer. > > > > > > > > Check to make sure that it wasn't reset during operation (not cleanup). > > > > > > > > > Hi Thinh, > > > > > > The last 3 RW traces were as follows: > > > > > > <idle>-0 [001] 5834.471640: dwc3_writel base=0xffffffc0091dc000 > > > offset=50184 value=4096 > > > <idle>-0 [001] 5834.471799: dwc3_readl base=0xffffffc0091dc000 > > > offset=50188 value=63488 > > > <idle>-0 [001] 5834.471803: dwc3_writel base=0xffffffc0091dc000 > > > offset=50184 value=2147487744 > > > > > > The first one was at the end of previous process_event_entry call where we > > > unmasked the interrupt. > > > > > > The second one was the read of GEVTCOUNT. > > > > > > The third one was where we wrote 31st bit of the GEVTSIZ register to mask > > > interrupt along with 4096 bytes to [15:0] bits. > > > > > > There is only 100-150us gap between each of these read writes and if you are > > > referring to whether the GEVTSIZ got modified in between, I am not sure of > > > that. > > > > > > > The read from GEVNTCOUNT looks corrupted. When was the last time (delta > > time) an event was handled prior to the failure? If it's a short delta, > > it further indicates that whatever the value returned is invalid. > > > > So, setting count = evt->length if count > evt->length is no good. The > > event count is probably much lower and you may end up reading stale > > events. > > > > It's odd that we are only able to see this issue now. For this type of > > failure, it should've been exposed much early on. Is there potentially > > something different for your setup/configuration that may cause this > > issue. Also, how easily are you able to reproduce this. Did you try to > > re-read the GEVNTCOUNT multiple times when this happens to see if the > > value changes dramatically? > > > > The previous event processed was just before this interrupt came in which > too was having a huge GEVTCOUNT but not big to crash the device (in memcpy). > And very soon, after it was processed, this current interrupt came in. > > (That is why in the comment section of the patch, I mentioned that this > could be a hw issue too that came up randomly. But wanted to check if this > case is something that can be considered for fixing.) > > And there has been only one instance reported so far on a target which has > very less computational capability and has a hug sw stack (apart from linux) > running on top of it. > This seems to be a hw issue. We can add a check if the event count is beyond evt->length, then ignore the interrupt and print an error to prevent the invalid memory access. We're not sure the events are valid at this point. As for the recovery from this issue, we need root cause the problem to provide an appropriate fix. Thanks, Thinh
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 5965796bc5d5..cff043ab9017 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -4457,6 +4457,9 @@ static irqreturn_t dwc3_check_event_buf(struct dwc3_event_buffer *evt) if (!count) return IRQ_NONE; + if (count > evt->length) + count = evt->length; + evt->count = count; evt->flags |= DWC3_EVENT_PENDING;