Message ID | 20230731130755.2674029-1-quic_ipkumar@quicinc.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:918b:0:b0:3e4:2afc:c1 with SMTP id s11csp2082086vqg; Mon, 31 Jul 2023 08:14:45 -0700 (PDT) X-Google-Smtp-Source: APBJJlFLbmd7knzzUqIPfE2/NwPqv6XAIkq5Oukyax/5iKaunrJ55gLpcchwzGWkmhrsRLbuADyj X-Received: by 2002:a17:902:c951:b0:1bb:c5b5:8353 with SMTP id i17-20020a170902c95100b001bbc5b58353mr10871313pla.4.1690816485026; Mon, 31 Jul 2023 08:14:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690816485; cv=none; d=google.com; s=arc-20160816; b=qgUzKAfQZfkjlLymKjgOohC5HKsPeInF7wyAx7CeHf9RUNIDCthoKmayDWc+ye383n srfLFBegS6GfmuKnRgfZqxITVJsz1h9CWL9AEITXNXolc0cRk34/vPCzr422DHpMdB4c G6bG07K4+Zig67+fflp2MJAQjDAtNFV/kV00wBcGnJfGHFydksM1KKWFBu41GEDblfG1 dtdKuW/Rt1l7CL8dK0/x2CSMI4cbWlyM2xeB9raLqrmGVao6j8ndioUIcHjycFpINNYq RZvJMrCuwbnLKtwnu8VEWLFiepMlpC005ADdnSFO9BQCfGKhrScin9SRirbr01Go1DSy vflA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=HmBwFRAqoui3gzkguf7120kFVc4zporDkOR9nPzEu08=; fh=XA5AqzJ43zkEKPFDiwcMXgga5Jk1f/ptzcAVsS/0Tso=; b=R8rdO9ZT5DkhapIv8IqtqRYk5jKfoF0oKfE0vkp4DkExFTpO96/t6oBxyKITrcZ1cH Vn+6fHudd+NVG3A8bOpO+4CFA6jjcseWygVHWfpTUz9l1t/VrEuyZTRn7nsvXY/LTjLa F2RI4VQdjqueNOKOeaJ75/cdHquzQjOEWLROHOehlisSeWbBoyMfW6HniraUaB5vIIx5 7FIO2GtgYzBbjiuifIadkeP/Mcq70hdAQuUjeoEfVb1gmLD7An4kvdI1FfO7Lcm+uGWR 6k7Pr1LjfvZdSv1jWPZhWZ1lOvUjOLi/tN9AWsCFVUdL3MV0rdVni68ow6UIEN2hZnzR lsQA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@quicinc.com header.s=qcppdkim1 header.b=Bje7rO5B; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=quicinc.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q3-20020a17090311c300b001b045d65aedsi7602534plh.228.2023.07.31.08.14.31; Mon, 31 Jul 2023 08:14:45 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@quicinc.com header.s=qcppdkim1 header.b=Bje7rO5B; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=quicinc.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232164AbjGaNJS (ORCPT <rfc822;dengxinlin2429@gmail.com> + 99 others); Mon, 31 Jul 2023 09:09:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46578 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231942AbjGaNJD (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 31 Jul 2023 09:09:03 -0400 Received: from mx0b-0031df01.pphosted.com (mx0b-0031df01.pphosted.com [205.220.180.131]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 461761BC3; Mon, 31 Jul 2023 06:08:39 -0700 (PDT) Received: from pps.filterd (m0279871.ppops.net [127.0.0.1]) by mx0a-0031df01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 36VCkFhF029018; Mon, 31 Jul 2023 13:08:10 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quicinc.com; h=from : to : cc : subject : date : message-id : mime-version : content-transfer-encoding : content-type; s=qcppdkim1; bh=HmBwFRAqoui3gzkguf7120kFVc4zporDkOR9nPzEu08=; b=Bje7rO5BcQCT6JyXe9kO7iEtsVwuc8iM8UJz37020cN4NUYWzM6mInRFonPMvACOh8Fv R5T8abxIBrsHy+JI9G+i63KH47T2Cxq6LTVxO7pLM0sGMGph01MJeCFD8g6biQoI+TBJ Py2en1Dx/fllFL9SkSZ3frMbhjiARGyqw//YlexgFBo/mz9lCRcNwq0FAsqbNLSNjS2z YFz/xtehzk7UwcKpooggRokBHT8rdH4GPG651aXI8S3eEd2tyvYe8gNfxl9U22DG+1Zl v67HPM8EWe3llcoDWbDXET5A+NIyMAr3Zbjyri0Wz7KkJp8HCS0hE7cMCvivwV0npuue dw== Received: from nalasppmta01.qualcomm.com (Global_NAT1.qualcomm.com [129.46.96.20]) by mx0a-0031df01.pphosted.com (PPS) with ESMTPS id 3s6d8gr9t5-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 31 Jul 2023 13:08:10 +0000 Received: from nalasex01c.na.qualcomm.com (nalasex01c.na.qualcomm.com [10.47.97.35]) by NALASPPMTA01.qualcomm.com (8.17.1.5/8.17.1.5) with ESMTPS id 36VD88tc010356 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 31 Jul 2023 13:08:08 GMT Received: from hu-ipkumar-blr.qualcomm.com (10.80.80.8) by nalasex01c.na.qualcomm.com (10.47.97.35) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.30; Mon, 31 Jul 2023 06:08:05 -0700 From: Praveenkumar I <quic_ipkumar@quicinc.com> To: <agross@kernel.org>, <andersson@kernel.org>, <konrad.dybcio@linaro.org>, <clew@codeaurora.org>, <linux-arm-msm@vger.kernel.org>, <linux-kernel@vger.kernel.org> CC: <quic_varada@quicinc.com>, <quic_srichara@quicinc.com> Subject: [PATCH] soc: qcom: qmi: Signal the txn completion after releasing the mutex Date: Mon, 31 Jul 2023 18:37:55 +0530 Message-ID: <20230731130755.2674029-1-quic_ipkumar@quicinc.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-Originating-IP: [10.80.80.8] X-ClientProxiedBy: nasanex01a.na.qualcomm.com (10.52.223.231) To nalasex01c.na.qualcomm.com (10.47.97.35) X-QCInternal: smtphost X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=5800 signatures=585085 X-Proofpoint-ORIG-GUID: 2MCfZqn-01gkgbnJFEm4Ycok3cmIs0C8 X-Proofpoint-GUID: 2MCfZqn-01gkgbnJFEm4Ycok3cmIs0C8 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.254,Aquarius:18.0.957,Hydra:6.0.591,FMLib:17.11.176.26 definitions=2023-07-31_06,2023-07-31_02,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 lowpriorityscore=0 bulkscore=0 suspectscore=0 impostorscore=0 spamscore=0 adultscore=0 phishscore=0 priorityscore=1501 clxscore=1011 mlxlogscore=897 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2306200000 definitions=main-2307310117 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1772949586490541979 X-GMAIL-MSGID: 1772949586490541979 |
Series |
soc: qcom: qmi: Signal the txn completion after releasing the mutex
|
|
Commit Message
Praveenkumar I
July 31, 2023, 1:07 p.m. UTC
txn is in #1 stack Worker #1 Worker #2 ******** ********* qmi_txn_wait(txn) qmi_handle_message | | | | wait_for_complete(txn->complete) .... | mutex_lock(txn->lock) | | mutex_lock(txn->lock) | ..... complete(txn->lock) | mutex_unlock(txn->lock) | mutex_unlock(txn->lock) In this case above, while #2 is doing the mutex_unlock(txn->lock), in between releasing lock and doing other lock related wakeup, #2 gets scheduled out. As a result #1, acquires the lock, unlocks, also frees the txn also (where the lock resides) Now #2, gets scheduled again and tries to do the rest of the lock related wakeup, but lock itself is invalid because txn itself is gone. Fixing this, by doing the mutex_unlock(txn->lock) first and then complete(txn->lock) in #2 Fixes: 3830d0771ef6 ("soc: qcom: Introduce QMI helpers") Cc: stable@vger.kernel.org Signed-off-by: Sricharan Ramabadhran <quic_srichara@quicinc.com> Signed-off-by: Praveenkumar I <quic_ipkumar@quicinc.com> --- drivers/soc/qcom/qmi_interface.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
Comments
On Mon, Jul 31, 2023 at 06:37:55PM +0530, Praveenkumar I wrote: > txn is in #1 stack > > Worker #1 Worker #2 > ******** ********* > > qmi_txn_wait(txn) qmi_handle_message > | | > | | > wait_for_complete(txn->complete) .... > | mutex_lock(txn->lock) > | | > mutex_lock(txn->lock) | > ..... complete(txn->lock) > | mutex_unlock(txn->lock) > | > mutex_unlock(txn->lock) > > In this case above, while #2 is doing the mutex_unlock(txn->lock), > in between releasing lock and doing other lock related wakeup, #2 gets > scheduled out. As a result #1, acquires the lock, unlocks, also > frees the txn also (where the lock resides) > > Now #2, gets scheduled again and tries to do the rest of the lock > related wakeup, but lock itself is invalid because txn itself is gone. > > Fixing this, by doing the mutex_unlock(txn->lock) first and then > complete(txn->lock) in #2 > > Fixes: 3830d0771ef6 ("soc: qcom: Introduce QMI helpers") > Cc: stable@vger.kernel.org > Signed-off-by: Sricharan Ramabadhran <quic_srichara@quicinc.com> > Signed-off-by: Praveenkumar I <quic_ipkumar@quicinc.com> > --- > drivers/soc/qcom/qmi_interface.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/soc/qcom/qmi_interface.c b/drivers/soc/qcom/qmi_interface.c > index 78d7361fdcf2..92e29db97359 100644 > --- a/drivers/soc/qcom/qmi_interface.c > +++ b/drivers/soc/qcom/qmi_interface.c > @@ -505,12 +505,13 @@ static void qmi_handle_message(struct qmi_handle *qmi, > pr_err("failed to decode incoming message\n"); > > txn->result = ret; > - complete(&txn->completion); > } else { > qmi_invoke_handler(qmi, sq, txn, buf, len); > } > > mutex_unlock(&txn->lock); > + if (txn->dest && txn->ei) > + complete(&txn->completion); > } else { > /* Create a txn based on the txn_id of the incoming message */ > memset(&tmp_txn, 0, sizeof(tmp_txn)); What happens in a remote scenario where the waiter gets timed out at the very same time you are releasing the mutex but before calling complete()? The caller might end up freeing txn structure and it results in the same issue you are currently facing. Thanks, Pavan
On 7/31/2023 8:19 AM, Pavan Kondeti wrote: > On Mon, Jul 31, 2023 at 06:37:55PM +0530, Praveenkumar I wrote: >> txn is in #1 stack >> >> Worker #1 Worker #2 >> ******** ********* >> >> qmi_txn_wait(txn) qmi_handle_message >> | | >> | | >> wait_for_complete(txn->complete) .... >> | mutex_lock(txn->lock) >> | | >> mutex_lock(txn->lock) | >> ..... complete(txn->lock) >> | mutex_unlock(txn->lock) >> | >> mutex_unlock(txn->lock) >> >> In this case above, while #2 is doing the mutex_unlock(txn->lock), >> in between releasing lock and doing other lock related wakeup, #2 gets >> scheduled out. As a result #1, acquires the lock, unlocks, also >> frees the txn also (where the lock resides) >> >> Now #2, gets scheduled again and tries to do the rest of the lock >> related wakeup, but lock itself is invalid because txn itself is gone. >> >> Fixing this, by doing the mutex_unlock(txn->lock) first and then >> complete(txn->lock) in #2 >> >> Fixes: 3830d0771ef6 ("soc: qcom: Introduce QMI helpers") >> Cc: stable@vger.kernel.org >> Signed-off-by: Sricharan Ramabadhran <quic_srichara@quicinc.com> >> Signed-off-by: Praveenkumar I <quic_ipkumar@quicinc.com> >> --- >> drivers/soc/qcom/qmi_interface.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/soc/qcom/qmi_interface.c b/drivers/soc/qcom/qmi_interface.c >> index 78d7361fdcf2..92e29db97359 100644 >> --- a/drivers/soc/qcom/qmi_interface.c >> +++ b/drivers/soc/qcom/qmi_interface.c >> @@ -505,12 +505,13 @@ static void qmi_handle_message(struct qmi_handle *qmi, >> pr_err("failed to decode incoming message\n"); >> >> txn->result = ret; >> - complete(&txn->completion); >> } else { >> qmi_invoke_handler(qmi, sq, txn, buf, len); >> } >> >> mutex_unlock(&txn->lock); >> + if (txn->dest && txn->ei) >> + complete(&txn->completion); >> } else { >> /* Create a txn based on the txn_id of the incoming message */ >> memset(&tmp_txn, 0, sizeof(tmp_txn)); > > What happens in a remote scenario where the waiter gets timed out at the > very same time you are releasing the mutex but before calling > complete()? The caller might end up freeing txn structure and it results > in the same issue you are currently facing. > > Thanks, > Pavan I think downstream we had various attempts of moving the signal around trying to avoid this, but hit scenarios like the one Pavan described. We eventually settled on removing the txn->lock and treating the qmi->txn_lock as a big lock. This remedied the issue where the txn->lock goes out of scope since qmi->txn_lock is tied to the qmi handle.
Hi, On 8/1/2023 6:06 AM, Chris Lew wrote: > > > On 7/31/2023 8:19 AM, Pavan Kondeti wrote: >> On Mon, Jul 31, 2023 at 06:37:55PM +0530, Praveenkumar I wrote: >>> txn is in #1 stack >>> >>> Worker #1 Worker #2 >>> ******** ********* >>> >>> qmi_txn_wait(txn) qmi_handle_message >>> | | >>> | | >>> wait_for_complete(txn->complete) .... >>> | mutex_lock(txn->lock) >>> | | >>> mutex_lock(txn->lock) | >>> ..... complete(txn->lock) >>> | >>> mutex_unlock(txn->lock) >>> | >>> mutex_unlock(txn->lock) >>> >>> In this case above, while #2 is doing the mutex_unlock(txn->lock), >>> in between releasing lock and doing other lock related wakeup, #2 gets >>> scheduled out. As a result #1, acquires the lock, unlocks, also >>> frees the txn also (where the lock resides) >>> >>> Now #2, gets scheduled again and tries to do the rest of the lock >>> related wakeup, but lock itself is invalid because txn itself is gone. >>> >>> Fixing this, by doing the mutex_unlock(txn->lock) first and then >>> complete(txn->lock) in #2 >>> >>> Fixes: 3830d0771ef6 ("soc: qcom: Introduce QMI helpers") >>> Cc: stable@vger.kernel.org >>> Signed-off-by: Sricharan Ramabadhran <quic_srichara@quicinc.com> >>> Signed-off-by: Praveenkumar I <quic_ipkumar@quicinc.com> >>> --- >>> drivers/soc/qcom/qmi_interface.c | 3 ++- >>> 1 file changed, 2 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/soc/qcom/qmi_interface.c >>> b/drivers/soc/qcom/qmi_interface.c >>> index 78d7361fdcf2..92e29db97359 100644 >>> --- a/drivers/soc/qcom/qmi_interface.c >>> +++ b/drivers/soc/qcom/qmi_interface.c >>> @@ -505,12 +505,13 @@ static void qmi_handle_message(struct >>> qmi_handle *qmi, >>> pr_err("failed to decode incoming message\n"); >>> txn->result = ret; >>> - complete(&txn->completion); >>> } else { >>> qmi_invoke_handler(qmi, sq, txn, buf, len); >>> } >>> mutex_unlock(&txn->lock); >>> + if (txn->dest && txn->ei) >>> + complete(&txn->completion); >>> } else { >>> /* Create a txn based on the txn_id of the incoming message */ >>> memset(&tmp_txn, 0, sizeof(tmp_txn)); >> >> What happens in a remote scenario where the waiter gets timed out at the >> very same time you are releasing the mutex but before calling >> complete()? The caller might end up freeing txn structure and it results >> in the same issue you are currently facing. >> >> Thanks, >> Pavan > > I think downstream we had various attempts of moving the signal around > trying to avoid this, but hit scenarios like the one Pavan described. > > We eventually settled on removing the txn->lock and treating the > qmi->txn_lock as a big lock. This remedied the issue where the txn->lock > goes out of scope since qmi->txn_lock is tied to the qmi handle. > ok agree. Using qmi->txn_lock looks a better approach. That said, this race between mutex lock/unlock looks odd though. If i remember we saw the issue only with CONFIG_DEBUG_LOCK_ALLOC. Was that the same case for you guys as well ? Otherwise, ideally handling all members of the object inside lock should be the right solution (ie moving the wait_for_complete(txn) inside the mutex_lock in qmi_txn_wait. That should take care of the scenario that Pavan described too. Regards, Sricharan
On 8/1/2023 4:13 AM, Sricharan Ramabadhran wrote: > Hi, > > On 8/1/2023 6:06 AM, Chris Lew wrote: >> >> >> On 7/31/2023 8:19 AM, Pavan Kondeti wrote: >>> On Mon, Jul 31, 2023 at 06:37:55PM +0530, Praveenkumar I wrote: >>>> txn is in #1 stack >>>> >>>> Worker #1 Worker #2 >>>> ******** ********* >>>> >>>> qmi_txn_wait(txn) qmi_handle_message >>>> | | >>>> | | >>>> wait_for_complete(txn->complete) .... >>>> | mutex_lock(txn->lock) >>>> | | >>>> mutex_lock(txn->lock) | >>>> ..... complete(txn->lock) >>>> | mutex_unlock(txn->lock) >>>> | >>>> mutex_unlock(txn->lock) >>>> >>>> In this case above, while #2 is doing the mutex_unlock(txn->lock), >>>> in between releasing lock and doing other lock related wakeup, #2 gets >>>> scheduled out. As a result #1, acquires the lock, unlocks, also >>>> frees the txn also (where the lock resides) >>>> >>>> Now #2, gets scheduled again and tries to do the rest of the lock >>>> related wakeup, but lock itself is invalid because txn itself is gone. >>>> >>>> Fixing this, by doing the mutex_unlock(txn->lock) first and then >>>> complete(txn->lock) in #2 >>>> >>>> Fixes: 3830d0771ef6 ("soc: qcom: Introduce QMI helpers") >>>> Cc: stable@vger.kernel.org >>>> Signed-off-by: Sricharan Ramabadhran <quic_srichara@quicinc.com> >>>> Signed-off-by: Praveenkumar I <quic_ipkumar@quicinc.com> >>>> --- >>>> drivers/soc/qcom/qmi_interface.c | 3 ++- >>>> 1 file changed, 2 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/soc/qcom/qmi_interface.c >>>> b/drivers/soc/qcom/qmi_interface.c >>>> index 78d7361fdcf2..92e29db97359 100644 >>>> --- a/drivers/soc/qcom/qmi_interface.c >>>> +++ b/drivers/soc/qcom/qmi_interface.c >>>> @@ -505,12 +505,13 @@ static void qmi_handle_message(struct >>>> qmi_handle *qmi, >>>> pr_err("failed to decode incoming message\n"); >>>> txn->result = ret; >>>> - complete(&txn->completion); >>>> } else { >>>> qmi_invoke_handler(qmi, sq, txn, buf, len); >>>> } >>>> mutex_unlock(&txn->lock); >>>> + if (txn->dest && txn->ei) >>>> + complete(&txn->completion); >>>> } else { >>>> /* Create a txn based on the txn_id of the incoming >>>> message */ >>>> memset(&tmp_txn, 0, sizeof(tmp_txn)); >>> >>> What happens in a remote scenario where the waiter gets timed out at the >>> very same time you are releasing the mutex but before calling >>> complete()? The caller might end up freeing txn structure and it results >>> in the same issue you are currently facing. >>> >>> Thanks, >>> Pavan >> >> I think downstream we had various attempts of moving the signal around >> trying to avoid this, but hit scenarios like the one Pavan described. >> >> We eventually settled on removing the txn->lock and treating the >> qmi->txn_lock as a big lock. This remedied the issue where the >> txn->lock goes out of scope since qmi->txn_lock is tied to the qmi >> handle. >> > > ok agree. Using qmi->txn_lock looks a better approach. > That said, this race between mutex lock/unlock looks odd though. > If i remember we saw the issue only with CONFIG_DEBUG_LOCK_ALLOC. > Was that the same case for you guys as well ? > > Otherwise, ideally handling all members of the object inside lock > should be the right solution (ie moving the wait_for_complete(txn) > inside the mutex_lock in qmi_txn_wait. That should take care of the > scenario that Pavan described too. > No, we saw the issue even without CONFIG_DEBUG_LOCK_ALLOC. The callstacks always ended up showing that the mutex could be acquired before mutex_unlock() completely finished. It didn't seem wise to poke at the mutex implementation so we went with the txn_lock. > Regards, > Sricharan >
On 8/2/2023 5:11 AM, Chris Lew wrote: > > > On 8/1/2023 4:13 AM, Sricharan Ramabadhran wrote: >> Hi, >> >> On 8/1/2023 6:06 AM, Chris Lew wrote: >>> >>> >>> On 7/31/2023 8:19 AM, Pavan Kondeti wrote: >>>> On Mon, Jul 31, 2023 at 06:37:55PM +0530, Praveenkumar I wrote: >>>>> txn is in #1 stack >>>>> >>>>> Worker #1 Worker #2 >>>>> ******** ********* >>>>> >>>>> qmi_txn_wait(txn) qmi_handle_message >>>>> | | >>>>> | | >>>>> wait_for_complete(txn->complete) .... >>>>> | >>>>> mutex_lock(txn->lock) >>>>> | | >>>>> mutex_lock(txn->lock) | >>>>> ..... complete(txn->lock) >>>>> | mutex_unlock(txn->lock) >>>>> | >>>>> mutex_unlock(txn->lock) >>>>> >>>>> In this case above, while #2 is doing the mutex_unlock(txn->lock), >>>>> in between releasing lock and doing other lock related wakeup, #2 gets >>>>> scheduled out. As a result #1, acquires the lock, unlocks, also >>>>> frees the txn also (where the lock resides) >>>>> >>>>> Now #2, gets scheduled again and tries to do the rest of the lock >>>>> related wakeup, but lock itself is invalid because txn itself is gone. >>>>> >>>>> Fixing this, by doing the mutex_unlock(txn->lock) first and then >>>>> complete(txn->lock) in #2 >>>>> >>>>> Fixes: 3830d0771ef6 ("soc: qcom: Introduce QMI helpers") >>>>> Cc: stable@vger.kernel.org >>>>> Signed-off-by: Sricharan Ramabadhran <quic_srichara@quicinc.com> >>>>> Signed-off-by: Praveenkumar I <quic_ipkumar@quicinc.com> >>>>> --- >>>>> drivers/soc/qcom/qmi_interface.c | 3 ++- >>>>> 1 file changed, 2 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/drivers/soc/qcom/qmi_interface.c >>>>> b/drivers/soc/qcom/qmi_interface.c >>>>> index 78d7361fdcf2..92e29db97359 100644 >>>>> --- a/drivers/soc/qcom/qmi_interface.c >>>>> +++ b/drivers/soc/qcom/qmi_interface.c >>>>> @@ -505,12 +505,13 @@ static void qmi_handle_message(struct >>>>> qmi_handle *qmi, >>>>> pr_err("failed to decode incoming message\n"); >>>>> txn->result = ret; >>>>> - complete(&txn->completion); >>>>> } else { >>>>> qmi_invoke_handler(qmi, sq, txn, buf, len); >>>>> } >>>>> mutex_unlock(&txn->lock); >>>>> + if (txn->dest && txn->ei) >>>>> + complete(&txn->completion); >>>>> } else { >>>>> /* Create a txn based on the txn_id of the incoming >>>>> message */ >>>>> memset(&tmp_txn, 0, sizeof(tmp_txn)); >>>> >>>> What happens in a remote scenario where the waiter gets timed out at >>>> the >>>> very same time you are releasing the mutex but before calling >>>> complete()? The caller might end up freeing txn structure and it >>>> results >>>> in the same issue you are currently facing. >>>> >>>> Thanks, >>>> Pavan >>> >>> I think downstream we had various attempts of moving the signal >>> around trying to avoid this, but hit scenarios like the one Pavan >>> described. >>> >>> We eventually settled on removing the txn->lock and treating the >>> qmi->txn_lock as a big lock. This remedied the issue where the >>> txn->lock goes out of scope since qmi->txn_lock is tied to the qmi >>> handle. >>> >> >> ok agree. Using qmi->txn_lock looks a better approach. >> That said, this race between mutex lock/unlock looks odd though. >> If i remember we saw the issue only with CONFIG_DEBUG_LOCK_ALLOC. >> Was that the same case for you guys as well ? >> >> Otherwise, ideally handling all members of the object inside lock >> should be the right solution (ie moving the wait_for_complete(txn) >> inside the mutex_lock in qmi_txn_wait. That should take care of the >> scenario that Pavan described too. >> > > No, we saw the issue even without CONFIG_DEBUG_LOCK_ALLOC. The > callstacks always ended up showing that the mutex could be acquired > before mutex_unlock() completely finished. > > It didn't seem wise to poke at the mutex implementation so we went with > the txn_lock. ok, that's strange. That effectively means, mutex_lock/unlock are not working/protecting the critical section ? Then qmi->txn_lock also would result in a similar issue ? I guess, in this case, during issue, txn (which holds the lock) was going out of context, while still the txn was in used in other thread. That effectively shows up a mutex issue maybe. While the downstream change to use qmi->txn_lock would fix the mutex issue, will have to check if the txn object itself is protected correctly. Regards, Sricharan
On 8/2/2023 1:07 AM, Sricharan Ramabadhran wrote: > > > On 8/2/2023 5:11 AM, Chris Lew wrote: >> >> >> On 8/1/2023 4:13 AM, Sricharan Ramabadhran wrote: >>> Hi, >>> >>> On 8/1/2023 6:06 AM, Chris Lew wrote: >>>> >>>> >>>> On 7/31/2023 8:19 AM, Pavan Kondeti wrote: >>>>> On Mon, Jul 31, 2023 at 06:37:55PM +0530, Praveenkumar I wrote: >>>>>> txn is in #1 stack >>>>>> >>>>>> Worker #1 Worker #2 >>>>>> ******** ********* >>>>>> >>>>>> qmi_txn_wait(txn) qmi_handle_message >>>>>> | | >>>>>> | | >>>>>> wait_for_complete(txn->complete) .... >>>>>> | mutex_lock(txn->lock) >>>>>> | | >>>>>> mutex_lock(txn->lock) | >>>>>> ..... complete(txn->lock) >>>>>> | mutex_unlock(txn->lock) >>>>>> | >>>>>> mutex_unlock(txn->lock) >>>>>> >>>>>> In this case above, while #2 is doing the mutex_unlock(txn->lock), >>>>>> in between releasing lock and doing other lock related wakeup, #2 >>>>>> gets >>>>>> scheduled out. As a result #1, acquires the lock, unlocks, also >>>>>> frees the txn also (where the lock resides) >>>>>> >>>>>> Now #2, gets scheduled again and tries to do the rest of the lock >>>>>> related wakeup, but lock itself is invalid because txn itself is >>>>>> gone. >>>>>> >>>>>> Fixing this, by doing the mutex_unlock(txn->lock) first and then >>>>>> complete(txn->lock) in #2 >>>>>> >>>>>> Fixes: 3830d0771ef6 ("soc: qcom: Introduce QMI helpers") >>>>>> Cc: stable@vger.kernel.org >>>>>> Signed-off-by: Sricharan Ramabadhran <quic_srichara@quicinc.com> >>>>>> Signed-off-by: Praveenkumar I <quic_ipkumar@quicinc.com> >>>>>> --- >>>>>> drivers/soc/qcom/qmi_interface.c | 3 ++- >>>>>> 1 file changed, 2 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/drivers/soc/qcom/qmi_interface.c >>>>>> b/drivers/soc/qcom/qmi_interface.c >>>>>> index 78d7361fdcf2..92e29db97359 100644 >>>>>> --- a/drivers/soc/qcom/qmi_interface.c >>>>>> +++ b/drivers/soc/qcom/qmi_interface.c >>>>>> @@ -505,12 +505,13 @@ static void qmi_handle_message(struct >>>>>> qmi_handle *qmi, >>>>>> pr_err("failed to decode incoming message\n"); >>>>>> txn->result = ret; >>>>>> - complete(&txn->completion); >>>>>> } else { >>>>>> qmi_invoke_handler(qmi, sq, txn, buf, len); >>>>>> } >>>>>> mutex_unlock(&txn->lock); >>>>>> + if (txn->dest && txn->ei) >>>>>> + complete(&txn->completion); >>>>>> } else { >>>>>> /* Create a txn based on the txn_id of the incoming >>>>>> message */ >>>>>> memset(&tmp_txn, 0, sizeof(tmp_txn)); >>>>> >>>>> What happens in a remote scenario where the waiter gets timed out >>>>> at the >>>>> very same time you are releasing the mutex but before calling >>>>> complete()? The caller might end up freeing txn structure and it >>>>> results >>>>> in the same issue you are currently facing. >>>>> >>>>> Thanks, >>>>> Pavan >>>> >>>> I think downstream we had various attempts of moving the signal >>>> around trying to avoid this, but hit scenarios like the one Pavan >>>> described. >>>> >>>> We eventually settled on removing the txn->lock and treating the >>>> qmi->txn_lock as a big lock. This remedied the issue where the >>>> txn->lock goes out of scope since qmi->txn_lock is tied to the qmi >>>> handle. >>>> >>> >>> ok agree. Using qmi->txn_lock looks a better approach. >>> That said, this race between mutex lock/unlock looks odd though. >>> If i remember we saw the issue only with CONFIG_DEBUG_LOCK_ALLOC. >>> Was that the same case for you guys as well ? >>> >>> Otherwise, ideally handling all members of the object inside lock >>> should be the right solution (ie moving the wait_for_complete(txn) >>> inside the mutex_lock in qmi_txn_wait. That should take care of the >>> scenario that Pavan described too. >>> >> >> No, we saw the issue even without CONFIG_DEBUG_LOCK_ALLOC. The >> callstacks always ended up showing that the mutex could be acquired >> before mutex_unlock() completely finished. >> >> It didn't seem wise to poke at the mutex implementation so we went >> with the txn_lock. > > ok, that's strange. That effectively means, mutex_lock/unlock are not > working/protecting the critical section ? Then qmi->txn_lock also would > result in a similar issue ? I guess, in this case, during issue, txn > (which holds the lock) was going out of context, while still the txn > was in used in other thread. That effectively shows up a mutex issue > maybe. While the downstream change to use qmi->txn_lock would fix the > mutex issue, will have to check if the txn object itself is protected > correctly. > Looked into this a bit more, I think the mutex was going into __mutex_unlock_slowpath because there is a waiter on the txn->lock. In the slow path there are two sections of the code, one where we release the owner and another where we notify waiters. It doesn't look like there is anything to prevent preemption between the two sections. /kernel/locking/mutex.c static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigned long ip) { ... /* * Release the lock before (potentially) taking the spinlock such that * other contenders can get on with things ASAP. * * Except when HANDOFF, in that case we must not clear the owner field, * but instead set it to the top waiter. */ A mutex is able to guarantee mutual exclusion on the critical sections that we enclose in locks. It is not able to guarantee the lifetime of a object, that would have to be done through a kref like mechanism or code organization. In this case relying on qmi->txn_lock() would be relying on code organization to guarantee it won't go out of scope/be freed. Thanks, Chris > Regards, > Sricharan
On 8/8/2023 1:56 AM, Chris Lew wrote: > > > On 8/2/2023 1:07 AM, Sricharan Ramabadhran wrote: >> >> >> On 8/2/2023 5:11 AM, Chris Lew wrote: >>> >>> >>> On 8/1/2023 4:13 AM, Sricharan Ramabadhran wrote: >>>> Hi, >>>> >>>> On 8/1/2023 6:06 AM, Chris Lew wrote: >>>>> >>>>> >>>>> On 7/31/2023 8:19 AM, Pavan Kondeti wrote: >>>>>> On Mon, Jul 31, 2023 at 06:37:55PM +0530, Praveenkumar I wrote: >>>>>>> txn is in #1 stack >>>>>>> >>>>>>> Worker #1 Worker #2 >>>>>>> ******** ********* >>>>>>> >>>>>>> qmi_txn_wait(txn) qmi_handle_message >>>>>>> | | >>>>>>> | | >>>>>>> wait_for_complete(txn->complete) .... >>>>>>> | mutex_lock(txn->lock) >>>>>>> | | >>>>>>> mutex_lock(txn->lock) | >>>>>>> ..... >>>>>>> complete(txn->lock) >>>>>>> | mutex_unlock(txn->lock) >>>>>>> | >>>>>>> mutex_unlock(txn->lock) >>>>>>> >>>>>>> In this case above, while #2 is doing the mutex_unlock(txn->lock), >>>>>>> in between releasing lock and doing other lock related wakeup, #2 >>>>>>> gets >>>>>>> scheduled out. As a result #1, acquires the lock, unlocks, also >>>>>>> frees the txn also (where the lock resides) >>>>>>> >>>>>>> Now #2, gets scheduled again and tries to do the rest of the lock >>>>>>> related wakeup, but lock itself is invalid because txn itself is >>>>>>> gone. >>>>>>> >>>>>>> Fixing this, by doing the mutex_unlock(txn->lock) first and then >>>>>>> complete(txn->lock) in #2 >>>>>>> >>>>>>> Fixes: 3830d0771ef6 ("soc: qcom: Introduce QMI helpers") >>>>>>> Cc: stable@vger.kernel.org >>>>>>> Signed-off-by: Sricharan Ramabadhran <quic_srichara@quicinc.com> >>>>>>> Signed-off-by: Praveenkumar I <quic_ipkumar@quicinc.com> >>>>>>> --- >>>>>>> drivers/soc/qcom/qmi_interface.c | 3 ++- >>>>>>> 1 file changed, 2 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/drivers/soc/qcom/qmi_interface.c >>>>>>> b/drivers/soc/qcom/qmi_interface.c >>>>>>> index 78d7361fdcf2..92e29db97359 100644 >>>>>>> --- a/drivers/soc/qcom/qmi_interface.c >>>>>>> +++ b/drivers/soc/qcom/qmi_interface.c >>>>>>> @@ -505,12 +505,13 @@ static void qmi_handle_message(struct >>>>>>> qmi_handle *qmi, >>>>>>> pr_err("failed to decode incoming message\n"); >>>>>>> txn->result = ret; >>>>>>> - complete(&txn->completion); >>>>>>> } else { >>>>>>> qmi_invoke_handler(qmi, sq, txn, buf, len); >>>>>>> } >>>>>>> mutex_unlock(&txn->lock); >>>>>>> + if (txn->dest && txn->ei) >>>>>>> + complete(&txn->completion); >>>>>>> } else { >>>>>>> /* Create a txn based on the txn_id of the incoming >>>>>>> message */ >>>>>>> memset(&tmp_txn, 0, sizeof(tmp_txn)); >>>>>> >>>>>> What happens in a remote scenario where the waiter gets timed out >>>>>> at the >>>>>> very same time you are releasing the mutex but before calling >>>>>> complete()? The caller might end up freeing txn structure and it >>>>>> results >>>>>> in the same issue you are currently facing. >>>>>> >>>>>> Thanks, >>>>>> Pavan >>>>> >>>>> I think downstream we had various attempts of moving the signal >>>>> around trying to avoid this, but hit scenarios like the one Pavan >>>>> described. >>>>> >>>>> We eventually settled on removing the txn->lock and treating the >>>>> qmi->txn_lock as a big lock. This remedied the issue where the >>>>> txn->lock goes out of scope since qmi->txn_lock is tied to the qmi >>>>> handle. >>>>> >>>> >>>> ok agree. Using qmi->txn_lock looks a better approach. >>>> That said, this race between mutex lock/unlock looks odd though. >>>> If i remember we saw the issue only with CONFIG_DEBUG_LOCK_ALLOC. >>>> Was that the same case for you guys as well ? >>>> >>>> Otherwise, ideally handling all members of the object inside lock >>>> should be the right solution (ie moving the wait_for_complete(txn) >>>> inside the mutex_lock in qmi_txn_wait. That should take care of the >>>> scenario that Pavan described too. >>>> >>> >>> No, we saw the issue even without CONFIG_DEBUG_LOCK_ALLOC. The >>> callstacks always ended up showing that the mutex could be acquired >>> before mutex_unlock() completely finished. >>> >>> It didn't seem wise to poke at the mutex implementation so we went >>> with the txn_lock. >> >> ok, that's strange. That effectively means, mutex_lock/unlock are not >> working/protecting the critical section ? Then qmi->txn_lock also would >> result in a similar issue ? I guess, in this case, during issue, txn >> (which holds the lock) was going out of context, while still the txn >> was in used in other thread. That effectively shows up a mutex issue >> maybe. While the downstream change to use qmi->txn_lock would fix the >> mutex issue, will have to check if the txn object itself is protected >> correctly. >> > > Looked into this a bit more, I think the mutex was going into > __mutex_unlock_slowpath because there is a waiter on the txn->lock. In > the slow path there are two sections of the code, one where we release > the owner and another where we notify waiters. It doesn't look like > there is anything to prevent preemption between the two sections. > > /kernel/locking/mutex.c > > static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, > unsigned long ip) > { > ... > /* > * Release the lock before (potentially) taking the spinlock such that > * other contenders can get on with things ASAP. > * > * Except when HANDOFF, in that case we must not clear the owner field, > * but instead set it to the top waiter. > */ > > A mutex is able to guarantee mutual exclusion on the critical sections > that we enclose in locks. It is not able to guarantee the lifetime of a > object, that would have to be done through a kref like mechanism or code > organization. In this case relying on qmi->txn_lock() would be relying > on code organization to guarantee it won't go out of scope/be freed. > Sorry for delayed response, i missed it. Yup, exactly, that is why earlier suspected even in your case also CONFIG_DEBUG_LOCK_ALLOC might have been enabled. It was same for us as well. Now it is correct for both cases. Regards, Sricharan
diff --git a/drivers/soc/qcom/qmi_interface.c b/drivers/soc/qcom/qmi_interface.c index 78d7361fdcf2..92e29db97359 100644 --- a/drivers/soc/qcom/qmi_interface.c +++ b/drivers/soc/qcom/qmi_interface.c @@ -505,12 +505,13 @@ static void qmi_handle_message(struct qmi_handle *qmi, pr_err("failed to decode incoming message\n"); txn->result = ret; - complete(&txn->completion); } else { qmi_invoke_handler(qmi, sq, txn, buf, len); } mutex_unlock(&txn->lock); + if (txn->dest && txn->ei) + complete(&txn->completion); } else { /* Create a txn based on the txn_id of the incoming message */ memset(&tmp_txn, 0, sizeof(tmp_txn));