Message ID | 20230522143105.8617-1-LinoSanfilippo@gmx.de |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp1505729vqo; Mon, 22 May 2023 07:51:32 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6Mzno/o1EW7StnoYpBn27uiAU4lzaDhk+MFOeZxaXtqVq2HvpkjwbjeWDqWqDM+MXx4Knc X-Received: by 2002:a17:902:6bc2:b0:1ae:6882:5bc4 with SMTP id m2-20020a1709026bc200b001ae68825bc4mr9679559plt.64.1684767091741; Mon, 22 May 2023 07:51:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684767091; cv=none; d=google.com; s=arc-20160816; b=GNmG5fh0AR9MwGij+DEU37N5g0cc2uOXuNWZSnwJvcPqkqMXfQEGQLLX3o0ToCsMEy tj0WuzY2kzly52Pr3ALb52s1BQ1VXTHQ33S5SXZYOaw18UgQFsNbIP+QI5lbrfYPCjs1 Wv7teLo+Ov96QX3qtjulT0SXLl9zWnH7VoOEf5/thjC6rHrs/QlBHfd7npYHaUJdDlSC 1aTqDxD4h54zX+sRyux+5UDnsQh4jaLP8NLEIYSZ5KhTHwD72coNF+sIo5m+1fyALUOd sUkuMqy9NB9ed11W9fmiZqXyAXZtyDLjCTCVDgJPMMqN8Fg1ZnZjHlc6WW4v2nPD0YVl 1Xmw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:ui-outboundreport:content-transfer-encoding :mime-version:message-id:date:subject:cc:to:from:dkim-signature; bh=lwQ+t7RCqn9sjjERhaQcuu0ceFAVx7n0LyVEjPw+TWs=; b=vow3Y0NvLbPkSJE2++psnwH03782DArUfnuxDTq01Vyuh7fHsxloD4kGxhw9Wa3ZAw bhQQVmWMgxXnfoJRnL+6naKNcBzYSIxzuisPIGl/iRJqDUsv+hzVxO4jc1dObDsy5GV8 oYNNsG2ZZZTyP/eOeePzDt/g5jcJbEiTD3ChiVHMVqLXnVQm3AC4jxVL7P+GD6cZJvwb tYbaaKTVqvWEoQ02Qr051yPhLyb7WbkIXqr0U+Z3XH1vZqivN3vJRVyQsqxG671lTkDf ZnD0ImjFwZDY11LbYGKxIU2KYiyUzZXw2moxcRsKoj0NEkMHRi/WXAhaIqWBxkfI8VVB 691A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmx.de header.s=s31663417 header.b=TDxYqCU9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gmx.de Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id m6-20020a170902db8600b001ae67aab6a6si658040pld.172.2023.05.22.07.51.18; Mon, 22 May 2023 07:51:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmx.de header.s=s31663417 header.b=TDxYqCU9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gmx.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233697AbjEVOca (ORCPT <rfc822;wlfightup@gmail.com> + 99 others); Mon, 22 May 2023 10:32:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56462 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233574AbjEVOc2 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 22 May 2023 10:32:28 -0400 Received: from mout.gmx.net (mout.gmx.net [212.227.15.15]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3132BA9; Mon, 22 May 2023 07:32:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gmx.de; s=s31663417; t=1684765901; i=linosanfilippo@gmx.de; bh=lwQ+t7RCqn9sjjERhaQcuu0ceFAVx7n0LyVEjPw+TWs=; h=X-UI-Sender-Class:From:To:Cc:Subject:Date; b=TDxYqCU9CtVWVrlfU1RaliEa09utKq9fw1wfmUq7CqDOeK5xPtbL9lAaqGO7GWqzM z3/o+MylPzTBZskzESGi2YoWElc7sljBW3ieIPVoE0lem41Erauj+m8OJFki8aXlGQ 6XTPfVG4LHofz1Bx4XunH/cvqMOLATg6cBQfvjZQhEdR7b9GJvJiBpyqqCX0KwG8iw svAnco1yeeGU5ZlgmKiKrZAxwNqDLtYVuJbhIPXyZvZaBRL80uXboOH7JSl28cbnbh qgfDfnxzee8UkSoaVlwHX53qQkxFw87r8Cowab78xUqOffHmKanxP2AxVbKyUbkjmC z8nmVh2nlCB5Q== X-UI-Sender-Class: 724b4f7f-cbec-4199-ad4e-598c01a50d3a Received: from Venus.speedport.ip ([84.162.2.106]) by mail.gmx.net (mrgmx005 [212.227.17.190]) with ESMTPSA (Nemesis) id 1N4hvR-1qADD027vD-011gMN; Mon, 22 May 2023 16:31:41 +0200 From: Lino Sanfilippo <LinoSanfilippo@gmx.de> To: peterhuewe@gmx.de, jarkko@kernel.org, jgg@ziepe.ca Cc: jsnitsel@redhat.com, hdegoede@redhat.com, oe-lkp@lists.linux.dev, lkp@intel.com, peter.ujfalusi@linux.intel.com, peterz@infradead.org, linux@mniewoehner.de, linux-integrity@vger.kernel.org, linux-kernel@vger.kernel.org, l.sanfilippo@kunbus.com, LinoSanfilippo@gmx.de, lukas@wunner.de, p.rosenberger@kunbus.com Subject: [PATCH 1/2] tpm, tpm_tis: Handle interrupt storm Date: Mon, 22 May 2023 16:31:04 +0200 Message-Id: <20230522143105.8617-1-LinoSanfilippo@gmx.de> X-Mailer: git-send-email 2.40.1 MIME-Version: 1.0 Content-Transfer-Encoding: base64 X-Provags-ID: V03:K1:yIe+WGyH1ASzJ7ENp03LTV4IXOeonm6dRKoNYtjMLPwcfYAi2iq bsV98q//ftrbGfjns3dZRRrI8a0UPsJzqcfUV3ea8Pg/tF5VZGL1bX7WEsGetOCdoX/WwFA GQvGPY7rnu6kc3KPfzmEEfx0MQXnyZweOUAuaSmk8Bnsc6ZrWh4uo7+yfvYQbFJxrcs/EIF f6xWuxrA2CptkCSbSY0/w== UI-OutboundReport: notjunk:1;M01:P0:btJ3+woI3As=;es3Q1meS12RE+lnRJ4+NzVouEpm /8b8hj2thexAdcK5YFJ/GTvX3FAw+yLNuCKhp7Vrpuc52EHwzUHYEY6tu/6oKPWDfzApqveMl eHUSqnH5lU6Gg6HdOuNcURfE9QxCFqWdc6V9rrXSlOZSH/isqt2ieKoOTdU2MBFhm7Ejl66vX RcmXz2mgx56Byh9wHBTRBhgvziYs4AVdGWOA5XznZh0YCFCHaSXJWfEefpRZwHSDfL4ipyYIB GYBEyKz/SblbSy6mMN2VlC+1EAQBoKX4fkb1A4yoOyxpaIDpRqwEvSx/vABUvTX13SMMuu5TR PNxhDLV569Ms5UGlgta5qe79fHaDfplaqZufVBmcaeVaeYlULtGCAlGIp7DGaZroy9sEt2j48 E6Khx86YLAjBmcjkASEnA26+bYAruJkCHkNyknY6CC6XN6O2irhv33t5p5wOuLANJGfCxWMSQ SfdVKY2hAdtXoPnf6MX6i1/h5fWEXgHIXQn4HiKhGjIwjeMbeMbeKQNelv+H/FycwV0eCQIMa VgwcJzgKiJ0t9Hi26GXBz+8f8x3wYCgRVcRUmanaPZrmGCeIWaVJZgCF1JQEMJyik9bNyw8jj cD/WBZ6J0YS1SEa7+uBeIicABcJdWU9AYcOkWOesG0VOVCSJLD0S2/l88rZFMSFCHTxeYdvwD Y19hgBdHJZi1iaLyWB9fVQhNSSr2tjGCksO6EBkeNhekDaIR8ilet8LesGUl9yeNRbLJBt8O3 Di0uWzMxEBEECIKMT+CIrJeztQMHknzQMYM3zxx5d9b3dqMkjD1Ie/5pwZ92vXI14pSpmp/7W XaNecdyWSR9QKvqRgjMb6o4thr0ateiXCcAFd3tkZ/JE7ChO959fVXP3TUUJm6ZEZ/XQ+3ju5 DIJlCJVnXEAmyo2eYg2htxMTvFXP8wbE62XI88N0sqHmQhY2QCFobKxhHYm63wIbgefUJle9C kALmVPKqb2KoWjEhFO2CM8KUmv0= X-Spam-Status: No, score=-1.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,MIME_BASE64_TEXT, RCVD_IN_DNSWL_LOW,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1766606338435968010?= X-GMAIL-MSGID: =?utf-8?q?1766606338435968010?= |
Series |
[1/2] tpm, tpm_tis: Handle interrupt storm
|
|
Commit Message
Lino Sanfilippo
May 22, 2023, 2:31 p.m. UTC
From: Lino Sanfilippo <l.sanfilippo@kunbus.com> Commit e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") enabled interrupts instead of polling on all capable TPMs. Unfortunately, on some products the interrupt line is either never asserted or never deasserted. The former causes interrupt timeouts and is detected by tpm_tis_core_init(). The latter results in interrupt storms. Recent reports concern the Lenovo ThinkStation P360 Tiny, Lenovo ThinkPad L490 and Inspur NF5180M6: https://lore.kernel.org/linux-integrity/20230511005403.24689-1-jsnitsel@redhat.com/ https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@kernel.org/ The current approach to avoid those storms is to disable interrupts by adding a DMI quirk for the concerned device. However this is a maintenance burden in the long run, so use a generic approach: Detect an interrupt storm by counting the number of unhandled interrupts within a 10 ms time interval. In case that more than 1000 were unhandled deactivate interrupts, deregister the handler and fall back to polling. This equals the implementation that handles interrupt storms in note_interrupt() by means of timestamps and counters in struct irq_desc. However the function to access this structure is private so the logic has to be reimplemented in the TPM TIS core. Since handler deregistration would deadlock from within the interrupt routine trigger a worker thread that executes the unregistration. Suggested-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Lino Sanfilippo <l.sanfilippo@kunbus.com> --- drivers/char/tpm/tpm_tis_core.c | 71 +++++++++++++++++++++++++++++++-- drivers/char/tpm/tpm_tis_core.h | 6 +++ 2 files changed, 74 insertions(+), 3 deletions(-) base-commit: 44c026a73be8038f03dbdeef028b642880cf1511
Comments
On Mon, May 22, 2023 at 04:31:04PM +0200, Lino Sanfilippo wrote: > From: Lino Sanfilippo <l.sanfilippo@kunbus.com> > > Commit e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") enabled > interrupts instead of polling on all capable TPMs. Unfortunately, on some > products the interrupt line is either never asserted or never deasserted. > > The former causes interrupt timeouts and is detected by > tpm_tis_core_init(). The latter results in interrupt storms. > > Recent reports concern the Lenovo ThinkStation P360 Tiny, Lenovo ThinkPad > L490 and Inspur NF5180M6: > > https://lore.kernel.org/linux-integrity/20230511005403.24689-1-jsnitsel@redhat.com/ > https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@kernel.org/ > > The current approach to avoid those storms is to disable interrupts by > adding a DMI quirk for the concerned device. > > However this is a maintenance burden in the long run, so use a generic > approach: > > Detect an interrupt storm by counting the number of unhandled interrupts > within a 10 ms time interval. In case that more than 1000 were unhandled > deactivate interrupts, deregister the handler and fall back to polling. > > This equals the implementation that handles interrupt storms in > note_interrupt() by means of timestamps and counters in struct irq_desc. > However the function to access this structure is private so the logic has > to be reimplemented in the TPM TIS core. > > Since handler deregistration would deadlock from within the interrupt > routine trigger a worker thread that executes the unregistration. > > Suggested-by: Lukas Wunner <lukas@wunner.de> > Signed-off-by: Lino Sanfilippo <l.sanfilippo@kunbus.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> > --- > drivers/char/tpm/tpm_tis_core.c | 71 +++++++++++++++++++++++++++++++-- > drivers/char/tpm/tpm_tis_core.h | 6 +++ > 2 files changed, 74 insertions(+), 3 deletions(-) > > diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c > index 558144fa707a..458ebf8c2f16 100644 > --- a/drivers/char/tpm/tpm_tis_core.c > +++ b/drivers/char/tpm/tpm_tis_core.c > @@ -752,6 +752,55 @@ static bool tpm_tis_req_canceled(struct tpm_chip *chip, u8 status) > return status == TPM_STS_COMMAND_READY; > } > > +static void tpm_tis_handle_irq_storm(struct tpm_chip *chip) > +{ > + struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev); > + int intmask = 0; > + > + dev_err(&chip->dev, HW_ERR > + "TPM interrupt storm detected, polling instead\n"); > + > + tpm_tis_read32(priv, TPM_INT_ENABLE(priv->locality), &intmask); > + > + intmask &= ~TPM_GLOBAL_INT_ENABLE; > + > + tpm_tis_request_locality(chip, 0); > + tpm_tis_write32(priv, TPM_INT_ENABLE(priv->locality), intmask); > + tpm_tis_relinquish_locality(chip, 0); > + > + chip->flags &= ~TPM_CHIP_FLAG_IRQ; > + > + /* > + * We must not call devm_free_irq() from within the interrupt handler, > + * since this function waits for running interrupt handlers to finish > + * and thus it would deadlock. Instead trigger a worker that does the > + * unregistration. > + */ > + schedule_work(&priv->free_irq_work); > +} > + > +static void tpm_tis_process_unhandled_interrupt(struct tpm_chip *chip) > +{ > + const unsigned int MAX_UNHANDLED_IRQS = 1000; > + struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev); > + /* > + * The worker to free the TPM interrupt (free_irq_work) may already > + * be scheduled, so make sure it is not scheduled again. > + */ > + if (!(chip->flags & TPM_CHIP_FLAG_IRQ)) > + return; > + > + if (time_after(jiffies, priv->last_unhandled_irq + HZ/10)) > + priv->unhandled_irqs = 1; > + else > + priv->unhandled_irqs++; > + > + priv->last_unhandled_irq = jiffies; > + > + if (priv->unhandled_irqs > MAX_UNHANDLED_IRQS) > + tpm_tis_handle_irq_storm(chip); > +} > + > static irqreturn_t tis_int_handler(int dummy, void *dev_id) > { > struct tpm_chip *chip = dev_id; > @@ -761,10 +810,10 @@ static irqreturn_t tis_int_handler(int dummy, void *dev_id) > > rc = tpm_tis_read32(priv, TPM_INT_STATUS(priv->locality), &interrupt); > if (rc < 0) > - return IRQ_NONE; > + goto unhandled; > > if (interrupt == 0) > - return IRQ_NONE; > + goto unhandled; > > set_bit(TPM_TIS_IRQ_TESTED, &priv->flags); > if (interrupt & TPM_INTF_DATA_AVAIL_INT) > @@ -780,10 +829,14 @@ static irqreturn_t tis_int_handler(int dummy, void *dev_id) > rc = tpm_tis_write32(priv, TPM_INT_STATUS(priv->locality), interrupt); > tpm_tis_relinquish_locality(chip, 0); > if (rc < 0) > - return IRQ_NONE; > + goto unhandled; > > tpm_tis_read32(priv, TPM_INT_STATUS(priv->locality), &interrupt); > return IRQ_HANDLED; > + > +unhandled: > + tpm_tis_process_unhandled_interrupt(chip); > + return IRQ_HANDLED; > } > > static void tpm_tis_gen_interrupt(struct tpm_chip *chip) > @@ -804,6 +857,15 @@ static void tpm_tis_gen_interrupt(struct tpm_chip *chip) > chip->flags &= ~TPM_CHIP_FLAG_IRQ; > } > > +static void tpm_tis_free_irq_func(struct work_struct *work) > +{ > + struct tpm_tis_data *priv = container_of(work, typeof(*priv), free_irq_work); > + struct tpm_chip *chip = priv->chip; > + > + devm_free_irq(chip->dev.parent, priv->irq, chip); > + priv->irq = 0; > +} > + > /* Register the IRQ and issue a command that will cause an interrupt. If an > * irq is seen then leave the chip setup for IRQ operation, otherwise reverse > * everything and leave in polling mode. Returns 0 on success. > @@ -816,6 +878,7 @@ static int tpm_tis_probe_irq_single(struct tpm_chip *chip, u32 intmask, > int rc; > u32 int_status; > > + INIT_WORK(&priv->free_irq_work, tpm_tis_free_irq_func); > > rc = devm_request_threaded_irq(chip->dev.parent, irq, NULL, > tis_int_handler, IRQF_ONESHOT | flags, > @@ -918,6 +981,7 @@ void tpm_tis_remove(struct tpm_chip *chip) > interrupt = 0; > > tpm_tis_write32(priv, reg, ~TPM_GLOBAL_INT_ENABLE & interrupt); > + flush_work(&priv->free_irq_work); > > tpm_tis_clkrun_enable(chip, false); > > @@ -1021,6 +1085,7 @@ int tpm_tis_core_init(struct device *dev, struct tpm_tis_data *priv, int irq, > chip->timeout_b = msecs_to_jiffies(TIS_TIMEOUT_B_MAX); > chip->timeout_c = msecs_to_jiffies(TIS_TIMEOUT_C_MAX); > chip->timeout_d = msecs_to_jiffies(TIS_TIMEOUT_D_MAX); > + priv->chip = chip; > priv->timeout_min = TPM_TIMEOUT_USECS_MIN; > priv->timeout_max = TPM_TIMEOUT_USECS_MAX; > priv->phy_ops = phy_ops; > diff --git a/drivers/char/tpm/tpm_tis_core.h b/drivers/char/tpm/tpm_tis_core.h > index e978f457fd4d..6fc86baa4398 100644 > --- a/drivers/char/tpm/tpm_tis_core.h > +++ b/drivers/char/tpm/tpm_tis_core.h > @@ -91,12 +91,18 @@ enum tpm_tis_flags { > }; > > struct tpm_tis_data { > + struct tpm_chip *chip; > u16 manufacturer_id; > struct mutex locality_count_mutex; > unsigned int locality_count; > int locality; > + /* Interrupts */ > int irq; > + struct work_struct free_irq_work; > + unsigned long last_unhandled_irq; > + unsigned int unhandled_irqs; > unsigned int int_mask; > + > unsigned long flags; > void __iomem *ilb_base_addr; > u16 clkrun_enabled; > > base-commit: 44c026a73be8038f03dbdeef028b642880cf1511 > -- > 2.40.1 >
On 22/05/2023 17:31, Lino Sanfilippo wrote: > From: Lino Sanfilippo <l.sanfilippo@kunbus.com> > > Commit e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") enabled > interrupts instead of polling on all capable TPMs. Unfortunately, on some > products the interrupt line is either never asserted or never deasserted. > > The former causes interrupt timeouts and is detected by > tpm_tis_core_init(). The latter results in interrupt storms. > > Recent reports concern the Lenovo ThinkStation P360 Tiny, Lenovo ThinkPad > L490 and Inspur NF5180M6: > > https://lore.kernel.org/linux-integrity/20230511005403.24689-1-jsnitsel@redhat.com/ > https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@kernel.org/ > > The current approach to avoid those storms is to disable interrupts by > adding a DMI quirk for the concerned device. This looked promising, however it looks like the UPX-i11 needs the DMI quirk. > However this is a maintenance burden in the long run, so use a generic > approach: > > Detect an interrupt storm by counting the number of unhandled interrupts > within a 10 ms time interval. In case that more than 1000 were unhandled > deactivate interrupts, deregister the handler and fall back to polling. > > This equals the implementation that handles interrupt storms in > note_interrupt() by means of timestamps and counters in struct irq_desc. > However the function to access this structure is private so the logic has > to be reimplemented in the TPM TIS core. > > Since handler deregistration would deadlock from within the interrupt > routine trigger a worker thread that executes the unregistration. > > Suggested-by: Lukas Wunner <lukas@wunner.de> > Signed-off-by: Lino Sanfilippo <l.sanfilippo@kunbus.com> > --- > drivers/char/tpm/tpm_tis_core.c | 71 +++++++++++++++++++++++++++++++-- > drivers/char/tpm/tpm_tis_core.h | 6 +++ > 2 files changed, 74 insertions(+), 3 deletions(-) > > diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c > index 558144fa707a..458ebf8c2f16 100644 > --- a/drivers/char/tpm/tpm_tis_core.c > +++ b/drivers/char/tpm/tpm_tis_core.c > @@ -752,6 +752,55 @@ static bool tpm_tis_req_canceled(struct tpm_chip *chip, u8 status) > return status == TPM_STS_COMMAND_READY; > } > > +static void tpm_tis_handle_irq_storm(struct tpm_chip *chip) > +{ > + struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev); > + int intmask = 0; > + > + dev_err(&chip->dev, HW_ERR > + "TPM interrupt storm detected, polling instead\n"); Should this be dev_warn or even dev_info level? It is done delibaretly and it is handled as planned, so it is not really an error? > + > + tpm_tis_read32(priv, TPM_INT_ENABLE(priv->locality), &intmask); > + > + intmask &= ~TPM_GLOBAL_INT_ENABLE; > + > + tpm_tis_request_locality(chip, 0); > + tpm_tis_write32(priv, TPM_INT_ENABLE(priv->locality), intmask); > + tpm_tis_relinquish_locality(chip, 0); > + > + chip->flags &= ~TPM_CHIP_FLAG_IRQ; > + > + /* > + * We must not call devm_free_irq() from within the interrupt handler, > + * since this function waits for running interrupt handlers to finish > + * and thus it would deadlock. Instead trigger a worker that does the > + * unregistration. > + */ > + schedule_work(&priv->free_irq_work); > +} > + > +static void tpm_tis_process_unhandled_interrupt(struct tpm_chip *chip) > +{ > + const unsigned int MAX_UNHANDLED_IRQS = 1000; > + struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev); > + /* > + * The worker to free the TPM interrupt (free_irq_work) may already > + * be scheduled, so make sure it is not scheduled again. > + */ > + if (!(chip->flags & TPM_CHIP_FLAG_IRQ)) > + return; > + > + if (time_after(jiffies, priv->last_unhandled_irq + HZ/10)) unsigned long storm_window; .. storm_window = priv->last_unhandled_irq + msecs_to_jiffies(10); if (time_after(jiffies, storm_window)) priv->unhandled_irqs = 0; priv->unhandled_irqs++; > + priv->unhandled_irqs = 1; > + else > + priv->unhandled_irqs++; > + > + priv->last_unhandled_irq = jiffies; > + > + if (priv->unhandled_irqs > MAX_UNHANDLED_IRQS) > + tpm_tis_handle_irq_storm(chip); Will the kernel step in and disbale the IRQ before we would have detected the storm? I don't know top of my head the trigger in core to stop an interrupt storm... > +} > + > static irqreturn_t tis_int_handler(int dummy, void *dev_id) > { > struct tpm_chip *chip = dev_id; > @@ -761,10 +810,10 @@ static irqreturn_t tis_int_handler(int dummy, void *dev_id) > > rc = tpm_tis_read32(priv, TPM_INT_STATUS(priv->locality), &interrupt); > if (rc < 0) > - return IRQ_NONE; > + goto unhandled; > > if (interrupt == 0) > - return IRQ_NONE; > + goto unhandled; > > set_bit(TPM_TIS_IRQ_TESTED, &priv->flags); > if (interrupt & TPM_INTF_DATA_AVAIL_INT) > @@ -780,10 +829,14 @@ static irqreturn_t tis_int_handler(int dummy, void *dev_id) > rc = tpm_tis_write32(priv, TPM_INT_STATUS(priv->locality), interrupt); > tpm_tis_relinquish_locality(chip, 0); > if (rc < 0) > - return IRQ_NONE; > + goto unhandled; This is more like an error than just unhandled IRQ. Yes, it was ignored, probably because it is common? > > tpm_tis_read32(priv, TPM_INT_STATUS(priv->locality), &interrupt); > return IRQ_HANDLED; > + > +unhandled: > + tpm_tis_process_unhandled_interrupt(chip); > + return IRQ_HANDLED; > } > > static void tpm_tis_gen_interrupt(struct tpm_chip *chip) > @@ -804,6 +857,15 @@ static void tpm_tis_gen_interrupt(struct tpm_chip *chip) > chip->flags &= ~TPM_CHIP_FLAG_IRQ; > } > > +static void tpm_tis_free_irq_func(struct work_struct *work) > +{ > + struct tpm_tis_data *priv = container_of(work, typeof(*priv), free_irq_work); > + struct tpm_chip *chip = priv->chip; > + > + devm_free_irq(chip->dev.parent, priv->irq, chip); > + priv->irq = 0; Should disable_interrupts() be called instead (with the locality request/relinquish)? Is there a chance of a race or is a race matters? > +} > + > /* Register the IRQ and issue a command that will cause an interrupt. If an > * irq is seen then leave the chip setup for IRQ operation, otherwise reverse > * everything and leave in polling mode. Returns 0 on success. > @@ -816,6 +878,7 @@ static int tpm_tis_probe_irq_single(struct tpm_chip *chip, u32 intmask, > int rc; > u32 int_status; > > + INIT_WORK(&priv->free_irq_work, tpm_tis_free_irq_func); > > rc = devm_request_threaded_irq(chip->dev.parent, irq, NULL, > tis_int_handler, IRQF_ONESHOT | flags, > @@ -918,6 +981,7 @@ void tpm_tis_remove(struct tpm_chip *chip) > interrupt = 0; > > tpm_tis_write32(priv, reg, ~TPM_GLOBAL_INT_ENABLE & interrupt); > + flush_work(&priv->free_irq_work); > > tpm_tis_clkrun_enable(chip, false); > > @@ -1021,6 +1085,7 @@ int tpm_tis_core_init(struct device *dev, struct tpm_tis_data *priv, int irq, > chip->timeout_b = msecs_to_jiffies(TIS_TIMEOUT_B_MAX); > chip->timeout_c = msecs_to_jiffies(TIS_TIMEOUT_C_MAX); > chip->timeout_d = msecs_to_jiffies(TIS_TIMEOUT_D_MAX); > + priv->chip = chip; > priv->timeout_min = TPM_TIMEOUT_USECS_MIN; > priv->timeout_max = TPM_TIMEOUT_USECS_MAX; > priv->phy_ops = phy_ops; > diff --git a/drivers/char/tpm/tpm_tis_core.h b/drivers/char/tpm/tpm_tis_core.h > index e978f457fd4d..6fc86baa4398 100644 > --- a/drivers/char/tpm/tpm_tis_core.h > +++ b/drivers/char/tpm/tpm_tis_core.h > @@ -91,12 +91,18 @@ enum tpm_tis_flags { > }; > > struct tpm_tis_data { > + struct tpm_chip *chip; > u16 manufacturer_id; > struct mutex locality_count_mutex; > unsigned int locality_count; > int locality; > + /* Interrupts */ > int irq; > + struct work_struct free_irq_work; > + unsigned long last_unhandled_irq; > + unsigned int unhandled_irqs; > unsigned int int_mask; > + > unsigned long flags; > void __iomem *ilb_base_addr; > u16 clkrun_enabled; > > base-commit: 44c026a73be8038f03dbdeef028b642880cf1511
On 23/05/2023 09:48, Péter Ujfalusi wrote: >> static void tpm_tis_gen_interrupt(struct tpm_chip *chip) >> @@ -804,6 +857,15 @@ static void tpm_tis_gen_interrupt(struct tpm_chip *chip) >> chip->flags &= ~TPM_CHIP_FLAG_IRQ; >> } >> >> +static void tpm_tis_free_irq_func(struct work_struct *work) >> +{ >> + struct tpm_tis_data *priv = container_of(work, typeof(*priv), free_irq_work); >> + struct tpm_chip *chip = priv->chip; >> + >> + devm_free_irq(chip->dev.parent, priv->irq, chip); >> + priv->irq = 0; > > Should disable_interrupts() be called instead (with the locality > request/relinquish)? > > Is there a chance of a race or is a race matters? Nevermind this comment, it is good.
On Tue, May 23, 2023 at 09:48:23AM +0300, Péter Ujfalusi wrote: > On 22/05/2023 17:31, Lino Sanfilippo wrote: [...] > This looked promising, however it looks like the UPX-i11 needs the DMI > quirk. Why is that? Is there a fundamental problem with the patch or is it a specific issue with that device? > > --- a/drivers/char/tpm/tpm_tis_core.c > > +++ b/drivers/char/tpm/tpm_tis_core.c > > @@ -752,6 +752,55 @@ static bool tpm_tis_req_canceled(struct tpm_chip *chip, u8 status) > > return status == TPM_STS_COMMAND_READY; > > } > > > > +static void tpm_tis_handle_irq_storm(struct tpm_chip *chip) > > +{ > > + struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev); > > + int intmask = 0; > > + > > + dev_err(&chip->dev, HW_ERR > > + "TPM interrupt storm detected, polling instead\n"); > > Should this be dev_warn or even dev_info level? The corresponding message emitted in tpm_tis_core_init() for an interrupt that's *never* asserted uses dev_err(), so using dev_err() here as well serves consistency: dev_err(&chip->dev, FW_BUG "TPM interrupt not working, polling instead\n"); That way the same severity is used both for the never asserted and the never deasserted interrupt case. > > + if (priv->unhandled_irqs > MAX_UNHANDLED_IRQS) > > + tpm_tis_handle_irq_storm(chip); > > Will the kernel step in and disbale the IRQ before we would have > detected the storm? No. The detection of spurious interrupts in note_interrupt() hinges on handlers returning IRQ_NONE. And this patch makes tis_int_handler() always return IRQ_HANDLED, thus pretending success to genirq code. > > rc = tpm_tis_write32(priv, TPM_INT_STATUS(priv->locality), interrupt); > > tpm_tis_relinquish_locality(chip, 0); > > if (rc < 0) > > - return IRQ_NONE; > > + goto unhandled; > > This is more like an error than just unhandled IRQ. Yes, it was ignored, > probably because it is common? The interrupt may be shared and then it's not an error. Thanks, Lukas
On 23/05/2023 10:44, Lukas Wunner wrote: > On Tue, May 23, 2023 at 09:48:23AM +0300, Péter Ujfalusi wrote: >> On 22/05/2023 17:31, Lino Sanfilippo wrote: > [...] >> This looked promising, however it looks like the UPX-i11 needs the DMI >> quirk. > > Why is that? Is there a fundamental problem with the patch or is it > a specific issue with that device? The flood is not detected (if there is a flood at all), interrupt stops working after about 200 interrupts - in the latest boot at 118th. I can check this later, likely tomorrow. >>> --- a/drivers/char/tpm/tpm_tis_core.c >>> +++ b/drivers/char/tpm/tpm_tis_core.c >>> @@ -752,6 +752,55 @@ static bool tpm_tis_req_canceled(struct tpm_chip *chip, u8 status) >>> return status == TPM_STS_COMMAND_READY; >>> } >>> >>> +static void tpm_tis_handle_irq_storm(struct tpm_chip *chip) >>> +{ >>> + struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev); >>> + int intmask = 0; >>> + >>> + dev_err(&chip->dev, HW_ERR >>> + "TPM interrupt storm detected, polling instead\n"); >> >> Should this be dev_warn or even dev_info level? > > The corresponding message emitted in tpm_tis_core_init() for > an interrupt that's *never* asserted uses dev_err(), so using > dev_err() here as well serves consistency: > > dev_err(&chip->dev, FW_BUG > "TPM interrupt not working, polling instead\n"); > > That way the same severity is used both for the never asserted and > the never deasserted interrupt case. Oh, OK. Is there anything the user can do to have a ERROR less boot? > >>> + if (priv->unhandled_irqs > MAX_UNHANDLED_IRQS) >>> + tpm_tis_handle_irq_storm(chip); >> >> Will the kernel step in and disbale the IRQ before we would have >> detected the storm? > > No. The detection of spurious interrupts in note_interrupt() > hinges on handlers returning IRQ_NONE. And this patch makes > tis_int_handler() always return IRQ_HANDLED, thus pretending > success to genirq code. True, thanks! > >>> rc = tpm_tis_write32(priv, TPM_INT_STATUS(priv->locality), interrupt); >>> tpm_tis_relinquish_locality(chip, 0); >>> if (rc < 0) >>> - return IRQ_NONE; >>> + goto unhandled; >> >> This is more like an error than just unhandled IRQ. Yes, it was ignored, >> probably because it is common? > > The interrupt may be shared and then it's not an error. but this is tpm_tis_write32() failing, no? If it is shared interrupt and we return IRQ_HANDLED unconditionally then I think the core will think that the interrupt was for this device and it was handled.
Hi, On 5/23/23 11:14, Péter Ujfalusi wrote: > > > On 23/05/2023 10:44, Lukas Wunner wrote: >> On Tue, May 23, 2023 at 09:48:23AM +0300, Péter Ujfalusi wrote: >>> On 22/05/2023 17:31, Lino Sanfilippo wrote: >> [...] >>> This looked promising, however it looks like the UPX-i11 needs the DMI >>> quirk. >> >> Why is that? Is there a fundamental problem with the patch or is it >> a specific issue with that device? > > The flood is not detected (if there is a flood at all), interrupt stops > working after about 200 interrupts - in the latest boot at 118th. > I can check this later, likely tomorrow. > >>>> --- a/drivers/char/tpm/tpm_tis_core.c >>>> +++ b/drivers/char/tpm/tpm_tis_core.c >>>> @@ -752,6 +752,55 @@ static bool tpm_tis_req_canceled(struct tpm_chip *chip, u8 status) >>>> return status == TPM_STS_COMMAND_READY; >>>> } >>>> >>>> +static void tpm_tis_handle_irq_storm(struct tpm_chip *chip) >>>> +{ >>>> + struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev); >>>> + int intmask = 0; >>>> + >>>> + dev_err(&chip->dev, HW_ERR >>>> + "TPM interrupt storm detected, polling instead\n"); >>> >>> Should this be dev_warn or even dev_info level? >> >> The corresponding message emitted in tpm_tis_core_init() for >> an interrupt that's *never* asserted uses dev_err(), so using >> dev_err() here as well serves consistency: >> >> dev_err(&chip->dev, FW_BUG >> "TPM interrupt not working, polling instead\n"); >> >> That way the same severity is used both for the never asserted and >> the never deasserted interrupt case. > > Oh, OK. > Is there anything the user can do to have a ERROR less boot? That is a good point. Even though the typical dmesg has at least some false-positive error messages I believe we should still strive to not log errors in cases where there is not e.g. an actual error which the user needs to care about (e.g. disk IO errors). Usually in similar cases like this, where we are basically correcting for firmware bugs (1) we use: dev_warn(dev, FW_BUG "...", ...); maybe we should switch both messages here to this ? FW_BUG is: defined in linux/printk as: #define FW_BUG "[Firmware Bug]: " And we are trying to use this in places like this both for uniformity of reporting these kinda bugs and to allow grepping for it. Regards, Hans 1) providing wrong / non working ACPI IRQ resources in this case
Hi, sorry for the unwrapped lines... On 23/05/2023 12:14, Péter Ujfalusi wrote: > > > On 23/05/2023 10:44, Lukas Wunner wrote: >> On Tue, May 23, 2023 at 09:48:23AM +0300, Péter Ujfalusi wrote: >>> On 22/05/2023 17:31, Lino Sanfilippo wrote: >> [...] >>> This looked promising, however it looks like the UPX-i11 needs the DMI >>> quirk. >> >> Why is that? Is there a fundamental problem with the patch or is it >> a specific issue with that device? > > The flood is not detected (if there is a flood at all), interrupt stops > working after about 200 interrupts - in the latest boot at 118th. > I can check this later, likely tomorrow. With the patches and this diff: diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c index 8f4f2cb5520f..6a910d3277d5 100644 --- a/drivers/char/tpm/tpm_tis_core.c +++ b/drivers/char/tpm/tpm_tis_core.c @@ -795,6 +795,7 @@ static void tpm_tis_process_unhandled_interrupt(struct tpm_chip *chip) priv->last_unhandled_irq = jiffies; + pr_warn("[PETER] %s: unhandled_irqs: %d\n", __func__, priv->unhandled_irqs); if (priv->unhandled_irqs > MAX_UNHANDLED_IRQS) tpm_tis_handle_irq_storm(chip); } In some boot I don't get a print at all and reboot takes 2 minutes (tpm timeout), or as it happened now: # dmesg | grep tpm [ 4.306999] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0x1B, rev-id 22) [ 4.325868] [PETER] tpm_tis_process_unhandled_interrupt: unhandled_irqs: 1 [ 4.325908] [PETER] tpm_tis_process_unhandled_interrupt: unhandled_irqs: 2 ... [ 4.329579] [PETER] tpm_tis_process_unhandled_interrupt: unhandled_irqs: 91 [ 5.129056] [PETER] tpm_tis_process_unhandled_interrupt: unhandled_irqs: 1 ... [ 5.129561] [PETER] tpm_tis_process_unhandled_interrupt: unhandled_irqs: 10 # cat /proc/interrupts | grep tpm 28: 0 0 0 133 IR-IO-APIC 28-fasteoi tpm0 Reboot takes in all cases 2 minutes to pass the timeout for the TPM2_CC_SHUTDOWN
Hi, On 23.05.23 11:35, Péter Ujfalusi wrote: > ATTENTION: This e-mail is from an external sender. Please check attachments and links before opening e.g. with mouseover. > > > Hi, > > sorry for the unwrapped lines... > > On 23/05/2023 12:14, Péter Ujfalusi wrote: >> >> >> On 23/05/2023 10:44, Lukas Wunner wrote: >>> On Tue, May 23, 2023 at 09:48:23AM +0300, Péter Ujfalusi wrote: >>>> On 22/05/2023 17:31, Lino Sanfilippo wrote: >>> [...] >>>> This looked promising, however it looks like the UPX-i11 needs the DMI >>>> quirk. >>> >>> Why is that? Is there a fundamental problem with the patch or is it >>> a specific issue with that device? >> >> The flood is not detected (if there is a flood at all), interrupt stops >> working after about 200 interrupts - in the latest boot at 118th. >> I can check this later, likely tomorrow. > > With the patches and this diff: > > diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c > index 8f4f2cb5520f..6a910d3277d5 100644 > --- a/drivers/char/tpm/tpm_tis_core.c > +++ b/drivers/char/tpm/tpm_tis_core.c > @@ -795,6 +795,7 @@ static void tpm_tis_process_unhandled_interrupt(struct tpm_chip *chip) > > priv->last_unhandled_irq = jiffies; > > + pr_warn("[PETER] %s: unhandled_irqs: %d\n", __func__, priv->unhandled_irqs); > if (priv->unhandled_irqs > MAX_UNHANDLED_IRQS) > tpm_tis_handle_irq_storm(chip); > } > > > In some boot I don't get a print at all and reboot takes > 2 minutes (tpm timeout), or as it happened now: > > # dmesg | grep tpm > [ 4.306999] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0x1B, rev-id 22) > [ 4.325868] [PETER] tpm_tis_process_unhandled_interrupt: unhandled_irqs: 1 > [ 4.325908] [PETER] tpm_tis_process_unhandled_interrupt: unhandled_irqs: 2 > ... > [ 4.329579] [PETER] tpm_tis_process_unhandled_interrupt: unhandled_irqs: 91 > [ 5.129056] [PETER] tpm_tis_process_unhandled_interrupt: unhandled_irqs: 1 > ... > [ 5.129561] [PETER] tpm_tis_process_unhandled_interrupt: unhandled_irqs: 10 > > # cat /proc/interrupts | grep tpm > 28: 0 0 0 133 IR-IO-APIC 28-fasteoi tpm0 > > Reboot takes in all cases 2 minutes to pass the timeout for the TPM2_CC_SHUTDOWN > Thanks a lot for testing. The patch was originally created to fix a reported interrupt storm: https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@kernel.org/ My hope was that it could also fix the issues seen with ThinkPad L490 and ThinkStation P360. But in some of your test cases there seems to be no storm at all, so this patch wont help. So for now I am afraid we have to keep the DMI quirks to handle these devices. I will adjust the commit message accordingly. Thanks, Lino > -- > Péter
Hi, On 23.05.23 11:14, Péter Ujfalusi wrote: >> >>>> rc = tpm_tis_write32(priv, TPM_INT_STATUS(priv->locality), interrupt); >>>> tpm_tis_relinquish_locality(chip, 0); >>>> if (rc < 0) >>>> - return IRQ_NONE; >>>> + goto unhandled; >>> >>> This is more like an error than just unhandled IRQ. Yes, it was ignored, >>> probably because it is common? >> >> The interrupt may be shared and then it's not an error. > > but this is tpm_tis_write32() failing, no? If it is shared interrupt and > we return IRQ_HANDLED unconditionally then I think the core will think > that the interrupt was for this device and it was handled. > At this point we already know the interrupt was for our device. Otherwise we would have already bailed out at if (interrupt == 0) Regards, Lino > -- > Péter
On Tue, May 23, 2023 at 12:14:28PM +0300, Péter Ujfalusi wrote: > On 23/05/2023 10:44, Lukas Wunner wrote: > > On Tue, May 23, 2023 at 09:48:23AM +0300, Péter Ujfalusi wrote: > >> On 22/05/2023 17:31, Lino Sanfilippo wrote: > > [...] > >> This looked promising, however it looks like the UPX-i11 needs the DMI > >> quirk. > > > > Why is that? Is there a fundamental problem with the patch or is it > > a specific issue with that device? > > The flood is not detected (if there is a flood at all), interrupt stops > working after about 200 interrupts - in the latest boot at 118th. You've got a variant of the "never asserted interrupt". That condition is currently tested only once on probe in tpm_tis_core_init(). The solution would be to disable interrupts whenever they're not (or no longer asserted). However, that's a distinct issue from the one addressed by the present patch, which deals with a "never *de*asserted interrupt". > >>> + dev_err(&chip->dev, HW_ERR > >>> + "TPM interrupt storm detected, polling instead\n"); > >> > >> Should this be dev_warn or even dev_info level? > > > > The corresponding message emitted in tpm_tis_core_init() for > > an interrupt that's *never* asserted uses dev_err(), so using > > dev_err() here as well serves consistency: > > > > dev_err(&chip->dev, FW_BUG > > "TPM interrupt not working, polling instead\n"); > > > > That way the same severity is used both for the never asserted and > > the never deasserted interrupt case. > > Oh, OK. > Is there anything the user can do to have a ERROR less boot? You're right that the user can't do anything about it and that toning the message down to KERN_WARN or even KERN_NOTICE severity may be appropriate. However the above-quoted message for the never asserted interrupt in tpm_tis_core_init() should then likewise be toned down to the same severity. I'm wondering why that message uses FW_BUG. That doesn't make any sense to me. It's typically not a firmware bug, but a hardware issue, e.g. an interrupt pin may erroneously not be connected or may be connected to ground. Lino used HW_ERR, which seems more appropriate to me. > >>> rc = tpm_tis_write32(priv, TPM_INT_STATUS(priv->locality), interrupt); > >>> tpm_tis_relinquish_locality(chip, 0); > >>> if (rc < 0) > >>> - return IRQ_NONE; > >>> + goto unhandled; > >> > >> This is more like an error than just unhandled IRQ. Yes, it was ignored, > >> probably because it is common? > > > > The interrupt may be shared and then it's not an error. > > but this is tpm_tis_write32() failing, no? If it is shared interrupt and > we return IRQ_HANDLED unconditionally then I think the core will think > that the interrupt was for this device and it was handled. No. The IRQ_HANDLED versus IRQ_NONE return values are merely used for book-keeping of spurious interrupts. If IRQ_HANDLED is returned, the other handlers will still be invoked. It is not discernible whether a shared interrupt was asserted by a single device or by multiple devices, so all handlers need to be called. Thanks, Lukas
On Tue, May 23, 2023 at 12:35:09PM +0300, Péter Ujfalusi wrote: > In some boot I don't get a print at all and reboot takes > 2 minutes (tpm timeout), or as it happened now: > > # dmesg | grep tpm > [ 4.306999] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0x1B, rev-id 22) > [ 4.325868] [PETER] tpm_tis_process_unhandled_interrupt: unhandled_irqs: 1 > [ 4.325908] [PETER] tpm_tis_process_unhandled_interrupt: unhandled_irqs: 2 > ... > [ 4.329579] [PETER] tpm_tis_process_unhandled_interrupt: unhandled_irqs: 91 > [ 5.129056] [PETER] tpm_tis_process_unhandled_interrupt: unhandled_irqs: 1 > ... > [ 5.129561] [PETER] tpm_tis_process_unhandled_interrupt: unhandled_irqs: 10 This looks like the interrupt line may be floating and any interrupts you get are probably caused by induced static or something. Thanks, Lukas
On Mon May 22, 2023 at 5:31 PM EEST, Lino Sanfilippo wrote: > From: Lino Sanfilippo <l.sanfilippo@kunbus.com> > > Commit e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") enabled > interrupts instead of polling on all capable TPMs. Unfortunately, on some > products the interrupt line is either never asserted or never deasserted. > > The former causes interrupt timeouts and is detected by > tpm_tis_core_init(). The latter results in interrupt storms. > > Recent reports concern the Lenovo ThinkStation P360 Tiny, Lenovo ThinkPad > L490 and Inspur NF5180M6: > > https://lore.kernel.org/linux-integrity/20230511005403.24689-1-jsnitsel@redhat.com/ > https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@kernel.org/ > > The current approach to avoid those storms is to disable interrupts by > adding a DMI quirk for the concerned device. > > However this is a maintenance burden in the long run, so use a generic > approach: I'm trying to comprehend how you evaluate, how big maintenance burden this would be. Adding even a few dozen table entries is not a maintenance burden. On the other hand any new functionality is objectively a maintanance burden of some measure (applies to any functionality). So how do we know that taking this change is less of a maintenance burden than just add new table entries, as they come up? > Detect an interrupt storm by counting the number of unhandled interrupts > within a 10 ms time interval. In case that more than 1000 were unhandled > deactivate interrupts, deregister the handler and fall back to polling. I know it can be sometimes hard to evaluate but can you try to explain how you came up to the 10 ms sampling period and 1000 interrupt threshold? I just don't like abritrary numbers. > This equals the implementation that handles interrupt storms in > note_interrupt() by means of timestamps and counters in struct irq_desc. > However the function to access this structure is private so the logic has > to be reimplemented in the TPM TIS core. > > Since handler deregistration would deadlock from within the interrupt > routine trigger a worker thread that executes the unregistration. > > Suggested-by: Lukas Wunner <lukas@wunner.de> > Signed-off-by: Lino Sanfilippo <l.sanfilippo@kunbus.com> > --- > drivers/char/tpm/tpm_tis_core.c | 71 +++++++++++++++++++++++++++++++-- > drivers/char/tpm/tpm_tis_core.h | 6 +++ > 2 files changed, 74 insertions(+), 3 deletions(-) > > diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c > index 558144fa707a..458ebf8c2f16 100644 > --- a/drivers/char/tpm/tpm_tis_core.c > +++ b/drivers/char/tpm/tpm_tis_core.c > @@ -752,6 +752,55 @@ static bool tpm_tis_req_canceled(struct tpm_chip *chip, u8 status) > return status == TPM_STS_COMMAND_READY; > } > > +static void tpm_tis_handle_irq_storm(struct tpm_chip *chip) > +{ > + struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev); > + int intmask = 0; > + > + dev_err(&chip->dev, HW_ERR > + "TPM interrupt storm detected, polling instead\n"); > + > + tpm_tis_read32(priv, TPM_INT_ENABLE(priv->locality), &intmask); > + > + intmask &= ~TPM_GLOBAL_INT_ENABLE; > + > + tpm_tis_request_locality(chip, 0); > + tpm_tis_write32(priv, TPM_INT_ENABLE(priv->locality), intmask); > + tpm_tis_relinquish_locality(chip, 0); > + > + chip->flags &= ~TPM_CHIP_FLAG_IRQ; > + > + /* > + * We must not call devm_free_irq() from within the interrupt handler, > + * since this function waits for running interrupt handlers to finish > + * and thus it would deadlock. Instead trigger a worker that does the > + * unregistration. > + */ > + schedule_work(&priv->free_irq_work); > +} > + > +static void tpm_tis_process_unhandled_interrupt(struct tpm_chip *chip) > +{ > + const unsigned int MAX_UNHANDLED_IRQS = 1000; > + struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev); > + /* > + * The worker to free the TPM interrupt (free_irq_work) may already > + * be scheduled, so make sure it is not scheduled again. > + */ > + if (!(chip->flags & TPM_CHIP_FLAG_IRQ)) > + return; > + > + if (time_after(jiffies, priv->last_unhandled_irq + HZ/10)) > + priv->unhandled_irqs = 1; > + else > + priv->unhandled_irqs++; > + > + priv->last_unhandled_irq = jiffies; > + > + if (priv->unhandled_irqs > MAX_UNHANDLED_IRQS) > + tpm_tis_handle_irq_storm(chip); > +} > + > static irqreturn_t tis_int_handler(int dummy, void *dev_id) > { > struct tpm_chip *chip = dev_id; > @@ -761,10 +810,10 @@ static irqreturn_t tis_int_handler(int dummy, void *dev_id) > > rc = tpm_tis_read32(priv, TPM_INT_STATUS(priv->locality), &interrupt); > if (rc < 0) > - return IRQ_NONE; > + goto unhandled; > > if (interrupt == 0) > - return IRQ_NONE; > + goto unhandled; > > set_bit(TPM_TIS_IRQ_TESTED, &priv->flags); > if (interrupt & TPM_INTF_DATA_AVAIL_INT) > @@ -780,10 +829,14 @@ static irqreturn_t tis_int_handler(int dummy, void *dev_id) > rc = tpm_tis_write32(priv, TPM_INT_STATUS(priv->locality), interrupt); > tpm_tis_relinquish_locality(chip, 0); > if (rc < 0) > - return IRQ_NONE; > + goto unhandled; > > tpm_tis_read32(priv, TPM_INT_STATUS(priv->locality), &interrupt); > return IRQ_HANDLED; > + > +unhandled: > + tpm_tis_process_unhandled_interrupt(chip); > + return IRQ_HANDLED; Shouldn't the return value be IRQ_NONE? > } > > static void tpm_tis_gen_interrupt(struct tpm_chip *chip) > @@ -804,6 +857,15 @@ static void tpm_tis_gen_interrupt(struct tpm_chip *chip) > chip->flags &= ~TPM_CHIP_FLAG_IRQ; > } > > +static void tpm_tis_free_irq_func(struct work_struct *work) > +{ > + struct tpm_tis_data *priv = container_of(work, typeof(*priv), free_irq_work); > + struct tpm_chip *chip = priv->chip; > + > + devm_free_irq(chip->dev.parent, priv->irq, chip); > + priv->irq = 0; > +} > + > /* Register the IRQ and issue a command that will cause an interrupt. If an > * irq is seen then leave the chip setup for IRQ operation, otherwise reverse > * everything and leave in polling mode. Returns 0 on success. > @@ -816,6 +878,7 @@ static int tpm_tis_probe_irq_single(struct tpm_chip *chip, u32 intmask, > int rc; > u32 int_status; > > + INIT_WORK(&priv->free_irq_work, tpm_tis_free_irq_func); > > rc = devm_request_threaded_irq(chip->dev.parent, irq, NULL, > tis_int_handler, IRQF_ONESHOT | flags, > @@ -918,6 +981,7 @@ void tpm_tis_remove(struct tpm_chip *chip) > interrupt = 0; > > tpm_tis_write32(priv, reg, ~TPM_GLOBAL_INT_ENABLE & interrupt); > + flush_work(&priv->free_irq_work); > > tpm_tis_clkrun_enable(chip, false); > > @@ -1021,6 +1085,7 @@ int tpm_tis_core_init(struct device *dev, struct tpm_tis_data *priv, int irq, > chip->timeout_b = msecs_to_jiffies(TIS_TIMEOUT_B_MAX); > chip->timeout_c = msecs_to_jiffies(TIS_TIMEOUT_C_MAX); > chip->timeout_d = msecs_to_jiffies(TIS_TIMEOUT_D_MAX); > + priv->chip = chip; > priv->timeout_min = TPM_TIMEOUT_USECS_MIN; > priv->timeout_max = TPM_TIMEOUT_USECS_MAX; > priv->phy_ops = phy_ops; > diff --git a/drivers/char/tpm/tpm_tis_core.h b/drivers/char/tpm/tpm_tis_core.h > index e978f457fd4d..6fc86baa4398 100644 > --- a/drivers/char/tpm/tpm_tis_core.h > +++ b/drivers/char/tpm/tpm_tis_core.h > @@ -91,12 +91,18 @@ enum tpm_tis_flags { > }; > > struct tpm_tis_data { > + struct tpm_chip *chip; > u16 manufacturer_id; > struct mutex locality_count_mutex; > unsigned int locality_count; > int locality; > + /* Interrupts */ Not relevant change for a bug fix. > int irq; > + struct work_struct free_irq_work; > + unsigned long last_unhandled_irq; > + unsigned int unhandled_irqs; > unsigned int int_mask; > + Ditto (for the empty line). > unsigned long flags; > void __iomem *ilb_base_addr; > u16 clkrun_enabled; > > base-commit: 44c026a73be8038f03dbdeef028b642880cf1511 > -- > 2.40.1 BR, Jarkko
On Tue May 23, 2023 at 9:53 PM EEST, Jarkko Sakkinen wrote: > On Mon May 22, 2023 at 5:31 PM EEST, Lino Sanfilippo wrote: > > From: Lino Sanfilippo <l.sanfilippo@kunbus.com> > > > > Commit e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") enabled > > interrupts instead of polling on all capable TPMs. Unfortunately, on some > > products the interrupt line is either never asserted or never deasserted. > > > > The former causes interrupt timeouts and is detected by > > tpm_tis_core_init(). The latter results in interrupt storms. > > > > Recent reports concern the Lenovo ThinkStation P360 Tiny, Lenovo ThinkPad > > L490 and Inspur NF5180M6: > > > > https://lore.kernel.org/linux-integrity/20230511005403.24689-1-jsnitsel@redhat.com/ > > https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@kernel.org/ > > > > The current approach to avoid those storms is to disable interrupts by > > adding a DMI quirk for the concerned device. > > > > However this is a maintenance burden in the long run, so use a generic > > approach: > > I'm trying to comprehend how you evaluate, how big maintenance burden > this would be. Adding even a few dozen table entries is not a > maintenance burden. > > On the other hand any new functionality is objectively a maintanance > burden of some measure (applies to any functionality). So how do we know > that taking this change is less of a maintenance burden than just add > new table entries, as they come up? I feel also a bit resistant because leaf driver framework is really a wrong location in the kernel tree for IRQ storm detection. It would be better to have it signaled above the TPM driver, and the driver would then just act on it. BR, Jarkko
On Tue May 23, 2023 at 9:53 PM EEST, Jarkko Sakkinen wrote: > On Mon May 22, 2023 at 5:31 PM EEST, Lino Sanfilippo wrote: > > From: Lino Sanfilippo <l.sanfilippo@kunbus.com> > > > > Commit e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") enabled > > interrupts instead of polling on all capable TPMs. Unfortunately, on some > > products the interrupt line is either never asserted or never deasserted. > > > > The former causes interrupt timeouts and is detected by > > tpm_tis_core_init(). The latter results in interrupt storms. > > > > Recent reports concern the Lenovo ThinkStation P360 Tiny, Lenovo ThinkPad > > L490 and Inspur NF5180M6: > > > > https://lore.kernel.org/linux-integrity/20230511005403.24689-1-jsnitsel@redhat.com/ > > https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@kernel.org/ > > > > The current approach to avoid those storms is to disable interrupts by > > adding a DMI quirk for the concerned device. > > > > However this is a maintenance burden in the long run, so use a generic > > approach: > > I'm trying to comprehend how you evaluate, how big maintenance burden > this would be. Adding even a few dozen table entries is not a > maintenance burden. > > On the other hand any new functionality is objectively a maintanance > burden of some measure (applies to any functionality). So how do we know > that taking this change is less of a maintenance burden than just add > new table entries, as they come up? > > > Detect an interrupt storm by counting the number of unhandled interrupts > > within a 10 ms time interval. In case that more than 1000 were unhandled > > deactivate interrupts, deregister the handler and fall back to polling. > > I know it can be sometimes hard to evaluate but can you try to explain > how you came up to the 10 ms sampling period and 1000 interrupt > threshold? I just don't like abritrary numbers. Also here I wonder how you came up with this computational model. This is not same as saying it is wrong. There's just whole stack of options. Out of top of my head you could e.g. window average the duration between IRQs. When the average goes beyond threshold, then you shutdown interrupts. The pro I would see in this that it is much easier intuitively discuss how much there should be time in-between interrupts that the kernel handles it, than how many IRQs you can stack into time interval, which blows my head tbh. BR, Jarkko
On Tue May 23, 2023 at 9:48 AM EEST, Péter Ujfalusi wrote: > > > On 22/05/2023 17:31, Lino Sanfilippo wrote: > > From: Lino Sanfilippo <l.sanfilippo@kunbus.com> > > > > Commit e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") enabled > > interrupts instead of polling on all capable TPMs. Unfortunately, on some > > products the interrupt line is either never asserted or never deasserted. > > > > The former causes interrupt timeouts and is detected by > > tpm_tis_core_init(). The latter results in interrupt storms. > > > > Recent reports concern the Lenovo ThinkStation P360 Tiny, Lenovo ThinkPad > > L490 and Inspur NF5180M6: > > > > https://lore.kernel.org/linux-integrity/20230511005403.24689-1-jsnitsel@redhat.com/ > > https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@kernel.org/ > > > > The current approach to avoid those storms is to disable interrupts by > > adding a DMI quirk for the concerned device. > > This looked promising, however it looks like the UPX-i11 needs the DMI > quirk. My take is this: 1. Keep calmd and add DMI quirks (for some time). 2. Let's reconsider if this becomes a too pressuring issue. 3. If there is need for IRQ detection, let's pick a parameter that would be also *intuitive* tuning parameter [1]. [1] https://lore.kernel.org/linux-integrity/CSTW9UX4ERDZ.VBD1QIWLBM75@suppilovahvero/ BR, Jarkko
On Tue, May 23, 2023 at 09:53:58PM +0300, Jarkko Sakkinen wrote: > On Mon May 22, 2023 at 5:31 PM EEST, Lino Sanfilippo wrote: > > Commit e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") enabled > > interrupts instead of polling on all capable TPMs. Unfortunately, on some > > products the interrupt line is either never asserted or never deasserted. > > > > The former causes interrupt timeouts and is detected by > > tpm_tis_core_init(). The latter results in interrupt storms. [...] > > The current approach to avoid those storms is to disable interrupts by > > adding a DMI quirk for the concerned device. > > > > However this is a maintenance burden in the long run, so use a generic > > approach: > > I'm trying to comprehend how you evaluate, how big maintenance burden > this would be. Adding even a few dozen table entries is not a > maintenance burden. > > On the other hand any new functionality is objectively a maintanance > burden of some measure (applies to any functionality). So how do we know > that taking this change is less of a maintenance burden than just add > new table entries, as they come up? That approach seems irresponsible to me as you force users to report the issue, you force developers to add a quirk. Users expect things to just work. And that's precisely what this patch will achieve. Lino introduced the issues by universally enabling interrupts. He acted responsibly by making the issue disappear most likely for everyone. I find it bewildering that you'd prefer the opposite approach and rather inflict on every affected user to open a bug report and hope that someone writes a patch for them. So far interrupts are only enabled in an rc release and linux-next, yet issue reports have popped up quickly. My expectation is we'll see a lot more reports once v6.4-final is released, doubly so when the next stable kernel with that feature is released. > > Detect an interrupt storm by counting the number of unhandled interrupts > > within a 10 ms time interval. In case that more than 1000 were unhandled > > deactivate interrupts, deregister the handler and fall back to polling. > > I know it can be sometimes hard to evaluate but can you try to explain > how you came up to the 10 ms sampling period and 1000 interrupt > threshold? I just don't like abritrary numbers. If you choose a too low number, you'll get false positives due to regular interrupt assertion either by the TPM or by another device sharing the interrupt. Those numbers may have to be fine-tuned over time to eliminate false positives and false negatives. They're as good a number as any to get that fine-tuning process started. > > +unhandled: > > + tpm_tis_process_unhandled_interrupt(chip); > > + return IRQ_HANDLED; > > Shouldn't the return value be IRQ_NONE? No, absolutely not. If you return IRQ_NONE here then genirq code will increase the spurious interrupt counter. That's bad because the IRQ storm detection tpm_tis_core.c would race with the IRQ storm detection in genirq code: Note that disablement of the interrupt must happen in a work_struct here to avoid a deadlock. (The deadlock would occur because devm_free_irq() waits for the interrupt handler to finish.) Now, let's say the 1000 unhandled interrupts limit has been reached and the work_struct is scheduled. If the work_struct isn't run quickly enough, you may reach the 99900 limit in note_interrupt() (see kernel/irq/spurious.c) and then genirq code will force the interrupt off completely. To avoid that you *have* to return IRQ_HANDLED here and thus pretend towards genirq code that the interrupt was not spurious. > > struct tpm_tis_data { > > + struct tpm_chip *chip; > > u16 manufacturer_id; > > struct mutex locality_count_mutex; > > unsigned int locality_count; > > int locality; > > + /* Interrupts */ > > Not relevant change for a bug fix. But not harmful either, is it? Thanks, Lukas
On Tue, May 23, 2023 at 10:00:10PM +0300, Jarkko Sakkinen wrote: > I feel also a bit resistant because leaf driver framework is really > a wrong location in the kernel tree for IRQ storm detection. > > It would be better to have it signaled above the TPM driver, and the > driver would then just act on it. That would require changing the logic in kernel/irq/spurious.c. At this point in the cycle, such a change would definitely not be eligible as a fix for v6.4. Thanks, Lukas
Hi, On 23.05.23 20:53, Jarkko Sakkinen wrote: > ATTENTION: This e-mail is from an external sender. Please check attachments and links before opening e.g. with mouseover. > > > On Mon May 22, 2023 at 5:31 PM EEST, Lino Sanfilippo wrote: >> From: Lino Sanfilippo <l.sanfilippo@kunbus.com> >> >> Commit e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") enabled >> interrupts instead of polling on all capable TPMs. Unfortunately, on some >> products the interrupt line is either never asserted or never deasserted. >> >> The former causes interrupt timeouts and is detected by >> tpm_tis_core_init(). The latter results in interrupt storms. >> >> Recent reports concern the Lenovo ThinkStation P360 Tiny, Lenovo ThinkPad >> L490 and Inspur NF5180M6: >> >> https://lore.kernel.org/linux-integrity/20230511005403.24689-1-jsnitsel@redhat.com/ >> https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@kernel.org/ >> >> The current approach to avoid those storms is to disable interrupts by >> adding a DMI quirk for the concerned device. >> >> However this is a maintenance burden in the long run, so use a generic >> approach: > > I'm trying to comprehend how you evaluate, how big maintenance burden > this would be. Adding even a few dozen table entries is not a > maintenance burden. > > On the other hand any new functionality is objectively a maintanance > burden of some measure (applies to any functionality). So how do we know > that taking this change is less of a maintenance burden than just add > new table entries, as they come up? > Initially this set was created as a response to this 0-day bug report which you asked me to have a look at: https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@kernel.org/ My hope was that it could also avoid some of (existing or future) DMI entries. But even if it does not (e.g. the problem Péter Ujfalusi reported with the UPX-i11 cannot be fixed by this patch set and thus needs the DMI quirk) we may at least avoid more bug reports due to interrupt storms once 6.4 is released. >> Detect an interrupt storm by counting the number of unhandled interrupts >> within a 10 ms time interval. In case that more than 1000 were unhandled >> deactivate interrupts, deregister the handler and fall back to polling. > > I know it can be sometimes hard to evaluate but can you try to explain > how you came up to the 10 ms sampling period and 1000 interrupt > threshold? I just don't like abritrary numbers. At least the 100 ms is not plucked out of thin air but its the same time period that the generic code in note_interrupt() uses - I assume for a good reason. Not only this number but the whole irq storm detection logic is taken from there: > >> This equals the implementation that handles interrupt storms in >> note_interrupt() by means of timestamps and counters in struct irq_desc. The number of 1000 unhandled interrupts is still far below the 99900 used in note_interrupt() but IMHO enough to indicate that there is something seriously wrong with interrupt processing and it is probably saver to fall back to polling. Regards, Lino
On 23.05.23 21:00, Jarkko Sakkinen wrote: > ATTENTION: This e-mail is from an external sender. Please check attachments and links before opening e.g. with mouseover. > > > On Tue May 23, 2023 at 9:53 PM EEST, Jarkko Sakkinen wrote: >> On Mon May 22, 2023 at 5:31 PM EEST, Lino Sanfilippo wrote: >>> From: Lino Sanfilippo <l.sanfilippo@kunbus.com> >>> >>> Commit e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") enabled >>> interrupts instead of polling on all capable TPMs. Unfortunately, on some >>> products the interrupt line is either never asserted or never deasserted. >>> >>> The former causes interrupt timeouts and is detected by >>> tpm_tis_core_init(). The latter results in interrupt storms. >>> >>> Recent reports concern the Lenovo ThinkStation P360 Tiny, Lenovo ThinkPad >>> L490 and Inspur NF5180M6: >>> >>> https://lore.kernel.org/linux-integrity/20230511005403.24689-1-jsnitsel@redhat.com/ >>> https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@kernel.org/ >>> >>> The current approach to avoid those storms is to disable interrupts by >>> adding a DMI quirk for the concerned device. >>> >>> However this is a maintenance burden in the long run, so use a generic >>> approach: >> >> I'm trying to comprehend how you evaluate, how big maintenance burden >> this would be. Adding even a few dozen table entries is not a >> maintenance burden. >> >> On the other hand any new functionality is objectively a maintanance >> burden of some measure (applies to any functionality). So how do we know >> that taking this change is less of a maintenance burden than just add >> new table entries, as they come up? > > I feel also a bit resistant because leaf driver framework is really > a wrong location in the kernel tree for IRQ storm detection. > > It would be better to have it signaled above the TPM driver, and the > driver would then just act on it. > I agree. But currently I do not see how to achieve this as there is no way to let a driver be informed about a detected interrupt storm. So the only solution I see for now is to implement it in the driver itself. Regards, Lino
On Tue, May 23, 2023 at 10:12:49PM +0300, Jarkko Sakkinen wrote: > On Tue May 23, 2023 at 9:53 PM EEST, Jarkko Sakkinen wrote: > > On Mon May 22, 2023 at 5:31 PM EEST, Lino Sanfilippo wrote: > > > From: Lino Sanfilippo <l.sanfilippo@kunbus.com> > > > > > > Commit e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") enabled > > > interrupts instead of polling on all capable TPMs. Unfortunately, on some > > > products the interrupt line is either never asserted or never deasserted. > > > > > > The former causes interrupt timeouts and is detected by > > > tpm_tis_core_init(). The latter results in interrupt storms. > > > > > > Recent reports concern the Lenovo ThinkStation P360 Tiny, Lenovo ThinkPad > > > L490 and Inspur NF5180M6: > > > > > > https://lore.kernel.org/linux-integrity/20230511005403.24689-1-jsnitsel@redhat.com/ > > > https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@kernel.org/ > > > > > > The current approach to avoid those storms is to disable interrupts by > > > adding a DMI quirk for the concerned device. > > > > > > However this is a maintenance burden in the long run, so use a generic > > > approach: > > > > I'm trying to comprehend how you evaluate, how big maintenance burden > > this would be. Adding even a few dozen table entries is not a > > maintenance burden. I do worry about how many cases will be reported once 6.4 is released, and this eventually makes its way into distributions. In either case the dmi table will need to be maintained. The UPX-11i case is a different issue, and IIRC the L490 it needed a DMI entry, because trying to catch the irq storm wasn't solving the issue there. I imagine other odd cases will be popping up as well. So far we have 2 irq storm reports with peterz's P360 Tiny, and I guess that Inspur system reported by the kernel test robot. Then there is whatever is going on with Peter Ujfalusi's UPX-11i. > > > > On the other hand any new functionality is objectively a maintanance > > burden of some measure (applies to any functionality). So how do we know > > that taking this change is less of a maintenance burden than just add > > new table entries, as they come up? > > > > > Detect an interrupt storm by counting the number of unhandled interrupts > > > within a 10 ms time interval. In case that more than 1000 were unhandled > > > deactivate interrupts, deregister the handler and fall back to polling. > > > > I know it can be sometimes hard to evaluate but can you try to explain > > how you came up to the 10 ms sampling period and 1000 interrupt > > threshold? I just don't like abritrary numbers. > > Also here I wonder how you came up with this computational model. This > is not same as saying it is wrong. There's just whole stack of options. > > Out of top of my head you could e.g. window average the duration between > IRQs. When the average goes beyond threshold, then you shutdown > interrupts. Just to make sure I have it clear in my head, you mean when the average is shorter than the threshold duration between interrupts, yes? My brain wants to read 'When the average goes beyond threshold' as 'threshold < average'. Does the check need to be a rolling window like 1/2 currently has? I expect that if the problem exists it will be noticed in the first window checked. I think what I originally tried was to check over some interval from when the handler first ran, disable interrupts if needed, and then skip the check from then on when the handler ran. Regards, Jerry > > The pro I would see in this that it is much easier intuitively discuss > how much there should be time in-between interrupts that the kernel > handles it, than how many IRQs you can stack into time interval, which > blows my head tbh. > > BR, Jarkko
On Wed May 24, 2023 at 1:32 AM EEST, Jerry Snitselaar wrote: > I do worry about how many cases will be reported once 6.4 is released, > and this eventually makes its way into distributions. In either case > the dmi table will need to be maintained. The UPX-11i case is a > different issue, and IIRC the L490 it needed a DMI entry, because > trying to catch the irq storm wasn't solving the issue there. I > imagine other odd cases will be popping up as well. > > So far we have 2 irq storm reports with peterz's P360 Tiny, and I > guess that Inspur system reported by the kernel test robot. Then there > is whatever is going on with Peter Ujfalusi's UPX-11i. Yeah, I agree that we need both in order to reach stability. > > Out of top of my head you could e.g. window average the duration between > > IRQs. When the average goes beyond threshold, then you shutdown > > interrupts. > > Just to make sure I have it clear in my head, you mean when the > average is shorter than the threshold duration between interrupts, > yes? My brain wants to read 'When the average goes beyond threshold' > as 'threshold < average'. > > Does the check need to be a rolling window like 1/2 currently has? I > expect that if the problem exists it will be noticed in the first > window checked. I think what I originally tried was to check over some > interval from when the handler first ran, disable interrupts if > needed, and then skip the check from then on when the handler ran. How about just: average' = (average / (then - now)) /2 And if average' > thershold, disable interrupts. BR, Jarkko
On Tue May 23, 2023 at 10:46 PM EEST, Lukas Wunner wrote: > On Tue, May 23, 2023 at 10:00:10PM +0300, Jarkko Sakkinen wrote: > > I feel also a bit resistant because leaf driver framework is really > > a wrong location in the kernel tree for IRQ storm detection. > > > > It would be better to have it signaled above the TPM driver, and the > > driver would then just act on it. > > That would require changing the logic in kernel/irq/spurious.c. > > At this point in the cycle, such a change would definitely not be > eligible as a fix for v6.4. No disagreeing with this. Just pointed it out. BR, Jarkko
On Tue May 23, 2023 at 10:37 PM EEST, Lukas Wunner wrote: > > > +unhandled: > > > + tpm_tis_process_unhandled_interrupt(chip); > > > + return IRQ_HANDLED; > > > > Shouldn't the return value be IRQ_NONE? > > No, absolutely not. If you return IRQ_NONE here then genirq code > will increase the spurious interrupt counter. That's bad because > the IRQ storm detection tpm_tis_core.c would race with the IRQ storm > detection in genirq code: > > Note that disablement of the interrupt must happen in a work_struct > here to avoid a deadlock. (The deadlock would occur because > devm_free_irq() waits for the interrupt handler to finish.) > > Now, let's say the 1000 unhandled interrupts limit has been reached > and the work_struct is scheduled. If the work_struct isn't run > quickly enough, you may reach the 99900 limit in note_interrupt() > (see kernel/irq/spurious.c) and then genirq code will force the > interrupt off completely. > > To avoid that you *have* to return IRQ_HANDLED here and thus pretend > towards genirq code that the interrupt was not spurious. This would deserve an inline comment. > > > struct tpm_tis_data { > > > + struct tpm_chip *chip; > > > u16 manufacturer_id; > > > struct mutex locality_count_mutex; > > > unsigned int locality_count; > > > int locality; > > > + /* Interrupts */ > > > > Not relevant change for a bug fix. > > But not harmful either, is it? No but it is still spurious change in this context. > Thanks, > > Lukas BR, Jarkko
Hi, Sorry, some minor glitches. On Mon May 22, 2023 at 5:31 PM EEST, Lino Sanfilippo wrote: > From: Lino Sanfilippo <l.sanfilippo@kunbus.com> > > Commit e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") enabled > interrupts instead of polling on all capable TPMs. Unfortunately, on some > products the interrupt line is either never asserted or never deasserted. Use Reported-by and Closes tag and remove this paragraph. In Closes link instead from lore the email where the bug was reported. > The former causes interrupt timeouts and is detected by > tpm_tis_core_init(). The latter results in interrupt storms. Please describe instead the system where the bug was realized. Don't worry about speculative descriptions. We only deal with ones actually realized. > Recent reports concern the Lenovo ThinkStation P360 Tiny, Lenovo ThinkPad > L490 and Inspur NF5180M6: > > https://lore.kernel.org/linux-integrity/20230511005403.24689-1-jsnitsel@redhat.com/ > https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@kernel.org/ Please remove all of this, as the fixes have been handled. Let's keep the commit message focused. > The current approach to avoid those storms is to disable interrupts by > adding a DMI quirk for the concerned device. > > However this is a maintenance burden in the long run, so use a generic > approach: > > Detect an interrupt storm by counting the number of unhandled interrupts > within a 10 ms time interval. In case that more than 1000 were unhandled > deactivate interrupts, deregister the handler and fall back to polling. > > This equals the implementation that handles interrupt storms in > note_interrupt() by means of timestamps and counters in struct irq_desc. > However the function to access this structure is private so the logic has > to be reimplemented in the TPM TIS core. I only now found out that this is based on kernel/irq/spurious.c code partly. Why this was unmentioned? That would make this already more legitimate because it is based on field tested metrics. Then we only have to discuss about counter. > routine trigger a worker thread that executes the unregistration. > > Suggested-by: Lukas Wunner <lukas@wunner.de> > Signed-off-by: Lino Sanfilippo <l.sanfilippo@kunbus.com> > --- > drivers/char/tpm/tpm_tis_core.c | 71 +++++++++++++++++++++++++++++++-- > drivers/char/tpm/tpm_tis_core.h | 6 +++ > 2 files changed, 74 insertions(+), 3 deletions(-) > > diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c > index 558144fa707a..458ebf8c2f16 100644 > --- a/drivers/char/tpm/tpm_tis_core.c > +++ b/drivers/char/tpm/tpm_tis_core.c > @@ -752,6 +752,55 @@ static bool tpm_tis_req_canceled(struct tpm_chip *chip, u8 status) > return status == TPM_STS_COMMAND_READY; > } > > +static void tpm_tis_handle_irq_storm(struct tpm_chip *chip) > +{ > + struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev); > + int intmask = 0; > + > + dev_err(&chip->dev, HW_ERR > + "TPM interrupt storm detected, polling instead\n"); Degrading this to warn is fine because it is legit behaviour in a sense. > + > + tpm_tis_read32(priv, TPM_INT_ENABLE(priv->locality), &intmask); > + > + intmask &= ~TPM_GLOBAL_INT_ENABLE; > + > + tpm_tis_request_locality(chip, 0); > + tpm_tis_write32(priv, TPM_INT_ENABLE(priv->locality), intmask); > + tpm_tis_relinquish_locality(chip, 0); > + > + chip->flags &= ~TPM_CHIP_FLAG_IRQ; > + > + /* > + * We must not call devm_free_irq() from within the interrupt handler, Never use "we" form. Always use either: 1. Imperative 2. Passive I.e. to address this, you would write instead "devm_free_irq() must not be called within the interrupt handler because ...". > + * since this function waits for running interrupt handlers to finish > + * and thus it would deadlock. Instead trigger a worker that does the > + * unregistration. > + */ > + schedule_work(&priv->free_irq_work); > +} > + > +static void tpm_tis_process_unhandled_interrupt(struct tpm_chip *chip) > +{ > + const unsigned int MAX_UNHANDLED_IRQS = 1000; > + struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev); Reverse order and add empty line. > + /* > + * The worker to free the TPM interrupt (free_irq_work) may already > + * be scheduled, so make sure it is not scheduled again. > + */ > + if (!(chip->flags & TPM_CHIP_FLAG_IRQ)) > + return; > + > + if (time_after(jiffies, priv->last_unhandled_irq + HZ/10)) > + priv->unhandled_irqs = 1; > + else > + priv->unhandled_irqs++; > + > + priv->last_unhandled_irq = jiffies; > + > + if (priv->unhandled_irqs > MAX_UNHANDLED_IRQS) > + tpm_tis_handle_irq_storm(chip); Why wouldn't we switch to polling mode even when there is a single unhandled IRQ? > +} > + > static irqreturn_t tis_int_handler(int dummy, void *dev_id) > { > struct tpm_chip *chip = dev_id; > @@ -761,10 +810,10 @@ static irqreturn_t tis_int_handler(int dummy, void *dev_id) > > rc = tpm_tis_read32(priv, TPM_INT_STATUS(priv->locality), &interrupt); > if (rc < 0) > - return IRQ_NONE; > + goto unhandled; > > if (interrupt == 0) > - return IRQ_NONE; > + goto unhandled; > > set_bit(TPM_TIS_IRQ_TESTED, &priv->flags); > if (interrupt & TPM_INTF_DATA_AVAIL_INT) > @@ -780,10 +829,14 @@ static irqreturn_t tis_int_handler(int dummy, void *dev_id) > rc = tpm_tis_write32(priv, TPM_INT_STATUS(priv->locality), interrupt); > tpm_tis_relinquish_locality(chip, 0); > if (rc < 0) > - return IRQ_NONE; > + goto unhandled; > > tpm_tis_read32(priv, TPM_INT_STATUS(priv->locality), &interrupt); > return IRQ_HANDLED; > + > +unhandled: > + tpm_tis_process_unhandled_interrupt(chip); > + return IRQ_HANDLED; > } > > static void tpm_tis_gen_interrupt(struct tpm_chip *chip) > @@ -804,6 +857,15 @@ static void tpm_tis_gen_interrupt(struct tpm_chip *chip) > chip->flags &= ~TPM_CHIP_FLAG_IRQ; > } > > +static void tpm_tis_free_irq_func(struct work_struct *work) > +{ > + struct tpm_tis_data *priv = container_of(work, typeof(*priv), free_irq_work); > + struct tpm_chip *chip = priv->chip; > + > + devm_free_irq(chip->dev.parent, priv->irq, chip); > + priv->irq = 0; > +} > + > /* Register the IRQ and issue a command that will cause an interrupt. If an > * irq is seen then leave the chip setup for IRQ operation, otherwise reverse > * everything and leave in polling mode. Returns 0 on success. > @@ -816,6 +878,7 @@ static int tpm_tis_probe_irq_single(struct tpm_chip *chip, u32 intmask, > int rc; > u32 int_status; > > + INIT_WORK(&priv->free_irq_work, tpm_tis_free_irq_func); > > rc = devm_request_threaded_irq(chip->dev.parent, irq, NULL, > tis_int_handler, IRQF_ONESHOT | flags, > @@ -918,6 +981,7 @@ void tpm_tis_remove(struct tpm_chip *chip) > interrupt = 0; > > tpm_tis_write32(priv, reg, ~TPM_GLOBAL_INT_ENABLE & interrupt); > + flush_work(&priv->free_irq_work); > > tpm_tis_clkrun_enable(chip, false); > > @@ -1021,6 +1085,7 @@ int tpm_tis_core_init(struct device *dev, struct tpm_tis_data *priv, int irq, > chip->timeout_b = msecs_to_jiffies(TIS_TIMEOUT_B_MAX); > chip->timeout_c = msecs_to_jiffies(TIS_TIMEOUT_C_MAX); > chip->timeout_d = msecs_to_jiffies(TIS_TIMEOUT_D_MAX); > + priv->chip = chip; > priv->timeout_min = TPM_TIMEOUT_USECS_MIN; > priv->timeout_max = TPM_TIMEOUT_USECS_MAX; > priv->phy_ops = phy_ops; > diff --git a/drivers/char/tpm/tpm_tis_core.h b/drivers/char/tpm/tpm_tis_core.h > index e978f457fd4d..6fc86baa4398 100644 > --- a/drivers/char/tpm/tpm_tis_core.h > +++ b/drivers/char/tpm/tpm_tis_core.h > @@ -91,12 +91,18 @@ enum tpm_tis_flags { > }; > > struct tpm_tis_data { > + struct tpm_chip *chip; > u16 manufacturer_id; > struct mutex locality_count_mutex; > unsigned int locality_count; > int locality; > + /* Interrupts */ > int irq; > + struct work_struct free_irq_work; > + unsigned long last_unhandled_irq; > + unsigned int unhandled_irqs; > unsigned int int_mask; > + > unsigned long flags; > void __iomem *ilb_base_addr; > u16 clkrun_enabled; > > base-commit: 44c026a73be8038f03dbdeef028b642880cf1511 > -- > 2.40.1 BR, Jarkko
On Wed May 24, 2023 at 6:58 AM EEST, Jarkko Sakkinen wrote: > Hi, > > Sorry, some minor glitches. > > On Mon May 22, 2023 at 5:31 PM EEST, Lino Sanfilippo wrote: > > From: Lino Sanfilippo <l.sanfilippo@kunbus.com> > > > > Commit e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") enabled > > interrupts instead of polling on all capable TPMs. Unfortunately, on some > > products the interrupt line is either never asserted or never deasserted. > > Use Reported-by and Closes tag and remove this paragraph. > > In Closes link instead from lore the email where the bug was reported. > > > The former causes interrupt timeouts and is detected by > > tpm_tis_core_init(). The latter results in interrupt storms. > > Please describe instead the system where the bug was realized. Don't > worry about speculative descriptions. We only deal with ones actually > realized. > > > Recent reports concern the Lenovo ThinkStation P360 Tiny, Lenovo ThinkPad > > L490 and Inspur NF5180M6: > > > > https://lore.kernel.org/linux-integrity/20230511005403.24689-1-jsnitsel@redhat.com/ > > https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@kernel.org/ > > Please remove all of this, as the fixes have been handled. Let's keep > the commit message focused. > > > The current approach to avoid those storms is to disable interrupts by > > adding a DMI quirk for the concerned device. > > > > However this is a maintenance burden in the long run, so use a generic > > approach: > > > > Detect an interrupt storm by counting the number of unhandled interrupts > > within a 10 ms time interval. In case that more than 1000 were unhandled > > deactivate interrupts, deregister the handler and fall back to polling. > > > > This equals the implementation that handles interrupt storms in > > note_interrupt() by means of timestamps and counters in struct irq_desc. > > However the function to access this structure is private so the logic has > > to be reimplemented in the TPM TIS core. > > I only now found out that this is based on kernel/irq/spurious.c code > partly. Why this was unmentioned? > > That would make this already more legitimate because it is based > on field tested metrics. > > Then we only have to discuss about counter. > > > routine trigger a worker thread that executes the unregistration. > > > > Suggested-by: Lukas Wunner <lukas@wunner.de> > > Signed-off-by: Lino Sanfilippo <l.sanfilippo@kunbus.com> > > --- > > drivers/char/tpm/tpm_tis_core.c | 71 +++++++++++++++++++++++++++++++-- > > drivers/char/tpm/tpm_tis_core.h | 6 +++ > > 2 files changed, 74 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c > > index 558144fa707a..458ebf8c2f16 100644 > > --- a/drivers/char/tpm/tpm_tis_core.c > > +++ b/drivers/char/tpm/tpm_tis_core.c > > @@ -752,6 +752,55 @@ static bool tpm_tis_req_canceled(struct tpm_chip *chip, u8 status) > > return status == TPM_STS_COMMAND_READY; > > } > > > > +static void tpm_tis_handle_irq_storm(struct tpm_chip *chip) > > +{ > > + struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev); > > + int intmask = 0; > > + > > + dev_err(&chip->dev, HW_ERR > > + "TPM interrupt storm detected, polling instead\n"); > > Degrading this to warn is fine because it is legit behaviour in a > sense. > > > + > > + tpm_tis_read32(priv, TPM_INT_ENABLE(priv->locality), &intmask); > > + > > + intmask &= ~TPM_GLOBAL_INT_ENABLE; > > + > > + tpm_tis_request_locality(chip, 0); > > + tpm_tis_write32(priv, TPM_INT_ENABLE(priv->locality), intmask); > > + tpm_tis_relinquish_locality(chip, 0); > > + > > + chip->flags &= ~TPM_CHIP_FLAG_IRQ; > > + > > + /* > > + * We must not call devm_free_irq() from within the interrupt handler, > > Never use "we" form. Always use either: > > 1. Imperative > 2. Passive > > I.e. to address this, you would write instead "devm_free_irq() must not > be called within the interrupt handler because ...". > > > + * since this function waits for running interrupt handlers to finish > > + * and thus it would deadlock. Instead trigger a worker that does the > > + * unregistration. > > + */ > > + schedule_work(&priv->free_irq_work); > > +} > > + > > +static void tpm_tis_process_unhandled_interrupt(struct tpm_chip *chip) > > +{ > > + const unsigned int MAX_UNHANDLED_IRQS = 1000; > > + struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev); > > Reverse order and add empty line. > > > + /* > > + * The worker to free the TPM interrupt (free_irq_work) may already > > + * be scheduled, so make sure it is not scheduled again. > > + */ > > + if (!(chip->flags & TPM_CHIP_FLAG_IRQ)) > > + return; > > + > > + if (time_after(jiffies, priv->last_unhandled_irq + HZ/10)) > > + priv->unhandled_irqs = 1; > > + else > > + priv->unhandled_irqs++; > > + > > + priv->last_unhandled_irq = jiffies; > > + > > + if (priv->unhandled_irqs > MAX_UNHANDLED_IRQS) > > + tpm_tis_handle_irq_storm(chip); > > Why wouldn't we switch to polling mode even when there is a single > unhandled IRQ? > > > +} > > + > > static irqreturn_t tis_int_handler(int dummy, void *dev_id) > > { > > struct tpm_chip *chip = dev_id; > > @@ -761,10 +810,10 @@ static irqreturn_t tis_int_handler(int dummy, void *dev_id) > > > > rc = tpm_tis_read32(priv, TPM_INT_STATUS(priv->locality), &interrupt); > > if (rc < 0) > > - return IRQ_NONE; > > + goto unhandled; > > > > if (interrupt == 0) > > - return IRQ_NONE; > > + goto unhandled; > > > > set_bit(TPM_TIS_IRQ_TESTED, &priv->flags); > > if (interrupt & TPM_INTF_DATA_AVAIL_INT) > > @@ -780,10 +829,14 @@ static irqreturn_t tis_int_handler(int dummy, void *dev_id) > > rc = tpm_tis_write32(priv, TPM_INT_STATUS(priv->locality), interrupt); > > tpm_tis_relinquish_locality(chip, 0); > > if (rc < 0) > > - return IRQ_NONE; > > + goto unhandled; > > > > tpm_tis_read32(priv, TPM_INT_STATUS(priv->locality), &interrupt); > > return IRQ_HANDLED; > > + > > +unhandled: > > + tpm_tis_process_unhandled_interrupt(chip); > > + return IRQ_HANDLED; > > } > > > > static void tpm_tis_gen_interrupt(struct tpm_chip *chip) > > @@ -804,6 +857,15 @@ static void tpm_tis_gen_interrupt(struct tpm_chip *chip) > > chip->flags &= ~TPM_CHIP_FLAG_IRQ; > > } > > > > +static void tpm_tis_free_irq_func(struct work_struct *work) > > +{ > > + struct tpm_tis_data *priv = container_of(work, typeof(*priv), free_irq_work); > > + struct tpm_chip *chip = priv->chip; > > + > > + devm_free_irq(chip->dev.parent, priv->irq, chip); > > + priv->irq = 0; > > +} > > + > > /* Register the IRQ and issue a command that will cause an interrupt. If an > > * irq is seen then leave the chip setup for IRQ operation, otherwise reverse > > * everything and leave in polling mode. Returns 0 on success. > > @@ -816,6 +878,7 @@ static int tpm_tis_probe_irq_single(struct tpm_chip *chip, u32 intmask, > > int rc; > > u32 int_status; > > > > + INIT_WORK(&priv->free_irq_work, tpm_tis_free_irq_func); > > > > rc = devm_request_threaded_irq(chip->dev.parent, irq, NULL, > > tis_int_handler, IRQF_ONESHOT | flags, > > @@ -918,6 +981,7 @@ void tpm_tis_remove(struct tpm_chip *chip) > > interrupt = 0; > > > > tpm_tis_write32(priv, reg, ~TPM_GLOBAL_INT_ENABLE & interrupt); > > + flush_work(&priv->free_irq_work); > > > > tpm_tis_clkrun_enable(chip, false); > > > > @@ -1021,6 +1085,7 @@ int tpm_tis_core_init(struct device *dev, struct tpm_tis_data *priv, int irq, > > chip->timeout_b = msecs_to_jiffies(TIS_TIMEOUT_B_MAX); > > chip->timeout_c = msecs_to_jiffies(TIS_TIMEOUT_C_MAX); > > chip->timeout_d = msecs_to_jiffies(TIS_TIMEOUT_D_MAX); > > + priv->chip = chip; > > priv->timeout_min = TPM_TIMEOUT_USECS_MIN; > > priv->timeout_max = TPM_TIMEOUT_USECS_MAX; > > priv->phy_ops = phy_ops; > > diff --git a/drivers/char/tpm/tpm_tis_core.h b/drivers/char/tpm/tpm_tis_core.h > > index e978f457fd4d..6fc86baa4398 100644 > > --- a/drivers/char/tpm/tpm_tis_core.h > > +++ b/drivers/char/tpm/tpm_tis_core.h > > @@ -91,12 +91,18 @@ enum tpm_tis_flags { > > }; > > > > struct tpm_tis_data { > > + struct tpm_chip *chip; > > u16 manufacturer_id; > > struct mutex locality_count_mutex; > > unsigned int locality_count; > > int locality; > > + /* Interrupts */ > > int irq; > > + struct work_struct free_irq_work; > > + unsigned long last_unhandled_irq; > > + unsigned int unhandled_irqs; > > unsigned int int_mask; > > + > > unsigned long flags; > > void __iomem *ilb_base_addr; > > u16 clkrun_enabled; > > > > base-commit: 44c026a73be8038f03dbdeef028b642880cf1511 > > -- > > 2.40.1 I added 'irq-storm' branch where I have the latest fixes: git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd.git All known DMI table updates are now afaik in the mainline. BR, Jarkko
On Wed, May 24, 2023 at 06:58:08AM +0300, Jarkko Sakkinen wrote: > On Mon May 22, 2023 at 5:31 PM EEST, Lino Sanfilippo wrote: > > + /* > > + * The worker to free the TPM interrupt (free_irq_work) may already > > + * be scheduled, so make sure it is not scheduled again. > > + */ > > + if (!(chip->flags & TPM_CHIP_FLAG_IRQ)) > > + return; > > + > > + if (time_after(jiffies, priv->last_unhandled_irq + HZ/10)) > > + priv->unhandled_irqs = 1; > > + else > > + priv->unhandled_irqs++; > > + > > + priv->last_unhandled_irq = jiffies; > > + > > + if (priv->unhandled_irqs > MAX_UNHANDLED_IRQS) > > + tpm_tis_handle_irq_storm(chip); > > Why wouldn't we switch to polling mode even when there is a single > unhandled IRQ? An unhandled IRQ can be legitimate if the interrupt is shared with other devices and the IRQ was raised by one of them. So you only want to switch to polling if there's a significant amount of unhandled IRQs in a short period of time. Thanks, Lukas
Hi Lukas, On 5/23/23 17:16, Lukas Wunner wrote: > On Tue, May 23, 2023 at 12:14:28PM +0300, Péter Ujfalusi wrote: >> On 23/05/2023 10:44, Lukas Wunner wrote: <snip> >>> The corresponding message emitted in tpm_tis_core_init() for >>> an interrupt that's *never* asserted uses dev_err(), so using >>> dev_err() here as well serves consistency: >>> >>> dev_err(&chip->dev, FW_BUG >>> "TPM interrupt not working, polling instead\n"); >>> >>> That way the same severity is used both for the never asserted and >>> the never deasserted interrupt case. >> >> Oh, OK. >> Is there anything the user can do to have a ERROR less boot? > > You're right that the user can't do anything about it and that > toning the message down to KERN_WARN or even KERN_NOTICE severity > may be appropriate. > > However the above-quoted message for the never asserted interrupt > in tpm_tis_core_init() should then likewise be toned down to the > same severity. > > I'm wondering why that message uses FW_BUG. That doesn't make any > sense to me. It's typically not a firmware bug, but a hardware issue, > e.g. an interrupt pin may erroneously not be connected or may be > connected to ground. Lino used HW_ERR, which seems more appropriate > to me. This issue seems to have mostly been seen on x86/ACPI systems and the issue seems to mostly be a misconfiguration of the IOAPIC (e.g. wrong trigger type). Besides this, the IRQ info always comes from either the ACPI tables or from devicetree both of which are firmware and if the IRQ does not work on a specific board then the firmware should simply not provide any IRQ info to the driver rather the pointing to a broken IRQ. Hence my suggestion to use dev_warn(dev, FW_BUG "....", ...); Regards, Hans
On Wed May 24, 2023 at 6:58 AM EEST, Jarkko Sakkinen wrote: > > rc = tpm_tis_read32(priv, TPM_INT_STATUS(priv->locality), &interrupt); > > if (rc < 0) > > - return IRQ_NONE; > > + goto unhandled; > > > > if (interrupt == 0) > > - return IRQ_NONE; > > + goto unhandled; > > > > set_bit(TPM_TIS_IRQ_TESTED, &priv->flags); > > if (interrupt & TPM_INTF_DATA_AVAIL_INT) > > @@ -780,10 +829,14 @@ static irqreturn_t tis_int_handler(int dummy, void *dev_id) > > rc = tpm_tis_write32(priv, TPM_INT_STATUS(priv->locality), interrupt); > > tpm_tis_relinquish_locality(chip, 0); > > if (rc < 0) > > - return IRQ_NONE; > > + goto unhandled; > > > > tpm_tis_read32(priv, TPM_INT_STATUS(priv->locality), &interrupt); > > return IRQ_HANDLED; > > + > > +unhandled: > > + tpm_tis_process_unhandled_interrupt(chip); > > + return IRQ_HANDLED; > > } Some minor glitches I noticed. You could simplify the flow by making the helper to return IRQ_NONE. E.g. tpm_tis_relinquish_locality(chip, 0); if (rc < 0) return tpm_tis_process_unhandled_interrupt(chip); I'd recommend changing the function name simply tpm_tis_rollback_interrupt(). Also tpm_tis_handle_irq_storm() is a pretty bad function name because handle also can mean anything. You are resetting to the polling mode, right? So perhaps that could be e.g. tpm_tis_reenable_polling? I'm open for any other name but it really needs to give a hint what the function does. BR, Jarkko
Hi, On 24.05.23 05:58, Jarkko Sakkinen wrote: >> Commit e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") enabled >> interrupts instead of polling on all capable TPMs. Unfortunately, on some >> products the interrupt line is either never asserted or never deasserted. > > Use Reported-by and Closes tag and remove this paragraph. > > In Closes link instead from lore the email where the bug was reported. Ok > >> The former causes interrupt timeouts and is detected by >> tpm_tis_core_init(). The latter results in interrupt storms. > > Please describe instead the system where the bug was realized. Don't > worry about speculative descriptions. We only deal with ones actually > realized. > >> Recent reports concern the Lenovo ThinkStation P360 Tiny, Lenovo ThinkPad >> L490 and Inspur NF5180M6: >> >> https://lore.kernel.org/linux-integrity/20230511005403.24689-1-jsnitsel@redhat.com/ >> https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@kernel.org/ > > Please remove all of this, as the fixes have been handled. Let's keep > the commit message focused. > Ok. >> The current approach to avoid those storms is to disable interrupts by >> adding a DMI quirk for the concerned device. >> >> However this is a maintenance burden in the long run, so use a generic >> approach: >> >> Detect an interrupt storm by counting the number of unhandled interrupts >> within a 10 ms time interval. In case that more than 1000 were unhandled >> deactivate interrupts, deregister the handler and fall back to polling. >> >> This equals the implementation that handles interrupt storms in >> note_interrupt() by means of timestamps and counters in struct irq_desc. >> However the function to access this structure is private so the logic has >> to be reimplemented in the TPM TIS core. > > I only now found out that this is based on kernel/irq/spurious.c code > partly. Why this was unmentioned? > > That would make this already more legitimate because it is based > on field tested metrics. >> Then we only have to discuss about counter. I mentioned this in one of my responses: https://lore.kernel.org/all/da435e0d-5f22-fac7-bc10-96a0fd4c6d54@kunbus.com/ > >> routine trigger a worker thread that executes the unregistration. >> >> Suggested-by: Lukas Wunner <lukas@wunner.de> >> Signed-off-by: Lino Sanfilippo <l.sanfilippo@kunbus.com> >> --- >> drivers/char/tpm/tpm_tis_core.c | 71 +++++++++++++++++++++++++++++++-- >> drivers/char/tpm/tpm_tis_core.h | 6 +++ >> 2 files changed, 74 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c >> index 558144fa707a..458ebf8c2f16 100644 >> --- a/drivers/char/tpm/tpm_tis_core.c >> +++ b/drivers/char/tpm/tpm_tis_core.c >> @@ -752,6 +752,55 @@ static bool tpm_tis_req_canceled(struct tpm_chip *chip, u8 status) >> return status == TPM_STS_COMMAND_READY; >> } >> >> +static void tpm_tis_handle_irq_storm(struct tpm_chip *chip) >> +{ >> + struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev); >> + int intmask = 0; >> + >> + dev_err(&chip->dev, HW_ERR >> + "TPM interrupt storm detected, polling instead\n"); > > Degrading this to warn is fine because it is legit behaviour in a > sense. > Agreed. >> + >> + tpm_tis_read32(priv, TPM_INT_ENABLE(priv->locality), &intmask); >> + >> + intmask &= ~TPM_GLOBAL_INT_ENABLE; >> + >> + tpm_tis_request_locality(chip, 0); >> + tpm_tis_write32(priv, TPM_INT_ENABLE(priv->locality), intmask); >> + tpm_tis_relinquish_locality(chip, 0); >> + >> + chip->flags &= ~TPM_CHIP_FLAG_IRQ; >> + >> + /* >> + * We must not call devm_free_irq() from within the interrupt handler, > > Never use "we" form. Always use either: > > 1. Imperative > 2. Passive > > I.e. to address this, you would write instead "devm_free_irq() must not > be called within the interrupt handler because ...". > Ok. >> + * since this function waits for running interrupt handlers to finish >> + * and thus it would deadlock. Instead trigger a worker that does the >> + * unregistration. >> + */ >> + schedule_work(&priv->free_irq_work); >> +} >> + >> +static void tpm_tis_process_unhandled_interrupt(struct tpm_chip *chip) >> +{ >> + const unsigned int MAX_UNHANDLED_IRQS = 1000; >> + struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev); > > Reverse order and add empty line. > Ok. >> + /* >> + * The worker to free the TPM interrupt (free_irq_work) may already >> + * be scheduled, so make sure it is not scheduled again. >> + */ >> + if (!(chip->flags & TPM_CHIP_FLAG_IRQ)) >> + return; >> + >> + if (time_after(jiffies, priv->last_unhandled_irq + HZ/10)) >> + priv->unhandled_irqs = 1; >> + else >> + priv->unhandled_irqs++; >> + >> + priv->last_unhandled_irq = jiffies; >> + >> + if (priv->unhandled_irqs > MAX_UNHANDLED_IRQS) >> + tpm_tis_handle_irq_storm(chip); > > Why wouldn't we switch to polling mode even when there is a single > unhandled IRQ? > As Lukas wrote, not handling an interrupt may be legit if it was raised by a device that shares the interrupt line with the TPM. Regards, Lino
On 24.05.23 17:30, Jarkko Sakkinen wrote: > ATTENTION: This e-mail is from an external sender. Please check attachments and links before opening e.g. with mouseover. > > > On Wed May 24, 2023 at 6:58 AM EEST, Jarkko Sakkinen wrote: >>> rc = tpm_tis_read32(priv, TPM_INT_STATUS(priv->locality), &interrupt); >>> if (rc < 0) >>> - return IRQ_NONE; >>> + goto unhandled; >>> >>> if (interrupt == 0) >>> - return IRQ_NONE; >>> + goto unhandled; >>> >>> set_bit(TPM_TIS_IRQ_TESTED, &priv->flags); >>> if (interrupt & TPM_INTF_DATA_AVAIL_INT) >>> @@ -780,10 +829,14 @@ static irqreturn_t tis_int_handler(int dummy, void *dev_id) >>> rc = tpm_tis_write32(priv, TPM_INT_STATUS(priv->locality), interrupt); >>> tpm_tis_relinquish_locality(chip, 0); >>> if (rc < 0) >>> - return IRQ_NONE; >>> + goto unhandled; >>> >>> tpm_tis_read32(priv, TPM_INT_STATUS(priv->locality), &interrupt); >>> return IRQ_HANDLED; >>> + >>> +unhandled: >>> + tpm_tis_process_unhandled_interrupt(chip); >>> + return IRQ_HANDLED; >>> } > > Some minor glitches I noticed. > > You could simplify the flow by making the helper to return IRQ_NONE. > > E.g. > > tpm_tis_relinquish_locality(chip, 0); > if (rc < 0) > return tpm_tis_process_unhandled_interrupt(chip); Agreed, this way we could spare a few lines in the interrupt handler (but note that the implementation only returns IRQ_HANDLED never IRQ_NONE. This is to prevent the generic irq code from doing its own interrupt storm handling before the TPM driver had a chance to fall back to polling). > > I'd recommend changing the function name simply tpm_tis_rollback_interrupt(). > Also tpm_tis_handle_irq_storm() is a pretty bad function name > because handle also can mean anything. You are resetting to the > polling mode, right? > > So perhaps that could be e.g. tpm_tis_reenable_polling? I'm open > for any other name but it really needs to give a hint what the > function does. tpm_tis_reenable_polling() sounds good to me. Regards, Lino
Hi Lino, On 23/05/2023 23:46, Lino Sanfilippo wrote: >> On the other hand any new functionality is objectively a maintanance >> burden of some measure (applies to any functionality). So how do we know >> that taking this change is less of a maintenance burden than just add >> new table entries, as they come up? >> > > Initially this set was created as a response to this 0-day bug report which you asked me > to have a look at: > > https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@kernel.org/ > > My hope was that it could also avoid some of (existing or future) DMI entries. But even if it does not > (e.g. the problem Péter Ujfalusi reported with the UPX-i11 cannot be fixed by this patch set and thus > needs the DMI quirk) we may at least avoid more bug reports due to interrupt storms once > 6.4 is released. I'm surprised that there is a need for a storm detection in the first place... Do we have something else on the same IRQ line on the affected devices which might have a bug or no driver at all? It is hard to believe that a TPM (Trusted Platform Module) is integrated so poorly ;) But put that aside: I think the storm detection is good given that there is no other way to know which machine have sloppy TPM integration. There are machines where this happens, so it is a know integration issue, right? My only 'nitpick' is with the printk level to be used. The ERR level is not correct as we know the issue and we handle it, so all is under control. If we want to add these machines to the quirk list then WARN is a good level to gain attention but I'm not sure if a user will know how to get the machine in the quirk (where to file a bug). If we only want the quirk to be used for machines like UPX-i11 which simply just have broken (likely floating) IRQ line then the WARN is too high level, INFO or even DBG would be appropriate as you are not going to update the quirk, it is just handled under the hood (which is a great thing, but on the other hand you will have the storm never the less and that is not a nice thing). It is a matter on how this is going to be handled in a long term. Add quirk for all the known machines with either stormy or plain broken IRQ line or handle the stormy ones and quirk the broken ones only. >>> Detect an interrupt storm by counting the number of unhandled interrupts >>> within a 10 ms time interval. In case that more than 1000 were unhandled >>> deactivate interrupts, deregister the handler and fall back to polling. >> >> I know it can be sometimes hard to evaluate but can you try to explain >> how you came up to the 10 ms sampling period and 1000 interrupt >> threshold? I just don't like abritrary numbers. > > At least the 100 ms is not plucked out of thin air but its the same time period > that the generic code in note_interrupt() uses - I assume for a good reason. > Not only this number but the whole irq storm detection logic is taken from > there: > >> >>> This equals the implementation that handles interrupt storms in >>> note_interrupt() by means of timestamps and counters in struct irq_desc. >> The number of 1000 unhandled interrupts is still far below the 99900 used in > note_interrupt() but IMHO enough to indicate that there is something seriously > wrong with interrupt processing and it is probably saver to fall back to polling. Except that if the line got the spurious designation in core, the interrupt line will be disabled while the TPM driver will think that it is still using IRQ mode and will not switch to polling. A storm of 1000 is better than a storm of 99900 for sure but quirking these would be the desired final solution. imho There are many buts around this ;)
On Tue, 2023-05-23 at 17:16 +0200, Lukas Wunner wrote: > > > > > + dev_err(&chip->dev, HW_ERR > > > > > + "TPM interrupt storm detected, polling instead\n"); > > > > > > > > Should this be dev_warn or even dev_info level? > > > > > > The corresponding message emitted in tpm_tis_core_init() for > > > an interrupt that's *never* asserted uses dev_err(), so using > > > dev_err() here as well serves consistency: > > > > > > dev_err(&chip->dev, FW_BUG > > > "TPM interrupt not working, polling instead\n"); > > > > > > That way the same severity is used both for the never asserted and > > > the never deasserted interrupt case. > > > > Oh, OK. > > Is there anything the user can do to have a ERROR less boot? > > You're right that the user can't do anything about it and that > toning the message down to KERN_WARN or even KERN_NOTICE severity > may be appropriate. > > However the above-quoted message for the never asserted interrupt > in tpm_tis_core_init() should then likewise be toned down to the > same severity. > > I'm wondering why that message uses FW_BUG. That doesn't make any > sense to me. It's typically not a firmware bug, but a hardware issue, > e.g. an interrupt pin may erroneously not be connected or may be > connected to ground. Lino used HW_ERR, which seems more appropriate > to me. Firmware is responsible for configuring gpios and interrupts correctly, independently of it being a design decision or a mistake. AIUI any interrupt storm could be prevented by firmware in any case by simply disabling that interrupt. Thus, FW_BUG is the right thing to use here IMO.
Hi Péter, On 29.05.23 at 08:46, Péter Ujfalusi wrote: >> My hope was that it could also avoid some of (existing or future) DMI entries. But even if it does not >> (e.g. the problem Péter Ujfalusi reported with the UPX-i11 cannot be fixed by this patch set and thus >> needs the DMI quirk) we may at least avoid more bug reports due to interrupt storms once >> 6.4 is released. > > I'm surprised that there is a need for a storm detection in the first > place... Do we have something else on the same IRQ line on the affected > devices which might have a bug or no driver at all? > It is hard to believe that a TPM (Trusted Platform Module) is integrated > so poorly ;) > > But put that aside: I think the storm detection is good given that there > is no other way to know which machine have sloppy TPM integration. > There are machines where this happens, so it is a know integration > issue, right? >> My only 'nitpick' is with the printk level to be used. > The ERR level is not correct as we know the issue and we handle it, so > all is under control. > If we want to add these machines to the quirk list then WARN is a good > level to gain attention but I'm not sure if a user will know how to get > the machine in the quirk (where to file a bug). > If we only want the quirk to be used for machines like UPX-i11 which > simply just have broken (likely floating) IRQ line then the WARN is too > high level, INFO or even DBG would be appropriate as you are not going > to update the quirk, it is just handled under the hood (which is a great > thing, but on the other hand you will have the storm never the less and > that is not a nice thing). > > It is a matter on how this is going to be handled in a long term. Add > quirk for all the known machines with either stormy or plain broken IRQ > line or handle the stormy ones and quirk the broken ones only. > Even in the long run I would always prefer a generic solution whenever it is possible. Quirks are fine for issues that cant be solved in another way or really require individual treatment. While I agree that ERR is not a good choice for the "falling back to polling" message I do not have a strong opinion on whether WARN, NOTICE or INFO is more appropriate. Jarko seems to prefer WARN. >>>> Detect an interrupt storm by counting the number of unhandled interrupts >>>> within a 10 ms time interval. In case that more than 1000 were unhandled >>>> deactivate interrupts, deregister the handler and fall back to polling. >>> >>> I know it can be sometimes hard to evaluate but can you try to explain >>> how you came up to the 10 ms sampling period and 1000 interrupt >>> threshold? I just don't like abritrary numbers. >> >> At least the 100 ms is not plucked out of thin air but its the same time period >> that the generic code in note_interrupt() uses - I assume for a good reason. >> Not only this number but the whole irq storm detection logic is taken from >> there: >> >>> >>>> This equals the implementation that handles interrupt storms in >>>> note_interrupt() by means of timestamps and counters in struct irq_desc. >>> The number of 1000 unhandled interrupts is still far below the 99900 > used in >> note_interrupt() but IMHO enough to indicate that there is something seriously >> wrong with interrupt processing and it is probably saver to fall back to polling. > > Except that if the line got the spurious designation in core, the > interrupt line will be disabled while the TPM driver will think that it > is still using IRQ mode and will not switch to polling. In the case that an interrupt storm cant be detected (since there might not even be one) I am fine with adding a quirk. > > A storm of 1000 is better than a storm of 99900 for sure but quirking > these would be the desired final solution. imho > As I wrote above I prefer a generic solution whenever possible. > There are many buts around this ;) > Regards, Lino
On 24.05.23 at 17:30, Jarkko Sakkinen wrote: > On Wed May 24, 2023 at 6:58 AM EEST, Jarkko Sakkinen wrote: >>> rc = tpm_tis_read32(priv, TPM_INT_STATUS(priv->locality), &interrupt); >>> if (rc < 0) >>> - return IRQ_NONE; >>> + goto unhandled; >>> >>> if (interrupt == 0) >>> - return IRQ_NONE; >>> + goto unhandled; >>> >>> set_bit(TPM_TIS_IRQ_TESTED, &priv->flags); >>> if (interrupt & TPM_INTF_DATA_AVAIL_INT) >>> @@ -780,10 +829,14 @@ static irqreturn_t tis_int_handler(int dummy, void *dev_id) >>> rc = tpm_tis_write32(priv, TPM_INT_STATUS(priv->locality), interrupt); >>> tpm_tis_relinquish_locality(chip, 0); >>> if (rc < 0) >>> - return IRQ_NONE; >>> + goto unhandled; >>> >>> tpm_tis_read32(priv, TPM_INT_STATUS(priv->locality), &interrupt); >>> return IRQ_HANDLED; >>> + >>> +unhandled: >>> + tpm_tis_process_unhandled_interrupt(chip); >>> + return IRQ_HANDLED; >>> } > > Some minor glitches I noticed. > > You could simplify the flow by making the helper to return IRQ_NONE. > > E.g. > > tpm_tis_relinquish_locality(chip, 0); > if (rc < 0) > return tpm_tis_process_unhandled_interrupt(chip); > > I'd recommend changing the function name simply tpm_tis_rollback_interrupt(). > IMHO this name is worse, since this function does actually _not_ rollback interrupts most of the times it is called. Only after an interrupt storm is detected (so currently after it has been called at least 1000 times without rollback) it actually rolls back interrupts and falls back to polling. Maybe rather tpm_tis_check_for_interrupt_storm()? Regards, Lino
On Mon, May 29, 2023 at 09:46:08AM +0300, Péter Ujfalusi wrote: > Hi Lino, > > On 23/05/2023 23:46, Lino Sanfilippo wrote: > >> On the other hand any new functionality is objectively a maintanance > >> burden of some measure (applies to any functionality). So how do we know > >> that taking this change is less of a maintenance burden than just add > >> new table entries, as they come up? > >> > > > > Initially this set was created as a response to this 0-day bug report which you asked me > > to have a look at: > > > > https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@kernel.org/ > > > > My hope was that it could also avoid some of (existing or future) DMI entries. But even if it does not > > (e.g. the problem Péter Ujfalusi reported with the UPX-i11 cannot be fixed by this patch set and thus > > needs the DMI quirk) we may at least avoid more bug reports due to interrupt storms once > > 6.4 is released. > > I'm surprised that there is a need for a storm detection in the first > place... Do we have something else on the same IRQ line on the affected > devices which might have a bug or no driver at all? > It is hard to believe that a TPM (Trusted Platform Module) is integrated > so poorly ;) > > But put that aside: I think the storm detection is good given that there > is no other way to know which machine have sloppy TPM integration. > There are machines where this happens, so it is a know integration > issue, right? > > My only 'nitpick' is with the printk level to be used. > The ERR level is not correct as we know the issue and we handle it, so > all is under control. > If we want to add these machines to the quirk list then WARN is a good > level to gain attention but I'm not sure if a user will know how to get > the machine in the quirk (where to file a bug). > If we only want the quirk to be used for machines like UPX-i11 which > simply just have broken (likely floating) IRQ line then the WARN is too > high level, INFO or even DBG would be appropriate as you are not going > to update the quirk, it is just handled under the hood (which is a great > thing, but on the other hand you will have the storm never the less and > that is not a nice thing). > > It is a matter on how this is going to be handled in a long term. Add > quirk for all the known machines with either stormy or plain broken IRQ > line or handle the stormy ones and quirk the broken ones only. > > >>> Detect an interrupt storm by counting the number of unhandled interrupts > >>> within a 10 ms time interval. In case that more than 1000 were unhandled > >>> deactivate interrupts, deregister the handler and fall back to polling. > >> > >> I know it can be sometimes hard to evaluate but can you try to explain > >> how you came up to the 10 ms sampling period and 1000 interrupt > >> threshold? I just don't like abritrary numbers. > > > > At least the 100 ms is not plucked out of thin air but its the same time period > > that the generic code in note_interrupt() uses - I assume for a good reason. > > Not only this number but the whole irq storm detection logic is taken from > > there: > > > >> > >>> This equals the implementation that handles interrupt storms in > >>> note_interrupt() by means of timestamps and counters in struct irq_desc. > >> The number of 1000 unhandled interrupts is still far below the 99900 > used in > > note_interrupt() but IMHO enough to indicate that there is something seriously > > wrong with interrupt processing and it is probably saver to fall back to polling. > > Except that if the line got the spurious designation in core, the > interrupt line will be disabled while the TPM driver will think that it > is still using IRQ mode and will not switch to polling. > > A storm of 1000 is better than a storm of 99900 for sure but quirking > these would be the desired final solution. imho If that is the case, then output could probably be sent to the console detailing the dmi info needed to update the table. Regards, Jerry > > There are many buts around this ;) > > -- > Péter
On Tue May 30, 2023 at 8:56 PM EEST, Jerry Snitselaar wrote: > On Mon, May 29, 2023 at 09:46:08AM +0300, Péter Ujfalusi wrote: > > Hi Lino, > > > > On 23/05/2023 23:46, Lino Sanfilippo wrote: > > >> On the other hand any new functionality is objectively a maintanance > > >> burden of some measure (applies to any functionality). So how do we know > > >> that taking this change is less of a maintenance burden than just add > > >> new table entries, as they come up? > > >> > > > > > > Initially this set was created as a response to this 0-day bug report which you asked me > > > to have a look at: > > > > > > https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@kernel.org/ > > > > > > My hope was that it could also avoid some of (existing or future) DMI entries. But even if it does not > > > (e.g. the problem Péter Ujfalusi reported with the UPX-i11 cannot be fixed by this patch set and thus > > > needs the DMI quirk) we may at least avoid more bug reports due to interrupt storms once > > > 6.4 is released. > > > > I'm surprised that there is a need for a storm detection in the first > > place... Do we have something else on the same IRQ line on the affected > > devices which might have a bug or no driver at all? > > It is hard to believe that a TPM (Trusted Platform Module) is integrated > > so poorly ;) > > > > But put that aside: I think the storm detection is good given that there > > is no other way to know which machine have sloppy TPM integration. > > There are machines where this happens, so it is a know integration > > issue, right? > > > > My only 'nitpick' is with the printk level to be used. > > The ERR level is not correct as we know the issue and we handle it, so > > all is under control. > > If we want to add these machines to the quirk list then WARN is a good > > level to gain attention but I'm not sure if a user will know how to get > > the machine in the quirk (where to file a bug). > > If we only want the quirk to be used for machines like UPX-i11 which > > simply just have broken (likely floating) IRQ line then the WARN is too > > high level, INFO or even DBG would be appropriate as you are not going > > to update the quirk, it is just handled under the hood (which is a great > > thing, but on the other hand you will have the storm never the less and > > that is not a nice thing). > > > > It is a matter on how this is going to be handled in a long term. Add > > quirk for all the known machines with either stormy or plain broken IRQ > > line or handle the stormy ones and quirk the broken ones only. > > > > >>> Detect an interrupt storm by counting the number of unhandled interrupts > > >>> within a 10 ms time interval. In case that more than 1000 were unhandled > > >>> deactivate interrupts, deregister the handler and fall back to polling. > > >> > > >> I know it can be sometimes hard to evaluate but can you try to explain > > >> how you came up to the 10 ms sampling period and 1000 interrupt > > >> threshold? I just don't like abritrary numbers. > > > > > > At least the 100 ms is not plucked out of thin air but its the same time period > > > that the generic code in note_interrupt() uses - I assume for a good reason. > > > Not only this number but the whole irq storm detection logic is taken from > > > there: > > > > > >> > > >>> This equals the implementation that handles interrupt storms in > > >>> note_interrupt() by means of timestamps and counters in struct irq_desc. > > >> The number of 1000 unhandled interrupts is still far below the 99900 > > used in > > > note_interrupt() but IMHO enough to indicate that there is something seriously > > > wrong with interrupt processing and it is probably saver to fall back to polling. > > > > Except that if the line got the spurious designation in core, the > > interrupt line will be disabled while the TPM driver will think that it > > is still using IRQ mode and will not switch to polling. > > > > A storm of 1000 is better than a storm of 99900 for sure but quirking > > these would be the desired final solution. imho > > If that is the case, then output could probably be sent to the console > detailing the dmi info needed to update the table. +1 Good idea. BR, Jarkko
On Mon May 29, 2023 at 4:15 PM EEST, Lino Sanfilippo wrote: > > Except that if the line got the spurious designation in core, the > > interrupt line will be disabled while the TPM driver will think that it > > is still using IRQ mode and will not switch to polling. > > In the case that an interrupt storm cant be detected (since there might not even > be one) I am fine with adding a quirk. Speaking of generic vs quirk (if storm can be detected): detection should be eager as ever possible. Too eager does not cause systems failing. This way I think we can converge to stability fast as possible. Then, later on, if a system where interrupts should work but the detection disables interrupts we can either make the fixed threshold less eager, or add a tuning parameter to sysfs. BR, Jarkko
On Mon May 29, 2023 at 9:46 AM EEST, Péter Ujfalusi wrote: > I'm surprised that there is a need for a storm detection in the first > place... Do we have something else on the same IRQ line on the affected > devices which might have a bug or no driver at all? > It is hard to believe that a TPM (Trusted Platform Module) is integrated > so poorly ;) I mentioned this some emails ago (do not care to look up when does not matter) and I guess the reason for this might be that a generic solution compassing all arch's etc. might be tricky to design... BR, Jarkko
diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c index 558144fa707a..458ebf8c2f16 100644 --- a/drivers/char/tpm/tpm_tis_core.c +++ b/drivers/char/tpm/tpm_tis_core.c @@ -752,6 +752,55 @@ static bool tpm_tis_req_canceled(struct tpm_chip *chip, u8 status) return status == TPM_STS_COMMAND_READY; } +static void tpm_tis_handle_irq_storm(struct tpm_chip *chip) +{ + struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev); + int intmask = 0; + + dev_err(&chip->dev, HW_ERR + "TPM interrupt storm detected, polling instead\n"); + + tpm_tis_read32(priv, TPM_INT_ENABLE(priv->locality), &intmask); + + intmask &= ~TPM_GLOBAL_INT_ENABLE; + + tpm_tis_request_locality(chip, 0); + tpm_tis_write32(priv, TPM_INT_ENABLE(priv->locality), intmask); + tpm_tis_relinquish_locality(chip, 0); + + chip->flags &= ~TPM_CHIP_FLAG_IRQ; + + /* + * We must not call devm_free_irq() from within the interrupt handler, + * since this function waits for running interrupt handlers to finish + * and thus it would deadlock. Instead trigger a worker that does the + * unregistration. + */ + schedule_work(&priv->free_irq_work); +} + +static void tpm_tis_process_unhandled_interrupt(struct tpm_chip *chip) +{ + const unsigned int MAX_UNHANDLED_IRQS = 1000; + struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev); + /* + * The worker to free the TPM interrupt (free_irq_work) may already + * be scheduled, so make sure it is not scheduled again. + */ + if (!(chip->flags & TPM_CHIP_FLAG_IRQ)) + return; + + if (time_after(jiffies, priv->last_unhandled_irq + HZ/10)) + priv->unhandled_irqs = 1; + else + priv->unhandled_irqs++; + + priv->last_unhandled_irq = jiffies; + + if (priv->unhandled_irqs > MAX_UNHANDLED_IRQS) + tpm_tis_handle_irq_storm(chip); +} + static irqreturn_t tis_int_handler(int dummy, void *dev_id) { struct tpm_chip *chip = dev_id; @@ -761,10 +810,10 @@ static irqreturn_t tis_int_handler(int dummy, void *dev_id) rc = tpm_tis_read32(priv, TPM_INT_STATUS(priv->locality), &interrupt); if (rc < 0) - return IRQ_NONE; + goto unhandled; if (interrupt == 0) - return IRQ_NONE; + goto unhandled; set_bit(TPM_TIS_IRQ_TESTED, &priv->flags); if (interrupt & TPM_INTF_DATA_AVAIL_INT) @@ -780,10 +829,14 @@ static irqreturn_t tis_int_handler(int dummy, void *dev_id) rc = tpm_tis_write32(priv, TPM_INT_STATUS(priv->locality), interrupt); tpm_tis_relinquish_locality(chip, 0); if (rc < 0) - return IRQ_NONE; + goto unhandled; tpm_tis_read32(priv, TPM_INT_STATUS(priv->locality), &interrupt); return IRQ_HANDLED; + +unhandled: + tpm_tis_process_unhandled_interrupt(chip); + return IRQ_HANDLED; } static void tpm_tis_gen_interrupt(struct tpm_chip *chip) @@ -804,6 +857,15 @@ static void tpm_tis_gen_interrupt(struct tpm_chip *chip) chip->flags &= ~TPM_CHIP_FLAG_IRQ; } +static void tpm_tis_free_irq_func(struct work_struct *work) +{ + struct tpm_tis_data *priv = container_of(work, typeof(*priv), free_irq_work); + struct tpm_chip *chip = priv->chip; + + devm_free_irq(chip->dev.parent, priv->irq, chip); + priv->irq = 0; +} + /* Register the IRQ and issue a command that will cause an interrupt. If an * irq is seen then leave the chip setup for IRQ operation, otherwise reverse * everything and leave in polling mode. Returns 0 on success. @@ -816,6 +878,7 @@ static int tpm_tis_probe_irq_single(struct tpm_chip *chip, u32 intmask, int rc; u32 int_status; + INIT_WORK(&priv->free_irq_work, tpm_tis_free_irq_func); rc = devm_request_threaded_irq(chip->dev.parent, irq, NULL, tis_int_handler, IRQF_ONESHOT | flags, @@ -918,6 +981,7 @@ void tpm_tis_remove(struct tpm_chip *chip) interrupt = 0; tpm_tis_write32(priv, reg, ~TPM_GLOBAL_INT_ENABLE & interrupt); + flush_work(&priv->free_irq_work); tpm_tis_clkrun_enable(chip, false); @@ -1021,6 +1085,7 @@ int tpm_tis_core_init(struct device *dev, struct tpm_tis_data *priv, int irq, chip->timeout_b = msecs_to_jiffies(TIS_TIMEOUT_B_MAX); chip->timeout_c = msecs_to_jiffies(TIS_TIMEOUT_C_MAX); chip->timeout_d = msecs_to_jiffies(TIS_TIMEOUT_D_MAX); + priv->chip = chip; priv->timeout_min = TPM_TIMEOUT_USECS_MIN; priv->timeout_max = TPM_TIMEOUT_USECS_MAX; priv->phy_ops = phy_ops; diff --git a/drivers/char/tpm/tpm_tis_core.h b/drivers/char/tpm/tpm_tis_core.h index e978f457fd4d..6fc86baa4398 100644 --- a/drivers/char/tpm/tpm_tis_core.h +++ b/drivers/char/tpm/tpm_tis_core.h @@ -91,12 +91,18 @@ enum tpm_tis_flags { }; struct tpm_tis_data { + struct tpm_chip *chip; u16 manufacturer_id; struct mutex locality_count_mutex; unsigned int locality_count; int locality; + /* Interrupts */ int irq; + struct work_struct free_irq_work; + unsigned long last_unhandled_irq; + unsigned int unhandled_irqs; unsigned int int_mask; + unsigned long flags; void __iomem *ilb_base_addr; u16 clkrun_enabled;