tpm: Disable hwrng for TPM 1 if PM_SLEEP is enabled

Message ID 20230105144742.3219571-1-Jason@zx2c4.com
State New
Headers
Series tpm: Disable hwrng for TPM 1 if PM_SLEEP is enabled |

Commit Message

Jason A. Donenfeld Jan. 5, 2023, 2:47 p.m. UTC
  TPM 1's support for its hardware RNG is broken across system suspends,
due to races or locking issues or something else that haven't been
diagnosed or fixed yet. These issues prevent the system from actually
suspending. So disable the driver in this case. Later, when this is
fixed properly, we can remove this.

Current breakage amounts to something like:

  tpm tpm0: A TPM error (28) occurred continue selftest
  ...
  tpm tpm0: A TPM error (28) occurred attempting get random
  ...
  tpm tpm0: Error (28) sending savestate before suspend
  tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
  tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
  tpm_tis 00:08: PM: failed to suspend: error 28
  PM: Some devices failed to suspend, or early wake event detected

This issue was partially fixed by 23393c646142 ("char: tpm: Protect
tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
directly because the TPM maintainers weren't available. However, it
seems like this just addresses the most common cases of the bug, rather
than addressing it entirely. So there are more things to fix still,
apparently.

The hwrng driver appears already to be occasionally disabled due to
other conditions, so this shouldn't be too large of a surprise.

Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/
Cc: stable@vger.kernel.org # 6.1+
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 drivers/char/tpm/tpm-chip.c | 8 ++++++++
 1 file changed, 8 insertions(+)
  

Comments

Jason A. Donenfeld Jan. 5, 2023, 2:53 p.m. UTC | #1
On Thu, Jan 05, 2023 at 03:47:42PM +0100, Jason A. Donenfeld wrote:
> TPM 1's support for its hardware RNG is broken across system suspends,
> due to races or locking issues or something else that haven't been
> diagnosed or fixed yet. These issues prevent the system from actually
> suspending. So disable the driver in this case. Later, when this is
> fixed properly, we can remove this.
> 
> Current breakage amounts to something like:
> 
>   tpm tpm0: A TPM error (28) occurred continue selftest
>   ...
>   tpm tpm0: A TPM error (28) occurred attempting get random
>   ...
>   tpm tpm0: Error (28) sending savestate before suspend
>   tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
>   tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
>   tpm_tis 00:08: PM: failed to suspend: error 28
>   PM: Some devices failed to suspend, or early wake event detected
> 
> This issue was partially fixed by 23393c646142 ("char: tpm: Protect
> tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
> directly because the TPM maintainers weren't available. However, it
> seems like this just addresses the most common cases of the bug, rather
> than addressing it entirely. So there are more things to fix still,
> apparently.
> 
> The hwrng driver appears already to be occasionally disabled due to
> other conditions, so this shouldn't be too large of a surprise.
> 
> Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/
> Cc: stable@vger.kernel.org # 6.1+
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Quoting from my previous email:

| I spent a long time working through the TPM code when this came up
| during 6.1. I set up the TPM emulator with QEMU and reproduced this and
| had a whole test setup and S3 fuzzer. It took a long time, and when I was
| done, I paged it all out of my brain. When I found that patch from Jan
| that fixed the problem most of the time (but not all the time), I wasted
| tons of time trying to get the TPM maintainers to take the patch and
| send it to Linus as part of rc7 or rc8. But they all ignored me, and
| eventually Linus just took that patch directly.
| 
| So I don't think I want to go down another rabbit hole here, having
| experienced the TPM maintainers not really caring much, and that sucking
| away the remaining energy I had before to keep looking at the issue and
| its edge cases not handled by Jan's patch.
| 
| On the contrary, it'd make a big difference if the TPM maintainers could
| actually help analyze the code that they're most familiar with, so that
| we can get to the bottom of this. That's a lot better than some random
| drive-by patches from a non-TPM person like me; before the 6.1 bug, I'd
| never even looked at these drivers.
| 
| My plan is to therefore be available to help and analyze and test and
| maybe even write some code, if the TPM maintainers take the lead on
| getting to the bottom of this. But if this hits neglect again like last
| time, I'll just send a `depends on BROKEN if PM` patch to the TPM
| hw_random driver and see what happens... That's a really awful solution
| though, so I hope the maintainers will wake up this cycle.

Seeing as there's still no life from the TPM maintainers, here's the
patch to make the problem go away until they wake up. When they do wake
up, though, I will be available to start looking into this again in
whatever capacity I might be useful.

Jason
  
Linus Torvalds Jan. 5, 2023, 9:58 p.m. UTC | #2
On Thu, Jan 5, 2023 at 6:48 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> TPM 1's support for its hardware RNG is broken across system suspends,
> due to races or locking issues or something else that haven't been
> diagnosed or fixed yet. These issues prevent the system from actually
> suspending. So disable the driver in this case. Later, when this is
> fixed properly, we can remove this.

How about just keeping it enabled, but not making it a fatal error if
the TPM saving doesn't work? IOW, just print the warning, and then
"return 0" from the suspend function.

I doubt anybody cares, but your patch disables that TPM device just
because PM is *enabled*. That's basically "all the time".

Imagine being on a desktop with a distro kernel that enables suspend -
because that kernel obviously is expected to work on laptops too.
You're never actually going to suspend things on that machine, but
maybe you still want to register it as a source of hw random data?

          Linus
  
Jason A. Donenfeld Jan. 5, 2023, 10:29 p.m. UTC | #3
On Thu, Jan 05, 2023 at 01:58:48PM -0800, Linus Torvalds wrote:
> On Thu, Jan 5, 2023 at 6:48 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >
> > TPM 1's support for its hardware RNG is broken across system suspends,
> > due to races or locking issues or something else that haven't been
> > diagnosed or fixed yet. These issues prevent the system from actually
> > suspending. So disable the driver in this case. Later, when this is
> > fixed properly, we can remove this.
> 
> How about just keeping it enabled, but not making it a fatal error if
> the TPM saving doesn't work? IOW, just print the warning, and then
> "return 0" from the suspend function.

You're right that returning 0 from the pm notifier would make the
problem that users actually care about -- laptop doesn't sleep when you
close the lid -- go away.

From a random.c perspective, the RNG is already initialized when the
driver loads, which will be before suspend bricks the driver. So even if
the behavior afterwards is a buggy driver handing all zeros to random.c,
it won't really matter much; random.c can deal with that
cryptographically. I have no idea if this is actually the case with the
driver's error condition. But if it is, it's good that it doesn't
matter.

So okay, I'll roll a patch to do that when I get home. I'm writing on my
phone now, but from memory it's just changing a 'return rc;' into
'return 0;'.

Then the TPM folks can fix the underlying issue at their leisure
whenever.

Jason
  

Patch

diff --git a/drivers/char/tpm/tpm-chip.c b/drivers/char/tpm/tpm-chip.c
index 741d8f3e8fb3..eed67ea8d3a7 100644
--- a/drivers/char/tpm/tpm-chip.c
+++ b/drivers/char/tpm/tpm-chip.c
@@ -524,6 +524,14 @@  static int tpm_add_hwrng(struct tpm_chip *chip)
 	if (!IS_ENABLED(CONFIG_HW_RANDOM_TPM) || tpm_is_firmware_upgrade(chip))
 		return 0;
 
+	/*
+	 * This driver's support for using the RNG across suspend is broken on
+	 * TPM1. Until somebody fixes this, just stop registering a HWRNG in
+	 * that case.
+	 */
+	if (!(chip->flags & TPM_CHIP_FLAG_TPM2) && IS_ENABLED(CONFIG_PM_SLEEP))
+		return 0;
+
 	snprintf(chip->hwrng_name, sizeof(chip->hwrng_name),
 		 "tpm-rng-%d", chip->dev_num);
 	chip->hwrng.name = chip->hwrng_name;