[printk,v1,02/18] printk: Add NMI check to down_trylock_console_sem()

Message ID 20230302195618.156940-3-john.ogness@linutronix.de
State New
Headers
Series threaded/atomic console support |

Commit Message

John Ogness March 2, 2023, 7:56 p.m. UTC
  The printk path is NMI safe because it only adds content to the
buffer and then triggers the delayed output via irq_work. If the
console is flushed or unblanked (on panic) from NMI then it can
deadlock in down_trylock_console_sem() because the semaphore is not
NMI safe.

Avoid try-locking the console from NMI and assume it failed.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
---
 kernel/printk/printk.c | 4 ++++
 1 file changed, 4 insertions(+)
  

Comments

Petr Mladek March 7, 2023, 4:05 p.m. UTC | #1
On Thu 2023-03-02 21:02:02, John Ogness wrote:
> The printk path is NMI safe because it only adds content to the
> buffer and then triggers the delayed output via irq_work. If the
> console is flushed or unblanked (on panic) from NMI then it can
> deadlock in down_trylock_console_sem() because the semaphore is not
> NMI safe.

Do you have any particular code path in mind, please?
This does not work in console_flush_on_panic(), see below.

> Avoid try-locking the console from NMI and assume it failed.
> 
> Signed-off-by: John Ogness <john.ogness@linutronix.de>
> ---
>  kernel/printk/printk.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 40c5f4170ac7..84af038292d9 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -318,6 +318,10 @@ static int __down_trylock_console_sem(unsigned long ip)
>  	int lock_failed;
>  	unsigned long flags;
>  
> +	/* Semaphores are not NMI-safe. */
> +	if (in_nmi())
> +		return 1;

console_flush_on_panic() ignores the console_trylock() return value:

void console_flush_on_panic(enum con_flush_mode mode)
{
[...]
	/*
	 * If someone else is holding the console lock, trylock will fail
	 * and may_schedule may be set.  Ignore and proceed to unlock so
	 * that messages are flushed out.  As this can be called from any
	 * context and we don't want to get preempted while flushing,
	 * ensure may_schedule is cleared.
	 */
	console_trylock();
	console_may_schedule = 0;
	console_unlock();
}

So that this change would cause a non-paired console_unlock().
And console_unlock might still deadlock on the console_sem->lock.


OK, your change makes sense. But we still should try flushing
the messages in console_flush_on_panic() even in NMI.

One solution would be to call console_flush_all() directly in
console_flush_on_panic() without taking console_lock().
It should not be worse than the current code which ignores
the console_trylock() return value.

Note that it mostly works because console_flush_on_panic() is called
when other CPUs are supposed to be stopped.

We only would need to prevent other CPUs from flushing messages
as well if they were still running by chance. But we actually already
do this, see abandon_console_lock_in_panic(). Well, we should
make sure that the abandon_console_lock_in_panic() check is
done before flushing the first message.

All these changes together would prevent deadlock on console_sem->lock.
But the synchronization "guarantees" should stay the same.

> +
>  	/*
>  	 * Here and in __up_console_sem() we need to be in safe mode,
>  	 * because spindump/WARN/etc from under console ->lock will

Alternative solution would be to make the generic down_trylock() safe
in NMI or in panic(). It might do spin_trylock() when oops_in_progress
is set. I mean to do the same trick and console drivers do with
port->lock.

But I am not sure if other down_trylock() users would be happy with
this change. Yes, it might get solved by introducing down_trylock_panic()
that might be used only in console_flush_on_panic(). But it might
be more hairy than the solution proposed above.

Best Regards,
Petr
  
John Ogness March 17, 2023, 11:37 a.m. UTC | #2
On 2023-03-07, Petr Mladek <pmladek@suse.com> wrote:
> So that this change would cause a non-paired console_unlock().
> And console_unlock might still deadlock on the console_sem->lock.

Yes, but at least it would have flushed beforehand.

> One solution would be to call console_flush_all() directly in
> console_flush_on_panic() without taking console_lock().
>
> It should not be worse than the current code which ignores
> the console_trylock() return value.

I think your suggestion is acceptable.

> Note that it mostly works because console_flush_on_panic() is called
> when other CPUs are supposed to be stopped.
>
> We only would need to prevent other CPUs from flushing messages
> as well if they were still running by chance. But we actually already
> do this, see abandon_console_lock_in_panic(). Well, we should
> make sure that the abandon_console_lock_in_panic() check is
> done before flushing the first message.
>
> All these changes together would prevent deadlock on
> console_sem->lock.  But the synchronization "guarantees" should stay
> the same.

We could also update console_trylock() and console_lock() to fail and
infinitely sleep, respectively, when abandon_console_lock_in_panic() is
true. That would prevent CPUs from newly acquiring the console lock and
interfering with the panic CPU.

John
  
Petr Mladek April 13, 2023, 1:42 p.m. UTC | #3
On Fri 2023-03-17 12:43:56, John Ogness wrote:
> On 2023-03-07, Petr Mladek <pmladek@suse.com> wrote:
> > So that this change would cause a non-paired console_unlock().
> > And console_unlock might still deadlock on the console_sem->lock.
> 
> Yes, but at least it would have flushed beforehand.
> 
> > One solution would be to call console_flush_all() directly in
> > console_flush_on_panic() without taking console_lock().
> >
> > It should not be worse than the current code which ignores
> > the console_trylock() return value.
> 
> I think your suggestion is acceptable.
> 
> > Note that it mostly works because console_flush_on_panic() is called
> > when other CPUs are supposed to be stopped.
> >
> > We only would need to prevent other CPUs from flushing messages
> > as well if they were still running by chance. But we actually already
> > do this, see abandon_console_lock_in_panic(). Well, we should
> > make sure that the abandon_console_lock_in_panic() check is
> > done before flushing the first message.
> >
> > All these changes together would prevent deadlock on
> > console_sem->lock.  But the synchronization "guarantees" should stay
> > the same.
> 
> We could also update console_trylock() and console_lock() to fail and
> infinitely sleep, respectively, when abandon_console_lock_in_panic() is
> true. That would prevent CPUs from newly acquiring the console lock and
> interfering with the panic CPU.

Interesting idea. It should be safe after panic() tries to
stop the CPUs. But I am slightly worried to do this earlier.

I wonder if it might block, for example, trigger_all_cpu_backtrace()
that is called when (panic_print & PANIC_PRINT_ALL_CPU_BT) bit is set.

Best Regards.
Petr
  

Patch

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 40c5f4170ac7..84af038292d9 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -318,6 +318,10 @@  static int __down_trylock_console_sem(unsigned long ip)
 	int lock_failed;
 	unsigned long flags;
 
+	/* Semaphores are not NMI-safe. */
+	if (in_nmi())
+		return 1;
+
 	/*
 	 * Here and in __up_console_sem() we need to be in safe mode,
 	 * because spindump/WARN/etc from under console ->lock will