[printk,v2,04/11] printk: nbcon: Provide functions to mark atomic write sections

Message ID 20230919230856.661435-5-john.ogness@linutronix.de
State New
Headers
Series wire up nbcon atomic printing |

Commit Message

John Ogness Sept. 19, 2023, 11:08 p.m. UTC
  From: Thomas Gleixner <tglx@linutronix.de>

WARN/OOPS/PANIC require printing out immediately since the
regular printing method (and in the future, the printing
threads) might not be able to run.

Add per-CPU state to denote the priority/urgency of the output
and provide functions to mark the beginning and end of sections
where the urgent messages are generated.

Note that when a CPU is in a priority elevated state, flushing
only occurs when dropping back to a lower priority. This allows
the full set of printk records (WARN/OOPS/PANIC output) to be
stored in the ringbuffer before beginning to flush the backlog.

Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de>
---
 include/linux/console.h |  4 ++
 kernel/printk/nbcon.c   | 89 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 93 insertions(+)
  

Comments

Petr Mladek Sept. 22, 2023, 9:33 a.m. UTC | #1
On Wed 2023-09-20 01:14:49, John Ogness wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> WARN/OOPS/PANIC require printing out immediately since the
> regular printing method (and in the future, the printing
> threads) might not be able to run.
> 
> Add per-CPU state to denote the priority/urgency of the output
> and provide functions to mark the beginning and end of sections
> where the urgent messages are generated.
> 
> Note that when a CPU is in a priority elevated state, flushing
> only occurs when dropping back to a lower priority. This allows
> the full set of printk records (WARN/OOPS/PANIC output) to be
> stored in the ringbuffer before beginning to flush the backlog.

The above paragraph is a bit confusing. The code added by this patch
does not do any flushing. I guess that this last paragraph is supposed
to explain why the "nesting" array is needed. I would write
something like:

"The state also counts nesting of printing contexts per-priority.
It will be later used to prevent flushing in nested contexts."

That said, I am not sure if the special handling of nested contexts
is needed. But let's discuss it in the patch introducing the flush
funtions.

> --- a/kernel/printk/nbcon.c
> +++ b/kernel/printk/nbcon.c
> @@ -961,6 +961,95 @@ static bool nbcon_emit_next_record(struct nbcon_write_context *wctxt)
>  	return nbcon_context_exit_unsafe(ctxt);
>  }
>  
> +/**
> + * struct nbcon_cpu_state - Per CPU printk context state
> + * @prio:	The current context priority level
> + * @nesting:	Per priority nest counter
> + */
> +struct nbcon_cpu_state {
> +	enum nbcon_prio		prio;
> +	int			nesting[NBCON_PRIO_MAX];
> +};
> +
> +static DEFINE_PER_CPU(struct nbcon_cpu_state, nbcon_pcpu_state);
> +static struct nbcon_cpu_state early_nbcon_pcpu_state __initdata;
> +
> +/**
> + * nbcon_get_cpu_state - Get the per CPU console state pointer
> + *
> + * Returns either a pointer to the per CPU state of the current CPU or to
> + * the init data state during early boot.
> + */
> +static __ref struct nbcon_cpu_state *nbcon_get_cpu_state(void)
> +{
> +	if (!printk_percpu_data_ready())
> +		return &early_nbcon_pcpu_state;

My first thought, was that this was racy. I was afraid that
printk_percpu_data_ready() could change value inside
atomit_enter()/exit() area. But it actually could not happen.
Anyway, it might worth a comment. Something like:

	/*
	 * The value of __printk_percpu_data_ready is modified in normal
	 * context. As a result it could never change inside a nbcon
	 * atomic context.
	 */
	if (!printk_percpu_data_ready())
		return &early_nbcon_pcpu_state;

> +
> +	return this_cpu_ptr(&nbcon_pcpu_state);
> +}
> +
> +/**
> + * nbcon_atomic_exit - Exit a context that enforces atomic printing
> + * @prio:	Priority of the context to leave
> + * @prev_prio:	Priority of the previous context for restore
> + *
> + * Context:	Any context. Enables migration.
> + *
> + * @prev_prio is the priority returned by the corresponding
> + * nbcon_atomic_enter().
> + */
> +void nbcon_atomic_exit(enum nbcon_prio prio, enum nbcon_prio prev_prio)
> +{
> +	struct nbcon_cpu_state *cpu_state;
> +
> +	cpu_state = nbcon_get_cpu_state();

I would add a consistency check:

	WARN_ON_ONCE(cpu_state->nesting[cpu_state->prio] <= 0)

> +	/*
> +	 * Undo the nesting of nbcon_atomic_enter() at the CPU state
> +	 * priority.
> +	 */
> +	cpu_state->nesting[cpu_state->prio]--;
> +
> +	/*
> +	 * Restore the previous priority, which was returned by
> +	 * nbcon_atomic_enter().
> +	 */
> +	cpu_state->prio = prev_prio;
> +
> +	migrate_enable();
> +}

Best Regards,
Petr
  
John Ogness Sept. 25, 2023, 9:25 a.m. UTC | #2
On 2023-09-22, Petr Mladek <pmladek@suse.com> wrote:
>> Note that when a CPU is in a priority elevated state, flushing
>> only occurs when dropping back to a lower priority. This allows
>> the full set of printk records (WARN/OOPS/PANIC output) to be
>> stored in the ringbuffer before beginning to flush the backlog.
>
> The above paragraph is a bit confusing. The code added by this patch
> does not do any flushing.

You are right. I should put this patch after patch 5 "printk: nbcon:
Provide function for atomic flushing" to simplify the introduction.

> I guess that this last paragraph is supposed to explain why the
> "nesting" array is needed.

No, it is explaining how this feature works in general. The term
"priority elevated state" means the CPU is in an atomic write section.

The "nesting" array is needed in order to support a feature that is not
explained in the commit message: If nested OOPS/WARN/PANIC occur, only
the outermost OOPS/WARN/PANIC will do the flushing. I will add this
information to the commit message.

>> +static __ref struct nbcon_cpu_state *nbcon_get_cpu_state(void)
>> +{
>> +	if (!printk_percpu_data_ready())
>> +		return &early_nbcon_pcpu_state;
>
> it might worth a comment. Something like:
>
> 	/*
> 	 * The value of __printk_percpu_data_ready is modified in normal
> 	 * context. As a result it could never change inside a nbcon
> 	 * atomic context.
> 	 */
> 	if (!printk_percpu_data_ready())
> 		return &early_nbcon_pcpu_state;

OK.

>> +void nbcon_atomic_exit(enum nbcon_prio prio, enum nbcon_prio prev_prio)
>> +{
>> +	struct nbcon_cpu_state *cpu_state;
>> +
>> +	cpu_state = nbcon_get_cpu_state();
>
> I would add a consistency check:
>
> 	WARN_ON_ONCE(cpu_state->nesting[cpu_state->prio] <= 0)

OK.

John
  
Petr Mladek Sept. 25, 2023, 4:04 p.m. UTC | #3
On Mon 2023-09-25 11:31:54, John Ogness wrote:
> On 2023-09-22, Petr Mladek <pmladek@suse.com> wrote:
> >> Note that when a CPU is in a priority elevated state, flushing
> >> only occurs when dropping back to a lower priority. This allows
> >> the full set of printk records (WARN/OOPS/PANIC output) to be
> >> stored in the ringbuffer before beginning to flush the backlog.
> >
> > The above paragraph is a bit confusing. The code added by this patch
> > does not do any flushing.
> 
> You are right. I should put this patch after patch 5 "printk: nbcon:
> Provide function for atomic flushing" to simplify the introduction.
> 
> > I guess that this last paragraph is supposed to explain why the
> > "nesting" array is needed.
> 
> No, it is explaining how this feature works in general. The term
> "priority elevated state" means the CPU is in an atomic write section.

This did not help me much after at first. But I got it later after
connection more dots ;-)

IMHO, the problem was that the commit message introduced the terms
in the following order:

   + WARN/OOPS/PANIC require printing out immediately

   + per-CPU state to denote the priority/urgency

   + functions to mark the beginning/end where the urgent messages
     are generated

   + when CPU is in priority elevated state, flushing only occurs
     when dropping back to a lower priority

From my POV:

   + It did not mention/explained "atomic write" at all

   + It said that the urgent messages required immediate printing.
     And Later, it said that they would get flushed later. Which
     is contradicting each other.

   + The handling of priorities is not only about CPU nesting.
     The same rules should apply also when other CPU is printing
     messages in a higher priority context.

My proposal:

  + Use the term "higher priority context" for all WARN/OOPS/PANIC
    messages.

  + Use "emergency priority context" or "nbcon emergency context"
    when talking about NBCON_PRIO_EMERGENCY context to avoid
    confusion with the printk log levels.

  + use "panic priority context or "nbcon panic context" for the panic
    one if we really want to add enter/exit for the panic context.

  + We must somewhere explain the "atomic context" and "atomic_write".
    callback. One important question is why it is atomic. Is it because it

      + _can_ be called in atomic context?
      + _must_ be called in atomic context?

    It is called also from console_unlock() for boot messages
    so it need not be in atomic context.

    What about renaming it to "nbcon_write" to avoid this confusion?


> The "nesting" array is needed in order to support a feature that is not
> explained in the commit message: If nested OOPS/WARN/PANIC occur, only
> the outermost OOPS/WARN/PANIC will do the flushing. I will add this
> information to the commit message.

What is the motivation for the feature, please?

1. Either we want to flush the messages in the higher priority context
   ASAP.

   The per-CPU lock has been designed exactly for this purpose. It
   would not need any extra nesting counter. And it would work
   consistently also when the lock is acquired on another CPU.

   It is simple, the context will either get the per-console lock or
   not. The (nested) context might get the lock only when it has higher
   priority. The flush would be called directly from vprintk_emit().

   I always thought that this was the planned behavior.

   IMHO, we want it. A nested panic() should be able to takeover
   the console and flush messages ASAP. It will never return.


2. Or we want to wait until all messages in the higher priority context
   are stored into the ring buffer and flush them by the caller.

   Who would actually flush the higher messages?
   WARN() caller?
   The timer callback which detected softlockup?
   Or a completely different context?
   Who would flush panic() messages when panic() interrupted
   locked CPU?


My proposal:

There are only two higher priority contexts:

  + NBCON_PRIO_PANIC should be used when panic_cpu == raw_smp_processor_id()

  + NBCON_PRIO_EMERGENCY contex would require some enter/exit wrappers
    and tracking. But it does not necessarily need to be per-CPU
    variable.

I think about adding "int printk_state" into struct task_struct.
It might be useful also for other things, e.g. for storing the last
log level of non-finished message. Entering section with enforced
minimal loglevel or so.


Then we could have:

#define PRINTK_STATE_EMERGENCY_MASK   0x000000ff
#define PRINTK_STATE_EMERGENCY_OFFSET 0x00000001

void nbcon_emergency_enter(&printk_state)
{
	*printk_state = current->printk_state;

	WARN_ON_ONCE((*printk_state & PRINTK_STATE_EMERGENCY_MASK) == PRINTK_STATE_EMERGENCY_MASK);

	current->printk_state = *printk_state + PRINTK_STATE_EMERGENCY_OFFSET;
}

void nbcon_emergency_exit(printk_state)
{
	WARN_ON_ONCE(!(current->printk_state & PRINTK_STATE_EMERGENCY_MASK))

	current->printk_state = printk_state;
}

enum nbcon_prio nbcon_get_default_prio(void)
{
	if (panic_cpu == raw_smp_processor_id()
		return NBCON_PANIC_PRIO;

	/* Does current always exist? What about fork? */
	if (current && (current->printk_state && PRINTK_STATE_EMERGENCY_MASK))
		return NBCON_PRIO_EMERGENCY;

	return NBCON_PRIO_NORMAL;
}

IMHO, it should be race free. And we could use it to initialize
struct nbcon_context. The final decision will be left for
the nbcon_ctxt_try_acquire().

Best Regards,
Petr
  
John Ogness Oct. 5, 2023, 12:51 p.m. UTC | #4
On 2023-09-25, Petr Mladek <pmladek@suse.com> wrote:
> From my POV:
>
>    + It did not mention/explained "atomic write" at all

Agreed.

>    + It said that the urgent messages required immediate printing.
>      And Later, it said that they would get flushed later. Which
>      is contradicting each other.

I agree that the wording needs to be dramatically improved. The term
"urgent message" was not meant to represent a single printk() call.

>    + The handling of priorities is not only about CPU nesting.
>      The same rules should apply also when other CPU is printing
>      messages in a higher priority context.

Sorry, I do not understand what you mean here. Each CPU is responsible
for its own state. If another CPU is in a higher priority state, that
CPU will be responsible for ensuring its own WARN/OOPS is stored and
flushed. (From v2 I see that the CPU does not try hard enough. I would
fix that for v3.)

> My proposal:
>
>   + Use the term "higher priority context" for all WARN/OOPS/PANIC
>     messages.
>
>   + Use "emergency priority context" or "nbcon emergency context"
>     when talking about NBCON_PRIO_EMERGENCY context to avoid
>     confusion with the printk log levels.
>
>   + use "panic priority context or "nbcon panic context" for the panic
>     one if we really want to add enter/exit for the panic context.

There are 3 different types of priority:

1. The printk record priority: KERN_ERR, KERN_WARNING, etc.

2. The priority of a console owner: used for controlled takeovers.

3. The priority state of a CPU: only elevated for urgent messages, used
to store all urgent messages and then afterwards print directly by
taking ownership of consoles.

I need to choose terminology carefully to make it easy to distinguish
between these 3 types. v2 failed to name, describe, and document this
correctly.

>   + We must somewhere explain the "atomic context" and "atomic_write".
>     callback. One important question is why it is atomic. Is it because it
>
>       + _can_ be called in atomic context?
>       + _must_ be called in atomic context?

Its main reason for existing is because it can be called from atomic
(including NMI) contexts. But like atomic_t, it can be used from any
context.

>     It is called also from console_unlock() for boot messages
>     so it need not be in atomic context.
>
>     What about renaming it to "nbcon_write" to avoid this confusion?

When we introduce the threads, there will be a 2nd callback (currently
planned write_thread()). This callback is guaranteed to be called from a
printing kthread, which for console drivers like fbcon will prove to be
helpful in cleaning up its code.

I will reserve the word "atomic" _only_ when talking about which
printing callback is used. That should help to avoid associating the
callback with a certain context or priority. But I think the name
"write_atomic" is appropriate.

>> The "nesting" array is needed in order to support a feature that is not
>> explained in the commit message: If nested OOPS/WARN/PANIC occur, only
>> the outermost OOPS/WARN/PANIC will do the flushing. I will add this
>> information to the commit message.
>
> What is the motivation for the feature, please?

During the demo at LPC2022 we had the situation that there was a large
backlog when a WARN was hit. With current mainline the first line of the
WARN is put into the ringbuffer and then the entire backlog is flushed
before storing the rest of the WARN into the ringbuffer. At the time it
was obvious that we should finish storing the WARN message and then
start flushing the backlog.

> 1. Either we want to flush the messages in the higher priority context
>    ASAP.
>
>    The per-CPU lock has been designed exactly for this purpose. It
>    would not need any extra nesting counter. And it would work
>    consistently also when the lock is acquired on another CPU.
>
>    It is simple, the context will either get the per-console lock or
>    not. The (nested) context might get the lock only when it has higher
>    priority. The flush would be called directly from vprintk_emit().
>
>    I always thought that this was the planned behavior.

It was. But then it was suggested by Thomas and acked by Linus that we
were taking the wrong approach. Rather than a global state for all the
consoles, each console should have separate owners with states, allowing
handovers/takeovers. And CPUs should have urgency states to control how
the ringbuffer is stored and when direct printing should take place for
that CPU. This idea is similar to the cpu_sync approach, but with
timeouts, handovers/takeovers, and is per-console.

This approach gives us a lot of flexibility to enforce logical and safe
policies for storing and printing in different situations. And if
named/documented correctly, I think it will be easy to understand.

> 2. Or we want to wait until all messages in the higher priority context
>    are stored into the ring buffer and flush them by the caller.

Yes.

>    Who would actually flush the higher messages?
>    WARN() caller?

Yes.

>    The timer callback which detected softlockup?

Yes.

>    Or a completely different context?

Another CPU could do some helpful backlog printing if it sees new
messages and is able to become an owner. But it is not the CPU that is
responsible for making certain that the messages have been printed.

>    Who would flush panic() messages when panic() interrupted
>    locked CPU?

The panic CPU can takeover (as a last resort, unsafely if
necessary). The panic CPU is responsible for getting those messages out.

> My proposal:
>
> There are only two higher priority contexts:
>
>   + NBCON_PRIO_PANIC should be used when panic_cpu == raw_smp_processor_id()

This is the case with v2.

>   + NBCON_PRIO_EMERGENCY contex would require some enter/exit wrappers
>     and tracking. But it does not necessarily need to be per-CPU
>     variable.
<
> I think about adding "int printk_state" into struct task_struct.
> It might be useful also for other things, e.g. for storing the last
> log level of non-finished message. Entering section with enforced
> minimal loglevel or so.

printk() calls are quite rare. And warning states are even more
rare. IMHO adding such a field to every task is a huge price to
pay. Also, printk operates in non-task contexts (hardirq, nmi). Although
@current probably points to something existing, there could be some
tricky corner cases.

IMHO per-cpu urgency state variables and per-console owner state
variables allows a much simpler picture in all contexts.

My proposal:

You have provided excellent feedback regarding naming and
documentation. Allow me to fix these things to clarify the various
functions and their roles.

John
  
Petr Mladek Oct. 6, 2023, 12:51 p.m. UTC | #5
Adding Linus into Cc. I would like to be sure about the flushing
of atomic consoles in panic context.

> During the demo at LPC2022 we had the situation that there was a large
> backlog when a WARN was hit. With current mainline the first line of the
> WARN is put into the ringbuffer and then the entire backlog is flushed
> before storing the rest of the WARN into the ringbuffer. At the time it
> was obvious that we should finish storing the WARN message and then
> start flushing the backlog.

This talks about the "emergency" context (WARN/OOPS/watchdog).
The system might be in big troubles but it would still try to continue.

Do we really want to defer the flush also for panic() context?

I ask because I was not on LPC 2022 in person and I do not remember
all details.

Anyway, the deferred flush works relatively well for the "emergency" context:

   + flushed from nbcon_atomic_exit()
   + printk kthread might emit the messages while they are being added

But it is tricky in panic(), see 8th patch at
https://lore.kernel.org/r/20230919230856.661435-9-john.ogness@linutronix.de

   + nbcon_atomic_exit() is called only in one code path.

   + nbcon_atomic_flush_all() is used in other paths. It looks like
     a "Whack a mole" game to me.

   + messages are never emitted by printk kthread either because
     CPUs are stopped or the kthread is not allowed to get the lock[*]

I see only one positive of the explicit flush. The consoles would
not delay crash_exec() and the crash dump might be closer to
the point where panic() was called.

Otherwise I see only negatives => IMHO, we want to flush atomic
consoles synchronously from printk() in panic().

Does anyone really want explicit flushes in panic()?

[*] Emitting messages is explicitly blocked on non-panic CPUs. It
    increases the change that panic-CPU would be able to take
    the console lock the safe way.

Best Regards,
Petr
  
Petr Mladek Oct. 6, 2023, 12:53 p.m. UTC | #6
(2nd attempt with with Linus really in Cc).

Adding Linus into Cc. I would like to be sure about the flushing
of atomic consoles in panic context.

> During the demo at LPC2022 we had the situation that there was a large
> backlog when a WARN was hit. With current mainline the first line of the
> WARN is put into the ringbuffer and then the entire backlog is flushed
> before storing the rest of the WARN into the ringbuffer. At the time it
> was obvious that we should finish storing the WARN message and then
> start flushing the backlog.

This talks about the "emergency" context (WARN/OOPS/watchdog).
The system might be in big troubles but it would still try to continue.

Do we really want to defer the flush also for panic() context?

I ask because I was not on LPC 2022 in person and I do not remember
all details.

Anyway, the deferred flush works relatively well for the "emergency" context:

   + flushed from nbcon_atomic_exit()
   + printk kthread might emit the messages while they are being added

But it is tricky in panic(), see 8th patch at
https://lore.kernel.org/r/20230919230856.661435-9-john.ogness@linutronix.de

   + nbcon_atomic_exit() is called only in one code path.

   + nbcon_atomic_flush_all() is used in other paths. It looks like
     a "Whack a mole" game to me.

   + messages are never emitted by printk kthread either because
     CPUs are stopped or the kthread is not allowed to get the lock[*]

I see only one positive of the explicit flush. The consoles would
not delay crash_exec() and the crash dump might be closer to
the point where panic() was called.

Otherwise I see only negatives => IMHO, we want to flush atomic
consoles synchronously from printk() in panic().

Does anyone really want explicit flushes in panic()?

[*] Emitting messages is explicitly blocked on non-panic CPUs. It
    increases the change that panic-CPU would be able to take
    the console lock the safe way.

Best Regards,
Petr
  
Petr Mladek Oct. 6, 2023, 3:52 p.m. UTC | #7
On Thu 2023-10-05 14:57:47, John Ogness wrote:
> On 2023-09-25, Petr Mladek <pmladek@suse.com> wrote:
> > From my POV:
> >
> >    + It did not mention/explained "atomic write" at all
> 
> Agreed.
> 
> >    + It said that the urgent messages required immediate printing.
> >      And Later, it said that they would get flushed later. Which
> >      is contradicting each other.
> 
> I agree that the wording needs to be dramatically improved. The term
> "urgent message" was not meant to represent a single printk() call.
> 
> >    + The handling of priorities is not only about CPU nesting.
> >      The same rules should apply also when other CPU is printing
> >      messages in a higher priority context.
> 
> Sorry, I do not understand what you mean here. Each CPU is responsible
> for its own state. If another CPU is in a higher priority state, that
> CPU will be responsible for ensuring its own WARN/OOPS is stored and
> flushed.

You are right that my comment was confusing. I had in mind that
flushing one emergency context would flush all pending messages
from other CPUs and even from other emergency context. Your
explanation is better.

> (From v2 I see that the CPU does not try hard enough. I would
> fix that for v3.)

Yes, this should be fixed.

> There are 3 different types of priority:
> 
> 1. The printk record priority: KERN_ERR, KERN_WARNING, etc.
> 
> 2. The priority of a console owner: used for controlled takeovers.
> 
> 3. The priority state of a CPU: only elevated for urgent messages, used
> to store all urgent messages and then afterwards print directly by
> taking ownership of consoles.

I know that the you want to distinguish 2. and 3. But I think that
they should be the same. I mean that nbcon_context_try_acquire()
should use the PRIO according to what context it is in.

Is there any situation where it should be different, please?

IMHO, it might simplify some logic when they are the same.

> I need to choose terminology carefully to make it easy to distinguish
> between these 3 types. v2 failed to name, describe, and document this
> correctly.

> >   + We must somewhere explain the "atomic context" and "atomic_write".
> >     callback. One important question is why it is atomic. Is it because it
> >
> >       + _can_ be called in atomic context?
> >       + _must_ be called in atomic context?
> 
> Its main reason for existing is because it can be called from atomic
> (including NMI) contexts. But like atomic_t, it can be used from any
> context.
> 
> >     It is called also from console_unlock() for boot messages
> >     so it need not be in atomic context.
> >
> >     What about renaming it to "nbcon_write" to avoid this confusion?
> 
> When we introduce the threads, there will be a 2nd callback (currently
> planned write_thread()). This callback is guaranteed to be called from a
> printing kthread, which for console drivers like fbcon will prove to be
> helpful in cleaning up its code.

I see.

> I will reserve the word "atomic" _only_ when talking about which
> printing callback is used. That should help to avoid associating the
> callback with a certain context or priority. But I think the name
> "write_atomic" is appropriate.

Sounds good.

> >> The "nesting" array is needed in order to support a feature that is not
> >> explained in the commit message: If nested OOPS/WARN/PANIC occur, only
> >> the outermost OOPS/WARN/PANIC will do the flushing. I will add this
> >> information to the commit message.
> >
> > What is the motivation for the feature, please?
> 
> During the demo at LPC2022 we had the situation that there was a large
> backlog when a WARN was hit. With current mainline the first line of the
> WARN is put into the ringbuffer and then the entire backlog is flushed
> before storing the rest of the WARN into the ringbuffer. At the time it
> was obvious that we should finish storing the WARN message and then
> start flushing the backlog.

This talks about contenxt using NBCON_PRIO_EMERGENCY. I am pretty sure
that we want to flush the messages synchronously from printk() in
panic(). Let's discuss this in the other thread with Linus in Cc [1].

> > My proposal:
> >
> > There are only two higher priority contexts:
> >
> >   + NBCON_PRIO_PANIC should be used when panic_cpu == raw_smp_processor_id()
> 
> This is the case with v2.

Ah, I see, nbcon_atomic_enter(NBCON_PRIO_PANIC) is called right after
setting panic_cpu. But this is actually another reason to get rid
of the nbcon_atomic_enter() call why?

  + The information about entering context with NBCON_PRIO_PANIC is
    already provided by panic_cpu variable.

  + Nesting is not possible because only one one context is allowed
    to acquire panic_cpu.

  + nbcon_atomic_exit() is almost useless because it is called only
    in one code path. Explicit nbcon_atomic_flush_all() calls
    are needed in other code paths.

IMHO, getting rid of nbcon_atomic_enter(NBCON_PRIO_PANIC) would have
several advantages:

  + enter()/exit() API would be needed only for NBCON_PRIO_EMERGENCY.
    We could call it nbcon_emergency_enter()/exit() and avoid
    the confusing atomic_enter name.

  + Nesting is important only for NBCON_PRIO_EMERGENCY context =>
    the per-CPU variable might be a simple refecence counter =>
    easier code and no need to remember the previous value
    and pass it to _exit() as a parameter.

> >   + NBCON_PRIO_EMERGENCY contex would require some enter/exit wrappers
> >     and tracking. But it does not necessarily need to be per-CPU
> >     variable.
> <
> > I think about adding "int printk_state" into struct task_struct.
> > It might be useful also for other things, e.g. for storing the last
> > log level of non-finished message. Entering section with enforced
> > minimal loglevel or so.
> 
> printk() calls are quite rare. And warning states are even more
> rare. IMHO adding such a field to every task is a huge price to
> pay. Also, printk operates in non-task contexts (hardirq, nmi). Although
> @current probably points to something existing, there could be some
> tricky corner cases.

Fair enough. Per-CPU variables are fine after all.

> My proposal:
> 
> You have provided excellent feedback regarding naming and
> documentation. Allow me to fix these things to clarify the various
> functions and their roles.

We should first agree on the approach in panic() in the other thread [1].

IMHO, the context using NBCON_PRIO_EMERGENCY should be the only one
where we need enter/exit and the deferred flush. In this case,
the per-CPU variable would be a simple nesting counter. And maybe
the explicit flush would be needed only in emergency context.
=> easier code, logic, naming.

[1] https://lore.kernel.org/r/ZSADUKp8oJ2Ws2vC@alley

Best Regards,
Petr
  
John Ogness Oct. 8, 2023, 10:13 a.m. UTC | #8
Hi Petr,

On 2023-10-06, Petr Mladek <pmladek@suse.com> wrote:
>> During the demo at LPC2022 we had the situation that there was a large
>> backlog when a WARN was hit. With current mainline the first line of the
>> WARN is put into the ringbuffer and then the entire backlog is flushed
>> before storing the rest of the WARN into the ringbuffer. At the time it
>> was obvious that we should finish storing the WARN message and then
>> start flushing the backlog.
>
> This talks about the "emergency" context (WARN/OOPS/watchdog).
> The system might be in big troubles but it would still try to continue.
>
> Do we really want to defer the flush also for panic() context?

We can start flushing right after the backtrace is in the
ringbuffer. But flushing the backlog _before_ putting the backtrace into
the ringbuffer was not desired because if there is a large backlog, the
machine may not survive to get the backtrace out. And in that case it
won't even be in the ringbuffer to be used by other debugging
tools.

> I ask because I was not on LPC 2022 in person and I do not remember
> all details.

The LPC2022 demo/talk was recorded:

https://www.youtube.com/watch?v=TVhNcKQvzxI

At 55:55 is where the situation occurred and triggered the conversation,
ultimately leading to this new feature.

You may also want to reread my summary:

https://lore.kernel.org/lkml/875yheqh6v.fsf@jogness.linutronix.de

as well as Linus' follow-up message:

https://lore.kernel.org/lkml/CAHk-=wieXPMGEm7E=Sz2utzZdW1d=9hJBwGYAaAipxnMXr0Hvg@mail.gmail.com

> But it is tricky in panic(), see 8th patch at
> https://lore.kernel.org/r/20230919230856.661435-9-john.ogness@linutronix.de
>
>    + nbcon_atomic_exit() is called only in one code path.

Correct. When panic() is complete and the machine goes into its infinite
loop. This is also the point where it will attempt an unsafe flush, if
it could not get the messages out yet.

>    + nbcon_atomic_flush_all() is used in other paths. It looks like
>      a "Whack a mole" game to me.

Several different outputs occur during panic(). The flush is everywhere
where something significant has been put into the ringbuffer and now it
would be OK to flush it.

>    + messages are never emitted by printk kthread either because
>      CPUs are stopped or the kthread is not allowed to get the lock

Correct.

> I see only one positive of the explicit flush. The consoles would
> not delay crash_exec() and the crash dump might be closer to
> the point where panic() was called.

It's only about getting the critical messages into the ringbuffer before
flushing. And since various things can go wrong during the many actions
within panic(), it makes sense to flush in between those actions.

> Otherwise I see only negatives => IMHO, we want to flush atomic
> consoles synchronously from printk() in panic().
>
> Does anyone really want explicit flushes in panic()?

So far you are the only one speaking against it. I expect as time goes
on it will get even more complex as it becomes tunable (also something
we talked about during the demo).

John
  
Petr Mladek Oct. 9, 2023, 3:32 p.m. UTC | #9
On Sun 2023-10-08 12:19:21, John Ogness wrote:
> Hi Petr,
> 
> On 2023-10-06, Petr Mladek <pmladek@suse.com> wrote:
> >> During the demo at LPC2022 we had the situation that there was a large
> >> backlog when a WARN was hit. With current mainline the first line of the
> >> WARN is put into the ringbuffer and then the entire backlog is flushed
> >> before storing the rest of the WARN into the ringbuffer. At the time it
> >> was obvious that we should finish storing the WARN message and then
> >> start flushing the backlog.
> >
> > This talks about the "emergency" context (WARN/OOPS/watchdog).
> > The system might be in big troubles but it would still try to continue.
> >
> > Do we really want to defer the flush also for panic() context?
> 
> We can start flushing right after the backtrace is in the
> ringbuffer. But flushing the backlog _before_ putting the backtrace into
> the ringbuffer was not desired because if there is a large backlog, the
> machine may not survive to get the backtrace out. And in that case it
> won't even be in the ringbuffer to be used by other debugging
> tools.

We really have to distinguish emergency and panic context!

> > I ask because I was not on LPC 2022 in person and I do not remember
> > all details.
> 
> The LPC2022 demo/talk was recorded:
> 
> https://www.youtube.com/watch?v=TVhNcKQvzxI
> 
> At 55:55 is where the situation occurred and triggered the conversation,
> ultimately leading to this new feature.

Thanks for the link. My understanding is that the following scenario
has happened:

1. System was booting and messages were being flushed using the kthread.

2. WARN() happened. It printed the 1st line, took over the per-console
   console lock and started flushing the backlog. There were still
   many pending messages from the boot.

3. NMI came and caused panic(). The panic context printed its first line,
   took over the console from the WARN context, flushed the rest
   of the backlog and continued printing/flushing more messages from
   the panic() context.


Problem:

People complained that they saw only the first line from WARN().
The related detailed info, including backtrace, was missing.

It was because panic() interrupted WARN() before it was able
to flush the backlog and print/flush all WARN() messages.


Proposed solution:

WARN()/emergency context should first store the messages and
flush them at the end.

It would increase the chance that all WARN() messages will
be stored in the ring buffer before NMI/panic() is called.

panic() would then flush all pending messages including
the stored WARN() messages.


Important:

The problem is that panic() interrupted WARN().

There is _no_ problem with interrupting panic().
Let me repeat, nested panic() is not possible!


> You may also want to reread my summary:
>
> https://lore.kernel.org/lkml/875yheqh6v.fsf@jogness.linutronix.de

Again, thanks for the pointer. Let me paste 2 paragraphs here:

<paste>
- Printing the backlog is important! If some emergency situation occurs,
  make sure the backlog gets printed.

- When an emergency occurs, put the full backtrace into the ringbuffer
  before flushing any backlog. This ensures that the backtrace makes it
  into the ringbuffer in case a panic occurs while flushing the backlog.
</paste>

My understanding is:

1st paragraph is the reason why:

   + we have three priorities: normal, emergency, panic

   + messages in normal context might be offloaded to kthread

   + emergency and panic context should try to flush the messages
     from this context.


2nd paragraph talks about that emergency context should first store
the messages and flush them later. And the important part:

     "in case a panic occurs while flushing the backlog.

     => panic might interrupt emergency

It clearly distinguishes emergency and panic context.


> as well as Linus' follow-up message:
> 
> https://lore.kernel.org/lkml/CAHk-=wieXPMGEm7E=Sz2utzZdW1d=9hJBwGYAaAipxnMXr0Hvg@mail.gmail.com

IMHO, the important part is:

<paste>
Yeah, I really liked the notion of doing the oops with just filling
the back buffer but still getting it printed out if something goes
wrong in the middle.
</paste>

He was talking about oops => emergency context

Also he wanted to get it out when something goes wrong in the middle
   => panic in the middle ?


And another paragraph:

<paste>
I doubt it ends up being an issue in practice, but basically I wanted
to just pipe up and say that the exact details of how much of the back
buffer needs to be flushed first _could_ be tweaked if it ever does
come up as an issue.
</paste>

Linus had doubts that there might be problems with too long backlog
in practice. And I have the doubts as well.

And this is my view. The deferred flush is trying to solve a corner
case and we are forgetting what blocked printk kthreads >10 years.

> > But it is tricky in panic(), see 8th patch at
> > https://lore.kernel.org/r/20230919230856.661435-9-john.ogness@linutronix.de
> >
> >    + nbcon_atomic_exit() is called only in one code path.
> 
> Correct. When panic() is complete and the machine goes into its infinite
> loop. This is also the point where it will attempt an unsafe flush, if
> it could not get the messages out yet.
> 
> >    + nbcon_atomic_flush_all() is used in other paths. It looks like
> >      a "Whack a mole" game to me.
> 
> Several different outputs occur during panic(). The flush is everywhere
> where something significant has been put into the ringbuffer and now it
> would be OK to flush it.

No, we could _never_ say that the flush is everywhere where something
important happens.

panic() might fail anywhere. The last printed message might be
an important clue. And it can be any message.


> >    + messages are never emitted by printk kthread either because
> >      CPUs are stopped or the kthread is not allowed to get the lock
> 
> Correct.
> 
> > I see only one positive of the explicit flush. The consoles would
> > not delay crash_exec() and the crash dump might be closer to
> > the point where panic() was called.
> 
> It's only about getting the critical messages into the ringbuffer before
> flushing. And since various things can go wrong during the many actions
> within panic(), it makes sense to flush in between those actions.

I am glad that we agree that "various things can go wrong during panic".

Are we sure that the patchset added the explicit flush right after
each possible problematic place? What about crash_exec(),
various kmsg dumpers, notifiers?


> > Otherwise I see only negatives => IMHO, we want to flush atomic
> > consoles synchronously from printk() in panic().
> >
> > Does anyone really want explicit flushes in panic()?
> 
> So far you are the only one speaking against it. I expect as time goes
> on it will get even more complex as it becomes tunable (also something
> we talked about during the demo).

From my POV:

1. We are just two people discussing it at the moment
      => word against word.

2. Please, try to read my reply again. I agreed with delaying the
   flush in emergency context.

   But I am strongly against explicit flushes during panic at
   least in the initial implementation.


3. IMHO, there might be a lot of misunderstanding caused by
   the word "emergency" context. Some people might see
   it as OOPs/WARN/panic and other might want to handle
   panic special way.

   I see it:

      + emergency - huge danger, medical check needed, patient might
	      survive

      + panic - patient is about to die, try to get some secrets from
	      him before he dies.

   Any sentence of the secret might be important. It would be pity
   to ignore his last A4 page just because it was not complete.


Sigh, I did my best to explain why the nesting problem is only in
the emergency context and why panic is different.

Feel free to ask if anything is still unclear.


Anyway, I am _not_ going to accept the explicit flushes in panic()
unless you show me a problem with interrupted panic() or
Linus explicitly says that the explicit flushes make sense.

Best Regards,
Petr
  
John Ogness Oct. 10, 2023, 4:02 p.m. UTC | #10
On 2023-10-09, Petr Mladek <pmladek@suse.com> wrote:
> We really have to distinguish emergency and panic context!

OK.

>> The LPC2022 demo/talk was recorded:
>> 
>> https://www.youtube.com/watch?v=TVhNcKQvzxI
>> 
>> At 55:55 is where the situation occurred and triggered the conversation,
>> ultimately leading to this new feature.
>
> Thanks for the link. My understanding is that the following scenario
> has happened:
>
> 1. System was booting and messages were being flushed using the kthread.
>
> 2. WARN() happened. It printed the 1st line, took over the per-console
>    console lock and started flushing the backlog. There were still
>    many pending messages from the boot.
>
> 3. NMI came and caused panic(). The panic context printed its first line,
>    took over the console from the WARN context, flushed the rest
>    of the backlog and continued printing/flushing more messages from
>    the panic() context.
>
>
> Problem:
>
> People complained that they saw only the first line from WARN().
> The related detailed info, including backtrace, was missing.
>
> It was because panic() interrupted WARN() before it was able
> to flush the backlog and print/flush all WARN() messages.

Thanks for taking the time to review it in detail.

> Proposed solution:
>
> WARN()/emergency context should first store the messages and
> flush them at the end.
>
> It would increase the chance that all WARN() messages will
> be stored in the ring buffer before NMI/panic() is called.
>
> panic() would then flush all pending messages including
> the stored WARN() messages.

OK.

> Important:
>
> The problem is that panic() interrupted WARN().

Ack.

>> You may also want to reread my summary:
>>
>> https://lore.kernel.org/lkml/875yheqh6v.fsf@jogness.linutronix.de
>
> Again, thanks for the pointer. Let me paste 2 paragraphs here:
>
> <paste>
> - Printing the backlog is important! If some emergency situation occurs,
>   make sure the backlog gets printed.
>
> - When an emergency occurs, put the full backtrace into the ringbuffer
>   before flushing any backlog. This ensures that the backtrace makes it
>   into the ringbuffer in case a panic occurs while flushing the backlog.
> </paste>
>
> My understanding is:
>
> 1st paragraph is the reason why:
>
>    + we have three priorities: normal, emergency, panic
>
>    + messages in normal context might be offloaded to kthread
>
>    + emergency and panic context should try to flush the messages
>      from this context.
>
>
> 2nd paragraph talks about that emergency context should first store
> the messages and flush them later. And the important part:
>
>      "in case a panic occurs while flushing the backlog.
>
>      => panic might interrupt emergency
>
> It clearly distinguishes emergency and panic context.
>
>
>> as well as Linus' follow-up message:
>> 
>> https://lore.kernel.org/lkml/CAHk-=wieXPMGEm7E=Sz2utzZdW1d=9hJBwGYAaAipxnMXr0Hvg@mail.gmail.com
>
> IMHO, the important part is:
>
> <paste>
> Yeah, I really liked the notion of doing the oops with just filling
> the back buffer but still getting it printed out if something goes
> wrong in the middle.
> </paste>
>
> He was talking about oops => emergency context
>
> Also he wanted to get it out when something goes wrong in the middle
>    => panic in the middle ?
>
>
> And another paragraph:
>
> <paste>
> I doubt it ends up being an issue in practice, but basically I wanted
> to just pipe up and say that the exact details of how much of the back
> buffer needs to be flushed first _could_ be tweaked if it ever does
> come up as an issue.
> </paste>
>
> Linus had doubts that there might be problems with too long backlog
> in practice. And I have the doubts as well.
>
> And this is my view. The deferred flush is trying to solve a corner
> case and we are forgetting what blocked printk kthreads >10 years.

OK. Thank you for the detailed analysis.

For v3 I will do something similar to what you proposed [0], except that
I will use a per-cpu variable (to track printk emergency nesting)
instead of adding a new field to the task struct.

John Ogness

[0] https://lore.kernel.org/lkml/ZRLBxsXPCym2NC5Q@alley/
  
Dave Young Oct. 16, 2023, 8:58 a.m. UTC | #11
[Added more people in cc]

On 10/08/23 at 12:19pm, John Ogness wrote:
> Hi Petr,
> 
> On 2023-10-06, Petr Mladek <pmladek@suse.com> wrote:
> >> During the demo at LPC2022 we had the situation that there was a large
> >> backlog when a WARN was hit. With current mainline the first line of the
> >> WARN is put into the ringbuffer and then the entire backlog is flushed
> >> before storing the rest of the WARN into the ringbuffer. At the time it
> >> was obvious that we should finish storing the WARN message and then
> >> start flushing the backlog.
> >
> > This talks about the "emergency" context (WARN/OOPS/watchdog).
> > The system might be in big troubles but it would still try to continue.
> >
> > Do we really want to defer the flush also for panic() context?
> 
> We can start flushing right after the backtrace is in the
> ringbuffer. But flushing the backlog _before_ putting the backtrace into
> the ringbuffer was not desired because if there is a large backlog, the
> machine may not survive to get the backtrace out. And in that case it
> won't even be in the ringbuffer to be used by other debugging
> tools.
> 
> > I ask because I was not on LPC 2022 in person and I do not remember
> > all details.
> 
> The LPC2022 demo/talk was recorded:
> 
> https://www.youtube.com/watch?v=TVhNcKQvzxI
> 
> At 55:55 is where the situation occurred and triggered the conversation,
> ultimately leading to this new feature.
> 
> You may also want to reread my summary:
> 
> https://lore.kernel.org/lkml/875yheqh6v.fsf@jogness.linutronix.de
> 
> as well as Linus' follow-up message:
> 
> https://lore.kernel.org/lkml/CAHk-=wieXPMGEm7E=Sz2utzZdW1d=9hJBwGYAaAipxnMXr0Hvg@mail.gmail.com
> 
> > But it is tricky in panic(), see 8th patch at
> > https://lore.kernel.org/r/20230919230856.661435-9-john.ogness@linutronix.de
> >
> >    + nbcon_atomic_exit() is called only in one code path.
> 
> Correct. When panic() is complete and the machine goes into its infinite
> loop. This is also the point where it will attempt an unsafe flush, if
> it could not get the messages out yet.
> 
> >    + nbcon_atomic_flush_all() is used in other paths. It looks like
> >      a "Whack a mole" game to me.
> 
> Several different outputs occur during panic(). The flush is everywhere
> where something significant has been put into the ringbuffer and now it
> would be OK to flush it.
> 
> >    + messages are never emitted by printk kthread either because
> >      CPUs are stopped or the kthread is not allowed to get the lock
> 
> Correct.
> 
> > I see only one positive of the explicit flush. The consoles would
> > not delay crash_exec() and the crash dump might be closer to
> > the point where panic() was called.
> 
> It's only about getting the critical messages into the ringbuffer before
> flushing. And since various things can go wrong during the many actions
> within panic(), it makes sense to flush in between those actions.
> 
> > Otherwise I see only negatives => IMHO, we want to flush atomic
> > consoles synchronously from printk() in panic().
> >
> > Does anyone really want explicit flushes in panic()?
> 
> So far you are the only one speaking against it. I expect as time goes
> on it will get even more complex as it becomes tunable (also something
> we talked about during the demo).

Flush consoles in panic kexec case sounds not good, but I have no
deep understanding about the atomic printk series, added kexec list and
reviewers in cc.

> 
> John
> 

Thanks
Dave
  
John Ogness Oct. 16, 2023, 10:09 a.m. UTC | #12
Hi Dave,

On 2023-10-16, Dave Young <dyoung@redhat.com> wrote:
>> > Does anyone really want explicit flushes in panic()?
>> 
>> So far you are the only one speaking against it. I expect as time
>> goes on it will get even more complex as it becomes tunable (also
>> something we talked about during the demo).
>
> Flush consoles in panic kexec case sounds not good, but I have no deep
> understanding about the atomic printk series, added kexec list and
> reviewers in cc.

Currently every printk() message tries to flush immediately.

This series introduced a new method of first allowing a set of printk()
messages to be stored to the ringbuffer and then flushing the full
set. That is what this discussion was about.

The issue with allowing a set of printk() messages to be stored is that
you need to explicitly mark in code where the actual flushing should
occur. Petr's argument is that we do not want to insert "flush points"
into the panic() function and instead we should do as we do now: flush
each printk() message immediately.

In the end (for my upcoming v3 series) I agreed with Petr. We will
continue to keep things as they are now: flush each printk() message
immediately.

Currently consoles try to flush unsafely before kexec. With the atomic
printk series our goal is to only perform _safe_ flushing until all
other panic operations are complete. Only at the very end of panic()
would unsafe flushing be attempted.

John
  

Patch

diff --git a/include/linux/console.h b/include/linux/console.h
index e4fc6f7c1496..25a3ddd39083 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -452,10 +452,14 @@  static inline bool console_is_registered(const struct console *con)
 	hlist_for_each_entry(con, &console_list, node)
 
 #ifdef CONFIG_PRINTK
+extern enum nbcon_prio nbcon_atomic_enter(enum nbcon_prio prio);
+extern void nbcon_atomic_exit(enum nbcon_prio prio, enum nbcon_prio prev_prio);
 extern bool nbcon_can_proceed(struct nbcon_write_context *wctxt);
 extern bool nbcon_enter_unsafe(struct nbcon_write_context *wctxt);
 extern bool nbcon_exit_unsafe(struct nbcon_write_context *wctxt);
 #else
+static inline enum nbcon_prio nbcon_atomic_enter(enum nbcon_prio prio) { return NBCON_PRIO_NONE; }
+static inline void nbcon_atomic_exit(enum nbcon_prio prio, enum nbcon_prio prev_prio) { }
 static inline bool nbcon_can_proceed(struct nbcon_write_context *wctxt) { return false; }
 static inline bool nbcon_enter_unsafe(struct nbcon_write_context *wctxt) { return false; }
 static inline bool nbcon_exit_unsafe(struct nbcon_write_context *wctxt) { return false; }
diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
index b96077152f49..9359906b575b 100644
--- a/kernel/printk/nbcon.c
+++ b/kernel/printk/nbcon.c
@@ -961,6 +961,95 @@  static bool nbcon_emit_next_record(struct nbcon_write_context *wctxt)
 	return nbcon_context_exit_unsafe(ctxt);
 }
 
+/**
+ * struct nbcon_cpu_state - Per CPU printk context state
+ * @prio:	The current context priority level
+ * @nesting:	Per priority nest counter
+ */
+struct nbcon_cpu_state {
+	enum nbcon_prio		prio;
+	int			nesting[NBCON_PRIO_MAX];
+};
+
+static DEFINE_PER_CPU(struct nbcon_cpu_state, nbcon_pcpu_state);
+static struct nbcon_cpu_state early_nbcon_pcpu_state __initdata;
+
+/**
+ * nbcon_get_cpu_state - Get the per CPU console state pointer
+ *
+ * Returns either a pointer to the per CPU state of the current CPU or to
+ * the init data state during early boot.
+ */
+static __ref struct nbcon_cpu_state *nbcon_get_cpu_state(void)
+{
+	if (!printk_percpu_data_ready())
+		return &early_nbcon_pcpu_state;
+
+	return this_cpu_ptr(&nbcon_pcpu_state);
+}
+
+/**
+ * nbcon_atomic_enter - Enter a context that enforces atomic printing
+ * @prio:	Priority of the context
+ *
+ * Return:	The previous priority that needs to be fed into
+ *		the corresponding nbcon_atomic_exit()
+ * Context:	Any context. Disables migration.
+ */
+enum nbcon_prio nbcon_atomic_enter(enum nbcon_prio prio)
+{
+	struct nbcon_cpu_state *cpu_state;
+	enum nbcon_prio prev_prio;
+
+	migrate_disable();
+
+	cpu_state = nbcon_get_cpu_state();
+
+	prev_prio = cpu_state->prio;
+	if (prio > prev_prio)
+		cpu_state->prio = prio;
+
+	/*
+	 * Increment the nesting on @cpu_state->prio (instead of
+	 * @prio) so that a WARN() nested within a panic printout
+	 * does not attempt to scribble state.
+	 */
+	cpu_state->nesting[cpu_state->prio]++;
+
+	return prev_prio;
+}
+
+/**
+ * nbcon_atomic_exit - Exit a context that enforces atomic printing
+ * @prio:	Priority of the context to leave
+ * @prev_prio:	Priority of the previous context for restore
+ *
+ * Context:	Any context. Enables migration.
+ *
+ * @prev_prio is the priority returned by the corresponding
+ * nbcon_atomic_enter().
+ */
+void nbcon_atomic_exit(enum nbcon_prio prio, enum nbcon_prio prev_prio)
+{
+	struct nbcon_cpu_state *cpu_state;
+
+	cpu_state = nbcon_get_cpu_state();
+
+	/*
+	 * Undo the nesting of nbcon_atomic_enter() at the CPU state
+	 * priority.
+	 */
+	cpu_state->nesting[cpu_state->prio]--;
+
+	/*
+	 * Restore the previous priority, which was returned by
+	 * nbcon_atomic_enter().
+	 */
+	cpu_state->prio = prev_prio;
+
+	migrate_enable();
+}
+
 /**
  * nbcon_alloc - Allocate buffers needed by the nbcon console
  * @con:	Console to allocate buffers for