[printk,v4,00/14] fix console flushing

Message ID 20240207134103.1357162-1-john.ogness@linutronix.de
Headers
Series fix console flushing |

Message

John Ogness Feb. 7, 2024, 1:40 p.m. UTC
  Hi,

While testing various flushing scenarios, I stumbled on a few
issues that cause console flushing to fail. While at LPC2023 in
Richmond, I sat down with Petr Mladek and we reviewed the
v2 [0] series. This series is the result of that offline
discussion. v3 is here [1].

This series addresses the following issues:

1. The prb_next_seq() optimization caused inconsistent return
   values. Fix prb_next_seq() to the originally intended
   behavior but keep an optimization.

2. pr_flush() might not wait until the most recently stored
   printk() message if non-finalized records precede it. Fix
   pr_flush() to wait for all records to print that are at
   least reserved at the time of the call.

3. In panic, the panic messages will not print if non-finalized
   records precede them. Add a special condition so that
   readers on the panic CPU will drop records that are not in
   a consistent state.

4. It is possible (and easy to reproduce) a scenario where the
   console on the panic CPU hands over to a waiter of a stopped
   CPU. Do not use the handover feature in panic.

5. If messages are being dropped during panic, non-panic CPUs
   are silenced. But by then it is already too late and most
   likely the panic messages have been dropped. Change the
   non-panic CPU silencing logic to _immediately_ silence
   non-panic CPUs during panic. This also leads to clean panic
   output when many CPUs are blasting the kernel log.

6. If a panic occurs in a context where printk() calls defer
   printing (NMI or printk_safe section), the printing of the
   final panic messages rely on irq_work. If that mechanism is
   not available, the final panic messages are not seen (even
   though they are finalized in the ringbuffer). Add one last
   explicit flush after all printk() calls are finished to
   ensure all available messages in the kernel log are printed.

7. When dumping the stacktrace from panic(), do not use the
   printk_cpu_sync because it can deadlock if another CPU holds
   and is unable to release the printk_cpu_sync.

This series also performs some minor cleanups to remove open
coded checks about the panic context and improve documentation
language regarding data-less records.

Because of multiple refactoring done in recent history, it
would be helpful to provide the LTS maintainers with the proper
backported patches. I am happy to do this.

The changes since v3:

- Drop patch 11 of v3 ("printk: ringbuffer: Consider committed
  as finalized in panic"). It adds complexity, may not perform
  as expected, and it is questionable as to whether it would
  help provide useful messages on panic.

- Changed several comments according to the suggestions by
  Petr. Note that I did not include all the suggestions
  because IMO they were too vague in describing the related
  memory barriers.

- Add a patch to avoid using printk_cpu_sync on panic(). This
  was recently discussed [2]. Feel free to drop the patch 
  14/14 ("dump_stack: Do not get cpu_sync for panic CPU") if
  you think it is not appropriate or will significantly delay
  this series.

John Ogness

[0] https://lore.kernel.org/lkml/20231106210730.115192-1-john.ogness@linutronix.de
[1] https://lore.kernel.org/lkml/20231214214201.499426-1-john.ogness@linutronix.de
[2] https://lore.kernel.org/lkml/ZcIGKU8sxti38Kok@alley

John Ogness (12):
  printk: nbcon: Relocate 32bit seq macros
  printk: Use prb_first_seq() as base for 32bit seq macros
  printk: ringbuffer: Do not skip non-finalized records with
    prb_next_seq()
  printk: ringbuffer: Clarify special lpos values
  printk: For @suppress_panic_printk check for other CPU in panic
  printk: Add this_cpu_in_panic()
  printk: ringbuffer: Cleanup reader terminology
  printk: Wait for all reserved records with pr_flush()
  printk: ringbuffer: Skip non-finalized records in panic
  printk: Avoid non-panic CPUs writing to ringbuffer
  panic: Flush kernel log buffer at the end
  dump_stack: Do not get cpu_sync for panic CPU

Petr Mladek (1):
  printk: Disable passing console lock owner completely during panic()

Sebastian Andrzej Siewior (1):
  printk: Adjust mapping for 32bit seq macros

 include/linux/printk.h            |   2 +
 kernel/panic.c                    |   8 +
 kernel/printk/nbcon.c             |  41 +---
 kernel/printk/printk.c            | 101 +++++----
 kernel/printk/printk_ringbuffer.c | 335 +++++++++++++++++++++++++-----
 kernel/printk/printk_ringbuffer.h |  54 ++++-
 lib/dump_stack.c                  |  16 +-
 7 files changed, 419 insertions(+), 138 deletions(-)


base-commit: 6c3a34e38436a2a3f7a1fa764c108ee19b05b893
  

Comments

Petr Mladek Feb. 7, 2024, 4:18 p.m. UTC | #1
On Wed 2024-02-07 14:46:49, John Ogness wrote:
> Hi,
> 
> While testing various flushing scenarios, I stumbled on a few
> issues that cause console flushing to fail. While at LPC2023 in
> Richmond, I sat down with Petr Mladek and we reviewed the
> v2 [0] series. This series is the result of that offline
> discussion. v3 is here [1].
> 
> This series addresses the following issues:
> 
> 1. The prb_next_seq() optimization caused inconsistent return
>    values. Fix prb_next_seq() to the originally intended
>    behavior but keep an optimization.
> 
> 2. pr_flush() might not wait until the most recently stored
>    printk() message if non-finalized records precede it. Fix
>    pr_flush() to wait for all records to print that are at
>    least reserved at the time of the call.
> 
> 3. In panic, the panic messages will not print if non-finalized
>    records precede them. Add a special condition so that
>    readers on the panic CPU will drop records that are not in
>    a consistent state.
> 
> 4. It is possible (and easy to reproduce) a scenario where the
>    console on the panic CPU hands over to a waiter of a stopped
>    CPU. Do not use the handover feature in panic.
> 
> 5. If messages are being dropped during panic, non-panic CPUs
>    are silenced. But by then it is already too late and most
>    likely the panic messages have been dropped. Change the
>    non-panic CPU silencing logic to _immediately_ silence
>    non-panic CPUs during panic. This also leads to clean panic
>    output when many CPUs are blasting the kernel log.
> 
> 6. If a panic occurs in a context where printk() calls defer
>    printing (NMI or printk_safe section), the printing of the
>    final panic messages rely on irq_work. If that mechanism is
>    not available, the final panic messages are not seen (even
>    though they are finalized in the ringbuffer). Add one last
>    explicit flush after all printk() calls are finished to
>    ensure all available messages in the kernel log are printed.
> 
> 7. When dumping the stacktrace from panic(), do not use the
>    printk_cpu_sync because it can deadlock if another CPU holds
>    and is unable to release the printk_cpu_sync.
> 
> This series also performs some minor cleanups to remove open
> coded checks about the panic context and improve documentation
> language regarding data-less records.
> 
> Because of multiple refactoring done in recent history, it
> would be helpful to provide the LTS maintainers with the proper
> backported patches. I am happy to do this.

The series seems to be ready linux-next from my POV. I am going
to push it there so that it gets as much testing before
the merge window as possible.

Best Regards,
Petr
  
Petr Mladek Feb. 7, 2024, 4:53 p.m. UTC | #2
On Wed 2024-02-07 17:18:38, Petr Mladek wrote:
> On Wed 2024-02-07 14:46:49, John Ogness wrote:
> > Hi,
> > 
> > While testing various flushing scenarios, I stumbled on a few
> > issues that cause console flushing to fail. While at LPC2023 in
> > Richmond, I sat down with Petr Mladek and we reviewed the
> > v2 [0] series. This series is the result of that offline
> > discussion. v3 is here [1].
> > 
> > This series addresses the following issues:
> > 
> > 1. The prb_next_seq() optimization caused inconsistent return
> >    values. Fix prb_next_seq() to the originally intended
> >    behavior but keep an optimization.
> > 
> > 2. pr_flush() might not wait until the most recently stored
> >    printk() message if non-finalized records precede it. Fix
> >    pr_flush() to wait for all records to print that are at
> >    least reserved at the time of the call.
> > 
> > 3. In panic, the panic messages will not print if non-finalized
> >    records precede them. Add a special condition so that
> >    readers on the panic CPU will drop records that are not in
> >    a consistent state.
> > 
> > 4. It is possible (and easy to reproduce) a scenario where the
> >    console on the panic CPU hands over to a waiter of a stopped
> >    CPU. Do not use the handover feature in panic.
> > 
> > 5. If messages are being dropped during panic, non-panic CPUs
> >    are silenced. But by then it is already too late and most
> >    likely the panic messages have been dropped. Change the
> >    non-panic CPU silencing logic to _immediately_ silence
> >    non-panic CPUs during panic. This also leads to clean panic
> >    output when many CPUs are blasting the kernel log.
> > 
> > 6. If a panic occurs in a context where printk() calls defer
> >    printing (NMI or printk_safe section), the printing of the
> >    final panic messages rely on irq_work. If that mechanism is
> >    not available, the final panic messages are not seen (even
> >    though they are finalized in the ringbuffer). Add one last
> >    explicit flush after all printk() calls are finished to
> >    ensure all available messages in the kernel log are printed.
> > 
> > 7. When dumping the stacktrace from panic(), do not use the
> >    printk_cpu_sync because it can deadlock if another CPU holds
> >    and is unable to release the printk_cpu_sync.
> > 
> > This series also performs some minor cleanups to remove open
> > coded checks about the panic context and improve documentation
> > language regarding data-less records.
> > 
> > Because of multiple refactoring done in recent history, it
> > would be helpful to provide the LTS maintainers with the proper
> > backported patches. I am happy to do this.
> 
> The series seems to be ready linux-next from my POV. I am going
> to push it there so that it gets as much testing before
> the merge window as possible.

JFYI, it has been committed into printk/linux.git,
branch rework/console-flushing-fixes.

Let's keep our fingers crossed.

Best Regards,
Petr