[v19,6/7] crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()

Message ID 20230306162228.8277-7-eric.devolder@oracle.com
State New
Headers
Series crash: Kernel handling of CPU and memory hot un/plug |

Commit Message

Eric DeVolder March 6, 2023, 4:22 p.m. UTC
  The function crash_prepare_elf64_headers() generates the elfcorehdr
which describes the cpus and memory in the system for the crash kernel.
In particular, it writes out ELF PT_NOTEs for memory regions and the
cpus in the system.

With respect to the cpus, the current implementation utilizes
for_each_present_cpu() which means that as cpus are added and removed,
the elfcorehdr must again be updated to reflect the new set of cpus.

The reasoning behind the change to use for_each_possible_cpu(), is:

- At kernel boot time, all percpu crash_notes are allocated for all
  possible cpus; that is, crash_notes are not allocated dynamically
  when cpus are plugged/unplugged. Thus the crash_notes for each
  possible cpu are always available.

- The crash_prepare_elf64_headers() creates an ELF PT_NOTE per cpu.
  Changing to for_each_possible_cpu() is valid as the crash_notes
  pointed to by each cpu PT_NOTE are present and always valid.

Furthermore, examining a common crash processing path of:

 kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer
           elfcorehdr      /proc/vmcore     vmcore

reveals how the ELF cpu PT_NOTEs are utilized:

- Upon panic, each cpu is sent an IPI and shuts itself down, recording
 its state in its crash_notes. When all cpus are shutdown, the
 crash kernel is launched with a pointer to the elfcorehdr.

- The crash kernel via linux/fs/proc/vmcore.c does not examine or
 use the contents of the PT_NOTEs, it exposes them via /proc/vmcore.

- The makedumpfile utility uses /proc/vmcore and reads the cpu
 PT_NOTEs to craft a nr_cpus variable, which is reported in a
 header but otherwise generally unused. Makedumpfile creates the
 vmcore.

- The 'crash' dump analyzer does not appear to reference the cpu
 PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask
 symbols and directly examines those structure contents from vmcore
 memory. From that information it is able to determine which cpus
 are present and online, and locate the corresponding crash_notes.
 Said differently, it appears that 'crash' analyzer does not rely
 on the ELF PT_NOTEs for cpus; rather it obtains the information
 directly via kernel symbols and the memory within the vmcore.

(There maybe other vmcore generating and analysis tools that do use
these PT_NOTEs, but 'makedumpfile' and 'crash' seems to be the most
common solution.)

This change results in the benefit of having all cpus described in
the elfcorehdr, and therefore reducing the need to re-generate the
elfcorehdr on cpu changes, at the small expense of an additional
56 bytes per PT_NOTE for not-present-but-possible cpus.

On systems where kexec_file_load() syscall is utilized, all the above
is valid. On systems where kexec_load() syscall is utilized, there
may be the need for the elfcorehdr to be regenerated once. The reason
being that some archs only populate the 'present' cpus in the
/sys/devices/system/cpus entries, which the userspace 'kexec' utility
uses to generate the userspace-supplied elfcorehdr. In this situation,
one memory or cpu change will rewrite the elfcorehdr via the
crash_prepare_elf64_headers() function and now all possible cpus will
be described, just as with kexec_file_load() syscall.

Suggested-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
---
 kernel/crash_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Sourabh Jain March 7, 2023, 8:48 a.m. UTC | #1
Hello Eric,

On 06/03/23 21:52, Eric DeVolder wrote:
> The function crash_prepare_elf64_headers() generates the elfcorehdr
> which describes the cpus and memory in the system for the crash kernel.
> In particular, it writes out ELF PT_NOTEs for memory regions and the
> cpus in the system.
>
> With respect to the cpus, the current implementation utilizes
> for_each_present_cpu() which means that as cpus are added and removed,
> the elfcorehdr must again be updated to reflect the new set of cpus.
>
> The reasoning behind the change to use for_each_possible_cpu(), is:
>
> - At kernel boot time, all percpu crash_notes are allocated for all
>    possible cpus; that is, crash_notes are not allocated dynamically
>    when cpus are plugged/unplugged. Thus the crash_notes for each
>    possible cpu are always available.
>
> - The crash_prepare_elf64_headers() creates an ELF PT_NOTE per cpu.
>    Changing to for_each_possible_cpu() is valid as the crash_notes
>    pointed to by each cpu PT_NOTE are present and always valid.
>
> Furthermore, examining a common crash processing path of:
>
>   kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer
>             elfcorehdr      /proc/vmcore     vmcore
>
> reveals how the ELF cpu PT_NOTEs are utilized:
>
> - Upon panic, each cpu is sent an IPI and shuts itself down, recording
>   its state in its crash_notes. When all cpus are shutdown, the
>   crash kernel is launched with a pointer to the elfcorehdr.
>
> - The crash kernel via linux/fs/proc/vmcore.c does not examine or
>   use the contents of the PT_NOTEs, it exposes them via /proc/vmcore.
>
> - The makedumpfile utility uses /proc/vmcore and reads the cpu
>   PT_NOTEs to craft a nr_cpus variable, which is reported in a
>   header but otherwise generally unused. Makedumpfile creates the
>   vmcore.
>
> - The 'crash' dump analyzer does not appear to reference the cpu
>   PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask
>   symbols and directly examines those structure contents from vmcore
>   memory. From that information it is able to determine which cpus
>   are present and online, and locate the corresponding crash_notes.
>   Said differently, it appears that 'crash' analyzer does not rely
>   on the ELF PT_NOTEs for cpus; rather it obtains the information
>   directly via kernel symbols and the memory within the vmcore.
>
> (There maybe other vmcore generating and analysis tools that do use
> these PT_NOTEs, but 'makedumpfile' and 'crash' seems to be the most
> common solution.)
>
> This change results in the benefit of having all cpus described in
> the elfcorehdr, and therefore reducing the need to re-generate the
> elfcorehdr on cpu changes, at the small expense of an additional
> 56 bytes per PT_NOTE for not-present-but-possible cpus.
>
> On systems where kexec_file_load() syscall is utilized, all the above
> is valid. On systems where kexec_load() syscall is utilized, there
> may be the need for the elfcorehdr to be regenerated once. The reason
> being that some archs only populate the 'present' cpus in the
> /sys/devices/system/cpus entries, which the userspace 'kexec' utility
> uses to generate the userspace-supplied elfcorehdr. In this situation,
> one memory or cpu change will rewrite the elfcorehdr via the
> crash_prepare_elf64_headers() function and now all possible cpus will
> be described, just as with kexec_file_load() syscall.
>
> Suggested-by: Sourabh Jain <sourabhjain@linux.ibm.com>
> Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
> ---
>   kernel/crash_core.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index dba4b75f7541..537b199a8774 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -365,7 +365,7 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
>   	ehdr->e_phentsize = sizeof(Elf64_Phdr);
>   
>   	/* Prepare one phdr of type PT_NOTE for each present CPU */
We need to change this comment as well.
> -	for_each_present_cpu(cpu) {
> +	for_each_possible_cpu(cpu) {
- Sourabh Jain
  
Eric DeVolder March 17, 2023, 7:12 p.m. UTC | #2
On 3/7/23 02:48, Sourabh Jain wrote:
> Hello Eric,
> 
> On 06/03/23 21:52, Eric DeVolder wrote:
>> The function crash_prepare_elf64_headers() generates the elfcorehdr
>> which describes the cpus and memory in the system for the crash kernel.
>> In particular, it writes out ELF PT_NOTEs for memory regions and the
>> cpus in the system.
>>
>> With respect to the cpus, the current implementation utilizes
>> for_each_present_cpu() which means that as cpus are added and removed,
>> the elfcorehdr must again be updated to reflect the new set of cpus.
>>
>> The reasoning behind the change to use for_each_possible_cpu(), is:
>>
>> - At kernel boot time, all percpu crash_notes are allocated for all
>>    possible cpus; that is, crash_notes are not allocated dynamically
>>    when cpus are plugged/unplugged. Thus the crash_notes for each
>>    possible cpu are always available.
>>
>> - The crash_prepare_elf64_headers() creates an ELF PT_NOTE per cpu.
>>    Changing to for_each_possible_cpu() is valid as the crash_notes
>>    pointed to by each cpu PT_NOTE are present and always valid.
>>
>> Furthermore, examining a common crash processing path of:
>>
>>   kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer
>>             elfcorehdr      /proc/vmcore     vmcore
>>
>> reveals how the ELF cpu PT_NOTEs are utilized:
>>
>> - Upon panic, each cpu is sent an IPI and shuts itself down, recording
>>   its state in its crash_notes. When all cpus are shutdown, the
>>   crash kernel is launched with a pointer to the elfcorehdr.
>>
>> - The crash kernel via linux/fs/proc/vmcore.c does not examine or
>>   use the contents of the PT_NOTEs, it exposes them via /proc/vmcore.
>>
>> - The makedumpfile utility uses /proc/vmcore and reads the cpu
>>   PT_NOTEs to craft a nr_cpus variable, which is reported in a
>>   header but otherwise generally unused. Makedumpfile creates the
>>   vmcore.
>>
>> - The 'crash' dump analyzer does not appear to reference the cpu
>>   PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask
>>   symbols and directly examines those structure contents from vmcore
>>   memory. From that information it is able to determine which cpus
>>   are present and online, and locate the corresponding crash_notes.
>>   Said differently, it appears that 'crash' analyzer does not rely
>>   on the ELF PT_NOTEs for cpus; rather it obtains the information
>>   directly via kernel symbols and the memory within the vmcore.
>>
>> (There maybe other vmcore generating and analysis tools that do use
>> these PT_NOTEs, but 'makedumpfile' and 'crash' seems to be the most
>> common solution.)
>>
>> This change results in the benefit of having all cpus described in
>> the elfcorehdr, and therefore reducing the need to re-generate the
>> elfcorehdr on cpu changes, at the small expense of an additional
>> 56 bytes per PT_NOTE for not-present-but-possible cpus.
>>
>> On systems where kexec_file_load() syscall is utilized, all the above
>> is valid. On systems where kexec_load() syscall is utilized, there
>> may be the need for the elfcorehdr to be regenerated once. The reason
>> being that some archs only populate the 'present' cpus in the
>> /sys/devices/system/cpus entries, which the userspace 'kexec' utility
>> uses to generate the userspace-supplied elfcorehdr. In this situation,
>> one memory or cpu change will rewrite the elfcorehdr via the
>> crash_prepare_elf64_headers() function and now all possible cpus will
>> be described, just as with kexec_file_load() syscall.
>>
>> Suggested-by: Sourabh Jain <sourabhjain@linux.ibm.com>
>> Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
>> ---
>>   kernel/crash_core.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
>> index dba4b75f7541..537b199a8774 100644
>> --- a/kernel/crash_core.c
>> +++ b/kernel/crash_core.c
>> @@ -365,7 +365,7 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
>>       ehdr->e_phentsize = sizeof(Elf64_Phdr);
>>       /* Prepare one phdr of type PT_NOTE for each present CPU */
> We need to change this comment as well.
Done, thanks!
eric

>> -    for_each_present_cpu(cpu) {
>> +    for_each_possible_cpu(cpu) {
> - Sourabh Jain
  

Patch

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index dba4b75f7541..537b199a8774 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -365,7 +365,7 @@  int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
 	ehdr->e_phentsize = sizeof(Elf64_Phdr);
 
 	/* Prepare one phdr of type PT_NOTE for each present CPU */
-	for_each_present_cpu(cpu) {
+	for_each_possible_cpu(cpu) {
 		phdr->p_type = PT_NOTE;
 		notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
 		phdr->p_offset = phdr->p_paddr = notes_addr;