[v2] mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page

Message ID 20240124084014.1772906-1-linmiaohe@huawei.com
State New
Headers
Series [v2] mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page |

Commit Message

Miaohe Lin Jan. 24, 2024, 8:40 a.m. UTC
  When I did soft offline stress test, a machine was observed to crash with
the following message:

  kernel BUG at include/linux/memcontrol.h:554!
  invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
  CPU: 5 PID: 3837 Comm: hwpoison.sh Not tainted 6.7.0-next-20240112-00001-g8ecf3e7fb7c8-dirty #97
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  RIP: 0010:folio_memcg+0xaf/0xd0
  Code: 10 5b 5d c3 cc cc cc cc 48 c7 c6 08 b1 f2 b2 48 89 ef e8 b4 c5 f8 ff 90 0f 0b 48 c7 c6 d0 b0 f2 b2 48 89 ef e8 a2 c5 f8 ff 90 <0f> 0b 48 c7 c6 08 b1 f2 b2 48 89 ef e8 90 c5 f8 ff 90 0f 0b 66 66
  RSP: 0018:ffffb6c043657c98 EFLAGS: 00000296
  RAX: 000000000000004b RBX: ffff932bc1d1e401 RCX: ffff933abfb5c908
  RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff933abfb5c900
  RBP: ffffea6f04019080 R08: ffffffffb3338ce8 R09: 0000000000009ffb
  R10: 00000000000004dd R11: ffffffffb3308d00 R12: ffffea6f04019080
  R13: ffffea6f04019080 R14: 0000000000000001 R15: ffffb6c043657da0
  FS:  00007f6c60f6b740(0000) GS:ffff933abfb40000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000559c3bc8b980 CR3: 0000000107f1c000 CR4: 00000000000006f0
  Call Trace:
   <TASK>
   ? die+0x32/0x90
   ? do_trap+0xde/0x110
   ? folio_memcg+0xaf/0xd0
   ? do_error_trap+0x60/0x80
   ? folio_memcg+0xaf/0xd0
   ? exc_invalid_op+0x53/0x70
   ? folio_memcg+0xaf/0xd0
   ? asm_exc_invalid_op+0x1a/0x20
   ? folio_memcg+0xaf/0xd0
   ? folio_memcg+0xae/0xd0
   split_huge_page_to_list+0x4d/0x1380
   ? sysvec_apic_timer_interrupt+0xf/0x80
   try_to_split_thp_page+0x3a/0xf0
   soft_offline_page+0x1ea/0x8a0
   soft_offline_page_store+0x52/0x90
   kernfs_fop_write_iter+0x118/0x1b0
   vfs_write+0x30b/0x430
   ksys_write+0x5e/0xe0
   do_syscall_64+0xb0/0x1b0
   entry_SYSCALL_64_after_hwframe+0x6d/0x75
  RIP: 0033:0x7f6c60d14697
  Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
  RSP: 002b:00007ffe9b72b8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
  RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f6c60d14697
  RDX: 000000000000000c RSI: 0000559c3bc8b980 RDI: 0000000000000001
  RBP: 0000559c3bc8b980 R08: 00007f6c60dd1460 R09: 000000007fffffff
  R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
  R13: 00007f6c60e1a780 R14: 00007f6c60e16600 R15: 00007f6c60e15a00

The problem is that page->mapping is overloaded with slab->slab_list or
slabs fields now, so slab pages could be taken as non-LRU movable pages
if field slabs contains PAGE_MAPPING_MOVABLE or slab_list->prev is set
to LIST_POISON2. These slab pages will be treated as thp later leading
to crash in split_huge_page_to_list().

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Fixes: 130d4df57390 ("mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head")
---
v2:
  Check PageSlab() first to leave the rest code alone per Matthew.
---
 mm/memory-failure.c | 3 +++
 1 file changed, 3 insertions(+)
  

Comments

Matthew Wilcox Jan. 24, 2024, 1:15 p.m. UTC | #1
On Wed, Jan 24, 2024 at 04:40:14PM +0800, Miaohe Lin wrote:
> When I did soft offline stress test, a machine was observed to crash with
> the following message:
> 
>   kernel BUG at include/linux/memcontrol.h:554!
>   invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>   CPU: 5 PID: 3837 Comm: hwpoison.sh Not tainted 6.7.0-next-20240112-00001-g8ecf3e7fb7c8-dirty #97
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>   RIP: 0010:folio_memcg+0xaf/0xd0
>   Code: 10 5b 5d c3 cc cc cc cc 48 c7 c6 08 b1 f2 b2 48 89 ef e8 b4 c5 f8 ff 90 0f 0b 48 c7 c6 d0 b0 f2 b2 48 89 ef e8 a2 c5 f8 ff 90 <0f> 0b 48 c7 c6 08 b1 f2 b2 48 89 ef e8 90 c5 f8 ff 90 0f 0b 66 66
>   RSP: 0018:ffffb6c043657c98 EFLAGS: 00000296
>   RAX: 000000000000004b RBX: ffff932bc1d1e401 RCX: ffff933abfb5c908
>   RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff933abfb5c900
>   RBP: ffffea6f04019080 R08: ffffffffb3338ce8 R09: 0000000000009ffb
>   R10: 00000000000004dd R11: ffffffffb3308d00 R12: ffffea6f04019080
>   R13: ffffea6f04019080 R14: 0000000000000001 R15: ffffb6c043657da0
>   FS:  00007f6c60f6b740(0000) GS:ffff933abfb40000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 0000559c3bc8b980 CR3: 0000000107f1c000 CR4: 00000000000006f0
>   Call Trace:
>    <TASK>
>    ? die+0x32/0x90
>    ? do_trap+0xde/0x110
>    ? folio_memcg+0xaf/0xd0
>    ? do_error_trap+0x60/0x80
>    ? folio_memcg+0xaf/0xd0
>    ? exc_invalid_op+0x53/0x70
>    ? folio_memcg+0xaf/0xd0
>    ? asm_exc_invalid_op+0x1a/0x20
>    ? folio_memcg+0xaf/0xd0
>    ? folio_memcg+0xae/0xd0

I might trim these ? lines out of the backtrace ...

>    split_huge_page_to_list+0x4d/0x1380
>    ? sysvec_apic_timer_interrupt+0xf/0x80
>    try_to_split_thp_page+0x3a/0xf0
>    soft_offline_page+0x1ea/0x8a0
>    soft_offline_page_store+0x52/0x90
>    kernfs_fop_write_iter+0x118/0x1b0
>    vfs_write+0x30b/0x430
>    ksys_write+0x5e/0xe0
>    do_syscall_64+0xb0/0x1b0
>    entry_SYSCALL_64_after_hwframe+0x6d/0x75
>   RIP: 0033:0x7f6c60d14697
>   Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
>   RSP: 002b:00007ffe9b72b8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>   RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f6c60d14697
>   RDX: 000000000000000c RSI: 0000559c3bc8b980 RDI: 0000000000000001
>   RBP: 0000559c3bc8b980 R08: 00007f6c60dd1460 R09: 000000007fffffff
>   R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
>   R13: 00007f6c60e1a780 R14: 00007f6c60e16600 R15: 00007f6c60e15a00
> 
> The problem is that page->mapping is overloaded with slab->slab_list or
> slabs fields now, so slab pages could be taken as non-LRU movable pages
> if field slabs contains PAGE_MAPPING_MOVABLE or slab_list->prev is set
> to LIST_POISON2. These slab pages will be treated as thp later leading
> to crash in split_huge_page_to_list().
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> Fixes: 130d4df57390 ("mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head")

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
  
Miaohe Lin Jan. 25, 2024, 11:53 a.m. UTC | #2
On 2024/1/24 21:15, Matthew Wilcox wrote:
> On Wed, Jan 24, 2024 at 04:40:14PM +0800, Miaohe Lin wrote:
>> When I did soft offline stress test, a machine was observed to crash with
>> the following message:
>>
>>   kernel BUG at include/linux/memcontrol.h:554!
>>   invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>   CPU: 5 PID: 3837 Comm: hwpoison.sh Not tainted 6.7.0-next-20240112-00001-g8ecf3e7fb7c8-dirty #97
>>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>>   RIP: 0010:folio_memcg+0xaf/0xd0
>>   Code: 10 5b 5d c3 cc cc cc cc 48 c7 c6 08 b1 f2 b2 48 89 ef e8 b4 c5 f8 ff 90 0f 0b 48 c7 c6 d0 b0 f2 b2 48 89 ef e8 a2 c5 f8 ff 90 <0f> 0b 48 c7 c6 08 b1 f2 b2 48 89 ef e8 90 c5 f8 ff 90 0f 0b 66 66
>>   RSP: 0018:ffffb6c043657c98 EFLAGS: 00000296
>>   RAX: 000000000000004b RBX: ffff932bc1d1e401 RCX: ffff933abfb5c908
>>   RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff933abfb5c900
>>   RBP: ffffea6f04019080 R08: ffffffffb3338ce8 R09: 0000000000009ffb
>>   R10: 00000000000004dd R11: ffffffffb3308d00 R12: ffffea6f04019080
>>   R13: ffffea6f04019080 R14: 0000000000000001 R15: ffffb6c043657da0
>>   FS:  00007f6c60f6b740(0000) GS:ffff933abfb40000(0000) knlGS:0000000000000000
>>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   CR2: 0000559c3bc8b980 CR3: 0000000107f1c000 CR4: 00000000000006f0
>>   Call Trace:
>>    <TASK>
>>    ? die+0x32/0x90
>>    ? do_trap+0xde/0x110
>>    ? folio_memcg+0xaf/0xd0
>>    ? do_error_trap+0x60/0x80
>>    ? folio_memcg+0xaf/0xd0
>>    ? exc_invalid_op+0x53/0x70
>>    ? folio_memcg+0xaf/0xd0
>>    ? asm_exc_invalid_op+0x1a/0x20
>>    ? folio_memcg+0xaf/0xd0
>>    ? folio_memcg+0xae/0xd0
> 
> I might trim these ? lines out of the backtrace ...

Do you mean make backtrace looks like something below?

Call Trace:
 <TASK>
 split_huge_page_to_list+0x4d/0x1380
 ? sysvec_apic_timer_interrupt+0xf/0x80
 try_to_split_thp_page+0x3a/0xf0
 soft_offline_page+0x1ea/0x8a0
 soft_offline_page_store+0x52/0x90
 kernfs_fop_write_iter+0x118/0x1b0
 vfs_write+0x30b/0x430
 ksys_write+0x5e/0xe0
 do_syscall_64+0xb0/0x1b0
 entry_SYSCALL_64_after_hwframe+0x6d/0x75
RIP: 0033:0x7f6c60d14697

> 
>>    split_huge_page_to_list+0x4d/0x1380
>>    ? sysvec_apic_timer_interrupt+0xf/0x80
>>    try_to_split_thp_page+0x3a/0xf0
>>    soft_offline_page+0x1ea/0x8a0
>>    soft_offline_page_store+0x52/0x90
>>    kernfs_fop_write_iter+0x118/0x1b0
>>    vfs_write+0x30b/0x430
>>    ksys_write+0x5e/0xe0
>>    do_syscall_64+0xb0/0x1b0
>>    entry_SYSCALL_64_after_hwframe+0x6d/0x75
>>   RIP: 0033:0x7f6c60d14697
>>   Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
>>   RSP: 002b:00007ffe9b72b8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>>   RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f6c60d14697
>>   RDX: 000000000000000c RSI: 0000559c3bc8b980 RDI: 0000000000000001
>>   RBP: 0000559c3bc8b980 R08: 00007f6c60dd1460 R09: 000000007fffffff
>>   R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
>>   R13: 00007f6c60e1a780 R14: 00007f6c60e16600 R15: 00007f6c60e15a00
>>
>> The problem is that page->mapping is overloaded with slab->slab_list or
>> slabs fields now, so slab pages could be taken as non-LRU movable pages
>> if field slabs contains PAGE_MAPPING_MOVABLE or slab_list->prev is set
>> to LIST_POISON2. These slab pages will be treated as thp later leading
>> to crash in split_huge_page_to_list().
>>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> Fixes: 130d4df57390 ("mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head")
> 
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Many thanks for your review.
  
Matthew Wilcox Jan. 25, 2024, 2:22 p.m. UTC | #3
On Thu, Jan 25, 2024 at 07:53:25PM +0800, Miaohe Lin wrote:
> On 2024/1/24 21:15, Matthew Wilcox wrote:
> >>   Call Trace:
> >>    <TASK>
> >>    ? die+0x32/0x90
> >>    ? do_trap+0xde/0x110
> >>    ? folio_memcg+0xaf/0xd0
> >>    ? do_error_trap+0x60/0x80
> >>    ? folio_memcg+0xaf/0xd0
> >>    ? exc_invalid_op+0x53/0x70
> >>    ? folio_memcg+0xaf/0xd0
> >>    ? asm_exc_invalid_op+0x1a/0x20
> >>    ? folio_memcg+0xaf/0xd0
> >>    ? folio_memcg+0xae/0xd0
> > 
> > I might trim these ? lines out of the backtrace ...
> 
> Do you mean make backtrace looks like something below?
> 
> Call Trace:
>  <TASK>
>  split_huge_page_to_list+0x4d/0x1380
>  ? sysvec_apic_timer_interrupt+0xf/0x80
>  try_to_split_thp_page+0x3a/0xf0
>  soft_offline_page+0x1ea/0x8a0
>  soft_offline_page_store+0x52/0x90
>  kernfs_fop_write_iter+0x118/0x1b0
>  vfs_write+0x30b/0x430
>  ksys_write+0x5e/0xe0
>  do_syscall_64+0xb0/0x1b0
>  entry_SYSCALL_64_after_hwframe+0x6d/0x75
> RIP: 0033:0x7f6c60d14697

Yes.  I'd trim the sysvec_apic_timer_interrupt+0xf/0x80 line too.
These lines aren't actually part of the call trace.  They're addresses
that the unwinder found on the stack but don't actually fit the call
trace.  It puts them in in case they're helpful, but marks them with a ?
to indicate that they're probably not part of the call trace.
  
Miaohe Lin Jan. 26, 2024, 1:13 a.m. UTC | #4
On 2024/1/25 22:22, Matthew Wilcox wrote:
> On Thu, Jan 25, 2024 at 07:53:25PM +0800, Miaohe Lin wrote:
>> On 2024/1/24 21:15, Matthew Wilcox wrote:
>>>>   Call Trace:
>>>>    <TASK>
>>>>    ? die+0x32/0x90
>>>>    ? do_trap+0xde/0x110
>>>>    ? folio_memcg+0xaf/0xd0
>>>>    ? do_error_trap+0x60/0x80
>>>>    ? folio_memcg+0xaf/0xd0
>>>>    ? exc_invalid_op+0x53/0x70
>>>>    ? folio_memcg+0xaf/0xd0
>>>>    ? asm_exc_invalid_op+0x1a/0x20
>>>>    ? folio_memcg+0xaf/0xd0
>>>>    ? folio_memcg+0xae/0xd0
>>>
>>> I might trim these ? lines out of the backtrace ...
>>
>> Do you mean make backtrace looks like something below?
>>
>> Call Trace:
>>  <TASK>
>>  split_huge_page_to_list+0x4d/0x1380
>>  ? sysvec_apic_timer_interrupt+0xf/0x80
>>  try_to_split_thp_page+0x3a/0xf0
>>  soft_offline_page+0x1ea/0x8a0
>>  soft_offline_page_store+0x52/0x90
>>  kernfs_fop_write_iter+0x118/0x1b0
>>  vfs_write+0x30b/0x430
>>  ksys_write+0x5e/0xe0
>>  do_syscall_64+0xb0/0x1b0
>>  entry_SYSCALL_64_after_hwframe+0x6d/0x75
>> RIP: 0033:0x7f6c60d14697
> 
> Yes.  I'd trim the sysvec_apic_timer_interrupt+0xf/0x80 line too.
> These lines aren't actually part of the call trace.  They're addresses
> that the unwinder found on the stack but don't actually fit the call
> trace.  It puts them in in case they're helpful, but marks them with a ?
> to indicate that they're probably not part of the call trace.

I see. Many thanks for your explanation. Will update backtrace in next version.

Thanks.
  

Patch

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 636280d04008..9349948f1abf 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1377,6 +1377,9 @@  void ClearPageHWPoisonTakenOff(struct page *page)
  */
 static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
 {
+	if (PageSlab(page))
+		return false;
+
 	/* Soft offline could migrate non-LRU movable pages */
 	if ((flags & MF_SOFT_OFFLINE) && __PageMovable(page))
 		return true;