[v2,net,6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

Message ID 8e3383d0bacd084f0e33d9158d24bd411f1bf6ba.1684796705.git.peilin.ye@bytedance.com
State New
Headers
Series net/sched: Fixes for sch_ingress and sch_clsact |

Commit Message

Peilin Ye May 22, 2023, 11:55 p.m. UTC
  mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
ingress_init() to point to net_device::miniq_ingress.  ingress Qdiscs
access this per-net_device pointer in mini_qdisc_pair_swap().  Similar for
clsact Qdiscs and miniq_egress.

Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
requests (thanks Hillf Danton for the hint), when replacing ingress or
clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
causing race conditions [1] including a use-after-free bug in
mini_qdisc_pair_swap() reported by syzbot:

 BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
 Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
...
 Call Trace:
  <TASK>
  __dump_stack lib/dump_stack.c:88 [inline]
  dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
  print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
  print_report mm/kasan/report.c:430 [inline]
  kasan_report+0x11c/0x130 mm/kasan/report.c:536
  mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
  tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
  tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
  tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
  tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
  tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
...

@old and @new should not affect each other.  In other words, @old should
never modify miniq_{in,e}gress after @new, and @new should not update
@old's RCU state.  Fixing without changing sch_api.c turned out to be
difficult (please refer to Closes: for discussions).  Instead, make sure
@new's first call always happen after @old's last call, in
qdisc_destroy(), has finished:

In qdisc_graft(), return -EAGAIN and tell the caller to replay
(suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter
requests, and call qdisc_destroy() for @old before grafting @new.

Introduce qdisc_refcount_dec_if_one() as the counterpart of
qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests.  Introduce
a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
just like qdisc_put() etc.

Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact
Qdiscs".

[1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
create under TC_H_INGRESS") on eth0 that has 8 transmission queues:

  Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
  adds a flower filter X to A.

  Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
  b2) to replace A, then adds a flower filter Y to B.

 Thread 1               A's refcnt   Thread 2
  RTM_NEWQDISC (A, RTNL-locked)
   qdisc_create(A)               1
   qdisc_graft(A)                9

  RTM_NEWTFILTER (X, RTNL-unlocked)
   __tcf_qdisc_find(A)          10
   tcf_chain0_head_change(A)
   mini_qdisc_pair_swap(A) (1st)
            |
            |                         RTM_NEWQDISC (B, RTNL-locked)
         RCU sync                2     qdisc_graft(B)
            |                    1     notify_and_destroy(A)
            |
   tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-unlocked)
   qdisc_destroy(A)                    tcf_chain0_head_change(B)
   tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
   mini_qdisc_pair_swap(A) (3rd)                |
           ...                                 ...

Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
mini Qdisc, b1.  Then, A calls mini_qdisc_pair_swap() again during
ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
on eth0 will not find filter Y in sch_handle_ingress().

This is only one of the possible consequences of concurrently accessing
miniq_{in,e}gress pointers.  The point is clear though: again, A should
never modify those per-net_device pointers after B, and B should not
update A's RCU state.

Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
Cc: Hillf Danton <hdanton@sina.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
changes in v2:
  - replay the request if the current Qdisc has any ongoing RTNL-unlocked
    filter requests (Vlad)
  - minor changes in code comments and commit log

 include/net/sch_generic.h |  8 ++++++++
 net/sched/sch_api.c       | 32 ++++++++++++++++++++++++++------
 net/sched/sch_generic.c   | 14 +++++++++++---
 3 files changed, 45 insertions(+), 9 deletions(-)
  

Comments

Pedro Tammela May 23, 2023, 3:51 a.m. UTC | #1
On 22/05/2023 20:55, Peilin Ye wrote:
> mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc [...]


Hi Peilin!

With V2 patches 5 and 6 applied I was still able to trigger an oops.

Branch is 'net' + patches 5 & 6:
145f639b9403 (HEAD -> main) net/sched: qdisc_destroy() old ingress and 
clsact Qdiscs before grafting
1aac74ef9673 net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs
18c40a1cc1d9 (origin/main, origin/HEAD) net/handshake: Fix sock->file 
allocation

Kernel config is the same as in the syzbot report.
Note that this was on a _single core_ VM.
I will double check if v1 is triggering this issue (basically run the 
repro for a long time). For multi-core my VM is running OOM even on a 
32Gb system. I will check if we have a spare server to run the repro.

[  695.782780][T12033] 
==================================================================
[  695.783617][T12033] BUG: KASAN: slab-use-after-free in 
mini_qdisc_pair_swap+0x1c2/0x1f0
[  695.784323][T12033] Write of size 8 at addr ffff888060cafb08 by task 
repro/12033
[  695.784996][T12033]
[  695.785210][T12033] CPU: 0 PID: 12033 Comm: repro Not tainted 
6.4.0-rc2-00187-g145f639b9403 #1
[  695.785981][T12033] Hardware name: QEMU Standard PC (i440FX + PIIX, 
1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[  695.786883][T12033] Call Trace:
[  695.787178][T12033]  <TASK>
[  695.787444][T12033]  dump_stack_lvl+0xd9/0x1b0
[  695.787871][T12033]  print_report+0xc4/0x5f0
[  695.788283][T12033]  ? __virt_addr_valid+0x5e/0x2d0
[  695.788736][T12033]  ? __phys_addr+0xc6/0x140
[  695.789138][T12033]  ? mini_qdisc_pair_swap+0x1c2/0x1f0
[  695.789604][T12033]  kasan_report+0xc0/0xf0
[  695.789604][T12033]  ? mini_qdisc_pair_swap+0x1c2/0x1f0
[  695.789604][T12033]  mini_qdisc_pair_swap+0x1c2/0x1f0
[  695.789604][T12033]  ? ingress_init+0x1c0/0x1c0
[  695.789604][T12033]  tcf_chain0_head_change.isra.0+0xb9/0x120
[  695.789604][T12033]  tc_new_tfilter+0x1ebb/0x22b0
[  695.789604][T12033]  ? tc_del_tfilter+0x1570/0x1570
[  695.789604][T12033]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  695.789604][T12033]  ? kasan_quarantine_put+0x102/0x230
[  695.789604][T12033]  ? lockdep_hardirqs_on+0x7d/0x100
[  695.789604][T12033]  ? rtnetlink_rcv_msg+0x94a/0xd30
[  695.789604][T12033]  ? reacquire_held_locks+0x4b0/0x4b0
[  695.789604][T12033]  ? bpf_lsm_capable+0x9/0x10
[  695.789604][T12033]  ? tc_del_tfilter+0x1570/0x1570
[  695.789604][T12033]  rtnetlink_rcv_msg+0x98a/0xd30
[  695.789604][T12033]  ? rtnl_getlink+0xb10/0xb10
[  695.789604][T12033]  ? reacquire_held_locks+0x4b0/0x4b0
[  695.789604][T12033]  ? netdev_core_pick_tx+0x390/0x390
[  695.789604][T12033]  netlink_rcv_skb+0x166/0x440
[  695.789604][T12033]  ? rtnl_getlink+0xb10/0xb10
[  695.789604][T12033]  ? netlink_ack+0x1370/0x1370
[  695.789604][T12033]  ? kasan_set_track+0x25/0x30
[  695.789604][T12033]  ? netlink_deliver_tap+0x1b1/0xd00
[  695.789604][T12033]  netlink_unicast+0x530/0x800
[  695.789604][T12033]  ? netlink_attachskb+0x880/0x880
[  695.789604][T12033]  ? __sanitizer_cov_trace_switch+0x54/0x90
[  695.789604][T12033]  ? __phys_addr_symbol+0x30/0x70
[  695.789604][T12033]  ? __check_object_size+0x323/0x740
[  695.789604][T12033]  netlink_sendmsg+0x90b/0xe10
[  695.789604][T12033]  ? netlink_unicast+0x800/0x800
[  695.789604][T12033]  ? bpf_lsm_socket_sendmsg+0x9/0x10
[  695.789604][T12033]  ? netlink_unicast+0x800/0x800
[  695.789604][T12033]  sock_sendmsg+0xd9/0x180
[  695.789604][T12033]  ____sys_sendmsg+0x264/0x910
[  695.789604][T12033]  ? kernel_sendmsg+0x50/0x50
[  695.789604][T12033]  ? __copy_msghdr+0x460/0x460
[  695.789604][T12033]  ___sys_sendmsg+0x11d/0x1b0
[  695.789604][T12033]  ? do_recvmmsg+0x700/0x700
[  695.789604][T12033]  ? find_held_lock+0x2d/0x110
[  695.789604][T12033]  ? __might_fault+0xe5/0x190
[  695.789604][T12033]  ? reacquire_held_locks+0x4b0/0x4b0
[  695.789604][T12033]  __sys_sendmmsg+0x18e/0x430
[  695.789604][T12033]  ? __ia32_sys_sendmsg+0xb0/0xb0
[  695.789604][T12033]  ? reacquire_held_locks+0x4b0/0x4b0
[  695.789604][T12033]  ? rcu_is_watching+0x12/0xb0
[  695.789604][T12033]  ? xfd_validate_state+0x5d/0x180
[  695.789604][T12033]  ? restore_fpregs_from_fpstate+0xc1/0x1d0
[  695.789604][T12033]  ? unlock_page_memcg+0x2d0/0x2d0
[  695.789604][T12033]  ? do_futex+0x350/0x350
[  695.789604][T12033]  __x64_sys_sendmmsg+0x9c/0x100
[  695.789604][T12033]  ? syscall_enter_from_user_mode+0x26/0x80
[  695.789604][T12033]  do_syscall_64+0x38/0xb0
[  695.789604][T12033]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  695.789604][T12033] RIP: 0033:0x7f4aca44a89d
[  695.789604][T12033] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 
0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 
24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b 05 0e 00 f7 d8 64 
89 01 48
[  695.789604][T12033] RSP: 002b:00007f4aca2eec68 EFLAGS: 00000203 
ORIG_RAX: 0000000000000133
[  695.789604][T12033] RAX: ffffffffffffffda RBX: 00007f4aca2efcdc RCX: 
00007f4aca44a89d
[  695.789604][T12033] RDX: 040000000000009f RSI: 00000000200002c0 RDI: 
0000000000000007
[  695.789604][T12033] RBP: 00007f4aca2eede0 R08: 0000000000000000 R09: 
0000000000000000
[  695.789604][T12033] R10: 0000000000000000 R11: 0000000000000203 R12: 
fffffffffffffeb8
[  695.789604][T12033] R13: 000000000000006e R14: 00007ffd1a53f720 R15: 
00007f4aca2cf000
[  695.789604][T12033]  </TASK>
[  695.789604][T12033]
[  695.789604][T12033] Allocated by task 12031:
[  695.789604][T12033]  kasan_save_stack+0x20/0x40
[  695.789604][T12033]  kasan_set_track+0x25/0x30
[  695.789604][T12033]  __kasan_kmalloc+0xa2/0xb0
[  695.789604][T12033]  __kmalloc_node+0x60/0x100
[  695.789604][T12033]  qdisc_alloc+0xb3/0xa90
[  695.789604][T12033]  qdisc_create+0xcf/0x1020
[  695.789604][T12033]  tc_modify_qdisc+0x495/0x1ab0
[  695.789604][T12033]  rtnetlink_rcv_msg+0x439/0xd30
[  695.789604][T12033]  netlink_rcv_skb+0x166/0x440
[  695.789604][T12033]  netlink_unicast+0x530/0x800
[  695.789604][T12033]  netlink_sendmsg+0x90b/0xe10
[  695.789604][T12033]  sock_sendmsg+0xd9/0x180
[  695.789604][T12033]  ____sys_sendmsg+0x264/0x910
[  695.789604][T12033]  ___sys_sendmsg+0x11d/0x1b0
[  695.789604][T12033]  __sys_sendmmsg+0x18e/0x430
[  695.789604][T12033]  __x64_sys_sendmmsg+0x9c/0x100
[  695.789604][T12033]  do_syscall_64+0x38/0xb0
[  695.789604][T12033]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  695.789604][T12033]
[  695.789604][T12033] Freed by task 15:
[  695.789604][T12033]  kasan_save_stack+0x20/0x40
[  695.789604][T12033]  kasan_set_track+0x25/0x30
[  695.789604][T12033]  kasan_save_free_info+0x2e/0x40
[  695.789604][T12033]  ____kasan_slab_free+0x15e/0x1b0
[  695.789604][T12033]  slab_free_freelist_hook+0x10b/0x1e0
[  695.789604][T12033]  __kmem_cache_free+0xaf/0x2e0
[  695.789604][T12033]  rcu_core+0x7f7/0x1ac0
[  695.789604][T12033]  __do_softirq+0x1d8/0x8fd
[  695.789604][T12033]
[  695.789604][T12033] Last potentially related work creation:
[  695.789604][T12033]  kasan_save_stack+0x20/0x40
[  695.789604][T12033]  __kasan_record_aux_stack+0xbf/0xd0
[  695.789604][T12033]  __call_rcu_common.constprop.0+0x9a/0x790
[  695.789604][T12033]  qdisc_put_unlocked+0x74/0x90
[  695.789604][T12033]  tcf_block_release+0x90/0xa0
[  695.789604][T12033]  tc_new_tfilter+0xa5e/0x22b0
[  695.789604][T12033]  rtnetlink_rcv_msg+0x98a/0xd30
[  695.789604][T12033]  netlink_rcv_skb+0x166/0x440
[  695.789604][T12033]  netlink_unicast+0x530/0x800
[  695.789604][T12033]  netlink_sendmsg+0x90b/0xe10
[  695.789604][T12033]  sock_sendmsg+0xd9/0x180
[  695.789604][T12033]  ____sys_sendmsg+0x264/0x910
[  695.789604][T12033]  ___sys_sendmsg+0x11d/0x1b0
[  695.789604][T12033]  __sys_sendmmsg+0x18e/0x430
[  695.789604][T12033]  __x64_sys_sendmmsg+0x9c/0x100
[  695.789604][T12033]  do_syscall_64+0x38/0xb0
[  695.789604][T12033]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  695.789604][T12033]
[  695.789604][T12033] Second to last potentially related work creation:
[  695.789604][T12033]  kasan_save_stack+0x20/0x40
[  695.789604][T12033]  __kasan_record_aux_stack+0xbf/0xd0
[  695.789604][T12033]  __call_rcu_common.constprop.0+0x9a/0x790
[  695.789604][T12033]  rht_deferred_worker+0x10fd/0x2010
[  695.789604][T12033]  process_one_work+0x9f9/0x15f0
[  695.789604][T12033]  worker_thread+0x687/0x1110
[  695.789604][T12033]  kthread+0x334/0x430
[  695.789604][T12033]  ret_from_fork+0x1f/0x30
[  695.789604][T12033]
[  695.789604][T12033] The buggy address belongs to the object at 
ffff888060caf800
[  695.789604][T12033]  which belongs to the cache kmalloc-1k of size 1024
[  695.789604][T12033] The buggy address is located 776 bytes inside of
[  695.789604][T12033]  freed 1024-byte region [ffff888060caf800, 
ffff888060cafc00)
[  695.789604][T12033]
[  695.789604][T12033] The buggy address belongs to the physical page:
[  695.789604][T12033] page:ffffea0001832b00 refcount:1 mapcount:0 
mapping:0000000000000000 index:0x0 pfn:0x60cac
[  695.789604][T12033] head:ffffea0001832b00 order:2 entire_mapcount:0 
nr_pages_mapped:0 pincount:0
[  695.789604][T12033] flags: 
0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
[  695.789604][T12033] page_type: 0xffffffff()
[  695.789604][T12033] raw: 00fff00000010200 ffff888012441dc0 
ffffea0000bac600 dead000000000002
[  695.789604][T12033] raw: 0000000000000000 0000000000080008 
00000001ffffffff 0000000000000000
[  695.789604][T12033] page dumped because: kasan: bad access detected
[  695.789604][T12033] page_owner tracks the page as allocated
[  695.789604][T12033] page last allocated via order 2, migratetype 
Unmovable, gfp_mask 
0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), 
pid 13324, tgid 13318 (repro), ts 170262603079, free_ts 0
[  695.789604][T12033]  get_page_from_freelist+0xe71/0x2e80
[  695.789604][T12033]  __alloc_pages+0x1c8/0x4a0
[  695.789604][T12033]  alloc_pages+0x1a9/0x270
[  695.789604][T12033]  allocate_slab+0x24e/0x380
[  695.789604][T12033]  ___slab_alloc+0x89a/0x1400
[  695.789604][T12033]  __slab_alloc.constprop.0+0x56/0xa0
[  695.789604][T12033]  __kmem_cache_alloc_node+0x126/0x330
[  695.789604][T12033]  kmalloc_trace+0x25/0xe0
[  695.789604][T12033]  fl_change+0x1b3/0x51e0
[  695.789604][T12033]  tc_new_tfilter+0x992/0x22b0
[  695.789604][T12033]  rtnetlink_rcv_msg+0x98a/0xd30
[  695.789604][T12033]  netlink_rcv_skb+0x166/0x440
[  695.789604][T12033]  netlink_unicast+0x530/0x800
[  695.789604][T12033]  netlink_sendmsg+0x90b/0xe10
[  695.789604][T12033]  sock_sendmsg+0xd9/0x180
[  695.789604][T12033]  ____sys_sendmsg+0x264/0x910
[  695.789604][T12033] page_owner free stack trace missing
[  695.789604][T12033]
[  695.789604][T12033] Memory state around the buggy address:
[  695.789604][T12033]  ffff888060cafa00: fb fb fb fb fb fb fb fb fb fb 
fb fb fb fb fb fb
[  695.789604][T12033]  ffff888060cafa80: fb fb fb fb fb fb fb fb fb fb 
fb fb fb fb fb fb
[  695.789604][T12033] >ffff888060cafb00: fb fb fb fb fb fb fb fb fb fb 
fb fb fb fb fb fb
[  695.789604][T12033]                       ^
[  695.789604][T12033]  ffff888060cafb80: fb fb fb fb fb fb fb fb fb fb 
fb fb fb fb fb fb
[  695.789604][T12033]  ffff888060cafc00: fc fc fc fc fc fc fc fc fc fc 
fc fc fc fc fc fc
[  695.789604][T12033] 
==================================================================
[  695.996261][T12042] __nla_validate_parse: 32 callbacks suppressed
[  695.996271][T12042] netlink: 24 bytes leftover after parsing 
attributes in process `repro'.
[  696.473670][T12046] netlink: 24 bytes leftover after parsing 
attributes in process `repro'.
[  696.660496][T12033] Kernel panic - not syncing: KASAN: panic_on_warn 
set ...
[  696.661250][T12033] CPU: 0 PID: 12033 Comm: repro Not tainted 
6.4.0-rc2-00187-g145f639b9403 #1
[  696.661947][T12033] Hardware name: QEMU Standard PC (i440FX + PIIX, 
1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[  696.662768][T12033] Call Trace:
[  696.663031][T12033]  <TASK>
[  696.663268][T12033]  dump_stack_lvl+0xd9/0x1b0
[  696.663659][T12033]  panic+0x689/0x730
[  696.663977][T12033]  ? panic_smp_self_stop+0xa0/0xa0
[  696.664396][T12033]  ? preempt_schedule_thunk+0x1a/0x20
[  696.664829][T12033]  ? preempt_schedule_common+0x45/0xc0
[  696.665263][T12033]  ? mini_qdisc_pair_swap+0x1c2/0x1f0
[  696.665698][T12033]  check_panic_on_warn+0xab/0xb0
[  696.666087][T12033]  ? mini_qdisc_pair_swap+0x1c2/0x1f0
[  696.666519][T12033]  end_report+0xe9/0x120
[  696.666861][T12033]  kasan_report+0xcd/0xf0
[  696.667209][T12033]  ? mini_qdisc_pair_swap+0x1c2/0x1f0
[  696.667639][T12033]  mini_qdisc_pair_swap+0x1c2/0x1f0
[  696.668064][T12033]  ? ingress_init+0x1c0/0x1c0
[  696.668451][T12033]  tcf_chain0_head_change.isra.0+0xb9/0x120
[  696.668964][T12033]  tc_new_tfilter+0x1ebb/0x22b0
[  696.669391][T12033]  ? tc_del_tfilter+0x1570/0x1570
[  696.669608][T12033]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  696.669608][T12033]  ? kasan_quarantine_put+0x102/0x230
[  696.669608][T12033]  ? lockdep_hardirqs_on+0x7d/0x100
[  696.669608][T12033]  ? rtnetlink_rcv_msg+0x94a/0xd30
[  696.669608][T12033]  ? reacquire_held_locks+0x4b0/0x4b0
[  696.669608][T12033]  ? bpf_lsm_capable+0x9/0x10
[  696.669608][T12033]  ? tc_del_tfilter+0x1570/0x1570
[  696.669608][T12033]  rtnetlink_rcv_msg+0x98a/0xd30
[  696.669608][T12033]  ? rtnl_getlink+0xb10/0xb10
[  696.669608][T12033]  ? reacquire_held_locks+0x4b0/0x4b0
[  696.669608][T12033]  ? netdev_core_pick_tx+0x390/0x390
[  696.669608][T12033]  netlink_rcv_skb+0x166/0x440
[  696.669608][T12033]  ? rtnl_getlink+0xb10/0xb10
[  696.669608][T12033]  ? netlink_ack+0x1370/0x1370
[  696.669608][T12033]  ? kasan_set_track+0x25/0x30
[  696.669608][T12033]  ? netlink_deliver_tap+0x1b1/0xd00
[  696.669608][T12033]  netlink_unicast+0x530/0x800
[  696.669608][T12033]  ? netlink_attachskb+0x880/0x880
[  696.669608][T12033]  ? __sanitizer_cov_trace_switch+0x54/0x90
[  696.669608][T12033]  ? __phys_addr_symbol+0x30/0x70
[  696.669608][T12033]  ? __check_object_size+0x323/0x740
[  696.669608][T12033]  netlink_sendmsg+0x90b/0xe10
[  696.669608][T12033]  ? netlink_unicast+0x800/0x800
[  696.669608][T12033]  ? bpf_lsm_socket_sendmsg+0x9/0x10
[  696.669608][T12033]  ? netlink_unicast+0x800/0x800
[  696.669608][T12033]  sock_sendmsg+0xd9/0x180
[  696.669608][T12033]  ____sys_sendmsg+0x264/0x910
[  696.669608][T12033]  ? kernel_sendmsg+0x50/0x50
[  696.669608][T12033]  ? __copy_msghdr+0x460/0x460
[  696.669608][T12033]  ___sys_sendmsg+0x11d/0x1b0
[  696.669608][T12033]  ? do_recvmmsg+0x700/0x700
[  696.669608][T12033]  ? find_held_lock+0x2d/0x110
[  696.669608][T12033]  ? __might_fault+0xe5/0x190
[  696.669608][T12033]  ? reacquire_held_locks+0x4b0/0x4b0
[  696.669608][T12033]  __sys_sendmmsg+0x18e/0x430
[  696.669608][T12033]  ? __ia32_sys_sendmsg+0xb0/0xb0
[  696.669608][T12033]  ? reacquire_held_locks+0x4b0/0x4b0
[  696.669608][T12033]  ? rcu_is_watching+0x12/0xb0
[  696.669608][T12033]  ? xfd_validate_state+0x5d/0x180
[  696.669608][T12033]  ? restore_fpregs_from_fpstate+0xc1/0x1d0
[  696.669608][T12033]  ? unlock_page_memcg+0x2d0/0x2d0
[  696.669608][T12033]  ? do_futex+0x350/0x350
[  696.669608][T12033]  __x64_sys_sendmmsg+0x9c/0x100
[  696.669608][T12033]  ? syscall_enter_from_user_mode+0x26/0x80
[  696.669608][T12033]  do_syscall_64+0x38/0xb0
[  696.669608][T12033]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  696.669608][T12033] RIP: 0033:0x7f4aca44a89d
[  696.669608][T12033] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 
0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 
24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b 05 0e 00 f7 d8 64 
89 01 48
[  696.669608][T12033] RSP: 002b:00007f4aca2eec68 EFLAGS: 00000203 
ORIG_RAX: 0000000000000133
[  696.669608][T12033] RAX: ffffffffffffffda RBX: 00007f4aca2efcdc RCX: 
00007f4aca44a89d
[  696.669608][T12033] RDX: 040000000000009f RSI: 00000000200002c0 RDI: 
0000000000000007
[  696.669608][T12033] RBP: 00007f4aca2eede0 R08: 0000000000000000 R09: 
0000000000000000
[  696.669608][T12033] R10: 0000000000000000 R11: 0000000000000203 R12: 
fffffffffffffeb8
[  696.669608][T12033] R13: 000000000000006e R14: 00007ffd1a53f720 R15: 
00007f4aca2cf000
[  696.669608][T12033]  </TASK>
[  696.669608][T12033] Kernel Offset: disabled
[  696.669608][T12033] Rebooting in 86400 seconds..
  
Peilin Ye May 23, 2023, 4:40 a.m. UTC | #2
Hi Pedro,

On Tue, May 23, 2023 at 12:51:44AM -0300, Pedro Tammela wrote:
> With V2 patches 5 and 6 applied I was still able to trigger an oops.
>
> Branch is 'net' + patches 5 & 6:
> 145f639b9403 (HEAD -> main) net/sched: qdisc_destroy() old ingress and
> clsact Qdiscs before grafting
> 1aac74ef9673 net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs
> 18c40a1cc1d9 (origin/main, origin/HEAD) net/handshake: Fix sock->file
> allocation
>
> Kernel config is the same as in the syzbot report.
> Note that this was on a _single core_ VM.
> I will double check if v1 is triggering this issue (basically run the repro
> for a long time). For multi-core my VM is running OOM even on a 32Gb system.
> I will check if we have a spare server to run the repro.

Thanks for testing this, but the syzbot reproducer creates ingress Qdiscs
under TC_H_ROOT, which isn't covered by [6/6] i.e. it exercises the
"!ingress" path in qdisc_graft().  I think that's why you are still seeing
the oops.  Adding sch_{ingress,clsact} to TC_H_ROOT is no longer possible
after [1,2/6], and I think we'll need a different reproducer for [5,6/6].

However I just noticed that for some reason my git-send-email in my new
setup didn't auto-generate From: tags with my work email, so Author: will
be my personal email (I have to send patches from personal email to avoid
"[External] " subject prefixes) ... I will fix it in v3 soon.  Sorry in
advance for spamming.

Thanks,
Peilin Ye
  
Pedro Tammela May 23, 2023, 11:36 a.m. UTC | #3
On 23/05/2023 01:40, Peilin Ye wrote:
> Hi Pedro,
> 
> On Tue, May 23, 2023 at 12:51:44AM -0300, Pedro Tammela wrote:
>> With V2 patches 5 and 6 applied I was still able to trigger an oops.
>>
>> Branch is 'net' + patches 5 & 6:
>> 145f639b9403 (HEAD -> main) net/sched: qdisc_destroy() old ingress and
>> clsact Qdiscs before grafting
>> 1aac74ef9673 net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs
>> 18c40a1cc1d9 (origin/main, origin/HEAD) net/handshake: Fix sock->file
>> allocation
>>
>> Kernel config is the same as in the syzbot report.
>> Note that this was on a _single core_ VM.
>> I will double check if v1 is triggering this issue (basically run the repro
>> for a long time). For multi-core my VM is running OOM even on a 32Gb system.
>> I will check if we have a spare server to run the repro.
> 
> Thanks for testing this, but the syzbot reproducer creates ingress Qdiscs
> under TC_H_ROOT, which isn't covered by [6/6] i.e. it exercises the
> "!ingress" path in qdisc_graft().  I think that's why you are still seeing
> the oops.  Adding sch_{ingress,clsact} to TC_H_ROOT is no longer possible
> after [1,2/6], and I think we'll need a different reproducer for [5,6/6].
> 
> However I just noticed that for some reason my git-send-email in my new
> setup didn't auto-generate From: tags with my work email, so Author: will
> be my personal email (I have to send patches from personal email to avoid
> "[External] " subject prefixes) ... I will fix it in v3 soon.  Sorry in
> advance for spamming.
> 
> Thanks,
> Peilin Ye
> 

I see,
We need to make sure then, when the time comes, that all the required 
patches are back ported in the same bundle so we don't have a partial 
fix; given that they target different commit tags.

I was still able to trigger an oops with the full patchset:

[  104.944353][ T6588] ------------[ cut here ]------------
[  104.944896][ T6588] jump label: negative count!
[  104.945780][ T6588] WARNING: CPU: 0 PID: 6588 at 
kernel/jump_label.c:263 static_key_slow_try_dec+0xf2/0x110
[  104.946795][ T6588] Modules linked in:
[  104.947111][ T6588] CPU: 0 PID: 6588 Comm: repro Not tainted 
6.4.0-rc2-00191-g4a3f9100193d #3
[  104.947765][ T6588] Hardware name: QEMU Standard PC (i440FX + PIIX, 
1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[  104.948557][ T6588] RIP: 0010:static_key_slow_try_dec+0xf2/0x110
[  104.949064][ T6588] Code: d5 ff e8 c1 33 d5 ff 44 89 e8 5b 5d 41 5c 
41 5d c3 44 89 e5 e9 66 ff ff ff e8 aa 33 d5 ff 48 c7 c7 00 9c 56 8a e8 
4e ce 9c ff <0f> 0b eb ae 48 89 df e8 02 4b 28 00 e9 42 ff ff ff 66 66 
2e 0f 1f
[  104.951134][ T6588] RSP: 0018:ffffc900079cf2c0 EFLAGS: 00010286
[  104.951646][ T6588] RAX: 0000000000000000 RBX: ffffffff9213a160 RCX: 
0000000000000000
[  104.952269][ T6588] RDX: ffff888112f83b80 RSI: ffffffff814c7747 RDI: 
0000000000000001
[  104.952901][ T6588] RBP: 00000000ffffffff R08: 0000000000000001 R09: 
0000000000000000
[  104.953523][ T6588] R10: 0000000000000001 R11: 0000000000000001 R12: 
00000000ffffffff
[  104.954133][ T6588] R13: ffff88816a514001 R14: 0000000000000001 R15: 
ffffffff8e7b0680
[  104.954746][ T6588] FS:  00007f76c65d56c0(0000) 
GS:ffff8881f5a00000(0000) knlGS:0000000000000000
[  104.955430][ T6588] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  104.955941][ T6588] CR2: 00007f9a40357a50 CR3: 000000011461e000 CR4: 
0000000000350ef0
[  104.956559][ T6588] Call Trace:
[  104.956829][ T6588]  <TASK>
[  104.957062][ T6588]  ? clsact_egress_block_get+0x40/0x40
[  104.957507][ T6588]  static_key_slow_dec+0x60/0xc0
[  104.957906][ T6588]  qdisc_create+0xa45/0x1090
[  104.958274][ T6588]  ? tc_get_qdisc+0xb70/0xb70
[  104.958646][ T6588]  tc_modify_qdisc+0x491/0x1b70
[  104.959031][ T6588]  ? qdisc_create+0x1090/0x1090
[  104.959420][ T6588]  ? bpf_lsm_capable+0x9/0x10
[  104.959797][ T6588]  ? qdisc_create+0x1090/0x1090
[  104.960175][ T6588]  rtnetlink_rcv_msg+0x439/0xd30
[  104.960625][ T6588]  ? rtnl_getlink+0xb10/0xb10
[  104.960995][ T6588]  ? __x64_sys_sendmmsg+0x9c/0x100
[  104.961397][ T6588]  ? do_syscall_64+0x38/0xb0
[  104.961767][ T6588]  ? netdev_core_pick_tx+0x390/0x390
[  104.962195][ T6588]  netlink_rcv_skb+0x166/0x440
[  104.962584][ T6588]  ? rtnl_getlink+0xb10/0xb10
[  104.962954][ T6588]  ? netlink_ack+0x1370/0x1370
[  104.963336][ T6588]  ? kasan_set_track+0x25/0x30
[  104.963731][ T6588]  ? netlink_deliver_tap+0x1b1/0xd00
[  104.964156][ T6588]  netlink_unicast+0x530/0x800
[  104.964537][ T6588]  ? netlink_attachskb+0x880/0x880
[  104.964951][ T6588]  ? __sanitizer_cov_trace_switch+0x54/0x90
[  104.965405][ T6588]  ? __phys_addr_symbol+0x30/0x70
[  104.965793][ T6588]  ? __check_object_size+0x323/0x740
[  104.966205][ T6588]  netlink_sendmsg+0x90b/0xe10
[  104.966577][ T6588]  ? netlink_unicast+0x800/0x800
[  104.966965][ T6588]  ? bpf_lsm_socket_sendmsg+0x9/0x10
[  104.967374][ T6588]  ? netlink_unicast+0x800/0x800
[  104.967761][ T6588]  sock_sendmsg+0xd9/0x180
[  104.968109][ T6588]  ____sys_sendmsg+0x264/0x910
[  104.968490][ T6588]  ? kernel_sendmsg+0x50/0x50
[  104.968871][ T6588]  ? __copy_msghdr+0x460/0x460
[  104.969244][ T6588]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  104.969716][ T6588]  ? find_held_lock+0x2d/0x110
[  104.970088][ T6588]  ___sys_sendmsg+0x11d/0x1b0
[  104.970491][ T6588]  ? do_recvmmsg+0x700/0x700
[  104.970853][ T6588]  ? __fget_files+0x260/0x420
[  104.971221][ T6588]  ? reacquire_held_locks+0x4b0/0x4b0
[  104.971649][ T6588]  ? __fget_files+0x282/0x420
[  104.972018][ T6588]  ? __fget_light+0xe6/0x270
[  104.972384][ T6588]  __sys_sendmmsg+0x18e/0x430
[  104.972766][ T6588]  ? __ia32_sys_sendmsg+0xb0/0xb0
[  104.973167][ T6588]  ? reacquire_held_locks+0x4b0/0x4b0
[  104.973588][ T6588]  ? rcu_is_watching+0x12/0xb0
[  104.973958][ T6588]  ? xfd_validate_state+0x5d/0x180
[  104.974355][ T6588]  ? restore_fpregs_from_fpstate+0xc1/0x1d0
[  104.974796][ T6588]  ? unlock_page_memcg+0x2d0/0x2d0
[  104.975178][ T6588]  ? do_futex+0x350/0x350
[  104.975501][ T6588]  __x64_sys_sendmmsg+0x9c/0x100
[  104.975867][ T6588]  ? syscall_enter_from_user_mode+0x26/0x80
[  104.976307][ T6588]  do_syscall_64+0x38/0xb0
[  104.976649][ T6588]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  104.977088][ T6588] RIP: 0033:0x7f76c66ee89d
[  104.977417][ T6588] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 
0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 
24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b 05 0e 00 f7 d8 64 
89 01 48
[  104.978835][ T6588] RSP: 002b:00007f76c65d4c68 EFLAGS: 00000203 
ORIG_RAX: 0000000000000133
[  104.979468][ T6588] RAX: ffffffffffffffda RBX: 00007f76c65d5cdc RCX: 
00007f76c66ee89d
[  104.980073][ T6588] RDX: 040000000000009f RSI: 00000000200002c0 RDI: 
0000000000000007
[  104.980739][ T6588] RBP: 00007f76c65d4de0 R08: 0000000100000001 R09: 
0000000000000000
[  104.981352][ T6588] R10: 0000000000000000 R11: 0000000000000203 R12: 
fffffffffffffeb8
[  104.981968][ T6588] R13: 0000000000000002 R14: 00007ffce5a5fd90 R15: 
00007f76c65b5000
[  104.982587][ T6588]  </TASK>
[  104.982833][ T6588] Kernel panic - not syncing: kernel: panic_on_warn 
set ...
[  104.983391][ T6588] CPU: 0 PID: 6588 Comm: repro Not tainted 
6.4.0-rc2-00191-g4a3f9100193d #3
[  104.984054][ T6588] Hardware name: QEMU Standard PC (i440FX + PIIX, 
1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[  104.984854][ T6588] Call Trace:
[  104.985121][ T6588]  <TASK>
[  104.985359][ T6588]  dump_stack_lvl+0xd9/0x1b0
[  104.985733][ T6588]  panic+0x689/0x730
[  104.986053][ T6588]  ? panic_smp_self_stop+0xa0/0xa0
[  104.986467][ T6588]  ? show_trace_log_lvl+0x28e/0x400
[  104.986888][ T6588]  ? static_key_slow_try_dec+0xf2/0x110
[  104.987329][ T6588]  check_panic_on_warn+0xab/0xb0
[  104.987729][ T6588]  __warn+0xf2/0x380
[  104.988046][ T6588]  ? static_key_slow_try_dec+0xf2/0x110
[  104.988476][ T6588]  report_bug+0x3bc/0x580
[  104.988838][ T6588]  handle_bug+0x3c/0x70
[  104.989169][ T6588]  exc_invalid_op+0x17/0x40
[  104.989529][ T6588]  asm_exc_invalid_op+0x1a/0x20
[  104.989914][ T6588] RIP: 0010:static_key_slow_try_dec+0xf2/0x110
[  104.990402][ T6588] Code: d5 ff e8 c1 33 d5 ff 44 89 e8 5b 5d 41 5c 
41 5d c3 44 89 e5 e9 66 ff ff ff e8 aa 33 d5 ff 48 c7 c7 00 9c 56 8a e8 
4e ce 9c ff <0f> 0b eb ae 48 89 df e8 02 4b 28 00 e9 42 ff ff ff 66 66 
2e 0f 1f
[  104.990702][ T6588] RSP: 0018:ffffc900079cf2c0 EFLAGS: 00010286
[  104.990702][ T6588] RAX: 0000000000000000 RBX: ffffffff9213a160 RCX: 
0000000000000000
[  104.990702][ T6588] RDX: ffff888112f83b80 RSI: ffffffff814c7747 RDI: 
0000000000000001
[  104.990702][ T6588] RBP: 00000000ffffffff R08: 0000000000000001 R09: 
0000000000000000
[  104.990702][ T6588] R10: 0000000000000001 R11: 0000000000000001 R12: 
00000000ffffffff
[  104.990702][ T6588] R13: ffff88816a514001 R14: 0000000000000001 R15: 
ffffffff8e7b0680
[  104.990702][ T6588]  ? __warn_printk+0x187/0x310
[  104.990702][ T6588]  ? clsact_egress_block_get+0x40/0x40
[  104.990702][ T6588]  static_key_slow_dec+0x60/0xc0
[  104.990702][ T6588]  qdisc_create+0xa45/0x1090
[  104.990702][ T6588]  ? tc_get_qdisc+0xb70/0xb70
[  104.990702][ T6588]  tc_modify_qdisc+0x491/0x1b70
[  104.990702][ T6588]  ? qdisc_create+0x1090/0x1090
[  104.990702][ T6588]  ? bpf_lsm_capable+0x9/0x10
[  104.990702][ T6588]  ? qdisc_create+0x1090/0x1090
[  104.990702][ T6588]  rtnetlink_rcv_msg+0x439/0xd30
[  104.990702][ T6588]  ? rtnl_getlink+0xb10/0xb10
[  104.990702][ T6588]  ? __x64_sys_sendmmsg+0x9c/0x100
[  104.990702][ T6588]  ? do_syscall_64+0x38/0xb0
[  104.990702][ T6588]  ? netdev_core_pick_tx+0x390/0x390
[  104.990702][ T6588]  netlink_rcv_skb+0x166/0x440
[  104.990702][ T6588]  ? rtnl_getlink+0xb10/0xb10
[  104.990702][ T6588]  ? netlink_ack+0x1370/0x1370
[  104.990702][ T6588]  ? kasan_set_track+0x25/0x30
[  104.990702][ T6588]  ? netlink_deliver_tap+0x1b1/0xd00
[  104.990702][ T6588]  netlink_unicast+0x530/0x800
[  104.990702][ T6588]  ? netlink_attachskb+0x880/0x880
[  104.990702][ T6588]  ? __sanitizer_cov_trace_switch+0x54/0x90
[  104.990702][ T6588]  ? __phys_addr_symbol+0x30/0x70
[  104.990702][ T6588]  ? __check_object_size+0x323/0x740
[  104.990702][ T6588]  netlink_sendmsg+0x90b/0xe10
[  104.990702][ T6588]  ? netlink_unicast+0x800/0x800
[  104.990702][ T6588]  ? bpf_lsm_socket_sendmsg+0x9/0x10
[  104.990702][ T6588]  ? netlink_unicast+0x800/0x800
[  104.990702][ T6588]  sock_sendmsg+0xd9/0x180
[  104.990702][ T6588]  ____sys_sendmsg+0x264/0x910
[  104.990702][ T6588]  ? kernel_sendmsg+0x50/0x50
[  104.990702][ T6588]  ? __copy_msghdr+0x460/0x460
[  104.990702][ T6588]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  104.990702][ T6588]  ? find_held_lock+0x2d/0x110
[  104.990702][ T6588]  ___sys_sendmsg+0x11d/0x1b0
[  104.990702][ T6588]  ? do_recvmmsg+0x700/0x700
[  104.990702][ T6588]  ? __fget_files+0x260/0x420
[  104.990702][ T6588]  ? reacquire_held_locks+0x4b0/0x4b0
[  104.990702][ T6588]  ? __fget_files+0x282/0x420
[  104.990702][ T6588]  ? __fget_light+0xe6/0x270
[  104.990702][ T6588]  __sys_sendmmsg+0x18e/0x430
[  104.990702][ T6588]  ? __ia32_sys_sendmsg+0xb0/0xb0
[  104.990702][ T6588]  ? reacquire_held_locks+0x4b0/0x4b0
[  104.990702][ T6588]  ? rcu_is_watching+0x12/0xb0
[  104.990702][ T6588]  ? xfd_validate_state+0x5d/0x180
[  104.990702][ T6588]  ? restore_fpregs_from_fpstate+0xc1/0x1d0
[  104.990702][ T6588]  ? unlock_page_memcg+0x2d0/0x2d0
[  104.990702][ T6588]  ? do_futex+0x350/0x350
[  104.990702][ T6588]  __x64_sys_sendmmsg+0x9c/0x100
[  104.990702][ T6588]  ? syscall_enter_from_user_mode+0x26/0x80
[  104.990702][ T6588]  do_syscall_64+0x38/0xb0
[  104.990702][ T6588]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  104.990702][ T6588] RIP: 0033:0x7f76c66ee89d
[  104.990702][ T6588] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 
0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 
24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b 05 0e 00 f7 d8 64 
89 01 48
[  104.990702][ T6588] RSP: 002b:00007f76c65d4c68 EFLAGS: 00000203 
ORIG_RAX: 0000000000000133
[  104.990702][ T6588] RAX: ffffffffffffffda RBX: 00007f76c65d5cdc RCX: 
00007f76c66ee89d
[  104.990702][ T6588] RDX: 040000000000009f RSI: 00000000200002c0 RDI: 
0000000000000007
[  104.990702][ T6588] RBP: 00007f76c65d4de0 R08: 0000000100000001 R09: 
0000000000000000
[  104.990702][ T6588] R10: 0000000000000000 R11: 0000000000000203 R12: 
fffffffffffffeb8
[  104.990702][ T6588] R13: 0000000000000002 R14: 00007ffce5a5fd90 R15: 
00007f76c65b5000
[  104.990702][ T6588]  </TASK>
[  104.990702][ T6588] Kernel Offset: disabled
[  104.990702][ T6588] Rebooting in 86400 seconds..
  
Vlad Buslov May 23, 2023, 2:04 p.m. UTC | #4
Hi Peilin,

Thanks again for the great analysis and investing effort into fixing
this!

On Mon 22 May 2023 at 16:55, Peilin Ye <yepeilin.cs@gmail.com> wrote:
> mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
> ingress_init() to point to net_device::miniq_ingress.  ingress Qdiscs
> access this per-net_device pointer in mini_qdisc_pair_swap().  Similar for
> clsact Qdiscs and miniq_egress.
>
> Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
> requests (thanks Hillf Danton for the hint), when replacing ingress or
> clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
> miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
> causing race conditions [1] including a use-after-free bug in
> mini_qdisc_pair_swap() reported by syzbot:
>
>  BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
>  Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
> ...
>  Call Trace:
>   <TASK>
>   __dump_stack lib/dump_stack.c:88 [inline]
>   dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
>   print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
>   print_report mm/kasan/report.c:430 [inline]
>   kasan_report+0x11c/0x130 mm/kasan/report.c:536
>   mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
>   tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
>   tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
>   tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
>   tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
>   tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
> ...
>
> @old and @new should not affect each other.  In other words, @old should
> never modify miniq_{in,e}gress after @new, and @new should not update
> @old's RCU state.  Fixing without changing sch_api.c turned out to be
> difficult (please refer to Closes: for discussions).  Instead, make sure
> @new's first call always happen after @old's last call, in
> qdisc_destroy(), has finished:
>
> In qdisc_graft(), return -EAGAIN and tell the caller to replay
> (suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter
> requests, and call qdisc_destroy() for @old before grafting @new.
>
> Introduce qdisc_refcount_dec_if_one() as the counterpart of
> qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests.  Introduce
> a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
> just like qdisc_put() etc.
>
> Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact
> Qdiscs".
>
> [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
> TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
> create under TC_H_INGRESS") on eth0 that has 8 transmission queues:
>
>   Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
>   adds a flower filter X to A.
>
>   Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
>   b2) to replace A, then adds a flower filter Y to B.
>
>  Thread 1               A's refcnt   Thread 2
>   RTM_NEWQDISC (A, RTNL-locked)
>    qdisc_create(A)               1
>    qdisc_graft(A)                9
>
>   RTM_NEWTFILTER (X, RTNL-unlocked)
>    __tcf_qdisc_find(A)          10
>    tcf_chain0_head_change(A)
>    mini_qdisc_pair_swap(A) (1st)
>             |
>             |                         RTM_NEWQDISC (B, RTNL-locked)
>          RCU sync                2     qdisc_graft(B)
>             |                    1     notify_and_destroy(A)
>             |
>    tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-unlocked)
>    qdisc_destroy(A)                    tcf_chain0_head_change(B)
>    tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
>    mini_qdisc_pair_swap(A) (3rd)                |
>            ...                                 ...
>
> Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
> mini Qdisc, b1.  Then, A calls mini_qdisc_pair_swap() again during
> ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
> on eth0 will not find filter Y in sch_handle_ingress().
>
> This is only one of the possible consequences of concurrently accessing
> miniq_{in,e}gress pointers.  The point is clear though: again, A should
> never modify those per-net_device pointers after B, and B should not
> update A's RCU state.
>
> Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
> Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
> Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
> Cc: Hillf Danton <hdanton@sina.com>
> Cc: Vlad Buslov <vladbu@mellanox.com>
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
> ---
> changes in v2:
>   - replay the request if the current Qdisc has any ongoing RTNL-unlocked
>     filter requests (Vlad)
>   - minor changes in code comments and commit log
>
>  include/net/sch_generic.h |  8 ++++++++
>  net/sched/sch_api.c       | 32 ++++++++++++++++++++++++++------
>  net/sched/sch_generic.c   | 14 +++++++++++---
>  3 files changed, 45 insertions(+), 9 deletions(-)
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index fab5ba3e61b7..3e9cc43cbc90 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
>  	refcount_inc(&qdisc->refcnt);
>  }
>  
> +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
> +{
> +	if (qdisc->flags & TCQ_F_BUILTIN)
> +		return true;
> +	return refcount_dec_if_one(&qdisc->refcnt);
> +}
> +
>  /* Intended to be used by unlocked users, when concurrent qdisc release is
>   * possible.
>   */
> @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head);
>  struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
>  			      struct Qdisc *qdisc);
>  void qdisc_reset(struct Qdisc *qdisc);
> +void qdisc_destroy(struct Qdisc *qdisc);
>  void qdisc_put(struct Qdisc *qdisc);
>  void qdisc_put_unlocked(struct Qdisc *qdisc);
>  void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index f72a581666a2..b3bafa6c1b44 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>  		if ((q && q->flags & TCQ_F_INGRESS) ||
>  		    (new && new->flags & TCQ_F_INGRESS)) {
>  			ingress = 1;
> -			if (!dev_ingress_queue(dev)) {
> +			dev_queue = dev_ingress_queue(dev);
> +			if (!dev_queue) {
>  				NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
>  				return -ENOENT;
>  			}
> +
> +			/* Replay if the current ingress (or clsact) Qdisc has ongoing
> +			 * RTNL-unlocked filter request(s).  This is the counterpart of that
> +			 * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
> +			 */
> +			if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
> +				return -EAGAIN;
>  		}
>  
>  		if (dev->flags & IFF_UP)
> @@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>  				qdisc_put(old);
>  			}
>  		} else {
> -			dev_queue = dev_ingress_queue(dev);
> -			old = dev_graft_qdisc(dev_queue, new);
> +			old = dev_graft_qdisc(dev_queue, NULL);
> +
> +			/* {ingress,clsact}_destroy() @old before grafting @new to avoid
> +			 * unprotected concurrent accesses to net_device::miniq_{in,e}gress
> +			 * pointer(s) in mini_qdisc_pair_swap().
> +			 */
> +			qdisc_notify(net, skb, n, classid, old, new, extack);
> +			qdisc_destroy(old);
> +
> +			dev_graft_qdisc(dev_queue, new);
>  		}
>  
>  skip:
> @@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>  
>  			if (new && new->ops->attach)
>  				new->ops->attach(new);
> -		} else {
> -			notify_and_destroy(net, skb, n, classid, old, new, extack);
>  		}
>  
>  		if (dev->flags & IFF_UP)
> @@ -1458,6 +1472,7 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
>  	struct Qdisc *p = NULL;
>  	int err;
>  
> +replay:

Perhaps also set q and p to NULL here? Even though on cursory look you
should get the same lookup result since the function is called under
rtnl_lock, tc_modify_qdisc() does this on replay ("Reinit, just in case
something touches this.") and tc_new_tfilter() got some non-obvious bugs
after I introduced replay there without re-setting some of the required
variables.

>  	err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX,
>  				     rtm_tca_policy, extack);
>  	if (err < 0)
> @@ -1515,8 +1530,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
>  			return -ENOENT;
>  		}
>  		err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack);
> -		if (err != 0)
> +		if (err != 0) {
> +			if (err == -EAGAIN)
> +				goto replay;
>  			return err;
> +		}
>  	} else {
>  		qdisc_notify(net, skb, n, clid, NULL, q, NULL);
>  	}
> @@ -1704,6 +1722,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
>  	if (err) {
>  		if (q)
>  			qdisc_put(q);
> +		if (err == -EAGAIN)
> +			goto replay;
>  		return err;
>  	}
>  
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 37e41f972f69..e14ed47f961c 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head)
>  	qdisc_free(q);
>  }
>  
> -static void qdisc_destroy(struct Qdisc *qdisc)
> +static void __qdisc_destroy(struct Qdisc *qdisc)
>  {
>  	const struct Qdisc_ops  *ops = qdisc->ops;
>  
> @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc)
>  	call_rcu(&qdisc->rcu, qdisc_free_cb);
>  }
>  
> +void qdisc_destroy(struct Qdisc *qdisc)
> +{
> +	if (qdisc->flags & TCQ_F_BUILTIN)
> +		return;
> +
> +	__qdisc_destroy(qdisc);
> +}
> +
>  void qdisc_put(struct Qdisc *qdisc)
>  {
>  	if (!qdisc)
> @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc)
>  	    !refcount_dec_and_test(&qdisc->refcnt))
>  		return;
>  
> -	qdisc_destroy(qdisc);
> +	__qdisc_destroy(qdisc);
>  }
>  EXPORT_SYMBOL(qdisc_put);
>  
> @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc)
>  	    !refcount_dec_and_rtnl_lock(&qdisc->refcnt))
>  		return;
>  
> -	qdisc_destroy(qdisc);
> +	__qdisc_destroy(qdisc);
>  	rtnl_unlock();
>  }
>  EXPORT_SYMBOL(qdisc_put_unlocked);
  
Peilin Ye May 23, 2023, 6:52 p.m. UTC | #5
Hi Vlad,

On Tue, May 23, 2023 at 05:04:40PM +0300, Vlad Buslov wrote:
> > @@ -1458,6 +1472,7 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
> >     struct Qdisc *p = NULL;
> >     int err;
> >
> > +replay:
>
> Perhaps also set q and p to NULL here? Even though on cursory look you
> should get the same lookup result since the function is called under
> rtnl_lock, tc_modify_qdisc() does this on replay ("Reinit, just in case
> something touches this.") and tc_new_tfilter() got some non-obvious bugs
> after I introduced replay there without re-setting some of the required
> variables.

Sure, I'll reinitialize tcm, q and p after "replay:" in next version, just
like tc_modify_qdisc().  Thanks for the suggestion!

Peilin Ye
  
Peilin Ye May 23, 2023, 8:06 p.m. UTC | #6
On Tue, May 23, 2023 at 08:36:35AM -0300, Pedro Tammela wrote:
> > Thanks for testing this, but the syzbot reproducer creates ingress Qdiscs
> > under TC_H_ROOT, which isn't covered by [6/6] i.e. it exercises the
> > "!ingress" path in qdisc_graft().  I think that's why you are still seeing
> > the oops.  Adding sch_{ingress,clsact} to TC_H_ROOT is no longer possible
> > after [1,2/6], and I think we'll need a different reproducer for [5,6/6].
> 
> I was still able to trigger an oops with the full patchset:
> 
> [  104.944353][ T6588] ------------[ cut here ]------------
> [  104.944896][ T6588] jump label: negative count!
> [  104.945780][ T6588] WARNING: CPU: 0 PID: 6588 at kernel/jump_label.c:263
> static_key_slow_try_dec+0xf2/0x110
> [  104.946795][ T6588] Modules linked in:
> [  104.947111][ T6588] CPU: 0 PID: 6588 Comm: repro Not tainted
> 6.4.0-rc2-00191-g4a3f9100193d #3
> [  104.947765][ T6588] Hardware name: QEMU Standard PC (i440FX + PIIX,
> 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> [  104.948557][ T6588] RIP: 0010:static_key_slow_try_dec+0xf2/0x110
> [  104.949064][ T6588] Code: d5 ff e8 c1 33 d5 ff 44 89 e8 5b 5d 41 5c 41 5d
> c3 44 89 e5 e9 66 ff ff ff e8 aa 33 d5 ff 48 c7 c7 00 9c 56 8a e8 4e ce 9c
> ff <0f> 0b eb ae 48 89 df e8 02 4b 28 00 e9 42 ff ff ff 66 66 2e 0f 1f
> [  104.951134][ T6588] RSP: 0018:ffffc900079cf2c0 EFLAGS: 00010286
> [  104.951646][ T6588] RAX: 0000000000000000 RBX: ffffffff9213a160 RCX:
> 0000000000000000
> [  104.952269][ T6588] RDX: ffff888112f83b80 RSI: ffffffff814c7747 RDI:
> 0000000000000001
> [  104.952901][ T6588] RBP: 00000000ffffffff R08: 0000000000000001 R09:
> 0000000000000000
> [  104.953523][ T6588] R10: 0000000000000001 R11: 0000000000000001 R12:
> 00000000ffffffff
> [  104.954133][ T6588] R13: ffff88816a514001 R14: 0000000000000001 R15:
> ffffffff8e7b0680
> [  104.954746][ T6588] FS:  00007f76c65d56c0(0000) GS:ffff8881f5a00000(0000)
> knlGS:0000000000000000
> [  104.955430][ T6588] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  104.955941][ T6588] CR2: 00007f9a40357a50 CR3: 000000011461e000 CR4:
> 0000000000350ef0
> [  104.956559][ T6588] Call Trace:
> [  104.956829][ T6588]  <TASK>
> [  104.957062][ T6588]  ? clsact_egress_block_get+0x40/0x40
> [  104.957507][ T6588]  static_key_slow_dec+0x60/0xc0
> [  104.957906][ T6588]  qdisc_create+0xa45/0x1090
> [  104.958274][ T6588]  ? tc_get_qdisc+0xb70/0xb70
> [  104.958646][ T6588]  tc_modify_qdisc+0x491/0x1b70
> [  104.959031][ T6588]  ? qdisc_create+0x1090/0x1090
> [  104.959420][ T6588]  ? bpf_lsm_capable+0x9/0x10
> [  104.959797][ T6588]  ? qdisc_create+0x1090/0x1090

Ah, qdisc_create() calls ->destroy() even "if ops->init() failed".  We
should check sch->parent in {ingress,clsact}_destroy() too.  I'll update
[1,2/6] in v5.

Thanks for reporting this!  Seems like I should've run the reproducer
nevertheless.  I'll run it before posting v5.

Thanks,
Peilin Ye
  

Patch

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index fab5ba3e61b7..3e9cc43cbc90 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -137,6 +137,13 @@  static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
 	refcount_inc(&qdisc->refcnt);
 }
 
+static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
+{
+	if (qdisc->flags & TCQ_F_BUILTIN)
+		return true;
+	return refcount_dec_if_one(&qdisc->refcnt);
+}
+
 /* Intended to be used by unlocked users, when concurrent qdisc release is
  * possible.
  */
@@ -652,6 +659,7 @@  void dev_deactivate_many(struct list_head *head);
 struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
 			      struct Qdisc *qdisc);
 void qdisc_reset(struct Qdisc *qdisc);
+void qdisc_destroy(struct Qdisc *qdisc);
 void qdisc_put(struct Qdisc *qdisc);
 void qdisc_put_unlocked(struct Qdisc *qdisc);
 void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index f72a581666a2..b3bafa6c1b44 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1080,10 +1080,18 @@  static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 		if ((q && q->flags & TCQ_F_INGRESS) ||
 		    (new && new->flags & TCQ_F_INGRESS)) {
 			ingress = 1;
-			if (!dev_ingress_queue(dev)) {
+			dev_queue = dev_ingress_queue(dev);
+			if (!dev_queue) {
 				NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
 				return -ENOENT;
 			}
+
+			/* Replay if the current ingress (or clsact) Qdisc has ongoing
+			 * RTNL-unlocked filter request(s).  This is the counterpart of that
+			 * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
+			 */
+			if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
+				return -EAGAIN;
 		}
 
 		if (dev->flags & IFF_UP)
@@ -1104,8 +1112,16 @@  static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 				qdisc_put(old);
 			}
 		} else {
-			dev_queue = dev_ingress_queue(dev);
-			old = dev_graft_qdisc(dev_queue, new);
+			old = dev_graft_qdisc(dev_queue, NULL);
+
+			/* {ingress,clsact}_destroy() @old before grafting @new to avoid
+			 * unprotected concurrent accesses to net_device::miniq_{in,e}gress
+			 * pointer(s) in mini_qdisc_pair_swap().
+			 */
+			qdisc_notify(net, skb, n, classid, old, new, extack);
+			qdisc_destroy(old);
+
+			dev_graft_qdisc(dev_queue, new);
 		}
 
 skip:
@@ -1119,8 +1135,6 @@  static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 
 			if (new && new->ops->attach)
 				new->ops->attach(new);
-		} else {
-			notify_and_destroy(net, skb, n, classid, old, new, extack);
 		}
 
 		if (dev->flags & IFF_UP)
@@ -1458,6 +1472,7 @@  static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 	struct Qdisc *p = NULL;
 	int err;
 
+replay:
 	err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX,
 				     rtm_tca_policy, extack);
 	if (err < 0)
@@ -1515,8 +1530,11 @@  static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 			return -ENOENT;
 		}
 		err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack);
-		if (err != 0)
+		if (err != 0) {
+			if (err == -EAGAIN)
+				goto replay;
 			return err;
+		}
 	} else {
 		qdisc_notify(net, skb, n, clid, NULL, q, NULL);
 	}
@@ -1704,6 +1722,8 @@  static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 	if (err) {
 		if (q)
 			qdisc_put(q);
+		if (err == -EAGAIN)
+			goto replay;
 		return err;
 	}
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 37e41f972f69..e14ed47f961c 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1046,7 +1046,7 @@  static void qdisc_free_cb(struct rcu_head *head)
 	qdisc_free(q);
 }
 
-static void qdisc_destroy(struct Qdisc *qdisc)
+static void __qdisc_destroy(struct Qdisc *qdisc)
 {
 	const struct Qdisc_ops  *ops = qdisc->ops;
 
@@ -1070,6 +1070,14 @@  static void qdisc_destroy(struct Qdisc *qdisc)
 	call_rcu(&qdisc->rcu, qdisc_free_cb);
 }
 
+void qdisc_destroy(struct Qdisc *qdisc)
+{
+	if (qdisc->flags & TCQ_F_BUILTIN)
+		return;
+
+	__qdisc_destroy(qdisc);
+}
+
 void qdisc_put(struct Qdisc *qdisc)
 {
 	if (!qdisc)
@@ -1079,7 +1087,7 @@  void qdisc_put(struct Qdisc *qdisc)
 	    !refcount_dec_and_test(&qdisc->refcnt))
 		return;
 
-	qdisc_destroy(qdisc);
+	__qdisc_destroy(qdisc);
 }
 EXPORT_SYMBOL(qdisc_put);
 
@@ -1094,7 +1102,7 @@  void qdisc_put_unlocked(struct Qdisc *qdisc)
 	    !refcount_dec_and_rtnl_lock(&qdisc->refcnt))
 		return;
 
-	qdisc_destroy(qdisc);
+	__qdisc_destroy(qdisc);
 	rtnl_unlock();
 }
 EXPORT_SYMBOL(qdisc_put_unlocked);