[v2,net,6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
Message ID | 8e3383d0bacd084f0e33d9158d24bd411f1bf6ba.1684796705.git.peilin.ye@bytedance.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp1809717vqo; Mon, 22 May 2023 17:44:32 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6bsZsSX/R0DpT7EE03iAp1bqkPNmWuJ59CsKXkwwIa+33dIHoyHkghO/ZCy7xxpA3Cw42v X-Received: by 2002:a17:902:e887:b0:1ae:8e80:ba89 with SMTP id w7-20020a170902e88700b001ae8e80ba89mr13759694plg.0.1684802671940; Mon, 22 May 2023 17:44:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684802671; cv=none; d=google.com; s=arc-20160816; b=rZmbrGX2+8mIyL5V4z1VQKQv/cFVp+yYY8UJWlHk5Fy6w+IRVN2W91RXq6SWM0I0WP leZy5kBz85heaksO43MND2hAlS5+u79KFW0AN7b910qU4tl0/illY9Ucghc2M9DF81Ri 0iXXnVUI9bNrz0IEP4jcWs2p/W6579R7cS22BO4ifXX0g3xABOHxqMKqc3TTjg4keCid DIuclm3izkBtjFfQgwOSuN5WGlAmDEstOZyApVVJzr4z9gzS+vFK5hxDx/476q6SawYj 4IZxgn26jfEUmh5gGc81FBdgeLB4tljJ/Ju/yLVK+FwP34G8ZhXUAi4S8n8kN6iRLwNf 2HVg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=tItS/JirFunMX5RfI2+aJ2PuNtEICNcjEQi3pmV0waQ=; b=HixgL2NVluKfw+nsJ/Td6OzkTCmDepFKzfKULlSheVpbXvxPYGvEE0vE6Ib3p9ys7M MQqyWCsy4waHMm483kOjl6Zi5ezYafdYui3G8TjZpn8jmPzBDiZCUaeEVyoSecakOm1M xMPFhqcL5fyb87mYMZP/UNJZ//4vR1ryvvidd+/NPgujOZZfSQI+fu35jfcKggoyczE3 oouhPz/vprIBnrqo1vVvHrUFpyA2lUO5S5epZg8gcvFz4SlI3F+izcYWNANOvjAOZQpa h5XEZWCu8xVel4CUBqWeX/ySsbiKbRxUbIXEIrjN4AhhSODAGw3rK6F6hextzsWFfvlH fgcQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=Oxk2wkmG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id jc1-20020a17090325c100b001ac3103c551si687682plb.85.2023.05.22.17.44.17; Mon, 22 May 2023 17:44:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=Oxk2wkmG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234994AbjEWAlz (ORCPT <rfc822;ahmedalshaiji.dev@gmail.com> + 99 others); Mon, 22 May 2023 20:41:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37316 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234723AbjEWABl (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 22 May 2023 20:01:41 -0400 Received: from mail-ot1-x332.google.com (mail-ot1-x332.google.com [IPv6:2607:f8b0:4864:20::332]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 54EF2213F; Mon, 22 May 2023 16:55:51 -0700 (PDT) Received: by mail-ot1-x332.google.com with SMTP id 46e09a7af769-6af7daff37eso1141127a34.0; Mon, 22 May 2023 16:55:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1684799750; x=1687391750; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=tItS/JirFunMX5RfI2+aJ2PuNtEICNcjEQi3pmV0waQ=; b=Oxk2wkmGKzN+KZViHTjAnIHcCWo+gN4WIpgSyeC38zjz0/hy481mG3wiKfWHFCi5xO jT8cRnHQ3D3ueSB0lDmMHwzLh3KvDfuwKsW0XIpyFf/xyfvVx5huI+WnPlyt+LJJOjis eUedKUWiJFf+2iKdFk/elq1GmfJPnZmXM80SwCwD7wQwpnwg3a3uDlP9iMVegvFYx+Y2 lgXAKuOW8opnf6DLXDlNtrrHKgM3A1Zkf3eo2BjeTdZZbaezCMRCeLGj5C1zZXFMMLYL wKAqOZC330NVZoYCNaL7hzH/dBQWeJVXelx9UNWtp+r9MKl8lZcY9W8eQgtVsKw0JImk eCqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684799750; x=1687391750; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tItS/JirFunMX5RfI2+aJ2PuNtEICNcjEQi3pmV0waQ=; b=j5H1XNGO8ygVIUlhkdVWHf1ny78B5VZmQ6ghEKZDjM1XxVgnnE6bAjn+10CojJB3IU ygqDgAI0m93xvGLkUwdH4KVsFGD+mnSQkrK4STbsrsZYwoPADiTjG31rXy6VASHX9LPc wq9KoaejKLgFzus2uHgGWw8XQhTw0ZCcf51bQCbkzSw3oFmxb6TEH4hXdwzuIK4PG432 //xNkiyECy5MmmcCYgg4lQgtHCipB83bYWVuSNSJj4H2EIVRStTkOvodrPNB/43GUfWM ZCmQMO4vzLbaF7eoqOPb5xm0HrN6WjX8m9kGe4JxrF/tp9Zi/bPJO/PZMaEf6UOB0FAA Eeww== X-Gm-Message-State: AC+VfDw35I5GZtVckiQIKR+aexKrQpQkZE9PUnsioabG+RP+cBEohS8l 7YQO1wksPqj1hq/YIfOOdA== X-Received: by 2002:a9d:7ad7:0:b0:6ac:8844:3605 with SMTP id m23-20020a9d7ad7000000b006ac88443605mr6297410otn.30.1684799750582; Mon, 22 May 2023 16:55:50 -0700 (PDT) Received: from C02FL77VMD6R.bytedance.net ([208.184.112.130]) by smtp.gmail.com with ESMTPSA id d5-20020a05683018e500b006a65be836acsm2889645otf.16.2023.05.22.16.55.49 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 22 May 2023 16:55:50 -0700 (PDT) From: Peilin Ye <yepeilin.cs@gmail.com> X-Google-Original-From: Peilin Ye <peilin.ye@bytedance.com> To: "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Jamal Hadi Salim <jhs@mojatatu.com>, Cong Wang <xiyou.wangcong@gmail.com>, Jiri Pirko <jiri@resnulli.us> Cc: Peilin Ye <peilin.ye@bytedance.com>, Daniel Borkmann <daniel@iogearbox.net>, John Fastabend <john.fastabend@gmail.com>, Vlad Buslov <vladbu@mellanox.com>, Pedro Tammela <pctammela@mojatatu.com>, Hillf Danton <hdanton@sina.com>, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Cong Wang <cong.wang@bytedance.com>, Peilin Ye <yepeilin.cs@gmail.com> Subject: [PATCH v2 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Date: Mon, 22 May 2023 16:55:36 -0700 Message-Id: <8e3383d0bacd084f0e33d9158d24bd411f1bf6ba.1684796705.git.peilin.ye@bytedance.com> X-Mailer: git-send-email 2.30.1 (Apple Git-130) In-Reply-To: <cover.1684796705.git.peilin.ye@bytedance.com> References: <cover.1684796705.git.peilin.ye@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1766643646596740672?= X-GMAIL-MSGID: =?utf-8?q?1766643646596740672?= |
Series |
net/sched: Fixes for sch_ingress and sch_clsact
|
|
Commit Message
Peilin Ye
May 22, 2023, 11:55 p.m. UTC
mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs
access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for
clsact Qdiscs and miniq_egress.
Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
requests (thanks Hillf Danton for the hint), when replacing ingress or
clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
causing race conditions [1] including a use-after-free bug in
mini_qdisc_pair_swap() reported by syzbot:
BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
...
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
print_report mm/kasan/report.c:430 [inline]
kasan_report+0x11c/0x130 mm/kasan/report.c:536
mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
...
@old and @new should not affect each other. In other words, @old should
never modify miniq_{in,e}gress after @new, and @new should not update
@old's RCU state. Fixing without changing sch_api.c turned out to be
difficult (please refer to Closes: for discussions). Instead, make sure
@new's first call always happen after @old's last call, in
qdisc_destroy(), has finished:
In qdisc_graft(), return -EAGAIN and tell the caller to replay
(suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter
requests, and call qdisc_destroy() for @old before grafting @new.
Introduce qdisc_refcount_dec_if_one() as the counterpart of
qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests. Introduce
a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
just like qdisc_put() etc.
Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact
Qdiscs".
[1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
create under TC_H_INGRESS") on eth0 that has 8 transmission queues:
Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
adds a flower filter X to A.
Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
b2) to replace A, then adds a flower filter Y to B.
Thread 1 A's refcnt Thread 2
RTM_NEWQDISC (A, RTNL-locked)
qdisc_create(A) 1
qdisc_graft(A) 9
RTM_NEWTFILTER (X, RTNL-unlocked)
__tcf_qdisc_find(A) 10
tcf_chain0_head_change(A)
mini_qdisc_pair_swap(A) (1st)
|
| RTM_NEWQDISC (B, RTNL-locked)
RCU sync 2 qdisc_graft(B)
| 1 notify_and_destroy(A)
|
tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-unlocked)
qdisc_destroy(A) tcf_chain0_head_change(B)
tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
mini_qdisc_pair_swap(A) (3rd) |
... ...
Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during
ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
on eth0 will not find filter Y in sch_handle_ingress().
This is only one of the possible consequences of concurrently accessing
miniq_{in,e}gress pointers. The point is clear though: again, A should
never modify those per-net_device pointers after B, and B should not
update A's RCU state.
Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
Cc: Hillf Danton <hdanton@sina.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
changes in v2:
- replay the request if the current Qdisc has any ongoing RTNL-unlocked
filter requests (Vlad)
- minor changes in code comments and commit log
include/net/sch_generic.h | 8 ++++++++
net/sched/sch_api.c | 32 ++++++++++++++++++++++++++------
net/sched/sch_generic.c | 14 +++++++++++---
3 files changed, 45 insertions(+), 9 deletions(-)
Comments
On 22/05/2023 20:55, Peilin Ye wrote:
> mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc [...]
Hi Peilin!
With V2 patches 5 and 6 applied I was still able to trigger an oops.
Branch is 'net' + patches 5 & 6:
145f639b9403 (HEAD -> main) net/sched: qdisc_destroy() old ingress and
clsact Qdiscs before grafting
1aac74ef9673 net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs
18c40a1cc1d9 (origin/main, origin/HEAD) net/handshake: Fix sock->file
allocation
Kernel config is the same as in the syzbot report.
Note that this was on a _single core_ VM.
I will double check if v1 is triggering this issue (basically run the
repro for a long time). For multi-core my VM is running OOM even on a
32Gb system. I will check if we have a spare server to run the repro.
[ 695.782780][T12033]
==================================================================
[ 695.783617][T12033] BUG: KASAN: slab-use-after-free in
mini_qdisc_pair_swap+0x1c2/0x1f0
[ 695.784323][T12033] Write of size 8 at addr ffff888060cafb08 by task
repro/12033
[ 695.784996][T12033]
[ 695.785210][T12033] CPU: 0 PID: 12033 Comm: repro Not tainted
6.4.0-rc2-00187-g145f639b9403 #1
[ 695.785981][T12033] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 695.786883][T12033] Call Trace:
[ 695.787178][T12033] <TASK>
[ 695.787444][T12033] dump_stack_lvl+0xd9/0x1b0
[ 695.787871][T12033] print_report+0xc4/0x5f0
[ 695.788283][T12033] ? __virt_addr_valid+0x5e/0x2d0
[ 695.788736][T12033] ? __phys_addr+0xc6/0x140
[ 695.789138][T12033] ? mini_qdisc_pair_swap+0x1c2/0x1f0
[ 695.789604][T12033] kasan_report+0xc0/0xf0
[ 695.789604][T12033] ? mini_qdisc_pair_swap+0x1c2/0x1f0
[ 695.789604][T12033] mini_qdisc_pair_swap+0x1c2/0x1f0
[ 695.789604][T12033] ? ingress_init+0x1c0/0x1c0
[ 695.789604][T12033] tcf_chain0_head_change.isra.0+0xb9/0x120
[ 695.789604][T12033] tc_new_tfilter+0x1ebb/0x22b0
[ 695.789604][T12033] ? tc_del_tfilter+0x1570/0x1570
[ 695.789604][T12033] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 695.789604][T12033] ? kasan_quarantine_put+0x102/0x230
[ 695.789604][T12033] ? lockdep_hardirqs_on+0x7d/0x100
[ 695.789604][T12033] ? rtnetlink_rcv_msg+0x94a/0xd30
[ 695.789604][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 695.789604][T12033] ? bpf_lsm_capable+0x9/0x10
[ 695.789604][T12033] ? tc_del_tfilter+0x1570/0x1570
[ 695.789604][T12033] rtnetlink_rcv_msg+0x98a/0xd30
[ 695.789604][T12033] ? rtnl_getlink+0xb10/0xb10
[ 695.789604][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 695.789604][T12033] ? netdev_core_pick_tx+0x390/0x390
[ 695.789604][T12033] netlink_rcv_skb+0x166/0x440
[ 695.789604][T12033] ? rtnl_getlink+0xb10/0xb10
[ 695.789604][T12033] ? netlink_ack+0x1370/0x1370
[ 695.789604][T12033] ? kasan_set_track+0x25/0x30
[ 695.789604][T12033] ? netlink_deliver_tap+0x1b1/0xd00
[ 695.789604][T12033] netlink_unicast+0x530/0x800
[ 695.789604][T12033] ? netlink_attachskb+0x880/0x880
[ 695.789604][T12033] ? __sanitizer_cov_trace_switch+0x54/0x90
[ 695.789604][T12033] ? __phys_addr_symbol+0x30/0x70
[ 695.789604][T12033] ? __check_object_size+0x323/0x740
[ 695.789604][T12033] netlink_sendmsg+0x90b/0xe10
[ 695.789604][T12033] ? netlink_unicast+0x800/0x800
[ 695.789604][T12033] ? bpf_lsm_socket_sendmsg+0x9/0x10
[ 695.789604][T12033] ? netlink_unicast+0x800/0x800
[ 695.789604][T12033] sock_sendmsg+0xd9/0x180
[ 695.789604][T12033] ____sys_sendmsg+0x264/0x910
[ 695.789604][T12033] ? kernel_sendmsg+0x50/0x50
[ 695.789604][T12033] ? __copy_msghdr+0x460/0x460
[ 695.789604][T12033] ___sys_sendmsg+0x11d/0x1b0
[ 695.789604][T12033] ? do_recvmmsg+0x700/0x700
[ 695.789604][T12033] ? find_held_lock+0x2d/0x110
[ 695.789604][T12033] ? __might_fault+0xe5/0x190
[ 695.789604][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 695.789604][T12033] __sys_sendmmsg+0x18e/0x430
[ 695.789604][T12033] ? __ia32_sys_sendmsg+0xb0/0xb0
[ 695.789604][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 695.789604][T12033] ? rcu_is_watching+0x12/0xb0
[ 695.789604][T12033] ? xfd_validate_state+0x5d/0x180
[ 695.789604][T12033] ? restore_fpregs_from_fpstate+0xc1/0x1d0
[ 695.789604][T12033] ? unlock_page_memcg+0x2d0/0x2d0
[ 695.789604][T12033] ? do_futex+0x350/0x350
[ 695.789604][T12033] __x64_sys_sendmmsg+0x9c/0x100
[ 695.789604][T12033] ? syscall_enter_from_user_mode+0x26/0x80
[ 695.789604][T12033] do_syscall_64+0x38/0xb0
[ 695.789604][T12033] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 695.789604][T12033] RIP: 0033:0x7f4aca44a89d
[ 695.789604][T12033] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3
0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c
24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b 05 0e 00 f7 d8 64
89 01 48
[ 695.789604][T12033] RSP: 002b:00007f4aca2eec68 EFLAGS: 00000203
ORIG_RAX: 0000000000000133
[ 695.789604][T12033] RAX: ffffffffffffffda RBX: 00007f4aca2efcdc RCX:
00007f4aca44a89d
[ 695.789604][T12033] RDX: 040000000000009f RSI: 00000000200002c0 RDI:
0000000000000007
[ 695.789604][T12033] RBP: 00007f4aca2eede0 R08: 0000000000000000 R09:
0000000000000000
[ 695.789604][T12033] R10: 0000000000000000 R11: 0000000000000203 R12:
fffffffffffffeb8
[ 695.789604][T12033] R13: 000000000000006e R14: 00007ffd1a53f720 R15:
00007f4aca2cf000
[ 695.789604][T12033] </TASK>
[ 695.789604][T12033]
[ 695.789604][T12033] Allocated by task 12031:
[ 695.789604][T12033] kasan_save_stack+0x20/0x40
[ 695.789604][T12033] kasan_set_track+0x25/0x30
[ 695.789604][T12033] __kasan_kmalloc+0xa2/0xb0
[ 695.789604][T12033] __kmalloc_node+0x60/0x100
[ 695.789604][T12033] qdisc_alloc+0xb3/0xa90
[ 695.789604][T12033] qdisc_create+0xcf/0x1020
[ 695.789604][T12033] tc_modify_qdisc+0x495/0x1ab0
[ 695.789604][T12033] rtnetlink_rcv_msg+0x439/0xd30
[ 695.789604][T12033] netlink_rcv_skb+0x166/0x440
[ 695.789604][T12033] netlink_unicast+0x530/0x800
[ 695.789604][T12033] netlink_sendmsg+0x90b/0xe10
[ 695.789604][T12033] sock_sendmsg+0xd9/0x180
[ 695.789604][T12033] ____sys_sendmsg+0x264/0x910
[ 695.789604][T12033] ___sys_sendmsg+0x11d/0x1b0
[ 695.789604][T12033] __sys_sendmmsg+0x18e/0x430
[ 695.789604][T12033] __x64_sys_sendmmsg+0x9c/0x100
[ 695.789604][T12033] do_syscall_64+0x38/0xb0
[ 695.789604][T12033] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 695.789604][T12033]
[ 695.789604][T12033] Freed by task 15:
[ 695.789604][T12033] kasan_save_stack+0x20/0x40
[ 695.789604][T12033] kasan_set_track+0x25/0x30
[ 695.789604][T12033] kasan_save_free_info+0x2e/0x40
[ 695.789604][T12033] ____kasan_slab_free+0x15e/0x1b0
[ 695.789604][T12033] slab_free_freelist_hook+0x10b/0x1e0
[ 695.789604][T12033] __kmem_cache_free+0xaf/0x2e0
[ 695.789604][T12033] rcu_core+0x7f7/0x1ac0
[ 695.789604][T12033] __do_softirq+0x1d8/0x8fd
[ 695.789604][T12033]
[ 695.789604][T12033] Last potentially related work creation:
[ 695.789604][T12033] kasan_save_stack+0x20/0x40
[ 695.789604][T12033] __kasan_record_aux_stack+0xbf/0xd0
[ 695.789604][T12033] __call_rcu_common.constprop.0+0x9a/0x790
[ 695.789604][T12033] qdisc_put_unlocked+0x74/0x90
[ 695.789604][T12033] tcf_block_release+0x90/0xa0
[ 695.789604][T12033] tc_new_tfilter+0xa5e/0x22b0
[ 695.789604][T12033] rtnetlink_rcv_msg+0x98a/0xd30
[ 695.789604][T12033] netlink_rcv_skb+0x166/0x440
[ 695.789604][T12033] netlink_unicast+0x530/0x800
[ 695.789604][T12033] netlink_sendmsg+0x90b/0xe10
[ 695.789604][T12033] sock_sendmsg+0xd9/0x180
[ 695.789604][T12033] ____sys_sendmsg+0x264/0x910
[ 695.789604][T12033] ___sys_sendmsg+0x11d/0x1b0
[ 695.789604][T12033] __sys_sendmmsg+0x18e/0x430
[ 695.789604][T12033] __x64_sys_sendmmsg+0x9c/0x100
[ 695.789604][T12033] do_syscall_64+0x38/0xb0
[ 695.789604][T12033] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 695.789604][T12033]
[ 695.789604][T12033] Second to last potentially related work creation:
[ 695.789604][T12033] kasan_save_stack+0x20/0x40
[ 695.789604][T12033] __kasan_record_aux_stack+0xbf/0xd0
[ 695.789604][T12033] __call_rcu_common.constprop.0+0x9a/0x790
[ 695.789604][T12033] rht_deferred_worker+0x10fd/0x2010
[ 695.789604][T12033] process_one_work+0x9f9/0x15f0
[ 695.789604][T12033] worker_thread+0x687/0x1110
[ 695.789604][T12033] kthread+0x334/0x430
[ 695.789604][T12033] ret_from_fork+0x1f/0x30
[ 695.789604][T12033]
[ 695.789604][T12033] The buggy address belongs to the object at
ffff888060caf800
[ 695.789604][T12033] which belongs to the cache kmalloc-1k of size 1024
[ 695.789604][T12033] The buggy address is located 776 bytes inside of
[ 695.789604][T12033] freed 1024-byte region [ffff888060caf800,
ffff888060cafc00)
[ 695.789604][T12033]
[ 695.789604][T12033] The buggy address belongs to the physical page:
[ 695.789604][T12033] page:ffffea0001832b00 refcount:1 mapcount:0
mapping:0000000000000000 index:0x0 pfn:0x60cac
[ 695.789604][T12033] head:ffffea0001832b00 order:2 entire_mapcount:0
nr_pages_mapped:0 pincount:0
[ 695.789604][T12033] flags:
0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
[ 695.789604][T12033] page_type: 0xffffffff()
[ 695.789604][T12033] raw: 00fff00000010200 ffff888012441dc0
ffffea0000bac600 dead000000000002
[ 695.789604][T12033] raw: 0000000000000000 0000000000080008
00000001ffffffff 0000000000000000
[ 695.789604][T12033] page dumped because: kasan: bad access detected
[ 695.789604][T12033] page_owner tracks the page as allocated
[ 695.789604][T12033] page last allocated via order 2, migratetype
Unmovable, gfp_mask
0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC),
pid 13324, tgid 13318 (repro), ts 170262603079, free_ts 0
[ 695.789604][T12033] get_page_from_freelist+0xe71/0x2e80
[ 695.789604][T12033] __alloc_pages+0x1c8/0x4a0
[ 695.789604][T12033] alloc_pages+0x1a9/0x270
[ 695.789604][T12033] allocate_slab+0x24e/0x380
[ 695.789604][T12033] ___slab_alloc+0x89a/0x1400
[ 695.789604][T12033] __slab_alloc.constprop.0+0x56/0xa0
[ 695.789604][T12033] __kmem_cache_alloc_node+0x126/0x330
[ 695.789604][T12033] kmalloc_trace+0x25/0xe0
[ 695.789604][T12033] fl_change+0x1b3/0x51e0
[ 695.789604][T12033] tc_new_tfilter+0x992/0x22b0
[ 695.789604][T12033] rtnetlink_rcv_msg+0x98a/0xd30
[ 695.789604][T12033] netlink_rcv_skb+0x166/0x440
[ 695.789604][T12033] netlink_unicast+0x530/0x800
[ 695.789604][T12033] netlink_sendmsg+0x90b/0xe10
[ 695.789604][T12033] sock_sendmsg+0xd9/0x180
[ 695.789604][T12033] ____sys_sendmsg+0x264/0x910
[ 695.789604][T12033] page_owner free stack trace missing
[ 695.789604][T12033]
[ 695.789604][T12033] Memory state around the buggy address:
[ 695.789604][T12033] ffff888060cafa00: fb fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb
[ 695.789604][T12033] ffff888060cafa80: fb fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb
[ 695.789604][T12033] >ffff888060cafb00: fb fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb
[ 695.789604][T12033] ^
[ 695.789604][T12033] ffff888060cafb80: fb fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb
[ 695.789604][T12033] ffff888060cafc00: fc fc fc fc fc fc fc fc fc fc
fc fc fc fc fc fc
[ 695.789604][T12033]
==================================================================
[ 695.996261][T12042] __nla_validate_parse: 32 callbacks suppressed
[ 695.996271][T12042] netlink: 24 bytes leftover after parsing
attributes in process `repro'.
[ 696.473670][T12046] netlink: 24 bytes leftover after parsing
attributes in process `repro'.
[ 696.660496][T12033] Kernel panic - not syncing: KASAN: panic_on_warn
set ...
[ 696.661250][T12033] CPU: 0 PID: 12033 Comm: repro Not tainted
6.4.0-rc2-00187-g145f639b9403 #1
[ 696.661947][T12033] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 696.662768][T12033] Call Trace:
[ 696.663031][T12033] <TASK>
[ 696.663268][T12033] dump_stack_lvl+0xd9/0x1b0
[ 696.663659][T12033] panic+0x689/0x730
[ 696.663977][T12033] ? panic_smp_self_stop+0xa0/0xa0
[ 696.664396][T12033] ? preempt_schedule_thunk+0x1a/0x20
[ 696.664829][T12033] ? preempt_schedule_common+0x45/0xc0
[ 696.665263][T12033] ? mini_qdisc_pair_swap+0x1c2/0x1f0
[ 696.665698][T12033] check_panic_on_warn+0xab/0xb0
[ 696.666087][T12033] ? mini_qdisc_pair_swap+0x1c2/0x1f0
[ 696.666519][T12033] end_report+0xe9/0x120
[ 696.666861][T12033] kasan_report+0xcd/0xf0
[ 696.667209][T12033] ? mini_qdisc_pair_swap+0x1c2/0x1f0
[ 696.667639][T12033] mini_qdisc_pair_swap+0x1c2/0x1f0
[ 696.668064][T12033] ? ingress_init+0x1c0/0x1c0
[ 696.668451][T12033] tcf_chain0_head_change.isra.0+0xb9/0x120
[ 696.668964][T12033] tc_new_tfilter+0x1ebb/0x22b0
[ 696.669391][T12033] ? tc_del_tfilter+0x1570/0x1570
[ 696.669608][T12033] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 696.669608][T12033] ? kasan_quarantine_put+0x102/0x230
[ 696.669608][T12033] ? lockdep_hardirqs_on+0x7d/0x100
[ 696.669608][T12033] ? rtnetlink_rcv_msg+0x94a/0xd30
[ 696.669608][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 696.669608][T12033] ? bpf_lsm_capable+0x9/0x10
[ 696.669608][T12033] ? tc_del_tfilter+0x1570/0x1570
[ 696.669608][T12033] rtnetlink_rcv_msg+0x98a/0xd30
[ 696.669608][T12033] ? rtnl_getlink+0xb10/0xb10
[ 696.669608][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 696.669608][T12033] ? netdev_core_pick_tx+0x390/0x390
[ 696.669608][T12033] netlink_rcv_skb+0x166/0x440
[ 696.669608][T12033] ? rtnl_getlink+0xb10/0xb10
[ 696.669608][T12033] ? netlink_ack+0x1370/0x1370
[ 696.669608][T12033] ? kasan_set_track+0x25/0x30
[ 696.669608][T12033] ? netlink_deliver_tap+0x1b1/0xd00
[ 696.669608][T12033] netlink_unicast+0x530/0x800
[ 696.669608][T12033] ? netlink_attachskb+0x880/0x880
[ 696.669608][T12033] ? __sanitizer_cov_trace_switch+0x54/0x90
[ 696.669608][T12033] ? __phys_addr_symbol+0x30/0x70
[ 696.669608][T12033] ? __check_object_size+0x323/0x740
[ 696.669608][T12033] netlink_sendmsg+0x90b/0xe10
[ 696.669608][T12033] ? netlink_unicast+0x800/0x800
[ 696.669608][T12033] ? bpf_lsm_socket_sendmsg+0x9/0x10
[ 696.669608][T12033] ? netlink_unicast+0x800/0x800
[ 696.669608][T12033] sock_sendmsg+0xd9/0x180
[ 696.669608][T12033] ____sys_sendmsg+0x264/0x910
[ 696.669608][T12033] ? kernel_sendmsg+0x50/0x50
[ 696.669608][T12033] ? __copy_msghdr+0x460/0x460
[ 696.669608][T12033] ___sys_sendmsg+0x11d/0x1b0
[ 696.669608][T12033] ? do_recvmmsg+0x700/0x700
[ 696.669608][T12033] ? find_held_lock+0x2d/0x110
[ 696.669608][T12033] ? __might_fault+0xe5/0x190
[ 696.669608][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 696.669608][T12033] __sys_sendmmsg+0x18e/0x430
[ 696.669608][T12033] ? __ia32_sys_sendmsg+0xb0/0xb0
[ 696.669608][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 696.669608][T12033] ? rcu_is_watching+0x12/0xb0
[ 696.669608][T12033] ? xfd_validate_state+0x5d/0x180
[ 696.669608][T12033] ? restore_fpregs_from_fpstate+0xc1/0x1d0
[ 696.669608][T12033] ? unlock_page_memcg+0x2d0/0x2d0
[ 696.669608][T12033] ? do_futex+0x350/0x350
[ 696.669608][T12033] __x64_sys_sendmmsg+0x9c/0x100
[ 696.669608][T12033] ? syscall_enter_from_user_mode+0x26/0x80
[ 696.669608][T12033] do_syscall_64+0x38/0xb0
[ 696.669608][T12033] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 696.669608][T12033] RIP: 0033:0x7f4aca44a89d
[ 696.669608][T12033] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3
0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c
24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b 05 0e 00 f7 d8 64
89 01 48
[ 696.669608][T12033] RSP: 002b:00007f4aca2eec68 EFLAGS: 00000203
ORIG_RAX: 0000000000000133
[ 696.669608][T12033] RAX: ffffffffffffffda RBX: 00007f4aca2efcdc RCX:
00007f4aca44a89d
[ 696.669608][T12033] RDX: 040000000000009f RSI: 00000000200002c0 RDI:
0000000000000007
[ 696.669608][T12033] RBP: 00007f4aca2eede0 R08: 0000000000000000 R09:
0000000000000000
[ 696.669608][T12033] R10: 0000000000000000 R11: 0000000000000203 R12:
fffffffffffffeb8
[ 696.669608][T12033] R13: 000000000000006e R14: 00007ffd1a53f720 R15:
00007f4aca2cf000
[ 696.669608][T12033] </TASK>
[ 696.669608][T12033] Kernel Offset: disabled
[ 696.669608][T12033] Rebooting in 86400 seconds..
Hi Pedro, On Tue, May 23, 2023 at 12:51:44AM -0300, Pedro Tammela wrote: > With V2 patches 5 and 6 applied I was still able to trigger an oops. > > Branch is 'net' + patches 5 & 6: > 145f639b9403 (HEAD -> main) net/sched: qdisc_destroy() old ingress and > clsact Qdiscs before grafting > 1aac74ef9673 net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs > 18c40a1cc1d9 (origin/main, origin/HEAD) net/handshake: Fix sock->file > allocation > > Kernel config is the same as in the syzbot report. > Note that this was on a _single core_ VM. > I will double check if v1 is triggering this issue (basically run the repro > for a long time). For multi-core my VM is running OOM even on a 32Gb system. > I will check if we have a spare server to run the repro. Thanks for testing this, but the syzbot reproducer creates ingress Qdiscs under TC_H_ROOT, which isn't covered by [6/6] i.e. it exercises the "!ingress" path in qdisc_graft(). I think that's why you are still seeing the oops. Adding sch_{ingress,clsact} to TC_H_ROOT is no longer possible after [1,2/6], and I think we'll need a different reproducer for [5,6/6]. However I just noticed that for some reason my git-send-email in my new setup didn't auto-generate From: tags with my work email, so Author: will be my personal email (I have to send patches from personal email to avoid "[External] " subject prefixes) ... I will fix it in v3 soon. Sorry in advance for spamming. Thanks, Peilin Ye
On 23/05/2023 01:40, Peilin Ye wrote: > Hi Pedro, > > On Tue, May 23, 2023 at 12:51:44AM -0300, Pedro Tammela wrote: >> With V2 patches 5 and 6 applied I was still able to trigger an oops. >> >> Branch is 'net' + patches 5 & 6: >> 145f639b9403 (HEAD -> main) net/sched: qdisc_destroy() old ingress and >> clsact Qdiscs before grafting >> 1aac74ef9673 net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs >> 18c40a1cc1d9 (origin/main, origin/HEAD) net/handshake: Fix sock->file >> allocation >> >> Kernel config is the same as in the syzbot report. >> Note that this was on a _single core_ VM. >> I will double check if v1 is triggering this issue (basically run the repro >> for a long time). For multi-core my VM is running OOM even on a 32Gb system. >> I will check if we have a spare server to run the repro. > > Thanks for testing this, but the syzbot reproducer creates ingress Qdiscs > under TC_H_ROOT, which isn't covered by [6/6] i.e. it exercises the > "!ingress" path in qdisc_graft(). I think that's why you are still seeing > the oops. Adding sch_{ingress,clsact} to TC_H_ROOT is no longer possible > after [1,2/6], and I think we'll need a different reproducer for [5,6/6]. > > However I just noticed that for some reason my git-send-email in my new > setup didn't auto-generate From: tags with my work email, so Author: will > be my personal email (I have to send patches from personal email to avoid > "[External] " subject prefixes) ... I will fix it in v3 soon. Sorry in > advance for spamming. > > Thanks, > Peilin Ye > I see, We need to make sure then, when the time comes, that all the required patches are back ported in the same bundle so we don't have a partial fix; given that they target different commit tags. I was still able to trigger an oops with the full patchset: [ 104.944353][ T6588] ------------[ cut here ]------------ [ 104.944896][ T6588] jump label: negative count! [ 104.945780][ T6588] WARNING: CPU: 0 PID: 6588 at kernel/jump_label.c:263 static_key_slow_try_dec+0xf2/0x110 [ 104.946795][ T6588] Modules linked in: [ 104.947111][ T6588] CPU: 0 PID: 6588 Comm: repro Not tainted 6.4.0-rc2-00191-g4a3f9100193d #3 [ 104.947765][ T6588] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014 [ 104.948557][ T6588] RIP: 0010:static_key_slow_try_dec+0xf2/0x110 [ 104.949064][ T6588] Code: d5 ff e8 c1 33 d5 ff 44 89 e8 5b 5d 41 5c 41 5d c3 44 89 e5 e9 66 ff ff ff e8 aa 33 d5 ff 48 c7 c7 00 9c 56 8a e8 4e ce 9c ff <0f> 0b eb ae 48 89 df e8 02 4b 28 00 e9 42 ff ff ff 66 66 2e 0f 1f [ 104.951134][ T6588] RSP: 0018:ffffc900079cf2c0 EFLAGS: 00010286 [ 104.951646][ T6588] RAX: 0000000000000000 RBX: ffffffff9213a160 RCX: 0000000000000000 [ 104.952269][ T6588] RDX: ffff888112f83b80 RSI: ffffffff814c7747 RDI: 0000000000000001 [ 104.952901][ T6588] RBP: 00000000ffffffff R08: 0000000000000001 R09: 0000000000000000 [ 104.953523][ T6588] R10: 0000000000000001 R11: 0000000000000001 R12: 00000000ffffffff [ 104.954133][ T6588] R13: ffff88816a514001 R14: 0000000000000001 R15: ffffffff8e7b0680 [ 104.954746][ T6588] FS: 00007f76c65d56c0(0000) GS:ffff8881f5a00000(0000) knlGS:0000000000000000 [ 104.955430][ T6588] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 104.955941][ T6588] CR2: 00007f9a40357a50 CR3: 000000011461e000 CR4: 0000000000350ef0 [ 104.956559][ T6588] Call Trace: [ 104.956829][ T6588] <TASK> [ 104.957062][ T6588] ? clsact_egress_block_get+0x40/0x40 [ 104.957507][ T6588] static_key_slow_dec+0x60/0xc0 [ 104.957906][ T6588] qdisc_create+0xa45/0x1090 [ 104.958274][ T6588] ? tc_get_qdisc+0xb70/0xb70 [ 104.958646][ T6588] tc_modify_qdisc+0x491/0x1b70 [ 104.959031][ T6588] ? qdisc_create+0x1090/0x1090 [ 104.959420][ T6588] ? bpf_lsm_capable+0x9/0x10 [ 104.959797][ T6588] ? qdisc_create+0x1090/0x1090 [ 104.960175][ T6588] rtnetlink_rcv_msg+0x439/0xd30 [ 104.960625][ T6588] ? rtnl_getlink+0xb10/0xb10 [ 104.960995][ T6588] ? __x64_sys_sendmmsg+0x9c/0x100 [ 104.961397][ T6588] ? do_syscall_64+0x38/0xb0 [ 104.961767][ T6588] ? netdev_core_pick_tx+0x390/0x390 [ 104.962195][ T6588] netlink_rcv_skb+0x166/0x440 [ 104.962584][ T6588] ? rtnl_getlink+0xb10/0xb10 [ 104.962954][ T6588] ? netlink_ack+0x1370/0x1370 [ 104.963336][ T6588] ? kasan_set_track+0x25/0x30 [ 104.963731][ T6588] ? netlink_deliver_tap+0x1b1/0xd00 [ 104.964156][ T6588] netlink_unicast+0x530/0x800 [ 104.964537][ T6588] ? netlink_attachskb+0x880/0x880 [ 104.964951][ T6588] ? __sanitizer_cov_trace_switch+0x54/0x90 [ 104.965405][ T6588] ? __phys_addr_symbol+0x30/0x70 [ 104.965793][ T6588] ? __check_object_size+0x323/0x740 [ 104.966205][ T6588] netlink_sendmsg+0x90b/0xe10 [ 104.966577][ T6588] ? netlink_unicast+0x800/0x800 [ 104.966965][ T6588] ? bpf_lsm_socket_sendmsg+0x9/0x10 [ 104.967374][ T6588] ? netlink_unicast+0x800/0x800 [ 104.967761][ T6588] sock_sendmsg+0xd9/0x180 [ 104.968109][ T6588] ____sys_sendmsg+0x264/0x910 [ 104.968490][ T6588] ? kernel_sendmsg+0x50/0x50 [ 104.968871][ T6588] ? __copy_msghdr+0x460/0x460 [ 104.969244][ T6588] ? lockdep_hardirqs_on_prepare+0x410/0x410 [ 104.969716][ T6588] ? find_held_lock+0x2d/0x110 [ 104.970088][ T6588] ___sys_sendmsg+0x11d/0x1b0 [ 104.970491][ T6588] ? do_recvmmsg+0x700/0x700 [ 104.970853][ T6588] ? __fget_files+0x260/0x420 [ 104.971221][ T6588] ? reacquire_held_locks+0x4b0/0x4b0 [ 104.971649][ T6588] ? __fget_files+0x282/0x420 [ 104.972018][ T6588] ? __fget_light+0xe6/0x270 [ 104.972384][ T6588] __sys_sendmmsg+0x18e/0x430 [ 104.972766][ T6588] ? __ia32_sys_sendmsg+0xb0/0xb0 [ 104.973167][ T6588] ? reacquire_held_locks+0x4b0/0x4b0 [ 104.973588][ T6588] ? rcu_is_watching+0x12/0xb0 [ 104.973958][ T6588] ? xfd_validate_state+0x5d/0x180 [ 104.974355][ T6588] ? restore_fpregs_from_fpstate+0xc1/0x1d0 [ 104.974796][ T6588] ? unlock_page_memcg+0x2d0/0x2d0 [ 104.975178][ T6588] ? do_futex+0x350/0x350 [ 104.975501][ T6588] __x64_sys_sendmmsg+0x9c/0x100 [ 104.975867][ T6588] ? syscall_enter_from_user_mode+0x26/0x80 [ 104.976307][ T6588] do_syscall_64+0x38/0xb0 [ 104.976649][ T6588] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 104.977088][ T6588] RIP: 0033:0x7f76c66ee89d [ 104.977417][ T6588] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b 05 0e 00 f7 d8 64 89 01 48 [ 104.978835][ T6588] RSP: 002b:00007f76c65d4c68 EFLAGS: 00000203 ORIG_RAX: 0000000000000133 [ 104.979468][ T6588] RAX: ffffffffffffffda RBX: 00007f76c65d5cdc RCX: 00007f76c66ee89d [ 104.980073][ T6588] RDX: 040000000000009f RSI: 00000000200002c0 RDI: 0000000000000007 [ 104.980739][ T6588] RBP: 00007f76c65d4de0 R08: 0000000100000001 R09: 0000000000000000 [ 104.981352][ T6588] R10: 0000000000000000 R11: 0000000000000203 R12: fffffffffffffeb8 [ 104.981968][ T6588] R13: 0000000000000002 R14: 00007ffce5a5fd90 R15: 00007f76c65b5000 [ 104.982587][ T6588] </TASK> [ 104.982833][ T6588] Kernel panic - not syncing: kernel: panic_on_warn set ... [ 104.983391][ T6588] CPU: 0 PID: 6588 Comm: repro Not tainted 6.4.0-rc2-00191-g4a3f9100193d #3 [ 104.984054][ T6588] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014 [ 104.984854][ T6588] Call Trace: [ 104.985121][ T6588] <TASK> [ 104.985359][ T6588] dump_stack_lvl+0xd9/0x1b0 [ 104.985733][ T6588] panic+0x689/0x730 [ 104.986053][ T6588] ? panic_smp_self_stop+0xa0/0xa0 [ 104.986467][ T6588] ? show_trace_log_lvl+0x28e/0x400 [ 104.986888][ T6588] ? static_key_slow_try_dec+0xf2/0x110 [ 104.987329][ T6588] check_panic_on_warn+0xab/0xb0 [ 104.987729][ T6588] __warn+0xf2/0x380 [ 104.988046][ T6588] ? static_key_slow_try_dec+0xf2/0x110 [ 104.988476][ T6588] report_bug+0x3bc/0x580 [ 104.988838][ T6588] handle_bug+0x3c/0x70 [ 104.989169][ T6588] exc_invalid_op+0x17/0x40 [ 104.989529][ T6588] asm_exc_invalid_op+0x1a/0x20 [ 104.989914][ T6588] RIP: 0010:static_key_slow_try_dec+0xf2/0x110 [ 104.990402][ T6588] Code: d5 ff e8 c1 33 d5 ff 44 89 e8 5b 5d 41 5c 41 5d c3 44 89 e5 e9 66 ff ff ff e8 aa 33 d5 ff 48 c7 c7 00 9c 56 8a e8 4e ce 9c ff <0f> 0b eb ae 48 89 df e8 02 4b 28 00 e9 42 ff ff ff 66 66 2e 0f 1f [ 104.990702][ T6588] RSP: 0018:ffffc900079cf2c0 EFLAGS: 00010286 [ 104.990702][ T6588] RAX: 0000000000000000 RBX: ffffffff9213a160 RCX: 0000000000000000 [ 104.990702][ T6588] RDX: ffff888112f83b80 RSI: ffffffff814c7747 RDI: 0000000000000001 [ 104.990702][ T6588] RBP: 00000000ffffffff R08: 0000000000000001 R09: 0000000000000000 [ 104.990702][ T6588] R10: 0000000000000001 R11: 0000000000000001 R12: 00000000ffffffff [ 104.990702][ T6588] R13: ffff88816a514001 R14: 0000000000000001 R15: ffffffff8e7b0680 [ 104.990702][ T6588] ? __warn_printk+0x187/0x310 [ 104.990702][ T6588] ? clsact_egress_block_get+0x40/0x40 [ 104.990702][ T6588] static_key_slow_dec+0x60/0xc0 [ 104.990702][ T6588] qdisc_create+0xa45/0x1090 [ 104.990702][ T6588] ? tc_get_qdisc+0xb70/0xb70 [ 104.990702][ T6588] tc_modify_qdisc+0x491/0x1b70 [ 104.990702][ T6588] ? qdisc_create+0x1090/0x1090 [ 104.990702][ T6588] ? bpf_lsm_capable+0x9/0x10 [ 104.990702][ T6588] ? qdisc_create+0x1090/0x1090 [ 104.990702][ T6588] rtnetlink_rcv_msg+0x439/0xd30 [ 104.990702][ T6588] ? rtnl_getlink+0xb10/0xb10 [ 104.990702][ T6588] ? __x64_sys_sendmmsg+0x9c/0x100 [ 104.990702][ T6588] ? do_syscall_64+0x38/0xb0 [ 104.990702][ T6588] ? netdev_core_pick_tx+0x390/0x390 [ 104.990702][ T6588] netlink_rcv_skb+0x166/0x440 [ 104.990702][ T6588] ? rtnl_getlink+0xb10/0xb10 [ 104.990702][ T6588] ? netlink_ack+0x1370/0x1370 [ 104.990702][ T6588] ? kasan_set_track+0x25/0x30 [ 104.990702][ T6588] ? netlink_deliver_tap+0x1b1/0xd00 [ 104.990702][ T6588] netlink_unicast+0x530/0x800 [ 104.990702][ T6588] ? netlink_attachskb+0x880/0x880 [ 104.990702][ T6588] ? __sanitizer_cov_trace_switch+0x54/0x90 [ 104.990702][ T6588] ? __phys_addr_symbol+0x30/0x70 [ 104.990702][ T6588] ? __check_object_size+0x323/0x740 [ 104.990702][ T6588] netlink_sendmsg+0x90b/0xe10 [ 104.990702][ T6588] ? netlink_unicast+0x800/0x800 [ 104.990702][ T6588] ? bpf_lsm_socket_sendmsg+0x9/0x10 [ 104.990702][ T6588] ? netlink_unicast+0x800/0x800 [ 104.990702][ T6588] sock_sendmsg+0xd9/0x180 [ 104.990702][ T6588] ____sys_sendmsg+0x264/0x910 [ 104.990702][ T6588] ? kernel_sendmsg+0x50/0x50 [ 104.990702][ T6588] ? __copy_msghdr+0x460/0x460 [ 104.990702][ T6588] ? lockdep_hardirqs_on_prepare+0x410/0x410 [ 104.990702][ T6588] ? find_held_lock+0x2d/0x110 [ 104.990702][ T6588] ___sys_sendmsg+0x11d/0x1b0 [ 104.990702][ T6588] ? do_recvmmsg+0x700/0x700 [ 104.990702][ T6588] ? __fget_files+0x260/0x420 [ 104.990702][ T6588] ? reacquire_held_locks+0x4b0/0x4b0 [ 104.990702][ T6588] ? __fget_files+0x282/0x420 [ 104.990702][ T6588] ? __fget_light+0xe6/0x270 [ 104.990702][ T6588] __sys_sendmmsg+0x18e/0x430 [ 104.990702][ T6588] ? __ia32_sys_sendmsg+0xb0/0xb0 [ 104.990702][ T6588] ? reacquire_held_locks+0x4b0/0x4b0 [ 104.990702][ T6588] ? rcu_is_watching+0x12/0xb0 [ 104.990702][ T6588] ? xfd_validate_state+0x5d/0x180 [ 104.990702][ T6588] ? restore_fpregs_from_fpstate+0xc1/0x1d0 [ 104.990702][ T6588] ? unlock_page_memcg+0x2d0/0x2d0 [ 104.990702][ T6588] ? do_futex+0x350/0x350 [ 104.990702][ T6588] __x64_sys_sendmmsg+0x9c/0x100 [ 104.990702][ T6588] ? syscall_enter_from_user_mode+0x26/0x80 [ 104.990702][ T6588] do_syscall_64+0x38/0xb0 [ 104.990702][ T6588] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 104.990702][ T6588] RIP: 0033:0x7f76c66ee89d [ 104.990702][ T6588] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b 05 0e 00 f7 d8 64 89 01 48 [ 104.990702][ T6588] RSP: 002b:00007f76c65d4c68 EFLAGS: 00000203 ORIG_RAX: 0000000000000133 [ 104.990702][ T6588] RAX: ffffffffffffffda RBX: 00007f76c65d5cdc RCX: 00007f76c66ee89d [ 104.990702][ T6588] RDX: 040000000000009f RSI: 00000000200002c0 RDI: 0000000000000007 [ 104.990702][ T6588] RBP: 00007f76c65d4de0 R08: 0000000100000001 R09: 0000000000000000 [ 104.990702][ T6588] R10: 0000000000000000 R11: 0000000000000203 R12: fffffffffffffeb8 [ 104.990702][ T6588] R13: 0000000000000002 R14: 00007ffce5a5fd90 R15: 00007f76c65b5000 [ 104.990702][ T6588] </TASK> [ 104.990702][ T6588] Kernel Offset: disabled [ 104.990702][ T6588] Rebooting in 86400 seconds..
Hi Peilin, Thanks again for the great analysis and investing effort into fixing this! On Mon 22 May 2023 at 16:55, Peilin Ye <yepeilin.cs@gmail.com> wrote: > mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in > ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs > access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for > clsact Qdiscs and miniq_egress. > > Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER > requests (thanks Hillf Danton for the hint), when replacing ingress or > clsact Qdiscs, for example, the old Qdisc ("@old") could access the same > miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"), > causing race conditions [1] including a use-after-free bug in > mini_qdisc_pair_swap() reported by syzbot: > > BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 > Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901 > ... > Call Trace: > <TASK> > __dump_stack lib/dump_stack.c:88 [inline] > dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 > print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319 > print_report mm/kasan/report.c:430 [inline] > kasan_report+0x11c/0x130 mm/kasan/report.c:536 > mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 > tcf_chain_head_change_item net/sched/cls_api.c:495 [inline] > tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509 > tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline] > tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline] > tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266 > ... > > @old and @new should not affect each other. In other words, @old should > never modify miniq_{in,e}gress after @new, and @new should not update > @old's RCU state. Fixing without changing sch_api.c turned out to be > difficult (please refer to Closes: for discussions). Instead, make sure > @new's first call always happen after @old's last call, in > qdisc_destroy(), has finished: > > In qdisc_graft(), return -EAGAIN and tell the caller to replay > (suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter > requests, and call qdisc_destroy() for @old before grafting @new. > > Introduce qdisc_refcount_dec_if_one() as the counterpart of > qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests. Introduce > a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check, > just like qdisc_put() etc. > > Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact > Qdiscs". > > [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under > TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only > create under TC_H_INGRESS") on eth0 that has 8 transmission queues: > > Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then > adds a flower filter X to A. > > Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and > b2) to replace A, then adds a flower filter Y to B. > > Thread 1 A's refcnt Thread 2 > RTM_NEWQDISC (A, RTNL-locked) > qdisc_create(A) 1 > qdisc_graft(A) 9 > > RTM_NEWTFILTER (X, RTNL-unlocked) > __tcf_qdisc_find(A) 10 > tcf_chain0_head_change(A) > mini_qdisc_pair_swap(A) (1st) > | > | RTM_NEWQDISC (B, RTNL-locked) > RCU sync 2 qdisc_graft(B) > | 1 notify_and_destroy(A) > | > tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-unlocked) > qdisc_destroy(A) tcf_chain0_head_change(B) > tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd) > mini_qdisc_pair_swap(A) (3rd) | > ... ... > > Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its > mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during > ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets > on eth0 will not find filter Y in sch_handle_ingress(). > > This is only one of the possible consequences of concurrently accessing > miniq_{in,e}gress pointers. The point is clear though: again, A should > never modify those per-net_device pointers after B, and B should not > update A's RCU state. > > Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops") > Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops") > Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com > Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/ > Cc: Hillf Danton <hdanton@sina.com> > Cc: Vlad Buslov <vladbu@mellanox.com> > Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> > --- > changes in v2: > - replay the request if the current Qdisc has any ongoing RTNL-unlocked > filter requests (Vlad) > - minor changes in code comments and commit log > > include/net/sch_generic.h | 8 ++++++++ > net/sched/sch_api.c | 32 ++++++++++++++++++++++++++------ > net/sched/sch_generic.c | 14 +++++++++++--- > 3 files changed, 45 insertions(+), 9 deletions(-) > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > index fab5ba3e61b7..3e9cc43cbc90 100644 > --- a/include/net/sch_generic.h > +++ b/include/net/sch_generic.h > @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc) > refcount_inc(&qdisc->refcnt); > } > > +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc) > +{ > + if (qdisc->flags & TCQ_F_BUILTIN) > + return true; > + return refcount_dec_if_one(&qdisc->refcnt); > +} > + > /* Intended to be used by unlocked users, when concurrent qdisc release is > * possible. > */ > @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head); > struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue, > struct Qdisc *qdisc); > void qdisc_reset(struct Qdisc *qdisc); > +void qdisc_destroy(struct Qdisc *qdisc); > void qdisc_put(struct Qdisc *qdisc); > void qdisc_put_unlocked(struct Qdisc *qdisc); > void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len); > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c > index f72a581666a2..b3bafa6c1b44 100644 > --- a/net/sched/sch_api.c > +++ b/net/sched/sch_api.c > @@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > if ((q && q->flags & TCQ_F_INGRESS) || > (new && new->flags & TCQ_F_INGRESS)) { > ingress = 1; > - if (!dev_ingress_queue(dev)) { > + dev_queue = dev_ingress_queue(dev); > + if (!dev_queue) { > NL_SET_ERR_MSG(extack, "Device does not have an ingress queue"); > return -ENOENT; > } > + > + /* Replay if the current ingress (or clsact) Qdisc has ongoing > + * RTNL-unlocked filter request(s). This is the counterpart of that > + * qdisc_refcount_inc_nz() call in __tcf_qdisc_find(). > + */ > + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) > + return -EAGAIN; > } > > if (dev->flags & IFF_UP) > @@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > qdisc_put(old); > } > } else { > - dev_queue = dev_ingress_queue(dev); > - old = dev_graft_qdisc(dev_queue, new); > + old = dev_graft_qdisc(dev_queue, NULL); > + > + /* {ingress,clsact}_destroy() @old before grafting @new to avoid > + * unprotected concurrent accesses to net_device::miniq_{in,e}gress > + * pointer(s) in mini_qdisc_pair_swap(). > + */ > + qdisc_notify(net, skb, n, classid, old, new, extack); > + qdisc_destroy(old); > + > + dev_graft_qdisc(dev_queue, new); > } > > skip: > @@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > > if (new && new->ops->attach) > new->ops->attach(new); > - } else { > - notify_and_destroy(net, skb, n, classid, old, new, extack); > } > > if (dev->flags & IFF_UP) > @@ -1458,6 +1472,7 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, > struct Qdisc *p = NULL; > int err; > > +replay: Perhaps also set q and p to NULL here? Even though on cursory look you should get the same lookup result since the function is called under rtnl_lock, tc_modify_qdisc() does this on replay ("Reinit, just in case something touches this.") and tc_new_tfilter() got some non-obvious bugs after I introduced replay there without re-setting some of the required variables. > err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX, > rtm_tca_policy, extack); > if (err < 0) > @@ -1515,8 +1530,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, > return -ENOENT; > } > err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack); > - if (err != 0) > + if (err != 0) { > + if (err == -EAGAIN) > + goto replay; > return err; > + } > } else { > qdisc_notify(net, skb, n, clid, NULL, q, NULL); > } > @@ -1704,6 +1722,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n, > if (err) { > if (q) > qdisc_put(q); > + if (err == -EAGAIN) > + goto replay; > return err; > } > > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index 37e41f972f69..e14ed47f961c 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head) > qdisc_free(q); > } > > -static void qdisc_destroy(struct Qdisc *qdisc) > +static void __qdisc_destroy(struct Qdisc *qdisc) > { > const struct Qdisc_ops *ops = qdisc->ops; > > @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc) > call_rcu(&qdisc->rcu, qdisc_free_cb); > } > > +void qdisc_destroy(struct Qdisc *qdisc) > +{ > + if (qdisc->flags & TCQ_F_BUILTIN) > + return; > + > + __qdisc_destroy(qdisc); > +} > + > void qdisc_put(struct Qdisc *qdisc) > { > if (!qdisc) > @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc) > !refcount_dec_and_test(&qdisc->refcnt)) > return; > > - qdisc_destroy(qdisc); > + __qdisc_destroy(qdisc); > } > EXPORT_SYMBOL(qdisc_put); > > @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc) > !refcount_dec_and_rtnl_lock(&qdisc->refcnt)) > return; > > - qdisc_destroy(qdisc); > + __qdisc_destroy(qdisc); > rtnl_unlock(); > } > EXPORT_SYMBOL(qdisc_put_unlocked);
Hi Vlad, On Tue, May 23, 2023 at 05:04:40PM +0300, Vlad Buslov wrote: > > @@ -1458,6 +1472,7 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, > > struct Qdisc *p = NULL; > > int err; > > > > +replay: > > Perhaps also set q and p to NULL here? Even though on cursory look you > should get the same lookup result since the function is called under > rtnl_lock, tc_modify_qdisc() does this on replay ("Reinit, just in case > something touches this.") and tc_new_tfilter() got some non-obvious bugs > after I introduced replay there without re-setting some of the required > variables. Sure, I'll reinitialize tcm, q and p after "replay:" in next version, just like tc_modify_qdisc(). Thanks for the suggestion! Peilin Ye
On Tue, May 23, 2023 at 08:36:35AM -0300, Pedro Tammela wrote: > > Thanks for testing this, but the syzbot reproducer creates ingress Qdiscs > > under TC_H_ROOT, which isn't covered by [6/6] i.e. it exercises the > > "!ingress" path in qdisc_graft(). I think that's why you are still seeing > > the oops. Adding sch_{ingress,clsact} to TC_H_ROOT is no longer possible > > after [1,2/6], and I think we'll need a different reproducer for [5,6/6]. > > I was still able to trigger an oops with the full patchset: > > [ 104.944353][ T6588] ------------[ cut here ]------------ > [ 104.944896][ T6588] jump label: negative count! > [ 104.945780][ T6588] WARNING: CPU: 0 PID: 6588 at kernel/jump_label.c:263 > static_key_slow_try_dec+0xf2/0x110 > [ 104.946795][ T6588] Modules linked in: > [ 104.947111][ T6588] CPU: 0 PID: 6588 Comm: repro Not tainted > 6.4.0-rc2-00191-g4a3f9100193d #3 > [ 104.947765][ T6588] Hardware name: QEMU Standard PC (i440FX + PIIX, > 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014 > [ 104.948557][ T6588] RIP: 0010:static_key_slow_try_dec+0xf2/0x110 > [ 104.949064][ T6588] Code: d5 ff e8 c1 33 d5 ff 44 89 e8 5b 5d 41 5c 41 5d > c3 44 89 e5 e9 66 ff ff ff e8 aa 33 d5 ff 48 c7 c7 00 9c 56 8a e8 4e ce 9c > ff <0f> 0b eb ae 48 89 df e8 02 4b 28 00 e9 42 ff ff ff 66 66 2e 0f 1f > [ 104.951134][ T6588] RSP: 0018:ffffc900079cf2c0 EFLAGS: 00010286 > [ 104.951646][ T6588] RAX: 0000000000000000 RBX: ffffffff9213a160 RCX: > 0000000000000000 > [ 104.952269][ T6588] RDX: ffff888112f83b80 RSI: ffffffff814c7747 RDI: > 0000000000000001 > [ 104.952901][ T6588] RBP: 00000000ffffffff R08: 0000000000000001 R09: > 0000000000000000 > [ 104.953523][ T6588] R10: 0000000000000001 R11: 0000000000000001 R12: > 00000000ffffffff > [ 104.954133][ T6588] R13: ffff88816a514001 R14: 0000000000000001 R15: > ffffffff8e7b0680 > [ 104.954746][ T6588] FS: 00007f76c65d56c0(0000) GS:ffff8881f5a00000(0000) > knlGS:0000000000000000 > [ 104.955430][ T6588] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 104.955941][ T6588] CR2: 00007f9a40357a50 CR3: 000000011461e000 CR4: > 0000000000350ef0 > [ 104.956559][ T6588] Call Trace: > [ 104.956829][ T6588] <TASK> > [ 104.957062][ T6588] ? clsact_egress_block_get+0x40/0x40 > [ 104.957507][ T6588] static_key_slow_dec+0x60/0xc0 > [ 104.957906][ T6588] qdisc_create+0xa45/0x1090 > [ 104.958274][ T6588] ? tc_get_qdisc+0xb70/0xb70 > [ 104.958646][ T6588] tc_modify_qdisc+0x491/0x1b70 > [ 104.959031][ T6588] ? qdisc_create+0x1090/0x1090 > [ 104.959420][ T6588] ? bpf_lsm_capable+0x9/0x10 > [ 104.959797][ T6588] ? qdisc_create+0x1090/0x1090 Ah, qdisc_create() calls ->destroy() even "if ops->init() failed". We should check sch->parent in {ingress,clsact}_destroy() too. I'll update [1,2/6] in v5. Thanks for reporting this! Seems like I should've run the reproducer nevertheless. I'll run it before posting v5. Thanks, Peilin Ye
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index fab5ba3e61b7..3e9cc43cbc90 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc) refcount_inc(&qdisc->refcnt); } +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc) +{ + if (qdisc->flags & TCQ_F_BUILTIN) + return true; + return refcount_dec_if_one(&qdisc->refcnt); +} + /* Intended to be used by unlocked users, when concurrent qdisc release is * possible. */ @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head); struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue, struct Qdisc *qdisc); void qdisc_reset(struct Qdisc *qdisc); +void qdisc_destroy(struct Qdisc *qdisc); void qdisc_put(struct Qdisc *qdisc); void qdisc_put_unlocked(struct Qdisc *qdisc); void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len); diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index f72a581666a2..b3bafa6c1b44 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, if ((q && q->flags & TCQ_F_INGRESS) || (new && new->flags & TCQ_F_INGRESS)) { ingress = 1; - if (!dev_ingress_queue(dev)) { + dev_queue = dev_ingress_queue(dev); + if (!dev_queue) { NL_SET_ERR_MSG(extack, "Device does not have an ingress queue"); return -ENOENT; } + + /* Replay if the current ingress (or clsact) Qdisc has ongoing + * RTNL-unlocked filter request(s). This is the counterpart of that + * qdisc_refcount_inc_nz() call in __tcf_qdisc_find(). + */ + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) + return -EAGAIN; } if (dev->flags & IFF_UP) @@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, qdisc_put(old); } } else { - dev_queue = dev_ingress_queue(dev); - old = dev_graft_qdisc(dev_queue, new); + old = dev_graft_qdisc(dev_queue, NULL); + + /* {ingress,clsact}_destroy() @old before grafting @new to avoid + * unprotected concurrent accesses to net_device::miniq_{in,e}gress + * pointer(s) in mini_qdisc_pair_swap(). + */ + qdisc_notify(net, skb, n, classid, old, new, extack); + qdisc_destroy(old); + + dev_graft_qdisc(dev_queue, new); } skip: @@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, if (new && new->ops->attach) new->ops->attach(new); - } else { - notify_and_destroy(net, skb, n, classid, old, new, extack); } if (dev->flags & IFF_UP) @@ -1458,6 +1472,7 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, struct Qdisc *p = NULL; int err; +replay: err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX, rtm_tca_policy, extack); if (err < 0) @@ -1515,8 +1530,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, return -ENOENT; } err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack); - if (err != 0) + if (err != 0) { + if (err == -EAGAIN) + goto replay; return err; + } } else { qdisc_notify(net, skb, n, clid, NULL, q, NULL); } @@ -1704,6 +1722,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n, if (err) { if (q) qdisc_put(q); + if (err == -EAGAIN) + goto replay; return err; } diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 37e41f972f69..e14ed47f961c 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head) qdisc_free(q); } -static void qdisc_destroy(struct Qdisc *qdisc) +static void __qdisc_destroy(struct Qdisc *qdisc) { const struct Qdisc_ops *ops = qdisc->ops; @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc) call_rcu(&qdisc->rcu, qdisc_free_cb); } +void qdisc_destroy(struct Qdisc *qdisc) +{ + if (qdisc->flags & TCQ_F_BUILTIN) + return; + + __qdisc_destroy(qdisc); +} + void qdisc_put(struct Qdisc *qdisc) { if (!qdisc) @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc) !refcount_dec_and_test(&qdisc->refcnt)) return; - qdisc_destroy(qdisc); + __qdisc_destroy(qdisc); } EXPORT_SYMBOL(qdisc_put); @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc) !refcount_dec_and_rtnl_lock(&qdisc->refcnt)) return; - qdisc_destroy(qdisc); + __qdisc_destroy(qdisc); rtnl_unlock(); } EXPORT_SYMBOL(qdisc_put_unlocked);