Message ID | e6c4681dd9205d702ae2e6124e20c6210520e76e.1683326865.git.peilin.ye@bytedance.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp784203vqo; Fri, 5 May 2023 17:57:53 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4ZZu6oVW3I8xHY8nnKbPc+CBBvogB8am9R29gw30FxTUx0F+SI4QZ79IGdTWCVAkMORcwk X-Received: by 2002:a17:90b:1243:b0:24e:1df5:103f with SMTP id gx3-20020a17090b124300b0024e1df5103fmr3385773pjb.42.1683334673245; Fri, 05 May 2023 17:57:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683334673; cv=none; d=google.com; s=arc-20160816; b=M1NtZYYpOFF0Qfbbd6xuZ7TG8L752g5j4YnWOi0ciU6Ggf3pV6M2jyNj5vVJMfVRJy OntFjn0WASMKfh1OTwBHwD2LsveEEx1gf3wggM8dRstWkEOBkZgerRf7YWm0FfRK/Zjl kFcVWC1HL7CtxjI7QRdkPgagr3QTdtmjZLSVq/SUy3dN8WTdFMZQkclX4OoiRx6tA3+o eFm+/TO5MgkSxEuqnagfAbEu+eF/kNhJZATVWkfcwz6clUF8aME06tGXftZCgNdp5oAo RfIYsCr+0Iv55mjqYjpph2P+O8Gt3AS2jxJr+oNWrOSJWM6Q3Sg9Wb+d6yQd50VHJoEW HsGQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=S7AI48GyyCwGO0uPw6kpNL/21jqgbbM33iUu+eDBZ4Y=; b=MMBdnGnF9vhVS8mXIDdamMX0A82neOVKu+ALt/eW5e3uDqpEnpZNT2sLv+QTlOV6Uq rhDSe5nU9tmYDOg5Z14vkn0hxZaKkPcjd7ArEy93MsgWjA7w3O3HynjxmhSgnqbecU6a Lk3NHhhPmeA0fiEYAGHfp2wPoB+7VehkscKzzd9TwnuKaoACkWtKcNYzBfGpSY+fOMuS TD9VHa7+2vTJUSS6M5F5yANji52DzAFFOU0HTeNRsUCMDzmR95XKbveaQV0pcBUZVNTZ A9/nKhBnG74g+QA3O52r+53gqMY+mDbmmMpqZ+GpL+UdXQ6tggEsxjJ8/2XqgHMgrIrM IJpg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=EMcW+G1O; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b66-20020a633445000000b0052c3ca8220dsi3141217pga.196.2023.05.05.17.57.39; Fri, 05 May 2023 17:57:53 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=EMcW+G1O; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230033AbjEFAQc (ORCPT <rfc822;baris.duru.linux@gmail.com> + 99 others); Fri, 5 May 2023 20:16:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58724 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231886AbjEFAQa (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 5 May 2023 20:16:30 -0400 Received: from mail-yw1-x1133.google.com (mail-yw1-x1133.google.com [IPv6:2607:f8b0:4864:20::1133]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2380B6E8A; Fri, 5 May 2023 17:16:29 -0700 (PDT) Received: by mail-yw1-x1133.google.com with SMTP id 00721157ae682-55a214572e8so37257207b3.0; Fri, 05 May 2023 17:16:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683332188; x=1685924188; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=S7AI48GyyCwGO0uPw6kpNL/21jqgbbM33iUu+eDBZ4Y=; b=EMcW+G1OGv1cOOhvkhP8d6xuW0h2Ccsp91rcxT9RtS2cVk3Xsrg5AHGRQVwn5v7uR3 G3IkuC64sXmHBAONIWuujF6i3FldimDXWvD0eCHna6m3iLYDrjRgWUulqsHd25DSJi2c wC/3AYCSn4HHBJAIxLVaHvcuW5XQbJcXdnem8NYkVGWpqVNWpS3yB5Ga0Vu9ne8l32rX FtJkwake/5P5KOeNGHZTciZQcyzKkFnbg3KqlyfIhG19nnU0Gm7f+C5lHbuz86I1R8fW OUJl5MEv4hRNEGLuly8di+Wr5jUHQUsoIn3b0DEvToT/e5ix8SK2B8XyBm4VcyzX6ISr A0NA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683332188; x=1685924188; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=S7AI48GyyCwGO0uPw6kpNL/21jqgbbM33iUu+eDBZ4Y=; b=CgjERmSz6hzMDyO3rDy6oAU/u7L7XGAe4xDi141DnTdQCt4no2Cpx3oN8kODKbakIs qxWUXJFGHEP6pgjy82xSRuK95+5ThbV/FbSqawTJnnx9yh4uLz1cwlNO+QFTm67QQh0b Yn3C9OVrq+Q+gIGexL2zfmfc59FaX2iFajc0TN8WjK4MJETn0pP/5unhVa/lLw6rrVYY rE/Il8JeJM+Wd70AAccpmv8oacRPMLHNV/wxi5izs2PbXV9dkhOWG19k5H9ixrrwWt3/ nEXc08orfVKcsi2BQ06QLN2P54BgDv9Jo6jH1sLTa4CmVv0B+hhhYFk8n45UmxLYVdYL pa0g== X-Gm-Message-State: AC+VfDyWiYG4RNXCUegx5aKrzxF5uaSv8Wx6n8xrg2s9j+HLLXxYupnU DTrsMrAUV6qzqtEmRObdCg== X-Received: by 2002:a81:4a0a:0:b0:55a:40d3:4d6f with SMTP id x10-20020a814a0a000000b0055a40d34d6fmr3640675ywa.26.1683332188225; Fri, 05 May 2023 17:16:28 -0700 (PDT) Received: from C02FL77VMD6R.attlocal.net ([2600:1700:d860:12b0:5c3e:e69d:d939:4053]) by smtp.gmail.com with ESMTPSA id n82-20020a0dcb55000000b00559be540b56sm801631ywd.134.2023.05.05.17.16.26 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 05 May 2023 17:16:27 -0700 (PDT) From: Peilin Ye <yepeilin.cs@gmail.com> X-Google-Original-From: Peilin Ye <peilin.ye@bytedance.com> To: "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Jamal Hadi Salim <jhs@mojatatu.com>, Cong Wang <xiyou.wangcong@gmail.com>, Jiri Pirko <jiri@resnulli.us> Cc: Peilin Ye <peilin.ye@bytedance.com>, Daniel Borkmann <daniel@iogearbox.net>, John Fastabend <john.r.fastabend@intel.com>, Vlad Buslov <vladbu@mellanox.com>, Pedro Tammela <pctammela@mojatatu.com>, Hillf Danton <hdanton@sina.com>, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Cong Wang <cong.wang@bytedance.com>, Peilin Ye <yepeilin.cs@gmail.com> Subject: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Date: Fri, 5 May 2023 17:16:10 -0700 Message-Id: <e6c4681dd9205d702ae2e6124e20c6210520e76e.1683326865.git.peilin.ye@bytedance.com> X-Mailer: git-send-email 2.30.1 (Apple Git-130) In-Reply-To: <cover.1683326865.git.peilin.ye@bytedance.com> References: <cover.1683326865.git.peilin.ye@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1765104338233139181?= X-GMAIL-MSGID: =?utf-8?q?1765104338233139181?= |
Series |
net/sched: Fixes for sch_ingress and sch_clsact
|
|
Commit Message
Peilin Ye
May 6, 2023, 12:16 a.m. UTC
mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs
access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for
clsact Qdiscs and miniq_egress.
Unfortunately, after introducing RTNL-lockless RTM_{NEW,DEL,GET}TFILTER
requests, when e.g. replacing ingress (clsact) Qdiscs, the old Qdisc could
access the same miniq_{in,e}gress pointer(s) concurrently with the new
Qdisc, causing race conditions [1] including a use-after-free in
mini_qdisc_pair_swap() reported by syzbot:
BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
...
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
print_report mm/kasan/report.c:430 [inline]
kasan_report+0x11c/0x130 mm/kasan/report.c:536
mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
...
The new (ingress or clsact) Qdisc should only call mini_qdisc_pair_swap()
after the old Qdisc's last call (in {ingress,clsact}_destroy()) has
finished.
To achieve this, in qdisc_graft(), return -EBUSY if the old (ingress or
clsact) Qdisc has ongoing RTNL-lockless filter requests, and call
qdisc_destroy() for "old" before grafting "new".
Introduce qdisc_refcount_dec_if_one() as the counterpart of
qdisc_refcount_inc_nz() used for RTNL-lockless filter requests. Introduce
a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
just like qdisc_put() etc.
[1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
create under TC_H_INGRESS") on eth0 that has 8 transmission queues:
Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
adds a flower filter X to A.
Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
b2) to replace A, then adds a flower filter Y to B.
Thread 1 A's refcnt Thread 2
RTM_NEWQDISC (A, RTNL-locked)
qdisc_create(A) 1
qdisc_graft(A) 9
RTM_NEWTFILTER (X, RTNL-lockless)
__tcf_qdisc_find(A) 10
tcf_chain0_head_change(A)
mini_qdisc_pair_swap(A) (1st)
|
| RTM_NEWQDISC (B, RTNL-locked)
RCU 2 qdisc_graft(B)
| 1 notify_and_destroy(A)
|
tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless)
qdisc_destroy(A) tcf_chain0_head_change(B)
tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
mini_qdisc_pair_swap(A) (3rd) |
... ...
Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during
ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
on eth0 will not find filter Y in sch_handle_ingress().
This is just one of the possible consequences of concurrently accessing
net_device::miniq_{in,e}gress pointers. The point is clear, however:
B's first call to mini_qdisc_pair_swap() should take place after A's
last call, in qdisc_destroy().
Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/netdev/0000000000006cf87705f79acf1a@google.com
Cc: Hillf Danton <hdanton@sina.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
include/net/sch_generic.h | 8 ++++++++
net/sched/sch_api.c | 26 +++++++++++++++++++++-----
net/sched/sch_generic.c | 14 +++++++++++---
3 files changed, 40 insertions(+), 8 deletions(-)
Comments
On Fri, May 5, 2023 at 8:16 PM Peilin Ye <yepeilin.cs@gmail.com> wrote: > > mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in > ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs > access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for > clsact Qdiscs and miniq_egress. > > Unfortunately, after introducing RTNL-lockless RTM_{NEW,DEL,GET}TFILTER > requests, when e.g. replacing ingress (clsact) Qdiscs, the old Qdisc could > access the same miniq_{in,e}gress pointer(s) concurrently with the new > Qdisc, causing race conditions [1] including a use-after-free in > mini_qdisc_pair_swap() reported by syzbot: > > BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 > Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901 > ... > Call Trace: > <TASK> > __dump_stack lib/dump_stack.c:88 [inline] > dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 > print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319 > print_report mm/kasan/report.c:430 [inline] > kasan_report+0x11c/0x130 mm/kasan/report.c:536 > mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 > tcf_chain_head_change_item net/sched/cls_api.c:495 [inline] > tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509 > tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline] > tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline] > tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266 > ... > > The new (ingress or clsact) Qdisc should only call mini_qdisc_pair_swap() > after the old Qdisc's last call (in {ingress,clsact}_destroy()) has > finished. > > To achieve this, in qdisc_graft(), return -EBUSY if the old (ingress or > clsact) Qdisc has ongoing RTNL-lockless filter requests, and call > qdisc_destroy() for "old" before grafting "new". > > Introduce qdisc_refcount_dec_if_one() as the counterpart of > qdisc_refcount_inc_nz() used for RTNL-lockless filter requests. Introduce > a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check, > just like qdisc_put() etc. > > [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under > TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only > create under TC_H_INGRESS") on eth0 that has 8 transmission queues: > > Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then > adds a flower filter X to A. > > Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and > b2) to replace A, then adds a flower filter Y to B. > > Thread 1 A's refcnt Thread 2 > RTM_NEWQDISC (A, RTNL-locked) > qdisc_create(A) 1 > qdisc_graft(A) 9 > > RTM_NEWTFILTER (X, RTNL-lockless) > __tcf_qdisc_find(A) 10 > tcf_chain0_head_change(A) > mini_qdisc_pair_swap(A) (1st) > | > | RTM_NEWQDISC (B, RTNL-locked) > RCU 2 qdisc_graft(B) > | 1 notify_and_destroy(A) > | > tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless) > qdisc_destroy(A) tcf_chain0_head_change(B) > tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd) > mini_qdisc_pair_swap(A) (3rd) | > ... ... > > Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its > mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during > ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets > on eth0 will not find filter Y in sch_handle_ingress(). > > This is just one of the possible consequences of concurrently accessing > net_device::miniq_{in,e}gress pointers. The point is clear, however: > B's first call to mini_qdisc_pair_swap() should take place after A's > last call, in qdisc_destroy(). > > Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops") > Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops") > Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com > Link: https://lore.kernel.org/netdev/0000000000006cf87705f79acf1a@google.com > Cc: Hillf Danton <hdanton@sina.com> > Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Thanks for the excellent analysis Peilin and for chasing this to the end. I have no doubt it was a lot of fun! ;-> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> cheers, jamal > --- > include/net/sch_generic.h | 8 ++++++++ > net/sched/sch_api.c | 26 +++++++++++++++++++++----- > net/sched/sch_generic.c | 14 +++++++++++--- > 3 files changed, 40 insertions(+), 8 deletions(-) > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > index fab5ba3e61b7..3e9cc43cbc90 100644 > --- a/include/net/sch_generic.h > +++ b/include/net/sch_generic.h > @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc) > refcount_inc(&qdisc->refcnt); > } > > +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc) > +{ > + if (qdisc->flags & TCQ_F_BUILTIN) > + return true; > + return refcount_dec_if_one(&qdisc->refcnt); > +} > + > /* Intended to be used by unlocked users, when concurrent qdisc release is > * possible. > */ > @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head); > struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue, > struct Qdisc *qdisc); > void qdisc_reset(struct Qdisc *qdisc); > +void qdisc_destroy(struct Qdisc *qdisc); > void qdisc_put(struct Qdisc *qdisc); > void qdisc_put_unlocked(struct Qdisc *qdisc); > void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len); > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c > index f72a581666a2..a2d07bc8ded6 100644 > --- a/net/sched/sch_api.c > +++ b/net/sched/sch_api.c > @@ -1080,10 +1080,20 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > if ((q && q->flags & TCQ_F_INGRESS) || > (new && new->flags & TCQ_F_INGRESS)) { > ingress = 1; > - if (!dev_ingress_queue(dev)) { > + dev_queue = dev_ingress_queue(dev); > + if (!dev_queue) { > NL_SET_ERR_MSG(extack, "Device does not have an ingress queue"); > return -ENOENT; > } > + > + /* This is the counterpart of that qdisc_refcount_inc_nz() call in > + * __tcf_qdisc_find() for RTNL-lockless filter requests. > + */ > + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) { > + NL_SET_ERR_MSG(extack, > + "Current ingress or clsact Qdisc has ongoing filter request(s)"); > + return -EBUSY; > + } > } > > if (dev->flags & IFF_UP) > @@ -1104,8 +1114,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > qdisc_put(old); > } > } else { > - dev_queue = dev_ingress_queue(dev); > - old = dev_graft_qdisc(dev_queue, new); > + old = dev_graft_qdisc(dev_queue, NULL); > + > + /* {ingress,clsact}_destroy() "old" before grafting "new" to avoid > + * unprotected concurrent accesses to net_device::miniq_{in,e}gress > + * pointer(s) in mini_qdisc_pair_swap(). > + */ > + qdisc_notify(net, skb, n, classid, old, new, extack); > + qdisc_destroy(old); > + > + dev_graft_qdisc(dev_queue, new); > } > > skip: > @@ -1119,8 +1137,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > > if (new && new->ops->attach) > new->ops->attach(new); > - } else { > - notify_and_destroy(net, skb, n, classid, old, new, extack); > } > > if (dev->flags & IFF_UP) > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index 37e41f972f69..e14ed47f961c 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head) > qdisc_free(q); > } > > -static void qdisc_destroy(struct Qdisc *qdisc) > +static void __qdisc_destroy(struct Qdisc *qdisc) > { > const struct Qdisc_ops *ops = qdisc->ops; > > @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc) > call_rcu(&qdisc->rcu, qdisc_free_cb); > } > > +void qdisc_destroy(struct Qdisc *qdisc) > +{ > + if (qdisc->flags & TCQ_F_BUILTIN) > + return; > + > + __qdisc_destroy(qdisc); > +} > + > void qdisc_put(struct Qdisc *qdisc) > { > if (!qdisc) > @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc) > !refcount_dec_and_test(&qdisc->refcnt)) > return; > > - qdisc_destroy(qdisc); > + __qdisc_destroy(qdisc); > } > EXPORT_SYMBOL(qdisc_put); > > @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc) > !refcount_dec_and_rtnl_lock(&qdisc->refcnt)) > return; > > - qdisc_destroy(qdisc); > + __qdisc_destroy(qdisc); > rtnl_unlock(); > } > EXPORT_SYMBOL(qdisc_put_unlocked); > -- > 2.20.1 >
On 05/05/2023 21:16, Peilin Ye wrote: > mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in > ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs > access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for > clsact Qdiscs and miniq_egress. > > Unfortunately, after introducing RTNL-lockless RTM_{NEW,DEL,GET}TFILTER > requests, when e.g. replacing ingress (clsact) Qdiscs, the old Qdisc could > access the same miniq_{in,e}gress pointer(s) concurrently with the new > Qdisc, causing race conditions [1] including a use-after-free in > mini_qdisc_pair_swap() reported by syzbot: > > BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 > Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901 > ... > Call Trace: > <TASK> > __dump_stack lib/dump_stack.c:88 [inline] > dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 > print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319 > print_report mm/kasan/report.c:430 [inline] > kasan_report+0x11c/0x130 mm/kasan/report.c:536 > mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 > tcf_chain_head_change_item net/sched/cls_api.c:495 [inline] > tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509 > tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline] > tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline] > tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266 > ... > > The new (ingress or clsact) Qdisc should only call mini_qdisc_pair_swap() > after the old Qdisc's last call (in {ingress,clsact}_destroy()) has > finished. > > To achieve this, in qdisc_graft(), return -EBUSY if the old (ingress or > clsact) Qdisc has ongoing RTNL-lockless filter requests, and call > qdisc_destroy() for "old" before grafting "new". > > Introduce qdisc_refcount_dec_if_one() as the counterpart of > qdisc_refcount_inc_nz() used for RTNL-lockless filter requests. Introduce > a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check, > just like qdisc_put() etc. > > [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under > TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only > create under TC_H_INGRESS") on eth0 that has 8 transmission queues: > > Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then > adds a flower filter X to A. > > Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and > b2) to replace A, then adds a flower filter Y to B. > > Thread 1 A's refcnt Thread 2 > RTM_NEWQDISC (A, RTNL-locked) > qdisc_create(A) 1 > qdisc_graft(A) 9 > > RTM_NEWTFILTER (X, RTNL-lockless) > __tcf_qdisc_find(A) 10 > tcf_chain0_head_change(A) > mini_qdisc_pair_swap(A) (1st) > | > | RTM_NEWQDISC (B, RTNL-locked) > RCU 2 qdisc_graft(B) > | 1 notify_and_destroy(A) > | > tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless) > qdisc_destroy(A) tcf_chain0_head_change(B) > tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd) > mini_qdisc_pair_swap(A) (3rd) | > ... ... > > Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its > mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during > ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets > on eth0 will not find filter Y in sch_handle_ingress(). > > This is just one of the possible consequences of concurrently accessing > net_device::miniq_{in,e}gress pointers. The point is clear, however: > B's first call to mini_qdisc_pair_swap() should take place after A's > last call, in qdisc_destroy(). > > Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops") > Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops") > Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com > Link: https://lore.kernel.org/netdev/0000000000006cf87705f79acf1a@google.com > Cc: Hillf Danton <hdanton@sina.com> > Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Thanks for chasing this! Tested-by: Pedro Tammela <pctammela@mojatatu.com> > --- > include/net/sch_generic.h | 8 ++++++++ > net/sched/sch_api.c | 26 +++++++++++++++++++++----- > net/sched/sch_generic.c | 14 +++++++++++--- > 3 files changed, 40 insertions(+), 8 deletions(-) > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > index fab5ba3e61b7..3e9cc43cbc90 100644 > --- a/include/net/sch_generic.h > +++ b/include/net/sch_generic.h > @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc) > refcount_inc(&qdisc->refcnt); > } > > +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc) > +{ > + if (qdisc->flags & TCQ_F_BUILTIN) > + return true; > + return refcount_dec_if_one(&qdisc->refcnt); > +} > + > /* Intended to be used by unlocked users, when concurrent qdisc release is > * possible. > */ > @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head); > struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue, > struct Qdisc *qdisc); > void qdisc_reset(struct Qdisc *qdisc); > +void qdisc_destroy(struct Qdisc *qdisc); > void qdisc_put(struct Qdisc *qdisc); > void qdisc_put_unlocked(struct Qdisc *qdisc); > void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len); > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c > index f72a581666a2..a2d07bc8ded6 100644 > --- a/net/sched/sch_api.c > +++ b/net/sched/sch_api.c > @@ -1080,10 +1080,20 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > if ((q && q->flags & TCQ_F_INGRESS) || > (new && new->flags & TCQ_F_INGRESS)) { > ingress = 1; > - if (!dev_ingress_queue(dev)) { > + dev_queue = dev_ingress_queue(dev); > + if (!dev_queue) { > NL_SET_ERR_MSG(extack, "Device does not have an ingress queue"); > return -ENOENT; > } > + > + /* This is the counterpart of that qdisc_refcount_inc_nz() call in > + * __tcf_qdisc_find() for RTNL-lockless filter requests. > + */ > + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) { > + NL_SET_ERR_MSG(extack, > + "Current ingress or clsact Qdisc has ongoing filter request(s)"); > + return -EBUSY; > + } > } > > if (dev->flags & IFF_UP) > @@ -1104,8 +1114,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > qdisc_put(old); > } > } else { > - dev_queue = dev_ingress_queue(dev); > - old = dev_graft_qdisc(dev_queue, new); > + old = dev_graft_qdisc(dev_queue, NULL); > + > + /* {ingress,clsact}_destroy() "old" before grafting "new" to avoid > + * unprotected concurrent accesses to net_device::miniq_{in,e}gress > + * pointer(s) in mini_qdisc_pair_swap(). > + */ > + qdisc_notify(net, skb, n, classid, old, new, extack); > + qdisc_destroy(old); > + > + dev_graft_qdisc(dev_queue, new); > } > > skip: > @@ -1119,8 +1137,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > > if (new && new->ops->attach) > new->ops->attach(new); > - } else { > - notify_and_destroy(net, skb, n, classid, old, new, extack); > } > > if (dev->flags & IFF_UP) > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index 37e41f972f69..e14ed47f961c 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head) > qdisc_free(q); > } > > -static void qdisc_destroy(struct Qdisc *qdisc) > +static void __qdisc_destroy(struct Qdisc *qdisc) > { > const struct Qdisc_ops *ops = qdisc->ops; > > @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc) > call_rcu(&qdisc->rcu, qdisc_free_cb); > } > > +void qdisc_destroy(struct Qdisc *qdisc) > +{ > + if (qdisc->flags & TCQ_F_BUILTIN) > + return; > + > + __qdisc_destroy(qdisc); > +} > + > void qdisc_put(struct Qdisc *qdisc) > { > if (!qdisc) > @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc) > !refcount_dec_and_test(&qdisc->refcnt)) > return; > > - qdisc_destroy(qdisc); > + __qdisc_destroy(qdisc); > } > EXPORT_SYMBOL(qdisc_put); > > @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc) > !refcount_dec_and_rtnl_lock(&qdisc->refcnt)) > return; > > - qdisc_destroy(qdisc); > + __qdisc_destroy(qdisc); > rtnl_unlock(); > } > EXPORT_SYMBOL(qdisc_put_unlocked);
On Mon, May 08, 2023 at 07:32:01AM -0400, Jamal Hadi Salim wrote: > Thanks for the excellent analysis Peilin and for chasing this to the > end. I have no doubt it was a lot of fun! ;-> Of course ;-) > Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> > Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Thanks for reviewing this! Peilin Ye
On Mon, May 08, 2023 at 11:12:24AM -0300, Pedro Tammela wrote: > Thanks for chasing this! > > Tested-by: Pedro Tammela <pctammela@mojatatu.com> Thanks for testing, Pedro! Thanks, Peilin Ye
On Fri, 5 May 2023 17:16:10 -0700 Peilin Ye wrote: > Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then > adds a flower filter X to A. > > Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and > b2) to replace A, then adds a flower filter Y to B. > > Thread 1 A's refcnt Thread 2 > RTM_NEWQDISC (A, RTNL-locked) > qdisc_create(A) 1 > qdisc_graft(A) 9 > > RTM_NEWTFILTER (X, RTNL-lockless) > __tcf_qdisc_find(A) 10 > tcf_chain0_head_change(A) > mini_qdisc_pair_swap(A) (1st) > | > | RTM_NEWQDISC (B, RTNL-locked) > RCU 2 qdisc_graft(B) > | 1 notify_and_destroy(A) > | > tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless) > qdisc_destroy(A) tcf_chain0_head_change(B) > tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd) > mini_qdisc_pair_swap(A) (3rd) | > ... ... > > Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its > mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during > ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets > on eth0 will not find filter Y in sch_handle_ingress(). > > This is just one of the possible consequences of concurrently accessing > net_device::miniq_{in,e}gress pointers. The point is clear, however: > B's first call to mini_qdisc_pair_swap() should take place after A's > last call, in qdisc_destroy(). Great analysis, thanks for squashing this bug. Have you considered creating a fix more localized to the miniq implementation? It seems that having per-device miniq pointers is incompatible with using reference counted objects. So miniq is a more natural place to solve the problem. Otherwise workarounds in the core keep piling up (here qdisc_graft()). Can we replace the rcu_assign_pointer in (3rd) with a cmpxchg()? If active qdisc is neither a1 nor a2 we should leave the dev state alone.
On Mon, May 08, 2023 at 06:33:24PM -0700, Jakub Kicinski wrote: > Great analysis, thanks for squashing this bug. Thanks, happy to help! > Have you considered creating a fix more localized to the miniq > implementation? It seems that having per-device miniq pointers is > incompatible with using reference counted objects. So miniq is > a more natural place to solve the problem. Otherwise workarounds > in the core keep piling up (here qdisc_graft()). > > Can we replace the rcu_assign_pointer in (3rd) with a cmpxchg()? > If active qdisc is neither a1 nor a2 we should leave the dev state > alone. Yes, I have tried fixing this in mini_qdisc_pair_swap(), but I am afraid it is hard: (3rd) is called from ->destroy(), so currently it uses RCU_INIT_POINTER() to set dev->miniq_ingress to NULL. It will need a logic like: I am A. Set dev->miniq_ingress to NULL, if and only if it is a1 or a2, and do it atomically. We need more than a cmpxchg() to implement this "set NULL iff a1 or a2". Additionally: On Fri, 5 May 2023 17:16:10 -0700 Peilin Ye wrote: > Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then > adds a flower filter X to A. > > Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and > b2) to replace A, then adds a flower filter Y to B. > > Thread 1 A's refcnt Thread 2 > RTM_NEWQDISC (A, RTNL-locked) > qdisc_create(A) 1 > qdisc_graft(A) 9 > > RTM_NEWTFILTER (X, RTNL-lockless) > __tcf_qdisc_find(A) 10 > tcf_chain0_head_change(A) > mini_qdisc_pair_swap(A) (1st) > | > | RTM_NEWQDISC (B, RTNL-locked) > RCU 2 qdisc_graft(B) > | 1 notify_and_destroy(A) > | > tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless) > qdisc_destroy(A) tcf_chain0_head_change(B) > tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd) > mini_qdisc_pair_swap(A) (3rd) | > ... ... Looking at the code, I think there is no guarantee that (1st) cannot happen after (2nd), although unlikely? Can RTNL-lockless RTM_NEWTFILTER handlers get preempted? If (1st) happens later than (2nd), we will need to make (1st) no-op, by detecting that we are the "old" Qdisc. I am not sure there is any (clean) way to do it. I even thought about: (1) Get the containing Qdisc of "miniqp" we are working on, "qdisc"; (2) Test if "qdisc == qdisc->dev_queue->qdisc_sleeping". If false, it means we are the "old" Qdisc (have been replaced), and should do nothing. However, for clsact Qdiscs I don't know if "miniqp" is the ingress or egress one, so I can't container_of() during step (1) ... Eventually I created [5,6/6]. It is a workaround indeed, in the sense that it changes sch_api.c to avoid a mini Qdisc issue. However I think it makes the code correct in a relatively understandable way, without slowing down mini_qdisc_pair_swap() or sch_handle_*gress(). Thanks, Peilin Ye
On Wed, 10 May 2023 13:11:19 -0700 Peilin Ye wrote: > On Fri, 5 May 2023 17:16:10 -0700 Peilin Ye wrote: > > Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then > > adds a flower filter X to A. > > > > Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and > > b2) to replace A, then adds a flower filter Y to B. > > > > Thread 1 A's refcnt Thread 2 > > RTM_NEWQDISC (A, RTNL-locked) > > qdisc_create(A) 1 > > qdisc_graft(A) 9 > > > > RTM_NEWTFILTER (X, RTNL-lockless) > > __tcf_qdisc_find(A) 10 > > tcf_chain0_head_change(A) > > mini_qdisc_pair_swap(A) (1st) > > | > > | RTM_NEWQDISC (B, RTNL-locked) > > RCU 2 qdisc_graft(B) > > | 1 notify_and_destroy(A) > > | > > tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless) > > qdisc_destroy(A) tcf_chain0_head_change(B) > > tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd) > > mini_qdisc_pair_swap(A) (3rd) | > > ... ... > > Looking at the code, I think there is no guarantee that (1st) cannot > happen after (2nd), although unlikely? Can RTNL-lockless RTM_NEWTFILTER > handlers get preempted? Right, we need qdisc_graft(B) to update the appropriate dev pointer to point to b1. With that the ordering should not matter. Probably using the ->attach() callback? > If (1st) happens later than (2nd), we will need to make (1st) no-op, by > detecting that we are the "old" Qdisc. I am not sure there is any > (clean) way to do it. I even thought about: > > (1) Get the containing Qdisc of "miniqp" we are working on, "qdisc"; > (2) Test if "qdisc == qdisc->dev_queue->qdisc_sleeping". If false, it > means we are the "old" Qdisc (have been replaced), and should do > nothing. > > However, for clsact Qdiscs I don't know if "miniqp" is the ingress or > egress one, so I can't container_of() during step (1) ... And we can't be using multiple pieces of information to make the decision since AFAIU mini_qdisc_pair_swap() can race with qdisc_graft(). My thinking was to make sure that dev->miniq_* pointers always point to one of the miniqs of the currently attached qdisc. Right now, on a quick look, those pointers are not initialized during initial graft, only when first filter is added, and may be cleared when filters are removed. But I don't think that's strictly required, miniq with no filters should be fine. > Eventually I created [5,6/6]. It is a workaround indeed, in the sense > that it changes sch_api.c to avoid a mini Qdisc issue. However I think it > makes the code correct in a relatively understandable way, What's your benchmark for being understandable? > without slowing down mini_qdisc_pair_swap() or sch_handle_*gress().
On Wed, May 10, 2023 at 04:15:59PM -0700, Jakub Kicinski wrote: > My thinking was to make sure that dev->miniq_* pointers always point > to one of the miniqs of the currently attached qdisc. Right now, on > a quick look, those pointers are not initialized during initial graft, > only when first filter is added, and may be cleared when filters are > removed. But I don't think that's strictly required, miniq with no > filters should be fine. Ah, I see, thanks for explaining, I didn't think of that. Looking at sch_handle_ingress() BTW, currently mini Qdisc stats aren't being updated (mini_qdisc_bstats_cpu_update()) if there's no filters, is this intended? Should I keep this behavior by: diff --git a/net/core/dev.c b/net/core/dev.c index 735096d42c1d..9016481377e0 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5116,7 +5116,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, * that are not configured with an ingress qdisc will bail * out here. */ - if (!miniq) + if (!miniq || !miniq->filter_list) return skb; if (*pt_prev) { On Wed, May 10, 2023 at 04:15:59PM -0700, Jakub Kicinski wrote: > On Wed, 10 May 2023 13:11:19 -0700 Peilin Ye wrote: > > On Fri, 5 May 2023 17:16:10 -0700 Peilin Ye wrote: > > > Thread 1 A's refcnt Thread 2 > > > RTM_NEWQDISC (A, RTNL-locked) > > > qdisc_create(A) 1 > > > qdisc_graft(A) 9 > > > > > > RTM_NEWTFILTER (X, RTNL-lockless) > > > __tcf_qdisc_find(A) 10 > > > tcf_chain0_head_change(A) > > > mini_qdisc_pair_swap(A) (1st) > > > | > > > | RTM_NEWQDISC (B, RTNL-locked) > > > RCU 2 qdisc_graft(B) > > > | 1 notify_and_destroy(A) > > > | > > > tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless) > > > qdisc_destroy(A) tcf_chain0_head_change(B) > > > tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd) > > > mini_qdisc_pair_swap(A) (3rd) | > > > ... ... > > > > Looking at the code, I think there is no guarantee that (1st) cannot > > happen after (2nd), although unlikely? Can RTNL-lockless RTM_NEWTFILTER > > handlers get preempted? > > Right, we need qdisc_graft(B) to update the appropriate dev pointer > to point to b1. With that the ordering should not matter. Probably > using the ->attach() callback? ->attach() is later than dev_graft_qdisc(). We should get ready for concurrent filter requests (i.e. have dev pointer pointing to b1) before grafting (publishing) B. Also currently qdisc_graft() doesn't call ->attach() for ingress and clsact Qdiscs. But I see your point, thanks for the suggestion! I'll try ->init() and create v2. Thanks, Peilin Ye
On Thu, 11 May 2023 13:40:10 -0700 Peilin Ye wrote: > On Wed, May 10, 2023 at 04:15:59PM -0700, Jakub Kicinski wrote: > > My thinking was to make sure that dev->miniq_* pointers always point > > to one of the miniqs of the currently attached qdisc. Right now, on > > a quick look, those pointers are not initialized during initial graft, > > only when first filter is added, and may be cleared when filters are > > removed. But I don't think that's strictly required, miniq with no > > filters should be fine. > > Ah, I see, thanks for explaining, I didn't think of that. Looking at > sch_handle_ingress() BTW, currently mini Qdisc stats aren't being updated > (mini_qdisc_bstats_cpu_update()) if there's no filters, is this intended? > Should I keep this behavior by: > > diff --git a/net/core/dev.c b/net/core/dev.c > index 735096d42c1d..9016481377e0 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -5116,7 +5116,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, > * that are not configured with an ingress qdisc will bail > * out here. > */ > - if (!miniq) > + if (!miniq || !miniq->filter_list) > return skb; > > if (*pt_prev) { Good question, maybe Jiri or Daniel can answer? > On Wed, May 10, 2023 at 04:15:59PM -0700, Jakub Kicinski wrote: > > On Wed, 10 May 2023 13:11:19 -0700 Peilin Ye wrote: > > > Looking at the code, I think there is no guarantee that (1st) cannot > > > happen after (2nd), although unlikely? Can RTNL-lockless RTM_NEWTFILTER > > > handlers get preempted? > > > > Right, we need qdisc_graft(B) to update the appropriate dev pointer > > to point to b1. With that the ordering should not matter. Probably > > using the ->attach() callback? > > ->attach() is later than dev_graft_qdisc(). We should get ready for > concurrent filter requests (i.e. have dev pointer pointing to b1) before > grafting (publishing) B. I thought even for "unlocked" filter operations the start of it is under the lock, but the lock gets dropped after qdisc/block are found. I could be misremembering, I haven't looked at the code. > Also currently qdisc_graft() doesn't call > ->attach() for ingress and clsact Qdiscs. > > But I see your point, thanks for the suggestion! I'll try ->init() and > create v2. ->init() may be too early, aren't there any error points which could prevent the Qdisc from binding after ->init() was called?
On Thu, May 11, 2023 at 04:20:23PM -0700, Jakub Kicinski wrote: > > But I see your point, thanks for the suggestion! I'll try ->init() and > > create v2. > > ->init() may be too early, aren't there any error points which could > prevent the Qdisc from binding after ->init() was called? You're right, it's in qdisc_create(), argh... On Thu, May 11, 2023 at 04:20:23PM -0700, Jakub Kicinski wrote: > > > > Looking at the code, I think there is no guarantee that (1st) cannot > > > > happen after (2nd), although unlikely? Can RTNL-lockless RTM_NEWTFILTER > > > > handlers get preempted? > > > > > > Right, we need qdisc_graft(B) to update the appropriate dev pointer > > > to point to b1. With that the ordering should not matter. Probably > > > using the ->attach() callback? > > > > ->attach() is later than dev_graft_qdisc(). We should get ready for > > concurrent filter requests (i.e. have dev pointer pointing to b1) before > > grafting (publishing) B. > > I thought even for "unlocked" filter operations the start of it is > under the lock, but the lock gets dropped after qdisc/block are found. > I could be misremembering, I haven't looked at the code. No, f.e. RTM_NEWTFILTER is registered as RTNL_FLAG_DOIT_UNLOCKED, so tc_new_tfilter() starts and calls __tcf_qdisc_find() without RTNL mutex, at least in latest code. Thinking, Peilin Ye
On Thu, May 11, 2023 at 04:46:33PM -0700, Peilin Ye wrote: > On Thu, May 11, 2023 at 04:20:23PM -0700, Jakub Kicinski wrote: > > > But I see your point, thanks for the suggestion! I'll try ->init() and > > > create v2. > > > > ->init() may be too early, aren't there any error points which could > > prevent the Qdisc from binding after ->init() was called? > > You're right, it's in qdisc_create(), argh... ->destroy() is called for all error points between ->init() and dev_graft_qdisc(). I'll try handling it in ->destroy(). Thanks, Peilin Ye
On Thu, May 11, 2023 at 05:11:23PM -0700, Peilin Ye wrote: > > > ->init() may be too early, aren't there any error points which could > > > prevent the Qdisc from binding after ->init() was called? > > > > You're right, it's in qdisc_create(), argh... > > ->destroy() is called for all error points between ->init() and > dev_graft_qdisc(). I'll try handling it in ->destroy(). Sorry for any confusion: there is no point at all undoing "setting dev pointer to b1" in ->destroy() because datapath has already been affected. To summarize, grafting B mustn't fail after setting dev pointer to b1, so ->init() is too early, because e.g. if user requested [1] to create a rate estimator, gen_new_estimator() could fail after ->init() in qdisc_create(). On the other hand, ->attach() is too late because it's later than dev_graft_qdisc(), so concurrent filter requests might see uninitialized dev pointer in theory. Please suggest; is adding another callback (or calling ->attach()) right before dev_graft_qdisc() for ingress (clsact) Qdiscs too much for this fix? [1] e.g. $ tc qdisc add dev eth0 estimator 1s 8s clsact Thanks, Peilin Ye
On Mon, 15 May 2023 15:45:15 -0700 Peilin Ye wrote: > On Thu, May 11, 2023 at 05:11:23PM -0700, Peilin Ye wrote: > > > You're right, it's in qdisc_create(), argh... > > > > ->destroy() is called for all error points between ->init() and > > dev_graft_qdisc(). I'll try handling it in ->destroy(). > > Sorry for any confusion: there is no point at all undoing "setting dev > pointer to b1" in ->destroy() because datapath has already been affected. > > To summarize, grafting B mustn't fail after setting dev pointer to b1, so > ->init() is too early, because e.g. if user requested [1] to create a rate > estimator, gen_new_estimator() could fail after ->init() in > qdisc_create(). > > On the other hand, ->attach() is too late because it's later than > dev_graft_qdisc(), so concurrent filter requests might see uninitialized > dev pointer in theory. > > Please suggest; is adding another callback (or calling ->attach()) right > before dev_graft_qdisc() for ingress (clsact) Qdiscs too much for this > fix? > > [1] e.g. $ tc qdisc add dev eth0 estimator 1s 8s clsact Vlad, could you please clarify how you expect the unlocked filter operations to work when the qdisc has global state?
On Tue 16 May 2023 at 12:22, Jakub Kicinski <kuba@kernel.org> wrote: > On Mon, 15 May 2023 15:45:15 -0700 Peilin Ye wrote: >> On Thu, May 11, 2023 at 05:11:23PM -0700, Peilin Ye wrote: >> > > You're right, it's in qdisc_create(), argh... >> > >> > ->destroy() is called for all error points between ->init() and >> > dev_graft_qdisc(). I'll try handling it in ->destroy(). >> >> Sorry for any confusion: there is no point at all undoing "setting dev >> pointer to b1" in ->destroy() because datapath has already been affected. >> >> To summarize, grafting B mustn't fail after setting dev pointer to b1, so >> ->init() is too early, because e.g. if user requested [1] to create a rate >> estimator, gen_new_estimator() could fail after ->init() in >> qdisc_create(). >> >> On the other hand, ->attach() is too late because it's later than >> dev_graft_qdisc(), so concurrent filter requests might see uninitialized >> dev pointer in theory. >> >> Please suggest; is adding another callback (or calling ->attach()) right >> before dev_graft_qdisc() for ingress (clsact) Qdiscs too much for this >> fix? >> >> [1] e.g. $ tc qdisc add dev eth0 estimator 1s 8s clsact > > Vlad, could you please clarify how you expect the unlocked filter > operations to work when the qdisc has global state? Jakub, I didn't account for per-net_device pointer usage by miniqp code hence this bug. I didn't comment on the fix because I was away from my PC last week but Peilin's approach seems reasonable to me. When Peilin brought up the issue initially I also tried to come up with some trick to contain the changes to miniqp code instead of changing core but couldn't think of anything workable due to the limitations already discussed in this thread. I'm open to explore alternative approaches to solving this issue, if that is what you suggest.
On Tue, 16 May 2023 22:35:51 +0300 Vlad Buslov wrote: > > Vlad, could you please clarify how you expect the unlocked filter > > operations to work when the qdisc has global state? > > Jakub, I didn't account for per-net_device pointer usage by miniqp code > hence this bug. I didn't comment on the fix because I was away from my > PC last week but Peilin's approach seems reasonable to me. When Peilin > brought up the issue initially I also tried to come up with some trick > to contain the changes to miniqp code instead of changing core but > couldn't think of anything workable due to the limitations already > discussed in this thread. I'm open to explore alternative approaches to > solving this issue, if that is what you suggest. Given Peilin's investigation I think fix without changing core may indeed be hard. I'm not sure if returning -EBUSY when qdisc refcnt is elevated will be appreciated by the users, do we already have similar behavior in other parts of TC?
On Tue, May 16, 2023 at 02:50:10PM -0700, Jakub Kicinski wrote: > > > Vlad, could you please clarify how you expect the unlocked filter > > > operations to work when the qdisc has global state? > > > > Jakub, I didn't account for per-net_device pointer usage by miniqp code > > hence this bug. I didn't comment on the fix because I was away from my > > PC last week but Peilin's approach seems reasonable to me. When Peilin > > brought up the issue initially I also tried to come up with some trick > > to contain the changes to miniqp code instead of changing core but > > couldn't think of anything workable due to the limitations already > > discussed in this thread. I'm open to explore alternative approaches to > > solving this issue, if that is what you suggest. > > Given Peilin's investigation I think fix without changing core may > indeed be hard. I'm not sure if returning -EBUSY when qdisc refcnt > is elevated will be appreciated by the users, do we already have > similar behavior in other parts of TC? Seems like trying to delete an "in-use" cls_u32 filter returns -EBUSY: net/sched/cls_u32.c: if (ht->refcnt == 1) { u32_destroy_hnode(tp, ht, extack); } else { NL_SET_ERR_MSG_MOD(extack, "Can not delete in-use filter"); return -EBUSY; } Thanks, Peilin Ye
On Tue, 16 May 2023 15:58:46 -0700 Peilin Ye wrote: > > Given Peilin's investigation I think fix without changing core may > > indeed be hard. I'm not sure if returning -EBUSY when qdisc refcnt > > is elevated will be appreciated by the users, do we already have > > similar behavior in other parts of TC? > > Seems like trying to delete an "in-use" cls_u32 filter returns -EBUSY I meant -EBUSY due to a race (another operation being in flight). I think that's different.
On Tue 16 May 2023 at 17:39, Jakub Kicinski <kuba@kernel.org> wrote: > On Tue, 16 May 2023 15:58:46 -0700 Peilin Ye wrote: >> > Given Peilin's investigation I think fix without changing core may >> > indeed be hard. I'm not sure if returning -EBUSY when qdisc refcnt >> > is elevated will be appreciated by the users, do we already have >> > similar behavior in other parts of TC? >> >> Seems like trying to delete an "in-use" cls_u32 filter returns -EBUSY > > I meant -EBUSY due to a race (another operation being in flight). > I think that's different. I wonder if somehow leveraging existing tc_modify_qdisc() 'replay' functionality instead of returning error to the user would be a better approach? Currently the function is replayed when qdisc_create() returns EAGAIN. It should be trivial to do the same for qdisc_graft() result.
On Wed, 17 May 2023 11:49:10 +0300 Vlad Buslov wrote: > On Tue 16 May 2023 at 17:39, Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 16 May 2023 15:58:46 -0700 Peilin Ye wrote: > >> > >> Seems like trying to delete an "in-use" cls_u32 filter returns -EBUSY > > > > I meant -EBUSY due to a race (another operation being in flight). > > I think that's different. > > I wonder if somehow leveraging existing tc_modify_qdisc() 'replay' > functionality instead of returning error to the user would be a better > approach? Currently the function is replayed when qdisc_create() returns > EAGAIN. It should be trivial to do the same for qdisc_graft() result. Sounds better than returning -EBUSY to the user and expecting them to retry, yes.
On Fri, 5 May 2023 17:16:10 -0700 Peilin Ye wrote: > } else { > - dev_queue = dev_ingress_queue(dev); > - old = dev_graft_qdisc(dev_queue, new); > + old = dev_graft_qdisc(dev_queue, NULL); > + > + /* {ingress,clsact}_destroy() "old" before grafting "new" to avoid > + * unprotected concurrent accesses to net_device::miniq_{in,e}gress > + * pointer(s) in mini_qdisc_pair_swap(). > + */ > + qdisc_notify(net, skb, n, classid, old, new, extack); > + qdisc_destroy(old); > + > + dev_graft_qdisc(dev_queue, new); BTW can't @old be NULL here?
On Wed, May 17, 2023 at 11:48:25AM -0700, Jakub Kicinski wrote: > > } else { > > - dev_queue = dev_ingress_queue(dev); > > - old = dev_graft_qdisc(dev_queue, new); > > + old = dev_graft_qdisc(dev_queue, NULL); > > + > > + /* {ingress,clsact}_destroy() "old" before grafting "new" to avoid > > + * unprotected concurrent accesses to net_device::miniq_{in,e}gress > > + * pointer(s) in mini_qdisc_pair_swap(). > > + */ > > + qdisc_notify(net, skb, n, classid, old, new, extack); > > + qdisc_destroy(old); > > + > > + dev_graft_qdisc(dev_queue, new); > > BTW can't @old be NULL here? ingress_queue->qdisc_sleeping is initialized to &noop_qdisc (placeholder) in dev_ingress_queue_create(), and dev_graft_qdisc() also grafts &noop_qdisc to represent "there's no Qdisc": /* ... and graft new one */ if (qdisc == NULL) qdisc = &noop_qdisc; dev_queue->qdisc_sleeping = qdisc; So @old can't be NULL here. Thanks, Peilin Ye
On Wed, May 17, 2023 at 11:48:05AM -0700, Jakub Kicinski wrote: > On Wed, 17 May 2023 11:49:10 +0300 Vlad Buslov wrote: > > I wonder if somehow leveraging existing tc_modify_qdisc() 'replay' > > functionality instead of returning error to the user would be a better > > approach? Currently the function is replayed when qdisc_create() returns > > EAGAIN. It should be trivial to do the same for qdisc_graft() result. > > Sounds better than returning -EBUSY to the user and expecting them > to retry, yes. Thanks for the suggestion, Vlad! I'll try this in tc_modify_qdisc() and tc_get_qdisc() (for Qdisc deletion) in v2. Thanks, Peilin Ye
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index fab5ba3e61b7..3e9cc43cbc90 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc) refcount_inc(&qdisc->refcnt); } +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc) +{ + if (qdisc->flags & TCQ_F_BUILTIN) + return true; + return refcount_dec_if_one(&qdisc->refcnt); +} + /* Intended to be used by unlocked users, when concurrent qdisc release is * possible. */ @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head); struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue, struct Qdisc *qdisc); void qdisc_reset(struct Qdisc *qdisc); +void qdisc_destroy(struct Qdisc *qdisc); void qdisc_put(struct Qdisc *qdisc); void qdisc_put_unlocked(struct Qdisc *qdisc); void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len); diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index f72a581666a2..a2d07bc8ded6 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -1080,10 +1080,20 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, if ((q && q->flags & TCQ_F_INGRESS) || (new && new->flags & TCQ_F_INGRESS)) { ingress = 1; - if (!dev_ingress_queue(dev)) { + dev_queue = dev_ingress_queue(dev); + if (!dev_queue) { NL_SET_ERR_MSG(extack, "Device does not have an ingress queue"); return -ENOENT; } + + /* This is the counterpart of that qdisc_refcount_inc_nz() call in + * __tcf_qdisc_find() for RTNL-lockless filter requests. + */ + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) { + NL_SET_ERR_MSG(extack, + "Current ingress or clsact Qdisc has ongoing filter request(s)"); + return -EBUSY; + } } if (dev->flags & IFF_UP) @@ -1104,8 +1114,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, qdisc_put(old); } } else { - dev_queue = dev_ingress_queue(dev); - old = dev_graft_qdisc(dev_queue, new); + old = dev_graft_qdisc(dev_queue, NULL); + + /* {ingress,clsact}_destroy() "old" before grafting "new" to avoid + * unprotected concurrent accesses to net_device::miniq_{in,e}gress + * pointer(s) in mini_qdisc_pair_swap(). + */ + qdisc_notify(net, skb, n, classid, old, new, extack); + qdisc_destroy(old); + + dev_graft_qdisc(dev_queue, new); } skip: @@ -1119,8 +1137,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, if (new && new->ops->attach) new->ops->attach(new); - } else { - notify_and_destroy(net, skb, n, classid, old, new, extack); } if (dev->flags & IFF_UP) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 37e41f972f69..e14ed47f961c 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head) qdisc_free(q); } -static void qdisc_destroy(struct Qdisc *qdisc) +static void __qdisc_destroy(struct Qdisc *qdisc) { const struct Qdisc_ops *ops = qdisc->ops; @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc) call_rcu(&qdisc->rcu, qdisc_free_cb); } +void qdisc_destroy(struct Qdisc *qdisc) +{ + if (qdisc->flags & TCQ_F_BUILTIN) + return; + + __qdisc_destroy(qdisc); +} + void qdisc_put(struct Qdisc *qdisc) { if (!qdisc) @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc) !refcount_dec_and_test(&qdisc->refcnt)) return; - qdisc_destroy(qdisc); + __qdisc_destroy(qdisc); } EXPORT_SYMBOL(qdisc_put); @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc) !refcount_dec_and_rtnl_lock(&qdisc->refcnt)) return; - qdisc_destroy(qdisc); + __qdisc_destroy(qdisc); rtnl_unlock(); } EXPORT_SYMBOL(qdisc_put_unlocked);