[v2] net: sched: fix memory leak in tcindex_set_parms

Message ID 20221113170507.8205-1-yin31149@gmail.com
State New
Headers
Series [v2] net: sched: fix memory leak in tcindex_set_parms |

Commit Message

Hawkins Jiawei Nov. 13, 2022, 5:05 p.m. UTC
  Syzkaller reports a memory leak as follows:
====================================
BUG: memory leak
unreferenced object 0xffff88810c287f00 (size 256):
  comm "syz-executor105", pid 3600, jiffies 4294943292 (age 12.990s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff814cf9f0>] kmalloc_trace+0x20/0x90 mm/slab_common.c:1046
    [<ffffffff839c9e07>] kmalloc include/linux/slab.h:576 [inline]
    [<ffffffff839c9e07>] kmalloc_array include/linux/slab.h:627 [inline]
    [<ffffffff839c9e07>] kcalloc include/linux/slab.h:659 [inline]
    [<ffffffff839c9e07>] tcf_exts_init include/net/pkt_cls.h:250 [inline]
    [<ffffffff839c9e07>] tcindex_set_parms+0xa7/0xbe0 net/sched/cls_tcindex.c:342
    [<ffffffff839caa1f>] tcindex_change+0xdf/0x120 net/sched/cls_tcindex.c:553
    [<ffffffff8394db62>] tc_new_tfilter+0x4f2/0x1100 net/sched/cls_api.c:2147
    [<ffffffff8389e91c>] rtnetlink_rcv_msg+0x4dc/0x5d0 net/core/rtnetlink.c:6082
    [<ffffffff839eba67>] netlink_rcv_skb+0x87/0x1d0 net/netlink/af_netlink.c:2540
    [<ffffffff839eab87>] netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
    [<ffffffff839eab87>] netlink_unicast+0x397/0x4c0 net/netlink/af_netlink.c:1345
    [<ffffffff839eb046>] netlink_sendmsg+0x396/0x710 net/netlink/af_netlink.c:1921
    [<ffffffff8383e796>] sock_sendmsg_nosec net/socket.c:714 [inline]
    [<ffffffff8383e796>] sock_sendmsg+0x56/0x80 net/socket.c:734
    [<ffffffff8383eb08>] ____sys_sendmsg+0x178/0x410 net/socket.c:2482
    [<ffffffff83843678>] ___sys_sendmsg+0xa8/0x110 net/socket.c:2536
    [<ffffffff838439c5>] __sys_sendmmsg+0x105/0x330 net/socket.c:2622
    [<ffffffff83843c14>] __do_sys_sendmmsg net/socket.c:2651 [inline]
    [<ffffffff83843c14>] __se_sys_sendmmsg net/socket.c:2648 [inline]
    [<ffffffff83843c14>] __x64_sys_sendmmsg+0x24/0x30 net/socket.c:2648
    [<ffffffff84605fd5>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    [<ffffffff84605fd5>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
    [<ffffffff84800087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
====================================

Kernel uses tcindex_change() to change an existing
traffic-control-indices filter properties. During the
process of changing, kernel clears the old
traffic-control-indices filter result, and updates it
by RCU assigning new traffic-control-indices data.

Yet the problem is that, kernel clears the old
traffic-control-indices filter result, without destroying
its tcf_exts structure, which triggers the above
memory leak.

This patch solves it by using tcf_exts_destroy() to
destroy the tcf_exts structure in old
traffic-control-indices filter result, after the
RCU grace period.

[Thanks to the suggestion from Jakub Kicinski and Cong Wang]

Fixes: b9a24bb76bf6 ("net_sched: properly handle failure case of tcf_exts_init()")
Link: https://lore.kernel.org/all/0000000000001de5c505ebc9ec59@google.com/
Reported-by: syzbot+232ebdbd36706c965ebf@syzkaller.appspotmail.com
Tested-by: syzbot+232ebdbd36706c965ebf@syzkaller.appspotmail.com
Cc: Cong Wang <cong.wang@bytedance.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
---
v2:
  - remove all 'will' in commit message according to Jakub Kicinski
  - add Fixes tag according to Jakub Kicinski
  - remove all ifdefs according to Jakub Kicinski and Cong Wang
  - add synchronize_rcu() before destorying old_e according to
Cong Wang

v1: https://lore.kernel.org/all/20221031060835.11722-1-yin31149@gmail.com/
 net/sched/cls_tcindex.c | 8 ++++++++
 1 file changed, 8 insertions(+)
  

Comments

Paolo Abeni Nov. 15, 2022, 11:36 a.m. UTC | #1
On Mon, 2022-11-14 at 01:05 +0800, Hawkins Jiawei wrote:
> Syzkaller reports a memory leak as follows:
> ====================================
> BUG: memory leak
> unreferenced object 0xffff88810c287f00 (size 256):
>   comm "syz-executor105", pid 3600, jiffies 4294943292 (age 12.990s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace:
>     [<ffffffff814cf9f0>] kmalloc_trace+0x20/0x90 mm/slab_common.c:1046
>     [<ffffffff839c9e07>] kmalloc include/linux/slab.h:576 [inline]
>     [<ffffffff839c9e07>] kmalloc_array include/linux/slab.h:627 [inline]
>     [<ffffffff839c9e07>] kcalloc include/linux/slab.h:659 [inline]
>     [<ffffffff839c9e07>] tcf_exts_init include/net/pkt_cls.h:250 [inline]
>     [<ffffffff839c9e07>] tcindex_set_parms+0xa7/0xbe0 net/sched/cls_tcindex.c:342
>     [<ffffffff839caa1f>] tcindex_change+0xdf/0x120 net/sched/cls_tcindex.c:553
>     [<ffffffff8394db62>] tc_new_tfilter+0x4f2/0x1100 net/sched/cls_api.c:2147
>     [<ffffffff8389e91c>] rtnetlink_rcv_msg+0x4dc/0x5d0 net/core/rtnetlink.c:6082
>     [<ffffffff839eba67>] netlink_rcv_skb+0x87/0x1d0 net/netlink/af_netlink.c:2540
>     [<ffffffff839eab87>] netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
>     [<ffffffff839eab87>] netlink_unicast+0x397/0x4c0 net/netlink/af_netlink.c:1345
>     [<ffffffff839eb046>] netlink_sendmsg+0x396/0x710 net/netlink/af_netlink.c:1921
>     [<ffffffff8383e796>] sock_sendmsg_nosec net/socket.c:714 [inline]
>     [<ffffffff8383e796>] sock_sendmsg+0x56/0x80 net/socket.c:734
>     [<ffffffff8383eb08>] ____sys_sendmsg+0x178/0x410 net/socket.c:2482
>     [<ffffffff83843678>] ___sys_sendmsg+0xa8/0x110 net/socket.c:2536
>     [<ffffffff838439c5>] __sys_sendmmsg+0x105/0x330 net/socket.c:2622
>     [<ffffffff83843c14>] __do_sys_sendmmsg net/socket.c:2651 [inline]
>     [<ffffffff83843c14>] __se_sys_sendmmsg net/socket.c:2648 [inline]
>     [<ffffffff83843c14>] __x64_sys_sendmmsg+0x24/0x30 net/socket.c:2648
>     [<ffffffff84605fd5>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>     [<ffffffff84605fd5>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
>     [<ffffffff84800087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> ====================================
> 
> Kernel uses tcindex_change() to change an existing
> traffic-control-indices filter properties. During the
> process of changing, kernel clears the old
> traffic-control-indices filter result, and updates it
> by RCU assigning new traffic-control-indices data.
> 
> Yet the problem is that, kernel clears the old
> traffic-control-indices filter result, without destroying
> its tcf_exts structure, which triggers the above
> memory leak.
> 
> This patch solves it by using tcf_exts_destroy() to
> destroy the tcf_exts structure in old
> traffic-control-indices filter result, after the
> RCU grace period.
> 
> [Thanks to the suggestion from Jakub Kicinski and Cong Wang]
> 
> Fixes: b9a24bb76bf6 ("net_sched: properly handle failure case of tcf_exts_init()")
> Link: https://lore.kernel.org/all/0000000000001de5c505ebc9ec59@google.com/
> Reported-by: syzbot+232ebdbd36706c965ebf@syzkaller.appspotmail.com
> Tested-by: syzbot+232ebdbd36706c965ebf@syzkaller.appspotmail.com
> Cc: Cong Wang <cong.wang@bytedance.com>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
> ---
> v2:
>   - remove all 'will' in commit message according to Jakub Kicinski
>   - add Fixes tag according to Jakub Kicinski
>   - remove all ifdefs according to Jakub Kicinski and Cong Wang
>   - add synchronize_rcu() before destorying old_e according to
> Cong Wang
> 
> v1: https://lore.kernel.org/all/20221031060835.11722-1-yin31149@gmail.com/
>  net/sched/cls_tcindex.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c
> index 1c9eeb98d826..d2fac9559d3e 100644
> --- a/net/sched/cls_tcindex.c
> +++ b/net/sched/cls_tcindex.c
> @@ -338,6 +338,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
>  	struct tcf_result cr = {};
>  	int err, balloc = 0;
>  	struct tcf_exts e;
> +	struct tcf_exts old_e = {};
>  
>  	err = tcf_exts_init(&e, net, TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE);
>  	if (err < 0)
> @@ -479,6 +480,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
>  	}
>  
>  	if (old_r && old_r != r) {
> +		old_e = old_r->exts;
>  		err = tcindex_filter_result_init(old_r, cp, net);
>  		if (err < 0) {
>  			kfree(f);
> @@ -510,6 +512,12 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
>  		tcf_exts_destroy(&new_filter_result.exts);
>  	}
>  
> +	/* Note: old_e should be destroyed after the RCU grace period,
> +	 * to avoid possible use-after-free by concurrent readers.
> +	 */
> +	synchronize_rcu();

this could make tc reconfiguration potentially very slow. I'm wondering
if we can delegate the tcf_exts_destroy() to some workqueue?

Thanks!

Paolo
  
Dmitry Vyukov Nov. 15, 2022, 11:39 a.m. UTC | #2
On Tue, 15 Nov 2022 at 12:36, Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Mon, 2022-11-14 at 01:05 +0800, Hawkins Jiawei wrote:
> > Syzkaller reports a memory leak as follows:
> > ====================================
> > BUG: memory leak
> > unreferenced object 0xffff88810c287f00 (size 256):
> >   comm "syz-executor105", pid 3600, jiffies 4294943292 (age 12.990s)
> >   hex dump (first 32 bytes):
> >     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> >     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> >   backtrace:
> >     [<ffffffff814cf9f0>] kmalloc_trace+0x20/0x90 mm/slab_common.c:1046
> >     [<ffffffff839c9e07>] kmalloc include/linux/slab.h:576 [inline]
> >     [<ffffffff839c9e07>] kmalloc_array include/linux/slab.h:627 [inline]
> >     [<ffffffff839c9e07>] kcalloc include/linux/slab.h:659 [inline]
> >     [<ffffffff839c9e07>] tcf_exts_init include/net/pkt_cls.h:250 [inline]
> >     [<ffffffff839c9e07>] tcindex_set_parms+0xa7/0xbe0 net/sched/cls_tcindex.c:342
> >     [<ffffffff839caa1f>] tcindex_change+0xdf/0x120 net/sched/cls_tcindex.c:553
> >     [<ffffffff8394db62>] tc_new_tfilter+0x4f2/0x1100 net/sched/cls_api.c:2147
> >     [<ffffffff8389e91c>] rtnetlink_rcv_msg+0x4dc/0x5d0 net/core/rtnetlink.c:6082
> >     [<ffffffff839eba67>] netlink_rcv_skb+0x87/0x1d0 net/netlink/af_netlink.c:2540
> >     [<ffffffff839eab87>] netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
> >     [<ffffffff839eab87>] netlink_unicast+0x397/0x4c0 net/netlink/af_netlink.c:1345
> >     [<ffffffff839eb046>] netlink_sendmsg+0x396/0x710 net/netlink/af_netlink.c:1921
> >     [<ffffffff8383e796>] sock_sendmsg_nosec net/socket.c:714 [inline]
> >     [<ffffffff8383e796>] sock_sendmsg+0x56/0x80 net/socket.c:734
> >     [<ffffffff8383eb08>] ____sys_sendmsg+0x178/0x410 net/socket.c:2482
> >     [<ffffffff83843678>] ___sys_sendmsg+0xa8/0x110 net/socket.c:2536
> >     [<ffffffff838439c5>] __sys_sendmmsg+0x105/0x330 net/socket.c:2622
> >     [<ffffffff83843c14>] __do_sys_sendmmsg net/socket.c:2651 [inline]
> >     [<ffffffff83843c14>] __se_sys_sendmmsg net/socket.c:2648 [inline]
> >     [<ffffffff83843c14>] __x64_sys_sendmmsg+0x24/0x30 net/socket.c:2648
> >     [<ffffffff84605fd5>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> >     [<ffffffff84605fd5>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> >     [<ffffffff84800087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > ====================================
> >
> > Kernel uses tcindex_change() to change an existing
> > traffic-control-indices filter properties. During the
> > process of changing, kernel clears the old
> > traffic-control-indices filter result, and updates it
> > by RCU assigning new traffic-control-indices data.
> >
> > Yet the problem is that, kernel clears the old
> > traffic-control-indices filter result, without destroying
> > its tcf_exts structure, which triggers the above
> > memory leak.
> >
> > This patch solves it by using tcf_exts_destroy() to
> > destroy the tcf_exts structure in old
> > traffic-control-indices filter result, after the
> > RCU grace period.
> >
> > [Thanks to the suggestion from Jakub Kicinski and Cong Wang]
> >
> > Fixes: b9a24bb76bf6 ("net_sched: properly handle failure case of tcf_exts_init()")
> > Link: https://lore.kernel.org/all/0000000000001de5c505ebc9ec59@google.com/
> > Reported-by: syzbot+232ebdbd36706c965ebf@syzkaller.appspotmail.com
> > Tested-by: syzbot+232ebdbd36706c965ebf@syzkaller.appspotmail.com
> > Cc: Cong Wang <cong.wang@bytedance.com>
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
> > ---
> > v2:
> >   - remove all 'will' in commit message according to Jakub Kicinski
> >   - add Fixes tag according to Jakub Kicinski
> >   - remove all ifdefs according to Jakub Kicinski and Cong Wang
> >   - add synchronize_rcu() before destorying old_e according to
> > Cong Wang
> >
> > v1: https://lore.kernel.org/all/20221031060835.11722-1-yin31149@gmail.com/
> >  net/sched/cls_tcindex.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c
> > index 1c9eeb98d826..d2fac9559d3e 100644
> > --- a/net/sched/cls_tcindex.c
> > +++ b/net/sched/cls_tcindex.c
> > @@ -338,6 +338,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
> >       struct tcf_result cr = {};
> >       int err, balloc = 0;
> >       struct tcf_exts e;
> > +     struct tcf_exts old_e = {};
> >
> >       err = tcf_exts_init(&e, net, TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE);
> >       if (err < 0)
> > @@ -479,6 +480,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
> >       }
> >
> >       if (old_r && old_r != r) {
> > +             old_e = old_r->exts;
> >               err = tcindex_filter_result_init(old_r, cp, net);
> >               if (err < 0) {
> >                       kfree(f);
> > @@ -510,6 +512,12 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
> >               tcf_exts_destroy(&new_filter_result.exts);
> >       }
> >
> > +     /* Note: old_e should be destroyed after the RCU grace period,
> > +      * to avoid possible use-after-free by concurrent readers.
> > +      */
> > +     synchronize_rcu();
>
> this could make tc reconfiguration potentially very slow. I'm wondering
> if we can delegate the tcf_exts_destroy() to some workqueue?

call_rcu?
  
Jakub Kicinski Nov. 15, 2022, 5:02 p.m. UTC | #3
On Mon, 14 Nov 2022 01:05:08 +0800 Hawkins Jiawei wrote:
> diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c
> index 1c9eeb98d826..d2fac9559d3e 100644
> --- a/net/sched/cls_tcindex.c
> +++ b/net/sched/cls_tcindex.c
> @@ -338,6 +338,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
>  	struct tcf_result cr = {};
>  	int err, balloc = 0;
>  	struct tcf_exts e;
> +	struct tcf_exts old_e = {};

This is not a valid way of initializing a structure.
tcf_exts_init() is supposed to be called.
If we add a list member to that structure this code will break, again.

>  	err = tcf_exts_init(&e, net, TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE);
>  	if (err < 0)
> @@ -479,6 +480,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
>  	}
>  
>  	if (old_r && old_r != r) {
> +		old_e = old_r->exts;
>  		err = tcindex_filter_result_init(old_r, cp, net);
>  		if (err < 0) {
>  			kfree(f);
> @@ -510,6 +512,12 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
>  		tcf_exts_destroy(&new_filter_result.exts);
>  	}
>  
> +	/* Note: old_e should be destroyed after the RCU grace period,
> +	 * to avoid possible use-after-free by concurrent readers.
> +	 */
> +	synchronize_rcu();
> +	tcf_exts_destroy(&old_e);

I don't think this dance is required, @cp is a copy of the original
data, and the original (@p) is destroyed in a safe manner below.

>  	if (oldp)
>  		tcf_queue_work(&oldp->rwork, tcindex_partial_destroy_work);
>  	return 0;
  
Paolo Abeni Nov. 15, 2022, 6:57 p.m. UTC | #4
On Tue, 2022-11-15 at 09:02 -0800, Jakub Kicinski wrote:
> On Mon, 14 Nov 2022 01:05:08 +0800 Hawkins Jiawei wrote:
> 
> > @@ -479,6 +480,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
> >  	}
> >  
> >  	if (old_r && old_r != r) {
> > +		old_e = old_r->exts;
> >  		err = tcindex_filter_result_init(old_r, cp, net);
> >  		if (err < 0) {
> >  			kfree(f);
> > @@ -510,6 +512,12 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
> >  		tcf_exts_destroy(&new_filter_result.exts);
> >  	}
> >  
> > +	/* Note: old_e should be destroyed after the RCU grace period,
> > +	 * to avoid possible use-after-free by concurrent readers.
> > +	 */
> > +	synchronize_rcu();
> > +	tcf_exts_destroy(&old_e);
> 
> I don't think this dance is required, @cp is a copy of the original
> data, and the original (@p) is destroyed in a safe manner below.

This code confuses me more than a bit, and I don't follow ?!? it looks
like that at this point:

* the data path could access 'old_r->exts' contents via 'p' just before
the previous 'tcindex_filter_result_init(old_r, cp, net);' but still
potentially within the same RCU grace period

* 'tcindex_filter_result_init(old_r, cp, net);' has 'unlinked' the old
exts from 'p'  so that will not be freed by later
tcindex_partial_destroy_work() 

Overall it looks to me that we need some somewhat wait for the RCU
grace period, 

Somewhat side question: it looks like that the 'perfect hashing' usage
is the root cause of the issue addressed here, and very likely is
afflicted by other problems, e.g. the data curruption in 'err =
tcindex_filter_result_init(old_r, cp, net);'.

AFAICS 'perfect hashing' usage is a sort of optimization that the user-
space may trigger with some combination of the tcindex arguments. I'm
wondering if we could drop all perfect hashing related code?

Paolo
  
Jakub Kicinski Nov. 16, 2022, 2:44 a.m. UTC | #5
On Tue, 15 Nov 2022 19:57:10 +0100 Paolo Abeni wrote:
> This code confuses me more than a bit, and I don't follow ?!? 

It's very confusing :S

For starters I don't know when r != old_r. I mean now it triggers
randomly after the RCU-ification, but in the original code when
it was just a memset(). When would old_r ever not be null and yet
point to a different entry?

> it looks like that at this point:
> 
> * the data path could access 'old_r->exts' contents via 'p' just before
> the previous 'tcindex_filter_result_init(old_r, cp, net);' but still
> potentially within the same RCU grace period
> 
> * 'tcindex_filter_result_init(old_r, cp, net);' has 'unlinked' the old
> exts from 'p'  so that will not be freed by later
> tcindex_partial_destroy_work() 
> 
> Overall it looks to me that we need some somewhat wait for the RCU
> grace period, 

Isn't it better to make @cp a deeper copy of @p ?
I thought it already is but we don't seem to be cloning p->h.
Also the cloning of p->perfect looks quite lossy.

> Somewhat side question: it looks like that the 'perfect hashing' usage
> is the root cause of the issue addressed here, and very likely is
> afflicted by other problems, e.g. the data curruption in 'err =
> tcindex_filter_result_init(old_r, cp, net);'.
> 
> AFAICS 'perfect hashing' usage is a sort of optimization that the user-
> space may trigger with some combination of the tcindex arguments. I'm
> wondering if we could drop all perfect hashing related code?

The thought of "how much of this can we delete" did cross my mind :)
  
Hawkins Jiawei Nov. 16, 2022, 7:35 a.m. UTC | #6
Hi,

On Wed, 16 Nov 2022 at 01:02, Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 14 Nov 2022 01:05:08 +0800 Hawkins Jiawei wrote:
> > diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c
> > index 1c9eeb98d826..d2fac9559d3e 100644
> > --- a/net/sched/cls_tcindex.c
> > +++ b/net/sched/cls_tcindex.c
> > @@ -338,6 +338,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
> >       struct tcf_result cr = {};
> >       int err, balloc = 0;
> >       struct tcf_exts e;
> > +     struct tcf_exts old_e = {};
>
> This is not a valid way of initializing a structure.
> tcf_exts_init() is supposed to be called.
> If we add a list member to that structure this code will break, again.

Yes, you are right. But the `old_e` variable here is used only for freeing
old_r->exts resource, `old_e` will be overwritten by old_r->exts content
as follows:

    struct tcf_exts old_e = {};
    ...

    if (old_r && old_r != r) {
            old_e = old_r->exts;
            ...
    }
    ...
    synchronize_rcu();
    tcf_exts_destroy(&old_e);

So this patch uses `struct tcf_exts old_e = {}` here just for a cleared space.
  
Hawkins Jiawei Nov. 16, 2022, 12:10 p.m. UTC | #7
On Wed, 16 Nov 2022 at 10:44, Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 15 Nov 2022 19:57:10 +0100 Paolo Abeni wrote:
> > This code confuses me more than a bit, and I don't follow ?!?
>
> It's very confusing :S
>
> For starters I don't know when r != old_r. I mean now it triggers
> randomly after the RCU-ification, but in the original code when
> it was just a memset(). When would old_r ever not be null and yet
> point to a different entry?

I am also confused about the code when I tried to fix this bug.

As for when `old_r != r`, according to the simplified
code below, this should be probably true if `p->perfect` is true
or `!p->perfect && !pc->h` is true(please correct me if I am wrong)

        struct tcindex_filter_result new_filter_result, *old_r = r;
        struct tcindex_data *cp = NULL, *oldp;
        struct tcf_result cr = {};

        /* tcindex_data attributes must look atomic to classifier/lookup so
         * allocate new tcindex data and RCU assign it onto root. Keeping
         * perfect hash and hash pointers from old data.
         */
        cp = kzalloc(sizeof(*cp), GFP_KERNEL);

        if (p->perfect) {
                if (tcindex_alloc_perfect_hash(net, cp) < 0)
                        goto errout;
                ...
        }
        cp->h = p->h;

        if (!cp->perfect && !cp->h) {
                if (valid_perfect_hash(cp)) {
                        if (tcindex_alloc_perfect_hash(net, cp) < 0)
                                goto errout_alloc;

                } else {
                        struct tcindex_filter __rcu **hash;

                        hash = kcalloc(cp->hash,
                                       sizeof(struct tcindex_filter *),
                                       GFP_KERNEL);

                        if (!hash)
                                goto errout_alloc;

                        cp->h = hash;
                }
        }
        ...

        if (cp->perfect)
                r = cp->perfect + handle;
        else
                r = tcindex_lookup(cp, handle) ? : &new_filter_result;

        if (old_r && old_r != r) {
                err = tcindex_filter_result_init(old_r, cp, net);
                if (err < 0) {
                        kfree(f);
                        goto errout_alloc;
                }
        }

* If `p->perfect` is true, tcindex_alloc_perfect_hash() newly
alloctes cp->perfect.

* If `!p->perfect && !p->h` is true, cp->perfect or cp->h is
newly allocated.

In either case, r probably points to the newly allocated memory,
which should not equals to the old_r.

>
> > it looks like that at this point:
> >
> > * the data path could access 'old_r->exts' contents via 'p' just before
> > the previous 'tcindex_filter_result_init(old_r, cp, net);' but still
> > potentially within the same RCU grace period
> >
> > * 'tcindex_filter_result_init(old_r, cp, net);' has 'unlinked' the old
> > exts from 'p'  so that will not be freed by later
> > tcindex_partial_destroy_work()
> >
> > Overall it looks to me that we need some somewhat wait for the RCU
> > grace period,
>
> Isn't it better to make @cp a deeper copy of @p ?
> I thought it already is but we don't seem to be cloning p->h.
> Also the cloning of p->perfect looks quite lossy.

Yes, I also think @cp should be a deeper copy of @p.

But it seems that in tcindex_alloc_perfect_hash(),
each @cp ->exts will be initialized by tcf_exts_init()
as below, and tcindex_set_parms() forgets to free the
old ->exts content, triggering this memory leak.(Please
correct me if I am wrong)

        static int tcindex_alloc_perfect_hash(struct net *net,
                                              struct tcindex_data *cp)
        {
        	int i, err = 0;
        
        	cp->perfect = kcalloc(cp->hash, sizeof(struct tcindex_filter_result),
        			      GFP_KERNEL | __GFP_NOWARN);
        
        	for (i = 0; i < cp->hash; i++) {
        		err = tcf_exts_init(&cp->perfect[i].exts, net,
        				    TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE);
        		if (err < 0)
        			goto errout;
        		cp->perfect[i].p = cp;
        	}
        }

        static inline int tcf_exts_init(struct tcf_exts *exts, struct net *net,
        				int action, int police)
        {
        #ifdef CONFIG_NET_CLS_ACT
        	exts->type = 0;
        	exts->nr_actions = 0;
        	/* Note: we do not own yet a reference on net.
        	 * This reference might be taken later from tcf_exts_get_net().
        	 */
        	exts->net = net;
        	exts->actions = kcalloc(TCA_ACT_MAX_PRIO, sizeof(struct tc_action *),
        				GFP_KERNEL);
        	if (!exts->actions)
        		return -ENOMEM;
        #endif
        	exts->action = action;
        	exts->police = police;
        	return 0;
        }

>
> > Somewhat side question: it looks like that the 'perfect hashing' usage
> > is the root cause of the issue addressed here, and very likely is
> > afflicted by other problems, e.g. the data curruption in 'err =
> > tcindex_filter_result_init(old_r, cp, net);'.
> >
> > AFAICS 'perfect hashing' usage is a sort of optimization that the user-
> > space may trigger with some combination of the tcindex arguments. I'm
> > wondering if we could drop all perfect hashing related code?
>
> The thought of "how much of this can we delete" did cross my mind :)
  
Hawkins Jiawei Nov. 16, 2022, 1:21 p.m. UTC | #8
On Wed, 16 Nov 2022 at 20:10, Hawkins Jiawei <yin31149@gmail.com> wrote:
>
> On Wed, 16 Nov 2022 at 10:44, Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Tue, 15 Nov 2022 19:57:10 +0100 Paolo Abeni wrote:
> > > This code confuses me more than a bit, and I don't follow ?!?
> >
> > It's very confusing :S
> >
> > For starters I don't know when r != old_r. I mean now it triggers
> > randomly after the RCU-ification, but in the original code when
> > it was just a memset(). When would old_r ever not be null and yet
> > point to a different entry?
>
> I am also confused about the code when I tried to fix this bug.
>
> As for when `old_r != r`, according to the simplified
> code below, this should be probably true if `p->perfect` is true
> or `!p->perfect && !pc->h` is true(please correct me if I am wrong)
>
>         struct tcindex_filter_result new_filter_result, *old_r = r;
>         struct tcindex_data *cp = NULL, *oldp;
>         struct tcf_result cr = {};
>
>         /* tcindex_data attributes must look atomic to classifier/lookup so
>          * allocate new tcindex data and RCU assign it onto root. Keeping
>          * perfect hash and hash pointers from old data.
>          */
>         cp = kzalloc(sizeof(*cp), GFP_KERNEL);
>
>         if (p->perfect) {
>                 if (tcindex_alloc_perfect_hash(net, cp) < 0)
>                         goto errout;
>                 ...
>         }
>         cp->h = p->h;
>
>         if (!cp->perfect && !cp->h) {
>                 if (valid_perfect_hash(cp)) {
>                         if (tcindex_alloc_perfect_hash(net, cp) < 0)
>                                 goto errout_alloc;
>
>                 } else {
>                         struct tcindex_filter __rcu **hash;
>
>                         hash = kcalloc(cp->hash,
>                                        sizeof(struct tcindex_filter *),
>                                        GFP_KERNEL);
>
>                         if (!hash)
>                                 goto errout_alloc;
>
>                         cp->h = hash;
>                 }
>         }
>         ...
>
>         if (cp->perfect)
>                 r = cp->perfect + handle;
>         else
>                 r = tcindex_lookup(cp, handle) ? : &new_filter_result;
>
>         if (old_r && old_r != r) {
>                 err = tcindex_filter_result_init(old_r, cp, net);
>                 if (err < 0) {
>                         kfree(f);
>                         goto errout_alloc;
>                 }
>         }
>
> * If `p->perfect` is true, tcindex_alloc_perfect_hash() newly
> alloctes cp->perfect.
>
> * If `!p->perfect && !p->h` is true, cp->perfect or cp->h is
> newly allocated.
>
> In either case, r probably points to the newly allocated memory,
> which should not equals to the old_r.

Sorry for the error. In the second case, `r` is possibly
pointing to the `&new_filter_result`, which is a stack variable
address, and should still not equal to the `old_r`.

>
> >
> > > it looks like that at this point:
> > >
> > > * the data path could access 'old_r->exts' contents via 'p' just before
> > > the previous 'tcindex_filter_result_init(old_r, cp, net);' but still
> > > potentially within the same RCU grace period
> > >
> > > * 'tcindex_filter_result_init(old_r, cp, net);' has 'unlinked' the old
> > > exts from 'p'  so that will not be freed by later
> > > tcindex_partial_destroy_work()
> > >
> > > Overall it looks to me that we need some somewhat wait for the RCU
> > > grace period,
> >
> > Isn't it better to make @cp a deeper copy of @p ?
> > I thought it already is but we don't seem to be cloning p->h.
> > Also the cloning of p->perfect looks quite lossy.
>
> Yes, I also think @cp should be a deeper copy of @p.
>
> But it seems that in tcindex_alloc_perfect_hash(),
> each @cp ->exts will be initialized by tcf_exts_init()
> as below, and tcindex_set_parms() forgets to free the
> old ->exts content, triggering this memory leak.(Please
> correct me if I am wrong)
>
>         static int tcindex_alloc_perfect_hash(struct net *net,
>                                               struct tcindex_data *cp)
>         {
>                 int i, err = 0;
>
>                 cp->perfect = kcalloc(cp->hash, sizeof(struct tcindex_filter_result),
>                                       GFP_KERNEL | __GFP_NOWARN);
>
>                 for (i = 0; i < cp->hash; i++) {
>                         err = tcf_exts_init(&cp->perfect[i].exts, net,
>                                             TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE);
>                         if (err < 0)
>                                 goto errout;
>                         cp->perfect[i].p = cp;
>                 }
>         }
>
>         static inline int tcf_exts_init(struct tcf_exts *exts, struct net *net,
>                                         int action, int police)
>         {
>         #ifdef CONFIG_NET_CLS_ACT
>                 exts->type = 0;
>                 exts->nr_actions = 0;
>                 /* Note: we do not own yet a reference on net.
>                  * This reference might be taken later from tcf_exts_get_net().
>                  */
>                 exts->net = net;
>                 exts->actions = kcalloc(TCA_ACT_MAX_PRIO, sizeof(struct tc_action *),
>                                         GFP_KERNEL);
>                 if (!exts->actions)
>                         return -ENOMEM;
>         #endif
>                 exts->action = action;
>                 exts->police = police;
>                 return 0;
>         }
>
> >
> > > Somewhat side question: it looks like that the 'perfect hashing' usage
> > > is the root cause of the issue addressed here, and very likely is
> > > afflicted by other problems, e.g. the data curruption in 'err =
> > > tcindex_filter_result_init(old_r, cp, net);'.
> > >
> > > AFAICS 'perfect hashing' usage is a sort of optimization that the user-
> > > space may trigger with some combination of the tcindex arguments. I'm
> > > wondering if we could drop all perfect hashing related code?
> >
> > The thought of "how much of this can we delete" did cross my mind :)
  

Patch

diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c
index 1c9eeb98d826..d2fac9559d3e 100644
--- a/net/sched/cls_tcindex.c
+++ b/net/sched/cls_tcindex.c
@@ -338,6 +338,7 @@  tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
 	struct tcf_result cr = {};
 	int err, balloc = 0;
 	struct tcf_exts e;
+	struct tcf_exts old_e = {};
 
 	err = tcf_exts_init(&e, net, TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE);
 	if (err < 0)
@@ -479,6 +480,7 @@  tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
 	}
 
 	if (old_r && old_r != r) {
+		old_e = old_r->exts;
 		err = tcindex_filter_result_init(old_r, cp, net);
 		if (err < 0) {
 			kfree(f);
@@ -510,6 +512,12 @@  tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
 		tcf_exts_destroy(&new_filter_result.exts);
 	}
 
+	/* Note: old_e should be destroyed after the RCU grace period,
+	 * to avoid possible use-after-free by concurrent readers.
+	 */
+	synchronize_rcu();
+	tcf_exts_destroy(&old_e);
+
 	if (oldp)
 		tcf_queue_work(&oldp->rwork, tcindex_partial_destroy_work);
 	return 0;