Message ID | 20230410130826.1492525-1-libaokun1@huawei.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp1902434vqo; Mon, 10 Apr 2023 06:40:26 -0700 (PDT) X-Google-Smtp-Source: AKy350aXhHQ5u3mTeZwlLF/dvr/Gan5tKZHdocI2dsOpenOqycQOK9Szqq45qabhq7vYtGdwiUYa X-Received: by 2002:a17:906:8694:b0:94e:2d:e94b with SMTP id g20-20020a170906869400b0094e002de94bmr3375ejx.63.1681134026595; Mon, 10 Apr 2023 06:40:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681134026; cv=none; d=google.com; s=arc-20160816; b=WYBdx6Xs/hX+esSgv21cXxBj5+wiG1kRQWS4/idaAqb3nB8PH9+fv1SKltok3hfXMu AtYPvu43w4er/kk98ermO6hD5u/2F5cpY/XPB83i1vqTdivrZDb0kzYQ9GxxIm78S5g5 XzfAroNjtWiVt+dJWYcxNBG3ijEOD5LqG7X675noNbp4TaIrli5gn6EuzeTSYYtAUNko kaV1wb2Rs6xdlrsXP1EISKUt+HPNneZTvq11/mK+F1GFf7eTRps+X5iERbr1iEKciVRi U7H0fdjvE0xPCGfKpPnJUfHlW5Ts0jjoyEdSWlO0yR27hYN34ukO7YliAz4L5At8QyAP de7A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=AtsSyUcQN0DNLCUj6zXxHWOffHK2714ljvzUtrF/j+o=; b=p1/fa30h7imFxGInD8BM1qdf0QCR8Dfoy21AgrIASoi22uOZCoXdHcEmJ1p5aAUHAg GqnsgY87RKVJ8gc4GDCmMMFpI4rOM8qeseiScv9gwlom46ioDFu2GX98JLtfOTQ0bqLu +NhNZ9t9yPSsDy424Pjv+sNdIbM180r4vSoxJmEef+vsu5t+TrPLzD42/XR5fGsVGW7G Ljko31oGymdAg+ZFvUCiH44+pFC1zwP2bOdoGAxGYsn81r4x8RZcvxyLtR7ojyarGLC+ QVHKyaDXvMniFU47G7N6MMjiRqStC+ttrNYAvpnry6MYFRWKEsDjTxGF7znS+uKjNaqB K30Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id vt6-20020a170907a60600b0094a82b4e9d8si2074727ejc.149.2023.04.10.06.40.02; Mon, 10 Apr 2023 06:40:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229624AbjDJNJJ (ORCPT <rfc822;yuanzuo1009@gmail.com> + 99 others); Mon, 10 Apr 2023 09:09:09 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44290 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229595AbjDJNJH (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 10 Apr 2023 09:09:07 -0400 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 66B8D30DB; Mon, 10 Apr 2023 06:09:06 -0700 (PDT) Received: from dggpeml500021.china.huawei.com (unknown [172.30.72.57]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4Pw8NB6P2RznbW7; Mon, 10 Apr 2023 21:05:30 +0800 (CST) Received: from huawei.com (10.175.127.227) by dggpeml500021.china.huawei.com (7.185.36.21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.23; Mon, 10 Apr 2023 21:09:02 +0800 From: Baokun Li <libaokun1@huawei.com> To: <linux-fsdevel@vger.kernel.org> CC: <viro@zeniv.linux.org.uk>, <brauner@kernel.org>, <jack@suse.cz>, <tj@kernel.org>, <dennis@kernel.org>, <adilger.kernel@dilger.ca>, <akpm@linux-foundation.org>, <linux-kernel@vger.kernel.org>, <yi.zhang@huawei.com>, <yangerkun@huawei.com>, <houtao1@huawei.com>, <libaokun1@huawei.com>, <stable@vger.kernel.org> Subject: [PATCH v2] writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs Date: Mon, 10 Apr 2023 21:08:26 +0800 Message-ID: <20230410130826.1492525-1-libaokun1@huawei.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.175.127.227] X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggpeml500021.china.huawei.com (7.185.36.21) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1762796792790947947?= X-GMAIL-MSGID: =?utf-8?q?1762796792790947947?= |
Series |
[v2] writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs
|
|
Commit Message
Baokun Li
April 10, 2023, 1:08 p.m. UTC
KASAN report null-ptr-deref:
==================================================================
BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0
Write of size 8 at addr 0000000000000000 by task sync/943
CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461
Call Trace:
<TASK>
dump_stack_lvl+0x7f/0xc0
print_report+0x2ba/0x340
kasan_report+0xc4/0x120
kasan_check_range+0x1b7/0x2e0
__kasan_check_write+0x24/0x40
bdi_split_work_to_wbs+0x5c5/0x7b0
sync_inodes_sb+0x195/0x630
sync_inodes_one_sb+0x3a/0x50
iterate_supers+0x106/0x1b0
ksys_sync+0x98/0x160
[...]
==================================================================
The race that causes the above issue is as follows:
cpu1 cpu2
-------------------------|-------------------------
inode_switch_wbs
INIT_WORK(&isw->work, inode_switch_wbs_work_fn)
queue_rcu_work(isw_wq, &isw->work)
// queue_work async
inode_switch_wbs_work_fn
wb_put_many(old_wb, nr_switched)
percpu_ref_put_many
ref->data->release(ref)
cgwb_release
queue_work(cgwb_release_wq, &wb->release_work)
// queue_work async
&wb->release_work
cgwb_release_workfn
ksys_sync
iterate_supers
sync_inodes_one_sb
sync_inodes_sb
bdi_split_work_to_wbs
kmalloc(sizeof(*work), GFP_ATOMIC)
// alloc memory failed
percpu_ref_exit
ref->data = NULL
kfree(data)
wb_get(wb)
percpu_ref_get(&wb->refcnt)
percpu_ref_get_many(ref, 1)
atomic_long_add(nr, &ref->data->count)
atomic64_add(i, v)
// trigger null-ptr-deref
bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all wbs.
If the allocation of new work fails, the on-stack fallback will be used and
the reference count of the current wb is increased afterwards. If cgroup
writeback membership switches occur before getting the reference count and
the current wb is released as old_wd, then calling wb_get() or wb_put()
will trigger the null pointer dereference above.
This issue was introduced in v4.3-rc7 (see fix tag1). Both sync_inodes_sb()
and __writeback_inodes_sb_nr() calls to bdi_split_work_to_wbs() can trigger
this issue. For scenarios called via sync_inodes_sb(), originally commit
7fc5854f8c6e ("writeback: synchronize sync(2) against cgroup writeback
membership switches") reduced the possibility of the issue by adding
wb_switch_rwsem, but in v5.14-rc1 (see fix tag2) removed the
"inode_io_list_del_locked(inode, old_wb)" from inode_switch_wbs_work_fn()
so that wb->state contains WB_has_dirty_io, thus old_wb is not skipped
when traversing wbs in bdi_split_work_to_wbs(), and the issue becomes
easily reproducible again.
To solve this problem, percpu_ref_exit() is called under RCU protection
to avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs().
Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(),
and skip the current wb if wb_tryget() fails because the wb has already
been shutdown.
Fixes: b817525a4a80 ("writeback: bdi_writeback iteration must not skip dying ones")
Fixes: f3b6a6df38aa ("writeback, cgroup: keep list of inodes attached to bdi_writeback")
Cc: stable@vger.kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
V1->V2:
Use RCU instead of wb_switch_rwsem to avoid race.
fs/fs-writeback.c | 17 ++++++++++-------
mm/backing-dev.c | 12 ++++++++++--
2 files changed, 20 insertions(+), 9 deletions(-)
Comments
On Mon, 10 Apr 2023 21:08:26 +0800 Baokun Li <libaokun1@huawei.com> wrote: > > ... > > To solve this problem, percpu_ref_exit() is called under RCU protection > to avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs(). > Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(), > and skip the current wb if wb_tryget() fails because the wb has already > been shutdown. > > Fixes: b817525a4a80 ("writeback: bdi_writeback iteration must not skip dying ones") > Fixes: f3b6a6df38aa ("writeback, cgroup: keep list of inodes attached to bdi_writeback") Two Fixes: is awkward. The Fixes: serves a guide to which kernel versions should be patched, but those two commits are six years apart. So... how far back should this fix be backported? > fs/fs-writeback.c | 17 ++++++++++------- > mm/backing-dev.c | 12 ++++++++++-- Jens, which tree do you think should carry this?
On 2023/4/11 11:53, Andrew Morton wrote: > On Mon, 10 Apr 2023 21:08:26 +0800 Baokun Li <libaokun1@huawei.com> wrote: > >> ... >> >> To solve this problem, percpu_ref_exit() is called under RCU protection >> to avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs(). >> Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(), >> and skip the current wb if wb_tryget() fails because the wb has already >> been shutdown. >> >> Fixes: b817525a4a80 ("writeback: bdi_writeback iteration must not skip dying ones") >> Fixes: f3b6a6df38aa ("writeback, cgroup: keep list of inodes attached to bdi_writeback") > Two Fixes: is awkward. The Fixes: serves a guide to which kernel > versions should be patched, but those two commits are six years apart. > > So... how far back should this fix be backported? This issue was introduced in v4.3-rc7 by commit b817525a4a80 ("writeback: bdi_writeback iteration must not skip dying ones"), so anything that has this commit incorporated is problematic. Another fix tag patch invalidates a previously unintentional fix, and then the problem becomes more easily reproducible. This fix tag can actually be removed, and is added here so that people who see the patch will know what happened.
On Mon 10-04-23 21:08:26, Baokun Li wrote: > KASAN report null-ptr-deref: > ================================================================== > BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0 > Write of size 8 at addr 0000000000000000 by task sync/943 > CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461 > Call Trace: > <TASK> > dump_stack_lvl+0x7f/0xc0 > print_report+0x2ba/0x340 > kasan_report+0xc4/0x120 > kasan_check_range+0x1b7/0x2e0 > __kasan_check_write+0x24/0x40 > bdi_split_work_to_wbs+0x5c5/0x7b0 > sync_inodes_sb+0x195/0x630 > sync_inodes_one_sb+0x3a/0x50 > iterate_supers+0x106/0x1b0 > ksys_sync+0x98/0x160 > [...] > ================================================================== > > The race that causes the above issue is as follows: > > cpu1 cpu2 > -------------------------|------------------------- > inode_switch_wbs > INIT_WORK(&isw->work, inode_switch_wbs_work_fn) > queue_rcu_work(isw_wq, &isw->work) > // queue_work async > inode_switch_wbs_work_fn > wb_put_many(old_wb, nr_switched) > percpu_ref_put_many > ref->data->release(ref) > cgwb_release > queue_work(cgwb_release_wq, &wb->release_work) > // queue_work async > &wb->release_work > cgwb_release_workfn > ksys_sync > iterate_supers > sync_inodes_one_sb > sync_inodes_sb > bdi_split_work_to_wbs > kmalloc(sizeof(*work), GFP_ATOMIC) > // alloc memory failed > percpu_ref_exit > ref->data = NULL > kfree(data) > wb_get(wb) > percpu_ref_get(&wb->refcnt) > percpu_ref_get_many(ref, 1) > atomic_long_add(nr, &ref->data->count) > atomic64_add(i, v) > // trigger null-ptr-deref > > bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all wbs. > If the allocation of new work fails, the on-stack fallback will be used and > the reference count of the current wb is increased afterwards. If cgroup > writeback membership switches occur before getting the reference count and > the current wb is released as old_wd, then calling wb_get() or wb_put() > will trigger the null pointer dereference above. > > This issue was introduced in v4.3-rc7 (see fix tag1). Both sync_inodes_sb() > and __writeback_inodes_sb_nr() calls to bdi_split_work_to_wbs() can trigger > this issue. For scenarios called via sync_inodes_sb(), originally commit > 7fc5854f8c6e ("writeback: synchronize sync(2) against cgroup writeback > membership switches") reduced the possibility of the issue by adding > wb_switch_rwsem, but in v5.14-rc1 (see fix tag2) removed the > "inode_io_list_del_locked(inode, old_wb)" from inode_switch_wbs_work_fn() > so that wb->state contains WB_has_dirty_io, thus old_wb is not skipped > when traversing wbs in bdi_split_work_to_wbs(), and the issue becomes > easily reproducible again. > > To solve this problem, percpu_ref_exit() is called under RCU protection > to avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs(). > Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(), > and skip the current wb if wb_tryget() fails because the wb has already > been shutdown. > > Fixes: b817525a4a80 ("writeback: bdi_writeback iteration must not skip dying ones") > Fixes: f3b6a6df38aa ("writeback, cgroup: keep list of inodes attached to bdi_writeback") > Cc: stable@vger.kernel.org > Signed-off-by: Baokun Li <libaokun1@huawei.com> > --- > V1->V2: > Use RCU instead of wb_switch_rwsem to avoid race. The cgwb shutdown code is really messy. But your change looks good to me and I don't see an easier way around this race. Feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > > fs/fs-writeback.c | 17 ++++++++++------- > mm/backing-dev.c | 12 ++++++++++-- > 2 files changed, 20 insertions(+), 9 deletions(-) > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 195dc23e0d83..1db3e3c24b43 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -978,6 +978,16 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, > continue; > } > > + /* > + * If wb_tryget fails, the wb has been shutdown, skip it. > + * > + * Pin @wb so that it stays on @bdi->wb_list. This allows > + * continuing iteration from @wb after dropping and > + * regrabbing rcu read lock. > + */ > + if (!wb_tryget(wb)) > + continue; > + > /* alloc failed, execute synchronously using on-stack fallback */ > work = &fallback_work; > *work = *base_work; > @@ -986,13 +996,6 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, > work->done = &fallback_work_done; > > wb_queue_work(wb, work); > - > - /* > - * Pin @wb so that it stays on @bdi->wb_list. This allows > - * continuing iteration from @wb after dropping and > - * regrabbing rcu read lock. > - */ > - wb_get(wb); > last_wb = wb; > > rcu_read_unlock(); > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index ad011308cebe..43b48750b491 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c > @@ -507,6 +507,15 @@ static LIST_HEAD(offline_cgwbs); > static void cleanup_offline_cgwbs_workfn(struct work_struct *work); > static DECLARE_WORK(cleanup_offline_cgwbs_work, cleanup_offline_cgwbs_workfn); > > +static void cgwb_free_rcu(struct rcu_head *rcu_head) > +{ > + struct bdi_writeback *wb = container_of(rcu_head, > + struct bdi_writeback, rcu); > + > + percpu_ref_exit(&wb->refcnt); > + kfree(wb); > +} > + > static void cgwb_release_workfn(struct work_struct *work) > { > struct bdi_writeback *wb = container_of(work, struct bdi_writeback, > @@ -529,11 +538,10 @@ static void cgwb_release_workfn(struct work_struct *work) > list_del(&wb->offline_node); > spin_unlock_irq(&cgwb_lock); > > - percpu_ref_exit(&wb->refcnt); > wb_exit(wb); > bdi_put(bdi); > WARN_ON_ONCE(!list_empty(&wb->b_attached)); > - kfree_rcu(wb, rcu); > + call_rcu(&wb->rcu, cgwb_free_rcu); > } > > static void cgwb_release(struct percpu_ref *refcnt) > -- > 2.31.1 >
On Mon, Apr 10, 2023 at 09:08:26PM +0800, Baokun Li wrote: > KASAN report null-ptr-deref: > ================================================================== > BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0 > Write of size 8 at addr 0000000000000000 by task sync/943 > CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461 > Call Trace: > <TASK> > dump_stack_lvl+0x7f/0xc0 > print_report+0x2ba/0x340 > kasan_report+0xc4/0x120 > kasan_check_range+0x1b7/0x2e0 > __kasan_check_write+0x24/0x40 > bdi_split_work_to_wbs+0x5c5/0x7b0 > sync_inodes_sb+0x195/0x630 > sync_inodes_one_sb+0x3a/0x50 > iterate_supers+0x106/0x1b0 > ksys_sync+0x98/0x160 > [...] > ================================================================== > > The race that causes the above issue is as follows: > > cpu1 cpu2 > -------------------------|------------------------- > inode_switch_wbs > INIT_WORK(&isw->work, inode_switch_wbs_work_fn) > queue_rcu_work(isw_wq, &isw->work) > // queue_work async > inode_switch_wbs_work_fn > wb_put_many(old_wb, nr_switched) > percpu_ref_put_many > ref->data->release(ref) > cgwb_release > queue_work(cgwb_release_wq, &wb->release_work) > // queue_work async > &wb->release_work > cgwb_release_workfn > ksys_sync > iterate_supers > sync_inodes_one_sb > sync_inodes_sb > bdi_split_work_to_wbs > kmalloc(sizeof(*work), GFP_ATOMIC) > // alloc memory failed > percpu_ref_exit > ref->data = NULL > kfree(data) > wb_get(wb) > percpu_ref_get(&wb->refcnt) > percpu_ref_get_many(ref, 1) > atomic_long_add(nr, &ref->data->count) > atomic64_add(i, v) > // trigger null-ptr-deref > > bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all wbs. > If the allocation of new work fails, the on-stack fallback will be used and > the reference count of the current wb is increased afterwards. If cgroup > writeback membership switches occur before getting the reference count and > the current wb is released as old_wd, then calling wb_get() or wb_put() > will trigger the null pointer dereference above. > > This issue was introduced in v4.3-rc7 (see fix tag1). Both sync_inodes_sb() > and __writeback_inodes_sb_nr() calls to bdi_split_work_to_wbs() can trigger > this issue. For scenarios called via sync_inodes_sb(), originally commit > 7fc5854f8c6e ("writeback: synchronize sync(2) against cgroup writeback > membership switches") reduced the possibility of the issue by adding > wb_switch_rwsem, but in v5.14-rc1 (see fix tag2) removed the > "inode_io_list_del_locked(inode, old_wb)" from inode_switch_wbs_work_fn() > so that wb->state contains WB_has_dirty_io, thus old_wb is not skipped > when traversing wbs in bdi_split_work_to_wbs(), and the issue becomes > easily reproducible again. > > To solve this problem, percpu_ref_exit() is called under RCU protection > to avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs(). > Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(), > and skip the current wb if wb_tryget() fails because the wb has already > been shutdown. > > Fixes: b817525a4a80 ("writeback: bdi_writeback iteration must not skip dying ones") > Fixes: f3b6a6df38aa ("writeback, cgroup: keep list of inodes attached to bdi_writeback") > Cc: stable@vger.kernel.org > Signed-off-by: Baokun Li <libaokun1@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Thanks.
On Mon, 10 Apr 2023 21:08:26 +0800 Baokun Li <libaokun1@huawei.com> wrote: > KASAN report null-ptr-deref: > ================================================================== > BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0 > Write of size 8 at addr 0000000000000000 by task sync/943 > CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461 > Call Trace: > <TASK> > dump_stack_lvl+0x7f/0xc0 > print_report+0x2ba/0x340 > kasan_report+0xc4/0x120 > kasan_check_range+0x1b7/0x2e0 > __kasan_check_write+0x24/0x40 > bdi_split_work_to_wbs+0x5c5/0x7b0 > sync_inodes_sb+0x195/0x630 > sync_inodes_one_sb+0x3a/0x50 > iterate_supers+0x106/0x1b0 > ksys_sync+0x98/0x160 > [...] > ================================================================== > > The race that causes the above issue is as follows: > > cpu1 cpu2 > -------------------------|------------------------- > inode_switch_wbs > INIT_WORK(&isw->work, inode_switch_wbs_work_fn) > queue_rcu_work(isw_wq, &isw->work) > // queue_work async > inode_switch_wbs_work_fn > wb_put_many(old_wb, nr_switched) > percpu_ref_put_many > ref->data->release(ref) > cgwb_release > queue_work(cgwb_release_wq, &wb->release_work) > // queue_work async > &wb->release_work > cgwb_release_workfn > ksys_sync > iterate_supers > sync_inodes_one_sb > sync_inodes_sb > bdi_split_work_to_wbs > kmalloc(sizeof(*work), GFP_ATOMIC) > // alloc memory failed > percpu_ref_exit > ref->data = NULL > kfree(data) > wb_get(wb) > percpu_ref_get(&wb->refcnt) > percpu_ref_get_many(ref, 1) > atomic_long_add(nr, &ref->data->count) > atomic64_add(i, v) > // trigger null-ptr-deref > > bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all wbs. > If the allocation of new work fails, the on-stack fallback will be used and > the reference count of the current wb is increased afterwards. If cgroup > writeback membership switches occur before getting the reference count and > the current wb is released as old_wd, then calling wb_get() or wb_put() > will trigger the null pointer dereference above. > > This issue was introduced in v4.3-rc7 (see fix tag1). Both sync_inodes_sb() > and __writeback_inodes_sb_nr() calls to bdi_split_work_to_wbs() can trigger > this issue. For scenarios called via sync_inodes_sb(), originally commit > 7fc5854f8c6e ("writeback: synchronize sync(2) against cgroup writeback > membership switches") reduced the possibility of the issue by adding > wb_switch_rwsem, but in v5.14-rc1 (see fix tag2) removed the > "inode_io_list_del_locked(inode, old_wb)" from inode_switch_wbs_work_fn() > so that wb->state contains WB_has_dirty_io, thus old_wb is not skipped > when traversing wbs in bdi_split_work_to_wbs(), and the issue becomes > easily reproducible again. > > To solve this problem, percpu_ref_exit() is called under RCU protection > to avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs(). > Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(), > and skip the current wb if wb_tryget() fails because the wb has already > been shutdown. > > Fixes: b817525a4a80 ("writeback: bdi_writeback iteration must not skip dying ones") > Fixes: f3b6a6df38aa ("writeback, cgroup: keep list of inodes attached to bdi_writeback") Cc Roman for this second commit. > Cc: stable@vger.kernel.org Having two Fixes: is awkward. These serve as a guide to tell -stable maintainers which kernels need the fix. Can we be more precise? > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -978,6 +978,16 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, > continue; > } > > + /* > + * If wb_tryget fails, the wb has been shutdown, skip it. > + * > + * Pin @wb so that it stays on @bdi->wb_list. This allows > + * continuing iteration from @wb after dropping and > + * regrabbing rcu read lock. > + */ > + if (!wb_tryget(wb)) > + continue; > + > /* alloc failed, execute synchronously using on-stack fallback */ > work = &fallback_work; > *work = *base_work; > @@ -986,13 +996,6 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, > work->done = &fallback_work_done; > > wb_queue_work(wb, work); > - > - /* > - * Pin @wb so that it stays on @bdi->wb_list. This allows > - * continuing iteration from @wb after dropping and > - * regrabbing rcu read lock. > - */ > - wb_get(wb); > last_wb = wb; > > rcu_read_unlock(); > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index ad011308cebe..43b48750b491 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c > @@ -507,6 +507,15 @@ static LIST_HEAD(offline_cgwbs); > static void cleanup_offline_cgwbs_workfn(struct work_struct *work); > static DECLARE_WORK(cleanup_offline_cgwbs_work, cleanup_offline_cgwbs_workfn); > > +static void cgwb_free_rcu(struct rcu_head *rcu_head) > +{ > + struct bdi_writeback *wb = container_of(rcu_head, > + struct bdi_writeback, rcu); nit: struct bdi_writeback *wb; wb = container_of(rcu_head, struct bdi_writeback, rcu); looks nicer, no? > + percpu_ref_exit(&wb->refcnt); > + kfree(wb); > +} > + > static void cgwb_release_workfn(struct work_struct *work) > { > struct bdi_writeback *wb = container_of(work, struct bdi_writeback, > @@ -529,11 +538,10 @@ static void cgwb_release_workfn(struct work_struct *work) > list_del(&wb->offline_node); > spin_unlock_irq(&cgwb_lock); > > - percpu_ref_exit(&wb->refcnt); > wb_exit(wb); > bdi_put(bdi); > WARN_ON_ONCE(!list_empty(&wb->b_attached)); > - kfree_rcu(wb, rcu); > + call_rcu(&wb->rcu, cgwb_free_rcu); > } > > static void cgwb_release(struct percpu_ref *refcnt)
On Tue, May 02, 2023 at 05:17:01PM -0700, Andrew Morton wrote: > On Mon, 10 Apr 2023 21:08:26 +0800 Baokun Li <libaokun1@huawei.com> wrote: > > > KASAN report null-ptr-deref: > > ================================================================== > > BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0 > > Write of size 8 at addr 0000000000000000 by task sync/943 > > CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461 > > Call Trace: > > <TASK> > > dump_stack_lvl+0x7f/0xc0 > > print_report+0x2ba/0x340 > > kasan_report+0xc4/0x120 > > kasan_check_range+0x1b7/0x2e0 > > __kasan_check_write+0x24/0x40 > > bdi_split_work_to_wbs+0x5c5/0x7b0 > > sync_inodes_sb+0x195/0x630 > > sync_inodes_one_sb+0x3a/0x50 > > iterate_supers+0x106/0x1b0 > > ksys_sync+0x98/0x160 > > [...] > > ================================================================== > > > > The race that causes the above issue is as follows: > > > > cpu1 cpu2 > > -------------------------|------------------------- > > inode_switch_wbs > > INIT_WORK(&isw->work, inode_switch_wbs_work_fn) > > queue_rcu_work(isw_wq, &isw->work) > > // queue_work async > > inode_switch_wbs_work_fn > > wb_put_many(old_wb, nr_switched) > > percpu_ref_put_many > > ref->data->release(ref) > > cgwb_release > > queue_work(cgwb_release_wq, &wb->release_work) > > // queue_work async > > &wb->release_work > > cgwb_release_workfn > > ksys_sync > > iterate_supers > > sync_inodes_one_sb > > sync_inodes_sb > > bdi_split_work_to_wbs > > kmalloc(sizeof(*work), GFP_ATOMIC) > > // alloc memory failed > > percpu_ref_exit > > ref->data = NULL > > kfree(data) > > wb_get(wb) > > percpu_ref_get(&wb->refcnt) > > percpu_ref_get_many(ref, 1) > > atomic_long_add(nr, &ref->data->count) > > atomic64_add(i, v) > > // trigger null-ptr-deref > > > > bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all wbs. > > If the allocation of new work fails, the on-stack fallback will be used and > > the reference count of the current wb is increased afterwards. If cgroup > > writeback membership switches occur before getting the reference count and > > the current wb is released as old_wd, then calling wb_get() or wb_put() > > will trigger the null pointer dereference above. > > > > This issue was introduced in v4.3-rc7 (see fix tag1). Both sync_inodes_sb() > > and __writeback_inodes_sb_nr() calls to bdi_split_work_to_wbs() can trigger > > this issue. For scenarios called via sync_inodes_sb(), originally commit > > 7fc5854f8c6e ("writeback: synchronize sync(2) against cgroup writeback > > membership switches") reduced the possibility of the issue by adding > > wb_switch_rwsem, but in v5.14-rc1 (see fix tag2) removed the > > "inode_io_list_del_locked(inode, old_wb)" from inode_switch_wbs_work_fn() > > so that wb->state contains WB_has_dirty_io, thus old_wb is not skipped > > when traversing wbs in bdi_split_work_to_wbs(), and the issue becomes > > easily reproducible again. > > > > To solve this problem, percpu_ref_exit() is called under RCU protection > > to avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs(). > > Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(), > > and skip the current wb if wb_tryget() fails because the wb has already > > been shutdown. > > > > Fixes: b817525a4a80 ("writeback: bdi_writeback iteration must not skip dying ones") > > Fixes: f3b6a6df38aa ("writeback, cgroup: keep list of inodes attached to bdi_writeback") > > Cc Roman for this second commit. Thanks for the heads up! The patch looks good to me. Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Thanks!
On 2023/5/3 8:17, Andrew Morton wrote: > On Mon, 10 Apr 2023 21:08:26 +0800 Baokun Li <libaokun1@huawei.com> wrote: > >> KASAN report null-ptr-deref: >> ================================================================== >> BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0 >> Write of size 8 at addr 0000000000000000 by task sync/943 >> CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461 >> Call Trace: >> <TASK> >> dump_stack_lvl+0x7f/0xc0 >> print_report+0x2ba/0x340 >> kasan_report+0xc4/0x120 >> kasan_check_range+0x1b7/0x2e0 >> __kasan_check_write+0x24/0x40 >> bdi_split_work_to_wbs+0x5c5/0x7b0 >> sync_inodes_sb+0x195/0x630 >> sync_inodes_one_sb+0x3a/0x50 >> iterate_supers+0x106/0x1b0 >> ksys_sync+0x98/0x160 >> [...] >> ================================================================== >> >> The race that causes the above issue is as follows: >> >> cpu1 cpu2 >> -------------------------|------------------------- >> inode_switch_wbs >> INIT_WORK(&isw->work, inode_switch_wbs_work_fn) >> queue_rcu_work(isw_wq, &isw->work) >> // queue_work async >> inode_switch_wbs_work_fn >> wb_put_many(old_wb, nr_switched) >> percpu_ref_put_many >> ref->data->release(ref) >> cgwb_release >> queue_work(cgwb_release_wq, &wb->release_work) >> // queue_work async >> &wb->release_work >> cgwb_release_workfn >> ksys_sync >> iterate_supers >> sync_inodes_one_sb >> sync_inodes_sb >> bdi_split_work_to_wbs >> kmalloc(sizeof(*work), GFP_ATOMIC) >> // alloc memory failed >> percpu_ref_exit >> ref->data = NULL >> kfree(data) >> wb_get(wb) >> percpu_ref_get(&wb->refcnt) >> percpu_ref_get_many(ref, 1) >> atomic_long_add(nr, &ref->data->count) >> atomic64_add(i, v) >> // trigger null-ptr-deref >> >> bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all wbs. >> If the allocation of new work fails, the on-stack fallback will be used and >> the reference count of the current wb is increased afterwards. If cgroup >> writeback membership switches occur before getting the reference count and >> the current wb is released as old_wd, then calling wb_get() or wb_put() >> will trigger the null pointer dereference above. >> >> This issue was introduced in v4.3-rc7 (see fix tag1). Both sync_inodes_sb() >> and __writeback_inodes_sb_nr() calls to bdi_split_work_to_wbs() can trigger >> this issue. For scenarios called via sync_inodes_sb(), originally commit >> 7fc5854f8c6e ("writeback: synchronize sync(2) against cgroup writeback >> membership switches") reduced the possibility of the issue by adding >> wb_switch_rwsem, but in v5.14-rc1 (see fix tag2) removed the >> "inode_io_list_del_locked(inode, old_wb)" from inode_switch_wbs_work_fn() >> so that wb->state contains WB_has_dirty_io, thus old_wb is not skipped >> when traversing wbs in bdi_split_work_to_wbs(), and the issue becomes >> easily reproducible again. >> >> To solve this problem, percpu_ref_exit() is called under RCU protection >> to avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs(). >> Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(), >> and skip the current wb if wb_tryget() fails because the wb has already >> been shutdown. >> >> Fixes: b817525a4a80 ("writeback: bdi_writeback iteration must not skip dying ones") >> Fixes: f3b6a6df38aa ("writeback, cgroup: keep list of inodes attached to bdi_writeback") > Cc Roman for this second commit. Thanks! I forgot to cc to Roman! > >> Cc: stable@vger.kernel.org > Having two Fixes: is awkward. These serve as a guide to tell -stable > maintainers which kernels need the fix. Can we be more precise? Yes! I'm very sorry about that. As I replied earlier, this issue was introduced in v4.3-rc7. >> --- a/fs/fs-writeback.c >> +++ b/fs/fs-writeback.c >> @@ -978,6 +978,16 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, >> continue; >> } >> >> + /* >> + * If wb_tryget fails, the wb has been shutdown, skip it. >> + * >> + * Pin @wb so that it stays on @bdi->wb_list. This allows >> + * continuing iteration from @wb after dropping and >> + * regrabbing rcu read lock. >> + */ >> + if (!wb_tryget(wb)) >> + continue; >> + >> /* alloc failed, execute synchronously using on-stack fallback */ >> work = &fallback_work; >> *work = *base_work; >> @@ -986,13 +996,6 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, >> work->done = &fallback_work_done; >> >> wb_queue_work(wb, work); >> - >> - /* >> - * Pin @wb so that it stays on @bdi->wb_list. This allows >> - * continuing iteration from @wb after dropping and >> - * regrabbing rcu read lock. >> - */ >> - wb_get(wb); >> last_wb = wb; >> >> rcu_read_unlock(); >> diff --git a/mm/backing-dev.c b/mm/backing-dev.c >> index ad011308cebe..43b48750b491 100644 >> --- a/mm/backing-dev.c >> +++ b/mm/backing-dev.c >> @@ -507,6 +507,15 @@ static LIST_HEAD(offline_cgwbs); >> static void cleanup_offline_cgwbs_workfn(struct work_struct *work); >> static DECLARE_WORK(cleanup_offline_cgwbs_work, cleanup_offline_cgwbs_workfn); >> >> +static void cgwb_free_rcu(struct rcu_head *rcu_head) >> +{ >> + struct bdi_writeback *wb = container_of(rcu_head, >> + struct bdi_writeback, rcu); > nit: > > struct bdi_writeback *wb; > > wb = container_of(rcu_head, struct bdi_writeback, rcu); > > looks nicer, no? Yes, it looks neater. > >> + percpu_ref_exit(&wb->refcnt); >> + kfree(wb); >> +} >> + >> static void cgwb_release_workfn(struct work_struct *work) >> { >> struct bdi_writeback *wb = container_of(work, struct bdi_writeback, >> @@ -529,11 +538,10 @@ static void cgwb_release_workfn(struct work_struct *work) >> list_del(&wb->offline_node); >> spin_unlock_irq(&cgwb_lock); >> >> - percpu_ref_exit(&wb->refcnt); >> wb_exit(wb); >> bdi_put(bdi); >> WARN_ON_ONCE(!list_empty(&wb->b_attached)); >> - kfree_rcu(wb, rcu); >> + call_rcu(&wb->rcu, cgwb_free_rcu); >> } >> >> static void cgwb_release(struct percpu_ref *refcnt) Thanks!
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 195dc23e0d83..1db3e3c24b43 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -978,6 +978,16 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, continue; } + /* + * If wb_tryget fails, the wb has been shutdown, skip it. + * + * Pin @wb so that it stays on @bdi->wb_list. This allows + * continuing iteration from @wb after dropping and + * regrabbing rcu read lock. + */ + if (!wb_tryget(wb)) + continue; + /* alloc failed, execute synchronously using on-stack fallback */ work = &fallback_work; *work = *base_work; @@ -986,13 +996,6 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, work->done = &fallback_work_done; wb_queue_work(wb, work); - - /* - * Pin @wb so that it stays on @bdi->wb_list. This allows - * continuing iteration from @wb after dropping and - * regrabbing rcu read lock. - */ - wb_get(wb); last_wb = wb; rcu_read_unlock(); diff --git a/mm/backing-dev.c b/mm/backing-dev.c index ad011308cebe..43b48750b491 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -507,6 +507,15 @@ static LIST_HEAD(offline_cgwbs); static void cleanup_offline_cgwbs_workfn(struct work_struct *work); static DECLARE_WORK(cleanup_offline_cgwbs_work, cleanup_offline_cgwbs_workfn); +static void cgwb_free_rcu(struct rcu_head *rcu_head) +{ + struct bdi_writeback *wb = container_of(rcu_head, + struct bdi_writeback, rcu); + + percpu_ref_exit(&wb->refcnt); + kfree(wb); +} + static void cgwb_release_workfn(struct work_struct *work) { struct bdi_writeback *wb = container_of(work, struct bdi_writeback, @@ -529,11 +538,10 @@ static void cgwb_release_workfn(struct work_struct *work) list_del(&wb->offline_node); spin_unlock_irq(&cgwb_lock); - percpu_ref_exit(&wb->refcnt); wb_exit(wb); bdi_put(bdi); WARN_ON_ONCE(!list_empty(&wb->b_attached)); - kfree_rcu(wb, rcu); + call_rcu(&wb->rcu, cgwb_free_rcu); } static void cgwb_release(struct percpu_ref *refcnt)