Message ID | 20230614070733.113068-1-lujialin4@huawei.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp1050987vqr; Wed, 14 Jun 2023 00:10:16 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ72rkfhL2OjDimVG0U62H7K/pYXajzSw9rEC9VIiBPj7DBH7Oc7h1AZZ7+5IQmEGQCHIOHQ X-Received: by 2002:a17:903:18c:b0:1b0:4527:3716 with SMTP id z12-20020a170903018c00b001b045273716mr12213186plg.15.1686726616107; Wed, 14 Jun 2023 00:10:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686726616; cv=none; d=google.com; s=arc-20160816; b=aFbVIWEoN10Fgiz0QfhH+D/S8VxaZPPH63H1pARf2v3uScU8uqpTX8iyAFjap9uE7R 3aSS5LrOZO1lJv8Q2LVLviuU7Yy1bqt3/X0dL2RhL+0ezCOnpOgO9KUVrH0ldswwdXDs Q4mBD0rr/0810oyVnvcY7t4XHf2G80V6EWFTzgg6j/Xl0uR1Ex6gqC8faY1sD21mdNi4 Te5w/RmBbdl45uekHaWgTpo7U260Hzgu5YLOvdPcYZ2ytwLHgbc2ehJ9iezfajeqN3bV TqIh7W+0akXrDRUgLXWW86BTBaC5K7TZsU11ks9ATObw3L010fVqEnkcdk0iKk8asrcT opwg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:message-id:date:subject:cc:to:from; bh=cqSpLHGYaxjGSeqI9ZQ5DogWQnKdIJPsuIu7Y5l/wzQ=; b=ouAKwqjvP0VedMZ1FYexuQDqLiCknExFXE/M0q07Fy4tXL4XTVmkyfoxjfzcKCiNU2 s1eGTHG80pEPC0DKh/AijjllUlxx7/bZeZb/4KVK2gfDG5G6rn4ZONh5ZdSowGoVNv42 SslIXNQ17c79Or4YUFkGHbM6eMPpNK9gCFe9JXyg6LtkOh3yggQqMAyz4n3DabhzX72L fG6yTJOxDwErz3N34Tq8ipZQSfL4UU2zPfa9MaQqaU5zcjDQF3mSpq3PWwrdxZcABg8a SDFLWBOemUs+Gzx7CXDZgu4M4T/h7Avae/zuw7uRviCH/FAhgCoaFWjeBAzFlxZ3a1dJ qEtw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id o18-20020a170902d4d200b001b3de2e77b3si3573478plg.137.2023.06.14.00.10.02; Wed, 14 Jun 2023 00:10:16 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243162AbjFNHJB (ORCPT <rfc822;jesperjuhl76@gmail.com> + 99 others); Wed, 14 Jun 2023 03:09:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34576 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231878AbjFNHI6 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 14 Jun 2023 03:08:58 -0400 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AC9D219B6; Wed, 14 Jun 2023 00:08:54 -0700 (PDT) Received: from kwepemm600003.china.huawei.com (unknown [172.30.72.53]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4QgxKp3cGmztQbG; Wed, 14 Jun 2023 15:06:22 +0800 (CST) Received: from ubuntu1804.huawei.com (10.67.174.175) by kwepemm600003.china.huawei.com (7.193.23.202) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.23; Wed, 14 Jun 2023 15:08:51 +0800 From: Lu Jialin <lujialin4@huawei.com> To: Lu Jialin <lujialin4@huawei.com>, Johannes Weiner <hannes@cmpxchg.org>, Andrew Morton <akpm@linux-foundation.org>, Alexander Viro <viro@zeniv.linux.org.uk>, Christian Brauner <brauner@kernel.org>, Suren Baghdasaryan <surenb@google.com>, Oleg Nesterov <oleg@redhat.com>, Eric Biggers <ebiggers@google.com> CC: <linux-kernel@vger.kernel.org>, <linux-fsdevel@vger.kernel.org> Subject: [PATCH v2] poll: Fix use-after-free in poll_freewait() Date: Wed, 14 Jun 2023 15:07:33 +0800 Message-ID: <20230614070733.113068-1-lujialin4@huawei.com> X-Mailer: git-send-email 2.17.1 MIME-Version: 1.0 Content-Type: text/plain X-Originating-IP: [10.67.174.175] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To kwepemm600003.china.huawei.com (7.193.23.202) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1768661048672826248?= X-GMAIL-MSGID: =?utf-8?q?1768661048672826248?= |
Series |
[v2] poll: Fix use-after-free in poll_freewait()
|
|
Commit Message
Lu Jialin
June 14, 2023, 7:07 a.m. UTC
We found a UAF bug in remove_wait_queue as follows:
==================================================================
BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x71/0xe0
Write of size 4 at addr ffff8881150d7b28 by task psi_trigger/15306
Call Trace:
dump_stack+0x9c/0xd3
print_address_description.constprop.0+0x19/0x170
__kasan_report.cold+0x6c/0x84
kasan_report+0x3a/0x50
check_memory_region+0xfd/0x1f0
_raw_spin_lock_irqsave+0x71/0xe0
remove_wait_queue+0x26/0xc0
poll_freewait+0x6b/0x120
do_sys_poll+0x305/0x400
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x61/0xc6
Allocated by task 15306:
kasan_save_stack+0x1b/0x40
__kasan_kmalloc.constprop.0+0xb5/0xe0
psi_trigger_create.part.0+0xfc/0x450
cgroup_pressure_write+0xfc/0x3b0
cgroup_file_write+0x1b3/0x390
kernfs_fop_write_iter+0x224/0x2e0
new_sync_write+0x2ac/0x3a0
vfs_write+0x365/0x430
ksys_write+0xd5/0x1b0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x61/0xc6
Freed by task 15850:
kasan_save_stack+0x1b/0x40
kasan_set_track+0x1c/0x30
kasan_set_free_info+0x20/0x40
__kasan_slab_free+0x151/0x180
kfree+0xba/0x680
cgroup_file_release+0x5c/0xe0
kernfs_drain_open_files+0x122/0x1e0
kernfs_drain+0xff/0x1e0
__kernfs_remove.part.0+0x1d1/0x3b0
kernfs_remove_by_name_ns+0x89/0xf0
cgroup_addrm_files+0x393/0x3d0
css_clear_dir+0x8f/0x120
kill_css+0x41/0xd0
cgroup_destroy_locked+0x166/0x300
cgroup_rmdir+0x37/0x140
kernfs_iop_rmdir+0xbb/0xf0
vfs_rmdir.part.0+0xa5/0x230
do_rmdir+0x2e0/0x320
__x64_sys_unlinkat+0x99/0xc0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x61/0xc6
==================================================================
If using epoll(), wake_up_pollfree will empty waitqueue and set
wait_queue_head is NULL before free waitqueue of psi trigger. But is
doesn't work when using poll(), which will lead a UAF problem in
poll_freewait coms as following:
(cgroup_rmdir) |
psi_trigger_destroy |
wake_up_pollfree(&t->event_wait) |
synchronize_rcu(); |
kfree(t) |
| (poll_freewait)
| free_poll_entry(pwq->inline_entries + i)
| remove_wait_queue(entry->wait_address)
| spin_lock_irqsave(&wq_head->lock)
entry->wait_address in poll_freewait() is t->event_wait in cgroup_rmdir().
t->event_wait is free in psi_trigger_destroy before call poll_freewait(),
therefore wq_head in poll_freewait() has been already freed, which would
lead to a UAF.
similar problem for epoll() has been fixed commit c2dbe32d5db5
("sched/psi: Fix use-after-free in ep_remove_wait_queue()").
epoll wakeup function ep_poll_callback() will empty waitqueue and set
wait_queue_head is NULL when pollflags is POLLFREE and judge pwq->whead
is NULL or not before remove_wait_queue in ep_remove_wait_queue(),
which will fix the UAF bug in ep_remove_wait_queue.
But poll wakeup function pollwake() doesn't do that. To fix the
problem, we empty waitqueue and set wait_address is NULL in pollwake() when
key is POLLFREE. otherwise in remove_wait_queue, which is similar to
epoll().
Fixes: 0e94682b73bf ("psi: introduce psi monitor")
Suggested-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/all/CAJuCfpEoCRHkJF-=1Go9E94wchB4BzwQ1E3vHGWxNe+tEmSJoA@mail.gmail.com/#t
Signed-off-by: Lu Jialin <lujialin4@huawei.com>
---
v2: correct commit msg and title suggested by Suren Baghdasaryan
---
fs/select.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
Comments
On Wed, 14 Jun 2023 15:07:33 +0800, Lu Jialin wrote: > We found a UAF bug in remove_wait_queue as follows: > > ================================================================== > BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x71/0xe0 > Write of size 4 at addr ffff8881150d7b28 by task psi_trigger/15306 > Call Trace: > dump_stack+0x9c/0xd3 > print_address_description.constprop.0+0x19/0x170 > __kasan_report.cold+0x6c/0x84 > kasan_report+0x3a/0x50 > check_memory_region+0xfd/0x1f0 > _raw_spin_lock_irqsave+0x71/0xe0 > remove_wait_queue+0x26/0xc0 > poll_freewait+0x6b/0x120 > do_sys_poll+0x305/0x400 > do_syscall_64+0x33/0x40 > entry_SYSCALL_64_after_hwframe+0x61/0xc6 > > [...] Applied to the vfs.misc branch of the vfs/vfs.git tree. Patches in the vfs.misc branch should appear in linux-next soon. Please report any outstanding bugs that were missed during review in a new review to the original patch series allowing us to drop it. It's encouraged to provide Acked-bys and Reviewed-bys even though the patch has now been applied. If possible patch trailers will be updated. tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git branch: vfs.misc [1/1] poll: Fix use-after-free in poll_freewait() https://git.kernel.org/vfs/vfs/c/e5f00a6f63bc
On Wed, Jun 14, 2023 at 03:07:33PM +0800, Lu Jialin wrote: > We found a UAF bug in remove_wait_queue as follows: > > ================================================================== > BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x71/0xe0 > Write of size 4 at addr ffff8881150d7b28 by task psi_trigger/15306 > Call Trace: > dump_stack+0x9c/0xd3 > print_address_description.constprop.0+0x19/0x170 > __kasan_report.cold+0x6c/0x84 > kasan_report+0x3a/0x50 > check_memory_region+0xfd/0x1f0 > _raw_spin_lock_irqsave+0x71/0xe0 > remove_wait_queue+0x26/0xc0 > poll_freewait+0x6b/0x120 > do_sys_poll+0x305/0x400 > do_syscall_64+0x33/0x40 > entry_SYSCALL_64_after_hwframe+0x61/0xc6 > > Allocated by task 15306: > kasan_save_stack+0x1b/0x40 > __kasan_kmalloc.constprop.0+0xb5/0xe0 > psi_trigger_create.part.0+0xfc/0x450 > cgroup_pressure_write+0xfc/0x3b0 > cgroup_file_write+0x1b3/0x390 > kernfs_fop_write_iter+0x224/0x2e0 > new_sync_write+0x2ac/0x3a0 > vfs_write+0x365/0x430 > ksys_write+0xd5/0x1b0 > do_syscall_64+0x33/0x40 > entry_SYSCALL_64_after_hwframe+0x61/0xc6 > > Freed by task 15850: > kasan_save_stack+0x1b/0x40 > kasan_set_track+0x1c/0x30 > kasan_set_free_info+0x20/0x40 > __kasan_slab_free+0x151/0x180 > kfree+0xba/0x680 > cgroup_file_release+0x5c/0xe0 > kernfs_drain_open_files+0x122/0x1e0 > kernfs_drain+0xff/0x1e0 > __kernfs_remove.part.0+0x1d1/0x3b0 > kernfs_remove_by_name_ns+0x89/0xf0 > cgroup_addrm_files+0x393/0x3d0 > css_clear_dir+0x8f/0x120 > kill_css+0x41/0xd0 > cgroup_destroy_locked+0x166/0x300 > cgroup_rmdir+0x37/0x140 > kernfs_iop_rmdir+0xbb/0xf0 > vfs_rmdir.part.0+0xa5/0x230 > do_rmdir+0x2e0/0x320 > __x64_sys_unlinkat+0x99/0xc0 > do_syscall_64+0x33/0x40 > entry_SYSCALL_64_after_hwframe+0x61/0xc6 > ================================================================== > > If using epoll(), wake_up_pollfree will empty waitqueue and set > wait_queue_head is NULL before free waitqueue of psi trigger. But is > doesn't work when using poll(), which will lead a UAF problem in > poll_freewait coms as following: > > (cgroup_rmdir) | > psi_trigger_destroy | > wake_up_pollfree(&t->event_wait) | > synchronize_rcu(); | > kfree(t) | > | (poll_freewait) > | free_poll_entry(pwq->inline_entries + i) > | remove_wait_queue(entry->wait_address) > | spin_lock_irqsave(&wq_head->lock) > > entry->wait_address in poll_freewait() is t->event_wait in cgroup_rmdir(). > t->event_wait is free in psi_trigger_destroy before call poll_freewait(), > therefore wq_head in poll_freewait() has been already freed, which would > lead to a UAF. > > similar problem for epoll() has been fixed commit c2dbe32d5db5 > ("sched/psi: Fix use-after-free in ep_remove_wait_queue()"). > epoll wakeup function ep_poll_callback() will empty waitqueue and set > wait_queue_head is NULL when pollflags is POLLFREE and judge pwq->whead > is NULL or not before remove_wait_queue in ep_remove_wait_queue(), > which will fix the UAF bug in ep_remove_wait_queue. > > But poll wakeup function pollwake() doesn't do that. To fix the > problem, we empty waitqueue and set wait_address is NULL in pollwake() when > key is POLLFREE. otherwise in remove_wait_queue, which is similar to > epoll(). > > Fixes: 0e94682b73bf ("psi: introduce psi monitor") > Suggested-by: Suren Baghdasaryan <surenb@google.com> > Link: https://lore.kernel.org/all/CAJuCfpEoCRHkJF-=1Go9E94wchB4BzwQ1E3vHGWxNe+tEmSJoA@mail.gmail.com/#t > Signed-off-by: Lu Jialin <lujialin4@huawei.com> > --- > v2: correct commit msg and title suggested by Suren Baghdasaryan > --- > fs/select.c | 20 +++++++++++++++++++- > 1 file changed, 19 insertions(+), 1 deletion(-) > > diff --git a/fs/select.c b/fs/select.c > index 0ee55af1a55c..e64c7b4e9959 100644 > --- a/fs/select.c > +++ b/fs/select.c > @@ -132,7 +132,17 @@ EXPORT_SYMBOL(poll_initwait); > > static void free_poll_entry(struct poll_table_entry *entry) > { > - remove_wait_queue(entry->wait_address, &entry->wait); > + wait_queue_head_t *whead; > + > + rcu_read_lock(); > + /* If it is cleared by POLLFREE, it should be rcu-safe. > + * If we read NULL we need a barrier paired with smp_store_release() > + * in pollwake(). > + */ > + whead = smp_load_acquire(&entry->wait_address); > + if (whead) > + remove_wait_queue(whead, &entry->wait); > + rcu_read_unlock(); > fput(entry->filp); > } > > @@ -215,6 +225,14 @@ static int pollwake(wait_queue_entry_t *wait, unsigned mode, int sync, void *key > entry = container_of(wait, struct poll_table_entry, wait); > if (key && !(key_to_poll(key) & entry->key)) > return 0; > + if (key_to_poll(key) & POLLFREE) { > + list_del_init(&wait->entry); > + /* wait_address !=NULL protects us from the race with > + * poll_freewait(). > + */ > + smp_store_release(&entry->wait_address, NULL); > + return 0; > + } > return __pollwake(wait, mode, sync, key); I don't understand why this patch is needed. The last time I looked at POLLFREE, it is only needed because of asynchronous polls. See my explanation in the commit message of commit 50252e4b5e989ce6. In summary, POLLFREE solves the problem of polled waitqueues whose lifetime is tied to the current task rather than to the file being polled. Also refer to the comment above wake_up_pollfree(), which mentions this. fs/select.c is synchronous polling, not asynchronous. Therefore, it should not need to handle POLLFREE. If there's actually a bug here, most likely it's a bug in psi_trigger_poll() where it is using a waitqueue whose lifetime is tied to neither the current task nor the file being polled. That needs to be fixed. - Eric
On Wed, Jun 14, 2023 at 10:40 AM Eric Biggers <ebiggers@kernel.org> wrote: > > On Wed, Jun 14, 2023 at 03:07:33PM +0800, Lu Jialin wrote: > > We found a UAF bug in remove_wait_queue as follows: > > > > ================================================================== > > BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x71/0xe0 > > Write of size 4 at addr ffff8881150d7b28 by task psi_trigger/15306 > > Call Trace: > > dump_stack+0x9c/0xd3 > > print_address_description.constprop.0+0x19/0x170 > > __kasan_report.cold+0x6c/0x84 > > kasan_report+0x3a/0x50 > > check_memory_region+0xfd/0x1f0 > > _raw_spin_lock_irqsave+0x71/0xe0 > > remove_wait_queue+0x26/0xc0 > > poll_freewait+0x6b/0x120 > > do_sys_poll+0x305/0x400 > > do_syscall_64+0x33/0x40 > > entry_SYSCALL_64_after_hwframe+0x61/0xc6 > > > > Allocated by task 15306: > > kasan_save_stack+0x1b/0x40 > > __kasan_kmalloc.constprop.0+0xb5/0xe0 > > psi_trigger_create.part.0+0xfc/0x450 > > cgroup_pressure_write+0xfc/0x3b0 > > cgroup_file_write+0x1b3/0x390 > > kernfs_fop_write_iter+0x224/0x2e0 > > new_sync_write+0x2ac/0x3a0 > > vfs_write+0x365/0x430 > > ksys_write+0xd5/0x1b0 > > do_syscall_64+0x33/0x40 > > entry_SYSCALL_64_after_hwframe+0x61/0xc6 > > > > Freed by task 15850: > > kasan_save_stack+0x1b/0x40 > > kasan_set_track+0x1c/0x30 > > kasan_set_free_info+0x20/0x40 > > __kasan_slab_free+0x151/0x180 > > kfree+0xba/0x680 > > cgroup_file_release+0x5c/0xe0 > > kernfs_drain_open_files+0x122/0x1e0 > > kernfs_drain+0xff/0x1e0 > > __kernfs_remove.part.0+0x1d1/0x3b0 > > kernfs_remove_by_name_ns+0x89/0xf0 > > cgroup_addrm_files+0x393/0x3d0 > > css_clear_dir+0x8f/0x120 > > kill_css+0x41/0xd0 > > cgroup_destroy_locked+0x166/0x300 > > cgroup_rmdir+0x37/0x140 > > kernfs_iop_rmdir+0xbb/0xf0 > > vfs_rmdir.part.0+0xa5/0x230 > > do_rmdir+0x2e0/0x320 > > __x64_sys_unlinkat+0x99/0xc0 > > do_syscall_64+0x33/0x40 > > entry_SYSCALL_64_after_hwframe+0x61/0xc6 > > ================================================================== > > > > If using epoll(), wake_up_pollfree will empty waitqueue and set > > wait_queue_head is NULL before free waitqueue of psi trigger. But is > > doesn't work when using poll(), which will lead a UAF problem in > > poll_freewait coms as following: > > > > (cgroup_rmdir) | > > psi_trigger_destroy | > > wake_up_pollfree(&t->event_wait) | > > synchronize_rcu(); | > > kfree(t) | > > | (poll_freewait) > > | free_poll_entry(pwq->inline_entries + i) > > | remove_wait_queue(entry->wait_address) > > | spin_lock_irqsave(&wq_head->lock) > > > > entry->wait_address in poll_freewait() is t->event_wait in cgroup_rmdir(). > > t->event_wait is free in psi_trigger_destroy before call poll_freewait(), > > therefore wq_head in poll_freewait() has been already freed, which would > > lead to a UAF. > > > > similar problem for epoll() has been fixed commit c2dbe32d5db5 > > ("sched/psi: Fix use-after-free in ep_remove_wait_queue()"). > > epoll wakeup function ep_poll_callback() will empty waitqueue and set > > wait_queue_head is NULL when pollflags is POLLFREE and judge pwq->whead > > is NULL or not before remove_wait_queue in ep_remove_wait_queue(), > > which will fix the UAF bug in ep_remove_wait_queue. > > > > But poll wakeup function pollwake() doesn't do that. To fix the > > problem, we empty waitqueue and set wait_address is NULL in pollwake() when > > key is POLLFREE. otherwise in remove_wait_queue, which is similar to > > epoll(). > > > > Fixes: 0e94682b73bf ("psi: introduce psi monitor") > > Suggested-by: Suren Baghdasaryan <surenb@google.com> > > Link: https://lore.kernel.org/all/CAJuCfpEoCRHkJF-=1Go9E94wchB4BzwQ1E3vHGWxNe+tEmSJoA@mail.gmail.com/#t > > Signed-off-by: Lu Jialin <lujialin4@huawei.com> > > --- > > v2: correct commit msg and title suggested by Suren Baghdasaryan > > --- > > fs/select.c | 20 +++++++++++++++++++- > > 1 file changed, 19 insertions(+), 1 deletion(-) > > > > diff --git a/fs/select.c b/fs/select.c > > index 0ee55af1a55c..e64c7b4e9959 100644 > > --- a/fs/select.c > > +++ b/fs/select.c > > @@ -132,7 +132,17 @@ EXPORT_SYMBOL(poll_initwait); > > > > static void free_poll_entry(struct poll_table_entry *entry) > > { > > - remove_wait_queue(entry->wait_address, &entry->wait); > > + wait_queue_head_t *whead; > > + > > + rcu_read_lock(); > > + /* If it is cleared by POLLFREE, it should be rcu-safe. > > + * If we read NULL we need a barrier paired with smp_store_release() > > + * in pollwake(). > > + */ > > + whead = smp_load_acquire(&entry->wait_address); > > + if (whead) > > + remove_wait_queue(whead, &entry->wait); > > + rcu_read_unlock(); > > fput(entry->filp); > > } > > > > @@ -215,6 +225,14 @@ static int pollwake(wait_queue_entry_t *wait, unsigned mode, int sync, void *key > > entry = container_of(wait, struct poll_table_entry, wait); > > if (key && !(key_to_poll(key) & entry->key)) > > return 0; > > + if (key_to_poll(key) & POLLFREE) { > > + list_del_init(&wait->entry); > > + /* wait_address !=NULL protects us from the race with > > + * poll_freewait(). > > + */ > > + smp_store_release(&entry->wait_address, NULL); > > + return 0; > > + } > > return __pollwake(wait, mode, sync, key); > > I don't understand why this patch is needed. > > The last time I looked at POLLFREE, it is only needed because of asynchronous > polls. See my explanation in the commit message of commit 50252e4b5e989ce6. Ah, I missed that. Thanks for the correction. > > In summary, POLLFREE solves the problem of polled waitqueues whose lifetime is > tied to the current task rather than to the file being polled. Also refer to > the comment above wake_up_pollfree(), which mentions this. > > fs/select.c is synchronous polling, not asynchronous. Therefore, it should not > need to handle POLLFREE. > > If there's actually a bug here, most likely it's a bug in psi_trigger_poll() > where it is using a waitqueue whose lifetime is tied to neither the current task > nor the file being polled. That needs to be fixed. Yeah. We discussed this issue in https://lore.kernel.org/all/CAJuCfpFb0J5ZwO6kncjRG0_4jQLXUy-_dicpH5uGiWP8aKYEJQ@mail.gmail.com and the root cause is that cgroup_file_release() where psi_trigger_destroy() is called is not tied to the cgroup file's real lifetime (see my analysis here: https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t). I guess it's time to do a deeper surgery and figure out a way to call psi_trigger_destroy() when the polled cgroup file is actually being destroyed. I'll take a closer look into this later today. A fix will likely require some cgroup or kernfs code changes, so CC'ing Tejun for visibility. Thanks, Suren. > > - Eric
On Wed, Jun 14, 2023 at 11:19 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Wed, Jun 14, 2023 at 10:40 AM Eric Biggers <ebiggers@kernel.org> wrote: > > > > On Wed, Jun 14, 2023 at 03:07:33PM +0800, Lu Jialin wrote: > > > We found a UAF bug in remove_wait_queue as follows: > > > > > > ================================================================== > > > BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x71/0xe0 > > > Write of size 4 at addr ffff8881150d7b28 by task psi_trigger/15306 > > > Call Trace: > > > dump_stack+0x9c/0xd3 > > > print_address_description.constprop.0+0x19/0x170 > > > __kasan_report.cold+0x6c/0x84 > > > kasan_report+0x3a/0x50 > > > check_memory_region+0xfd/0x1f0 > > > _raw_spin_lock_irqsave+0x71/0xe0 > > > remove_wait_queue+0x26/0xc0 > > > poll_freewait+0x6b/0x120 > > > do_sys_poll+0x305/0x400 > > > do_syscall_64+0x33/0x40 > > > entry_SYSCALL_64_after_hwframe+0x61/0xc6 > > > > > > Allocated by task 15306: > > > kasan_save_stack+0x1b/0x40 > > > __kasan_kmalloc.constprop.0+0xb5/0xe0 > > > psi_trigger_create.part.0+0xfc/0x450 > > > cgroup_pressure_write+0xfc/0x3b0 > > > cgroup_file_write+0x1b3/0x390 > > > kernfs_fop_write_iter+0x224/0x2e0 > > > new_sync_write+0x2ac/0x3a0 > > > vfs_write+0x365/0x430 > > > ksys_write+0xd5/0x1b0 > > > do_syscall_64+0x33/0x40 > > > entry_SYSCALL_64_after_hwframe+0x61/0xc6 > > > > > > Freed by task 15850: > > > kasan_save_stack+0x1b/0x40 > > > kasan_set_track+0x1c/0x30 > > > kasan_set_free_info+0x20/0x40 > > > __kasan_slab_free+0x151/0x180 > > > kfree+0xba/0x680 > > > cgroup_file_release+0x5c/0xe0 > > > kernfs_drain_open_files+0x122/0x1e0 > > > kernfs_drain+0xff/0x1e0 > > > __kernfs_remove.part.0+0x1d1/0x3b0 > > > kernfs_remove_by_name_ns+0x89/0xf0 > > > cgroup_addrm_files+0x393/0x3d0 > > > css_clear_dir+0x8f/0x120 > > > kill_css+0x41/0xd0 > > > cgroup_destroy_locked+0x166/0x300 > > > cgroup_rmdir+0x37/0x140 > > > kernfs_iop_rmdir+0xbb/0xf0 > > > vfs_rmdir.part.0+0xa5/0x230 > > > do_rmdir+0x2e0/0x320 > > > __x64_sys_unlinkat+0x99/0xc0 > > > do_syscall_64+0x33/0x40 > > > entry_SYSCALL_64_after_hwframe+0x61/0xc6 > > > ================================================================== > > > > > > If using epoll(), wake_up_pollfree will empty waitqueue and set > > > wait_queue_head is NULL before free waitqueue of psi trigger. But is > > > doesn't work when using poll(), which will lead a UAF problem in > > > poll_freewait coms as following: > > > > > > (cgroup_rmdir) | > > > psi_trigger_destroy | > > > wake_up_pollfree(&t->event_wait) | > > > synchronize_rcu(); | > > > kfree(t) | > > > | (poll_freewait) > > > | free_poll_entry(pwq->inline_entries + i) > > > | remove_wait_queue(entry->wait_address) > > > | spin_lock_irqsave(&wq_head->lock) > > > > > > entry->wait_address in poll_freewait() is t->event_wait in cgroup_rmdir(). > > > t->event_wait is free in psi_trigger_destroy before call poll_freewait(), > > > therefore wq_head in poll_freewait() has been already freed, which would > > > lead to a UAF. Hi Lu, Could you please share your reproducer along with the kernel config you used? I'm trying to reproduce this UAF but every time I delete the cgroup being polled, poll() simply returns POLLERR. Thanks, Suren. > > > > > > similar problem for epoll() has been fixed commit c2dbe32d5db5 > > > ("sched/psi: Fix use-after-free in ep_remove_wait_queue()"). > > > epoll wakeup function ep_poll_callback() will empty waitqueue and set > > > wait_queue_head is NULL when pollflags is POLLFREE and judge pwq->whead > > > is NULL or not before remove_wait_queue in ep_remove_wait_queue(), > > > which will fix the UAF bug in ep_remove_wait_queue. > > > > > > But poll wakeup function pollwake() doesn't do that. To fix the > > > problem, we empty waitqueue and set wait_address is NULL in pollwake() when > > > key is POLLFREE. otherwise in remove_wait_queue, which is similar to > > > epoll(). > > > > > > Fixes: 0e94682b73bf ("psi: introduce psi monitor") > > > Suggested-by: Suren Baghdasaryan <surenb@google.com> > > > Link: https://lore.kernel.org/all/CAJuCfpEoCRHkJF-=1Go9E94wchB4BzwQ1E3vHGWxNe+tEmSJoA@mail.gmail.com/#t > > > Signed-off-by: Lu Jialin <lujialin4@huawei.com> > > > --- > > > v2: correct commit msg and title suggested by Suren Baghdasaryan > > > --- > > > fs/select.c | 20 +++++++++++++++++++- > > > 1 file changed, 19 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/select.c b/fs/select.c > > > index 0ee55af1a55c..e64c7b4e9959 100644 > > > --- a/fs/select.c > > > +++ b/fs/select.c > > > @@ -132,7 +132,17 @@ EXPORT_SYMBOL(poll_initwait); > > > > > > static void free_poll_entry(struct poll_table_entry *entry) > > > { > > > - remove_wait_queue(entry->wait_address, &entry->wait); > > > + wait_queue_head_t *whead; > > > + > > > + rcu_read_lock(); > > > + /* If it is cleared by POLLFREE, it should be rcu-safe. > > > + * If we read NULL we need a barrier paired with smp_store_release() > > > + * in pollwake(). > > > + */ > > > + whead = smp_load_acquire(&entry->wait_address); > > > + if (whead) > > > + remove_wait_queue(whead, &entry->wait); > > > + rcu_read_unlock(); > > > fput(entry->filp); > > > } > > > > > > @@ -215,6 +225,14 @@ static int pollwake(wait_queue_entry_t *wait, unsigned mode, int sync, void *key > > > entry = container_of(wait, struct poll_table_entry, wait); > > > if (key && !(key_to_poll(key) & entry->key)) > > > return 0; > > > + if (key_to_poll(key) & POLLFREE) { > > > + list_del_init(&wait->entry); > > > + /* wait_address !=NULL protects us from the race with > > > + * poll_freewait(). > > > + */ > > > + smp_store_release(&entry->wait_address, NULL); > > > + return 0; > > > + } > > > return __pollwake(wait, mode, sync, key); > > > > I don't understand why this patch is needed. > > > > The last time I looked at POLLFREE, it is only needed because of asynchronous > > polls. See my explanation in the commit message of commit 50252e4b5e989ce6. > > Ah, I missed that. Thanks for the correction. > > > > > In summary, POLLFREE solves the problem of polled waitqueues whose lifetime is > > tied to the current task rather than to the file being polled. Also refer to > > the comment above wake_up_pollfree(), which mentions this. > > > > fs/select.c is synchronous polling, not asynchronous. Therefore, it should not > > need to handle POLLFREE. > > > > If there's actually a bug here, most likely it's a bug in psi_trigger_poll() > > where it is using a waitqueue whose lifetime is tied to neither the current task > > nor the file being polled. That needs to be fixed. > > Yeah. We discussed this issue in > https://lore.kernel.org/all/CAJuCfpFb0J5ZwO6kncjRG0_4jQLXUy-_dicpH5uGiWP8aKYEJQ@mail.gmail.com > and the root cause is that cgroup_file_release() where > psi_trigger_destroy() is called is not tied to the cgroup file's real > lifetime (see my analysis here: > https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t). > I guess it's time to do a deeper surgery and figure out a way to call > psi_trigger_destroy() when the polled cgroup file is actually being > destroyed. I'll take a closer look into this later today. > A fix will likely require some cgroup or kernfs code changes, so > CC'ing Tejun for visibility. > Thanks, > Suren. > > > > > - Eric
Hi Suren: kernel config: x86_64_defconfig CONFIG_PSI=y CONFIG_SLUB_DEBUG=y CONFIG_SLUB_DEBUG_ON=y CONFIG_KASAN=y CONFIG_KASAN_INLINE=y I make some change in code, in order to increase the recurrence probability. diff --git a/fs/select.c b/fs/select.c index 5edffee1162c..5ee5b74a8386 100644 --- a/fs/select.c +++ b/fs/select.c @@ -139,6 +139,7 @@ void poll_freewait(struct poll_wqueues *pwq) { struct poll_table_page * p = pwq->table; int i; + mdelay(50); for (i = 0; i < pwq->inline_index; i++) free_poll_entry(pwq->inline_entries + i); while (p) { Here is the simple repo test.sh: #!/bin/bash RESOURCE_TYPES=("cpu" "memory" "io" "irq") #RESOURCE_TYPES=("cpu") cgroup_num=50 test_dir=/sys/fs/cgroup/test function restart_cgroup() { num=$(expr $RANDOM % $cgroup_num + 1) rmdir $test_dir/test_$num mkdir $test_dir/test_$num } function create_triggers() { num=$(expr $RANDOM % $cgroup_num + 1) random=$(expr $RANDOM % "${#RESOURCE_TYPES[@]}") psi_type="${RESOURCE_TYPES[${random}]}" ./psi_monitor $test_dir/test_$num $psi_type & } mkdir $test_dir for i in $(seq 1 $cgroup_num) do mkdir $test_dir/test_$i done for j in $(seq 1 100) do restart_cgroup & create_triggers & done psi_monitor.c: #include <errno.h> #include <fcntl.h> #include <stdio.h> #include <poll.h> #include <string.h> #include <unistd.h> int main(int argc, char *argv[]) { const char trig[] = "full 1000000 1000000"; struct pollfd fds; char filename[100]; sprintf(filename, "%s/%s.pressure", argv[1], argv[2]); fds.fd = open(filename, O_RDWR | O_NONBLOCK); if (fds.fd < 0) { printf("%s open error: %s\n", filename,strerror(errno)); return 1; } fds.events = POLLPRI; if (write(fds.fd, trig, strlen(trig) + 1) < 0) { printf("%s write error: %s\n",filename,strerror(errno)); return 1; } while (1) { poll(&fds, 1, -1); } close(fds.fd); return 0; } Thanks, Lu 在 2023/6/16 7:13, Suren Baghdasaryan 写道: > On Wed, Jun 14, 2023 at 11:19 AM Suren Baghdasaryan <surenb@google.com> wrote: >> >> On Wed, Jun 14, 2023 at 10:40 AM Eric Biggers <ebiggers@kernel.org> wrote: >>> >>> On Wed, Jun 14, 2023 at 03:07:33PM +0800, Lu Jialin wrote: >>>> We found a UAF bug in remove_wait_queue as follows: >>>> >>>> ================================================================== >>>> BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x71/0xe0 >>>> Write of size 4 at addr ffff8881150d7b28 by task psi_trigger/15306 >>>> Call Trace: >>>> dump_stack+0x9c/0xd3 >>>> print_address_description.constprop.0+0x19/0x170 >>>> __kasan_report.cold+0x6c/0x84 >>>> kasan_report+0x3a/0x50 >>>> check_memory_region+0xfd/0x1f0 >>>> _raw_spin_lock_irqsave+0x71/0xe0 >>>> remove_wait_queue+0x26/0xc0 >>>> poll_freewait+0x6b/0x120 >>>> do_sys_poll+0x305/0x400 >>>> do_syscall_64+0x33/0x40 >>>> entry_SYSCALL_64_after_hwframe+0x61/0xc6 >>>> >>>> Allocated by task 15306: >>>> kasan_save_stack+0x1b/0x40 >>>> __kasan_kmalloc.constprop.0+0xb5/0xe0 >>>> psi_trigger_create.part.0+0xfc/0x450 >>>> cgroup_pressure_write+0xfc/0x3b0 >>>> cgroup_file_write+0x1b3/0x390 >>>> kernfs_fop_write_iter+0x224/0x2e0 >>>> new_sync_write+0x2ac/0x3a0 >>>> vfs_write+0x365/0x430 >>>> ksys_write+0xd5/0x1b0 >>>> do_syscall_64+0x33/0x40 >>>> entry_SYSCALL_64_after_hwframe+0x61/0xc6 >>>> >>>> Freed by task 15850: >>>> kasan_save_stack+0x1b/0x40 >>>> kasan_set_track+0x1c/0x30 >>>> kasan_set_free_info+0x20/0x40 >>>> __kasan_slab_free+0x151/0x180 >>>> kfree+0xba/0x680 >>>> cgroup_file_release+0x5c/0xe0 >>>> kernfs_drain_open_files+0x122/0x1e0 >>>> kernfs_drain+0xff/0x1e0 >>>> __kernfs_remove.part.0+0x1d1/0x3b0 >>>> kernfs_remove_by_name_ns+0x89/0xf0 >>>> cgroup_addrm_files+0x393/0x3d0 >>>> css_clear_dir+0x8f/0x120 >>>> kill_css+0x41/0xd0 >>>> cgroup_destroy_locked+0x166/0x300 >>>> cgroup_rmdir+0x37/0x140 >>>> kernfs_iop_rmdir+0xbb/0xf0 >>>> vfs_rmdir.part.0+0xa5/0x230 >>>> do_rmdir+0x2e0/0x320 >>>> __x64_sys_unlinkat+0x99/0xc0 >>>> do_syscall_64+0x33/0x40 >>>> entry_SYSCALL_64_after_hwframe+0x61/0xc6 >>>> ================================================================== >>>> >>>> If using epoll(), wake_up_pollfree will empty waitqueue and set >>>> wait_queue_head is NULL before free waitqueue of psi trigger. But is >>>> doesn't work when using poll(), which will lead a UAF problem in >>>> poll_freewait coms as following: >>>> >>>> (cgroup_rmdir) | >>>> psi_trigger_destroy | >>>> wake_up_pollfree(&t->event_wait) | >>>> synchronize_rcu(); | >>>> kfree(t) | >>>> | (poll_freewait) >>>> | free_poll_entry(pwq->inline_entries + i) >>>> | remove_wait_queue(entry->wait_address) >>>> | spin_lock_irqsave(&wq_head->lock) >>>> >>>> entry->wait_address in poll_freewait() is t->event_wait in cgroup_rmdir(). >>>> t->event_wait is free in psi_trigger_destroy before call poll_freewait(), >>>> therefore wq_head in poll_freewait() has been already freed, which would >>>> lead to a UAF. > > Hi Lu, > Could you please share your reproducer along with the kernel config > you used? I'm trying to reproduce this UAF but every time I delete the > cgroup being polled, poll() simply returns POLLERR. > Thanks, > Suren. > >>>> >>>> similar problem for epoll() has been fixed commit c2dbe32d5db5 >>>> ("sched/psi: Fix use-after-free in ep_remove_wait_queue()"). >>>> epoll wakeup function ep_poll_callback() will empty waitqueue and set >>>> wait_queue_head is NULL when pollflags is POLLFREE and judge pwq->whead >>>> is NULL or not before remove_wait_queue in ep_remove_wait_queue(), >>>> which will fix the UAF bug in ep_remove_wait_queue. >>>> >>>> But poll wakeup function pollwake() doesn't do that. To fix the >>>> problem, we empty waitqueue and set wait_address is NULL in pollwake() when >>>> key is POLLFREE. otherwise in remove_wait_queue, which is similar to >>>> epoll(). >>>> >>>> Fixes: 0e94682b73bf ("psi: introduce psi monitor") >>>> Suggested-by: Suren Baghdasaryan <surenb@google.com> >>>> Link: https://lore.kernel.org/all/CAJuCfpEoCRHkJF-=1Go9E94wchB4BzwQ1E3vHGWxNe+tEmSJoA@mail.gmail.com/#t >>>> Signed-off-by: Lu Jialin <lujialin4@huawei.com> >>>> --- >>>> v2: correct commit msg and title suggested by Suren Baghdasaryan >>>> --- >>>> fs/select.c | 20 +++++++++++++++++++- >>>> 1 file changed, 19 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/fs/select.c b/fs/select.c >>>> index 0ee55af1a55c..e64c7b4e9959 100644 >>>> --- a/fs/select.c >>>> +++ b/fs/select.c >>>> @@ -132,7 +132,17 @@ EXPORT_SYMBOL(poll_initwait); >>>> >>>> static void free_poll_entry(struct poll_table_entry *entry) >>>> { >>>> - remove_wait_queue(entry->wait_address, &entry->wait); >>>> + wait_queue_head_t *whead; >>>> + >>>> + rcu_read_lock(); >>>> + /* If it is cleared by POLLFREE, it should be rcu-safe. >>>> + * If we read NULL we need a barrier paired with smp_store_release() >>>> + * in pollwake(). >>>> + */ >>>> + whead = smp_load_acquire(&entry->wait_address); >>>> + if (whead) >>>> + remove_wait_queue(whead, &entry->wait); >>>> + rcu_read_unlock(); >>>> fput(entry->filp); >>>> } >>>> >>>> @@ -215,6 +225,14 @@ static int pollwake(wait_queue_entry_t *wait, unsigned mode, int sync, void *key >>>> entry = container_of(wait, struct poll_table_entry, wait); >>>> if (key && !(key_to_poll(key) & entry->key)) >>>> return 0; >>>> + if (key_to_poll(key) & POLLFREE) { >>>> + list_del_init(&wait->entry); >>>> + /* wait_address !=NULL protects us from the race with >>>> + * poll_freewait(). >>>> + */ >>>> + smp_store_release(&entry->wait_address, NULL); >>>> + return 0; >>>> + } >>>> return __pollwake(wait, mode, sync, key); >>> >>> I don't understand why this patch is needed. >>> >>> The last time I looked at POLLFREE, it is only needed because of asynchronous >>> polls. See my explanation in the commit message of commit 50252e4b5e989ce6. >> >> Ah, I missed that. Thanks for the correction. >> >>> >>> In summary, POLLFREE solves the problem of polled waitqueues whose lifetime is >>> tied to the current task rather than to the file being polled. Also refer to >>> the comment above wake_up_pollfree(), which mentions this. >>> >>> fs/select.c is synchronous polling, not asynchronous. Therefore, it should not >>> need to handle POLLFREE. >>> >>> If there's actually a bug here, most likely it's a bug in psi_trigger_poll() >>> where it is using a waitqueue whose lifetime is tied to neither the current task >>> nor the file being polled. That needs to be fixed. >> >> Yeah. We discussed this issue in >> https://lore.kernel.org/all/CAJuCfpFb0J5ZwO6kncjRG0_4jQLXUy-_dicpH5uGiWP8aKYEJQ@mail.gmail.com >> and the root cause is that cgroup_file_release() where >> psi_trigger_destroy() is called is not tied to the cgroup file's real >> lifetime (see my analysis here: >> https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t). >> I guess it's time to do a deeper surgery and figure out a way to call >> psi_trigger_destroy() when the polled cgroup file is actually being >> destroyed. I'll take a closer look into this later today. >> A fix will likely require some cgroup or kernfs code changes, so >> CC'ing Tejun for visibility. >> Thanks, >> Suren. >> >>> >>> - Eric > . >
On Sun, Jun 18, 2023 at 6:28 AM lujialin (A) <lujialin4@huawei.com> wrote: > > Hi Suren: > > kernel config: > x86_64_defconfig > CONFIG_PSI=y > CONFIG_SLUB_DEBUG=y > CONFIG_SLUB_DEBUG_ON=y > CONFIG_KASAN=y > CONFIG_KASAN_INLINE=y > > I make some change in code, in order to increase the recurrence probability. > diff --git a/fs/select.c b/fs/select.c > index 5edffee1162c..5ee5b74a8386 100644 > --- a/fs/select.c > +++ b/fs/select.c > @@ -139,6 +139,7 @@ void poll_freewait(struct poll_wqueues *pwq) > { > struct poll_table_page * p = pwq->table; > int i; > + mdelay(50); > for (i = 0; i < pwq->inline_index; i++) > free_poll_entry(pwq->inline_entries + i); > while (p) { > > Here is the simple repo test.sh: > #!/bin/bash > > RESOURCE_TYPES=("cpu" "memory" "io" "irq") > #RESOURCE_TYPES=("cpu") > cgroup_num=50 > test_dir=/sys/fs/cgroup/test > > function restart_cgroup() { > num=$(expr $RANDOM % $cgroup_num + 1) > rmdir $test_dir/test_$num > mkdir $test_dir/test_$num > } > > function create_triggers() { > num=$(expr $RANDOM % $cgroup_num + 1) > random=$(expr $RANDOM % "${#RESOURCE_TYPES[@]}") > psi_type="${RESOURCE_TYPES[${random}]}" > ./psi_monitor $test_dir/test_$num $psi_type & > } > > mkdir $test_dir > for i in $(seq 1 $cgroup_num) > do > mkdir $test_dir/test_$i > done > for j in $(seq 1 100) > do > restart_cgroup & > create_triggers & > done > > psi_monitor.c: > #include <errno.h> > #include <fcntl.h> > #include <stdio.h> > #include <poll.h> > #include <string.h> > #include <unistd.h> > > int main(int argc, char *argv[]) { > const char trig[] = "full 1000000 1000000"; > struct pollfd fds; > char filename[100]; > > sprintf(filename, "%s/%s.pressure", argv[1], argv[2]); > > fds.fd = open(filename, O_RDWR | O_NONBLOCK); > if (fds.fd < 0) { > printf("%s open error: %s\n", filename,strerror(errno)); > return 1; > } > fds.events = POLLPRI; > if (write(fds.fd, trig, strlen(trig) + 1) < 0) { > printf("%s write error: %s\n",filename,strerror(errno)); > return 1; > } > while (1) { > poll(&fds, 1, -1); > } > close(fds.fd); > return 0; > } Thanks a lot! I'll try to get this reproduced and fixed by the end of this week. Suren. > Thanks, > Lu > 在 2023/6/16 7:13, Suren Baghdasaryan 写道: > > On Wed, Jun 14, 2023 at 11:19 AM Suren Baghdasaryan <surenb@google.com> wrote: > >> > >> On Wed, Jun 14, 2023 at 10:40 AM Eric Biggers <ebiggers@kernel.org> wrote: > >>> > >>> On Wed, Jun 14, 2023 at 03:07:33PM +0800, Lu Jialin wrote: > >>>> We found a UAF bug in remove_wait_queue as follows: > >>>> > >>>> ================================================================== > >>>> BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x71/0xe0 > >>>> Write of size 4 at addr ffff8881150d7b28 by task psi_trigger/15306 > >>>> Call Trace: > >>>> dump_stack+0x9c/0xd3 > >>>> print_address_description.constprop.0+0x19/0x170 > >>>> __kasan_report.cold+0x6c/0x84 > >>>> kasan_report+0x3a/0x50 > >>>> check_memory_region+0xfd/0x1f0 > >>>> _raw_spin_lock_irqsave+0x71/0xe0 > >>>> remove_wait_queue+0x26/0xc0 > >>>> poll_freewait+0x6b/0x120 > >>>> do_sys_poll+0x305/0x400 > >>>> do_syscall_64+0x33/0x40 > >>>> entry_SYSCALL_64_after_hwframe+0x61/0xc6 > >>>> > >>>> Allocated by task 15306: > >>>> kasan_save_stack+0x1b/0x40 > >>>> __kasan_kmalloc.constprop.0+0xb5/0xe0 > >>>> psi_trigger_create.part.0+0xfc/0x450 > >>>> cgroup_pressure_write+0xfc/0x3b0 > >>>> cgroup_file_write+0x1b3/0x390 > >>>> kernfs_fop_write_iter+0x224/0x2e0 > >>>> new_sync_write+0x2ac/0x3a0 > >>>> vfs_write+0x365/0x430 > >>>> ksys_write+0xd5/0x1b0 > >>>> do_syscall_64+0x33/0x40 > >>>> entry_SYSCALL_64_after_hwframe+0x61/0xc6 > >>>> > >>>> Freed by task 15850: > >>>> kasan_save_stack+0x1b/0x40 > >>>> kasan_set_track+0x1c/0x30 > >>>> kasan_set_free_info+0x20/0x40 > >>>> __kasan_slab_free+0x151/0x180 > >>>> kfree+0xba/0x680 > >>>> cgroup_file_release+0x5c/0xe0 > >>>> kernfs_drain_open_files+0x122/0x1e0 > >>>> kernfs_drain+0xff/0x1e0 > >>>> __kernfs_remove.part.0+0x1d1/0x3b0 > >>>> kernfs_remove_by_name_ns+0x89/0xf0 > >>>> cgroup_addrm_files+0x393/0x3d0 > >>>> css_clear_dir+0x8f/0x120 > >>>> kill_css+0x41/0xd0 > >>>> cgroup_destroy_locked+0x166/0x300 > >>>> cgroup_rmdir+0x37/0x140 > >>>> kernfs_iop_rmdir+0xbb/0xf0 > >>>> vfs_rmdir.part.0+0xa5/0x230 > >>>> do_rmdir+0x2e0/0x320 > >>>> __x64_sys_unlinkat+0x99/0xc0 > >>>> do_syscall_64+0x33/0x40 > >>>> entry_SYSCALL_64_after_hwframe+0x61/0xc6 > >>>> ================================================================== > >>>> > >>>> If using epoll(), wake_up_pollfree will empty waitqueue and set > >>>> wait_queue_head is NULL before free waitqueue of psi trigger. But is > >>>> doesn't work when using poll(), which will lead a UAF problem in > >>>> poll_freewait coms as following: > >>>> > >>>> (cgroup_rmdir) | > >>>> psi_trigger_destroy | > >>>> wake_up_pollfree(&t->event_wait) | > >>>> synchronize_rcu(); | > >>>> kfree(t) | > >>>> | (poll_freewait) > >>>> | free_poll_entry(pwq->inline_entries + i) > >>>> | remove_wait_queue(entry->wait_address) > >>>> | spin_lock_irqsave(&wq_head->lock) > >>>> > >>>> entry->wait_address in poll_freewait() is t->event_wait in cgroup_rmdir(). > >>>> t->event_wait is free in psi_trigger_destroy before call poll_freewait(), > >>>> therefore wq_head in poll_freewait() has been already freed, which would > >>>> lead to a UAF. > > > > Hi Lu, > > Could you please share your reproducer along with the kernel config > > you used? I'm trying to reproduce this UAF but every time I delete the > > cgroup being polled, poll() simply returns POLLERR. > > Thanks, > > Suren. > > > >>>> > >>>> similar problem for epoll() has been fixed commit c2dbe32d5db5 > >>>> ("sched/psi: Fix use-after-free in ep_remove_wait_queue()"). > >>>> epoll wakeup function ep_poll_callback() will empty waitqueue and set > >>>> wait_queue_head is NULL when pollflags is POLLFREE and judge pwq->whead > >>>> is NULL or not before remove_wait_queue in ep_remove_wait_queue(), > >>>> which will fix the UAF bug in ep_remove_wait_queue. > >>>> > >>>> But poll wakeup function pollwake() doesn't do that. To fix the > >>>> problem, we empty waitqueue and set wait_address is NULL in pollwake() when > >>>> key is POLLFREE. otherwise in remove_wait_queue, which is similar to > >>>> epoll(). > >>>> > >>>> Fixes: 0e94682b73bf ("psi: introduce psi monitor") > >>>> Suggested-by: Suren Baghdasaryan <surenb@google.com> > >>>> Link: https://lore.kernel.org/all/CAJuCfpEoCRHkJF-=1Go9E94wchB4BzwQ1E3vHGWxNe+tEmSJoA@mail.gmail.com/#t > >>>> Signed-off-by: Lu Jialin <lujialin4@huawei.com> > >>>> --- > >>>> v2: correct commit msg and title suggested by Suren Baghdasaryan > >>>> --- > >>>> fs/select.c | 20 +++++++++++++++++++- > >>>> 1 file changed, 19 insertions(+), 1 deletion(-) > >>>> > >>>> diff --git a/fs/select.c b/fs/select.c > >>>> index 0ee55af1a55c..e64c7b4e9959 100644 > >>>> --- a/fs/select.c > >>>> +++ b/fs/select.c > >>>> @@ -132,7 +132,17 @@ EXPORT_SYMBOL(poll_initwait); > >>>> > >>>> static void free_poll_entry(struct poll_table_entry *entry) > >>>> { > >>>> - remove_wait_queue(entry->wait_address, &entry->wait); > >>>> + wait_queue_head_t *whead; > >>>> + > >>>> + rcu_read_lock(); > >>>> + /* If it is cleared by POLLFREE, it should be rcu-safe. > >>>> + * If we read NULL we need a barrier paired with smp_store_release() > >>>> + * in pollwake(). > >>>> + */ > >>>> + whead = smp_load_acquire(&entry->wait_address); > >>>> + if (whead) > >>>> + remove_wait_queue(whead, &entry->wait); > >>>> + rcu_read_unlock(); > >>>> fput(entry->filp); > >>>> } > >>>> > >>>> @@ -215,6 +225,14 @@ static int pollwake(wait_queue_entry_t *wait, unsigned mode, int sync, void *key > >>>> entry = container_of(wait, struct poll_table_entry, wait); > >>>> if (key && !(key_to_poll(key) & entry->key)) > >>>> return 0; > >>>> + if (key_to_poll(key) & POLLFREE) { > >>>> + list_del_init(&wait->entry); > >>>> + /* wait_address !=NULL protects us from the race with > >>>> + * poll_freewait(). > >>>> + */ > >>>> + smp_store_release(&entry->wait_address, NULL); > >>>> + return 0; > >>>> + } > >>>> return __pollwake(wait, mode, sync, key); > >>> > >>> I don't understand why this patch is needed. > >>> > >>> The last time I looked at POLLFREE, it is only needed because of asynchronous > >>> polls. See my explanation in the commit message of commit 50252e4b5e989ce6. > >> > >> Ah, I missed that. Thanks for the correction. > >> > >>> > >>> In summary, POLLFREE solves the problem of polled waitqueues whose lifetime is > >>> tied to the current task rather than to the file being polled. Also refer to > >>> the comment above wake_up_pollfree(), which mentions this. > >>> > >>> fs/select.c is synchronous polling, not asynchronous. Therefore, it should not > >>> need to handle POLLFREE. > >>> > >>> If there's actually a bug here, most likely it's a bug in psi_trigger_poll() > >>> where it is using a waitqueue whose lifetime is tied to neither the current task > >>> nor the file being polled. That needs to be fixed. > >> > >> Yeah. We discussed this issue in > >> https://lore.kernel.org/all/CAJuCfpFb0J5ZwO6kncjRG0_4jQLXUy-_dicpH5uGiWP8aKYEJQ@mail.gmail.com > >> and the root cause is that cgroup_file_release() where > >> psi_trigger_destroy() is called is not tied to the cgroup file's real > >> lifetime (see my analysis here: > >> https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t). > >> I guess it's time to do a deeper surgery and figure out a way to call > >> psi_trigger_destroy() when the polled cgroup file is actually being > >> destroyed. I'll take a closer look into this later today. > >> A fix will likely require some cgroup or kernfs code changes, so > >> CC'ing Tejun for visibility. > >> Thanks, > >> Suren. > >> > >>> > >>> - Eric > > . > >
On Tue, Jun 20, 2023 at 5:09 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Sun, Jun 18, 2023 at 6:28 AM lujialin (A) <lujialin4@huawei.com> wrote: > > > > Hi Suren: > > > > kernel config: > > x86_64_defconfig > > CONFIG_PSI=y > > CONFIG_SLUB_DEBUG=y > > CONFIG_SLUB_DEBUG_ON=y > > CONFIG_KASAN=y > > CONFIG_KASAN_INLINE=y > > > > I make some change in code, in order to increase the recurrence probability. > > diff --git a/fs/select.c b/fs/select.c > > index 5edffee1162c..5ee5b74a8386 100644 > > --- a/fs/select.c > > +++ b/fs/select.c > > @@ -139,6 +139,7 @@ void poll_freewait(struct poll_wqueues *pwq) > > { > > struct poll_table_page * p = pwq->table; > > int i; > > + mdelay(50); > > for (i = 0; i < pwq->inline_index; i++) > > free_poll_entry(pwq->inline_entries + i); > > while (p) { > > > > Here is the simple repo test.sh: > > #!/bin/bash > > > > RESOURCE_TYPES=("cpu" "memory" "io" "irq") > > #RESOURCE_TYPES=("cpu") > > cgroup_num=50 > > test_dir=/sys/fs/cgroup/test > > > > function restart_cgroup() { > > num=$(expr $RANDOM % $cgroup_num + 1) > > rmdir $test_dir/test_$num > > mkdir $test_dir/test_$num > > } > > > > function create_triggers() { > > num=$(expr $RANDOM % $cgroup_num + 1) > > random=$(expr $RANDOM % "${#RESOURCE_TYPES[@]}") > > psi_type="${RESOURCE_TYPES[${random}]}" > > ./psi_monitor $test_dir/test_$num $psi_type & > > } > > > > mkdir $test_dir > > for i in $(seq 1 $cgroup_num) > > do > > mkdir $test_dir/test_$i > > done > > for j in $(seq 1 100) > > do > > restart_cgroup & > > create_triggers & > > done > > > > psi_monitor.c: > > #include <errno.h> > > #include <fcntl.h> > > #include <stdio.h> > > #include <poll.h> > > #include <string.h> > > #include <unistd.h> > > > > int main(int argc, char *argv[]) { > > const char trig[] = "full 1000000 1000000"; > > struct pollfd fds; > > char filename[100]; > > > > sprintf(filename, "%s/%s.pressure", argv[1], argv[2]); > > > > fds.fd = open(filename, O_RDWR | O_NONBLOCK); > > if (fds.fd < 0) { > > printf("%s open error: %s\n", filename,strerror(errno)); > > return 1; > > } > > fds.events = POLLPRI; > > if (write(fds.fd, trig, strlen(trig) + 1) < 0) { > > printf("%s write error: %s\n",filename,strerror(errno)); > > return 1; > > } > > while (1) { > > poll(&fds, 1, -1); > > } > > close(fds.fd); > > return 0; > > } > > Thanks a lot! > I'll try to get this reproduced and fixed by the end of this week. Ok, I was able to reproduce the issue and I think the ultimate problem here is that kernfs_release_file() can be called from both kernfs_fop_release() and kernfs_drain_open_files(). While kernfs_fop_release() is called when the FD's refcount is 0 and there are no users, kernfs_drain_open_files() can be called while there are still other users. In our scenario, kn->attr.ops->release points to cgroup_pressure_release() which destroys the psi trigger thinking that (since the file is "released") there could be no other users. So, shell process which issues the rmdir command will destroy the trigger once cgroup_pressure_release() is called and the reproducer will step on the freed wait_queue_head. The way kernfs_release_file() is implemented, it ensures that kn->attr.ops->release(of) is called only once (the first time), therefore cgroup_pressure_release() is never called in reproducer's context. That prevents me from implementing some kind of refcounting schema for psi triggers because we are never notified when the last user is gone. I think to fix this I would need to modify kernfs_release_file() and maybe add another operation in kernfs_ops to indicate that the last user is gone (smth like final_release()). It's not pretty but so far I did not find a better way. I'll think some more over the weekend and will try to post a patch implementing the fix on Monday. Thanks, Suren. > Suren. > > > Thanks, > > Lu > > 在 2023/6/16 7:13, Suren Baghdasaryan 写道: > > > On Wed, Jun 14, 2023 at 11:19 AM Suren Baghdasaryan <surenb@google.com> wrote: > > >> > > >> On Wed, Jun 14, 2023 at 10:40 AM Eric Biggers <ebiggers@kernel.org> wrote: > > >>> > > >>> On Wed, Jun 14, 2023 at 03:07:33PM +0800, Lu Jialin wrote: > > >>>> We found a UAF bug in remove_wait_queue as follows: > > >>>> > > >>>> ================================================================== > > >>>> BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x71/0xe0 > > >>>> Write of size 4 at addr ffff8881150d7b28 by task psi_trigger/15306 > > >>>> Call Trace: > > >>>> dump_stack+0x9c/0xd3 > > >>>> print_address_description.constprop.0+0x19/0x170 > > >>>> __kasan_report.cold+0x6c/0x84 > > >>>> kasan_report+0x3a/0x50 > > >>>> check_memory_region+0xfd/0x1f0 > > >>>> _raw_spin_lock_irqsave+0x71/0xe0 > > >>>> remove_wait_queue+0x26/0xc0 > > >>>> poll_freewait+0x6b/0x120 > > >>>> do_sys_poll+0x305/0x400 > > >>>> do_syscall_64+0x33/0x40 > > >>>> entry_SYSCALL_64_after_hwframe+0x61/0xc6 > > >>>> > > >>>> Allocated by task 15306: > > >>>> kasan_save_stack+0x1b/0x40 > > >>>> __kasan_kmalloc.constprop.0+0xb5/0xe0 > > >>>> psi_trigger_create.part.0+0xfc/0x450 > > >>>> cgroup_pressure_write+0xfc/0x3b0 > > >>>> cgroup_file_write+0x1b3/0x390 > > >>>> kernfs_fop_write_iter+0x224/0x2e0 > > >>>> new_sync_write+0x2ac/0x3a0 > > >>>> vfs_write+0x365/0x430 > > >>>> ksys_write+0xd5/0x1b0 > > >>>> do_syscall_64+0x33/0x40 > > >>>> entry_SYSCALL_64_after_hwframe+0x61/0xc6 > > >>>> > > >>>> Freed by task 15850: > > >>>> kasan_save_stack+0x1b/0x40 > > >>>> kasan_set_track+0x1c/0x30 > > >>>> kasan_set_free_info+0x20/0x40 > > >>>> __kasan_slab_free+0x151/0x180 > > >>>> kfree+0xba/0x680 > > >>>> cgroup_file_release+0x5c/0xe0 > > >>>> kernfs_drain_open_files+0x122/0x1e0 > > >>>> kernfs_drain+0xff/0x1e0 > > >>>> __kernfs_remove.part.0+0x1d1/0x3b0 > > >>>> kernfs_remove_by_name_ns+0x89/0xf0 > > >>>> cgroup_addrm_files+0x393/0x3d0 > > >>>> css_clear_dir+0x8f/0x120 > > >>>> kill_css+0x41/0xd0 > > >>>> cgroup_destroy_locked+0x166/0x300 > > >>>> cgroup_rmdir+0x37/0x140 > > >>>> kernfs_iop_rmdir+0xbb/0xf0 > > >>>> vfs_rmdir.part.0+0xa5/0x230 > > >>>> do_rmdir+0x2e0/0x320 > > >>>> __x64_sys_unlinkat+0x99/0xc0 > > >>>> do_syscall_64+0x33/0x40 > > >>>> entry_SYSCALL_64_after_hwframe+0x61/0xc6 > > >>>> ================================================================== > > >>>> > > >>>> If using epoll(), wake_up_pollfree will empty waitqueue and set > > >>>> wait_queue_head is NULL before free waitqueue of psi trigger. But is > > >>>> doesn't work when using poll(), which will lead a UAF problem in > > >>>> poll_freewait coms as following: > > >>>> > > >>>> (cgroup_rmdir) | > > >>>> psi_trigger_destroy | > > >>>> wake_up_pollfree(&t->event_wait) | > > >>>> synchronize_rcu(); | > > >>>> kfree(t) | > > >>>> | (poll_freewait) > > >>>> | free_poll_entry(pwq->inline_entries + i) > > >>>> | remove_wait_queue(entry->wait_address) > > >>>> | spin_lock_irqsave(&wq_head->lock) > > >>>> > > >>>> entry->wait_address in poll_freewait() is t->event_wait in cgroup_rmdir(). > > >>>> t->event_wait is free in psi_trigger_destroy before call poll_freewait(), > > >>>> therefore wq_head in poll_freewait() has been already freed, which would > > >>>> lead to a UAF. > > > > > > Hi Lu, > > > Could you please share your reproducer along with the kernel config > > > you used? I'm trying to reproduce this UAF but every time I delete the > > > cgroup being polled, poll() simply returns POLLERR. > > > Thanks, > > > Suren. > > > > > >>>> > > >>>> similar problem for epoll() has been fixed commit c2dbe32d5db5 > > >>>> ("sched/psi: Fix use-after-free in ep_remove_wait_queue()"). > > >>>> epoll wakeup function ep_poll_callback() will empty waitqueue and set > > >>>> wait_queue_head is NULL when pollflags is POLLFREE and judge pwq->whead > > >>>> is NULL or not before remove_wait_queue in ep_remove_wait_queue(), > > >>>> which will fix the UAF bug in ep_remove_wait_queue. > > >>>> > > >>>> But poll wakeup function pollwake() doesn't do that. To fix the > > >>>> problem, we empty waitqueue and set wait_address is NULL in pollwake() when > > >>>> key is POLLFREE. otherwise in remove_wait_queue, which is similar to > > >>>> epoll(). > > >>>> > > >>>> Fixes: 0e94682b73bf ("psi: introduce psi monitor") > > >>>> Suggested-by: Suren Baghdasaryan <surenb@google.com> > > >>>> Link: https://lore.kernel.org/all/CAJuCfpEoCRHkJF-=1Go9E94wchB4BzwQ1E3vHGWxNe+tEmSJoA@mail.gmail.com/#t > > >>>> Signed-off-by: Lu Jialin <lujialin4@huawei.com> > > >>>> --- > > >>>> v2: correct commit msg and title suggested by Suren Baghdasaryan > > >>>> --- > > >>>> fs/select.c | 20 +++++++++++++++++++- > > >>>> 1 file changed, 19 insertions(+), 1 deletion(-) > > >>>> > > >>>> diff --git a/fs/select.c b/fs/select.c > > >>>> index 0ee55af1a55c..e64c7b4e9959 100644 > > >>>> --- a/fs/select.c > > >>>> +++ b/fs/select.c > > >>>> @@ -132,7 +132,17 @@ EXPORT_SYMBOL(poll_initwait); > > >>>> > > >>>> static void free_poll_entry(struct poll_table_entry *entry) > > >>>> { > > >>>> - remove_wait_queue(entry->wait_address, &entry->wait); > > >>>> + wait_queue_head_t *whead; > > >>>> + > > >>>> + rcu_read_lock(); > > >>>> + /* If it is cleared by POLLFREE, it should be rcu-safe. > > >>>> + * If we read NULL we need a barrier paired with smp_store_release() > > >>>> + * in pollwake(). > > >>>> + */ > > >>>> + whead = smp_load_acquire(&entry->wait_address); > > >>>> + if (whead) > > >>>> + remove_wait_queue(whead, &entry->wait); > > >>>> + rcu_read_unlock(); > > >>>> fput(entry->filp); > > >>>> } > > >>>> > > >>>> @@ -215,6 +225,14 @@ static int pollwake(wait_queue_entry_t *wait, unsigned mode, int sync, void *key > > >>>> entry = container_of(wait, struct poll_table_entry, wait); > > >>>> if (key && !(key_to_poll(key) & entry->key)) > > >>>> return 0; > > >>>> + if (key_to_poll(key) & POLLFREE) { > > >>>> + list_del_init(&wait->entry); > > >>>> + /* wait_address !=NULL protects us from the race with > > >>>> + * poll_freewait(). > > >>>> + */ > > >>>> + smp_store_release(&entry->wait_address, NULL); > > >>>> + return 0; > > >>>> + } > > >>>> return __pollwake(wait, mode, sync, key); > > >>> > > >>> I don't understand why this patch is needed. > > >>> > > >>> The last time I looked at POLLFREE, it is only needed because of asynchronous > > >>> polls. See my explanation in the commit message of commit 50252e4b5e989ce6. > > >> > > >> Ah, I missed that. Thanks for the correction. > > >> > > >>> > > >>> In summary, POLLFREE solves the problem of polled waitqueues whose lifetime is > > >>> tied to the current task rather than to the file being polled. Also refer to > > >>> the comment above wake_up_pollfree(), which mentions this. > > >>> > > >>> fs/select.c is synchronous polling, not asynchronous. Therefore, it should not > > >>> need to handle POLLFREE. > > >>> > > >>> If there's actually a bug here, most likely it's a bug in psi_trigger_poll() > > >>> where it is using a waitqueue whose lifetime is tied to neither the current task > > >>> nor the file being polled. That needs to be fixed. > > >> > > >> Yeah. We discussed this issue in > > >> https://lore.kernel.org/all/CAJuCfpFb0J5ZwO6kncjRG0_4jQLXUy-_dicpH5uGiWP8aKYEJQ@mail.gmail.com > > >> and the root cause is that cgroup_file_release() where > > >> psi_trigger_destroy() is called is not tied to the cgroup file's real > > >> lifetime (see my analysis here: > > >> https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t). > > >> I guess it's time to do a deeper surgery and figure out a way to call > > >> psi_trigger_destroy() when the polled cgroup file is actually being > > >> destroyed. I'll take a closer look into this later today. > > >> A fix will likely require some cgroup or kernfs code changes, so > > >> CC'ing Tejun for visibility. > > >> Thanks, > > >> Suren. > > >> > > >>> > > >>> - Eric > > > . > > >
On Fri, Jun 23, 2023 at 6:58 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Tue, Jun 20, 2023 at 5:09 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Sun, Jun 18, 2023 at 6:28 AM lujialin (A) <lujialin4@huawei.com> wrote: > > > > > > Hi Suren: > > > > > > kernel config: > > > x86_64_defconfig > > > CONFIG_PSI=y > > > CONFIG_SLUB_DEBUG=y > > > CONFIG_SLUB_DEBUG_ON=y > > > CONFIG_KASAN=y > > > CONFIG_KASAN_INLINE=y > > > > > > I make some change in code, in order to increase the recurrence probability. > > > diff --git a/fs/select.c b/fs/select.c > > > index 5edffee1162c..5ee5b74a8386 100644 > > > --- a/fs/select.c > > > +++ b/fs/select.c > > > @@ -139,6 +139,7 @@ void poll_freewait(struct poll_wqueues *pwq) > > > { > > > struct poll_table_page * p = pwq->table; > > > int i; > > > + mdelay(50); > > > for (i = 0; i < pwq->inline_index; i++) > > > free_poll_entry(pwq->inline_entries + i); > > > while (p) { > > > > > > Here is the simple repo test.sh: > > > #!/bin/bash > > > > > > RESOURCE_TYPES=("cpu" "memory" "io" "irq") > > > #RESOURCE_TYPES=("cpu") > > > cgroup_num=50 > > > test_dir=/sys/fs/cgroup/test > > > > > > function restart_cgroup() { > > > num=$(expr $RANDOM % $cgroup_num + 1) > > > rmdir $test_dir/test_$num > > > mkdir $test_dir/test_$num > > > } > > > > > > function create_triggers() { > > > num=$(expr $RANDOM % $cgroup_num + 1) > > > random=$(expr $RANDOM % "${#RESOURCE_TYPES[@]}") > > > psi_type="${RESOURCE_TYPES[${random}]}" > > > ./psi_monitor $test_dir/test_$num $psi_type & > > > } > > > > > > mkdir $test_dir > > > for i in $(seq 1 $cgroup_num) > > > do > > > mkdir $test_dir/test_$i > > > done > > > for j in $(seq 1 100) > > > do > > > restart_cgroup & > > > create_triggers & > > > done > > > > > > psi_monitor.c: > > > #include <errno.h> > > > #include <fcntl.h> > > > #include <stdio.h> > > > #include <poll.h> > > > #include <string.h> > > > #include <unistd.h> > > > > > > int main(int argc, char *argv[]) { > > > const char trig[] = "full 1000000 1000000"; > > > struct pollfd fds; > > > char filename[100]; > > > > > > sprintf(filename, "%s/%s.pressure", argv[1], argv[2]); > > > > > > fds.fd = open(filename, O_RDWR | O_NONBLOCK); > > > if (fds.fd < 0) { > > > printf("%s open error: %s\n", filename,strerror(errno)); > > > return 1; > > > } > > > fds.events = POLLPRI; > > > if (write(fds.fd, trig, strlen(trig) + 1) < 0) { > > > printf("%s write error: %s\n",filename,strerror(errno)); > > > return 1; > > > } > > > while (1) { > > > poll(&fds, 1, -1); > > > } > > > close(fds.fd); > > > return 0; > > > } > > > > Thanks a lot! > > I'll try to get this reproduced and fixed by the end of this week. > > Ok, I was able to reproduce the issue and I think the ultimate problem > here is that kernfs_release_file() can be called from both > kernfs_fop_release() and kernfs_drain_open_files(). While > kernfs_fop_release() is called when the FD's refcount is 0 and there > are no users, kernfs_drain_open_files() can be called while there are > still other users. In our scenario, kn->attr.ops->release points to > cgroup_pressure_release() which destroys the psi trigger thinking that > (since the file is "released") there could be no other users. So, > shell process which issues the rmdir command will destroy the trigger > once cgroup_pressure_release() is called and the reproducer will step > on the freed wait_queue_head. The way kernfs_release_file() is > implemented, it ensures that kn->attr.ops->release(of) is called only > once (the first time), therefore cgroup_pressure_release() is never > called in reproducer's context. That prevents me from implementing > some kind of refcounting schema for psi triggers because we are never > notified when the last user is gone. > I think to fix this I would need to modify kernfs_release_file() and > maybe add another operation in kernfs_ops to indicate that the last > user is gone (smth like final_release()). It's not pretty but so far I > did not find a better way. I'll think some more over the weekend and > will try to post a patch implementing the fix on Monday. Posted 2 patches to fix this at: https://lore.kernel.org/all/20230626201713.1204982-1-surenb@google.com/ Thanks, Suren. > Thanks, > Suren. > > > > Suren. > > > > > Thanks, > > > Lu > > > 在 2023/6/16 7:13, Suren Baghdasaryan 写道: > > > > On Wed, Jun 14, 2023 at 11:19 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > >> > > > >> On Wed, Jun 14, 2023 at 10:40 AM Eric Biggers <ebiggers@kernel.org> wrote: > > > >>> > > > >>> On Wed, Jun 14, 2023 at 03:07:33PM +0800, Lu Jialin wrote: > > > >>>> We found a UAF bug in remove_wait_queue as follows: > > > >>>> > > > >>>> ================================================================== > > > >>>> BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x71/0xe0 > > > >>>> Write of size 4 at addr ffff8881150d7b28 by task psi_trigger/15306 > > > >>>> Call Trace: > > > >>>> dump_stack+0x9c/0xd3 > > > >>>> print_address_description.constprop.0+0x19/0x170 > > > >>>> __kasan_report.cold+0x6c/0x84 > > > >>>> kasan_report+0x3a/0x50 > > > >>>> check_memory_region+0xfd/0x1f0 > > > >>>> _raw_spin_lock_irqsave+0x71/0xe0 > > > >>>> remove_wait_queue+0x26/0xc0 > > > >>>> poll_freewait+0x6b/0x120 > > > >>>> do_sys_poll+0x305/0x400 > > > >>>> do_syscall_64+0x33/0x40 > > > >>>> entry_SYSCALL_64_after_hwframe+0x61/0xc6 > > > >>>> > > > >>>> Allocated by task 15306: > > > >>>> kasan_save_stack+0x1b/0x40 > > > >>>> __kasan_kmalloc.constprop.0+0xb5/0xe0 > > > >>>> psi_trigger_create.part.0+0xfc/0x450 > > > >>>> cgroup_pressure_write+0xfc/0x3b0 > > > >>>> cgroup_file_write+0x1b3/0x390 > > > >>>> kernfs_fop_write_iter+0x224/0x2e0 > > > >>>> new_sync_write+0x2ac/0x3a0 > > > >>>> vfs_write+0x365/0x430 > > > >>>> ksys_write+0xd5/0x1b0 > > > >>>> do_syscall_64+0x33/0x40 > > > >>>> entry_SYSCALL_64_after_hwframe+0x61/0xc6 > > > >>>> > > > >>>> Freed by task 15850: > > > >>>> kasan_save_stack+0x1b/0x40 > > > >>>> kasan_set_track+0x1c/0x30 > > > >>>> kasan_set_free_info+0x20/0x40 > > > >>>> __kasan_slab_free+0x151/0x180 > > > >>>> kfree+0xba/0x680 > > > >>>> cgroup_file_release+0x5c/0xe0 > > > >>>> kernfs_drain_open_files+0x122/0x1e0 > > > >>>> kernfs_drain+0xff/0x1e0 > > > >>>> __kernfs_remove.part.0+0x1d1/0x3b0 > > > >>>> kernfs_remove_by_name_ns+0x89/0xf0 > > > >>>> cgroup_addrm_files+0x393/0x3d0 > > > >>>> css_clear_dir+0x8f/0x120 > > > >>>> kill_css+0x41/0xd0 > > > >>>> cgroup_destroy_locked+0x166/0x300 > > > >>>> cgroup_rmdir+0x37/0x140 > > > >>>> kernfs_iop_rmdir+0xbb/0xf0 > > > >>>> vfs_rmdir.part.0+0xa5/0x230 > > > >>>> do_rmdir+0x2e0/0x320 > > > >>>> __x64_sys_unlinkat+0x99/0xc0 > > > >>>> do_syscall_64+0x33/0x40 > > > >>>> entry_SYSCALL_64_after_hwframe+0x61/0xc6 > > > >>>> ================================================================== > > > >>>> > > > >>>> If using epoll(), wake_up_pollfree will empty waitqueue and set > > > >>>> wait_queue_head is NULL before free waitqueue of psi trigger. But is > > > >>>> doesn't work when using poll(), which will lead a UAF problem in > > > >>>> poll_freewait coms as following: > > > >>>> > > > >>>> (cgroup_rmdir) | > > > >>>> psi_trigger_destroy | > > > >>>> wake_up_pollfree(&t->event_wait) | > > > >>>> synchronize_rcu(); | > > > >>>> kfree(t) | > > > >>>> | (poll_freewait) > > > >>>> | free_poll_entry(pwq->inline_entries + i) > > > >>>> | remove_wait_queue(entry->wait_address) > > > >>>> | spin_lock_irqsave(&wq_head->lock) > > > >>>> > > > >>>> entry->wait_address in poll_freewait() is t->event_wait in cgroup_rmdir(). > > > >>>> t->event_wait is free in psi_trigger_destroy before call poll_freewait(), > > > >>>> therefore wq_head in poll_freewait() has been already freed, which would > > > >>>> lead to a UAF. > > > > > > > > Hi Lu, > > > > Could you please share your reproducer along with the kernel config > > > > you used? I'm trying to reproduce this UAF but every time I delete the > > > > cgroup being polled, poll() simply returns POLLERR. > > > > Thanks, > > > > Suren. > > > > > > > >>>> > > > >>>> similar problem for epoll() has been fixed commit c2dbe32d5db5 > > > >>>> ("sched/psi: Fix use-after-free in ep_remove_wait_queue()"). > > > >>>> epoll wakeup function ep_poll_callback() will empty waitqueue and set > > > >>>> wait_queue_head is NULL when pollflags is POLLFREE and judge pwq->whead > > > >>>> is NULL or not before remove_wait_queue in ep_remove_wait_queue(), > > > >>>> which will fix the UAF bug in ep_remove_wait_queue. > > > >>>> > > > >>>> But poll wakeup function pollwake() doesn't do that. To fix the > > > >>>> problem, we empty waitqueue and set wait_address is NULL in pollwake() when > > > >>>> key is POLLFREE. otherwise in remove_wait_queue, which is similar to > > > >>>> epoll(). > > > >>>> > > > >>>> Fixes: 0e94682b73bf ("psi: introduce psi monitor") > > > >>>> Suggested-by: Suren Baghdasaryan <surenb@google.com> > > > >>>> Link: https://lore.kernel.org/all/CAJuCfpEoCRHkJF-=1Go9E94wchB4BzwQ1E3vHGWxNe+tEmSJoA@mail.gmail.com/#t > > > >>>> Signed-off-by: Lu Jialin <lujialin4@huawei.com> > > > >>>> --- > > > >>>> v2: correct commit msg and title suggested by Suren Baghdasaryan > > > >>>> --- > > > >>>> fs/select.c | 20 +++++++++++++++++++- > > > >>>> 1 file changed, 19 insertions(+), 1 deletion(-) > > > >>>> > > > >>>> diff --git a/fs/select.c b/fs/select.c > > > >>>> index 0ee55af1a55c..e64c7b4e9959 100644 > > > >>>> --- a/fs/select.c > > > >>>> +++ b/fs/select.c > > > >>>> @@ -132,7 +132,17 @@ EXPORT_SYMBOL(poll_initwait); > > > >>>> > > > >>>> static void free_poll_entry(struct poll_table_entry *entry) > > > >>>> { > > > >>>> - remove_wait_queue(entry->wait_address, &entry->wait); > > > >>>> + wait_queue_head_t *whead; > > > >>>> + > > > >>>> + rcu_read_lock(); > > > >>>> + /* If it is cleared by POLLFREE, it should be rcu-safe. > > > >>>> + * If we read NULL we need a barrier paired with smp_store_release() > > > >>>> + * in pollwake(). > > > >>>> + */ > > > >>>> + whead = smp_load_acquire(&entry->wait_address); > > > >>>> + if (whead) > > > >>>> + remove_wait_queue(whead, &entry->wait); > > > >>>> + rcu_read_unlock(); > > > >>>> fput(entry->filp); > > > >>>> } > > > >>>> > > > >>>> @@ -215,6 +225,14 @@ static int pollwake(wait_queue_entry_t *wait, unsigned mode, int sync, void *key > > > >>>> entry = container_of(wait, struct poll_table_entry, wait); > > > >>>> if (key && !(key_to_poll(key) & entry->key)) > > > >>>> return 0; > > > >>>> + if (key_to_poll(key) & POLLFREE) { > > > >>>> + list_del_init(&wait->entry); > > > >>>> + /* wait_address !=NULL protects us from the race with > > > >>>> + * poll_freewait(). > > > >>>> + */ > > > >>>> + smp_store_release(&entry->wait_address, NULL); > > > >>>> + return 0; > > > >>>> + } > > > >>>> return __pollwake(wait, mode, sync, key); > > > >>> > > > >>> I don't understand why this patch is needed. > > > >>> > > > >>> The last time I looked at POLLFREE, it is only needed because of asynchronous > > > >>> polls. See my explanation in the commit message of commit 50252e4b5e989ce6. > > > >> > > > >> Ah, I missed that. Thanks for the correction. > > > >> > > > >>> > > > >>> In summary, POLLFREE solves the problem of polled waitqueues whose lifetime is > > > >>> tied to the current task rather than to the file being polled. Also refer to > > > >>> the comment above wake_up_pollfree(), which mentions this. > > > >>> > > > >>> fs/select.c is synchronous polling, not asynchronous. Therefore, it should not > > > >>> need to handle POLLFREE. > > > >>> > > > >>> If there's actually a bug here, most likely it's a bug in psi_trigger_poll() > > > >>> where it is using a waitqueue whose lifetime is tied to neither the current task > > > >>> nor the file being polled. That needs to be fixed. > > > >> > > > >> Yeah. We discussed this issue in > > > >> https://lore.kernel.org/all/CAJuCfpFb0J5ZwO6kncjRG0_4jQLXUy-_dicpH5uGiWP8aKYEJQ@mail.gmail.com > > > >> and the root cause is that cgroup_file_release() where > > > >> psi_trigger_destroy() is called is not tied to the cgroup file's real > > > >> lifetime (see my analysis here: > > > >> https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t). > > > >> I guess it's time to do a deeper surgery and figure out a way to call > > > >> psi_trigger_destroy() when the polled cgroup file is actually being > > > >> destroyed. I'll take a closer look into this later today. > > > >> A fix will likely require some cgroup or kernfs code changes, so > > > >> CC'ing Tejun for visibility. > > > >> Thanks, > > > >> Suren. > > > >> > > > >>> > > > >>> - Eric > > > > . > > > >
diff --git a/fs/select.c b/fs/select.c index 0ee55af1a55c..e64c7b4e9959 100644 --- a/fs/select.c +++ b/fs/select.c @@ -132,7 +132,17 @@ EXPORT_SYMBOL(poll_initwait); static void free_poll_entry(struct poll_table_entry *entry) { - remove_wait_queue(entry->wait_address, &entry->wait); + wait_queue_head_t *whead; + + rcu_read_lock(); + /* If it is cleared by POLLFREE, it should be rcu-safe. + * If we read NULL we need a barrier paired with smp_store_release() + * in pollwake(). + */ + whead = smp_load_acquire(&entry->wait_address); + if (whead) + remove_wait_queue(whead, &entry->wait); + rcu_read_unlock(); fput(entry->filp); } @@ -215,6 +225,14 @@ static int pollwake(wait_queue_entry_t *wait, unsigned mode, int sync, void *key entry = container_of(wait, struct poll_table_entry, wait); if (key && !(key_to_poll(key) & entry->key)) return 0; + if (key_to_poll(key) & POLLFREE) { + list_del_init(&wait->entry); + /* wait_address !=NULL protects us from the race with + * poll_freewait(). + */ + smp_store_release(&entry->wait_address, NULL); + return 0; + } return __pollwake(wait, mode, sync, key); }