Message ID | 20230616085608.42435-1-libaokun1@huawei.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp1183236vqr; Fri, 16 Jun 2023 01:59:54 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7+IOS7vXOiwxzMgpVocDFX822Z6uLuv0FEFcRqK86tmNfiCMosvxuzGGCaReE6h6rMOjMu X-Received: by 2002:a05:6a20:7f84:b0:112:1e0d:14ae with SMTP id d4-20020a056a207f8400b001121e0d14aemr2188837pzj.7.1686905994057; Fri, 16 Jun 2023 01:59:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686905994; cv=none; d=google.com; s=arc-20160816; b=SNQNTyQ9z7rIzbMLinL4ZO1YKiFYPbeq4DBIYKsGhI4Ro8fUxD01Om7i5LX5nbZq/a qS99/yCD4vLwGp4tfCxObaahqTmyA++cvTUDd4/0E6FRnlq9chphO1nldpF2S2u5S1IJ hoXqW5spPfxKbvLgmYQckQ4bUkxA9Nn6yKvV1nEGZ7DQbepQnosE39Rt7VUOy03gmfOX b0187DPYYSyD718QU77RSimQTAEXSL1ZXoI5qlzImuhn1hQfVuDUI/KNXClG22zAYlmF 1JFSFX5UicIbqqeo73buq8l3X/3Z/t9yzJRuyNBotT0VXG9xc5BmB6139CfbdnzPewOc TtzQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=6xk2alcsNHmtnaH4tr/rtdgb38xIaz+MzMNCMJAvN54=; b=F8/+mb3hAYAm79P4S+i5TaViU4T0XYaS6Tz1rgibHD1vN7nhH634nTXmcxdazUSSXU mwNiZLToDYMpk6AvCPek4S4zrnRNrRwn8lVP9E7YoAzSSgCiF03EvVhArdCfBUmMn8cX b4PkBkQmiBXXznLFuZ6MmzpZKhNKI3fL9qM2g4uCBs8X0eNavKnxBa04FH4zt6zJnZ61 HTgM0Ignb+/0eLNnwMoaOss9TCDnuuyJ7odmD7RCBQOcjRWGw5agXbQISWTEAaAd/ldg p5SB1gwR7oqoi+/HrPxhh2LnuoCtCBvF/PbGx05fj9v/Y1Rc+BunjzcQ3q8PM3IxC3wd BfCQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id r22-20020a638f56000000b0055338995061si1834698pgn.754.2023.06.16.01.59.39; Fri, 16 Jun 2023 01:59:54 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243516AbjFPI6V (ORCPT <rfc822;maxin.john@gmail.com> + 99 others); Fri, 16 Jun 2023 04:58:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51286 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233756AbjFPI6S (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 16 Jun 2023 04:58:18 -0400 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9917910F6; Fri, 16 Jun 2023 01:58:16 -0700 (PDT) Received: from dggpeml500021.china.huawei.com (unknown [172.30.72.54]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4QjCfN57KjzGplF; Fri, 16 Jun 2023 16:55:08 +0800 (CST) Received: from huawei.com (10.175.127.227) by dggpeml500021.china.huawei.com (7.185.36.21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.23; Fri, 16 Jun 2023 16:58:14 +0800 From: Baokun Li <libaokun1@huawei.com> To: <jack@suse.cz> CC: <linux-fsdevel@vger.kernel.org>, <linux-ext4@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <yi.zhang@huawei.com>, <yangerkun@huawei.com>, <chengzhihao1@huawei.com>, <yukuai3@huawei.com>, <libaokun1@huawei.com> Subject: [PATCH] quota: fix race condition between dqput() and dquot_mark_dquot_dirty() Date: Fri, 16 Jun 2023 16:56:08 +0800 Message-ID: <20230616085608.42435-1-libaokun1@huawei.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.175.127.227] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To dggpeml500021.china.huawei.com (7.185.36.21) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1768849139407486205?= X-GMAIL-MSGID: =?utf-8?q?1768849139407486205?= |
Series |
quota: fix race condition between dqput() and dquot_mark_dquot_dirty()
|
|
Commit Message
Baokun Li
June 16, 2023, 8:56 a.m. UTC
We ran into a problem that dqput() and dquot_mark_dquot_dirty() may race
like the function graph below, causing a released dquot to be added to the
dqi_dirty_list, and this leads to that dquot being released again in
dquot_writeback_dquots(), making two identical quotas in free_dquots.
cpu1 cpu2
_________________|_________________
wb_do_writeback CHOWN(1)
...
ext4_da_update_reserve_space
dquot_claim_block
...
dquot_mark_dquot_dirty // try to dirty old quota
test_bit(DQ_ACTIVE_B, &dquot->dq_flags) // still ACTIVE
if (test_bit(DQ_MOD_B, &dquot->dq_flags))
// test no dirty, wait dq_list_lock
...
dquot_transfer
__dquot_transfer
dqput_all(transfer_from) // rls old dquot
dqput // last dqput
dquot_release
clear_bit(DQ_ACTIVE_B, &dquot->dq_flags)
atomic_dec(&dquot->dq_count)
put_dquot_last(dquot)
list_add_tail(&dquot->dq_free, &free_dquots)
// first add the dquot to free_dquots
if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags))
add dqi_dirty_list // add freed dquot to dirty_list
P3:
ksys_sync
...
dquot_writeback_dquots
WARN_ON(!test_bit(DQ_ACTIVE_B, &dquot->dq_flags))
dqgrab(dquot)
WARN_ON_ONCE(!atomic_read(&dquot->dq_count))
WARN_ON_ONCE(!test_bit(DQ_ACTIVE_B, &dquot->dq_flags))
dqput(dquot)
put_dquot_last(dquot)
list_add_tail(&dquot->dq_free, &free_dquots)
// Double add the dquot to free_dquots
This causes a list_del corruption when removing the entry from free_dquots,
and even trying to free the dquot twice in dqcache_shrink_scan triggers a
use-after-free.
A warning may also be triggered by a race like the function diagram below:
cpu1 cpu2 cpu3
________________|_______________|________________
wb_do_writeback CHOWN(1) QUOTASYNC(1)
... ...
ext4_da_update_reserve_space
... __dquot_transfer
dqput // last dqput
dquot_release
dquot_is_busy
if (test_bit(DQ_MOD_B, &dquot->dq_flags))
// not dirty and still active
dquot_mark_dquot_dirty
if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags))
add dqi_dirty_list
clear_bit(DQ_ACTIVE_B, &dquot->dq_flags)
dquot_writeback_dquots
WARN_ON(!test_bit(DQ_ACTIVE_B))
To solve this problem, it is similar to the way dqget() avoids racing with
dquot_release(). First set the DQ_MOD_B flag, then execute wait_on_dquot(),
after this we know that either dquot_release() is already finished or it
will be canceled due to DQ_MOD_B flag test, at this point if the quota is
DQ_ACTIVE_B, then we can safely add the dquot to the dqi_dirty_list,
otherwise clear the DQ_MOD_B flag and exit directly.
Fixes: 4580b30ea887 ("quota: Do not dirty bad dquots")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
Hello Honza,
This problem can also be solved by modifying the reference count mechanism,
where dquots hold a reference count after they are allocated until they are
destroyed, i.e. the dquots in the free_dquots list have dq_count == 1. This
allows us to reduce the reference count as soon as we enter the dqput(),
and then add the dquot to the dqi_dirty_list only when dq_count > 1. This
also prevents the dquot in the dqi_dirty_list from not having the
DQ_ACTIVE_B flag, but this is a more impactful modification, so we chose to
refer to dqget() to avoid racing with dquot_release(). If you prefer this
solution by modifying the dq_count mechanism, I would be happy to send
another version of the patch.
Thanks,
Baokun.
fs/quota/dquot.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)
Comments
Hello Baokun! On Fri 16-06-23 16:56:08, Baokun Li wrote: > We ran into a problem that dqput() and dquot_mark_dquot_dirty() may race > like the function graph below, causing a released dquot to be added to the > dqi_dirty_list, and this leads to that dquot being released again in > dquot_writeback_dquots(), making two identical quotas in free_dquots. > > cpu1 cpu2 > _________________|_________________ > wb_do_writeback CHOWN(1) > ... > ext4_da_update_reserve_space > dquot_claim_block > ... > dquot_mark_dquot_dirty // try to dirty old quota > test_bit(DQ_ACTIVE_B, &dquot->dq_flags) // still ACTIVE > if (test_bit(DQ_MOD_B, &dquot->dq_flags)) > // test no dirty, wait dq_list_lock > ... > dquot_transfer > __dquot_transfer > dqput_all(transfer_from) // rls old dquot > dqput // last dqput > dquot_release > clear_bit(DQ_ACTIVE_B, &dquot->dq_flags) > atomic_dec(&dquot->dq_count) > put_dquot_last(dquot) > list_add_tail(&dquot->dq_free, &free_dquots) > // first add the dquot to free_dquots > if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags)) > add dqi_dirty_list // add freed dquot to dirty_list > P3: > ksys_sync > ... > dquot_writeback_dquots > WARN_ON(!test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) > dqgrab(dquot) > WARN_ON_ONCE(!atomic_read(&dquot->dq_count)) > WARN_ON_ONCE(!test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) > dqput(dquot) > put_dquot_last(dquot) > list_add_tail(&dquot->dq_free, &free_dquots) > // Double add the dquot to free_dquots > > This causes a list_del corruption when removing the entry from free_dquots, > and even trying to free the dquot twice in dqcache_shrink_scan triggers a > use-after-free. > > A warning may also be triggered by a race like the function diagram below: > > cpu1 cpu2 cpu3 > ________________|_______________|________________ > wb_do_writeback CHOWN(1) QUOTASYNC(1) > ... ... > ext4_da_update_reserve_space > ... __dquot_transfer > dqput // last dqput > dquot_release > dquot_is_busy > if (test_bit(DQ_MOD_B, &dquot->dq_flags)) > // not dirty and still active > dquot_mark_dquot_dirty > if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags)) > add dqi_dirty_list > clear_bit(DQ_ACTIVE_B, &dquot->dq_flags) > dquot_writeback_dquots > WARN_ON(!test_bit(DQ_ACTIVE_B)) > > To solve this problem, it is similar to the way dqget() avoids racing with > dquot_release(). First set the DQ_MOD_B flag, then execute wait_on_dquot(), > after this we know that either dquot_release() is already finished or it > will be canceled due to DQ_MOD_B flag test, at this point if the quota is > DQ_ACTIVE_B, then we can safely add the dquot to the dqi_dirty_list, > otherwise clear the DQ_MOD_B flag and exit directly. > > Fixes: 4580b30ea887 ("quota: Do not dirty bad dquots") > Signed-off-by: Baokun Li <libaokun1@huawei.com> > --- > > Hello Honza, > > This problem can also be solved by modifying the reference count mechanism, > where dquots hold a reference count after they are allocated until they are > destroyed, i.e. the dquots in the free_dquots list have dq_count == 1. This > allows us to reduce the reference count as soon as we enter the dqput(), > and then add the dquot to the dqi_dirty_list only when dq_count > 1. This > also prevents the dquot in the dqi_dirty_list from not having the > DQ_ACTIVE_B flag, but this is a more impactful modification, so we chose to > refer to dqget() to avoid racing with dquot_release(). If you prefer this > solution by modifying the dq_count mechanism, I would be happy to send > another version of the patch. The way this *should* work is that dquot_mark_dquot_dirty() using dquot references from the inode should be protected by dquot_srcu. quota_off code takes care to call synchronize_srcu(&dquot_srcu) to not drop dquot references while they are used by other users. But you are right dquot_transfer() breaks this assumption. Most callers are fine since they are also protected by inode->i_lock but still I'd prefer to fix dquot_transfer() to follow the guarantees dquot_srcu should provide. Now calling synchronize_srcu() directly from dquot_transfer() is too expensive (and mostly unnecessary) so what I would rather suggest is to create another dquot list (use dq_free list_head inside struct dquot for it) and add dquot whose last reference should be dropped there. We'd then queue work item which would call synchronize_srcu() and after that perform the final cleanup of all the dquots on the list. Now this also needs some modifications to dqget() and to quotaoff code to handle various races with the new dqput() code so if you feel it is too complex for your taste, I can implement this myself. Honza
Hello Honza ! On 2023/6/16 23:28, Jan Kara wrote: > Hello Baokun! > > On Fri 16-06-23 16:56:08, Baokun Li wrote: >> To solve this problem, it is similar to the way dqget() avoids racing with >> dquot_release(). First set the DQ_MOD_B flag, then execute wait_on_dquot(), >> after this we know that either dquot_release() is already finished or it >> will be canceled due to DQ_MOD_B flag test, at this point if the quota is >> DQ_ACTIVE_B, then we can safely add the dquot to the dqi_dirty_list, >> otherwise clear the DQ_MOD_B flag and exit directly. >> >> Fixes: 4580b30ea887 ("quota: Do not dirty bad dquots") >> Signed-off-by: Baokun Li <libaokun1@huawei.com> >> --- >> >> Hello Honza, >> >> This problem can also be solved by modifying the reference count mechanism, >> where dquots hold a reference count after they are allocated until they are >> destroyed, i.e. the dquots in the free_dquots list have dq_count == 1. This >> allows us to reduce the reference count as soon as we enter the dqput(), >> and then add the dquot to the dqi_dirty_list only when dq_count > 1. This >> also prevents the dquot in the dqi_dirty_list from not having the >> DQ_ACTIVE_B flag, but this is a more impactful modification, so we chose to >> refer to dqget() to avoid racing with dquot_release(). If you prefer this >> solution by modifying the dq_count mechanism, I would be happy to send >> another version of the patch. > The way this *should* work is that dquot_mark_dquot_dirty() using dquot > references from the inode should be protected by dquot_srcu. quota_off > code takes care to call synchronize_srcu(&dquot_srcu) to not drop dquot > references while they are used by other users. But you are right > dquot_transfer() breaks this assumption. Most callers are fine since they > are also protected by inode->i_lock but still I'd prefer to fix > dquot_transfer() to follow the guarantees dquot_srcu should provide. Indeed! Operation accessing dquots via inode pointers shuould be protectedby dquot_srcu. And inode->i_lock ensures that we do not record usage changes in a deprecated dquota pointer, even when concurrent with dquot_transfer(). > Now calling synchronize_srcu() directly from dquot_transfer() is too > expensive (and mostly unnecessary) so what I would rather suggest is to > create another dquot list (use dq_free list_head inside struct dquot for > it) and add dquot whose last reference should be dropped there. We'd then > queue work item which would call synchronize_srcu() and after that perform > the final cleanup of all the dquots on the list. > > Now this also needs some modifications to dqget() and to quotaoff code to > handle various races with the new dqput() code so if you feel it is too > complex for your taste, I can implement this myself. > > Honza I see what you mean, what we are doing here is very similar to drop_dquot_ref(), and if we have to modify it this way, I am happy to implement it. But as you said, calling synchronize_srcu() is too expensive and it blocks almost all mark dirty processes, so we only call it now in performance insensitive scenarios like dquot_disable(). And how do we control how often synchronize_srcu() is called? Are there more than a certain number of dquots in releasing_dquots or are they executed at regular intervals? And it would introduce various new competitions. Is it worthwhile to do this for a corner scenario like this one? I think we can simply focus on the race between the DQ_ACTIVE_B flag and the DQ_MOD_B flag, which is the core problem, because the same quota should not have both flags. These two flags are protected by dq_list_lock and dquot->dq_lock respectively, so it makes sense to add a wait_on_dquot() to ensure the accuracy of DQ_ACTIVE_B. The addition of wait_on_dquot() to this solution also seems very expensive, and I had similar concerns before, but testing found no performance impact due to the fast path without any locks. We returns 1 directly when the current dquot is already dirty, so there is no locking involved after dquot is dirty until DQ_MOD_B is cleared. And clear the dirtying of the dqi_dirty_list only happens in last dqput and dquot_writeback_dquots(), both of which occur very infrequently. And if we don't care about the harmless warning in dquot_writeback_dquots() in the second function graph (just skip it), wait_on_dquot() in the solution can be removed. We only need to determine again whether dquot is DQ_ACTIVE_B under dq_list_lock protection to solve the problem in the first function graph. This is why there are two function graphs in the patch description, because with the second problem, we have to be more careful if we want to keep the warning. Thanks!
Hello! On Mon 19-06-23 14:44:03, Baokun Li wrote: > On 2023/6/16 23:28, Jan Kara wrote: > > Now calling synchronize_srcu() directly from dquot_transfer() is too > > expensive (and mostly unnecessary) so what I would rather suggest is to > > create another dquot list (use dq_free list_head inside struct dquot for > > it) and add dquot whose last reference should be dropped there. We'd then > > queue work item which would call synchronize_srcu() and after that perform > > the final cleanup of all the dquots on the list. > > > > Now this also needs some modifications to dqget() and to quotaoff code to > > handle various races with the new dqput() code so if you feel it is too > > complex for your taste, I can implement this myself. > > > > Honza > I see what you mean, what we are doing here is very similar to > drop_dquot_ref(), > and if we have to modify it this way, I am happy to implement it. > > But as you said, calling synchronize_srcu() is too expensive and it blocks > almost all > mark dirty processes, so we only call it now in performance insensitive > scenarios > like dquot_disable(). And how do we control how often synchronize_srcu() is > called? > Are there more than a certain number of dquots in releasing_dquots or are > they > executed at regular intervals? And it would introduce various new > competitions. > Is it worthwhile to do this for a corner scenario like this one? So the way this is handled (e.g. in fsnotify subsystem) is that we just queue work item when we drop the last reference to the protected structure. The scheduling latency before the work item gets executed is enough to batch synchronize_srcu() calls and once synchronize_srcu() finishes, we add all items from the "staging list" to the free_dquots list. > I think we can simply focus on the race between the DQ_ACTIVE_B flag and > the DQ_MOD_B flag, which is the core problem, because the same quota > should not have both flags. These two flags are protected by dq_list_lock > and dquot->dq_lock respectively, so it makes sense to add a > wait_on_dquot() to ensure the accuracy of DQ_ACTIVE_B. But the fundamental problem is not only the race with DQ_MOD_B setting. The dquot structure can be completely freed by the time dquot_claim_space_nodirty() calls dquot_mark_dquot_dirty() on it. That's why I think making __dquot_transfer() obey dquot_srcu rules is the right solution. Honza
Hello! Sorry for the late reply, just had a Dragon Boat holiday. On 2023/6/22 22:56, Jan Kara wrote: > Hello! > > On Mon 19-06-23 14:44:03, Baokun Li wrote: >> On 2023/6/16 23:28, Jan Kara wrote: >>> Now calling synchronize_srcu() directly from dquot_transfer() is too >>> expensive (and mostly unnecessary) so what I would rather suggest is to >>> create another dquot list (use dq_free list_head inside struct dquot for >>> it) and add dquot whose last reference should be dropped there. We'd then >>> queue work item which would call synchronize_srcu() and after that perform >>> the final cleanup of all the dquots on the list. >>> >>> Now this also needs some modifications to dqget() and to quotaoff code to >>> handle various races with the new dqput() code so if you feel it is too >>> complex for your taste, I can implement this myself. >>> >>> Honza >> I see what you mean, what we are doing here is very similar to >> drop_dquot_ref(), >> and if we have to modify it this way, I am happy to implement it. >> >> But as you said, calling synchronize_srcu() is too expensive and it blocks >> almost all >> mark dirty processes, so we only call it now in performance insensitive >> scenarios >> like dquot_disable(). And how do we control how often synchronize_srcu() is >> called? >> Are there more than a certain number of dquots in releasing_dquots or are >> they >> executed at regular intervals? And it would introduce various new >> competitions. >> Is it worthwhile to do this for a corner scenario like this one? > So the way this is handled (e.g. in fsnotify subsystem) is that we just > queue work item when we drop the last reference to the protected structure. > The scheduling latency before the work item gets executed is enough to > batch synchronize_srcu() calls and once synchronize_srcu() finishes, we add > all items from the "staging list" to the free_dquots list. Cool, thanks a lot for clearing up the confusion! I will implement it in the next version. > >> I think we can simply focus on the race between the DQ_ACTIVE_B flag and >> the DQ_MOD_B flag, which is the core problem, because the same quota >> should not have both flags. These two flags are protected by dq_list_lock >> and dquot->dq_lock respectively, so it makes sense to add a >> wait_on_dquot() to ensure the accuracy of DQ_ACTIVE_B. > But the fundamental problem is not only the race with DQ_MOD_B setting. The > dquot structure can be completely freed by the time > dquot_claim_space_nodirty() calls dquot_mark_dquot_dirty() on it. That's > why I think making __dquot_transfer() obey dquot_srcu rules is the right > solution. > > Honza Yes, now I also think that making __dquot_transfer() obey dquot_srcu rules is a better solution. But with inode->i_lock protection, why would the dquot structure be completely freed? Thanks!
Hello! On Sun 25-06-23 15:56:10, Baokun Li wrote: > > > I think we can simply focus on the race between the DQ_ACTIVE_B flag and > > > the DQ_MOD_B flag, which is the core problem, because the same quota > > > should not have both flags. These two flags are protected by dq_list_lock > > > and dquot->dq_lock respectively, so it makes sense to add a > > > wait_on_dquot() to ensure the accuracy of DQ_ACTIVE_B. > > But the fundamental problem is not only the race with DQ_MOD_B setting. The > > dquot structure can be completely freed by the time > > dquot_claim_space_nodirty() calls dquot_mark_dquot_dirty() on it. That's > > why I think making __dquot_transfer() obey dquot_srcu rules is the right > > solution. > Yes, now I also think that making __dquot_transfer() obey dquot_srcu > rules is a better solution. But with inode->i_lock protection, why would > the dquot structure be completely freed? Well, when dquot_claim_space_nodirty() calls mark_all_dquot_dirty() it does not hold any locks (only dquot_srcu). So nothing prevents dquot_transfer() to go, swap dquot structure pointers and drop dquot references and after that mark_all_dquot_dirty() can use a stale pointer to call mark_dquot_dirty() on already freed memory. Honza
Hello! On 2023/6/26 21:09, Jan Kara wrote: > Hello! > > On Sun 25-06-23 15:56:10, Baokun Li wrote: >>>> I think we can simply focus on the race between the DQ_ACTIVE_B flag and >>>> the DQ_MOD_B flag, which is the core problem, because the same quota >>>> should not have both flags. These two flags are protected by dq_list_lock >>>> and dquot->dq_lock respectively, so it makes sense to add a >>>> wait_on_dquot() to ensure the accuracy of DQ_ACTIVE_B. >>> But the fundamental problem is not only the race with DQ_MOD_B setting. The >>> dquot structure can be completely freed by the time >>> dquot_claim_space_nodirty() calls dquot_mark_dquot_dirty() on it. That's >>> why I think making __dquot_transfer() obey dquot_srcu rules is the right >>> solution. >> Yes, now I also think that making __dquot_transfer() obey dquot_srcu >> rules is a better solution. But with inode->i_lock protection, why would >> the dquot structure be completely freed? > Well, when dquot_claim_space_nodirty() calls mark_all_dquot_dirty() it does > not hold any locks (only dquot_srcu). So nothing prevents dquot_transfer() > to go, swap dquot structure pointers and drop dquot references and after > that mark_all_dquot_dirty() can use a stale pointer to call > mark_dquot_dirty() on already freed memory. > > Honza No, this doesn't look like it's going to happen. The mark_all_dquot_dirty() uses a pointer array pointer, the dquot in the array is dynamically changing, so after swap dquot structure pointers, mark_all_dquot_dirty() uses the new pointer, and the stale pointer is always destroyed after swap, so there is no case of using the stale pointer here.
Hello! On Mon 26-06-23 21:55:49, Baokun Li wrote: > On 2023/6/26 21:09, Jan Kara wrote: > > On Sun 25-06-23 15:56:10, Baokun Li wrote: > > > > > I think we can simply focus on the race between the DQ_ACTIVE_B flag and > > > > > the DQ_MOD_B flag, which is the core problem, because the same quota > > > > > should not have both flags. These two flags are protected by dq_list_lock > > > > > and dquot->dq_lock respectively, so it makes sense to add a > > > > > wait_on_dquot() to ensure the accuracy of DQ_ACTIVE_B. > > > > But the fundamental problem is not only the race with DQ_MOD_B setting. The > > > > dquot structure can be completely freed by the time > > > > dquot_claim_space_nodirty() calls dquot_mark_dquot_dirty() on it. That's > > > > why I think making __dquot_transfer() obey dquot_srcu rules is the right > > > > solution. > > > Yes, now I also think that making __dquot_transfer() obey dquot_srcu > > > rules is a better solution. But with inode->i_lock protection, why would > > > the dquot structure be completely freed? > > Well, when dquot_claim_space_nodirty() calls mark_all_dquot_dirty() it does > > not hold any locks (only dquot_srcu). So nothing prevents dquot_transfer() > > to go, swap dquot structure pointers and drop dquot references and after > > that mark_all_dquot_dirty() can use a stale pointer to call > > mark_dquot_dirty() on already freed memory. > > > No, this doesn't look like it's going to happen. The > mark_all_dquot_dirty() uses a pointer array pointer, the dquot in the > array is dynamically changing, so after swap dquot structure pointers, > mark_all_dquot_dirty() uses the new pointer, and the stale pointer is > always destroyed after swap, so there is no case of using the stale > pointer here. There is a case - CPU0 can prefetch the values from dquots[] array into its local cache, then CPU1 can update the dquots[] array (these writes can happily stay in CPU1 store cache invisible to other CPUs) and free the dquots via dqput(). Then CPU0 can pass the prefetched dquot pointers to mark_dquot_dirty(). There are no locks or memory barries preventing CPUs from ordering instructions and memory operations like this in the code... You can read Documentation/memory-barriers.txt about all the perils current CPU architecture brings wrt coordination of memory accesses among CPUs ;) Honza
Hello! On 2023/6/27 16:34, Jan Kara wrote: > Hello! > > On Mon 26-06-23 21:55:49, Baokun Li wrote: >> On 2023/6/26 21:09, Jan Kara wrote: >>> On Sun 25-06-23 15:56:10, Baokun Li wrote: >>>>>> I think we can simply focus on the race between the DQ_ACTIVE_B flag and >>>>>> the DQ_MOD_B flag, which is the core problem, because the same quota >>>>>> should not have both flags. These two flags are protected by dq_list_lock >>>>>> and dquot->dq_lock respectively, so it makes sense to add a >>>>>> wait_on_dquot() to ensure the accuracy of DQ_ACTIVE_B. >>>>> But the fundamental problem is not only the race with DQ_MOD_B setting. The >>>>> dquot structure can be completely freed by the time >>>>> dquot_claim_space_nodirty() calls dquot_mark_dquot_dirty() on it. That's >>>>> why I think making __dquot_transfer() obey dquot_srcu rules is the right >>>>> solution. >>>> Yes, now I also think that making __dquot_transfer() obey dquot_srcu >>>> rules is a better solution. But with inode->i_lock protection, why would >>>> the dquot structure be completely freed? >>> Well, when dquot_claim_space_nodirty() calls mark_all_dquot_dirty() it does >>> not hold any locks (only dquot_srcu). So nothing prevents dquot_transfer() >>> to go, swap dquot structure pointers and drop dquot references and after >>> that mark_all_dquot_dirty() can use a stale pointer to call >>> mark_dquot_dirty() on already freed memory. >>> >> No, this doesn't look like it's going to happen. The >> mark_all_dquot_dirty() uses a pointer array pointer, the dquot in the >> array is dynamically changing, so after swap dquot structure pointers, >> mark_all_dquot_dirty() uses the new pointer, and the stale pointer is >> always destroyed after swap, so there is no case of using the stale >> pointer here. > There is a case - CPU0 can prefetch the values from dquots[] array into its > local cache, then CPU1 can update the dquots[] array (these writes can > happily stay in CPU1 store cache invisible to other CPUs) and free the > dquots via dqput(). Then CPU0 can pass the prefetched dquot pointers to > mark_dquot_dirty(). There are no locks or memory barries preventing CPUs > from ordering instructions and memory operations like this in the code... > You can read Documentation/memory-barriers.txt about all the perils current > CPU architecture brings wrt coordination of memory accesses among CPUs ;) > > Honza Got it! Sorry for misunderstanding you (I thought "completely freed" meant dquot_destroy(), but you should have meant dquot_release()). Then this is exactly the scenario in the first function graph in my patch description. In this scenario the data from the old dquot has been moved to the new dquot. The old dquot may set DQ_MOD_B bit after it has been cleared of DQ_ACTIVE_B bit. We want to do exactly the thing to avoid this situation.
On Tue 27-06-23 17:08:27, Baokun Li wrote: > Hello! > > On 2023/6/27 16:34, Jan Kara wrote: > > Hello! > > > > On Mon 26-06-23 21:55:49, Baokun Li wrote: > > > On 2023/6/26 21:09, Jan Kara wrote: > > > > On Sun 25-06-23 15:56:10, Baokun Li wrote: > > > > > > > I think we can simply focus on the race between the DQ_ACTIVE_B flag and > > > > > > > the DQ_MOD_B flag, which is the core problem, because the same quota > > > > > > > should not have both flags. These two flags are protected by dq_list_lock > > > > > > > and dquot->dq_lock respectively, so it makes sense to add a > > > > > > > wait_on_dquot() to ensure the accuracy of DQ_ACTIVE_B. > > > > > > But the fundamental problem is not only the race with DQ_MOD_B setting. The > > > > > > dquot structure can be completely freed by the time > > > > > > dquot_claim_space_nodirty() calls dquot_mark_dquot_dirty() on it. That's > > > > > > why I think making __dquot_transfer() obey dquot_srcu rules is the right > > > > > > solution. > > > > > Yes, now I also think that making __dquot_transfer() obey dquot_srcu > > > > > rules is a better solution. But with inode->i_lock protection, why would > > > > > the dquot structure be completely freed? > > > > Well, when dquot_claim_space_nodirty() calls mark_all_dquot_dirty() it does > > > > not hold any locks (only dquot_srcu). So nothing prevents dquot_transfer() > > > > to go, swap dquot structure pointers and drop dquot references and after > > > > that mark_all_dquot_dirty() can use a stale pointer to call > > > > mark_dquot_dirty() on already freed memory. > > > > > > > No, this doesn't look like it's going to happen. The > > > mark_all_dquot_dirty() uses a pointer array pointer, the dquot in the > > > array is dynamically changing, so after swap dquot structure pointers, > > > mark_all_dquot_dirty() uses the new pointer, and the stale pointer is > > > always destroyed after swap, so there is no case of using the stale > > > pointer here. > > There is a case - CPU0 can prefetch the values from dquots[] array into its > > local cache, then CPU1 can update the dquots[] array (these writes can > > happily stay in CPU1 store cache invisible to other CPUs) and free the > > dquots via dqput(). Then CPU0 can pass the prefetched dquot pointers to > > mark_dquot_dirty(). There are no locks or memory barries preventing CPUs > > from ordering instructions and memory operations like this in the code... > > You can read Documentation/memory-barriers.txt about all the perils current > > CPU architecture brings wrt coordination of memory accesses among CPUs ;) > > > > Honza > > Got it! > > Sorry for misunderstanding you (I thought "completely freed" meant > dquot_destroy(), but you should have meant dquot_release()). Well, the dquot can even get to dquot_destroy(). There's nothing really preventing CPU2 going into memory reclaim and free the dquot in dqcache_shrink_scan() still before CPU0 even calls mark_dquot_dirty() on it. Sure such timing on real hardware is very unlikely but in a VM where a virtual CPU can get starved for a significant amount of time this could happen. Honza
On 2023/6/27 17:28, Jan Kara wrote: > On Tue 27-06-23 17:08:27, Baokun Li wrote: >> Hello! >> >> On 2023/6/27 16:34, Jan Kara wrote: >>> Hello! >>> >>> On Mon 26-06-23 21:55:49, Baokun Li wrote: >>>> On 2023/6/26 21:09, Jan Kara wrote: >>>>> On Sun 25-06-23 15:56:10, Baokun Li wrote: >>>>>>>> I think we can simply focus on the race between the DQ_ACTIVE_B flag and >>>>>>>> the DQ_MOD_B flag, which is the core problem, because the same quota >>>>>>>> should not have both flags. These two flags are protected by dq_list_lock >>>>>>>> and dquot->dq_lock respectively, so it makes sense to add a >>>>>>>> wait_on_dquot() to ensure the accuracy of DQ_ACTIVE_B. >>>>>>> But the fundamental problem is not only the race with DQ_MOD_B setting. The >>>>>>> dquot structure can be completely freed by the time >>>>>>> dquot_claim_space_nodirty() calls dquot_mark_dquot_dirty() on it. That's >>>>>>> why I think making __dquot_transfer() obey dquot_srcu rules is the right >>>>>>> solution. >>>>>> Yes, now I also think that making __dquot_transfer() obey dquot_srcu >>>>>> rules is a better solution. But with inode->i_lock protection, why would >>>>>> the dquot structure be completely freed? >>>>> Well, when dquot_claim_space_nodirty() calls mark_all_dquot_dirty() it does >>>>> not hold any locks (only dquot_srcu). So nothing prevents dquot_transfer() >>>>> to go, swap dquot structure pointers and drop dquot references and after >>>>> that mark_all_dquot_dirty() can use a stale pointer to call >>>>> mark_dquot_dirty() on already freed memory. >>>>> >>>> No, this doesn't look like it's going to happen. The >>>> mark_all_dquot_dirty() uses a pointer array pointer, the dquot in the >>>> array is dynamically changing, so after swap dquot structure pointers, >>>> mark_all_dquot_dirty() uses the new pointer, and the stale pointer is >>>> always destroyed after swap, so there is no case of using the stale >>>> pointer here. >>> There is a case - CPU0 can prefetch the values from dquots[] array into its >>> local cache, then CPU1 can update the dquots[] array (these writes can >>> happily stay in CPU1 store cache invisible to other CPUs) and free the >>> dquots via dqput(). Then CPU0 can pass the prefetched dquot pointers to >>> mark_dquot_dirty(). There are no locks or memory barries preventing CPUs >>> from ordering instructions and memory operations like this in the code... >>> You can read Documentation/memory-barriers.txt about all the perils current >>> CPU architecture brings wrt coordination of memory accesses among CPUs ;) >>> >>> Honza >> Got it! >> >> Sorry for misunderstanding you (I thought "completely freed" meant >> dquot_destroy(), but you should have meant dquot_release()). > Well, the dquot can even get to dquot_destroy(). There's nothing really > preventing CPU2 going into memory reclaim and free the dquot in > dqcache_shrink_scan() still before CPU0 even calls mark_dquot_dirty() on > it. Sure such timing on real hardware is very unlikely but in a VM where a > virtual CPU can get starved for a significant amount of time this could > happen. > > Honza Yes, invalidate_dquots() calling do_destroy_dquot() does not have this problem because it calls synchronize_srcu(&dquot_srcu) in drop_dquot_ref() before. However, calling do_destroy_dquot() from dqcache_shrink_scan() is not protected, and calling dqcache_shrink_scan() after P3 execution will trigger the UAF by calling do_destroy_dquot() twice, as shown in function graph 1 in the patch description; If dqcache_shrink_scan() is called after dquot is added to free_dquots and before P3 is executed, the UAF may be triggered in dquot_mark_dquot_dirty(). Thank you for your patient explanation! The new version of the solution is almost complete, and is doing some stress testing, which I will send out once it passes.
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c index e3e4f4047657..2a04cd74c7c5 100644 --- a/fs/quota/dquot.c +++ b/fs/quota/dquot.c @@ -362,11 +362,26 @@ int dquot_mark_dquot_dirty(struct dquot *dquot) return 1; spin_lock(&dq_list_lock); - if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags)) { + ret = test_and_set_bit(DQ_MOD_B, &dquot->dq_flags); + if (ret) + goto out_lock; + spin_unlock(&dq_list_lock); + + /* + * Wait for dq_lock - after this we know that either dquot_release() is + * already finished or it will be canceled due to DQ_MOD_B flag test. + */ + wait_on_dquot(dquot); + spin_lock(&dq_list_lock); + if (!test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) { + clear_bit(DQ_MOD_B, &dquot->dq_flags); + goto out_lock; + } + /* DQ_MOD_B is cleared means that the dquot has been written back */ + if (test_bit(DQ_MOD_B, &dquot->dq_flags)) list_add(&dquot->dq_dirty, &sb_dqopt(dquot->dq_sb)-> info[dquot->dq_id.type].dqi_dirty_list); - ret = 0; - } +out_lock: spin_unlock(&dq_list_lock); return ret; } @@ -791,7 +806,7 @@ void dqput(struct dquot *dquot) return; } /* Need to release dquot? */ - if (dquot_dirty(dquot)) { + if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags) && dquot_dirty(dquot)) { spin_unlock(&dq_list_lock); /* Commit dquot before releasing */ ret = dquot->dq_sb->dq_op->write_dquot(dquot);