quota: fix race condition between dqput() and dquot_mark_dquot_dirty()

Message ID 20230616085608.42435-1-libaokun1@huawei.com
State New
Headers
Series quota: fix race condition between dqput() and dquot_mark_dquot_dirty() |

Commit Message

Baokun Li June 16, 2023, 8:56 a.m. UTC
  We ran into a problem that dqput() and dquot_mark_dquot_dirty() may race
like the function graph below, causing a released dquot to be added to the
dqi_dirty_list, and this leads to that dquot being released again in
dquot_writeback_dquots(), making two identical quotas in free_dquots.

       cpu1              cpu2
_________________|_________________
wb_do_writeback         CHOWN(1)
 ...
  ext4_da_update_reserve_space
   dquot_claim_block
    ...
     dquot_mark_dquot_dirty // try to dirty old quota
      test_bit(DQ_ACTIVE_B, &dquot->dq_flags) // still ACTIVE
      if (test_bit(DQ_MOD_B, &dquot->dq_flags))
      // test no dirty, wait dq_list_lock
                    ...
                     dquot_transfer
                      __dquot_transfer
                      dqput_all(transfer_from) // rls old dquot
                       dqput // last dqput
                        dquot_release
                         clear_bit(DQ_ACTIVE_B, &dquot->dq_flags)
                        atomic_dec(&dquot->dq_count)
                        put_dquot_last(dquot)
                         list_add_tail(&dquot->dq_free, &free_dquots)
                         // first add the dquot to free_dquots
      if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags))
        add dqi_dirty_list // add freed dquot to dirty_list
P3:
ksys_sync
 ...
  dquot_writeback_dquots
   WARN_ON(!test_bit(DQ_ACTIVE_B, &dquot->dq_flags))
   dqgrab(dquot)
    WARN_ON_ONCE(!atomic_read(&dquot->dq_count))
    WARN_ON_ONCE(!test_bit(DQ_ACTIVE_B, &dquot->dq_flags))
   dqput(dquot)
    put_dquot_last(dquot)
     list_add_tail(&dquot->dq_free, &free_dquots)
     // Double add the dquot to free_dquots

This causes a list_del corruption when removing the entry from free_dquots,
and even trying to free the dquot twice in dqcache_shrink_scan triggers a
use-after-free.

A warning may also be triggered by a race like the function diagram below:

       cpu1            cpu2           cpu3
________________|_______________|________________
wb_do_writeback   CHOWN(1)        QUOTASYNC(1)
 ...                              ...
  ext4_da_update_reserve_space
    ...           __dquot_transfer
                   dqput // last dqput
                    dquot_release
                     dquot_is_busy
                      if (test_bit(DQ_MOD_B, &dquot->dq_flags))
                       // not dirty and still active
     dquot_mark_dquot_dirty
      if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags))
        add dqi_dirty_list
                       clear_bit(DQ_ACTIVE_B, &dquot->dq_flags)
                                   dquot_writeback_dquots
                                    WARN_ON(!test_bit(DQ_ACTIVE_B))

To solve this problem, it is similar to the way dqget() avoids racing with
dquot_release(). First set the DQ_MOD_B flag, then execute wait_on_dquot(),
after this we know that either dquot_release() is already finished or it
will be canceled due to DQ_MOD_B flag test, at this point if the quota is
DQ_ACTIVE_B, then we can safely add the dquot to the dqi_dirty_list,
otherwise clear the DQ_MOD_B flag and exit directly.

Fixes: 4580b30ea887 ("quota: Do not dirty bad dquots")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---

Hello Honza,

This problem can also be solved by modifying the reference count mechanism,
where dquots hold a reference count after they are allocated until they are
destroyed, i.e. the dquots in the free_dquots list have dq_count == 1. This
allows us to reduce the reference count as soon as we enter the dqput(),
and then add the dquot to the dqi_dirty_list only when dq_count > 1. This
also prevents the dquot in the dqi_dirty_list from not having the
DQ_ACTIVE_B flag, but this is a more impactful modification, so we chose to
refer to dqget() to avoid racing with dquot_release(). If you prefer this
solution by modifying the dq_count mechanism, I would be happy to send
another version of the patch.

Thanks,
Baokun.

 fs/quota/dquot.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)
  

Comments

Jan Kara June 16, 2023, 3:28 p.m. UTC | #1
Hello Baokun!

On Fri 16-06-23 16:56:08, Baokun Li wrote:
> We ran into a problem that dqput() and dquot_mark_dquot_dirty() may race
> like the function graph below, causing a released dquot to be added to the
> dqi_dirty_list, and this leads to that dquot being released again in
> dquot_writeback_dquots(), making two identical quotas in free_dquots.
> 
>        cpu1              cpu2
> _________________|_________________
> wb_do_writeback         CHOWN(1)
>  ...
>   ext4_da_update_reserve_space
>    dquot_claim_block
>     ...
>      dquot_mark_dquot_dirty // try to dirty old quota
>       test_bit(DQ_ACTIVE_B, &dquot->dq_flags) // still ACTIVE
>       if (test_bit(DQ_MOD_B, &dquot->dq_flags))
>       // test no dirty, wait dq_list_lock
>                     ...
>                      dquot_transfer
>                       __dquot_transfer
>                       dqput_all(transfer_from) // rls old dquot
>                        dqput // last dqput
>                         dquot_release
>                          clear_bit(DQ_ACTIVE_B, &dquot->dq_flags)
>                         atomic_dec(&dquot->dq_count)
>                         put_dquot_last(dquot)
>                          list_add_tail(&dquot->dq_free, &free_dquots)
>                          // first add the dquot to free_dquots
>       if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags))
>         add dqi_dirty_list // add freed dquot to dirty_list
> P3:
> ksys_sync
>  ...
>   dquot_writeback_dquots
>    WARN_ON(!test_bit(DQ_ACTIVE_B, &dquot->dq_flags))
>    dqgrab(dquot)
>     WARN_ON_ONCE(!atomic_read(&dquot->dq_count))
>     WARN_ON_ONCE(!test_bit(DQ_ACTIVE_B, &dquot->dq_flags))
>    dqput(dquot)
>     put_dquot_last(dquot)
>      list_add_tail(&dquot->dq_free, &free_dquots)
>      // Double add the dquot to free_dquots
> 
> This causes a list_del corruption when removing the entry from free_dquots,
> and even trying to free the dquot twice in dqcache_shrink_scan triggers a
> use-after-free.
> 
> A warning may also be triggered by a race like the function diagram below:
> 
>        cpu1            cpu2           cpu3
> ________________|_______________|________________
> wb_do_writeback   CHOWN(1)        QUOTASYNC(1)
>  ...                              ...
>   ext4_da_update_reserve_space
>     ...           __dquot_transfer
>                    dqput // last dqput
>                     dquot_release
>                      dquot_is_busy
>                       if (test_bit(DQ_MOD_B, &dquot->dq_flags))
>                        // not dirty and still active
>      dquot_mark_dquot_dirty
>       if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags))
>         add dqi_dirty_list
>                        clear_bit(DQ_ACTIVE_B, &dquot->dq_flags)
>                                    dquot_writeback_dquots
>                                     WARN_ON(!test_bit(DQ_ACTIVE_B))
> 
> To solve this problem, it is similar to the way dqget() avoids racing with
> dquot_release(). First set the DQ_MOD_B flag, then execute wait_on_dquot(),
> after this we know that either dquot_release() is already finished or it
> will be canceled due to DQ_MOD_B flag test, at this point if the quota is
> DQ_ACTIVE_B, then we can safely add the dquot to the dqi_dirty_list,
> otherwise clear the DQ_MOD_B flag and exit directly.
> 
> Fixes: 4580b30ea887 ("quota: Do not dirty bad dquots")
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> ---
> 
> Hello Honza,
> 
> This problem can also be solved by modifying the reference count mechanism,
> where dquots hold a reference count after they are allocated until they are
> destroyed, i.e. the dquots in the free_dquots list have dq_count == 1. This
> allows us to reduce the reference count as soon as we enter the dqput(),
> and then add the dquot to the dqi_dirty_list only when dq_count > 1. This
> also prevents the dquot in the dqi_dirty_list from not having the
> DQ_ACTIVE_B flag, but this is a more impactful modification, so we chose to
> refer to dqget() to avoid racing with dquot_release(). If you prefer this
> solution by modifying the dq_count mechanism, I would be happy to send
> another version of the patch.

The way this *should* work is that dquot_mark_dquot_dirty() using dquot
references from the inode should be protected by dquot_srcu. quota_off
code takes care to call synchronize_srcu(&dquot_srcu) to not drop dquot
references while they are used by other users. But you are right
dquot_transfer() breaks this assumption. Most callers are fine since they
are also protected by inode->i_lock but still I'd prefer to fix
dquot_transfer() to follow the guarantees dquot_srcu should provide.

Now calling synchronize_srcu() directly from dquot_transfer() is too
expensive (and mostly unnecessary) so what I would rather suggest is to
create another dquot list (use dq_free list_head inside struct dquot for
it) and add dquot whose last reference should be dropped there. We'd then
queue work item which would call synchronize_srcu() and after that perform
the final cleanup of all the dquots on the list.

Now this also needs some modifications to dqget() and to quotaoff code to
handle various races with the new dqput() code so if you feel it is too
complex for your taste, I can implement this myself.

								Honza
  
Baokun Li June 19, 2023, 6:44 a.m. UTC | #2
Hello Honza !

On 2023/6/16 23:28, Jan Kara wrote:
> Hello Baokun!
>
> On Fri 16-06-23 16:56:08, Baokun Li wrote:
>> To solve this problem, it is similar to the way dqget() avoids racing with
>> dquot_release(). First set the DQ_MOD_B flag, then execute wait_on_dquot(),
>> after this we know that either dquot_release() is already finished or it
>> will be canceled due to DQ_MOD_B flag test, at this point if the quota is
>> DQ_ACTIVE_B, then we can safely add the dquot to the dqi_dirty_list,
>> otherwise clear the DQ_MOD_B flag and exit directly.
>>
>> Fixes: 4580b30ea887 ("quota: Do not dirty bad dquots")
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>> ---
>>
>> Hello Honza,
>>
>> This problem can also be solved by modifying the reference count mechanism,
>> where dquots hold a reference count after they are allocated until they are
>> destroyed, i.e. the dquots in the free_dquots list have dq_count == 1. This
>> allows us to reduce the reference count as soon as we enter the dqput(),
>> and then add the dquot to the dqi_dirty_list only when dq_count > 1. This
>> also prevents the dquot in the dqi_dirty_list from not having the
>> DQ_ACTIVE_B flag, but this is a more impactful modification, so we chose to
>> refer to dqget() to avoid racing with dquot_release(). If you prefer this
>> solution by modifying the dq_count mechanism, I would be happy to send
>> another version of the patch.
> The way this *should* work is that dquot_mark_dquot_dirty() using dquot
> references from the inode should be protected by dquot_srcu. quota_off
> code takes care to call synchronize_srcu(&dquot_srcu) to not drop dquot
> references while they are used by other users. But you are right
> dquot_transfer() breaks this assumption. Most callers are fine since they
> are also protected by inode->i_lock but still I'd prefer to fix
> dquot_transfer() to follow the guarantees dquot_srcu should provide.
Indeed!
Operation accessing dquots via inode pointers shuould be protectedby 
dquot_srcu.
And inode->i_lock ensures that we do not record usage changes in a 
deprecated
dquota pointer, even when concurrent with dquot_transfer().
> Now calling synchronize_srcu() directly from dquot_transfer() is too
> expensive (and mostly unnecessary) so what I would rather suggest is to
> create another dquot list (use dq_free list_head inside struct dquot for
> it) and add dquot whose last reference should be dropped there. We'd then
> queue work item which would call synchronize_srcu() and after that perform
> the final cleanup of all the dquots on the list.
>
> Now this also needs some modifications to dqget() and to quotaoff code to
> handle various races with the new dqput() code so if you feel it is too
> complex for your taste, I can implement this myself.
>
> 								Honza
I see what you mean, what we are doing here is very similar to 
drop_dquot_ref(),
and if we have to modify it this way, I am happy to implement it.

But as you said, calling synchronize_srcu() is too expensive and it 
blocks almost all
mark dirty processes, so we only call it now in performance insensitive 
scenarios
like dquot_disable(). And how do we control how often synchronize_srcu() 
is called?
Are there more than a certain number of dquots in releasing_dquots or 
are they
executed at regular intervals? And it would introduce various new 
competitions.
Is it worthwhile to do this for a corner scenario like this one?

I think we can simply focus on the race between the DQ_ACTIVE_B flag and the
DQ_MOD_B flag, which is the core problem, because the same quota should not
have both flags. These two flags are protected by dq_list_lock and 
dquot->dq_lock
respectively, so it makes sense to add a wait_on_dquot() to ensure the 
accuracy of
DQ_ACTIVE_B.

The addition of wait_on_dquot() to this solution also seems very 
expensive, and I had
similar concerns before, but testing found no performance impact due to 
the fast path
without any locks. We returns 1 directly when the current dquot is 
already dirty, so there
is no locking involved after dquot is dirty until DQ_MOD_B is cleared. 
And clear the
dirtying of the dqi_dirty_list only happens in last dqput and 
dquot_writeback_dquots(),
both of which occur very infrequently.

And if we don't care about the harmless warning in 
dquot_writeback_dquots() in the
second function graph (just skip it), wait_on_dquot() in the solution 
can be removed.
We only need to determine again whether dquot is DQ_ACTIVE_B under 
dq_list_lock
protection to solve the problem in the first function graph. This is why 
there are two
function graphs in the patch description, because with the second 
problem, we have
to be more careful if we want to keep the warning.

Thanks!
  
Jan Kara June 22, 2023, 2:56 p.m. UTC | #3
Hello!

On Mon 19-06-23 14:44:03, Baokun Li wrote:
> On 2023/6/16 23:28, Jan Kara wrote:
> > Now calling synchronize_srcu() directly from dquot_transfer() is too
> > expensive (and mostly unnecessary) so what I would rather suggest is to
> > create another dquot list (use dq_free list_head inside struct dquot for
> > it) and add dquot whose last reference should be dropped there. We'd then
> > queue work item which would call synchronize_srcu() and after that perform
> > the final cleanup of all the dquots on the list.
> > 
> > Now this also needs some modifications to dqget() and to quotaoff code to
> > handle various races with the new dqput() code so if you feel it is too
> > complex for your taste, I can implement this myself.
> > 
> > 								Honza
> I see what you mean, what we are doing here is very similar to
> drop_dquot_ref(),
> and if we have to modify it this way, I am happy to implement it.
> 
> But as you said, calling synchronize_srcu() is too expensive and it blocks
> almost all
> mark dirty processes, so we only call it now in performance insensitive
> scenarios
> like dquot_disable(). And how do we control how often synchronize_srcu() is
> called?
> Are there more than a certain number of dquots in releasing_dquots or are
> they
> executed at regular intervals? And it would introduce various new
> competitions.
> Is it worthwhile to do this for a corner scenario like this one?

So the way this is handled (e.g. in fsnotify subsystem) is that we just
queue work item when we drop the last reference to the protected structure.
The scheduling latency before the work item gets executed is enough to
batch synchronize_srcu() calls and once synchronize_srcu() finishes, we add
all items from the "staging list" to the free_dquots list.

> I think we can simply focus on the race between the DQ_ACTIVE_B flag and
> the DQ_MOD_B flag, which is the core problem, because the same quota
> should not have both flags. These two flags are protected by dq_list_lock
> and dquot->dq_lock respectively, so it makes sense to add a
> wait_on_dquot() to ensure the accuracy of DQ_ACTIVE_B.

But the fundamental problem is not only the race with DQ_MOD_B setting. The
dquot structure can be completely freed by the time
dquot_claim_space_nodirty() calls dquot_mark_dquot_dirty() on it. That's
why I think making __dquot_transfer() obey dquot_srcu rules is the right
solution.

								Honza
  
Baokun Li June 25, 2023, 7:56 a.m. UTC | #4
Hello!

Sorry for the late reply, just had a Dragon Boat holiday.

On 2023/6/22 22:56, Jan Kara wrote:
> Hello!
>
> On Mon 19-06-23 14:44:03, Baokun Li wrote:
>> On 2023/6/16 23:28, Jan Kara wrote:
>>> Now calling synchronize_srcu() directly from dquot_transfer() is too
>>> expensive (and mostly unnecessary) so what I would rather suggest is to
>>> create another dquot list (use dq_free list_head inside struct dquot for
>>> it) and add dquot whose last reference should be dropped there. We'd then
>>> queue work item which would call synchronize_srcu() and after that perform
>>> the final cleanup of all the dquots on the list.
>>>
>>> Now this also needs some modifications to dqget() and to quotaoff code to
>>> handle various races with the new dqput() code so if you feel it is too
>>> complex for your taste, I can implement this myself.
>>>
>>> 								Honza
>> I see what you mean, what we are doing here is very similar to
>> drop_dquot_ref(),
>> and if we have to modify it this way, I am happy to implement it.
>>
>> But as you said, calling synchronize_srcu() is too expensive and it blocks
>> almost all
>> mark dirty processes, so we only call it now in performance insensitive
>> scenarios
>> like dquot_disable(). And how do we control how often synchronize_srcu() is
>> called?
>> Are there more than a certain number of dquots in releasing_dquots or are
>> they
>> executed at regular intervals? And it would introduce various new
>> competitions.
>> Is it worthwhile to do this for a corner scenario like this one?
> So the way this is handled (e.g. in fsnotify subsystem) is that we just
> queue work item when we drop the last reference to the protected structure.
> The scheduling latency before the work item gets executed is enough to
> batch synchronize_srcu() calls and once synchronize_srcu() finishes, we add
> all items from the "staging list" to the free_dquots list.

Cool, thanks a lot for clearing up the confusion!

I will implement it in the next version.

>
>> I think we can simply focus on the race between the DQ_ACTIVE_B flag and
>> the DQ_MOD_B flag, which is the core problem, because the same quota
>> should not have both flags. These two flags are protected by dq_list_lock
>> and dquot->dq_lock respectively, so it makes sense to add a
>> wait_on_dquot() to ensure the accuracy of DQ_ACTIVE_B.
> But the fundamental problem is not only the race with DQ_MOD_B setting. The
> dquot structure can be completely freed by the time
> dquot_claim_space_nodirty() calls dquot_mark_dquot_dirty() on it. That's
> why I think making __dquot_transfer() obey dquot_srcu rules is the right
> solution.
>
> 								Honza
Yes, now I also think that making __dquot_transfer() obey dquot_srcu 
rules is
a better solution. But with inode->i_lock protection, why would the dquot
structure be completely freed?

Thanks!
  
Jan Kara June 26, 2023, 1:09 p.m. UTC | #5
Hello!

On Sun 25-06-23 15:56:10, Baokun Li wrote:
> > > I think we can simply focus on the race between the DQ_ACTIVE_B flag and
> > > the DQ_MOD_B flag, which is the core problem, because the same quota
> > > should not have both flags. These two flags are protected by dq_list_lock
> > > and dquot->dq_lock respectively, so it makes sense to add a
> > > wait_on_dquot() to ensure the accuracy of DQ_ACTIVE_B.
> > But the fundamental problem is not only the race with DQ_MOD_B setting. The
> > dquot structure can be completely freed by the time
> > dquot_claim_space_nodirty() calls dquot_mark_dquot_dirty() on it. That's
> > why I think making __dquot_transfer() obey dquot_srcu rules is the right
> > solution.
> Yes, now I also think that making __dquot_transfer() obey dquot_srcu
> rules is a better solution. But with inode->i_lock protection, why would
> the dquot structure be completely freed?

Well, when dquot_claim_space_nodirty() calls mark_all_dquot_dirty() it does
not hold any locks (only dquot_srcu). So nothing prevents dquot_transfer()
to go, swap dquot structure pointers and drop dquot references and after
that mark_all_dquot_dirty() can use a stale pointer to call
mark_dquot_dirty() on already freed memory.

								Honza
  
Baokun Li June 26, 2023, 1:55 p.m. UTC | #6
Hello!

On 2023/6/26 21:09, Jan Kara wrote:
> Hello!
>
> On Sun 25-06-23 15:56:10, Baokun Li wrote:
>>>> I think we can simply focus on the race between the DQ_ACTIVE_B flag and
>>>> the DQ_MOD_B flag, which is the core problem, because the same quota
>>>> should not have both flags. These two flags are protected by dq_list_lock
>>>> and dquot->dq_lock respectively, so it makes sense to add a
>>>> wait_on_dquot() to ensure the accuracy of DQ_ACTIVE_B.
>>> But the fundamental problem is not only the race with DQ_MOD_B setting. The
>>> dquot structure can be completely freed by the time
>>> dquot_claim_space_nodirty() calls dquot_mark_dquot_dirty() on it. That's
>>> why I think making __dquot_transfer() obey dquot_srcu rules is the right
>>> solution.
>> Yes, now I also think that making __dquot_transfer() obey dquot_srcu
>> rules is a better solution. But with inode->i_lock protection, why would
>> the dquot structure be completely freed?
> Well, when dquot_claim_space_nodirty() calls mark_all_dquot_dirty() it does
> not hold any locks (only dquot_srcu). So nothing prevents dquot_transfer()
> to go, swap dquot structure pointers and drop dquot references and after
> that mark_all_dquot_dirty() can use a stale pointer to call
> mark_dquot_dirty() on already freed memory.
>
> 								Honza
No, this doesn't look like it's going to happen.
The mark_all_dquot_dirty() uses a pointer array pointer, the dquot in 
the array is
dynamically changing, so after swap dquot structure pointers, 
mark_all_dquot_dirty()
uses the new pointer, and the stale pointer is always destroyed after 
swap, so there
is no case of using the stale pointer here.
  
Jan Kara June 27, 2023, 8:34 a.m. UTC | #7
Hello!

On Mon 26-06-23 21:55:49, Baokun Li wrote:
> On 2023/6/26 21:09, Jan Kara wrote:
> > On Sun 25-06-23 15:56:10, Baokun Li wrote:
> > > > > I think we can simply focus on the race between the DQ_ACTIVE_B flag and
> > > > > the DQ_MOD_B flag, which is the core problem, because the same quota
> > > > > should not have both flags. These two flags are protected by dq_list_lock
> > > > > and dquot->dq_lock respectively, so it makes sense to add a
> > > > > wait_on_dquot() to ensure the accuracy of DQ_ACTIVE_B.
> > > > But the fundamental problem is not only the race with DQ_MOD_B setting. The
> > > > dquot structure can be completely freed by the time
> > > > dquot_claim_space_nodirty() calls dquot_mark_dquot_dirty() on it. That's
> > > > why I think making __dquot_transfer() obey dquot_srcu rules is the right
> > > > solution.
> > > Yes, now I also think that making __dquot_transfer() obey dquot_srcu
> > > rules is a better solution. But with inode->i_lock protection, why would
> > > the dquot structure be completely freed?
> > Well, when dquot_claim_space_nodirty() calls mark_all_dquot_dirty() it does
> > not hold any locks (only dquot_srcu). So nothing prevents dquot_transfer()
> > to go, swap dquot structure pointers and drop dquot references and after
> > that mark_all_dquot_dirty() can use a stale pointer to call
> > mark_dquot_dirty() on already freed memory.
> > 
> No, this doesn't look like it's going to happen.  The
> mark_all_dquot_dirty() uses a pointer array pointer, the dquot in the
> array is dynamically changing, so after swap dquot structure pointers,
> mark_all_dquot_dirty() uses the new pointer, and the stale pointer is
> always destroyed after swap, so there is no case of using the stale
> pointer here.

There is a case - CPU0 can prefetch the values from dquots[] array into its
local cache, then CPU1 can update the dquots[] array (these writes can
happily stay in CPU1 store cache invisible to other CPUs) and free the
dquots via dqput(). Then CPU0 can pass the prefetched dquot pointers to
mark_dquot_dirty(). There are no locks or memory barries preventing CPUs
from ordering instructions and memory operations like this in the code...
You can read Documentation/memory-barriers.txt about all the perils current
CPU architecture brings wrt coordination of memory accesses among CPUs ;)

								Honza
  
Baokun Li June 27, 2023, 9:08 a.m. UTC | #8
Hello!

On 2023/6/27 16:34, Jan Kara wrote:
> Hello!
>
> On Mon 26-06-23 21:55:49, Baokun Li wrote:
>> On 2023/6/26 21:09, Jan Kara wrote:
>>> On Sun 25-06-23 15:56:10, Baokun Li wrote:
>>>>>> I think we can simply focus on the race between the DQ_ACTIVE_B flag and
>>>>>> the DQ_MOD_B flag, which is the core problem, because the same quota
>>>>>> should not have both flags. These two flags are protected by dq_list_lock
>>>>>> and dquot->dq_lock respectively, so it makes sense to add a
>>>>>> wait_on_dquot() to ensure the accuracy of DQ_ACTIVE_B.
>>>>> But the fundamental problem is not only the race with DQ_MOD_B setting. The
>>>>> dquot structure can be completely freed by the time
>>>>> dquot_claim_space_nodirty() calls dquot_mark_dquot_dirty() on it. That's
>>>>> why I think making __dquot_transfer() obey dquot_srcu rules is the right
>>>>> solution.
>>>> Yes, now I also think that making __dquot_transfer() obey dquot_srcu
>>>> rules is a better solution. But with inode->i_lock protection, why would
>>>> the dquot structure be completely freed?
>>> Well, when dquot_claim_space_nodirty() calls mark_all_dquot_dirty() it does
>>> not hold any locks (only dquot_srcu). So nothing prevents dquot_transfer()
>>> to go, swap dquot structure pointers and drop dquot references and after
>>> that mark_all_dquot_dirty() can use a stale pointer to call
>>> mark_dquot_dirty() on already freed memory.
>>>
>> No, this doesn't look like it's going to happen.  The
>> mark_all_dquot_dirty() uses a pointer array pointer, the dquot in the
>> array is dynamically changing, so after swap dquot structure pointers,
>> mark_all_dquot_dirty() uses the new pointer, and the stale pointer is
>> always destroyed after swap, so there is no case of using the stale
>> pointer here.
> There is a case - CPU0 can prefetch the values from dquots[] array into its
> local cache, then CPU1 can update the dquots[] array (these writes can
> happily stay in CPU1 store cache invisible to other CPUs) and free the
> dquots via dqput(). Then CPU0 can pass the prefetched dquot pointers to
> mark_dquot_dirty(). There are no locks or memory barries preventing CPUs
> from ordering instructions and memory operations like this in the code...
> You can read Documentation/memory-barriers.txt about all the perils current
> CPU architecture brings wrt coordination of memory accesses among CPUs ;)
>
> 								Honza

Got it!

Sorry for misunderstanding you (I thought "completely freed" meant
dquot_destroy(), but you should have meant dquot_release()).
Then this is exactly the scenario in the first function graph in my patch
description. In this scenario the data from the old dquot has been moved
to the new dquot. The old dquot may set DQ_MOD_B bit after it has been
cleared of DQ_ACTIVE_B bit.

We want to do exactly the thing to avoid this situation.
  
Jan Kara June 27, 2023, 9:28 a.m. UTC | #9
On Tue 27-06-23 17:08:27, Baokun Li wrote:
> Hello!
> 
> On 2023/6/27 16:34, Jan Kara wrote:
> > Hello!
> > 
> > On Mon 26-06-23 21:55:49, Baokun Li wrote:
> > > On 2023/6/26 21:09, Jan Kara wrote:
> > > > On Sun 25-06-23 15:56:10, Baokun Li wrote:
> > > > > > > I think we can simply focus on the race between the DQ_ACTIVE_B flag and
> > > > > > > the DQ_MOD_B flag, which is the core problem, because the same quota
> > > > > > > should not have both flags. These two flags are protected by dq_list_lock
> > > > > > > and dquot->dq_lock respectively, so it makes sense to add a
> > > > > > > wait_on_dquot() to ensure the accuracy of DQ_ACTIVE_B.
> > > > > > But the fundamental problem is not only the race with DQ_MOD_B setting. The
> > > > > > dquot structure can be completely freed by the time
> > > > > > dquot_claim_space_nodirty() calls dquot_mark_dquot_dirty() on it. That's
> > > > > > why I think making __dquot_transfer() obey dquot_srcu rules is the right
> > > > > > solution.
> > > > > Yes, now I also think that making __dquot_transfer() obey dquot_srcu
> > > > > rules is a better solution. But with inode->i_lock protection, why would
> > > > > the dquot structure be completely freed?
> > > > Well, when dquot_claim_space_nodirty() calls mark_all_dquot_dirty() it does
> > > > not hold any locks (only dquot_srcu). So nothing prevents dquot_transfer()
> > > > to go, swap dquot structure pointers and drop dquot references and after
> > > > that mark_all_dquot_dirty() can use a stale pointer to call
> > > > mark_dquot_dirty() on already freed memory.
> > > > 
> > > No, this doesn't look like it's going to happen.  The
> > > mark_all_dquot_dirty() uses a pointer array pointer, the dquot in the
> > > array is dynamically changing, so after swap dquot structure pointers,
> > > mark_all_dquot_dirty() uses the new pointer, and the stale pointer is
> > > always destroyed after swap, so there is no case of using the stale
> > > pointer here.
> > There is a case - CPU0 can prefetch the values from dquots[] array into its
> > local cache, then CPU1 can update the dquots[] array (these writes can
> > happily stay in CPU1 store cache invisible to other CPUs) and free the
> > dquots via dqput(). Then CPU0 can pass the prefetched dquot pointers to
> > mark_dquot_dirty(). There are no locks or memory barries preventing CPUs
> > from ordering instructions and memory operations like this in the code...
> > You can read Documentation/memory-barriers.txt about all the perils current
> > CPU architecture brings wrt coordination of memory accesses among CPUs ;)
> > 
> > 								Honza
> 
> Got it!
> 
> Sorry for misunderstanding you (I thought "completely freed" meant
> dquot_destroy(), but you should have meant dquot_release()).

Well, the dquot can even get to dquot_destroy(). There's nothing really
preventing CPU2 going into memory reclaim and free the dquot in
dqcache_shrink_scan() still before CPU0 even calls mark_dquot_dirty() on
it. Sure such timing on real hardware is very unlikely but in a VM where a
virtual CPU can get starved for a significant amount of time this could
happen.

								Honza
  
Baokun Li June 27, 2023, 2:06 p.m. UTC | #10
On 2023/6/27 17:28, Jan Kara wrote:
> On Tue 27-06-23 17:08:27, Baokun Li wrote:
>> Hello!
>>
>> On 2023/6/27 16:34, Jan Kara wrote:
>>> Hello!
>>>
>>> On Mon 26-06-23 21:55:49, Baokun Li wrote:
>>>> On 2023/6/26 21:09, Jan Kara wrote:
>>>>> On Sun 25-06-23 15:56:10, Baokun Li wrote:
>>>>>>>> I think we can simply focus on the race between the DQ_ACTIVE_B flag and
>>>>>>>> the DQ_MOD_B flag, which is the core problem, because the same quota
>>>>>>>> should not have both flags. These two flags are protected by dq_list_lock
>>>>>>>> and dquot->dq_lock respectively, so it makes sense to add a
>>>>>>>> wait_on_dquot() to ensure the accuracy of DQ_ACTIVE_B.
>>>>>>> But the fundamental problem is not only the race with DQ_MOD_B setting. The
>>>>>>> dquot structure can be completely freed by the time
>>>>>>> dquot_claim_space_nodirty() calls dquot_mark_dquot_dirty() on it. That's
>>>>>>> why I think making __dquot_transfer() obey dquot_srcu rules is the right
>>>>>>> solution.
>>>>>> Yes, now I also think that making __dquot_transfer() obey dquot_srcu
>>>>>> rules is a better solution. But with inode->i_lock protection, why would
>>>>>> the dquot structure be completely freed?
>>>>> Well, when dquot_claim_space_nodirty() calls mark_all_dquot_dirty() it does
>>>>> not hold any locks (only dquot_srcu). So nothing prevents dquot_transfer()
>>>>> to go, swap dquot structure pointers and drop dquot references and after
>>>>> that mark_all_dquot_dirty() can use a stale pointer to call
>>>>> mark_dquot_dirty() on already freed memory.
>>>>>
>>>> No, this doesn't look like it's going to happen.  The
>>>> mark_all_dquot_dirty() uses a pointer array pointer, the dquot in the
>>>> array is dynamically changing, so after swap dquot structure pointers,
>>>> mark_all_dquot_dirty() uses the new pointer, and the stale pointer is
>>>> always destroyed after swap, so there is no case of using the stale
>>>> pointer here.
>>> There is a case - CPU0 can prefetch the values from dquots[] array into its
>>> local cache, then CPU1 can update the dquots[] array (these writes can
>>> happily stay in CPU1 store cache invisible to other CPUs) and free the
>>> dquots via dqput(). Then CPU0 can pass the prefetched dquot pointers to
>>> mark_dquot_dirty(). There are no locks or memory barries preventing CPUs
>>> from ordering instructions and memory operations like this in the code...
>>> You can read Documentation/memory-barriers.txt about all the perils current
>>> CPU architecture brings wrt coordination of memory accesses among CPUs ;)
>>>
>>> 								Honza
>> Got it!
>>
>> Sorry for misunderstanding you (I thought "completely freed" meant
>> dquot_destroy(), but you should have meant dquot_release()).
> Well, the dquot can even get to dquot_destroy(). There's nothing really
> preventing CPU2 going into memory reclaim and free the dquot in
> dqcache_shrink_scan() still before CPU0 even calls mark_dquot_dirty() on
> it. Sure such timing on real hardware is very unlikely but in a VM where a
> virtual CPU can get starved for a significant amount of time this could
> happen.
>
> 								Honza
Yes, invalidate_dquots() calling do_destroy_dquot() does not have this 
problem
because it calls synchronize_srcu(&dquot_srcu) in drop_dquot_ref() before.

However, calling do_destroy_dquot() from dqcache_shrink_scan() is not
protected, and calling dqcache_shrink_scan() after P3 execution will trigger
the UAF by calling do_destroy_dquot() twice, as shown in function graph 1
in the patch description; If dqcache_shrink_scan() is called after dquot is
added to free_dquots and before P3 is executed, the UAF may be
triggered in dquot_mark_dquot_dirty().

Thank you for your patient explanation!
The new version of the solution is almost complete, and is doing some stress
testing, which I will send out once it passes.
  

Patch

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index e3e4f4047657..2a04cd74c7c5 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -362,11 +362,26 @@  int dquot_mark_dquot_dirty(struct dquot *dquot)
 		return 1;
 
 	spin_lock(&dq_list_lock);
-	if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags)) {
+	ret = test_and_set_bit(DQ_MOD_B, &dquot->dq_flags);
+	if (ret)
+		goto out_lock;
+	spin_unlock(&dq_list_lock);
+
+	/*
+	 * Wait for dq_lock - after this we know that either dquot_release() is
+	 * already finished or it will be canceled due to DQ_MOD_B flag test.
+	 */
+	wait_on_dquot(dquot);
+	spin_lock(&dq_list_lock);
+	if (!test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) {
+		clear_bit(DQ_MOD_B, &dquot->dq_flags);
+		goto out_lock;
+	}
+	/* DQ_MOD_B is cleared means that the dquot has been written back */
+	if (test_bit(DQ_MOD_B, &dquot->dq_flags))
 		list_add(&dquot->dq_dirty, &sb_dqopt(dquot->dq_sb)->
 				info[dquot->dq_id.type].dqi_dirty_list);
-		ret = 0;
-	}
+out_lock:
 	spin_unlock(&dq_list_lock);
 	return ret;
 }
@@ -791,7 +806,7 @@  void dqput(struct dquot *dquot)
 		return;
 	}
 	/* Need to release dquot? */
-	if (dquot_dirty(dquot)) {
+	if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags) && dquot_dirty(dquot)) {
 		spin_unlock(&dq_list_lock);
 		/* Commit dquot before releasing */
 		ret = dquot->dq_sb->dq_op->write_dquot(dquot);