[v2,1/2] fscache: Use wait_on_bit() to wait for the freeing of relinquished volume

Message ID 20221226103309.953112-2-houtao@huaweicloud.com
State New
Headers
Series Fixes for fscache volume operations |

Commit Message

Hou Tao Dec. 26, 2022, 10:33 a.m. UTC
  From: Hou Tao <houtao1@huawei.com>

The freeing of relinquished volume will wake up the pending volume
acquisition by using wake_up_bit(), however it is mismatched with
wait_var_event() used in fscache_wait_on_volume_collision() and it will
never wake up the waiter in the wait-queue because these two functions
operate on different wait-queues.

According to the implementation in fscache_wait_on_volume_collision(),
if the wake-up of pending acquisition is delayed longer than 20 seconds
(e.g., due to the delay of on-demand fd closing), the first
wait_var_event_timeout() will timeout and the following wait_var_event()
will hang forever as shown below:

 FS-Cache: Potential volume collision new=00000024 old=00000022
 ......
 INFO: task mount:1148 blocked for more than 122 seconds.
       Not tainted 6.1.0-rc6+ #1
 task:mount           state:D stack:0     pid:1148  ppid:1
 Call Trace:
  <TASK>
  __schedule+0x2f6/0xb80
  schedule+0x67/0xe0
  fscache_wait_on_volume_collision.cold+0x80/0x82
  __fscache_acquire_volume+0x40d/0x4e0
  erofs_fscache_register_volume+0x51/0xe0 [erofs]
  erofs_fscache_register_fs+0x19c/0x240 [erofs]
  erofs_fc_fill_super+0x746/0xaf0 [erofs]
  vfs_get_super+0x7d/0x100
  get_tree_nodev+0x16/0x20
  erofs_fc_get_tree+0x20/0x30 [erofs]
  vfs_get_tree+0x24/0xb0
  path_mount+0x2fa/0xa90
  do_mount+0x7c/0xa0
  __x64_sys_mount+0x8b/0xe0
  do_syscall_64+0x30/0x60
  entry_SYSCALL_64_after_hwframe+0x46/0xb0

Considering that wake_up_bit() is more selective, so fixing it by using
wait_on_bit() instead of wait_var_event() to wait for the freeing of
relinquished volume. In addition because waitqueue_active() is used in
wake_up_bit() and clear_bit() doesn't imply any memory barrier, so also
adding smp_mb__after_atomic() before wake_up_bit().

Fixes: 62ab63352350 ("fscache: Implement volume registration")
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 fs/fscache/volume.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)
  

Comments

David Howells Jan. 11, 2023, 4:06 p.m. UTC | #1
Hou Tao <houtao@huaweicloud.com> wrote:

>  			clear_bit(FSCACHE_VOLUME_ACQUIRE_PENDING, &cursor->flags);
> +			/*
> +			 * Paired with barrier in wait_on_bit(). Check
> +			 * wake_up_bit() and waitqueue_active() for details.
> +			 */
> +			smp_mb__after_atomic();
>  			wake_up_bit(&cursor->flags, FSCACHE_VOLUME_ACQUIRE_PENDING);

What two values are you applying a partial ordering to?

David
  
Hou Tao Jan. 12, 2023, 1:05 a.m. UTC | #2
Hi,

On 1/12/2023 12:06 AM, David Howells wrote:
> Hou Tao <houtao@huaweicloud.com> wrote:
>
>>  			clear_bit(FSCACHE_VOLUME_ACQUIRE_PENDING, &cursor->flags);
>> +			/*
>> +			 * Paired with barrier in wait_on_bit(). Check
>> +			 * wake_up_bit() and waitqueue_active() for details.
>> +			 */
>> +			smp_mb__after_atomic();
>>  			wake_up_bit(&cursor->flags, FSCACHE_VOLUME_ACQUIRE_PENDING);
> What two values are you applying a partial ordering to?
cursor->flags and wq->head. fscache_wake_pending_volume() will write
cursor->flags and read wq->head through waitqueue_active(), and the wait will
write wq->head then read cursor->flags.
>
> David
>
  
Jingbo Xu Jan. 12, 2023, 3:47 a.m. UTC | #3
On 12/26/22 6:33 PM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> The freeing of relinquished volume will wake up the pending volume
> acquisition by using wake_up_bit(), however it is mismatched with
> wait_var_event() used in fscache_wait_on_volume_collision() and it will
> never wake up the waiter in the wait-queue because these two functions
> operate on different wait-queues.
> 
> According to the implementation in fscache_wait_on_volume_collision(),
> if the wake-up of pending acquisition is delayed longer than 20 seconds
> (e.g., due to the delay of on-demand fd closing), the first
> wait_var_event_timeout() will timeout and the following wait_var_event()
> will hang forever as shown below:
> 
>  FS-Cache: Potential volume collision new=00000024 old=00000022
>  ......
>  INFO: task mount:1148 blocked for more than 122 seconds.
>        Not tainted 6.1.0-rc6+ #1
>  task:mount           state:D stack:0     pid:1148  ppid:1
>  Call Trace:
>   <TASK>
>   __schedule+0x2f6/0xb80
>   schedule+0x67/0xe0
>   fscache_wait_on_volume_collision.cold+0x80/0x82
>   __fscache_acquire_volume+0x40d/0x4e0
>   erofs_fscache_register_volume+0x51/0xe0 [erofs]
>   erofs_fscache_register_fs+0x19c/0x240 [erofs]
>   erofs_fc_fill_super+0x746/0xaf0 [erofs]
>   vfs_get_super+0x7d/0x100
>   get_tree_nodev+0x16/0x20
>   erofs_fc_get_tree+0x20/0x30 [erofs]
>   vfs_get_tree+0x24/0xb0
>   path_mount+0x2fa/0xa90
>   do_mount+0x7c/0xa0
>   __x64_sys_mount+0x8b/0xe0
>   do_syscall_64+0x30/0x60
>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
> 
> Considering that wake_up_bit() is more selective, so fixing it by using
							^
						       fix
> wait_on_bit() instead of wait_var_event() to wait for the freeing of
> relinquished volume. In addition because waitqueue_active() is used in
> wake_up_bit() and clear_bit() doesn't imply any memory barrier, so also
> adding smp_mb__after_atomic() before wake_up_bit().

... doesn't imply any memory barrier, add ...

> 
> Fixes: 62ab63352350 ("fscache: Implement volume registration")
> Signed-off-by: Hou Tao <houtao1@huawei.com>


Otherwise LGTM :)

Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>

> ---
>  fs/fscache/volume.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fscache/volume.c b/fs/fscache/volume.c
> index ab8ceddf9efa..fc3dd3bc851d 100644
> --- a/fs/fscache/volume.c
> +++ b/fs/fscache/volume.c
> @@ -141,13 +141,14 @@ static bool fscache_is_acquire_pending(struct fscache_volume *volume)
>  static void fscache_wait_on_volume_collision(struct fscache_volume *candidate,
>  					     unsigned int collidee_debug_id)
>  {
> -	wait_var_event_timeout(&candidate->flags,
> -			       !fscache_is_acquire_pending(candidate), 20 * HZ);
> +	wait_on_bit_timeout(&candidate->flags, FSCACHE_VOLUME_ACQUIRE_PENDING,
> +			    TASK_UNINTERRUPTIBLE, 20 * HZ);
>  	if (fscache_is_acquire_pending(candidate)) {
>  		pr_notice("Potential volume collision new=%08x old=%08x",
>  			  candidate->debug_id, collidee_debug_id);
>  		fscache_stat(&fscache_n_volumes_collision);
> -		wait_var_event(&candidate->flags, !fscache_is_acquire_pending(candidate));
> +		wait_on_bit(&candidate->flags, FSCACHE_VOLUME_ACQUIRE_PENDING,
> +			    TASK_UNINTERRUPTIBLE);
>  	}
>  }
>  
> @@ -348,6 +349,11 @@ static void fscache_wake_pending_volume(struct fscache_volume *volume,
>  		if (fscache_volume_same(cursor, volume)) {
>  			fscache_see_volume(cursor, fscache_volume_see_hash_wake);
>  			clear_bit(FSCACHE_VOLUME_ACQUIRE_PENDING, &cursor->flags);
> +			/*
> +			 * Paired with barrier in wait_on_bit(). Check
> +			 * wake_up_bit() and waitqueue_active() for details.
> +			 */
> +			smp_mb__after_atomic();
>  			wake_up_bit(&cursor->flags, FSCACHE_VOLUME_ACQUIRE_PENDING);
>  			return;
>  		}
  
Jingbo Xu Jan. 12, 2023, 3:58 a.m. UTC | #4
On 1/12/23 12:06 AM, David Howells wrote:
> Hou Tao <houtao@huaweicloud.com> wrote:
> 
>>  			clear_bit(FSCACHE_VOLUME_ACQUIRE_PENDING, &cursor->flags);
>> +			/*
>> +			 * Paired with barrier in wait_on_bit(). Check
>> +			 * wake_up_bit() and waitqueue_active() for details.
>> +			 */
>> +			smp_mb__after_atomic();
>>  			wake_up_bit(&cursor->flags, FSCACHE_VOLUME_ACQUIRE_PENDING);
> 
> What two values are you applying a partial ordering to?

Yeah Hou Tao has explained that a full barrier is needed here to avoid
the potential reordering at the waker side.

As I was also researching on this these days, I'd like to share my
thought on this, hopefully if it could give some insight :)

Without the barrier at the waker side, it may suffer from the following
race:

```
CPU0 - waker                    CPU1 - waiter

if (waitqueue_active(wq_head)) <-- find no wq_entry in wq_head list
    wake_up(wq_head);

                                for (;;) {
                                   prepare_to_wait(...);
                                        # add wq_entry into wq_head list

                                    if (@cond)  <-- @cond is false
                                        break;
                                    schedule(); <-- wq_entry still in
                                                    wq_head list,
                                                    wait for next wakeup
                                 }
                                 finish_wait(&wq_head, &wait);

@cond = true;
```

in which case the waiter misses the wakeup for one time.
  
Hou Tao Jan. 12, 2023, 6:12 a.m. UTC | #5
Hi,

On 1/12/2023 11:58 AM, Jingbo Xu wrote:
>
> On 1/12/23 12:06 AM, David Howells wrote:
>> Hou Tao <houtao@huaweicloud.com> wrote:
>>
>>>  			clear_bit(FSCACHE_VOLUME_ACQUIRE_PENDING, &cursor->flags);
>>> +			/*
>>> +			 * Paired with barrier in wait_on_bit(). Check
>>> +			 * wake_up_bit() and waitqueue_active() for details.
>>> +			 */
>>> +			smp_mb__after_atomic();
>>>  			wake_up_bit(&cursor->flags, FSCACHE_VOLUME_ACQUIRE_PENDING);
>> What two values are you applying a partial ordering to?
> Yeah Hou Tao has explained that a full barrier is needed here to avoid
> the potential reordering at the waker side.
>
> As I was also researching on this these days, I'd like to share my
> thought on this, hopefully if it could give some insight :)
>
> Without the barrier at the waker side, it may suffer from the following
> race:
>
> ```
> CPU0 - waker                    CPU1 - waiter
>
> if (waitqueue_active(wq_head)) <-- find no wq_entry in wq_head list
>     wake_up(wq_head);
>
>                                 for (;;) {
>                                    prepare_to_wait(...);
>                                         # add wq_entry into wq_head list
>
>                                     if (@cond)  <-- @cond is false
>                                         break;
>                                     schedule(); <-- wq_entry still in
>                                                     wq_head list,
>                                                     wait for next wakeup
>                                  }
>                                  finish_wait(&wq_head, &wait);
>
> @cond = true;
> ```
>
> in which case the waiter misses the wakeup for one time.
Thanks for the details annotation. It is exactly what I tried to say but failed to.
>
  
Hou Tao Jan. 12, 2023, 6:14 a.m. UTC | #6
Hi,

On 1/12/2023 11:47 AM, Jingbo Xu wrote:
>
> On 12/26/22 6:33 PM, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> The freeing of relinquished volume will wake up the pending volume
>> acquisition by using wake_up_bit(), however it is mismatched with
>> wait_var_event() used in fscache_wait_on_volume_collision() and it will
>> never wake up the waiter in the wait-queue because these two functions
>> operate on different wait-queues.
>>
>> According to the implementation in fscache_wait_on_volume_collision(),
>> if the wake-up of pending acquisition is delayed longer than 20 seconds
>> (e.g., due to the delay of on-demand fd closing), the first
>> wait_var_event_timeout() will timeout and the following wait_var_event()
>> will hang forever as shown below:
>>
>>  FS-Cache: Potential volume collision new=00000024 old=00000022
>>  ......
>>  INFO: task mount:1148 blocked for more than 122 seconds.
>>        Not tainted 6.1.0-rc6+ #1
>>  task:mount           state:D stack:0     pid:1148  ppid:1
>>  Call Trace:
>>   <TASK>
>>   __schedule+0x2f6/0xb80
>>   schedule+0x67/0xe0
>>   fscache_wait_on_volume_collision.cold+0x80/0x82
>>   __fscache_acquire_volume+0x40d/0x4e0
>>   erofs_fscache_register_volume+0x51/0xe0 [erofs]
>>   erofs_fscache_register_fs+0x19c/0x240 [erofs]
>>   erofs_fc_fill_super+0x746/0xaf0 [erofs]
>>   vfs_get_super+0x7d/0x100
>>   get_tree_nodev+0x16/0x20
>>   erofs_fc_get_tree+0x20/0x30 [erofs]
>>   vfs_get_tree+0x24/0xb0
>>   path_mount+0x2fa/0xa90
>>   do_mount+0x7c/0xa0
>>   __x64_sys_mount+0x8b/0xe0
>>   do_syscall_64+0x30/0x60
>>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>
>> Considering that wake_up_bit() is more selective, so fixing it by using
> 							^
> 						       fix
>> wait_on_bit() instead of wait_var_event() to wait for the freeing of
>> relinquished volume. In addition because waitqueue_active() is used in
>> wake_up_bit() and clear_bit() doesn't imply any memory barrier, so also
>> adding smp_mb__after_atomic() before wake_up_bit().
> ... doesn't imply any memory barrier, add ...
Thanks for suggestions above. Will update in v3.
>
>> Fixes: 62ab63352350 ("fscache: Implement volume registration")
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>
> Otherwise LGTM :)
>
> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Thanks for review.
>
>> ---
>>  fs/fscache/volume.c | 12 +++++++++---
>>  1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/fscache/volume.c b/fs/fscache/volume.c
>> index ab8ceddf9efa..fc3dd3bc851d 100644
>> --- a/fs/fscache/volume.c
>> +++ b/fs/fscache/volume.c
>> @@ -141,13 +141,14 @@ static bool fscache_is_acquire_pending(struct fscache_volume *volume)
>>  static void fscache_wait_on_volume_collision(struct fscache_volume *candidate,
>>  					     unsigned int collidee_debug_id)
>>  {
>> -	wait_var_event_timeout(&candidate->flags,
>> -			       !fscache_is_acquire_pending(candidate), 20 * HZ);
>> +	wait_on_bit_timeout(&candidate->flags, FSCACHE_VOLUME_ACQUIRE_PENDING,
>> +			    TASK_UNINTERRUPTIBLE, 20 * HZ);
>>  	if (fscache_is_acquire_pending(candidate)) {
>>  		pr_notice("Potential volume collision new=%08x old=%08x",
>>  			  candidate->debug_id, collidee_debug_id);
>>  		fscache_stat(&fscache_n_volumes_collision);
>> -		wait_var_event(&candidate->flags, !fscache_is_acquire_pending(candidate));
>> +		wait_on_bit(&candidate->flags, FSCACHE_VOLUME_ACQUIRE_PENDING,
>> +			    TASK_UNINTERRUPTIBLE);
>>  	}
>>  }
>>  
>> @@ -348,6 +349,11 @@ static void fscache_wake_pending_volume(struct fscache_volume *volume,
>>  		if (fscache_volume_same(cursor, volume)) {
>>  			fscache_see_volume(cursor, fscache_volume_see_hash_wake);
>>  			clear_bit(FSCACHE_VOLUME_ACQUIRE_PENDING, &cursor->flags);
>> +			/*
>> +			 * Paired with barrier in wait_on_bit(). Check
>> +			 * wake_up_bit() and waitqueue_active() for details.
>> +			 */
>> +			smp_mb__after_atomic();
>>  			wake_up_bit(&cursor->flags, FSCACHE_VOLUME_ACQUIRE_PENDING);
>>  			return;
>>  		}
  

Patch

diff --git a/fs/fscache/volume.c b/fs/fscache/volume.c
index ab8ceddf9efa..fc3dd3bc851d 100644
--- a/fs/fscache/volume.c
+++ b/fs/fscache/volume.c
@@ -141,13 +141,14 @@  static bool fscache_is_acquire_pending(struct fscache_volume *volume)
 static void fscache_wait_on_volume_collision(struct fscache_volume *candidate,
 					     unsigned int collidee_debug_id)
 {
-	wait_var_event_timeout(&candidate->flags,
-			       !fscache_is_acquire_pending(candidate), 20 * HZ);
+	wait_on_bit_timeout(&candidate->flags, FSCACHE_VOLUME_ACQUIRE_PENDING,
+			    TASK_UNINTERRUPTIBLE, 20 * HZ);
 	if (fscache_is_acquire_pending(candidate)) {
 		pr_notice("Potential volume collision new=%08x old=%08x",
 			  candidate->debug_id, collidee_debug_id);
 		fscache_stat(&fscache_n_volumes_collision);
-		wait_var_event(&candidate->flags, !fscache_is_acquire_pending(candidate));
+		wait_on_bit(&candidate->flags, FSCACHE_VOLUME_ACQUIRE_PENDING,
+			    TASK_UNINTERRUPTIBLE);
 	}
 }
 
@@ -348,6 +349,11 @@  static void fscache_wake_pending_volume(struct fscache_volume *volume,
 		if (fscache_volume_same(cursor, volume)) {
 			fscache_see_volume(cursor, fscache_volume_see_hash_wake);
 			clear_bit(FSCACHE_VOLUME_ACQUIRE_PENDING, &cursor->flags);
+			/*
+			 * Paired with barrier in wait_on_bit(). Check
+			 * wake_up_bit() and waitqueue_active() for details.
+			 */
+			smp_mb__after_atomic();
 			wake_up_bit(&cursor->flags, FSCACHE_VOLUME_ACQUIRE_PENDING);
 			return;
 		}