[v2] jbd2: Fix data missing when reusing bh which is ready to be checkpointed

Message ID 20230106115603.2624644-1-chengzhihao1@huawei.com
State New
Headers
Series [v2] jbd2: Fix data missing when reusing bh which is ready to be checkpointed |

Commit Message

Zhihao Cheng Jan. 6, 2023, 11:56 a.m. UTC
  Following process will make data lost and could lead to a filesystem
corrupted problem:

1. jh(bh) is inserted into T1->t_checkpoint_list, bh is dirty, and
   jh->b_transaction = NULL
2. T1 is added into journal->j_checkpoint_transactions.
3. Get bh prepare to write while doing checkpoing:
           PA				    PB
   do_get_write_access             jbd2_log_do_checkpoint
    spin_lock(&jh->b_state_lock)
     if (buffer_dirty(bh))
      clear_buffer_dirty(bh)   // clear buffer dirty
       set_buffer_jbddirty(bh)
				    transaction =
				    journal->j_checkpoint_transactions
				    jh = transaction->t_checkpoint_list
				    if (!buffer_dirty(bh))
		                      __jbd2_journal_remove_checkpoint(jh)
				      // bh won't be flushed
		                    jbd2_cleanup_journal_tail
    __jbd2_journal_file_buffer(jh, transaction, BJ_Reserved)
4. Aborting journal/Power-cut before writing latest bh on journal area.

In this way we get a corrupted filesystem with bh's data lost.

Fix it by moving the clearing of buffer_dirty bit just before the call
to __jbd2_journal_file_buffer(), both bit clearing and jh->b_transaction
assignment are under journal->j_list_lock locked, so that
jbd2_log_do_checkpoint() will wait until jh's new transaction fininshed
even bh is currently not dirty. And journal_shrink_one_cp_list() won't
remove jh from checkpoint list if the buffer head is reused in
do_get_write_access().

Cc: <stable@kernel.org>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: zhanchengbin <zhanchengbin1@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
---
 v1->v2: Adopt Jan's suggestion, move the clearing of buffer_dirty bit
	 and __jbd2_journal_file_buffer() inside journal->j_list_lock
	 locking area.
 fs/jbd2/transaction.c | 51 ++++++++++++++++++++++++++++---------------
 1 file changed, 34 insertions(+), 17 deletions(-)
  

Comments

Jan Kara Jan. 6, 2023, 2:22 p.m. UTC | #1
On Fri 06-01-23 19:56:03, Zhihao Cheng wrote:
> Following process will make data lost and could lead to a filesystem
> corrupted problem:
> 
> 1. jh(bh) is inserted into T1->t_checkpoint_list, bh is dirty, and
>    jh->b_transaction = NULL
> 2. T1 is added into journal->j_checkpoint_transactions.
> 3. Get bh prepare to write while doing checkpoing:
>            PA				    PB
>    do_get_write_access             jbd2_log_do_checkpoint
>     spin_lock(&jh->b_state_lock)
>      if (buffer_dirty(bh))
>       clear_buffer_dirty(bh)   // clear buffer dirty
>        set_buffer_jbddirty(bh)
> 				    transaction =
> 				    journal->j_checkpoint_transactions
> 				    jh = transaction->t_checkpoint_list
> 				    if (!buffer_dirty(bh))
> 		                      __jbd2_journal_remove_checkpoint(jh)
> 				      // bh won't be flushed
> 		                    jbd2_cleanup_journal_tail
>     __jbd2_journal_file_buffer(jh, transaction, BJ_Reserved)
> 4. Aborting journal/Power-cut before writing latest bh on journal area.
> 
> In this way we get a corrupted filesystem with bh's data lost.
> 
> Fix it by moving the clearing of buffer_dirty bit just before the call
> to __jbd2_journal_file_buffer(), both bit clearing and jh->b_transaction
> assignment are under journal->j_list_lock locked, so that
> jbd2_log_do_checkpoint() will wait until jh's new transaction fininshed
> even bh is currently not dirty. And journal_shrink_one_cp_list() won't
> remove jh from checkpoint list if the buffer head is reused in
> do_get_write_access().
> 
> Cc: <stable@kernel.org>
> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
> Signed-off-by: zhanchengbin <zhanchengbin1@huawei.com>
> Suggested-by: Jan Kara <jack@suse.cz>

Thanks for the patch! It looks good, some suggestions for making it a bit
more tidy below:

> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 6a404ac1c178..06a5e7961ef2 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1010,36 +1010,37 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>  	 * ie. locked but not dirty) or tune2fs (which may actually have
>  	 * the buffer dirtied, ugh.)  */
>  
> -	if (buffer_dirty(bh)) {
> +	if (buffer_dirty(bh) && jh->b_transaction) {
>  		/*
>  		 * First question: is this buffer already part of the current
>  		 * transaction or the existing committing transaction?
>  		 */
> -		if (jh->b_transaction) {
> -			J_ASSERT_JH(jh,
> -				jh->b_transaction == transaction ||
> -				jh->b_transaction ==
> -					journal->j_committing_transaction);
> -			if (jh->b_next_transaction)
> -				J_ASSERT_JH(jh, jh->b_next_transaction ==
> -							transaction);
> -			warn_dirty_buffer(bh);
> -		}
> +		J_ASSERT_JH(jh, jh->b_transaction == transaction ||
> +			jh->b_transaction == journal->j_committing_transaction);
> +		if (jh->b_next_transaction)
> +			J_ASSERT_JH(jh, jh->b_next_transaction == transaction);
> +		warn_dirty_buffer(bh);
>  		/*
> -		 * In any case we need to clean the dirty flag and we must
> -		 * do it under the buffer lock to be sure we don't race
> -		 * with running write-out.
> +		 * We need to clean the dirty flag and we must do it under the
> +		 * buffer lock to be sure we don't race with running write-out.
>  		 */
>  		JBUFFER_TRACE(jh, "Journalling dirty buffer");
>  		clear_buffer_dirty(bh);
> +		/*
> +		 * Setting jbddirty after clearing buffer dirty is necessary.
> +		 * Function jbd2_journal_restart() could keep buffer on
> +		 * BJ_Reserved list until the transaction committing, then the
> +		 * buffer won't be dirtied by jbd2_journal_refile_buffer()
> +		 * after committing, the buffer couldn't fall on disk even
> +		 * last checkpoint finished, which may corrupt filesystem.
> +		 */
>  		set_buffer_jbddirty(bh);
>  	}

So I think the sequence:

	if (buffer_dirty(bh)) {
		warn_dirty_buffer(bh);
		JBUFFER_TRACE(jh, "Journalling dirty buffer");
		clear_buffer_dirty(bh);
		set_buffer_jbddirty(bh);
	}

can be moved into the branch

  	if (jh->b_transaction == transaction ||
	    jh->b_next_transaction == transaction) {

below. That way you can drop the assertions as well because they happen
later in do_get_write_access() again anyway.

Also I don't quite understand the new comment you have added. Do you mean
we need to not only clear BH_Dirty bit but also set BH_JBDdirty as dirtying
(through jbd2_journal_dirty_metadata()) does not have to follow after
do_get_write_access()?

Otherwise the patch looks good.

								Honza
>  
> -	unlock_buffer(bh);
> -
>  	error = -EROFS;
>  	if (is_handle_aborted(handle)) {
>  		spin_unlock(&jh->b_state_lock);
> +		unlock_buffer(bh);
>  		goto out;
>  	}
>  	error = 0;
> @@ -1049,8 +1050,10 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>  	 * b_next_transaction points to it
>  	 */
>  	if (jh->b_transaction == transaction ||
> -	    jh->b_next_transaction == transaction)
> +	    jh->b_next_transaction == transaction) {
> +		unlock_buffer(bh);
>  		goto done;
> +	}
>  
>  	/*
>  	 * this is the first time this transaction is touching this buffer,
> @@ -1074,10 +1077,24 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>  		 */
>  		smp_wmb();
>  		spin_lock(&journal->j_list_lock);
> +		if (test_clear_buffer_dirty(bh)) {
> +			/*
> +			 * Execute buffer dirty clearing and jh->b_transaction
> +			 * assignment under journal->j_list_lock locked to
> +			 * prevent bh being removed from checkpoint list if
> +			 * the buffer is in an intermediate state (not dirty
> +			 * and jh->b_transaction is NULL).
> +			 */
> +			JBUFFER_TRACE(jh, "Journalling dirty buffer");
> +			set_buffer_jbddirty(bh);
> +		}
>  		__jbd2_journal_file_buffer(jh, transaction, BJ_Reserved);
>  		spin_unlock(&journal->j_list_lock);
> +		unlock_buffer(bh);
>  		goto done;
>  	}
> +	unlock_buffer(bh);
> +
>  	/*
>  	 * If there is already a copy-out version of this buffer, then we don't
>  	 * need to make another one
> -- 
> 2.31.1
>
  
Zhihao Cheng Jan. 7, 2023, 9:16 a.m. UTC | #2
在 2023/1/6 22:22, Jan Kara 写道:

Hi Jan, thanks for reviewing.Some discussions below:


>> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
>> index 6a404ac1c178..06a5e7961ef2 100644
>> --- a/fs/jbd2/transaction.c
>> +++ b/fs/jbd2/transaction.c
>> @@ -1010,36 +1010,37 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>>   	 * ie. locked but not dirty) or tune2fs (which may actually have
>>   	 * the buffer dirtied, ugh.)  */
>>   
>> -	if (buffer_dirty(bh)) {
>> +	if (buffer_dirty(bh) && jh->b_transaction) {
>>   		/*
>>   		 * First question: is this buffer already part of the current
>>   		 * transaction or the existing committing transaction?
>>   		 */
>> -		if (jh->b_transaction) {
>> -			J_ASSERT_JH(jh,
>> -				jh->b_transaction == transaction ||
>> -				jh->b_transaction ==
>> -					journal->j_committing_transaction);
>> -			if (jh->b_next_transaction)
>> -				J_ASSERT_JH(jh, jh->b_next_transaction ==
>> -							transaction);
>> -			warn_dirty_buffer(bh);
>> -		}
>> +		J_ASSERT_JH(jh, jh->b_transaction == transaction ||
>> +			jh->b_transaction == journal->j_committing_transaction);
>> +		if (jh->b_next_transaction)
>> +			J_ASSERT_JH(jh, jh->b_next_transaction == transaction);
>> +		warn_dirty_buffer(bh);
>>   		/*
>> -		 * In any case we need to clean the dirty flag and we must
>> -		 * do it under the buffer lock to be sure we don't race
>> -		 * with running write-out.
>> +		 * We need to clean the dirty flag and we must do it under the
>> +		 * buffer lock to be sure we don't race with running write-out.
>>   		 */
>>   		JBUFFER_TRACE(jh, "Journalling dirty buffer");
>>   		clear_buffer_dirty(bh);
>> +		/*
>> +		 * Setting jbddirty after clearing buffer dirty is necessary.
>> +		 * Function jbd2_journal_restart() could keep buffer on
>> +		 * BJ_Reserved list until the transaction committing, then the
>> +		 * buffer won't be dirtied by jbd2_journal_refile_buffer()
>> +		 * after committing, the buffer couldn't fall on disk even
>> +		 * last checkpoint finished, which may corrupt filesystem.
>> +		 */
>>   		set_buffer_jbddirty(bh);
>>   	}
> 
> So I think the sequence:
> 
> 	if (buffer_dirty(bh)) {
> 		warn_dirty_buffer(bh);
> 		JBUFFER_TRACE(jh, "Journalling dirty buffer");
> 		clear_buffer_dirty(bh);
> 		set_buffer_jbddirty(bh);
> 	}
> 
> can be moved into the branch
> 
>    	if (jh->b_transaction == transaction ||
> 	    jh->b_next_transaction == transaction) {
> 
> below. That way you can drop the assertions as well because they happen
> later in do_get_write_access() again anyway.

1. If we move the squence:
  	if (buffer_dirty(bh)) {
  		warn_dirty_buffer(bh);
  		JBUFFER_TRACE(jh, "Journalling dirty buffer");
  		clear_buffer_dirty(bh);
  		set_buffer_jbddirty(bh);
  	}

into the branch

         if (jh->b_transaction == transaction ||
  	    jh->b_next_transaction == transaction) {

, then we have a new situation(jh->b_transaction == 
journal->j_committing_transaction) to clear buffer dirty, so we need to 
add an else-branch like(based on v2 patch):
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1092,6 +1092,10 @@ do_get_write_access(handle_t *handle, struct 
journal_head *jh,
                 spin_unlock(&journal->j_list_lock);
                 unlock_buffer(bh);
                 goto done;
+       } else if (test_clear_buffer_dirty(bh)) {
+               warn_dirty_buffer(bh);
+               JBUFFER_TRACE(jh, "Journalling dirty buffer");
+               set_buffer_jbddirty(bh);
         }
         unlock_buffer(bh);

I think we'd better not to move the sequence?

2. I agree that the assertions in branch 'if (jh->b_transaction)' are 
redundant, I will remove them in v3. Thanks for pointing that.

> Also I don't quite understand the new comment you have added. Do you mean
> we need to not only clear BH_Dirty bit but also set BH_JBDdirty as dirtying
> (through jbd2_journal_dirty_metadata()) does not have to follow after
> do_get_write_access()?
> 

Yes.
I think one reason we have non-empty commit_transaction->t_reserved_list 
is that: jbd2_journal_restart() could let jh attach to transaction_1 and 
dirty jh in transaction_2.

buffer is dirty after trans_0 committed
do_get_write_access() =>  jh->trans = old_handle->trans_1, clear buffer 
dirty & set jbddirty, BJ_Reserved
jbd2_journal_restart()  => stop old_handle && may jbd2_log_start_commit 
&& start new_handle with trans_2
jbd2_journal_commit_transaction() => clear jbddirty & set buffer dirty & 
set jh->b_transaction = NULL
do_checkpoint  => buffer is fell on disk. If do_get_write_access() not 
mark jbddirty, buffer won't be fell on disk after checkpoint, which 
could corrupt filesystem.

I'm not sure whether we have the above path in realworld. I guess it 
exists in theory according to the comments:
        /* 

         * First thing we are allowed to do is to discard any remaining 

         * BJ_Reserved buffers.  Note, it is _not_ permissible to assume 

         * that there are no such buffers: if a large filesystem 

         * operation like a truncate needs to split itself over multiple 

         * transactions, then it may try to do a jbd2_journal_restart() 
while
          * there are still BJ_Reserved buffers outstanding.  These must 

          * be released cleanly from the current transaction. 

          * 

          * In this case, the filesystem must still reserve write access 

          * again before modifying the buffer in the new transaction, 
but
          * we do not require it to remember exactly which old buffers 
it
          * has reserved.  This is consistent with the existing 
behaviour
          * that multiple jbd2_journal_get_write_access() calls to the 
same
          * buffer are perfectly permissible. 

          * We use journal->j_state_lock here to serialize processing of 

          * t_reserved_list with eviction of buffers from 
journal_unmap_buffer().
          */ 

         while (commit_transaction->t_reserved_list) {  [...]
  
Jan Kara Jan. 9, 2023, 11:20 a.m. UTC | #3
On Sat 07-01-23 17:16:10, Zhihao Cheng wrote:
> 在 2023/1/6 22:22, Jan Kara 写道:
> 
> Hi Jan, thanks for reviewing.Some discussions below:
> 
> 
> > > diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> > > index 6a404ac1c178..06a5e7961ef2 100644
> > > --- a/fs/jbd2/transaction.c
> > > +++ b/fs/jbd2/transaction.c
> > > @@ -1010,36 +1010,37 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
> > >   	 * ie. locked but not dirty) or tune2fs (which may actually have
> > >   	 * the buffer dirtied, ugh.)  */
> > > -	if (buffer_dirty(bh)) {
> > > +	if (buffer_dirty(bh) && jh->b_transaction) {
> > >   		/*
> > >   		 * First question: is this buffer already part of the current
> > >   		 * transaction or the existing committing transaction?
> > >   		 */
> > > -		if (jh->b_transaction) {
> > > -			J_ASSERT_JH(jh,
> > > -				jh->b_transaction == transaction ||
> > > -				jh->b_transaction ==
> > > -					journal->j_committing_transaction);
> > > -			if (jh->b_next_transaction)
> > > -				J_ASSERT_JH(jh, jh->b_next_transaction ==
> > > -							transaction);
> > > -			warn_dirty_buffer(bh);
> > > -		}
> > > +		J_ASSERT_JH(jh, jh->b_transaction == transaction ||
> > > +			jh->b_transaction == journal->j_committing_transaction);
> > > +		if (jh->b_next_transaction)
> > > +			J_ASSERT_JH(jh, jh->b_next_transaction == transaction);
> > > +		warn_dirty_buffer(bh);
> > >   		/*
> > > -		 * In any case we need to clean the dirty flag and we must
> > > -		 * do it under the buffer lock to be sure we don't race
> > > -		 * with running write-out.
> > > +		 * We need to clean the dirty flag and we must do it under the
> > > +		 * buffer lock to be sure we don't race with running write-out.
> > >   		 */
> > >   		JBUFFER_TRACE(jh, "Journalling dirty buffer");
> > >   		clear_buffer_dirty(bh);
> > > +		/*
> > > +		 * Setting jbddirty after clearing buffer dirty is necessary.
> > > +		 * Function jbd2_journal_restart() could keep buffer on
> > > +		 * BJ_Reserved list until the transaction committing, then the
> > > +		 * buffer won't be dirtied by jbd2_journal_refile_buffer()
> > > +		 * after committing, the buffer couldn't fall on disk even
> > > +		 * last checkpoint finished, which may corrupt filesystem.
> > > +		 */
> > >   		set_buffer_jbddirty(bh);
> > >   	}
> > 
> > So I think the sequence:
> > 
> > 	if (buffer_dirty(bh)) {
> > 		warn_dirty_buffer(bh);
> > 		JBUFFER_TRACE(jh, "Journalling dirty buffer");
> > 		clear_buffer_dirty(bh);
> > 		set_buffer_jbddirty(bh);
> > 	}
> > 
> > can be moved into the branch
> > 
> >    	if (jh->b_transaction == transaction ||
> > 	    jh->b_next_transaction == transaction) {
> > 
> > below. That way you can drop the assertions as well because they happen
> > later in do_get_write_access() again anyway.
> 
> 1. If we move the squence:
>  	if (buffer_dirty(bh)) {
>  		warn_dirty_buffer(bh);
>  		JBUFFER_TRACE(jh, "Journalling dirty buffer");
>  		clear_buffer_dirty(bh);
>  		set_buffer_jbddirty(bh);
>  	}
> 
> into the branch
> 
>         if (jh->b_transaction == transaction ||
>  	    jh->b_next_transaction == transaction) {
> 
> , then we have a new situation(jh->b_transaction ==
> journal->j_committing_transaction) to clear buffer dirty, so we need to add
> an else-branch like(based on v2 patch):
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1092,6 +1092,10 @@ do_get_write_access(handle_t *handle, struct
> journal_head *jh,
>                 spin_unlock(&journal->j_list_lock);
>                 unlock_buffer(bh);
>                 goto done;
> +       } else if (test_clear_buffer_dirty(bh)) {
> +               warn_dirty_buffer(bh);
> +               JBUFFER_TRACE(jh, "Journalling dirty buffer");
> +               set_buffer_jbddirty(bh);
>         }
>         unlock_buffer(bh);
> 
> I think we'd better not to move the sequence?

Oh, you're right. So yeah, keep this sequence where it was.

> 2. I agree that the assertions in branch 'if (jh->b_transaction)' are
> redundant, I will remove them in v3. Thanks for pointing that.

OK, thanks!

								Honza
  

Patch

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 6a404ac1c178..06a5e7961ef2 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1010,36 +1010,37 @@  do_get_write_access(handle_t *handle, struct journal_head *jh,
 	 * ie. locked but not dirty) or tune2fs (which may actually have
 	 * the buffer dirtied, ugh.)  */
 
-	if (buffer_dirty(bh)) {
+	if (buffer_dirty(bh) && jh->b_transaction) {
 		/*
 		 * First question: is this buffer already part of the current
 		 * transaction or the existing committing transaction?
 		 */
-		if (jh->b_transaction) {
-			J_ASSERT_JH(jh,
-				jh->b_transaction == transaction ||
-				jh->b_transaction ==
-					journal->j_committing_transaction);
-			if (jh->b_next_transaction)
-				J_ASSERT_JH(jh, jh->b_next_transaction ==
-							transaction);
-			warn_dirty_buffer(bh);
-		}
+		J_ASSERT_JH(jh, jh->b_transaction == transaction ||
+			jh->b_transaction == journal->j_committing_transaction);
+		if (jh->b_next_transaction)
+			J_ASSERT_JH(jh, jh->b_next_transaction == transaction);
+		warn_dirty_buffer(bh);
 		/*
-		 * In any case we need to clean the dirty flag and we must
-		 * do it under the buffer lock to be sure we don't race
-		 * with running write-out.
+		 * We need to clean the dirty flag and we must do it under the
+		 * buffer lock to be sure we don't race with running write-out.
 		 */
 		JBUFFER_TRACE(jh, "Journalling dirty buffer");
 		clear_buffer_dirty(bh);
+		/*
+		 * Setting jbddirty after clearing buffer dirty is necessary.
+		 * Function jbd2_journal_restart() could keep buffer on
+		 * BJ_Reserved list until the transaction committing, then the
+		 * buffer won't be dirtied by jbd2_journal_refile_buffer()
+		 * after committing, the buffer couldn't fall on disk even
+		 * last checkpoint finished, which may corrupt filesystem.
+		 */
 		set_buffer_jbddirty(bh);
 	}
 
-	unlock_buffer(bh);
-
 	error = -EROFS;
 	if (is_handle_aborted(handle)) {
 		spin_unlock(&jh->b_state_lock);
+		unlock_buffer(bh);
 		goto out;
 	}
 	error = 0;
@@ -1049,8 +1050,10 @@  do_get_write_access(handle_t *handle, struct journal_head *jh,
 	 * b_next_transaction points to it
 	 */
 	if (jh->b_transaction == transaction ||
-	    jh->b_next_transaction == transaction)
+	    jh->b_next_transaction == transaction) {
+		unlock_buffer(bh);
 		goto done;
+	}
 
 	/*
 	 * this is the first time this transaction is touching this buffer,
@@ -1074,10 +1077,24 @@  do_get_write_access(handle_t *handle, struct journal_head *jh,
 		 */
 		smp_wmb();
 		spin_lock(&journal->j_list_lock);
+		if (test_clear_buffer_dirty(bh)) {
+			/*
+			 * Execute buffer dirty clearing and jh->b_transaction
+			 * assignment under journal->j_list_lock locked to
+			 * prevent bh being removed from checkpoint list if
+			 * the buffer is in an intermediate state (not dirty
+			 * and jh->b_transaction is NULL).
+			 */
+			JBUFFER_TRACE(jh, "Journalling dirty buffer");
+			set_buffer_jbddirty(bh);
+		}
 		__jbd2_journal_file_buffer(jh, transaction, BJ_Reserved);
 		spin_unlock(&journal->j_list_lock);
+		unlock_buffer(bh);
 		goto done;
 	}
+	unlock_buffer(bh);
+
 	/*
 	 * If there is already a copy-out version of this buffer, then we don't
 	 * need to make another one