Message ID | 20230317013553.1009553-1-chengzhihao1@huawei.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:604a:0:0:0:0:0 with SMTP id j10csp94892wrt; Thu, 16 Mar 2023 18:54:14 -0700 (PDT) X-Google-Smtp-Source: AK7set/ofXQZtROUjzFN4h/9vNF+eRnjaMGiICUW/9OG5UOCjH1kmFsXjCVgWpqv/J6+NeOa97Pp X-Received: by 2002:a17:90a:54:b0:23d:3e86:e48d with SMTP id 20-20020a17090a005400b0023d3e86e48dmr1257295pjb.23.1679018054110; Thu, 16 Mar 2023 18:54:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1679018054; cv=none; d=google.com; s=arc-20160816; b=Pu2wIsJhQB0oNxpiFjGjEVjJnN/i8QYBzRMyjkOmhcuV8MYC6GOULGFY7o8N3XY6mH cURx0xf6AOxlx8V73NTCAI7S2uvwV0pPK6PYZ3wi7Mf4kn3aXUxBkak0mDDJmQjK/V8/ Gvr+77/9loGQcojQmwI6TdTHyV5NvP6qa7nGCKfqsb14mqJJXQ1Ta0CMQuCddZ5fCn6P hq89rzn9VrdxmCP8P2AYHbHrW2u4sUM5sf3tAWwsFyzG5gGfHoCiimP3JxyWSqdYdWjw jldXloG3cL6TLq5lGur8SMDng2YsBZ8+fm3Sfo15XiJ0hg3cBl+Jkn3NSX3Nml1JgFJ5 ihWA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=juuMp4QWJKJiw5uEcWb7Ui23femeHvnHtvmcw0GGkhc=; b=b5Q8bzkb6/kDo7Np+8k9rHUn9RzwFP0ueGsmgNdoWQ7BY8sSvuzNYu5LonHxj6sPSd hSPyFR3FpSfqDHjoGpTLMuyjwfm09OMreVVqDmkVaWu4l+0rLG2nVdbo0bKggbMpUD2/ Q3YOTGzOQnS9AeeZdSAN2QAJW/TRTFnsi+isljJD1AjAqSRbAWGJuuIxGYM9YFTMGuSq YsRWnqUEg/koyoyE4WY77CWrfhK9BJNmuTgaze54JArKUPbetVVYHAty+3VuJDIbI24H ePYtk/MEqBg8c6RS4uOC7J6BddDrgOpVwGfVDNywGZC3YqZG7+omXuFHbLlTH/btv1yW WTXA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i17-20020a17090acf9100b0023d13eb5e5dsi765095pju.127.2023.03.16.18.54.01; Thu, 16 Mar 2023 18:54:14 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229644AbjCQBgv (ORCPT <rfc822;pwkd43@gmail.com> + 99 others); Thu, 16 Mar 2023 21:36:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60788 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229589AbjCQBgs (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 16 Mar 2023 21:36:48 -0400 Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 95D72A219B; Thu, 16 Mar 2023 18:36:46 -0700 (PDT) Received: from kwepemm600013.china.huawei.com (unknown [172.30.72.57]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4Pd6DC0P6NzKmr8; Fri, 17 Mar 2023 09:36:27 +0800 (CST) Received: from huawei.com (10.175.127.227) by kwepemm600013.china.huawei.com (7.193.23.68) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.21; Fri, 17 Mar 2023 09:36:42 +0800 From: Zhihao Cheng <chengzhihao1@huawei.com> To: <tytso@mit.edu>, <adilger.kernel@dilger.ca>, <jack@suse.com>, <tudor.ambarus@linaro.org> CC: <linux-ext4@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <chengzhihao1@huawei.com>, <yi.zhang@huawei.com> Subject: [PATCH] ext4: Fix i_disksize exceeding i_size problem in paritally written case Date: Fri, 17 Mar 2023 09:35:53 +0800 Message-ID: <20230317013553.1009553-1-chengzhihao1@huawei.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.175.127.227] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To kwepemm600013.china.huawei.com (7.193.23.68) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1760578034975823710?= X-GMAIL-MSGID: =?utf-8?q?1760578034975823710?= |
Series |
ext4: Fix i_disksize exceeding i_size problem in paritally written case
|
|
Commit Message
Zhihao Cheng
March 17, 2023, 1:35 a.m. UTC
Following process makes i_disksize exceed i_size:
generic_perform_write
copied = iov_iter_copy_from_user_atomic(len) // copied < len
ext4_da_write_end
| ext4_update_i_disksize
| new_i_size = pos + copied;
| WRITE_ONCE(EXT4_I(inode)->i_disksize, newsize) // update i_disksize
| generic_write_end
| copied = block_write_end(copied, len) // copied = 0
| if (unlikely(copied < len))
| if (!PageUptodate(page))
| copied = 0;
| if (pos + copied > inode->i_size) // return false
if (unlikely(copied == 0))
goto again;
if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
status = -EFAULT;
break;
}
We get i_disksize greater than i_size here, which could trigger WARNING
check 'i_size_read(inode) < EXT4_I(inode)->i_disksize' while doing dio:
ext4_dio_write_iter
iomap_dio_rw
__iomap_dio_rw // return err, length is not aligned to 512
ext4_handle_inode_extension
WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize) // Oops
WARNING: CPU: 2 PID: 2609 at fs/ext4/file.c:319
CPU: 2 PID: 2609 Comm: aa Not tainted 6.3.0-rc2
RIP: 0010:ext4_file_write_iter+0xbc7
Call Trace:
vfs_write+0x3b1
ksys_write+0x77
do_syscall_64+0x39
Fix it by putting block_write_end() before i_disksize updating just
like ext4_write_end() does.
Fetch a reproducer in [Link].
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217209
Fixes: 64769240bd07f ("ext4: Add delayed allocation support in data=writeback mode")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
---
fs/ext4/inode.c | 32 +++++++++++++++++++++++++-------
1 file changed, 25 insertions(+), 7 deletions(-)
Comments
On Fri 17-03-23 09:35:53, Zhihao Cheng wrote: > Following process makes i_disksize exceed i_size: > > generic_perform_write > copied = iov_iter_copy_from_user_atomic(len) // copied < len > ext4_da_write_end > | ext4_update_i_disksize > | new_i_size = pos + copied; > | WRITE_ONCE(EXT4_I(inode)->i_disksize, newsize) // update i_disksize > | generic_write_end > | copied = block_write_end(copied, len) // copied = 0 > | if (unlikely(copied < len)) > | if (!PageUptodate(page)) > | copied = 0; > | if (pos + copied > inode->i_size) // return false > if (unlikely(copied == 0)) > goto again; > if (unlikely(iov_iter_fault_in_readable(i, bytes))) { > status = -EFAULT; > break; > } > > We get i_disksize greater than i_size here, which could trigger WARNING > check 'i_size_read(inode) < EXT4_I(inode)->i_disksize' while doing dio: > > ext4_dio_write_iter > iomap_dio_rw > __iomap_dio_rw // return err, length is not aligned to 512 > ext4_handle_inode_extension > WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize) // Oops > > WARNING: CPU: 2 PID: 2609 at fs/ext4/file.c:319 > CPU: 2 PID: 2609 Comm: aa Not tainted 6.3.0-rc2 > RIP: 0010:ext4_file_write_iter+0xbc7 > Call Trace: > vfs_write+0x3b1 > ksys_write+0x77 > do_syscall_64+0x39 > > Fix it by putting block_write_end() before i_disksize updating just > like ext4_write_end() does. > > Fetch a reproducer in [Link]. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=217209 > Fixes: 64769240bd07f ("ext4: Add delayed allocation support in data=writeback mode") > Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Good catch (although practically this will hardly have any negative effect). But rather than opencoding generic_write_end() I'd do: if (unlikely(copied < len) && !PageUptodate(page)) copied = 0; at the beginning of ext4_da_write_end() and that should solve these problems as well? Honza > --- > fs/ext4/inode.c | 32 +++++++++++++++++++++++++------- > 1 file changed, 25 insertions(+), 7 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index bf0b7dea4900..577dc23f3b78 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3136,6 +3136,8 @@ static int ext4_da_write_end(struct file *file, > loff_t new_i_size; > unsigned long start, end; > int write_mode = (int)(unsigned long)fsdata; > + bool i_size_changed = false; > + loff_t old_size = inode->i_size; > > if (write_mode == FALL_BACK_TO_NONDELALLOC) > return ext4_write_end(file, mapping, pos, > @@ -3148,6 +3150,8 @@ static int ext4_da_write_end(struct file *file, > ext4_has_inline_data(inode)) > return ext4_write_inline_data_end(inode, pos, len, copied, page); > > + copied = block_write_end(file, mapping, pos, len, copied, page, fsdata); > + > start = pos & (PAGE_SIZE - 1); > end = start + copied - 1; > > @@ -3162,16 +3166,30 @@ static int ext4_da_write_end(struct file *file, > * check), we need to update i_disksize here as neither > * ext4_writepage() nor certain ext4_writepages() paths not > * allocating blocks update i_disksize. > - * > - * Note that we defer inode dirtying to generic_write_end() / > - * ext4_da_write_inline_data_end(). > */ > new_i_size = pos + copied; > - if (copied && new_i_size > inode->i_size && > - ext4_da_should_update_i_disksize(page, end)) > - ext4_update_i_disksize(inode, new_i_size); > + if (new_i_size > inode->i_size) { > + i_size_write(inode, new_i_size); > + i_size_changed = true; > + if (copied && ext4_da_should_update_i_disksize(page, end)) > + ext4_update_i_disksize(inode, new_i_size); > + } > + > + unlock_page(page); > + put_page(page); > + > + if (old_size < pos) > + pagecache_isize_extended(inode, old_size, pos); > + /* > + * Don't mark the inode dirty under page lock. First, it unnecessarily > + * makes the holding time of page lock longer. Second, it forces lock > + * ordering of page lock and transaction start for journaling > + * filesystems. > + */ > + if (i_size_changed) > + mark_inode_dirty(inode); > > - return generic_write_end(file, mapping, pos, len, copied, page, fsdata); > + return copied; > } > > /* > -- > 2.31.1 >
Hi, Jan > On Fri 17-03-23 09:35:53, Zhihao Cheng wrote: >> Following process makes i_disksize exceed i_size: >> >> generic_perform_write >> copied = iov_iter_copy_from_user_atomic(len) // copied < len >> ext4_da_write_end >> | ext4_update_i_disksize >> | new_i_size = pos + copied; >> | WRITE_ONCE(EXT4_I(inode)->i_disksize, newsize) // update i_disksize >> | generic_write_end >> | copied = block_write_end(copied, len) // copied = 0 >> | if (unlikely(copied < len)) >> | if (!PageUptodate(page)) >> | copied = 0; >> | if (pos + copied > inode->i_size) // return false >> if (unlikely(copied == 0)) >> goto again; >> if (unlikely(iov_iter_fault_in_readable(i, bytes))) { >> status = -EFAULT; >> break; >> } >> >> We get i_disksize greater than i_size here, which could trigger WARNING >> check 'i_size_read(inode) < EXT4_I(inode)->i_disksize' while doing dio: >> >> ext4_dio_write_iter >> iomap_dio_rw >> __iomap_dio_rw // return err, length is not aligned to 512 >> ext4_handle_inode_extension >> WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize) // Oops >> >> WARNING: CPU: 2 PID: 2609 at fs/ext4/file.c:319 >> CPU: 2 PID: 2609 Comm: aa Not tainted 6.3.0-rc2 >> RIP: 0010:ext4_file_write_iter+0xbc7 >> Call Trace: >> vfs_write+0x3b1 >> ksys_write+0x77 >> do_syscall_64+0x39 >> >> Fix it by putting block_write_end() before i_disksize updating just >> like ext4_write_end() does. >> >> Fetch a reproducer in [Link]. >> >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217209 >> Fixes: 64769240bd07f ("ext4: Add delayed allocation support in data=writeback mode") >> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> > > Good catch (although practically this will hardly have any negative > effect). But rather than opencoding generic_write_end() I'd do: > > if (unlikely(copied < len) && !PageUptodate(page)) > copied = 0; > > at the beginning of ext4_da_write_end() and that should solve these > problems as well? > Yes, your suggestion looks good, and I think we can put the checking just after ext4_write_inline_data_end(Line 3150)? On the one hand, we can pass original 'copied' value in trace_ext4_da_write_end(), one the other hand, ext4_write_inline_data_end() already has this checking.
> Hi, Jan >> On Fri 17-03-23 09:35:53, Zhihao Cheng wrote: >>> Following process makes i_disksize exceed i_size: >>> >>> generic_perform_write >>> copied = iov_iter_copy_from_user_atomic(len) // copied < len >>> ext4_da_write_end >>> | ext4_update_i_disksize >>> | new_i_size = pos + copied; >>> | WRITE_ONCE(EXT4_I(inode)->i_disksize, newsize) // update i_disksize >>> | generic_write_end >>> | copied = block_write_end(copied, len) // copied = 0 >>> | if (unlikely(copied < len)) >>> | if (!PageUptodate(page)) >>> | copied = 0; >>> | if (pos + copied > inode->i_size) // return false >>> if (unlikely(copied == 0)) >>> goto again; >>> if (unlikely(iov_iter_fault_in_readable(i, bytes))) { >>> status = -EFAULT; >>> break; >>> } >>> >>> We get i_disksize greater than i_size here, which could trigger WARNING >>> check 'i_size_read(inode) < EXT4_I(inode)->i_disksize' while doing dio: >>> >>> ext4_dio_write_iter >>> iomap_dio_rw >>> __iomap_dio_rw // return err, length is not aligned to 512 >>> ext4_handle_inode_extension >>> WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize) // Oops >>> >>> WARNING: CPU: 2 PID: 2609 at fs/ext4/file.c:319 >>> CPU: 2 PID: 2609 Comm: aa Not tainted 6.3.0-rc2 >>> RIP: 0010:ext4_file_write_iter+0xbc7 >>> Call Trace: >>> vfs_write+0x3b1 >>> ksys_write+0x77 >>> do_syscall_64+0x39 >>> >>> Fix it by putting block_write_end() before i_disksize updating just >>> like ext4_write_end() does. >>> >>> Fetch a reproducer in [Link]. >>> >>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217209 >>> Fixes: 64769240bd07f ("ext4: Add delayed allocation support in >>> data=writeback mode") >>> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> >> >> Good catch (although practically this will hardly have any negative >> effect). But rather than opencoding generic_write_end() I'd do: >> >> if (unlikely(copied < len) && !PageUptodate(page)) >> copied = 0; >> >> at the beginning of ext4_da_write_end() and that should solve these >> problems as well? >> > > Yes, your suggestion looks good, and I think we can put the checking > just after ext4_write_inline_data_end(Line 3150)? On the one hand, we > can pass original 'copied' value in trace_ext4_da_write_end(), one the > other hand, ext4_write_inline_data_end() already has this checking. > . BTW, I want send another patch as follows: diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index bf0b7dea4900..570a687ae847 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3149,7 +3149,7 @@ static int ext4_da_write_end(struct file *file, return ext4_write_inline_data_end(inode, pos, len, copied, page); start = pos & (PAGE_SIZE - 1); - end = start + copied - 1; + end = start + (copied ? copied - 1 : copied); /* * Since we are holding inode lock, we are sure i_disksize <= @@ -3167,7 +3167,7 @@ static int ext4_da_write_end(struct file *file, * ext4_da_write_inline_data_end(). */ new_i_size = pos + copied; - if (copied && new_i_size > inode->i_size && + if (new_i_size > inode->i_size && ext4_da_should_update_i_disksize(page, end)) ext4_update_i_disksize(inode, new_i_size); This modification handle unconsistent i_size and i_disksize imported by ea51d132dbf9 ("ext4: avoid hangs in ext4_da_should_update_i_disksize()"). Paritially written may display a fake inode size for user, for example: i_disksize=1 generic_perform_write copied = iov_iter_copy_from_user_atomic(len) // copied = 0 ext4_da_write_end // skip updating i_disksize generic_write_end if (pos + copied > inode->i_size) { // 10 + 0 > 1, true i_size_write(inode, pos + copied); // i_size = 10 } 0 1 10 4096 |_|_______|_________..._____| | | i_size pos Now, user see the i_size is 10 (i_disksize is still 1). After inode destroyed, user will get the i_size is 1 read from disk.
On Sat 18-03-23 16:51:00, Zhihao Cheng wrote: > > Hi, Jan > > > On Fri 17-03-23 09:35:53, Zhihao Cheng wrote: > > > > Following process makes i_disksize exceed i_size: > > > > > > > > generic_perform_write > > > > copied = iov_iter_copy_from_user_atomic(len) // copied < len > > > > ext4_da_write_end > > > > | ext4_update_i_disksize > > > > | new_i_size = pos + copied; > > > > | WRITE_ONCE(EXT4_I(inode)->i_disksize, newsize) // update i_disksize > > > > | generic_write_end > > > > | copied = block_write_end(copied, len) // copied = 0 > > > > | if (unlikely(copied < len)) > > > > | if (!PageUptodate(page)) > > > > | copied = 0; > > > > | if (pos + copied > inode->i_size) // return false > > > > if (unlikely(copied == 0)) > > > > goto again; > > > > if (unlikely(iov_iter_fault_in_readable(i, bytes))) { > > > > status = -EFAULT; > > > > break; > > > > } > > > > > > > > We get i_disksize greater than i_size here, which could trigger WARNING > > > > check 'i_size_read(inode) < EXT4_I(inode)->i_disksize' while doing dio: > > > > > > > > ext4_dio_write_iter > > > > iomap_dio_rw > > > > __iomap_dio_rw // return err, length is not aligned to 512 > > > > ext4_handle_inode_extension > > > > WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize) // Oops > > > > > > > > WARNING: CPU: 2 PID: 2609 at fs/ext4/file.c:319 > > > > CPU: 2 PID: 2609 Comm: aa Not tainted 6.3.0-rc2 > > > > RIP: 0010:ext4_file_write_iter+0xbc7 > > > > Call Trace: > > > > vfs_write+0x3b1 > > > > ksys_write+0x77 > > > > do_syscall_64+0x39 > > > > > > > > Fix it by putting block_write_end() before i_disksize updating just > > > > like ext4_write_end() does. > > > > > > > > Fetch a reproducer in [Link]. > > > > > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=217209 > > > > Fixes: 64769240bd07f ("ext4: Add delayed allocation support in > > > > data=writeback mode") > > > > Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> > > > > > > Good catch (although practically this will hardly have any negative > > > effect). But rather than opencoding generic_write_end() I'd do: > > > > > > if (unlikely(copied < len) && !PageUptodate(page)) > > > copied = 0; > > > > > > at the beginning of ext4_da_write_end() and that should solve these > > > problems as well? > > > > > > > Yes, your suggestion looks good, and I think we can put the checking > > just after ext4_write_inline_data_end(Line 3150)? On the one hand, we > > can pass original 'copied' value in trace_ext4_da_write_end(), one the > > other hand, ext4_write_inline_data_end() already has this checking. > > . > > BTW, I want send another patch as follows: > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index bf0b7dea4900..570a687ae847 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3149,7 +3149,7 @@ static int ext4_da_write_end(struct file *file, > return ext4_write_inline_data_end(inode, pos, len, copied, > page); > > start = pos & (PAGE_SIZE - 1); > - end = start + copied - 1; > + end = start + (copied ? copied - 1 : copied); > > /* > * Since we are holding inode lock, we are sure i_disksize <= > @@ -3167,7 +3167,7 @@ static int ext4_da_write_end(struct file *file, > * ext4_da_write_inline_data_end(). > */ > new_i_size = pos + copied; > - if (copied && new_i_size > inode->i_size && > + if (new_i_size > inode->i_size && > ext4_da_should_update_i_disksize(page, end)) > ext4_update_i_disksize(inode, new_i_size); > > This modification handle unconsistent i_size and i_disksize imported by > ea51d132dbf9 ("ext4: avoid hangs in ext4_da_should_update_i_disksize()"). > > Paritially written may display a fake inode size for user, for example: > > > > i_disksize=1 > > generic_perform_write > > copied = iov_iter_copy_from_user_atomic(len) // copied = 0 > > ext4_da_write_end // skip updating i_disksize > > generic_write_end > > if (pos + copied > inode->i_size) { // 10 + 0 > 1, true > > i_size_write(inode, pos + copied); // i_size = 10 > > } > > > > 0 1 10 4096 > > |_|_______|_________..._____| > > | | > > i_size pos > > > > Now, user see the i_size is 10 (i_disksize is still 1). After inode > > destroyed, user will get the i_size is 1 read from disk. OK, but shouldn't we rather change generic_write_end() to not increase i_size if no write happened? Because that is what seems somewhat problematic... Honza
[...] >> BTW, I want send another patch as follows: >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >> index bf0b7dea4900..570a687ae847 100644 >> --- a/fs/ext4/inode.c >> +++ b/fs/ext4/inode.c >> @@ -3149,7 +3149,7 @@ static int ext4_da_write_end(struct file *file, >> return ext4_write_inline_data_end(inode, pos, len, copied, >> page); >> >> start = pos & (PAGE_SIZE - 1); >> - end = start + copied - 1; >> + end = start + (copied ? copied - 1 : copied); >> >> /* >> * Since we are holding inode lock, we are sure i_disksize <= >> @@ -3167,7 +3167,7 @@ static int ext4_da_write_end(struct file *file, >> * ext4_da_write_inline_data_end(). >> */ >> new_i_size = pos + copied; >> - if (copied && new_i_size > inode->i_size && >> + if (new_i_size > inode->i_size && >> ext4_da_should_update_i_disksize(page, end)) >> ext4_update_i_disksize(inode, new_i_size); >> >> This modification handle unconsistent i_size and i_disksize imported by >> ea51d132dbf9 ("ext4: avoid hangs in ext4_da_should_update_i_disksize()"). >> >> Paritially written may display a fake inode size for user, for example: >> >> >> >> i_disksize=1 >> >> generic_perform_write >> >> copied = iov_iter_copy_from_user_atomic(len) // copied = 0 >> >> ext4_da_write_end // skip updating i_disksize >> >> generic_write_end >> >> if (pos + copied > inode->i_size) { // 10 + 0 > 1, true >> >> i_size_write(inode, pos + copied); // i_size = 10 >> >> } >> >> >> >> 0 1 10 4096 >> >> |_|_______|_________..._____| >> >> | | >> >> i_size pos >> >> >> >> Now, user see the i_size is 10 (i_disksize is still 1). After inode >> >> destroyed, user will get the i_size is 1 read from disk. > > OK, but shouldn't we rather change generic_write_end() to not increase > i_size if no write happened? Because that is what seems somewhat > problematic... > > Honza > After looking through some code, I find some other places have similar problem: 1. In ext4_write_end(), i_size is updated by ext4 not generic_write_end(). 2. The iommap framework, i_size is updated even copied is zero. 3. ubifs_write_end, i_size is updated even copied is zero. It seems that fixing all places is not an easy work.
On Mon 20-03-23 20:49:07, Zhihao Cheng wrote: > [...] > > > > BTW, I want send another patch as follows: > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > > index bf0b7dea4900..570a687ae847 100644 > > > --- a/fs/ext4/inode.c > > > +++ b/fs/ext4/inode.c > > > @@ -3149,7 +3149,7 @@ static int ext4_da_write_end(struct file *file, > > > return ext4_write_inline_data_end(inode, pos, len, copied, > > > page); > > > > > > start = pos & (PAGE_SIZE - 1); > > > - end = start + copied - 1; > > > + end = start + (copied ? copied - 1 : copied); > > > > > > /* > > > * Since we are holding inode lock, we are sure i_disksize <= > > > @@ -3167,7 +3167,7 @@ static int ext4_da_write_end(struct file *file, > > > * ext4_da_write_inline_data_end(). > > > */ > > > new_i_size = pos + copied; > > > - if (copied && new_i_size > inode->i_size && > > > + if (new_i_size > inode->i_size && > > > ext4_da_should_update_i_disksize(page, end)) > > > ext4_update_i_disksize(inode, new_i_size); > > > > > > This modification handle unconsistent i_size and i_disksize imported by > > > ea51d132dbf9 ("ext4: avoid hangs in ext4_da_should_update_i_disksize()"). > > > > > > Paritially written may display a fake inode size for user, for example: > > > > > > > > > > > > i_disksize=1 > > > > > > generic_perform_write > > > > > > copied = iov_iter_copy_from_user_atomic(len) // copied = 0 > > > > > > ext4_da_write_end // skip updating i_disksize > > > > > > generic_write_end > > > > > > if (pos + copied > inode->i_size) { // 10 + 0 > 1, true > > > > > > i_size_write(inode, pos + copied); // i_size = 10 > > > > > > } > > > > > > > > > > > > 0 1 10 4096 > > > > > > |_|_______|_________..._____| > > > > > > | | > > > > > > i_size pos > > > > > > > > > > > > Now, user see the i_size is 10 (i_disksize is still 1). After inode > > > > > > destroyed, user will get the i_size is 1 read from disk. > > > > OK, but shouldn't we rather change generic_write_end() to not increase > > i_size if no write happened? Because that is what seems somewhat > > problematic... > > > > Honza > > > > After looking through some code, I find some other places have similar > problem: > 1. In ext4_write_end(), i_size is updated by ext4 not generic_write_end(). > 2. The iommap framework, i_size is updated even copied is zero. > 3. ubifs_write_end, i_size is updated even copied is zero. > > It seems that fixing all places is not an easy work. Well, yeah, probably not trivial but still desirable ;). Will you look into that? Honza
> On Mon 20-03-23 20:49:07, Zhihao Cheng wrote: >> [...] >> >>>> BTW, I want send another patch as follows: >>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >>>> index bf0b7dea4900..570a687ae847 100644 >>>> --- a/fs/ext4/inode.c >>>> +++ b/fs/ext4/inode.c >>>> @@ -3149,7 +3149,7 @@ static int ext4_da_write_end(struct file *file, >>>> return ext4_write_inline_data_end(inode, pos, len, copied, >>>> page); >>>> >>>> start = pos & (PAGE_SIZE - 1); >>>> - end = start + copied - 1; >>>> + end = start + (copied ? copied - 1 : copied); >>>> >>>> /* >>>> * Since we are holding inode lock, we are sure i_disksize <= >>>> @@ -3167,7 +3167,7 @@ static int ext4_da_write_end(struct file *file, >>>> * ext4_da_write_inline_data_end(). >>>> */ >>>> new_i_size = pos + copied; >>>> - if (copied && new_i_size > inode->i_size && >>>> + if (new_i_size > inode->i_size && >>>> ext4_da_should_update_i_disksize(page, end)) >>>> ext4_update_i_disksize(inode, new_i_size); >>>> >>>> This modification handle unconsistent i_size and i_disksize imported by >>>> ea51d132dbf9 ("ext4: avoid hangs in ext4_da_should_update_i_disksize()"). >>>> >>>> Paritially written may display a fake inode size for user, for example: >>>> >>>> >>>> >>>> i_disksize=1 >>>> >>>> generic_perform_write >>>> >>>> copied = iov_iter_copy_from_user_atomic(len) // copied = 0 >>>> >>>> ext4_da_write_end // skip updating i_disksize >>>> >>>> generic_write_end >>>> >>>> if (pos + copied > inode->i_size) { // 10 + 0 > 1, true >>>> >>>> i_size_write(inode, pos + copied); // i_size = 10 >>>> >>>> } >>>> >>>> >>>> >>>> 0 1 10 4096 >>>> >>>> |_|_______|_________..._____| >>>> >>>> | | >>>> >>>> i_size pos >>>> >>>> >>>> >>>> Now, user see the i_size is 10 (i_disksize is still 1). After inode >>>> >>>> destroyed, user will get the i_size is 1 read from disk. >>> >>> OK, but shouldn't we rather change generic_write_end() to not increase >>> i_size if no write happened? Because that is what seems somewhat >>> problematic... >>> >>> Honza >>> >> >> After looking through some code, I find some other places have similar >> problem: >> 1. In ext4_write_end(), i_size is updated by ext4 not generic_write_end(). >> 2. The iommap framework, i_size is updated even copied is zero. >> 3. ubifs_write_end, i_size is updated even copied is zero. >> >> It seems that fixing all places is not an easy work. > > Well, yeah, probably not trivial but still desirable ;). Will you look into > that? > > Honza > I'am happy to investigate it, maybe it will take some time, and I'm also glad to help review code if somebody come up a solution firstly.
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index bf0b7dea4900..577dc23f3b78 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3136,6 +3136,8 @@ static int ext4_da_write_end(struct file *file, loff_t new_i_size; unsigned long start, end; int write_mode = (int)(unsigned long)fsdata; + bool i_size_changed = false; + loff_t old_size = inode->i_size; if (write_mode == FALL_BACK_TO_NONDELALLOC) return ext4_write_end(file, mapping, pos, @@ -3148,6 +3150,8 @@ static int ext4_da_write_end(struct file *file, ext4_has_inline_data(inode)) return ext4_write_inline_data_end(inode, pos, len, copied, page); + copied = block_write_end(file, mapping, pos, len, copied, page, fsdata); + start = pos & (PAGE_SIZE - 1); end = start + copied - 1; @@ -3162,16 +3166,30 @@ static int ext4_da_write_end(struct file *file, * check), we need to update i_disksize here as neither * ext4_writepage() nor certain ext4_writepages() paths not * allocating blocks update i_disksize. - * - * Note that we defer inode dirtying to generic_write_end() / - * ext4_da_write_inline_data_end(). */ new_i_size = pos + copied; - if (copied && new_i_size > inode->i_size && - ext4_da_should_update_i_disksize(page, end)) - ext4_update_i_disksize(inode, new_i_size); + if (new_i_size > inode->i_size) { + i_size_write(inode, new_i_size); + i_size_changed = true; + if (copied && ext4_da_should_update_i_disksize(page, end)) + ext4_update_i_disksize(inode, new_i_size); + } + + unlock_page(page); + put_page(page); + + if (old_size < pos) + pagecache_isize_extended(inode, old_size, pos); + /* + * Don't mark the inode dirty under page lock. First, it unnecessarily + * makes the holding time of page lock longer. Second, it forces lock + * ordering of page lock and transaction start for journaling + * filesystems. + */ + if (i_size_changed) + mark_inode_dirty(inode); - return generic_write_end(file, mapping, pos, len, copied, page, fsdata); + return copied; } /*