[2/3] ext4: truncate complete range in pagecache before calling ext4_zero_partial_blocks()
Message ID | f7aaea4a59bc1c69f87e178dae34c38d2bcfee12.1695987265.git.ojaswin@linux.ibm.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2a8e:b0:403:3b70:6f57 with SMTP id in14csp14686vqb; Fri, 29 Sep 2023 13:04:13 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFSkLA2qkXlAkrtvURp/0vTFuOogaZP6puU6X9SXQ84qnVEp2YPQTrQ38i/M/KvvU1Eq4CZ X-Received: by 2002:a05:6a00:15c7:b0:690:3abc:8043 with SMTP id o7-20020a056a0015c700b006903abc8043mr5742705pfu.8.1696017852924; Fri, 29 Sep 2023 13:04:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696017852; cv=none; d=google.com; s=arc-20160816; b=FcL35TsWgR6ijNk70Sn1A8UaL2vK9SanfmtbeExvts9v+0DWKHaOhhxtEdbW7+BSCC I+lgujBAv0fv6jPJO9RplGb4jZ59USFxqRPP1/7mi7QjYBI/YKHZUqwMQ/oFImlHcpsQ Bsyz1jZXYxb7mwZeXBiRakVLoy1oAUbINhejPrIQW4EIGdPJ7x/rAi5xrCLjjFp+CiWs 268dNCWmccqVHWhgNPQv5bTmSRxkzRnEmsXMmmjJ7fJGmkl6osIaEz4g2XXrSR/qqYkz uUdQH3MxckIODYMHKNJtSbGxvQtnYpwY0PUCeIN2BcnK+C8E/hO+VePSMx1lX1l0qIoF C1LQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=w8SJmSny0Zw7pXex9C2zi5+DJoyv5ZO/nLsi0spalII=; fh=PJMzc3Hp2p3Dvzr5TvkhdjVsfQ/xiDeCtVytQiIqKjY=; b=Q3kOtcHve0ArdiaSn7PIjoeGrw0cBq8B1i6WQGM0ZfApvG33VlJdH03uNvT68QACFA qU7E2i+b2ydcCza/PcZm0RWFb17Dhm6jmZKm5bmD1qSI9ESvNq1uiPJescGwdP2eoEDM Q3dJImdBwstY3u65xsmUe1EDG7rJfdpdqLtHno9xZujZmP6tG9Vx9gR6IttZDLlQwBMz gM5JsEoy7eeZdMplSopkW4vjWPRoJOX0++YznGblY4qbiIuvsyVhsLIFJRJrPWOpEn9a E6CAzAWKtVFSoQImlkwUTXO4DKbPybMO8U+ppB9hN+229i33wcyRoQZSRe9ilGqC79Dk aE6A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=HnAP9Fzr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3]) by mx.google.com with ESMTPS id cb3-20020a056a00430300b0068fb986fcf3si21280969pfb.377.2023.09.29.13.04.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Sep 2023 13:04:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) client-ip=2620:137:e000::3:3; Authentication-Results: mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=HnAP9Fzr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id DAAE58098B03; Fri, 29 Sep 2023 07:11:43 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233217AbjI2OLL (ORCPT <rfc822;pwkd43@gmail.com> + 20 others); Fri, 29 Sep 2023 10:11:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33486 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233125AbjI2OLE (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 29 Sep 2023 10:11:04 -0400 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4394F1A8; Fri, 29 Sep 2023 07:11:02 -0700 (PDT) Received: from pps.filterd (m0356517.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 38TDDvsh019380; Fri, 29 Sep 2023 14:10:58 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=w8SJmSny0Zw7pXex9C2zi5+DJoyv5ZO/nLsi0spalII=; b=HnAP9FzrYno+GyCptuOatT5/zccTsE0llfrgvLZqy4n1Lt9ksjjp4exoTReuamBUHe2t ff9quY+UW0bshBCBB1fhkFHwdP+eA5CyvcVVwL64GUXgvz8GrjaaFngEeTi5KtIBO/CU RlySCDx6j0Sfj1DNPbdHKaHTIvyuV0zZ1NIX9Zb9zYP2gMcoQJWhhze4NTIdR+7QHJ3Y ixngUXDoI4Z+n33q7K5GenqpFRqpuV2qtx1RCv8VV6TPV1QzC0Nk1qQ+P35yyy3KbgBo dxveYAjE1ZVw2aTUxPZF+p/gbVVdkColQDb19biMx6avbN8fwHdN/zzw81BHVHpt9e/S 1g== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3tdxtfjvtn-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 29 Sep 2023 14:10:58 +0000 Received: from m0356517.ppops.net (m0356517.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 38TDEs9E022682; Fri, 29 Sep 2023 14:10:57 GMT Received: from ppma23.wdc07v.mail.ibm.com (5d.69.3da9.ip4.static.sl-reverse.com [169.61.105.93]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3tdxtfjvt0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 29 Sep 2023 14:10:57 +0000 Received: from pps.filterd (ppma23.wdc07v.mail.ibm.com [127.0.0.1]) by ppma23.wdc07v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 38TCuFRT010995; Fri, 29 Sep 2023 14:10:56 GMT Received: from smtprelay06.fra02v.mail.ibm.com ([9.218.2.230]) by ppma23.wdc07v.mail.ibm.com (PPS) with ESMTPS id 3tabum4wtp-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 29 Sep 2023 14:10:56 +0000 Received: from smtpav07.fra02v.mail.ibm.com (smtpav07.fra02v.mail.ibm.com [10.20.54.106]) by smtprelay06.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 38TEAsmn44892648 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 29 Sep 2023 14:10:54 GMT Received: from smtpav07.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 61EC720043; Fri, 29 Sep 2023 14:10:54 +0000 (GMT) Received: from smtpav07.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A0CD820040; Fri, 29 Sep 2023 14:10:52 +0000 (GMT) Received: from li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com.com (unknown [9.43.84.228]) by smtpav07.fra02v.mail.ibm.com (Postfix) with ESMTP; Fri, 29 Sep 2023 14:10:52 +0000 (GMT) From: Ojaswin Mujoo <ojaswin@linux.ibm.com> To: linux-ext4@vger.kernel.org, "Theodore Ts'o" <tytso@mit.edu> Cc: Ritesh Harjani <ritesh.list@gmail.com>, linux-kernel@vger.kernel.org, Jan Kara <jack@suse.cz> Subject: [PATCH 2/3] ext4: truncate complete range in pagecache before calling ext4_zero_partial_blocks() Date: Fri, 29 Sep 2023 19:40:44 +0530 Message-Id: <f7aaea4a59bc1c69f87e178dae34c38d2bcfee12.1695987265.git.ojaswin@linux.ibm.com> X-Mailer: git-send-email 2.39.3 In-Reply-To: <cover.1695987265.git.ojaswin@linux.ibm.com> References: <cover.1695987265.git.ojaswin@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: mLO0E7J62S03FXhdZmwf1R1eOP5SmG0d X-Proofpoint-GUID: BZ4DcwI7GjGeCo8UQtqWZ2YRaQq4ey4j X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.267,Aquarius:18.0.980,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-09-29_11,2023-09-28_03,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 lowpriorityscore=0 suspectscore=0 mlxlogscore=999 spamscore=0 clxscore=1015 malwarescore=0 priorityscore=1501 adultscore=0 bulkscore=0 mlxscore=0 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2309180000 definitions=main-2309290120 X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Fri, 29 Sep 2023 07:11:43 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778403616006137823 X-GMAIL-MSGID: 1778403616006137823 |
Series |
Optimize __ext4_block_zero_page_range() for unwritten buffers
|
|
Commit Message
Ojaswin Mujoo
Sept. 29, 2023, 2:10 p.m. UTC
In ext4_zero_range() and ext4_punch_hole(), the range passed could be unaligned
however we only zero out the pagecache range that is block aligned. These
functions are relying on ext4_zero_partial_blocks() ->
__ext4_block_zero_page_range() to take care of zeroing the unaligned edges in
the pageacache. However, the right thing to do is to properly zero out the whole
range in these functions before and not rely on a different function to do it
for us. Hence, modify ext4_zero_range() and ext4_punch_hole() to zero the
complete range.
This will also allow us to now exit early for unwritten buffer heads in
__ext4_block_zero_page_range(), in upcoming patch.
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
fs/ext4/extents.c | 17 +++++++++++------
fs/ext4/inode.c | 3 +--
2 files changed, 12 insertions(+), 8 deletions(-)
Comments
On Fri, Sep 29, 2023 at 07:40:44PM +0530, Ojaswin Mujoo wrote: > In ext4_zero_range() and ext4_punch_hole(), the range passed could be unaligned > however we only zero out the pagecache range that is block aligned. These > functions are relying on ext4_zero_partial_blocks() -> > __ext4_block_zero_page_range() to take care of zeroing the unaligned edges in > the pageacache. However, the right thing to do is to properly zero out the whole > range in these functions before and not rely on a different function to do it > for us. Hence, modify ext4_zero_range() and ext4_punch_hole() to zero the > complete range. > > This will also allow us to now exit early for unwritten buffer heads in > __ext4_block_zero_page_range(), in upcoming patch. > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> > --- > fs/ext4/extents.c | 17 +++++++++++------ > fs/ext4/inode.c | 3 +-- > 2 files changed, 12 insertions(+), 8 deletions(-) > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > index c79b4c25afc4..2dc681cab6a5 100644 > --- a/fs/ext4/extents.c > +++ b/fs/ext4/extents.c > @@ -4582,9 +4582,6 @@ static long ext4_zero_range(struct file *file, loff_t offset, > > /* Zero range excluding the unaligned edges */ > if (max_blocks > 0) { > - flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | > - EXT4_EX_NOCACHE); > - > /* > * Prevent page faults from reinstantiating pages we have > * released from page cache. > @@ -4609,17 +4606,25 @@ static long ext4_zero_range(struct file *file, loff_t offset, > * disk in case of crash before zeroing trans is committed. > */ > if (ext4_should_journal_data(inode)) { > - ret = filemap_write_and_wait_range(mapping, start, end - 1); > + ret = filemap_write_and_wait_range(mapping, start, > + end - 1); I think this accidentally creeped in, will fix it in next rev. Anyways, I had some questions that might be unrelated to this patch, I'll add them inline: > if (ret) { > filemap_invalidate_unlock(mapping); > goto out_mutex; > } > } > + } So the above if (max_blocks) {...} block runs when the range spans multiple blocks but I think the filemap_write_and_wait_range() and ext4_update_disksize_before_punch() should be called when we are actually spanning multiple pages, since the disksize not updating issue and the truncate racing with checkpoint only happen when the complete page is truncated. Is this understanding correct? > + > + /* > + * Now truncate the pagecache and zero out non page aligned edges of the > + * range (if any) > + */ > + truncate_pagecache_range(inode, offset, offset + len - 1); > > - /* Now release the pages and zero block aligned part of pages */ > - truncate_pagecache_range(inode, start, end - 1); > + if (max_blocks > 0) { > inode->i_mtime = inode->i_ctime = current_time(inode); > > + flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | EXT4_EX_NOCACHE); > ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size, > flags); > filemap_invalidate_unlock(mapping); > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 6c490f05e2ba..de8ea8430d30 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3974,9 +3974,8 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) > ret = ext4_update_disksize_before_punch(inode, offset, length); In this function ext4_punch_hole() I see that we call filemap_write_and_wait_range() and then take the inode_lock() later. Doesn't this leave a window for the pages to get dirty again? For example, in ext4_zero_range(), we checkpoint using filemap_write_and_wait_range() in case of data=journal under inode_lock() but that's not the case here. Just wondering if this or any other code path might still race here? Regards, ojaswin > if (ret) > goto out_dio; > - truncate_pagecache_range(inode, first_block_offset, > - last_block_offset); > } > + truncate_pagecache_range(inode, offset, offset + length - 1); > > if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) > credits = ext4_writepage_trans_blocks(inode); > -- > 2.39.3 >
On Fri 29-09-23 21:45:29, Ojaswin Mujoo wrote: > On Fri, Sep 29, 2023 at 07:40:44PM +0530, Ojaswin Mujoo wrote: > > In ext4_zero_range() and ext4_punch_hole(), the range passed could be unaligned > > however we only zero out the pagecache range that is block aligned. These > > functions are relying on ext4_zero_partial_blocks() -> > > __ext4_block_zero_page_range() to take care of zeroing the unaligned edges in > > the pageacache. However, the right thing to do is to properly zero out the whole > > range in these functions before and not rely on a different function to do it > > for us. Hence, modify ext4_zero_range() and ext4_punch_hole() to zero the > > complete range. > > > > This will also allow us to now exit early for unwritten buffer heads in > > __ext4_block_zero_page_range(), in upcoming patch. > > > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> > > --- > > fs/ext4/extents.c | 17 +++++++++++------ > > fs/ext4/inode.c | 3 +-- > > 2 files changed, 12 insertions(+), 8 deletions(-) > > > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > > index c79b4c25afc4..2dc681cab6a5 100644 > > --- a/fs/ext4/extents.c > > +++ b/fs/ext4/extents.c > > @@ -4582,9 +4582,6 @@ static long ext4_zero_range(struct file *file, loff_t offset, > > > > /* Zero range excluding the unaligned edges */ > > if (max_blocks > 0) { > > - flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | > > - EXT4_EX_NOCACHE); > > - > > /* > > * Prevent page faults from reinstantiating pages we have > > * released from page cache. > > @@ -4609,17 +4606,25 @@ static long ext4_zero_range(struct file *file, loff_t offset, > > * disk in case of crash before zeroing trans is committed. > > */ > > if (ext4_should_journal_data(inode)) { > > - ret = filemap_write_and_wait_range(mapping, start, end - 1); > > + ret = filemap_write_and_wait_range(mapping, start, > > + end - 1); > > I think this accidentally creeped in, will fix it in next rev. Yeah, just pull it in patch 1. > > if (ret) { > > filemap_invalidate_unlock(mapping); > > goto out_mutex; > > } > > } > > + } > > So the above if (max_blocks) {...} block runs when the range spans > multiple blocks but I think the filemap_write_and_wait_range() and > ext4_update_disksize_before_punch() should be called when we are actually > spanning multiple pages, since the disksize not updating issue and the > truncate racing with checkpoint only happen when the complete page is > truncated. Is this understanding correct? Why do you think the issues apply only to multiple pages? I mean even if a single block is dirty in memory, it may be pushing i_disksize or carrying journalled data we need to commit. > > + /* > > + * Now truncate the pagecache and zero out non page aligned edges of the > > + * range (if any) > > + */ > > + truncate_pagecache_range(inode, offset, offset + len - 1); > > > > - /* Now release the pages and zero block aligned part of pages */ > > - truncate_pagecache_range(inode, start, end - 1); > > + if (max_blocks > 0) { > > inode->i_mtime = inode->i_ctime = current_time(inode); > > > > + flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | EXT4_EX_NOCACHE); > > ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size, > > flags); > > filemap_invalidate_unlock(mapping); > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > index 6c490f05e2ba..de8ea8430d30 100644 > > --- a/fs/ext4/inode.c > > +++ b/fs/ext4/inode.c > > @@ -3974,9 +3974,8 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) > > ret = ext4_update_disksize_before_punch(inode, offset, length); > > In this function ext4_punch_hole() I see that we call > filemap_write_and_wait_range() and then take the inode_lock() later. > Doesn't this leave a window for the pages to get dirty again? There's definitely a race window but I think the call to filemap_write_and_wait_range() is a leftover from the past when hole punching could race in a nasty way. These days we have invalidate_lock so I *think* we can just remove that filemap_write_and_wait_range() call. At least I don't see a good reason for it now because the pages are going away anyway. But it needs testing :). > For example, in ext4_zero_range(), we checkpoint using > filemap_write_and_wait_range() in case of data=journal under > inode_lock() but that's not the case here. Just wondering if this > or any other code path might still race here? Well, that's a bit different story as the comment there explains. And yes, invalidate_lock protects us from possible races there. > > if (ret) > > goto out_dio; > > - truncate_pagecache_range(inode, first_block_offset, > > - last_block_offset); > > } > > + truncate_pagecache_range(inode, offset, offset + length - 1); But I have realized that changes done in this patch actually don't help with changing ext4_zero_partial_blocks() because as soon as we drop invalidate_lock, a page fault can come in and modify contents of partial pages we need zeroed. So thinking about this again with fresh mind, these block vs pagecache consistency issues aren't probably worth it and current code flow is good enough. Sorry for misleading you. We might just add a comment to __ext4_block_zero_page_range() to explain that buffer_unwritten() buffers can get there but they should be already zeroed-out and uptodate and we do need to process them because of page cache zeroing. What do you think? Honza
On Thu, Oct 19, 2023 at 06:55:46PM +0200, Jan Kara wrote: > On Fri 29-09-23 21:45:29, Ojaswin Mujoo wrote: > > On Fri, Sep 29, 2023 at 07:40:44PM +0530, Ojaswin Mujoo wrote: > > > In ext4_zero_range() and ext4_punch_hole(), the range passed could be unaligned > > > however we only zero out the pagecache range that is block aligned. These > > > functions are relying on ext4_zero_partial_blocks() -> > > > __ext4_block_zero_page_range() to take care of zeroing the unaligned edges in > > > the pageacache. However, the right thing to do is to properly zero out the whole > > > range in these functions before and not rely on a different function to do it > > > for us. Hence, modify ext4_zero_range() and ext4_punch_hole() to zero the > > > complete range. > > > > > > This will also allow us to now exit early for unwritten buffer heads in > > > __ext4_block_zero_page_range(), in upcoming patch. > > > > > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> > > > --- > > > fs/ext4/extents.c | 17 +++++++++++------ > > > fs/ext4/inode.c | 3 +-- > > > 2 files changed, 12 insertions(+), 8 deletions(-) > > > > > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > > > index c79b4c25afc4..2dc681cab6a5 100644 > > > --- a/fs/ext4/extents.c > > > +++ b/fs/ext4/extents.c > > > @@ -4582,9 +4582,6 @@ static long ext4_zero_range(struct file *file, loff_t offset, > > > > > > /* Zero range excluding the unaligned edges */ > > > if (max_blocks > 0) { > > > - flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | > > > - EXT4_EX_NOCACHE); > > > - > > > /* > > > * Prevent page faults from reinstantiating pages we have > > > * released from page cache. > > > @@ -4609,17 +4606,25 @@ static long ext4_zero_range(struct file *file, loff_t offset, > > > * disk in case of crash before zeroing trans is committed. > > > */ > > > if (ext4_should_journal_data(inode)) { > > > - ret = filemap_write_and_wait_range(mapping, start, end - 1); > > > + ret = filemap_write_and_wait_range(mapping, start, > > > + end - 1); > > > > I think this accidentally creeped in, will fix it in next rev. > > Yeah, just pull it in patch 1. > > > > if (ret) { > > > filemap_invalidate_unlock(mapping); > > > goto out_mutex; > > > } > > > } > > > + } > > > > So the above if (max_blocks) {...} block runs when the range spans > > multiple blocks but I think the filemap_write_and_wait_range() and > > ext4_update_disksize_before_punch() should be called when we are actually > > spanning multiple pages, since the disksize not updating issue and the > > truncate racing with checkpoint only happen when the complete page is > > truncated. Is this understanding correct? > > Why do you think the issues apply only to multiple pages? I mean even if a > single block is dirty in memory, it may be pushing i_disksize or carrying > journalled data we need to commit. Hey Jan, You are right, I think I was misunderstanding this code a bit, thinking that these things would only be needed if the complete folio is absent. Upon rechecking the paths like the writeback path I can now see that even if the blocks till i_size are already allocated (eg, through ext4_zero_range) then we won't actually be calling mpage_map_and_submit_extent() which is where disksize updates. > > > > + /* > > > + * Now truncate the pagecache and zero out non page aligned edges of the > > > + * range (if any) > > > + */ > > > + truncate_pagecache_range(inode, offset, offset + len - 1); > > > > > > - /* Now release the pages and zero block aligned part of pages */ > > > - truncate_pagecache_range(inode, start, end - 1); > > > + if (max_blocks > 0) { > > > inode->i_mtime = inode->i_ctime = current_time(inode); > > > > > > + flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | EXT4_EX_NOCACHE); > > > ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size, > > > flags); > > > filemap_invalidate_unlock(mapping); > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > > index 6c490f05e2ba..de8ea8430d30 100644 > > > --- a/fs/ext4/inode.c > > > +++ b/fs/ext4/inode.c > > > @@ -3974,9 +3974,8 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) > > > ret = ext4_update_disksize_before_punch(inode, offset, length); > > > > In this function ext4_punch_hole() I see that we call > > filemap_write_and_wait_range() and then take the inode_lock() later. > > Doesn't this leave a window for the pages to get dirty again? > > There's definitely a race window but I think the call to > filemap_write_and_wait_range() is a leftover from the past when hole > punching could race in a nasty way. These days we have invalidate_lock so I > *think* we can just remove that filemap_write_and_wait_range() call. At > least I don't see a good reason for it now because the pages are going away > anyway. But it needs testing :). > > > For example, in ext4_zero_range(), we checkpoint using > > filemap_write_and_wait_range() in case of data=journal under > > inode_lock() but that's not the case here. Just wondering if this > > or any other code path might still race here? > > Well, that's a bit different story as the comment there explains. And yes, > invalidate_lock protects us from possible races there. Ahh okay, got it. > > > > if (ret) > > > goto out_dio; > > > - truncate_pagecache_range(inode, first_block_offset, > > > - last_block_offset); > > > } > > > + truncate_pagecache_range(inode, offset, offset + length - 1); > > But I have realized that changes done in this patch actually don't help > with changing ext4_zero_partial_blocks() because as soon as we drop > invalidate_lock, a page fault can come in and modify contents of partial > pages we need zeroed. > > So thinking about this again with fresh mind, these block vs pagecache > consistency issues aren't probably worth it and current code flow is good > enough. Sorry for misleading you. We might just add a comment to > __ext4_block_zero_page_range() to explain that buffer_unwritten() buffers > can get there but they should be already zeroed-out and uptodate and we do > need to process them because of page cache zeroing. What do you think? Oh right, I was not thinking from the mmap path, sorry about that. In that case I think your point makes sense, lets just let this be for now. I'll send a v2 with the first patch of the series and also add a comment as you suggested. Thanks for the review and taking the time to answer my questions! Regards, ojaswin > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index c79b4c25afc4..2dc681cab6a5 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4582,9 +4582,6 @@ static long ext4_zero_range(struct file *file, loff_t offset, /* Zero range excluding the unaligned edges */ if (max_blocks > 0) { - flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | - EXT4_EX_NOCACHE); - /* * Prevent page faults from reinstantiating pages we have * released from page cache. @@ -4609,17 +4606,25 @@ static long ext4_zero_range(struct file *file, loff_t offset, * disk in case of crash before zeroing trans is committed. */ if (ext4_should_journal_data(inode)) { - ret = filemap_write_and_wait_range(mapping, start, end - 1); + ret = filemap_write_and_wait_range(mapping, start, + end - 1); if (ret) { filemap_invalidate_unlock(mapping); goto out_mutex; } } + } + + /* + * Now truncate the pagecache and zero out non page aligned edges of the + * range (if any) + */ + truncate_pagecache_range(inode, offset, offset + len - 1); - /* Now release the pages and zero block aligned part of pages */ - truncate_pagecache_range(inode, start, end - 1); + if (max_blocks > 0) { inode->i_mtime = inode->i_ctime = current_time(inode); + flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | EXT4_EX_NOCACHE); ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size, flags); filemap_invalidate_unlock(mapping); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 6c490f05e2ba..de8ea8430d30 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3974,9 +3974,8 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) ret = ext4_update_disksize_before_punch(inode, offset, length); if (ret) goto out_dio; - truncate_pagecache_range(inode, first_block_offset, - last_block_offset); } + truncate_pagecache_range(inode, offset, offset + length - 1); if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) credits = ext4_writepage_trans_blocks(inode);