Message ID | 20221123055827.26996-11-nj.shetty@samsung.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp2619169wrr; Tue, 22 Nov 2022 22:19:28 -0800 (PST) X-Google-Smtp-Source: AA0mqf59OMAbmIrURMM2InoO4t0dNiSM/XoaRR+BvBxV/obs5qSFNryfxcZKcWzieFHUJNsjxh9X X-Received: by 2002:a63:575b:0:b0:46f:a202:f6be with SMTP id h27-20020a63575b000000b0046fa202f6bemr6411357pgm.378.1669184368190; Tue, 22 Nov 2022 22:19:28 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669184368; cv=none; d=google.com; s=arc-20160816; b=wRoYpu53WfmoI9JdKu9thD3TskWSMOx7vn7z/ri840D0HFcxSaxS3SdjG08sAuXYry 9QW+GTGcCw1upnr89op+FKaa2XzU2LL/eSNOLnqmkvAbtmB70XcmQ9ociQMIMAaUH1KQ jf2INBJM5+eKI39YosT9SwKex4z9wml29aKs8zVrDADVUy/iCJEdJ5XFtJhIF5gQTAXm CE9kNSD863/T35IbvW9VOwbPoD3EpGe24l87AovLWr0N/0vOFBJpcCQFHsb9LjrMtw5O 6tDeLkoNoicsZZo4YP2wNgQEPGN4WshtI7g2HlFlwxXe1mKaVzTIID+G+5acbV9Ihpuy rqOQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:references:dlp-filter:cms-type :content-transfer-encoding:mime-version:in-reply-to:message-id:date :subject:cc:to:from:dkim-signature:dkim-filter; bh=qjxOVyivyB+vJwGT4+o1dKgihFCPZHtfk9wQXF66Pqk=; b=HZjeCATp6Bo1ibANlfMwi4vd6Z9NoGXBQB0VjEoctLBZ7pWXV2hN5oPOpYt+QBIl6b JfqgMiP0TUexfgQ5YnlnR5pIYqhgjROPyjTkFlMfsYurcUns6d8gjQy3SZQR7X8SsIeU 4rG2vP16+GIz1mcZ3krkNed7jAp1wVZi8PT+PSkBZsIOVI/ClOkSLe7Yn7R74ABTV50U HcbxfhuNzZOUf7/AEiILj5k2KQ+b/1YNfl4NcEQyEpt0Y/bH6MRW5Uqhei7hr7TgsjXg fJ0hKeTv4qQD5RnE3DaceWLLCqI7Fgy1LplClZ2ZAa/Kjwr8SMemCIhHRAGO4mhPn+az aHbQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@samsung.com header.s=mail20170921 header.b="c1/XHFPy"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=samsung.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id om14-20020a17090b3a8e00b00205b268cd58si1051265pjb.181.2022.11.22.22.19.15; Tue, 22 Nov 2022 22:19:28 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@samsung.com header.s=mail20170921 header.b="c1/XHFPy"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=samsung.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235992AbiKWGPG (ORCPT <rfc822;cjcooper78@gmail.com> + 99 others); Wed, 23 Nov 2022 01:15:06 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51306 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235988AbiKWGOI (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 23 Nov 2022 01:14:08 -0500 Received: from mailout2.samsung.com (mailout2.samsung.com [203.254.224.25]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 975E6F3907 for <linux-kernel@vger.kernel.org>; Tue, 22 Nov 2022 22:13:55 -0800 (PST) Received: from epcas5p3.samsung.com (unknown [182.195.41.41]) by mailout2.samsung.com (KnoxPortal) with ESMTP id 20221123061354epoutp02348f27026644d953cb8a818b13a93b7d~qIgkHCwUS1821018210epoutp025 for <linux-kernel@vger.kernel.org>; Wed, 23 Nov 2022 06:13:54 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 mailout2.samsung.com 20221123061354epoutp02348f27026644d953cb8a818b13a93b7d~qIgkHCwUS1821018210epoutp025 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=samsung.com; s=mail20170921; t=1669184034; bh=qjxOVyivyB+vJwGT4+o1dKgihFCPZHtfk9wQXF66Pqk=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=c1/XHFPyj7Mz683nijLGtrG4B8aKdg0pLUYEQ0kCnp2OrnUQcnpOscgWlfVwAt9Rp wIX6XYOku5f0ple/ScbYqIynD6x15hKmJK8DpFyexkX0OL3Vvc4mrTQ5dmtYZaNKvj YLRh/WI2qXZ7Q4aTeIQ34vxLp/gte8bkkAek8y7E= Received: from epsnrtp1.localdomain (unknown [182.195.42.162]) by epcas5p2.samsung.com (KnoxPortal) with ESMTP id 20221123061353epcas5p27e513557cf50a9c47223485ae09172b8~qIgjaHlxZ2317523175epcas5p22; Wed, 23 Nov 2022 06:13:53 +0000 (GMT) Received: from epsmges5p3new.samsung.com (unknown [182.195.38.177]) by epsnrtp1.localdomain (Postfix) with ESMTP id 4NH9mv2cxKz4x9Q7; Wed, 23 Nov 2022 06:13:51 +0000 (GMT) Received: from epcas5p3.samsung.com ( [182.195.41.41]) by epsmges5p3new.samsung.com (Symantec Messaging Gateway) with SMTP id 4E.37.56352.F1ABD736; Wed, 23 Nov 2022 15:13:51 +0900 (KST) Received: from epsmtrp1.samsung.com (unknown [182.195.40.13]) by epcas5p2.samsung.com (KnoxPortal) with ESMTPA id 20221123061044epcas5p2ac082a91fc8197821f29e84278b6203c~qIdzuCqgV3021930219epcas5p2C; Wed, 23 Nov 2022 06:10:44 +0000 (GMT) Received: from epsmgms1p1new.samsung.com (unknown [182.195.42.41]) by epsmtrp1.samsung.com (KnoxPortal) with ESMTP id 20221123061044epsmtrp15492e9b07b3038a45061cbc65c0dbdf3~qIdzjunIb1974919749epsmtrp1Q; Wed, 23 Nov 2022 06:10:44 +0000 (GMT) X-AuditID: b6c32a4b-5f7fe7000001dc20-42-637dba1f2329 Received: from epsmtip1.samsung.com ( [182.195.34.30]) by epsmgms1p1new.samsung.com (Symantec Messaging Gateway) with SMTP id E1.2B.14392.469BD736; Wed, 23 Nov 2022 15:10:44 +0900 (KST) Received: from test-zns.sa.corp.samsungelectronics.net (unknown [107.110.206.5]) by epsmtip1.samsung.com (KnoxPortal) with ESMTPA id 20221123061041epsmtip1e464d5d909ca389f1c8af953c5b59cb4~qIdwm1Hr_1981819818epsmtip1k; Wed, 23 Nov 2022 06:10:41 +0000 (GMT) From: Nitesh Shetty <nj.shetty@samsung.com> To: axboe@kernel.dk, agk@redhat.com, snitzer@kernel.org, dm-devel@redhat.com, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me, james.smart@broadcom.com, kch@nvidia.com, damien.lemoal@opensource.wdc.com, naohiro.aota@wdc.com, jth@kernel.org, viro@zeniv.linux.org.uk Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, linux-fsdevel@vger.kernel.org, anuj20.g@samsung.com, joshi.k@samsung.com, p.raghav@samsung.com, nitheshshetty@gmail.com, gost.dev@samsung.com, Nitesh Shetty <nj.shetty@samsung.com> Subject: [PATCH v5 10/10] fs: add support for copy file range in zonefs Date: Wed, 23 Nov 2022 11:28:27 +0530 Message-Id: <20221123055827.26996-11-nj.shetty@samsung.com> X-Mailer: git-send-email 2.35.1.500.gb896f729e2 In-Reply-To: <20221123055827.26996-1-nj.shetty@samsung.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFupil+LIzCtJLcpLzFFi42LZdlhTU1d+V22ywZMz3BbrTx1jtmia8JfZ YvXdfjaL32fPM1vsfTeb1eLmgZ1MFitXH2Wy2L3wI5PF0f9v2SwefrnFYjHp0DVGi6dXZzFZ 7L2lbbFn70kWi8u75rBZzF/2lN1i4vHNrBY7njQyWmz7PZ/Z4vPSFnaLda/fs1icuCVtcf7v cVYHcY9Z98+yeeycdZfd4/y9jSwel8+Wemxa1cnmsXlJvcfumw1sHr3N74AKWu+zerzfd5XN o2/LKkaPz5vkPNoPdDN5bHrylimALyrbJiM1MSW1SCE1Lzk/JTMv3VbJOzjeOd7UzMBQ19DS wlxJIS8xN9VWycUnQNctMwfofyWFssScUqBQQGJxsZK+nU1RfmlJqkJGfnGJrVJqQUpOgUmB XnFibnFpXrpeXmqJlaGBgZEpUGFCdkb34bSCqRYVm3ZMZW5gfKDbxcjJISFgInF34w2WLkYu DiGB3YwSO69+YIdwPjFKtB+YD+V8Y5T48eUTE0zL4t7VzBCJvYwSk99Ph3JamST+P30EVMXB wSagLXH6PwdIXESgj0miY/cBsCJmgdlMEm+X/WEHGSUs4C7xbv5ZZhCbRUBV4lx7GyuIzStg LfGs9zcryCAJAX2J/vuCIGFOoPCZi7uYIEoEJU7OfMICYjMLyEs0b50NNl9CYD6nxI2jK5kh TnWRmDt1OSOELSzx6vgWdghbSuJlfxuUXS6xcsoKNojmFkaJWddnQTXYS7Se6mcGOYJZQFNi /S59iLCsxNRT65ggFvNJ9P5+Ag0WXokd82BsZYk16xewQdiSEte+N0LZHhL3b+9ghIRWH6PE tivz2CcwKsxC8tAsJA/NQli9gJF5FaNkakFxbnpqsWmBcV5qOTyak/NzNzGCM4WW9w7GRw8+ 6B1iZOJgPMQowcGsJMJb71mTLMSbklhZlVqUH19UmpNafIjRFBjgE5mlRJPzgbkqryTe0MTS wMTMzMzE0tjMUEmcd/EMrWQhgfTEktTs1NSC1CKYPiYOTqkGpqcXMn4X8j3Nva9TL7u8Sv7U g7PiorPE5aPtNt7JXOXA8085gCkw+7acXfXLiuNafGZGk7Ytv5P8sUljg5++fWDvwoDP/m90 omOvzc71Z/j4/b+OasIVEe+37e/D+S8mBK6bWrlUoOGC+eG6zQJ9Ksczr4cks+cYT/5+zSkx Zvfpf3qfNkQq2yXmlL9uOhNyPmLd8gOn3rg+Y/N3yuoKDRb7ujhspoXYxStR/iXLRHMFZNcJ qkhduvnwWfakm7VsyY+OzWWS4e5m2PjgfcrrVw6yDZdCflxb4W389iP3+ztBDRpPX4p7Rcn9 u33v9A+7bbtPiE/xlNU6Yn5HrMX65akNYWxJGyo8ul0fls8+ocRSnJFoqMVcVJwIALa80fqd BAAA X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFtrHIsWRmVeSWpSXmKPExsWy7bCSnG7Kztpkg/sTJSzWnzrGbNE04S+z xeq7/WwWv8+eZ7bY+242q8XNAzuZLFauPspksXvhRyaLo//fslk8/HKLxWLSoWuMFk+vzmKy 2HtL22LP3pMsFpd3zWGzmL/sKbvFxOObWS12PGlktNj2ez6zxeelLewW616/Z7E4cUva4vzf 46wO4h6z7p9l89g56y67x/l7G1k8Lp8t9di0qpPNY/OSeo/dNxvYPHqb3wEVtN5n9Xi/7yqb R9+WVYwenzfJebQf6Gby2PTkLVMAXxSXTUpqTmZZapG+XQJXRvfhtIKpFhWbdkxlbmB8oNvF yMkhIWAisbh3NXMXIxeHkMBuRolpvV+YIBKSEsv+HmGGsIUlVv57zg5R1Mwksfb0DaAEBweb gLbE6f8cIHERgQVMEpfvvQKbxCywlEli9pW7bCDdwgLuEu/mnwWbxCKgKnGuvY0VxOYVsJZ4 1vubFWSQhIC+RP99QZAwJ1D4zMVdTCBhIQEriT3LdCCqBSVOznzCAmIzC8hLNG+dzTyBUWAW ktQsJKkFjEyrGCVTC4pz03OLDQsM81LL9YoTc4tL89L1kvNzNzGCo1hLcwfj9lUf9A4xMnEw HmKU4GBWEuGt96xJFuJNSaysSi3Kjy8qzUktPsQozcGiJM57oetkvJBAemJJanZqakFqEUyW iYNTqoHJ5H+Qu8iciXkLnzP3HF6ucs/lXvsVmdBFxz7szDnfxOd+4LoQm6Xk91j/U+uOe77d sjtu092GLf9fqj4TFuNondC00b3jXew2xpeSnU0/L4sECVa/DWUQlviWH1Lfvimn7km/0Uy1 uqRfN626rdt14s01Hf/8vv2HYcs2U1u3j8sZn1q9eX45MOtNyl2N50ff7JO7bxSc8/+Gk5yG lWi0yRwFOXfnDxqtv5bE+dn92fFzttjHY0sq3uoGbDvzOulhgfZPF9noH7rHZ7rVnlGL3/bg vcIm3gkBj5+UJltu4i27VCsqVpYxkeXTh73hr3la57zSty47fC903X31bxb6Hx/qSNn8mV3e ZPJui2GFEktxRqKhFnNRcSIAruvDJlEDAAA= X-CMS-MailID: 20221123061044epcas5p2ac082a91fc8197821f29e84278b6203c X-Msg-Generator: CA Content-Type: text/plain; charset="utf-8" X-Sendblock-Type: REQ_APPROVE CMS-TYPE: 105P DLP-Filter: Pass X-CFilter-Loop: Reflected X-CMS-RootMailID: 20221123061044epcas5p2ac082a91fc8197821f29e84278b6203c References: <20221123055827.26996-1-nj.shetty@samsung.com> <CGME20221123061044epcas5p2ac082a91fc8197821f29e84278b6203c@epcas5p2.samsung.com> X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1750266668114953427?= X-GMAIL-MSGID: =?utf-8?q?1750266668114953427?= |
Series |
Implement copy offload support
|
|
Commit Message
Nitesh Shetty
Nov. 23, 2022, 5:58 a.m. UTC
copy_file_range is implemented using copy offload, copy offloading to device is always enabled. To disable copy offloading mount with "no_copy_offload" mount option. At present copy offload is only used, if the source and destination files are on same block device, otherwise copy file range is completed by generic copy file range. copy file range implemented as following: - write pending writes on the src and dest files - drop page cache for dest file if its conv zone - copy the range using offload - update dest file info For all failure cases we fallback to generic file copy range At present this implementation does not support conv aggregation Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> --- fs/zonefs/super.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 179 insertions(+)
Comments
On Wed, Nov 23, 2022 at 8:26 AM Nitesh Shetty <nj.shetty@samsung.com> wrote: > > copy_file_range is implemented using copy offload, > copy offloading to device is always enabled. > To disable copy offloading mount with "no_copy_offload" mount option. > At present copy offload is only used, if the source and destination files > are on same block device, otherwise copy file range is completed by > generic copy file range. > > copy file range implemented as following: > - write pending writes on the src and dest files > - drop page cache for dest file if its conv zone > - copy the range using offload > - update dest file info > > For all failure cases we fallback to generic file copy range > At present this implementation does not support conv aggregation > > Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > --- > fs/zonefs/super.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 179 insertions(+) > > diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c > index abc9a85106f2..15613433d4ae 100644 > --- a/fs/zonefs/super.c > +++ b/fs/zonefs/super.c > @@ -1223,6 +1223,183 @@ static int zonefs_file_release(struct inode *inode, struct file *file) > return 0; > } > > +static int zonefs_is_file_copy_offset_ok(struct inode *src_inode, > + struct inode *dst_inode, loff_t src_off, loff_t dst_off, > + size_t *len) > +{ > + loff_t size, endoff; > + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); > + > + inode_lock(src_inode); > + size = i_size_read(src_inode); > + inode_unlock(src_inode); > + /* Don't copy beyond source file EOF. */ > + if (src_off < size) { > + if (src_off + *len > size) > + *len = (size - (src_off + *len)); > + } else > + *len = 0; > + > + mutex_lock(&dst_zi->i_truncate_mutex); > + if (dst_zi->i_ztype == ZONEFS_ZTYPE_SEQ) { > + if (*len > dst_zi->i_max_size - dst_zi->i_wpoffset) > + *len -= dst_zi->i_max_size - dst_zi->i_wpoffset; > + > + if (dst_off != dst_zi->i_wpoffset) > + goto err; > + } > + mutex_unlock(&dst_zi->i_truncate_mutex); > + > + endoff = dst_off + *len; > + inode_lock(dst_inode); > + if (endoff > dst_zi->i_max_size || > + inode_newsize_ok(dst_inode, endoff)) { > + inode_unlock(dst_inode); > + goto err; > + } > + inode_unlock(dst_inode); > + > + return 0; > +err: > + mutex_unlock(&dst_zi->i_truncate_mutex); > + return -EINVAL; > +} > + > +static ssize_t zonefs_issue_copy(struct zonefs_inode_info *src_zi, > + loff_t src_off, struct zonefs_inode_info *dst_zi, > + loff_t dst_off, size_t len) > +{ > + struct block_device *src_bdev = src_zi->i_vnode.i_sb->s_bdev; > + struct block_device *dst_bdev = dst_zi->i_vnode.i_sb->s_bdev; > + struct range_entry *rlist = NULL; > + int ret = len; > + > + rlist = kmalloc(sizeof(*rlist), GFP_KERNEL); > + if (!rlist) > + return -ENOMEM; > + > + rlist[0].dst = (dst_zi->i_zsector << SECTOR_SHIFT) + dst_off; > + rlist[0].src = (src_zi->i_zsector << SECTOR_SHIFT) + src_off; > + rlist[0].len = len; > + rlist[0].comp_len = 0; > + ret = blkdev_issue_copy(src_bdev, dst_bdev, rlist, 1, NULL, NULL, > + GFP_KERNEL); > + if (rlist[0].comp_len > 0) > + ret = rlist[0].comp_len; > + kfree(rlist); > + > + return ret; > +} > + > +/* Returns length of possible copy, else returns error */ > +static ssize_t zonefs_copy_file_checks(struct file *src_file, loff_t src_off, > + struct file *dst_file, loff_t dst_off, > + size_t *len, unsigned int flags) > +{ > + struct inode *src_inode = file_inode(src_file); > + struct inode *dst_inode = file_inode(dst_file); > + struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode); > + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); > + ssize_t ret; > + > + if (src_inode->i_sb != dst_inode->i_sb) > + return -EXDEV; > + > + /* Start by sync'ing the source and destination files for conv zones */ > + if (src_zi->i_ztype == ZONEFS_ZTYPE_CNV) { > + ret = file_write_and_wait_range(src_file, src_off, > + (src_off + *len)); > + if (ret < 0) > + goto io_error; > + } > + inode_dio_wait(src_inode); > + > + /* Start by sync'ing the source and destination files ifor conv zones */ > + if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) { > + ret = file_write_and_wait_range(dst_file, dst_off, > + (dst_off + *len)); > + if (ret < 0) > + goto io_error; > + } > + inode_dio_wait(dst_inode); > + > + /* Drop dst file cached pages for a conv zone*/ > + if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) { > + ret = invalidate_inode_pages2_range(dst_inode->i_mapping, > + dst_off >> PAGE_SHIFT, > + (dst_off + *len) >> PAGE_SHIFT); > + if (ret < 0) > + goto io_error; > + } > + > + ret = zonefs_is_file_copy_offset_ok(src_inode, dst_inode, src_off, > + dst_off, len); > + if (ret < 0) > + return ret; > + > + return *len; > + > +io_error: > + zonefs_io_error(dst_inode, true); > + return ret; > +} > + > +static ssize_t zonefs_copy_file(struct file *src_file, loff_t src_off, > + struct file *dst_file, loff_t dst_off, > + size_t len, unsigned int flags) > +{ > + struct inode *src_inode = file_inode(src_file); > + struct inode *dst_inode = file_inode(dst_file); > + struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode); > + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); > + ssize_t ret = 0, bytes; > + > + inode_lock(src_inode); > + inode_lock(dst_inode); > + bytes = zonefs_issue_copy(src_zi, src_off, dst_zi, dst_off, len); > + if (bytes < 0) > + goto unlock_exit; > + > + ret += bytes; > + > + file_update_time(dst_file); > + mutex_lock(&dst_zi->i_truncate_mutex); > + zonefs_update_stats(dst_inode, dst_off + bytes); > + zonefs_i_size_write(dst_inode, dst_off + bytes); > + dst_zi->i_wpoffset += bytes; > + mutex_unlock(&dst_zi->i_truncate_mutex); > + /* if we still have some bytes left, do splice copy */ > + if (bytes && (bytes < len)) { > + bytes = do_splice_direct(src_file, &src_off, dst_file, > + &dst_off, len, flags); > + if (bytes > 0) > + ret += bytes; > + } > +unlock_exit: > + if (ret < 0) > + zonefs_io_error(dst_inode, true); > + inode_unlock(src_inode); > + inode_unlock(dst_inode); > + return ret; > +} > + > +static ssize_t zonefs_copy_file_range(struct file *src_file, loff_t src_off, > + struct file *dst_file, loff_t dst_off, > + size_t len, unsigned int flags) > +{ > + ssize_t ret = -EIO; > + > + ret = zonefs_copy_file_checks(src_file, src_off, dst_file, dst_off, > + &len, flags); > + if (ret > 0) > + ret = zonefs_copy_file(src_file, src_off, dst_file, dst_off, > + len, flags); > + else if (ret < 0 && ret == -EXDEV) First of all, ret < 0 is redundant. > + ret = generic_copy_file_range(src_file, src_off, dst_file, > + dst_off, len, flags); But more importantly, why do you want to fall back to do_splice_direct() in zonefs copy_file_range? How does it serve your patch set or the prospect consumers of zonefs copy_file_range? The reason I am asking is because commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") turned out to be an API mistake that was later reverted by 868f9f2f8e00 ("vfs: fix copy_file_range() regression in cross-fs copies") It is always better to return EXDEV to userspace which can always fallback to splice itself, but maybe it has something smarter to do. The places where it made sense for kernel to fallback to direct splice was for network servers server-side-copy, but that is independent of any specific filesystem copy_file_range() implementation. Thanks, Amir.
On Wed, Nov 23, 2022 at 08:53:14AM +0200, Amir Goldstein wrote: > On Wed, Nov 23, 2022 at 8:26 AM Nitesh Shetty <nj.shetty@samsung.com> wrote: > > > > copy_file_range is implemented using copy offload, > > copy offloading to device is always enabled. > > To disable copy offloading mount with "no_copy_offload" mount option. > > At present copy offload is only used, if the source and destination files > > are on same block device, otherwise copy file range is completed by > > generic copy file range. > > > > copy file range implemented as following: > > - write pending writes on the src and dest files > > - drop page cache for dest file if its conv zone > > - copy the range using offload > > - update dest file info > > > > For all failure cases we fallback to generic file copy range > > At present this implementation does not support conv aggregation > > > > Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> > > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > > --- > > fs/zonefs/super.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 179 insertions(+) > > > > diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c > > index abc9a85106f2..15613433d4ae 100644 > > --- a/fs/zonefs/super.c > > +++ b/fs/zonefs/super.c > > @@ -1223,6 +1223,183 @@ static int zonefs_file_release(struct inode *inode, struct file *file) > > return 0; > > } > > > > +static int zonefs_is_file_copy_offset_ok(struct inode *src_inode, > > + struct inode *dst_inode, loff_t src_off, loff_t dst_off, > > + size_t *len) > > +{ > > + loff_t size, endoff; > > + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); > > + > > + inode_lock(src_inode); > > + size = i_size_read(src_inode); > > + inode_unlock(src_inode); > > + /* Don't copy beyond source file EOF. */ > > + if (src_off < size) { > > + if (src_off + *len > size) > > + *len = (size - (src_off + *len)); > > + } else > > + *len = 0; > > + > > + mutex_lock(&dst_zi->i_truncate_mutex); > > + if (dst_zi->i_ztype == ZONEFS_ZTYPE_SEQ) { > > + if (*len > dst_zi->i_max_size - dst_zi->i_wpoffset) > > + *len -= dst_zi->i_max_size - dst_zi->i_wpoffset; > > + > > + if (dst_off != dst_zi->i_wpoffset) > > + goto err; > > + } > > + mutex_unlock(&dst_zi->i_truncate_mutex); > > + > > + endoff = dst_off + *len; > > + inode_lock(dst_inode); > > + if (endoff > dst_zi->i_max_size || > > + inode_newsize_ok(dst_inode, endoff)) { > > + inode_unlock(dst_inode); > > + goto err; > > + } > > + inode_unlock(dst_inode); > > + > > + return 0; > > +err: > > + mutex_unlock(&dst_zi->i_truncate_mutex); > > + return -EINVAL; > > +} > > + > > +static ssize_t zonefs_issue_copy(struct zonefs_inode_info *src_zi, > > + loff_t src_off, struct zonefs_inode_info *dst_zi, > > + loff_t dst_off, size_t len) > > +{ > > + struct block_device *src_bdev = src_zi->i_vnode.i_sb->s_bdev; > > + struct block_device *dst_bdev = dst_zi->i_vnode.i_sb->s_bdev; > > + struct range_entry *rlist = NULL; > > + int ret = len; > > + > > + rlist = kmalloc(sizeof(*rlist), GFP_KERNEL); > > + if (!rlist) > > + return -ENOMEM; > > + > > + rlist[0].dst = (dst_zi->i_zsector << SECTOR_SHIFT) + dst_off; > > + rlist[0].src = (src_zi->i_zsector << SECTOR_SHIFT) + src_off; > > + rlist[0].len = len; > > + rlist[0].comp_len = 0; > > + ret = blkdev_issue_copy(src_bdev, dst_bdev, rlist, 1, NULL, NULL, > > + GFP_KERNEL); > > + if (rlist[0].comp_len > 0) > > + ret = rlist[0].comp_len; > > + kfree(rlist); > > + > > + return ret; > > +} > > + > > +/* Returns length of possible copy, else returns error */ > > +static ssize_t zonefs_copy_file_checks(struct file *src_file, loff_t src_off, > > + struct file *dst_file, loff_t dst_off, > > + size_t *len, unsigned int flags) > > +{ > > + struct inode *src_inode = file_inode(src_file); > > + struct inode *dst_inode = file_inode(dst_file); > > + struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode); > > + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); > > + ssize_t ret; > > + > > + if (src_inode->i_sb != dst_inode->i_sb) > > + return -EXDEV; > > + > > + /* Start by sync'ing the source and destination files for conv zones */ > > + if (src_zi->i_ztype == ZONEFS_ZTYPE_CNV) { > > + ret = file_write_and_wait_range(src_file, src_off, > > + (src_off + *len)); > > + if (ret < 0) > > + goto io_error; > > + } > > + inode_dio_wait(src_inode); > > + > > + /* Start by sync'ing the source and destination files ifor conv zones */ > > + if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) { > > + ret = file_write_and_wait_range(dst_file, dst_off, > > + (dst_off + *len)); > > + if (ret < 0) > > + goto io_error; > > + } > > + inode_dio_wait(dst_inode); > > + > > + /* Drop dst file cached pages for a conv zone*/ > > + if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) { > > + ret = invalidate_inode_pages2_range(dst_inode->i_mapping, > > + dst_off >> PAGE_SHIFT, > > + (dst_off + *len) >> PAGE_SHIFT); > > + if (ret < 0) > > + goto io_error; > > + } > > + > > + ret = zonefs_is_file_copy_offset_ok(src_inode, dst_inode, src_off, > > + dst_off, len); > > + if (ret < 0) > > + return ret; > > + > > + return *len; > > + > > +io_error: > > + zonefs_io_error(dst_inode, true); > > + return ret; > > +} > > + > > +static ssize_t zonefs_copy_file(struct file *src_file, loff_t src_off, > > + struct file *dst_file, loff_t dst_off, > > + size_t len, unsigned int flags) > > +{ > > + struct inode *src_inode = file_inode(src_file); > > + struct inode *dst_inode = file_inode(dst_file); > > + struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode); > > + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); > > + ssize_t ret = 0, bytes; > > + > > + inode_lock(src_inode); > > + inode_lock(dst_inode); > > + bytes = zonefs_issue_copy(src_zi, src_off, dst_zi, dst_off, len); > > + if (bytes < 0) > > + goto unlock_exit; > > + > > + ret += bytes; > > + > > + file_update_time(dst_file); > > + mutex_lock(&dst_zi->i_truncate_mutex); > > + zonefs_update_stats(dst_inode, dst_off + bytes); > > + zonefs_i_size_write(dst_inode, dst_off + bytes); > > + dst_zi->i_wpoffset += bytes; > > + mutex_unlock(&dst_zi->i_truncate_mutex); > > + /* if we still have some bytes left, do splice copy */ > > + if (bytes && (bytes < len)) { > > + bytes = do_splice_direct(src_file, &src_off, dst_file, > > + &dst_off, len, flags); > > + if (bytes > 0) > > + ret += bytes; > > + } > > +unlock_exit: > > + if (ret < 0) > > + zonefs_io_error(dst_inode, true); > > + inode_unlock(src_inode); > > + inode_unlock(dst_inode); > > + return ret; > > +} > > + > > +static ssize_t zonefs_copy_file_range(struct file *src_file, loff_t src_off, > > + struct file *dst_file, loff_t dst_off, > > + size_t len, unsigned int flags) > > +{ > > + ssize_t ret = -EIO; > > + > > + ret = zonefs_copy_file_checks(src_file, src_off, dst_file, dst_off, > > + &len, flags); > > + if (ret > 0) > > + ret = zonefs_copy_file(src_file, src_off, dst_file, dst_off, > > + len, flags); > > + else if (ret < 0 && ret == -EXDEV) > > First of all, ret < 0 is redundant. > acked > > + ret = generic_copy_file_range(src_file, src_off, dst_file, > > + dst_off, len, flags); > > But more importantly, why do you want to fall back to > do_splice_direct() in zonefs copy_file_range? > How does it serve your patch set or the prospect consumers > of zonefs copy_file_range? > > The reason I am asking is because commit 5dae222a5ff0 > ("vfs: allow copy_file_range to copy across devices") > turned out to be an API mistake that was later reverted by > 868f9f2f8e00 ("vfs: fix copy_file_range() regression in cross-fs copies") > > It is always better to return EXDEV to userspace which can > always fallback to splice itself, but maybe it has something > smarter to do. > > The places where it made sense for kernel to fallback to > direct splice was for network servers server-side-copy, but that > is independent of any specific filesystem copy_file_range() > implementation. > > Thanks, > Amir. > At present we don't handle few case's such as IO getting split incase of copy offload, so we wanted to fallback to existing mechanism. So went with default operation, do_splice_direct. Regards, Nitesh Shetty
On 11/23/22 14:58, Nitesh Shetty wrote: > copy_file_range is implemented using copy offload, > copy offloading to device is always enabled. > To disable copy offloading mount with "no_copy_offload" mount option. And were is the code that handle this option ? > At present copy offload is only used, if the source and destination files > are on same block device, otherwise copy file range is completed by > generic copy file range. > > copy file range implemented as following: > - write pending writes on the src and dest files > - drop page cache for dest file if its conv zone > - copy the range using offload > - update dest file info > > For all failure cases we fallback to generic file copy range For all cases ? That would be weird. What would be the point of trying to copy again if e.g. the dest zone has gone offline or read only ? > At present this implementation does not support conv aggregation Please check this commit message overall: the grammar and punctuation could really be improved. > > Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > --- > fs/zonefs/super.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 179 insertions(+) > > diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c > index abc9a85106f2..15613433d4ae 100644 > --- a/fs/zonefs/super.c > +++ b/fs/zonefs/super.c > @@ -1223,6 +1223,183 @@ static int zonefs_file_release(struct inode *inode, struct file *file) > return 0; > } > > +static int zonefs_is_file_copy_offset_ok(struct inode *src_inode, > + struct inode *dst_inode, loff_t src_off, loff_t dst_off, > + size_t *len) > +{ > + loff_t size, endoff; > + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); > + > + inode_lock(src_inode); > + size = i_size_read(src_inode); > + inode_unlock(src_inode); > + /* Don't copy beyond source file EOF. */ > + if (src_off < size) { > + if (src_off + *len > size) > + *len = (size - (src_off + *len)); > + } else > + *len = 0; Missing curly brackets for the else. > + > + mutex_lock(&dst_zi->i_truncate_mutex); > + if (dst_zi->i_ztype == ZONEFS_ZTYPE_SEQ) { > + if (*len > dst_zi->i_max_size - dst_zi->i_wpoffset) > + *len -= dst_zi->i_max_size - dst_zi->i_wpoffset; > + > + if (dst_off != dst_zi->i_wpoffset) > + goto err; > + } > + mutex_unlock(&dst_zi->i_truncate_mutex); > + > + endoff = dst_off + *len; > + inode_lock(dst_inode); > + if (endoff > dst_zi->i_max_size || > + inode_newsize_ok(dst_inode, endoff)) { > + inode_unlock(dst_inode); > + goto err; And here truncate mutex is not locked, but goto err will unlock it. This is broken... > + } > + inode_unlock(dst_inode); ...The locking is completely broken in this function anyway. You take the lock, look at something, then release the lock. Then what if a write or a trunctate comes in when the inode is not locked ? This is completely broken. The inode should be locked with no dio pending when this function is called. This is only to check if everything is ok. This has no business playing with the inode and truncate locks. > + > + return 0; > +err: > + mutex_unlock(&dst_zi->i_truncate_mutex); > + return -EINVAL; > +} > + > +static ssize_t zonefs_issue_copy(struct zonefs_inode_info *src_zi, > + loff_t src_off, struct zonefs_inode_info *dst_zi, > + loff_t dst_off, size_t len) > +{ > + struct block_device *src_bdev = src_zi->i_vnode.i_sb->s_bdev; > + struct block_device *dst_bdev = dst_zi->i_vnode.i_sb->s_bdev; > + struct range_entry *rlist = NULL; > + int ret = len; > + > + rlist = kmalloc(sizeof(*rlist), GFP_KERNEL); GFP_NOIO ? > + if (!rlist) > + return -ENOMEM; > + > + rlist[0].dst = (dst_zi->i_zsector << SECTOR_SHIFT) + dst_off; > + rlist[0].src = (src_zi->i_zsector << SECTOR_SHIFT) + src_off; > + rlist[0].len = len; > + rlist[0].comp_len = 0; > + ret = blkdev_issue_copy(src_bdev, dst_bdev, rlist, 1, NULL, NULL, > + GFP_KERNEL); > + if (rlist[0].comp_len > 0) > + ret = rlist[0].comp_len; > + kfree(rlist); > + > + return ret; > +} > + > +/* Returns length of possible copy, else returns error */ > +static ssize_t zonefs_copy_file_checks(struct file *src_file, loff_t src_off, > + struct file *dst_file, loff_t dst_off, > + size_t *len, unsigned int flags) > +{ > + struct inode *src_inode = file_inode(src_file); > + struct inode *dst_inode = file_inode(dst_file); > + struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode); > + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); > + ssize_t ret; > + > + if (src_inode->i_sb != dst_inode->i_sb) > + return -EXDEV; > + > + /* Start by sync'ing the source and destination files for conv zones */ > + if (src_zi->i_ztype == ZONEFS_ZTYPE_CNV) { > + ret = file_write_and_wait_range(src_file, src_off, > + (src_off + *len)); > + if (ret < 0) > + goto io_error; > + } > + inode_dio_wait(src_inode); That is not a "check". So having this in a function called zonefs_copy_file_checks() is a little strange. > + > + /* Start by sync'ing the source and destination files ifor conv zones */ Same comment repeated, with typos. > + if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) { > + ret = file_write_and_wait_range(dst_file, dst_off, > + (dst_off + *len)); > + if (ret < 0) > + goto io_error; > + } > + inode_dio_wait(dst_inode); > + > + /* Drop dst file cached pages for a conv zone*/ > + if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) { > + ret = invalidate_inode_pages2_range(dst_inode->i_mapping, > + dst_off >> PAGE_SHIFT, > + (dst_off + *len) >> PAGE_SHIFT); > + if (ret < 0) > + goto io_error; > + } > + > + ret = zonefs_is_file_copy_offset_ok(src_inode, dst_inode, src_off, > + dst_off, len); > + if (ret < 0) if (ret) > + return ret; > + > + return *len; > + > +io_error: > + zonefs_io_error(dst_inode, true); > + return ret; > +} > + > +static ssize_t zonefs_copy_file(struct file *src_file, loff_t src_off, > + struct file *dst_file, loff_t dst_off, > + size_t len, unsigned int flags) > +{ > + struct inode *src_inode = file_inode(src_file); > + struct inode *dst_inode = file_inode(dst_file); > + struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode); > + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); > + ssize_t ret = 0, bytes; > + > + inode_lock(src_inode); > + inode_lock(dst_inode); So you did zonefs_copy_file_checks() outside of these locks, which mean that everything about the source and destination files may have changed. This does not work. > + bytes = zonefs_issue_copy(src_zi, src_off, dst_zi, dst_off, len); > + if (bytes < 0) > + goto unlock_exit; > + > + ret += bytes; > + > + file_update_time(dst_file); > + mutex_lock(&dst_zi->i_truncate_mutex); > + zonefs_update_stats(dst_inode, dst_off + bytes); > + zonefs_i_size_write(dst_inode, dst_off + bytes); > + dst_zi->i_wpoffset += bytes; This is wierd. iszie for dst will be dst_zi->i_wpoffset. So please do: dst_zi->i_wpoffset += bytes; zonefs_i_size_write(dst_inode, dst_zi->i_wpoffset); > + mutex_unlock(&dst_zi->i_truncate_mutex); And you are not taking care of the accounting for active zones here. If the copy made the dst zone full, it is not active anymore. You need to call zonefs_account_active(); > + /* if we still have some bytes left, do splice copy */ > + if (bytes && (bytes < len)) { > + bytes = do_splice_direct(src_file, &src_off, dst_file, > + &dst_off, len, flags); No way. > + if (bytes > 0) > + ret += bytes; > + } > +unlock_exit: > + if (ret < 0) > + zonefs_io_error(dst_inode, true); How can you be sure that you even did an IO when you get an error ? zonefs_issue_copy() may have failed on its kmalloc() and no IO was done. > + inode_unlock(src_inode); > + inode_unlock(dst_inode); > + return ret; > +} > + > +static ssize_t zonefs_copy_file_range(struct file *src_file, loff_t src_off, > + struct file *dst_file, loff_t dst_off, > + size_t len, unsigned int flags) > +{ > + ssize_t ret = -EIO; This does not need to be initialized. > + > + ret = zonefs_copy_file_checks(src_file, src_off, dst_file, dst_off, > + &len, flags); These checks need to be done for the generic implementation too, no ? Why would checking this automatically trigger the offload ? What if the device does not support offloading ? > + if (ret > 0) > + ret = zonefs_copy_file(src_file, src_off, dst_file, dst_off, > + len, flags); return here, then no need for the else. But see above. This seems all broken to me. > + else if (ret < 0 && ret == -EXDEV) > + ret = generic_copy_file_range(src_file, src_off, dst_file, > + dst_off, len, flags); > + return ret; > +} > + > static const struct file_operations zonefs_file_operations = { > .open = zonefs_file_open, > .release = zonefs_file_release, > @@ -1234,6 +1411,7 @@ static const struct file_operations zonefs_file_operations = { > .splice_read = generic_file_splice_read, > .splice_write = iter_file_splice_write, > .iopoll = iocb_bio_iopoll, > + .copy_file_range = zonefs_copy_file_range, > }; > > static struct kmem_cache *zonefs_inode_cachep; > @@ -1804,6 +1982,7 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent) > atomic_set(&sbi->s_active_seq_files, 0); > sbi->s_max_active_seq_files = bdev_max_active_zones(sb->s_bdev); > > + /* set copy support by default */ What is this comment supposed to be for ? > ret = zonefs_read_super(sb); > if (ret) > return ret;
On 11/24/22 10:32, Damien Le Moal wrote: > On 11/23/22 14:58, Nitesh Shetty wrote: >> copy_file_range is implemented using copy offload, >> copy offloading to device is always enabled. >> To disable copy offloading mount with "no_copy_offload" mount option. > > And were is the code that handle this option ? > >> At present copy offload is only used, if the source and destination files >> are on same block device, otherwise copy file range is completed by >> generic copy file range. >> >> copy file range implemented as following: >> - write pending writes on the src and dest files >> - drop page cache for dest file if its conv zone >> - copy the range using offload >> - update dest file info >> >> For all failure cases we fallback to generic file copy range > > For all cases ? That would be weird. What would be the point of trying to > copy again if e.g. the dest zone has gone offline or read only ? > >> At present this implementation does not support conv aggregation > > Please check this commit message overall: the grammar and punctuation > could really be improved. > >> >> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> >> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> >> --- >> fs/zonefs/super.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 179 insertions(+) >> >> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c >> index abc9a85106f2..15613433d4ae 100644 >> --- a/fs/zonefs/super.c >> +++ b/fs/zonefs/super.c >> @@ -1223,6 +1223,183 @@ static int zonefs_file_release(struct inode *inode, struct file *file) >> return 0; >> } >> >> +static int zonefs_is_file_copy_offset_ok(struct inode *src_inode, >> + struct inode *dst_inode, loff_t src_off, loff_t dst_off, >> + size_t *len) >> +{ >> + loff_t size, endoff; >> + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); >> + >> + inode_lock(src_inode); >> + size = i_size_read(src_inode); >> + inode_unlock(src_inode); >> + /* Don't copy beyond source file EOF. */ >> + if (src_off < size) { >> + if (src_off + *len > size) >> + *len = (size - (src_off + *len)); >> + } else >> + *len = 0; > > Missing curly brackets for the else. > >> + >> + mutex_lock(&dst_zi->i_truncate_mutex); >> + if (dst_zi->i_ztype == ZONEFS_ZTYPE_SEQ) { >> + if (*len > dst_zi->i_max_size - dst_zi->i_wpoffset) >> + *len -= dst_zi->i_max_size - dst_zi->i_wpoffset; >> + >> + if (dst_off != dst_zi->i_wpoffset) >> + goto err; >> + } >> + mutex_unlock(&dst_zi->i_truncate_mutex); >> + >> + endoff = dst_off + *len; >> + inode_lock(dst_inode); >> + if (endoff > dst_zi->i_max_size || >> + inode_newsize_ok(dst_inode, endoff)) { >> + inode_unlock(dst_inode); >> + goto err; > > And here truncate mutex is not locked, but goto err will unlock it. This > is broken... > >> + } >> + inode_unlock(dst_inode); > > ...The locking is completely broken in this function anyway. You take the > lock, look at something, then release the lock. Then what if a write or a > trunctate comes in when the inode is not locked ? This is completely > broken. The inode should be locked with no dio pending when this function > is called. This is only to check if everything is ok. This has no business > playing with the inode and truncate locks. > >> + >> + return 0; >> +err: >> + mutex_unlock(&dst_zi->i_truncate_mutex); >> + return -EINVAL; >> +} >> + >> +static ssize_t zonefs_issue_copy(struct zonefs_inode_info *src_zi, >> + loff_t src_off, struct zonefs_inode_info *dst_zi, >> + loff_t dst_off, size_t len) >> +{ >> + struct block_device *src_bdev = src_zi->i_vnode.i_sb->s_bdev; >> + struct block_device *dst_bdev = dst_zi->i_vnode.i_sb->s_bdev; >> + struct range_entry *rlist = NULL; >> + int ret = len; >> + >> + rlist = kmalloc(sizeof(*rlist), GFP_KERNEL); > > GFP_NOIO ? > >> + if (!rlist) >> + return -ENOMEM; >> + >> + rlist[0].dst = (dst_zi->i_zsector << SECTOR_SHIFT) + dst_off; >> + rlist[0].src = (src_zi->i_zsector << SECTOR_SHIFT) + src_off; >> + rlist[0].len = len; >> + rlist[0].comp_len = 0; >> + ret = blkdev_issue_copy(src_bdev, dst_bdev, rlist, 1, NULL, NULL, >> + GFP_KERNEL); >> + if (rlist[0].comp_len > 0) >> + ret = rlist[0].comp_len; >> + kfree(rlist); >> + >> + return ret; >> +} >> + >> +/* Returns length of possible copy, else returns error */ >> +static ssize_t zonefs_copy_file_checks(struct file *src_file, loff_t src_off, >> + struct file *dst_file, loff_t dst_off, >> + size_t *len, unsigned int flags) >> +{ >> + struct inode *src_inode = file_inode(src_file); >> + struct inode *dst_inode = file_inode(dst_file); >> + struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode); >> + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); >> + ssize_t ret; >> + >> + if (src_inode->i_sb != dst_inode->i_sb) >> + return -EXDEV; >> + >> + /* Start by sync'ing the source and destination files for conv zones */ >> + if (src_zi->i_ztype == ZONEFS_ZTYPE_CNV) { >> + ret = file_write_and_wait_range(src_file, src_off, >> + (src_off + *len)); >> + if (ret < 0) >> + goto io_error; >> + } >> + inode_dio_wait(src_inode); > > That is not a "check". So having this in a function called > zonefs_copy_file_checks() is a little strange. > >> + >> + /* Start by sync'ing the source and destination files ifor conv zones */ > > Same comment repeated, with typos. > >> + if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) { >> + ret = file_write_and_wait_range(dst_file, dst_off, >> + (dst_off + *len)); >> + if (ret < 0) >> + goto io_error; >> + } >> + inode_dio_wait(dst_inode); >> + >> + /* Drop dst file cached pages for a conv zone*/ >> + if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) { >> + ret = invalidate_inode_pages2_range(dst_inode->i_mapping, >> + dst_off >> PAGE_SHIFT, >> + (dst_off + *len) >> PAGE_SHIFT); >> + if (ret < 0) >> + goto io_error; >> + } >> + >> + ret = zonefs_is_file_copy_offset_ok(src_inode, dst_inode, src_off, >> + dst_off, len); >> + if (ret < 0) > > if (ret) > >> + return ret; >> + >> + return *len; >> + >> +io_error: >> + zonefs_io_error(dst_inode, true); >> + return ret; >> +} >> + >> +static ssize_t zonefs_copy_file(struct file *src_file, loff_t src_off, >> + struct file *dst_file, loff_t dst_off, >> + size_t len, unsigned int flags) >> +{ >> + struct inode *src_inode = file_inode(src_file); >> + struct inode *dst_inode = file_inode(dst_file); >> + struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode); >> + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); >> + ssize_t ret = 0, bytes; >> + >> + inode_lock(src_inode); >> + inode_lock(dst_inode); > > So you did zonefs_copy_file_checks() outside of these locks, which mean > that everything about the source and destination files may have changed. > This does not work. I forgot to mention that locking 2 inodes blindly like this can leads to deadlocks if another process tries a copy range from dst to src at the same time (lock order is reversed and so can deadlock). > >> + bytes = zonefs_issue_copy(src_zi, src_off, dst_zi, dst_off, len); >> + if (bytes < 0) >> + goto unlock_exit; >> + >> + ret += bytes; >> + >> + file_update_time(dst_file); >> + mutex_lock(&dst_zi->i_truncate_mutex); >> + zonefs_update_stats(dst_inode, dst_off + bytes); >> + zonefs_i_size_write(dst_inode, dst_off + bytes); >> + dst_zi->i_wpoffset += bytes; > > This is wierd. iszie for dst will be dst_zi->i_wpoffset. So please do: > > dst_zi->i_wpoffset += bytes; > zonefs_i_size_write(dst_inode, dst_zi->i_wpoffset); > >> + mutex_unlock(&dst_zi->i_truncate_mutex); > > And you are not taking care of the accounting for active zones here. If > the copy made the dst zone full, it is not active anymore. You need to > call zonefs_account_active(); > >> + /* if we still have some bytes left, do splice copy */ >> + if (bytes && (bytes < len)) { >> + bytes = do_splice_direct(src_file, &src_off, dst_file, >> + &dst_off, len, flags); > > No way. > >> + if (bytes > 0) >> + ret += bytes; >> + } >> +unlock_exit: >> + if (ret < 0) >> + zonefs_io_error(dst_inode, true); > > How can you be sure that you even did an IO when you get an error ? > zonefs_issue_copy() may have failed on its kmalloc() and no IO was done. > >> + inode_unlock(src_inode); >> + inode_unlock(dst_inode); >> + return ret; >> +} >> + >> +static ssize_t zonefs_copy_file_range(struct file *src_file, loff_t src_off, >> + struct file *dst_file, loff_t dst_off, >> + size_t len, unsigned int flags) >> +{ >> + ssize_t ret = -EIO; > > This does not need to be initialized. > >> + >> + ret = zonefs_copy_file_checks(src_file, src_off, dst_file, dst_off, >> + &len, flags); > > These checks need to be done for the generic implementation too, no ? Why > would checking this automatically trigger the offload ? What if the device > does not support offloading ? > >> + if (ret > 0) >> + ret = zonefs_copy_file(src_file, src_off, dst_file, dst_off, >> + len, flags); > > return here, then no need for the else. But see above. This seems all > broken to me. > >> + else if (ret < 0 && ret == -EXDEV) >> + ret = generic_copy_file_range(src_file, src_off, dst_file, >> + dst_off, len, flags); >> + return ret; >> +} >> + >> static const struct file_operations zonefs_file_operations = { >> .open = zonefs_file_open, >> .release = zonefs_file_release, >> @@ -1234,6 +1411,7 @@ static const struct file_operations zonefs_file_operations = { >> .splice_read = generic_file_splice_read, >> .splice_write = iter_file_splice_write, >> .iopoll = iocb_bio_iopoll, >> + .copy_file_range = zonefs_copy_file_range, >> }; >> >> static struct kmem_cache *zonefs_inode_cachep; >> @@ -1804,6 +1982,7 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent) >> atomic_set(&sbi->s_active_seq_files, 0); >> sbi->s_max_active_seq_files = bdev_max_active_zones(sb->s_bdev); >> >> + /* set copy support by default */ > > What is this comment supposed to be for ? > >> ret = zonefs_read_super(sb); >> if (ret) >> return ret; >
On Thu, Nov 24, 2022 at 10:47:55AM +0900, Damien Le Moal wrote: > >> + inode_lock(src_inode); > >> + inode_lock(dst_inode); > > > > So you did zonefs_copy_file_checks() outside of these locks, which mean > > that everything about the source and destination files may have changed. > > This does not work. > > I forgot to mention that locking 2 inodes blindly like this can leads to > deadlocks if another process tries a copy range from dst to src at the > same time (lock order is reversed and so can deadlock). Not to mention the deadlocks with existing places where fs/namei.c locks two inodes...
On Thu, Nov 24, 2022 at 10:47:55AM +0900, Damien Le Moal wrote: > On 11/24/22 10:32, Damien Le Moal wrote: > > On 11/23/22 14:58, Nitesh Shetty wrote: > >> copy_file_range is implemented using copy offload, > >> copy offloading to device is always enabled. > >> To disable copy offloading mount with "no_copy_offload" mount option. > > > > And were is the code that handle this option ? > > > >> At present copy offload is only used, if the source and destination files > >> are on same block device, otherwise copy file range is completed by > >> generic copy file range. > >> > >> copy file range implemented as following: > >> - write pending writes on the src and dest files > >> - drop page cache for dest file if its conv zone > >> - copy the range using offload > >> - update dest file info > >> > >> For all failure cases we fallback to generic file copy range > > > > For all cases ? That would be weird. What would be the point of trying to > > copy again if e.g. the dest zone has gone offline or read only ? > > > >> At present this implementation does not support conv aggregation > > > > Please check this commit message overall: the grammar and punctuation > > could really be improved. > > > >> > >> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> > >> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > >> --- > >> fs/zonefs/super.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++ > >> 1 file changed, 179 insertions(+) > >> > >> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c > >> index abc9a85106f2..15613433d4ae 100644 > >> --- a/fs/zonefs/super.c > >> +++ b/fs/zonefs/super.c > >> @@ -1223,6 +1223,183 @@ static int zonefs_file_release(struct inode *inode, struct file *file) > >> return 0; > >> } > >> > >> +static int zonefs_is_file_copy_offset_ok(struct inode *src_inode, > >> + struct inode *dst_inode, loff_t src_off, loff_t dst_off, > >> + size_t *len) > >> +{ > >> + loff_t size, endoff; > >> + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); > >> + > >> + inode_lock(src_inode); > >> + size = i_size_read(src_inode); > >> + inode_unlock(src_inode); > >> + /* Don't copy beyond source file EOF. */ > >> + if (src_off < size) { > >> + if (src_off + *len > size) > >> + *len = (size - (src_off + *len)); > >> + } else > >> + *len = 0; > > > > Missing curly brackets for the else. > > > >> + > >> + mutex_lock(&dst_zi->i_truncate_mutex); > >> + if (dst_zi->i_ztype == ZONEFS_ZTYPE_SEQ) { > >> + if (*len > dst_zi->i_max_size - dst_zi->i_wpoffset) > >> + *len -= dst_zi->i_max_size - dst_zi->i_wpoffset; > >> + > >> + if (dst_off != dst_zi->i_wpoffset) > >> + goto err; > >> + } > >> + mutex_unlock(&dst_zi->i_truncate_mutex); > >> + > >> + endoff = dst_off + *len; > >> + inode_lock(dst_inode); > >> + if (endoff > dst_zi->i_max_size || > >> + inode_newsize_ok(dst_inode, endoff)) { > >> + inode_unlock(dst_inode); > >> + goto err; > > > > And here truncate mutex is not locked, but goto err will unlock it. This > > is broken... > > > >> + } > >> + inode_unlock(dst_inode); > > > > ...The locking is completely broken in this function anyway. You take the > > lock, look at something, then release the lock. Then what if a write or a > > trunctate comes in when the inode is not locked ? This is completely > > broken. The inode should be locked with no dio pending when this function > > is called. This is only to check if everything is ok. This has no business > > playing with the inode and truncate locks. > > > >> + > >> + return 0; > >> +err: > >> + mutex_unlock(&dst_zi->i_truncate_mutex); > >> + return -EINVAL; > >> +} > >> + > >> +static ssize_t zonefs_issue_copy(struct zonefs_inode_info *src_zi, > >> + loff_t src_off, struct zonefs_inode_info *dst_zi, > >> + loff_t dst_off, size_t len) > >> +{ > >> + struct block_device *src_bdev = src_zi->i_vnode.i_sb->s_bdev; > >> + struct block_device *dst_bdev = dst_zi->i_vnode.i_sb->s_bdev; > >> + struct range_entry *rlist = NULL; > >> + int ret = len; > >> + > >> + rlist = kmalloc(sizeof(*rlist), GFP_KERNEL); > > > > GFP_NOIO ? > > > >> + if (!rlist) > >> + return -ENOMEM; > >> + > >> + rlist[0].dst = (dst_zi->i_zsector << SECTOR_SHIFT) + dst_off; > >> + rlist[0].src = (src_zi->i_zsector << SECTOR_SHIFT) + src_off; > >> + rlist[0].len = len; > >> + rlist[0].comp_len = 0; > >> + ret = blkdev_issue_copy(src_bdev, dst_bdev, rlist, 1, NULL, NULL, > >> + GFP_KERNEL); > >> + if (rlist[0].comp_len > 0) > >> + ret = rlist[0].comp_len; > >> + kfree(rlist); > >> + > >> + return ret; > >> +} > >> + > >> +/* Returns length of possible copy, else returns error */ > >> +static ssize_t zonefs_copy_file_checks(struct file *src_file, loff_t src_off, > >> + struct file *dst_file, loff_t dst_off, > >> + size_t *len, unsigned int flags) > >> +{ > >> + struct inode *src_inode = file_inode(src_file); > >> + struct inode *dst_inode = file_inode(dst_file); > >> + struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode); > >> + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); > >> + ssize_t ret; > >> + > >> + if (src_inode->i_sb != dst_inode->i_sb) > >> + return -EXDEV; > >> + > >> + /* Start by sync'ing the source and destination files for conv zones */ > >> + if (src_zi->i_ztype == ZONEFS_ZTYPE_CNV) { > >> + ret = file_write_and_wait_range(src_file, src_off, > >> + (src_off + *len)); > >> + if (ret < 0) > >> + goto io_error; > >> + } > >> + inode_dio_wait(src_inode); > > > > That is not a "check". So having this in a function called > > zonefs_copy_file_checks() is a little strange. > > > >> + > >> + /* Start by sync'ing the source and destination files ifor conv zones */ > > > > Same comment repeated, with typos. > > > >> + if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) { > >> + ret = file_write_and_wait_range(dst_file, dst_off, > >> + (dst_off + *len)); > >> + if (ret < 0) > >> + goto io_error; > >> + } > >> + inode_dio_wait(dst_inode); > >> + > >> + /* Drop dst file cached pages for a conv zone*/ > >> + if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) { > >> + ret = invalidate_inode_pages2_range(dst_inode->i_mapping, > >> + dst_off >> PAGE_SHIFT, > >> + (dst_off + *len) >> PAGE_SHIFT); > >> + if (ret < 0) > >> + goto io_error; > >> + } > >> + > >> + ret = zonefs_is_file_copy_offset_ok(src_inode, dst_inode, src_off, > >> + dst_off, len); > >> + if (ret < 0) > > > > if (ret) > > > >> + return ret; > >> + > >> + return *len; > >> + > >> +io_error: > >> + zonefs_io_error(dst_inode, true); > >> + return ret; > >> +} > >> + > >> +static ssize_t zonefs_copy_file(struct file *src_file, loff_t src_off, > >> + struct file *dst_file, loff_t dst_off, > >> + size_t len, unsigned int flags) > >> +{ > >> + struct inode *src_inode = file_inode(src_file); > >> + struct inode *dst_inode = file_inode(dst_file); > >> + struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode); > >> + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); > >> + ssize_t ret = 0, bytes; > >> + > >> + inode_lock(src_inode); > >> + inode_lock(dst_inode); > > > > So you did zonefs_copy_file_checks() outside of these locks, which mean > > that everything about the source and destination files may have changed. > > This does not work. > > I forgot to mention that locking 2 inodes blindly like this can leads to > deadlocks if another process tries a copy range from dst to src at the > same time (lock order is reversed and so can deadlock). > > > > >> + bytes = zonefs_issue_copy(src_zi, src_off, dst_zi, dst_off, len); > >> + if (bytes < 0) > >> + goto unlock_exit; > >> + > >> + ret += bytes; > >> + > >> + file_update_time(dst_file); > >> + mutex_lock(&dst_zi->i_truncate_mutex); > >> + zonefs_update_stats(dst_inode, dst_off + bytes); > >> + zonefs_i_size_write(dst_inode, dst_off + bytes); > >> + dst_zi->i_wpoffset += bytes; > > > > This is wierd. iszie for dst will be dst_zi->i_wpoffset. So please do: > > > > dst_zi->i_wpoffset += bytes; > > zonefs_i_size_write(dst_inode, dst_zi->i_wpoffset); > > > >> + mutex_unlock(&dst_zi->i_truncate_mutex); > > > > And you are not taking care of the accounting for active zones here. If > > the copy made the dst zone full, it is not active anymore. You need to > > call zonefs_account_active(); > > > >> + /* if we still have some bytes left, do splice copy */ > >> + if (bytes && (bytes < len)) { > >> + bytes = do_splice_direct(src_file, &src_off, dst_file, > >> + &dst_off, len, flags); > > > > No way. > > > >> + if (bytes > 0) > >> + ret += bytes; > >> + } > >> +unlock_exit: > >> + if (ret < 0) > >> + zonefs_io_error(dst_inode, true); > > > > How can you be sure that you even did an IO when you get an error ? > > zonefs_issue_copy() may have failed on its kmalloc() and no IO was done. > > > >> + inode_unlock(src_inode); > >> + inode_unlock(dst_inode); > >> + return ret; > >> +} > >> + > >> +static ssize_t zonefs_copy_file_range(struct file *src_file, loff_t src_off, > >> + struct file *dst_file, loff_t dst_off, > >> + size_t len, unsigned int flags) > >> +{ > >> + ssize_t ret = -EIO; > > > > This does not need to be initialized. > > > >> + > >> + ret = zonefs_copy_file_checks(src_file, src_off, dst_file, dst_off, > >> + &len, flags); > > > > These checks need to be done for the generic implementation too, no ? Why > > would checking this automatically trigger the offload ? What if the device > > does not support offloading ? > > > >> + if (ret > 0) > >> + ret = zonefs_copy_file(src_file, src_off, dst_file, dst_off, > >> + len, flags); > > > > return here, then no need for the else. But see above. This seems all > > broken to me. > > > >> + else if (ret < 0 && ret == -EXDEV) > >> + ret = generic_copy_file_range(src_file, src_off, dst_file, > >> + dst_off, len, flags); > >> + return ret; > >> +} > >> + > >> static const struct file_operations zonefs_file_operations = { > >> .open = zonefs_file_open, > >> .release = zonefs_file_release, > >> @@ -1234,6 +1411,7 @@ static const struct file_operations zonefs_file_operations = { > >> .splice_read = generic_file_splice_read, > >> .splice_write = iter_file_splice_write, > >> .iopoll = iocb_bio_iopoll, > >> + .copy_file_range = zonefs_copy_file_range, > >> }; > >> > >> static struct kmem_cache *zonefs_inode_cachep; > >> @@ -1804,6 +1982,7 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent) > >> atomic_set(&sbi->s_active_seq_files, 0); > >> sbi->s_max_active_seq_files = bdev_max_active_zones(sb->s_bdev); > >> > >> + /* set copy support by default */ > > > > What is this comment supposed to be for ? > > > >> ret = zonefs_read_super(sb); > >> if (ret) > >> return ret; > > > > -- > Damien Le Moal > Western Digital Research > > Acked. I do see a gap in current zonefs cfr implementation. I will drop this implementation for next version. Once we finalize on block copy offload implementation, will re-pick this and send with above reviews fixed. Thank you, Nitesh
On 11/29/22 21:22, Nitesh Shetty wrote: > Acked. I do see a gap in current zonefs cfr implementation. I will drop this cfr ? > implementation for next version. Once we finalize on block copy offload > implementation, will re-pick this and send with above reviews fixed. > > Thank you, > Nitesh Please trim your replies.
On Wed, Nov 30, 2022 at 08:45:55AM +0900, Damien Le Moal wrote: > On 11/29/22 21:22, Nitesh Shetty wrote: > > Acked. I do see a gap in current zonefs cfr implementation. I will drop this > > cfr ? > yes, will drop zonefs cfr for next version. > > implementation for next version. Once we finalize on block copy offload > > implementation, will re-pick this and send with above reviews fixed. > > > > Thank you, > > Nitesh > > Please trim your replies. > Noted > > -- > Damien Le Moal > Western Digital Research > > Thanks, Nitesh Shetty
On 11/30/22 13:17, Nitesh Shetty wrote: > On Wed, Nov 30, 2022 at 08:45:55AM +0900, Damien Le Moal wrote: >> On 11/29/22 21:22, Nitesh Shetty wrote: >>> Acked. I do see a gap in current zonefs cfr implementation. I will drop this >> >> cfr ? >> > > yes, will drop zonefs cfr for next version. I meant: I do not understand "cfr". I now realize that it probably means copy-file-range ? Please be clear and do not use abbreviations.
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c index abc9a85106f2..15613433d4ae 100644 --- a/fs/zonefs/super.c +++ b/fs/zonefs/super.c @@ -1223,6 +1223,183 @@ static int zonefs_file_release(struct inode *inode, struct file *file) return 0; } +static int zonefs_is_file_copy_offset_ok(struct inode *src_inode, + struct inode *dst_inode, loff_t src_off, loff_t dst_off, + size_t *len) +{ + loff_t size, endoff; + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); + + inode_lock(src_inode); + size = i_size_read(src_inode); + inode_unlock(src_inode); + /* Don't copy beyond source file EOF. */ + if (src_off < size) { + if (src_off + *len > size) + *len = (size - (src_off + *len)); + } else + *len = 0; + + mutex_lock(&dst_zi->i_truncate_mutex); + if (dst_zi->i_ztype == ZONEFS_ZTYPE_SEQ) { + if (*len > dst_zi->i_max_size - dst_zi->i_wpoffset) + *len -= dst_zi->i_max_size - dst_zi->i_wpoffset; + + if (dst_off != dst_zi->i_wpoffset) + goto err; + } + mutex_unlock(&dst_zi->i_truncate_mutex); + + endoff = dst_off + *len; + inode_lock(dst_inode); + if (endoff > dst_zi->i_max_size || + inode_newsize_ok(dst_inode, endoff)) { + inode_unlock(dst_inode); + goto err; + } + inode_unlock(dst_inode); + + return 0; +err: + mutex_unlock(&dst_zi->i_truncate_mutex); + return -EINVAL; +} + +static ssize_t zonefs_issue_copy(struct zonefs_inode_info *src_zi, + loff_t src_off, struct zonefs_inode_info *dst_zi, + loff_t dst_off, size_t len) +{ + struct block_device *src_bdev = src_zi->i_vnode.i_sb->s_bdev; + struct block_device *dst_bdev = dst_zi->i_vnode.i_sb->s_bdev; + struct range_entry *rlist = NULL; + int ret = len; + + rlist = kmalloc(sizeof(*rlist), GFP_KERNEL); + if (!rlist) + return -ENOMEM; + + rlist[0].dst = (dst_zi->i_zsector << SECTOR_SHIFT) + dst_off; + rlist[0].src = (src_zi->i_zsector << SECTOR_SHIFT) + src_off; + rlist[0].len = len; + rlist[0].comp_len = 0; + ret = blkdev_issue_copy(src_bdev, dst_bdev, rlist, 1, NULL, NULL, + GFP_KERNEL); + if (rlist[0].comp_len > 0) + ret = rlist[0].comp_len; + kfree(rlist); + + return ret; +} + +/* Returns length of possible copy, else returns error */ +static ssize_t zonefs_copy_file_checks(struct file *src_file, loff_t src_off, + struct file *dst_file, loff_t dst_off, + size_t *len, unsigned int flags) +{ + struct inode *src_inode = file_inode(src_file); + struct inode *dst_inode = file_inode(dst_file); + struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode); + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); + ssize_t ret; + + if (src_inode->i_sb != dst_inode->i_sb) + return -EXDEV; + + /* Start by sync'ing the source and destination files for conv zones */ + if (src_zi->i_ztype == ZONEFS_ZTYPE_CNV) { + ret = file_write_and_wait_range(src_file, src_off, + (src_off + *len)); + if (ret < 0) + goto io_error; + } + inode_dio_wait(src_inode); + + /* Start by sync'ing the source and destination files ifor conv zones */ + if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) { + ret = file_write_and_wait_range(dst_file, dst_off, + (dst_off + *len)); + if (ret < 0) + goto io_error; + } + inode_dio_wait(dst_inode); + + /* Drop dst file cached pages for a conv zone*/ + if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) { + ret = invalidate_inode_pages2_range(dst_inode->i_mapping, + dst_off >> PAGE_SHIFT, + (dst_off + *len) >> PAGE_SHIFT); + if (ret < 0) + goto io_error; + } + + ret = zonefs_is_file_copy_offset_ok(src_inode, dst_inode, src_off, + dst_off, len); + if (ret < 0) + return ret; + + return *len; + +io_error: + zonefs_io_error(dst_inode, true); + return ret; +} + +static ssize_t zonefs_copy_file(struct file *src_file, loff_t src_off, + struct file *dst_file, loff_t dst_off, + size_t len, unsigned int flags) +{ + struct inode *src_inode = file_inode(src_file); + struct inode *dst_inode = file_inode(dst_file); + struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode); + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); + ssize_t ret = 0, bytes; + + inode_lock(src_inode); + inode_lock(dst_inode); + bytes = zonefs_issue_copy(src_zi, src_off, dst_zi, dst_off, len); + if (bytes < 0) + goto unlock_exit; + + ret += bytes; + + file_update_time(dst_file); + mutex_lock(&dst_zi->i_truncate_mutex); + zonefs_update_stats(dst_inode, dst_off + bytes); + zonefs_i_size_write(dst_inode, dst_off + bytes); + dst_zi->i_wpoffset += bytes; + mutex_unlock(&dst_zi->i_truncate_mutex); + /* if we still have some bytes left, do splice copy */ + if (bytes && (bytes < len)) { + bytes = do_splice_direct(src_file, &src_off, dst_file, + &dst_off, len, flags); + if (bytes > 0) + ret += bytes; + } +unlock_exit: + if (ret < 0) + zonefs_io_error(dst_inode, true); + inode_unlock(src_inode); + inode_unlock(dst_inode); + return ret; +} + +static ssize_t zonefs_copy_file_range(struct file *src_file, loff_t src_off, + struct file *dst_file, loff_t dst_off, + size_t len, unsigned int flags) +{ + ssize_t ret = -EIO; + + ret = zonefs_copy_file_checks(src_file, src_off, dst_file, dst_off, + &len, flags); + if (ret > 0) + ret = zonefs_copy_file(src_file, src_off, dst_file, dst_off, + len, flags); + else if (ret < 0 && ret == -EXDEV) + ret = generic_copy_file_range(src_file, src_off, dst_file, + dst_off, len, flags); + return ret; +} + static const struct file_operations zonefs_file_operations = { .open = zonefs_file_open, .release = zonefs_file_release, @@ -1234,6 +1411,7 @@ static const struct file_operations zonefs_file_operations = { .splice_read = generic_file_splice_read, .splice_write = iter_file_splice_write, .iopoll = iocb_bio_iopoll, + .copy_file_range = zonefs_copy_file_range, }; static struct kmem_cache *zonefs_inode_cachep; @@ -1804,6 +1982,7 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent) atomic_set(&sbi->s_active_seq_files, 0); sbi->s_max_active_seq_files = bdev_max_active_zones(sb->s_bdev); + /* set copy support by default */ ret = zonefs_read_super(sb); if (ret) return ret;