Message ID | cover.1701339358.git.ojaswin@linux.ibm.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:bcd1:0:b0:403:3b70:6f57 with SMTP id r17csp400473vqy; Thu, 30 Nov 2023 05:54:04 -0800 (PST) X-Google-Smtp-Source: AGHT+IEkTeKfhttHGr4N7aZe9QWeiopsTNVt9OLeD3lao4hDqs7G7EVNdbCjggSDInLaR6rHVFp3 X-Received: by 2002:a17:90b:1b44:b0:285:b0fa:f7c6 with SMTP id nv4-20020a17090b1b4400b00285b0faf7c6mr18514732pjb.10.1701352443790; Thu, 30 Nov 2023 05:54:03 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701352443; cv=none; d=google.com; s=arc-20160816; b=01WRrYh1Mjbo2UQKyaIKW5FHV0Yd9P+Fc5T5epNeLIZIZxKxgJkDqHyV/XAzRJyB1g s5d/NMFNe4KSo/bVvl2xwhB8w4S9pG/2xEJsRZDhF0gppijCvk2Jy6uCwY/UdCi6TfNZ ejNVfktGQ/7kAKXrqqXrDRZSVMUd+iH3ZIQpFOiBw3S9/C25f1KHGKskgqP2L+ZLqUlD wNTqPR3vSE/VEInhnDdY9xyQ2hWJWWV64JBrcBzNL3WXUJ0f3wMYnUYerxld05IhwDaK wOOCYfgQQobs0BIMRu8Idl6j48j6vGWi6E/MkNe8RBzrcKKXDKqEewHc5Xw+4pCSZmo0 /8uw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:content-transfer-encoding :message-id:date:subject:cc:to:from:dkim-signature; bh=HeP1XueRhjsHpWAhcUWUNoyK0cnx/fEeEMwzhs6stek=; fh=o0c9oRQqGcw+lX35liWyz4mGWjDo/97N/+E1+y9wprk=; b=or2DS8t5mLRclg/K0rQBykiBeE8KHlbwp++k05xSgVez588Qc36iQnsA+a8dEUTWIN NFptHW5S+X1mk6B5ptFelKVDnvNDHO4xc1rRqQh9f0r+EUvNWZkYt6klWWfWmbPx3lG7 LQNM71S7L/tzNX4kuJu7Lz8vnpjHS+4iNVAoBHqz0fvnoCFkVfdVOgqUhEkwjaR5lpbQ WJ+ZQBSZ74xuLse2FEToL0+RxOACwTVPhc1XXRrh/0wjEzSBwV3zmIGges7qgko7ezXu BopkHorggyrSKbc/kWBDp2K0WlSAj6VHnEfdg3OQODGg9ihHfdi5IrvLsVG/eVzNReH6 lmUg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=j0M+lOl+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id h11-20020a17090acf0b00b002839bc84a0esi3662278pju.139.2023.11.30.05.54.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 30 Nov 2023 05:54:03 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=j0M+lOl+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id A459B802795B; Thu, 30 Nov 2023 05:54:00 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345826AbjK3Nxo (ORCPT <rfc822;ruipengqi7@gmail.com> + 99 others); Thu, 30 Nov 2023 08:53:44 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40572 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345774AbjK3Nxa (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 30 Nov 2023 08:53:30 -0500 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 97900170C; Thu, 30 Nov 2023 05:53:35 -0800 (PST) Received: from pps.filterd (m0356516.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 3AUDnXcU001367; Thu, 30 Nov 2023 13:53:25 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : content-transfer-encoding : mime-version; s=pp1; bh=HeP1XueRhjsHpWAhcUWUNoyK0cnx/fEeEMwzhs6stek=; b=j0M+lOl+IpGdqDyYvXUzR7gbQAgSqThCFgfMYqKNSKd8sbyOQiIWq+1rSYlpbnJ3T/1w UyxeBzbZEw+DYcO38YmgbMINrTYt1GhA2kJaUxURDpSWzF0Gmno2dgqpG2oCY7KatdDp tvq39mWK1Q7+n/43FwGJuq0m4XMGg31GJvBieIkV6++gBIBORymAPwoqAjXMun7RmB9R FR1UiaQ2+kOsywwmAsyfRnl193efWwThZC1jaKbht1dvmZVjRRZ3vQofcrOQtJBfO/T7 jwpcd0f5E0wtKZMt1/UWMtcmhfkG8Ef//wsoBrHsOOeg96ZT4dBM81ZtuCACOUdjJHp+ Dw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3upu2vh3b1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:25 +0000 Received: from m0356516.ppops.net (m0356516.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 3AUDniCu002718; Thu, 30 Nov 2023 13:53:24 GMT Received: from ppma13.dal12v.mail.ibm.com (dd.9e.1632.ip4.static.sl-reverse.com [50.22.158.221]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3upu2vh3ap-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:24 +0000 Received: from pps.filterd (ppma13.dal12v.mail.ibm.com [127.0.0.1]) by ppma13.dal12v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 3AUDnBHU029486; Thu, 30 Nov 2023 13:53:23 GMT Received: from smtprelay05.fra02v.mail.ibm.com ([9.218.2.225]) by ppma13.dal12v.mail.ibm.com (PPS) with ESMTPS id 3ukwfke1qv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:23 +0000 Received: from smtpav01.fra02v.mail.ibm.com (smtpav01.fra02v.mail.ibm.com [10.20.54.100]) by smtprelay05.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 3AUDrLWM23528130 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 30 Nov 2023 13:53:22 GMT Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D6CD82004D; Thu, 30 Nov 2023 13:53:21 +0000 (GMT) Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 7149B20043; Thu, 30 Nov 2023 13:53:19 +0000 (GMT) Received: from li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com.com (unknown [9.43.76.38]) by smtpav01.fra02v.mail.ibm.com (Postfix) with ESMTP; Thu, 30 Nov 2023 13:53:19 +0000 (GMT) From: Ojaswin Mujoo <ojaswin@linux.ibm.com> To: linux-ext4@vger.kernel.org, "Theodore Ts'o" <tytso@mit.edu> Cc: Ritesh Harjani <ritesh.list@gmail.com>, linux-kernel@vger.kernel.org, "Darrick J . Wong" <djwong@kernel.org>, linux-block@vger.kernel.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, John Garry <john.g.garry@oracle.com>, dchinner@redhat.com Subject: [RFC 0/7] ext4: Allocator changes for atomic write support with DIO Date: Thu, 30 Nov 2023 19:23:08 +0530 Message-Id: <cover.1701339358.git.ojaswin@linux.ibm.com> X-Mailer: git-send-email 2.39.3 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: hBD7kWuO7Nb_vj1mPT_00MLQichdLJ_H X-Proofpoint-ORIG-GUID: zeJyHyfHU1L2PEvisQEtnw9KtFa1_K0H Content-Transfer-Encoding: 8bit X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.997,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-11-30_12,2023-11-30_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1011 bulkscore=0 mlxlogscore=999 adultscore=0 lowpriorityscore=0 impostorscore=0 suspectscore=0 phishscore=0 malwarescore=0 spamscore=0 priorityscore=1501 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2311060000 definitions=main-2311300102 X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Thu, 30 Nov 2023 05:54:00 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783997339758468181 X-GMAIL-MSGID: 1783997339758468181 |
Series |
ext4: Allocator changes for atomic write support with DIO
|
|
Message
Ojaswin Mujoo
Nov. 30, 2023, 1:53 p.m. UTC
This patch series builds on top of John Gary's atomic direct write patch series [1] and enables this support in ext4. This is a 2 step process: 1. Enable aligned allocation in ext4 mballoc. This allows us to allocate power-of-2 aligned physical blocks, which is needed for atomic writes. 2. Hook the direct IO path in ext4 to use aligned allocation to obtain physical blocks at a given alignment, which is needed for atomic IO. If for any reason we are not able to obtain blocks at given alignment we fail the atomic write. Currently this RFC does not impose any restrictions for atomic and non-atomic allocations to any inode, which also leaves policy decisions to user-space as much as possible. So, for example, the user space can: * Do an atomic direct IO at any alignment and size provided it satisfies underlying device constraints. The only restriction for now is that it should be power of 2 len and atleast of FS block size. * Do any combination of non atomic and atomic writes on the same file in any order. As long as the user space is passing the RWF_ATOMIC flag to pwritev2() it is guaranteed to do an atomic IO (or fail if not possible). There are some TODOs on the allocator side which are remaining like... 1. Fallback to original request size when normalized request size (due to preallocation) allocation is not possible. 2. Testing some edge cases. But since all the basic test scenarios were covered, hence we wanted to get this RFC out for discussion on atomic write support for DIO in ext4. Further points for discussion - 1. We might need an inode flag to identify that the inode has blocks/extents atomically allocated. So that other userspace tools do not move the blocks of the inode for e.g. during resize/fsck etc. a. Should inode be marked as atomic similar to how we have IS_DAX(inode) implementation? Any thoughts? 2. Should there be support for open flags like O_ATOMIC. So that in case if user wants to do only atomic writes to an open fd, then all writes can be considered atomic. 3. Do we need to have any feature compat flags for FS? (IMO) It doesn't look like since say if there are block allocations done which were done atomically, it should not matter to FS w.r.t compatibility. 4. Mostly aligned allocations are required when we don't have data=journal mode. So should we return -EIO with data journalling mode for DIO request? Script to test using pwritev2() can be found here: https://gist.github.com/OjaswinM/e67accee3cbb7832bd3f1a9543c01da9 Regards, ojaswin [1] https://lore.kernel.org/linux-fsdevel/20230929102726.2985188-1-john.g.garry@oracle.com Ojaswin Mujoo (7): iomap: Don't fall back to buffered write if the write is atomic ext4: Factor out size and start prediction from ext4_mb_normalize_request() ext4: add aligned allocation support in mballoc ext4: allow inode preallocation for aligned alloc block: export blkdev_atomic_write_valid() and refactor api ext4: Add aligned allocation support for atomic direct io ext4: Support atomic write for statx block/fops.c | 18 ++- fs/ext4/ext4.h | 10 +- fs/ext4/extents.c | 14 ++ fs/ext4/file.c | 49 ++++++ fs/ext4/inode.c | 142 ++++++++++++++++- fs/ext4/mballoc.c | 302 +++++++++++++++++++++++++----------- fs/iomap/direct-io.c | 8 +- include/linux/blkdev.h | 2 + include/trace/events/ext4.h | 2 + 9 files changed, 442 insertions(+), 105 deletions(-)
Comments
On 30/11/2023 13:53, Ojaswin Mujoo wrote: Thanks for putting this together. > This patch series builds on top of John Gary's atomic direct write > patch series [1] and enables this support in ext4. This is a 2 step > process: > > 1. Enable aligned allocation in ext4 mballoc. This allows us to allocate > power-of-2 aligned physical blocks, which is needed for atomic writes. > > 2. Hook the direct IO path in ext4 to use aligned allocation to obtain > physical blocks at a given alignment, which is needed for atomic IO. If > for any reason we are not able to obtain blocks at given alignment we > fail the atomic write. So are we supposed to be doing atomic writes on unwritten ranges only in the file to get the aligned allocations? I actually tried that, and I got a WARN triggered: # mkfs.ext4 /dev/sda mke2fs 1.46.5 (30-Dec-2021) Creating filesystem with 358400 1k blocks and 89760 inodes Filesystem UUID: 7543a44b-2957-4ddc-9d4a-db3a5fd019c9 Superblock backups stored on blocks: 8193, 24577, 40961, 57345, 73729, 204801, 221185 Allocating group tables: done Writing inode tables: done Creating journal (8192 blocks): done Writing superblocks and filesystem accounting information: done [ 12.745889] mkfs.ext4 (150) used greatest stack depth: 13304 bytes left # mount /dev/sda mnt [ 12.798804] EXT4-fs (sda): mounted filesystem 7543a44b-2957-4ddc-9d4a-db3a5fd019c9 r/w with ordered data mode. Quota mode: none. # touch mnt/file # # /test-statx -a /root/mnt/file statx(/root/mnt/file) = 0 dump_statx results=5fff Size: 0 Blocks: 0 IO Block: 1024 regular file Device: 08:00 Inode: 12 Links: 1 Access: (0644/-rw-r--r--) Uid: 0 Gid: 0 Access: 2023-12-04 10:27:40.002848720+0000 Modify: 2023-12-04 10:27:40.002848720+0000 Change: 2023-12-04 10:27:40.002848720+0000 Birth: 2023-12-04 10:27:40.002848720+0000 stx_attributes_mask=0x703874 STATX_ATTR_WRITE_ATOMIC set unit min: 1024 uunit max: 524288 Attributes: 0000000000400000 (........ ........ ........ ........ ........ .?--.... ..---... .---.-..) # looks ok so far, then write 4KB at offset 0: # /test-pwritev2 -a -d -p 0 -l 4096 /root/mnt/file file=/root/mnt/file write_size=4096 offset=0 o_flags=0x4002 wr_flags=0x24 [ 46.813720] ------------[ cut here ]------------ [ 46.814934] WARNING: CPU: 1 PID: 158 at fs/ext4/mballoc.c:2991 ext4_mb_regular_allocator+0xeca/0xf20 [ 46.816344] Modules linked in: [ 46.816831] CPU: 1 PID: 158 Comm: test-pwritev2 Not tainted 6.7.0-rc1-00038-gae3807f27e7d-dirty #968 [ 46.818220] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552c-rebuilt.opensuse.org 04/01/2014 [ 46.819886] RIP: 0010:ext4_mb_regular_allocator+0xeca/0xf20 [ 46.820734] Code: fd ff ff f0 41 ff 81 e4 03 00 00 e9 63 fd ff ff 90 0f 0b 90 e9 fe f3 ff ff 90 48 c7 c7 e2 7a b2 86 44 89 ca e8 f7 f1 d2 ff 90 <0f> 0b 90 90 45 8b 44 24 3c e9 d4 f3 ff ff 4d 8b 45 08 4c 89 c2 4d [ 46.823577] RSP: 0018:ffffb77dc056b7c0 EFLAGS: 00010286 [ 46.824379] RAX: 0000000000000000 RBX: ffff9b2ad77dea80 RCX: 0000000000000000 [ 46.825458] RDX: 0000000000000001 RSI: ffff9b2b3491b5c0 RDI: ffff9b2b3491b5c0 [ 46.826557] RBP: ffff9b2adc7cd000 R08: 0000000000000000 R09: c0000000ffffdfff [ 46.827634] R10: ffff9b2adcb9d780 R11: ffffb77dc056b648 R12: ffff9b2ac6778000 [ 46.828714] R13: ffff9b2adc7cd000 R14: ffff9b2adc7d0000 R15: 000000000000002a [ 46.829796] FS: 00007f726dece740(0000) GS:ffff9b2b34900000(0000) knlGS:0000000000000000 [ 46.830706] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 46.831299] CR2: 0000000001ed72b8 CR3: 000000001c794006 CR4: 0000000000370ef0 [ 46.832041] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 46.832813] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 46.833546] Call Trace: [ 46.833901] <TASK> [ 46.834163] ? __warn+0x78/0x130 [ 46.834504] ? ext4_mb_regular_allocator+0xeca/0xf20 [ 46.835037] ? report_bug+0xf8/0x1e0 [ 46.835527] ? console_unlock+0x45/0xd0 [ 46.835963] ? handle_bug+0x40/0x70 [ 46.836419] ? exc_invalid_op+0x13/0x70 [ 46.836865] ? asm_exc_invalid_op+0x16/0x20 [ 46.837329] ? ext4_mb_regular_allocator+0xeca/0xf20 [ 46.837852] ext4_mb_new_blocks+0x7e8/0xe60 [ 46.838382] ? __kmalloc+0x4b/0x130 [ 46.838824] ? __kmalloc+0x4b/0x130 [ 46.839243] ? ext4_find_extent+0x347/0x360 [ 46.839743] ext4_ext_map_blocks+0xc44/0xff0 [ 46.840395] ext4_map_blocks+0x162/0x5b0 [ 46.841010] ? jbd2__journal_start+0x84/0x1f0 [ 46.841694] ext4_map_blocks_aligned+0x20/0xa0 [ 46.842382] ext4_iomap_begin+0x1e9/0x320 [ 46.843006] iomap_iter+0x16d/0x350 [ 46.843554] __iomap_dio_rw+0x3be/0x830 [ 46.844150] iomap_dio_rw+0x9/0x30 [ 46.844680] ext4_file_write_iter+0x597/0x800 [ 46.845346] do_iter_readv_writev+0xe1/0x150 [ 46.846029] do_iter_write+0x86/0x1f0 [ 46.846638] vfs_writev+0x96/0x190 [ 46.847176] ? do_pwritev+0x98/0xd0 [ 46.847721] do_pwritev+0x98/0xd0 [ 46.848230] ? syscall_trace_enter.isra.19+0x130/0x1b0 [ 46.849028] do_syscall_64+0x42/0xf0 [ 46.849590] entry_SYSCALL_64_after_hwframe+0x6f/0x77 [ 46.850405] RIP: 0033:0x7f726df9666f [ 46.850964] Code: d5 41 54 49 89 f4 55 89 fd 53 44 89 c3 48 83 ec 18 80 3d bb fd 0b 00 00 74 2a 45 89 c1 49 89 ca 45 31 c0 b8 48 01 00 00 0f 05 <48> 3d 00 f0 ff ff 76 5c 48 8b 15 7a 77 0b 00 f7 d8 64 89 02 48 83 [ 46.854020] RSP: 002b:00007fff28b9bff0 EFLAGS: 00000246 ORIG_RAX: 0000000000000148 [ 46.855178] RAX: ffffffffffffffda RBX: 0000000000000024 RCX: 00007f726df9666f [ 46.856248] RDX: 0000000000000001 RSI: 00007fff28b9c050 RDI: 0000000000000003 [ 46.857303] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000024 [ 46.858365] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff28b9c050 [ 46.859407] R13: 0000000000000001 R14: 0000000000000000 R15: 00007f726e08aa60 [ 46.860448] </TASK> [ 46.860797] ---[ end trace 0000000000000000 ]--- [ 46.861497] EXT4-fs warning (device sda): ext4_map_blocks_aligned:520: Returned extent couldn't satisfy alignment requirements main wrote -1 bytes at offset 0 [ 46.863855] test-pwritev2 (158) used greatest stack depth: 11920 bytes left # Please note that I tested on my own dev branch, which contains changes over [1], but I expect it would not make a difference for this test. > > Currently this RFC does not impose any restrictions for atomic and non-atomic > allocations to any inode, which also leaves policy decisions to user-space > as much as possible. So, for example, the user space can: > > * Do an atomic direct IO at any alignment and size provided it > satisfies underlying device constraints. The only restriction for now > is that it should be power of 2 len and atleast of FS block size. > > * Do any combination of non atomic and atomic writes on the same file > in any order. As long as the user space is passing the RWF_ATOMIC flag > to pwritev2() it is guaranteed to do an atomic IO (or fail if not > possible). > > There are some TODOs on the allocator side which are remaining like... > > 1. Fallback to original request size when normalized request size (due to > preallocation) allocation is not possible. > 2. Testing some edge cases. > > But since all the basic test scenarios were covered, hence we wanted to get > this RFC out for discussion on atomic write support for DIO in ext4. > > Further points for discussion - > > 1. We might need an inode flag to identify that the inode has blocks/extents > atomically allocated. So that other userspace tools do not move the blocks of > the inode for e.g. during resize/fsck etc. > a. Should inode be marked as atomic similar to how we have IS_DAX(inode) > implementation? Any thoughts? > > 2. Should there be support for open flags like O_ATOMIC. So that in case if > user wants to do only atomic writes to an open fd, then all writes can be > considered atomic. > > 3. Do we need to have any feature compat flags for FS? (IMO) It doesn't look > like since say if there are block allocations done which were done atomically, > it should not matter to FS w.r.t compatibility. > > 4. Mostly aligned allocations are required when we don't have data=journal > mode. So should we return -EIO with data journalling mode for DIO request? > > Script to test using pwritev2() can be found here: > https://urldefense.com/v3/__https://gist.github.com/OjaswinM/e67accee3cbb7832bd3f1a9543c01da9__;!!ACWV5N9M2RV99hQ!LbVSb-43597CLDmYnhgOwH6MAcikusRh75-4fUbUrA_8go3B6JL1lWJPmhij8siPJE031qtQb6-bpdLEa1qrVA$ Please note that the posix_memalign() call in the program should PAGE align. Thanks, John
Hi John, Thanks for the review. On Mon, Dec 04, 2023 at 10:36:01AM +0000, John Garry wrote: > On 30/11/2023 13:53, Ojaswin Mujoo wrote: > > Thanks for putting this together. > > > This patch series builds on top of John Gary's atomic direct write > > patch series [1] and enables this support in ext4. This is a 2 step > > process: > > > > 1. Enable aligned allocation in ext4 mballoc. This allows us to allocate > > power-of-2 aligned physical blocks, which is needed for atomic writes. > > > > 2. Hook the direct IO path in ext4 to use aligned allocation to obtain > > physical blocks at a given alignment, which is needed for atomic IO. If > > for any reason we are not able to obtain blocks at given alignment we > > fail the atomic write. > > So are we supposed to be doing atomic writes on unwritten ranges only in the > file to get the aligned allocations? If we do an atomic write on a hole, ext4 will give us an aligned extent provided the hole is big enough to accomodate it. However, if we do an atomic write on a existing extent (written or unwritten) ext4 would check if it satisfies the alignment and length requirement and returns an error if it doesn't. Since we don't have cow like functionality afaik the only way we could let this kind of write go through is by punching the pre-existing extent which is not something we can directly do in the same write operation, hence we return error. > > I actually tried that, and I got a WARN triggered: > > # mkfs.ext4 /dev/sda > mke2fs 1.46.5 (30-Dec-2021) > Creating filesystem with 358400 1k blocks and 89760 inodes > Filesystem UUID: 7543a44b-2957-4ddc-9d4a-db3a5fd019c9 > Superblock backups stored on blocks: > 8193, 24577, 40961, 57345, 73729, 204801, 221185 > > Allocating group tables: done > Writing inode tables: done > Creating journal (8192 blocks): done > Writing superblocks and filesystem accounting information: done > > [ 12.745889] mkfs.ext4 (150) used greatest stack depth: 13304 bytes left > # mount /dev/sda mnt > [ 12.798804] EXT4-fs (sda): mounted filesystem > 7543a44b-2957-4ddc-9d4a-db3a5fd019c9 r/w with ordered data mode. Quota > mode: none. > # touch mnt/file > # > # /test-statx -a /root/mnt/file > statx(/root/mnt/file) = 0 > dump_statx results=5fff > Size: 0 Blocks: 0 IO Block: 1024 regular file > Device: 08:00 Inode: 12 Links: 1 > Access: (0644/-rw-r--r--) Uid: 0 Gid: 0 > Access: 2023-12-04 10:27:40.002848720+0000 > Modify: 2023-12-04 10:27:40.002848720+0000 > Change: 2023-12-04 10:27:40.002848720+0000 > Birth: 2023-12-04 10:27:40.002848720+0000 > stx_attributes_mask=0x703874 > STATX_ATTR_WRITE_ATOMIC set > unit min: 1024 > uunit max: 524288 > Attributes: 0000000000400000 (........ ........ ........ ........ > ........ .?--.... ..---... .---.-..) > # > > > > looks ok so far, then write 4KB at offset 0: > > # /test-pwritev2 -a -d -p 0 -l 4096 /root/mnt/file > file=/root/mnt/file write_size=4096 offset=0 o_flags=0x4002 wr_flags=0x24 > [ 46.813720] ------------[ cut here ]------------ > [ 46.814934] WARNING: CPU: 1 PID: 158 at fs/ext4/mballoc.c:2991 > ext4_mb_regular_allocator+0xeca/0xf20 > [ 46.816344] Modules linked in: > [ 46.816831] CPU: 1 PID: 158 Comm: test-pwritev2 Not tainted > 6.7.0-rc1-00038-gae3807f27e7d-dirty #968 > [ 46.818220] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), > BIOS rel-1.16.0-0-gd239552c-rebuilt.opensuse.org 04/01/2014 > [ 46.819886] RIP: 0010:ext4_mb_regular_allocator+0xeca/0xf20 > [ 46.820734] Code: fd ff ff f0 41 ff 81 e4 03 00 00 e9 63 fd ff ff > 90 0f 0b 90 e9 fe f3 ff ff 90 48 c7 c7 e2 7a b2 86 44 89 ca e8 f7 f1 > d2 ff 90 <0f> 0b 90 90 45 8b 44 24 3c e9 d4 f3 ff ff 4d 8b 45 08 4c 89 > c2 4d > [ 46.823577] RSP: 0018:ffffb77dc056b7c0 EFLAGS: 00010286 > [ 46.824379] RAX: 0000000000000000 RBX: ffff9b2ad77dea80 RCX: > 0000000000000000 > [ 46.825458] RDX: 0000000000000001 RSI: ffff9b2b3491b5c0 RDI: > ffff9b2b3491b5c0 > [ 46.826557] RBP: ffff9b2adc7cd000 R08: 0000000000000000 R09: > c0000000ffffdfff > [ 46.827634] R10: ffff9b2adcb9d780 R11: ffffb77dc056b648 R12: > ffff9b2ac6778000 > [ 46.828714] R13: ffff9b2adc7cd000 R14: ffff9b2adc7d0000 R15: > 000000000000002a > [ 46.829796] FS: 00007f726dece740(0000) GS:ffff9b2b34900000(0000) > knlGS:0000000000000000 > [ 46.830706] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 46.831299] CR2: 0000000001ed72b8 CR3: 000000001c794006 CR4: > 0000000000370ef0 > [ 46.832041] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [ 46.832813] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [ 46.833546] Call Trace: > [ 46.833901] <TASK> > [ 46.834163] ? __warn+0x78/0x130 > [ 46.834504] ? ext4_mb_regular_allocator+0xeca/0xf20 > [ 46.835037] ? report_bug+0xf8/0x1e0 > [ 46.835527] ? console_unlock+0x45/0xd0 > [ 46.835963] ? handle_bug+0x40/0x70 > [ 46.836419] ? exc_invalid_op+0x13/0x70 > [ 46.836865] ? asm_exc_invalid_op+0x16/0x20 > [ 46.837329] ? ext4_mb_regular_allocator+0xeca/0xf20 > [ 46.837852] ext4_mb_new_blocks+0x7e8/0xe60 > [ 46.838382] ? __kmalloc+0x4b/0x130 > [ 46.838824] ? __kmalloc+0x4b/0x130 > [ 46.839243] ? ext4_find_extent+0x347/0x360 > [ 46.839743] ext4_ext_map_blocks+0xc44/0xff0 > [ 46.840395] ext4_map_blocks+0x162/0x5b0 > [ 46.841010] ? jbd2__journal_start+0x84/0x1f0 > [ 46.841694] ext4_map_blocks_aligned+0x20/0xa0 > [ 46.842382] ext4_iomap_begin+0x1e9/0x320 > [ 46.843006] iomap_iter+0x16d/0x350 > [ 46.843554] __iomap_dio_rw+0x3be/0x830 > [ 46.844150] iomap_dio_rw+0x9/0x30 > [ 46.844680] ext4_file_write_iter+0x597/0x800 > [ 46.845346] do_iter_readv_writev+0xe1/0x150 > [ 46.846029] do_iter_write+0x86/0x1f0 > [ 46.846638] vfs_writev+0x96/0x190 > [ 46.847176] ? do_pwritev+0x98/0xd0 > [ 46.847721] do_pwritev+0x98/0xd0 > [ 46.848230] ? syscall_trace_enter.isra.19+0x130/0x1b0 > [ 46.849028] do_syscall_64+0x42/0xf0 > [ 46.849590] entry_SYSCALL_64_after_hwframe+0x6f/0x77 > [ 46.850405] RIP: 0033:0x7f726df9666f > [ 46.850964] Code: d5 41 54 49 89 f4 55 89 fd 53 44 89 c3 48 83 ec > 18 80 3d bb fd 0b 00 00 74 2a 45 89 c1 49 89 ca 45 31 c0 b8 48 01 00 > 00 0f 05 <48> 3d 00 f0 ff ff 76 5c 48 8b 15 7a 77 0b 00 f7 d8 64 89 02 > 48 83 > [ 46.854020] RSP: 002b:00007fff28b9bff0 EFLAGS: 00000246 ORIG_RAX: > 0000000000000148 > [ 46.855178] RAX: ffffffffffffffda RBX: 0000000000000024 RCX: > 00007f726df9666f > [ 46.856248] RDX: 0000000000000001 RSI: 00007fff28b9c050 RDI: > 0000000000000003 > [ 46.857303] RBP: 0000000000000003 R08: 0000000000000000 R09: > 0000000000000024 > [ 46.858365] R10: 0000000000000000 R11: 0000000000000246 R12: > 00007fff28b9c050 > [ 46.859407] R13: 0000000000000001 R14: 0000000000000000 R15: > 00007f726e08aa60 > [ 46.860448] </TASK> > [ 46.860797] ---[ end trace 0000000000000000 ]--- > [ 46.861497] EXT4-fs warning (device sda): > ext4_map_blocks_aligned:520: Returned extent couldn't satisfy > alignment requirements > main wrote -1 bytes at offset 0 > [ 46.863855] test-pwritev2 (158) used greatest stack depth: 11920 bytes > left > # > > Please note that I tested on my own dev branch, which contains changes over > [1], but I expect it would not make a difference for this test. Hmm this should not ideally happen, can you please share your test script with me if possible? > > > > > > Currently this RFC does not impose any restrictions for atomic and non-atomic > > allocations to any inode, which also leaves policy decisions to user-space > > as much as possible. So, for example, the user space can: > > > > * Do an atomic direct IO at any alignment and size provided it > > satisfies underlying device constraints. The only restriction for now > > is that it should be power of 2 len and atleast of FS block size. > > > > * Do any combination of non atomic and atomic writes on the same file > > in any order. As long as the user space is passing the RWF_ATOMIC flag > > to pwritev2() it is guaranteed to do an atomic IO (or fail if not > > possible). > > > > There are some TODOs on the allocator side which are remaining like... > > > > 1. Fallback to original request size when normalized request size (due to > > preallocation) allocation is not possible. > > 2. Testing some edge cases. > > > > But since all the basic test scenarios were covered, hence we wanted to get > > this RFC out for discussion on atomic write support for DIO in ext4. > > > > Further points for discussion - > > > > 1. We might need an inode flag to identify that the inode has blocks/extents > > atomically allocated. So that other userspace tools do not move the blocks of > > the inode for e.g. during resize/fsck etc. > > a. Should inode be marked as atomic similar to how we have IS_DAX(inode) > > implementation? Any thoughts? > > > > 2. Should there be support for open flags like O_ATOMIC. So that in case if > > user wants to do only atomic writes to an open fd, then all writes can be > > considered atomic. > > > > 3. Do we need to have any feature compat flags for FS? (IMO) It doesn't look > > like since say if there are block allocations done which were done atomically, > > it should not matter to FS w.r.t compatibility. > > > > 4. Mostly aligned allocations are required when we don't have data=journal > > mode. So should we return -EIO with data journalling mode for DIO request? > > > > Script to test using pwritev2() can be found here: > > https://gist.github.com/OjaswinM/e67accee3cbb7832bd3f1a9543c01da9 > > Please note that the posix_memalign() call in the program should PAGE align. Why do you say that? direct IO seems to be working when the userspace buffer is 512 byte aligned, am I missing something? Regards, ojaswin PS: I'm on vacation this week so might be a bit slow to address all the review comments. > > Thanks, > John
On 04/12/2023 13:38, Ojaswin Mujoo wrote: >> So are we supposed to be doing atomic writes on unwritten ranges only in the >> file to get the aligned allocations? > If we do an atomic write on a hole, ext4 will give us an aligned extent > provided the hole is big enough to accomodate it. > > However, if we do an atomic write on a existing extent (written or > unwritten) ext4 would check if it satisfies the alignment and length > requirement and returns an error if it doesn't. This seems a rather big drawback. > Since we don't have cow > like functionality afaik the only way we could let this kind of write go > through is by punching the pre-existing extent which is not something we > can directly do in the same write operation, hence we return error. Well, as you prob saw, for XFS we were relying on forcing extent alignment, and not CoW (yet). > >> I actually tried that, and I got a WARN triggered: >> >> # mkfs.ext4 /dev/sda >> mke2fs 1.46.5 (30-Dec-2021) >> Creating filesystem with 358400 1k blocks and 89760 inodes >> Filesystem UUID: 7543a44b-2957-4ddc-9d4a-db3a5fd019c9 >> Superblock backups stored on blocks: >> 8193, 24577, 40961, 57345, 73729, 204801, 221185 >> >> Allocating group tables: done >> Writing inode tables: done >> Creating journal (8192 blocks): done >> Writing superblocks and filesystem accounting information: done >> >> [ 12.745889] mkfs.ext4 (150) used greatest stack depth: 13304 bytes left >> # mount /dev/sda mnt >> [ 12.798804] EXT4-fs (sda): mounted filesystem >> 7543a44b-2957-4ddc-9d4a-db3a5fd019c9 r/w with ordered data mode. Quota >> mode: none. >> # touch mnt/file >> # >> # /test-statx -a /root/mnt/file >> statx(/root/mnt/file) = 0 >> dump_statx results=5fff >> Size: 0 Blocks: 0 IO Block: 1024 regular file >> Device: 08:00 Inode: 12 Links: 1 >> Access: (0644/-rw-r--r--) Uid: 0 Gid: 0 >> Access: 2023-12-04 10:27:40.002848720+0000 >> Modify: 2023-12-04 10:27:40.002848720+0000 >> Change: 2023-12-04 10:27:40.002848720+0000 >> Birth: 2023-12-04 10:27:40.002848720+0000 >> stx_attributes_mask=0x703874 >> STATX_ATTR_WRITE_ATOMIC set >> unit min: 1024 >> uunit max: 524288 >> Attributes: 0000000000400000 (........ ........ ........ ........ >> ........ .?--.... ..---... .---.-..) >> # >> >> >> >> looks ok so far, then write 4KB at offset 0: >> >> # /test-pwritev2 -a -d -p 0 -l 4096 /root/mnt/file >> file=/root/mnt/file write_size=4096 offset=0 o_flags=0x4002 wr_flags=0x24 ... >> Please note that I tested on my own dev branch, which contains changes over >> [1], but I expect it would not make a difference for this test. > Hmm this should not ideally happen, can you please share your test > script with me if possible? It's doing nothing special, just RWF_ATOMIC flag is set for DIO write: https://github.com/johnpgarry/linux/blob/236870d48ecb19c1cf89dc439e188182a0524cd4/samples/vfs/test-pwritev2.c >>> >>> Script to test using pwritev2() can be found here: >>> https://urldefense.com/v3/__https://gist.github.com/OjaswinM/e67accee3cbb7832bd3f1a9543c01da9__;!!ACWV5N9M2RV99hQ!Lr0j4iDHrfXisXOGZ82HNPefBtVDDGe9zbLhey7rRDfPE7A_tsrrQ9Dw_4Ng_qS7xTGCZaEWBKtd6pqA_LIBfA$ >> Please note that the posix_memalign() call in the program should PAGE align. > Why do you say that? direct IO seems to be working when the userspace > buffer is 512 byte aligned, am I missing something? ah, sorry, if you just use 1x IOV vector then no special alignment are required, so ignore this. Indeed, I need to improve kernel checks for alignment anyway. Thanks, John
Hi John, On Mon, Dec 04, 2023 at 02:44:36PM +0000, John Garry wrote: > On 04/12/2023 13:38, Ojaswin Mujoo wrote: > > > So are we supposed to be doing atomic writes on unwritten ranges only in the > > > file to get the aligned allocations? > > If we do an atomic write on a hole, ext4 will give us an aligned extent > > provided the hole is big enough to accomodate it. > > > > However, if we do an atomic write on a existing extent (written or > > unwritten) ext4 would check if it satisfies the alignment and length > > requirement and returns an error if it doesn't. > > This seems a rather big drawback. So if I'm not wrong, we force the extent alignment as well as the size of the extent in xfs right. We didn't want to overly restrict the users of atomic writes by forcing the extents to be of a certain alignment/size irrespective of the size of write. The design in this patchset provides this flexibility at the cost of some added precautions that the user should take (eg not doing an atomic write on a pre existing unaligned extent etc). However, I don't think it should be too hard to provide an optional forced alignment feature on top of this if there's interest in that. > > > Since we don't have cow > > like functionality afaik the only way we could let this kind of write go > > through is by punching the pre-existing extent which is not something we > > can directly do in the same write operation, hence we return error. > > Well, as you prob saw, for XFS we were relying on forcing extent alignment, > and not CoW (yet). That's right. > > > > > > I actually tried that, and I got a WARN triggered: > > > > > > # mkfs.ext4 /dev/sda > > > mke2fs 1.46.5 (30-Dec-2021) > > > Creating filesystem with 358400 1k blocks and 89760 inodes > > > Filesystem UUID: 7543a44b-2957-4ddc-9d4a-db3a5fd019c9 > > > Superblock backups stored on blocks: > > > 8193, 24577, 40961, 57345, 73729, 204801, 221185 > > > > > > Allocating group tables: done > > > Writing inode tables: done > > > Creating journal (8192 blocks): done > > > Writing superblocks and filesystem accounting information: done > > > > > > [ 12.745889] mkfs.ext4 (150) used greatest stack depth: 13304 bytes left > > > # mount /dev/sda mnt > > > [ 12.798804] EXT4-fs (sda): mounted filesystem > > > 7543a44b-2957-4ddc-9d4a-db3a5fd019c9 r/w with ordered data mode. Quota > > > mode: none. > > > # touch mnt/file > > > # > > > # /test-statx -a /root/mnt/file > > > statx(/root/mnt/file) = 0 > > > dump_statx results=5fff > > > Size: 0 Blocks: 0 IO Block: 1024 regular file > > > Device: 08:00 Inode: 12 Links: 1 > > > Access: (0644/-rw-r--r--) Uid: 0 Gid: 0 > > > Access: 2023-12-04 10:27:40.002848720+0000 > > > Modify: 2023-12-04 10:27:40.002848720+0000 > > > Change: 2023-12-04 10:27:40.002848720+0000 > > > Birth: 2023-12-04 10:27:40.002848720+0000 > > > stx_attributes_mask=0x703874 > > > STATX_ATTR_WRITE_ATOMIC set > > > unit min: 1024 > > > uunit max: 524288 > > > Attributes: 0000000000400000 (........ ........ ........ ........ > > > ........ .?--.... ..---... .---.-..) > > > # > > > > > > > > > > > > looks ok so far, then write 4KB at offset 0: > > > > > > # /test-pwritev2 -a -d -p 0 -l 4096 /root/mnt/file > > > file=/root/mnt/file write_size=4096 offset=0 o_flags=0x4002 wr_flags=0x24 > > ... > > > > Please note that I tested on my own dev branch, which contains changes over > > > [1], but I expect it would not make a difference for this test. > > Hmm this should not ideally happen, can you please share your test > > script with me if possible? > > It's doing nothing special, just RWF_ATOMIC flag is set for DIO write: > > https://github.com/johnpgarry/linux/blob/236870d48ecb19c1cf89dc439e188182a0524cd4/samples/vfs/test-pwritev2.c Thanks for the script, will try to replicate this today and get back to you. > > > > > > > > > Script to test using pwritev2() can be found here: > > > > https://gist.github.com/OjaswinM/e67accee3cbb7832bd3f1a9543c01da9 > > > Please note that the posix_memalign() call in the program should PAGE align. > > Why do you say that? direct IO seems to be working when the userspace > > buffer is 512 byte aligned, am I missing something? > > ah, sorry, if you just use 1x IOV vector then no special alignment are > required, so ignore this. Indeed, I need to improve kernel checks for > alignment anyway. > > Thanks, > John > Regards, ojaswin
On 11/12/2023 10:54, Ojaswin Mujoo wrote: >> This seems a rather big drawback. > So if I'm not wrong, we force the extent alignment as well as the size > of the extent in xfs right. For XFS in my v1 patchset, we only force alignment (but not size). It is assumed that the user will fallocate/dd the complete file before issuing atomic writes, and we will have extent alignment and length as required. However - as we have seen with a trial user - it can create a problem if we don't do that and we write 4K and then overwrite with a 16K atomic write to a file, as 2x extents may be allocated for the complete 16K and it cannot be issued as a single BIO. > > We didn't want to overly restrict the users of atomic writes by forcing > the extents to be of a certain alignment/size irrespective of the size > of write. The design in this patchset provides this flexibility at the > cost of some added precautions that the user should take (eg not doing > an atomic write on a pre existing unaligned extent etc). Doesn't bigalloc already give you what you require here? Thanks, John
On Tue, Dec 12, 2023 at 07:46:51AM +0000, John Garry wrote: > It is assumed that the user will fallocate/dd the complete file before > issuing atomic writes, and we will have extent alignment and length as > required. I don't think that's a long time maintainable usage model.
On Tue, Dec 12, 2023 at 05:10:42AM -0800, Christoph Hellwig wrote: > On Tue, Dec 12, 2023 at 07:46:51AM +0000, John Garry wrote: > > It is assumed that the user will fallocate/dd the complete file before > > issuing atomic writes, and we will have extent alignment and length as > > required. > > I don't think that's a long time maintainable usage model. For databases that are trying to use this to significantly improve their performance by eliminating double writes, the allocation and writes are being done by a single process. So for *that* use case, it is quite maintainable. Cheers, - Ted
On Tue, Dec 12, 2023 at 10:16:13AM -0500, Theodore Ts'o wrote: > On Tue, Dec 12, 2023 at 05:10:42AM -0800, Christoph Hellwig wrote: > > On Tue, Dec 12, 2023 at 07:46:51AM +0000, John Garry wrote: > > > It is assumed that the user will fallocate/dd the complete file before > > > issuing atomic writes, and we will have extent alignment and length as > > > required. > > > > I don't think that's a long time maintainable usage model. > > For databases that are trying to use this to significantly improve > their performance by eliminating double writes, the allocation and > writes are being done by a single process. So for *that* use case, it > is quite maintainable. That's not the freaking point. We need to have proper kernel interfaces that don't rely on intimate knowledge and control of details. We need to build proper genral purpose interfaces and not layer hacks on top of hacks.
On 12/12/2023 13:10, Christoph Hellwig wrote: > On Tue, Dec 12, 2023 at 07:46:51AM +0000, John Garry wrote: >> It is assumed that the user will fallocate/dd the complete file before >> issuing atomic writes, and we will have extent alignment and length as >> required. > I don't think that's a long time maintainable usage model. ok, sure - as I may have mentioned, this was even causing issues in porting. I sent the v2 series without XFS support, but the same API proposal as before. Let's discuss any potential API changes there. Thanks, John
On Tue, Dec 12, 2023 at 07:46:51AM +0000, John Garry wrote: > On 11/12/2023 10:54, Ojaswin Mujoo wrote: > > > This seems a rather big drawback. > > So if I'm not wrong, we force the extent alignment as well as the size > > of the extent in xfs right. > > For XFS in my v1 patchset, we only force alignment (but not size). > > It is assumed that the user will fallocate/dd the complete file before > issuing atomic writes, and we will have extent alignment and length as > required. > > However - as we have seen with a trial user - it can create a problem if we > don't do that and we write 4K and then overwrite with a 16K atomic write to > a file, as 2x extents may be allocated for the complete 16K and it cannot be > issued as a single BIO. So currently, if we don't fallocate beforehand in xfs and the user tries to do the 16k overwrite to an offset having a 4k extent, how are we handling it? Here ext4 will return an error indicating atomic write can't happen at this particular offset. The way I see it is if the user passes atomic flag to pwritev2 and we are unable to ensure atomicity for any reason we return error, which seems like a fair approach for a generic interface. > > > > > We didn't want to overly restrict the users of atomic writes by > > forcing > > the extents to be of a certain alignment/size irrespective of the > > size > > of write. The design in this patchset provides this flexibility at > > the > > cost of some added precautions that the user should take (eg not > > doing > > an atomic write on a pre existing unaligned extent etc). > > Doesn't bigalloc already give you what you require here? Yes, but its an mkfs time feature and it also applies to each an every file which might not be desirable for all use cases. Regards, ojaswin
On Mon, Dec 11, 2023 at 04:24:14PM +0530, Ojaswin Mujoo wrote: > > > > looks ok so far, then write 4KB at offset 0: > > > > > > > > # /test-pwritev2 -a -d -p 0 -l 4096 /root/mnt/file > > > > file=/root/mnt/file write_size=4096 offset=0 o_flags=0x4002 wr_flags=0x24 > > > > ... > > > > > > Please note that I tested on my own dev branch, which contains changes over > > > > [1], but I expect it would not make a difference for this test. > > > Hmm this should not ideally happen, can you please share your test > > > script with me if possible? > > > > It's doing nothing special, just RWF_ATOMIC flag is set for DIO write: > > > > https://github.com/johnpgarry/linux/blob/236870d48ecb19c1cf89dc439e188182a0524cd4/samples/vfs/test-pwritev2.c > > Thanks for the script, will try to replicate this today and get back to > you. > Hi John, So I don't seem to be able to hit the warn on: $ touch mnt/testfile $ ./test-pwritev2 -a -d -p 0 -l 4096 mnt/testfile file=mnt/testfile write_size=4096 offset=0 o_flags=0x4002 wr_flags=0x24 main wrote 4096 bytes at offset 0 $ filefrag -v mnt/testfile Filesystem type is: ef53 File size of testfile is 4096 (1 block of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 0: 32900.. 32900: 1: last,eof $ ./test-pwritev2 -a -d -p 8192 -l 8192 mnt/testfile file=mnt/testfile write_size=8192 offset=8192 o_flags=0x4002 wr_flags=0x24 main wrote 8192 bytes at offset 8192 $ filefrag -v mnt/testfile Filesystem type is: ef53 File size of mnt/testfile is 16384 (4 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 0: 32900.. 32900: 1: 1: 2.. 3: 33288.. 33289: 2: 32902: last,eof mnt/testfile: 2 extents found Not sure why you are hitting the WARN_ON. The tree I used is: Latest ted/dev + your atomic patchset v1 + this patchset Regards, ojaswin
On 13/12/2023 05:59, Ojaswin Mujoo wrote: >> However - as we have seen with a trial user - it can create a problem if we >> don't do that and we write 4K and then overwrite with a 16K atomic write to >> a file, as 2x extents may be allocated for the complete 16K and it cannot be >> issued as a single BIO. > So currently, if we don't fallocate beforehand in xfs and the user > tries to do the 16k overwrite to an offset having a 4k extent, how are > we handling it? > > Here ext4 will return an error indicating atomic write can't happen at > this particular offset. The way I see it is if the user passes atomic > flag to pwritev2 and we are unable to ensure atomicity for any reason we > return error, which seems like a fair approach for a generic interface. ok, but this does not seem like a good user experience. > >>> We didn't want to overly restrict the users of atomic writes by >>> forcing >>> the extents to be of a certain alignment/size irrespective of the >>> size >>> of write. The design in this patchset provides this flexibility at >>> the >>> cost of some added precautions that the user should take (eg not >>> doing >>> an atomic write on a pre existing unaligned extent etc). >> Doesn't bigalloc already give you what you require here? > Yes, but its an mkfs time feature and it also applies to each an every > file which might not be desirable for all use cases. Sure, but to get started could you initially support that option (as long as it does not conflict with other per-file option)? Thanks, John
On 13/12/2023 06:42, Ojaswin Mujoo wrote: > Hi John, > > So I don't seem to be able to hit the warn on: > > $ touch mnt/testfile > $ ./test-pwritev2 -a -d -p 0 -l 4096 mnt/testfile > > file=mnt/testfile write_size=4096 offset=0 o_flags=0x4002 wr_flags=0x24 > main wrote 4096 bytes at offset 0 > > $ filefrag -v mnt/testfile > > Filesystem type is: ef53 > File size of testfile is 4096 (1 block of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 0: 32900.. 32900: 1: last,eof > > $ ./test-pwritev2 -a -d -p 8192 -l 8192 mnt/testfile > > file=mnt/testfile write_size=8192 offset=8192 o_flags=0x4002 wr_flags=0x24 > main wrote 8192 bytes at offset 8192 > > $ filefrag -v mnt/testfile > > Filesystem type is: ef53 > File size of mnt/testfile is 16384 (4 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 0: 32900.. 32900: 1: > 1: 2.. 3: 33288.. 33289: 2: 32902: last,eof > mnt/testfile: 2 extents found > > Not sure why you are hitting the WARN_ON. The tree I used is: > > Latest ted/dev + your atomic patchset v1 + this patchset What commit/tree is ted/dev exactly? Anyway, my v2 series is at https://github.com/johnpgarry/linux/commits/atomic-writes-v6.7-v2-blk I'll try that later for ext4. Thanks, John