Message ID | 20230202204428.3267832-1-willy@infradead.org |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp464327wrn; Thu, 2 Feb 2023 12:46:24 -0800 (PST) X-Google-Smtp-Source: AK7set9bkmRuXPdd+5Zc44yNuC5sfztUvoSp5hE8YoZqzA6ZGZnKZun26n3dxI9oY0d5oRKcvUwn X-Received: by 2002:a05:6a20:a68e:b0:be:fb8f:b692 with SMTP id ba14-20020a056a20a68e00b000befb8fb692mr8145629pzb.8.1675370783885; Thu, 02 Feb 2023 12:46:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1675370783; cv=none; d=google.com; s=arc-20160816; b=IWOyr6gVTO5Rj9obJTnd98pgApHnZDnpbh1H0MPqxDikOkLLjAoNg9QidpMEPzTqTk bMbwNDGseimTyFE3lwPgCh/FLk9NJiYFcHK5JI18C0UCsmqJXv9FM2nlArhxmnxvaqR0 nIg6EQslagqrxi9Alkw974q2mzfOjXaD+EnZALaZlbFtk5pNHlz02VGy/Dlvz5zA4OSc y+VUjcm6ernw6ZTdbRUhnpOpvtDTbc7UWV26gWDVUD2jHhwv+vnJbftyvO9lCCVD+WLc 7wbReDHZEFyHZJIZn/eAA1kmWNokQvKACqquQ19XJstH8o91z4p9xNRKYx+ojZ+7Rs36 q8/g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=N5DDGPQCQApO6zBKJzZWPvZ6faGorf6EH2GURRa+zbA=; b=WbUVKdZX9Vs/UJ6F4ix7FZ8z8iJHG7vYcrkvERP+Osb5fnZNzfU9Ur8Se3LWz+f4CY s9A5iBMzjGq39LOOB0Frqe1blMJ8UrChd9dMLsdxbsi1fQ5eQzg1TWw5H9Iojn8qMdZ7 djlWUo4GyahIYRtwc0Ol/BEsaM+tpzxsmaiwhmjZP5U/L1n2V5sBiMbdpifEygkArnqG cbGx5b+p9nzDijU2Yd0VxYycItY8I0MKv87BGK6E6is4Erg8I6hb8KuLyCJKDwSfGrN/ /dw3eLtmZtdySxb5GDfKhss4Z6b0tWB9xjzadOL4dUFo1iR+SG+u9caeeM/fRjQq45JS 2AYg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=f4UYhP5S; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d68-20020a633647000000b004d3cdf2aae7si555598pga.789.2023.02.02.12.46.10; Thu, 02 Feb 2023 12:46:23 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=f4UYhP5S; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233033AbjBBUot (ORCPT <rfc822;il.mystafa@gmail.com> + 99 others); Thu, 2 Feb 2023 15:44:49 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41650 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232680AbjBBUog (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 2 Feb 2023 15:44:36 -0500 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C70A47D9E; Thu, 2 Feb 2023 12:44:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Transfer-Encoding:MIME-Version: Message-Id:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID: Content-Description:In-Reply-To:References; bh=N5DDGPQCQApO6zBKJzZWPvZ6faGorf6EH2GURRa+zbA=; b=f4UYhP5SSJPpuEBQlMz7006pDd WeQsd33O7QvKUktKYkseuuQqIsrRzMbMano2OEp7/OVp1r4Ju5WDAzuV18VtLzT1zpKgn8X3LiS+X 0KZ440PySw2UE0+App6GnHcRpyR1dv6db6hSXFy10sM/IrIvqjprUdz1/eMCYBlcPNdK62mwSV4Qo fWcmX/9pX+GGhMX3t9j4AI+3LzBfgZQywwgfU8HlnpzVU2kTuewBcClrYN230HBQ8fY8ayQUEM4e7 hUvYLBXkHgKja0q7ZQVy1jDr9XdCj/gydOTLtWUeEBJt41cTkeTBvK7mPXfYvHbapXcjikAganER/ KyJPX82A==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1pNgRa-00Di7J-HD; Thu, 02 Feb 2023 20:44:30 +0000 From: "Matthew Wilcox (Oracle)" <willy@infradead.org> To: linux-fsdevel@vger.kernel.org Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>, linux-afs@lists.infradead.org, linux-btrfs@vger.kernel.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, Hugh Dickins <hughd@google.com>, linux-kernel@vger.kernel.org, fstests@vger.kernel.org Subject: [PATCH 0/5] Fix a minor POSIX conformance problem Date: Thu, 2 Feb 2023 20:44:22 +0000 Message-Id: <20230202204428.3267832-1-willy@infradead.org> X-Mailer: git-send-email 2.37.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1756753595031033714?= X-GMAIL-MSGID: =?utf-8?q?1756753595031033714?= |
Series |
Fix a minor POSIX conformance problem
|
|
Message
Matthew Wilcox
Feb. 2, 2023, 8:44 p.m. UTC
POSIX requires that on ftruncate() expansion, the new bytes must read as zeroes. If someone's mmap()ed the file and stored past EOF, for most filesystems the bytes in that page will be not-zero. It's a pretty minor violation; someone could race you and write to the file between the ftruncate() call and you reading from it, but it's a bit of a QOI violation. I've tested xfs (passes before & after), ext4 and tmpfs (both fail before, pass after). Testing from other FS developers appreciated. fstest to follow; not sure how to persuade git-send-email to work on multiple repositories Matthew Wilcox (Oracle) (5): truncate: Zero bytes after 'oldsize' if we're expanding the file ext4: Zero bytes after 'oldsize' if we're expanding the file tmpfs: Zero bytes after 'oldsize' if we're expanding the file afs: Zero bytes after 'oldsize' if we're expanding the file btrfs: Zero bytes after 'oldsize' if we're expanding the file fs/afs/inode.c | 2 ++ fs/btrfs/inode.c | 1 + fs/ext4/inode.c | 1 + mm/shmem.c | 2 ++ mm/truncate.c | 7 +++++-- 5 files changed, 11 insertions(+), 2 deletions(-)
Comments
On Feb 2, 2023, at 1:44 PM, Matthew Wilcox (Oracle) <willy@infradead.org> wrote: > > POSIX requires that on ftruncate() expansion, the new bytes must read > as zeroes. If someone's mmap()ed the file and stored past EOF, for > most filesystems the bytes in that page will be not-zero. It's a > pretty minor violation; someone could race you and write to the file > between the ftruncate() call and you reading from it, but it's a bit > of a QOI violation. Is it possible to have mmap return SIGBUS for the writes beyond EOF? On the one hand, that might indicate incorrect behavior of the application, and on the other hand, it seems possible that the application doesn't know it is writing beyond EOF and expects that data to be read back OK? What happens if it is writing beyond EOF, but the block hasn't even been allocated because PAGE_SIZE > blocksize? IMHO, this seems better to stop the root of the problem (mmap() allowing bad writes), rather than trying to fix it after the fact. Cheers, Andreas > I've tested xfs (passes before & after), ext4 and tmpfs (both fail > before, pass after). Testing from other FS developers appreciated. > fstest to follow; not sure how to persuade git-send-email to work on > multiple repositories > > Matthew Wilcox (Oracle) (5): > truncate: Zero bytes after 'oldsize' if we're expanding the file > ext4: Zero bytes after 'oldsize' if we're expanding the file > tmpfs: Zero bytes after 'oldsize' if we're expanding the file > afs: Zero bytes after 'oldsize' if we're expanding the file > btrfs: Zero bytes after 'oldsize' if we're expanding the file > > fs/afs/inode.c | 2 ++ > fs/btrfs/inode.c | 1 + > fs/ext4/inode.c | 1 + > mm/shmem.c | 2 ++ > mm/truncate.c | 7 +++++-- > 5 files changed, 11 insertions(+), 2 deletions(-) > > -- > 2.35.1 > Cheers, Andreas
On Thu, Feb 02, 2023 at 04:08:49PM -0700, Andreas Dilger wrote: > On Feb 2, 2023, at 1:44 PM, Matthew Wilcox (Oracle) <willy@infradead.org> wrote: > > > > POSIX requires that on ftruncate() expansion, the new bytes must read > > as zeroes. If someone's mmap()ed the file and stored past EOF, for > > most filesystems the bytes in that page will be not-zero. It's a > > pretty minor violation; someone could race you and write to the file > > between the ftruncate() call and you reading from it, but it's a bit > > of a QOI violation. > > Is it possible to have mmap return SIGBUS for the writes beyond EOF? Well, no. The hardware only tells us about accesses on a per-page basis. We could SIGBUS on writes that _start_ after EOF, but this test doesn't do that (it starts before EOF and extends past EOF). And once the page is mapped writable, there's no page fault taken for subsequent writes. > On the one hand, that might indicate incorrect behavior of the application, > and on the other hand, it seems possible that the application doesn't > know it is writing beyond EOF and expects that data to be read back OK? POSIX says: "The system shall always zero-fill any partial page at the end of an object. Further, the system shall never write out any modified portions of the last page of an object which are beyond its end. References within the address range starting at pa and continuing for len bytes to whole pages following the end of an object shall result in delivery of a SIGBUS signal." https://pubs.opengroup.org/onlinepubs/9699919799/functions/mmap.html So the application can't expect to read back anything it's written (and if you look at page writeback, we currently zero beyond EOF at writeback time). > IMHO, this seems better to stop the root of the problem (mmap() allowing > bad writes), rather than trying to fix it after the fact. That would be nice, but we're rather stuck with the hardware that exists. IIUC Cray-1 had byte-granularity range registers, but page-granularity is what we have.
From: Matthew Wilcox > Sent: 03 February 2023 13:21 > > On Thu, Feb 02, 2023 at 04:08:49PM -0700, Andreas Dilger wrote: > > On Feb 2, 2023, at 1:44 PM, Matthew Wilcox (Oracle) <willy@infradead.org> wrote: > > > > > > POSIX requires that on ftruncate() expansion, the new bytes must read > > > as zeroes. If someone's mmap()ed the file and stored past EOF, for > > > most filesystems the bytes in that page will be not-zero. It's a > > > pretty minor violation; someone could race you and write to the file > > > between the ftruncate() call and you reading from it, but it's a bit > > > of a QOI violation. > > > > Is it possible to have mmap return SIGBUS for the writes beyond EOF? > > Well, no. The hardware only tells us about accesses on a per-page > basis. We could SIGBUS on writes that _start_ after EOF, but this > test doesn't do that (it starts before EOF and extends past EOF). > And once the page is mapped writable, there's no page fault taken > for subsequent writes. > > > On the one hand, that might indicate incorrect behavior of the application, > > and on the other hand, it seems possible that the application doesn't > > know it is writing beyond EOF and expects that data to be read back OK? > > POSIX says: > > "The system shall always zero-fill any partial page at the end of an > object. Further, the system shall never write out any modified portions > of the last page of an object which are beyond its end. References > within the address range starting at pa and continuing for len bytes to > whole pages following the end of an object shall result in delivery of > a SIGBUS signal." > > https://pubs.opengroup.org/onlinepubs/9699919799/functions/mmap.html It also says (down at the bottom of the rational): "The mmap() function can be used to map a region of memory that is larger than the current size of the object. Memory access within the mapping but beyond the current end of the underlying objects may result in SIGBUS signals being sent to the process. The reason for this is that the size of the object can be manipulated by other processes and can change at any moment. The implementation should tell the application that a memory reference is outside the object where this can be detected; otherwise, written data may be lost and read data may not reflect actual data in the object." There are a lot of 'may' in that sentence. Note that it only says that 'data written beyond the current eof may be lost'. I think that could be taken to take precedence over the zeroing clause in ftruncate(). I'd bet a lot of beer that the original SYSV implementation (on with the description is based) didn't zero the page buffer when ftruncate() increased the file size. Whether anything (important) actually relies on that is an interesting question! David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, Feb 03, 2023 at 04:23:32PM +0000, David Laight wrote: > From: Matthew Wilcox > > "The system shall always zero-fill any partial page at the end of an > > object. Further, the system shall never write out any modified portions > > of the last page of an object which are beyond its end. References > > within the address range starting at pa and continuing for len bytes to > > whole pages following the end of an object shall result in delivery of > > a SIGBUS signal." > > > > https://pubs.opengroup.org/onlinepubs/9699919799/functions/mmap.html > > It also says (down at the bottom of the rational): > > "The mmap() function can be used to map a region of memory that is larger > than the current size of the object. Memory access within the mapping but > beyond the current end of the underlying objects may result in SIGBUS > signals being sent to the process. The reason for this is that the size > of the object can be manipulated by other processes and can change at any > moment. The implementation should tell the application that a memory > reference is outside the object where this can be detected; otherwise, > written data may be lost and read data may not reflect actual data in the > object." > > There are a lot of 'may' in that sentence. > Note that it only says that 'data written beyond the current eof may be > lost'. > I think that could be taken to take precedence over the zeroing clause > in ftruncate(). How can the _rationale_ (explicitly labelled as informative) for one function take precedence over the requirements for another function? This is nonsense. > I'd bet a lot of beer that the original SYSV implementation (on with the > description is based) didn't zero the page buffer when ftruncate() > increased the file size. > Whether anything (important) actually relies on that is an interesting > question! > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) >