[RFC,0/4] minimum folio order support in filemap

Message ID	20230621083823.1724337-1-p.raghav@samsung.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Pankaj Raghav <p.raghav@samsung.com> To: <hare@suse.de>, <willy@infradead.org>, <david@fromorbit.com> CC: <gost.dev@samsung.com>, <mcgrof@kernel.org>, <hch@lst.de>, <jwong@kernel.org>, <linux-fsdevel@vger.kernel.org>, <linux-kernel@vger.kernel.org>, Pankaj Raghav <p.raghav@samsung.com> Subject: [RFC 0/4] minimum folio order support in filemap Date: Wed, 21 Jun 2023 10:38:19 +0200 Message-ID: <20230621083823.1724337-1-p.raghav@samsung.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="utf-8" CMS-TYPE: 201P References: <CGME20230621083825eucas1p1b05a6d7e0bf90e7a3d8e621f6578ff0a@eucas1p1.samsung.com> Precedence: bulk
Series	minimum folio order support in filemap \| [RFC,0/4] minimum folio order support in filemap [RFC,1/4] fs: Allow fine-grained control of folio sizes [RFC,2/4] filemap: use minimum order while allocating folios [RFC,3/4] block: set mapping order for the block cache in set_init_blocksize [RFC,4/4] nvme: enable logical block size > PAGE_SIZE

Message ID

20230621083823.1724337-1-p.raghav@samsung.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: Pankaj Raghav <p.raghav@samsung.com>
To: <hare@suse.de>, <willy@infradead.org>, <david@fromorbit.com>
CC: <gost.dev@samsung.com>, <mcgrof@kernel.org>, <hch@lst.de>,
        <jwong@kernel.org>, <linux-fsdevel@vger.kernel.org>,
        <linux-kernel@vger.kernel.org>,
        Pankaj Raghav <p.raghav@samsung.com>
Subject: [RFC 0/4] minimum folio order support in filemap
Date: Wed, 21 Jun 2023 10:38:19 +0200
Message-ID: <20230621083823.1724337-1-p.raghav@samsung.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset="utf-8"
CMS-TYPE: 201P
References: 
 <CGME20230621083825eucas1p1b05a6d7e0bf90e7a3d8e621f6578ff0a@eucas1p1.samsung.com>
Precedence: bulk

Series

minimum folio order support in filemap |

Message

Pankaj Raghav June 21, 2023, 8:38 a.m. UTC

  There has been a lot of discussion recently to support devices and fs for
bs > ps. One of the main plumbing to support buffered IO is to have a minimum
order while allocating folios in the page cache.

Hannes sent recently a series[1] where he deduces the minimum folio
order based on the i_blkbits in struct inode. This takes a different
approach based on the discussion in that thread where the minimum and
maximum folio order can be set individually per inode.

This series is based on top of Christoph's patches to have iomap aops
for the block cache[2]. I rebased his remaining patches to
next-20230621. The whole tree can be found here[3].

Compiling the tree with CONFIG_BUFFER_HEAD=n, I am able to do a buffered
IO on a nvme drive with bs>ps in QEMU without any issues:

[root@archlinux ~]# cat /sys/block/nvme0n2/queue/logical_block_size
16384
[root@archlinux ~]# fio -bs=16k -iodepth=8 -rw=write -ioengine=io_uring -size=500M
		    -name=io_uring_1 -filename=/dev/nvme0n2 -verify=md5
io_uring_1: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=io_uring, iodepth=8
fio-3.34
Starting 1 process
Jobs: 1 (f=1): [V(1)][100.0%][r=336MiB/s][r=21.5k IOPS][eta 00m:00s]
io_uring_1: (groupid=0, jobs=1): err= 0: pid=285: Wed Jun 21 07:58:29 2023
  read: IOPS=27.3k, BW=426MiB/s (447MB/s)(500MiB/1174msec)
  <snip>
Run status group 0 (all jobs):
   READ: bw=426MiB/s (447MB/s), 426MiB/s-426MiB/s (447MB/s-447MB/s), io=500MiB (524MB), run=1174-1174msec
  WRITE: bw=198MiB/s (207MB/s), 198MiB/s-198MiB/s (207MB/s-207MB/s), io=500MiB (524MB), run=2527-2527msec

Disk stats (read/write):
  nvme0n2: ios=35614/4297, merge=0/0, ticks=11283/1441, in_queue=12725, util=96.27%

One of the main dependency to work on a block device with bs>ps is
Christoph's work on converting block device aops to use iomap.

[1] https://lwn.net/Articles/934651/
[2] https://lwn.net/ml/linux-kernel/20230424054926.26927-1-hch@lst.de/
[3] https://github.com/Panky-codes/linux/tree/next-20230523-filemap-order-generic-v1

Luis Chamberlain (1):
  block: set mapping order for the block cache in set_init_blocksize

Matthew Wilcox (Oracle) (1):
  fs: Allow fine-grained control of folio sizes

Pankaj Raghav (2):
  filemap: use minimum order while allocating folios
  nvme: enable logical block size > PAGE_SIZE

 block/bdev.c             |  9 ++++++++
 drivers/nvme/host/core.c |  2 +-
 include/linux/pagemap.h  | 46 ++++++++++++++++++++++++++++++++++++----
 mm/filemap.c             |  9 +++++---
 mm/readahead.c           | 34 ++++++++++++++++++++---------
 5 files changed, 82 insertions(+), 18 deletions(-)

Comments

Hannes Reinecke June 21, 2023, 9 a.m. UTC | #1

On 6/21/23 10:38, Pankaj Raghav wrote:
> There has been a lot of discussion recently to support devices and fs for
> bs > ps. One of the main plumbing to support buffered IO is to have a minimum
> order while allocating folios in the page cache.
> 
> Hannes sent recently a series[1] where he deduces the minimum folio
> order based on the i_blkbits in struct inode. This takes a different
> approach based on the discussion in that thread where the minimum and
> maximum folio order can be set individually per inode.
> 
> This series is based on top of Christoph's patches to have iomap aops
> for the block cache[2]. I rebased his remaining patches to
> next-20230621. The whole tree can be found here[3].
> 
> Compiling the tree with CONFIG_BUFFER_HEAD=n, I am able to do a buffered
> IO on a nvme drive with bs>ps in QEMU without any issues:
> 
> [root@archlinux ~]# cat /sys/block/nvme0n2/queue/logical_block_size
> 16384
> [root@archlinux ~]# fio -bs=16k -iodepth=8 -rw=write -ioengine=io_uring -size=500M
> 		    -name=io_uring_1 -filename=/dev/nvme0n2 -verify=md5
> io_uring_1: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=io_uring, iodepth=8
> fio-3.34
> Starting 1 process
> Jobs: 1 (f=1): [V(1)][100.0%][r=336MiB/s][r=21.5k IOPS][eta 00m:00s]
> io_uring_1: (groupid=0, jobs=1): err= 0: pid=285: Wed Jun 21 07:58:29 2023
>    read: IOPS=27.3k, BW=426MiB/s (447MB/s)(500MiB/1174msec)
>    <snip>
> Run status group 0 (all jobs):
>     READ: bw=426MiB/s (447MB/s), 426MiB/s-426MiB/s (447MB/s-447MB/s), io=500MiB (524MB), run=1174-1174msec
>    WRITE: bw=198MiB/s (207MB/s), 198MiB/s-198MiB/s (207MB/s-207MB/s), io=500MiB (524MB), run=2527-2527msec
> 
> Disk stats (read/write):
>    nvme0n2: ios=35614/4297, merge=0/0, ticks=11283/1441, in_queue=12725, util=96.27%
> 
> One of the main dependency to work on a block device with bs>ps is
> Christoph's work on converting block device aops to use iomap.
> 
> [1] https://lwn.net/Articles/934651/
> [2] https://lwn.net/ml/linux-kernel/20230424054926.26927-1-hch@lst.de/
> [3] https://github.com/Panky-codes/linux/tree/next-20230523-filemap-order-generic-v1
> 
> Luis Chamberlain (1):
>    block: set mapping order for the block cache in set_init_blocksize
> 
> Matthew Wilcox (Oracle) (1):
>    fs: Allow fine-grained control of folio sizes
> 
> Pankaj Raghav (2):
>    filemap: use minimum order while allocating folios
>    nvme: enable logical block size > PAGE_SIZE
> 
>   block/bdev.c             |  9 ++++++++
>   drivers/nvme/host/core.c |  2 +-
>   include/linux/pagemap.h  | 46 ++++++++++++++++++++++++++++++++++++----
>   mm/filemap.c             |  9 +++++---
>   mm/readahead.c           | 34 ++++++++++++++++++++---------
>   5 files changed, 82 insertions(+), 18 deletions(-)
> 

Hmm. Most unfortunate; I've just finished my own patchset (duplicating 
much of this work) to get 'brd' running with large folios.
And it even works this time, 'fsx' from the xfstest suite runs happily 
on that.

Guess we'll need to reconcile our patches.

Cheers,

Hannes

Dave Chinner June 21, 2023, 10:07 p.m. UTC | #2

On Wed, Jun 21, 2023 at 11:00:24AM +0200, Hannes Reinecke wrote:
> On 6/21/23 10:38, Pankaj Raghav wrote:
> > There has been a lot of discussion recently to support devices and fs for
> > bs > ps. One of the main plumbing to support buffered IO is to have a minimum
> > order while allocating folios in the page cache.
> > 
> > Hannes sent recently a series[1] where he deduces the minimum folio
> > order based on the i_blkbits in struct inode. This takes a different
> > approach based on the discussion in that thread where the minimum and
> > maximum folio order can be set individually per inode.
> > 
> > This series is based on top of Christoph's patches to have iomap aops
> > for the block cache[2]. I rebased his remaining patches to
> > next-20230621. The whole tree can be found here[3].
> > 
> > Compiling the tree with CONFIG_BUFFER_HEAD=n, I am able to do a buffered
> > IO on a nvme drive with bs>ps in QEMU without any issues:
> > 
> > [root@archlinux ~]# cat /sys/block/nvme0n2/queue/logical_block_size
> > 16384
> > [root@archlinux ~]# fio -bs=16k -iodepth=8 -rw=write -ioengine=io_uring -size=500M
> > 		    -name=io_uring_1 -filename=/dev/nvme0n2 -verify=md5
> > io_uring_1: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=io_uring, iodepth=8
> > fio-3.34
> > Starting 1 process
> > Jobs: 1 (f=1): [V(1)][100.0%][r=336MiB/s][r=21.5k IOPS][eta 00m:00s]
> > io_uring_1: (groupid=0, jobs=1): err= 0: pid=285: Wed Jun 21 07:58:29 2023
> >    read: IOPS=27.3k, BW=426MiB/s (447MB/s)(500MiB/1174msec)
> >    <snip>
> > Run status group 0 (all jobs):
> >     READ: bw=426MiB/s (447MB/s), 426MiB/s-426MiB/s (447MB/s-447MB/s), io=500MiB (524MB), run=1174-1174msec
> >    WRITE: bw=198MiB/s (207MB/s), 198MiB/s-198MiB/s (207MB/s-207MB/s), io=500MiB (524MB), run=2527-2527msec
> > 
> > Disk stats (read/write):
> >    nvme0n2: ios=35614/4297, merge=0/0, ticks=11283/1441, in_queue=12725, util=96.27%
> > 
> > One of the main dependency to work on a block device with bs>ps is
> > Christoph's work on converting block device aops to use iomap.
> > 
> > [1] https://lwn.net/Articles/934651/
> > [2] https://lwn.net/ml/linux-kernel/20230424054926.26927-1-hch@lst.de/
> > [3] https://github.com/Panky-codes/linux/tree/next-20230523-filemap-order-generic-v1
> > 
> > Luis Chamberlain (1):
> >    block: set mapping order for the block cache in set_init_blocksize
> > 
> > Matthew Wilcox (Oracle) (1):
> >    fs: Allow fine-grained control of folio sizes
> > 
> > Pankaj Raghav (2):
> >    filemap: use minimum order while allocating folios
> >    nvme: enable logical block size > PAGE_SIZE
> > 
> >   block/bdev.c             |  9 ++++++++
> >   drivers/nvme/host/core.c |  2 +-
> >   include/linux/pagemap.h  | 46 ++++++++++++++++++++++++++++++++++++----
> >   mm/filemap.c             |  9 +++++---
> >   mm/readahead.c           | 34 ++++++++++++++++++++---------
> >   5 files changed, 82 insertions(+), 18 deletions(-)
> > 
> 
> Hmm. Most unfortunate; I've just finished my own patchset (duplicating much
> of this work) to get 'brd' running with large folios.
> And it even works this time, 'fsx' from the xfstest suite runs happily on
> that.

So you've converted a filesystem to use bs > ps, too? Or is the
filesystem that fsx is running on just using normal 4kB block size?
If the latter, then fsx is not actually testing the large folio page
cache support, it's mostly just doing 4kB aligned IO to brd....

Cheers,

Dave.

Hannes Reinecke June 22, 2023, 5:51 a.m. UTC | #3

On 6/22/23 00:07, Dave Chinner wrote:
> On Wed, Jun 21, 2023 at 11:00:24AM +0200, Hannes Reinecke wrote:
>> On 6/21/23 10:38, Pankaj Raghav wrote:
>>> There has been a lot of discussion recently to support devices and fs for
>>> bs > ps. One of the main plumbing to support buffered IO is to have a minimum
>>> order while allocating folios in the page cache.
>>>
>>> Hannes sent recently a series[1] where he deduces the minimum folio
>>> order based on the i_blkbits in struct inode. This takes a different
>>> approach based on the discussion in that thread where the minimum and
>>> maximum folio order can be set individually per inode.
>>>
>>> This series is based on top of Christoph's patches to have iomap aops
>>> for the block cache[2]. I rebased his remaining patches to
>>> next-20230621. The whole tree can be found here[3].
>>>
>>> Compiling the tree with CONFIG_BUFFER_HEAD=n, I am able to do a buffered
>>> IO on a nvme drive with bs>ps in QEMU without any issues:
>>>
>>> [root@archlinux ~]# cat /sys/block/nvme0n2/queue/logical_block_size
>>> 16384
>>> [root@archlinux ~]# fio -bs=16k -iodepth=8 -rw=write -ioengine=io_uring -size=500M
>>> 		    -name=io_uring_1 -filename=/dev/nvme0n2 -verify=md5
>>> io_uring_1: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=io_uring, iodepth=8
>>> fio-3.34
>>> Starting 1 process
>>> Jobs: 1 (f=1): [V(1)][100.0%][r=336MiB/s][r=21.5k IOPS][eta 00m:00s]
>>> io_uring_1: (groupid=0, jobs=1): err= 0: pid=285: Wed Jun 21 07:58:29 2023
>>>     read: IOPS=27.3k, BW=426MiB/s (447MB/s)(500MiB/1174msec)
>>>     <snip>
>>> Run status group 0 (all jobs):
>>>      READ: bw=426MiB/s (447MB/s), 426MiB/s-426MiB/s (447MB/s-447MB/s), io=500MiB (524MB), run=1174-1174msec
>>>     WRITE: bw=198MiB/s (207MB/s), 198MiB/s-198MiB/s (207MB/s-207MB/s), io=500MiB (524MB), run=2527-2527msec
>>>
>>> Disk stats (read/write):
>>>     nvme0n2: ios=35614/4297, merge=0/0, ticks=11283/1441, in_queue=12725, util=96.27%
>>>
>>> One of the main dependency to work on a block device with bs>ps is
>>> Christoph's work on converting block device aops to use iomap.
>>>
>>> [1] https://lwn.net/Articles/934651/
>>> [2] https://lwn.net/ml/linux-kernel/20230424054926.26927-1-hch@lst.de/
>>> [3] https://github.com/Panky-codes/linux/tree/next-20230523-filemap-order-generic-v1
>>>
>>> Luis Chamberlain (1):
>>>     block: set mapping order for the block cache in set_init_blocksize
>>>
>>> Matthew Wilcox (Oracle) (1):
>>>     fs: Allow fine-grained control of folio sizes
>>>
>>> Pankaj Raghav (2):
>>>     filemap: use minimum order while allocating folios
>>>     nvme: enable logical block size > PAGE_SIZE
>>>
>>>    block/bdev.c             |  9 ++++++++
>>>    drivers/nvme/host/core.c |  2 +-
>>>    include/linux/pagemap.h  | 46 ++++++++++++++++++++++++++++++++++++----
>>>    mm/filemap.c             |  9 +++++---
>>>    mm/readahead.c           | 34 ++++++++++++++++++++---------
>>>    5 files changed, 82 insertions(+), 18 deletions(-)
>>>
>>
>> Hmm. Most unfortunate; I've just finished my own patchset (duplicating much
>> of this work) to get 'brd' running with large folios.
>> And it even works this time, 'fsx' from the xfstest suite runs happily on
>> that.
> 
> So you've converted a filesystem to use bs > ps, too? Or is the
> filesystem that fsx is running on just using normal 4kB block size?
> If the latter, then fsx is not actually testing the large folio page
> cache support, it's mostly just doing 4kB aligned IO to brd....
> 
I have been running fsx on an xfs with bs=16k, and it worked like a charm.
I'll try to run the xfstest suite once I'm finished with merging
Pankajs patches into my patchset.

Cheers,

Hannes

Hannes Reinecke June 22, 2023, 6:50 a.m. UTC | #4

On 6/22/23 07:51, Hannes Reinecke wrote:
> On 6/22/23 00:07, Dave Chinner wrote:
>> On Wed, Jun 21, 2023 at 11:00:24AM +0200, Hannes Reinecke wrote:
>>> On 6/21/23 10:38, Pankaj Raghav wrote:
>>>> There has been a lot of discussion recently to support devices and 
>>>> fs for
>>>> bs > ps. One of the main plumbing to support buffered IO is to have 
>>>> a minimum
>>>> order while allocating folios in the page cache.
>>>>
>>>> Hannes sent recently a series[1] where he deduces the minimum folio
>>>> order based on the i_blkbits in struct inode. This takes a different
>>>> approach based on the discussion in that thread where the minimum and
>>>> maximum folio order can be set individually per inode.
>>>>
>>>> This series is based on top of Christoph's patches to have iomap aops
>>>> for the block cache[2]. I rebased his remaining patches to
>>>> next-20230621. The whole tree can be found here[3].
>>>>
>>>> Compiling the tree with CONFIG_BUFFER_HEAD=n, I am able to do a 
>>>> buffered
>>>> IO on a nvme drive with bs>ps in QEMU without any issues:
>>>>
>>>> [root@archlinux ~]# cat /sys/block/nvme0n2/queue/logical_block_size
>>>> 16384
>>>> [root@archlinux ~]# fio -bs=16k -iodepth=8 -rw=write 
>>>> -ioengine=io_uring -size=500M
>>>>             -name=io_uring_1 -filename=/dev/nvme0n2 -verify=md5
>>>> io_uring_1: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 
>>>> 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=io_uring, iodepth=8
>>>> fio-3.34
>>>> Starting 1 process
>>>> Jobs: 1 (f=1): [V(1)][100.0%][r=336MiB/s][r=21.5k IOPS][eta 00m:00s]
>>>> io_uring_1: (groupid=0, jobs=1): err= 0: pid=285: Wed Jun 21 
>>>> 07:58:29 2023
>>>>     read: IOPS=27.3k, BW=426MiB/s (447MB/s)(500MiB/1174msec)
>>>>     <snip>
>>>> Run status group 0 (all jobs):
>>>>      READ: bw=426MiB/s (447MB/s), 426MiB/s-426MiB/s 
>>>> (447MB/s-447MB/s), io=500MiB (524MB), run=1174-1174msec
>>>>     WRITE: bw=198MiB/s (207MB/s), 198MiB/s-198MiB/s 
>>>> (207MB/s-207MB/s), io=500MiB (524MB), run=2527-2527msec
>>>>
>>>> Disk stats (read/write):
>>>>     nvme0n2: ios=35614/4297, merge=0/0, ticks=11283/1441, 
>>>> in_queue=12725, util=96.27%
>>>>
>>>> One of the main dependency to work on a block device with bs>ps is
>>>> Christoph's work on converting block device aops to use iomap.
>>>>
>>>> [1] https://lwn.net/Articles/934651/
>>>> [2] https://lwn.net/ml/linux-kernel/20230424054926.26927-1-hch@lst.de/
>>>> [3] 
>>>> https://github.com/Panky-codes/linux/tree/next-20230523-filemap-order-generic-v1
>>>>
>>>> Luis Chamberlain (1):
>>>>     block: set mapping order for the block cache in set_init_blocksize
>>>>
>>>> Matthew Wilcox (Oracle) (1):
>>>>     fs: Allow fine-grained control of folio sizes
>>>>
>>>> Pankaj Raghav (2):
>>>>     filemap: use minimum order while allocating folios
>>>>     nvme: enable logical block size > PAGE_SIZE
>>>>
>>>>    block/bdev.c             |  9 ++++++++
>>>>    drivers/nvme/host/core.c |  2 +-
>>>>    include/linux/pagemap.h  | 46 
>>>> ++++++++++++++++++++++++++++++++++++----
>>>>    mm/filemap.c             |  9 +++++---
>>>>    mm/readahead.c           | 34 ++++++++++++++++++++---------
>>>>    5 files changed, 82 insertions(+), 18 deletions(-)
>>>>
>>>
>>> Hmm. Most unfortunate; I've just finished my own patchset 
>>> (duplicating much
>>> of this work) to get 'brd' running with large folios.
>>> And it even works this time, 'fsx' from the xfstest suite runs 
>>> happily on
>>> that.
>>
>> So you've converted a filesystem to use bs > ps, too? Or is the
>> filesystem that fsx is running on just using normal 4kB block size?
>> If the latter, then fsx is not actually testing the large folio page
>> cache support, it's mostly just doing 4kB aligned IO to brd....
>>
> I have been running fsx on an xfs with bs=16k, and it worked like a charm.
> I'll try to run the xfstest suite once I'm finished with merging
> Pankajs patches into my patchset.
> Well, would've been too easy.
'fsx' bails out at test 27 (collapse), with:

XFS (ram0): Corruption detected. Unmount and run xfs_repair
XFS (ram0): Internal error isnullstartblock(got.br_startblock) at line 
5787 of file fs/xfs/libxfs/xfs_bmap.c.  Caller 
xfs_bmap_collapse_extents+0x2d9/0x320 [xfs]

Guess some more work needs to be done here.

Cheers,

Hannes

Dave Chinner June 22, 2023, 10:20 a.m. UTC | #5

On Thu, Jun 22, 2023 at 08:50:06AM +0200, Hannes Reinecke wrote:
> On 6/22/23 07:51, Hannes Reinecke wrote:
> > On 6/22/23 00:07, Dave Chinner wrote:
> > > On Wed, Jun 21, 2023 at 11:00:24AM +0200, Hannes Reinecke wrote:
> > > > On 6/21/23 10:38, Pankaj Raghav wrote:
> > > > Hmm. Most unfortunate; I've just finished my own patchset
> > > > (duplicating much
> > > > of this work) to get 'brd' running with large folios.
> > > > And it even works this time, 'fsx' from the xfstest suite runs
> > > > happily on
> > > > that.
> > > 
> > > So you've converted a filesystem to use bs > ps, too? Or is the
> > > filesystem that fsx is running on just using normal 4kB block size?
> > > If the latter, then fsx is not actually testing the large folio page
> > > cache support, it's mostly just doing 4kB aligned IO to brd....
> > > 
> > I have been running fsx on an xfs with bs=16k, and it worked like a charm.
> > I'll try to run the xfstest suite once I'm finished with merging
> > Pankajs patches into my patchset.
> > Well, would've been too easy.
> 'fsx' bails out at test 27 (collapse), with:
> 
> XFS (ram0): Corruption detected. Unmount and run xfs_repair
> XFS (ram0): Internal error isnullstartblock(got.br_startblock) at line 5787
> of file fs/xfs/libxfs/xfs_bmap.c.  Caller
> xfs_bmap_collapse_extents+0x2d9/0x320 [xfs]
> 
> Guess some more work needs to be done here.

Yup, start by trying to get the fstests that run fsx through cleanly
first. That'll get you through the first 100,000 or so test ops
in a few different run configs. Those canned tests are:

tests/generic/075
tests/generic/112
tests/generic/127
tests/generic/231
tests/generic/455
tests/generic/457

Cheers,

Dave.

Hannes Reinecke June 22, 2023, 10:23 a.m. UTC | #6

On 6/22/23 12:20, Dave Chinner wrote:
> On Thu, Jun 22, 2023 at 08:50:06AM +0200, Hannes Reinecke wrote:
>> On 6/22/23 07:51, Hannes Reinecke wrote:
>>> On 6/22/23 00:07, Dave Chinner wrote:
>>>> On Wed, Jun 21, 2023 at 11:00:24AM +0200, Hannes Reinecke wrote:
>>>>> On 6/21/23 10:38, Pankaj Raghav wrote:
>>>>> Hmm. Most unfortunate; I've just finished my own patchset
>>>>> (duplicating much
>>>>> of this work) to get 'brd' running with large folios.
>>>>> And it even works this time, 'fsx' from the xfstest suite runs
>>>>> happily on
>>>>> that.
>>>>
>>>> So you've converted a filesystem to use bs > ps, too? Or is the
>>>> filesystem that fsx is running on just using normal 4kB block size?
>>>> If the latter, then fsx is not actually testing the large folio page
>>>> cache support, it's mostly just doing 4kB aligned IO to brd....
>>>>
>>> I have been running fsx on an xfs with bs=16k, and it worked like a charm.
>>> I'll try to run the xfstest suite once I'm finished with merging
>>> Pankajs patches into my patchset.
>>> Well, would've been too easy.
>> 'fsx' bails out at test 27 (collapse), with:
>>
>> XFS (ram0): Corruption detected. Unmount and run xfs_repair
>> XFS (ram0): Internal error isnullstartblock(got.br_startblock) at line 5787
>> of file fs/xfs/libxfs/xfs_bmap.c.  Caller
>> xfs_bmap_collapse_extents+0x2d9/0x320 [xfs]
>>
>> Guess some more work needs to be done here.
> 
> Yup, start by trying to get the fstests that run fsx through cleanly
> first. That'll get you through the first 100,000 or so test ops
> in a few different run configs. Those canned tests are:
> 
> tests/generic/075
> tests/generic/112
> tests/generic/127
> tests/generic/231
> tests/generic/455
> tests/generic/457
> 
THX.

Any preferences for the filesystem size?
I'm currently running off two ramdisks with 512M each; if that's too 
small I need to increase the memory of the VM ...

Cheers,

Hannes

Dave Chinner June 22, 2023, 10:33 p.m. UTC | #7

On Thu, Jun 22, 2023 at 12:23:10PM +0200, Hannes Reinecke wrote:
> On 6/22/23 12:20, Dave Chinner wrote:
> > On Thu, Jun 22, 2023 at 08:50:06AM +0200, Hannes Reinecke wrote:
> > > On 6/22/23 07:51, Hannes Reinecke wrote:
> > > > On 6/22/23 00:07, Dave Chinner wrote:
> > > > > On Wed, Jun 21, 2023 at 11:00:24AM +0200, Hannes Reinecke wrote:
> > > > > > On 6/21/23 10:38, Pankaj Raghav wrote:
> > > > > > Hmm. Most unfortunate; I've just finished my own patchset
> > > > > > (duplicating much
> > > > > > of this work) to get 'brd' running with large folios.
> > > > > > And it even works this time, 'fsx' from the xfstest suite runs
> > > > > > happily on
> > > > > > that.
> > > > > 
> > > > > So you've converted a filesystem to use bs > ps, too? Or is the
> > > > > filesystem that fsx is running on just using normal 4kB block size?
> > > > > If the latter, then fsx is not actually testing the large folio page
> > > > > cache support, it's mostly just doing 4kB aligned IO to brd....
> > > > > 
> > > > I have been running fsx on an xfs with bs=16k, and it worked like a charm.
> > > > I'll try to run the xfstest suite once I'm finished with merging
> > > > Pankajs patches into my patchset.
> > > > Well, would've been too easy.
> > > 'fsx' bails out at test 27 (collapse), with:
> > > 
> > > XFS (ram0): Corruption detected. Unmount and run xfs_repair
> > > XFS (ram0): Internal error isnullstartblock(got.br_startblock) at line 5787
> > > of file fs/xfs/libxfs/xfs_bmap.c.  Caller
> > > xfs_bmap_collapse_extents+0x2d9/0x320 [xfs]
> > > 
> > > Guess some more work needs to be done here.
> > 
> > Yup, start by trying to get the fstests that run fsx through cleanly
> > first. That'll get you through the first 100,000 or so test ops
> > in a few different run configs. Those canned tests are:
> > 
> > tests/generic/075
> > tests/generic/112
> > tests/generic/127
> > tests/generic/231
> > tests/generic/455
> > tests/generic/457
> > 
> THX.
> 
> Any preferences for the filesystem size?
> I'm currently running off two ramdisks with 512M each; if that's too small I
> need to increase the memory of the VM ...

I generally run my pmem/ramdisk VM on a pair of 8GB ramdisks for 4kB
filesystem testing.

Because you are using larger block sizes, you are going to want to
use larger rather than smaller because there are fewer blocks for a
given size, and metadata blocks hold many more records before they
spill to multiple nodes/levels.

e.g. going from 4kB to 16kB needs a 16x larger fs and file sizes for
the 16kB filesystem to exercise the same metadata tree depth
coverage as the 4kB filesystem (i.e. each single block extent is 4x
larger, each single block metadata block holds 4x as much metadata
before it spills).

With this in mind, I'd say you want the 16kB block size ramdisks to
be as large as you can make them when running fstests....

Cheers,

Dave.