[RFC,00/23] Enable block size > page size in XFS

Message ID 20230915183848.1018717-1-kernel@pankajraghav.com
Headers
Series Enable block size > page size in XFS |

Message

Pankaj Raghav (Samsung) Sept. 15, 2023, 6:38 p.m. UTC
  From: Pankaj Raghav <p.raghav@samsung.com>

There has been efforts over the last 16 years to enable enable Large
Block Sizes (LBS), that is block sizes in filesystems where bs > page
size [1] [2]. Through these efforts we have learned that one of the
main blockers to supporting bs > ps in fiesystems has been a way to
allocate pages that are at least the filesystem block size on the page
cache where bs > ps [3]. Another blocker was changed in filesystems due to
buffer-heads. Thanks to these previous efforts, the surgery by Matthew
Willcox in the page cache for adopting xarray's multi-index support, and
iomap support, it makes supporting bs > ps in XFS possible with only a few
line change to XFS. Most of changes are to the page cache to support minimum
order folio support for the target block size on the filesystem.

A new motivation for LBS today is to support high-capacity (large amount
of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are
typically greater than 4k [4] to help reduce DRAM and so in turn cost
and space. In practice this then allows different architectures to use a
base page size of 4k while still enabling support for block sizes
aligned to the larger IUs by relying on high order folios on the page
cache when needed. It also enables to take advantage of these same
drive's support for larger atomics than 4k with buffered IO support in
Linux. As described this year at LSFMM, supporting large atomics greater
than 4k enables databases to remove the need to rely on their own
journaling, so they can disable double buffered writes [5], which is a
feature different cloud providers are already innovating and enabling
customers for through custom storage solutions.

This series still needs some polishing and fixing some crashes, but it is
mainly targeted to get initial feedback from the community, enable initial
experimentation, hence the RFC. It's being posted now given the results from
our testing are proving much better results than expected and we hope to
polish this up together with the community. After all, this has been a 16
year old effort and none of this could have been possible without that effort.

Implementation:

This series only adds the notion of a minimum order of a folio in the
page cache that was initially proposed by Willy. The minimum folio order
requirement is set during inode creation. The minimum order will
typically correspond to the filesystem block size. The page cache will
in turn respect the minimum folio order requirement while allocating a
folio. This series mainly changes the page cache's filemap, readahead, and
truncation code to allocate and align the folios to the minimum order set for the
filesystem's inode's respective address space mapping.

Only XFS was enabled and tested as a part of this series as it has
supported block sizes up to 64k and sector sizes up to 32k for years.
The only thing missing was the page cache magic to enable bs > ps. However any filesystem
that doesn't depend on buffer-heads and support larger block sizes
already should be able to leverage this effort to also support LBS,
bs > ps.

This also paves the way for supporting block devices where their logical
block size > page size in the future by leveraging iomap's address space
operation added to the block device cache by Christoph Hellwig [6]. We
have work to enable support for this, enabling LBAs > 4k on NVME,  and
at the same time allow coexistence with buffer-heads on the same block
device so to enable support allow for a drive to use filesystem's to
switch between filesystem's which may depend on buffer-heads or need the
iomap address space operations for the block device cache. Patches for
this will be posted shortly after this patch series.

Testing:

The test results show, this isn't so scary. Only a few regressions so
far on xfs where CRCs are disabled on block sizes smaller than 4k and
some generic tests crashing the system for bs > 4k. The crashes are at most a
handful at this point. This series has been cleaned up 3 times now after
we passed our first billion through fsx ops on different block sizes. Not
surprisingly there are a few test bugs for the bs > ps world.

We've established baseline first against linux-next against 14 different
XFS test profiles as maintained in kdevops [7]:

xfs_crc
xfs_reflink
xfs_reflink_normapbt
xfs_reflink_1024
xfs_reflink_2k
xfs_reflink_4k
xfs_nocrc
xfs_nocrc_512
xfs_nocrc_1k
xfs_nocrc_2k
xfs_nocrc_4k
xfs_logdev
xfs_rtdev
xfs_rtlogdev

We first established a high confidence baseline for linux-next and have
kept following that to ensure we don't regress it. The majority of
regressions are fsx ops on no CRC block sizes of 512 and 2k, and we plan
to fix that, but welcome others at this point to jump in and collaborate.

The list of known possible regressions are then can be seen on kdevops
with git grep:

git grep regression workflows/fstests/expunges/6.6.0-rc1-large-block-20230914/ | awk -F"unassigned/" '{print $2}'
xfs_nocrc_2k.txt:generic/075 # possible regression
xfs_nocrc_2k.txt:generic/112 # possible regression
xfs_nocrc_2k.txt:generic/127 # possible regression
xfs_nocrc_2k.txt:generic/231 # possible regression
xfs_nocrc_2k.txt:generic/263 # possible regression
xfs_nocrc_2k.txt:generic/469 # possible regression
xfs_nocrc_512.txt:generic/075 # possible regression
xfs_nocrc_512.txt:generic/112 # possible regression
xfs_nocrc_512.txt:generic/127 # possible regression
xfs_nocrc_512.txt:generic/231 # possible regression
xfs_nocrc_512.txt:generic/263 # possible regression
xfs_nocrc_512.txt:generic/469 # possible regression
xfs_reflink_1024.txt:generic/457 # possible regression crash https://gist.github.com/mcgrof/f182b250a9d091f77dc85782a83224b3
xfs_rtdev.txt:generic/333 # might crash might be a regression, takes forever...

Billion of fsx ops are possible with 16k and so far successful also with
hundreds of millions of fsx ops against 32k and 64k with 4k sector size.

To verify larger IOs are used we have been using Daniel Gomez's lbs-ctl
tool which uses eBPF to verify different IO counts on the block layer.
That tool will soon be published.

For more details please refer to the kernel newbies page on LBS [8].

[1] https://lwn.net/Articles/231793/
[2] https://lwn.net/ml/linux-fsdevel/20181107063127.3902-1-david@fromorbit.com/
[3] https://lore.kernel.org/linux-mm/20230308075952.GU2825702@dread.disaster.area/
[4] https://cdrdv2-public.intel.com/605724/Achieving_Optimal_Perf_IU_SSDs-338395-003US.pdf
[5] https://lwn.net/Articles/932900/
[6] https://lore.kernel.org/lkml/20230801172201.1923299-2-hch@lst.de/T/
[7] https://github.com/linux-kdevops/kdevops/blob/master/playbooks/roles/fstests/templates/xfs/xfs.config
[8] https://kernelnewbies.org/KernelProjects/large-block-size

--
Regards,
Pankaj
Luis

Dave Chinner (1):
  xfs: expose block size in stat

Luis Chamberlain (12):
  filemap: set the order of the index in page_cache_delete_batch()
  filemap: align index to mapping_min_order in filemap_range_has_page()
  mm: call xas_set_order() in replace_page_cache_folio()
  filemap: align the index to mapping_min_order in __filemap_add_folio()
  filemap: align the index to mapping_min_order in
    filemap_get_folios_tag()
  filemap: align the index to mapping_min_order in filemap_get_pages()
  readahead: set file_ra_state->ra_pages to be at least
    mapping_min_order
  readahead: add folio with at least mapping_min_order in
    page_cache_ra_order
  readahead: set the minimum ra size in get_(init|next)_ra
  readahead: align ra start and size to mapping_min_order in
    ondemand_ra()
  truncate: align index to mapping_min_order
  mm: round down folio split requirements

Matthew Wilcox (Oracle) (1):
  fs: Allow fine-grained control of folio sizes

Pankaj Raghav (9):
  pagemap: use mapping_min_order in fgf_set_order()
  filemap: add folio with at least mapping_min_order in
    __filemap_get_folio
  filemap: use mapping_min_order while allocating folios
  filemap: align the index to mapping_min_order in
    do_[a]sync_mmap_readahead
  filemap: align index to mapping_min_order in filemap_fault()
  readahead: allocate folios with mapping_min_order in ra_unbounded()
  readahead: align with mapping_min_order in force_page_cache_ra()
  xfs: enable block size larger than page size support
  xfs: set minimum order folio for page cache based on blocksize

 fs/iomap/buffered-io.c  |  2 +-
 fs/xfs/xfs_icache.c     |  8 +++-
 fs/xfs/xfs_iops.c       |  4 +-
 fs/xfs/xfs_mount.c      |  9 ++++-
 fs/xfs/xfs_super.c      |  7 +---
 include/linux/pagemap.h | 87 ++++++++++++++++++++++++++++++-----------
 mm/filemap.c            | 87 +++++++++++++++++++++++++++++++++--------
 mm/huge_memory.c        | 14 +++++--
 mm/readahead.c          | 86 ++++++++++++++++++++++++++++++++++------
 mm/truncate.c           | 34 +++++++++++-----
 10 files changed, 263 insertions(+), 75 deletions(-)


base-commit: e143016b56ecb0fcda5bb6026b0a25fe55274f56
  

Comments

Matthew Wilcox Sept. 15, 2023, 7:03 p.m. UTC | #1
On Fri, Sep 15, 2023 at 08:38:26PM +0200, Pankaj Raghav wrote:
> +static inline void mapping_set_folio_orders(struct address_space *mapping,
> +					    unsigned int min, unsigned int max)
> +{
> +	/*
> +	 * XXX: max is ignored as only minimum folio order is supported
> +	 * currently.
> +	 */

I think we need some sanity checking ...

	if (min == 1)
		min = 2;
	if (max < min)
		max = min;
	if (max > MAX_PAGECACHE_ORDER)
		max = MAX_PAGECACHE_ORDER;

> +	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
> +			 (min << AS_FOLIO_ORDER_MIN) |
> +			 (MAX_PAGECACHE_ORDER << AS_FOLIO_ORDER_MAX);
> +}
  
Matthew Wilcox Sept. 15, 2023, 7:45 p.m. UTC | #2
On Fri, Sep 15, 2023 at 08:38:30PM +0200, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> page cache is mapping min_folio_order aligned. Use mapping min_folio_order
> to align the start_byte and end_byte in filemap_range_has_page().

What goes wrong if you don't?  Seems to me like it should work.

> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  mm/filemap.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 2c47729dc8b0..4dee24b5b61c 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -477,9 +477,12 @@ EXPORT_SYMBOL(filemap_flush);
>  bool filemap_range_has_page(struct address_space *mapping,
>  			   loff_t start_byte, loff_t end_byte)
>  {
> +	unsigned int min_order = mapping_min_folio_order(mapping);
> +	unsigned int nrpages = 1UL << min_order;
> +	pgoff_t index = round_down(start_byte >> PAGE_SHIFT, nrpages);
>  	struct folio *folio;
> -	XA_STATE(xas, &mapping->i_pages, start_byte >> PAGE_SHIFT);
> -	pgoff_t max = end_byte >> PAGE_SHIFT;
> +	XA_STATE(xas, &mapping->i_pages, index);
> +	pgoff_t max = round_down(end_byte >> PAGE_SHIFT, nrpages);
>  
>  	if (end_byte < start_byte)
>  		return false;
> -- 
> 2.40.1
>
  
Matthew Wilcox Sept. 15, 2023, 7:46 p.m. UTC | #3
On Fri, Sep 15, 2023 at 08:38:31PM +0200, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Call xas_set_order() in replace_page_cache_folio() for non hugetlb
> pages.

This function definitely should work without this patch.  What goes wrong?

> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  mm/filemap.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4dee24b5b61c..33de71bfa953 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -815,12 +815,14 @@ EXPORT_SYMBOL(file_write_and_wait_range);
>  void replace_page_cache_folio(struct folio *old, struct folio *new)
>  {
>  	struct address_space *mapping = old->mapping;
> +	unsigned int min_order = mapping_min_folio_order(mapping);
>  	void (*free_folio)(struct folio *) = mapping->a_ops->free_folio;
>  	pgoff_t offset = old->index;
>  	XA_STATE(xas, &mapping->i_pages, offset);
>  
>  	VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
>  	VM_BUG_ON_FOLIO(!folio_test_locked(new), new);
> +	VM_BUG_ON_FOLIO(folio_order(new) != folio_order(old), new);
>  	VM_BUG_ON_FOLIO(new->mapping, new);
>  
>  	folio_get(new);
> @@ -829,6 +831,11 @@ void replace_page_cache_folio(struct folio *old, struct folio *new)
>  
>  	mem_cgroup_migrate(old, new);
>  
> +	if (!folio_test_hugetlb(new)) {
> +		VM_BUG_ON_FOLIO(folio_order(new) < min_order, new);
> +		xas_set_order(&xas, offset, folio_order(new));
> +	}
> +
>  	xas_lock_irq(&xas);
>  	xas_store(&xas, new);
>  
> -- 
> 2.40.1
>
  
Dave Chinner Sept. 17, 2023, 10:05 p.m. UTC | #4
On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> There has been efforts over the last 16 years to enable enable Large
> Block Sizes (LBS), that is block sizes in filesystems where bs > page
> size [1] [2]. Through these efforts we have learned that one of the
> main blockers to supporting bs > ps in fiesystems has been a way to
> allocate pages that are at least the filesystem block size on the page
> cache where bs > ps [3]. Another blocker was changed in filesystems due to
> buffer-heads. Thanks to these previous efforts, the surgery by Matthew
> Willcox in the page cache for adopting xarray's multi-index support, and
> iomap support, it makes supporting bs > ps in XFS possible with only a few
> line change to XFS. Most of changes are to the page cache to support minimum
> order folio support for the target block size on the filesystem.
> 
> A new motivation for LBS today is to support high-capacity (large amount
> of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are
> typically greater than 4k [4] to help reduce DRAM and so in turn cost
> and space. In practice this then allows different architectures to use a
> base page size of 4k while still enabling support for block sizes
> aligned to the larger IUs by relying on high order folios on the page
> cache when needed. It also enables to take advantage of these same
> drive's support for larger atomics than 4k with buffered IO support in
> Linux. As described this year at LSFMM, supporting large atomics greater
> than 4k enables databases to remove the need to rely on their own
> journaling, so they can disable double buffered writes [5], which is a
> feature different cloud providers are already innovating and enabling
> customers for through custom storage solutions.
> 
> This series still needs some polishing and fixing some crashes, but it is
> mainly targeted to get initial feedback from the community, enable initial
> experimentation, hence the RFC. It's being posted now given the results from
> our testing are proving much better results than expected and we hope to
> polish this up together with the community. After all, this has been a 16
> year old effort and none of this could have been possible without that effort.
> 
> Implementation:
> 
> This series only adds the notion of a minimum order of a folio in the
> page cache that was initially proposed by Willy. The minimum folio order
> requirement is set during inode creation. The minimum order will
> typically correspond to the filesystem block size. The page cache will
> in turn respect the minimum folio order requirement while allocating a
> folio. This series mainly changes the page cache's filemap, readahead, and
> truncation code to allocate and align the folios to the minimum order set for the
> filesystem's inode's respective address space mapping.
> 
> Only XFS was enabled and tested as a part of this series as it has
> supported block sizes up to 64k and sector sizes up to 32k for years.
> The only thing missing was the page cache magic to enable bs > ps. However any filesystem
> that doesn't depend on buffer-heads and support larger block sizes
> already should be able to leverage this effort to also support LBS,
> bs > ps.
> 
> This also paves the way for supporting block devices where their logical
> block size > page size in the future by leveraging iomap's address space
> operation added to the block device cache by Christoph Hellwig [6]. We
> have work to enable support for this, enabling LBAs > 4k on NVME,  and
> at the same time allow coexistence with buffer-heads on the same block
> device so to enable support allow for a drive to use filesystem's to
> switch between filesystem's which may depend on buffer-heads or need the
> iomap address space operations for the block device cache. Patches for
> this will be posted shortly after this patch series.

Do you have a git tree branch that I can pull this from
somewhere?

As it is, I'd really prefer stuff that adds significant XFS
functionality that we need to test to be based on a current Linus
TOT kernel so that we can test it without being impacted by all
the random unrelated breakages that regularly happen in linux-next
kernels....

-Dave.
  
Luis Chamberlain Sept. 18, 2023, 2:04 a.m. UTC | #5
On Mon, Sep 18, 2023 at 08:05:20AM +1000, Dave Chinner wrote:
> On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > There has been efforts over the last 16 years to enable enable Large
> > Block Sizes (LBS), that is block sizes in filesystems where bs > page
> > size [1] [2]. Through these efforts we have learned that one of the
> > main blockers to supporting bs > ps in fiesystems has been a way to
> > allocate pages that are at least the filesystem block size on the page
> > cache where bs > ps [3]. Another blocker was changed in filesystems due to
> > buffer-heads. Thanks to these previous efforts, the surgery by Matthew
> > Willcox in the page cache for adopting xarray's multi-index support, and
> > iomap support, it makes supporting bs > ps in XFS possible with only a few
> > line change to XFS. Most of changes are to the page cache to support minimum
> > order folio support for the target block size on the filesystem.
> > 
> > A new motivation for LBS today is to support high-capacity (large amount
> > of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are
> > typically greater than 4k [4] to help reduce DRAM and so in turn cost
> > and space. In practice this then allows different architectures to use a
> > base page size of 4k while still enabling support for block sizes
> > aligned to the larger IUs by relying on high order folios on the page
> > cache when needed. It also enables to take advantage of these same
> > drive's support for larger atomics than 4k with buffered IO support in
> > Linux. As described this year at LSFMM, supporting large atomics greater
> > than 4k enables databases to remove the need to rely on their own
> > journaling, so they can disable double buffered writes [5], which is a
> > feature different cloud providers are already innovating and enabling
> > customers for through custom storage solutions.
> > 
> > This series still needs some polishing and fixing some crashes, but it is
> > mainly targeted to get initial feedback from the community, enable initial
> > experimentation, hence the RFC. It's being posted now given the results from
> > our testing are proving much better results than expected and we hope to
> > polish this up together with the community. After all, this has been a 16
> > year old effort and none of this could have been possible without that effort.
> > 
> > Implementation:
> > 
> > This series only adds the notion of a minimum order of a folio in the
> > page cache that was initially proposed by Willy. The minimum folio order
> > requirement is set during inode creation. The minimum order will
> > typically correspond to the filesystem block size. The page cache will
> > in turn respect the minimum folio order requirement while allocating a
> > folio. This series mainly changes the page cache's filemap, readahead, and
> > truncation code to allocate and align the folios to the minimum order set for the
> > filesystem's inode's respective address space mapping.
> > 
> > Only XFS was enabled and tested as a part of this series as it has
> > supported block sizes up to 64k and sector sizes up to 32k for years.
> > The only thing missing was the page cache magic to enable bs > ps. However any filesystem
> > that doesn't depend on buffer-heads and support larger block sizes
> > already should be able to leverage this effort to also support LBS,
> > bs > ps.
> > 
> > This also paves the way for supporting block devices where their logical
> > block size > page size in the future by leveraging iomap's address space
> > operation added to the block device cache by Christoph Hellwig [6]. We
> > have work to enable support for this, enabling LBAs > 4k on NVME,  and
> > at the same time allow coexistence with buffer-heads on the same block
> > device so to enable support allow for a drive to use filesystem's to
> > switch between filesystem's which may depend on buffer-heads or need the
> > iomap address space operations for the block device cache. Patches for
> > this will be posted shortly after this patch series.
> 
> Do you have a git tree branch that I can pull this from
> somewhere?
> 
> As it is, I'd really prefer stuff that adds significant XFS
> functionality that we need to test to be based on a current Linus
> TOT kernel so that we can test it without being impacted by all
> the random unrelated breakages that regularly happen in linux-next
> kernels....

That's understandable! I just rebased onto Linus' tree, this only
has the bs > ps support on 4k sector size:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev

I just did a cursory build / boot / fsx with 16k block size / 4k sector size
test with this tree only. I havne't ran fstests on it.

Just a heads up, using 512 byte sector size will fail for now, it's a
regression we have to fix. Likewise using block sizes 1k, 2k will also
regress on fsx right now. These are regressions we are aware of but
haven't had time yet to bisect / fix.

  Luis
  
Dave Chinner Sept. 18, 2023, 5:07 a.m. UTC | #6
On Sun, Sep 17, 2023 at 07:04:24PM -0700, Luis Chamberlain wrote:
> On Mon, Sep 18, 2023 at 08:05:20AM +1000, Dave Chinner wrote:
> > On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote:
> > > From: Pankaj Raghav <p.raghav@samsung.com>
> > > 
> > > There has been efforts over the last 16 years to enable enable Large
> > > Block Sizes (LBS), that is block sizes in filesystems where bs > page
> > > size [1] [2]. Through these efforts we have learned that one of the
> > > main blockers to supporting bs > ps in fiesystems has been a way to
> > > allocate pages that are at least the filesystem block size on the page
> > > cache where bs > ps [3]. Another blocker was changed in filesystems due to
> > > buffer-heads. Thanks to these previous efforts, the surgery by Matthew
> > > Willcox in the page cache for adopting xarray's multi-index support, and
> > > iomap support, it makes supporting bs > ps in XFS possible with only a few
> > > line change to XFS. Most of changes are to the page cache to support minimum
> > > order folio support for the target block size on the filesystem.
> > > 
> > > A new motivation for LBS today is to support high-capacity (large amount
> > > of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are
> > > typically greater than 4k [4] to help reduce DRAM and so in turn cost
> > > and space. In practice this then allows different architectures to use a
> > > base page size of 4k while still enabling support for block sizes
> > > aligned to the larger IUs by relying on high order folios on the page
> > > cache when needed. It also enables to take advantage of these same
> > > drive's support for larger atomics than 4k with buffered IO support in
> > > Linux. As described this year at LSFMM, supporting large atomics greater
> > > than 4k enables databases to remove the need to rely on their own
> > > journaling, so they can disable double buffered writes [5], which is a
> > > feature different cloud providers are already innovating and enabling
> > > customers for through custom storage solutions.
> > > 
> > > This series still needs some polishing and fixing some crashes, but it is
> > > mainly targeted to get initial feedback from the community, enable initial
> > > experimentation, hence the RFC. It's being posted now given the results from
> > > our testing are proving much better results than expected and we hope to
> > > polish this up together with the community. After all, this has been a 16
> > > year old effort and none of this could have been possible without that effort.
> > > 
> > > Implementation:
> > > 
> > > This series only adds the notion of a minimum order of a folio in the
> > > page cache that was initially proposed by Willy. The minimum folio order
> > > requirement is set during inode creation. The minimum order will
> > > typically correspond to the filesystem block size. The page cache will
> > > in turn respect the minimum folio order requirement while allocating a
> > > folio. This series mainly changes the page cache's filemap, readahead, and
> > > truncation code to allocate and align the folios to the minimum order set for the
> > > filesystem's inode's respective address space mapping.
> > > 
> > > Only XFS was enabled and tested as a part of this series as it has
> > > supported block sizes up to 64k and sector sizes up to 32k for years.
> > > The only thing missing was the page cache magic to enable bs > ps. However any filesystem
> > > that doesn't depend on buffer-heads and support larger block sizes
> > > already should be able to leverage this effort to also support LBS,
> > > bs > ps.
> > > 
> > > This also paves the way for supporting block devices where their logical
> > > block size > page size in the future by leveraging iomap's address space
> > > operation added to the block device cache by Christoph Hellwig [6]. We
> > > have work to enable support for this, enabling LBAs > 4k on NVME,  and
> > > at the same time allow coexistence with buffer-heads on the same block
> > > device so to enable support allow for a drive to use filesystem's to
> > > switch between filesystem's which may depend on buffer-heads or need the
> > > iomap address space operations for the block device cache. Patches for
> > > this will be posted shortly after this patch series.
> > 
> > Do you have a git tree branch that I can pull this from
> > somewhere?
> > 
> > As it is, I'd really prefer stuff that adds significant XFS
> > functionality that we need to test to be based on a current Linus
> > TOT kernel so that we can test it without being impacted by all
> > the random unrelated breakages that regularly happen in linux-next
> > kernels....
> 
> That's understandable! I just rebased onto Linus' tree, this only
> has the bs > ps support on 4k sector size:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev


> I just did a cursory build / boot / fsx with 16k block size / 4k sector size
> test with this tree only. I havne't ran fstests on it.

W/ 64k block size, generic/042 fails (maybe just a test block size
thing), generic/091 fails (data corruption on read after ~70 ops)
and then generic/095 hung with a crash in iomap_readpage_iter()
during readahead.

Looks like a null folio was passed to ifs_alloc(), which implies the
iomap_readpage_ctx didn't have a folio attached to it. Something
isn't working properly in the readahead code, which would also
explain the quick fsx failure...

> Just a heads up, using 512 byte sector size will fail for now, it's a
> regression we have to fix. Likewise using block sizes 1k, 2k will also
> regress on fsx right now. These are regressions we are aware of but
> haven't had time yet to bisect / fix.

I'm betting that the recently added sub-folio dirty tracking code
got broken by this patchset....

Cheers,

Dave.
  
Pankaj Raghav Sept. 18, 2023, 12:29 p.m. UTC | #7
>>>
>>> As it is, I'd really prefer stuff that adds significant XFS
>>> functionality that we need to test to be based on a current Linus
>>> TOT kernel so that we can test it without being impacted by all
>>> the random unrelated breakages that regularly happen in linux-next
>>> kernels....
>>
>> That's understandable! I just rebased onto Linus' tree, this only
>> has the bs > ps support on 4k sector size:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev
> 

I think this tree doesn't have some of the last minute changes I did before I sent the RFC. I will
sync with Luis offline regarding that.

> 
>> I just did a cursory build / boot / fsx with 16k block size / 4k sector size
>> test with this tree only. I havne't ran fstests on it.
> 
> W/ 64k block size, generic/042 fails (maybe just a test block size
> thing), generic/091 fails (data corruption on read after ~70 ops)
> and then generic/095 hung with a crash in iomap_readpage_iter()
> during readahead.
> 
> Looks like a null folio was passed to ifs_alloc(), which implies the
> iomap_readpage_ctx didn't have a folio attached to it. Something
> isn't working properly in the readahead code, which would also
> explain the quick fsx failure...
> 

Yeah, I have noticed this as well. This is the main crash scenario I am noticing
when I am running xfstests, and hopefully we will be able to fix it soon.

In general, we have had better results with 16k block size than 64k block size. I still don't
know why, but the ifs_alloc crash happens in generic/451 with 16k block size.


>> Just a heads up, using 512 byte sector size will fail for now, it's a
>> regression we have to fix. Likewise using block sizes 1k, 2k will also
>> regress on fsx right now. These are regressions we are aware of but
>> haven't had time yet to bisect / fix.
> 
> I'm betting that the recently added sub-folio dirty tracking code
> got broken by this patchset....
> 

Hmm, this crossed my mind as well. I am assuming I can really test the sub-folio dirty
tracking code on a system which has a page size greater than the block size? Or is there
some tests that can already test this? CCing Ritesh as well.

> Cheers,
> 
> Dave.
  
Pankaj Raghav Sept. 18, 2023, 12:35 p.m. UTC | #8
On 2023-09-15 20:50, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote:
>> Only XFS was enabled and tested as a part of this series as it has
>> supported block sizes up to 64k and sector sizes up to 32k for years.
>> The only thing missing was the page cache magic to enable bs > ps. However any filesystem
>> that doesn't depend on buffer-heads and support larger block sizes
>> already should be able to leverage this effort to also support LBS,
>> bs > ps.
> 
> I think you should choose whether you're going to use 'bs > ps' or LBS
> and stick to it.  They're both pretty inscrutable and using both
> interchanagbly is worse.
> 

Got it! Probably I will stick to Large block size and explain what it means
at the start of the patchset.

> But I think filesystems which use buffer_heads should be fine to support
> bs > ps.  The problems with the buffer cache are really when you try to
> support small block sizes and large folio sizes (eg arrays of bhs on
> the stack).  Supporting bs == folio_size shouldn't be a problem.
> 

I remember some patches from you trying to avoid the stack limitation while working
with bh. Thanks for the clarification!
  
Ritesh Harjani (IBM) Sept. 19, 2023, 11:56 a.m. UTC | #9
Pankaj Raghav <p.raghav@samsung.com> writes:

>>>>
>>>> As it is, I'd really prefer stuff that adds significant XFS
>>>> functionality that we need to test to be based on a current Linus
>>>> TOT kernel so that we can test it without being impacted by all
>>>> the random unrelated breakages that regularly happen in linux-next
>>>> kernels....
>>>
>>> That's understandable! I just rebased onto Linus' tree, this only
>>> has the bs > ps support on 4k sector size:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev
>> 
>
> I think this tree doesn't have some of the last minute changes I did before I sent the RFC. I will
> sync with Luis offline regarding that.
>
>> 
>>> I just did a cursory build / boot / fsx with 16k block size / 4k sector size
>>> test with this tree only. I havne't ran fstests on it.
>> 
>> W/ 64k block size, generic/042 fails (maybe just a test block size
>> thing), generic/091 fails (data corruption on read after ~70 ops)
>> and then generic/095 hung with a crash in iomap_readpage_iter()
>> during readahead.
>> 
>> Looks like a null folio was passed to ifs_alloc(), which implies the
>> iomap_readpage_ctx didn't have a folio attached to it. Something
>> isn't working properly in the readahead code, which would also
>> explain the quick fsx failure...
>> 
>
> Yeah, I have noticed this as well. This is the main crash scenario I am noticing
> when I am running xfstests, and hopefully we will be able to fix it soon.
>
> In general, we have had better results with 16k block size than 64k block size. I still don't
> know why, but the ifs_alloc crash happens in generic/451 with 16k block size.
>
>
>>> Just a heads up, using 512 byte sector size will fail for now, it's a
>>> regression we have to fix. Likewise using block sizes 1k, 2k will also
>>> regress on fsx right now. These are regressions we are aware of but
>>> haven't had time yet to bisect / fix.
>> 
>> I'm betting that the recently added sub-folio dirty tracking code
>> got broken by this patchset....
>> 
>
> Hmm, this crossed my mind as well. I am assuming I can really test the sub-folio dirty
> tracking code on a system which has a page size greater than the block size? Or is there
> some tests that can already test this? CCing Ritesh as well.
>

Sorry I haven't yet looked into this series yet. I will spend sometime
reading it. Will also give a spin to run the fstests.

But to answer your question on how to test sub-folio dirty
tracking code[1] [2] with XFS. Just use blocksize < pagesize in mkfs option
and run fstests. There are a no. of tests which checks for data
correctness for various types of writes.

1. test 1k blocksize on a 4k pagsize machine (as long as bs < ps)
2. Test 4k blocksize on a 64k pagesize machine (if you have one) (as long as bs < ps)
3. Or also enable large folios support and test bs < ps
(with large folios system starts insantiating large folios > 4k on a 4k
pagesize machine. So blocksize automatically becomes lesser than folio size)

You will need CONFIG_TRANSPARENT_HUGEPAGE to be enabled along with
willy's series which enables large folios in buffered write path [3].
(This is already in linux 6.6-rc1)

<snip>
/*                                                                            
 * Large folio support currently depends on THP.  These dependencies are      
 * being worked on but are not yet fixed.                                     
 */                                                                           
static inline bool mapping_large_folio_support(struct address_space *mapping) 
{                                                                             
        return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&                     
                test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
}

<links>
[1]: https://lore.kernel.org/linux-xfs/20230725122932.144426-1-ritesh.list@gmail.com/
[2]:
https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=for-next&id=4ce02c67972211be488408c275c8fbf19faf29b3
[3]: https://lore.kernel.org/all/ZLVrEkVU2YCneoXR@casper.infradead.org/

Hope this helps!

-ritesh
  
Luis Chamberlain Sept. 19, 2023, 9:15 p.m. UTC | #10
On Tue, Sep 19, 2023 at 05:26:44PM +0530, Ritesh Harjani wrote:
> Pankaj Raghav <p.raghav@samsung.com> writes:
> 
> >>>>
> >>>> As it is, I'd really prefer stuff that adds significant XFS
> >>>> functionality that we need to test to be based on a current Linus
> >>>> TOT kernel so that we can test it without being impacted by all
> >>>> the random unrelated breakages that regularly happen in linux-next
> >>>> kernels....
> >>>
> >>> That's understandable! I just rebased onto Linus' tree, this only
> >>> has the bs > ps support on 4k sector size:
> >>>
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev
> >> 
> >
> > I think this tree doesn't have some of the last minute changes I did before I sent the RFC. I will
> > sync with Luis offline regarding that.
> >
> >> 
> >>> I just did a cursory build / boot / fsx with 16k block size / 4k sector size
> >>> test with this tree only. I havne't ran fstests on it.
> >> 
> >> W/ 64k block size, generic/042 fails (maybe just a test block size
> >> thing), generic/091 fails (data corruption on read after ~70 ops)
> >> and then generic/095 hung with a crash in iomap_readpage_iter()
> >> during readahead.
> >> 
> >> Looks like a null folio was passed to ifs_alloc(), which implies the
> >> iomap_readpage_ctx didn't have a folio attached to it. Something
> >> isn't working properly in the readahead code, which would also
> >> explain the quick fsx failure...
> >> 
> >
> > Yeah, I have noticed this as well. This is the main crash scenario I am noticing
> > when I am running xfstests, and hopefully we will be able to fix it soon.
> >
> > In general, we have had better results with 16k block size than 64k block size. I still don't
> > know why, but the ifs_alloc crash happens in generic/451 with 16k block size.
> >
> >
> >>> Just a heads up, using 512 byte sector size will fail for now, it's a
> >>> regression we have to fix. Likewise using block sizes 1k, 2k will also
> >>> regress on fsx right now. These are regressions we are aware of but
> >>> haven't had time yet to bisect / fix.
> >> 
> >> I'm betting that the recently added sub-folio dirty tracking code
> >> got broken by this patchset....
> >> 
> >
> > Hmm, this crossed my mind as well. I am assuming I can really test the sub-folio dirty
> > tracking code on a system which has a page size greater than the block size? Or is there
> > some tests that can already test this? CCing Ritesh as well.
> >
> 
> Sorry I haven't yet looked into this series yet. I will spend sometime
> reading it. Will also give a spin to run the fstests.

Ritesh,

You can save yourself time in not testing the patch series with fstests
for block sizes below ps as we already are aware that a patch in the
series breaks this. We just wanted to get the patch series out early for
review given the progress. There's probably one patch which regresses
this, if each patch regresses this, that's a bigger issue :P

  Luis
  
Luis Chamberlain Sept. 21, 2023, 3 a.m. UTC | #11
On Mon, Sep 18, 2023 at 02:29:22PM +0200, Pankaj Raghav wrote:
> I think this tree doesn't have some of the last minute changes I did
> before I sent the RFC. I will sync with Luis offline regarding that.

OK, we sorted the small changes, and this patch series posted is now rebased
and available here to Linus' v6.6-rc2, for those that want more
stability than the wild wild linux-next:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus-nobdev

If you wanna muck with the coexistence stuff, which you will need if you
want to actually use an LBS device, that is this patch series
and then the coexistence stuff:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus

Given this is a fresh rebase, I started running fsx on the nobdev branch
which only has this series and managed to get fsx ops up to over 1 million
for:

512 sector size:
  * 16k block size
  * 32k block size
  * 64k block size
4k sector size:
  * 16k block size
  * 32k block size
  * 64k block size

It's at least enough cursory test to git push it. I haven't tested
yet the second branch I pushed though but it applied without any changes
so it should be good (usual famous last words).

  Luis
  
Luis Chamberlain Sept. 21, 2023, 4:57 a.m. UTC | #12
On Wed, Sep 20, 2023 at 08:00:12PM -0700, Luis Chamberlain wrote:
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus
> 
> I haven't tested yet the second branch I pushed though but it applied without any changes
> so it should be good (usual famous last words).

I have run some preliminary tests on that branch as well above using fsx
with larger LBA formats running them all on the *same* system at the
same time. Kernel is happy.

root@linus ~ # uname -r
6.6.0-rc2-large-block-linus+

root@linus ~ # mount | grep mnt
/dev/nvme17n1 on /mnt-16k type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/nvme13n1 on /mnt-32k-16ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/nvme11n1 on /mnt-64k-16ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=64k,noquota)
/dev/nvme18n1 on /mnt-32k type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/nvme14n1 on /mnt-64k-32ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=64k,noquota)
/dev/nvme7n1 on /mnt-64k-512b type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/nvme4n1 on /mnt-32k-512 type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/nvme3n1 on /mnt-16k-512b type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/nvme9n1 on /mnt-64k-4ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=64k,noquota)
/dev/nvme8n1 on /mnt-32k-4ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/nvme6n1 on /mnt-16k-4ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/nvme5n1 on /mnt-4k type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/nvme1n1 on /mnt-512 type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)

root@linus ~ # ps -ef| grep fsx
root       45601   45172 44 04:02 pts/3    00:20:26 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-16k/foo
root       46207   45658 39 04:04 pts/5    00:17:18 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-32k-16ks/foo
root       46792   46289 35 04:06 pts/7    00:14:36 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-64k-16ks/foo
root       47293   46899 39 04:08 pts/9    00:15:30 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-32k/foo
root       47921   47338 34 04:10 pts/11   00:12:56 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-64k-32ks/foo
root       48898   48484 32 04:14 pts/13   00:10:56 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-64k-512b/foo
root       49313   48939 35 04:15 pts/15   00:11:38 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-32k-512/foo
root       49729   49429 40 04:17 pts/17   00:12:27 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-16k-512b/foo
root       50085   49794 33 04:18 pts/19   00:09:56 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-64k-4ks/foo
root       50449   50130 36 04:19 pts/21   00:10:28 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-32k-4ks/foo
root       50844   50517 41 04:20 pts/23   00:11:22 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-16k-4ks/foo
root       51135   50893 52 04:21 pts/25   00:13:57 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-4k/foo
root       52061   51193 49 04:25 pts/27   00:11:21 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-512/foo
root       57668   52131  0 04:48 pts/29   00:00:00 grep fsx


root@linuslbs ~ # grep XFS /boot/config-6.6.0-rc2-large-block-linus-nobdev+ 
CONFIG_XFS_FS=m
CONFIG_XFS_SUPPORT_V4=y
# CONFIG_XFS_SUPPORT_ASCII_CI is not set
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_XFS_RT=y
CONFIG_XFS_DRAIN_INTENTS=y
CONFIG_XFS_ONLINE_SCRUB=y
CONFIG_XFS_ONLINE_SCRUB_STATS=y
CONFIG_XFS_ONLINE_REPAIR=y
CONFIG_XFS_DEBUG=y
CONFIG_XFS_ASSERT_FATAL=y
# CONFIG_VXFS_FS is not set

root@linuslbs ~ # grep DEBUG_VM /boot/config-6.6.0-rc2-large-block-linus-nobdev+ 
CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
CONFIG_DEBUG_VM_IRQSOFF=y
CONFIG_DEBUG_VM=y
# CONFIG_DEBUG_VM_MAPLE_TREE is not set
# CONFIG_DEBUG_VM_RB is not set
CONFIG_DEBUG_VM_PGFLAGS=y
# CONFIG_DEBUG_VM_PGTABLE is not set

And the status gets updated then:

https://docs.google.com/spreadsheets/d/e/2PACX-1vSA6z8C3u0WsiU2EtJNM7O_1n1-_4DYz66s53DbSuNTDDboi70EsR0hSmeogCCjqieBqALmRJ2AIKBW/pubhtml

We will run fstests for all profiles as well, but figured I'd at
least mention a quick test on the linus branch which also adds
the coex stuff.

  Luis
  
Dave Chinner Sept. 21, 2023, 6:03 a.m. UTC | #13
On Wed, Sep 20, 2023 at 09:57:56PM -0700, Luis Chamberlain wrote:
> On Wed, Sep 20, 2023 at 08:00:12PM -0700, Luis Chamberlain wrote:
> > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus
> > 
> > I haven't tested yet the second branch I pushed though but it applied without any changes
> > so it should be good (usual famous last words).
> 
> I have run some preliminary tests on that branch as well above using fsx
> with larger LBA formats running them all on the *same* system at the
> same time. Kernel is happy.
> 
> root@linus ~ # uname -r
> 6.6.0-rc2-large-block-linus+
> 
> root@linus ~ # mount | grep mnt
> /dev/nvme17n1 on /mnt-16k type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme13n1 on /mnt-32k-16ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme11n1 on /mnt-64k-16ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=64k,noquota)
> /dev/nvme18n1 on /mnt-32k type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme14n1 on /mnt-64k-32ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=64k,noquota)
> /dev/nvme7n1 on /mnt-64k-512b type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme4n1 on /mnt-32k-512 type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme3n1 on /mnt-16k-512b type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme9n1 on /mnt-64k-4ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=64k,noquota)
> /dev/nvme8n1 on /mnt-32k-4ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme6n1 on /mnt-16k-4ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme5n1 on /mnt-4k type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme1n1 on /mnt-512 type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> 
> root@linus ~ # ps -ef| grep fsx
> root       45601   45172 44 04:02 pts/3    00:20:26 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-16k/foo
> root       46207   45658 39 04:04 pts/5    00:17:18 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-32k-16ks/foo
> root       46792   46289 35 04:06 pts/7    00:14:36 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-64k-16ks/foo
> root       47293   46899 39 04:08 pts/9    00:15:30 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-32k/foo
> root       47921   47338 34 04:10 pts/11   00:12:56 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-64k-32ks/foo
> root       48898   48484 32 04:14 pts/13   00:10:56 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-64k-512b/foo
> root       49313   48939 35 04:15 pts/15   00:11:38 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-32k-512/foo
> root       49729   49429 40 04:17 pts/17   00:12:27 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-16k-512b/foo
> root       50085   49794 33 04:18 pts/19   00:09:56 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-64k-4ks/foo
> root       50449   50130 36 04:19 pts/21   00:10:28 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-32k-4ks/foo
> root       50844   50517 41 04:20 pts/23   00:11:22 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-16k-4ks/foo
> root       51135   50893 52 04:21 pts/25   00:13:57 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-4k/foo
> root       52061   51193 49 04:25 pts/27   00:11:21 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-512/foo
> root       57668   52131  0 04:48 pts/29   00:00:00 grep fsx

So I just pulled this, built it and run generic/091 as the very
first test on this:

# ./run_check.sh --mkfs-opts "-m rmapbt=1 -b size=64k" --run-opts "-s xfs_64k generic/091"
.....
meta-data=/dev/pmem0             isize=512    agcount=4, agsize=32768 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=0
data     =                       bsize=65536  blocks=131072, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=65536  ascii-ci=0, ftype=1
log      =internal log           bsize=65536  blocks=2613, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=65536  blocks=0, rtextents=0
....
Running: MOUNT_OPTIONS= ./check -R xunit -b -s xfs_64k generic/091
SECTION       -- xfs_64k
FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 test3 6.6.0-rc2-large-block-linus-dgc+ #1906 SMP PREEMPT_DYNAMIC Thu Sep 21 15:19:47 AEST 2023
MKFS_OPTIONS  -- -f -m rmapbt=1 -b size=64k /dev/pmem1
MOUNT_OPTIONS -- -o dax=never -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch

generic/091 10s ... [failed, exit status 1]- output mismatch (see /home/dave/src/xfstests-dev/results//xfs_64k/generic/091.out.bad)
    --- tests/generic/091.out   2022-12-21 15:53:25.467044754 +1100
    +++ /home/dave/src/xfstests-dev/results//xfs_64k/generic/091.out.bad        2023-09-21 15:47:48.222559248 +1000
    @@ -1,7 +1,113 @@
     QA output created by 091
     fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
    -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
    -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
    -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
    -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
    -fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W
    ...
    (Run 'diff -u /home/dave/src/xfstests-dev/tests/generic/091.out /home/dave/src/xfstests-dev/results//xfs_64k/generic/091.out.bad'  to see the entire diff)
Failures: generic/091
Failed 1 of 1 tests
Xunit report: /home/dave/src/xfstests-dev/results//xfs_64k/result.xml

SECTION       -- xfs_64k
=========================
Failures: generic/091
Failed 1 of 1 tests


real    0m4.214s
user    0m0.972s
sys     0m3.603s
#

For all these assertions about how none of your testing is finding
bugs in this code, It's taken me *4 seconds* of test runtime to find
the first failure.

And, well, it's the same failure as I reported for the previous
version of this code:

# cat /home/dave/src/xfstests-dev/results//xfs_64k/generic/091.out.bad
/home/dave/src/xfstests-dev/ltp/fsx -N 10000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -W /mnt/test/junk
mapped writes DISABLED    
Seed set to 1
main: filesystem does not support exchange range, disabling!
fallocating to largest ever: 0x79f06
READ BAD DATA: offset = 0x18000, size = 0xf000, fname = /mnt/test/junk
OFFSET      GOOD    BAD     RANGE
0x21000     0x0000  0x9008  0x0
operation# (mod 256) for the bad data may be 144
0x21001     0x0000  0x0810  0x1
operation# (mod 256) for the bad data may be 16
0x21002     0x0000  0x1000  0x2
operation# (mod 256) for the bad data may be 16
0x21005     0x0000  0x8e00  0x3
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x21007     0x0000  0x82ff  0x4
operation# (mod 256) for the bad data may be 255
0x21008     0x0000  0xffff  0x5
operation# (mod 256) for the bad data may be 255
0x21009     0x0000  0xffff  0x6
operation# (mod 256) for the bad data may be 255
0x2100a     0x0000  0xffff  0x7
operation# (mod 256) for the bad data may be 255
0x2100b     0x0000  0xff00  0x8
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x21010     0x0000  0x700b  0x9
operation# (mod 256) for the bad data may be 112
0x21011     0x0000  0x0b10  0xa
operation# (mod 256) for the bad data may be 16
0x21012     0x0000  0x1000  0xb
operation# (mod 256) for the bad data may be 16
0x21014     0x0000  0x038e  0xc
operation# (mod 256) for the bad data may be 3
0x21015     0x0000  0x8e00  0xd
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x21017     0x0000  0x82ff  0xe
operation# (mod 256) for the bad data may be 255
0x21018     0x0000  0xffff  0xf
operation# (mod 256) for the bad data may be 255
LOG DUMP (69 total operations):
1(  1 mod 256): FALLOC   0x6ba10 thru 0x79f06   (0xe4f6 bytes) EXTENDING
2(  2 mod 256): SKIPPED (no operation)
3(  3 mod 256): SKIPPED (no operation)
4(  4 mod 256): TRUNCATE DOWN   from 0x79f06 to 0x51800
5(  5 mod 256): SKIPPED (no operation)
6(  6 mod 256): READ     0x1b000 thru 0x21fff   (0x7000 bytes)
7(  7 mod 256): PUNCH    0x2ce7a thru 0x39b9e   (0xcd25 bytes)
8(  8 mod 256): PUNCH    0x29238 thru 0x29f57   (0xd20 bytes)
9(  9 mod 256): COPY 0x3000 thru 0x9fff (0x7000 bytes) to 0x40400 thru 0x473ff
10( 10 mod 256): READ     0x16000 thru 0x21fff  (0xc000 bytes)
11( 11 mod 256): FALLOC   0x4a42b thru 0x4b8f7  (0x14cc bytes) INTERIOR
12( 12 mod 256): TRUNCATE DOWN  from 0x51800 to 0x15c00 ******WWWW
13( 13 mod 256): SKIPPED (no operation)
14( 14 mod 256): READ     0xb000 thru 0x14fff   (0xa000 bytes)
15( 15 mod 256): SKIPPED (no operation)
16( 16 mod 256): SKIPPED (no operation)
17( 17 mod 256): SKIPPED (no operation)
18( 18 mod 256): READ     0x3000 thru 0x11fff   (0xf000 bytes)
19( 19 mod 256): FALLOC   0x69b94 thru 0x6c922  (0x2d8e bytes) EXTENDING
20( 20 mod 256): SKIPPED (no operation)
21( 21 mod 256): SKIPPED (no operation)
22( 22 mod 256): WRITE    0x23000 thru 0x285ff  (0x5600 bytes)
23( 23 mod 256): SKIPPED (no operation)
24( 24 mod 256): SKIPPED (no operation)
25( 25 mod 256): SKIPPED (no operation)
26( 26 mod 256): ZERO     0x1fba0 thru 0x2c568  (0xc9c9 bytes)  ******ZZZZ
27( 27 mod 256): READ     0x4f000 thru 0x50fff  (0x2000 bytes)
28( 28 mod 256): READ     0x39000 thru 0x3afff  (0x2000 bytes)
29( 29 mod 256): WRITE    0x40200 thru 0x4cdff  (0xcc00 bytes)
30( 30 mod 256): SKIPPED (no operation)
31( 31 mod 256): WRITE    0x47e00 thru 0x547ff  (0xca00 bytes)
32( 32 mod 256): SKIPPED (no operation)
33( 33 mod 256): READ     0x28000 thru 0x29fff  (0x2000 bytes)
34( 34 mod 256): SKIPPED (no operation)
35( 35 mod 256): READ     0x69000 thru 0x6bfff  (0x3000 bytes)
36( 36 mod 256): READ     0x16000 thru 0x20fff  (0xb000 bytes)
37( 37 mod 256): ZERO     0x45150 thru 0x47e9c  (0x2d4d bytes)
38( 38 mod 256): SKIPPED (no operation)
39( 39 mod 256): SKIPPED (no operation)
40( 40 mod 256): COPY 0x10000 thru 0x11fff      (0x2000 bytes) to 0x22a00 thru 0x249ff
41( 41 mod 256): WRITE    0x29000 thru 0x2efff  (0x6000 bytes)
42( 42 mod 256): ZERO     0x59c7 thru 0x13eee   (0xe528 bytes)
43( 43 mod 256): FALLOC   0x1fdbf thru 0x2e694  (0xe8d5 bytes) INTERIOR ******FFFF
44( 44 mod 256): SKIPPED (no operation)
45( 45 mod 256): ZERO     0x740f5 thru 0x7a11f  (0x602b bytes)
46( 46 mod 256): SKIPPED (no operation)
47( 47 mod 256): WRITE    0x14200 thru 0x1e3ff  (0xa200 bytes)
48( 48 mod 256): READ     0x69000 thru 0x6bfff  (0x3000 bytes)
49( 49 mod 256): TRUNCATE DOWN  from 0x6c922 to 0x16a00 ******WWWW
50( 50 mod 256): WRITE    0x15000 thru 0x163ff  (0x1400 bytes)
51( 51 mod 256): PUNCH    0x3b5e thru 0xa2c1    (0x6764 bytes)
52( 52 mod 256): SKIPPED (no operation)
53( 53 mod 256): SKIPPED (no operation)
54( 54 mod 256): WRITE    0x34a00 thru 0x3fdff  (0xb400 bytes) HOLE     ***WWWW
55( 55 mod 256): WRITE    0x38000 thru 0x397ff  (0x1800 bytes)
56( 56 mod 256): PUNCH    0x7922 thru 0x115f0   (0x9ccf bytes)
57( 57 mod 256): SKIPPED (no operation)
58( 58 mod 256): SKIPPED (no operation)
59( 59 mod 256): SKIPPED (no operation)
60( 60 mod 256): FALLOC   0x300a8 thru 0x331d0  (0x3128 bytes) INTERIOR
61( 61 mod 256): ZERO     0x3799c thru 0x39245  (0x18aa bytes)
62( 62 mod 256): ZERO     0x62fc3 thru 0x6b630  (0x866e bytes)
63( 63 mod 256): SKIPPED (no operation)
64( 64 mod 256): ZERO     0x6110a thru 0x61dad  (0xca4 bytes)
65( 65 mod 256): FALLOC   0x1d8ca thru 0x20876  (0x2fac bytes) INTERIOR
66( 66 mod 256): COPY 0x65000 thru 0x68fff      (0x4000 bytes) to 0x22400 thru 0x263ff
67( 67 mod 256): SKIPPED (no operation)
68( 68 mod 256): WRITE    0x36a00 thru 0x415ff  (0xac00 bytes)
69( 69 mod 256): READ     0x18000 thru 0x26fff  (0xf000 bytes)  ***RRRR***
Log of operations saved to "/mnt/test/junk.fsxops"; replay with --replay-ops
Correct content saved for comparison
(maybe hexdump "/mnt/test/junk" vs "/mnt/test/junk.fsxgood")

Guess what? The fsx parameters being used means it is testing things you
aren't. Yes, the '-Z -R -W' mean it is using direct IO for reads and writes,
mmap() is disabled. Other parameters indicate that using 4k aligned reads and
512 byte aligned writes and truncates.

There is a reason there are multiple different fsx tests in fstests;
they all exercise different sets of IO behaviours and alignments,
and they exercise the IO paths differently.

So there's clearly something wrong here - it's likely that the
filesystem IO alignment parameters pulled from the underlying block
device (4k physical, 512 byte logical sector sizes) are improperly
interpreted.  i.e. for a filesystem with a sector size of 4kB,
direct IO with an alignment of 512 bytes should be rejected......

-Dave.
  
Luis Chamberlain Sept. 21, 2023, 7:18 a.m. UTC | #14
On Thu, Sep 21, 2023 at 04:03:56PM +1000, Dave Chinner wrote:
> On Wed, Sep 20, 2023 at 09:57:56PM -0700, Luis Chamberlain wrote:
> > On Wed, Sep 20, 2023 at 08:00:12PM -0700, Luis Chamberlain wrote:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus
> > > 
> > > I haven't tested yet the second branch I pushed though but it applied without any changes
> > > so it should be good (usual famous last words).
> > 
> > I have run some preliminary tests on that branch as well above using fsx
> > with larger LBA formats running them all on the *same* system at the
> > same time. Kernel is happy.

<-- snip -->

> So I just pulled this, built it and run generic/091 as the very
> first test on this:
> 
> # ./run_check.sh --mkfs-opts "-m rmapbt=1 -b size=64k" --run-opts "-s xfs_64k generic/091"

The cover letter for this patch series acknowledged failures in fstests.

For kdevops now, we borrow the same last linux-next baseline:

git grep "generic/091" workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev
workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_1024.txt:generic/091 # possible regression
workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_16k.txt:generic/091
workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_32k.txt:generic/091
workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_64k_4ks.txt:generic/091

So well, we already know this fails.

> For all these assertions about how none of your testing is finding
> bugs in this code, It's taken me *4 seconds* of test runtime to find
> the first failure.

Because you know what to look for and this is not yet perfect.

> And, well, it's the same failure as I reported for the previous
> version of this code:

And we haven't done *any* new changes to the patch series so no surprise
either.

> Guess what? The fsx parameters being used means it is testing things you
> aren't.

I actualy found quite a bit of issues with -W. And it was useful.

> Yes, the '-Z -R -W' mean it is using direct IO for reads and writes,
> mmap() is disabled. Other parameters indicate that using 4k aligned reads and
> 512 byte aligned writes and truncates.

Thanks! This will help for sure!.

> There is a reason there are multiple different fsx tests in fstests;

You made it clear, and I documented the goal to ensure we get to the
point we pass all those:

https://kernelnewbies.org/KernelProjects/large-block-size#fsx

> they all exercise different sets of IO behaviours and alignments,
> and they exercise the IO paths differently.
> 
> So there's clearly something wrong here - it's likely that the
> filesystem IO alignment parameters pulled from the underlying block
> device (4k physical, 512 byte logical sector sizes) are improperly
> interpreted.  i.e. for a filesystem with a sector size of 4kB,
> direct IO with an alignment of 512 bytes should be rejected......

So yes, this is not yet complete.

But now let's step back and I want you to realize where we started
and why we decided to post, in particular me, I was suggesting we
post now, instead of waiting for us to resolve *it all*.

When we first started this work we simply thought it was impossible.
Unless of course you are Matthew and you believed hard in your work.

The progress, which you don't see, is that steps towards fixing fsx
issues have been logarithmic. Days, weeks, months before decent
progress, but the progress was steady...

And so to get to where we are today only just shows, well this is
actually not impossible, and Matthew did the right thing with the
right data structure, and the changes to the page cache with multi
index array stuff, it seems to be able to also be used for LBS.

At this point, from a logarithmic perspective, we have huge progress,
and I don't think it will stop. It gives us confidence Matthew was
right and LBS is possible indeed with the multi-index stuff.

It's not about, can this crash. Yes, we know, it can crash. It's about
how many different ways, and how many fixes left. Because clearly the
multi-index stuff is working well. The code feedback so far on this
patch series has mostly been "I don't think this patch is needed" or
"perhaps this way is better", and that's the kind of feedback we're
looking for. Because *each* new patch adds a huge a milestone. And
it seems the progress has been logarithmic. It is exactly why this
series went out with a few patches which ... we felt safer with them
than without. For instance the batch delete.. I still am suspicious
about us not needing as Hannes' patches also seem to rely on similer
rounding on the wait stuff, and it seems to bring back memories
on issues found on permissions. But anyway, the point is that, this
is clearly not ready. But try to think of progress here as logarithmic,
and any *dent* we make on the page cache to fix the last corner cases
will be huge, not small.

If you want to try, you can see for yourself, what's the next fix? :)
And if found, was it logarithmic? How do we polish this? That's the
goal of this patch series.

  Luis
  
Luis Chamberlain Sept. 21, 2023, 7:20 a.m. UTC | #15
On Thu, Sep 21, 2023 at 12:18:13AM -0700, Luis Chamberlain wrote:
> When we first started this work we simply thought it was impossible.

*not possible*

  Luis
  
Dave Chinner Sept. 22, 2023, 5:03 a.m. UTC | #16
On Thu, Sep 21, 2023 at 12:18:13AM -0700, Luis Chamberlain wrote:
> On Thu, Sep 21, 2023 at 04:03:56PM +1000, Dave Chinner wrote:
> > On Wed, Sep 20, 2023 at 09:57:56PM -0700, Luis Chamberlain wrote:
> > > On Wed, Sep 20, 2023 at 08:00:12PM -0700, Luis Chamberlain wrote:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus
> > > > 
> > > > I haven't tested yet the second branch I pushed though but it applied without any changes
> > > > so it should be good (usual famous last words).
> > > 
> > > I have run some preliminary tests on that branch as well above using fsx
> > > with larger LBA formats running them all on the *same* system at the
> > > same time. Kernel is happy.
> 
> <-- snip -->
> 
> > So I just pulled this, built it and run generic/091 as the very
> > first test on this:
> > 
> > # ./run_check.sh --mkfs-opts "-m rmapbt=1 -b size=64k" --run-opts "-s xfs_64k generic/091"
> 
> The cover letter for this patch series acknowledged failures in fstests.

But this is a new update, which you said fixed various issues, and
you posted this in direct response to the bug report I gave you.

> For kdevops now, we borrow the same last linux-next baseline:
> 
> git grep "generic/091" workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev
> workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_1024.txt:generic/091 # possible regression
> workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_16k.txt:generic/091
> workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_32k.txt:generic/091
> workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_64k_4ks.txt:generic/091
> 
> So well, we already know this fails.

*cough*

-You- know it already fails.

And you are expecting people who try the code to somehow know
that you've explicitly ignored this fsx failure, especially after
all your words to tell us how much fsx testing it has passed?

And that's kinda my point - you're effusing about how much fsx
testing this has passed, yet it istill fails after just a handful of
ops in generic/091. The dissonance could break windows...

----

Fundamentally, when it comes to data integrity, it important to
exercise as much of the operational application space as quickly as
possible as it is that breadth of variation in operations that
flushes out more bugs and helps stabilises the code faster.

Why do you think we talk about the massive test matrix most
filesytsems have and how long it takes to iterate so much? It's
because iterating that complex test matrix is how we find all the
whacky, weird bugs in the code.

Concentrating on a single test configuration and running it over and
over again won't find bugs in code it doesn't exercise no matter how
long it is run for. Running such a setup in an automated environment
doesn't mean you get better code coverage, it just means you cover
the same narrow set of corner cases faster and more times. If it
works once, it should work a million times. Iterating it a billion
more times doesn't tell us anything additional, either.

Put simply: performing deep, homogenous testing on code that has known
data corruption bugs outside the narrow scope of the test case is
not telling us anything useful about the overall state of the code.
Indeed, turning off failing tests that are critical to validating the
correct operation of the code you are modifying is bad practice.

For code changes like this, all fsx testing in fstests should pass
before you post anything for review - even for an RFC. There is no
point reviewing code that doesn't work properly, nor wasting
people's time by encouraging them to test it when it's clear to you
that it's going to fail in various important ways.

Hence I think your testing is focussing on the wrong things and I
suspect that you've misunderstood the statements of "we'll need
billions of fsx ops to test this code" that various people have made
really meant.  You've elevated running billions of fsx ops to your
primary "it works" gating condition, at the expense of making sure
all the other parts of the filesystem still work correctly.

The reality is that the returns from fsx diminish as the number of
ops go up. Once you've run the first hundred million fsx ops for a
given operations set, the chance that the next 100M ops will find a
new problem is -greatly- reduced. The vast majority of problems will
be found in the first 10M ops that are run in any given fsx
operation, and few bugs are found beyond the 100M mark. Yes, we
occasionally find one up in the billions, but that's rare and most
definitely not somethign to focus on when still developing RFC level
code.

Different fsx configurations change the operation set that is run -
mixing DIO reads with buffered writes, turning mmap on and off,
using AIO or io_uring rather than synchronous IO, etc. These all
exercise different code paths and corner cases and have vastly
different code interactions, and that is what we need to cover when
developing new code.

IOWs, we need coverage of the *entire operation space*, not just the
same narrow set of operations run billions of time.  A wide focus
requires billions of ops to cover because it requires lots of
different application configurations to be run. In constrast, there
are only three fs configurations that matter: bs < PS, bs == PS and
bs > PS.

For example, 16kB, 32kB and 64kB filesystem configs exercise exactly
the same code paths in exactly the same way (e.g. both have non-zero
miniumum folio orders but only differ by what that order is). Hence
running the same test application configs on these different
filessytem configurations does actually not improve code coverage of
the testing at all. Testing all of them only increases the resources
required to the test a change, it does not improve the quality of
coverage of the testing being performed at all....

Hence I'd strongly suggest that, for the next posting of these
cahnge, you focus on making fstests pass without turning off any
failing tests, and that fsx is run with a wide variety of
configurations (e.g. modify all the fstests cases to run for a
configurable number of ops (e.g. via SOAK_DURATION)). We just don't
care at this point about finding that 1 in 10^15 ops bug because
it's code in development; what we actually care about is that
-everything- works correctly for the vast majority of use cases....

-Dave.
  
Matthew Wilcox Sept. 22, 2023, 7:38 p.m. UTC | #17
lOn Thu, Sep 21, 2023 at 04:03:56PM +1000, Dave Chinner wrote:
> So there's clearly something wrong here - it's likely that the
> filesystem IO alignment parameters pulled from the underlying block
> device (4k physical, 512 byte logical sector sizes) are improperly
> interpreted.  i.e. for a filesystem with a sector size of 4kB,
> direct IO with an alignment of 512 bytes should be rejected......

I wonder if it's something in the truncation code that's splitting folios
that ought not to be split.  Does this test possibly keep folios in
cache that maybe get invalidated?

truncate_inode_partial_folio() is the one i'm most concernd about.
but i'm also severely jetlagged.