[RFC,0/4] convert create_page_buffers to create_folio_buffers

Message ID	20230414110821.21548-1-p.raghav@samsung.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Pankaj Raghav <p.raghav@samsung.com> To: brauner@kernel.org, willy@infradead.org, viro@zeniv.linux.org.uk, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, mcgrof@kernel.org, gost.dev@samsung.com, hare@suse.de, Pankaj Raghav <p.raghav@samsung.com> Subject: [RFC 0/4] convert create_page_buffers to create_folio_buffers Date: Fri, 14 Apr 2023 13:08:17 +0200 Message-Id: <20230414110821.21548-1-p.raghav@samsung.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="utf-8" CMS-TYPE: 201P References: <CGME20230414110825eucas1p1ed4d16627889ef8542dfa31b1183063d@eucas1p1.samsung.com> Precedence: bulk
Series	convert create_page_buffers to create_folio_buffers \| [RFC,0/4] convert create_page_buffers to create_folio_buffers [RFC,1/4] fs/buffer: add set_bh_folio helper [RFC,2/4] buffer: add alloc_folio_buffers() helper [RFC,3/4] fs/buffer: add folio_create_empty_buffers helper [RFC,4/4] fs/buffer: convert create_page_buffers to create_folio_buffers

Message ID

20230414110821.21548-1-p.raghav@samsung.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: Pankaj Raghav <p.raghav@samsung.com>
To: brauner@kernel.org, willy@infradead.org, viro@zeniv.linux.org.uk,
        akpm@linux-foundation.org
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
        mcgrof@kernel.org, gost.dev@samsung.com, hare@suse.de,
        Pankaj Raghav <p.raghav@samsung.com>
Subject: [RFC 0/4] convert create_page_buffers to create_folio_buffers
Date: Fri, 14 Apr 2023 13:08:17 +0200
Message-Id: <20230414110821.21548-1-p.raghav@samsung.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset="utf-8"
CMS-TYPE: 201P
References: 
 <CGME20230414110825eucas1p1ed4d16627889ef8542dfa31b1183063d@eucas1p1.samsung.com>
Precedence: bulk

Series

convert create_page_buffers to create_folio_buffers |

Message

Pankaj Raghav April 14, 2023, 11:08 a.m. UTC

  One of the first kernel panic we hit when we try to increase the
block size > 4k is inside create_page_buffers()[1]. Even though buffer.c
function do not support large folios (folios > PAGE_SIZE) at the moment,
these changes are required when we want to remove that constraint.

Willy had already mentioned that he wanted to convert create_page_buffers to
create_folio_buffers but didn't get to it yet, so I decided to take a
shot.

No functional changes introduced.

OI:
- I don't like the fact that I had to introduce
  folio_create_empty_buffers() as create_empty_buffers() is used in
  many parts of the kernel. Should I do a big bang change as a part of
  this series where we convert create_empty_buffers to take a folio and
  change the callers to pass a folio instead of a page?

- I split the series into 4 commits for clarity. I could squash them
  into one patch if needed.

[1] https://lore.kernel.org/all/ZBnGc4WbBOlnRUgd@casper.infradead.org/

Pankaj Raghav (4):
  fs/buffer: add set_bh_folio helper
  buffer: add alloc_folio_buffers() helper
  fs/buffer: add folio_create_empty_buffers helper
  fs/buffer: convert create_page_buffers to create_folio_buffers

 fs/buffer.c                 | 131 +++++++++++++++++++++++++++++++++---
 include/linux/buffer_head.h |   4 ++
 2 files changed, 125 insertions(+), 10 deletions(-)

Comments

Hannes Reinecke April 14, 2023, 1:47 p.m. UTC | #1

On 4/14/23 13:08, Pankaj Raghav wrote:
> One of the first kernel panic we hit when we try to increase the
> block size > 4k is inside create_page_buffers()[1]. Even though buffer.c
> function do not support large folios (folios > PAGE_SIZE) at the moment,
> these changes are required when we want to remove that constraint.
> 
> Willy had already mentioned that he wanted to convert create_page_buffers to
> create_folio_buffers but didn't get to it yet, so I decided to take a
> shot.
> 
> No functional changes introduced.
> 
> OI:
> - I don't like the fact that I had to introduce
>    folio_create_empty_buffers() as create_empty_buffers() is used in
>    many parts of the kernel. Should I do a big bang change as a part of
>    this series where we convert create_empty_buffers to take a folio and
>    change the callers to pass a folio instead of a page?
> 
> - I split the series into 4 commits for clarity. I could squash them
>    into one patch if needed.
> 
> [1] https://lore.kernel.org/all/ZBnGc4WbBOlnRUgd@casper.infradead.org/
> 
> Pankaj Raghav (4):
>    fs/buffer: add set_bh_folio helper
>    buffer: add alloc_folio_buffers() helper
>    fs/buffer: add folio_create_empty_buffers helper
>    fs/buffer: convert create_page_buffers to create_folio_buffers
> 
>   fs/buffer.c                 | 131 +++++++++++++++++++++++++++++++++---
>   include/linux/buffer_head.h |   4 ++
>   2 files changed, 125 insertions(+), 10 deletions(-)
> 
Funnily enough, I've been tinkering along the same lines, and ended up 
with pretty similar patches.
I've had to use two additional patches to get my modified 'brd' driver 
off the ground with logical blocksize of 16k:
- mm/filemap: allocate folios according to the blocksize
   (will be sending the patch separately)
- Modify read_folio() to use the correct order:

@@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio, 
get_block_t *get_block)
         if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
                 limit = inode->i_sb->s_maxbytes;

-       VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-
         head = create_folio_buffers(folio, inode, 0);
         blocksize = head->b_size;
         bbits = block_size_bits(blocksize);

-       iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
+       if (WARN_ON(PAGE_SHIFT < bbits)) {
+               iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT);
+       } else {
+               iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
+       }
         lblock = (limit+blocksize-1) >> bbits;
         bh = head;
         nr = 0;


With that (and my modified brd driver) I've been able to set the logical 
blocksize to 16k for brd and have it happily loaded.
Haven't tested the write path yet, though; there's surely quite some 
work to be done.

BTW; I've got another patch replacing 'writepage' with 'write_folio'
(and the corresponding argument update). Is that a direction you want to go?

Cheers,

Hannes

Matthew Wilcox April 14, 2023, 1:51 p.m. UTC | #2

On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote:
> BTW; I've got another patch replacing 'writepage' with 'write_folio'
> (and the corresponding argument update). Is that a direction you want to go?

No; ->writepage is being deleted.  It's already gone from ext4 and xfs.

Hannes Reinecke April 14, 2023, 1:56 p.m. UTC | #3

On 4/14/23 15:51, Matthew Wilcox wrote:
> On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote:
>> BTW; I've got another patch replacing 'writepage' with 'write_folio'
>> (and the corresponding argument update). Is that a direction you want to go?
> 
> No; ->writepage is being deleted.  It's already gone from ext4 and xfs.

Aw.
And here's me having converted block/fops over to using iomap w/ 
iomap_writepage(). Tough.

Oh well.
Wasn't a great fit anyway as for a sb_bread() replacement we would need 
a sub-page access for iomap.
Question is whether we really need that or shouldn't read PAGE_SIZE 
sectors always. Surely would make life easier. And would save us all the 
logic with bh_lru etc as we can rely on the page cache.

But probably an item for the iomap discussion at LSF.

Unless you got plans already ...

Cheers,

Hannes

Pankaj Raghav April 14, 2023, 3 p.m. UTC | #4

>> Pankaj Raghav (4):
>>    fs/buffer: add set_bh_folio helper
>>    buffer: add alloc_folio_buffers() helper
>>    fs/buffer: add folio_create_empty_buffers helper
>>    fs/buffer: convert create_page_buffers to create_folio_buffers
>>
>>   fs/buffer.c                 | 131 +++++++++++++++++++++++++++++++++---
>>   include/linux/buffer_head.h |   4 ++
>>   2 files changed, 125 insertions(+), 10 deletions(-)
>>
> Funnily enough, I've been tinkering along the same lines, and ended up with pretty similar patches.
> I've had to use two additional patches to get my modified 'brd' driver off the ground with logical
> blocksize of 16k:

Good to know that we are working on a similar direction here.

> - mm/filemap: allocate folios according to the blocksize
>   (will be sending the patch separately)
> - Modify read_folio() to use the correct order:
> 
> @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>         if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
>                 limit = inode->i_sb->s_maxbytes;
> 
> -       VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> -
>         head = create_folio_buffers(folio, inode, 0);
>         blocksize = head->b_size;
>         bbits = block_size_bits(blocksize);
> 
> -       iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
> +       if (WARN_ON(PAGE_SHIFT < bbits)) {
> +               iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT);
> +       } else {
> +               iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
> +       }
>         lblock = (limit+blocksize-1) >> bbits;
>         bh = head;
>         nr = 0;
> 
> 
> With that (and my modified brd driver) I've been able to set the logical blocksize to 16k for brd
> and have it happily loaded.

I will give your patches a try as well.

Luis Chamberlain April 15, 2023, 1:01 a.m. UTC | #5

On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote:
> On 4/14/23 13:08, Pankaj Raghav wrote:
> > One of the first kernel panic we hit when we try to increase the
> > block size > 4k is inside create_page_buffers()[1]. Even though buffer.c
> > function do not support large folios (folios > PAGE_SIZE) at the moment,
> > these changes are required when we want to remove that constraint.

> Funnily enough, I've been tinkering along the same lines, and ended up with
> pretty similar patches.
> I've had to use two additional patches to get my modified 'brd' driver off
> the ground with logical blocksize of 16k:
> - mm/filemap: allocate folios according to the blocksize
>   (will be sending the patch separately)
> - Modify read_folio() to use the correct order:
> 
> @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio,
> get_block_t *get_block)
>         if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
>                 limit = inode->i_sb->s_maxbytes;
> 
> -       VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> -
>         head = create_folio_buffers(folio, inode, 0);
>         blocksize = head->b_size;
>         bbits = block_size_bits(blocksize);
> 
> -       iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
> +       if (WARN_ON(PAGE_SHIFT < bbits)) {
> +               iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT);
> +       } else {
> +               iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
> +       }
>         lblock = (limit+blocksize-1) >> bbits;
>         bh = head;
>         nr = 0;

At a quick glance I think both approaches (unless Hannes already did it)
seem to just miss that pesky static *arr[MAX_BUF_PER_PAGE], and so I
think we need to:

a) dynamically allocate those now
b) do a cursory review of the users of that and prepare them
   to grok buffer heads which are blocksize based rather than
   PAGE_SIZE based. So we just try to kill MAX_BUF_PER_PAGE.

Without a) I think buffers after PAGE_SIZE won't get submit_bh() or lock for
bs > PAGE_SIZE right now.

Luis

Matthew Wilcox April 15, 2023, 2:31 a.m. UTC | #6

On Fri, Apr 14, 2023 at 06:01:16PM -0700, Luis Chamberlain wrote:
> a) dynamically allocate those now
> b) do a cursory review of the users of that and prepare them
>    to grok buffer heads which are blocksize based rather than
>    PAGE_SIZE based. So we just try to kill MAX_BUF_PER_PAGE.
> 
> Without a) I think buffers after PAGE_SIZE won't get submit_bh() or lock for
> bs > PAGE_SIZE right now.

Worse, we'll overflow the array and corrupt the stack.

This one is a simple fix ...

+++ b/fs/buffer.c
@@ -2282,7 +2282,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 {
        struct inode *inode = folio->mapping->host;
        sector_t iblock, lblock;
-       struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+       struct buffer_head *bh, *head;
        unsigned int blocksize, bbits;
        int nr, i;
        int fully_mapped = 1;
@@ -2335,7 +2335,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
                        if (buffer_uptodate(bh))
                                continue;
                }
-               arr[nr++] = bh;
        } while (i++, iblock++, (bh = bh->b_this_page) != head);
 
        if (fully_mapped)
@@ -2353,24 +2352,27 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
        }
 
        /* Stage two: lock the buffers */
-       for (i = 0; i < nr; i++) {
-               bh = arr[i];
+       bh = head;
+       do {
                lock_buffer(bh);
                mark_buffer_async_read(bh);
-       }
+               bh = bh->b_this_page;
+       } while (bh != head);
 
        /*
         * Stage 3: start the IO.  Check for uptodateness
         * inside the buffer lock in case another process reading
         * the underlying blockdev brought it uptodate (the sct fix).
         */
-       for (i = 0; i < nr; i++) {
-               bh = arr[i];
+       bh = head;
+       do {
                if (buffer_uptodate(bh))
                        end_buffer_async_read(bh, 1);
                else
                        submit_bh(REQ_OP_READ, bh);
-       }
+               bh = bh->b_this_page;
+       } while (bh != head);
+
        return 0;
 }
 EXPORT_SYMBOL(block_read_full_folio);

Luis Chamberlain April 15, 2023, 3:24 a.m. UTC | #7

On Sat, Apr 15, 2023 at 03:31:54AM +0100, Matthew Wilcox wrote:
> On Fri, Apr 14, 2023 at 06:01:16PM -0700, Luis Chamberlain wrote:
> > a) dynamically allocate those now
> > b) do a cursory review of the users of that and prepare them
> >    to grok buffer heads which are blocksize based rather than
> >    PAGE_SIZE based. So we just try to kill MAX_BUF_PER_PAGE.
> > 
> > Without a) I think buffers after PAGE_SIZE won't get submit_bh() or lock for
> > bs > PAGE_SIZE right now.
> 
> Worse, we'll overflow the array and corrupt the stack.
> 
> This one is a simple fix ...
> 
> +++ b/fs/buffer.c
> @@ -2282,7 +2282,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>  {
>         struct inode *inode = folio->mapping->host;
>         sector_t iblock, lblock;
> -       struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
> +       struct buffer_head *bh, *head;
>         unsigned int blocksize, bbits;
>         int nr, i;
>         int fully_mapped = 1;
> @@ -2335,7 +2335,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>                         if (buffer_uptodate(bh))
>                                 continue;
>                 }
> -               arr[nr++] = bh;
>         } while (i++, iblock++, (bh = bh->b_this_page) != head);
>  
>         if (fully_mapped)
> @@ -2353,24 +2352,27 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>         }
>  
>         /* Stage two: lock the buffers */
> -       for (i = 0; i < nr; i++) {
> -               bh = arr[i];
> +       bh = head;
> +       do {
>                 lock_buffer(bh);
>                 mark_buffer_async_read(bh);
> -       }
> +               bh = bh->b_this_page;
> +       } while (bh != head);
>  
>         /*
>          * Stage 3: start the IO.  Check for uptodateness
>          * inside the buffer lock in case another process reading
>          * the underlying blockdev brought it uptodate (the sct fix).
>          */
> -       for (i = 0; i < nr; i++) {
> -               bh = arr[i];
> +       bh = head;
> +       do {
>                 if (buffer_uptodate(bh))
>                         end_buffer_async_read(bh, 1);
>                 else
>                         submit_bh(REQ_OP_READ, bh);
> -       }
> +               bh = bh->b_this_page;
> +       } while (bh != head);
> +
>         return 0;

I thought of that but I saw that the loop that assigns the arr only
pegs a bh if we don't "continue" for certain conditions, which made me
believe that we only wanted to keep on the array as non-null items which
meet the initial loop's criteria. If that is not accurate then yes,
the simplication is nice!

  Luis

Matthew Wilcox April 15, 2023, 3:44 a.m. UTC | #8

On Fri, Apr 14, 2023 at 08:24:56PM -0700, Luis Chamberlain wrote:
> I thought of that but I saw that the loop that assigns the arr only
> pegs a bh if we don't "continue" for certain conditions, which made me
> believe that we only wanted to keep on the array as non-null items which
> meet the initial loop's criteria. If that is not accurate then yes,
> the simplication is nice!

Uh, right.  A little bit more carefully this time ... how does this
look?

diff --git a/fs/buffer.c b/fs/buffer.c
index 5e67e21b350a..dff671079b02 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2282,7 +2282,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 {
 	struct inode *inode = folio->mapping->host;
 	sector_t iblock, lblock;
-	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+	struct buffer_head *bh, *head;
 	unsigned int blocksize, bbits;
 	int nr, i;
 	int fully_mapped = 1;
@@ -2335,7 +2335,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 			if (buffer_uptodate(bh))
 				continue;
 		}
-		arr[nr++] = bh;
+		nr++;
 	} while (i++, iblock++, (bh = bh->b_this_page) != head);
 
 	if (fully_mapped)
@@ -2352,25 +2352,29 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 		return 0;
 	}
 
-	/* Stage two: lock the buffers */
-	for (i = 0; i < nr; i++) {
-		bh = arr[i];
+	/*
+	 * Stage two: lock the buffers.  Recheck the uptodate flag under
+	 * the lock in case somebody else brought it uptodate first.
+	 */
+	bh = head;
+	do {
+		if (buffer_uptodate(bh))
+			continue;
 		lock_buffer(bh);
+		if (buffer_uptodate(bh)) {
+			unlock_buffer(bh);
+			continue;
+		}
 		mark_buffer_async_read(bh);
-	}
+	} while ((bh = bh->b_this_page) != head);
 
-	/*
-	 * Stage 3: start the IO.  Check for uptodateness
-	 * inside the buffer lock in case another process reading
-	 * the underlying blockdev brought it uptodate (the sct fix).
-	 */
-	for (i = 0; i < nr; i++) {
-		bh = arr[i];
-		if (buffer_uptodate(bh))
-			end_buffer_async_read(bh, 1);
-		else
+	/* Stage 3: start the IO */
+	bh = head;
+	do {
+		if (buffer_async_read(bh))
 			submit_bh(REQ_OP_READ, bh);
-	}
+	} while ((bh = bh->b_this_page) != head);
+
 	return 0;
 }
 EXPORT_SYMBOL(block_read_full_folio);


I do wonder how much it's worth doing this vs switching to non-BH methods.
I appreciate that's a lot of work still.

Hannes Reinecke April 15, 2023, 1:14 p.m. UTC | #9

On 4/15/23 05:44, Matthew Wilcox wrote:
> On Fri, Apr 14, 2023 at 08:24:56PM -0700, Luis Chamberlain wrote:
>> I thought of that but I saw that the loop that assigns the arr only
>> pegs a bh if we don't "continue" for certain conditions, which made me
>> believe that we only wanted to keep on the array as non-null items which
>> meet the initial loop's criteria. If that is not accurate then yes,
>> the simplication is nice!
> 
> Uh, right.  A little bit more carefully this time ... how does this
> look?
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 5e67e21b350a..dff671079b02 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2282,7 +2282,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>   {
>   	struct inode *inode = folio->mapping->host;
>   	sector_t iblock, lblock;
> -	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
> +	struct buffer_head *bh, *head;
>   	unsigned int blocksize, bbits;
>   	int nr, i;
>   	int fully_mapped = 1;
> @@ -2335,7 +2335,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>   			if (buffer_uptodate(bh))
>   				continue;
>   		}
> -		arr[nr++] = bh;
> +		nr++;
>   	} while (i++, iblock++, (bh = bh->b_this_page) != head);
>   
>   	if (fully_mapped)
> @@ -2352,25 +2352,29 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>   		return 0;
>   	}
>   
> -	/* Stage two: lock the buffers */
> -	for (i = 0; i < nr; i++) {
> -		bh = arr[i];
> +	/*
> +	 * Stage two: lock the buffers.  Recheck the uptodate flag under
> +	 * the lock in case somebody else brought it uptodate first.
> +	 */
> +	bh = head;
> +	do {
> +		if (buffer_uptodate(bh))
> +			continue;
>   		lock_buffer(bh);
> +		if (buffer_uptodate(bh)) {
> +			unlock_buffer(bh);
> +			continue;
> +		}
>   		mark_buffer_async_read(bh);
> -	}
> +	} while ((bh = bh->b_this_page) != head);
>   
> -	/*
> -	 * Stage 3: start the IO.  Check for uptodateness
> -	 * inside the buffer lock in case another process reading
> -	 * the underlying blockdev brought it uptodate (the sct fix).
> -	 */
> -	for (i = 0; i < nr; i++) {
> -		bh = arr[i];
> -		if (buffer_uptodate(bh))
> -			end_buffer_async_read(bh, 1);
> -		else
> +	/* Stage 3: start the IO */
> +	bh = head;
> +	do {
> +		if (buffer_async_read(bh))
>   			submit_bh(REQ_OP_READ, bh);
> -	}
> +	} while ((bh = bh->b_this_page) != head);
> +
>   	return 0;
>   }
>   EXPORT_SYMBOL(block_read_full_folio);
> 
> 
> I do wonder how much it's worth doing this vs switching to non-BH methods.
> I appreciate that's a lot of work still.

That's what I've been wondering, too.

I would _vastly_ prefer to switch over to iomap; however, the blasted
sb_bread() is getting in the way. Currently iomap only runs on entire
pages / folios, but a lot of (older) filesystems insist on doing 512
byte I/O. While this seem logical (seeing that 512 bytes is the
default, and, in most cases, the only supported sector size) question
is whether _we_ from the linux side need to do that.
We _could_ upgrade to always do full page I/O; there's a good
chance we'll be using the entire page anyway eventually.
And with storage bandwidth getting larger and larger we might even
get a performance boost there.

And it would save us having to implement sub-page I/O for iomap.

Hmm?

Cheers,

Hannes

Matthew Wilcox April 15, 2023, 5:09 p.m. UTC | #10

On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote:
> On 4/15/23 05:44, Matthew Wilcox wrote:
> > I do wonder how much it's worth doing this vs switching to non-BH methods.
> > I appreciate that's a lot of work still.
> 
> That's what I've been wondering, too.
> 
> I would _vastly_ prefer to switch over to iomap; however, the blasted
> sb_bread() is getting in the way. Currently iomap only runs on entire
> pages / folios, but a lot of (older) filesystems insist on doing 512

Hang on, no, iomap can issue sub-page reads.  eg iomap_read_folio_sync()
will read the parts of the folio which have not yet been read when
called from __iomap_write_begin().

> byte I/O. While this seem logical (seeing that 512 bytes is the
> default, and, in most cases, the only supported sector size) question
> is whether _we_ from the linux side need to do that.
> We _could_ upgrade to always do full page I/O; there's a good
> chance we'll be using the entire page anyway eventually.
> And with storage bandwidth getting larger and larger we might even
> get a performance boost there.

I think we need to look at this from the filesystem side.  What do
filesystems actually want to do?  The first thing is they want to read
the superblock.  That's either going to be immediately freed ("Oh,
this isn't a JFS filesystem after all") or it's going to hang around
indefinitely.  There's no particular need to keep it in any kind of
cache (buffer or page).  Except ... we want to probe a dozen different
filesystems, and half of them keep their superblock at the same offset
from the start of the block device.  So we do want to keep it cached.
That's arguing for using the page cache, at least to read it.

Now, do we want userspace to be able to dd a new superblock into place
and have the mounted filesystem see it?  I suspect that confuses just
about every filesystem out there.  So I think the right answer is to read
the page into the bdev's page cache and then copy it into a kmalloc'ed
buffer which the filesystem is then responsible for freeing.  It's also
responsible for writing it back (so that's another API we need), and for
a journalled filesystem, it needs to fit into the journalling scheme.
Also, we may need to write back multiple copies of the superblock,
possibly with slight modifications.

There are a lot of considerations here, and I don't feel like I have
enough of an appreciation of filesystem needs to come up with a decent
API.  I'd hope we can get a good discussion going at LSFMM.

Luis Chamberlain April 16, 2023, 1:28 a.m. UTC | #11

On Sat, Apr 15, 2023 at 06:09:12PM +0100, Matthew Wilcox wrote:
> On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote:
> > On 4/15/23 05:44, Matthew Wilcox wrote:
> > We _could_ upgrade to always do full page I/O; there's a good
> > chance we'll be using the entire page anyway eventually.

*Iff* doing away with buffer head 512 granularity could help block sizes
greater than page size where physical and logical block size > PAGE_SIZE
we whould also be able to see it on 4kn drives (logical and physical
block size == 4k). A projection could be made after.

In so far as experimenting with this, if you already have some
effort on IOMAP for bdev aops one possibility for pure experimentation
for now would be to peg a new set of aops could be set in the path
of __alloc_disk_node() --> bdev_alloc() but that's a wee-bit too early
for us to know if the device is has (lbs = pbs) > 512. For NVMe for
instance this would be nvme_alloc_ns() --> blk_mq_alloc_disk(). We
put together and set the logical and phyiscal block size on NVMe on
nvme_update_ns_info() --> nvme_update_disk_info(), right before we
call device_add_disk(). The only way to override the aops then would
be right before device_add_disk(), or as part of a new device_add_disk_aops()
or whatever.

> > And with storage bandwidth getting larger and larger we might even
> > get a performance boost there.
> 
> I think we need to look at this from the filesystem side.

Before that let's recap the bdev cache current issues.

Today by just adding the disk we move on to partition scanning
immediately unless your block driver has a flag that says otherwise. The
current crash we're evaluating with brd and that we also hit with NVMe
is due to this part.

device_add_disk() -->
  disk_scan_partitions() -->
    blkdev_get_whole() -->
      bdev_disk_changed() -->
        filemap_read_folio() --> filler()

The filler is from aops.

We don't even have a filesystem yet on these devices at this point. The entire
partition core does this partition scanning. Refer to:

disk_scan_partitions() --> block/partitions/core.c : bdev_disk_changed()

And all of that stuff is also under a 512-byte atomic operation assumption,
we could do better if wanted to.

> What do filesystems actually want to do?

So you are suggesting that the early reads of the block device by the
block cache and its use of the page cache cache should be aligned /
perhaps redesigned to assist more clearly with what modern filesystems
might actually would want today?

> The first thing is they want to read
> the superblock.  That's either going to be immediately freed ("Oh,
> this isn't a JFS filesystem after all") or it's going to hang around
> indefinitely.  There's no particular need to keep it in any kind of
> cache (buffer or page). 

And the bdev cache would not be able to know before hand that's the case.

> Except ... we want to probe a dozen different
> filesystems, and half of them keep their superblock at the same offset
> from the start of the block device.  So we do want to keep it cached.
> That's arguing for using the page cache, at least to read it.

Do we currently share anything from the bdev cache with the fs for this?
Let's say that first block device blocksize in memory.

> Now, do we want userspace to be able to dd a new superblock into place
> and have the mounted filesystem see it? 

Not sure I follow this. dd a new super block?

> I suspect that confuses just
> about every filesystem out there.  So I think the right answer is to read
> the page into the bdev's page cache and then copy it into a kmalloc'ed
> buffer which the filesystem is then responsible for freeing.  It's also
> responsible for writing it back (so that's another API we need), and for
> a journalled filesystem, it needs to fit into the journalling scheme.
> Also, we may need to write back multiple copies of the superblock,
> possibly with slight modifications.

Are you considering these as extentions to the bdev cache?

  Luis

Matthew Wilcox April 16, 2023, 3:40 a.m. UTC | #12

On Sat, Apr 15, 2023 at 06:28:11PM -0700, Luis Chamberlain wrote:
> On Sat, Apr 15, 2023 at 06:09:12PM +0100, Matthew Wilcox wrote:
> > On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote:
> > > On 4/15/23 05:44, Matthew Wilcox wrote:
> > > We _could_ upgrade to always do full page I/O; there's a good
> > > chance we'll be using the entire page anyway eventually.
> 
> *Iff* doing away with buffer head 512 granularity could help block sizes
> greater than page size where physical and logical block size > PAGE_SIZE
> we whould also be able to see it on 4kn drives (logical and physical
> block size == 4k). A projection could be made after.
> 
> In so far as experimenting with this, if you already have some
> effort on IOMAP for bdev aops one possibility for pure experimentation
> for now would be to peg a new set of aops could be set in the path
> of __alloc_disk_node() --> bdev_alloc() but that's a wee-bit too early
> for us to know if the device is has (lbs = pbs) > 512. For NVMe for
> instance this would be nvme_alloc_ns() --> blk_mq_alloc_disk(). We
> put together and set the logical and phyiscal block size on NVMe on
> nvme_update_ns_info() --> nvme_update_disk_info(), right before we
> call device_add_disk(). The only way to override the aops then would
> be right before device_add_disk(), or as part of a new device_add_disk_aops()
> or whatever.

I think you're making this harder than you need to.
For LBA size > PAGE_SIZE, there is always only going to be
one BH per folio-that-is-LBA-size, so all the problems with
more-than-8-BHs-per-4k-page don't actually exist.  I don't think we
should be overriding the aops, and if we narrow the scope of large folio
support in blockdev t only supporting folio_size == LBA size, it becomes
much more feasible.

> > > And with storage bandwidth getting larger and larger we might even
> > > get a performance boost there.
> > 
> > I think we need to look at this from the filesystem side.
> 
> Before that let's recap the bdev cache current issues.

Ooh, yes, this is good!  I was totally neglecting the partition
scanning code.

> Today by just adding the disk we move on to partition scanning
> immediately unless your block driver has a flag that says otherwise. The
> current crash we're evaluating with brd and that we also hit with NVMe
> is due to this part.
> 
> device_add_disk() -->
>   disk_scan_partitions() -->
>     blkdev_get_whole() -->
>       bdev_disk_changed() -->
>         filemap_read_folio() --> filler()
> 
> The filler is from aops.

Right, so you missed a step in that callchain, which is
read_mapping_folio().  That ends up in do_read_cache_folio(), which
contains the deadly:

                folio = filemap_alloc_folio(gfp, 0);

so that needs to be changed to get the minimum folio order from the
mapping, and then it should work.

> > What do filesystems actually want to do?
> 
> So you are suggesting that the early reads of the block device by the
> block cache and its use of the page cache cache should be aligned /
> perhaps redesigned to assist more clearly with what modern filesystems
> might actually would want today?

Perhaps?  I'm just saying the needs of the block device are not the
only consideration here.  I'd like an API that makes sense for the fs
author.

> > The first thing is they want to read
> > the superblock.  That's either going to be immediately freed ("Oh,
> > this isn't a JFS filesystem after all") or it's going to hang around
> > indefinitely.  There's no particular need to keep it in any kind of
> > cache (buffer or page). 
> 
> And the bdev cache would not be able to know before hand that's the case.

Right, nobody knows until it's been read and examined.

> > Except ... we want to probe a dozen different
> > filesystems, and half of them keep their superblock at the same offset
> > from the start of the block device.  So we do want to keep it cached.
> > That's arguing for using the page cache, at least to read it.
> 
> Do we currently share anything from the bdev cache with the fs for this?
> Let's say that first block device blocksize in memory.

sb_bread() is used by most filesystems, and the buffer cache aliases
into the page cache.

> > Now, do we want userspace to be able to dd a new superblock into place
> > and have the mounted filesystem see it? 
> 
> Not sure I follow this. dd a new super block?

In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N',
I can overwrite the superblock.  Do we want filesystems to see that
kind of vandalism, or do we want the mounted filesystem to have its
own copy of the data and overwrite what userspace wrote the next time it
updates the superblock?

(the trick is that this may not be vandalism, it might be the sysadmin
updating the uuid or running some fsck-ish program or trying to update
the superblock to support fabulous-new-feature on next mount.  does this
change the answer?)

> > I suspect that confuses just
> > about every filesystem out there.  So I think the right answer is to read
> > the page into the bdev's page cache and then copy it into a kmalloc'ed
> > buffer which the filesystem is then responsible for freeing.  It's also
> > responsible for writing it back (so that's another API we need), and for
> > a journalled filesystem, it needs to fit into the journalling scheme.
> > Also, we may need to write back multiple copies of the superblock,
> > possibly with slight modifications.
> 
> Are you considering these as extentions to the bdev cache?

I'm trying to suggest some of the considerations that need to go into
a replacement for sb_bread().

Luis Chamberlain April 16, 2023, 5:26 a.m. UTC | #13

On Sun, Apr 16, 2023 at 04:40:06AM +0100, Matthew Wilcox wrote:
> On Sat, Apr 15, 2023 at 06:28:11PM -0700, Luis Chamberlain wrote:
> > On Sat, Apr 15, 2023 at 06:09:12PM +0100, Matthew Wilcox wrote:
> > > On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote:
> > > > On 4/15/23 05:44, Matthew Wilcox wrote:
> > > > We _could_ upgrade to always do full page I/O; there's a good
> > > > chance we'll be using the entire page anyway eventually.
> > 
> > *Iff* doing away with buffer head 512 granularity could help block sizes
> > greater than page size where physical and logical block size > PAGE_SIZE
> > we whould also be able to see it on 4kn drives (logical and physical
> > block size == 4k). A projection could be made after.
> > 
> > In so far as experimenting with this, if you already have some
> > effort on IOMAP for bdev aops one possibility for pure experimentation
> > for now would be to peg a new set of aops could be set in the path
> > of __alloc_disk_node() --> bdev_alloc() but that's a wee-bit too early
> > for us to know if the device is has (lbs = pbs) > 512. For NVMe for
> > instance this would be nvme_alloc_ns() --> blk_mq_alloc_disk(). We
> > put together and set the logical and phyiscal block size on NVMe on
> > nvme_update_ns_info() --> nvme_update_disk_info(), right before we
> > call device_add_disk(). The only way to override the aops then would
> > be right before device_add_disk(), or as part of a new device_add_disk_aops()
> > or whatever.
> 
> I think you're making this harder than you need to.
> For LBA size > PAGE_SIZE, there is always only going to be
> one BH per folio-that-is-LBA-size, so all the problems with
> more-than-8-BHs-per-4k-page don't actually exist.

Oh! Then yes, sorry!

> I don't think we
> should be overriding the aops, and if we narrow the scope of large folio
> support in blockdev t only supporting folio_size == LBA size, it becomes
> much more feasible.

I'm trying to think of the possible use cases where folio_size != LBA size
and I cannot immediately think of some. Yes there are cases where a
filesystem may use a different block for say meta data than data, but that
I believe is side issue, ie, read/writes for small metadata would have
to be accepted. At least for NVMe we have metadata size as part of the
LBA format, but from what I understand no Linux filesystem yet uses that.

> > > > And with storage bandwidth getting larger and larger we might even
> > > > get a performance boost there.
> > > 
> > > I think we need to look at this from the filesystem side.
> > 
> > Before that let's recap the bdev cache current issues.
> 
> Ooh, yes, this is good!  I was totally neglecting the partition
> scanning code.
> 
> > Today by just adding the disk we move on to partition scanning
> > immediately unless your block driver has a flag that says otherwise. The
> > current crash we're evaluating with brd and that we also hit with NVMe
> > is due to this part.
> > 
> > device_add_disk() -->
> >   disk_scan_partitions() -->
> >     blkdev_get_whole() -->
> >       bdev_disk_changed() -->
> >         filemap_read_folio() --> filler()
> > 
> > The filler is from aops.
> 
> Right, so you missed a step in that callchain, which is
> read_mapping_folio().  That ends up in do_read_cache_folio(), which
> contains the deadly:
> 
>                 folio = filemap_alloc_folio(gfp, 0);

Right and before this we have:

	if (!filler)                                                            
		filler = mapping->a_ops->read_folio;

The folio is order 0 and after filemap_alloc_folio() its added to the
page cache, then filemap_read_folio(file, filler, folio)
uses the filler blkdev_read_folio()  --> fs/buffer.c block_read_full_folio()

Just for posterity:

fs/buffer.c
int block_read_full_folio(struct folio *folio, get_block_t *get_block)
{
	...
	struct buffer_head *bh, *head,...;
	...
	head = create_page_buffers(&folio->page, inode, 0);
	...
}

static struct buffer_head *create_page_buffers(struct page *page, struct inode *inode, unsigned int b_state)
{                                                                               
        BUG_ON(!PageLocked(page));                                              
                                                                                
        if (!page_has_buffers(page))                                            
                create_empty_buffers(page, 1 << READ_ONCE(inode->i_blkbits),    
                                     b_state);                                  
        return page_buffers(page);                                              
}

void create_empty_buffers(struct page *page,                                    
                        unsigned long blocksize, unsigned long b_state)         
{
        struct buffer_head *bh, *head, *tail;                                   
                                                                                
        head = alloc_page_buffers(page, blocksize, true);
        bh = head; 
        do {                                                                    
                bh->b_state |= b_state; -----> CRASH HERE head is NULL
                tail = bh;                                                      
                bh = bh->b_this_page;                                           
        } while (bh);                                                           
        tail->b_this_page = head;
}

Why is it NULL? The blocksize passed to alloc_page_buffers() is larger
than PAGE_SIZE and below offset is PAGE_SIZE. PAGE_SIZE - blocksize
gives a negative number and so the loop is not traversed.

struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,   
                bool retry)                                                     
{ 
        struct buffer_head *bh, *head;                                          
        gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT;                                   
        long offset;                                                            
        struct mem_cgroup *memcg, *old_memcg;                                   
                                                                                
        if (retry)                                                              
                gfp |= __GFP_NOFAIL;                                            
                                                                                
        /* The page lock pins the memcg */ 
        memcg = page_memcg(page);                                               
        old_memcg = set_active_memcg(memcg);                                    
                                                                                
        head = NULL;  
        offset = PAGE_SIZE;                                                     
        while ((offset -= size) >= 0) {                                         
                bh = alloc_buffer_head(gfp);                                    
                if (!bh)                                                        
                        goto no_grow;                                           
                                                                                
                bh->b_this_page = head;                                         
                bh->b_blocknr = -1;                                             
                head = bh;                                                      
                                                                                
                bh->b_size = size;                                              
                                                                                
                /* Link the buffer to its page */                               
                set_bh_page(bh, page, offset);                                  
        }                                                                       
out:                                                                            
        set_active_memcg(old_memcg);                                            
        return head;
	...
}

I see now what you say about the buffer head being of the block size
bh->b_size = size above.

> so that needs to be changed to get the minimum folio order from the
> mapping, and then it should work.
> 
> > > What do filesystems actually want to do?
> > 
> > So you are suggesting that the early reads of the block device by the
> > block cache and its use of the page cache cache should be aligned /
> > perhaps redesigned to assist more clearly with what modern filesystems
> > might actually would want today?
> 
> Perhaps?  I'm just saying the needs of the block device are not the
> only consideration here.  I'd like an API that makes sense for the fs
> author.

Makes sense!

> > > The first thing is they want to read
> > > the superblock.  That's either going to be immediately freed ("Oh,
> > > this isn't a JFS filesystem after all") or it's going to hang around
> > > indefinitely.  There's no particular need to keep it in any kind of
> > > cache (buffer or page). 
> > 
> > And the bdev cache would not be able to know before hand that's the case.
> 
> Right, nobody knows until it's been read and examined.
> 
> > > Except ... we want to probe a dozen different
> > > filesystems, and half of them keep their superblock at the same offset
> > > from the start of the block device.  So we do want to keep it cached.
> > > That's arguing for using the page cache, at least to read it.
> > 
> > Do we currently share anything from the bdev cache with the fs for this?
> > Let's say that first block device blocksize in memory.
> 
> sb_bread() is used by most filesystems, and the buffer cache aliases
> into the page cache.

I see thanks. I checked what xfs does and its xfs_readsb() uses its own
xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and
xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why
they did that.

> > > Now, do we want userspace to be able to dd a new superblock into place
> > > and have the mounted filesystem see it? 
> > 
> > Not sure I follow this. dd a new super block?
> 
> In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N',
> I can overwrite the superblock.  Do we want filesystems to see that
> kind of vandalism, or do we want the mounted filesystem to have its
> own copy of the data and overwrite what userspace wrote the next time it
> updates the superblock?

Oh, what happens today?

> (the trick is that this may not be vandalism, it might be the sysadmin
> updating the uuid or running some fsck-ish program or trying to update
> the superblock to support fabulous-new-feature on next mount.  does this
> change the answer?)
> 
> > > I suspect that confuses just
> > > about every filesystem out there.  So I think the right answer is to read
> > > the page into the bdev's page cache and then copy it into a kmalloc'ed
> > > buffer which the filesystem is then responsible for freeing.  It's also
> > > responsible for writing it back (so that's another API we need), and for
> > > a journalled filesystem, it needs to fit into the journalling scheme.
> > > Also, we may need to write back multiple copies of the superblock,
> > > possibly with slight modifications.
> > 
> > Are you considering these as extentions to the bdev cache?
> 
> I'm trying to suggest some of the considerations that need to go into
> a replacement for sb_bread().

I see! That would also help EOL buffer heads for that purpose.

  Luis

Matthew Wilcox April 16, 2023, 2:07 p.m. UTC | #14

On Sat, Apr 15, 2023 at 10:26:42PM -0700, Luis Chamberlain wrote:
> On Sun, Apr 16, 2023 at 04:40:06AM +0100, Matthew Wilcox wrote:
> > I don't think we
> > should be overriding the aops, and if we narrow the scope of large folio
> > support in blockdev t only supporting folio_size == LBA size, it becomes
> > much more feasible.
> 
> I'm trying to think of the possible use cases where folio_size != LBA size
> and I cannot immediately think of some. Yes there are cases where a
> filesystem may use a different block for say meta data than data, but that
> I believe is side issue, ie, read/writes for small metadata would have
> to be accepted. At least for NVMe we have metadata size as part of the
> LBA format, but from what I understand no Linux filesystem yet uses that.

NVMe metadata is per-block metadata -- a CRC or similar.  Filesystem
metadata is things like directories, inode tables, free space bitmaps,
etc.

> struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,   
>                 bool retry)                                                     
> { 
[...]
>         head = NULL;  
>         offset = PAGE_SIZE;                                                     
>         while ((offset -= size) >= 0) {                                         
> 
> I see now what you say about the buffer head being of the block size
> bh->b_size = size above.

Yes, just changing that to 'offset = page_size(page);' will do the trick.

> > sb_bread() is used by most filesystems, and the buffer cache aliases
> > into the page cache.
> 
> I see thanks. I checked what xfs does and its xfs_readsb() uses its own
> xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and
> xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why
> they did that.

IRIX didn't have an sb_bread() ;-)

> > In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N',
> > I can overwrite the superblock.  Do we want filesystems to see that
> > kind of vandalism, or do we want the mounted filesystem to have its
> > own copy of the data and overwrite what userspace wrote the next time it
> > updates the superblock?
> 
> Oh, what happens today?

Depends on the filesystem, I think?  Not really sure, to be honest.

Dave Chinner April 16, 2023, 10:57 p.m. UTC | #15

On Sat, Apr 15, 2023 at 10:26:42PM -0700, Luis Chamberlain wrote:
> > > > Except ... we want to probe a dozen different
> > > > filesystems, and half of them keep their superblock at the same offset
> > > > from the start of the block device.  So we do want to keep it cached.
> > > > That's arguing for using the page cache, at least to read it.
> > > 
> > > Do we currently share anything from the bdev cache with the fs for this?
> > > Let's say that first block device blocksize in memory.
> > 
> > sb_bread() is used by most filesystems, and the buffer cache aliases
> > into the page cache.
> 
> I see thanks. I checked what xfs does and its xfs_readsb() uses its own
> xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and
> xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why
> they did that.

XFS has it's own metadata address space for caching - it does not
use the block device page cache at all. This is not new, it never
has.

The xfs_buf buffer cache does not use the page cache, either. It
does it's own thing, has it's own indexing, locking, shrinkers, etc.
IOWs, it does not use the iomap infrastructure at all - iomap is
used by XFS exclusively for data IO.

As for why we use an uncached buffer for the superblock? That's
largely historic because prior to 2007 every modification that did
allocation/free needed to lock and modify the superblock at
transaction commit. Hence it's always needed in memory but a
critical fast path, so it is always directly available without
needing to do a cache lookup to callers that need it.

In 2007, lazy superblock counters got rid of the requirement to lock
the superblock buffer in every transaction commit, so the uncached
buffer optimisation hasn't really been needed for the past decade.
But if it ain't broke, don't try to fix it....

> > > > Now, do we want userspace to be able to dd a new superblock into place
> > > > and have the mounted filesystem see it? 
> > > 
> > > Not sure I follow this. dd a new super block?
> > 
> > In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N',
> > I can overwrite the superblock.  Do we want filesystems to see that
> > kind of vandalism, or do we want the mounted filesystem to have its
> > own copy of the data and overwrite what userspace wrote the next time it
> > updates the superblock?
> 
> Oh, what happens today?

In XFS, it will completely ignore the fact the the superblock got
trashed like this. When the fs goes idle, or the sb modified for
some other reason, it will relog the in-memory superblock and write
it back to disk, thereby fixing the corruption. i.e. while the
filesystem is mounted, the superblock is _write-only_...

> > (the trick is that this may not be vandalism, it might be the sysadmin
> > updating the uuid or running some fsck-ish program or trying to update
> > the superblock to support fabulous-new-feature on next mount.  does this
> > change the answer?)

If you need to change anything in the superblock while the XFS fs is
mounted, then you have to use ioctls to modify the superblock
contents through the running transaction subsystem. Editting the
block device directly breaks the security model of filesystems that
assume they have exclusive access to the block device whilst the
filesystem is mounted....

-Dave.

Luis Chamberlain April 17, 2023, 2:27 a.m. UTC | #16

On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote:
> @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio,
> get_block_t *get_block)
>         if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
>                 limit = inode->i_sb->s_maxbytes;
> 
> -       VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> -
>         head = create_folio_buffers(folio, inode, 0);
>         blocksize = head->b_size;
>         bbits = block_size_bits(blocksize);
> 
> -       iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
> +       if (WARN_ON(PAGE_SHIFT < bbits)) {
> +               iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT);
> +       } else {
> +               iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
> +       }
>         lblock = (limit+blocksize-1) >> bbits;
>         bh = head;
>         nr = 0;

BTW I See this pattern in:

fs/mpage.c: do_mpage_readpage()
fs/mpage.c: __mpage_writepage()

A helper might be in order.

  Luis

Hannes Reinecke April 17, 2023, 6:04 a.m. UTC | #17

On 4/17/23 04:27, Luis Chamberlain wrote:
> On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote:
>> @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio,
>> get_block_t *get_block)
>>          if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
>>                  limit = inode->i_sb->s_maxbytes;
>>
>> -       VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>> -
>>          head = create_folio_buffers(folio, inode, 0);
>>          blocksize = head->b_size;
>>          bbits = block_size_bits(blocksize);
>>
>> -       iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
>> +       if (WARN_ON(PAGE_SHIFT < bbits)) {
>> +               iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT);
>> +       } else {
>> +               iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
>> +       }
>>          lblock = (limit+blocksize-1) >> bbits;
>>          bh = head;
>>          nr = 0;
> 
> BTW I See this pattern in:
> 
> fs/mpage.c: do_mpage_readpage()
> fs/mpage.c: __mpage_writepage()
> 
> A helper might be in order.
> 

But only _iff_ we decide to stick with buffer_heads and upgrade the 
buffer_head code to handle multi-page block sizes.
The idea was to move over to iomap, hence I'm not sure into which
lengths we should go keeping buffer_heads alive ...

Cheers,

Hannes

Darrick J. Wong April 17, 2023, 3:40 p.m. UTC | #18

On Sun, Apr 16, 2023 at 03:07:33PM +0100, Matthew Wilcox wrote:
> On Sat, Apr 15, 2023 at 10:26:42PM -0700, Luis Chamberlain wrote:
> > On Sun, Apr 16, 2023 at 04:40:06AM +0100, Matthew Wilcox wrote:
> > > I don't think we
> > > should be overriding the aops, and if we narrow the scope of large folio
> > > support in blockdev t only supporting folio_size == LBA size, it becomes
> > > much more feasible.
> > 
> > I'm trying to think of the possible use cases where folio_size != LBA size
> > and I cannot immediately think of some. Yes there are cases where a
> > filesystem may use a different block for say meta data than data, but that
> > I believe is side issue, ie, read/writes for small metadata would have
> > to be accepted. At least for NVMe we have metadata size as part of the
> > LBA format, but from what I understand no Linux filesystem yet uses that.
> 
> NVMe metadata is per-block metadata -- a CRC or similar.  Filesystem
> metadata is things like directories, inode tables, free space bitmaps,
> etc.
> 
> > struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,   
> >                 bool retry)                                                     
> > { 
> [...]
> >         head = NULL;  
> >         offset = PAGE_SIZE;                                                     
> >         while ((offset -= size) >= 0) {                                         
> > 
> > I see now what you say about the buffer head being of the block size
> > bh->b_size = size above.
> 
> Yes, just changing that to 'offset = page_size(page);' will do the trick.
> 
> > > sb_bread() is used by most filesystems, and the buffer cache aliases
> > > into the page cache.
> > 
> > I see thanks. I checked what xfs does and its xfs_readsb() uses its own
> > xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and
> > xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why
> > they did that.
> 
> IRIX didn't have an sb_bread() ;-)
> 
> > > In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N',
> > > I can overwrite the superblock.  Do we want filesystems to see that
> > > kind of vandalism, or do we want the mounted filesystem to have its
> > > own copy of the data and overwrite what userspace wrote the next time it
> > > updates the superblock?
> > 
> > Oh, what happens today?
> 
> Depends on the filesystem, I think?  Not really sure, to be honest.

The filesystem driver sees the vandalism, and can very well crash as a
result[1].  In that case it was corrupted journal contents being
replayed, but the same thing would happen if you wrote a malicious
userspace program to set the metadata_csum feature flag in the ondisk
superblock after mounting the fs.

https://bugzilla.kernel.org/show_bug.cgi?id=82201#c4

I've tried to prevent people from writing to mounted block devices in
the past, but did not succeed.  If you try to prevent programs from
opening such devices with O_RDWR/O_WRONLY you then break lvm tools which
require that ability even though they don't actually write anything to
the block device.  If you make the block device write_iter function
fail, then old e2fsprogs breaks and you get shouted at for breaking
userspace.

Hence I decided to let security researchers find these bugs and control
the design discussion via CVE.  That's not correct and it's not smart,
but it preserves some of my sanity.

--D