writeback: Account the number of pages written back

Message ID 20230628185548.981888-1-willy@infradead.org
State New
Headers
Series writeback: Account the number of pages written back |

Commit Message

Matthew Wilcox June 28, 2023, 6:55 p.m. UTC
  nr_to_write is a count of pages, so we need to decrease it by the number
of pages in the folio we just wrote, not by 1.  Most callers specify
either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes()
might end up writing 512x as many pages as it asked for.

Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/page-writeback.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)
  

Comments

Dave Chinner June 28, 2023, 9:53 p.m. UTC | #1
On Wed, Jun 28, 2023 at 07:55:48PM +0100, Matthew Wilcox (Oracle) wrote:
> nr_to_write is a count of pages, so we need to decrease it by the number
> of pages in the folio we just wrote, not by 1.  Most callers specify
> either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes()
> might end up writing 512x as many pages as it asked for.
> 
> Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/page-writeback.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 1d17fb1ec863..d3f42009bb70 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2434,6 +2434,7 @@ int write_cache_pages(struct address_space *mapping,
>  
>  		for (i = 0; i < nr_folios; i++) {
>  			struct folio *folio = fbatch.folios[i];
> +			unsigned long nr;
>  
>  			done_index = folio->index;
>  
> @@ -2471,6 +2472,7 @@ int write_cache_pages(struct address_space *mapping,
>  
>  			trace_wbc_writepage(wbc, inode_to_bdi(mapping->host));
>  			error = writepage(folio, wbc, data);
> +			nr = folio_nr_pages(folio);

This should really be done before writepage() is called, right? By
the time the writepage() returns, the folio can be unlocked, the
entire write completed and the folio partially invalidated which may
try to split the folio...

Even if this can't happen (folio refcount is elevated, right?), it
makes much more sense to me to sample the size of the folio while it
is obviously locked and not going to change...

Other than that, change looks good.

-Dave.
  
Matthew Wilcox June 29, 2023, 12:01 a.m. UTC | #2
On Thu, Jun 29, 2023 at 07:53:44AM +1000, Dave Chinner wrote:
> On Wed, Jun 28, 2023 at 07:55:48PM +0100, Matthew Wilcox (Oracle) wrote:
> > nr_to_write is a count of pages, so we need to decrease it by the number
> > of pages in the folio we just wrote, not by 1.  Most callers specify
> > either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes()
> > might end up writing 512x as many pages as it asked for.
> > 
> > Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > ---
> >  mm/page-writeback.c | 8 +++++---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 1d17fb1ec863..d3f42009bb70 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -2434,6 +2434,7 @@ int write_cache_pages(struct address_space *mapping,
> >  
> >  		for (i = 0; i < nr_folios; i++) {
> >  			struct folio *folio = fbatch.folios[i];
> > +			unsigned long nr;
> >  
> >  			done_index = folio->index;
> >  
> > @@ -2471,6 +2472,7 @@ int write_cache_pages(struct address_space *mapping,
> >  
> >  			trace_wbc_writepage(wbc, inode_to_bdi(mapping->host));
> >  			error = writepage(folio, wbc, data);
> > +			nr = folio_nr_pages(folio);
> 
> This should really be done before writepage() is called, right? By
> the time the writepage() returns, the folio can be unlocked, the
> entire write completed and the folio partially invalidated which may
> try to split the folio...
> 
> Even if this can't happen (folio refcount is elevated, right?), it
> makes much more sense to me to sample the size of the folio while it
> is obviously locked and not going to change...

It can't change because of the refcount we hold (that's put in the call
to folio_batch_release()).  I didn't want to call it before the call to
writepage() because that makes the compiler stick it on the stack instead
of a local variable.  Also, when we transform this into an iterator (see
patches posted yesterday), we'd have to stash it away in the iterator.
  
Dave Chinner June 29, 2023, 1:02 a.m. UTC | #3
On Thu, Jun 29, 2023 at 01:01:59AM +0100, Matthew Wilcox wrote:
> On Thu, Jun 29, 2023 at 07:53:44AM +1000, Dave Chinner wrote:
> > On Wed, Jun 28, 2023 at 07:55:48PM +0100, Matthew Wilcox (Oracle) wrote:
> > > nr_to_write is a count of pages, so we need to decrease it by the number
> > > of pages in the folio we just wrote, not by 1.  Most callers specify
> > > either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes()
> > > might end up writing 512x as many pages as it asked for.
> > > 
> > > Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
> > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > ---
> > >  mm/page-writeback.c | 8 +++++---
> > >  1 file changed, 5 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > index 1d17fb1ec863..d3f42009bb70 100644
> > > --- a/mm/page-writeback.c
> > > +++ b/mm/page-writeback.c
> > > @@ -2434,6 +2434,7 @@ int write_cache_pages(struct address_space *mapping,
> > >  
> > >  		for (i = 0; i < nr_folios; i++) {
> > >  			struct folio *folio = fbatch.folios[i];
> > > +			unsigned long nr;
> > >  
> > >  			done_index = folio->index;
> > >  
> > > @@ -2471,6 +2472,7 @@ int write_cache_pages(struct address_space *mapping,
> > >  
> > >  			trace_wbc_writepage(wbc, inode_to_bdi(mapping->host));
> > >  			error = writepage(folio, wbc, data);
> > > +			nr = folio_nr_pages(folio);
> > 
> > This should really be done before writepage() is called, right? By
> > the time the writepage() returns, the folio can be unlocked, the
> > entire write completed and the folio partially invalidated which may
> > try to split the folio...
> > 
> > Even if this can't happen (folio refcount is elevated, right?), it
> > makes much more sense to me to sample the size of the folio while it
> > is obviously locked and not going to change...
> 
> It can't change because of the refcount we hold (that's put in the call
> to folio_batch_release()).  I didn't want to call it before the call to
> writepage() because that makes the compiler stick it on the stack instead
> of a local variable.

I don't care for micro-optimisations when the result is code
that looks dodgy and suspect and requires lots of additional
thinking about to determine that it is safe.

> Also, when we transform this into an iterator (see
> patches posted yesterday), we'd have to stash it away in the iterator.

That's no big deal, either.

-Dave.
  
Andrew Morton July 2, 2023, 8:06 p.m. UTC | #4
On Wed, 28 Jun 2023 19:55:48 +0100 "Matthew Wilcox (Oracle)" <willy@infradead.org> wrote:

> nr_to_write is a count of pages, so we need to decrease it by the number
> of pages in the folio we just wrote, not by 1.  Most callers specify
> either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes()
> might end up writing 512x as many pages as it asked for.

512 is a big number,  Should we backport this?

> Fixes: 793917d997df ("mm/readahead: Add large folio readahead")

I'm not seeing how a readahead change messed up writeback accounting?
  
Matthew Wilcox July 3, 2023, 2:13 a.m. UTC | #5
On Sun, Jul 02, 2023 at 01:06:15PM -0700, Andrew Morton wrote:
> On Wed, 28 Jun 2023 19:55:48 +0100 "Matthew Wilcox (Oracle)" <willy@infradead.org> wrote:
> 
> > nr_to_write is a count of pages, so we need to decrease it by the number
> > of pages in the folio we just wrote, not by 1.  Most callers specify
> > either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes()
> > might end up writing 512x as many pages as it asked for.
> 
> 512 is a big number,  Should we backport this?

I'm really not sure.  Maybe?  I'm hoping one of the bots comes up with a
meaningful performance change as a result of this patch and we find out.

> > Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
> 
> I'm not seeing how a readahead change messed up writeback accounting?

That was the first patch which allowed large folios to be added to the
page cache.  Until that point, this was latent.  We could probably argue
for one of a dozen other commits around the same time.
  
Dave Chinner July 3, 2023, 9:37 p.m. UTC | #6
On Mon, Jul 03, 2023 at 03:13:15AM +0100, Matthew Wilcox wrote:
> On Sun, Jul 02, 2023 at 01:06:15PM -0700, Andrew Morton wrote:
> > On Wed, 28 Jun 2023 19:55:48 +0100 "Matthew Wilcox (Oracle)" <willy@infradead.org> wrote:
> > 
> > > nr_to_write is a count of pages, so we need to decrease it by the number
> > > of pages in the folio we just wrote, not by 1.  Most callers specify
> > > either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes()
> > > might end up writing 512x as many pages as it asked for.
> > 
> > 512 is a big number,  Should we backport this?
> 
> I'm really not sure.  Maybe?  I'm hoping one of the bots comes up with a
> meaningful performance change as a result of this patch and we find out.

XFS is the only filesystem this would affect, right? AFAIA, nothing
else enables large folios and uses writeback through
write_cache_pages() at this point...

In which case, I'd be surprised if much difference, if any, gets
noticed by anyone.

-Dave.
  
Matthew Wilcox July 4, 2023, 6:03 p.m. UTC | #7
On Tue, Jul 04, 2023 at 07:37:17AM +1000, Dave Chinner wrote:
> On Mon, Jul 03, 2023 at 03:13:15AM +0100, Matthew Wilcox wrote:
> > On Sun, Jul 02, 2023 at 01:06:15PM -0700, Andrew Morton wrote:
> > > On Wed, 28 Jun 2023 19:55:48 +0100 "Matthew Wilcox (Oracle)" <willy@infradead.org> wrote:
> > > 
> > > > nr_to_write is a count of pages, so we need to decrease it by the number
> > > > of pages in the folio we just wrote, not by 1.  Most callers specify
> > > > either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes()
> > > > might end up writing 512x as many pages as it asked for.
> > > 
> > > 512 is a big number,  Should we backport this?
> > 
> > I'm really not sure.  Maybe?  I'm hoping one of the bots comes up with a
> > meaningful performance change as a result of this patch and we find out.
> 
> XFS is the only filesystem this would affect, right? AFAIA, nothing
> else enables large folios and uses writeback through
> write_cache_pages() at this point...

Good point.  Still, Intel's 0day has squawked about a loss of performance
when large folios have _stopped_ being used, so they are at least testing
with XFS.
  
Christoph Hellwig July 6, 2023, 2:21 p.m. UTC | #8
Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
  

Patch

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 1d17fb1ec863..d3f42009bb70 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2434,6 +2434,7 @@  int write_cache_pages(struct address_space *mapping,
 
 		for (i = 0; i < nr_folios; i++) {
 			struct folio *folio = fbatch.folios[i];
+			unsigned long nr;
 
 			done_index = folio->index;
 
@@ -2471,6 +2472,7 @@  int write_cache_pages(struct address_space *mapping,
 
 			trace_wbc_writepage(wbc, inode_to_bdi(mapping->host));
 			error = writepage(folio, wbc, data);
+			nr = folio_nr_pages(folio);
 			if (unlikely(error)) {
 				/*
 				 * Handle errors according to the type of
@@ -2489,8 +2491,7 @@  int write_cache_pages(struct address_space *mapping,
 					error = 0;
 				} else if (wbc->sync_mode != WB_SYNC_ALL) {
 					ret = error;
-					done_index = folio->index +
-						folio_nr_pages(folio);
+					done_index = folio->index + nr;
 					done = 1;
 					break;
 				}
@@ -2504,7 +2505,8 @@  int write_cache_pages(struct address_space *mapping,
 			 * keep going until we have written all the pages
 			 * we tagged for writeback prior to entering this loop.
 			 */
-			if (--wbc->nr_to_write <= 0 &&
+			wbc->nr_to_write -= nr;
+			if (wbc->nr_to_write <= 0 &&
 			    wbc->sync_mode == WB_SYNC_NONE) {
 				done = 1;
 				break;