Message ID | 20230628185548.981888-1-willy@infradead.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp9142753vqr; Wed, 28 Jun 2023 12:02:00 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ58/Rx15Ey0lEhScsqpotef0dpsYeuPm2bHFC7TFEtdiUaRL1vyizhAbmkvmfDXxjD0k0LI X-Received: by 2002:a17:902:c945:b0:1ac:40f7:8b5a with SMTP id i5-20020a170902c94500b001ac40f78b5amr42394897pla.3.1687978919632; Wed, 28 Jun 2023 12:01:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1687978919; cv=none; d=google.com; s=arc-20160816; b=kN8garvoW+0VV3xGEVgVj3ToY3n3OcB94RAixc1gyxg7ZD8JwaeEO7b0rgsjiMp0Io yctaxwMwG1O3U4uhA3QDLmriZ2vNsSrUHj1sBYhVb4b4gnSJdGC6spN0/CdQdP8RXKZ2 EeGth417H+gmfDAJdVoZsBQ1RuiO6rFD85AdRS17P7/sTjI8psUFbFenl5w7K6YQE+w6 tIuBbBlxXupMBL0qD7NJzNaAYcMOqM3YePS8hGdiJhUUZUZwN7tp0KM2cimmhnbyjuwk 6b55jmE6RihUzfgXGQNyl+EV1oUdwwvzhRlBOkRNynsx18xmX/n7iEfG0yjq9skqaOL5 lHMw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=6P08QVXXRrbXoyateQxxQCk4bh+paRKekLCZ82zEgH8=; fh=mrg3hG/hDe1Arlf6AwYiQA24lijTK8quR2KRoDMKMXA=; b=n7njkhnTwuxujVVCQkK8E5ZNA8sYvGY+QZBDbB+IDWF+DPzWWiNyMSJiLST0lHtWSI 4mFrbXgyhyXVJ76r+R8zGtoOanoLgQNVEM3agmAyRTFK/ZCwR0cXmq0NxQfYR8CA4Vir LEQH4W4s8ny+sZPYOeuMUfj7UY8jq2GNGwuuQjF1+/VnwotuwlYyPm3evcLROQ1uDCCB lW3ghu8H4kjMVmHuO6oE8zz91CEK+d4PyTQhMEyGGjJ/1T19gj9WOSk1An4QyHU8nukZ 44nf8mukKbxTuSm6Zc0tzImRoJzd25jInSviCZKbzTMFVZMhcT27cvyTbH3sEmUPArDt lROA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=url2F0fZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j9-20020a170902da8900b001b80a89dcc0si6119052plx.143.2023.06.28.12.01.45; Wed, 28 Jun 2023 12:01:59 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=url2F0fZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231917AbjF1TAy (ORCPT <rfc822;ivan.orlov0322@gmail.com> + 99 others); Wed, 28 Jun 2023 15:00:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53920 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232438AbjF1S7J (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 28 Jun 2023 14:59:09 -0400 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F27AC2102; Wed, 28 Jun 2023 11:55:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Transfer-Encoding:MIME-Version: Message-Id:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID: Content-Description:In-Reply-To:References; bh=6P08QVXXRrbXoyateQxxQCk4bh+paRKekLCZ82zEgH8=; b=url2F0fZ01ji+ds4Rem0FLeG/t 3dBgDzfYIW7OTTtfNLTt4rIH8crndyry4OHdQvZbyIyMr7Ajb+Ot4P1UNdIXGYTPc2hf3qzLWaJFe pspy9msXJ+wykOy3fBGL5CoAAPKY3IE7oB9mUn9o3bCcgNirkUg9veocsCNB8mPWiIf95QyUPcM2U QxocqigmcFUM2BtK2mRWdF665uuL0dFsM3gIdK08lFx8RbxxTfbCJeh3vnoMChqvx3PkZ+mKXLbEM Cg/NaR6/AbFocoTrKu1dv6QzWIfOmOAMf7EZ66V4Xww7T4aQdeNDAnbEs2Fbx5ZHJOwqsGouzITT0 zNLlPUsw==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1qEaKT-0047Ys-Ue; Wed, 28 Jun 2023 18:55:49 +0000 From: "Matthew Wilcox (Oracle)" <willy@infradead.org> To: Andrew Morton <akpm@linux-foundation.org>, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Jan Kara <jack@suse.cz>, Christoph Hellwig <hch@lst.de> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Subject: [PATCH] writeback: Account the number of pages written back Date: Wed, 28 Jun 2023 19:55:48 +0100 Message-Id: <20230628185548.981888-1-willy@infradead.org> X-Mailer: git-send-email 2.37.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1769974183845157269?= X-GMAIL-MSGID: =?utf-8?q?1769974183845157269?= |
Series |
writeback: Account the number of pages written back
|
|
Commit Message
Matthew Wilcox
June 28, 2023, 6:55 p.m. UTC
nr_to_write is a count of pages, so we need to decrease it by the number
of pages in the folio we just wrote, not by 1. Most callers specify
either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes()
might end up writing 512x as many pages as it asked for.
Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
mm/page-writeback.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
Comments
On Wed, Jun 28, 2023 at 07:55:48PM +0100, Matthew Wilcox (Oracle) wrote: > nr_to_write is a count of pages, so we need to decrease it by the number > of pages in the folio we just wrote, not by 1. Most callers specify > either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes() > might end up writing 512x as many pages as it asked for. > > Fixes: 793917d997df ("mm/readahead: Add large folio readahead") > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > --- > mm/page-writeback.c | 8 +++++--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 1d17fb1ec863..d3f42009bb70 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -2434,6 +2434,7 @@ int write_cache_pages(struct address_space *mapping, > > for (i = 0; i < nr_folios; i++) { > struct folio *folio = fbatch.folios[i]; > + unsigned long nr; > > done_index = folio->index; > > @@ -2471,6 +2472,7 @@ int write_cache_pages(struct address_space *mapping, > > trace_wbc_writepage(wbc, inode_to_bdi(mapping->host)); > error = writepage(folio, wbc, data); > + nr = folio_nr_pages(folio); This should really be done before writepage() is called, right? By the time the writepage() returns, the folio can be unlocked, the entire write completed and the folio partially invalidated which may try to split the folio... Even if this can't happen (folio refcount is elevated, right?), it makes much more sense to me to sample the size of the folio while it is obviously locked and not going to change... Other than that, change looks good. -Dave.
On Thu, Jun 29, 2023 at 07:53:44AM +1000, Dave Chinner wrote: > On Wed, Jun 28, 2023 at 07:55:48PM +0100, Matthew Wilcox (Oracle) wrote: > > nr_to_write is a count of pages, so we need to decrease it by the number > > of pages in the folio we just wrote, not by 1. Most callers specify > > either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes() > > might end up writing 512x as many pages as it asked for. > > > > Fixes: 793917d997df ("mm/readahead: Add large folio readahead") > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > --- > > mm/page-writeback.c | 8 +++++--- > > 1 file changed, 5 insertions(+), 3 deletions(-) > > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > > index 1d17fb1ec863..d3f42009bb70 100644 > > --- a/mm/page-writeback.c > > +++ b/mm/page-writeback.c > > @@ -2434,6 +2434,7 @@ int write_cache_pages(struct address_space *mapping, > > > > for (i = 0; i < nr_folios; i++) { > > struct folio *folio = fbatch.folios[i]; > > + unsigned long nr; > > > > done_index = folio->index; > > > > @@ -2471,6 +2472,7 @@ int write_cache_pages(struct address_space *mapping, > > > > trace_wbc_writepage(wbc, inode_to_bdi(mapping->host)); > > error = writepage(folio, wbc, data); > > + nr = folio_nr_pages(folio); > > This should really be done before writepage() is called, right? By > the time the writepage() returns, the folio can be unlocked, the > entire write completed and the folio partially invalidated which may > try to split the folio... > > Even if this can't happen (folio refcount is elevated, right?), it > makes much more sense to me to sample the size of the folio while it > is obviously locked and not going to change... It can't change because of the refcount we hold (that's put in the call to folio_batch_release()). I didn't want to call it before the call to writepage() because that makes the compiler stick it on the stack instead of a local variable. Also, when we transform this into an iterator (see patches posted yesterday), we'd have to stash it away in the iterator.
On Thu, Jun 29, 2023 at 01:01:59AM +0100, Matthew Wilcox wrote: > On Thu, Jun 29, 2023 at 07:53:44AM +1000, Dave Chinner wrote: > > On Wed, Jun 28, 2023 at 07:55:48PM +0100, Matthew Wilcox (Oracle) wrote: > > > nr_to_write is a count of pages, so we need to decrease it by the number > > > of pages in the folio we just wrote, not by 1. Most callers specify > > > either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes() > > > might end up writing 512x as many pages as it asked for. > > > > > > Fixes: 793917d997df ("mm/readahead: Add large folio readahead") > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > > --- > > > mm/page-writeback.c | 8 +++++--- > > > 1 file changed, 5 insertions(+), 3 deletions(-) > > > > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > > > index 1d17fb1ec863..d3f42009bb70 100644 > > > --- a/mm/page-writeback.c > > > +++ b/mm/page-writeback.c > > > @@ -2434,6 +2434,7 @@ int write_cache_pages(struct address_space *mapping, > > > > > > for (i = 0; i < nr_folios; i++) { > > > struct folio *folio = fbatch.folios[i]; > > > + unsigned long nr; > > > > > > done_index = folio->index; > > > > > > @@ -2471,6 +2472,7 @@ int write_cache_pages(struct address_space *mapping, > > > > > > trace_wbc_writepage(wbc, inode_to_bdi(mapping->host)); > > > error = writepage(folio, wbc, data); > > > + nr = folio_nr_pages(folio); > > > > This should really be done before writepage() is called, right? By > > the time the writepage() returns, the folio can be unlocked, the > > entire write completed and the folio partially invalidated which may > > try to split the folio... > > > > Even if this can't happen (folio refcount is elevated, right?), it > > makes much more sense to me to sample the size of the folio while it > > is obviously locked and not going to change... > > It can't change because of the refcount we hold (that's put in the call > to folio_batch_release()). I didn't want to call it before the call to > writepage() because that makes the compiler stick it on the stack instead > of a local variable. I don't care for micro-optimisations when the result is code that looks dodgy and suspect and requires lots of additional thinking about to determine that it is safe. > Also, when we transform this into an iterator (see > patches posted yesterday), we'd have to stash it away in the iterator. That's no big deal, either. -Dave.
On Wed, 28 Jun 2023 19:55:48 +0100 "Matthew Wilcox (Oracle)" <willy@infradead.org> wrote: > nr_to_write is a count of pages, so we need to decrease it by the number > of pages in the folio we just wrote, not by 1. Most callers specify > either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes() > might end up writing 512x as many pages as it asked for. 512 is a big number, Should we backport this? > Fixes: 793917d997df ("mm/readahead: Add large folio readahead") I'm not seeing how a readahead change messed up writeback accounting?
On Sun, Jul 02, 2023 at 01:06:15PM -0700, Andrew Morton wrote: > On Wed, 28 Jun 2023 19:55:48 +0100 "Matthew Wilcox (Oracle)" <willy@infradead.org> wrote: > > > nr_to_write is a count of pages, so we need to decrease it by the number > > of pages in the folio we just wrote, not by 1. Most callers specify > > either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes() > > might end up writing 512x as many pages as it asked for. > > 512 is a big number, Should we backport this? I'm really not sure. Maybe? I'm hoping one of the bots comes up with a meaningful performance change as a result of this patch and we find out. > > Fixes: 793917d997df ("mm/readahead: Add large folio readahead") > > I'm not seeing how a readahead change messed up writeback accounting? That was the first patch which allowed large folios to be added to the page cache. Until that point, this was latent. We could probably argue for one of a dozen other commits around the same time.
On Mon, Jul 03, 2023 at 03:13:15AM +0100, Matthew Wilcox wrote: > On Sun, Jul 02, 2023 at 01:06:15PM -0700, Andrew Morton wrote: > > On Wed, 28 Jun 2023 19:55:48 +0100 "Matthew Wilcox (Oracle)" <willy@infradead.org> wrote: > > > > > nr_to_write is a count of pages, so we need to decrease it by the number > > > of pages in the folio we just wrote, not by 1. Most callers specify > > > either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes() > > > might end up writing 512x as many pages as it asked for. > > > > 512 is a big number, Should we backport this? > > I'm really not sure. Maybe? I'm hoping one of the bots comes up with a > meaningful performance change as a result of this patch and we find out. XFS is the only filesystem this would affect, right? AFAIA, nothing else enables large folios and uses writeback through write_cache_pages() at this point... In which case, I'd be surprised if much difference, if any, gets noticed by anyone. -Dave.
On Tue, Jul 04, 2023 at 07:37:17AM +1000, Dave Chinner wrote: > On Mon, Jul 03, 2023 at 03:13:15AM +0100, Matthew Wilcox wrote: > > On Sun, Jul 02, 2023 at 01:06:15PM -0700, Andrew Morton wrote: > > > On Wed, 28 Jun 2023 19:55:48 +0100 "Matthew Wilcox (Oracle)" <willy@infradead.org> wrote: > > > > > > > nr_to_write is a count of pages, so we need to decrease it by the number > > > > of pages in the folio we just wrote, not by 1. Most callers specify > > > > either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes() > > > > might end up writing 512x as many pages as it asked for. > > > > > > 512 is a big number, Should we backport this? > > > > I'm really not sure. Maybe? I'm hoping one of the bots comes up with a > > meaningful performance change as a result of this patch and we find out. > > XFS is the only filesystem this would affect, right? AFAIA, nothing > else enables large folios and uses writeback through > write_cache_pages() at this point... Good point. Still, Intel's 0day has squawked about a loss of performance when large folios have _stopped_ being used, so they are at least testing with XFS.
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 1d17fb1ec863..d3f42009bb70 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2434,6 +2434,7 @@ int write_cache_pages(struct address_space *mapping, for (i = 0; i < nr_folios; i++) { struct folio *folio = fbatch.folios[i]; + unsigned long nr; done_index = folio->index; @@ -2471,6 +2472,7 @@ int write_cache_pages(struct address_space *mapping, trace_wbc_writepage(wbc, inode_to_bdi(mapping->host)); error = writepage(folio, wbc, data); + nr = folio_nr_pages(folio); if (unlikely(error)) { /* * Handle errors according to the type of @@ -2489,8 +2491,7 @@ int write_cache_pages(struct address_space *mapping, error = 0; } else if (wbc->sync_mode != WB_SYNC_ALL) { ret = error; - done_index = folio->index + - folio_nr_pages(folio); + done_index = folio->index + nr; done = 1; break; } @@ -2504,7 +2505,8 @@ int write_cache_pages(struct address_space *mapping, * keep going until we have written all the pages * we tagged for writeback prior to entering this loop. */ - if (--wbc->nr_to_write <= 0 && + wbc->nr_to_write -= nr; + if (wbc->nr_to_write <= 0 && wbc->sync_mode == WB_SYNC_NONE) { done = 1; break;