diff mbox series

block: Remove special-casing of compound pages

Message ID	20230814144100.596749-1-willy@infradead.org
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: "Matthew Wilcox (Oracle)" <willy@infradead.org> To: Jens Axboe <axboe@kernel.dk> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>, Andrew Morton <akpm@linux-foundation.org>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, Hugh Dickins <hughd@google.com>, linux-mm@kvack.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org Subject: [PATCH] block: Remove special-casing of compound pages Date: Mon, 14 Aug 2023 15:41:00 +0100 Message-Id: <20230814144100.596749-1-willy@infradead.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	block: Remove special-casing of compound pages \| block: Remove special-casing of compound pages

Commit Message

Matthew Wilcox Aug. 14, 2023, 2:41 p.m. UTC

  The special casing was originally added in pre-git history; reproducing
the commit log here:

> commit a318a92567d77
> Author: Andrew Morton <akpm@osdl.org>
> Date:   Sun Sep 21 01:42:22 2003 -0700
>
>     [PATCH] Speed up direct-io hugetlbpage handling
>
>     This patch short-circuits all the direct-io page dirtying logic for
>     higher-order pages.  Without this, we pointlessly bounce BIOs up to
>     keventd all the time.

In the last twenty years, compound pages have become used for more than
just hugetlb.  Rewrite these functions to operate on folios instead
of pages and remove the special case for hugetlbfs; I don't think
it's needed any more (and if it is, we can put it back in as a call
to folio_test_hugetlb()).

This was found by inspection; as far as I can tell, this bug can lead
to pages used as the destination of a direct I/O read not being marked
as dirty.  If those pages are then reclaimed by the MM without being
dirtied for some other reason, they won't be written out.  Then when
they're faulted back in, they will not contain the data they should.
It'll take a pretty unusual setup to produce this problem with several
races all going the wrong way.

This problem predates the folio work; it could for example have been
triggered by mmaping a THP in tmpfs and using that as the target of an
O_DIRECT read.

Fixes: 800d8c63b2e98 ("shmem: add huge pages support")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 block/bio.c | 46 ++++++++++++++++++++++++----------------------
 1 file changed, 24 insertions(+), 22 deletions(-)

Comments

Matthew Wilcox Sept. 15, 2023, 2:21 p.m. UTC | #1

On Wed, Aug 16, 2023 at 01:27:17PM -0700, Hugh Dickins wrote:
> > This problem predates the folio work; it could for example have been
> > triggered by mmaping a THP in tmpfs and using that as the target of an
> > O_DIRECT read.
> > 
> > Fixes: 800d8c63b2e98 ("shmem: add huge pages support")
> 
> No. It's a good catch, but bug looks specific to the folio work to me.
> 
> Almost all shmem pages are dirty from birth, even as soon as they are
> brought back from swap; so it is not necessary to re-mark them dirty.
> 
> The exceptions are pages allocated to holes when faulted: so you did
> get me worried as to whether khugepaged could collapse a pmd-ful of
> those into a THP without marking the result as dirty.
> 
> But no, in v6.5-rc6 the collapse_file() success path has
> 	if (is_shmem)
> 		folio_mark_dirty(folio);
> and in v5.10 the same appears as
> 		if (is_shmem)
> 			set_page_dirty(new_page);
> 
> (IIRC, that or marking pmd dirty was missed from early shmem THP
> support, but fairly soon corrected, and backported to stable then.
> I have a faint memory of versions which assembled pmd_dirty from
> collected pte_dirtys.)
> 
> And the !is_shmem case is for CONFIG_READ_ONLY_THP_FOR_FS: writing
> into those pages, by direct IO or whatever, is already prohibited.
> 
> It's dem dirty (or not dirty) folios dat's the trouble!

Thanks for the correction!  Could it happen with anon THP?
They're not kept dirty from birth ... are they?

Hugh Dickins Sept. 15, 2023, 10:48 p.m. UTC | #2

On Fri, 15 Sep 2023, Matthew Wilcox wrote:
> On Wed, Aug 16, 2023 at 01:27:17PM -0700, Hugh Dickins wrote:
> > > This problem predates the folio work; it could for example have been
> > > triggered by mmaping a THP in tmpfs and using that as the target of an
> > > O_DIRECT read.
> > > 
> > > Fixes: 800d8c63b2e98 ("shmem: add huge pages support")
> > 
> > No. It's a good catch, but bug looks specific to the folio work to me.
> > 
> > Almost all shmem pages are dirty from birth, even as soon as they are
> > brought back from swap; so it is not necessary to re-mark them dirty.
> > 
> > The exceptions are pages allocated to holes when faulted: so you did
> > get me worried as to whether khugepaged could collapse a pmd-ful of
> > those into a THP without marking the result as dirty.
> > 
> > But no, in v6.5-rc6 the collapse_file() success path has
> > 	if (is_shmem)
> > 		folio_mark_dirty(folio);
> > and in v5.10 the same appears as
> > 		if (is_shmem)
> > 			set_page_dirty(new_page);
> > 
> > (IIRC, that or marking pmd dirty was missed from early shmem THP
> > support, but fairly soon corrected, and backported to stable then.
> > I have a faint memory of versions which assembled pmd_dirty from
> > collected pte_dirtys.)
> > 
> > And the !is_shmem case is for CONFIG_READ_ONLY_THP_FOR_FS: writing
> > into those pages, by direct IO or whatever, is already prohibited.
> > 
> > It's dem dirty (or not dirty) folios dat's the trouble!
> 
> Thanks for the correction!  Could it happen with anon THP?
> They're not kept dirty from birth ... are they?

Anon pages, THP or other, are not marked dirty from birth, right.
But nor are they considered for freeing without writeout:
shrink_folio_list() does add_to_swap() on them without considering
dirtiness, and add_to_swap() does an unconditional folio_mark_dirty().
Well, not quite unconditional: it is conditional on allocating swap,
but shrink_folio_list() just reactivates when swap is not allocated.

So, I see no opportunity for data loss there.

When it's read back from swap later on, the folio will be left clean
while it matches swap: I haven't bothered to recheck the latest details
of what happens when it's CoWed, or the swap is deleted, those details
won't matter given the behavior above.  But might there be a direct IO
problem while that anon folio (large or small) remains clean in swapcache,
when reclaim's pageout() might be liable to free it without rewriting?

There ought not to be: get_user_pages()/follow_page_pte() have taken
care of that for many years with the FOLL_TOUCH+FOLL_WRITE
	if (flags & FOLL_TOUCH) {
		if ((flags & FOLL_WRITE) &&
		    !pte_dirty(pte) && !PageDirty(page))
			set_page_dirty(page);
and follow_trans_huge_pmd() dirties the pmd when FOLL_TOUCH+FOLL_WRITE.
I forget why follow_page_pte() prefers to dirty page rather than pte,
but either approach should be good enough to avoid the data loss.

However, I don't see equivalent FOLL_TOUCH+FOLL_WRITE code where
get_user_pages_fast() goes its fast route; nor pin_user_pages_fast()
used by lib/iov_iter.c; and it looks odd that pin_user_pages_remote()
and pin_user_pages_unlocked() use FOLL_TOUCH but pin_user_pages() not.

Problem?  Not particularly for anon or for large, but for any?  Or not
a problem because of final set_page_dirty() or folio_mark_dirty() on
release - only a problem while that PageCompound check remains?

(Of course filesystems hate behind-the-back dirtying for other reasons,
that they may lose the data without proper notice: separate discussion
we'd better not get back into here.)

I've spent much longer trying to answer this than I could afford,
expect no more from me, back to you and GUP+PIN experts.

Hugh

Jens Axboe Dec. 7, 2023, 9:04 p.m. UTC | #3

On Mon, 14 Aug 2023 15:41:00 +0100, Matthew Wilcox (Oracle) wrote:
> The special casing was originally added in pre-git history; reproducing
> the commit log here:
> 
> > commit a318a92567d77
> > Author: Andrew Morton <akpm@osdl.org>
> > Date:   Sun Sep 21 01:42:22 2003 -0700
> >
> >     [PATCH] Speed up direct-io hugetlbpage handling
> >
> >     This patch short-circuits all the direct-io page dirtying logic for
> >     higher-order pages.  Without this, we pointlessly bounce BIOs up to
> >     keventd all the time.
> 
> [...]

Applied, thanks!

[1/1] block: Remove special-casing of compound pages
      commit: 1b151e2435fc3a9b10c8946c6aebe9f3e1938c55

Best regards,

Keith Busch Dec. 7, 2023, 10:10 p.m. UTC | #4

On Mon, Aug 14, 2023 at 03:41:00PM +0100, Matthew Wilcox (Oracle) wrote:
>  void __bio_release_pages(struct bio *bio, bool mark_dirty)
>  {
> -	struct bvec_iter_all iter_all;
> -	struct bio_vec *bvec;
> +	struct folio_iter fi;
> +
> +	bio_for_each_folio_all(fi, bio) {
> +		struct page *page;
> +		size_t done = 0;
>  
> -	bio_for_each_segment_all(bvec, bio, iter_all) {
> -		if (mark_dirty && !PageCompound(bvec->bv_page))
> -			set_page_dirty_lock(bvec->bv_page);
> -		bio_release_page(bio, bvec->bv_page);
> +		if (mark_dirty) {
> +			folio_lock(fi.folio);
> +			folio_mark_dirty(fi.folio);
> +			folio_unlock(fi.folio);
> +		}
> +		page = folio_page(fi.folio, fi.offset / PAGE_SIZE);
> +		do {
> +			bio_release_page(bio, page++);
> +			done += PAGE_SIZE;
> +		} while (done < fi.length);
>  	}
>  }

Is it okay to release same-folio pages while creating the bio instead of
releasing all the pages at the completion? If so, the completion could
provide just the final bio_release_page() instead looping. I'm more
confirming if that's an appropriate way to use folios here.

Matthew Wilcox Dec. 7, 2023, 11:57 p.m. UTC | #5

On Thu, Dec 07, 2023 at 03:10:13PM -0700, Keith Busch wrote:
> On Mon, Aug 14, 2023 at 03:41:00PM +0100, Matthew Wilcox (Oracle) wrote:
> >  void __bio_release_pages(struct bio *bio, bool mark_dirty)
> >  {
> > -	struct bvec_iter_all iter_all;
> > -	struct bio_vec *bvec;
> > +	struct folio_iter fi;
> > +
> > +	bio_for_each_folio_all(fi, bio) {
> > +		struct page *page;
> > +		size_t done = 0;
> >  
> > -	bio_for_each_segment_all(bvec, bio, iter_all) {
> > -		if (mark_dirty && !PageCompound(bvec->bv_page))
> > -			set_page_dirty_lock(bvec->bv_page);
> > -		bio_release_page(bio, bvec->bv_page);
> > +		if (mark_dirty) {
> > +			folio_lock(fi.folio);
> > +			folio_mark_dirty(fi.folio);
> > +			folio_unlock(fi.folio);
> > +		}
> > +		page = folio_page(fi.folio, fi.offset / PAGE_SIZE);
> > +		do {
> > +			bio_release_page(bio, page++);
> > +			done += PAGE_SIZE;
> > +		} while (done < fi.length);
> >  	}
> >  }
> 
> Is it okay to release same-folio pages while creating the bio instead of
> releasing all the pages at the completion? If so, the completion could
> provide just the final bio_release_page() instead looping. I'm more
> confirming if that's an appropriate way to use folios here.

For this patch, I'm just replicating the existing behaviour.  We can
probably do much better.  Honestly, the whole thing is kind of grotesque
and needs to be reformed ... but I think that's part of the physr project.

Greg Edwards Feb. 29, 2024, 6:25 p.m. UTC | #6

On Thu, Dec 07, 2023 at 02:04:26PM -0700, Jens Axboe wrote:
> On Mon, 14 Aug 2023 15:41:00 +0100, Matthew Wilcox (Oracle) wrote:
>> The special casing was originally added in pre-git history; reproducing
>> the commit log here:
>>
>>> commit a318a92567d77
>>> Author: Andrew Morton <akpm@osdl.org>
>>> Date:   Sun Sep 21 01:42:22 2003 -0700
>>>
>>>     [PATCH] Speed up direct-io hugetlbpage handling
>>>
>>>     This patch short-circuits all the direct-io page dirtying logic for
>>>     higher-order pages.  Without this, we pointlessly bounce BIOs up to
>>>     keventd all the time.
>>
>> [...]
>
> Applied, thanks!
>
> [1/1] block: Remove special-casing of compound pages
>       commit: 1b151e2435fc3a9b10c8946c6aebe9f3e1938c55

This commit results in a change of behavior for QEMU VMs backed by hugepages
that open their VM disk image file with O_DIRECT (QEMU cache=none or
cache.direct=on options).  When the VM shuts down and the QEMU process exits,
one or two hugepages may fail to free correctly.  It appears to be a race, as
it doesn't happen every time.

From debugging on 6.8-rc6, when it occurs, the hugepage that fails to free has
a non-zero refcount when it hits the folio_put_testzero(folio) test in
release_pages().  On a failure test iteration with 1 GiB hugepages, the failing
folio had a mapcount of 0, refcount of 35, and folio_maybe_dma_pinned was true.

The problem only occurs when the VM disk image file is opened with O_DIRECT.
When using QEMU cache=writeback or cache.direct=off options, it does not occur.
We first noticed it on the 6.1.y stable kernel when this commit landed there
(6.1.75).

A very simple reproducer without KVM (just boot VM up, then shut it down):

echo 512 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
qemu-system-x86_64 \
	-cpu qemu64 \
	-m 1024 \
	-nographic \
	-mem-path /dev/hugepages/vm00 \
	-mem-prealloc \
	-drive file=test.qcow2,if=none,cache=none,id=drive0 \
	-device virtio-blk-pci,drive=drive0,id=disk0,bootindex=1
rm -f /dev/hugepages/vm00

Some testing notes:

  * occurs with 6.1.75, 6.6.14, 6.8-rc6, and linux-next-20240229
  * occurs with 1 GiB and 2 MiB huge pages, with both hugetlbfs and memfd
  * occurs with QEMU 8.0.y, 8.1.y, 8.2.y, and master
  * occurs with (-enable-kvm -cpu host) or without (-cpu qemu64) KVM

Thanks for your time!

Greg

Matthew Wilcox Feb. 29, 2024, 7:37 p.m. UTC | #7

On Thu, Feb 29, 2024 at 11:25:13AM -0700, Greg Edwards wrote:
> > [1/1] block: Remove special-casing of compound pages
> >       commit: 1b151e2435fc3a9b10c8946c6aebe9f3e1938c55
> 
> This commit results in a change of behavior for QEMU VMs backed by hugepages
> that open their VM disk image file with O_DIRECT (QEMU cache=none or
> cache.direct=on options).  When the VM shuts down and the QEMU process exits,
> one or two hugepages may fail to free correctly.  It appears to be a race, as
> it doesn't happen every time.

Hi Greg,

By sheer coincidence the very next email after this one was:

https://lore.kernel.org/linux-mm/86e592a9-98d4-4cff-a646-0c0084328356@cybernetics.com/T/#u

Can you try Tony's patch and see if it fixes your problem?
I haven't even begun to analyse either your email or his patch,
but there's a strong likelihood that they're the same thing.

Greg Edwards Feb. 29, 2024, 8:05 p.m. UTC | #8

On Thu, Feb 29, 2024 at 07:37:11PM +0000, Matthew Wilcox wrote:
> On Thu, Feb 29, 2024 at 11:25:13AM -0700, Greg Edwards wrote:
>>> [1/1] block: Remove special-casing of compound pages
>>>       commit: 1b151e2435fc3a9b10c8946c6aebe9f3e1938c55
>>
>> This commit results in a change of behavior for QEMU VMs backed by hugepages
>> that open their VM disk image file with O_DIRECT (QEMU cache=none or
>> cache.direct=on options).  When the VM shuts down and the QEMU process exits,
>> one or two hugepages may fail to free correctly.  It appears to be a race, as
>> it doesn't happen every time.
>
> By sheer coincidence the very next email after this one was:
>
> https://lore.kernel.org/linux-mm/86e592a9-98d4-4cff-a646-0c0084328356@cybernetics.com/T/#u
>
> Can you try Tony's patch and see if it fixes your problem?
> I haven't even begun to analyse either your email or his patch,
> but there's a strong likelihood that they're the same thing.

This does appear to fix it.  Thank you!

I'll do some more testing on it today, then add a Tested-by: tag if it
holds up.

Greg

diff mbox series

Patch

diff --git a/block/bio.c b/block/bio.c
index 8672179213b9..f46d8ec71fbd 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1171,13 +1171,22 @@  EXPORT_SYMBOL(bio_add_folio);
 
 void __bio_release_pages(struct bio *bio, bool mark_dirty)
 {
-	struct bvec_iter_all iter_all;
-	struct bio_vec *bvec;
+	struct folio_iter fi;
+
+	bio_for_each_folio_all(fi, bio) {
+		struct page *page;
+		size_t done = 0;
 
-	bio_for_each_segment_all(bvec, bio, iter_all) {
-		if (mark_dirty && !PageCompound(bvec->bv_page))
-			set_page_dirty_lock(bvec->bv_page);
-		bio_release_page(bio, bvec->bv_page);
+		if (mark_dirty) {
+			folio_lock(fi.folio);
+			folio_mark_dirty(fi.folio);
+			folio_unlock(fi.folio);
+		}
+		page = folio_page(fi.folio, fi.offset / PAGE_SIZE);
+		do {
+			bio_release_page(bio, page++);
+			done += PAGE_SIZE;
+		} while (done < fi.length);
 	}
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1455,18 +1464,12 @@  EXPORT_SYMBOL(bio_free_pages);
  * bio_set_pages_dirty() and bio_check_pages_dirty() are support functions
  * for performing direct-IO in BIOs.
  *
- * The problem is that we cannot run set_page_dirty() from interrupt context
+ * The problem is that we cannot run folio_mark_dirty() from interrupt context
  * because the required locks are not interrupt-safe.  So what we can do is to
  * mark the pages dirty _before_ performing IO.  And in interrupt context,
  * check that the pages are still dirty.   If so, fine.  If not, redirty them
  * in process context.
  *
- * We special-case compound pages here: normally this means reads into hugetlb
- * pages.  The logic in here doesn't really work right for compound pages
- * because the VM does not uniformly chase down the head page in all cases.
- * But dirtiness of compound pages is pretty meaningless anyway: the VM doesn't
- * handle them at all.  So we skip compound pages here at an early stage.
- *
  * Note that this code is very hard to test under normal circumstances because
  * direct-io pins the pages with get_user_pages().  This makes
  * is_page_cache_freeable return false, and the VM will not clean the pages.
@@ -1482,12 +1485,12 @@  EXPORT_SYMBOL(bio_free_pages);
  */
 void bio_set_pages_dirty(struct bio *bio)
 {
-	struct bio_vec *bvec;
-	struct bvec_iter_all iter_all;
+	struct folio_iter fi;
 
-	bio_for_each_segment_all(bvec, bio, iter_all) {
-		if (!PageCompound(bvec->bv_page))
-			set_page_dirty_lock(bvec->bv_page);
+	bio_for_each_folio_all(fi, bio) {
+		folio_lock(fi.folio);
+		folio_mark_dirty(fi.folio);
+		folio_unlock(fi.folio);
 	}
 }
 
@@ -1530,12 +1533,11 @@  static void bio_dirty_fn(struct work_struct *work)
 
 void bio_check_pages_dirty(struct bio *bio)
 {
-	struct bio_vec *bvec;
+	struct folio_iter fi;
 	unsigned long flags;
-	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, iter_all) {
-		if (!PageDirty(bvec->bv_page) && !PageCompound(bvec->bv_page))
+	bio_for_each_folio_all(fi, bio) {
+		if (!folio_test_dirty(fi.folio))
 			goto defer;
 	}