Message ID | 20240128142522.1524741-1-ming.lei@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-41648-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:2087:b0:106:209c:c626 with SMTP id gs7csp70501dyb; Sun, 28 Jan 2024 06:27:00 -0800 (PST) X-Google-Smtp-Source: AGHT+IEy2tnD3aaVUYC9XOxsOtM+6wvhnUKoCGu3o8cWje0CUS2FKXW24ztaZHRTx5856KnACfDG X-Received: by 2002:a05:6a20:d388:b0:19c:9ba4:156c with SMTP id iq8-20020a056a20d38800b0019c9ba4156cmr3005375pzb.28.1706452020726; Sun, 28 Jan 2024 06:27:00 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706452020; cv=pass; d=google.com; s=arc-20160816; b=s6XZIkuJFZSobjW+9sMER8/StnuoJKxj7WNMhhZ1xEFggRRF1JwllsiRjoWL26mDhr rZV88NDM5EfYYbB+H14yCZ5t0MnGoq+XIRC0pcoABV5AWn/s3hr/qRanmk6vPstZG57t f6BoNNoRNBXY8OQ2BJny7Cubw989f4zOj7lOQCt1aRj8HiwPJnderBYZScu5EIZqWNF4 Z0pfjPDn7qc8lr2MeA1O2M0li+SMj4jxlvNIb95f8BdOU4czSPXBIh23FS0KgkA/Oq5W zs7z6mt6sR1r+Wqcxh0t25zdc9UdipfSZg4DDF6jwDUyNd5hIv06umqyuLvpdAUmE4ih lQuA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from:dkim-signature; bh=+CMzCSV2vzJC7cGxMUfn3KLiYWQWO3h35bY6GJWrDzU=; fh=gk3REUvLUdUOnD8z9lKk5eq+cCszODJ+e8JlxOPWUn0=; b=osbRuv6Y8K5gitGzpzjf2ORh2p9Clj6+9OX4lM3oN5GBP3/6rHWKnG3M4VYSj5nn/a H/ZxaLOOg7SUTCPQtTBmW/GQk4WHjCwwr9QcWmveKU0aV2wD9Y7enMPlU15OCgTsaG13 QMuWKySLV/mUJ/jIrFBIn/gQLRDM3zUVNsJTqnAct2/Nwlv+ednEerJEQEk059LeKPIz Uqld4iv5y8hYfvs+VDwdKtqZih00e180qFeJRvXAVuJ37rGK07gi9HeRETfxFShnayym gQUggjVMFruNT0h+jcCm7kFQTMYDf2+t/k95JcQAnkFgUYM8EbSIanr8+DSOss7rgMqA PUpQ== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=YdM5Ktdj; arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass dkdomain=redhat.com dmarc=pass fromdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-41648-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-41648-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [2604:1380:40f1:3f00::1]) by mx.google.com with ESMTPS id l74-20020a633e4d000000b005cdb499ac69si4318066pga.53.2024.01.28.06.27.00 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 28 Jan 2024 06:27:00 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-41648-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=YdM5Ktdj; arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass dkdomain=redhat.com dmarc=pass fromdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-41648-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-41648-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 91A79B2265B for <ouuuleilei@gmail.com>; Sun, 28 Jan 2024 14:26:59 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 9C0F228E34; Sun, 28 Jan 2024 14:26:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="YdM5Ktdj" Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0484124B39 for <linux-kernel@vger.kernel.org>; Sun, 28 Jan 2024 14:26:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706451998; cv=none; b=PIfqbRKowGLZ/+PTYJOjL8C0eeSekzBNbsFtgENBlA7hpJvFiMZUJracXZ1hqrjLdG40nUNo/d8qbxwFZ5DgTl6pZkDKowQBZKxJNXz/HI6eDmtPKozpKPsr1NiA7eoY9pd8FSHwFfVGVPXACC0NSkcNavnAbiX5qq/GiTVfjp0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706451998; c=relaxed/simple; bh=ap3jE6eJ5LumWLiuT4HUPdO0BOsdnouZYWjxGdamNxo=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=lPzLSK6ucUV4uQo+RSNWZ+jZHzYpdX3jI3ywCNF+5+y832RQCH04LpmO/XvcO1spk319vsRAGkszpnZ6tiO5UzYriX9MUy67uTbWHL0Gxj8TAKsDYUHw4BiIT6VuypCMB1AKBt4GajMfs6lGo1m7NV30/WYd+GKZ9HKnfiPaJOE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=YdM5Ktdj; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1706451994; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=+CMzCSV2vzJC7cGxMUfn3KLiYWQWO3h35bY6GJWrDzU=; b=YdM5Ktdj0lgYffjZ12NEK8sGpwRWNcGEONaDv5cI9b8s1RBDKUJ/vZD6cD/nq19azYvGGn wrdODxsJwjVVJMqEZry9J13QnUMQaYmbmkWoKzp5QPV9K/YuyXkJ8zv4HB3ArpaHrXcbpG Ed/f+NjkD1+K4CvUpqlNLzz02jXs6g4= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-286-cTWLHBBcPtCk_Sstyzs8Mg-1; Sun, 28 Jan 2024 09:26:32 -0500 X-MC-Unique: cTWLHBBcPtCk_Sstyzs8Mg-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.rdu2.redhat.com [10.11.54.7]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 5176229AB411; Sun, 28 Jan 2024 14:26:32 +0000 (UTC) Received: from localhost (unknown [10.72.116.13]) by smtp.corp.redhat.com (Postfix) with ESMTP id 7FBCB1C060AF; Sun, 28 Jan 2024 14:26:31 +0000 (UTC) From: Ming Lei <ming.lei@redhat.com> To: Andrew Morton <akpm@linux-foundation.org>, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Ming Lei <ming.lei@redhat.com>, Mike Snitzer <snitzer@kernel.org>, Don Dutile <ddutile@redhat.com>, Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Subject: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range Date: Sun, 28 Jan 2024 22:25:22 +0800 Message-ID: <20240128142522.1524741-1-ming.lei@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.7 X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1789344633959506667 X-GMAIL-MSGID: 1789344633959506667 |
Series |
[RFC] mm/readahead: readahead aggressively if read drops in willneed range
|
|
Commit Message
Ming Lei
Jan. 28, 2024, 2:25 p.m. UTC
Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for
memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED
only tries to readahead 512 pages, and the remained part in the advised
range fallback on normal readahead.
If bdi->ra_pages is set as small, readahead will perform not efficient
enough. Increasing read ahead may not be an option since workload may
have mixed random and sequential I/O.
Improve this situation by maintaining one willneed range maple tree, if
read drops in any willneed range, readahead aggressively just like
what we did before commit 6d2be915e589.
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
fs/file_table.c | 10 ++++++++++
include/linux/fs.h | 15 +++++++++++++++
mm/filemap.c | 5 ++++-
mm/internal.h | 7 ++++++-
mm/readahead.c | 32 +++++++++++++++++++++++++++++++-
5 files changed, 66 insertions(+), 3 deletions(-)
Comments
On Sun, Jan 28, 2024 at 10:25:22PM +0800, Ming Lei wrote: > Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for > memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED > only tries to readahead 512 pages, and the remained part in the advised > range fallback on normal readahead. Does the MAINTAINERS file mean nothing any more? > If bdi->ra_pages is set as small, readahead will perform not efficient > enough. Increasing read ahead may not be an option since workload may > have mixed random and sequential I/O. I thik there needs to be a lot more explanation than this about what's going on before we jump to "And therefore this patch is the right answer". > @@ -972,6 +974,7 @@ struct file_ra_state { > unsigned int ra_pages; > unsigned int mmap_miss; > loff_t prev_pos; > + struct maple_tree *need_mt; No. Embed the struct maple tree. Don't allocate it. What made you think this was the right approach?
On Sun, Jan 28 2024 at 5:02P -0500, Matthew Wilcox <willy@infradead.org> wrote: > On Sun, Jan 28, 2024 at 10:25:22PM +0800, Ming Lei wrote: > > Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for > > memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED > > only tries to readahead 512 pages, and the remained part in the advised > > range fallback on normal readahead. > > Does the MAINTAINERS file mean nothing any more? "Ming, please use scripts/get_maintainer.pl when submitting patches." (I've cc'd accordingly with this email). > > If bdi->ra_pages is set as small, readahead will perform not efficient > > enough. Increasing read ahead may not be an option since workload may > > have mixed random and sequential I/O. > > I thik there needs to be a lot more explanation than this about what's > going on before we jump to "And therefore this patch is the right > answer". The patch is "RFC". Ming didn't declare his RFC is "the right answer". All ideas for how best to fix this issue are welcome. I agree this patch's header could've worked harder to establish the problem that it fixes. But I'll now take a crack at backfilling the regression report that motivated this patch be developed: Linux 3.14 was the last kernel to allow madvise (MADV_WILLNEED) allowed mmap'ing a file more optimally if read_ahead_kb < max_sectors_kb. Ths regressed with commit 6d2be915e589 (so Linux 3.15) such that mounting XFS on a device with read_ahead_kb=64 and max_sectors_kb=1024 and running this reproducer against a 2G file will take ~5x longer (depending on the system's capabilities), mmap_load_test.java follows: import java.nio.ByteBuffer; import java.nio.ByteOrder; import java.io.RandomAccessFile; import java.nio.MappedByteBuffer; import java.nio.channels.FileChannel; import java.io.File; import java.io.FileNotFoundException; import java.io.IOException; public class mmap_load_test { public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException { if (args.length == 0) { System.out.println("Please provide a file"); System.exit(0); } FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel(); MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size()); System.out.println("Loading the file"); long startTime = System.currentTimeMillis(); mem.load(); long endTime = System.currentTimeMillis(); System.out.println("Done! Loading took " + (endTime-startTime) + " ms"); } } reproduce with: javac mmap_load_test.java echo 64 > /sys/block/sda/queue/read_ahead_kb echo 1024 > /sys/block/sda/queue/max_sectors_kb mkfs.xfs /dev/sda mount /dev/sda /mnt/test dd if=/dev/zero of=/mnt/test/2G_file bs=1024k count=2000 echo 3 > /proc/sys/vm/drop_caches java mmap_load_test /mnt/test/2G_file Without a fix, like the patch Ming provided, iostat will show rareq-sz is 64 rather than ~1024. > > @@ -972,6 +974,7 @@ struct file_ra_state { > > unsigned int ra_pages; > > unsigned int mmap_miss; > > loff_t prev_pos; > > + struct maple_tree *need_mt; > > No. Embed the struct maple tree. Don't allocate it. Constructive feedback, thanks. > What made you think this was the right approach? But then you closed with an attack, rather than inform Ming and/or others why you feel so strongly, e.g.: Best to keep memory used for file_ra_state contiguous.
On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote: > On Sun, Jan 28 2024 at 5:02P -0500, > Matthew Wilcox <willy@infradead.org> wrote: > > > On Sun, Jan 28, 2024 at 10:25:22PM +0800, Ming Lei wrote: > > > Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for > > > memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED > > > only tries to readahead 512 pages, and the remained part in the advised > > > range fallback on normal readahead. > > > > Does the MAINTAINERS file mean nothing any more? > > "Ming, please use scripts/get_maintainer.pl when submitting patches." That's an appropriate response to a new contributor, sure. Ming has been submitting patches since, what, 2008? Surely they know how to submit patches by now. > I agree this patch's header could've worked harder to establish the > problem that it fixes. But I'll now take a crack at backfilling the > regression report that motivated this patch be developed: Thank you. > Linux 3.14 was the last kernel to allow madvise (MADV_WILLNEED) > allowed mmap'ing a file more optimally if read_ahead_kb < max_sectors_kb. > > Ths regressed with commit 6d2be915e589 (so Linux 3.15) such that > mounting XFS on a device with read_ahead_kb=64 and max_sectors_kb=1024 > and running this reproducer against a 2G file will take ~5x longer > (depending on the system's capabilities), mmap_load_test.java follows: > > import java.nio.ByteBuffer; > import java.nio.ByteOrder; > import java.io.RandomAccessFile; > import java.nio.MappedByteBuffer; > import java.nio.channels.FileChannel; > import java.io.File; > import java.io.FileNotFoundException; > import java.io.IOException; > > public class mmap_load_test { > > public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException { > if (args.length == 0) { > System.out.println("Please provide a file"); > System.exit(0); > } > FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel(); > MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size()); > > System.out.println("Loading the file"); > > long startTime = System.currentTimeMillis(); > mem.load(); > long endTime = System.currentTimeMillis(); > System.out.println("Done! Loading took " + (endTime-startTime) + " ms"); > > } > } It's good to have the original reproducer. The unfortunate part is that being at such a high level, it doesn't really show what syscalls the library makes on behalf of the application. I'll take your word for it that it calls madvise(MADV_WILLNEED). An strace might not go amiss. > reproduce with: > > javac mmap_load_test.java > echo 64 > /sys/block/sda/queue/read_ahead_kb > echo 1024 > /sys/block/sda/queue/max_sectors_kb > mkfs.xfs /dev/sda > mount /dev/sda /mnt/test > dd if=/dev/zero of=/mnt/test/2G_file bs=1024k count=2000 > > echo 3 > /proc/sys/vm/drop_caches (I prefer to unmount/mount /mnt/test; it drops the cache for /mnt/test/2G_file without affecting the rest of the system) > java mmap_load_test /mnt/test/2G_file > > Without a fix, like the patch Ming provided, iostat will show rareq-sz > is 64 rather than ~1024. Understood. But ... the application is asking for as much readahead as possible, and the sysadmin has said "Don't readahead more than 64kB at a time". So why will we not get a bug report in 1-15 years time saying "I put a limit on readahead and the kernel is ignoring it"? I think typically we allow the sysadmin to override application requests, don't we? > > > @@ -972,6 +974,7 @@ struct file_ra_state { > > > unsigned int ra_pages; > > > unsigned int mmap_miss; > > > loff_t prev_pos; > > > + struct maple_tree *need_mt; > > > > No. Embed the struct maple tree. Don't allocate it. > > Constructive feedback, thanks. > > > What made you think this was the right approach? > > But then you closed with an attack, rather than inform Ming and/or > others why you feel so strongly, e.g.: Best to keep memory used for > file_ra_state contiguous. That's not an attack, it's a genuine question. Is there somewhere else doing it wrong that Ming copied from? Does the documentation need to be clearer? I can't fix what I don't know.
On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote: > > On Sun, Jan 28 2024 at 5:02P -0500, > > Matthew Wilcox <willy@infradead.org> wrote: > > > > > On Sun, Jan 28, 2024 at 10:25:22PM +0800, Ming Lei wrote: > > > > Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for > > > > memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED > > > > only tries to readahead 512 pages, and the remained part in the advised > > > > range fallback on normal readahead. > > > > > > Does the MAINTAINERS file mean nothing any more? > > > > "Ming, please use scripts/get_maintainer.pl when submitting patches." > > That's an appropriate response to a new contributor, sure. Ming has > been submitting patches since, what, 2008? Surely they know how to > submit patches by now. > > > I agree this patch's header could've worked harder to establish the > > problem that it fixes. But I'll now take a crack at backfilling the > > regression report that motivated this patch be developed: > > Thank you. > > > Linux 3.14 was the last kernel to allow madvise (MADV_WILLNEED) > > allowed mmap'ing a file more optimally if read_ahead_kb < max_sectors_kb. > > > > Ths regressed with commit 6d2be915e589 (so Linux 3.15) such that > > mounting XFS on a device with read_ahead_kb=64 and max_sectors_kb=1024 > > and running this reproducer against a 2G file will take ~5x longer > > (depending on the system's capabilities), mmap_load_test.java follows: > > > > import java.nio.ByteBuffer; > > import java.nio.ByteOrder; > > import java.io.RandomAccessFile; > > import java.nio.MappedByteBuffer; > > import java.nio.channels.FileChannel; > > import java.io.File; > > import java.io.FileNotFoundException; > > import java.io.IOException; > > > > public class mmap_load_test { > > > > public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException { > > if (args.length == 0) { > > System.out.println("Please provide a file"); > > System.exit(0); > > } > > FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel(); > > MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size()); > > > > System.out.println("Loading the file"); > > > > long startTime = System.currentTimeMillis(); > > mem.load(); > > long endTime = System.currentTimeMillis(); > > System.out.println("Done! Loading took " + (endTime-startTime) + " ms"); > > > > } > > } > > It's good to have the original reproducer. The unfortunate part is > that being at such a high level, it doesn't really show what syscalls > the library makes on behalf of the application. I'll take your word > for it that it calls madvise(MADV_WILLNEED). An strace might not go > amiss. > > > reproduce with: > > > > javac mmap_load_test.java > > echo 64 > /sys/block/sda/queue/read_ahead_kb > > echo 1024 > /sys/block/sda/queue/max_sectors_kb > > mkfs.xfs /dev/sda > > mount /dev/sda /mnt/test > > dd if=/dev/zero of=/mnt/test/2G_file bs=1024k count=2000 > > > > echo 3 > /proc/sys/vm/drop_caches > > (I prefer to unmount/mount /mnt/test; it drops the cache for > /mnt/test/2G_file without affecting the rest of the system) > > > java mmap_load_test /mnt/test/2G_file > > > > Without a fix, like the patch Ming provided, iostat will show rareq-sz > > is 64 rather than ~1024. > > Understood. But ... the application is asking for as much readahead as > possible, and the sysadmin has said "Don't readahead more than 64kB at > a time". So why will we not get a bug report in 1-15 years time saying > "I put a limit on readahead and the kernel is ignoring it"? I think > typically we allow the sysadmin to override application requests, > don't we? The application isn't knowingly asking for readahead. It is asking to mmap the file (and reporter wants it done as quickly as possible.. like occurred before). This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap request size based on read-ahead setting") -- same logic, just applied to callchain that ends up using madvise(MADV_WILLNEED). > > > > @@ -972,6 +974,7 @@ struct file_ra_state { > > > > unsigned int ra_pages; > > > > unsigned int mmap_miss; > > > > loff_t prev_pos; > > > > + struct maple_tree *need_mt; > > > > > > No. Embed the struct maple tree. Don't allocate it. > > > > Constructive feedback, thanks. > > > > > What made you think this was the right approach? > > > > But then you closed with an attack, rather than inform Ming and/or > > others why you feel so strongly, e.g.: Best to keep memory used for > > file_ra_state contiguous. > > That's not an attack, it's a genuine question. Is there somewhere else > doing it wrong that Ming copied from? Does the documentation need to > be clearer? I can't fix what I don't know. OK
On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote: > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote: > > > On Sun, Jan 28 2024 at 5:02P -0500, > > > Matthew Wilcox <willy@infradead.org> wrote: > > Understood. But ... the application is asking for as much readahead as > > possible, and the sysadmin has said "Don't readahead more than 64kB at > > a time". So why will we not get a bug report in 1-15 years time saying > > "I put a limit on readahead and the kernel is ignoring it"? I think > > typically we allow the sysadmin to override application requests, > > don't we? > > The application isn't knowingly asking for readahead. It is asking to > mmap the file (and reporter wants it done as quickly as possible.. > like occurred before). .. which we do within the constraints of the given configuration. > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap > request size based on read-ahead setting") -- same logic, just applied > to callchain that ends up using madvise(MADV_WILLNEED). Not really. There is a difference between performing a synchronous read IO here that we must complete, compared to optimistic asynchronous read-ahead which we can fail or toss away without the user ever seeing the data the IO returned. We want required IO to be done in as few, larger IOs as possible, and not be limited by constraints placed on background optimistic IOs. madvise(WILLNEED) is optimistic IO - there is no requirement that it complete the data reads successfully. If the data is actually required, we'll guarantee completion when the user accesses it, not when madvise() is called. IOWs, madvise is async readahead, and so really should be constrained by readahead bounds and not user IO bounds. We could change this behaviour for madvise of large ranges that we force into the page cache by ignoring device readahead bounds, but I'm not sure we want to do this in general. Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages value in this situation to override the device limit for large ranges (for some definition of large - say 10x bdi->ra_pages) and restore it once the readahead operation is done. This would make it behave less like readahead and more like a user read from an IO perspective... -Dave.
On Sun, Jan 28, 2024 at 8:48 PM Dave Chinner <david@fromorbit.com> wrote: > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote: > > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@infradeadorg> wrote: > > > > > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote: > > > > On Sun, Jan 28 2024 at 5:02P -0500, > > > > Matthew Wilcox <willy@infradead.org> wrote: > > > Understood. But ... the application is asking for as much readahead as > > > possible, and the sysadmin has said "Don't readahead more than 64kB at > > > a time". So why will we not get a bug report in 1-15 years time saying > > > "I put a limit on readahead and the kernel is ignoring it"? I think > > > typically we allow the sysadmin to override application requests, > > > don't we? > > > > The application isn't knowingly asking for readahead. It is asking to > > mmap the file (and reporter wants it done as quickly as possible.. > > like occurred before). > > .. which we do within the constraints of the given configuration. > > > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap > > request size based on read-ahead setting") -- same logic, just applied > > to callchain that ends up using madvise(MADV_WILLNEED). > > Not really. There is a difference between performing a synchronous > read IO here that we must complete, compared to optimistic > asynchronous read-ahead which we can fail or toss away without the > user ever seeing the data the IO returned. > > We want required IO to be done in as few, larger IOs as possible, > and not be limited by constraints placed on background optimistic > IOs. > > madvise(WILLNEED) is optimistic IO - there is no requirement that it > complete the data reads successfully. If the data is actually > required, we'll guarantee completion when the user accesses it, not > when madvise() is called. IOWs, madvise is async readahead, and so > really should be constrained by readahead bounds and not user IO > bounds. > > We could change this behaviour for madvise of large ranges that we > force into the page cache by ignoring device readahead bounds, but > I'm not sure we want to do this in general. > > Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages > value in this situation to override the device limit for large > ranges (for some definition of large - say 10x bdi->ra_pages) and > restore it once the readahead operation is done. This would make it > behave less like readahead and more like a user read from an IO > perspective... I'm not going to pretend like I'm an expert in this code or all the distinctions that are in play. BUT, if you look at the high-level java reproducer: it is requesting mmap of a finite size, starting from the beginning of the file: FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel(); MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size()); Yet you're talking about the application like it is stabbingly triggering unbounded async reads that can get dropped, etc, etc. I just want to make sure the subtlety of (ab)using madvise(WILLNEED) like this app does isn't incorrectly attributed to something it isn't. The app really is effectively requesting a user read of a particular extent in terms of mmap, right? BTW, your suggestion to have this app fiddle with ra_pages and then reset it is pretty awful (that is a global setting, being tweaked for a single use, and exposing random IO to excessive readahead should there be a heavy mix of IO to the backing block device). Seems the app is providing plenty of context that it shouldn't be bounded in terms of readahead limits, so much so that Ming's patch is conveying the range the madvise(WILLNEED) is provided by the app so as to _know_ if the requested page(s) fall within the requested size; Linux just happens to be fulfilling the syscall in terms of readahead. FYI, here is the evolution of this use-case back from when it issued larger IO back in 3.14, Ming documented it I'm just sharing in case it helps: 3.14: madvise_willneed() -> force_page_cache_readahead() force_page_cache_readahead() will read all pages in the specified range 3.15: madvise_willneed() -> vfs_fadvise() -> generic_fadvise() -> force_page_cache_readahead() force_page_cache_readahead() only reads at most device max_sectors bytes, and the remainder is read in filemap_fault() Things start to change since: 1) 6d2be915e589 ("mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead max_pages") which limits at most 2Mbytes to be read in madvise_willneed() so the remained bytes are read in filemap_fault() via normal readahead 2) 600e19afc5f8 ("mm: use only per-device readahead limit") limits at most .ra_pages bytes to read in madvise_willneed() 3) 9491ae4aade6 ("mm: don't cap request size based on read-ahead setting") relax the limit by reading at most max sectors bytes in madvise_willneed()
On Sun, Jan 28, 2024 at 10:02:49PM +0000, Matthew Wilcox wrote: > On Sun, Jan 28, 2024 at 10:25:22PM +0800, Ming Lei wrote: > > Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for > > memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED > > only tries to readahead 512 pages, and the remained part in the advised > > range fallback on normal readahead. > > Does the MAINTAINERS file mean nothing any more? It is just miss to Cc you, sorry. > > > If bdi->ra_pages is set as small, readahead will perform not efficient > > enough. Increasing read ahead may not be an option since workload may > > have mixed random and sequential I/O. > > I thik there needs to be a lot more explanation than this about what's > going on before we jump to "And therefore this patch is the right > answer". Both 6d2be915e589 and the commit log provids background about this issue, let me explain it more: 1) before commit 6d2be915e589, madvise/fadvise(WILLNEED)/readahead syscalls try to readahead in the specified range if memory is allowed, and for each readahead in this range, the ra size is set as max sectors of the block device, see force_page_cache_ra(). 2) since commit 6d2be915e589, only 2MB bytes are load in these syscalls, and the remained bytes fallback to future normal readahead when reads from page cache or mmap buffer 3) this patch wires the advise(WILLNEED) range info to normal readahead for both mmap fault and buffered read code path, so each readhead can use max sectors of block size for the ra, basically takes the similar approach before commit 6d2be915e589 > > > @@ -972,6 +974,7 @@ struct file_ra_state { > > unsigned int ra_pages; > > unsigned int mmap_miss; > > loff_t prev_pos; > > + struct maple_tree *need_mt; > > No. Embed the struct maple tree. Don't allocate it. What made you > think this was the right approach? Can you explain why it has to be embedded? core-api/maple_tree.rst mentioned it is fine to call "mt_init() for dynamically allocated ones". maple tree provides one easy way to record the advised willneed range, so readahead code path can apply this info for speedup readahead. Thanks, Ming
On Mon, Jan 29, 2024 at 12:21:16AM +0000, Matthew Wilcox wrote: > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote: > > On Sun, Jan 28 2024 at 5:02P -0500, > > Matthew Wilcox <willy@infradead.org> wrote: > > > > > On Sun, Jan 28, 2024 at 10:25:22PM +0800, Ming Lei wrote: > > > > Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for > > > > memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED > > > > only tries to readahead 512 pages, and the remained part in the advised > > > > range fallback on normal readahead. > > > > > > Does the MAINTAINERS file mean nothing any more? > > > > "Ming, please use scripts/get_maintainer.pl when submitting patches." > > That's an appropriate response to a new contributor, sure. Ming has > been submitting patches since, what, 2008? Surely they know how to > submit patches by now. > > > I agree this patch's header could've worked harder to establish the > > problem that it fixes. But I'll now take a crack at backfilling the > > regression report that motivated this patch be developed: > > Thank you. > > > Linux 3.14 was the last kernel to allow madvise (MADV_WILLNEED) > > allowed mmap'ing a file more optimally if read_ahead_kb < max_sectors_kb. > > > > Ths regressed with commit 6d2be915e589 (so Linux 3.15) such that > > mounting XFS on a device with read_ahead_kb=64 and max_sectors_kb=1024 > > and running this reproducer against a 2G file will take ~5x longer > > (depending on the system's capabilities), mmap_load_test.java follows: > > > > import java.nio.ByteBuffer; > > import java.nio.ByteOrder; > > import java.io.RandomAccessFile; > > import java.nio.MappedByteBuffer; > > import java.nio.channels.FileChannel; > > import java.io.File; > > import java.io.FileNotFoundException; > > import java.io.IOException; > > > > public class mmap_load_test { > > > > public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException { > > if (args.length == 0) { > > System.out.println("Please provide a file"); > > System.exit(0); > > } > > FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel(); > > MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size()); > > > > System.out.println("Loading the file"); > > > > long startTime = System.currentTimeMillis(); > > mem.load(); > > long endTime = System.currentTimeMillis(); > > System.out.println("Done! Loading took " + (endTime-startTime) + " ms"); > > > > } > > } > > It's good to have the original reproducer. The unfortunate part is > that being at such a high level, it doesn't really show what syscalls > the library makes on behalf of the application. I'll take your word > for it that it calls madvise(MADV_WILLNEED). An strace might not go > amiss. Yeah, it can be fadvise(WILLNEED)/readahead syscall too. > > > reproduce with: > > > > javac mmap_load_test.java > > echo 64 > /sys/block/sda/queue/read_ahead_kb > > echo 1024 > /sys/block/sda/queue/max_sectors_kb > > mkfs.xfs /dev/sda > > mount /dev/sda /mnt/test > > dd if=/dev/zero of=/mnt/test/2G_file bs=1024k count=2000 > > > > echo 3 > /proc/sys/vm/drop_caches > > (I prefer to unmount/mount /mnt/test; it drops the cache for > /mnt/test/2G_file without affecting the rest of the system) > > > java mmap_load_test /mnt/test/2G_file > > > > Without a fix, like the patch Ming provided, iostat will show rareq-sz > > is 64 rather than ~1024. > > Understood. But ... the application is asking for as much readahead as > possible, and the sysadmin has said "Don't readahead more than 64kB at > a time". So why will we not get a bug report in 1-15 years time saying > "I put a limit on readahead and the kernel is ignoring it"? I think > typically we allow the sysadmin to override application requests, > don't we? ra_pages is just one hint for readahead, the reality is that sysadmin can't understand how much bytes is perfect for readahead. But application often knows how much bytes it will need, so here I think application requirement should have higher priority, especially when application doesn't want kernel to readahead blindly. And madvise/fadvise(WILLNEED) syscall already reads bdi->io_pages first, and which is bigger than ra_pages. Thanks, Ming
On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote: > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote: > > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote: > > > > On Sun, Jan 28 2024 at 5:02P -0500, > > > > Matthew Wilcox <willy@infradead.org> wrote: > > > Understood. But ... the application is asking for as much readahead as > > > possible, and the sysadmin has said "Don't readahead more than 64kB at > > > a time". So why will we not get a bug report in 1-15 years time saying > > > "I put a limit on readahead and the kernel is ignoring it"? I think > > > typically we allow the sysadmin to override application requests, > > > don't we? > > > > The application isn't knowingly asking for readahead. It is asking to > > mmap the file (and reporter wants it done as quickly as possible.. > > like occurred before). > > ... which we do within the constraints of the given configuration. > > > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap > > request size based on read-ahead setting") -- same logic, just applied > > to callchain that ends up using madvise(MADV_WILLNEED). > > Not really. There is a difference between performing a synchronous > read IO here that we must complete, compared to optimistic > asynchronous read-ahead which we can fail or toss away without the > user ever seeing the data the IO returned. Yeah, the big readahead in this patch happens when user starts to read over mmaped buffer instead of madvise(). > > We want required IO to be done in as few, larger IOs as possible, > and not be limited by constraints placed on background optimistic > IOs. > > madvise(WILLNEED) is optimistic IO - there is no requirement that it > complete the data reads successfully. If the data is actually > required, we'll guarantee completion when the user accesses it, not > when madvise() is called. IOWs, madvise is async readahead, and so > really should be constrained by readahead bounds and not user IO > bounds. > > We could change this behaviour for madvise of large ranges that we > force into the page cache by ignoring device readahead bounds, but > I'm not sure we want to do this in general. > > Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages > value in this situation to override the device limit for large > ranges (for some definition of large - say 10x bdi->ra_pages) and > restore it once the readahead operation is done. This would make it > behave less like readahead and more like a user read from an IO > perspective... ->ra_pages is just one hint, which is 128KB at default, and either device or userspace can override it. fadvise/madvise(willneed) already readahead bytes from bdi->io_pages which is the max device sector size(often 10X of ->ra_pages), please see force_page_cache_ra(). Follows the current report: 1) usersapce call madvise(willneed, 1G) 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is readahead in madvise(willneed, 1G) since commit 6d2be915e589 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is set as 64KB by userspace when userspace reads the mmaped buffer, then the whole application becomes slower. This patch changes 3) to use bdi->io_pages as readahead unit. Thanks, Ming
On Sun, Jan 28, 2024 at 09:12:12PM -0500, Mike Snitzer wrote: > On Sun, Jan 28, 2024 at 8:48 PM Dave Chinner <david@fromorbit.com> wrote: > > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote: > > > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote: > > > > > On Sun, Jan 28 2024 at 5:02P -0500, > > > > > Matthew Wilcox <willy@infradead.org> wrote: > > > > Understood. But ... the application is asking for as much readahead as > > > > possible, and the sysadmin has said "Don't readahead more than 64kB at > > > > a time". So why will we not get a bug report in 1-15 years time saying > > > > "I put a limit on readahead and the kernel is ignoring it"? I think > > > > typically we allow the sysadmin to override application requests, > > > > don't we? > > > > > > The application isn't knowingly asking for readahead. It is asking to > > > mmap the file (and reporter wants it done as quickly as possible.. > > > like occurred before). > > > > .. which we do within the constraints of the given configuration. > > > > > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap > > > request size based on read-ahead setting") -- same logic, just applied > > > to callchain that ends up using madvise(MADV_WILLNEED). > > > > Not really. There is a difference between performing a synchronous > > read IO here that we must complete, compared to optimistic > > asynchronous read-ahead which we can fail or toss away without the > > user ever seeing the data the IO returned. > > > > We want required IO to be done in as few, larger IOs as possible, > > and not be limited by constraints placed on background optimistic > > IOs. > > > > madvise(WILLNEED) is optimistic IO - there is no requirement that it > > complete the data reads successfully. If the data is actually > > required, we'll guarantee completion when the user accesses it, not > > when madvise() is called. IOWs, madvise is async readahead, and so > > really should be constrained by readahead bounds and not user IO > > bounds. > > > > We could change this behaviour for madvise of large ranges that we > > force into the page cache by ignoring device readahead bounds, but > > I'm not sure we want to do this in general. > > > > Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages > > value in this situation to override the device limit for large > > ranges (for some definition of large - say 10x bdi->ra_pages) and > > restore it once the readahead operation is done. This would make it > > behave less like readahead and more like a user read from an IO > > perspective... > > I'm not going to pretend like I'm an expert in this code or all the > distinctions that are in play. BUT, if you look at the high-level > java reproducer: it is requesting mmap of a finite size, starting from > the beginning of the file: > FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel(); > MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size()); Mapping an entire file does not mean "we are going to access the entire file". Lots of code will do this, especially those that do random accesses within the file. > Yet you're talking about the application like it is stabbingly > triggering unbounded async reads that can get dropped, etc, etc. I I don't know what the application actually does. All I see is a microbenchmark that mmaps() a file and walks it sequentially. On a system where readahead has been tuned to de-prioritise sequential IO performance. > just want to make sure the subtlety of (ab)using madvise(WILLNEED) > like this app does isn't incorrectly attributed to something it isn't. > The app really is effectively requesting a user read of a particular > extent in terms of mmap, right? madvise() is an -advisory- interface that does not guarantee any specific behaviour. the man page says: MADV_WILLNEED Expect access in the near future. (Hence, it might be a good idea to read some pages ahead.) It says nothing about guaranteeing that all the data is brought into memory, or that if it does get brought into memory that it will remain there until the application accesses it. It doesn't even imply that IO will even be done immediately. Any application relying on madvise() to fully populate the page cache range before returning is expecting behaviour that is not documented nor guaranteed. Similarly, the fadvise64() man page does not say that WILLNEED will bring the entire file into cache: POSIX_FADV_WILLNEED The specified data will be accessed in the near future. POSIX_FADV_WILLNEED initiates a nonblocking read of the specified region into the page cache. The amount of data read may be de‐ creased by the kernel depending on virtual memory load. (A few megabytes will usually be fully satisfied, and more is rarely use‐ ful.) > BTW, your suggestion to have this app fiddle with ra_pages and then No, I did not suggest that the app fiddle with anything. I was talking about the in-kernel FADV_WILLNEED implementation changing file->f_ra.ra_pages similar to how FADV_RANDOM and FADV_SEQUENTIAL do to change readahead IO behaviour. That then allows subsequent readahead on that vma->file to use a larger value than the default value pulled in off the device. Largely, I think the problem is that the application has set a readahead limit too low for optimal sequential performance. Complaining that readahead is slow when it has been explicitly tuned to be slow doesn't really seem like a problem we can fix with code. -Dave.
On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote: > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote: > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote: > > > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote: > > > > > On Sun, Jan 28 2024 at 5:02P -0500, > > > > > Matthew Wilcox <willy@infradead.org> wrote: > > > > Understood. But ... the application is asking for as much readahead as > > > > possible, and the sysadmin has said "Don't readahead more than 64kB at > > > > a time". So why will we not get a bug report in 1-15 years time saying > > > > "I put a limit on readahead and the kernel is ignoring it"? I think > > > > typically we allow the sysadmin to override application requests, > > > > don't we? > > > > > > The application isn't knowingly asking for readahead. It is asking to > > > mmap the file (and reporter wants it done as quickly as possible.. > > > like occurred before). > > > > ... which we do within the constraints of the given configuration. > > > > > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap > > > request size based on read-ahead setting") -- same logic, just applied > > > to callchain that ends up using madvise(MADV_WILLNEED). > > > > Not really. There is a difference between performing a synchronous > > read IO here that we must complete, compared to optimistic > > asynchronous read-ahead which we can fail or toss away without the > > user ever seeing the data the IO returned. > > Yeah, the big readahead in this patch happens when user starts to read > over mmaped buffer instead of madvise(). Yes, that's how it is intended to work :/ > > We want required IO to be done in as few, larger IOs as possible, > > and not be limited by constraints placed on background optimistic > > IOs. > > > > madvise(WILLNEED) is optimistic IO - there is no requirement that it > > complete the data reads successfully. If the data is actually > > required, we'll guarantee completion when the user accesses it, not > > when madvise() is called. IOWs, madvise is async readahead, and so > > really should be constrained by readahead bounds and not user IO > > bounds. > > > > We could change this behaviour for madvise of large ranges that we > > force into the page cache by ignoring device readahead bounds, but > > I'm not sure we want to do this in general. > > > > Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages > > value in this situation to override the device limit for large > > ranges (for some definition of large - say 10x bdi->ra_pages) and > > restore it once the readahead operation is done. This would make it > > behave less like readahead and more like a user read from an IO > > perspective... > > ->ra_pages is just one hint, which is 128KB at default, and either > device or userspace can override it. > > fadvise/madvise(willneed) already readahead bytes from bdi->io_pages which > is the max device sector size(often 10X of ->ra_pages), please see > force_page_cache_ra(). Yes, but if we also change vma->file->f_ra->ra_pages during the WILLNEED operation (as we do for FADV_SEQUENTIAL) then we get a larger readahead window for the demand-paged access portion of the WILLNEED access... > > Follows the current report: > > 1) usersapce call madvise(willneed, 1G) > > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is > readahead in madvise(willneed, 1G) since commit 6d2be915e589 > > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is > set as 64KB by userspace when userspace reads the mmaped buffer, then > the whole application becomes slower. It gets limited by file->f_ra->ra_pages being initialised to bdi->ra_pages and then never changed as the advice for access methods to the file are changed. But the problem here is *not the readahead code*. The problem is that the user has configured the device readahead window to be far smaller than is optimal for the storage. Hence readahead is slow. The fix for that is to either increase the device readahead windows, or to change the specific readahead window for the file that has sequential access patterns. Indeed, we already have that - FADV_SEQUENTIAL will set file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses larger IOs for that access. That's what should happen here - MADV_WILLNEED does not imply a specific access pattern so the application should be running MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED to start the readahead, and then the rest of the on-demand readahead will get the higher readahead limits. > This patch changes 3) to use bdi->io_pages as readahead unit. I think it really should be changing MADV/FADV_SEQUENTIAL to set file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the mem.load() implementation in the application converted to use MADV_SEQUENTIAL to properly indicate it's access pattern to the readahead algorithm. Cheers, Dave.
On Mon, Jan 29, 2024 at 04:15:16PM +1100, Dave Chinner wrote: > On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote: > > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote: > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote: > > > > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote: > > > > > > On Sun, Jan 28 2024 at 5:02P -0500, > > > > > > Matthew Wilcox <willy@infradead.org> wrote: > > > > > Understood. But ... the application is asking for as much readahead as > > > > > possible, and the sysadmin has said "Don't readahead more than 64kB at > > > > > a time". So why will we not get a bug report in 1-15 years time saying > > > > > "I put a limit on readahead and the kernel is ignoring it"? I think > > > > > typically we allow the sysadmin to override application requests, > > > > > don't we? > > > > > > > > The application isn't knowingly asking for readahead. It is asking to > > > > mmap the file (and reporter wants it done as quickly as possible.. > > > > like occurred before). > > > > > > ... which we do within the constraints of the given configuration. > > > > > > > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap > > > > request size based on read-ahead setting") -- same logic, just applied > > > > to callchain that ends up using madvise(MADV_WILLNEED). > > > > > > Not really. There is a difference between performing a synchronous > > > read IO here that we must complete, compared to optimistic > > > asynchronous read-ahead which we can fail or toss away without the > > > user ever seeing the data the IO returned. > > > > Yeah, the big readahead in this patch happens when user starts to read > > over mmaped buffer instead of madvise(). > > Yes, that's how it is intended to work :/ > > > > We want required IO to be done in as few, larger IOs as possible, > > > and not be limited by constraints placed on background optimistic > > > IOs. > > > > > > madvise(WILLNEED) is optimistic IO - there is no requirement that it > > > complete the data reads successfully. If the data is actually > > > required, we'll guarantee completion when the user accesses it, not > > > when madvise() is called. IOWs, madvise is async readahead, and so > > > really should be constrained by readahead bounds and not user IO > > > bounds. > > > > > > We could change this behaviour for madvise of large ranges that we > > > force into the page cache by ignoring device readahead bounds, but > > > I'm not sure we want to do this in general. > > > > > > Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages > > > value in this situation to override the device limit for large > > > ranges (for some definition of large - say 10x bdi->ra_pages) and > > > restore it once the readahead operation is done. This would make it > > > behave less like readahead and more like a user read from an IO > > > perspective... > > > > ->ra_pages is just one hint, which is 128KB at default, and either > > device or userspace can override it. > > > > fadvise/madvise(willneed) already readahead bytes from bdi->io_pages which > > is the max device sector size(often 10X of ->ra_pages), please see > > force_page_cache_ra(). > > Yes, but if we also change vma->file->f_ra->ra_pages during the > WILLNEED operation (as we do for FADV_SEQUENTIAL) then we get a > larger readahead window for the demand-paged access portion of the > WILLNEED access... > > > > > Follows the current report: > > > > 1) usersapce call madvise(willneed, 1G) > > > > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is > > readahead in madvise(willneed, 1G) since commit 6d2be915e589 > > > > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is > > set as 64KB by userspace when userspace reads the mmaped buffer, then > > the whole application becomes slower. > > It gets limited by file->f_ra->ra_pages being initialised to > bdi->ra_pages and then never changed as the advice for access > methods to the file are changed. > > But the problem here is *not the readahead code*. The problem is > that the user has configured the device readahead window to be far > smaller than is optimal for the storage. Hence readahead is slow. > The fix for that is to either increase the device readahead windows, > or to change the specific readahead window for the file that has > sequential access patterns. > > Indeed, we already have that - FADV_SEQUENTIAL will set > file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses > larger IOs for that access. > > That's what should happen here - MADV_WILLNEED does not imply a > specific access pattern so the application should be running > MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED > to start the readahead, and then the rest of the on-demand readahead > will get the higher readahead limits. > > > This patch changes 3) to use bdi->io_pages as readahead unit. > > I think it really should be changing MADV/FADV_SEQUENTIAL to set > file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the > mem.load() implementation in the application converted to use > MADV_SEQUENTIAL to properly indicate it's access pattern to the > readahead algorithm. Here the single .ra_pages may not work, that is why this patch stores the willneed range in maple tree, please see the following words from the original RH report: " Increasing read ahead is not an option as it has a mixed I/O workload of random I/O and sequential I/O, so that a large read ahead is very counterproductive to the random I/O and is unacceptable. " Also almost all these advises(SEQUENTIA, WILLNEED, NORMAL, RANDOM) ignore the passed range, and the behavior becomes all or nothing, instead of something only for the specified range, which may not match with man, please see 'man posix_fadvise': ``` POSIX_FADV_NORMAL Indicates that the application has no advice to give about its access pattern for the specified data. If no advice is given for an open file, this is the default assumption. POSIX_FADV_SEQUENTIAL The application expects to access the specified data sequentially (with lower offsets read before higher ones). POSIX_FADV_RANDOM The specified data will be accessed in random order. POSIX_FADV_NOREUSE The specified data will be accessed only once. In kernels before 2.6.18, POSIX_FADV_NOREUSE had the same semantics as POSIX_FADV_WILLNEED. This was probably a bug; since kernel 2.6.18, this flag is a no-op. POSIX_FADV_WILLNEED The specified data will be accessed in the near future. ``` It is even worse for readahead() syscall: ``` DESCRIPTION readahead() initiates readahead on a file so that subsequent reads from that file will be satisfied from the cache, and not block on disk I/O (assuming the readahead was initiated early enough and that other activity on the system did not in the meantime flush pages from the cache). ``` Thanks, Ming
On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote: > Here the single .ra_pages may not work, that is why this patch stores > the willneed range in maple tree, please see the following words from > the original RH report: > > " > Increasing read ahead is not an option as it has a mixed I/O workload of > random I/O and sequential I/O, so that a large read ahead is very counterproductive > to the random I/O and is unacceptable. > " It is really frustrating having to drag this important information out of you. Please send the full bug report (stripping identifying information if that's what the customer needs). We seem to be covering the same ground again in public that apparently you've already done in private. This is no way to work with upstream.
On Mon, Jan 29 2024 at 8:26P -0500, Matthew Wilcox <willy@infradead.org> wrote: > On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote: > > Here the single .ra_pages may not work, that is why this patch stores > > the willneed range in maple tree, please see the following words from > > the original RH report: > > > > " > > Increasing read ahead is not an option as it has a mixed I/O workload of > > random I/O and sequential I/O, so that a large read ahead is very counterproductive > > to the random I/O and is unacceptable. > > " > > It is really frustrating having to drag this important information out of > you. Please send the full bug report (stripping identifying information > if that's what the customer needs). We seem to be covering the same > ground again in public that apparently you've already done in private. > This is no way to work with upstream. We are all upstream developers here. And Ming is among the best of us, so please try to stay constructive. You now have the full bug report (I provided the reproducer and additonal context, and now Ming shared the "Increasing read ahead is not an option..." detail from the original report). BUT Ming did mention the potential for mixed workload in his original RFC patch header: On Sun, Jan 28 2024 at 9:25P -0500, Ming Lei <ming.lei@redhat.com> wrote: > Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for > memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED > only tries to readahead 512 pages, and the remained part in the advised > range fallback on normal readahead. > > If bdi->ra_pages is set as small, readahead will perform not efficient > enough. Increasing read ahead may not be an option since workload may > have mixed random and sequential I/O. And while both you and Dave rightly seized on the seemingly "Doctor it hurts when I clamp readahead to be small but then issue larger sequential reads and want it to use larger readahead" aspect of this report... Dave was helpful with his reasoned follow-up responses, culminating with this one (where the discussion evolved to clearly consider the fact that an integral part of addressing the reported issue is the need to allow for mixed workloads not stomping on each other when issuing IO to the same backing block device): On Mon, Jan 29 2024 at 12:15P -0500, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote: > > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote: > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote: > > > > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote: > > > > > > On Sun, Jan 28 2024 at 5:02P -0500, > > > > > > Matthew Wilcox <willy@infradead.org> wrote: > > > > > Understood. But ... the application is asking for as much readahead as > > > > > possible, and the sysadmin has said "Don't readahead more than 64kB at > > > > > a time". So why will we not get a bug report in 1-15 years time saying > > > > > "I put a limit on readahead and the kernel is ignoring it"? I think > > > > > typically we allow the sysadmin to override application requests, > > > > > don't we? > > > > > > > > The application isn't knowingly asking for readahead. It is asking to > > > > mmap the file (and reporter wants it done as quickly as possible.. > > > > like occurred before). > > > > > > ... which we do within the constraints of the given configuration. > > > > > > > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap > > > > request size based on read-ahead setting") -- same logic, just applied > > > > to callchain that ends up using madvise(MADV_WILLNEED). > > > > > > Not really. There is a difference between performing a synchronous > > > read IO here that we must complete, compared to optimistic > > > asynchronous read-ahead which we can fail or toss away without the > > > user ever seeing the data the IO returned. > > > > Yeah, the big readahead in this patch happens when user starts to read > > over mmaped buffer instead of madvise(). > > Yes, that's how it is intended to work :/ > > > > We want required IO to be done in as few, larger IOs as possible, > > > and not be limited by constraints placed on background optimistic > > > IOs. > > > > > > madvise(WILLNEED) is optimistic IO - there is no requirement that it > > > complete the data reads successfully. If the data is actually > > > required, we'll guarantee completion when the user accesses it, not > > > when madvise() is called. IOWs, madvise is async readahead, and so > > > really should be constrained by readahead bounds and not user IO > > > bounds. > > > > > > We could change this behaviour for madvise of large ranges that we > > > force into the page cache by ignoring device readahead bounds, but > > > I'm not sure we want to do this in general. > > > > > > Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages > > > value in this situation to override the device limit for large > > > ranges (for some definition of large - say 10x bdi->ra_pages) and > > > restore it once the readahead operation is done. This would make it > > > behave less like readahead and more like a user read from an IO > > > perspective... > > > > ->ra_pages is just one hint, which is 128KB at default, and either > > device or userspace can override it. > > > > fadvise/madvise(willneed) already readahead bytes from bdi->io_pages which > > is the max device sector size(often 10X of ->ra_pages), please see > > force_page_cache_ra(). > > Yes, but if we also change vma->file->f_ra->ra_pages during the > WILLNEED operation (as we do for FADV_SEQUENTIAL) then we get a > larger readahead window for the demand-paged access portion of the > WILLNEED access... > > > > > Follows the current report: > > > > 1) usersapce call madvise(willneed, 1G) > > > > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is > > readahead in madvise(willneed, 1G) since commit 6d2be915e589 > > > > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is > > set as 64KB by userspace when userspace reads the mmaped buffer, then > > the whole application becomes slower. > > It gets limited by file->f_ra->ra_pages being initialised to > bdi->ra_pages and then never changed as the advice for access > methods to the file are changed. > > But the problem here is *not the readahead code*. The problem is > that the user has configured the device readahead window to be far > smaller than is optimal for the storage. Hence readahead is slow. > The fix for that is to either increase the device readahead windows, > or to change the specific readahead window for the file that has > sequential access patterns. > > Indeed, we already have that - FADV_SEQUENTIAL will set > file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses > larger IOs for that access. > > That's what should happen here - MADV_WILLNEED does not imply a > specific access pattern so the application should be running > MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED > to start the readahead, and then the rest of the on-demand readahead > will get the higher readahead limits. > > > This patch changes 3) to use bdi->io_pages as readahead unit. > > I think it really should be changing MADV/FADV_SEQUENTIAL to set > file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the > mem.load() implementation in the application converted to use > MADV_SEQUENTIAL to properly indicate it's access pattern to the > readahead algorithm. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > Anyway, if we could all keep cool and just finish the discussion about how to address the reported issue in a long-term supported way that'd be _awesome_. While I'm sure this legacy application would love to not have to change its code at all, I think we can all agree that we need to just focus on how best to advise applications that have mixed workloads accomplish efficient mmap+read of both sequential and random. To that end, I heard Dave clearly suggest 2 things: 1) update MADV/FADV_SEQUENTIAL to set file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2 2) Have the application first issue MADV_SEQUENTIAL to convey that for the following MADV_WILLNEED is for sequential file load (so it is desirable to use larger ra_pages) This overrides the default of bdi->io_pages and _should_ provide the required per-file duality of control for readahead, correct? Thanks, Mike
On Mon, Jan 29 2024 at 12:19P -0500, Mike Snitzer <snitzer@kernel.org> wrote: > While I'm sure this legacy application would love to not have to > change its code at all, I think we can all agree that we need to just > focus on how best to advise applications that have mixed workloads > accomplish efficient mmap+read of both sequential and random. > > To that end, I heard Dave clearly suggest 2 things: > > 1) update MADV/FADV_SEQUENTIAL to set file->f_ra.ra_pages to > bdi->io_pages, not bdi->ra_pages * 2 > > 2) Have the application first issue MADV_SEQUENTIAL to convey that for > the following MADV_WILLNEED is for sequential file load (so it is > desirable to use larger ra_pages) > > This overrides the default of bdi->io_pages and _should_ provide the > required per-file duality of control for readahead, correct? I meant "This overrides the default of bdi->ra_pages ..." ;)
On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote: > On Mon, Jan 29, 2024 at 04:15:16PM +1100, Dave Chinner wrote: > > On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote: > > > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote: > > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote: > > > Follows the current report: > > > > > > 1) usersapce call madvise(willneed, 1G) > > > > > > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is > > > readahead in madvise(willneed, 1G) since commit 6d2be915e589 > > > > > > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is > > > set as 64KB by userspace when userspace reads the mmaped buffer, then > > > the whole application becomes slower. > > > > It gets limited by file->f_ra->ra_pages being initialised to > > bdi->ra_pages and then never changed as the advice for access > > methods to the file are changed. > > > > But the problem here is *not the readahead code*. The problem is > > that the user has configured the device readahead window to be far > > smaller than is optimal for the storage. Hence readahead is slow. > > The fix for that is to either increase the device readahead windows, > > or to change the specific readahead window for the file that has > > sequential access patterns. > > > > Indeed, we already have that - FADV_SEQUENTIAL will set > > file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses > > larger IOs for that access. > > > > That's what should happen here - MADV_WILLNEED does not imply a > > specific access pattern so the application should be running > > MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED > > to start the readahead, and then the rest of the on-demand readahead > > will get the higher readahead limits. > > > > > This patch changes 3) to use bdi->io_pages as readahead unit. > > > > I think it really should be changing MADV/FADV_SEQUENTIAL to set > > file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the > > mem.load() implementation in the application converted to use > > MADV_SEQUENTIAL to properly indicate it's access pattern to the > > readahead algorithm. > > Here the single .ra_pages may not work, that is why this patch stores > the willneed range in maple tree, please see the following words from > the original RH report: > " > Increasing read ahead is not an option as it has a mixed I/O workload of > random I/O and sequential I/O, so that a large read ahead is very counterproductive > to the random I/O and is unacceptable. > " Yes, I've read the bug. There's no triage that tells us what the root cause of the application perofrmance issue might be. Just an assertion that "this is how we did it 10 years ago, it's been unchanged for all this time, the new kernel we are upgrading to needs to behave exactly like pre-3.10 era kernels did. And to be totally honest, my instincts tell me this is more likely a problem with a root cause in poor IO scheduling decisions than be a problem with the page cache readahead implementation. Readahead has been turned down to stop the bandwidth it uses via background async read IO from starving latency dependent foreground random IO operation, and then we're being asked to turn readahead back up in specific situations because it's actually needed for performance in certain access patterns. This is the sort of thing bfq is intended to solve. > Also almost all these advises(SEQUENTIA, WILLNEED, NORMAL, RANDOM) > ignore the passed range, and the behavior becomes all or nothing, > instead of something only for the specified range, which may not > match with man, please see 'man posix_fadvise': The man page says: The advice is not binding; it merely constitutes an expectation on behalf of the application. > It is even worse for readahead() syscall: > > ``` DESCRIPTION readahead() initiates readahead on a file > so that subsequent reads from that file will be satisfied > from the cache, and not block on disk I/O (assuming the > readahead was initiated early enough and that other activity > on the system did not in the meantime flush pages from the > cache). ``` Yes, that's been "broken" for a long time (since the changes to cap force_page_cache_readahead() to ra_pages way back when), but the assumption documented about when readahead(2) will work goes to the heart of why we don't let user controlled readahead actually do much in the way of direct readahead. i.e. too much readahead is typically harmful to IO and system performance and very, very few applications actually need files preloaded entirely into memory. ---- All said, I'm starting to think that there isn't an immediate upstream kernel change needed right now. I just did a quick check through the madvise() man page to see if I'd missed anything, and I most definitely did miss what is a relatively new addition to it: MADV_POPULATE_READ (since Linux 5.14) "Populate (prefault) page tables readable, faulting in all pages in the range just as if manually reading from each page; however, avoid the actual memory access that would have been performed after handling the fault. In contrast to MAP_POPULATE, MADV_POPULATE_READ does not hide errors, can be applied to (parts of) existing mappings and will al‐ ways populate (prefault) page tables readable. One example use case is prefaulting a file mapping, reading all file content from disk; however, pages won't be dirtied and consequently won't have to be written back to disk when evicting the pages from memory. That's exactly what the application is apparently wanting MADV_WILLNEED to do. Please read the commit message for commit 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables"). It has some relevant commentary on why MADV_WILLNEED could not be modified to meet the pre-population requirements of the applications that required this pre-population behaviour from the kernel. With this, I suspect that the application needs to be updated to use MADV_POPULATE_READ rather than MADV_WILLNEED, and then we can go back and do some analysis of the readahead behaviour of the application and the MADV_POPULATE_READ operation. We may need to tweak MADV_POPULATE_READ for large readahead IO, but that's OK because it's no longer "optimistic speculation" about whether the data is needed in cache - the operation being performed guarantees that or it fails with an error. IOWs, MADV_POPULATE_READ is effectively user data IO at this point, not advice about future access patterns... Cheers, Dave.
On Mon, Jan 29, 2024 at 12:19:02PM -0500, Mike Snitzer wrote: > While I'm sure this legacy application would love to not have to > change its code at all, I think we can all agree that we need to just > focus on how best to advise applications that have mixed workloads > accomplish efficient mmap+read of both sequential and random. > > To that end, I heard Dave clearly suggest 2 things: > > 1) update MADV/FADV_SEQUENTIAL to set file->f_ra.ra_pages to > bdi->io_pages, not bdi->ra_pages * 2 > > 2) Have the application first issue MADV_SEQUENTIAL to convey that for > the following MADV_WILLNEED is for sequential file load (so it is > desirable to use larger ra_pages) > > This overrides the default of bdi->io_pages and _should_ provide the > required per-file duality of control for readahead, correct? I just discovered MADV_POPULATE_READ - see my reply to Ming up-thread about that. The applicaiton should use that instead of MADV_WILLNEED because it gives cache population guarantees that WILLNEED doesn't. Then we can look at optimising the performance of MADV_POPULATE_READ (if needed) as there is constrained scope we can optimise within in ways that we cannot do with WILLNEED. -Dave.
On Mon, Jan 29 2024 at 5:12P -0500, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Jan 29, 2024 at 12:19:02PM -0500, Mike Snitzer wrote: > > While I'm sure this legacy application would love to not have to > > change its code at all, I think we can all agree that we need to just > > focus on how best to advise applications that have mixed workloads > > accomplish efficient mmap+read of both sequential and random. > > > > To that end, I heard Dave clearly suggest 2 things: > > > > 1) update MADV/FADV_SEQUENTIAL to set file->f_ra.ra_pages to > > bdi->io_pages, not bdi->ra_pages * 2 > > > > 2) Have the application first issue MADV_SEQUENTIAL to convey that for > > the following MADV_WILLNEED is for sequential file load (so it is > > desirable to use larger ra_pages) > > > > This overrides the default of bdi->ra_pages and _should_ provide the > > required per-file duality of control for readahead, correct? > > I just discovered MADV_POPULATE_READ - see my reply to Ming > up-thread about that. The applicaiton should use that instead of > MADV_WILLNEED because it gives cache population guarantees that > WILLNEED doesn't. Then we can look at optimising the performance of > MADV_POPULATE_READ (if needed) as there is constrained scope we can > optimise within in ways that we cannot do with WILLNEED. Nice find! Given commit 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables"), I've cc'd David Hildenbrand just so he's in the loop. FYI, I proactively raised feedback and questions to the reporter of this issue: CONTEXT: madvise(WILLNEED) doesn't convey the nature of the access, sequential vs random, just the range that may be accessed. Q1: Is your application's sequential vs random (or smaller sequential) access split on a per-file basis? Or is the same file accessed both sequentially and randomly? A1: The same files can be accessed either randomly or sequentially, depending on certain access patterns and optimizing logic. Q2: Can the application be changed to use madvise() MADV_SEQUENTIAL and MADV_RANDOM to indicate its access pattern? A2: No, the application is a Java application. Java does not expose MADVISE API directly. Our application uses Java NIO API via MappedByteBuffer.load() (https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByteBuffer.html#load--) that calls MADVISE_WILLNEED at the low level. There is no way for us to switch this behavior, but we take advantage of this behavior to optimize large file sequential I/O with great success. So it's looking like it'll be hard to help this reporter avoid changes... but that's not upstream's problem! Mike
On Tue, Jan 30, 2024 at 09:07:24AM +1100, Dave Chinner wrote: > On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote: > > On Mon, Jan 29, 2024 at 04:15:16PM +1100, Dave Chinner wrote: > > > On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote: > > > > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote: > > > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote: > > > > Follows the current report: > > > > > > > > 1) usersapce call madvise(willneed, 1G) > > > > > > > > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is > > > > readahead in madvise(willneed, 1G) since commit 6d2be915e589 > > > > > > > > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is > > > > set as 64KB by userspace when userspace reads the mmaped buffer, then > > > > the whole application becomes slower. > > > > > > It gets limited by file->f_ra->ra_pages being initialised to > > > bdi->ra_pages and then never changed as the advice for access > > > methods to the file are changed. > > > > > > But the problem here is *not the readahead code*. The problem is > > > that the user has configured the device readahead window to be far > > > smaller than is optimal for the storage. Hence readahead is slow. > > > The fix for that is to either increase the device readahead windows, > > > or to change the specific readahead window for the file that has > > > sequential access patterns. > > > > > > Indeed, we already have that - FADV_SEQUENTIAL will set > > > file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses > > > larger IOs for that access. > > > > > > That's what should happen here - MADV_WILLNEED does not imply a > > > specific access pattern so the application should be running > > > MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED > > > to start the readahead, and then the rest of the on-demand readahead > > > will get the higher readahead limits. > > > > > > > This patch changes 3) to use bdi->io_pages as readahead unit. > > > > > > I think it really should be changing MADV/FADV_SEQUENTIAL to set > > > file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the > > > mem.load() implementation in the application converted to use > > > MADV_SEQUENTIAL to properly indicate it's access pattern to the > > > readahead algorithm. > > > > Here the single .ra_pages may not work, that is why this patch stores > > the willneed range in maple tree, please see the following words from > > the original RH report: > > > " > > Increasing read ahead is not an option as it has a mixed I/O workload of > > random I/O and sequential I/O, so that a large read ahead is very counterproductive > > to the random I/O and is unacceptable. > > " > > Yes, I've read the bug. There's no triage that tells us what the > root cause of the application perofrmance issue might be. Just an > assertion that "this is how we did it 10 years ago, it's been > unchanged for all this time, the new kernel we are upgrading > to needs to behave exactly like pre-3.10 era kernels did. > > And to be totally honest, my instincts tell me this is more likely a > problem with a root cause in poor IO scheduling decisions than be a > problem with the page cache readahead implementation. Readahead has > been turned down to stop the bandwidth it uses via background async > read IO from starving latency dependent foreground random IO > operation, and then we're being asked to turn readahead back up > in specific situations because it's actually needed for performance > in certain access patterns. This is the sort of thing bfq is > intended to solve. Reading mmaped buffer in userspace is sync IO, and page fault just readahead 64KB. I don't understand how block IO scheduler makes a difference in this single 64KB readahead in case of cache miss. > > > > Also almost all these advises(SEQUENTIA, WILLNEED, NORMAL, RANDOM) > > ignore the passed range, and the behavior becomes all or nothing, > > instead of something only for the specified range, which may not > > match with man, please see 'man posix_fadvise': > > The man page says: > > The advice is not binding; it merely constitutes an > expectation on behalf of the application. > > > It is even worse for readahead() syscall: > > > > ``` DESCRIPTION readahead() initiates readahead on a file > > so that subsequent reads from that file will be satisfied > > from the cache, and not block on disk I/O (assuming the > > readahead was initiated early enough and that other activity > > on the system did not in the meantime flush pages from the > > cache). ``` > > Yes, that's been "broken" for a long time (since the changes to cap > force_page_cache_readahead() to ra_pages way back when), but the > assumption documented about when readahead(2) will work goes to the > heart of why we don't let user controlled readahead actually do much > in the way of direct readahead. i.e. too much readahead is > typically harmful to IO and system performance and very, very few > applications actually need files preloaded entirely into memory. It is true for normal readahead, but not sure if it is for advise(willneed) or readahead(). > > ---- > > All said, I'm starting to think that there isn't an immediate > upstream kernel change needed right now. I just did a quick check > through the madvise() man page to see if I'd missed anything, and I > most definitely did miss what is a relatively new addition to it: > > MADV_POPULATE_READ (since Linux 5.14) > "Populate (prefault) page tables readable, faulting in all > pages in the range just as if manually reading from each page; > however, avoid the actual memory access that would have been > performed after handling the fault. > > In contrast to MAP_POPULATE, MADV_POPULATE_READ does not hide > errors, can be applied to (parts of) existing mappings and will > al‐ ways populate (prefault) page tables readable. One > example use case is prefaulting a file mapping, reading all > file content from disk; however, pages won't be dirtied and > consequently won't have to be written back to disk when > evicting the pages from memory. > > That's exactly what the application is apparently wanting > MADV_WILLNEED to do. Indeed, it works as expected(all mmapped pages are load in madvise(MADV_POPULATE_READ)) in my test code except for 16 ra_pages, but it is less important now. Thanks for this idea! > > Please read the commit message for commit 4ca9b3859dac ("mm/madvise: > introduce MADV_POPULATE_(READ|WRITE) to prefault page tables"). It > has some relevant commentary on why MADV_WILLNEED could not be > modified to meet the pre-population requirements of the applications > that required this pre-population behaviour from the kernel. > > With this, I suspect that the application needs to be updated to > use MADV_POPULATE_READ rather than MADV_WILLNEED, and then we can go > back and do some analysis of the readahead behaviour of the > application and the MADV_POPULATE_READ operation. We may need to > tweak MADV_POPULATE_READ for large readahead IO, but that's OK > because it's no longer "optimistic speculation" about whether the > data is needed in cache - the operation being performed guarantees > that or it fails with an error. IOWs, MADV_POPULATE_READ is > effectively user data IO at this point, not advice about future > access patterns... BTW, in this report, MADV_WILLNEED is used by java library[1], and I guess it could be difficult to update to MADV_POPULATE_READ. [1] https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByteBuffer.html Thanks, Ming
On Tue, Jan 30, 2024 at 11:13:50AM +0800, Ming Lei wrote: > On Tue, Jan 30, 2024 at 09:07:24AM +1100, Dave Chinner wrote: > > On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote: > > > On Mon, Jan 29, 2024 at 04:15:16PM +1100, Dave Chinner wrote: > > > > On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote: > > > > > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote: > > > > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote: > > > > > Follows the current report: > > > > > > > > > > 1) usersapce call madvise(willneed, 1G) > > > > > > > > > > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is > > > > > readahead in madvise(willneed, 1G) since commit 6d2be915e589 > > > > > > > > > > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is > > > > > set as 64KB by userspace when userspace reads the mmaped buffer, then > > > > > the whole application becomes slower. > > > > > > > > It gets limited by file->f_ra->ra_pages being initialised to > > > > bdi->ra_pages and then never changed as the advice for access > > > > methods to the file are changed. > > > > > > > > But the problem here is *not the readahead code*. The problem is > > > > that the user has configured the device readahead window to be far > > > > smaller than is optimal for the storage. Hence readahead is slow. > > > > The fix for that is to either increase the device readahead windows, > > > > or to change the specific readahead window for the file that has > > > > sequential access patterns. > > > > > > > > Indeed, we already have that - FADV_SEQUENTIAL will set > > > > file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses > > > > larger IOs for that access. > > > > > > > > That's what should happen here - MADV_WILLNEED does not imply a > > > > specific access pattern so the application should be running > > > > MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED > > > > to start the readahead, and then the rest of the on-demand readahead > > > > will get the higher readahead limits. > > > > > > > > > This patch changes 3) to use bdi->io_pages as readahead unit. > > > > > > > > I think it really should be changing MADV/FADV_SEQUENTIAL to set > > > > file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the > > > > mem.load() implementation in the application converted to use > > > > MADV_SEQUENTIAL to properly indicate it's access pattern to the > > > > readahead algorithm. > > > > > > Here the single .ra_pages may not work, that is why this patch stores > > > the willneed range in maple tree, please see the following words from > > > the original RH report: > > > > > " > > > Increasing read ahead is not an option as it has a mixed I/O workload of > > > random I/O and sequential I/O, so that a large read ahead is very counterproductive > > > to the random I/O and is unacceptable. > > > " > > > > Yes, I've read the bug. There's no triage that tells us what the > > root cause of the application perofrmance issue might be. Just an > > assertion that "this is how we did it 10 years ago, it's been > > unchanged for all this time, the new kernel we are upgrading > > to needs to behave exactly like pre-3.10 era kernels did. > > > > And to be totally honest, my instincts tell me this is more likely a > > problem with a root cause in poor IO scheduling decisions than be a > > problem with the page cache readahead implementation. Readahead has > > been turned down to stop the bandwidth it uses via background async > > read IO from starving latency dependent foreground random IO > > operation, and then we're being asked to turn readahead back up > > in specific situations because it's actually needed for performance > > in certain access patterns. This is the sort of thing bfq is > > intended to solve. > > Reading mmaped buffer in userspace is sync IO, and page fault just > readahead 64KB. I don't understand how block IO scheduler makes a > difference in this single 64KB readahead in case of cache miss. I think you've misunderstood what I said. I was refering to the original customer problem of "too much readahead IO causes problems for latency sensitive IO" issue that lead to the customer setting 64kB readahead device limits in the first place. That is, if reducing readahead for sequential IO suddenly makes synchronous random IO perform a whole lot better and the application goes faster, then it indicates the problem is IO dispatch prioritisation, not that there is too much readahead. Deprioritising readahead will educe it's impact on other IO, without having to reduce the readahead windows that provide decent sequential IO perofrmance... I really think the customer needs to retune their application from first principles. Start with the defaults, measure where things are slow, address the worst issue by twiddling knobs. Repeat until performance is either good enough or they hit on actual problems that need code changes. > > > It is even worse for readahead() syscall: > > > > > > ``` DESCRIPTION readahead() initiates readahead on a file > > > so that subsequent reads from that file will be satisfied > > > from the cache, and not block on disk I/O (assuming the > > > readahead was initiated early enough and that other activity > > > on the system did not in the meantime flush pages from the > > > cache). ``` > > > > Yes, that's been "broken" for a long time (since the changes to cap > > force_page_cache_readahead() to ra_pages way back when), but the > > assumption documented about when readahead(2) will work goes to the > > heart of why we don't let user controlled readahead actually do much > > in the way of direct readahead. i.e. too much readahead is > > typically harmful to IO and system performance and very, very few > > applications actually need files preloaded entirely into memory. > > It is true for normal readahead, but not sure if it is for > advise(willneed) or readahead(). If we allowed unbound readahead via WILLNEED or readahead(2), then a user can DOS the storage and/or the memory allocation subsystem very easily. In a previous attempt to revert the current WILLNEED readahead bounding behaviour changes, Linus said this: "It's just that historically we've had some issues with people over-doing readahead (because it often helps some made-up microbenchmark), and then we end up with latency issues when somebody does a multi-gigabyte readahead... Iirc, we had exactly that problem with the readahead() system call at some point (long ago)." https://lore.kernel.org/linux-mm/CA+55aFy8kOomnL-C5GwSpHTn+g5R7dY78C9=h-J_Rb_u=iASpg@mail.gmail.com/ Elsewhere in a different thread for a different patchset to try to revert this readahead behaviour, Linus ranted about how it allowed unbound, unkillable user controlled readahead for 64-bit data lengths. Fundamentally, readahead is not functionality we want to expose directly to user control. MADV_POPULATE_* is a different in that it isn't actually readahead - it works more like normal sequential user page fault access. It is interruptable, it can fail due to ENOMEM or OOM-kill, it can fail on IO errors, etc. IOWs, The MADV_POPULATE functions are what the application should be using, not trying to hack WILLNEED to do stuff that MADV_POPULATE* already does in a better way... > > Please read the commit message for commit 4ca9b3859dac ("mm/madvise: > > introduce MADV_POPULATE_(READ|WRITE) to prefault page tables"). It > > has some relevant commentary on why MADV_WILLNEED could not be > > modified to meet the pre-population requirements of the applications > > that required this pre-population behaviour from the kernel. > > > > With this, I suspect that the application needs to be updated to > > use MADV_POPULATE_READ rather than MADV_WILLNEED, and then we can go > > back and do some analysis of the readahead behaviour of the > > application and the MADV_POPULATE_READ operation. We may need to > > tweak MADV_POPULATE_READ for large readahead IO, but that's OK > > because it's no longer "optimistic speculation" about whether the > > data is needed in cache - the operation being performed guarantees > > that or it fails with an error. IOWs, MADV_POPULATE_READ is > > effectively user data IO at this point, not advice about future > > access patterns... > > BTW, in this report, MADV_WILLNEED is used by java library[1], and I > guess it could be difficult to update to MADV_POPULATE_READ. Yes, but that's not an upstream kernel code development problem. That's a problem for the people paying $$$$$ to their software vendor to sort out. -Dave.
On 29.01.24 23:46, Mike Snitzer wrote: > On Mon, Jan 29 2024 at 5:12P -0500, > Dave Chinner <david@fromorbit.com> wrote: > >> On Mon, Jan 29, 2024 at 12:19:02PM -0500, Mike Snitzer wrote: >>> While I'm sure this legacy application would love to not have to >>> change its code at all, I think we can all agree that we need to just >>> focus on how best to advise applications that have mixed workloads >>> accomplish efficient mmap+read of both sequential and random. >>> >>> To that end, I heard Dave clearly suggest 2 things: >>> >>> 1) update MADV/FADV_SEQUENTIAL to set file->f_ra.ra_pages to >>> bdi->io_pages, not bdi->ra_pages * 2 >>> >>> 2) Have the application first issue MADV_SEQUENTIAL to convey that for >>> the following MADV_WILLNEED is for sequential file load (so it is >>> desirable to use larger ra_pages) >>> >>> This overrides the default of bdi->ra_pages and _should_ provide the >>> required per-file duality of control for readahead, correct? >> >> I just discovered MADV_POPULATE_READ - see my reply to Ming >> up-thread about that. The applicaiton should use that instead of >> MADV_WILLNEED because it gives cache population guarantees that >> WILLNEED doesn't. Then we can look at optimising the performance of >> MADV_POPULATE_READ (if needed) as there is constrained scope we can >> optimise within in ways that we cannot do with WILLNEED. > > Nice find! Given commit 4ca9b3859dac ("mm/madvise: introduce > MADV_POPULATE_(READ|WRITE) to prefault page tables"), I've cc'd David > Hildenbrand just so he's in the loop. Thanks for CCing me. MADV_POPULATE_READ is indeed different; it doesn't give hints (not "might be a good idea to read some pages" like MADV_WILLNEED documents), it forces swapin/read/.../. In a sense, MADV_POPULATE_READ is similar to simply reading one byte from each PTE, triggering page faults. However, without actually reading from the target pages. MADV_POPULATE_READ has a conceptual benefit: we know exactly how much memory user space wants to have populated (which range). In contrast, page faults contain no such hints and we have to guess based on historical behavior. One could use that range information to *not* do any faultaround/readahead when we come via MADV_POPULATE_READ, and really only popoulate the range of interest. Further, one can use that range information to allocate larger folios, without having to guess where placement of a large folio is reasonable, and which size we should use. > > FYI, I proactively raised feedback and questions to the reporter of > this issue: > > CONTEXT: madvise(WILLNEED) doesn't convey the nature of the access, > sequential vs random, just the range that may be accessed. Indeed. The "problem" with MADV_SEQUENTIAL/MADV_RANDOM is that it will fragment/split VMAs. So applying it to smaller chunks (like one would do with MADV_WILLNEED) is likely not a good option.
On Tue, Jan 30, 2024 at 04:29:35PM +1100, Dave Chinner wrote: > On Tue, Jan 30, 2024 at 11:13:50AM +0800, Ming Lei wrote: > > On Tue, Jan 30, 2024 at 09:07:24AM +1100, Dave Chinner wrote: > > > On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote: > > > > On Mon, Jan 29, 2024 at 04:15:16PM +1100, Dave Chinner wrote: > > > > > On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote: > > > > > > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote: > > > > > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote: > > > > > > Follows the current report: > > > > > > > > > > > > 1) usersapce call madvise(willneed, 1G) > > > > > > > > > > > > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is > > > > > > readahead in madvise(willneed, 1G) since commit 6d2be915e589 > > > > > > > > > > > > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is > > > > > > set as 64KB by userspace when userspace reads the mmaped buffer, then > > > > > > the whole application becomes slower. > > > > > > > > > > It gets limited by file->f_ra->ra_pages being initialised to > > > > > bdi->ra_pages and then never changed as the advice for access > > > > > methods to the file are changed. > > > > > > > > > > But the problem here is *not the readahead code*. The problem is > > > > > that the user has configured the device readahead window to be far > > > > > smaller than is optimal for the storage. Hence readahead is slow. > > > > > The fix for that is to either increase the device readahead windows, > > > > > or to change the specific readahead window for the file that has > > > > > sequential access patterns. > > > > > > > > > > Indeed, we already have that - FADV_SEQUENTIAL will set > > > > > file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses > > > > > larger IOs for that access. > > > > > > > > > > That's what should happen here - MADV_WILLNEED does not imply a > > > > > specific access pattern so the application should be running > > > > > MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED > > > > > to start the readahead, and then the rest of the on-demand readahead > > > > > will get the higher readahead limits. > > > > > > > > > > > This patch changes 3) to use bdi->io_pages as readahead unit. > > > > > > > > > > I think it really should be changing MADV/FADV_SEQUENTIAL to set > > > > > file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the > > > > > mem.load() implementation in the application converted to use > > > > > MADV_SEQUENTIAL to properly indicate it's access pattern to the > > > > > readahead algorithm. > > > > > > > > Here the single .ra_pages may not work, that is why this patch stores > > > > the willneed range in maple tree, please see the following words from > > > > the original RH report: > > > > > > > " > > > > Increasing read ahead is not an option as it has a mixed I/O workload of > > > > random I/O and sequential I/O, so that a large read ahead is very counterproductive > > > > to the random I/O and is unacceptable. > > > > " > > > > > > Yes, I've read the bug. There's no triage that tells us what the > > > root cause of the application perofrmance issue might be. Just an > > > assertion that "this is how we did it 10 years ago, it's been > > > unchanged for all this time, the new kernel we are upgrading > > > to needs to behave exactly like pre-3.10 era kernels did. > > > > > > And to be totally honest, my instincts tell me this is more likely a > > > problem with a root cause in poor IO scheduling decisions than be a > > > problem with the page cache readahead implementation. Readahead has > > > been turned down to stop the bandwidth it uses via background async > > > read IO from starving latency dependent foreground random IO > > > operation, and then we're being asked to turn readahead back up > > > in specific situations because it's actually needed for performance > > > in certain access patterns. This is the sort of thing bfq is > > > intended to solve. > > > > Reading mmaped buffer in userspace is sync IO, and page fault just > > readahead 64KB. I don't understand how block IO scheduler makes a > > difference in this single 64KB readahead in case of cache miss. > > I think you've misunderstood what I said. I was refering to the > original customer problem of "too much readahead IO causes problems > for latency sensitive IO" issue that lead to the customer setting > 64kB readahead device limits in the first place. Looks we are not in same page, I never see words of "latency sensitive IO" in this report(RHEL-22476). > > That is, if reducing readahead for sequential IO suddenly makes > synchronous random IO perform a whole lot better and the application > goes faster, then it indicates the problem is IO dispatch > prioritisation, not that there is too much readahead. Deprioritising > readahead will educe it's impact on other IO, without having to > reduce the readahead windows that provide decent sequential IO > perofrmance... > > I really think the customer needs to retune their application from > first principles. Start with the defaults, measure where things are > slow, address the worst issue by twiddling knobs. Repeat until > performance is either good enough or they hit on actual problems > that need code changes. io priority is set in blkcg/process level, we even don't know if the random IO and sequential IO are submitted from different process. Also io priority is only applied when IOs with different priority are submitted concurrently. The main input from the report is that iostat shows that read IO request size is reduced to 64K from 1MB, which isn't something io priority can deal with. Here from my understanding the problem is that advise(ADV_RANDOM, ADV_SEQUENTIAL, ADV_WILLNEED) are basically applied on file level instead of range level, even though range is passed in from these syscalls. So sequential and random advise are actually exclusively on the whole file. That is why the customer don't want to set bigger ra_pages because they are afraid(or have tested) that bigger ra_pages hurts performance of random IO workload because unnecessary data may be readahead. But readahead algorithm is supposed to be capable of dealing with it, maybe still not well enough. But yes, more details are needed for understand the issue further. thanks, Ming
diff --git a/fs/file_table.c b/fs/file_table.c index b991f90571b4..bb0303683305 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -61,8 +61,18 @@ struct path *backing_file_user_path(struct file *f) } EXPORT_SYMBOL_GPL(backing_file_user_path); +static inline void file_ra_free(struct file_ra_state *ra) +{ + if (ra->need_mt) { + mtree_destroy(ra->need_mt); + kfree(ra->need_mt); + ra->need_mt = NULL; + } +} + static inline void file_free(struct file *f) { + file_ra_free(&f->f_ra); security_file_free(f); if (likely(!(f->f_mode & FMODE_NOACCOUNT))) percpu_counter_dec(&nr_files); diff --git a/include/linux/fs.h b/include/linux/fs.h index ed5966a70495..bdbd16990072 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -43,6 +43,7 @@ #include <linux/cred.h> #include <linux/mnt_idmapping.h> #include <linux/slab.h> +#include <linux/maple_tree.h> #include <asm/byteorder.h> #include <uapi/linux/fs.h> @@ -961,6 +962,7 @@ struct fown_struct { * @ra_pages: Maximum size of a readahead request, copied from the bdi. * @mmap_miss: How many mmap accesses missed in the page cache. * @prev_pos: The last byte in the most recent read request. + * @need_mt: maple tree for tracking WILL_NEED ranges * * When this structure is passed to ->readahead(), the "most recent" * readahead means the current readahead. @@ -972,6 +974,7 @@ struct file_ra_state { unsigned int ra_pages; unsigned int mmap_miss; loff_t prev_pos; + struct maple_tree *need_mt; }; /* @@ -983,6 +986,18 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index) index < ra->start + ra->size); } +/* + * Check if @index falls in the madvise/fadvise WILLNEED window. + */ +static inline bool ra_index_in_need_range(struct file_ra_state *ra, + pgoff_t index) +{ + if (ra->need_mt) + return mtree_load(ra->need_mt, index) != NULL; + + return false; +} + /* * f_{lock,count,pos_lock} members can be highly contended and share * the same cacheline. f_{lock,mode} are very frequently used together diff --git a/mm/filemap.c b/mm/filemap.c index 750e779c23db..0ffe63d58421 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3147,7 +3147,10 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) */ fpin = maybe_unlock_mmap_for_io(vmf, fpin); ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2); - ra->size = ra->ra_pages; + if (ra_index_in_need_range(ra, vmf->pgoff)) + ra->size = inode_to_bdi(mapping->host)->io_pages; + else + ra->size = ra->ra_pages; ra->async_size = ra->ra_pages / 4; ractl._index = ra->start; page_cache_ra_order(&ractl, ra, 0); diff --git a/mm/internal.h b/mm/internal.h index f309a010d50f..17bd970ff23c 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -120,13 +120,18 @@ void unmap_page_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, struct zap_details *details); +void file_ra_add_need_range(struct file_ra_state *ra, pgoff_t start, + pgoff_t end); void page_cache_ra_order(struct readahead_control *, struct file_ra_state *, unsigned int order); void force_page_cache_ra(struct readahead_control *, unsigned long nr); static inline void force_page_cache_readahead(struct address_space *mapping, struct file *file, pgoff_t index, unsigned long nr_to_read) { - DEFINE_READAHEAD(ractl, file, &file->f_ra, mapping, index); + struct file_ra_state *ra = &file->f_ra; + DEFINE_READAHEAD(ractl, file, ra, mapping, index); + + file_ra_add_need_range(ra, index, index + nr_to_read); force_page_cache_ra(&ractl, nr_to_read); } diff --git a/mm/readahead.c b/mm/readahead.c index 23620c57c122..0882ceecf9ff 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -140,9 +140,38 @@ file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping) { ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages; ra->prev_pos = -1; + ra->need_mt = NULL; } EXPORT_SYMBOL_GPL(file_ra_state_init); +static void file_ra_setup_need_mt(struct file_ra_state *ra) +{ + struct maple_tree *mt = kzalloc(sizeof(*mt), GFP_KERNEL); + + if (!mt) + return; + + mt_init(mt); + if (cmpxchg(&ra->need_mt, NULL, mt) != NULL) + kfree(mt); +} + +/* Maintain one willneed range hint for speedup readahead */ +void file_ra_add_need_range(struct file_ra_state *ra, pgoff_t start, + pgoff_t end) +{ + /* ignore small willneed range */ + if (end - start < 4 * ra->ra_pages) + return; + + if (!ra->need_mt) + file_ra_setup_need_mt(ra); + + if (ra->need_mt) + mtree_insert_range(ra->need_mt, start, end, (void *)1, + GFP_KERNEL); +} + static void read_pages(struct readahead_control *rac) { const struct address_space_operations *aops = rac->mapping->a_ops; @@ -552,9 +581,10 @@ static void ondemand_readahead(struct readahead_control *ractl, { struct backing_dev_info *bdi = inode_to_bdi(ractl->mapping->host); struct file_ra_state *ra = ractl->ra; - unsigned long max_pages = ra->ra_pages; unsigned long add_pages; pgoff_t index = readahead_index(ractl); + unsigned long max_pages = ra_index_in_need_range(ra, index) ? + bdi->io_pages : ra->ra_pages; pgoff_t expected, prev_index; unsigned int order = folio ? folio_order(folio) : 0;