Message ID | 20240202022029.1903629-1-ming.lei@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-49086-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:9bc1:b0:106:209c:c626 with SMTP id op1csp161792dyc; Thu, 1 Feb 2024 18:21:37 -0800 (PST) X-Google-Smtp-Source: AGHT+IEHF4Uh9+TUVbj7Pj7fSOy/Tjb8mLbOgnzpIE+UJN1Od6dgOOhIzlg89hDayR/bSjnVzTTr X-Received: by 2002:a05:620a:28c9:b0:783:8071:2473 with SMTP id l9-20020a05620a28c900b0078380712473mr1243727qkp.61.1706840497108; Thu, 01 Feb 2024 18:21:37 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706840497; cv=pass; d=google.com; s=arc-20160816; b=eK++fPCw9CxiNuH7hvWS0AtZjYVh4bJonOgRAR2FJizYLGFbUvkM9pg/qo9mtpBLag SGFIfS7sVzFr0VI2NeQTsETo6WUUaQrbPXJO+kM1kVJWRWmLqzHTxFUA145V6HP5kvb1 +czcrh7lm/tGypybhYZzVO/qdM+LfyFOD+kdeHi8C1HJVREeS0OHtLjEXrK93LtyvMb5 AgEydiUs7H/nccmoNO6BvdyNOt+Bi5bKJT5f6OCAoyqj4pUpLoH8z3Rtz1xbeKJkV8sb ExUZi4Lf6TlfYsoKCWsz8hTT8sL0fgO8m+ACl/kUTIN4qwc4rJaQrfCrt6DrdZ0/V3UK Ufvw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from:dkim-signature; bh=/AFNq394fKdD6rutt1qh+bLin+C/DGw45kLcKx0YqTM=; fh=jOoXhTp+Zsa0x5dBmehlU46fSvUIlLEev+n5as3UbIc=; b=XACqbqhjYIsuZRGXl8SOr5Rz7qpQsGLxxCqRGe9M2bOXGHhdN8AZI7ATANTT/SSoMn U9jnKFXdkNKOzeacv00/3iEc6vA/QNFb2Z6N5SwB5oUE4bCj7hXVLJwv0G9aVgtIFmSL koPqKxA3ug6ytoViCJRK3s+HPrF28rXZFjrZANAyuYWFDrd05yjbZNajYWSvtvuWzUta MBkzBxdJRFV8IRqx8fXRgyspmOI9DEaekUE/dFL2kZL+wOyWzHPVfJiXru0FCs/G+86A GRM+5vvRd4zhXGQNlosWaY83ZznGHOqEFjLtHCFYdKQt4yEKRhXxS2/XkWcmL5yI5UJN U7gA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=crslRzn8; arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass dkdomain=redhat.com dmarc=pass fromdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-49086-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-49086-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com X-Forwarded-Encrypted: i=1; AJvYcCVBXlMi8wstEgPvXFgab9+lswVPwU81KZhK7zpmCcp8FlY74oFK9yVPE6MIKTAOq4cHZpjxPkBleSnvDMlNIinRuBbIfQ== Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id o20-20020a05620a22d400b00783fd87084asi990378qki.97.2024.02.01.18.21.37 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 01 Feb 2024 18:21:37 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-49086-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=crslRzn8; arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass dkdomain=redhat.com dmarc=pass fromdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-49086-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-49086-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id E20A81C22762 for <ouuuleilei@gmail.com>; Fri, 2 Feb 2024 02:21:36 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id DD598D275; Fri, 2 Feb 2024 02:21:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="crslRzn8" Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6D2CBBA48 for <linux-kernel@vger.kernel.org>; Fri, 2 Feb 2024 02:21:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706840466; cv=none; b=VZBLKbkl3tYHNB8i4FUaVgY4NNp01khhxignMOnREX5eFc0NEuIKX4JdY2/s1tTfDxD0a9J2dBtPyrdk0Ksn5aixtAYXgF+vWFNVV1yZkUJrsid2Y9UE/SLV7/K+i71hwnbC8adO8LvaMq+xm5bM9eeOxV40tFkEv2i6biJI3aI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706840466; c=relaxed/simple; bh=EQIwbZHsGuMlZyQ+7R9MpEihML0bmDWknH+HvCa+3QI=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=UppPgccqvHjFKFxSGS06bxYVQKon2wdLrOdn8mWLiCf6pmzeqQF14b9NrNbsLkDj7IJf45cD9yqWKXV4+I/13cLnDee9T3UavzT9rsBmbfQOAUNTwO0sIXECNpPGdPzabniQsAxaDkgesLMYsexVfoYSbEfeRehnkjJ76/AZ77A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=crslRzn8; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1706840463; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=/AFNq394fKdD6rutt1qh+bLin+C/DGw45kLcKx0YqTM=; b=crslRzn8TIDEYKtqWxAUJMegEew8Yoh1yPbu4l2l56NrRQa1fg3iIIXBCl49gSRezmOi7w q27LNdKljx+9HCt3D8RtXuGGnZpjeZ/laBf2Z4IVc1RsJeKxMvMeIJxTAk311wCYvBHyBX jiZeMRMZY9bpdSJcgLKWhKoDxfY3Qe8= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-124-aBo1uFCGPYOOyb7-_Gf_gw-1; Thu, 01 Feb 2024 21:20:58 -0500 X-MC-Unique: aBo1uFCGPYOOyb7-_Gf_gw-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 975C729AC00A; Fri, 2 Feb 2024 02:20:57 +0000 (UTC) Received: from localhost (unknown [10.72.116.16]) by smtp.corp.redhat.com (Postfix) with ESMTP id ABCD01121306; Fri, 2 Feb 2024 02:20:56 +0000 (UTC) From: Ming Lei <ming.lei@redhat.com> To: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, Ming Lei <ming.lei@redhat.com>, David Hildenbrand <david@redhat.com>, Matthew Wilcox <willy@infradead.org>, Alexander Viro <viro@zeniv.linux.org.uk>, Christian Brauner <brauner@kernel.org>, Don Dutile <ddutile@redhat.com>, Rafael Aquini <raquini@redhat.com>, Dave Chinner <david@fromorbit.com>, Mike Snitzer <snitzer@kernel.org> Subject: [PATCH] mm/madvise: set ra_pages as device max request size during ADV_POPULATE_READ Date: Fri, 2 Feb 2024 10:20:29 +0800 Message-ID: <20240202022029.1903629-1-ming.lei@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.3 X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1789751980927839869 X-GMAIL-MSGID: 1789751980927839869 |
Series |
mm/madvise: set ra_pages as device max request size during ADV_POPULATE_READ
|
|
Commit Message
Ming Lei
Feb. 2, 2024, 2:20 a.m. UTC
madvise(MADV_POPULATE_READ) tries to populate all page tables in the
specific range, so it is usually sequential IO if VMA is backed by
file.
Set ra_pages as device max request size for the involved readahead in
the ADV_POPULATE_READ, this way reduces latency of madvise(MADV_POPULATE_READ)
to 1/10 when running madvise(MADV_POPULATE_READ) over one 1GB file with
usual(default) 128KB of read_ahead_kb.
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Rafael Aquini <raquini@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
mm/madvise.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 51 insertions(+), 1 deletion(-)
Comments
On Fri, Feb 02, 2024 at 10:20:29AM +0800, Ming Lei wrote: > +static struct file *madvise_override_ra_win(struct file *f, > + unsigned long start, unsigned long end, > + unsigned int *old_ra_pages) > +{ > + unsigned int io_pages; > + > + if (!f || !f->f_mapping || !f->f_mapping->host) > + return NULL; How can ->f_mapping be NULL? How can f_mapping->host be NULL?
On Thu, Feb 01 2024 at 9:20P -0500, Ming Lei <ming.lei@redhat.com> wrote: > madvise(MADV_POPULATE_READ) tries to populate all page tables in the > specific range, so it is usually sequential IO if VMA is backed by > file. > > Set ra_pages as device max request size for the involved readahead in > the ADV_POPULATE_READ, this way reduces latency of madvise(MADV_POPULATE_READ) > to 1/10 when running madvise(MADV_POPULATE_READ) over one 1GB file with > usual(default) 128KB of read_ahead_kb. > > Cc: David Hildenbrand <david@redhat.com> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Alexander Viro <viro@zeniv.linux.org.uk> > Cc: Christian Brauner <brauner@kernel.org> > Cc: Don Dutile <ddutile@redhat.com> > Cc: Rafael Aquini <raquini@redhat.com> > Cc: Dave Chinner <david@fromorbit.com> > Cc: Mike Snitzer <snitzer@kernel.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > mm/madvise.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 51 insertions(+), 1 deletion(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index 912155a94ed5..db5452c8abdd 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -900,6 +900,37 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > return -EINVAL; > } > > +static void madvise_restore_ra_win(struct file **file, unsigned int ra_pages) > +{ > + if (*file) { > + struct file *f = *file; > + > + f->f_ra.ra_pages = ra_pages; > + fput(f); > + *file = NULL; > + } > +} > + > +static struct file *madvise_override_ra_win(struct file *f, > + unsigned long start, unsigned long end, > + unsigned int *old_ra_pages) > +{ > + unsigned int io_pages; > + > + if (!f || !f->f_mapping || !f->f_mapping->host) > + return NULL; > + > + io_pages = inode_to_bdi(f->f_mapping->host)->io_pages; > + if (((end - start) >> PAGE_SHIFT) < io_pages) > + return NULL; > + > + f = get_file(f); > + *old_ra_pages = f->f_ra.ra_pages; > + f->f_ra.ra_pages = io_pages; > + > + return f; > +} > + Does this override imply that madvise_populate resorts to calling filemap_fault() and here you're just arming it to use the larger ->io_pages for the duration of all associated faulting? Wouldn't it be better to avoid faulting and build up larger page vectors that get sent down to the block layer in one go and let the block layer split using the device's limits? (like happens with force_page_cache_ra) I'm concerned that madvise_populate isn't so efficient with filemap due to excessive faulting (*BUT* I haven't traced to know, I'm just inferring that is why twiddling f->f_ra.ra_pages helps improve madvise_populate by having it issue larger IO. Apologies if I'm way off base) Mike
On Fri, Feb 02, 2024 at 04:15:39AM +0000, Matthew Wilcox wrote: > On Fri, Feb 02, 2024 at 10:20:29AM +0800, Ming Lei wrote: > > +static struct file *madvise_override_ra_win(struct file *f, > > + unsigned long start, unsigned long end, > > + unsigned int *old_ra_pages) > > +{ > > + unsigned int io_pages; > > + > > + if (!f || !f->f_mapping || !f->f_mapping->host) > > + return NULL; > > How can ->f_mapping be NULL? How can f_mapping->host be NULL? You are right, the two checks can be removed because both two won't be NULL for opened file, and .f_ra is initialized with f->f_mapping->host->i_mapping directly too. I will drop the checks in next version. BTW, looks the same check in madvise_remove() can removed too. Thanks, Ming
On Thu, Feb 01, 2024 at 11:43:11PM -0500, Mike Snitzer wrote: > On Thu, Feb 01 2024 at 9:20P -0500, > Ming Lei <ming.lei@redhat.com> wrote: > > > madvise(MADV_POPULATE_READ) tries to populate all page tables in the > > specific range, so it is usually sequential IO if VMA is backed by > > file. > > > > Set ra_pages as device max request size for the involved readahead in > > the ADV_POPULATE_READ, this way reduces latency of madvise(MADV_POPULATE_READ) > > to 1/10 when running madvise(MADV_POPULATE_READ) over one 1GB file with > > usual(default) 128KB of read_ahead_kb. > > > > Cc: David Hildenbrand <david@redhat.com> > > Cc: Matthew Wilcox <willy@infradead.org> > > Cc: Alexander Viro <viro@zeniv.linux.org.uk> > > Cc: Christian Brauner <brauner@kernel.org> > > Cc: Don Dutile <ddutile@redhat.com> > > Cc: Rafael Aquini <raquini@redhat.com> > > Cc: Dave Chinner <david@fromorbit.com> > > Cc: Mike Snitzer <snitzer@kernel.org> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > --- > > mm/madvise.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++- > > 1 file changed, 51 insertions(+), 1 deletion(-) > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 912155a94ed5..db5452c8abdd 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -900,6 +900,37 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > > return -EINVAL; > > } > > > > +static void madvise_restore_ra_win(struct file **file, unsigned int ra_pages) > > +{ > > + if (*file) { > > + struct file *f = *file; > > + > > + f->f_ra.ra_pages = ra_pages; > > + fput(f); > > + *file = NULL; > > + } > > +} > > + > > +static struct file *madvise_override_ra_win(struct file *f, > > + unsigned long start, unsigned long end, > > + unsigned int *old_ra_pages) > > +{ > > + unsigned int io_pages; > > + > > + if (!f || !f->f_mapping || !f->f_mapping->host) > > + return NULL; > > + > > + io_pages = inode_to_bdi(f->f_mapping->host)->io_pages; > > + if (((end - start) >> PAGE_SHIFT) < io_pages) > > + return NULL; > > + > > + f = get_file(f); > > + *old_ra_pages = f->f_ra.ra_pages; > > + f->f_ra.ra_pages = io_pages; > > + > > + return f; > > +} > > + > > Does this override imply that madvise_populate resorts to calling > filemap_fault() and here you're just arming it to use the larger > ->io_pages for the duration of all associated faulting? Yes. > > Wouldn't it be better to avoid faulting and build up larger page How can we avoid the fault handling? which is needed to build VA->PA mapping. > vectors that get sent down to the block layer in one go and let the filemap_fault() already tries to allocate folio in big size(max order is MAX_PAGECACHE_ORDER), see page_cache_ra_order() and ra_alloc_folio(). > block layer split using the device's limits? (like happens with > force_page_cache_ra) Here filemap code won't deal with block directly because there is VFS & FS and io mapping is required, and it just calls aops->readahead() or aops->read_folio(), but block plug & readahead_control are applied for handling everything in batch. > > I'm concerned that madvise_populate isn't so efficient with filemap That is why this patch increases readahead window, then madvise_populate() performance can be improved by X10 in big file-backed popluate read. > due to excessive faulting (*BUT* I haven't traced to know, I'm just > inferring that is why twiddling f->f_ra.ra_pages helps improve > madvise_populate by having it issue larger IO. Apologies if I'm way > off base) As mentioned, fault handling can't be avoided, but we can improve involved readahead IO perf. Thanks, Ming
On Fri, Feb 02 2024 at 5:52P -0500, Ming Lei <ming.lei@redhat.com> wrote: > On Thu, Feb 01, 2024 at 11:43:11PM -0500, Mike Snitzer wrote: > > On Thu, Feb 01 2024 at 9:20P -0500, > > Ming Lei <ming.lei@redhat.com> wrote: > > > > > madvise(MADV_POPULATE_READ) tries to populate all page tables in the > > > specific range, so it is usually sequential IO if VMA is backed by > > > file. > > > > > > Set ra_pages as device max request size for the involved readahead in > > > the ADV_POPULATE_READ, this way reduces latency of madvise(MADV_POPULATE_READ) > > > to 1/10 when running madvise(MADV_POPULATE_READ) over one 1GB file with > > > usual(default) 128KB of read_ahead_kb. > > > > > > Cc: David Hildenbrand <david@redhat.com> > > > Cc: Matthew Wilcox <willy@infradead.org> > > > Cc: Alexander Viro <viro@zeniv.linux.org.uk> > > > Cc: Christian Brauner <brauner@kernel.org> > > > Cc: Don Dutile <ddutile@redhat.com> > > > Cc: Rafael Aquini <raquini@redhat.com> > > > Cc: Dave Chinner <david@fromorbit.com> > > > Cc: Mike Snitzer <snitzer@kernel.org> > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > > --- > > > mm/madvise.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++- > > > 1 file changed, 51 insertions(+), 1 deletion(-) > > > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > > index 912155a94ed5..db5452c8abdd 100644 > > > --- a/mm/madvise.c > > > +++ b/mm/madvise.c > > > @@ -900,6 +900,37 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > > > return -EINVAL; > > > } > > > > > > +static void madvise_restore_ra_win(struct file **file, unsigned int ra_pages) > > > +{ > > > + if (*file) { > > > + struct file *f = *file; > > > + > > > + f->f_ra.ra_pages = ra_pages; > > > + fput(f); > > > + *file = NULL; > > > + } > > > +} > > > + > > > +static struct file *madvise_override_ra_win(struct file *f, > > > + unsigned long start, unsigned long end, > > > + unsigned int *old_ra_pages) > > > +{ > > > + unsigned int io_pages; > > > + > > > + if (!f || !f->f_mapping || !f->f_mapping->host) > > > + return NULL; > > > + > > > + io_pages = inode_to_bdi(f->f_mapping->host)->io_pages; > > > + if (((end - start) >> PAGE_SHIFT) < io_pages) > > > + return NULL; > > > + > > > + f = get_file(f); > > > + *old_ra_pages = f->f_ra.ra_pages; > > > + f->f_ra.ra_pages = io_pages; > > > + > > > + return f; > > > +} > > > + > > > > Does this override imply that madvise_populate resorts to calling > > filemap_fault() and here you're just arming it to use the larger > > ->io_pages for the duration of all associated faulting? > > Yes. > > > > > Wouldn't it be better to avoid faulting and build up larger page > > How can we avoid the fault handling? which is needed to build VA->PA mapping. I was wondering if it made sense to add fadvise_populate -- but given my lack of experience with MM I then get handwavvy quick -- I have more work ahead to round out my MM understanding so that I'm more informed. > > vectors that get sent down to the block layer in one go and let the > > filemap_fault() already tries to allocate folio in big size(max order > is MAX_PAGECACHE_ORDER), see page_cache_ra_order() and ra_alloc_folio(). > > > block layer split using the device's limits? (like happens with > > force_page_cache_ra) > > Here filemap code won't deal with block directly because there is VFS & > FS and io mapping is required, and it just calls aops->readahead() or > aops->read_folio(), but block plug & readahead_control are applied for > handling everything in batch. > > > > > I'm concerned that madvise_populate isn't so efficient with filemap > > That is why this patch increases readahead window, then > madvise_populate() performance can be improved by X10 in big file-backed > popluate read. Right, as you know I've tested your patch, the larger readahead window certainly did provide the much more desirable performance. I'll reply to your v2 (with reduced negative checks) with my Reviewed-by and Tested-by. I was just wondering if there an opportunity to plumb in more a specific (and potentially better) fadvise_populate for dealing with file backed pages. > > due to excessive faulting (*BUT* I haven't traced to know, I'm just > > inferring that is why twiddling f->f_ra.ra_pages helps improve > > madvise_populate by having it issue larger IO. Apologies if I'm way > > off base) > > As mentioned, fault handling can't be avoided, but we can improve > involved readahead IO perf. Thanks, and sorry for asking such a naive question (put more pressure on you to educate than I should have). Mike
On Fri, Feb 02, 2024 at 10:20:29AM +0800, Ming Lei wrote: > madvise(MADV_POPULATE_READ) tries to populate all page tables in the > specific range, so it is usually sequential IO if VMA is backed by > file. > > Set ra_pages as device max request size for the involved readahead in > the ADV_POPULATE_READ, this way reduces latency of madvise(MADV_POPULATE_READ) > to 1/10 when running madvise(MADV_POPULATE_READ) over one 1GB file with > usual(default) 128KB of read_ahead_kb. > > Cc: David Hildenbrand <david@redhat.com> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Alexander Viro <viro@zeniv.linux.org.uk> > Cc: Christian Brauner <brauner@kernel.org> > Cc: Don Dutile <ddutile@redhat.com> > Cc: Rafael Aquini <raquini@redhat.com> > Cc: Dave Chinner <david@fromorbit.com> > Cc: Mike Snitzer <snitzer@kernel.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > mm/madvise.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 51 insertions(+), 1 deletion(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index 912155a94ed5..db5452c8abdd 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -900,6 +900,37 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > return -EINVAL; > } > > +static void madvise_restore_ra_win(struct file **file, unsigned int ra_pages) > +{ > + if (*file) { > + struct file *f = *file; > + > + f->f_ra.ra_pages = ra_pages; > + fput(f); > + *file = NULL; > + } > +} > + > +static struct file *madvise_override_ra_win(struct file *f, > + unsigned long start, unsigned long end, > + unsigned int *old_ra_pages) > +{ > + unsigned int io_pages; > + > + if (!f || !f->f_mapping || !f->f_mapping->host) > + return NULL; > + > + io_pages = inode_to_bdi(f->f_mapping->host)->io_pages; > + if (((end - start) >> PAGE_SHIFT) < io_pages) > + return NULL; > + > + f = get_file(f); > + *old_ra_pages = f->f_ra.ra_pages; > + f->f_ra.ra_pages = io_pages; > + > + return f; > +} This won't do what you think if the file has been marked FMODE_RANDOM before this populate call. IOWs, I don't think madvise should be digging in the struct file readahead stuff here. It should call vfs_fadvise(FADV_SEQUENTIAL) to do the set the readahead mode, rather that try to duplicate FADV_SEQUENTIAL (badly). We already do this for WILLNEED to make it do the right thing, we should be doing the same thing here. Also, AFAICT, there is no need for get_file()/fput() here - the vma already has a reference to the struct file, and the vma should not be going away whilst the madvise() operation is in progress. -Dave.
On Mon, Feb 05, 2024 at 10:34:47AM +1100, Dave Chinner wrote: > On Fri, Feb 02, 2024 at 10:20:29AM +0800, Ming Lei wrote: > > madvise(MADV_POPULATE_READ) tries to populate all page tables in the > > specific range, so it is usually sequential IO if VMA is backed by > > file. > > > > Set ra_pages as device max request size for the involved readahead in > > the ADV_POPULATE_READ, this way reduces latency of madvise(MADV_POPULATE_READ) > > to 1/10 when running madvise(MADV_POPULATE_READ) over one 1GB file with > > usual(default) 128KB of read_ahead_kb. > > > > Cc: David Hildenbrand <david@redhat.com> > > Cc: Matthew Wilcox <willy@infradead.org> > > Cc: Alexander Viro <viro@zeniv.linux.org.uk> > > Cc: Christian Brauner <brauner@kernel.org> > > Cc: Don Dutile <ddutile@redhat.com> > > Cc: Rafael Aquini <raquini@redhat.com> > > Cc: Dave Chinner <david@fromorbit.com> > > Cc: Mike Snitzer <snitzer@kernel.org> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > --- > > mm/madvise.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++- > > 1 file changed, 51 insertions(+), 1 deletion(-) > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 912155a94ed5..db5452c8abdd 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -900,6 +900,37 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > > return -EINVAL; > > } > > > > +static void madvise_restore_ra_win(struct file **file, unsigned int ra_pages) > > +{ > > + if (*file) { > > + struct file *f = *file; > > + > > + f->f_ra.ra_pages = ra_pages; > > + fput(f); > > + *file = NULL; > > + } > > +} > > + > > +static struct file *madvise_override_ra_win(struct file *f, > > + unsigned long start, unsigned long end, > > + unsigned int *old_ra_pages) > > +{ > > + unsigned int io_pages; > > + > > + if (!f || !f->f_mapping || !f->f_mapping->host) > > + return NULL; > > + > > + io_pages = inode_to_bdi(f->f_mapping->host)->io_pages; > > + if (((end - start) >> PAGE_SHIFT) < io_pages) > > + return NULL; > > + > > + f = get_file(f); > > + *old_ra_pages = f->f_ra.ra_pages; > > + f->f_ra.ra_pages = io_pages; > > + > > + return f; > > +} > > This won't do what you think if the file has been marked > FMODE_RANDOM before this populate call. Yeah. But madvise(POPULATE_READ) is actually one action, so userspace can call fadvise(POSIX_FADV_NORMAL) or fadvise(POSIX_FADV_SEQUENTIAL) before madvise(POPULATE_READ), and set RANDOM advise back after madvise(POPULATE_READ) returns, so looks not big issue in reality. > > IOWs, I don't think madvise should be digging in the struct file > readahead stuff here. It should call vfs_fadvise(FADV_SEQUENTIAL) to > do the set the readahead mode, rather that try to duplicate > FADV_SEQUENTIAL (badly). We already do this for WILLNEED to make it > do the right thing, we should be doing the same thing here. FADV_SEQUENTIAL doubles current readahead window, which is far from enough to get top performance, such as, latency of doubling (default) ra window is still 2X of setting ra windows as bdi->io_pages. If application sets small 'bdi/read_ahead_kb' just like this report, the gap can be very big. Or can we add one API/helper in fs code to set file readahead ra_pages for this use case? > > Also, AFAICT, there is no need for get_file()/fput() here - the vma > already has a reference to the struct file, and the vma should not > be going away whilst the madvise() operation is in progress. You are right, get_file() is only needed in case of dropping mm lock. Thanks, Ming
diff --git a/mm/madvise.c b/mm/madvise.c index 912155a94ed5..db5452c8abdd 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -900,6 +900,37 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, return -EINVAL; } +static void madvise_restore_ra_win(struct file **file, unsigned int ra_pages) +{ + if (*file) { + struct file *f = *file; + + f->f_ra.ra_pages = ra_pages; + fput(f); + *file = NULL; + } +} + +static struct file *madvise_override_ra_win(struct file *f, + unsigned long start, unsigned long end, + unsigned int *old_ra_pages) +{ + unsigned int io_pages; + + if (!f || !f->f_mapping || !f->f_mapping->host) + return NULL; + + io_pages = inode_to_bdi(f->f_mapping->host)->io_pages; + if (((end - start) >> PAGE_SHIFT) < io_pages) + return NULL; + + f = get_file(f); + *old_ra_pages = f->f_ra.ra_pages; + f->f_ra.ra_pages = io_pages; + + return f; +} + static long madvise_populate(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, @@ -908,9 +939,21 @@ static long madvise_populate(struct vm_area_struct *vma, const bool write = behavior == MADV_POPULATE_WRITE; struct mm_struct *mm = vma->vm_mm; unsigned long tmp_end; + unsigned int ra_pages; + struct file *file; int locked = 1; long pages; + /* + * In case of file backing mapping, increase readahead window + * for reducing the whole populate latency, and restore it + * after the populate is done + */ + if (behavior == MADV_POPULATE_READ) + file = madvise_override_ra_win(vma->vm_file, start, end, + &ra_pages); + else + file = NULL; *prev = vma; while (start < end) { @@ -920,8 +963,10 @@ static long madvise_populate(struct vm_area_struct *vma, */ if (!vma || start >= vma->vm_end) { vma = vma_lookup(mm, start); - if (!vma) + if (!vma) { + madvise_restore_ra_win(&file, ra_pages); return -ENOMEM; + } } tmp_end = min_t(unsigned long, end, vma->vm_end); @@ -935,6 +980,9 @@ static long madvise_populate(struct vm_area_struct *vma, vma = NULL; } if (pages < 0) { + /* restore ra pages back in case of any failure */ + madvise_restore_ra_win(&file, ra_pages); + switch (pages) { case -EINTR: return -EINTR; @@ -954,6 +1002,8 @@ static long madvise_populate(struct vm_area_struct *vma, } start += pages * PAGE_SIZE; } + + madvise_restore_ra_win(&file, ra_pages); return 0; }