Message ID | 20240201100835.1626685-3-liushixin2@huawei.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-47835-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:2719:b0:106:209c:c626 with SMTP id hl25csp22425dyb; Thu, 1 Feb 2024 01:13:55 -0800 (PST) X-Google-Smtp-Source: AGHT+IHBKFhS0nfeY5B9t/tcrcwbHid6wQp7GTz2VqX3Q+JwvXV/nMG4B84JbeFPhg4iJm0TOgXG X-Received: by 2002:a05:622a:1895:b0:42a:4f38:8fbd with SMTP id v21-20020a05622a189500b0042a4f388fbdmr7603206qtc.24.1706778834903; Thu, 01 Feb 2024 01:13:54 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706778834; cv=pass; d=google.com; s=arc-20160816; b=JhMb+qVIMy0Yj55MCTnIOUp/vK1ATg0m+CYNB45ZGmKheKoCjziVigKVk94LXWfBE1 TDqoBMGQcWrRkbhELTwnlpCSqm27JZq3RvydAVqFR+FBje9Mpv/v/S0q57nhFq2ig99S aYzcNcETOjR1KrkYa3bhVZFg9NTc3gkO9rfIBsJIg6eWZiqWgrfZZ4N2oZSHNS/kqy2g Q3E8sAwtqklZgG0PB5+a5ynvksVTOpA7+6SkGO+xZ3KSAGIFjvNnoUFU9WzdVsPTJs3v GCQy3fnRtjqCH5hnJ232aEIDfgnNt4D+r1BuBX42wSoHjE2XlRtdyMeYkapeYJeRA+lZ MUrQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:in-reply-to:message-id :date:subject:cc:to:from; bh=wvmhl+5gUmTJi6IqK1s9zuIPvkJzdu1e4AjqE6mpLa8=; fh=b0yFT+WhUZwPZ56nvjgDVE/6HBFKZhq3U8thR/2S0MI=; b=AkQXyVqs4ZFNsKHtPwSjQAH85MMQmSFCvakoRXuWN0qA/ElVE9BaP9OP5ml//rhI4P xHXBAK3iyLDBokHQLxNe+pj6MabA/TwvDhO/265yHAQMLMeafnciV2CwkynkuI1FLxdt GmR65BTUno+ai41dK/2n1YLnkBZoBiyeAh/YIJEpFAoonxu2wlKzmO7cYn1ZJWh/e6ai CONKtdbMZtdjZD6tc0u0L7tc8a6z/F7Y4Nbh4LljgQmm10NvfXWyYMS1g9cmHSFfCwr5 rJtdgDhD8KFM9R1FlMyweK+gdo9rbWsvi7Ep8xTZ23htvcwTPNCBgLk0ztunnzpBxOzY iCWg==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=huawei.com dmarc=pass fromdomain=huawei.com); spf=pass (google.com: domain of linux-kernel+bounces-47835-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-47835-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com X-Forwarded-Encrypted: i=1; AJvYcCXZYumOWOIYCycs7bBtfqGXRy8e7jWTHUN3SkmZYj5WHrUO1QrBR6wLoWQ9BHzSYwYW4ECzpBIdkBK7WCiVWMBHcVNIuA== Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id e1-20020a05622a110100b0042bf9a83865si165376qty.755.2024.02.01.01.13.54 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 01 Feb 2024 01:13:54 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-47835-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=huawei.com dmarc=pass fromdomain=huawei.com); spf=pass (google.com: domain of linux-kernel+bounces-47835-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-47835-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id AB8AF1C26A2C for <ouuuleilei@gmail.com>; Thu, 1 Feb 2024 09:13:54 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 2595A15B96D; Thu, 1 Feb 2024 09:13:18 +0000 (UTC) Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F0A3D15AAB5; Thu, 1 Feb 2024 09:13:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.187 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706778796; cv=none; b=pANsTM9T2bBKOJ0/pA71OUx/M9QqmVlIQfhTqmlp1N0Jo8uc62pdJ5kU7K4RMJ1RL3qsXI2DPE/lHwIljbjqldPWlwX5EwBMuvR6gGW4fVaMhkCZDWj5BUAZzSLzNUQBI0URYUyjN8VvrVhRnSTKaWjyHl/rLRIqEEz4+YZTqOc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706778796; c=relaxed/simple; bh=7sqqG92PUIOpPJ0+E1C7xrOjm82lqqJbU4jpppRz7LA=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=QRJQd9abjYzJmqLs2PTVHkhyL/VAbFiuo9zb8ZESGXqdkWP4BfIx0aAKH0ZlJR3Qs+m7DAIxgU6MOr6d50fyJaaS0/aUx9tKcRAPdAbCIkg2o2Dc0QEbMbTKMPaB0Rd9LSNvddZtLLSAl5BxHReAlr38zhkW1W7qaB8zvSRzpSs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.187 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.163.252]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4TQY7c6YpPzsWkK; Thu, 1 Feb 2024 17:11:56 +0800 (CST) Received: from dggpemd200004.china.huawei.com (unknown [7.185.36.141]) by mail.maildlp.com (Postfix) with ESMTPS id 2C21818005E; Thu, 1 Feb 2024 17:13:10 +0800 (CST) Received: from huawei.com (10.175.113.32) by dggpemd200004.china.huawei.com (7.185.36.141) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.1258.28; Thu, 1 Feb 2024 17:13:09 +0800 From: Liu Shixin <liushixin2@huawei.com> To: Alexander Viro <viro@zeniv.linux.org.uk>, Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>, Matthew Wilcox <willy@infradead.org>, Andrew Morton <akpm@linux-foundation.org> CC: <linux-fsdevel@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>, Liu Shixin <liushixin2@huawei.com> Subject: [PATCH 2/2] mm/readahead: limit sync readahead while too many active refault Date: Thu, 1 Feb 2024 18:08:35 +0800 Message-ID: <20240201100835.1626685-3-liushixin2@huawei.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20240201100835.1626685-1-liushixin2@huawei.com> References: <20240201100835.1626685-1-liushixin2@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpemd200004.china.huawei.com (7.185.36.141) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1789687323249749687 X-GMAIL-MSGID: 1789687323249749687 |
Series |
[1/2] mm/readahead: stop readahead loop if memcg charge fails
|
|
Commit Message
Liu Shixin
Feb. 1, 2024, 10:08 a.m. UTC
When the pagefault is not for write and the refault distance is close,
the page will be activated directly. If there are too many such pages in
a file, that means the pages may be reclaimed immediately.
In such situation, there is no positive effect to read-ahead since it will
only waste IO. So collect the number of such pages and when the number is
too large, stop bothering with read-ahead for a while until it decreased
automatically.
Define 'too large' as 10000 experientially, which can solves the problem
and does not affect by the occasional active refault.
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
---
include/linux/fs.h | 2 ++
include/linux/pagemap.h | 1 +
mm/filemap.c | 16 ++++++++++++++++
mm/readahead.c | 4 ++++
4 files changed, 23 insertions(+)
Comments
On Thu 01-02-24 18:08:35, Liu Shixin wrote: > When the pagefault is not for write and the refault distance is close, > the page will be activated directly. If there are too many such pages in > a file, that means the pages may be reclaimed immediately. > In such situation, there is no positive effect to read-ahead since it will > only waste IO. So collect the number of such pages and when the number is > too large, stop bothering with read-ahead for a while until it decreased > automatically. > > Define 'too large' as 10000 experientially, which can solves the problem > and does not affect by the occasional active refault. > > Signed-off-by: Liu Shixin <liushixin2@huawei.com> So I'm not convinced this new logic is needed. We already have ra->mmap_miss which gets incremented when a page fault has to read the page (and decremented when a page fault found the page already in cache). This should already work to detect trashing as well, shouldn't it? If it does not, why? Honza > --- > include/linux/fs.h | 2 ++ > include/linux/pagemap.h | 1 + > mm/filemap.c | 16 ++++++++++++++++ > mm/readahead.c | 4 ++++ > 4 files changed, 23 insertions(+) > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index ed5966a704951..f2a1825442f5a 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -960,6 +960,7 @@ struct fown_struct { > * the first of these pages is accessed. > * @ra_pages: Maximum size of a readahead request, copied from the bdi. > * @mmap_miss: How many mmap accesses missed in the page cache. > + * @active_refault: Number of active page refault. > * @prev_pos: The last byte in the most recent read request. > * > * When this structure is passed to ->readahead(), the "most recent" > @@ -971,6 +972,7 @@ struct file_ra_state { > unsigned int async_size; > unsigned int ra_pages; > unsigned int mmap_miss; > + unsigned int active_refault; > loff_t prev_pos; > }; > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index 2df35e65557d2..da9eaf985dec4 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -1256,6 +1256,7 @@ struct readahead_control { > pgoff_t _index; > unsigned int _nr_pages; > unsigned int _batch_count; > + unsigned int _active_refault; > bool _workingset; > unsigned long _pflags; > }; > diff --git a/mm/filemap.c b/mm/filemap.c > index 750e779c23db7..4de80592ab270 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -3037,6 +3037,7 @@ loff_t mapping_seek_hole_data(struct address_space *mapping, loff_t start, > > #ifdef CONFIG_MMU > #define MMAP_LOTSAMISS (100) > +#define ACTIVE_REFAULT_LIMIT (10000) > /* > * lock_folio_maybe_drop_mmap - lock the page, possibly dropping the mmap_lock > * @vmf - the vm_fault for this fault. > @@ -3142,6 +3143,18 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) > if (mmap_miss > MMAP_LOTSAMISS) > return fpin; > > + ractl._active_refault = READ_ONCE(ra->active_refault); > + if (ractl._active_refault) > + WRITE_ONCE(ra->active_refault, --ractl._active_refault); > + > + /* > + * If there are a lot of refault of active pages in this file, > + * that means the memory reclaim is ongoing. Stop bothering with > + * read-ahead since it will only waste IO. > + */ > + if (ractl._active_refault >= ACTIVE_REFAULT_LIMIT) > + return fpin; > + > /* > * mmap read-around > */ > @@ -3151,6 +3164,9 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) > ra->async_size = ra->ra_pages / 4; > ractl._index = ra->start; > page_cache_ra_order(&ractl, ra, 0); > + > + WRITE_ONCE(ra->active_refault, ractl._active_refault); > + > return fpin; > } > > diff --git a/mm/readahead.c b/mm/readahead.c > index cc4abb67eb223..d79bb70a232c4 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -263,6 +263,10 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, > folio_set_readahead(folio); > ractl->_workingset |= folio_test_workingset(folio); > ractl->_nr_pages++; > + if (unlikely(folio_test_workingset(folio))) > + ractl->_active_refault++; > + else if (unlikely(ractl->_active_refault)) > + ractl->_active_refault--; > } > > /* > -- > 2.25.1 >
On 2024/2/1 17:37, Jan Kara wrote: > On Thu 01-02-24 18:08:35, Liu Shixin wrote: >> When the pagefault is not for write and the refault distance is close, >> the page will be activated directly. If there are too many such pages in >> a file, that means the pages may be reclaimed immediately. >> In such situation, there is no positive effect to read-ahead since it will >> only waste IO. So collect the number of such pages and when the number is >> too large, stop bothering with read-ahead for a while until it decreased >> automatically. >> >> Define 'too large' as 10000 experientially, which can solves the problem >> and does not affect by the occasional active refault. >> >> Signed-off-by: Liu Shixin <liushixin2@huawei.com> > So I'm not convinced this new logic is needed. We already have > ra->mmap_miss which gets incremented when a page fault has to read the page > (and decremented when a page fault found the page already in cache). This > should already work to detect trashing as well, shouldn't it? If it does > not, why? > > Honza ra->mmap_miss doesn't help, it increased only one in do_sync_mmap_readahead() and then decreased one for every page in filemap_map_pages(). So in this scenario, it can't exceed MMAP_LOTSAMISS. Thanks, > >> --- >> include/linux/fs.h | 2 ++ >> include/linux/pagemap.h | 1 + >> mm/filemap.c | 16 ++++++++++++++++ >> mm/readahead.c | 4 ++++ >> 4 files changed, 23 insertions(+) >> >> diff --git a/include/linux/fs.h b/include/linux/fs.h >> index ed5966a704951..f2a1825442f5a 100644 >> --- a/include/linux/fs.h >> +++ b/include/linux/fs.h >> @@ -960,6 +960,7 @@ struct fown_struct { >> * the first of these pages is accessed. >> * @ra_pages: Maximum size of a readahead request, copied from the bdi. >> * @mmap_miss: How many mmap accesses missed in the page cache. >> + * @active_refault: Number of active page refault. >> * @prev_pos: The last byte in the most recent read request. >> * >> * When this structure is passed to ->readahead(), the "most recent" >> @@ -971,6 +972,7 @@ struct file_ra_state { >> unsigned int async_size; >> unsigned int ra_pages; >> unsigned int mmap_miss; >> + unsigned int active_refault; >> loff_t prev_pos; >> }; >> >> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h >> index 2df35e65557d2..da9eaf985dec4 100644 >> --- a/include/linux/pagemap.h >> +++ b/include/linux/pagemap.h >> @@ -1256,6 +1256,7 @@ struct readahead_control { >> pgoff_t _index; >> unsigned int _nr_pages; >> unsigned int _batch_count; >> + unsigned int _active_refault; >> bool _workingset; >> unsigned long _pflags; >> }; >> diff --git a/mm/filemap.c b/mm/filemap.c >> index 750e779c23db7..4de80592ab270 100644 >> --- a/mm/filemap.c >> +++ b/mm/filemap.c >> @@ -3037,6 +3037,7 @@ loff_t mapping_seek_hole_data(struct address_space *mapping, loff_t start, >> >> #ifdef CONFIG_MMU >> #define MMAP_LOTSAMISS (100) >> +#define ACTIVE_REFAULT_LIMIT (10000) >> /* >> * lock_folio_maybe_drop_mmap - lock the page, possibly dropping the mmap_lock >> * @vmf - the vm_fault for this fault. >> @@ -3142,6 +3143,18 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) >> if (mmap_miss > MMAP_LOTSAMISS) >> return fpin; >> >> + ractl._active_refault = READ_ONCE(ra->active_refault); >> + if (ractl._active_refault) >> + WRITE_ONCE(ra->active_refault, --ractl._active_refault); >> + >> + /* >> + * If there are a lot of refault of active pages in this file, >> + * that means the memory reclaim is ongoing. Stop bothering with >> + * read-ahead since it will only waste IO. >> + */ >> + if (ractl._active_refault >= ACTIVE_REFAULT_LIMIT) >> + return fpin; >> + >> /* >> * mmap read-around >> */ >> @@ -3151,6 +3164,9 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) >> ra->async_size = ra->ra_pages / 4; >> ractl._index = ra->start; >> page_cache_ra_order(&ractl, ra, 0); >> + >> + WRITE_ONCE(ra->active_refault, ractl._active_refault); >> + >> return fpin; >> } >> >> diff --git a/mm/readahead.c b/mm/readahead.c >> index cc4abb67eb223..d79bb70a232c4 100644 >> --- a/mm/readahead.c >> +++ b/mm/readahead.c >> @@ -263,6 +263,10 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, >> folio_set_readahead(folio); >> ractl->_workingset |= folio_test_workingset(folio); >> ractl->_nr_pages++; >> + if (unlikely(folio_test_workingset(folio))) >> + ractl->_active_refault++; >> + else if (unlikely(ractl->_active_refault)) >> + ractl->_active_refault--; >> } >> >> /* >> -- >> 2.25.1 >>
On Thu 01-02-24 18:41:30, Liu Shixin wrote: > On 2024/2/1 17:37, Jan Kara wrote: > > On Thu 01-02-24 18:08:35, Liu Shixin wrote: > >> When the pagefault is not for write and the refault distance is close, > >> the page will be activated directly. If there are too many such pages in > >> a file, that means the pages may be reclaimed immediately. > >> In such situation, there is no positive effect to read-ahead since it will > >> only waste IO. So collect the number of such pages and when the number is > >> too large, stop bothering with read-ahead for a while until it decreased > >> automatically. > >> > >> Define 'too large' as 10000 experientially, which can solves the problem > >> and does not affect by the occasional active refault. > >> > >> Signed-off-by: Liu Shixin <liushixin2@huawei.com> > > So I'm not convinced this new logic is needed. We already have > > ra->mmap_miss which gets incremented when a page fault has to read the page > > (and decremented when a page fault found the page already in cache). This > > should already work to detect trashing as well, shouldn't it? If it does > > not, why? > > > > Honza > ra->mmap_miss doesn't help, it increased only one in do_sync_mmap_readahead() > and then decreased one for every page in filemap_map_pages(). So in this scenario, > it can't exceed MMAP_LOTSAMISS. I see, OK. But that's a (longstanding) bug in how mmap_miss is handled. Can you please test whether attached patches fix the trashing for you? At least now I can see mmap_miss properly increments when we are hitting uncached pages... Thanks! Honza
On 2024/2/2 1:31, Jan Kara wrote: > On Thu 01-02-24 18:41:30, Liu Shixin wrote: >> On 2024/2/1 17:37, Jan Kara wrote: >>> On Thu 01-02-24 18:08:35, Liu Shixin wrote: >>>> When the pagefault is not for write and the refault distance is close, >>>> the page will be activated directly. If there are too many such pages in >>>> a file, that means the pages may be reclaimed immediately. >>>> In such situation, there is no positive effect to read-ahead since it will >>>> only waste IO. So collect the number of such pages and when the number is >>>> too large, stop bothering with read-ahead for a while until it decreased >>>> automatically. >>>> >>>> Define 'too large' as 10000 experientially, which can solves the problem >>>> and does not affect by the occasional active refault. >>>> >>>> Signed-off-by: Liu Shixin <liushixin2@huawei.com> >>> So I'm not convinced this new logic is needed. We already have >>> ra->mmap_miss which gets incremented when a page fault has to read the page >>> (and decremented when a page fault found the page already in cache). This >>> should already work to detect trashing as well, shouldn't it? If it does >>> not, why? >>> >>> Honza >> ra->mmap_miss doesn't help, it increased only one in do_sync_mmap_readahead() >> and then decreased one for every page in filemap_map_pages(). So in this scenario, >> it can't exceed MMAP_LOTSAMISS. > I see, OK. But that's a (longstanding) bug in how mmap_miss is handled. Can > you please test whether attached patches fix the trashing for you? At least > now I can see mmap_miss properly increments when we are hitting uncached > pages... Thanks! > > Honza Thanks for the patch, I will test it. >
On 2024/2/2 1:31, Jan Kara wrote: > On Thu 01-02-24 18:41:30, Liu Shixin wrote: >> On 2024/2/1 17:37, Jan Kara wrote: >>> On Thu 01-02-24 18:08:35, Liu Shixin wrote: >>>> When the pagefault is not for write and the refault distance is close, >>>> the page will be activated directly. If there are too many such pages in >>>> a file, that means the pages may be reclaimed immediately. >>>> In such situation, there is no positive effect to read-ahead since it will >>>> only waste IO. So collect the number of such pages and when the number is >>>> too large, stop bothering with read-ahead for a while until it decreased >>>> automatically. >>>> >>>> Define 'too large' as 10000 experientially, which can solves the problem >>>> and does not affect by the occasional active refault. >>>> >>>> Signed-off-by: Liu Shixin <liushixin2@huawei.com> >>> So I'm not convinced this new logic is needed. We already have >>> ra->mmap_miss which gets incremented when a page fault has to read the page >>> (and decremented when a page fault found the page already in cache). This >>> should already work to detect trashing as well, shouldn't it? If it does >>> not, why? >>> >>> Honza >> ra->mmap_miss doesn't help, it increased only one in do_sync_mmap_readahead() >> and then decreased one for every page in filemap_map_pages(). So in this scenario, >> it can't exceed MMAP_LOTSAMISS. > I see, OK. But that's a (longstanding) bug in how mmap_miss is handled. Can > you please test whether attached patches fix the trashing for you? At least > now I can see mmap_miss properly increments when we are hitting uncached > pages... Thanks! > > Honza The patch doesn't seem to have much effect. I will try to analyze why it doesn't work. The attached file is my testcase. Thanks, > #!/bin/bash while true; do flag=$(ps -ef | grep -v grep | grep alloc_page| wc -l) if [ "$flag" -eq 0 ]; then /alloc_page & fi sleep 30 start_time=$(date +%s) yum install -y expect > /dev/null 2>&1 end_time=$(date +%s) elapsed_time=$((end_time - start_time)) echo "$elapsed_time seconds" yum remove -y expect > /dev/null 2>&1 done #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <string.h> #define SIZE 1*1024*1024 //1M int main() { void *ptr = NULL; int i; for (i = 0; i < 1024 * 6 - 50;i++) { ptr = (void *) malloc(SIZE); if (ptr == NULL) { printf("malloc err!"); return -1; } memset(ptr, 0, SIZE); } sleep(99999); free(ptr); return 0; }
On 2024/2/2 17:02, Liu Shixin wrote: > > On 2024/2/2 1:31, Jan Kara wrote: >> On Thu 01-02-24 18:41:30, Liu Shixin wrote: >>> On 2024/2/1 17:37, Jan Kara wrote: >>>> On Thu 01-02-24 18:08:35, Liu Shixin wrote: >>>>> When the pagefault is not for write and the refault distance is close, >>>>> the page will be activated directly. If there are too many such pages in >>>>> a file, that means the pages may be reclaimed immediately. >>>>> In such situation, there is no positive effect to read-ahead since it will >>>>> only waste IO. So collect the number of such pages and when the number is >>>>> too large, stop bothering with read-ahead for a while until it decreased >>>>> automatically. >>>>> >>>>> Define 'too large' as 10000 experientially, which can solves the problem >>>>> and does not affect by the occasional active refault. >>>>> >>>>> Signed-off-by: Liu Shixin <liushixin2@huawei.com> >>>> So I'm not convinced this new logic is needed. We already have >>>> ra->mmap_miss which gets incremented when a page fault has to read the page >>>> (and decremented when a page fault found the page already in cache). This >>>> should already work to detect trashing as well, shouldn't it? If it does >>>> not, why? >>>> >>>> Honza >>> ra->mmap_miss doesn't help, it increased only one in do_sync_mmap_readahead() >>> and then decreased one for every page in filemap_map_pages(). So in this scenario, >>> it can't exceed MMAP_LOTSAMISS. >> I see, OK. But that's a (longstanding) bug in how mmap_miss is handled. Can >> you please test whether attached patches fix the trashing for you? At least >> now I can see mmap_miss properly increments when we are hitting uncached >> pages... Thanks! >> >> Honza > The patch doesn't seem to have much effect. I will try to analyze why it doesn't work. > The attached file is my testcase. > > Thanks, I think I figured out why mmap_miss doesn't work. After do_sync_mmap_readahead(), there is a __filemap_get_folio() to make sure the page is ready. Then, it is ready too in filemap_map_pages(), so the mmap_miss will decreased once. mmap_miss goes back to 0, and can't stop read-ahead. Overall, I don't think mmap_miss can solve this problem. .
diff --git a/include/linux/fs.h b/include/linux/fs.h index ed5966a704951..f2a1825442f5a 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -960,6 +960,7 @@ struct fown_struct { * the first of these pages is accessed. * @ra_pages: Maximum size of a readahead request, copied from the bdi. * @mmap_miss: How many mmap accesses missed in the page cache. + * @active_refault: Number of active page refault. * @prev_pos: The last byte in the most recent read request. * * When this structure is passed to ->readahead(), the "most recent" @@ -971,6 +972,7 @@ struct file_ra_state { unsigned int async_size; unsigned int ra_pages; unsigned int mmap_miss; + unsigned int active_refault; loff_t prev_pos; }; diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 2df35e65557d2..da9eaf985dec4 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -1256,6 +1256,7 @@ struct readahead_control { pgoff_t _index; unsigned int _nr_pages; unsigned int _batch_count; + unsigned int _active_refault; bool _workingset; unsigned long _pflags; }; diff --git a/mm/filemap.c b/mm/filemap.c index 750e779c23db7..4de80592ab270 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3037,6 +3037,7 @@ loff_t mapping_seek_hole_data(struct address_space *mapping, loff_t start, #ifdef CONFIG_MMU #define MMAP_LOTSAMISS (100) +#define ACTIVE_REFAULT_LIMIT (10000) /* * lock_folio_maybe_drop_mmap - lock the page, possibly dropping the mmap_lock * @vmf - the vm_fault for this fault. @@ -3142,6 +3143,18 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) if (mmap_miss > MMAP_LOTSAMISS) return fpin; + ractl._active_refault = READ_ONCE(ra->active_refault); + if (ractl._active_refault) + WRITE_ONCE(ra->active_refault, --ractl._active_refault); + + /* + * If there are a lot of refault of active pages in this file, + * that means the memory reclaim is ongoing. Stop bothering with + * read-ahead since it will only waste IO. + */ + if (ractl._active_refault >= ACTIVE_REFAULT_LIMIT) + return fpin; + /* * mmap read-around */ @@ -3151,6 +3164,9 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) ra->async_size = ra->ra_pages / 4; ractl._index = ra->start; page_cache_ra_order(&ractl, ra, 0); + + WRITE_ONCE(ra->active_refault, ractl._active_refault); + return fpin; } diff --git a/mm/readahead.c b/mm/readahead.c index cc4abb67eb223..d79bb70a232c4 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -263,6 +263,10 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, folio_set_readahead(folio); ractl->_workingset |= folio_test_workingset(folio); ractl->_nr_pages++; + if (unlikely(folio_test_workingset(folio))) + ractl->_active_refault++; + else if (unlikely(ractl->_active_refault)) + ractl->_active_refault--; } /*