Message ID | 20230117231632.2734737-1-minchan@kernel.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp2052798wrn; Tue, 17 Jan 2023 16:11:20 -0800 (PST) X-Google-Smtp-Source: AMrXdXtO7fj/AOtA3Buhm25CZC7VnE1Y85dKTedOoAs+XlbDS+1shEU+sCZJ6UUcDWvgYnJnmWdl X-Received: by 2002:a05:6402:5419:b0:49e:351d:37 with SMTP id ev25-20020a056402541900b0049e351d0037mr4074197edb.8.1674000680795; Tue, 17 Jan 2023 16:11:20 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674000680; cv=none; d=google.com; s=arc-20160816; b=dVibVg7yuMgMaMi/q9ZzpSvvBtwA3Ydl00cmqIAMKCWQlWqkxe5ClMwzIlrgAHc+/o iv9RGtiygIXkMXZetha+LYofOaSuzivnMLosLsJVfVIg++W0UuD1m37dRLa9qKQ72aFy 1sutv/yCSPefv92I1l4SUN9lJth5M9zQDI0NpDLhwm9dsZStlNNEhlmoaVzpZ1oZXjXL Kdb59iVVk3rkHsRbPKR6oXptnFIlafd2FZjKpy+FzWBnscbQgR1FD1c6mSHzhUpOtTfN 8/hpu6bBjVCKS9zOc5o9eQLj5VM67+0C/uc0kmXdAzGXrHvbrOTXSu7w3N+iu9ahawuX MMOA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:sender:dkim-signature; bh=QYlhvxpn1g/r4dWO5+sPLebA9xNT77jNrSyttGo3dyo=; b=KnLg+43D5fsZcnT005vG2EcYUn4tRgXu8gbz9LWJc6uG531ilOuRaeMPgjfecRCnWc lK6T97E/vJOlCwuv7uAwGttC6ZYgyJZTH78WB1N/SMU44q7929JxXTl2dWor49TbFxko t2gflKDXUzmmqN5Fs/XAZ4ccHNCMyhZlIW90aBVTKoBnHfHZYs+7K8A3CgfdxxiyHlo6 RTVyzgRLaMc+p46ig7J9WZZzitCt6kfiVPxOl2FlUjhWukiB2XN2SObYlwcK85ErUWwI 7BtRJtznLnI/rR2bh9JHqh39o1JRAUWNNPXAyY+boDmnmaFqXyFbwA8g8puUQeGvgKRL vOPQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=j08AWUcI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id v2-20020a056402348200b0049e29ce0a51si6474762edc.17.2023.01.17.16.10.55; Tue, 17 Jan 2023 16:11:20 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=j08AWUcI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229771AbjARADr (ORCPT <rfc822;pfffrao@gmail.com> + 99 others); Tue, 17 Jan 2023 19:03:47 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48416 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229555AbjARADO (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 17 Jan 2023 19:03:14 -0500 Received: from mail-pg1-x534.google.com (mail-pg1-x534.google.com [IPv6:2607:f8b0:4864:20::534]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 60353BD145 for <linux-kernel@vger.kernel.org>; Tue, 17 Jan 2023 15:16:39 -0800 (PST) Received: by mail-pg1-x534.google.com with SMTP id g68so22107027pgc.11 for <linux-kernel@vger.kernel.org>; Tue, 17 Jan 2023 15:16:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:sender:from:to:cc:subject:date:message-id:reply-to; bh=QYlhvxpn1g/r4dWO5+sPLebA9xNT77jNrSyttGo3dyo=; b=j08AWUcIVyI2/yHykLABxKp1sDwWbEofJttmSP3/rw60yRelBsKPqIfY4+/bYgIPxC 5ynfexqfZiEffFJ5WzTJVnRQaaB9wvKnBE5HA/5VvZr/Q4C+NxxVPvRz2Aj4LUyMbqyp ROiePmzKnmm49Em7xPM/47hH6vWeqFb6aFEv7chKTg6fQBxGvD3ClP9cLVVXmStFUSsy jjqS8CGg4yibtu21orMKGYOrPW9DdQSLtICy8MB0nRyvGhzCcSvZYeuNO3gjVaV26wEE B3mJrW9472YkkUVzWdPV9GJ4KUM/N03MP0Y/F53hIMCLwIC7cQOYuCY7YftWDQYCzQ5I Okfg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:sender:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=QYlhvxpn1g/r4dWO5+sPLebA9xNT77jNrSyttGo3dyo=; b=Sy8E6ntsYRFe3KJO9fMnASiLGE7jqnZQBLWlxeQU5+ImJtHBruJsP4KzbbgmatmQnB jvB0OoB11HhG/8yDoyJbyU8QAF53eqGv85Y4xzOvg/RI5XY//y79NxfLHU4hc/LKbu6p uSzX9Z/a4HRAvZH1RmnzQ+ze/Iw7Sx9saGn/QesI7sYPSxf2nU0Mbwpw65XFJ3ltTKrp i5qoS442W89wQe+1VDTJoN9Ub9GFphSj7BSTVAIRdU5FmSbJsPY3Oc4Zx0eo2NOM5J1k aGyDwBP5NvzimnJTcnzimiOUiwzWhGiMvaFzV8jL0pEVSPYQsAxZqej/xeJlsGsX0u+I cVaA== X-Gm-Message-State: AFqh2kqJ42Wx6F9mhw/J7j7wNke+SzcbjDRVPUpozOaptjJc+KxXHNuH jSFaTtPlvBIOM9827DkwvFs= X-Received: by 2002:a05:6a00:24d4:b0:57e:866d:c095 with SMTP id d20-20020a056a0024d400b0057e866dc095mr6948251pfv.25.1673997398358; Tue, 17 Jan 2023 15:16:38 -0800 (PST) Received: from bbox-1.mtv.corp.google.com ([2620:15c:211:201:27ce:97b5:ee13:dbfe]) by smtp.gmail.com with ESMTPSA id c24-20020aa79538000000b0057447bb0ddcsm5180965pfp.49.2023.01.17.15.16.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 17 Jan 2023 15:16:37 -0800 (PST) Sender: Minchan Kim <minchan.kim@gmail.com> From: Minchan Kim <minchan@kernel.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: Suren Baghdasaryan <surenb@google.com>, Matthew Wilcox <willy@infradead.org>, linux-mm <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>, Michal Hocko <mhocko@suse.com>, SeongJae Park <sj@kernel.org>, Minchan Kim <minchan@kernel.org> Subject: [PATCH 1/3] mm: return the number of pages successfully paged out Date: Tue, 17 Jan 2023 15:16:30 -0800 Message-Id: <20230117231632.2734737-1-minchan@kernel.org> X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.5 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_EF,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1755316937993387097?= X-GMAIL-MSGID: =?utf-8?q?1755316937993387097?= |
Series |
[1/3] mm: return the number of pages successfully paged out
|
|
Commit Message
Minchan Kim
Jan. 17, 2023, 11:16 p.m. UTC
The reclaim_pages MADV_PAGEOUT uses needs to return the number of
pages paged-out successfully, not only the number of reclaimed pages
in the operation because those pages paged-out successfully will be
reclaimed easily at the memory pressure due to asynchronous writeback
rotation(i.e., PG_reclaim with folio_rotate_reclaimable).
This patch renames the reclaim_pages with paging_out(with hope that
it's clear from operation point of view) and then adds a additional
stat in reclaim_stat to represent the number of paged-out but kept
in the memory for rotation on writeback completion.
With that stat, madvise_pageout can know how many pages were paged-out
successfully as well as reclaimed. The return value will be used for
statistics in next patch.
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
include/linux/swap.h | 2 +-
include/linux/vmstat.h | 1 +
mm/damon/paddr.c | 2 +-
mm/madvise.c | 4 ++--
mm/vmscan.c | 31 ++++++++++++++++++++++---------
5 files changed, 27 insertions(+), 13 deletions(-)
Comments
I'm all hung up on the naming of everything. > mm: return the number of pages successfully paged out This is a vague title - MM is a big place. Perhaps "mm/vmscan: ..." On Tue, 17 Jan 2023 15:16:30 -0800 Minchan Kim <minchan@kernel.org> wrote: > The reclaim_pages MADV_PAGEOUT uses needs to return the number of > pages paged-out successfully, not only the number of reclaimed pages > in the operation because those pages paged-out successfully will be > reclaimed easily at the memory pressure due to asynchronous writeback > rotation(i.e., PG_reclaim with folio_rotate_reclaimable). So... what does "paged out" actually mean? "writeback to backing store was initiated"? From an application's point of view it means "no longer in page tables needs a fault to get it back", no? > This patch renames the reclaim_pages with paging_out(with hope that "page_out" or "pageout" would be better than "paging_out". > it's clear from operation point of view) and then adds a additional > stat in reclaim_stat to represent the number of paged-out but kept > in the memory for rotation on writeback completion. So it's the number of pages against which we have initiated writeback. Why not call it "nr_writeback" or similar? > With that stat, madvise_pageout can know how many pages were paged-out > successfully as well as reclaimed. The return value will be used for > statistics in next patch. > > ... > > -unsigned long reclaim_pages(struct list_head *folio_list) > +/* > + * paging_out - reclaim clean pages and write dirty pages into storage > + * @folio_list: pages for paging out > + * > + * paging_out() writes dirty pages to backing storage and/or reclaim > + * clean pages from memory. Returns the number of written/reclaimed pages. s/reclaim/reclaims/ "and/or" it vague - just "or", I think. "written/reclaimed" is vague. "number of reclaimed pages plus the number of pages against which writeback was initiated" is precise.
On Tue, Jan 17, 2023 at 03:53:12PM -0800, Andrew Morton wrote: > > I'm all hung up on the naming of everything. > > > mm: return the number of pages successfully paged out > > This is a vague title - MM is a big place. Perhaps "mm/vmscan: ..." > > On Tue, 17 Jan 2023 15:16:30 -0800 Minchan Kim <minchan@kernel.org> wrote: > > > The reclaim_pages MADV_PAGEOUT uses needs to return the number of > > pages paged-out successfully, not only the number of reclaimed pages > > in the operation because those pages paged-out successfully will be > > reclaimed easily at the memory pressure due to asynchronous writeback > > rotation(i.e., PG_reclaim with folio_rotate_reclaimable). > > So... what does "paged out" actually mean? "writeback to backing > store was initiated"? From an application's point of view it means "no > longer in page tables needs a fault to get it back", no? Yes, both are correct in my view since pageout is initiated after unmapping the page from page table and think that's better wording to be in description. Let me use the explanation in the description at next spin. Thanks. > > > This patch renames the reclaim_pages with paging_out(with hope that > > "page_out" or "pageout" would be better than "paging_out". pageout was taken from vmscan.c. Then I will use the page_out unless some suggests better naming. > > > it's clear from operation point of view) and then adds a additional > > stat in reclaim_stat to represent the number of paged-out but kept > > in the memory for rotation on writeback completion. > > So it's the number of pages against which we have initiated writeback. > Why not call it "nr_writeback" or similar? Currently, nr_writeback is used to indicate how many times shrink_folio_list encoutered PG_writeback in the page list. TLDR: I need to distinguish syncronous writeback and asynchronous writeback. Actually, I wanted to use nr_pageout but it would make double counting from madvise_pageout PoV depending on backing storage's speed. (For example, madvise_pageout tried 32 pages and 12 pages were swapped out very quickly so those 12 pages were reclaimed under shrink_folio_list context so it returns 12. But the other 20 pages were swapped out slowly due to the device was congested so they were rotated back when the write was done. In the case, madvise_pageout want to get 32 as successful paging out's result rather than only 12 pages) Maybe, nr_pageout_async? > > > With that stat, madvise_pageout can know how many pages were paged-out > > successfully as well as reclaimed. The return value will be used for > > statistics in next patch. > > > > ... > > > > -unsigned long reclaim_pages(struct list_head *folio_list) > > +/* > > + * paging_out - reclaim clean pages and write dirty pages into storage > > + * @folio_list: pages for paging out > > + * > > + * paging_out() writes dirty pages to backing storage and/or reclaim > > + * clean pages from memory. Returns the number of written/reclaimed pages. > > s/reclaim/reclaims/ > > "and/or" it vague - just "or", I think. Since the page list could have immediate reclaimable pages(A) when backing storage is enough fast and just written pages(B) when the backing storage is slowed down, even A + B if the device is congested in the middle of doing operation, I think "and/or" is right. > > "written/reclaimed" is vague. "number of reclaimed pages plus the > number of pages against which writeback was initiated" is precise. Sure, let me correct it. Thanks, Andrew.
On Tue, Jan 17, 2023 at 04:35:00PM -0800, Minchan Kim wrote: > Yes, both are correct in my view since pageout is initiated after > unmapping the page from page table and think that's better wording > to be in description. Let me use the explanation in the description > at next spin. Thanks. For the next spin, you'll want to do it against mm-unstable as deactivate_page() is now folio_deactivate().
On Wed, Jan 18, 2023 at 12:58:09AM +0000, Matthew Wilcox wrote: > On Tue, Jan 17, 2023 at 04:35:00PM -0800, Minchan Kim wrote: > > Yes, both are correct in my view since pageout is initiated after > > unmapping the page from page table and think that's better wording > > to be in description. Let me use the explanation in the description > > at next spin. Thanks. > > For the next spin, you'll want to do it against mm-unstable as > deactivate_page() is now folio_deactivate(). I was curious what branch I need to use baseline for creating a patch since I saw multiple branches recent mm/ Thanks for the hint. Sure, will do.
On Tue 17-01-23 15:16:30, Minchan Kim wrote: > The reclaim_pages MADV_PAGEOUT uses needs to return the number of > pages paged-out successfully, not only the number of reclaimed pages > in the operation because those pages paged-out successfully will be > reclaimed easily at the memory pressure due to asynchronous writeback > rotation(i.e., PG_reclaim with folio_rotate_reclaimable). > > This patch renames the reclaim_pages with paging_out(with hope that > it's clear from operation point of view) and then adds a additional > stat in reclaim_stat to represent the number of paged-out but kept > in the memory for rotation on writeback completion. > > With that stat, madvise_pageout can know how many pages were paged-out > successfully as well as reclaimed. The return value will be used for > statistics in next patch. I really fail to see the reson for the rename and paging_out doesn't even make much sense as a name TBH.
On Wed, Jan 18, 2023 at 10:10:44AM +0100, Michal Hocko wrote: > On Tue 17-01-23 15:16:30, Minchan Kim wrote: > > The reclaim_pages MADV_PAGEOUT uses needs to return the number of > > pages paged-out successfully, not only the number of reclaimed pages > > in the operation because those pages paged-out successfully will be > > reclaimed easily at the memory pressure due to asynchronous writeback > > rotation(i.e., PG_reclaim with folio_rotate_reclaimable). > > > > This patch renames the reclaim_pages with paging_out(with hope that > > it's clear from operation point of view) and then adds a additional > > stat in reclaim_stat to represent the number of paged-out but kept > > in the memory for rotation on writeback completion. > > > > With that stat, madvise_pageout can know how many pages were paged-out > > successfully as well as reclaimed. The return value will be used for > > statistics in next patch. > > I really fail to see the reson for the rename and paging_out doesn't > even make much sense as a name TBH. Currently, what we are doing to reclaim memory is reclaim_folio_list shrink_folio_list if (folio_mapped(folio)) try_to_unmap(folio) if (folio_test_dirty(folio)) pageout Based on the structure, pageout is just one of way to reclaim memory. With MADV_PAGEOUT, what user want to know how many pages were paged out as they requested(from userspace PoV, how many times pages fault happens in future accesses), not the number of reclaimed pages shrink_folio_list returns currently. In the sense, I wanted to distinguish between reclaim and pageout.
On Wed 18-01-23 09:09:36, Minchan Kim wrote: > On Wed, Jan 18, 2023 at 10:10:44AM +0100, Michal Hocko wrote: > > On Tue 17-01-23 15:16:30, Minchan Kim wrote: > > > The reclaim_pages MADV_PAGEOUT uses needs to return the number of > > > pages paged-out successfully, not only the number of reclaimed pages > > > in the operation because those pages paged-out successfully will be > > > reclaimed easily at the memory pressure due to asynchronous writeback > > > rotation(i.e., PG_reclaim with folio_rotate_reclaimable). > > > > > > This patch renames the reclaim_pages with paging_out(with hope that > > > it's clear from operation point of view) and then adds a additional > > > stat in reclaim_stat to represent the number of paged-out but kept > > > in the memory for rotation on writeback completion. > > > > > > With that stat, madvise_pageout can know how many pages were paged-out > > > successfully as well as reclaimed. The return value will be used for > > > statistics in next patch. > > > > I really fail to see the reson for the rename and paging_out doesn't > > even make much sense as a name TBH. > > Currently, what we are doing to reclaim memory is > > reclaim_folio_list > shrink_folio_list > if (folio_mapped(folio)) > try_to_unmap(folio) > > if (folio_test_dirty(folio)) > pageout > > Based on the structure, pageout is just one of way to reclaim memory. > > With MADV_PAGEOUT, what user want to know how many pages > were paged out as they requested(from userspace PoV, how many times > pages fault happens in future accesses), not the number of reclaimed > pages shrink_folio_list returns currently. > > In the sense, I wanted to distinguish between reclaim and pageout. But MADV_PAGEOUT is documented to trigger memory reclaim in general not a pageout. Let me quote from the man page : Reclaim a given range of pages. This is done to free up memory occupied : by these pages. Sure anonymous pages can be paged out to the swap storage but with the upcomming multi-tiering it can be also "paged out" to a lower tier. All that leads to freeing up memory that is currently mapped by that address range. Anyway, what do you actually meen by distinguishing between reclaim and pageout. Aren't those just two names for the same thing?
On Wed, Jan 18, 2023 at 06:35:32PM +0100, Michal Hocko wrote: > On Wed 18-01-23 09:09:36, Minchan Kim wrote: > > On Wed, Jan 18, 2023 at 10:10:44AM +0100, Michal Hocko wrote: > > > On Tue 17-01-23 15:16:30, Minchan Kim wrote: > > > > The reclaim_pages MADV_PAGEOUT uses needs to return the number of > > > > pages paged-out successfully, not only the number of reclaimed pages > > > > in the operation because those pages paged-out successfully will be > > > > reclaimed easily at the memory pressure due to asynchronous writeback > > > > rotation(i.e., PG_reclaim with folio_rotate_reclaimable). > > > > > > > > This patch renames the reclaim_pages with paging_out(with hope that > > > > it's clear from operation point of view) and then adds a additional > > > > stat in reclaim_stat to represent the number of paged-out but kept > > > > in the memory for rotation on writeback completion. > > > > > > > > With that stat, madvise_pageout can know how many pages were paged-out > > > > successfully as well as reclaimed. The return value will be used for > > > > statistics in next patch. > > > > > > I really fail to see the reson for the rename and paging_out doesn't > > > even make much sense as a name TBH. > > > > Currently, what we are doing to reclaim memory is > > > > reclaim_folio_list > > shrink_folio_list > > if (folio_mapped(folio)) > > try_to_unmap(folio) > > > > if (folio_test_dirty(folio)) > > pageout > > > > Based on the structure, pageout is just one of way to reclaim memory. > > > > With MADV_PAGEOUT, what user want to know how many pages > > were paged out as they requested(from userspace PoV, how many times > > pages fault happens in future accesses), not the number of reclaimed > > pages shrink_folio_list returns currently. > > > > In the sense, I wanted to distinguish between reclaim and pageout. > > But MADV_PAGEOUT is documented to trigger memory reclaim in general > not a pageout. Let me quote from the man page > : Reclaim a given range of pages. This is done to free up memory occupied > : by these pages. IMO, we need to change the documentation something like this. : Try to reclaim a given range of pages. The reclaim carries on the unmap pages from address space and then write them out to backing storage. It could help to free up memory occupied by these pages or improve memory reclaim efficiency. > > Sure anonymous pages can be paged out to the swap storage but with the > upcomming multi-tiering it can be also "paged out" to a lower tier. All > that leads to freeing up memory that is currently mapped by that address > range. I am not familiar with multi-tiering. However, thing is the operation of pageout is synchronous or not. If it's synchronous(IOW, when the pageout returns, the page was really written to the storage), yes, it can reclaim memory. If the backing storage is asynchrnous device (which is *major* these days), we cannot reclaim the memory but just wrote the page to the storage with hope it could help reclaim speed at next iteration of reclaim. > > Anyway, what do you actually meen by distinguishing between reclaim and > pageout. Aren't those just two names for the same thing? reclaim is realy memory freeing but pageout is just one of the way to achieve the memory freeing, which is not guaranteed depending on backing storage's speed.
On Wed 18-01-23 10:07:17, Minchan Kim wrote: > On Wed, Jan 18, 2023 at 06:35:32PM +0100, Michal Hocko wrote: > > On Wed 18-01-23 09:09:36, Minchan Kim wrote: > > > On Wed, Jan 18, 2023 at 10:10:44AM +0100, Michal Hocko wrote: > > > > On Tue 17-01-23 15:16:30, Minchan Kim wrote: > > > > > The reclaim_pages MADV_PAGEOUT uses needs to return the number of > > > > > pages paged-out successfully, not only the number of reclaimed pages > > > > > in the operation because those pages paged-out successfully will be > > > > > reclaimed easily at the memory pressure due to asynchronous writeback > > > > > rotation(i.e., PG_reclaim with folio_rotate_reclaimable). > > > > > > > > > > This patch renames the reclaim_pages with paging_out(with hope that > > > > > it's clear from operation point of view) and then adds a additional > > > > > stat in reclaim_stat to represent the number of paged-out but kept > > > > > in the memory for rotation on writeback completion. > > > > > > > > > > With that stat, madvise_pageout can know how many pages were paged-out > > > > > successfully as well as reclaimed. The return value will be used for > > > > > statistics in next patch. > > > > > > > > I really fail to see the reson for the rename and paging_out doesn't > > > > even make much sense as a name TBH. > > > > > > Currently, what we are doing to reclaim memory is > > > > > > reclaim_folio_list > > > shrink_folio_list > > > if (folio_mapped(folio)) > > > try_to_unmap(folio) > > > > > > if (folio_test_dirty(folio)) > > > pageout > > > > > > Based on the structure, pageout is just one of way to reclaim memory. > > > > > > With MADV_PAGEOUT, what user want to know how many pages > > > were paged out as they requested(from userspace PoV, how many times > > > pages fault happens in future accesses), not the number of reclaimed > > > pages shrink_folio_list returns currently. > > > > > > In the sense, I wanted to distinguish between reclaim and pageout. > > > > But MADV_PAGEOUT is documented to trigger memory reclaim in general > > not a pageout. Let me quote from the man page > > : Reclaim a given range of pages. This is done to free up memory occupied > > : by these pages. > > IMO, we need to change the documentation something like this. > > : Try to reclaim a given range of pages. The reclaim carries on the > unmap pages from address space and then write them out to backing > storage. It could help to free up memory occupied by these pages > or improve memory reclaim efficiency. But this is not what the implementation does nor should it be specific about what reclaim actual can do. The specific implementation of the reclaim is an implementation detail. > > Sure anonymous pages can be paged out to the swap storage but with the > > upcomming multi-tiering it can be also "paged out" to a lower tier. All > > that leads to freeing up memory that is currently mapped by that address > > range. > > I am not familiar with multi-tiering. However, thing is the operation > of pageout is synchronous or not. If it's synchronous(IOW, when the > pageout returns, the page was really written to the storage), yes, > it can reclaim memory. If the backing storage is asynchrnous device > (which is *major* these days), we cannot reclaim the memory but just > wrote the page to the storage with hope it could help reclaim speed > at next iteration of reclaim. I am sorry but I do not follow. Synchronicity of the reclaim should be completely irrelevant. Even swapout (pageout from your POV AFAIU) can be async or sync. > > Anyway, what do you actually meen by distinguishing between reclaim and > > pageout. Aren't those just two names for the same thing? > > reclaim is realy memory freeing but pageout is just one of the way > to achieve the memory freeing, which is not guaranteed depending on > backing storage's speed. Try to think about it some more. Do you really want the MADV_PAGEOUT to be so specific about how the memory reclaim is achieved? How do you reflect new ways of reclaiming memory - e.g. memory demotion when the primary memory gets freed by migrating the content to a slower type of memory yet not write it out to ultra slow swap storage (which is just yet another tier that cannot be accessed directly without an explicit IO)?
On Wed, Jan 18, 2023 at 10:23:02PM +0100, Michal Hocko wrote: > On Wed 18-01-23 10:07:17, Minchan Kim wrote: > > On Wed, Jan 18, 2023 at 06:35:32PM +0100, Michal Hocko wrote: > > > On Wed 18-01-23 09:09:36, Minchan Kim wrote: > > > > On Wed, Jan 18, 2023 at 10:10:44AM +0100, Michal Hocko wrote: > > > > > On Tue 17-01-23 15:16:30, Minchan Kim wrote: > > > > > > The reclaim_pages MADV_PAGEOUT uses needs to return the number of > > > > > > pages paged-out successfully, not only the number of reclaimed pages > > > > > > in the operation because those pages paged-out successfully will be > > > > > > reclaimed easily at the memory pressure due to asynchronous writeback > > > > > > rotation(i.e., PG_reclaim with folio_rotate_reclaimable). > > > > > > > > > > > > This patch renames the reclaim_pages with paging_out(with hope that > > > > > > it's clear from operation point of view) and then adds a additional > > > > > > stat in reclaim_stat to represent the number of paged-out but kept > > > > > > in the memory for rotation on writeback completion. > > > > > > > > > > > > With that stat, madvise_pageout can know how many pages were paged-out > > > > > > successfully as well as reclaimed. The return value will be used for > > > > > > statistics in next patch. > > > > > > > > > > I really fail to see the reson for the rename and paging_out doesn't > > > > > even make much sense as a name TBH. > > > > > > > > Currently, what we are doing to reclaim memory is > > > > > > > > reclaim_folio_list > > > > shrink_folio_list > > > > if (folio_mapped(folio)) > > > > try_to_unmap(folio) > > > > > > > > if (folio_test_dirty(folio)) > > > > pageout > > > > > > > > Based on the structure, pageout is just one of way to reclaim memory. > > > > > > > > With MADV_PAGEOUT, what user want to know how many pages > > > > were paged out as they requested(from userspace PoV, how many times > > > > pages fault happens in future accesses), not the number of reclaimed > > > > pages shrink_folio_list returns currently. > > > > > > > > In the sense, I wanted to distinguish between reclaim and pageout. > > > > > > But MADV_PAGEOUT is documented to trigger memory reclaim in general > > > not a pageout. Let me quote from the man page > > > : Reclaim a given range of pages. This is done to free up memory occupied > > > : by these pages. > > > > IMO, we need to change the documentation something like this. > > > > : Try to reclaim a given range of pages. The reclaim carries on the > > unmap pages from address space and then write them out to backing > > storage. It could help to free up memory occupied by these pages > > or improve memory reclaim efficiency. > > But this is not what the implementation does nor should it be specific > about what reclaim actual can do. The specific implementation of the > reclaim is an implementation detail. > > > > Sure anonymous pages can be paged out to the swap storage but with the > > > upcomming multi-tiering it can be also "paged out" to a lower tier. All > > > that leads to freeing up memory that is currently mapped by that address > > > range. > > > > I am not familiar with multi-tiering. However, thing is the operation > > of pageout is synchronous or not. If it's synchronous(IOW, when the > > pageout returns, the page was really written to the storage), yes, > > it can reclaim memory. If the backing storage is asynchrnous device > > (which is *major* these days), we cannot reclaim the memory but just > > wrote the page to the storage with hope it could help reclaim speed > > at next iteration of reclaim. > > I am sorry but I do not follow. Synchronicity of the reclaim should be > completely irrelevant. Even swapout (pageout from your POV AFAIU) can be > async or sync. > > > > Anyway, what do you actually meen by distinguishing between reclaim and > > > pageout. Aren't those just two names for the same thing? > > > > reclaim is realy memory freeing but pageout is just one of the way > > to achieve the memory freeing, which is not guaranteed depending on > > backing storage's speed. > > Try to think about it some more. Do you really want the MADV_PAGEOUT to > be so specific about how the memory reclaim is achieved? How do you > reflect new ways of reclaiming memory - e.g. memory demotion when the > primary memory gets freed by migrating the content to a slower type of > memory yet not write it out to ultra slow swap storage (which is just > yet another tier that cannot be accessed directly without an explicit > IO)? I understand your concern now and believe better implementation would account the number of virtual address scanning and the number of page *unmapped from page table* so we don't need to worry what types of paging out happens(e.g., write it to slower storage or demote it to lower tier. In the end, userspace will see the paging in, anyway.) "Unmapped the page from page table and demotes the page to secondary device. User would see page fault when the next access happen" If you agree it, yeah, I don't need to change anything in vmscan.c. Instead, I could do everything in madvise.c Let me know if you have other concern or suggestion. Thanks, Michal.
On Wed 18-01-23 14:27:23, Minchan Kim wrote:
[...]
> Let me know if you have other concern or suggestion.
I would propose to use a tracepoint to track this on the madvise side.
This way you can both track a per-process effectivity as well a madvise
originator effectivity (if the policy is implemented by a global monitor
then it won't get interfering activity by other users of this
interface). Global counters cannot do neither of that.
On Thu, Jan 19, 2023 at 10:07:23AM +0100, Michal Hocko wrote: > On Wed 18-01-23 14:27:23, Minchan Kim wrote: > [...] > > Let me know if you have other concern or suggestion. > > I would propose to use a tracepoint to track this on the madvise side. > This way you can both track a per-process effectivity as well a madvise > originator effectivity (if the policy is implemented by a global monitor > then it won't get interfering activity by other users of this > interface). Global counters cannot do neither of that. I don't think the tracepoint is right approach for the purpose. I understand we could get the same result using tracepoint using bpf or something so whenever event happens, a daemon get the result and accumlate the number so totally same result with global stat. Yeah, technically it's doable. With the claim, there is nothing we can do with tracpoint. Checks existing vmstat fields, why do we have them into vmstat instead of tracepoint? TP is much easiler/fleixible but with vmstat, we can get the ballpark under fleet easier to sense what's going on simply, and once we found something weird, we could turn on the trace to know the detail and TP would work for it. With process control using process_madvise in centralized controlled system, I think those two stats are really worth along with other memory reclaim statistics to be captured for memory health. If we have needs per-process level tracking(Actually, not for our case), we could add the tracepoint later.
diff --git a/include/linux/swap.h b/include/linux/swap.h index a18cf4b7c724..0ada46b595cd 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -435,7 +435,7 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages); extern int vm_swappiness; long remove_mapping(struct address_space *mapping, struct folio *folio); -extern unsigned long reclaim_pages(struct list_head *page_list); +extern unsigned int paging_out(struct list_head *page_list); #ifdef CONFIG_NUMA extern int node_reclaim_mode; extern int sysctl_min_unmapped_ratio; diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 19cf5b6892ce..cda903a8fa6e 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -28,6 +28,7 @@ struct reclaim_stat { unsigned nr_writeback; unsigned nr_immediate; unsigned nr_pageout; + unsigned nr_pageout_keep; unsigned nr_activate[ANON_AND_FILE]; unsigned nr_ref_keep; unsigned nr_unmap_fail; diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c index e1a4315c4be6..be2a731d3459 100644 --- a/mm/damon/paddr.c +++ b/mm/damon/paddr.c @@ -226,7 +226,7 @@ static unsigned long damon_pa_pageout(struct damon_region *r) put_page(page); } } - applied = reclaim_pages(&page_list); + applied = paging_out(&page_list); cond_resched(); return applied * PAGE_SIZE; } diff --git a/mm/madvise.c b/mm/madvise.c index c7105ec6d08c..a4a03054ab6b 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -400,7 +400,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, huge_unlock: spin_unlock(ptl); if (pageout) - reclaim_pages(&page_list); + paging_out(&page_list); return 0; } @@ -491,7 +491,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, arch_leave_lazy_mmu_mode(); pte_unmap_unlock(orig_pte, ptl); if (pageout) - reclaim_pages(&page_list); + paging_out(&page_list); cond_resched(); return 0; diff --git a/mm/vmscan.c b/mm/vmscan.c index 04d8b88e5216..579a7ebbe24a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1933,6 +1933,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, goto activate_locked; case PAGE_SUCCESS: stat->nr_pageout += nr_pages; + stat->nr_pageout_keep += nr_pages; if (folio_test_writeback(folio)) goto keep; @@ -1948,6 +1949,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, if (folio_test_dirty(folio) || folio_test_writeback(folio)) goto keep_locked; + + stat->nr_pageout_keep -= nr_pages; mapping = folio_mapping(folio); fallthrough; case PAGE_CLEAN: @@ -2646,9 +2649,9 @@ static void shrink_active_list(unsigned long nr_to_scan, } static unsigned int reclaim_folio_list(struct list_head *folio_list, - struct pglist_data *pgdat) + struct pglist_data *pgdat, + struct reclaim_stat *stat) { - struct reclaim_stat dummy_stat; unsigned int nr_reclaimed; struct folio *folio; struct scan_control sc = { @@ -2659,7 +2662,7 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list, .no_demotion = 1, }; - nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, &dummy_stat, false); + nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, stat, false); while (!list_empty(folio_list)) { folio = lru_to_folio(folio_list); list_del(&folio->lru); @@ -2669,15 +2672,23 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list, return nr_reclaimed; } -unsigned long reclaim_pages(struct list_head *folio_list) +/* + * paging_out - reclaim clean pages and write dirty pages into storage + * @folio_list: pages for paging out + * + * paging_out() writes dirty pages to backing storage and/or reclaim + * clean pages from memory. Returns the number of written/reclaimed pages. + */ +unsigned int paging_out(struct list_head *folio_list) { int nid; - unsigned int nr_reclaimed = 0; + unsigned int nr_pageout = 0; LIST_HEAD(node_folio_list); unsigned int noreclaim_flag; + struct reclaim_stat stat; if (list_empty(folio_list)) - return nr_reclaimed; + return nr_pageout; noreclaim_flag = memalloc_noreclaim_save(); @@ -2691,15 +2702,17 @@ unsigned long reclaim_pages(struct list_head *folio_list) continue; } - nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid)); + nr_pageout += reclaim_folio_list(&node_folio_list, NODE_DATA(nid), &stat); + nr_pageout += stat.nr_pageout_keep; nid = folio_nid(lru_to_folio(folio_list)); } while (!list_empty(folio_list)); - nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid)); + nr_pageout += reclaim_folio_list(&node_folio_list, NODE_DATA(nid), &stat); + nr_pageout += stat.nr_pageout_keep; memalloc_noreclaim_restore(noreclaim_flag); - return nr_reclaimed; + return nr_pageout; } static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,