[v11,0/3] cachestat: a new syscall for page cache state of files

Message ID	20230308032748.609510-1-nphamcs@gmail.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Nhat Pham <nphamcs@gmail.com> To: akpm@linux-foundation.org Cc: hannes@cmpxchg.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, bfoster@redhat.com, willy@infradead.org, arnd@arndb.de, linux-api@vger.kernel.org, kernel-team@meta.com Subject: [PATCH v11 0/3] cachestat: a new syscall for page cache state of files Date: Tue, 7 Mar 2023 19:27:45 -0800 Message-Id: <20230308032748.609510-1-nphamcs@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	cachestat: a new syscall for page cache state of files \| [v11,0/3] cachestat: a new syscall for page cache state of files [v11,1/3] workingset: refactor LRU refault to expose refault recency check [v11,2/3] cachestat: implement cachestat syscall [v11,3/3] selftests: Add selftests for cachestat

Message ID

20230308032748.609510-1-nphamcs@gmail.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: Nhat Pham <nphamcs@gmail.com>
To: akpm@linux-foundation.org
Cc: hannes@cmpxchg.org, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org, bfoster@redhat.com,
        willy@infradead.org, arnd@arndb.de, linux-api@vger.kernel.org,
        kernel-team@meta.com
Subject: [PATCH v11 0/3] cachestat: a new syscall for page cache state of
 files
Date: Tue,  7 Mar 2023 19:27:45 -0800
Message-Id: <20230308032748.609510-1-nphamcs@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

cachestat: a new syscall for page cache state of files |

Message

Nhat Pham March 8, 2023, 3:27 a.m. UTC

  Changelog:
v11:
  * Clean up code and comments/documentation.
    (patch 1 and 2) (suggested by Matthew Wilcox)
  * Drop support for hugetlbfs (patch 2)
    (from discussion with Johannes Weiner and Matthew Wilcox).
v10:
  * Reorder the arguments for archs with alignment requirements.
    (patch 2) (suggested by Arnd Bergmann)
v9:
  * Remove syscall from all the architectures syscall table except x86
    (patch 2)
  * API change: handle different cases for offset and add compat syscall.
    (patch 2) (suggested by Johannes Weiner and Arnd Bergmann)
v8:
  * Add syscall to mips syscall tables (detected by kernel test robot)
    (patch 2)
  * Add a missing return (suggested by Yu Zhao) (patch 2)
v7:
  * Fix and use lru_gen_test_recent (suggested by Brian Foster)
    (patch 2)
  * Small formatting and organizational fixes
v6:
  * Add a missing fdput() (suggested by Brian Foster) (patch 2)
  * Replace cstat_size with cstat_version (suggested by Brian Foster)
    (patch 2)
  * Add conditional resched to the xas walk. (suggested by Hillf Danton)
    (patch 2)
v5:
  * Separate first patch into its own series.
    (suggested by Andrew Morton)
  * Expose filemap_cachestat() to non-syscall usage
    (patch 2) (suggested by Brian Foster).
  * Fix some build errors from last version.
    (patch 2)
  * Explain eviction and recent eviction in the draft man page and
    documentation (suggested by Andrew Morton).
    (patch 2)
v4:
  * Refactor cachestat and move it to mm/filemap.c (patch 3)
    (suggested by Brian Foster)
  * Remove redundant checks (!folio, access_ok)
    (patch 3) (suggested by Matthew Wilcox and Al Viro)
  * Fix a bug in handling multipages folio.
    (patch 3) (suggested by Matthew Wilcox)
  * Add a selftest for shmem files, which can be used to test huge
    pages (patch 4) (suggested by Johannes Weiner)
v3:
  * Fix some minor formatting issues and build errors.
  * Add the new syscall entry to missing architecture syscall tables.
    (patch 3).
  * Add flags argument for the syscall. (patch 3).
  * Clean up the recency refactoring (patch 2) (suggested by Yu Zhao)
  * Add the new Kconfig (CONFIG_CACHESTAT) to disable the syscall.
    (patch 3) (suggested by Josh Triplett)
v2:
  * len == 0 means query to EOF. len < 0 is invalid.
    (patch 3) (suggested by Brian Foster)
  * Make cachestat extensible by adding the `cstat_size` argument in the
    syscall (patch 3)

There is currently no good way to query the page cache state of large
file sets and directory trees. There is mincore(), but it scales poorly:
the kernel writes out a lot of bitmap data that userspace has to
aggregate, when the user really doesn not care about per-page information
in that case. The user also needs to mmap and unmap each file as it goes
along, which can be quite slow as well.

This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty
pages, pages marked for writeback, evicted pages etc.) of a file, in a
specified range of bytes. It also include a selftest suite that tests some
typical usage. Currently, the syscall is only wired in for x86
architecture.

This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore.
Relevant links:

https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html

For comparison with mincore, I ran both syscalls on a 2TB sparse file:

Using mincore:
real    0m37.510s
user    0m2.934s
sys     0m34.558s

Using cachestat:
real    0m0.009s
user    0m0.000s
sys     0m0.009s

This series should be applied on top of:

workingset: fix confusion around eviction vs refault container
https://lkml.org/lkml/2023/1/4/1066

This series consist of 3 patches:

Nhat Pham (3):
  workingset: refactor LRU refault to expose refault recency check
  cachestat: implement cachestat syscall
  selftests: Add selftests for cachestat

 MAINTAINERS                                   |   7 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 include/linux/compat.h                        |   5 +-
 include/linux/swap.h                          |   1 +
 include/linux/syscalls.h                      |   3 +
 include/uapi/asm-generic/unistd.h             |   5 +-
 include/uapi/linux/mman.h                     |   9 +
 init/Kconfig                                  |  10 +
 kernel/sys_ni.c                               |   1 +
 mm/filemap.c                                  | 166 +++++++++++
 mm/workingset.c                               | 145 ++++++----
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/cachestat/.gitignore  |   2 +
 tools/testing/selftests/cachestat/Makefile    |   8 +
 .../selftests/cachestat/test_cachestat.c      | 257 ++++++++++++++++++
 16 files changed, 574 insertions(+), 48 deletions(-)
 create mode 100644 tools/testing/selftests/cachestat/.gitignore
 create mode 100644 tools/testing/selftests/cachestat/Makefile
 create mode 100644 tools/testing/selftests/cachestat/test_cachestat.c


base-commit: 1440f576022887004f719883acb094e7e0dd4944
prerequisite-patch-id: 171a43d333e1b267ce14188a5beaea2f313787fb
--
2.39.2

Comments

Andrew Morton March 14, 2023, 11 p.m. UTC | #1

On Tue,  7 Mar 2023 19:27:45 -0800 Nhat Pham <nphamcs@gmail.com> wrote:

> There is currently no good way to query the page cache state of large
> file sets and directory trees. There is mincore(), but it scales poorly:
> the kernel writes out a lot of bitmap data that userspace has to
> aggregate, when the user really doesn not care about per-page information
> in that case. The user also needs to mmap and unmap each file as it goes
> along, which can be quite slow as well.

A while ago I asked about the security implications - could cachestat()
be used to figure out what parts of a file another user is reading. 
This also applies to mincore(), but cachestat() newly permits user A to
work out which parts of a file user B has *written* to.

I don't recall seeing a response to this, and there is no discussion in
the changelogs.

Secondly, I'm not seeing description of any use cases.  OK, it's faster
and better than mincore(), but who cares?  In other words, what
end-user value compels us to add this feature to Linux?

>    struct cachestat {
>	        __u64 nr_cache;
>	        __u64 nr_dirty;
>	        __u64 nr_writeback;
>	        __u64 nr_evicted;
>	        __u64 nr_recently_evicted;
>    };

And these fields are really getting into the weedy details of internal
kernel implementation.  Bear in mind that we must support this API for
ever.

Particularly the "evicted" things.  The workingset code was implemented
eight years ago, which is actually relatively recent.  It could be that
eight years from now it will have been removed and possibly replaced
workingset with something else.  Then what do we do?

For these reasons, and because of the lack of enthusiasm I have seen
from others, I don't think a case has yet been made for the addition of
this new syscall.

Johannes Weiner March 15, 2023, 5:09 p.m. UTC | #2

Hi Andrew,

On Tue, Mar 14, 2023 at 04:00:41PM -0700, Andrew Morton wrote:
> On Tue,  7 Mar 2023 19:27:45 -0800 Nhat Pham <nphamcs@gmail.com> wrote:
> 
> > There is currently no good way to query the page cache state of large
> > file sets and directory trees. There is mincore(), but it scales poorly:
> > the kernel writes out a lot of bitmap data that userspace has to
> > aggregate, when the user really doesn not care about per-page information
> > in that case. The user also needs to mmap and unmap each file as it goes
> > along, which can be quite slow as well.
> 
> A while ago I asked about the security implications - could cachestat()
> be used to figure out what parts of a file another user is reading. 
> This also applies to mincore(), but cachestat() newly permits user A to
> work out which parts of a file user B has *written* to.

The caller of cachestat() must have the file open for reading. If they
can read the contents that B has written, is the fact that they can
see dirty state really a concern?

Nhat and I were discussing this offlist at the time, but weren't
creative enough to come up with an abuse scenario.

> I don't recall seeing a response to this, and there is no discussion in
> the changelogs.

It might have drowned in the noise, but he did reply:

https://lore.kernel.org/lkml/CAKEwX=Ppf=WbOuV2Rh3+V8ohOYXo=CnfSu9qqSh-DpVvfy2nhA@mail.gmail.com/

> Secondly, I'm not seeing description of any use cases.  OK, it's faster
> and better than mincore(), but who cares?  In other words, what
> end-user value compels us to add this feature to Linux?

Years ago there was a thread about adding dirty bits to mincore(), I
don't know if you remember this:

https://lkml.org/lkml/2013/2/10/162

In that thread, Rusty described a usecase of maintaining a journaling
file alongside a main file. The idea for testing the dirty state isn't
to call sync but to see whether the journal needs to be updated.

The efficiency of mincore() was touched on too. Andres Freund (CC'd,
hopefully I got the email address right) mentioned that Postgres has a
usecase for deciding whether to do an index scan or query tables
directly, based on whether the index is cached. Postgres works with
files rather than memory regions, and Andres mentioned that the index
could be quite large. The consensus was that having to go through
mmap(), and getting a bytemap representing each page when all you need
is a summary for the queried range, was too painful in practice.

Most recently, the database team at Meta reached out to us and asked
about the ability to query dirty state again. The motivation for this
was twofold. One was simply visibility into the writeback algorithm,
i.e. trying to figure out what it's doing when investigating
performance problems.

The second usecase they brought up was to advise writeback from
userspace to manage the tradeoff between integrity and IO utilization:
if IO capacity is available, sync more frequently; if not, let the
work batch up. Blindly syncing through the file in chunks doesn't work
because you don't know in advance how much IO they'll end up doing (or
how much they've done, afterwards.) So it's difficult to build an
algorithm that will reasonably pace through sparsely dirtied regions
without the risk of overwhelming the IO device on dense ones. And it's
not straight-forward to do this from the kernel, since it doesn't know
the IO headroom the application needs for reading (which is dynamic).

The page cache is often the biggest memory consumer, and so the kernel
heuristics that manage it have a big impact on performance. We have a
rich interface to augment those heuristics with fadvise and the sync
family, but it's not a stretch to say that it's difficult to use them
if you cannot get good insights into what the other hand is doing.

Another query we get almost monthly is service owners trying to
understand where their memory is going and what's causing unexpected
pressure on a host. They see the cache in vmstat, but between a
complex application, shared libraries or a runtime (jvm, hhvm etc.)
and a myriad of host management agents, there is so much going on on
the machine that it's hard to find out who is touching which
files. When it comes to disk usage, the kernel provides the ability to
quickly stat entire filesystem subtrees and drill down with tools like
du. It sure would be useful to have the same for memory usage.

Our current cache interface is seriously lacking in that regard.

It would be great to have a stable, canonical and versatile interface
to inspect what the cache is doing. One that blends in with the
broader VFS and buffered IO interface: an easy to discover, easy to
use syscall (not an obscure tracepoint or fcntl or a drgn script); an
fd instead of a vma; a VFS-based permission model; efficient handling
of the wide range of file sizes that exist in the real world.

cachestat() fits that bill.

> >    struct cachestat {
> >	        __u64 nr_cache;
> >	        __u64 nr_dirty;
> >	        __u64 nr_writeback;
> >	        __u64 nr_evicted;
> >	        __u64 nr_recently_evicted;
> >    };
> 
> And these fields are really getting into the weedy details of internal
> kernel implementation.  Bear in mind that we must support this API for
> ever.
> 
> Particularly the "evicted" things.  The workingset code was implemented
> eight years ago, which is actually relatively recent.  It could be that
> eight years from now it will have been removed and possibly replaced
> workingset with something else.  Then what do we do?

;) I'm definitely biased here, but I don't think it's realistic that
we'd ever go back to a cache that doesn't maintain *some* form of
non-residency information.

We now have two reclaim implementations that rely on it at its
core. And psi is designed around the concept of initial faults vs
refaults; that's an ABI we have to maintain indefinitely anyway, and
is widely used for OOM killing and load shedding in datacenters, on
Android, by all systemd-based installations etc.

It seems unlikely that this is a fluke. But even if I'm completely
wrong about that, I think we have options that wouldn't spell the end
of the world. We could report 0 for those fields and be perfectly
backward compatible. There is a flags field that allows versioning of
struct cachestat, too.

Andres Freund March 15, 2023, 7:14 p.m. UTC | #3

Hi,

On 2023-03-15 13:09:34 -0400, Johannes Weiner wrote:
> On Tue, Mar 14, 2023 at 04:00:41PM -0700, Andrew Morton wrote:
> > A while ago I asked about the security implications - could cachestat()
> > be used to figure out what parts of a file another user is reading.
> > This also applies to mincore(), but cachestat() newly permits user A to
> > work out which parts of a file user B has *written* to.
>
> The caller of cachestat() must have the file open for reading. If they
> can read the contents that B has written, is the fact that they can
> see dirty state really a concern?

Random idea: Only fill ->dirty/writeback if the fd is open for writing.


> > Secondly, I'm not seeing description of any use cases.  OK, it's faster
> > and better than mincore(), but who cares?  In other words, what
> > end-user value compels us to add this feature to Linux?
>
> Years ago there was a thread about adding dirty bits to mincore(), I
> don't know if you remember this:
>
> https://lkml.org/lkml/2013/2/10/162
>
> In that thread, Rusty described a usecase of maintaining a journaling
> file alongside a main file. The idea for testing the dirty state isn't
> to call sync but to see whether the journal needs to be updated.
>
> The efficiency of mincore() was touched on too. Andres Freund (CC'd,
> hopefully I got the email address right) mentioned that Postgres has a
> usecase for deciding whether to do an index scan or query tables
> directly, based on whether the index is cached. Postgres works with
> files rather than memory regions, and Andres mentioned that the index
> could be quite large.

This is still relevant, FWIW. And not just for deciding on the optimal query
plan, but also for reporting purposes. We can show the user what part of the
query has done how much IO, but that can end up being quite confusing because
we're not aware of how much IO was fullfilled by the page cache.


> Most recently, the database team at Meta reached out to us and asked
> about the ability to query dirty state again. The motivation for this
> was twofold. One was simply visibility into the writeback algorithm,
> i.e. trying to figure out what it's doing when investigating
> performance problems.
>
> The second usecase they brought up was to advise writeback from
> userspace to manage the tradeoff between integrity and IO utilization:
> if IO capacity is available, sync more frequently; if not, let the
> work batch up. Blindly syncing through the file in chunks doesn't work
> because you don't know in advance how much IO they'll end up doing (or
> how much they've done, afterwards.) So it's difficult to build an
> algorithm that will reasonably pace through sparsely dirtied regions
> without the risk of overwhelming the IO device on dense ones. And it's
> not straight-forward to do this from the kernel, since it doesn't know
> the IO headroom the application needs for reading (which is dynamic).

We ended up building something very roughly like that in userspace - each
backend tracks the last N writes, and once the numbers reaches a certain
limit, we sort and collapse the outstanding ranges and issue
sync_file_range(SYNC_FILE_RANGE_WRITE) for them. Different types of tasks have
different limits. Without that latency in write heavy workloads is ... not
good (to this day, but to a lesser degree than 5-10 years ago).


> Another query we get almost monthly is service owners trying to
> understand where their memory is going and what's causing unexpected
> pressure on a host. They see the cache in vmstat, but between a
> complex application, shared libraries or a runtime (jvm, hhvm etc.)
> and a myriad of host management agents, there is so much going on on
> the machine that it's hard to find out who is touching which
> files. When it comes to disk usage, the kernel provides the ability to
> quickly stat entire filesystem subtrees and drill down with tools like
> du. It sure would be useful to have the same for memory usage.

+1

Greetings,

Andres Freund

Nhat Pham March 24, 2023, 9:59 p.m. UTC | #4

On Wed, Mar 15, 2023 at 12:15 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-03-15 13:09:34 -0400, Johannes Weiner wrote:
> > On Tue, Mar 14, 2023 at 04:00:41PM -0700, Andrew Morton wrote:
> > > A while ago I asked about the security implications - could cachestat()
> > > be used to figure out what parts of a file another user is reading.
> > > This also applies to mincore(), but cachestat() newly permits user A to
> > > work out which parts of a file user B has *written* to.
> >
> > The caller of cachestat() must have the file open for reading. If they
> > can read the contents that B has written, is the fact that they can
> > see dirty state really a concern?
>
> Random idea: Only fill ->dirty/writeback if the fd is open for writing.
>
>
> > > Secondly, I'm not seeing description of any use cases.  OK, it's faster
> > > and better than mincore(), but who cares?  In other words, what
> > > end-user value compels us to add this feature to Linux?
> >
> > Years ago there was a thread about adding dirty bits to mincore(), I
> > don't know if you remember this:
> >
> > https://lkml.org/lkml/2013/2/10/162
> >
> > In that thread, Rusty described a usecase of maintaining a journaling
> > file alongside a main file. The idea for testing the dirty state isn't
> > to call sync but to see whether the journal needs to be updated.
> >
> > The efficiency of mincore() was touched on too. Andres Freund (CC'd,
> > hopefully I got the email address right) mentioned that Postgres has a
> > usecase for deciding whether to do an index scan or query tables
> > directly, based on whether the index is cached. Postgres works with
> > files rather than memory regions, and Andres mentioned that the index
> > could be quite large.
>
> This is still relevant, FWIW. And not just for deciding on the optimal query
> plan, but also for reporting purposes. We can show the user what part of the
> query has done how much IO, but that can end up being quite confusing because
> we're not aware of how much IO was fullfilled by the page cache.
>
>
> > Most recently, the database team at Meta reached out to us and asked
> > about the ability to query dirty state again. The motivation for this
> > was twofold. One was simply visibility into the writeback algorithm,
> > i.e. trying to figure out what it's doing when investigating
> > performance problems.
> >
> > The second usecase they brought up was to advise writeback from
> > userspace to manage the tradeoff between integrity and IO utilization:
> > if IO capacity is available, sync more frequently; if not, let the
> > work batch up. Blindly syncing through the file in chunks doesn't work
> > because you don't know in advance how much IO they'll end up doing (or
> > how much they've done, afterwards.) So it's difficult to build an
> > algorithm that will reasonably pace through sparsely dirtied regions
> > without the risk of overwhelming the IO device on dense ones. And it's
> > not straight-forward to do this from the kernel, since it doesn't know
> > the IO headroom the application needs for reading (which is dynamic).
>
> We ended up building something very roughly like that in userspace - each
> backend tracks the last N writes, and once the numbers reaches a certain
> limit, we sort and collapse the outstanding ranges and issue
> sync_file_range(SYNC_FILE_RANGE_WRITE) for them. Different types of tasks have
> different limits. Without that latency in write heavy workloads is ... not
> good (to this day, but to a lesser degree than 5-10 years ago).
>
>
> > Another query we get almost monthly is service owners trying to
> > understand where their memory is going and what's causing unexpected
> > pressure on a host. They see the cache in vmstat, but between a
> > complex application, shared libraries or a runtime (jvm, hhvm etc.)
> > and a myriad of host management agents, there is so much going on on
> > the machine that it's hard to find out who is touching which
> > files. When it comes to disk usage, the kernel provides the ability to
> > quickly stat entire filesystem subtrees and drill down with tools like
> > du. It sure would be useful to have the same for memory usage.
>
> +1
>
> Greetings,
>
> Andres Freund

Thanks for the suggestion/discussion regarding cachestat's use cases,
Johannes and Andres! I'll put a summary of these points (along with a link to
the original discussion thread) in the cover letter and commit message
of the new version of the patch set.

In the meantime, feel free to let me know if there is something else cachestat
could help with (along with any improvements that could facilitate such
use cases)

Best,
Nhat