Message ID | 20230308032748.609510-1-nphamcs@gmail.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp113152wrd; Tue, 7 Mar 2023 19:31:17 -0800 (PST) X-Google-Smtp-Source: AK7set+7VYWWX30UYX51l1Xkp062gRaz86vu7YXtcvtNMdw5pvKZeS7+CJwochXKvz83p08s8pov X-Received: by 2002:a05:6a20:7d86:b0:cd:6f68:98d6 with SMTP id v6-20020a056a207d8600b000cd6f6898d6mr20924466pzj.0.1678246276957; Tue, 07 Mar 2023 19:31:16 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1678246276; cv=none; d=google.com; s=arc-20160816; b=QP0CzY72gCilE2Qrl7OkMaY727FFYT3pYP6pRBnNxJ0SFtcFNbm1ZtA0I7H3psabz+ ifaEKJixw3YrmmAXUtcmOje0cOrZoH+CoABsV5ndAyAac9aPTqxzi2PINffEWr9BZ0W2 Vnoif4RnlStcTtK3qy/EypRwlnj9V2ahH548pSHWdeye8MVMuy8mCoyH6TQIhnmt+Hww YPOB6omVbITfaQJEhE7WYaGjNQJCk1Xkg+sQuj457oBNJY1l6ANwttJu75P23UZAm78H KMzbv4iEnsqfvohI1dRDVu6xBnKsnFY8QGSOJz4Cb9NrFaW1rZb/mLA2H/Q7A7jS56x8 6gig== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=o/+PqUINQ//6SMKEaEI7/TJwTTa9y72VEOCQ2mIUhUA=; b=fpEfGH4l4d03x5FyCdhQxUd4woS7Yj2bKVu9nvpGbvUyFiH4fnl5jWxhPUv59sACsB KW0IWiPxjI8BUwQerteW86YeCS0oMu/RAAA7XLKuCbOeM62TipvS8c6yIdf2M80Gz0Dx ZkvLMUIfXjYQNe9raPpPBCdI575d8jHLcF9SrUoRApBd64xIC68F8Iu7uxzddgDMM4GK msRP9fbX7a4BkgQ6SozTpblN1Uydsaj0aumlmB41Sk/MYCDqHxs1Xc+ATcKD0gMxvKYM Z7JxJUIR3VGoo45cts00WbO1JMRgFwALXl2ToJYY8ClEt2eUDCrsIGyOhoI8AvVyqEGz z71Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=ez6hGB5q; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id e18-20020a63ee12000000b004fabeb6f51asi13122622pgi.754.2023.03.07.19.31.03; Tue, 07 Mar 2023 19:31:16 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=ez6hGB5q; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229659AbjCHD16 (ORCPT <rfc822;toshivichauhan@gmail.com> + 99 others); Tue, 7 Mar 2023 22:27:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57128 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229484AbjCHD14 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 7 Mar 2023 22:27:56 -0500 Received: from mail-pj1-x1031.google.com (mail-pj1-x1031.google.com [IPv6:2607:f8b0:4864:20::1031]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2FF282700; Tue, 7 Mar 2023 19:27:52 -0800 (PST) Received: by mail-pj1-x1031.google.com with SMTP id h17-20020a17090aea9100b0023739b10792so754105pjz.1; Tue, 07 Mar 2023 19:27:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1678246071; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=o/+PqUINQ//6SMKEaEI7/TJwTTa9y72VEOCQ2mIUhUA=; b=ez6hGB5qGNJMiD5UxRMvkGkKJxSWkolD9A0Rj2vtz1vLr38diPW8EUvgO7wIdy/qED PkGRkQLDLVzCbF6zGKxPooJ9rxoHblr0BuYGCWPdEl2in0HtHo3KFyuQ1sDFznmX6erW xRJ+iPBdblsImSsf1aCn9bkPIbkS5rYQD9jkaB+gwCMQUKw3X//AhrXZlZ0Ik3WtZx4Z W/CXCqdRuelnIqGT5fMk04dzLKG1e4yyTFYqFs9wUtKV4Cw5yT7ntCgZHixsFPXCavr1 H5Gmb/uqySHwDkaeWBUjfHxmpjIwq6KVeeES/rMRuYB8UkbPiveXVbkWsWcGggQsfXCx D92Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678246071; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=o/+PqUINQ//6SMKEaEI7/TJwTTa9y72VEOCQ2mIUhUA=; b=DHsaKQISIlMqSKlPcxiVqWzR+7bVDkX29ymdmrPYJihC0W16pfNQE3YoBWd0CHJivQ 1naxbJlf53rjCjiE5B0YmwLYEKoP3/EHiMiNN++PVi+cxOJUQHlTnGkqk0l6QoVK/FQP Qqrb+hzyCE4Ymcnd9cZdsBJOtBVAeAsLiLUk4rPuRHW4Quy0DKqh8gAGPLYg8OzKSHGt WfMaDMLHUcgj7/yt+p4PQoBxwoEGGeaqCu/z2P2HWHv7Ydgn6ZKt7SC+aNirErLP4O1m K7mm1koGIAQg2SiaZhrP7oqodewEUSaIkMC3ttYwFMloOLu3pW0XtnOTZxA+EzlMZOF+ MLfw== X-Gm-Message-State: AO0yUKUKKhUNwGkqWDyEt/JZhKrNsbq1zafC6zmpUlFanvrCAdmhUyXL iTUzbqaIvJRlk2LNMF03OCE= X-Received: by 2002:a17:90a:190d:b0:230:7a31:b9a6 with SMTP id 13-20020a17090a190d00b002307a31b9a6mr17824582pjg.7.1678246071482; Tue, 07 Mar 2023 19:27:51 -0800 (PST) Received: from localhost ([113.161.42.54]) by smtp.gmail.com with ESMTPSA id h1-20020a17090a580100b00213202d77d9sm7839461pji.43.2023.03.07.19.27.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Mar 2023 19:27:51 -0800 (PST) From: Nhat Pham <nphamcs@gmail.com> To: akpm@linux-foundation.org Cc: hannes@cmpxchg.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, bfoster@redhat.com, willy@infradead.org, arnd@arndb.de, linux-api@vger.kernel.org, kernel-team@meta.com Subject: [PATCH v11 0/3] cachestat: a new syscall for page cache state of files Date: Tue, 7 Mar 2023 19:27:45 -0800 Message-Id: <20230308032748.609510-1-nphamcs@gmail.com> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1759768768063093247?= X-GMAIL-MSGID: =?utf-8?q?1759768768063093247?= |
Series |
cachestat: a new syscall for page cache state of files
|
|
Message
Nhat Pham
March 8, 2023, 3:27 a.m. UTC
Changelog: v11: * Clean up code and comments/documentation. (patch 1 and 2) (suggested by Matthew Wilcox) * Drop support for hugetlbfs (patch 2) (from discussion with Johannes Weiner and Matthew Wilcox). v10: * Reorder the arguments for archs with alignment requirements. (patch 2) (suggested by Arnd Bergmann) v9: * Remove syscall from all the architectures syscall table except x86 (patch 2) * API change: handle different cases for offset and add compat syscall. (patch 2) (suggested by Johannes Weiner and Arnd Bergmann) v8: * Add syscall to mips syscall tables (detected by kernel test robot) (patch 2) * Add a missing return (suggested by Yu Zhao) (patch 2) v7: * Fix and use lru_gen_test_recent (suggested by Brian Foster) (patch 2) * Small formatting and organizational fixes v6: * Add a missing fdput() (suggested by Brian Foster) (patch 2) * Replace cstat_size with cstat_version (suggested by Brian Foster) (patch 2) * Add conditional resched to the xas walk. (suggested by Hillf Danton) (patch 2) v5: * Separate first patch into its own series. (suggested by Andrew Morton) * Expose filemap_cachestat() to non-syscall usage (patch 2) (suggested by Brian Foster). * Fix some build errors from last version. (patch 2) * Explain eviction and recent eviction in the draft man page and documentation (suggested by Andrew Morton). (patch 2) v4: * Refactor cachestat and move it to mm/filemap.c (patch 3) (suggested by Brian Foster) * Remove redundant checks (!folio, access_ok) (patch 3) (suggested by Matthew Wilcox and Al Viro) * Fix a bug in handling multipages folio. (patch 3) (suggested by Matthew Wilcox) * Add a selftest for shmem files, which can be used to test huge pages (patch 4) (suggested by Johannes Weiner) v3: * Fix some minor formatting issues and build errors. * Add the new syscall entry to missing architecture syscall tables. (patch 3). * Add flags argument for the syscall. (patch 3). * Clean up the recency refactoring (patch 2) (suggested by Yu Zhao) * Add the new Kconfig (CONFIG_CACHESTAT) to disable the syscall. (patch 3) (suggested by Josh Triplett) v2: * len == 0 means query to EOF. len < 0 is invalid. (patch 3) (suggested by Brian Foster) * Make cachestat extensible by adding the `cstat_size` argument in the syscall (patch 3) There is currently no good way to query the page cache state of large file sets and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really doesn not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture. This interface is inspired by past discussion and concerns with fincore, which has a similar design (and as a result, issues) as mincore. Relevant links: https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html For comparison with mincore, I ran both syscalls on a 2TB sparse file: Using mincore: real 0m37.510s user 0m2.934s sys 0m34.558s Using cachestat: real 0m0.009s user 0m0.000s sys 0m0.009s This series should be applied on top of: workingset: fix confusion around eviction vs refault container https://lkml.org/lkml/2023/1/4/1066 This series consist of 3 patches: Nhat Pham (3): workingset: refactor LRU refault to expose refault recency check cachestat: implement cachestat syscall selftests: Add selftests for cachestat MAINTAINERS | 7 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/compat.h | 5 +- include/linux/swap.h | 1 + include/linux/syscalls.h | 3 + include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/mman.h | 9 + init/Kconfig | 10 + kernel/sys_ni.c | 1 + mm/filemap.c | 166 +++++++++++ mm/workingset.c | 145 ++++++---- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/cachestat/.gitignore | 2 + tools/testing/selftests/cachestat/Makefile | 8 + .../selftests/cachestat/test_cachestat.c | 257 ++++++++++++++++++ 16 files changed, 574 insertions(+), 48 deletions(-) create mode 100644 tools/testing/selftests/cachestat/.gitignore create mode 100644 tools/testing/selftests/cachestat/Makefile create mode 100644 tools/testing/selftests/cachestat/test_cachestat.c base-commit: 1440f576022887004f719883acb094e7e0dd4944 prerequisite-patch-id: 171a43d333e1b267ce14188a5beaea2f313787fb -- 2.39.2
Comments
On Tue, 7 Mar 2023 19:27:45 -0800 Nhat Pham <nphamcs@gmail.com> wrote: > There is currently no good way to query the page cache state of large > file sets and directory trees. There is mincore(), but it scales poorly: > the kernel writes out a lot of bitmap data that userspace has to > aggregate, when the user really doesn not care about per-page information > in that case. The user also needs to mmap and unmap each file as it goes > along, which can be quite slow as well. A while ago I asked about the security implications - could cachestat() be used to figure out what parts of a file another user is reading. This also applies to mincore(), but cachestat() newly permits user A to work out which parts of a file user B has *written* to. I don't recall seeing a response to this, and there is no discussion in the changelogs. Secondly, I'm not seeing description of any use cases. OK, it's faster and better than mincore(), but who cares? In other words, what end-user value compels us to add this feature to Linux? > struct cachestat { > __u64 nr_cache; > __u64 nr_dirty; > __u64 nr_writeback; > __u64 nr_evicted; > __u64 nr_recently_evicted; > }; And these fields are really getting into the weedy details of internal kernel implementation. Bear in mind that we must support this API for ever. Particularly the "evicted" things. The workingset code was implemented eight years ago, which is actually relatively recent. It could be that eight years from now it will have been removed and possibly replaced workingset with something else. Then what do we do? For these reasons, and because of the lack of enthusiasm I have seen from others, I don't think a case has yet been made for the addition of this new syscall.
Hi Andrew, On Tue, Mar 14, 2023 at 04:00:41PM -0700, Andrew Morton wrote: > On Tue, 7 Mar 2023 19:27:45 -0800 Nhat Pham <nphamcs@gmail.com> wrote: > > > There is currently no good way to query the page cache state of large > > file sets and directory trees. There is mincore(), but it scales poorly: > > the kernel writes out a lot of bitmap data that userspace has to > > aggregate, when the user really doesn not care about per-page information > > in that case. The user also needs to mmap and unmap each file as it goes > > along, which can be quite slow as well. > > A while ago I asked about the security implications - could cachestat() > be used to figure out what parts of a file another user is reading. > This also applies to mincore(), but cachestat() newly permits user A to > work out which parts of a file user B has *written* to. The caller of cachestat() must have the file open for reading. If they can read the contents that B has written, is the fact that they can see dirty state really a concern? Nhat and I were discussing this offlist at the time, but weren't creative enough to come up with an abuse scenario. > I don't recall seeing a response to this, and there is no discussion in > the changelogs. It might have drowned in the noise, but he did reply: https://lore.kernel.org/lkml/CAKEwX=Ppf=WbOuV2Rh3+V8ohOYXo=CnfSu9qqSh-DpVvfy2nhA@mail.gmail.com/ > Secondly, I'm not seeing description of any use cases. OK, it's faster > and better than mincore(), but who cares? In other words, what > end-user value compels us to add this feature to Linux? Years ago there was a thread about adding dirty bits to mincore(), I don't know if you remember this: https://lkml.org/lkml/2013/2/10/162 In that thread, Rusty described a usecase of maintaining a journaling file alongside a main file. The idea for testing the dirty state isn't to call sync but to see whether the journal needs to be updated. The efficiency of mincore() was touched on too. Andres Freund (CC'd, hopefully I got the email address right) mentioned that Postgres has a usecase for deciding whether to do an index scan or query tables directly, based on whether the index is cached. Postgres works with files rather than memory regions, and Andres mentioned that the index could be quite large. The consensus was that having to go through mmap(), and getting a bytemap representing each page when all you need is a summary for the queried range, was too painful in practice. Most recently, the database team at Meta reached out to us and asked about the ability to query dirty state again. The motivation for this was twofold. One was simply visibility into the writeback algorithm, i.e. trying to figure out what it's doing when investigating performance problems. The second usecase they brought up was to advise writeback from userspace to manage the tradeoff between integrity and IO utilization: if IO capacity is available, sync more frequently; if not, let the work batch up. Blindly syncing through the file in chunks doesn't work because you don't know in advance how much IO they'll end up doing (or how much they've done, afterwards.) So it's difficult to build an algorithm that will reasonably pace through sparsely dirtied regions without the risk of overwhelming the IO device on dense ones. And it's not straight-forward to do this from the kernel, since it doesn't know the IO headroom the application needs for reading (which is dynamic). The page cache is often the biggest memory consumer, and so the kernel heuristics that manage it have a big impact on performance. We have a rich interface to augment those heuristics with fadvise and the sync family, but it's not a stretch to say that it's difficult to use them if you cannot get good insights into what the other hand is doing. Another query we get almost monthly is service owners trying to understand where their memory is going and what's causing unexpected pressure on a host. They see the cache in vmstat, but between a complex application, shared libraries or a runtime (jvm, hhvm etc.) and a myriad of host management agents, there is so much going on on the machine that it's hard to find out who is touching which files. When it comes to disk usage, the kernel provides the ability to quickly stat entire filesystem subtrees and drill down with tools like du. It sure would be useful to have the same for memory usage. Our current cache interface is seriously lacking in that regard. It would be great to have a stable, canonical and versatile interface to inspect what the cache is doing. One that blends in with the broader VFS and buffered IO interface: an easy to discover, easy to use syscall (not an obscure tracepoint or fcntl or a drgn script); an fd instead of a vma; a VFS-based permission model; efficient handling of the wide range of file sizes that exist in the real world. cachestat() fits that bill. > > struct cachestat { > > __u64 nr_cache; > > __u64 nr_dirty; > > __u64 nr_writeback; > > __u64 nr_evicted; > > __u64 nr_recently_evicted; > > }; > > And these fields are really getting into the weedy details of internal > kernel implementation. Bear in mind that we must support this API for > ever. > > Particularly the "evicted" things. The workingset code was implemented > eight years ago, which is actually relatively recent. It could be that > eight years from now it will have been removed and possibly replaced > workingset with something else. Then what do we do? ;) I'm definitely biased here, but I don't think it's realistic that we'd ever go back to a cache that doesn't maintain *some* form of non-residency information. We now have two reclaim implementations that rely on it at its core. And psi is designed around the concept of initial faults vs refaults; that's an ABI we have to maintain indefinitely anyway, and is widely used for OOM killing and load shedding in datacenters, on Android, by all systemd-based installations etc. It seems unlikely that this is a fluke. But even if I'm completely wrong about that, I think we have options that wouldn't spell the end of the world. We could report 0 for those fields and be perfectly backward compatible. There is a flags field that allows versioning of struct cachestat, too.
Hi, On 2023-03-15 13:09:34 -0400, Johannes Weiner wrote: > On Tue, Mar 14, 2023 at 04:00:41PM -0700, Andrew Morton wrote: > > A while ago I asked about the security implications - could cachestat() > > be used to figure out what parts of a file another user is reading. > > This also applies to mincore(), but cachestat() newly permits user A to > > work out which parts of a file user B has *written* to. > > The caller of cachestat() must have the file open for reading. If they > can read the contents that B has written, is the fact that they can > see dirty state really a concern? Random idea: Only fill ->dirty/writeback if the fd is open for writing. > > Secondly, I'm not seeing description of any use cases. OK, it's faster > > and better than mincore(), but who cares? In other words, what > > end-user value compels us to add this feature to Linux? > > Years ago there was a thread about adding dirty bits to mincore(), I > don't know if you remember this: > > https://lkml.org/lkml/2013/2/10/162 > > In that thread, Rusty described a usecase of maintaining a journaling > file alongside a main file. The idea for testing the dirty state isn't > to call sync but to see whether the journal needs to be updated. > > The efficiency of mincore() was touched on too. Andres Freund (CC'd, > hopefully I got the email address right) mentioned that Postgres has a > usecase for deciding whether to do an index scan or query tables > directly, based on whether the index is cached. Postgres works with > files rather than memory regions, and Andres mentioned that the index > could be quite large. This is still relevant, FWIW. And not just for deciding on the optimal query plan, but also for reporting purposes. We can show the user what part of the query has done how much IO, but that can end up being quite confusing because we're not aware of how much IO was fullfilled by the page cache. > Most recently, the database team at Meta reached out to us and asked > about the ability to query dirty state again. The motivation for this > was twofold. One was simply visibility into the writeback algorithm, > i.e. trying to figure out what it's doing when investigating > performance problems. > > The second usecase they brought up was to advise writeback from > userspace to manage the tradeoff between integrity and IO utilization: > if IO capacity is available, sync more frequently; if not, let the > work batch up. Blindly syncing through the file in chunks doesn't work > because you don't know in advance how much IO they'll end up doing (or > how much they've done, afterwards.) So it's difficult to build an > algorithm that will reasonably pace through sparsely dirtied regions > without the risk of overwhelming the IO device on dense ones. And it's > not straight-forward to do this from the kernel, since it doesn't know > the IO headroom the application needs for reading (which is dynamic). We ended up building something very roughly like that in userspace - each backend tracks the last N writes, and once the numbers reaches a certain limit, we sort and collapse the outstanding ranges and issue sync_file_range(SYNC_FILE_RANGE_WRITE) for them. Different types of tasks have different limits. Without that latency in write heavy workloads is ... not good (to this day, but to a lesser degree than 5-10 years ago). > Another query we get almost monthly is service owners trying to > understand where their memory is going and what's causing unexpected > pressure on a host. They see the cache in vmstat, but between a > complex application, shared libraries or a runtime (jvm, hhvm etc.) > and a myriad of host management agents, there is so much going on on > the machine that it's hard to find out who is touching which > files. When it comes to disk usage, the kernel provides the ability to > quickly stat entire filesystem subtrees and drill down with tools like > du. It sure would be useful to have the same for memory usage. +1 Greetings, Andres Freund
On Wed, Mar 15, 2023 at 12:15 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2023-03-15 13:09:34 -0400, Johannes Weiner wrote: > > On Tue, Mar 14, 2023 at 04:00:41PM -0700, Andrew Morton wrote: > > > A while ago I asked about the security implications - could cachestat() > > > be used to figure out what parts of a file another user is reading. > > > This also applies to mincore(), but cachestat() newly permits user A to > > > work out which parts of a file user B has *written* to. > > > > The caller of cachestat() must have the file open for reading. If they > > can read the contents that B has written, is the fact that they can > > see dirty state really a concern? > > Random idea: Only fill ->dirty/writeback if the fd is open for writing. > > > > > Secondly, I'm not seeing description of any use cases. OK, it's faster > > > and better than mincore(), but who cares? In other words, what > > > end-user value compels us to add this feature to Linux? > > > > Years ago there was a thread about adding dirty bits to mincore(), I > > don't know if you remember this: > > > > https://lkml.org/lkml/2013/2/10/162 > > > > In that thread, Rusty described a usecase of maintaining a journaling > > file alongside a main file. The idea for testing the dirty state isn't > > to call sync but to see whether the journal needs to be updated. > > > > The efficiency of mincore() was touched on too. Andres Freund (CC'd, > > hopefully I got the email address right) mentioned that Postgres has a > > usecase for deciding whether to do an index scan or query tables > > directly, based on whether the index is cached. Postgres works with > > files rather than memory regions, and Andres mentioned that the index > > could be quite large. > > This is still relevant, FWIW. And not just for deciding on the optimal query > plan, but also for reporting purposes. We can show the user what part of the > query has done how much IO, but that can end up being quite confusing because > we're not aware of how much IO was fullfilled by the page cache. > > > > Most recently, the database team at Meta reached out to us and asked > > about the ability to query dirty state again. The motivation for this > > was twofold. One was simply visibility into the writeback algorithm, > > i.e. trying to figure out what it's doing when investigating > > performance problems. > > > > The second usecase they brought up was to advise writeback from > > userspace to manage the tradeoff between integrity and IO utilization: > > if IO capacity is available, sync more frequently; if not, let the > > work batch up. Blindly syncing through the file in chunks doesn't work > > because you don't know in advance how much IO they'll end up doing (or > > how much they've done, afterwards.) So it's difficult to build an > > algorithm that will reasonably pace through sparsely dirtied regions > > without the risk of overwhelming the IO device on dense ones. And it's > > not straight-forward to do this from the kernel, since it doesn't know > > the IO headroom the application needs for reading (which is dynamic). > > We ended up building something very roughly like that in userspace - each > backend tracks the last N writes, and once the numbers reaches a certain > limit, we sort and collapse the outstanding ranges and issue > sync_file_range(SYNC_FILE_RANGE_WRITE) for them. Different types of tasks have > different limits. Without that latency in write heavy workloads is ... not > good (to this day, but to a lesser degree than 5-10 years ago). > > > > Another query we get almost monthly is service owners trying to > > understand where their memory is going and what's causing unexpected > > pressure on a host. They see the cache in vmstat, but between a > > complex application, shared libraries or a runtime (jvm, hhvm etc.) > > and a myriad of host management agents, there is so much going on on > > the machine that it's hard to find out who is touching which > > files. When it comes to disk usage, the kernel provides the ability to > > quickly stat entire filesystem subtrees and drill down with tools like > > du. It sure would be useful to have the same for memory usage. > > +1 > > Greetings, > > Andres Freund Thanks for the suggestion/discussion regarding cachestat's use cases, Johannes and Andres! I'll put a summary of these points (along with a link to the original discussion thread) in the cover letter and commit message of the new version of the patch set. In the meantime, feel free to let me know if there is something else cachestat could help with (along with any improvements that could facilitate such use cases) Best, Nhat