Message ID | 20221216192149.3902877-4-nphamcs@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:e747:0:0:0:0:0 with SMTP id c7csp1160284wrn; Fri, 16 Dec 2022 11:42:59 -0800 (PST) X-Google-Smtp-Source: AA0mqf7ZHmVExQpRi919wky6030TWF6XOkofW378PNH4qzqkCQTaJunaHE407PSiiX9VlsR8ryNR X-Received: by 2002:a17:907:c687:b0:7b5:911c:9b12 with SMTP id ue7-20020a170907c68700b007b5911c9b12mr26884734ejc.1.1671219779653; Fri, 16 Dec 2022 11:42:59 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671219779; cv=none; d=google.com; s=arc-20160816; b=AWB1NxfCiAGVesuI7uuDznVhdOMgPuqwSDMMn3e03e24sE15EzdsFqbdTF40TMgTdA 0SDeye506mWcGDmh9CZbOZbAQ+EdQ2NoucPtCEnvaioQKDjwAaStIwCC1GJDCFzwyaAu K1QBdvte4+DoW0SDgt/d308YbHq3FTW3IXypS/HKLI9shy+bJDCzLZA0j/262dKEml3d +Qeu5hrHXXSMA+/9iPIY9WJwS7AWtZIzMBEJwOPtPDFrlTQmxBSuaYUCICtYWlrL8y/U pfmj6vHRRVDf7+U5Sk2kqjTWsaVA75YIIZgeYvF/iim7hXMLa7QWmJ/Clvuqd2L4n5Um r/iw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=mcfj7//KZGx3L4vs1asXbZHiEbvSE2W37vabxM0hoJ8=; b=x/Ztw+sd4P25Bsg4Jpck9fPP096QFaqYe8wIMl2JB66kgGJdnc7b/ZOhk9Afinu8oZ q3RGBmB/XlsR7ZdIT2Mmq/FTH2NfI++SLKF8E9d9e5NCs1hgnrI9D04dqIajTjKhG72z +rPwJqTfb4BrjNw5fwxuL4LJvNgeE3gpOTrI3DprsKpsITemd7cem2TQOv5WmvImgGi9 g1POBzEN5e/vTg7bvYDJeQRPy88VSr7SG/QEL51HUF4nWZGnyeSHkUp4LY98KbbOV9e6 p+yOyonE1zMwHGFZvkIEnD/8KD+p2K+drAlvMOKuuFfhlfN6ucvyL0vswZZHmZ6S17I8 8gwA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=XyvoQwzM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id gy18-20020a170906f25200b007c128ebacfcsi2645038ejb.835.2022.12.16.11.42.35; Fri, 16 Dec 2022 11:42:59 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=XyvoQwzM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231496AbiLPTWK (ORCPT <rfc822;jeantsuru.cumc.mandola@gmail.com> + 99 others); Fri, 16 Dec 2022 14:22:10 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34510 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229742AbiLPTV5 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 16 Dec 2022 14:21:57 -0500 Received: from mail-pj1-x1032.google.com (mail-pj1-x1032.google.com [IPv6:2607:f8b0:4864:20::1032]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4D28057B67 for <linux-kernel@vger.kernel.org>; Fri, 16 Dec 2022 11:21:55 -0800 (PST) Received: by mail-pj1-x1032.google.com with SMTP id v13-20020a17090a6b0d00b00219c3be9830so3328771pjj.4 for <linux-kernel@vger.kernel.org>; Fri, 16 Dec 2022 11:21:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=mcfj7//KZGx3L4vs1asXbZHiEbvSE2W37vabxM0hoJ8=; b=XyvoQwzMgd30oKLXbNCSHtBaFKuZiO62uBdvtZ0V+ri6ypZX1wYUEYC6CfR9jnAIUZ oXovRbwVtPxgB8xGuhGfdOAADmxaUqygXmHamXKxCPOW7eJGEjBIOevihiXmmdOUuJyS SgBWKEpogOMb60EJL+GTwP/hvCjtOhvPAz2zhaG85oCoYeHRVFzrS80EM1bztbhLchFv P/XiJqCIg7w3Rlhu8FFDAxviL1e5ahSEpx7hnFilhVJCF0KDDYz1TzJV0glGWt7CDykz uSSwLQYRrFFBnKL4sTFD6V/4jx2878YW6rR4B+iHRm431MXqy3QZIxSiDTMMv3qrp7QO /ehA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mcfj7//KZGx3L4vs1asXbZHiEbvSE2W37vabxM0hoJ8=; b=5iJn3TONG4RK/n8t1zUefuvXNVT0j6nRV1yuk7kJpIHSs4+tioDEBhT7OBnW8iNZ95 ON7Q5ncyNhYh2g5kdzR38sTaJuhoNccsyvi8HLhcAn9Re+3jIUKfUXpXvEHRBwAWla7U DkIBJx1go6hepZqgh+5DD2TfmhIm1juN5JQcr6F0TzPW8NPD+eTK0NfaYg82MjESUBE/ GmyGox8mAXjMf0GLvjXJkzL6QOImNgdQshwlz+KvUbe7eGVqCDD7ahA8X4+ZRkEwWEoQ enyIyybW7zDzx6qDp96IVBHurEd0agV1Y7OvSjgX8ySoEOMB0J5fOom8ymMBEAorKHgV xh6w== X-Gm-Message-State: ANoB5pmpguLNWV1ygyZyTd8asJ9Rd1LQ2oNXNjG06I0IncClKjQnY4sy YUgMPm/CzWEWwnUiB/82aFE= X-Received: by 2002:a17:902:8548:b0:18c:fd25:a934 with SMTP id d8-20020a170902854800b0018cfd25a934mr30262434plo.61.1671218514611; Fri, 16 Dec 2022 11:21:54 -0800 (PST) Received: from localhost (fwdproxy-prn-118.fbsv.net. [2a03:2880:ff:76::face:b00c]) by smtp.gmail.com with ESMTPSA id ja5-20020a170902efc500b001873aa85e1fsm1953676plb.305.2022.12.16.11.21.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 16 Dec 2022 11:21:54 -0800 (PST) From: Nhat Pham <nphamcs@gmail.com> To: akpm@linux-foundation.org Cc: hannes@cmpxchg.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, bfoster@redhat.com, willy@infradead.org, kernel-team@meta.com Subject: [PATCH v4 3/4] cachestat: implement cachestat syscall Date: Fri, 16 Dec 2022 11:21:48 -0800 Message-Id: <20221216192149.3902877-4-nphamcs@gmail.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20221216192149.3902877-1-nphamcs@gmail.com> References: <20221216192149.3902877-1-nphamcs@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752400952015609704?= X-GMAIL-MSGID: =?utf-8?q?1752400952015609704?= |
Series |
cachestat: a new syscall for page cache state of files
|
|
Commit Message
Nhat Pham
Dec. 16, 2022, 7:21 p.m. UTC
Implement a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in
a given range.
NAME
cachestat - query the page cache status of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, off_t off, size_t len,
size_t cstat_size, struct cachestat *cstat,
unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of (recently)
evicted pages, in the bytes range given by `off` and `len`.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
`cstat_size` allows users to obtain partial results. The syscall
will copy the first `csstat_size` bytes to the specified userspace
memory. `cstat_size` must be a non-negative value that is no larger
than the current size of the cachestat struct.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat points to an invalid address.
EINVAL invalid `cstat_size` or `flags`
EBADF invalid file descriptor.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/fs.h | 3 +
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/mman.h | 9 ++
init/Kconfig | 10 ++
kernel/sys_ni.c | 1 +
mm/filemap.c | 137 ++++++++++++++++++++
20 files changed, 180 insertions(+), 1 deletion(-)
Comments
On Fri, 16 Dec 2022 11:21:48 -0800 Nhat Pham <nphamcs@gmail.com> wrote: > Implement a new syscall that queries cache state of a file and > summarizes the number of cached pages, number of dirty pages, number of > pages marked for writeback, number of (recently) evicted pages, etc. in > a given range. > > NAME > cachestat - query the page cache status of a file. > > SYNOPSIS > #include <sys/mman.h> > > struct cachestat { > __u64 nr_cache; > __u64 nr_dirty; > __u64 nr_writeback; > __u64 nr_evicted; > __u64 nr_recently_evicted; > }; > > int cachestat(unsigned int fd, off_t off, size_t len, > size_t cstat_size, struct cachestat *cstat, > unsigned int flags); > > DESCRIPTION > cachestat() queries the number of cached pages, number of dirty > pages, number of pages marked for writeback, number of (recently) > evicted pages, in the bytes range given by `off` and `len`. I suggest this be spelled out better: "number of evicted and number or recently evicted pages". I suggest this clearly tell readers what an "evicted" page is - they aren't kernel programmers! What is the benefit of the "recently evicted" pages? "recently" seems very vague - what use is this to anyone? > These values are returned in a cachestat struct, whose address is > given by the `cstat` argument. > > The `off` and `len` arguments must be non-negative integers. If > `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` == > 0, we will query in the range from `off` to the end of the file. > > `cstat_size` allows users to obtain partial results. The syscall > will copy the first `csstat_size` bytes to the specified userspace > memory. `cstat_size` must be a non-negative value that is no larger > than the current size of the cachestat struct. > > The `flags` argument is unused for now, but is included for future > extensibility. User should pass 0 (i.e no flag specified). Why is `flags' here? We could add an unused flags arg to any syscall, but we don't. What's the plan? Are there security implications? If I know that some process has a file open, I can use cachestat() to infer which parts of that file they're looking at (like mincore(), I guess). And I can infer which parts they're writing to, unlike mincore(). I suggest the [patch 1/4] fixup be separated from this series.
Hi Nhat, Thank you for the patch! Yet something to improve: [auto build test ERROR on shuah-kselftest/next] [also build test ERROR on akpm-mm/mm-everything linus/master v6.1 next-20221216] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254 base: https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next patch link: https://lore.kernel.org/r/20221216192149.3902877-4-nphamcs%40gmail.com patch subject: [PATCH v4 3/4] cachestat: implement cachestat syscall config: s390-allmodconfig compiler: s390-linux-gcc (GCC) 12.1.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/85fdb04584b3d9a3b90d2c63794ec65e8d996b18 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254 git checkout 85fdb04584b3d9a3b90d2c63794ec65e8d996b18 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 olddefconfig COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash arch/s390/kvm/ If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> All error/warnings (new ones prefixed by >>): In file included from include/uapi/asm-generic/mman.h:5, from ./arch/s390/include/generated/uapi/asm/mman.h:1, from include/uapi/linux/mman.h:5, from include/linux/fs.h:49, from include/linux/huge_mm.h:8, from include/linux/mm.h:737, from include/linux/kvm_host.h:16, from arch/s390/kvm/kvm-s390.h:17, from arch/s390/kvm/gaccess.c:16: >> include/uapi/asm-generic/mman-common.h:16:25: error: expected identifier before numeric constant 16 | #define PROT_NONE 0x0 /* page can not be accessed */ | ^~~ arch/s390/kvm/gaccess.c:493:9: note: in expansion of macro 'PROT_NONE' 493 | PROT_NONE, | ^~~~~~~~~ arch/s390/kvm/gaccess.c: In function 'trans_exc_ending': >> arch/s390/kvm/gaccess.c:516:17: error: duplicate case value 516 | case PROT_TYPE_LA: | ^~~~ arch/s390/kvm/gaccess.c:509:17: note: previously used here 509 | case PROT_NONE: | ^~~~ In file included from include/linux/compiler_types.h:79, from <command-line>: >> include/linux/compiler_attributes.h:223:41: warning: attribute 'fallthrough' not preceding a case label or default label 223 | # define fallthrough __attribute__((__fallthrough__)) | ^~~~~~~~~~~~~ arch/s390/kvm/gaccess.c:515:25: note: in expansion of macro 'fallthrough' 515 | fallthrough; | ^~~~~~~~~~~ vim +16 include/uapi/asm-generic/mman-common.h 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 9 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 10 #define PROT_READ 0x1 /* page can be read */ 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 11 #define PROT_WRITE 0x2 /* page can be written */ 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 12 #define PROT_EXEC 0x4 /* page can be executed */ 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 13 #define PROT_SEM 0x8 /* page may be used for atomic ops */ d41938d2cbee92 include/uapi/asm-generic/mman-common.h Dave Martin 2019-12-11 14 /* 0x10 reserved for arch-specific use */ d41938d2cbee92 include/uapi/asm-generic/mman-common.h Dave Martin 2019-12-11 15 /* 0x20 reserved for arch-specific use */ 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 @16 #define PROT_NONE 0x0 /* page can not be accessed */ 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 17 #define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 18 #define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 19
Hi Nhat, Thank you for the patch! Yet something to improve: [auto build test ERROR on shuah-kselftest/next] [also build test ERROR on shuah-kselftest/fixes tip/x86/asm akpm-mm/mm-everything linus/master v6.1 next-20221216] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254 base: https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next patch link: https://lore.kernel.org/r/20221216192149.3902877-4-nphamcs%40gmail.com patch subject: [PATCH v4 3/4] cachestat: implement cachestat syscall config: arm64-randconfig-r035-20221216 compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 98b13979fb05f3ed288a900deb843e7b27589e58) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # install arm64 cross compiling tool for clang build # apt-get install binutils-aarch64-linux-gnu # https://github.com/intel-lab-lkp/linux/commit/85fdb04584b3d9a3b90d2c63794ec65e8d996b18 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254 git checkout 85fdb04584b3d9a3b90d2c63794ec65e8d996b18 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm64 olddefconfig COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm64 prepare If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): In file included from arch/arm64/kernel/asm-offsets.c:10: In file included from include/linux/arm_sdei.h:8: In file included from include/acpi/ghes.h:5: In file included from include/acpi/apei.h:9: In file included from include/linux/acpi.h:15: In file included from include/linux/device.h:32: In file included from include/linux/device/driver.h:21: In file included from include/linux/module.h:19: In file included from include/linux/elf.h:6: In file included from arch/arm64/include/asm/elf.h:141: In file included from include/linux/fs.h:49: In file included from include/uapi/linux/mman.h:5: >> arch/arm64/include/asm/mman.h:15:10: error: use of undeclared identifier 'VM_ARM64_BTI' ret |= VM_ARM64_BTI; ^ >> arch/arm64/include/asm/mman.h:18:10: error: use of undeclared identifier 'VM_MTE' ret |= VM_MTE; ^ >> arch/arm64/include/asm/mman.h:32:10: error: use of undeclared identifier 'VM_MTE_ALLOWED' return VM_MTE_ALLOWED; ^ arch/arm64/include/asm/mman.h:59:22: error: use of undeclared identifier 'VM_MTE' return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED); ^ arch/arm64/include/asm/mman.h:59:45: error: use of undeclared identifier 'VM_MTE_ALLOWED' return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED); ^ 5 errors generated. make[2]: *** [scripts/Makefile.build:118: arch/arm64/kernel/asm-offsets.s] Error 1 make[2]: Target 'prepare' not remade because of errors. make[1]: *** [Makefile:1270: prepare0] Error 2 make[1]: Target 'prepare' not remade because of errors. make: *** [Makefile:231: __sub-make] Error 2 make: Target 'prepare' not remade because of errors. vim +/VM_ARM64_BTI +15 arch/arm64/include/asm/mman.h 8ef8f360cf30be Dave Martin 2020-03-16 8 8ef8f360cf30be Dave Martin 2020-03-16 9 static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot, 8ef8f360cf30be Dave Martin 2020-03-16 10 unsigned long pkey __always_unused) 8ef8f360cf30be Dave Martin 2020-03-16 11 { 9f3419315f3cdc Catalin Marinas 2019-11-27 12 unsigned long ret = 0; 9f3419315f3cdc Catalin Marinas 2019-11-27 13 8ef8f360cf30be Dave Martin 2020-03-16 14 if (system_supports_bti() && (prot & PROT_BTI)) 9f3419315f3cdc Catalin Marinas 2019-11-27 @15 ret |= VM_ARM64_BTI; 8ef8f360cf30be Dave Martin 2020-03-16 16 9f3419315f3cdc Catalin Marinas 2019-11-27 17 if (system_supports_mte() && (prot & PROT_MTE)) 9f3419315f3cdc Catalin Marinas 2019-11-27 @18 ret |= VM_MTE; 9f3419315f3cdc Catalin Marinas 2019-11-27 19 9f3419315f3cdc Catalin Marinas 2019-11-27 20 return ret; 8ef8f360cf30be Dave Martin 2020-03-16 21 } 8ef8f360cf30be Dave Martin 2020-03-16 22 #define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey) 8ef8f360cf30be Dave Martin 2020-03-16 23 9f3419315f3cdc Catalin Marinas 2019-11-27 24 static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags) 9f3419315f3cdc Catalin Marinas 2019-11-27 25 { 9f3419315f3cdc Catalin Marinas 2019-11-27 26 /* 9f3419315f3cdc Catalin Marinas 2019-11-27 27 * Only allow MTE on anonymous mappings as these are guaranteed to be 9f3419315f3cdc Catalin Marinas 2019-11-27 28 * backed by tags-capable memory. The vm_flags may be overridden by a 9f3419315f3cdc Catalin Marinas 2019-11-27 29 * filesystem supporting MTE (RAM-based). 9f3419315f3cdc Catalin Marinas 2019-11-27 30 */ 9f3419315f3cdc Catalin Marinas 2019-11-27 31 if (system_supports_mte() && (flags & MAP_ANONYMOUS)) 9f3419315f3cdc Catalin Marinas 2019-11-27 @32 return VM_MTE_ALLOWED; 9f3419315f3cdc Catalin Marinas 2019-11-27 33 9f3419315f3cdc Catalin Marinas 2019-11-27 34 return 0; 9f3419315f3cdc Catalin Marinas 2019-11-27 35 } 9f3419315f3cdc Catalin Marinas 2019-11-27 36 #define arch_calc_vm_flag_bits(flags) arch_calc_vm_flag_bits(flags) 9f3419315f3cdc Catalin Marinas 2019-11-27 37
Hi Nhat, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on shuah-kselftest/next] [also build test WARNING on shuah-kselftest/fixes tip/x86/asm akpm-mm/mm-everything linus/master v6.1 next-20221216] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254 base: https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next patch link: https://lore.kernel.org/r/20221216192149.3902877-4-nphamcs%40gmail.com patch subject: [PATCH v4 3/4] cachestat: implement cachestat syscall config: alpha-randconfig-r036-20221216 compiler: alpha-linux-gcc (GCC) 12.1.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/85fdb04584b3d9a3b90d2c63794ec65e8d996b18 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254 git checkout 85fdb04584b3d9a3b90d2c63794ec65e8d996b18 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=alpha olddefconfig COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=alpha SHELL=/bin/bash mm/damon/ If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> All warnings (new ones prefixed by >>): In file included from include/uapi/linux/mman.h:5, from include/linux/fs.h:49, from include/linux/highmem.h:5, from mm/damon/vaddr.c:11: >> arch/alpha/include/uapi/asm/mman.h:15: warning: "MAP_FIXED" redefined 15 | #define MAP_FIXED 0x100 /* Interpret addr exactly */ | In file included from mm/damon/vaddr.c:10: include/uapi/asm-generic/mman-common.h:22: note: this is the location of the previous definition 22 | #define MAP_FIXED 0x10 /* Interpret addr exactly */ | >> arch/alpha/include/uapi/asm/mman.h:16: warning: "MAP_ANONYMOUS" redefined 16 | #define MAP_ANONYMOUS 0x10 /* don't use a file */ | include/uapi/asm-generic/mman-common.h:23: note: this is the location of the previous definition 23 | #define MAP_ANONYMOUS 0x20 /* don't use a file */ | >> arch/alpha/include/uapi/asm/mman.h:29: warning: "MAP_POPULATE" redefined 29 | #define MAP_POPULATE 0x20000 /* populate (prefault) pagetables */ | include/uapi/asm-generic/mman-common.h:26: note: this is the location of the previous definition 26 | #define MAP_POPULATE 0x008000 /* populate (prefault) pagetables */ | >> arch/alpha/include/uapi/asm/mman.h:30: warning: "MAP_NONBLOCK" redefined 30 | #define MAP_NONBLOCK 0x40000 /* do not block on IO */ | include/uapi/asm-generic/mman-common.h:27: note: this is the location of the previous definition 27 | #define MAP_NONBLOCK 0x010000 /* do not block on IO */ | >> arch/alpha/include/uapi/asm/mman.h:31: warning: "MAP_STACK" redefined 31 | #define MAP_STACK 0x80000 /* give out an address that is best suited for process/thread stacks */ | include/uapi/asm-generic/mman-common.h:28: note: this is the location of the previous definition 28 | #define MAP_STACK 0x020000 /* give out an address that is best suited for process/thread stacks */ | >> arch/alpha/include/uapi/asm/mman.h:32: warning: "MAP_HUGETLB" redefined 32 | #define MAP_HUGETLB 0x100000 /* create a huge page mapping */ | include/uapi/asm-generic/mman-common.h:29: note: this is the location of the previous definition 29 | #define MAP_HUGETLB 0x040000 /* create a huge page mapping */ | >> arch/alpha/include/uapi/asm/mman.h:33: warning: "MAP_FIXED_NOREPLACE" redefined 33 | #define MAP_FIXED_NOREPLACE 0x200000/* MAP_FIXED which doesn't unmap underlying mapping */ | include/uapi/asm-generic/mman-common.h:31: note: this is the location of the previous definition 31 | #define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */ | >> arch/alpha/include/uapi/asm/mman.h:36: warning: "MS_SYNC" redefined 36 | #define MS_SYNC 2 /* synchronous memory sync */ | include/uapi/asm-generic/mman-common.h:43: note: this is the location of the previous definition 43 | #define MS_SYNC 4 /* synchronous memory sync */ | >> arch/alpha/include/uapi/asm/mman.h:37: warning: "MS_INVALIDATE" redefined 37 | #define MS_INVALIDATE 4 /* invalidate the caches */ | include/uapi/asm-generic/mman-common.h:42: note: this is the location of the previous definition 42 | #define MS_INVALIDATE 2 /* invalidate the caches */ | >> arch/alpha/include/uapi/asm/mman.h:50: warning: "MADV_DONTNEED" redefined 50 | #define MADV_DONTNEED 6 /* don't need these pages */ | include/uapi/asm-generic/mman-common.h:49: note: this is the location of the previous definition 49 | #define MADV_DONTNEED 4 /* don't need these pages */ | vim +/MAP_FIXED +15 arch/alpha/include/uapi/asm/mman.h ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 12 746c9398f5ac2b arch/alpha/include/uapi/asm/mman.h Michael S. Tsirkin 2019-02-08 13 /* 0x01 - 0x03 are defined in linux/mman.h */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 14 #define MAP_TYPE 0x0f /* Mask for type of mapping (OSF/1 is _wrong_) */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 @15 #define MAP_FIXED 0x100 /* Interpret addr exactly */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 @16 #define MAP_ANONYMOUS 0x10 /* don't use a file */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 17 ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 18 /* not used by linux, but here to make sure we don't clash with OSF/1 defines */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 19 #define _MAP_HASSEMAPHORE 0x0200 ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 20 #define _MAP_INHERIT 0x0400 ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 21 #define _MAP_UNALIGNED 0x0800 ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 22 ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 23 /* These are linux-specific */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 24 #define MAP_GROWSDOWN 0x01000 /* stack-like segment */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 25 #define MAP_DENYWRITE 0x02000 /* ETXTBSY */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 26 #define MAP_EXECUTABLE 0x04000 /* mark it as an executable */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 27 #define MAP_LOCKED 0x08000 /* lock the mapping */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 28 #define MAP_NORESERVE 0x10000 /* don't check for reservations */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 @29 #define MAP_POPULATE 0x20000 /* populate (prefault) pagetables */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 @30 #define MAP_NONBLOCK 0x40000 /* do not block on IO */ 90f72aa58bbf07 arch/alpha/include/asm/mman.h Arnd Bergmann 2009-09-21 @31 #define MAP_STACK 0x80000 /* give out an address that is best suited for process/thread stacks */ 90f72aa58bbf07 arch/alpha/include/asm/mman.h Arnd Bergmann 2009-09-21 @32 #define MAP_HUGETLB 0x100000 /* create a huge page mapping */ a4ff8e8620d3f4 arch/alpha/include/uapi/asm/mman.h Michal Hocko 2018-04-10 @33 #define MAP_FIXED_NOREPLACE 0x200000/* MAP_FIXED which doesn't unmap underlying mapping */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 34 ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 35 #define MS_ASYNC 1 /* sync memory asynchronously */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 @36 #define MS_SYNC 2 /* synchronous memory sync */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 @37 #define MS_INVALIDATE 4 /* invalidate the caches */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 38 ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 39 #define MCL_CURRENT 8192 /* lock all currently mapped pages */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 40 #define MCL_FUTURE 16384 /* lock all additions to address space */ b0f205c2a3082d arch/alpha/include/uapi/asm/mman.h Eric B Munson 2015-11-05 41 #define MCL_ONFAULT 32768 /* lock all pages that are faulted in */ b0f205c2a3082d arch/alpha/include/uapi/asm/mman.h Eric B Munson 2015-11-05 42 b0f205c2a3082d arch/alpha/include/uapi/asm/mman.h Eric B Munson 2015-11-05 43 #define MLOCK_ONFAULT 0x01 /* Lock pages in range after they are faulted in, do not prefault */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 44 ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 45 #define MADV_NORMAL 0 /* no further special treatment */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 46 #define MADV_RANDOM 1 /* expect random page references */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 47 #define MADV_SEQUENTIAL 2 /* expect sequential page references */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 48 #define MADV_WILLNEED 3 /* will need these pages */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 49 #define MADV_SPACEAVAIL 5 /* ensure resources are available */ ^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 @50 #define MADV_DONTNEED 6 /* don't need these pages */ 5f6164f3092832 include/asm-alpha/mman.h Michael S. Tsirkin 2006-02-15 51
Hi Nhat, Thank you for the patch! Yet something to improve: [auto build test ERROR on shuah-kselftest/next] [also build test ERROR on shuah-kselftest/fixes tip/x86/asm akpm-mm/mm-everything linus/master v6.1 next-20221216] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254 base: https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next patch link: https://lore.kernel.org/r/20221216192149.3902877-4-nphamcs%40gmail.com patch subject: [PATCH v4 3/4] cachestat: implement cachestat syscall config: arm64-allyesconfig compiler: aarch64-linux-gcc (GCC) 12.1.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/85fdb04584b3d9a3b90d2c63794ec65e8d996b18 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254 git checkout 85fdb04584b3d9a3b90d2c63794ec65e8d996b18 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arm64 olddefconfig COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arm64 prepare If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): In file included from include/uapi/linux/mman.h:5, from include/linux/fs.h:49, from arch/arm64/include/asm/elf.h:141, from include/linux/elf.h:6, from include/linux/module.h:19, from include/linux/device/driver.h:21, from include/linux/device.h:32, from include/linux/acpi.h:15, from include/acpi/apei.h:9, from include/acpi/ghes.h:5, from include/linux/arm_sdei.h:8, from arch/arm64/kernel/asm-offsets.c:10: arch/arm64/include/asm/mman.h: In function 'arch_calc_vm_prot_bits': >> arch/arm64/include/asm/mman.h:15:24: error: 'VM_ARM64_BTI' undeclared (first use in this function); did you mean 'ARM64_BTI'? 15 | ret |= VM_ARM64_BTI; | ^~~~~~~~~~~~ | ARM64_BTI arch/arm64/include/asm/mman.h:15:24: note: each undeclared identifier is reported only once for each function it appears in >> arch/arm64/include/asm/mman.h:18:24: error: 'VM_MTE' undeclared (first use in this function) 18 | ret |= VM_MTE; | ^~~~~~ arch/arm64/include/asm/mman.h: In function 'arch_calc_vm_flag_bits': >> arch/arm64/include/asm/mman.h:32:24: error: 'VM_MTE_ALLOWED' undeclared (first use in this function) 32 | return VM_MTE_ALLOWED; | ^~~~~~~~~~~~~~ arch/arm64/include/asm/mman.h: In function 'arch_validate_flags': arch/arm64/include/asm/mman.h:59:29: error: 'VM_MTE' undeclared (first use in this function) 59 | return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED); | ^~~~~~ arch/arm64/include/asm/mman.h:59:52: error: 'VM_MTE_ALLOWED' undeclared (first use in this function) 59 | return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED); | ^~~~~~~~~~~~~~ make[2]: *** [scripts/Makefile.build:118: arch/arm64/kernel/asm-offsets.s] Error 1 make[2]: Target 'prepare' not remade because of errors. make[1]: *** [Makefile:1270: prepare0] Error 2 make[1]: Target 'prepare' not remade because of errors. make: *** [Makefile:231: __sub-make] Error 2 make: Target 'prepare' not remade because of errors. vim +15 arch/arm64/include/asm/mman.h 8ef8f360cf30be Dave Martin 2020-03-16 8 8ef8f360cf30be Dave Martin 2020-03-16 9 static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot, 8ef8f360cf30be Dave Martin 2020-03-16 10 unsigned long pkey __always_unused) 8ef8f360cf30be Dave Martin 2020-03-16 11 { 9f3419315f3cdc Catalin Marinas 2019-11-27 12 unsigned long ret = 0; 9f3419315f3cdc Catalin Marinas 2019-11-27 13 8ef8f360cf30be Dave Martin 2020-03-16 14 if (system_supports_bti() && (prot & PROT_BTI)) 9f3419315f3cdc Catalin Marinas 2019-11-27 @15 ret |= VM_ARM64_BTI; 8ef8f360cf30be Dave Martin 2020-03-16 16 9f3419315f3cdc Catalin Marinas 2019-11-27 17 if (system_supports_mte() && (prot & PROT_MTE)) 9f3419315f3cdc Catalin Marinas 2019-11-27 @18 ret |= VM_MTE; 9f3419315f3cdc Catalin Marinas 2019-11-27 19 9f3419315f3cdc Catalin Marinas 2019-11-27 20 return ret; 8ef8f360cf30be Dave Martin 2020-03-16 21 } 8ef8f360cf30be Dave Martin 2020-03-16 22 #define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey) 8ef8f360cf30be Dave Martin 2020-03-16 23 9f3419315f3cdc Catalin Marinas 2019-11-27 24 static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags) 9f3419315f3cdc Catalin Marinas 2019-11-27 25 { 9f3419315f3cdc Catalin Marinas 2019-11-27 26 /* 9f3419315f3cdc Catalin Marinas 2019-11-27 27 * Only allow MTE on anonymous mappings as these are guaranteed to be 9f3419315f3cdc Catalin Marinas 2019-11-27 28 * backed by tags-capable memory. The vm_flags may be overridden by a 9f3419315f3cdc Catalin Marinas 2019-11-27 29 * filesystem supporting MTE (RAM-based). 9f3419315f3cdc Catalin Marinas 2019-11-27 30 */ 9f3419315f3cdc Catalin Marinas 2019-11-27 31 if (system_supports_mte() && (flags & MAP_ANONYMOUS)) 9f3419315f3cdc Catalin Marinas 2019-11-27 @32 return VM_MTE_ALLOWED; 9f3419315f3cdc Catalin Marinas 2019-11-27 33 9f3419315f3cdc Catalin Marinas 2019-11-27 34 return 0; 9f3419315f3cdc Catalin Marinas 2019-11-27 35 } 9f3419315f3cdc Catalin Marinas 2019-11-27 36 #define arch_calc_vm_flag_bits(flags) arch_calc_vm_flag_bits(flags) 9f3419315f3cdc Catalin Marinas 2019-11-27 37
On Fri, Dec 16, 2022 at 11:21:48AM -0800, Nhat Pham wrote: > Implement a new syscall that queries cache state of a file and > summarizes the number of cached pages, number of dirty pages, number of > pages marked for writeback, number of (recently) evicted pages, etc. in > a given range. > > NAME > cachestat - query the page cache status of a file. > > SYNOPSIS > #include <sys/mman.h> > > struct cachestat { > __u64 nr_cache; > __u64 nr_dirty; > __u64 nr_writeback; > __u64 nr_evicted; > __u64 nr_recently_evicted; > }; > > int cachestat(unsigned int fd, off_t off, size_t len, > size_t cstat_size, struct cachestat *cstat, > unsigned int flags); > > DESCRIPTION > cachestat() queries the number of cached pages, number of dirty > pages, number of pages marked for writeback, number of (recently) > evicted pages, in the bytes range given by `off` and `len`. > > These values are returned in a cachestat struct, whose address is > given by the `cstat` argument. > > The `off` and `len` arguments must be non-negative integers. If > `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` == > 0, we will query in the range from `off` to the end of the file. > > `cstat_size` allows users to obtain partial results. The syscall > will copy the first `csstat_size` bytes to the specified userspace > memory. `cstat_size` must be a non-negative value that is no larger > than the current size of the cachestat struct. > > The `flags` argument is unused for now, but is included for future > extensibility. User should pass 0 (i.e no flag specified). > > RETURN VALUE > On success, cachestat returns 0. On error, -1 is returned, and errno > is set to indicate the error. > > ERRORS > EFAULT cstat points to an invalid address. > > EINVAL invalid `cstat_size` or `flags` > > EBADF invalid file descriptor. > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > --- Hi Nhat, Thanks for the tweaks. FWIW, this by and large looks reasonable to me. Just a couple more random nitty things, if you happen to care about them.. > arch/alpha/kernel/syscalls/syscall.tbl | 1 + > arch/arm/tools/syscall.tbl | 1 + > arch/ia64/kernel/syscalls/syscall.tbl | 1 + > arch/m68k/kernel/syscalls/syscall.tbl | 1 + > arch/microblaze/kernel/syscalls/syscall.tbl | 1 + > arch/parisc/kernel/syscalls/syscall.tbl | 1 + > arch/powerpc/kernel/syscalls/syscall.tbl | 1 + > arch/s390/kernel/syscalls/syscall.tbl | 1 + > arch/sh/kernel/syscalls/syscall.tbl | 1 + > arch/sparc/kernel/syscalls/syscall.tbl | 1 + > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > arch/xtensa/kernel/syscalls/syscall.tbl | 1 + > include/linux/fs.h | 3 + > include/linux/syscalls.h | 3 + > include/uapi/asm-generic/unistd.h | 5 +- > include/uapi/linux/mman.h | 9 ++ > init/Kconfig | 10 ++ > kernel/sys_ni.c | 1 + > mm/filemap.c | 137 ++++++++++++++++++++ > 20 files changed, 180 insertions(+), 1 deletion(-) > ... > diff --git a/mm/filemap.c b/mm/filemap.c > index 08341616ae7a..29ffb906caa4 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c ... > @@ -3949,3 +3953,136 @@ bool filemap_release_folio(struct folio *folio, gfp_t gfp) > return try_to_free_buffers(folio); > } > EXPORT_SYMBOL(filemap_release_folio); > + > +#ifdef CONFIG_CACHESTAT > +/** > + * filemap_cachestat() - compute the page cache statistics of a mapping > + * @mapping: The mapping to compute the statistics for. > + * @first_index: The starting page cache index. > + * @last_index: The final page index (inclusive). > + * @cs: the cachestat struct to write the result to. > + * > + * This will query the page cache statistics of a mapping in the > + * page range of [first_index, last_index] (inclusive). THe statistics > + * queried include: number of dirty pages, number of pages marked for > + * writeback, and the number of (recently) evicted pages. > + */ > +void filemap_cachestat(struct address_space *mapping, pgoff_t first_index, > + pgoff_t last_index, struct cachestat *cs) > +{ Not sure why the internal helper needs to be wrapped in a config option? Perhaps it makes more sense to define CONFIG_CACHESTAT_SYSCALL and just wrap the syscall bits with that..? I would also think it might make things simpler to split out all the syscall bits into a separate patch from filemap_cachestat(). > + XA_STATE(xas, &mapping->i_pages, first_index); > + struct folio *folio; > + > + rcu_read_lock(); > + xas_for_each(&xas, folio, last_index) { > + unsigned long nr_pages; > + pgoff_t folio_first_index, folio_last_index; > + > + if (xas_retry(&xas, folio)) > + continue; > + > + nr_pages = folio_nr_pages(folio); > + folio_first_index = folio_pgoff(folio); > + folio_last_index = folio_first_index + nr_pages - 1; > + > + /* Folios might straddle the range boundaries, only count covered subpages */ > + if (folio_first_index < first_index) > + nr_pages -= first_index - folio_first_index; > + > + if (folio_last_index > last_index) > + nr_pages -= folio_last_index - last_index; > + > + if (xa_is_value(folio)) { > + /* page is evicted */ > + void *shadow = (void *)folio; > + bool workingset; /* not used */ > + > + cs->nr_evicted += nr_pages; > + > +#ifdef CONFIG_SWAP /* implies CONFIG_MMU */ > + if (shmem_mapping(mapping)) { > + /* shmem file - in swap cache */ > + swp_entry_t swp = radix_to_swp_entry(folio); > + > + shadow = get_shadow_from_swap_cache(swp); > + } > +#endif > + if (workingset_test_recent(shadow, true, &workingset)) > + cs->nr_recently_evicted += nr_pages; > + > + continue; > + } > + > + /* page is in cache */ > + cs->nr_cache += nr_pages; > + > + if (folio_test_dirty(folio)) > + cs->nr_dirty += nr_pages; > + > + if (folio_test_writeback(folio)) > + cs->nr_writeback += nr_pages; I'm not sure this is an issue right now (or if it will ever be for your use cases), but from a filesystem perspective it is possible to have large or variable sized folios in cache. At the moment I think the fs (or XFS+iomap at least) manages dirty/writeback state at folio granularity, but that may change in the near future if/when iomap sub-page dirty tracking comes along. I suspect that means it may become possible to have a large folio of some N number of pages where only a subset of those pages are actually in dirty/writeback state, and thus introduces some inaccuracy here because this assumes that folio state applies to folio_nr_pages() worth of pages. Just something to be aware of.. Brian > + } > + rcu_read_unlock(); > +} > +EXPORT_SYMBOL(filemap_cachestat); > + > +/* > + * The cachestat(5) system call. > + * > + * cachestat() returns the page cache statistics of a file in the > + * bytes specified by `off` and `len`: number of cached pages, > + * number of dirty pages, number of pages marked for writeback, > + * number of (recently) evicted pages. > + * > + * `off` and `len` must be non-negative integers. If `len` > 0, > + * the queried range is [`off`, `off` + `len`]. If `len` == 0, > + * we will query in the range from `off` to the end of the file. > + * > + * `cstat_size` allows users to obtain partial results. The syscall > + * will copy the first `csstat_size` bytes to the specified userspace > + * memory. It also makes the cachestat struct extensible - new fields > + * can be added in the future without breaking existing usage. > + * `cstat_size` must be a non-negative value that is no larger than > + * the current size of the cachestat struct. > + * > + * The `flags` argument is unused for now, but is included for future > + * extensibility. User should pass 0 (i.e no flag specified). > + * > + * Because the status of a page can change after cachestat() checks it > + * but before it returns to the application, the returned values may > + * contain stale information. > + * > + * return values: > + * zero - success > + * -EFAULT - cstat points to an illegal address > + * -EINVAL - invalid arguments > + * -EBADF - invalid file descriptor > + */ > +SYSCALL_DEFINE6(cachestat, unsigned int, fd, off_t, off, size_t, len, > + size_t, cstat_size, struct cachestat __user *, cstat, > + unsigned int, flags) > +{ > + struct fd f = fdget(fd); > + struct address_space *mapping; > + struct cachestat cs; > + pgoff_t first_index = off >> PAGE_SHIFT; > + pgoff_t last_index = > + len == 0 ? ULONG_MAX : (off + len - 1) >> PAGE_SHIFT; > + > + if (off < 0 || cstat_size > sizeof(struct cachestat) || flags != 0) > + return -EINVAL; > + > + if (!f.file) > + return -EBADF; > + > + memset(&cs, 0, sizeof(struct cachestat)); > + mapping = f.file->f_mapping; > + filemap_cachestat(mapping, first_index, last_index, &cs); > + fdput(f); > + > + if (copy_to_user(cstat, &cs, cstat_size)) > + return -EFAULT; > + > + return 0; > +} > +#endif /* CONFIG_CACHESTAT */ > -- > 2.30.2 >
On Fri, Dec 16, 2022 at 1:48 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Fri, 16 Dec 2022 11:21:48 -0800 Nhat Pham <nphamcs@gmail.com> wrote: > > > Implement a new syscall that queries cache state of a file and > > summarizes the number of cached pages, number of dirty pages, number of > > pages marked for writeback, number of (recently) evicted pages, etc. in > > a given range. > > > > NAME > > cachestat - query the page cache status of a file. > > > > SYNOPSIS > > #include <sys/mman.h> > > > > struct cachestat { > > __u64 nr_cache; > > __u64 nr_dirty; > > __u64 nr_writeback; > > __u64 nr_evicted; > > __u64 nr_recently_evicted; > > }; > > > > int cachestat(unsigned int fd, off_t off, size_t len, > > size_t cstat_size, struct cachestat *cstat, > > unsigned int flags); > > > > DESCRIPTION > > cachestat() queries the number of cached pages, number of dirty > > pages, number of pages marked for writeback, number of (recently) > > evicted pages, in the bytes range given by `off` and `len`. > > I suggest this be spelled out better: "number of evicted and number or > recently evicted pages". > > I suggest this clearly tell readers what an "evicted" page is - they > aren't kernel programmers! Valid points - I'll try to explain this more clearly in the future versions of this patch series, especially in the draft man page. > > What is the benefit of the "recently evicted" pages? "recently" seems > very vague - what use is this to anyone? This eviction recency semantics comes from the LRU's refault computation. Users of cachestat might be interested in two very different questions: 1. How many pages are not resident in the page cache. 2. How many pages are recently evicted (recently enough that their refault will be construed as memory pressure). The first question is answered with nr_evicted, whereas the second is answered with nr_recently_evicted. I will figure out a way to explain this better in the next version. Users definitely do not need to know the nitty gritty details of LRU logic, but they should know the general idea of each field at least. > > > These values are returned in a cachestat struct, whose address is > > given by the `cstat` argument. > > > > The `off` and `len` arguments must be non-negative integers. If > > `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` == > > 0, we will query in the range from `off` to the end of the file. > > > > `cstat_size` allows users to obtain partial results. The syscall > > will copy the first `csstat_size` bytes to the specified userspace > > memory. `cstat_size` must be a non-negative value that is no larger > > than the current size of the cachestat struct. > > > > The `flags` argument is unused for now, but is included for future > > extensibility. User should pass 0 (i.e no flag specified). > > Why is `flags' here? We could add an unused flags arg to any syscall, > but we don't. What's the plan? I included this field to ensure that cachestat can be extended safely, especially when different users might want different things out of it. For instance, in the future there might be new fields/computations that are too heavy for certain use cases - a flag could be used to disable/skip such fields/computations. Another thing it might be used for is the huge page counting - we have not implemented this in this version yet, but it might introduce murky semantics to new/existing fields in struct cachestat. Or maybe not - but worst case scenario we can leave this decision to the users to decide through flags. I'm sure there are more potential pitfalls that the flags could save us from, but these are the two on top of my head. > > Are there security implications? If I know that some process has a > file open, I can use cachestat() to infer which parts of that file > they're looking at (like mincore(), I guess). And I can infer which > parts they're writing to, unlike mincore(). This one, I'm not 100% sure, but it is a valid concern. Let me think about it and discuss with more security-oriented minds before responding to this. > > I suggest the [patch 1/4] fixup be separated from this series. Sounds good! I'll loop Johannes in about this breakup as well.
On Tue, Dec 20, 2022 at 6:37 AM Brian Foster <bfoster@redhat.com> wrote: > > On Fri, Dec 16, 2022 at 11:21:48AM -0800, Nhat Pham wrote: > > Implement a new syscall that queries cache state of a file and > > summarizes the number of cached pages, number of dirty pages, number of > > pages marked for writeback, number of (recently) evicted pages, etc. in > > a given range. > > > > NAME > > cachestat - query the page cache status of a file. > > > > SYNOPSIS > > #include <sys/mman.h> > > > > struct cachestat { > > __u64 nr_cache; > > __u64 nr_dirty; > > __u64 nr_writeback; > > __u64 nr_evicted; > > __u64 nr_recently_evicted; > > }; > > > > int cachestat(unsigned int fd, off_t off, size_t len, > > size_t cstat_size, struct cachestat *cstat, > > unsigned int flags); > > > > DESCRIPTION > > cachestat() queries the number of cached pages, number of dirty > > pages, number of pages marked for writeback, number of (recently) > > evicted pages, in the bytes range given by `off` and `len`. > > > > These values are returned in a cachestat struct, whose address is > > given by the `cstat` argument. > > > > The `off` and `len` arguments must be non-negative integers. If > > `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` == > > 0, we will query in the range from `off` to the end of the file. > > > > `cstat_size` allows users to obtain partial results. The syscall > > will copy the first `csstat_size` bytes to the specified userspace > > memory. `cstat_size` must be a non-negative value that is no larger > > than the current size of the cachestat struct. > > > > The `flags` argument is unused for now, but is included for future > > extensibility. User should pass 0 (i.e no flag specified). > > > > RETURN VALUE > > On success, cachestat returns 0. On error, -1 is returned, and errno > > is set to indicate the error. > > > > ERRORS > > EFAULT cstat points to an invalid address. > > > > EINVAL invalid `cstat_size` or `flags` > > > > EBADF invalid file descriptor. > > > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > > --- > > Hi Nhat, > > Thanks for the tweaks. FWIW, this by and large looks reasonable to me. Thanks for suggesting the tweaks! I tried refactoring it out and it did look cleaner to me (and one fewer file that I have to keep track of...) > Just a couple more random nitty things, if you happen to care about > them.. > > > arch/alpha/kernel/syscalls/syscall.tbl | 1 + > > arch/arm/tools/syscall.tbl | 1 + > > arch/ia64/kernel/syscalls/syscall.tbl | 1 + > > arch/m68k/kernel/syscalls/syscall.tbl | 1 + > > arch/microblaze/kernel/syscalls/syscall.tbl | 1 + > > arch/parisc/kernel/syscalls/syscall.tbl | 1 + > > arch/powerpc/kernel/syscalls/syscall.tbl | 1 + > > arch/s390/kernel/syscalls/syscall.tbl | 1 + > > arch/sh/kernel/syscalls/syscall.tbl | 1 + > > arch/sparc/kernel/syscalls/syscall.tbl | 1 + > > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > > arch/xtensa/kernel/syscalls/syscall.tbl | 1 + > > include/linux/fs.h | 3 + > > include/linux/syscalls.h | 3 + > > include/uapi/asm-generic/unistd.h | 5 +- > > include/uapi/linux/mman.h | 9 ++ > > init/Kconfig | 10 ++ > > kernel/sys_ni.c | 1 + > > mm/filemap.c | 137 ++++++++++++++++++++ > > 20 files changed, 180 insertions(+), 1 deletion(-) > > > ... > > diff --git a/mm/filemap.c b/mm/filemap.c > > index 08341616ae7a..29ffb906caa4 100644 > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > ... > > @@ -3949,3 +3953,136 @@ bool filemap_release_folio(struct folio *folio, gfp_t gfp) > > return try_to_free_buffers(folio); > > } > > EXPORT_SYMBOL(filemap_release_folio); > > + > > +#ifdef CONFIG_CACHESTAT > > +/** > > + * filemap_cachestat() - compute the page cache statistics of a mapping > > + * @mapping: The mapping to compute the statistics for. > > + * @first_index: The starting page cache index. > > + * @last_index: The final page index (inclusive). > > + * @cs: the cachestat struct to write the result to. > > + * > > + * This will query the page cache statistics of a mapping in the > > + * page range of [first_index, last_index] (inclusive). THe statistics > > + * queried include: number of dirty pages, number of pages marked for > > + * writeback, and the number of (recently) evicted pages. > > + */ > > +void filemap_cachestat(struct address_space *mapping, pgoff_t first_index, > > + pgoff_t last_index, struct cachestat *cs) > > +{ > > Not sure why the internal helper needs to be wrapped in a config option? > Perhaps it makes more sense to define CONFIG_CACHESTAT_SYSCALL and just > wrap the syscall bits with that..? No particularly strong reasons - I was just bundling the two together because I was not entirely sure if it's interesting or has any use case outside of the syscall itself. Do you have something in mind? > > I would also think it might make things simpler to split out all the > syscall bits into a separate patch from filemap_cachestat(). Same as above. > > > + XA_STATE(xas, &mapping->i_pages, first_index); > > + struct folio *folio; > > + > > + rcu_read_lock(); > > + xas_for_each(&xas, folio, last_index) { > > + unsigned long nr_pages; > > + pgoff_t folio_first_index, folio_last_index; > > + > > + if (xas_retry(&xas, folio)) > > + continue; > > + > > + nr_pages = folio_nr_pages(folio); > > + folio_first_index = folio_pgoff(folio); > > + folio_last_index = folio_first_index + nr_pages - 1; > > + > > + /* Folios might straddle the range boundaries, only count covered subpages */ > > + if (folio_first_index < first_index) > > + nr_pages -= first_index - folio_first_index; > > + > > + if (folio_last_index > last_index) > > + nr_pages -= folio_last_index - last_index; > > + > > + if (xa_is_value(folio)) { > > + /* page is evicted */ > > + void *shadow = (void *)folio; > > + bool workingset; /* not used */ > > + > > + cs->nr_evicted += nr_pages; > > + > > +#ifdef CONFIG_SWAP /* implies CONFIG_MMU */ > > + if (shmem_mapping(mapping)) { > > + /* shmem file - in swap cache */ > > + swp_entry_t swp = radix_to_swp_entry(folio); > > + > > + shadow = get_shadow_from_swap_cache(swp); > > + } > > +#endif > > + if (workingset_test_recent(shadow, true, &workingset)) > > + cs->nr_recently_evicted += nr_pages; > > + > > + continue; > > + } > > + > > + /* page is in cache */ > > + cs->nr_cache += nr_pages; > > + > > + if (folio_test_dirty(folio)) > > + cs->nr_dirty += nr_pages; > > + > > + if (folio_test_writeback(folio)) > > + cs->nr_writeback += nr_pages; > > I'm not sure this is an issue right now (or if it will ever be for your > use cases), but from a filesystem perspective it is possible to have > large or variable sized folios in cache. At the moment I think the fs > (or XFS+iomap at least) manages dirty/writeback state at folio > granularity, but that may change in the near future if/when iomap > sub-page dirty tracking comes along. I suspect that means it may become > possible to have a large folio of some N number of pages where only a > subset of those pages are actually in dirty/writeback state, and thus > introduces some inaccuracy here because this assumes that folio state > applies to folio_nr_pages() worth of pages. Just something to be aware > of.. > > Brian Oof, I coded this with the mental framework of folio-as-a-unit, and assumed that the dirty/writeback state is managed at the granularity of folio. Thanks for bringing this up Brian! I'll have to watch out for this as iomap evolves (and the subpage tracking becomes a thing). > > > + } > > + rcu_read_unlock(); > > +} > > +EXPORT_SYMBOL(filemap_cachestat); > > + > > +/* > > + * The cachestat(5) system call. > > + * > > + * cachestat() returns the page cache statistics of a file in the > > + * bytes specified by `off` and `len`: number of cached pages, > > + * number of dirty pages, number of pages marked for writeback, > > + * number of (recently) evicted pages. > > + * > > + * `off` and `len` must be non-negative integers. If `len` > 0, > > + * the queried range is [`off`, `off` + `len`]. If `len` == 0, > > + * we will query in the range from `off` to the end of the file. > > + * > > + * `cstat_size` allows users to obtain partial results. The syscall > > + * will copy the first `csstat_size` bytes to the specified userspace > > + * memory. It also makes the cachestat struct extensible - new fields > > + * can be added in the future without breaking existing usage. > > + * `cstat_size` must be a non-negative value that is no larger than > > + * the current size of the cachestat struct. > > + * > > + * The `flags` argument is unused for now, but is included for future > > + * extensibility. User should pass 0 (i.e no flag specified). > > + * > > + * Because the status of a page can change after cachestat() checks it > > + * but before it returns to the application, the returned values may > > + * contain stale information. > > + * > > + * return values: > > + * zero - success > > + * -EFAULT - cstat points to an illegal address > > + * -EINVAL - invalid arguments > > + * -EBADF - invalid file descriptor > > + */ > > +SYSCALL_DEFINE6(cachestat, unsigned int, fd, off_t, off, size_t, len, > > + size_t, cstat_size, struct cachestat __user *, cstat, > > + unsigned int, flags) > > +{ > > + struct fd f = fdget(fd); > > + struct address_space *mapping; > > + struct cachestat cs; > > + pgoff_t first_index = off >> PAGE_SHIFT; > > + pgoff_t last_index = > > + len == 0 ? ULONG_MAX : (off + len - 1) >> PAGE_SHIFT; > > + > > + if (off < 0 || cstat_size > sizeof(struct cachestat) || flags != 0) > > + return -EINVAL; > > + > > + if (!f.file) > > + return -EBADF; > > + > > + memset(&cs, 0, sizeof(struct cachestat)); > > + mapping = f.file->f_mapping; > > + filemap_cachestat(mapping, first_index, last_index, &cs); > > + fdput(f); > > + > > + if (copy_to_user(cstat, &cs, cstat_size)) > > + return -EFAULT; > > + > > + return 0; > > +} > > +#endif /* CONFIG_CACHESTAT */ > > -- > > 2.30.2 > > >
On Fri, Dec 16, 2022 at 2:27 PM kernel test robot <lkp@intel.com> wrote: > > Hi Nhat, > > Thank you for the patch! Yet something to improve: > > [auto build test ERROR on shuah-kselftest/next] > [also build test ERROR on akpm-mm/mm-everything linus/master v6.1 next-20221216] > [If your patch is applied to the wrong git tree, kindly drop us a note. > And when submitting patch, we suggest to use '--base' as documented in > https://git-scm.com/docs/git-format-patch#_base_tree_information] > > url: https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254 > base: https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next > patch link: https://lore.kernel.org/r/20221216192149.3902877-4-nphamcs%40gmail.com > patch subject: [PATCH v4 3/4] cachestat: implement cachestat syscall > config: s390-allmodconfig > compiler: s390-linux-gcc (GCC) 12.1.0 > reproduce (this is a W=1 build): > wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross > chmod +x ~/bin/make.cross > # https://github.com/intel-lab-lkp/linux/commit/85fdb04584b3d9a3b90d2c63794ec65e8d996b18 > git remote add linux-review https://github.com/intel-lab-lkp/linux > git fetch --no-tags linux-review Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254 > git checkout 85fdb04584b3d9a3b90d2c63794ec65e8d996b18 > # save the config file > mkdir build_dir && cp config build_dir/.config > COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 olddefconfig > COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash arch/s390/kvm/ > > If you fix the issue, kindly add following tag where applicable > | Reported-by: kernel test robot <lkp@intel.com> > > All error/warnings (new ones prefixed by >>): > > In file included from include/uapi/asm-generic/mman.h:5, > from ./arch/s390/include/generated/uapi/asm/mman.h:1, > from include/uapi/linux/mman.h:5, > from include/linux/fs.h:49, > from include/linux/huge_mm.h:8, > from include/linux/mm.h:737, > from include/linux/kvm_host.h:16, > from arch/s390/kvm/kvm-s390.h:17, > from arch/s390/kvm/gaccess.c:16: > >> include/uapi/asm-generic/mman-common.h:16:25: error: expected identifier before numeric constant > 16 | #define PROT_NONE 0x0 /* page can not be accessed */ > | ^~~ > arch/s390/kvm/gaccess.c:493:9: note: in expansion of macro 'PROT_NONE' > 493 | PROT_NONE, > | ^~~~~~~~~ > arch/s390/kvm/gaccess.c: In function 'trans_exc_ending': > >> arch/s390/kvm/gaccess.c:516:17: error: duplicate case value > 516 | case PROT_TYPE_LA: > | ^~~~ > arch/s390/kvm/gaccess.c:509:17: note: previously used here > 509 | case PROT_NONE: > | ^~~~ > In file included from include/linux/compiler_types.h:79, > from <command-line>: > >> include/linux/compiler_attributes.h:223:41: warning: attribute 'fallthrough' not preceding a case label or default label > 223 | # define fallthrough __attribute__((__fallthrough__)) > | ^~~~~~~~~~~~~ > arch/s390/kvm/gaccess.c:515:25: note: in expansion of macro 'fallthrough' > 515 | fallthrough; > | ^~~~~~~~~~~ > > > vim +16 include/uapi/asm-generic/mman-common.h > > 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 9 > 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 10 #define PROT_READ 0x1 /* page can be read */ > 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 11 #define PROT_WRITE 0x2 /* page can be written */ > 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 12 #define PROT_EXEC 0x4 /* page can be executed */ > 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 13 #define PROT_SEM 0x8 /* page may be used for atomic ops */ > d41938d2cbee92 include/uapi/asm-generic/mman-common.h Dave Martin 2019-12-11 14 /* 0x10 reserved for arch-specific use */ > d41938d2cbee92 include/uapi/asm-generic/mman-common.h Dave Martin 2019-12-11 15 /* 0x20 reserved for arch-specific use */ > 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 @16 #define PROT_NONE 0x0 /* page can not be accessed */ > 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 17 #define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ > 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 18 #define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ > 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 19 > > -- > 0-DAY CI Kernel Test Service > https://01.org/lkp These build errors (and the ones reported by the later kernel test robot emails) seem to come from my refactoring, which adds: #include <uapi/linux/mman.h> to include/linux/fs.h. Several other files (which chain-include fs.h) also define several same constants elsewhere. This include seems unnecessary, as I can just replace it with: struct cachestat; This should be fixed in the next version of the patch series.
On Thu, Dec 22, 2022 at 01:50:19PM -0800, Nhat Pham wrote: > On Tue, Dec 20, 2022 at 6:37 AM Brian Foster <bfoster@redhat.com> wrote: > > > > On Fri, Dec 16, 2022 at 11:21:48AM -0800, Nhat Pham wrote: > > > Implement a new syscall that queries cache state of a file and > > > summarizes the number of cached pages, number of dirty pages, number of > > > pages marked for writeback, number of (recently) evicted pages, etc. in > > > a given range. > > > > > > NAME > > > cachestat - query the page cache status of a file. > > > > > > SYNOPSIS > > > #include <sys/mman.h> > > > > > > struct cachestat { > > > __u64 nr_cache; > > > __u64 nr_dirty; > > > __u64 nr_writeback; > > > __u64 nr_evicted; > > > __u64 nr_recently_evicted; > > > }; > > > > > > int cachestat(unsigned int fd, off_t off, size_t len, > > > size_t cstat_size, struct cachestat *cstat, > > > unsigned int flags); > > > > > > DESCRIPTION > > > cachestat() queries the number of cached pages, number of dirty > > > pages, number of pages marked for writeback, number of (recently) > > > evicted pages, in the bytes range given by `off` and `len`. > > > > > > These values are returned in a cachestat struct, whose address is > > > given by the `cstat` argument. > > > > > > The `off` and `len` arguments must be non-negative integers. If > > > `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` == > > > 0, we will query in the range from `off` to the end of the file. > > > > > > `cstat_size` allows users to obtain partial results. The syscall > > > will copy the first `csstat_size` bytes to the specified userspace > > > memory. `cstat_size` must be a non-negative value that is no larger > > > than the current size of the cachestat struct. > > > > > > The `flags` argument is unused for now, but is included for future > > > extensibility. User should pass 0 (i.e no flag specified). > > > > > > RETURN VALUE > > > On success, cachestat returns 0. On error, -1 is returned, and errno > > > is set to indicate the error. > > > > > > ERRORS > > > EFAULT cstat points to an invalid address. > > > > > > EINVAL invalid `cstat_size` or `flags` > > > > > > EBADF invalid file descriptor. > > > > > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > > > --- > > > > Hi Nhat, > > > > Thanks for the tweaks. FWIW, this by and large looks reasonable to me. > > Thanks for suggesting the tweaks! I tried refactoring it out and it > did look cleaner to me (and one fewer file that I have to keep > track of...) > > > Just a couple more random nitty things, if you happen to care about > > them.. > > > > > arch/alpha/kernel/syscalls/syscall.tbl | 1 + > > > arch/arm/tools/syscall.tbl | 1 + > > > arch/ia64/kernel/syscalls/syscall.tbl | 1 + > > > arch/m68k/kernel/syscalls/syscall.tbl | 1 + > > > arch/microblaze/kernel/syscalls/syscall.tbl | 1 + > > > arch/parisc/kernel/syscalls/syscall.tbl | 1 + > > > arch/powerpc/kernel/syscalls/syscall.tbl | 1 + > > > arch/s390/kernel/syscalls/syscall.tbl | 1 + > > > arch/sh/kernel/syscalls/syscall.tbl | 1 + > > > arch/sparc/kernel/syscalls/syscall.tbl | 1 + > > > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > > > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > > > arch/xtensa/kernel/syscalls/syscall.tbl | 1 + > > > include/linux/fs.h | 3 + > > > include/linux/syscalls.h | 3 + > > > include/uapi/asm-generic/unistd.h | 5 +- > > > include/uapi/linux/mman.h | 9 ++ > > > init/Kconfig | 10 ++ > > > kernel/sys_ni.c | 1 + > > > mm/filemap.c | 137 ++++++++++++++++++++ > > > 20 files changed, 180 insertions(+), 1 deletion(-) > > > > > ... > > > diff --git a/mm/filemap.c b/mm/filemap.c > > > index 08341616ae7a..29ffb906caa4 100644 > > > --- a/mm/filemap.c > > > +++ b/mm/filemap.c > > ... > > > @@ -3949,3 +3953,136 @@ bool filemap_release_folio(struct folio *folio, gfp_t gfp) > > > return try_to_free_buffers(folio); > > > } > > > EXPORT_SYMBOL(filemap_release_folio); > > > + > > > +#ifdef CONFIG_CACHESTAT > > > +/** > > > + * filemap_cachestat() - compute the page cache statistics of a mapping > > > + * @mapping: The mapping to compute the statistics for. > > > + * @first_index: The starting page cache index. > > > + * @last_index: The final page index (inclusive). > > > + * @cs: the cachestat struct to write the result to. > > > + * > > > + * This will query the page cache statistics of a mapping in the > > > + * page range of [first_index, last_index] (inclusive). THe statistics > > > + * queried include: number of dirty pages, number of pages marked for > > > + * writeback, and the number of (recently) evicted pages. > > > + */ > > > +void filemap_cachestat(struct address_space *mapping, pgoff_t first_index, > > > + pgoff_t last_index, struct cachestat *cs) > > > +{ > > > > Not sure why the internal helper needs to be wrapped in a config option? > > Perhaps it makes more sense to define CONFIG_CACHESTAT_SYSCALL and just > > wrap the syscall bits with that..? > > No particularly strong reasons - I was just bundling the two together > because I was not entirely sure if it's interesting or has any use > case outside of the syscall itself. > > Do you have something in mind? > Not immediately, though it looks like a straightforward and potentially useful helper. It wouldn't surprise me if it grew more in-kernel users eventually. :) That said, I only suggested this for cleanup/consistency reasons. I.e., it seems there is plenty of precedent for CONFIG_*_SYSCALL config options, and the one or two I happened to peek at looked like they only wrapped the syscall bits as opposed to the entire implementation (which also seems like a clean and elegant approach to me). Somebody could easily come along and make that change later if/when they might want to use the helper (though it might be annoying to change the name of the config option), so I have no strong opinion on either suggestion. Brian > > > > I would also think it might make things simpler to split out all the > > syscall bits into a separate patch from filemap_cachestat(). > > Same as above. > > > > > > + XA_STATE(xas, &mapping->i_pages, first_index); > > > + struct folio *folio; > > > + > > > + rcu_read_lock(); > > > + xas_for_each(&xas, folio, last_index) { > > > + unsigned long nr_pages; > > > + pgoff_t folio_first_index, folio_last_index; > > > + > > > + if (xas_retry(&xas, folio)) > > > + continue; > > > + > > > + nr_pages = folio_nr_pages(folio); > > > + folio_first_index = folio_pgoff(folio); > > > + folio_last_index = folio_first_index + nr_pages - 1; > > > + > > > + /* Folios might straddle the range boundaries, only count covered subpages */ > > > + if (folio_first_index < first_index) > > > + nr_pages -= first_index - folio_first_index; > > > + > > > + if (folio_last_index > last_index) > > > + nr_pages -= folio_last_index - last_index; > > > + > > > + if (xa_is_value(folio)) { > > > + /* page is evicted */ > > > + void *shadow = (void *)folio; > > > + bool workingset; /* not used */ > > > + > > > + cs->nr_evicted += nr_pages; > > > + > > > +#ifdef CONFIG_SWAP /* implies CONFIG_MMU */ > > > + if (shmem_mapping(mapping)) { > > > + /* shmem file - in swap cache */ > > > + swp_entry_t swp = radix_to_swp_entry(folio); > > > + > > > + shadow = get_shadow_from_swap_cache(swp); > > > + } > > > +#endif > > > + if (workingset_test_recent(shadow, true, &workingset)) > > > + cs->nr_recently_evicted += nr_pages; > > > + > > > + continue; > > > + } > > > + > > > + /* page is in cache */ > > > + cs->nr_cache += nr_pages; > > > + > > > + if (folio_test_dirty(folio)) > > > + cs->nr_dirty += nr_pages; > > > + > > > + if (folio_test_writeback(folio)) > > > + cs->nr_writeback += nr_pages; > > > > I'm not sure this is an issue right now (or if it will ever be for your > > use cases), but from a filesystem perspective it is possible to have > > large or variable sized folios in cache. At the moment I think the fs > > (or XFS+iomap at least) manages dirty/writeback state at folio > > granularity, but that may change in the near future if/when iomap > > sub-page dirty tracking comes along. I suspect that means it may become > > possible to have a large folio of some N number of pages where only a > > subset of those pages are actually in dirty/writeback state, and thus > > introduces some inaccuracy here because this assumes that folio state > > applies to folio_nr_pages() worth of pages. Just something to be aware > > of.. > > > > Brian > > Oof, I coded this with the mental framework of folio-as-a-unit, and > assumed that the dirty/writeback state is managed at the granularity of folio. > Thanks for bringing this up Brian! I'll have to watch out for this as iomap > evolves (and the subpage tracking becomes a thing). > > > > > > + } > > > + rcu_read_unlock(); > > > +} > > > +EXPORT_SYMBOL(filemap_cachestat); > > > + > > > +/* > > > + * The cachestat(5) system call. > > > + * > > > + * cachestat() returns the page cache statistics of a file in the > > > + * bytes specified by `off` and `len`: number of cached pages, > > > + * number of dirty pages, number of pages marked for writeback, > > > + * number of (recently) evicted pages. > > > + * > > > + * `off` and `len` must be non-negative integers. If `len` > 0, > > > + * the queried range is [`off`, `off` + `len`]. If `len` == 0, > > > + * we will query in the range from `off` to the end of the file. > > > + * > > > + * `cstat_size` allows users to obtain partial results. The syscall > > > + * will copy the first `csstat_size` bytes to the specified userspace > > > + * memory. It also makes the cachestat struct extensible - new fields > > > + * can be added in the future without breaking existing usage. > > > + * `cstat_size` must be a non-negative value that is no larger than > > > + * the current size of the cachestat struct. > > > + * > > > + * The `flags` argument is unused for now, but is included for future > > > + * extensibility. User should pass 0 (i.e no flag specified). > > > + * > > > + * Because the status of a page can change after cachestat() checks it > > > + * but before it returns to the application, the returned values may > > > + * contain stale information. > > > + * > > > + * return values: > > > + * zero - success > > > + * -EFAULT - cstat points to an illegal address > > > + * -EINVAL - invalid arguments > > > + * -EBADF - invalid file descriptor > > > + */ > > > +SYSCALL_DEFINE6(cachestat, unsigned int, fd, off_t, off, size_t, len, > > > + size_t, cstat_size, struct cachestat __user *, cstat, > > > + unsigned int, flags) > > > +{ > > > + struct fd f = fdget(fd); > > > + struct address_space *mapping; > > > + struct cachestat cs; > > > + pgoff_t first_index = off >> PAGE_SHIFT; > > > + pgoff_t last_index = > > > + len == 0 ? ULONG_MAX : (off + len - 1) >> PAGE_SHIFT; > > > + > > > + if (off < 0 || cstat_size > sizeof(struct cachestat) || flags != 0) > > > + return -EINVAL; > > > + > > > + if (!f.file) > > > + return -EBADF; > > > + > > > + memset(&cs, 0, sizeof(struct cachestat)); > > > + mapping = f.file->f_mapping; > > > + filemap_cachestat(mapping, first_index, last_index, &cs); > > > + fdput(f); > > > + > > > + if (copy_to_user(cstat, &cs, cstat_size)) > > > + return -EFAULT; > > > + > > > + return 0; > > > +} > > > +#endif /* CONFIG_CACHESTAT */ > > > -- > > > 2.30.2 > > > > > >
On Tue, Jan 3, 2023 at 10:42 AM Brian Foster <bfoster@redhat.com> wrote: > > On Thu, Dec 22, 2022 at 01:50:19PM -0800, Nhat Pham wrote: > > On Tue, Dec 20, 2022 at 6:37 AM Brian Foster <bfoster@redhat.com> wrote: > > > > > > On Fri, Dec 16, 2022 at 11:21:48AM -0800, Nhat Pham wrote: > > > > Implement a new syscall that queries cache state of a file and > > > > summarizes the number of cached pages, number of dirty pages, number of > > > > pages marked for writeback, number of (recently) evicted pages, etc. in > > > > a given range. > > > > > > > > NAME > > > > cachestat - query the page cache status of a file. > > > > > > > > SYNOPSIS > > > > #include <sys/mman.h> > > > > > > > > struct cachestat { > > > > __u64 nr_cache; > > > > __u64 nr_dirty; > > > > __u64 nr_writeback; > > > > __u64 nr_evicted; > > > > __u64 nr_recently_evicted; > > > > }; > > > > > > > > int cachestat(unsigned int fd, off_t off, size_t len, > > > > size_t cstat_size, struct cachestat *cstat, > > > > unsigned int flags); > > > > > > > > DESCRIPTION > > > > cachestat() queries the number of cached pages, number of dirty > > > > pages, number of pages marked for writeback, number of (recently) > > > > evicted pages, in the bytes range given by `off` and `len`. > > > > > > > > These values are returned in a cachestat struct, whose address is > > > > given by the `cstat` argument. > > > > > > > > The `off` and `len` arguments must be non-negative integers. If > > > > `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` == > > > > 0, we will query in the range from `off` to the end of the file. > > > > > > > > `cstat_size` allows users to obtain partial results. The syscall > > > > will copy the first `csstat_size` bytes to the specified userspace > > > > memory. `cstat_size` must be a non-negative value that is no larger > > > > than the current size of the cachestat struct. > > > > > > > > The `flags` argument is unused for now, but is included for future > > > > extensibility. User should pass 0 (i.e no flag specified). > > > > > > > > RETURN VALUE > > > > On success, cachestat returns 0. On error, -1 is returned, and errno > > > > is set to indicate the error. > > > > > > > > ERRORS > > > > EFAULT cstat points to an invalid address. > > > > > > > > EINVAL invalid `cstat_size` or `flags` > > > > > > > > EBADF invalid file descriptor. > > > > > > > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > > > > --- > > > > > > Hi Nhat, > > > > > > Thanks for the tweaks. FWIW, this by and large looks reasonable to me. > > > > Thanks for suggesting the tweaks! I tried refactoring it out and it > > did look cleaner to me (and one fewer file that I have to keep > > track of...) > > > > > Just a couple more random nitty things, if you happen to care about > > > them.. > > > > > > > arch/alpha/kernel/syscalls/syscall.tbl | 1 + > > > > arch/arm/tools/syscall.tbl | 1 + > > > > arch/ia64/kernel/syscalls/syscall.tbl | 1 + > > > > arch/m68k/kernel/syscalls/syscall.tbl | 1 + > > > > arch/microblaze/kernel/syscalls/syscall.tbl | 1 + > > > > arch/parisc/kernel/syscalls/syscall.tbl | 1 + > > > > arch/powerpc/kernel/syscalls/syscall.tbl | 1 + > > > > arch/s390/kernel/syscalls/syscall.tbl | 1 + > > > > arch/sh/kernel/syscalls/syscall.tbl | 1 + > > > > arch/sparc/kernel/syscalls/syscall.tbl | 1 + > > > > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > > > > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > > > > arch/xtensa/kernel/syscalls/syscall.tbl | 1 + > > > > include/linux/fs.h | 3 + > > > > include/linux/syscalls.h | 3 + > > > > include/uapi/asm-generic/unistd.h | 5 +- > > > > include/uapi/linux/mman.h | 9 ++ > > > > init/Kconfig | 10 ++ > > > > kernel/sys_ni.c | 1 + > > > > mm/filemap.c | 137 ++++++++++++++++++++ > > > > 20 files changed, 180 insertions(+), 1 deletion(-) > > > > > > > ... > > > > diff --git a/mm/filemap.c b/mm/filemap.c > > > > index 08341616ae7a..29ffb906caa4 100644 > > > > --- a/mm/filemap.c > > > > +++ b/mm/filemap.c > > > ... > > > > @@ -3949,3 +3953,136 @@ bool filemap_release_folio(struct folio *folio, gfp_t gfp) > > > > return try_to_free_buffers(folio); > > > > } > > > > EXPORT_SYMBOL(filemap_release_folio); > > > > + > > > > +#ifdef CONFIG_CACHESTAT > > > > +/** > > > > + * filemap_cachestat() - compute the page cache statistics of a mapping > > > > + * @mapping: The mapping to compute the statistics for. > > > > + * @first_index: The starting page cache index. > > > > + * @last_index: The final page index (inclusive). > > > > + * @cs: the cachestat struct to write the result to. > > > > + * > > > > + * This will query the page cache statistics of a mapping in the > > > > + * page range of [first_index, last_index] (inclusive). THe statistics > > > > + * queried include: number of dirty pages, number of pages marked for > > > > + * writeback, and the number of (recently) evicted pages. > > > > + */ > > > > +void filemap_cachestat(struct address_space *mapping, pgoff_t first_index, > > > > + pgoff_t last_index, struct cachestat *cs) > > > > +{ > > > > > > Not sure why the internal helper needs to be wrapped in a config option? > > > Perhaps it makes more sense to define CONFIG_CACHESTAT_SYSCALL and just > > > wrap the syscall bits with that..? > > > > No particularly strong reasons - I was just bundling the two together > > because I was not entirely sure if it's interesting or has any use > > case outside of the syscall itself. > > > > Do you have something in mind? > > > > Not immediately, though it looks like a straightforward and potentially > useful helper. It wouldn't surprise me if it grew more in-kernel users > eventually. :) Sounds good! I'll move it out in the next version, unless there is some unexpected issue with that. > That said, I only suggested this for cleanup/consistency reasons. I.e., > it seems there is plenty of precedent for CONFIG_*_SYSCALL config > options, and the one or two I happened to peek at looked like they only > wrapped the syscall bits as opposed to the entire implementation (which > also seems like a clean and elegant approach to me). Somebody could > easily come along and make that change later if/when they might want to > use the helper (though it might be annoying to change the name of the > config option), so I have no strong opinion on either suggestion. > > Brian > > > > > > > I would also think it might make things simpler to split out all the > > > syscall bits into a separate patch from filemap_cachestat(). > > > > Same as above. > > > > > > > > > + XA_STATE(xas, &mapping->i_pages, first_index); > > > > + struct folio *folio; > > > > + > > > > + rcu_read_lock(); > > > > + xas_for_each(&xas, folio, last_index) { > > > > + unsigned long nr_pages; > > > > + pgoff_t folio_first_index, folio_last_index; > > > > + > > > > + if (xas_retry(&xas, folio)) > > > > + continue; > > > > + > > > > + nr_pages = folio_nr_pages(folio); > > > > + folio_first_index = folio_pgoff(folio); > > > > + folio_last_index = folio_first_index + nr_pages - 1; > > > > + > > > > + /* Folios might straddle the range boundaries, only count covered subpages */ > > > > + if (folio_first_index < first_index) > > > > + nr_pages -= first_index - folio_first_index; > > > > + > > > > + if (folio_last_index > last_index) > > > > + nr_pages -= folio_last_index - last_index; > > > > + > > > > + if (xa_is_value(folio)) { > > > > + /* page is evicted */ > > > > + void *shadow = (void *)folio; > > > > + bool workingset; /* not used */ > > > > + > > > > + cs->nr_evicted += nr_pages; > > > > + > > > > +#ifdef CONFIG_SWAP /* implies CONFIG_MMU */ > > > > + if (shmem_mapping(mapping)) { > > > > + /* shmem file - in swap cache */ > > > > + swp_entry_t swp = radix_to_swp_entry(folio); > > > > + > > > > + shadow = get_shadow_from_swap_cache(swp); > > > > + } > > > > +#endif > > > > + if (workingset_test_recent(shadow, true, &workingset)) > > > > + cs->nr_recently_evicted += nr_pages; > > > > + > > > > + continue; > > > > + } > > > > + > > > > + /* page is in cache */ > > > > + cs->nr_cache += nr_pages; > > > > + > > > > + if (folio_test_dirty(folio)) > > > > + cs->nr_dirty += nr_pages; > > > > + > > > > + if (folio_test_writeback(folio)) > > > > + cs->nr_writeback += nr_pages; > > > > > > I'm not sure this is an issue right now (or if it will ever be for your > > > use cases), but from a filesystem perspective it is possible to have > > > large or variable sized folios in cache. At the moment I think the fs > > > (or XFS+iomap at least) manages dirty/writeback state at folio > > > granularity, but that may change in the near future if/when iomap > > > sub-page dirty tracking comes along. I suspect that means it may become > > > possible to have a large folio of some N number of pages where only a > > > subset of those pages are actually in dirty/writeback state, and thus > > > introduces some inaccuracy here because this assumes that folio state > > > applies to folio_nr_pages() worth of pages. Just something to be aware > > > of.. > > > > > > Brian > > > > Oof, I coded this with the mental framework of folio-as-a-unit, and > > assumed that the dirty/writeback state is managed at the granularity of folio. > > Thanks for bringing this up Brian! I'll have to watch out for this as iomap > > evolves (and the subpage tracking becomes a thing). > > > > > > > > > + } > > > > + rcu_read_unlock(); > > > > +} > > > > +EXPORT_SYMBOL(filemap_cachestat); > > > > + > > > > +/* > > > > + * The cachestat(5) system call. > > > > + * > > > > + * cachestat() returns the page cache statistics of a file in the > > > > + * bytes specified by `off` and `len`: number of cached pages, > > > > + * number of dirty pages, number of pages marked for writeback, > > > > + * number of (recently) evicted pages. > > > > + * > > > > + * `off` and `len` must be non-negative integers. If `len` > 0, > > > > + * the queried range is [`off`, `off` + `len`]. If `len` == 0, > > > > + * we will query in the range from `off` to the end of the file. > > > > + * > > > > + * `cstat_size` allows users to obtain partial results. The syscall > > > > + * will copy the first `csstat_size` bytes to the specified userspace > > > > + * memory. It also makes the cachestat struct extensible - new fields > > > > + * can be added in the future without breaking existing usage. > > > > + * `cstat_size` must be a non-negative value that is no larger than > > > > + * the current size of the cachestat struct. > > > > + * > > > > + * The `flags` argument is unused for now, but is included for future > > > > + * extensibility. User should pass 0 (i.e no flag specified). > > > > + * > > > > + * Because the status of a page can change after cachestat() checks it > > > > + * but before it returns to the application, the returned values may > > > > + * contain stale information. > > > > + * > > > > + * return values: > > > > + * zero - success > > > > + * -EFAULT - cstat points to an illegal address > > > > + * -EINVAL - invalid arguments > > > > + * -EBADF - invalid file descriptor > > > > + */ > > > > +SYSCALL_DEFINE6(cachestat, unsigned int, fd, off_t, off, size_t, len, > > > > + size_t, cstat_size, struct cachestat __user *, cstat, > > > > + unsigned int, flags) > > > > +{ > > > > + struct fd f = fdget(fd); > > > > + struct address_space *mapping; > > > > + struct cachestat cs; > > > > + pgoff_t first_index = off >> PAGE_SHIFT; > > > > + pgoff_t last_index = > > > > + len == 0 ? ULONG_MAX : (off + len - 1) >> PAGE_SHIFT; > > > > + > > > > + if (off < 0 || cstat_size > sizeof(struct cachestat) || flags != 0) > > > > + return -EINVAL; > > > > + > > > > + if (!f.file) > > > > + return -EBADF; > > > > + > > > > + memset(&cs, 0, sizeof(struct cachestat)); > > > > + mapping = f.file->f_mapping; > > > > + filemap_cachestat(mapping, first_index, last_index, &cs); > > > > + fdput(f); > > > > + > > > > + if (copy_to_user(cstat, &cs, cstat_size)) > > > > + return -EFAULT; > > > > + > > > > + return 0; > > > > +} > > > > +#endif /* CONFIG_CACHESTAT */ > > > > -- > > > > 2.30.2 > > > > > > > > > >
On Wed, Dec 21, 2022 at 4:37 PM Nhat Pham <nphamcs@gmail.com> wrote: > > On Fri, Dec 16, 2022 at 1:48 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > On Fri, 16 Dec 2022 11:21:48 -0800 Nhat Pham <nphamcs@gmail.com> wrote: > > > > > Implement a new syscall that queries cache state of a file and > > > summarizes the number of cached pages, number of dirty pages, number of > > > pages marked for writeback, number of (recently) evicted pages, etc. in > > > a given range. > > > > > > NAME > > > cachestat - query the page cache status of a file. > > > > > > SYNOPSIS > > > #include <sys/mman.h> > > > > > > struct cachestat { > > > __u64 nr_cache; > > > __u64 nr_dirty; > > > __u64 nr_writeback; > > > __u64 nr_evicted; > > > __u64 nr_recently_evicted; > > > }; > > > > > > int cachestat(unsigned int fd, off_t off, size_t len, > > > size_t cstat_size, struct cachestat *cstat, > > > unsigned int flags); > > > > > > DESCRIPTION > > > cachestat() queries the number of cached pages, number of dirty > > > pages, number of pages marked for writeback, number of (recently) > > > evicted pages, in the bytes range given by `off` and `len`. > > > > I suggest this be spelled out better: "number of evicted and number or > > recently evicted pages". > > > > I suggest this clearly tell readers what an "evicted" page is - they > > aren't kernel programmers! > > Valid points - I'll try to explain this more clearly in the future > versions of this patch series, especially in the draft man page. > > > > > What is the benefit of the "recently evicted" pages? "recently" seems > > very vague - what use is this to anyone? > > This eviction recency semantics comes from the LRU's refault > computation. Users of cachestat might be interested in two very > different questions: > > 1. How many pages are not resident in the page cache. > 2. How many pages are recently evicted (recently enough that > their refault will be construed as memory pressure). > > The first question is answered with nr_evicted, whereas the second > is answered with nr_recently_evicted. > > I will figure out a way to explain this better in the next version. Users > definitely do not need to know the nitty gritty details of LRU logic, > but they should know the general idea of each field at least. > > > > > > These values are returned in a cachestat struct, whose address is > > > given by the `cstat` argument. > > > > > > The `off` and `len` arguments must be non-negative integers. If > > > `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` == > > > 0, we will query in the range from `off` to the end of the file. > > > > > > `cstat_size` allows users to obtain partial results. The syscall > > > will copy the first `csstat_size` bytes to the specified userspace > > > memory. `cstat_size` must be a non-negative value that is no larger > > > than the current size of the cachestat struct. > > > > > > The `flags` argument is unused for now, but is included for future > > > extensibility. User should pass 0 (i.e no flag specified). > > > > Why is `flags' here? We could add an unused flags arg to any syscall, > > but we don't. What's the plan? > > I included this field to ensure that cachestat can be extended safely, > especially when different users might want different things out of it. > > For instance, in the future there might be new fields/computations > that are too heavy for certain use cases - a flag could be used to > disable/skip such fields/computations. > > Another thing it might be used for is the huge page counting - > we have not implemented this in this version yet, but it might > introduce murky semantics to new/existing fields in struct > cachestat. Or maybe not - but worst case scenario we can > leave this decision to the users to decide through flags. > > I'm sure there are more potential pitfalls that the flags could > save us from, but these are the two on top of my head. > > > > > Are there security implications? If I know that some process has a > > file open, I can use cachestat() to infer which parts of that file > > they're looking at (like mincore(), I guess). And I can infer which > > parts they're writing to, unlike mincore(). > > This one, I'm not 100% sure, but it is a valid concern. Let me think > about it and discuss with more security-oriented minds before > responding to this. Hmm I've given it some more thought. The syscall does not seem to expose any extra security issue, given that the user already has read permission to the file itself (since they have an fd to that file). This means that the user can already know the underlying data in its entirety, which seems like much more information (and as a result, security risk) than the cache status itself. Do you have something concrete in mind that I might have missed? > > > > > I suggest the [patch 1/4] fixup be separated from this series. > > Sounds good! I'll loop Johannes in about this breakup as well.
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 8ebacf37a8cf..1f13995d00d7 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -490,3 +490,4 @@ 558 common process_mrelease sys_process_mrelease 559 common futex_waitv sys_futex_waitv 560 common set_mempolicy_home_node sys_ni_syscall +561 common cachestat sys_cachestat diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index ac964612d8b0..8ebed8a13874 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -464,3 +464,4 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common cachestat sys_cachestat diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl index 72c929d9902b..f8c74ffeeefb 100644 --- a/arch/ia64/kernel/syscalls/syscall.tbl +++ b/arch/ia64/kernel/syscalls/syscall.tbl @@ -371,3 +371,4 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common cachestat sys_cachestat diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index b1f3940bc298..4f504783371f 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -450,3 +450,4 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common cachestat sys_cachestat diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index 820145e47350..858d22bf275c 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -456,3 +456,4 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common cachestat sys_cachestat diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index 8a99c998da9b..7c84a72306d1 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -448,3 +448,4 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common cachestat sys_cachestat diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index 2bca64f96164..937460f0a8ec 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -530,3 +530,4 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 nospu set_mempolicy_home_node sys_set_mempolicy_home_node +451 common cachestat sys_cachestat diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index 799147658dee..7df0329d46cb 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -453,3 +453,4 @@ 448 common process_mrelease sys_process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node sys_set_mempolicy_home_node +451 common cachestat sys_cachestat sys_cachestat diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index 2de85c977f54..97377e8c5025 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -453,3 +453,4 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common cachestat sys_cachestat diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index 4398cc6fb68d..faa835f3c54a 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -496,3 +496,4 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common cachestat sys_cachestat diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 320480a8db4f..bc0a3c941b35 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -455,3 +455,4 @@ 448 i386 process_mrelease sys_process_mrelease 449 i386 futex_waitv sys_futex_waitv 450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node +451 i386 cachestat sys_cachestat diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index c84d12608cd2..227538b0ce80 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -372,6 +372,7 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common cachestat sys_cachestat # # Due to a historical design error, certain syscalls are numbered differently diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index 52c94ab5c205..2b69c3c035b6 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -421,3 +421,4 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common cachestat sys_cachestat diff --git a/include/linux/fs.h b/include/linux/fs.h index e654435f1651..8902799121c9 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -46,6 +46,7 @@ #include <asm/byteorder.h> #include <uapi/linux/fs.h> +#include <uapi/linux/mman.h> struct backing_dev_info; struct bdi_writeback; @@ -830,6 +831,8 @@ void filemap_invalidate_lock_two(struct address_space *mapping1, struct address_space *mapping2); void filemap_invalidate_unlock_two(struct address_space *mapping1, struct address_space *mapping2); +void filemap_cachestat(struct address_space *mapping, pgoff_t first_index, + pgoff_t last_index, struct cachestat *cs); /* diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index a34b0f9a9972..50f8c6999d99 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -72,6 +72,7 @@ struct open_how; struct mount_attr; struct landlock_ruleset_attr; enum landlock_rule_type; +struct cachestat; #include <linux/types.h> #include <linux/aio_abi.h> @@ -1056,6 +1057,8 @@ asmlinkage long sys_memfd_secret(unsigned int flags); asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, unsigned long home_node, unsigned long flags); +asmlinkage long sys_cachestat(unsigned int fd, off_t off, size_t len, + size_t cstat_size, struct cachestat __user *cstat, unsigned int flags); /* * Architecture-specific system calls diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 45fa180cc56a..cd639fae9086 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv) #define __NR_set_mempolicy_home_node 450 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) +#define __NR_cachestat 451 +__SYSCALL(__NR_cachestat, sys_cachestat) + #undef __NR_syscalls -#define __NR_syscalls 451 +#define __NR_syscalls 452 /* * 32 bit systems traditionally used different diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h index f55bc680b5b0..fe03ed0b7587 100644 --- a/include/uapi/linux/mman.h +++ b/include/uapi/linux/mman.h @@ -4,6 +4,7 @@ #include <asm/mman.h> #include <asm-generic/hugetlb_encode.h> +#include <linux/types.h> #define MREMAP_MAYMOVE 1 #define MREMAP_FIXED 2 @@ -41,4 +42,12 @@ #define MAP_HUGE_2GB HUGETLB_FLAG_ENCODE_2GB #define MAP_HUGE_16GB HUGETLB_FLAG_ENCODE_16GB +struct cachestat { + __u64 nr_cache; + __u64 nr_dirty; + __u64 nr_writeback; + __u64 nr_evicted; + __u64 nr_recently_evicted; +}; + #endif /* _UAPI_LINUX_MMAN_H */ diff --git a/init/Kconfig b/init/Kconfig index 694f7c160c9c..ecc4f781dd6c 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1798,6 +1798,16 @@ config RSEQ If unsure, say Y. +config CACHESTAT + bool "Enable cachestat() system call" if EXPERT + default y + help + Enable the cachestat system call, which queries the page cache + statistics of a file (number of cached pages, dirty pages, + pages marked for writeback, (recently) evicted pages). + + If unsure say Y here. + config DEBUG_RSEQ default n bool "Enabled debugging of rseq() system call" if EXPERT diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 860b2dcf3ac4..04bfb1e4d377 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -299,6 +299,7 @@ COND_SYSCALL(set_mempolicy); COND_SYSCALL(migrate_pages); COND_SYSCALL(move_pages); COND_SYSCALL(set_mempolicy_home_node); +COND_SYSCALL(cachestat); COND_SYSCALL(perf_event_open); COND_SYSCALL(accept4); diff --git a/mm/filemap.c b/mm/filemap.c index 08341616ae7a..29ffb906caa4 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -22,6 +22,7 @@ #include <linux/mm.h> #include <linux/swap.h> #include <linux/swapops.h> +#include <linux/syscalls.h> #include <linux/mman.h> #include <linux/pagemap.h> #include <linux/file.h> @@ -55,6 +56,9 @@ #include <linux/buffer_head.h> /* for try_to_free_buffers */ #include <asm/mman.h> +#include <uapi/linux/mman.h> + +#include "swap.h" /* * Shared mappings implemented 30.11.1994. It's not fully working yet, @@ -3949,3 +3953,136 @@ bool filemap_release_folio(struct folio *folio, gfp_t gfp) return try_to_free_buffers(folio); } EXPORT_SYMBOL(filemap_release_folio); + +#ifdef CONFIG_CACHESTAT +/** + * filemap_cachestat() - compute the page cache statistics of a mapping + * @mapping: The mapping to compute the statistics for. + * @first_index: The starting page cache index. + * @last_index: The final page index (inclusive). + * @cs: the cachestat struct to write the result to. + * + * This will query the page cache statistics of a mapping in the + * page range of [first_index, last_index] (inclusive). THe statistics + * queried include: number of dirty pages, number of pages marked for + * writeback, and the number of (recently) evicted pages. + */ +void filemap_cachestat(struct address_space *mapping, pgoff_t first_index, + pgoff_t last_index, struct cachestat *cs) +{ + XA_STATE(xas, &mapping->i_pages, first_index); + struct folio *folio; + + rcu_read_lock(); + xas_for_each(&xas, folio, last_index) { + unsigned long nr_pages; + pgoff_t folio_first_index, folio_last_index; + + if (xas_retry(&xas, folio)) + continue; + + nr_pages = folio_nr_pages(folio); + folio_first_index = folio_pgoff(folio); + folio_last_index = folio_first_index + nr_pages - 1; + + /* Folios might straddle the range boundaries, only count covered subpages */ + if (folio_first_index < first_index) + nr_pages -= first_index - folio_first_index; + + if (folio_last_index > last_index) + nr_pages -= folio_last_index - last_index; + + if (xa_is_value(folio)) { + /* page is evicted */ + void *shadow = (void *)folio; + bool workingset; /* not used */ + + cs->nr_evicted += nr_pages; + +#ifdef CONFIG_SWAP /* implies CONFIG_MMU */ + if (shmem_mapping(mapping)) { + /* shmem file - in swap cache */ + swp_entry_t swp = radix_to_swp_entry(folio); + + shadow = get_shadow_from_swap_cache(swp); + } +#endif + if (workingset_test_recent(shadow, true, &workingset)) + cs->nr_recently_evicted += nr_pages; + + continue; + } + + /* page is in cache */ + cs->nr_cache += nr_pages; + + if (folio_test_dirty(folio)) + cs->nr_dirty += nr_pages; + + if (folio_test_writeback(folio)) + cs->nr_writeback += nr_pages; + } + rcu_read_unlock(); +} +EXPORT_SYMBOL(filemap_cachestat); + +/* + * The cachestat(5) system call. + * + * cachestat() returns the page cache statistics of a file in the + * bytes specified by `off` and `len`: number of cached pages, + * number of dirty pages, number of pages marked for writeback, + * number of (recently) evicted pages. + * + * `off` and `len` must be non-negative integers. If `len` > 0, + * the queried range is [`off`, `off` + `len`]. If `len` == 0, + * we will query in the range from `off` to the end of the file. + * + * `cstat_size` allows users to obtain partial results. The syscall + * will copy the first `csstat_size` bytes to the specified userspace + * memory. It also makes the cachestat struct extensible - new fields + * can be added in the future without breaking existing usage. + * `cstat_size` must be a non-negative value that is no larger than + * the current size of the cachestat struct. + * + * The `flags` argument is unused for now, but is included for future + * extensibility. User should pass 0 (i.e no flag specified). + * + * Because the status of a page can change after cachestat() checks it + * but before it returns to the application, the returned values may + * contain stale information. + * + * return values: + * zero - success + * -EFAULT - cstat points to an illegal address + * -EINVAL - invalid arguments + * -EBADF - invalid file descriptor + */ +SYSCALL_DEFINE6(cachestat, unsigned int, fd, off_t, off, size_t, len, + size_t, cstat_size, struct cachestat __user *, cstat, + unsigned int, flags) +{ + struct fd f = fdget(fd); + struct address_space *mapping; + struct cachestat cs; + pgoff_t first_index = off >> PAGE_SHIFT; + pgoff_t last_index = + len == 0 ? ULONG_MAX : (off + len - 1) >> PAGE_SHIFT; + + if (off < 0 || cstat_size > sizeof(struct cachestat) || flags != 0) + return -EINVAL; + + if (!f.file) + return -EBADF; + + memset(&cs, 0, sizeof(struct cachestat)); + mapping = f.file->f_mapping; + filemap_cachestat(mapping, first_index, last_index, &cs); + fdput(f); + + if (copy_to_user(cstat, &cs, cstat_size)) + return -EFAULT; + + return 0; +} +#endif /* CONFIG_CACHESTAT */