inotify: support returning file_handles
Commit Message
This patch adds the watch mask bit "IN_FID". It implements semantics
similar to fanotify's "FAN_REPORT_FID": if the bit is set in a "struct
inotify_event", the event (and the name) are followed by a "struct
inotify_extra_fid" which contains the filesystem id (not yet
implemented) and the file_handle.
It is debatable whether this is a useful feature; not useful for me,
but Jan Kara cited this feature as an advantage of fanotify over
inotify:
https://lore.kernel.org/linux-fsdevel/20230919100112.nlb2t4nm46wmugc2@quack3/
In a follow-up, Amir Goldstein did not want to accept that this
feature could be added easily to inotify, telling me that I "do not
understand the complexity involved":
https://lore.kernel.org/linux-fsdevel/CAOQ4uxgG6ync6dSBJiGW98docJGnajALiV+9tuwGiRt8NE8F+w@mail.gmail.com/
.. which Jan Kara agreed with: "I'll really find out it isn't so easy"
https://lore.kernel.org/linux-fsdevel/20230919132818.4s3n5bsqmokof6n2@quack3/
So here it is, an easy implementation in less than 90 lines of code
(which is slightly incomplete; grep TODO). This is a per-watch flag
and there is no backwards compatibility breakage because the extra
struct is only added if explicitly requested by user space.
This is just a proof-of-concept, to demonstrate that adding such a
feature to inotify is not only possible, but also easy. Even though
this part of the kernel is new to me, apparently I do "understand the
complexity involved", after all.
I don't expect this to be merged. As much as I'd like to see inotify
improved because fanotify is too complex and cumbersome for my taste,
I'm not deluded enough to believe this PoC will convince anybody. But
hacking it was fun and helped me learn about the inotify code which I
may continue to improve in our private kernel fork.
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
---
fs/notify/inotify/inotify.h | 4 ++-
fs/notify/inotify/inotify_fsnotify.c | 31 ++++++++++++++++++++
fs/notify/inotify/inotify_user.c | 43 +++++++++++++++++++++++++++-
include/linux/inotify.h | 1 +
include/uapi/linux/inotify.h | 11 +++++++
5 files changed, 88 insertions(+), 2 deletions(-)
Comments
On Tue, Sep 19, 2023 at 11:23 PM Max Kellermann
<max.kellermann@ionos.com> wrote:
>
> This patch adds the watch mask bit "IN_FID". It implements semantics
> similar to fanotify's "FAN_REPORT_FID": if the bit is set in a "struct
> inotify_event", the event (and the name) are followed by a "struct
> inotify_extra_fid" which contains the filesystem id (not yet
> implemented) and the file_handle.
>
> It is debatable whether this is a useful feature; not useful for me,
> but Jan Kara cited this feature as an advantage of fanotify over
> inotify:
> https://lore.kernel.org/linux-fsdevel/20230919100112.nlb2t4nm46wmugc2@quack3/
>
> In a follow-up, Amir Goldstein did not want to accept that this
> feature could be added easily to inotify, telling me that I "do not
> understand the complexity involved":
> https://lore.kernel.org/linux-fsdevel/CAOQ4uxgG6ync6dSBJiGW98docJGnajALiV+9tuwGiRt8NE8F+w@mail.gmail.com/
>
> .. which Jan Kara agreed with: "I'll really find out it isn't so easy"
> https://lore.kernel.org/linux-fsdevel/20230919132818.4s3n5bsqmokof6n2@quack3/
>
> So here it is, an easy implementation in less than 90 lines of code
> (which is slightly incomplete; grep TODO). This is a per-watch flag
> and there is no backwards compatibility breakage because the extra
> struct is only added if explicitly requested by user space.
>
> This is just a proof-of-concept, to demonstrate that adding such a
> feature to inotify is not only possible, but also easy. Even though
> this part of the kernel is new to me, apparently I do "understand the
> complexity involved", after all.
>
A bit more humility couldn't hurt. From both you and I.
Obviously, reporting the fhandle in not the complexity we were
talking about. Looks like you are a fast learner, so maybe one day
you wili dive into fanotify code and see the improvements over inotify.
But your basic point that inotify could have been extended is true.
We wanted to extend functionality in the kernel for filesystem watches [1].
We could have done inotify 2.0 (extend inotify).
We could have done fsnotify 1.0 (i.e. new syscalls).
We decided to go with fanotify 2.0, because the fanotify
syscalls and event format already had all the fields needed to
provide the extended API.
It was an arbitrary design choice. It is futile to argue about it.
The main difference between inotify and fanotify APIs which concerns
your complaint is the watch descriptor - a kernel map of unique inode
objects.
Filesystem watches drove the need for reporting fhandles.
I don't have the time to explain why reporting wd/fd in this case
is not a good idea.
It was a design decision to keep the API consistent and uniform
and report the same information for inode watches as well.
Ignoring the functionality gap that needs to be closed between
fanotify -> inotify for the discussion, the remaining part that you like
about inotify is the integer wd descriptor, which makes your life easier.
I am not saying that wd is not useful.
I am not saying that it will not be considered useful enough to keep it
around when the time comes to consider deprecating inotify.
What I am saying is that the inode already has a unique identifier -
more than one even.
For the context of pinned inodes (as in inotify watches), st_dev/st_ino
is unique and fixed size.
fsid/fhandle is variable size (practically fixes size per fs type), but has
the advantage that the identity is persistent so applications can use it to
store change tracking information in databases and to resolve uptodate
path of objects.
Again, it was a design decision to use the stronger fid always.
I have probably offered more than once to implement
FAN_REPORT_DEV_INO, which simplifies things a bit
(maybe not enough for you ;)) and closes the gap of supporting
all filesystems.
For now, we decided to try and stick with FAN_REPORT_FID
for uniformity and add support for fs that do not currently report
fsid/fhandle.
> I don't expect this to be merged. As much as I'd like to see inotify
> improved because fanotify is too complex and cumbersome for my taste,
> I'm not deluded enough to believe this PoC will convince anybody. But
> hacking it was fun and helped me learn about the inotify code which I
> may continue to improve in our private kernel fork.
>
Learning is always good.
Thanks,
Amir.
[1] https://github.com/amir73il/fsnotify-utils/wiki/filesystem-mark
Hi Max,
kernel test robot noticed the following build errors:
[auto build test ERROR on jack-fs/fsnotify]
[also build test ERROR on linus/master v6.6-rc2 next-20230920]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Max-Kellermann/inotify-support-returning-file_handles/20230920-042458
base: https://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git fsnotify
patch link: https://lore.kernel.org/r/20230919202304.1197654-1-max.kellermann%40ionos.com
patch subject: [PATCH] inotify: support returning file_handles
config: powerpc64-randconfig-002-20230920 (https://download.01.org/0day-ci/archive/20230920/202309201518.s23Ngrnu-lkp@intel.com/config)
compiler: powerpc64-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230920/202309201518.s23Ngrnu-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202309201518.s23Ngrnu-lkp@intel.com/
All errors (new ones prefixed by >>):
powerpc64-linux-ld: warning: discarding dynamic section .glink
powerpc64-linux-ld: warning: discarding dynamic section .plt
powerpc64-linux-ld: linkage table error against `exportfs_encode_inode_fh'
powerpc64-linux-ld: stubs don't match calculated size
powerpc64-linux-ld: can not build stubs: bad value
powerpc64-linux-ld: fs/notify/inotify/inotify_fsnotify.o: in function `.inotify_handle_inode_event':
>> inotify_fsnotify.c:(.text+0x2f4): undefined reference to `.exportfs_encode_inode_fh'
>> powerpc64-linux-ld: inotify_fsnotify.c:(.text+0x560): undefined reference to `.exportfs_encode_inode_fh'
Hi Max,
kernel test robot noticed the following build errors:
[auto build test ERROR on jack-fs/fsnotify]
[also build test ERROR on linus/master v6.6-rc2 next-20230920]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Max-Kellermann/inotify-support-returning-file_handles/20230920-042458
base: https://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git fsnotify
patch link: https://lore.kernel.org/r/20230919202304.1197654-1-max.kellermann%40ionos.com
patch subject: [PATCH] inotify: support returning file_handles
config: mips-ath25_defconfig (https://download.01.org/0day-ci/archive/20230921/202309210656.zoCH7Ysu-lkp@intel.com/config)
compiler: mips-linux-gcc (GCC) 11.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230921/202309210656.zoCH7Ysu-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202309210656.zoCH7Ysu-lkp@intel.com/
All errors (new ones prefixed by >>):
mips-linux-ld: fs/notify/inotify/inotify_fsnotify.o: in function `inotify_handle_inode_event':
inotify_fsnotify.c:(.text+0x200): undefined reference to `exportfs_encode_inode_fh'
>> mips-linux-ld: inotify_fsnotify.c:(.text+0x318): undefined reference to `exportfs_encode_inode_fh'
@@ -9,6 +9,8 @@ struct inotify_event_info {
int wd;
u32 sync_cookie;
int name_len;
+ __kernel_fsid_t fsid;
+ u8 file_handle_size;
char name[];
};
@@ -27,7 +29,7 @@ static inline struct inotify_event_info *INOTIFY_E(struct fsnotify_event *fse)
* userspace. There is at least one bit (FS_EVENT_ON_CHILD) which is
* used only internally to the kernel.
*/
-#define INOTIFY_USER_MASK (IN_ALL_EVENTS)
+#define INOTIFY_USER_MASK (IN_ALL_EVENTS|IN_FID)
static inline __u32 inotify_mark_user_mask(struct fsnotify_mark *fsn_mark)
{
@@ -17,6 +17,7 @@
#include <linux/fs.h> /* struct inode */
#include <linux/fsnotify_backend.h>
#include <linux/inotify.h>
+#include <linux/exportfs.h>
#include <linux/path.h> /* struct path */
#include <linux/slab.h> /* kmem_* */
#include <linux/types.h>
@@ -43,6 +44,7 @@ static bool event_compare(struct fsnotify_event *old_fsn,
(old->name_len == new->name_len) &&
(!old->name_len || !strcmp(old->name, new->name)))
return true;
+ // TODO compare fsid, file_handle
return false;
}
@@ -56,6 +58,16 @@ static int inotify_merge(struct fsnotify_group *group,
return event_compare(last_event, event);
}
+static int get_file_handle_dwords(struct inode *inode)
+{
+ if (inode == NULL)
+ return 0;
+
+ int dwords = 0;
+ exportfs_encode_fid(inode, NULL, &dwords);
+ return dwords;
+}
+
int inotify_handle_inode_event(struct fsnotify_mark *inode_mark, u32 mask,
struct inode *inode, struct inode *dir,
const struct qstr *name, u32 cookie)
@@ -66,7 +78,9 @@ int inotify_handle_inode_event(struct fsnotify_mark *inode_mark, u32 mask,
struct fsnotify_group *group = inode_mark->group;
int ret;
int len = 0, wd;
+ int file_handle_dwords;
int alloc_len = sizeof(struct inotify_event_info);
+ size_t file_handle_offset = 0;
struct mem_cgroup *old_memcg;
if (name) {
@@ -74,6 +88,14 @@ int inotify_handle_inode_event(struct fsnotify_mark *inode_mark, u32 mask,
alloc_len += len + 1;
}
+ if (inode_mark->mask & IN_FID) {
+ file_handle_dwords = get_file_handle_dwords(inode);
+ if (file_handle_dwords > 0) {
+ file_handle_offset = roundup(alloc_len, 4);
+ alloc_len = file_handle_offset + sizeof(struct file_handle) + (file_handle_dwords << 2);
+ }
+ }
+
pr_debug("%s: group=%p mark=%p mask=%x\n", __func__, group, inode_mark,
mask);
@@ -120,9 +142,18 @@ int inotify_handle_inode_event(struct fsnotify_mark *inode_mark, u32 mask,
event->wd = wd;
event->sync_cookie = cookie;
event->name_len = len;
+ memset(&event->fsid, 0, sizeof(event->fsid)); // TODO
if (len)
strcpy(event->name, name->name);
+ if (file_handle_offset > 0) {
+ struct file_handle *fh = (struct file_handle *)((unsigned char *)event + file_handle_offset);
+ fh->handle_type = exportfs_encode_fid(inode, (struct fid *)fh->f_handle, &file_handle_dwords);
+ fh->handle_bytes = file_handle_dwords << 2;
+ event->mask |= IN_FID;
+ event->file_handle_size = sizeof(*fh) + (file_handle_dwords << 2);
+ }
+
ret = fsnotify_add_event(group, fsn_event, inotify_merge);
if (ret) {
/* Our event wasn't used in the end. Free it. */
@@ -133,6 +133,7 @@ static inline unsigned int inotify_arg_to_flags(u32 arg)
static inline u32 inotify_mask_to_arg(__u32 mask)
{
return mask & (IN_ALL_EVENTS | IN_ISDIR | IN_UNMOUNT | IN_IGNORED |
+ IN_FID |
IN_Q_OVERFLOW);
}
@@ -173,6 +174,7 @@ static struct fsnotify_event *get_one_event(struct fsnotify_group *group,
{
size_t event_size = sizeof(struct inotify_event);
struct fsnotify_event *event;
+ const struct inotify_event_info *ie;
event = fsnotify_peek_first_event(group);
if (!event)
@@ -181,6 +183,12 @@ static struct fsnotify_event *get_one_event(struct fsnotify_group *group,
pr_debug("%s: group=%p event=%p\n", __func__, group, event);
event_size += round_event_name_len(event);
+
+ ie = INOTIFY_E(event);
+ if (ie->mask & IN_FID)
+ event_size += roundup(sizeof(struct inotify_extra_fid) + ie->file_handle_size,
+ sizeof(struct inotify_event));
+
if (event_size > count)
return ERR_PTR(-EINVAL);
@@ -241,9 +249,42 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
/* fill userspace with 0's */
if (clear_user(buf, pad_name_len - name_len))
return -EFAULT;
+
+ buf += pad_name_len - name_len;
event_size += pad_name_len;
}
+ if (event->mask & IN_FID) {
+ const size_t handle_size = event->file_handle_size;
+ const size_t extra_size = sizeof(struct inotify_extra_fid) + handle_size;
+ const size_t padded_extra_size = roundup(extra_size, sizeof(struct inotify_event));
+ const size_t padding = padded_extra_size - extra_size;
+
+ const struct inotify_extra_fid extra = {
+ .fsid = event->fsid,
+ .handle_size = handle_size,
+ .padding = padding,
+ };
+
+ if (copy_to_user(buf, &extra, sizeof(extra)))
+ return -EFAULT;
+
+ buf += sizeof(extra);
+
+ const unsigned char *handle = (const unsigned char *)(event + 1) + roundup(name_len + (name_len > 0), 4);
+ if (copy_to_user(buf, handle, handle_size))
+ return -EFAULT;
+
+ buf += handle_size;
+
+ if (clear_user(buf, padding))
+ return -EFAULT;
+
+ buf += padding;
+
+ event_size += padded_extra_size;
+ }
+
return event_size;
}
@@ -860,7 +901,7 @@ static int __init inotify_user_setup(void)
BUILD_BUG_ON(IN_IGNORED != FS_IN_IGNORED);
BUILD_BUG_ON(IN_ISDIR != FS_ISDIR);
- BUILD_BUG_ON(HWEIGHT32(ALL_INOTIFY_BITS) != 22);
+ BUILD_BUG_ON(HWEIGHT32(ALL_INOTIFY_BITS) != 23);
inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark,
SLAB_PANIC|SLAB_ACCOUNT);
@@ -13,6 +13,7 @@
IN_CLOSE_NOWRITE | IN_OPEN | IN_MOVED_FROM | \
IN_MOVED_TO | IN_CREATE | IN_DELETE | \
IN_DELETE_SELF | IN_MOVE_SELF | IN_UNMOUNT | \
+ IN_FID | \
IN_Q_OVERFLOW | IN_IGNORED | IN_ONLYDIR | \
IN_DONT_FOLLOW | IN_EXCL_UNLINK | IN_MASK_ADD | \
IN_MASK_CREATE | IN_ISDIR | IN_ONESHOT)
@@ -26,6 +26,16 @@ struct inotify_event {
char name[]; /* stub for possible name */
};
+struct inotify_extra_fid {
+ __kernel_fsid_t fsid;
+
+ __u8 handle_size;
+ __u8 padding;
+ __u8 reserved1, reserved2;
+
+ unsigned char handle[];
+};
+
/* the following are legal, implemented events that user-space can watch for */
#define IN_ACCESS 0x00000001 /* File was accessed */
#define IN_MODIFY 0x00000002 /* File was modified */
@@ -50,6 +60,7 @@ struct inotify_event {
#define IN_MOVE (IN_MOVED_FROM | IN_MOVED_TO) /* moves */
/* special flags */
+#define IN_FID 0x00800000 /* return fsid and file_handle (struct inotify_extra_fid) */
#define IN_ONLYDIR 0x01000000 /* only watch the path if it is a directory */
#define IN_DONT_FOLLOW 0x02000000 /* don't follow a sym link */
#define IN_EXCL_UNLINK 0x04000000 /* exclude events on unlinked objects */