[RFC,v2,8/9] namespace: add sb_revalidate_bindmounts helper

Message ID 20230403144517.347517-9-aleksandr.mikhalitsyn@canonical.com
State New
Headers
Series fuse: API for Checkpoint/Restore |

Commit Message

Aleksandr Mikhalitsyn April 3, 2023, 2:45 p.m. UTC
  Useful if for some reason bindmounts root dentries get invalidated
but it's needed to revalidate existing bindmounts without remounting.

Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Stéphane Graber <stgraber@ubuntu.com>
Cc: Seth Forshee <sforshee@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Bernd Schubert <bschubert@ddn.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: criu@openvz.org
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
---
 fs/namespace.c                | 90 +++++++++++++++++++++++++++++++++++
 include/linux/mnt_namespace.h |  3 ++
 2 files changed, 93 insertions(+)
  

Comments

Christian Brauner April 3, 2023, 9:14 p.m. UTC | #1
On Mon, Apr 03, 2023 at 04:45:16PM +0200, Alexander Mikhalitsyn wrote:
> Useful if for some reason bindmounts root dentries get invalidated
> but it's needed to revalidate existing bindmounts without remounting.
> 
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Stéphane Graber <stgraber@ubuntu.com>
> Cc: Seth Forshee <sforshee@kernel.org>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Andrei Vagin <avagin@gmail.com>
> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> Cc: Bernd Schubert <bschubert@ddn.com>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: criu@openvz.org
> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
> ---
>  fs/namespace.c                | 90 +++++++++++++++++++++++++++++++++++
>  include/linux/mnt_namespace.h |  3 ++
>  2 files changed, 93 insertions(+)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index bc0f15257b49..b74d00d6abb0 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -568,6 +568,96 @@ static int mnt_make_readonly(struct mount *mnt)
>  	return ret;
>  }
>  
> +struct bind_mount_list_item {
> +	struct list_head list;
> +	struct vfsmount *mnt;
> +};
> +
> +/*
> + * sb_revalidate_bindmounts - Relookup/reset bindmounts root dentries
> + *
> + * Useful if for some reason bindmount root dentries get invalidated
> + * but it's needed to revalidate existing bindmounts without remounting.
> + */
> +int sb_revalidate_bindmounts(struct super_block *sb)

It's difficult to not strongly dislike this patchset on the basis of the
need for a function like this alone. And I just have to say it even if I
normally try not to do this: This whole function is bonkers in my opinion.

But leaving that aside for a second. This really needs detailed
explanations on locking, assumptions, and an explanation what you're
doing here. This looks crazy to me and definitely isn't fit to be a
generic helper in this form.

> +{
> +	struct mount *mnt;
> +	struct bind_mount_list_item *bmnt, *next;
> +	int err = 0;
> +	struct vfsmount *root_mnt = NULL;
> +	LIST_HEAD(mnt_to_update);
> +	char *buf;
> +
> +	buf = (char *) __get_free_page(GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	lock_mount_hash();
> +	list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
> +		/* we only want to touch bindmounts */
> +		if (mnt->mnt.mnt_root == sb->s_root) {
> +			if (!root_mnt)
> +				root_mnt = mntget(&mnt->mnt);
> +
> +			continue;
> +		}
> +
> +		bmnt = kzalloc(sizeof(struct bind_mount_list_item), GFP_NOWAIT | __GFP_NOWARN);

Allocating under lock_mount_hash() even if doable with this flag
combination should be avoided at all costs imho. That just seems hacky.

> +		if (!bmnt) {
> +			err = -ENOMEM;
> +			goto exit;

You're exiting the function with lock_mount_hash() held...

> +		}
> +
> +		bmnt->mnt = mntget(&mnt->mnt);
> +		list_add_tail(&bmnt->list, &mnt_to_update);
> +	}
> +	unlock_mount_hash();

You've dropped unlock_mount_hash() and the function doesn't hold
namespace_lock() and isn't documented as requiring the caller to hold
it. And the later patch that uses this doesn't afaict (although I
haven't looked at any of the fuse specific stuff). So this is open to
all sorts of races with mount and unmount now. This needs an explanation
why that doesn't matter.

> +
> +	/* TODO: get rid of this limitation */

Confused about this comment.

> +	if (!root_mnt) {
> +		err = -ENOENT;
> +		goto exit;
> +	}
> +
> +	list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {

No one else can access that list so list_for_each_entry_safe() seems
pointless.

> +		struct vfsmount *cur_mnt = bmnt->mnt;
> +		struct path path;
> +		struct dentry *old_root;
> +		char *p;
> +
> +		p = dentry_path(cur_mnt->mnt_root, buf, PAGE_SIZE);
> +		if (IS_ERR(p))
> +			goto exit;
> +
> +		/* TODO: are these lookup flags fully safe and correct? */
> +		err = vfs_path_lookup(root_mnt->mnt_root, root_mnt,
> +				p, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT|LOOKUP_REVAL, &path);
> +		if (err)
> +			goto exit;
> +
> +		/* replace bindmount root dentry */
> +		lock_mount_hash();
> +		old_root = cur_mnt->mnt_root;
> +		cur_mnt->mnt_root = dget(path.dentry);

mnt->mnt_root isn't protected by lock_mount_hash(). It's invariant after
it has been set and a lot of code just assumes that it's stable.

Top of my hat, since you're not holding namespace_lock() mount
propagation can be going on concurrently so propagate_one() can check if
(!is_subdir(mp->m_dentry, m->mnt.mnt_root)) while you're happily
updating it. A lot of code could actually be touching mnt->mnt_root
while you're updating it.

There's probably a lot more issues with this I'm just not seeing without
spending more time on this. But NAK on this as it stands. Sorry.

> +		dput(old_root);
> +		unlock_mount_hash();
> +
> +		path_put(&path);
> +	}
> +
> +exit:
> +	free_page((unsigned long) buf);
> +	mntput(root_mnt);
> +	list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {
> +		list_del(&bmnt->list);
> +		mntput(bmnt->mnt);
> +		kfree(bmnt);
> +	}
> +
> +	return err;
> +}
> +EXPORT_SYMBOL_GPL(sb_revalidate_bindmounts);
  
Aleksandr Mikhalitsyn April 5, 2023, 6:45 p.m. UTC | #2
On Mon, Apr 3, 2023 at 11:14 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Mon, Apr 03, 2023 at 04:45:16PM +0200, Alexander Mikhalitsyn wrote:
> > Useful if for some reason bindmounts root dentries get invalidated
> > but it's needed to revalidate existing bindmounts without remounting.
> >
> > Cc: Miklos Szeredi <mszeredi@redhat.com>
> > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > Cc: Amir Goldstein <amir73il@gmail.com>
> > Cc: Stéphane Graber <stgraber@ubuntu.com>
> > Cc: Seth Forshee <sforshee@kernel.org>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Cc: Andrei Vagin <avagin@gmail.com>
> > Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> > Cc: Bernd Schubert <bschubert@ddn.com>
> > Cc: linux-fsdevel@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: criu@openvz.org
> > Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
> > ---
> >  fs/namespace.c                | 90 +++++++++++++++++++++++++++++++++++
> >  include/linux/mnt_namespace.h |  3 ++
> >  2 files changed, 93 insertions(+)
> >
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index bc0f15257b49..b74d00d6abb0 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -568,6 +568,96 @@ static int mnt_make_readonly(struct mount *mnt)
> >       return ret;
> >  }
> >
> > +struct bind_mount_list_item {
> > +     struct list_head list;
> > +     struct vfsmount *mnt;
> > +};
> > +
> > +/*
> > + * sb_revalidate_bindmounts - Relookup/reset bindmounts root dentries
> > + *
> > + * Useful if for some reason bindmount root dentries get invalidated
> > + * but it's needed to revalidate existing bindmounts without remounting.
> > + */
> > +int sb_revalidate_bindmounts(struct super_block *sb)

Hi Christian!

>
> It's difficult to not strongly dislike this patchset on the basis of the
> need for a function like this alone. And I just have to say it even if I
> normally try not to do this: This whole function is bonkers in my opinion.

This function is the main pain point in this patchset, I believe :)
I thought a lot about other ways (without this helper) and all of them
have their pros and cons.

And that's the reason why this patchset is RFC. I wanted to hear from
you and other folks
about your point of view on the problem. As far as I can see there is
some interest to that problem,
and implementation question is important, but if there is no
objections against the concept of
FUSE mounts healing (and Checkpoint/Restore too as a more complex
case) then I think we will
be able to agree on some approach to the problem. I'm ready to rework
that in the way it is supposed to be done.

>
> But leaving that aside for a second. This really needs detailed
> explanations on locking, assumptions, and an explanation what you're
> doing here. This looks crazy to me and definitely isn't fit to be a
> generic helper in this form.

Of course.

>
> > +{
> > +     struct mount *mnt;
> > +     struct bind_mount_list_item *bmnt, *next;
> > +     int err = 0;
> > +     struct vfsmount *root_mnt = NULL;
> > +     LIST_HEAD(mnt_to_update);
> > +     char *buf;
> > +
> > +     buf = (char *) __get_free_page(GFP_KERNEL);
> > +     if (!buf)
> > +             return -ENOMEM;
> > +
> > +     lock_mount_hash();
> > +     list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
> > +             /* we only want to touch bindmounts */
> > +             if (mnt->mnt.mnt_root == sb->s_root) {
> > +                     if (!root_mnt)
> > +                             root_mnt = mntget(&mnt->mnt);
> > +
> > +                     continue;
> > +             }
> > +
> > +             bmnt = kzalloc(sizeof(struct bind_mount_list_item), GFP_NOWAIT | __GFP_NOWARN);
>
> Allocating under lock_mount_hash() even if doable with this flag
> combination should be avoided at all costs imho. That just seems hacky.

got it

>
> > +             if (!bmnt) {
> > +                     err = -ENOMEM;
> > +                     goto exit;
>
> You're exiting the function with lock_mount_hash() held...

+

>
> > +             }
> > +
> > +             bmnt->mnt = mntget(&mnt->mnt);
> > +             list_add_tail(&bmnt->list, &mnt_to_update);
> > +     }
> > +     unlock_mount_hash();
>
> You've dropped unlock_mount_hash() and the function doesn't hold
> namespace_lock() and isn't documented as requiring the caller to hold
> it. And the later patch that uses this doesn't afaict (although I
> haven't looked at any of the fuse specific stuff). So this is open to
> all sorts of races with mount and unmount now. This needs an explanation
> why that doesn't matter.

yes, it doesn't matter, my idea was that this helper should
"actualize" roots for the bind mounts
at the point of call, but it seems no problem if we skip some new
mounts, because
This helper can be called again. I didn't want to take a namespace_lock() there
because we don't want to have such guarantees for this interface.

Of course, this has to be documented. Thanks!

>
> > +
> > +     /* TODO: get rid of this limitation */
>
> Confused about this comment.

Sorry.

This comment means that current implementation relies on the fact that
each fuse superblock has some
root mount in the list (mnt->mnt.mnt_root == sb->s_root). And that's
not always the case, because user
can do something like this:
fusermount ... /mnt/my_fuse_mount
mount --bind /mnt/my_fuse_mount/subtree /mnt/my_fuse_mount_subtree
umount /mnt/my_fuse_mount

In this case, root_mnt == NULL. But we need to have root_mnt, to pass
it in vfs_path_lookup(..) later.

>
> > +     if (!root_mnt) {
> > +             err = -ENOENT;
> > +             goto exit;
> > +     }
> > +
> > +     list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {
>
> No one else can access that list so list_for_each_entry_safe() seems
> pointless.

+

>
> > +             struct vfsmount *cur_mnt = bmnt->mnt;
> > +             struct path path;
> > +             struct dentry *old_root;
> > +             char *p;
> > +
> > +             p = dentry_path(cur_mnt->mnt_root, buf, PAGE_SIZE);
> > +             if (IS_ERR(p))
> > +                     goto exit;
> > +
> > +             /* TODO: are these lookup flags fully safe and correct? */
> > +             err = vfs_path_lookup(root_mnt->mnt_root, root_mnt,
> > +                             p, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT|LOOKUP_REVAL, &path);
> > +             if (err)
> > +                     goto exit;
> > +
> > +             /* replace bindmount root dentry */
> > +             lock_mount_hash();
> > +             old_root = cur_mnt->mnt_root;
> > +             cur_mnt->mnt_root = dget(path.dentry);
>
> mnt->mnt_root isn't protected by lock_mount_hash(). It's invariant after
> it has been set and a lot of code just assumes that it's stable.
>
> Top of my hat, since you're not holding namespace_lock() mount
> propagation can be going on concurrently so propagate_one() can check if
> (!is_subdir(mp->m_dentry, m->mnt.mnt_root)) while you're happily
> updating it. A lot of code could actually be touching mnt->mnt_root
> while you're updating it.
>
> There's probably a lot more issues with this I'm just not seeing without
> spending more time on this. But NAK on this as it stands. Sorry.

Thanks for looking into it, Christian!

Kind regards,
Alex

>
> > +             dput(old_root);
> > +             unlock_mount_hash();
> > +
> > +             path_put(&path);
> > +     }
> > +
> > +exit:
> > +     free_page((unsigned long) buf);
> > +     mntput(root_mnt);
> > +     list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {
> > +             list_del(&bmnt->list);
> > +             mntput(bmnt->mnt);
> > +             kfree(bmnt);
> > +     }
> > +
> > +     return err;
> > +}
> > +EXPORT_SYMBOL_GPL(sb_revalidate_bindmounts);
  

Patch

diff --git a/fs/namespace.c b/fs/namespace.c
index bc0f15257b49..b74d00d6abb0 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -568,6 +568,96 @@  static int mnt_make_readonly(struct mount *mnt)
 	return ret;
 }
 
+struct bind_mount_list_item {
+	struct list_head list;
+	struct vfsmount *mnt;
+};
+
+/*
+ * sb_revalidate_bindmounts - Relookup/reset bindmounts root dentries
+ *
+ * Useful if for some reason bindmount root dentries get invalidated
+ * but it's needed to revalidate existing bindmounts without remounting.
+ */
+int sb_revalidate_bindmounts(struct super_block *sb)
+{
+	struct mount *mnt;
+	struct bind_mount_list_item *bmnt, *next;
+	int err = 0;
+	struct vfsmount *root_mnt = NULL;
+	LIST_HEAD(mnt_to_update);
+	char *buf;
+
+	buf = (char *) __get_free_page(GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	lock_mount_hash();
+	list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
+		/* we only want to touch bindmounts */
+		if (mnt->mnt.mnt_root == sb->s_root) {
+			if (!root_mnt)
+				root_mnt = mntget(&mnt->mnt);
+
+			continue;
+		}
+
+		bmnt = kzalloc(sizeof(struct bind_mount_list_item), GFP_NOWAIT | __GFP_NOWARN);
+		if (!bmnt) {
+			err = -ENOMEM;
+			goto exit;
+		}
+
+		bmnt->mnt = mntget(&mnt->mnt);
+		list_add_tail(&bmnt->list, &mnt_to_update);
+	}
+	unlock_mount_hash();
+
+	/* TODO: get rid of this limitation */
+	if (!root_mnt) {
+		err = -ENOENT;
+		goto exit;
+	}
+
+	list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {
+		struct vfsmount *cur_mnt = bmnt->mnt;
+		struct path path;
+		struct dentry *old_root;
+		char *p;
+
+		p = dentry_path(cur_mnt->mnt_root, buf, PAGE_SIZE);
+		if (IS_ERR(p))
+			goto exit;
+
+		/* TODO: are these lookup flags fully safe and correct? */
+		err = vfs_path_lookup(root_mnt->mnt_root, root_mnt,
+				p, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT|LOOKUP_REVAL, &path);
+		if (err)
+			goto exit;
+
+		/* replace bindmount root dentry */
+		lock_mount_hash();
+		old_root = cur_mnt->mnt_root;
+		cur_mnt->mnt_root = dget(path.dentry);
+		dput(old_root);
+		unlock_mount_hash();
+
+		path_put(&path);
+	}
+
+exit:
+	free_page((unsigned long) buf);
+	mntput(root_mnt);
+	list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {
+		list_del(&bmnt->list);
+		mntput(bmnt->mnt);
+		kfree(bmnt);
+	}
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(sb_revalidate_bindmounts);
+
 int sb_prepare_remount_readonly(struct super_block *sb)
 {
 	struct mount *mnt;
diff --git a/include/linux/mnt_namespace.h b/include/linux/mnt_namespace.h
index 8f882f5881e8..20ac29e702f5 100644
--- a/include/linux/mnt_namespace.h
+++ b/include/linux/mnt_namespace.h
@@ -3,6 +3,7 @@ 
 #define _NAMESPACE_H_
 #ifdef __KERNEL__
 
+struct super_block;
 struct mnt_namespace;
 struct fs_struct;
 struct user_namespace;
@@ -13,6 +14,8 @@  extern struct mnt_namespace *copy_mnt_ns(unsigned long, struct mnt_namespace *,
 extern void put_mnt_ns(struct mnt_namespace *ns);
 extern struct ns_common *from_mnt_ns(struct mnt_namespace *);
 
+extern int sb_revalidate_bindmounts(struct super_block *sb);
+
 extern const struct file_operations proc_mounts_operations;
 extern const struct file_operations proc_mountinfo_operations;
 extern const struct file_operations proc_mountstats_operations;