[RESEND,RFC,v2,00/14] device_cgroup: guard mknod for non-initial user namespace

Message ID 20231025094224.72858-1-michael.weiss@aisec.fraunhofer.de
Headers
Series device_cgroup: guard mknod for non-initial user namespace |

Message

Michael Weiß Oct. 25, 2023, 9:42 a.m. UTC
  Introduce the flag BPF_DEVCG_ACC_MKNOD_UNS for bpf programs of type
BPF_PROG_TYPE_CGROUP_DEVICE which allows to guard access to mknod
in non-initial user namespaces.

If a container manager restricts its unprivileged (user namespaced)
children by a device cgroup, it is not necessary to deny mknod()
anymore. Thus, user space applications may map devices on different
locations in the file system by using mknod() inside the container.

A use case for this, we also use in GyroidOS, is to run virsh for
VMs inside an unprivileged container. virsh creates device nodes,
e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
in a non-initial userns, even if a cgroup device white list with the
corresponding major, minor of /dev/null exists. Thus, in this case
the usual bind mounts or pre populated device nodes under /dev are
not sufficient.

To circumvent this limitation, allow mknod() by checking CAP_MKNOD
in the userns by implementing the security_inode_mknod_nscap(). The
hook implementation checks if the corresponding permission flag
BPF_DEVCG_ACC_MKNOD_UNS is set for the device in the bpf program.
To avoid to create unusable inodes in user space the hook also
checks SB_I_NODEV on the corresponding super block.

Further, the security_sb_alloc_userns() hook is implemented using
cgroup_bpf_current_enabled() to allow usage of device nodes on super
blocks mounted by a guarded task.

Patch 1 to 3 rework the current devcgroup_inode hooks as an LSM

Patch 4 to 8 rework explicit calls to devcgroup_check_permission
also as LSM hooks and finalize the conversion of the device_cgroup
subsystem to a LSM.

Patch 9 and 10 introduce new generic security hooks to be used
for the actual mknod device guard implementation.

Patch 11 wires up the security hooks in the vfs

Patch 12 and 13 provide helper functions in the bpf cgroup
subsystem.

Patch 14 finally implement the LSM hooks to grand access

Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
---
Changes in v2:
- Integrate this as LSM (Christian, Paul)
- Switched to a device cgroup specific flag instead of a generic
  bpf program flag (Christian)
- do not ignore SB_I_NODEV in fs/namei.c but use LSM hook in
  sb_alloc_super in fs/super.c
- Link to v1: https://lore.kernel.org/r/20230814-devcg_guard-v1-0-654971ab88b1@aisec.fraunhofer.de

Michael Weiß (14):
  device_cgroup: Implement devcgroup hooks as lsm security hooks
  vfs: Remove explicit devcgroup_inode calls
  device_cgroup: Remove explicit devcgroup_inode hooks
  lsm: Add security_dev_permission() hook
  device_cgroup: Implement dev_permission() hook
  block: Switch from devcgroup_check_permission to security hook
  drm/amdkfd: Switch from devcgroup_check_permission to security hook
  device_cgroup: Hide devcgroup functionality completely in lsm
  lsm: Add security_inode_mknod_nscap() hook
  lsm: Add security_sb_alloc_userns() hook
  vfs: Wire up security hooks for lsm-based device guard in userns
  bpf: Add flag BPF_DEVCG_ACC_MKNOD_UNS for device access
  bpf: cgroup: Introduce helper cgroup_bpf_current_enabled()
  device_cgroup: Allow mknod in non-initial userns if guarded

 block/bdev.c                                 |   9 +-
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h        |   7 +-
 fs/namei.c                                   |  24 ++--
 fs/super.c                                   |   6 +-
 include/linux/bpf-cgroup.h                   |   2 +
 include/linux/device_cgroup.h                |  67 -----------
 include/linux/lsm_hook_defs.h                |   4 +
 include/linux/security.h                     |  18 +++
 include/uapi/linux/bpf.h                     |   1 +
 init/Kconfig                                 |   4 +
 kernel/bpf/cgroup.c                          |  14 +++
 security/Kconfig                             |   1 +
 security/Makefile                            |   2 +-
 security/device_cgroup/Kconfig               |   7 ++
 security/device_cgroup/Makefile              |   4 +
 security/{ => device_cgroup}/device_cgroup.c |   3 +-
 security/device_cgroup/device_cgroup.h       |  20 ++++
 security/device_cgroup/lsm.c                 | 114 +++++++++++++++++++
 security/security.c                          |  75 ++++++++++++
 19 files changed, 294 insertions(+), 88 deletions(-)
 delete mode 100644 include/linux/device_cgroup.h
 create mode 100644 security/device_cgroup/Kconfig
 create mode 100644 security/device_cgroup/Makefile
 rename security/{ => device_cgroup}/device_cgroup.c (99%)
 create mode 100644 security/device_cgroup/device_cgroup.h
 create mode 100644 security/device_cgroup/lsm.c


base-commit: 58720809f52779dc0f08e53e54b014209d13eebb
  

Comments

Paul Moore Oct. 25, 2023, 1:17 p.m. UTC | #1
On Wed, Oct 25, 2023 at 5:42 AM Michael Weiß
<michael.weiss@aisec.fraunhofer.de> wrote:
>
> Introduce the flag BPF_DEVCG_ACC_MKNOD_UNS for bpf programs of type
> BPF_PROG_TYPE_CGROUP_DEVICE which allows to guard access to mknod
> in non-initial user namespaces.
>
> If a container manager restricts its unprivileged (user namespaced)
> children by a device cgroup, it is not necessary to deny mknod()
> anymore. Thus, user space applications may map devices on different
> locations in the file system by using mknod() inside the container.
>
> A use case for this, we also use in GyroidOS, is to run virsh for
> VMs inside an unprivileged container. virsh creates device nodes,
> e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
> in a non-initial userns, even if a cgroup device white list with the
> corresponding major, minor of /dev/null exists. Thus, in this case
> the usual bind mounts or pre populated device nodes under /dev are
> not sufficient.
>
> To circumvent this limitation, allow mknod() by checking CAP_MKNOD
> in the userns by implementing the security_inode_mknod_nscap(). The
> hook implementation checks if the corresponding permission flag
> BPF_DEVCG_ACC_MKNOD_UNS is set for the device in the bpf program.
> To avoid to create unusable inodes in user space the hook also
> checks SB_I_NODEV on the corresponding super block.
>
> Further, the security_sb_alloc_userns() hook is implemented using
> cgroup_bpf_current_enabled() to allow usage of device nodes on super
> blocks mounted by a guarded task.
>
> Patch 1 to 3 rework the current devcgroup_inode hooks as an LSM
>
> Patch 4 to 8 rework explicit calls to devcgroup_check_permission
> also as LSM hooks and finalize the conversion of the device_cgroup
> subsystem to a LSM.
>
> Patch 9 and 10 introduce new generic security hooks to be used
> for the actual mknod device guard implementation.
>
> Patch 11 wires up the security hooks in the vfs
>
> Patch 12 and 13 provide helper functions in the bpf cgroup
> subsystem.
>
> Patch 14 finally implement the LSM hooks to grand access
>
> Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
> ---
> Changes in v2:
> - Integrate this as LSM (Christian, Paul)
> - Switched to a device cgroup specific flag instead of a generic
>   bpf program flag (Christian)
> - do not ignore SB_I_NODEV in fs/namei.c but use LSM hook in
>   sb_alloc_super in fs/super.c
> - Link to v1: https://lore.kernel.org/r/20230814-devcg_guard-v1-0-654971ab88b1@aisec.fraunhofer.de
>
> Michael Weiß (14):
>   device_cgroup: Implement devcgroup hooks as lsm security hooks
>   vfs: Remove explicit devcgroup_inode calls
>   device_cgroup: Remove explicit devcgroup_inode hooks
>   lsm: Add security_dev_permission() hook
>   device_cgroup: Implement dev_permission() hook
>   block: Switch from devcgroup_check_permission to security hook
>   drm/amdkfd: Switch from devcgroup_check_permission to security hook
>   device_cgroup: Hide devcgroup functionality completely in lsm
>   lsm: Add security_inode_mknod_nscap() hook
>   lsm: Add security_sb_alloc_userns() hook
>   vfs: Wire up security hooks for lsm-based device guard in userns
>   bpf: Add flag BPF_DEVCG_ACC_MKNOD_UNS for device access
>   bpf: cgroup: Introduce helper cgroup_bpf_current_enabled()
>   device_cgroup: Allow mknod in non-initial userns if guarded
>
>  block/bdev.c                                 |   9 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h        |   7 +-
>  fs/namei.c                                   |  24 ++--
>  fs/super.c                                   |   6 +-
>  include/linux/bpf-cgroup.h                   |   2 +
>  include/linux/device_cgroup.h                |  67 -----------
>  include/linux/lsm_hook_defs.h                |   4 +
>  include/linux/security.h                     |  18 +++
>  include/uapi/linux/bpf.h                     |   1 +
>  init/Kconfig                                 |   4 +
>  kernel/bpf/cgroup.c                          |  14 +++
>  security/Kconfig                             |   1 +
>  security/Makefile                            |   2 +-
>  security/device_cgroup/Kconfig               |   7 ++
>  security/device_cgroup/Makefile              |   4 +
>  security/{ => device_cgroup}/device_cgroup.c |   3 +-
>  security/device_cgroup/device_cgroup.h       |  20 ++++
>  security/device_cgroup/lsm.c                 | 114 +++++++++++++++++++
>  security/security.c                          |  75 ++++++++++++
>  19 files changed, 294 insertions(+), 88 deletions(-)
>  delete mode 100644 include/linux/device_cgroup.h
>  create mode 100644 security/device_cgroup/Kconfig
>  create mode 100644 security/device_cgroup/Makefile
>  rename security/{ => device_cgroup}/device_cgroup.c (99%)
>  create mode 100644 security/device_cgroup/device_cgroup.h
>  create mode 100644 security/device_cgroup/lsm.c

Hi Michael,

I think this was lost because it wasn't CC'd to the LSM list (see
below).  I've CC'd the list on my reply, but future patch submissions
that involve the LSM must be posted to the LSM list if you would like
them to be considered.

http://vger.kernel.org/vger-lists.html#linux-security-module
  
Michael Weiß Oct. 25, 2023, 6:11 p.m. UTC | #2
On 25.10.23 15:17, Paul Moore wrote:
> On Wed, Oct 25, 2023 at 5:42 AM Michael Weiß
> <michael.weiss@aisec.fraunhofer.de> wrote:
>>
>> Introduce the flag BPF_DEVCG_ACC_MKNOD_UNS for bpf programs of type
>> BPF_PROG_TYPE_CGROUP_DEVICE which allows to guard access to mknod
>> in non-initial user namespaces.
>>
>> If a container manager restricts its unprivileged (user namespaced)
>> children by a device cgroup, it is not necessary to deny mknod()
>> anymore. Thus, user space applications may map devices on different
>> locations in the file system by using mknod() inside the container.
>>
>> A use case for this, we also use in GyroidOS, is to run virsh for
>> VMs inside an unprivileged container. virsh creates device nodes,
>> e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
>> in a non-initial userns, even if a cgroup device white list with the
>> corresponding major, minor of /dev/null exists. Thus, in this case
>> the usual bind mounts or pre populated device nodes under /dev are
>> not sufficient.
>>
>> To circumvent this limitation, allow mknod() by checking CAP_MKNOD
>> in the userns by implementing the security_inode_mknod_nscap(). The
>> hook implementation checks if the corresponding permission flag
>> BPF_DEVCG_ACC_MKNOD_UNS is set for the device in the bpf program.
>> To avoid to create unusable inodes in user space the hook also
>> checks SB_I_NODEV on the corresponding super block.
>>
>> Further, the security_sb_alloc_userns() hook is implemented using
>> cgroup_bpf_current_enabled() to allow usage of device nodes on super
>> blocks mounted by a guarded task.
>>
>> Patch 1 to 3 rework the current devcgroup_inode hooks as an LSM
>>
>> Patch 4 to 8 rework explicit calls to devcgroup_check_permission
>> also as LSM hooks and finalize the conversion of the device_cgroup
>> subsystem to a LSM.
>>
>> Patch 9 and 10 introduce new generic security hooks to be used
>> for the actual mknod device guard implementation.
>>
>> Patch 11 wires up the security hooks in the vfs
>>
>> Patch 12 and 13 provide helper functions in the bpf cgroup
>> subsystem.
>>
>> Patch 14 finally implement the LSM hooks to grand access
>>
>> Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
>> ---
>> Changes in v2:
>> - Integrate this as LSM (Christian, Paul)
>> - Switched to a device cgroup specific flag instead of a generic
>>   bpf program flag (Christian)
>> - do not ignore SB_I_NODEV in fs/namei.c but use LSM hook in
>>   sb_alloc_super in fs/super.c
>> - Link to v1: https://lore.kernel.org/r/20230814-devcg_guard-v1-0-654971ab88b1@aisec.fraunhofer.de
>>
>> Michael Weiß (14):
>>   device_cgroup: Implement devcgroup hooks as lsm security hooks
>>   vfs: Remove explicit devcgroup_inode calls
>>   device_cgroup: Remove explicit devcgroup_inode hooks
>>   lsm: Add security_dev_permission() hook
>>   device_cgroup: Implement dev_permission() hook
>>   block: Switch from devcgroup_check_permission to security hook
>>   drm/amdkfd: Switch from devcgroup_check_permission to security hook
>>   device_cgroup: Hide devcgroup functionality completely in lsm
>>   lsm: Add security_inode_mknod_nscap() hook
>>   lsm: Add security_sb_alloc_userns() hook
>>   vfs: Wire up security hooks for lsm-based device guard in userns
>>   bpf: Add flag BPF_DEVCG_ACC_MKNOD_UNS for device access
>>   bpf: cgroup: Introduce helper cgroup_bpf_current_enabled()
>>   device_cgroup: Allow mknod in non-initial userns if guarded
>>
>>  block/bdev.c                                 |   9 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h        |   7 +-
>>  fs/namei.c                                   |  24 ++--
>>  fs/super.c                                   |   6 +-
>>  include/linux/bpf-cgroup.h                   |   2 +
>>  include/linux/device_cgroup.h                |  67 -----------
>>  include/linux/lsm_hook_defs.h                |   4 +
>>  include/linux/security.h                     |  18 +++
>>  include/uapi/linux/bpf.h                     |   1 +
>>  init/Kconfig                                 |   4 +
>>  kernel/bpf/cgroup.c                          |  14 +++
>>  security/Kconfig                             |   1 +
>>  security/Makefile                            |   2 +-
>>  security/device_cgroup/Kconfig               |   7 ++
>>  security/device_cgroup/Makefile              |   4 +
>>  security/{ => device_cgroup}/device_cgroup.c |   3 +-
>>  security/device_cgroup/device_cgroup.h       |  20 ++++
>>  security/device_cgroup/lsm.c                 | 114 +++++++++++++++++++
>>  security/security.c                          |  75 ++++++++++++
>>  19 files changed, 294 insertions(+), 88 deletions(-)
>>  delete mode 100644 include/linux/device_cgroup.h
>>  create mode 100644 security/device_cgroup/Kconfig
>>  create mode 100644 security/device_cgroup/Makefile
>>  rename security/{ => device_cgroup}/device_cgroup.c (99%)
>>  create mode 100644 security/device_cgroup/device_cgroup.h
>>  create mode 100644 security/device_cgroup/lsm.c
> 
> Hi Michael,
> 
> I think this was lost because it wasn't CC'd to the LSM list (see
> below).  I've CC'd the list on my reply, but future patch submissions
> that involve the LSM must be posted to the LSM list if you would like
> them to be considered.
> 
> http://vger.kernel.org/vger-lists.html#linux-security-module
> 
Hi Paul,

thanks, I'll keep this in mind for the next submissions.

I have also resend because, I realized that some spam filters my
have swallowed the last submission as I used my private smtp server
from another domain in the gitconfig. Sorry for that. I hope now
every one received it.

Thanks,
Michael
  
Christian Brauner Nov. 24, 2023, 4:47 p.m. UTC | #3
> - Integrate this as LSM (Christian, Paul)

Huh, my rant made you write an LSM. I'm not sure if that's a good or bad
thing...

So I dislike this less than the initial version that just worked around
SB_I_NODEV and this might be able to go somewhere. _But_ I want to see
the rules written down:

(1) current device access management
    I summarized the current places where that's done very very briefly in
    https://lore.kernel.org/all/20230815-feigling-kopfsache-56c2d31275bd@brauner

    * inode_permission()
      -> devcgroup_inode_permission()

    * vfs_mknod()
      -> devcgroup_inode_mknod()

    * blkdev_get_by_dev() // sget()/sget_fc(), other ways to open block devices and friends
      -> devcgroup_check_permission()

    * drivers/gpu/drm/amd/amdkfd // weird restrictions on showing gpu info afaict
      -> devcgroup_check_permission()

    but that's not enough. What we need is a summary of how device node
    creation and device node opening currently interact.

    Because it is subtle. Currently you cannot even create device nodes
    without capable(CAP_SYS_ADMIN) and you can't open any existing ones
    if you lack capable(CAP_SYS_ADMIN).

    Years ago we tried that insane model where we enabled userspace to
    create device nodes but not open them. IOW, the capability check
    for device node creation was lifted but the SB_I_NODEV limitation
    wasn't lifted. That broke the whole world and had to be reverted.

(2) LSM device access management

    I really want to be able to see how you envision the permission
    checking to work in the new model. Specifically:

    * How do device node creation and device node opening interact.
    * The consequences of allowing to remove the SB_I_NODEV restriction.
    * Permission checking for users without and without a bpf guarded
      profile.

If you write this down we'll add it to documentation as well or to the
commit messages. It won't be lost. It doesn't have to be some really
long thing. I just want to better understand what you think this is
going to do and what the code does.
  
Christian Brauner Nov. 24, 2023, 6:08 p.m. UTC | #4
On Fri, Nov 24, 2023 at 05:47:32PM +0100, Christian Brauner wrote:
> > - Integrate this as LSM (Christian, Paul)
> 
> Huh, my rant made you write an LSM. I'm not sure if that's a good or bad
> thing...
> 
> So I dislike this less than the initial version that just worked around

Hm, I wonder if we're being to timid or too complex in how we want to
solve this problem.

The device cgroup management logic is hacked into multiple layers and is
frankly pretty appalling.

What I think device access management wants to look like is that you can
implement a policy in an LSM - be it bpf or regular selinux - and have
this guarded by the main hooks:

security_file_open()
security_inode_mknod()

So, look at:

vfs_get_tree()
-> security_sb_set_mnt_opts()
   -> bpf_sb_set_mnt_opts()

A bpf LSM program should be able to strip SB_I_NODEV from sb->s_iflags
today via bpf_sb_set_mnt_opts() without any kernel changes at all.

I assume that a bpf LSM can also keep state in sb->s_security just like
selinux et al? If so then a device access management program or whatever
can be stored in sb->s_security.

That device access management program would then be run on each call to:

security_file_open()
-> bpf_file_open()

and

security_inode_mknod()
-> bpf_sb_set_mnt_opts()

and take access decisions.

This obviously makes device access management something that's tied
completely to a filesystem. So, you could have the same device node on
two tmpfs filesystems both mounted in the same userns.

The first tmpfs has SB_I_NODEV and doesn't allow you to open that
device. The second tmpfs has a bpf LSM program attached to it that has
stripped SB_I_NODEV and manages device access and allows callers to open
that device.

I guess it's even possible to restrict this on a caller basis by marking
them with a "container id" when the container is started. That can be
done with that task storage thing also via a bpf LSM hook. And then
you can further restrict device access to only those tasks that have a
specific container id in some range or some token or something.

I might just be fantasizing abilities into bpf that it doesn't have so
anyone with the knowledge please speak up.

If this is feasible then the only thing we need to figure out is what to
do with the legacy cgroup access management and specifically the
capable(CAP_SYS_ADMIN) check that's more of a hack than anything else.

So, we could introduce a sysctl that makes it possible to turn this
check into ns_capable(sb->s_userns, CAP_SYS_ADMIN). Because due to
SB_I_NODEV it is inherently safe to do that. It's just that a lot of
container runtimes need to have time to adapt to a world where you may
be able to create a device but not be able to then open it. This isn't
rocket science but it will take time.

But in the end this will mean we get away with minimal kernel changes
and using a lot of existing infrastructure.

Thoughts?
  
Michael Weiß Nov. 28, 2023, 8:54 p.m. UTC | #5
On 24.11.23 19:08, Christian Brauner wrote:
> On Fri, Nov 24, 2023 at 05:47:32PM +0100, Christian Brauner wrote:
>>> - Integrate this as LSM (Christian, Paul)
>>
>> Huh, my rant made you write an LSM. I'm not sure if that's a good or bad
>> thing...
>>
>> So I dislike this less than the initial version that just worked around
>> SB_I_NODEV and this might be able to go somewhere. _But_ I want to see
the rules written down:

Since we have some new Ideas, I also will try to better describe
the vision and current device node interaction when submitting the next
version.

> 
> Hm, I wonder if we're being to timid or too complex in how we want to
> solve this problem.
> 
> The device cgroup management logic is hacked into multiple layers and is
> frankly pretty appalling.
> 
> What I think device access management wants to look like is that you can
> implement a policy in an LSM - be it bpf or regular selinux - and have
> this guarded by the main hooks:
> 
> security_file_open()
> security_inode_mknod()
> 
> So, look at:
> 
> vfs_get_tree()
> -> security_sb_set_mnt_opts()
>    -> bpf_sb_set_mnt_opts()
> 
> A bpf LSM program should be able to strip SB_I_NODEV from sb->s_iflags
> today via bpf_sb_set_mnt_opts() without any kernel changes at all.
> > I assume that a bpf LSM can also keep state in sb->s_security just like
> selinux et al? If so then a device access management program or whatever
> can be stored in sb->s_security.
> 
> That device access management program would then be run on each call to:
> 
> security_file_open()
> -> bpf_file_open()
> 
> and
> 
> security_inode_mknod()
> -> bpf_sb_set_mnt_opts()
> 
> and take access security_sb_set_mnt_optsdecisions.
> 
> This obviously makes device access management something that's tied
> completely to a filesystem. So, you could have the same device node on
> two tmpfs filesystems both mounted in the same userns.
> 
> The first tmpfs has SB_I_NODEV and doesn't allow you to open that
> device. The second tmpfs has a bpf LSM program attached to it that has
> stripped SB_I_NODEV and manages device access and allows callers to open
> that device.

I like the approach to clear SB_I_NODEV in security_sb_set_mnt_opts().
I still have to sort this out but I think that was the missing piece in
the current patch set.

> 
> I guess it's even possible to restrict this on a caller basis by marking
> them with a "container id" when the container is started. That can be
> done with that task storage thing also via a bpf LSM hook. And then
> you can further restrict device access to only those tasks that have a
> specific container id in some range or some token or something.
> 
> I might just be fantasizing abilities into bpf that it doesn't have so
> anyone with the knowledge please speak up.
> 
> If this is feasible then the only thing we need to figure out is what to
> do with the legacy cgroup access management and specifically the
> capable(CAP_SYS_ADMIN) check that's more of a hack than anything else.

First to make this clear, we speak about CAP_SYS_MKNOD.

My approach was to restructure the cgroup_device in an own cgroup_device
lsm not in the current bpf lsm, to also be able to handle the legacy calls.
Especially, the remaining direct calls to devcgroup_check_permission() are
transformed to a new security_hook security_dev_permission() which is
similar to security_inode_permission() but could be called in such places
where only the dev_t representation is available. However, if we
implement it that way you sketched up above, we can just leave the
devcgroup_check_permission() in its current place and only implement
the devcgroup_inode_permission() and devcgroup_mknode and let the blk
and amd/gpu drivers as is for now, or just leave all the devcgroup_*()
hooks as is.

> 
> So, we could introduce a sysctl that makes it possible to turn this
> check into ns_capable(sb->s_userns, CAP_SYS_ADMIN). Because due to
> SB_I_NODEV it is inherently safe to do that. It's just that a lot of
> container runtimes need to have time to adapt to a world where you may
> be able to create a device but not be able to then open it. This isn't
> rocket science but it will take time.

True. I think a sysctl would be a good option.

> 
> But in the end this will mean we get away with minimal kernel changes
> and using a lot of existing infrastructure.
> 
> Thoughts?

For the whole bpf lsm part I'm not confident enough to make any
proposition, yet.
But I think an own simple devnode lsm would be feasible with the above
described security_sb_set_mnt_opts() handling to get this idea realized.
Maybe we go that way to implement a simple lsm without any changes to
the device_cgroup and keep the devcgroup hooks in place. To implement
it as bpf lsm with all full blown conatiner_id could then be done
orthogonally. 
So from a simple container runtime perspective one could just use the
simple lsm and the existing bpf device cgroup program without any change
only having to activate the sysctl. A more complex cloud setup Kubernetes
what so ever, could then use bpf lsm approach.