Message ID | 20230326092208.13613-1-laoar.shao@gmail.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp847999vqo; Sun, 26 Mar 2023 02:50:26 -0700 (PDT) X-Google-Smtp-Source: AKy350Y+SqM2bp906Od8LdyzYqDCqLC4Uv69SxIGZpxASd29nT6nSrAByvNr2cs0AURtq+jnwzix X-Received: by 2002:a17:906:7312:b0:91f:5f9c:5db6 with SMTP id di18-20020a170906731200b0091f5f9c5db6mr12343742ejc.52.1679824226001; Sun, 26 Mar 2023 02:50:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1679824225; cv=none; d=google.com; s=arc-20160816; b=Jrc7lkyXbQfFoRbMBt6IqoSb91AX8MTSY0LU7MNk5CqLDgIRP/Z7Kvy1nLlYQV6S2c XrNkQRUzjKg1EFpq90gJl64LsEp623ppRnksOX2Fj6S82+rt7JJ7J77BQUnrFkXeFXx2 t+WWjNXMQ2z6fKaf41LJnEZ2BiRRU7yrPNPb/B2NTkw8xEWlSI7cqm+r6k6uB8Ux6yWP gEAO7oiBdAHIUpeB3/8XJybHPKVxMexlXIJ6xeW/528/QWiLuBJnEqH7x2JlUCuGNlYo VghJMjjtAcJ7FdG+uHwEj65gOSGCwoM5sBYc/tMI71BOGwOAAwVqISlRMBRyRe3UG/PT /P6Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=4JUY0vHvLERzzHWhNq8Rp87qcHqbkDVGp191/I1Fx0o=; b=yyx8aZe1Q2Ylqhrr8UiN8NscK8N2k6pMKkrhG+iyIRr5110QxgQtQ1DdLOEq6kxpZ9 jeNHSKvGnd1VKGQNOVaP2+3AfC5cewhWXreQUA/k20M/CZICaviRUczK2GZ8jHQxP2VH VKQpticmx8pLQ3ws7s6UD5aj6Fr9AG+0Zzy9hDr829LfSjDrJUrhGhaTlLyki8MokuEt 937cc731MIxfs4G5UYXnbE5RWUoCYoEnL/AHxxU6tf8t2G2eQ8pohl3jvntduNvyN2Pz hGtqL194KzVyM7VOppVo9ZlJPC2t8QtveqYpE/cmumShaoHkL6oBxOG6z3qkOwoibGyB yaAg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=hxPGVTIL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id r21-20020a1709064d1500b009332d678282si5630598eju.692.2023.03.26.02.50.01; Sun, 26 Mar 2023 02:50:25 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=hxPGVTIL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231906AbjCZJW0 (ORCPT <rfc822;makky5685@gmail.com> + 99 others); Sun, 26 Mar 2023 05:22:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43006 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229655AbjCZJWU (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sun, 26 Mar 2023 05:22:20 -0400 Received: from mail-qt1-x831.google.com (mail-qt1-x831.google.com [IPv6:2607:f8b0:4864:20::831]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 14C2F5BB1; Sun, 26 Mar 2023 02:22:19 -0700 (PDT) Received: by mail-qt1-x831.google.com with SMTP id t19so5854847qta.12; Sun, 26 Mar 2023 02:22:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1679822537; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=4JUY0vHvLERzzHWhNq8Rp87qcHqbkDVGp191/I1Fx0o=; b=hxPGVTILocq4nVaMiPMZ+SChKdyPTIUO2eHI7YCoUEHkOB5mK5TxFtSs1jCKOUPltB xykIzzPNFliOyqGx5MZoG/R2wB4+OgBiBgr9CYJAKiXmBJMNC08/+X8naNgmlbkhbPgy 3KplZNvj6cImdSZ72dAZGR7JS/GvIFsOkBoQp5umvFqDz2VMDNUS8FU6OTfyGfahpeBf UsaKolEb9DWSuloVPXcVhleg0kUaoKxAhWq78chCp7T/wEx8uZP9JoFw51eUGt00joYz VGx7N7K6YHPOZjUjdiwXPwB9YkKdSmT84qQwDFechghEBJhitGcoLk2vS6ZthdODqDIA Q+0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679822537; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=4JUY0vHvLERzzHWhNq8Rp87qcHqbkDVGp191/I1Fx0o=; b=WL7xqxtTK2cmtizlZef+TCvGV2o7/PCUVOd4pOI3JuGubXxz9eiTiHFdeB+XQDNvRu gwJ2nPRlk0KFo7TifJHMo/UEQCTe3LLUqNJ+zt5UFv1TOTKKG0lUZnLyA9Y1uR2D7Vhj W8RImHYtp9kq/1uJbwJoSoQ9+miF7SqfpUxwJHyXa3UtYVknjnHZeG5KTZ33F9zP5YsD D0Wzmry2xGZ3+D0c4ag29ZFzVk6mMKQeXBd0X5qX2SAWQ6rRYy/kIozSjq0ZYnlh/k++ yCDTyBy+CEATlFNfyCby1AyJxuwfE7Vrzyz2Oocmp95jsz+1TnCuKf/Vh4RqjHJDNQIf od4w== X-Gm-Message-State: AO0yUKWNFdWMkNAN/7pC9+ZK+K2xEM9RWaREzQuCOWaKRjpU7hEh4qjT mQl3qyTal7E3lznB4Ua8JbE= X-Received: by 2002:a05:622a:1983:b0:3e0:3d65:19d2 with SMTP id u3-20020a05622a198300b003e03d6519d2mr14458863qtc.20.1679822537149; Sun, 26 Mar 2023 02:22:17 -0700 (PDT) Received: from vultr.guest ([2001:19f0:1000:1a1f:5400:4ff:fe5e:1d32]) by smtp.gmail.com with ESMTPSA id y5-20020ac87085000000b003e014845d9esm10257987qto.74.2023.03.26.02.22.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 26 Mar 2023 02:22:16 -0700 (PDT) From: Yafang Shao <laoar.shao@gmail.com> To: ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, kafai@fb.com, songliubraving@fb.com, yhs@fb.com, john.fastabend@gmail.com, kpsingh@kernel.org, sdf@google.com, haoluo@google.com, jolsa@kernel.org Cc: bpf@vger.kernel.org, linux-kernel@vger.kernel.org, Yafang Shao <laoar.shao@gmail.com> Subject: [RFC PATCH bpf-next 00/13] bpf: Introduce BPF namespace Date: Sun, 26 Mar 2023 09:21:55 +0000 Message-Id: <20230326092208.13613-1-laoar.shao@gmail.com> X-Mailer: git-send-email 2.39.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.2 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1761423367575533488?= X-GMAIL-MSGID: =?utf-8?q?1761423367575533488?= |
Series |
bpf: Introduce BPF namespace
|
|
Message
Yafang Shao
March 26, 2023, 9:21 a.m. UTC
Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs to FDs, that's intended for BPF's security model[1]. Not only does it prevent non-privilidged users from getting other users' bpf program, but also it prevents the user from iterating his own bpf objects. In container environment, some users want to run bpf programs in their containers. These users can run their bpf programs under CAP_BPF and some other specific CAPs, but they can't inspect their bpf programs in a generic way. For example, the bpftool can't be used as it requires CAP_SYS_ADMIN. That is very inconvenient. Without CAP_SYS_ADMIN, the only way to get the information of a bpf object which is not created by the process itself is with SCM_RIGHTS, that requires each processes which created bpf object has to implement a unix domain socket to share the fd of a bpf object between different processes, that is really trivial and troublesome. Hence we need a better mechanism to get bpf object info without CAP_SYS_ADMIN. BPF namespace is introduced in this patchset with an attempt to remove the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and link in a specific bpf namespace, then these bpf objects will not be visible to the users in a different bpf namespace. But these bpf objects are visible to its parent bpf namespace, so the sys admin can still iterate and inspect them. BPF namespace is similar to PID namespace, and the bpf objects are similar to tasks, so BPF namespace is very easy to understand. These patchset only implements BPF namespace for bpf map, prog and link. In the future we may extend it to other bpf objects like btf, bpffs and etc. For example, we can allow some of the BTF objects to be used in non-init bpf namespace, then the container user can only trace the processes running in his container, but can't get the information of tasks running in other containers. A simple example is introduced into selftests/bpf on how to use the bpf namespace. Putting bpf map, prog and link into bpf namespace is the first step. Let's start with it. [1]. https://lore.kernel.org/bpf/20200513230355.7858-1-alexei.starovoitov@gmail.com/ Yafang Shao (13): fork: New clone3 flag for BPF namespace proc_ns: Extend the field type in struct proc_ns_operations to long bpf: Implement bpf namespace bpf: No need to check if id is 0 bpf: Make bpf objects id have the same alloc and free pattern bpf: Helpers to alloc and free object id in bpf namespace bpf: Add bpf helper to get bpf object id bpf: Alloc and free bpf_map id in bpf namespace bpf: Alloc and free bpf_prog id in bpf namespace bpf: Alloc and free bpf_link id in bpf namespace bpf: Allow iterating bpf objects with CAP_BPF in bpf namespace bpf: Use bpf_idr_lock array instead selftests/bpf: Add selftest for bpf namespace fs/proc/namespaces.c | 4 + include/linux/bpf.h | 9 +- include/linux/bpf_namespace.h | 88 ++++++++++ include/linux/nsproxy.h | 4 + include/linux/proc_ns.h | 3 +- include/linux/user_namespace.h | 1 + include/uapi/linux/bpf.h | 7 + include/uapi/linux/sched.h | 1 + kernel/bpf/Makefile | 1 + kernel/bpf/bpf_namespace.c | 283 ++++++++++++++++++++++++++++++ kernel/bpf/offload.c | 16 +- kernel/bpf/syscall.c | 262 ++++++++++----------------- kernel/bpf/task_iter.c | 12 ++ kernel/fork.c | 5 +- kernel/nsproxy.c | 19 +- kernel/trace/bpf_trace.c | 2 + kernel/ucount.c | 1 + tools/bpf/bpftool/skeleton/pid_iter.bpf.c | 13 +- tools/include/uapi/linux/bpf.h | 7 + tools/testing/selftests/bpf/Makefile | 3 +- tools/testing/selftests/bpf/test_bpfns.c | 76 ++++++++ 21 files changed, 637 insertions(+), 180 deletions(-) create mode 100644 include/linux/bpf_namespace.h create mode 100644 kernel/bpf/bpf_namespace.c create mode 100644 tools/testing/selftests/bpf/test_bpfns.c
Comments
Yafang Shao <laoar.shao@gmail.com> writes: > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs > to FDs, that's intended for BPF's security model[1]. Not only does it > prevent non-privilidged users from getting other users' bpf program, but > also it prevents the user from iterating his own bpf objects. > > In container environment, some users want to run bpf programs in their > containers. These users can run their bpf programs under CAP_BPF and > some other specific CAPs, but they can't inspect their bpf programs in a > generic way. For example, the bpftool can't be used as it requires > CAP_SYS_ADMIN. That is very inconvenient. > > Without CAP_SYS_ADMIN, the only way to get the information of a bpf object > which is not created by the process itself is with SCM_RIGHTS, that > requires each processes which created bpf object has to implement a unix > domain socket to share the fd of a bpf object between different > processes, that is really trivial and troublesome. > > Hence we need a better mechanism to get bpf object info without > CAP_SYS_ADMIN. > > BPF namespace is introduced in this patchset with an attempt to remove > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and > link in a specific bpf namespace, then these bpf objects will not be > visible to the users in a different bpf namespace. But these bpf > objects are visible to its parent bpf namespace, so the sys admin can > still iterate and inspect them. > > BPF namespace is similar to PID namespace, and the bpf objects are > similar to tasks, so BPF namespace is very easy to understand. These > patchset only implements BPF namespace for bpf map, prog and link. In the > future we may extend it to other bpf objects like btf, bpffs and etc. May? I think we should cover all of the existing BPF objects from the beginning here, or we may miss important interactions that will invalidate the whole idea. In particular, I'm a little worried about the interaction between namespaces and bpffs; what happens if you're in a bpf namespace and you try to read a BPF object from a bpffs that belongs to a different namespace? Does the operation fail? Is the object hidden entirely? Something else? -Toke
On Sun, Mar 26, 2023 at 6:49 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote: > > Yafang Shao <laoar.shao@gmail.com> writes: > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs > > to FDs, that's intended for BPF's security model[1]. Not only does it > > prevent non-privilidged users from getting other users' bpf program, but > > also it prevents the user from iterating his own bpf objects. > > > > In container environment, some users want to run bpf programs in their > > containers. These users can run their bpf programs under CAP_BPF and > > some other specific CAPs, but they can't inspect their bpf programs in a > > generic way. For example, the bpftool can't be used as it requires > > CAP_SYS_ADMIN. That is very inconvenient. > > > > Without CAP_SYS_ADMIN, the only way to get the information of a bpf object > > which is not created by the process itself is with SCM_RIGHTS, that > > requires each processes which created bpf object has to implement a unix > > domain socket to share the fd of a bpf object between different > > processes, that is really trivial and troublesome. > > > > Hence we need a better mechanism to get bpf object info without > > CAP_SYS_ADMIN. > > > > BPF namespace is introduced in this patchset with an attempt to remove > > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and > > link in a specific bpf namespace, then these bpf objects will not be > > visible to the users in a different bpf namespace. But these bpf > > objects are visible to its parent bpf namespace, so the sys admin can > > still iterate and inspect them. > > > > BPF namespace is similar to PID namespace, and the bpf objects are > > similar to tasks, so BPF namespace is very easy to understand. These > > patchset only implements BPF namespace for bpf map, prog and link. In the > > future we may extend it to other bpf objects like btf, bpffs and etc. > > May? I think we should cover all of the existing BPF objects from the > beginning here, or we may miss important interactions that will > invalidate the whole idea. This patchset is intended to address iterating bpf IDs and converting IDs to FDs. To be more specific, it covers BPF_{PROG,MAP,LINK}_GET_NEXT_ID and BPF_{PROG,MAP,LINK}_GET_FD_BY_ID. It should also include BPF_BTF_GET_NEXT_ID and BPF_BTF_GET_FD_BY_ID, but I don't implement it because I find we can do more wrt BTF, for example, if we can expose a small amount of BTFs in the vmlinux to non-root bpf namespace. But, yes, I should implement BTF ID in this patchset. > In particular, I'm a little worried about the > interaction between namespaces and bpffs; what happens if you're in a > bpf namespace and you try to read a BPF object from a bpffs that belongs > to a different namespace? Does the operation fail? Is the object hidden > entirely? Something else? > bpffs is a different topic and it can be implemented in later patchsets. bpffs has its own specific problem even without the bpf namespace. 1. The user can always get the information of a bpf object through its corresponding pinned file. In our practice, different container users have different bpffs, and we allow the container user to bind-mount its bpffs only, so others' bpffs are invisible. To make it better with the bpf namespace, I think we can fail the operation if the pinned file doesn't belong to its bpf namespace. That said, we will add pinned bpf files into the bpf namespace in the next step. 2. The user can always iterate bpf objects through progs.debug and maps.debug progs.debug and maps.debug are debugging purposes only. So I think we can handle it later.
On 03/26, Yafang Shao wrote: > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs > to FDs, that's intended for BPF's security model[1]. Not only does it > prevent non-privilidged users from getting other users' bpf program, but > also it prevents the user from iterating his own bpf objects. > In container environment, some users want to run bpf programs in their > containers. These users can run their bpf programs under CAP_BPF and > some other specific CAPs, but they can't inspect their bpf programs in a > generic way. For example, the bpftool can't be used as it requires > CAP_SYS_ADMIN. That is very inconvenient. > Without CAP_SYS_ADMIN, the only way to get the information of a bpf object > which is not created by the process itself is with SCM_RIGHTS, that > requires each processes which created bpf object has to implement a unix > domain socket to share the fd of a bpf object between different > processes, that is really trivial and troublesome. > Hence we need a better mechanism to get bpf object info without > CAP_SYS_ADMIN. [..] > BPF namespace is introduced in this patchset with an attempt to remove > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and > link in a specific bpf namespace, then these bpf objects will not be > visible to the users in a different bpf namespace. But these bpf > objects are visible to its parent bpf namespace, so the sys admin can > still iterate and inspect them. Does it essentially mean unpriv bpf? Can I, as a non-root, create a new bpf namespace and start loading/attaching progs? Maybe add a paragraph about now vs whatever you're proposing. Otherwise it's not very clear what's the security story. (haven't looked at the whole series, so maybe it's answered somewhere else?) > BPF namespace is similar to PID namespace, and the bpf objects are > similar to tasks, so BPF namespace is very easy to understand. These > patchset only implements BPF namespace for bpf map, prog and link. In the > future we may extend it to other bpf objects like btf, bpffs and etc. > For example, we can allow some of the BTF objects to be used in > non-init bpf namespace, then the container user can only trace the > processes running in his container, but can't get the information of > tasks running in other containers. > A simple example is introduced into selftests/bpf on how to use the bpf > namespace. > Putting bpf map, prog and link into bpf namespace is the first step. > Let's start with it. > [1]. > https://lore.kernel.org/bpf/20200513230355.7858-1-alexei.starovoitov@gmail.com/ > Yafang Shao (13): > fork: New clone3 flag for BPF namespace > proc_ns: Extend the field type in struct proc_ns_operations to long > bpf: Implement bpf namespace > bpf: No need to check if id is 0 > bpf: Make bpf objects id have the same alloc and free pattern > bpf: Helpers to alloc and free object id in bpf namespace > bpf: Add bpf helper to get bpf object id > bpf: Alloc and free bpf_map id in bpf namespace > bpf: Alloc and free bpf_prog id in bpf namespace > bpf: Alloc and free bpf_link id in bpf namespace > bpf: Allow iterating bpf objects with CAP_BPF in bpf namespace > bpf: Use bpf_idr_lock array instead > selftests/bpf: Add selftest for bpf namespace > fs/proc/namespaces.c | 4 + > include/linux/bpf.h | 9 +- > include/linux/bpf_namespace.h | 88 ++++++++++ > include/linux/nsproxy.h | 4 + > include/linux/proc_ns.h | 3 +- > include/linux/user_namespace.h | 1 + > include/uapi/linux/bpf.h | 7 + > include/uapi/linux/sched.h | 1 + > kernel/bpf/Makefile | 1 + > kernel/bpf/bpf_namespace.c | 283 > ++++++++++++++++++++++++++++++ > kernel/bpf/offload.c | 16 +- > kernel/bpf/syscall.c | 262 > ++++++++++----------------- > kernel/bpf/task_iter.c | 12 ++ > kernel/fork.c | 5 +- > kernel/nsproxy.c | 19 +- > kernel/trace/bpf_trace.c | 2 + > kernel/ucount.c | 1 + > tools/bpf/bpftool/skeleton/pid_iter.bpf.c | 13 +- > tools/include/uapi/linux/bpf.h | 7 + > tools/testing/selftests/bpf/Makefile | 3 +- > tools/testing/selftests/bpf/test_bpfns.c | 76 ++++++++ > 21 files changed, 637 insertions(+), 180 deletions(-) > create mode 100644 include/linux/bpf_namespace.h > create mode 100644 kernel/bpf/bpf_namespace.c > create mode 100644 tools/testing/selftests/bpf/test_bpfns.c > -- > 1.8.3.1
On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs > to FDs, that's intended for BPF's security model[1]. Not only does it > prevent non-privilidged users from getting other users' bpf program, but > also it prevents the user from iterating his own bpf objects. > > In container environment, some users want to run bpf programs in their > containers. These users can run their bpf programs under CAP_BPF and > some other specific CAPs, but they can't inspect their bpf programs in a > generic way. For example, the bpftool can't be used as it requires > CAP_SYS_ADMIN. That is very inconvenient. Agreed that it is important to enable tools like bpftool without CAP_SYS_ADMIN. However, I am not sure whether we need a new namespace for this. Can we reuse some existing namespace for this? If we do need a new namespace, maybe we should share some effort with tracer namespace proposal [1]? Thanks, Song [1] https://lpc.events/event/16/contributions/1237/
Yafang Shao <laoar.shao@gmail.com> writes: > On Sun, Mar 26, 2023 at 6:49 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote: >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs >> > to FDs, that's intended for BPF's security model[1]. Not only does it >> > prevent non-privilidged users from getting other users' bpf program, but >> > also it prevents the user from iterating his own bpf objects. >> > >> > In container environment, some users want to run bpf programs in their >> > containers. These users can run their bpf programs under CAP_BPF and >> > some other specific CAPs, but they can't inspect their bpf programs in a >> > generic way. For example, the bpftool can't be used as it requires >> > CAP_SYS_ADMIN. That is very inconvenient. >> > >> > Without CAP_SYS_ADMIN, the only way to get the information of a bpf object >> > which is not created by the process itself is with SCM_RIGHTS, that >> > requires each processes which created bpf object has to implement a unix >> > domain socket to share the fd of a bpf object between different >> > processes, that is really trivial and troublesome. >> > >> > Hence we need a better mechanism to get bpf object info without >> > CAP_SYS_ADMIN. >> > >> > BPF namespace is introduced in this patchset with an attempt to remove >> > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and >> > link in a specific bpf namespace, then these bpf objects will not be >> > visible to the users in a different bpf namespace. But these bpf >> > objects are visible to its parent bpf namespace, so the sys admin can >> > still iterate and inspect them. >> > >> > BPF namespace is similar to PID namespace, and the bpf objects are >> > similar to tasks, so BPF namespace is very easy to understand. These >> > patchset only implements BPF namespace for bpf map, prog and link. In the >> > future we may extend it to other bpf objects like btf, bpffs and etc. >> >> May? I think we should cover all of the existing BPF objects from the >> beginning here, or we may miss important interactions that will >> invalidate the whole idea. > > This patchset is intended to address iterating bpf IDs and converting > IDs to FDs. To be more specific, it covers > BPF_{PROG,MAP,LINK}_GET_NEXT_ID and BPF_{PROG,MAP,LINK}_GET_FD_BY_ID. > It should also include BPF_BTF_GET_NEXT_ID and BPF_BTF_GET_FD_BY_ID, > but I don't implement it because I find we can do more wrt BTF, for > example, if we can expose a small amount of BTFs in the vmlinux to > non-root bpf namespace. > But, yes, I should implement BTF ID in this patchset. Right, as you can see by my comment on that patch, not including the btf id is a tad confusing, so yeah, better include that. >> In particular, I'm a little worried about the >> interaction between namespaces and bpffs; what happens if you're in a >> bpf namespace and you try to read a BPF object from a bpffs that belongs >> to a different namespace? Does the operation fail? Is the object hidden >> entirely? Something else? >> > > bpffs is a different topic and it can be implemented in later patchsets. > bpffs has its own specific problem even without the bpf namespace. > 1. The user can always get the information of a bpf object through its > corresponding pinned file. > In our practice, different container users have different bpffs, and > we allow the container user to bind-mount its bpffs only, so others' > bpffs are invisible. > To make it better with the bpf namespace, I think we can fail the > operation if the pinned file doesn't belong to its bpf namespace. That > said, we will add pinned bpf files into the bpf namespace in the next > step. > > 2. The user can always iterate bpf objects through progs.debug and maps.debug > progs.debug and maps.debug are debugging purposes only. So I think we > can handle it later. Well, I disagree. Working out these issues with bpffs is an important aspect to get a consistent API, and handwaving it away risks merging something that will turn out to not be workable further down the line at which point we can't change it. -Toke
On Tue, Mar 28, 2023 at 1:28 AM Stanislav Fomichev <sdf@google.com> wrote: > > On 03/26, Yafang Shao wrote: > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs > > to FDs, that's intended for BPF's security model[1]. Not only does it > > prevent non-privilidged users from getting other users' bpf program, but > > also it prevents the user from iterating his own bpf objects. > > > In container environment, some users want to run bpf programs in their > > containers. These users can run their bpf programs under CAP_BPF and > > some other specific CAPs, but they can't inspect their bpf programs in a > > generic way. For example, the bpftool can't be used as it requires > > CAP_SYS_ADMIN. That is very inconvenient. > > > Without CAP_SYS_ADMIN, the only way to get the information of a bpf object > > which is not created by the process itself is with SCM_RIGHTS, that > > requires each processes which created bpf object has to implement a unix > > domain socket to share the fd of a bpf object between different > > processes, that is really trivial and troublesome. > > > Hence we need a better mechanism to get bpf object info without > > CAP_SYS_ADMIN. > > [..] > > > BPF namespace is introduced in this patchset with an attempt to remove > > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and > > link in a specific bpf namespace, then these bpf objects will not be > > visible to the users in a different bpf namespace. But these bpf > > objects are visible to its parent bpf namespace, so the sys admin can > > still iterate and inspect them. > > Does it essentially mean unpriv bpf? Right. With CAP_BPF and some other CAPs enabled. > Can I, as a non-root, create > a new bpf namespace and start loading/attaching progs? No, you can't create a new bpf namespace as a non-root, see also copy_namespaces(). In the container environment, new namespaces are always created by containered, which is started by root. > Maybe add a paragraph about now vs whatever you're proposing. What I'm proposing in this patchset is to put bpf objects (map, prog, link, and btf) into the bpf namespace. Next step I will put bpffs into the bpf namespace as well. That said, I'm trying to put all the objects created in bpf into the bpf namespace. Below is a simple paragraph to illustrate it. Regarding the unpriv user with CAP_BPF enabled, Now | Future ------------------------------------------------------------------------ Iterate his BPF IDs | N | Y | Iterate others' BPF IDs | N | N | Convert his BPF IDs to FDs | N | Y | Convert others' BPF IDs to FDs | N | N | Get others' object info from pinned file | Y(*) | N | ------------------------------------------------------------------------ (*) It can be improved by, 1). Different containers has different bpffs 2). Setting file permission That's not perfect, for example, if one single user has two bpf instances, but we don't want them to inspect each other. > Otherwise it's not very clear what's the security story. > (haven't looked at the whole series, so maybe it's answered somewhere else?) >
On Tue, Mar 28, 2023 at 3:04 AM Song Liu <song@kernel.org> wrote: > > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs > > to FDs, that's intended for BPF's security model[1]. Not only does it > > prevent non-privilidged users from getting other users' bpf program, but > > also it prevents the user from iterating his own bpf objects. > > > > In container environment, some users want to run bpf programs in their > > containers. These users can run their bpf programs under CAP_BPF and > > some other specific CAPs, but they can't inspect their bpf programs in a > > generic way. For example, the bpftool can't be used as it requires > > CAP_SYS_ADMIN. That is very inconvenient. > > Agreed that it is important to enable tools like bpftool without > CAP_SYS_ADMIN. However, I am not sure whether we need a new > namespace for this. Can we reuse some existing namespace for this? > It seems we can't. > If we do need a new namespace, maybe we should share some effort > with tracer namespace proposal [1]? > Thanks for your information. I will learn the tracer namespace first and try to analyze how to cooperate with it. > Thanks, > Song > > [1] https://lpc.events/event/16/contributions/1237/
On Tue, Mar 28, 2023 at 4:51 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote: > > Yafang Shao <laoar.shao@gmail.com> writes: > > > On Sun, Mar 26, 2023 at 6:49 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote: > >> > >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs > >> > to FDs, that's intended for BPF's security model[1]. Not only does it > >> > prevent non-privilidged users from getting other users' bpf program, but > >> > also it prevents the user from iterating his own bpf objects. > >> > > >> > In container environment, some users want to run bpf programs in their > >> > containers. These users can run their bpf programs under CAP_BPF and > >> > some other specific CAPs, but they can't inspect their bpf programs in a > >> > generic way. For example, the bpftool can't be used as it requires > >> > CAP_SYS_ADMIN. That is very inconvenient. > >> > > >> > Without CAP_SYS_ADMIN, the only way to get the information of a bpf object > >> > which is not created by the process itself is with SCM_RIGHTS, that > >> > requires each processes which created bpf object has to implement a unix > >> > domain socket to share the fd of a bpf object between different > >> > processes, that is really trivial and troublesome. > >> > > >> > Hence we need a better mechanism to get bpf object info without > >> > CAP_SYS_ADMIN. > >> > > >> > BPF namespace is introduced in this patchset with an attempt to remove > >> > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and > >> > link in a specific bpf namespace, then these bpf objects will not be > >> > visible to the users in a different bpf namespace. But these bpf > >> > objects are visible to its parent bpf namespace, so the sys admin can > >> > still iterate and inspect them. > >> > > >> > BPF namespace is similar to PID namespace, and the bpf objects are > >> > similar to tasks, so BPF namespace is very easy to understand. These > >> > patchset only implements BPF namespace for bpf map, prog and link. In the > >> > future we may extend it to other bpf objects like btf, bpffs and etc. > >> > >> May? I think we should cover all of the existing BPF objects from the > >> beginning here, or we may miss important interactions that will > >> invalidate the whole idea. > > > > This patchset is intended to address iterating bpf IDs and converting > > IDs to FDs. To be more specific, it covers > > BPF_{PROG,MAP,LINK}_GET_NEXT_ID and BPF_{PROG,MAP,LINK}_GET_FD_BY_ID. > > It should also include BPF_BTF_GET_NEXT_ID and BPF_BTF_GET_FD_BY_ID, > > but I don't implement it because I find we can do more wrt BTF, for > > example, if we can expose a small amount of BTFs in the vmlinux to > > non-root bpf namespace. > > But, yes, I should implement BTF ID in this patchset. > > Right, as you can see by my comment on that patch, not including the btf > id is a tad confusing, so yeah, better include that. > > >> In particular, I'm a little worried about the > >> interaction between namespaces and bpffs; what happens if you're in a > >> bpf namespace and you try to read a BPF object from a bpffs that belongs > >> to a different namespace? Does the operation fail? Is the object hidden > >> entirely? Something else? > >> > > > > bpffs is a different topic and it can be implemented in later patchsets. > > bpffs has its own specific problem even without the bpf namespace. > > 1. The user can always get the information of a bpf object through its > > corresponding pinned file. > > In our practice, different container users have different bpffs, and > > we allow the container user to bind-mount its bpffs only, so others' > > bpffs are invisible. > > To make it better with the bpf namespace, I think we can fail the > > operation if the pinned file doesn't belong to its bpf namespace. That > > said, we will add pinned bpf files into the bpf namespace in the next > > step. > > > > 2. The user can always iterate bpf objects through progs.debug and maps.debug > > progs.debug and maps.debug are debugging purposes only. So I think we > > can handle it later. > > Well, I disagree. Working out these issues with bpffs is an important > aspect to get a consistent API, and handwaving it away risks merging > something that will turn out to not be workable further down the line at > which point we can't change it. > Sure, I will include bpffs in the next version.
On 03/28, Yafang Shao wrote: > On Tue, Mar 28, 2023 at 1:28 AM Stanislav Fomichev <sdf@google.com> wrote: > > > > On 03/26, Yafang Shao wrote: > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert > IDs > > > to FDs, that's intended for BPF's security model[1]. Not only does it > > > prevent non-privilidged users from getting other users' bpf program, > but > > > also it prevents the user from iterating his own bpf objects. > > > > > In container environment, some users want to run bpf programs in their > > > containers. These users can run their bpf programs under CAP_BPF and > > > some other specific CAPs, but they can't inspect their bpf programs > in a > > > generic way. For example, the bpftool can't be used as it requires > > > CAP_SYS_ADMIN. That is very inconvenient. > > > > > Without CAP_SYS_ADMIN, the only way to get the information of a bpf > object > > > which is not created by the process itself is with SCM_RIGHTS, that > > > requires each processes which created bpf object has to implement a > unix > > > domain socket to share the fd of a bpf object between different > > > processes, that is really trivial and troublesome. > > > > > Hence we need a better mechanism to get bpf object info without > > > CAP_SYS_ADMIN. > > > > [..] > > > > > BPF namespace is introduced in this patchset with an attempt to remove > > > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and > > > link in a specific bpf namespace, then these bpf objects will not be > > > visible to the users in a different bpf namespace. But these bpf > > > objects are visible to its parent bpf namespace, so the sys admin can > > > still iterate and inspect them. > > > > Does it essentially mean unpriv bpf? > Right. With CAP_BPF and some other CAPs enabled. > > Can I, as a non-root, create > > a new bpf namespace and start loading/attaching progs? > No, you can't create a new bpf namespace as a non-root, see also > copy_namespaces(). > In the container environment, new namespaces are always created by > containered, which is started by root. Are you talking about "if (!ns_capable(user_ns, CAP_SYS_ADMIN))" part from copy_namespaces? Isn't it trivially bypassed with a new user namespace? IIUC, I can create a new user namespace which gives me CAP_SYS_ADMIN in this particular user-ns. Then I can go on and create a new bpf namespace (with CAP_BPF) and go wild? I won't see anything from the other namespaces, but I'll be able to load/attach bpf programs? > > Maybe add a paragraph about now vs whatever you're proposing. > What I'm proposing in this patchset is to put bpf objects (map, prog, > link, and btf) into the bpf namespace. Next step I will put bpffs into > the bpf namespace as well. > That said, I'm trying to put all the objects created in bpf into the > bpf namespace. Below is a simple paragraph to illustrate it. > Regarding the unpriv user with CAP_BPF enabled, > Now | Future > ------------------------------------------------------------------------ > Iterate his BPF IDs | N | Y | > Iterate others' BPF IDs | N | N | > Convert his BPF IDs to FDs | N | Y | > Convert others' BPF IDs to FDs | N | N | > Get others' object info from pinned file | Y(*) | N | > ------------------------------------------------------------------------ > (*) It can be improved by, > 1). Different containers has different bpffs > 2). Setting file permission > That's not perfect, for example, if one single user has two bpf > instances, but we don't want them to inspect each other. I think the question here is what happens to the existing capable(CAP_BPF) checks? Do they become ns_capable(CAP_BPF) eventually? And if not, I don't think it integrates well with the user namespaces? > > Otherwise it's not very clear what's the security story. > > (haven't looked at the whole series, so maybe it's answered somewhere > else?) > > > -- > Regards > Yafang
On Wed, Mar 29, 2023 at 1:15 AM Stanislav Fomichev <sdf@google.com> wrote: > > On 03/28, Yafang Shao wrote: > > On Tue, Mar 28, 2023 at 1:28 AM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > On 03/26, Yafang Shao wrote: > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert > > IDs > > > > to FDs, that's intended for BPF's security model[1]. Not only does it > > > > prevent non-privilidged users from getting other users' bpf program, > > but > > > > also it prevents the user from iterating his own bpf objects. > > > > > > > In container environment, some users want to run bpf programs in their > > > > containers. These users can run their bpf programs under CAP_BPF and > > > > some other specific CAPs, but they can't inspect their bpf programs > > in a > > > > generic way. For example, the bpftool can't be used as it requires > > > > CAP_SYS_ADMIN. That is very inconvenient. > > > > > > > Without CAP_SYS_ADMIN, the only way to get the information of a bpf > > object > > > > which is not created by the process itself is with SCM_RIGHTS, that > > > > requires each processes which created bpf object has to implement a > > unix > > > > domain socket to share the fd of a bpf object between different > > > > processes, that is really trivial and troublesome. > > > > > > > Hence we need a better mechanism to get bpf object info without > > > > CAP_SYS_ADMIN. > > > > > > [..] > > > > > > > BPF namespace is introduced in this patchset with an attempt to remove > > > > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and > > > > link in a specific bpf namespace, then these bpf objects will not be > > > > visible to the users in a different bpf namespace. But these bpf > > > > objects are visible to its parent bpf namespace, so the sys admin can > > > > still iterate and inspect them. > > > > > > Does it essentially mean unpriv bpf? > > > Right. With CAP_BPF and some other CAPs enabled. > > > > Can I, as a non-root, create > > > a new bpf namespace and start loading/attaching progs? > > > No, you can't create a new bpf namespace as a non-root, see also > > copy_namespaces(). > > In the container environment, new namespaces are always created by > > containered, which is started by root. > > Are you talking about "if (!ns_capable(user_ns, CAP_SYS_ADMIN))" part > from copy_namespaces? Isn't it trivially bypassed with a new user > namespace? > > IIUC, I can create a new user namespace which gives me CAP_SYS_ADMIN > in this particular user-ns. Then I can go on and create a new bpf > namespace (with CAP_BPF) and go wild? I won't see anything from the > other namespaces, but I'll be able to load/attach bpf programs? > I don't think so. If you create a new userspace, and give the process the CAP_BPF or CAP_SYS_ADMIN in this new user namespace but not the initial namespace, you can't do that. Because currently only CAP_BPF or CAP_SYS_ADMIN in the init user namespace can load/attach bpf programs. > > > Maybe add a paragraph about now vs whatever you're proposing. > > > What I'm proposing in this patchset is to put bpf objects (map, prog, > > link, and btf) into the bpf namespace. Next step I will put bpffs into > > the bpf namespace as well. > > That said, I'm trying to put all the objects created in bpf into the > > bpf namespace. Below is a simple paragraph to illustrate it. > > > Regarding the unpriv user with CAP_BPF enabled, > > Now | Future > > ------------------------------------------------------------------------ > > Iterate his BPF IDs | N | Y | > > Iterate others' BPF IDs | N | N | > > Convert his BPF IDs to FDs | N | Y | > > Convert others' BPF IDs to FDs | N | N | > > Get others' object info from pinned file | Y(*) | N | > > ------------------------------------------------------------------------ > > > (*) It can be improved by, > > 1). Different containers has different bpffs > > 2). Setting file permission > > That's not perfect, for example, if one single user has two bpf > > instances, but we don't want them to inspect each other. > > I think the question here is what happens to the existing > capable(CAP_BPF) checks? Do they become ns_capable(CAP_BPF) eventually? > They won't become ns_capable(CAP_BPF). If it becomes ns_capable(CAP_BPF), it will really go wild then. > And if not, I don't think it integrates well with the user namespaces? > IIUC, it is the CAP_BPF which doesn't integrate with the user namespaces, right?
On Tue, Mar 28, 2023 at 8:03 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Wed, Mar 29, 2023 at 1:15 AM Stanislav Fomichev <sdf@google.com> wrote: > > > > On 03/28, Yafang Shao wrote: > > > On Tue, Mar 28, 2023 at 1:28 AM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > > > On 03/26, Yafang Shao wrote: > > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert > > > IDs > > > > > to FDs, that's intended for BPF's security model[1]. Not only does it > > > > > prevent non-privilidged users from getting other users' bpf program, > > > but > > > > > also it prevents the user from iterating his own bpf objects. > > > > > > > > > In container environment, some users want to run bpf programs in their > > > > > containers. These users can run their bpf programs under CAP_BPF and > > > > > some other specific CAPs, but they can't inspect their bpf programs > > > in a > > > > > generic way. For example, the bpftool can't be used as it requires > > > > > CAP_SYS_ADMIN. That is very inconvenient. > > > > > > > > > Without CAP_SYS_ADMIN, the only way to get the information of a bpf > > > object > > > > > which is not created by the process itself is with SCM_RIGHTS, that > > > > > requires each processes which created bpf object has to implement a > > > unix > > > > > domain socket to share the fd of a bpf object between different > > > > > processes, that is really trivial and troublesome. > > > > > > > > > Hence we need a better mechanism to get bpf object info without > > > > > CAP_SYS_ADMIN. > > > > > > > > [..] > > > > > > > > > BPF namespace is introduced in this patchset with an attempt to remove > > > > > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and > > > > > link in a specific bpf namespace, then these bpf objects will not be > > > > > visible to the users in a different bpf namespace. But these bpf > > > > > objects are visible to its parent bpf namespace, so the sys admin can > > > > > still iterate and inspect them. > > > > > > > > Does it essentially mean unpriv bpf? > > > > > Right. With CAP_BPF and some other CAPs enabled. > > > > > > Can I, as a non-root, create > > > > a new bpf namespace and start loading/attaching progs? > > > > > No, you can't create a new bpf namespace as a non-root, see also > > > copy_namespaces(). > > > In the container environment, new namespaces are always created by > > > containered, which is started by root. > > > > Are you talking about "if (!ns_capable(user_ns, CAP_SYS_ADMIN))" part > > from copy_namespaces? Isn't it trivially bypassed with a new user > > namespace? > > > > IIUC, I can create a new user namespace which gives me CAP_SYS_ADMIN > > in this particular user-ns. Then I can go on and create a new bpf > > namespace (with CAP_BPF) and go wild? I won't see anything from the > > other namespaces, but I'll be able to load/attach bpf programs? > > > > I don't think so. If you create a new userspace, and give the process > the CAP_BPF or CAP_SYS_ADMIN in this new user namespace but not the > initial namespace, you can't do that. Because currently only CAP_BPF > or CAP_SYS_ADMIN in the init user namespace can load/attach bpf > programs. > > > > > Maybe add a paragraph about now vs whatever you're proposing. > > > > > What I'm proposing in this patchset is to put bpf objects (map, prog, > > > link, and btf) into the bpf namespace. Next step I will put bpffs into > > > the bpf namespace as well. > > > That said, I'm trying to put all the objects created in bpf into the > > > bpf namespace. Below is a simple paragraph to illustrate it. > > > > > Regarding the unpriv user with CAP_BPF enabled, > > > Now | Future > > > ------------------------------------------------------------------------ > > > Iterate his BPF IDs | N | Y | > > > Iterate others' BPF IDs | N | N | > > > Convert his BPF IDs to FDs | N | Y | > > > Convert others' BPF IDs to FDs | N | N | > > > Get others' object info from pinned file | Y(*) | N | > > > ------------------------------------------------------------------------ > > > > > (*) It can be improved by, > > > 1). Different containers has different bpffs > > > 2). Setting file permission > > > That's not perfect, for example, if one single user has two bpf > > > instances, but we don't want them to inspect each other. > > > > I think the question here is what happens to the existing > > capable(CAP_BPF) checks? Do they become ns_capable(CAP_BPF) eventually? > > > > They won't become ns_capable(CAP_BPF). If it becomes > ns_capable(CAP_BPF), it will really go wild then. > > > And if not, I don't think it integrates well with the user namespaces? > > > > IIUC, it is the CAP_BPF which doesn't integrate with the user > namespaces, right? Yeah. And it's probably fine if we don't, I just wanted to see some explanation on the long-term plan. If the purpose is to have a bpf namespace and use it for pure isolation purposes, let's state it clearly in the cover letter. Otherwise it's not clear whether it's only about isolation or potentially allowing CAP_BPF in user namespaces. Maybe clone(CLONE_NEWBPF|CLONE_NEWUSER) should return an explicit error? (or maybe it already does, haven't looked at the patches) One other question I have is: should init bpf namespace see everything? Otherwise it might be hard to chase down the namespaces that loaded some BPF program... > -- > Regards > Yafang
On Thu, Mar 30, 2023 at 4:50 AM Stanislav Fomichev <sdf@google.com> wrote: > > On Tue, Mar 28, 2023 at 8:03 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > On Wed, Mar 29, 2023 at 1:15 AM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > On 03/28, Yafang Shao wrote: > > > > On Tue, Mar 28, 2023 at 1:28 AM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > > > > > On 03/26, Yafang Shao wrote: > > > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert > > > > IDs > > > > > > to FDs, that's intended for BPF's security model[1]. Not only does it > > > > > > prevent non-privilidged users from getting other users' bpf program, > > > > but > > > > > > also it prevents the user from iterating his own bpf objects. > > > > > > > > > > > In container environment, some users want to run bpf programs in their > > > > > > containers. These users can run their bpf programs under CAP_BPF and > > > > > > some other specific CAPs, but they can't inspect their bpf programs > > > > in a > > > > > > generic way. For example, the bpftool can't be used as it requires > > > > > > CAP_SYS_ADMIN. That is very inconvenient. > > > > > > > > > > > Without CAP_SYS_ADMIN, the only way to get the information of a bpf > > > > object > > > > > > which is not created by the process itself is with SCM_RIGHTS, that > > > > > > requires each processes which created bpf object has to implement a > > > > unix > > > > > > domain socket to share the fd of a bpf object between different > > > > > > processes, that is really trivial and troublesome. > > > > > > > > > > > Hence we need a better mechanism to get bpf object info without > > > > > > CAP_SYS_ADMIN. > > > > > > > > > > [..] > > > > > > > > > > > BPF namespace is introduced in this patchset with an attempt to remove > > > > > > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and > > > > > > link in a specific bpf namespace, then these bpf objects will not be > > > > > > visible to the users in a different bpf namespace. But these bpf > > > > > > objects are visible to its parent bpf namespace, so the sys admin can > > > > > > still iterate and inspect them. > > > > > > > > > > Does it essentially mean unpriv bpf? > > > > > > > Right. With CAP_BPF and some other CAPs enabled. > > > > > > > > Can I, as a non-root, create > > > > > a new bpf namespace and start loading/attaching progs? > > > > > > > No, you can't create a new bpf namespace as a non-root, see also > > > > copy_namespaces(). > > > > In the container environment, new namespaces are always created by > > > > containered, which is started by root. > > > > > > Are you talking about "if (!ns_capable(user_ns, CAP_SYS_ADMIN))" part > > > from copy_namespaces? Isn't it trivially bypassed with a new user > > > namespace? > > > > > > IIUC, I can create a new user namespace which gives me CAP_SYS_ADMIN > > > in this particular user-ns. Then I can go on and create a new bpf > > > namespace (with CAP_BPF) and go wild? I won't see anything from the > > > other namespaces, but I'll be able to load/attach bpf programs? > > > > > > > I don't think so. If you create a new userspace, and give the process > > the CAP_BPF or CAP_SYS_ADMIN in this new user namespace but not the > > initial namespace, you can't do that. Because currently only CAP_BPF > > or CAP_SYS_ADMIN in the init user namespace can load/attach bpf > > programs. > > > > > > > Maybe add a paragraph about now vs whatever you're proposing. > > > > > > > What I'm proposing in this patchset is to put bpf objects (map, prog, > > > > link, and btf) into the bpf namespace. Next step I will put bpffs into > > > > the bpf namespace as well. > > > > That said, I'm trying to put all the objects created in bpf into the > > > > bpf namespace. Below is a simple paragraph to illustrate it. > > > > > > > Regarding the unpriv user with CAP_BPF enabled, > > > > Now | Future > > > > ------------------------------------------------------------------------ > > > > Iterate his BPF IDs | N | Y | > > > > Iterate others' BPF IDs | N | N | > > > > Convert his BPF IDs to FDs | N | Y | > > > > Convert others' BPF IDs to FDs | N | N | > > > > Get others' object info from pinned file | Y(*) | N | > > > > ------------------------------------------------------------------------ > > > > > > > (*) It can be improved by, > > > > 1). Different containers has different bpffs > > > > 2). Setting file permission > > > > That's not perfect, for example, if one single user has two bpf > > > > instances, but we don't want them to inspect each other. > > > > > > I think the question here is what happens to the existing > > > capable(CAP_BPF) checks? Do they become ns_capable(CAP_BPF) eventually? > > > > > > > They won't become ns_capable(CAP_BPF). If it becomes > > ns_capable(CAP_BPF), it will really go wild then. > > > > > And if not, I don't think it integrates well with the user namespaces? > > > > > > > IIUC, it is the CAP_BPF which doesn't integrate with the user > > namespaces, right? > > Yeah. And it's probably fine if we don't, I just wanted to see some > explanation on the long-term plan. > If the purpose is to have a bpf namespace and use it for pure > isolation purposes, let's state it clearly in the cover letter. Will add it. > Otherwise it's not clear whether it's only about isolation or > potentially allowing CAP_BPF in user namespaces. > Maybe clone(CLONE_NEWBPF|CLONE_NEWUSER) should return an explicit > error? (or maybe it already does, haven't looked at the patches) > Good suggestion. It should return an error. I will change it in the next version. > One other question I have is: should init bpf namespace see > everything? Otherwise it might be hard to chase down the namespaces > that loaded some BPF program... Right, the init bpf namespace will see everything.
On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@gmail.com> wrote: > <...> > > BPF namespace is introduced in this patchset with an attempt to remove > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and > link in a specific bpf namespace, then these bpf objects will not be > visible to the users in a different bpf namespace. But these bpf > objects are visible to its parent bpf namespace, so the sys admin can > still iterate and inspect them. > > BPF namespace is similar to PID namespace, and the bpf objects are > similar to tasks, so BPF namespace is very easy to understand. These > patchset only implements BPF namespace for bpf map, prog and link. In the > future we may extend it to other bpf objects like btf, bpffs and etc. > For example, we can allow some of the BTF objects to be used in > non-init bpf namespace, then the container user can only trace the > processes running in his container, but can't get the information of > tasks running in other containers. > Hi Yafang, Thanks for putting effort toward enabling BPF for container users! However, I think the cover letter can be improved. It's unclear to me what exactly is BPF namespace, what exactly it tries to achieve and what is its behavior. If you look at the manpage of pid namespace [1], cgroup namespace[2], and namespace[3], they all have a very precise definition, their goals and explain the intended behaviors well. I felt you intended the BPF namespace to provide isolation of object ids. That is, different views of the bpf object ids for different processes. This is like the PID namespace. But somehow, you also attach CAPs on top of that. That, I think, is not a namespace's job. Well, I could be wrong, but would appreciate you adding more details as follow-up. Hao [1] https://man7.org/linux/man-pages/man7/pid_namespaces.7.html [2] https://man7.org/linux/man-pages/man7/cgroup_namespaces.7.html [3] https://man7.org/linux/man-pages/man7/namespaces.7.html
On Fri, Mar 31, 2023 at 1:52 PM Hao Luo <haoluo@google.com> wrote: > > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > <...> > > > > BPF namespace is introduced in this patchset with an attempt to remove > > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and > > link in a specific bpf namespace, then these bpf objects will not be > > visible to the users in a different bpf namespace. But these bpf > > objects are visible to its parent bpf namespace, so the sys admin can > > still iterate and inspect them. > > > > BPF namespace is similar to PID namespace, and the bpf objects are > > similar to tasks, so BPF namespace is very easy to understand. These > > patchset only implements BPF namespace for bpf map, prog and link. In the > > future we may extend it to other bpf objects like btf, bpffs and etc. > > For example, we can allow some of the BTF objects to be used in > > non-init bpf namespace, then the container user can only trace the > > processes running in his container, but can't get the information of > > tasks running in other containers. > > > > Hi Yafang, > > Thanks for putting effort toward enabling BPF for container users! > > However, I think the cover letter can be improved. It's unclear to me > what exactly is BPF namespace, what exactly it tries to achieve and > what is its behavior. If you look at the manpage of pid namespace [1], > cgroup namespace[2], and namespace[3], they all have a very precise > definition, their goals and explain the intended behaviors well. > Thanks for your suggestion. The covetter should be improved. I will read the man pages of these namespaces and improve it as you suggested. > I felt you intended the BPF namespace to provide isolation of object > ids. That is, different views of the bpf object ids for different > processes. This is like the PID namespace. But somehow, you also > attach CAPs on top of that. That, I think, is not a namespace's job. > Agree with you that it should be independent of CAPs. After the bpf namespace is introduced, actually we don't need the CAPs when the user iterates IDs or converts IDs to FDs in his bpf namespace (except in the init bpf namespace), because these are all readonly operations and the user can only read the bpf objects created by himself. While the CAPs should be required when the user wants to write something, e.g. creating a map, loading a prog. They are really different things. I will change it in the next version. > Well, I could be wrong, but would appreciate you adding more details > as follow-up. > > Hao > > [1] https://man7.org/linux/man-pages/man7/pid_namespaces.7.html > [2] https://man7.org/linux/man-pages/man7/cgroup_namespaces.7.html > [3] https://man7.org/linux/man-pages/man7/namespaces.7.html
On Tue, Mar 28, 2023 at 11:47:31AM +0800, Yafang Shao wrote: > On Tue, Mar 28, 2023 at 3:04 AM Song Liu <song@kernel.org> wrote: > > > > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs > > > to FDs, that's intended for BPF's security model[1]. Not only does it > > > prevent non-privilidged users from getting other users' bpf program, but > > > also it prevents the user from iterating his own bpf objects. > > > > > > In container environment, some users want to run bpf programs in their > > > containers. These users can run their bpf programs under CAP_BPF and > > > some other specific CAPs, but they can't inspect their bpf programs in a > > > generic way. For example, the bpftool can't be used as it requires > > > CAP_SYS_ADMIN. That is very inconvenient. > > > > Agreed that it is important to enable tools like bpftool without > > CAP_SYS_ADMIN. However, I am not sure whether we need a new > > namespace for this. Can we reuse some existing namespace for this? > > > > It seems we can't. Yafang, It's a Nack. The only thing you've been trying to "solve" with bpf namespace is to allow 'bpftool prog show' iterate progs in the "namespace" without CAP_SYS_ADMIN. The concept of bpf namespace is not even close to be thought through. Others pointed out the gaps in the design. Like bpffs. There are plenty. Please do not send patches like this in the future. You need to start with describing the problem you want to solve, then propose _several_ solutions, describe their pros and cons, solicit feedback, present at the conferences (like LSFMMBPF or LPC), and when the community agrees that 1. problem is worth solving, 2. the solution makes sense, only then work on patches. "In container environment, some users want to run bpf programs in their containers." is something Song brought up at LSFMMBPF a year ago. At that meeting most of the folks agreed that there is a need to run bpf in containers and make sure that the effect of bpf prog is limited to a container. This new namespace that creates virtual IDs for progs and maps doesn't come close in solving this task.
On Mon, Apr 3, 2023 at 7:37 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Tue, Mar 28, 2023 at 11:47:31AM +0800, Yafang Shao wrote: > > On Tue, Mar 28, 2023 at 3:04 AM Song Liu <song@kernel.org> wrote: > > > > > > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs > > > > to FDs, that's intended for BPF's security model[1]. Not only does it > > > > prevent non-privilidged users from getting other users' bpf program, but > > > > also it prevents the user from iterating his own bpf objects. > > > > > > > > In container environment, some users want to run bpf programs in their > > > > containers. These users can run their bpf programs under CAP_BPF and > > > > some other specific CAPs, but they can't inspect their bpf programs in a > > > > generic way. For example, the bpftool can't be used as it requires > > > > CAP_SYS_ADMIN. That is very inconvenient. > > > > > > Agreed that it is important to enable tools like bpftool without > > > CAP_SYS_ADMIN. However, I am not sure whether we need a new > > > namespace for this. Can we reuse some existing namespace for this? > > > > > > > It seems we can't. > > Yafang, > > It's a Nack. > > The only thing you've been trying to "solve" with bpf namespace is to > allow 'bpftool prog show' iterate progs in the "namespace" without CAP_SYS_ADMIN. > The concept of bpf namespace is not even close to be thought through. Right, it is more likely a PoC in its current state. > Others pointed out the gaps in the design. Like bpffs. There are plenty. > Please do not send patches like this in the future. The reason I sent it with an early state is that I want to get some early feedback from the community ahead of the LSF/MM/BPF workshop, then I can improve it based on these feedbacks and present it more specifically at the workshop. Then the discussion will be more effective. > You need to start with describing the problem you want to solve, > then propose _several_ solutions, describe their pros and cons, > solicit feedback, present at the conferences (like LSFMMBPF or LPC), > and when the community agrees that 1. problem is worth solving, > 2. the solution makes sense, only then work on patches. > I would like to give a short discussion on the BPF namespace if everything goes fine. > "In container environment, some users want to run bpf programs in their containers." > is something Song brought up at LSFMMBPF a year ago. > At that meeting most of the folks agreed that there is a need to run bpf > in containers and make sure that the effect of bpf prog is limited to a container. > This new namespace that creates virtual IDs for progs and maps doesn't come > close in solving this task. Currently in our production environment, all the containers running bpf programs are privileged, that is risky. So actually the goal of the BPF namespace is to make them (or part of them) non-privileged. But some of the abilities of these bpf programs will be lost in this procedure, like the debug-bility with bpftool, so we need to fix it. Agree with you that this goal is far from making bpf programs safely running in a container environment.
On Mon, Apr 03, 2023 at 11:05:25AM +0800, Yafang Shao wrote: > On Mon, Apr 3, 2023 at 7:37 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Tue, Mar 28, 2023 at 11:47:31AM +0800, Yafang Shao wrote: > > > On Tue, Mar 28, 2023 at 3:04 AM Song Liu <song@kernel.org> wrote: > > > > > > > > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs > > > > > to FDs, that's intended for BPF's security model[1]. Not only does it > > > > > prevent non-privilidged users from getting other users' bpf program, but > > > > > also it prevents the user from iterating his own bpf objects. > > > > > > > > > > In container environment, some users want to run bpf programs in their > > > > > containers. These users can run their bpf programs under CAP_BPF and > > > > > some other specific CAPs, but they can't inspect their bpf programs in a > > > > > generic way. For example, the bpftool can't be used as it requires > > > > > CAP_SYS_ADMIN. That is very inconvenient. > > > > > > > > Agreed that it is important to enable tools like bpftool without > > > > CAP_SYS_ADMIN. However, I am not sure whether we need a new > > > > namespace for this. Can we reuse some existing namespace for this? > > > > > > > > > > It seems we can't. > > > > Yafang, > > > > It's a Nack. > > > > The only thing you've been trying to "solve" with bpf namespace is to > > allow 'bpftool prog show' iterate progs in the "namespace" without CAP_SYS_ADMIN. > > The concept of bpf namespace is not even close to be thought through. > > Right, it is more likely a PoC in its current state. > > > Others pointed out the gaps in the design. Like bpffs. There are plenty. > > Please do not send patches like this in the future. > > The reason I sent it with an early state is that I want to get some > early feedback from the community ahead of the LSF/MM/BPF workshop, > then I can improve it based on these feedbacks and present it more > specifically at the workshop. Then the discussion will be more > effective. > > > You need to start with describing the problem you want to solve, > > then propose _several_ solutions, describe their pros and cons, > > solicit feedback, present at the conferences (like LSFMMBPF or LPC), > > and when the community agrees that 1. problem is worth solving, > > 2. the solution makes sense, only then work on patches. > > > > I would like to give a short discussion on the BPF namespace if > everything goes fine. Not in this shape of BPF namespace as done in this patch set. We've talked about BPF namespace in the past. This is not it. > > "In container environment, some users want to run bpf programs in their containers." > > is something Song brought up at LSFMMBPF a year ago. > > At that meeting most of the folks agreed that there is a need to run bpf > > in containers and make sure that the effect of bpf prog is limited to a container. > > This new namespace that creates virtual IDs for progs and maps doesn't come > > close in solving this task. > > Currently in our production environment, all the containers running > bpf programs are privileged, that is risky. So actually the goal of > the BPF namespace is to make them (or part of them) non-privileged. > But some of the abilities of these bpf programs will be lost in this > procedure, like the debug-bility with bpftool, so we need to fix it. > Agree with you that this goal is far from making bpf programs safely > running in a container environment. I disagree that allowing admin to run bpftool without sudo is a task worth solving. The visibility of bpf progs in a container is a different task. Without doing any kernel changes we can add a flag to bpftool to let 'bpftool prog show' list progs that were loaded by processes in the same cgroup. bpftool already does prog->pid mapping with bpf iterators. It can filter by cgroup just as well.
On Tue, Apr 4, 2023 at 6:50 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Mon, Apr 03, 2023 at 11:05:25AM +0800, Yafang Shao wrote: > > On Mon, Apr 3, 2023 at 7:37 AM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Tue, Mar 28, 2023 at 11:47:31AM +0800, Yafang Shao wrote: > > > > On Tue, Mar 28, 2023 at 3:04 AM Song Liu <song@kernel.org> wrote: > > > > > > > > > > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs > > > > > > to FDs, that's intended for BPF's security model[1]. Not only does it > > > > > > prevent non-privilidged users from getting other users' bpf program, but > > > > > > also it prevents the user from iterating his own bpf objects. > > > > > > > > > > > > In container environment, some users want to run bpf programs in their > > > > > > containers. These users can run their bpf programs under CAP_BPF and > > > > > > some other specific CAPs, but they can't inspect their bpf programs in a > > > > > > generic way. For example, the bpftool can't be used as it requires > > > > > > CAP_SYS_ADMIN. That is very inconvenient. > > > > > > > > > > Agreed that it is important to enable tools like bpftool without > > > > > CAP_SYS_ADMIN. However, I am not sure whether we need a new > > > > > namespace for this. Can we reuse some existing namespace for this? > > > > > > > > > > > > > It seems we can't. > > > > > > Yafang, > > > > > > It's a Nack. > > > > > > The only thing you've been trying to "solve" with bpf namespace is to > > > allow 'bpftool prog show' iterate progs in the "namespace" without CAP_SYS_ADMIN. > > > The concept of bpf namespace is not even close to be thought through. > > > > Right, it is more likely a PoC in its current state. > > > > > Others pointed out the gaps in the design. Like bpffs. There are plenty. > > > Please do not send patches like this in the future. > > > > The reason I sent it with an early state is that I want to get some > > early feedback from the community ahead of the LSF/MM/BPF workshop, > > then I can improve it based on these feedbacks and present it more > > specifically at the workshop. Then the discussion will be more > > effective. > > > > > You need to start with describing the problem you want to solve, > > > then propose _several_ solutions, describe their pros and cons, > > > solicit feedback, present at the conferences (like LSFMMBPF or LPC), > > > and when the community agrees that 1. problem is worth solving, > > > 2. the solution makes sense, only then work on patches. > > > > > > > I would like to give a short discussion on the BPF namespace if > > everything goes fine. > > Not in this shape of BPF namespace as done in this patch set. > We've talked about BPF namespace in the past. This is not it. > > > > "In container environment, some users want to run bpf programs in their containers." > > > is something Song brought up at LSFMMBPF a year ago. > > > At that meeting most of the folks agreed that there is a need to run bpf > > > in containers and make sure that the effect of bpf prog is limited to a container. > > > This new namespace that creates virtual IDs for progs and maps doesn't come > > > close in solving this task. > > > > Currently in our production environment, all the containers running > > bpf programs are privileged, that is risky. So actually the goal of > > the BPF namespace is to make them (or part of them) non-privileged. > > But some of the abilities of these bpf programs will be lost in this > > procedure, like the debug-bility with bpftool, so we need to fix it. > > Agree with you that this goal is far from making bpf programs safely > > running in a container environment. > > I disagree that allowing admin to run bpftool without sudo is a task > worth solving. The visibility of bpf progs in a container is a different task. > Without doing any kernel changes we can add a flag to bpftool to let > 'bpftool prog show' list progs that were loaded by processes in the same cgroup. > bpftool already does prog->pid mapping with bpf iterators. > It can filter by cgroup just as well. IIUC, at least we need bellow change in the kernel, --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -3705,9 +3705,6 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr, if (CHECK_ATTR(BPF_OBJ_GET_NEXT_ID) || next_id >= INT_MAX) return -EINVAL; - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - next_id++; spin_lock_bh(lock); if (!idr_get_next(idr, &next_id)) Because the container doesn't have CAP_SYS_ADMIN enabled, while they only have CAP_BPF and other required CAPs. Another possible solution is that we run an agent in the host, and the user in the container who wants to get the bpf objects info in his container should send a request to this agent via unix domain socket. That is what we are doing now in our production environment. That said, each container has to run a client to get the bpf object fd. There are some downsides, - It can't handle pinned bpf programs For pinned programs, the user can get them from the pinned files directly, so he can use bpftool in his case, only with some complaints. - If the user attached the bpf prog, and then removed the pinned file, but didn't detach it. That happened. But this error case can't be handled. - There may be other corner cases that it can't fit. There's a solution to improve it, but we also need to change the kernel. That is, we can use the wasted space btf->name. diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index b7e5a55..59d73a3 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -5542,6 +5542,8 @@ static struct btf *btf_parse(bpfptr_t btf_data, u32 btf_data_size, err = -ENOMEM; goto errout; } + snprintf(btf->name, sizeof(btf->name), "%s-%d-%d", current->comm, + current->pid, cgroup_id(task_cgroup(p, cpu_cgrp_id))); env->btf = btf; data = kvmalloc(btf_data_size, GFP_KERNEL | __GFP_NOWARN);
On Tue, Apr 04, 2023 at 10:59:55AM +0800, Yafang Shao wrote: > On Tue, Apr 4, 2023 at 6:50 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Mon, Apr 03, 2023 at 11:05:25AM +0800, Yafang Shao wrote: > > > On Mon, Apr 3, 2023 at 7:37 AM Alexei Starovoitov > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > On Tue, Mar 28, 2023 at 11:47:31AM +0800, Yafang Shao wrote: > > > > > On Tue, Mar 28, 2023 at 3:04 AM Song Liu <song@kernel.org> wrote: > > > > > > > > > > > > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs > > > > > > > to FDs, that's intended for BPF's security model[1]. Not only does it > > > > > > > prevent non-privilidged users from getting other users' bpf program, but > > > > > > > also it prevents the user from iterating his own bpf objects. > > > > > > > > > > > > > > In container environment, some users want to run bpf programs in their > > > > > > > containers. These users can run their bpf programs under CAP_BPF and > > > > > > > some other specific CAPs, but they can't inspect their bpf programs in a > > > > > > > generic way. For example, the bpftool can't be used as it requires > > > > > > > CAP_SYS_ADMIN. That is very inconvenient. > > > > > > > > > > > > Agreed that it is important to enable tools like bpftool without > > > > > > CAP_SYS_ADMIN. However, I am not sure whether we need a new > > > > > > namespace for this. Can we reuse some existing namespace for this? > > > > > > > > > > > > > > > > It seems we can't. > > > > > > > > Yafang, > > > > > > > > It's a Nack. > > > > > > > > The only thing you've been trying to "solve" with bpf namespace is to > > > > allow 'bpftool prog show' iterate progs in the "namespace" without CAP_SYS_ADMIN. > > > > The concept of bpf namespace is not even close to be thought through. > > > > > > Right, it is more likely a PoC in its current state. > > > > > > > Others pointed out the gaps in the design. Like bpffs. There are plenty. > > > > Please do not send patches like this in the future. > > > > > > The reason I sent it with an early state is that I want to get some > > > early feedback from the community ahead of the LSF/MM/BPF workshop, > > > then I can improve it based on these feedbacks and present it more > > > specifically at the workshop. Then the discussion will be more > > > effective. > > > > > > > You need to start with describing the problem you want to solve, > > > > then propose _several_ solutions, describe their pros and cons, > > > > solicit feedback, present at the conferences (like LSFMMBPF or LPC), > > > > and when the community agrees that 1. problem is worth solving, > > > > 2. the solution makes sense, only then work on patches. > > > > > > > > > > I would like to give a short discussion on the BPF namespace if > > > everything goes fine. > > > > Not in this shape of BPF namespace as done in this patch set. > > We've talked about BPF namespace in the past. This is not it. > > > > > > "In container environment, some users want to run bpf programs in their containers." > > > > is something Song brought up at LSFMMBPF a year ago. > > > > At that meeting most of the folks agreed that there is a need to run bpf > > > > in containers and make sure that the effect of bpf prog is limited to a container. > > > > This new namespace that creates virtual IDs for progs and maps doesn't come > > > > close in solving this task. > > > > > > Currently in our production environment, all the containers running > > > bpf programs are privileged, that is risky. So actually the goal of > > > the BPF namespace is to make them (or part of them) non-privileged. > > > But some of the abilities of these bpf programs will be lost in this > > > procedure, like the debug-bility with bpftool, so we need to fix it. > > > Agree with you that this goal is far from making bpf programs safely > > > running in a container environment. > > > > I disagree that allowing admin to run bpftool without sudo is a task > > worth solving. The visibility of bpf progs in a container is a different task. > > Without doing any kernel changes we can add a flag to bpftool to let > > 'bpftool prog show' list progs that were loaded by processes in the same cgroup. > > bpftool already does prog->pid mapping with bpf iterators. > > It can filter by cgroup just as well. > > IIUC, at least we need bellow change in the kernel, No. The user should just 'sudo bpftool ...' instead. > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -3705,9 +3705,6 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr, > if (CHECK_ATTR(BPF_OBJ_GET_NEXT_ID) || next_id >= INT_MAX) > return -EINVAL; > > - if (!capable(CAP_SYS_ADMIN)) > - return -EPERM; > - > next_id++; > spin_lock_bh(lock); > if (!idr_get_next(idr, &next_id)) > > Because the container doesn't have CAP_SYS_ADMIN enabled, while they > only have CAP_BPF and other required CAPs. > > Another possible solution is that we run an agent in the host, and the > user in the container who wants to get the bpf objects info in his > container should send a request to this agent via unix domain socket. > That is what we are doing now in our production environment. That > said, each container has to run a client to get the bpf object fd. None of such hacks are necessary. People that debug bpf setups with bpftool can always sudo. > There are some downsides, > - It can't handle pinned bpf programs > For pinned programs, the user can get them from the pinned files > directly, so he can use bpftool in his case, only with some > complaints. > - If the user attached the bpf prog, and then removed the pinned > file, but didn't detach it. > That happened. But this error case can't be handled. > - There may be other corner cases that it can't fit. > > There's a solution to improve it, but we also need to change the > kernel. That is, we can use the wasted space btf->name. > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c > index b7e5a55..59d73a3 100644 > --- a/kernel/bpf/btf.c > +++ b/kernel/bpf/btf.c > @@ -5542,6 +5542,8 @@ static struct btf *btf_parse(bpfptr_t btf_data, > u32 btf_data_size, > err = -ENOMEM; > goto errout; > } > + snprintf(btf->name, sizeof(btf->name), "%s-%d-%d", current->comm, > + current->pid, cgroup_id(task_cgroup(p, cpu_cgrp_id))); Unnecessary. comm, pid, cgroup can be printed by bpftool without changing the kernel.
On Thu, Apr 6, 2023 at 10:07 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Tue, Apr 04, 2023 at 10:59:55AM +0800, Yafang Shao wrote: > > On Tue, Apr 4, 2023 at 6:50 AM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Mon, Apr 03, 2023 at 11:05:25AM +0800, Yafang Shao wrote: > > > > On Mon, Apr 3, 2023 at 7:37 AM Alexei Starovoitov > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > On Tue, Mar 28, 2023 at 11:47:31AM +0800, Yafang Shao wrote: > > > > > > On Tue, Mar 28, 2023 at 3:04 AM Song Liu <song@kernel.org> wrote: > > > > > > > > > > > > > > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs > > > > > > > > to FDs, that's intended for BPF's security model[1]. Not only does it > > > > > > > > prevent non-privilidged users from getting other users' bpf program, but > > > > > > > > also it prevents the user from iterating his own bpf objects. > > > > > > > > > > > > > > > > In container environment, some users want to run bpf programs in their > > > > > > > > containers. These users can run their bpf programs under CAP_BPF and > > > > > > > > some other specific CAPs, but they can't inspect their bpf programs in a > > > > > > > > generic way. For example, the bpftool can't be used as it requires > > > > > > > > CAP_SYS_ADMIN. That is very inconvenient. > > > > > > > > > > > > > > Agreed that it is important to enable tools like bpftool without > > > > > > > CAP_SYS_ADMIN. However, I am not sure whether we need a new > > > > > > > namespace for this. Can we reuse some existing namespace for this? > > > > > > > > > > > > > > > > > > > It seems we can't. > > > > > > > > > > Yafang, > > > > > > > > > > It's a Nack. > > > > > > > > > > The only thing you've been trying to "solve" with bpf namespace is to > > > > > allow 'bpftool prog show' iterate progs in the "namespace" without CAP_SYS_ADMIN. > > > > > The concept of bpf namespace is not even close to be thought through. > > > > > > > > Right, it is more likely a PoC in its current state. > > > > > > > > > Others pointed out the gaps in the design. Like bpffs. There are plenty. > > > > > Please do not send patches like this in the future. > > > > > > > > The reason I sent it with an early state is that I want to get some > > > > early feedback from the community ahead of the LSF/MM/BPF workshop, > > > > then I can improve it based on these feedbacks and present it more > > > > specifically at the workshop. Then the discussion will be more > > > > effective. > > > > > > > > > You need to start with describing the problem you want to solve, > > > > > then propose _several_ solutions, describe their pros and cons, > > > > > solicit feedback, present at the conferences (like LSFMMBPF or LPC), > > > > > and when the community agrees that 1. problem is worth solving, > > > > > 2. the solution makes sense, only then work on patches. > > > > > > > > > > > > > I would like to give a short discussion on the BPF namespace if > > > > everything goes fine. > > > > > > Not in this shape of BPF namespace as done in this patch set. > > > We've talked about BPF namespace in the past. This is not it. > > > > > > > > "In container environment, some users want to run bpf programs in their containers." > > > > > is something Song brought up at LSFMMBPF a year ago. > > > > > At that meeting most of the folks agreed that there is a need to run bpf > > > > > in containers and make sure that the effect of bpf prog is limited to a container. > > > > > This new namespace that creates virtual IDs for progs and maps doesn't come > > > > > close in solving this task. > > > > > > > > Currently in our production environment, all the containers running > > > > bpf programs are privileged, that is risky. So actually the goal of > > > > the BPF namespace is to make them (or part of them) non-privileged. > > > > But some of the abilities of these bpf programs will be lost in this > > > > procedure, like the debug-bility with bpftool, so we need to fix it. > > > > Agree with you that this goal is far from making bpf programs safely > > > > running in a container environment. > > > > > > I disagree that allowing admin to run bpftool without sudo is a task > > > worth solving. The visibility of bpf progs in a container is a different task. > > > Without doing any kernel changes we can add a flag to bpftool to let > > > 'bpftool prog show' list progs that were loaded by processes in the same cgroup. > > > bpftool already does prog->pid mapping with bpf iterators. > > > It can filter by cgroup just as well. > > > > IIUC, at least we need bellow change in the kernel, > > No. The user should just 'sudo bpftool ...' instead. > It seems that I didn't describe the issue clearly. The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is required to run bpftool, so the bpftool running in the container can't get the ID of bpf objects or convert IDs to FDs. Is there something that I missed ? > > --- a/kernel/bpf/syscall.c > > +++ b/kernel/bpf/syscall.c > > @@ -3705,9 +3705,6 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr, > > if (CHECK_ATTR(BPF_OBJ_GET_NEXT_ID) || next_id >= INT_MAX) > > return -EINVAL; > > > > - if (!capable(CAP_SYS_ADMIN)) > > - return -EPERM; > > - > > next_id++; > > spin_lock_bh(lock); > > if (!idr_get_next(idr, &next_id)) > > > > Because the container doesn't have CAP_SYS_ADMIN enabled, while they > > only have CAP_BPF and other required CAPs. > > > > Another possible solution is that we run an agent in the host, and the > > user in the container who wants to get the bpf objects info in his > > container should send a request to this agent via unix domain socket. > > That is what we are doing now in our production environment. That > > said, each container has to run a client to get the bpf object fd. > > None of such hacks are necessary. People that debug bpf setups with bpftool > can always sudo. > > > There are some downsides, > > - It can't handle pinned bpf programs > > For pinned programs, the user can get them from the pinned files > > directly, so he can use bpftool in his case, only with some > > complaints. > > - If the user attached the bpf prog, and then removed the pinned > > file, but didn't detach it. > > That happened. But this error case can't be handled. > > - There may be other corner cases that it can't fit. > > > > There's a solution to improve it, but we also need to change the > > kernel. That is, we can use the wasted space btf->name. > > > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c > > index b7e5a55..59d73a3 100644 > > --- a/kernel/bpf/btf.c > > +++ b/kernel/bpf/btf.c > > @@ -5542,6 +5542,8 @@ static struct btf *btf_parse(bpfptr_t btf_data, > > u32 btf_data_size, > > err = -ENOMEM; > > goto errout; > > } > > + snprintf(btf->name, sizeof(btf->name), "%s-%d-%d", current->comm, > > + current->pid, cgroup_id(task_cgroup(p, cpu_cgrp_id))); > > Unnecessary. > comm, pid, cgroup can be printed by bpftool without changing the kernel. Some questions, - What if the process exits after attaching the bpf prog and the prog is not auto-detachable? For example, the reuserport bpf prog is not auto-detachable. After pins the reuserport bpf prog, a task can attach it through the pinned bpf file, but if the task forgets to detach it and the pinned file is removed, then it seems there's no way to figure out which task or cgroup this prog belongs to... - Could you pls. explain in detail how to get comm, pid, or cgroup from a pinned bpffs file?
On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > It seems that I didn't describe the issue clearly. > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is > required to run bpftool, so the bpftool running in the container > can't get the ID of bpf objects or convert IDs to FDs. > Is there something that I missed ? Nothing. This is by design. bpftool needs sudo. That's all. > > > > --- a/kernel/bpf/syscall.c > > > +++ b/kernel/bpf/syscall.c > > > @@ -3705,9 +3705,6 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr, > > > if (CHECK_ATTR(BPF_OBJ_GET_NEXT_ID) || next_id >= INT_MAX) > > > return -EINVAL; > > > > > > - if (!capable(CAP_SYS_ADMIN)) > > > - return -EPERM; > > > - > > > next_id++; > > > spin_lock_bh(lock); > > > if (!idr_get_next(idr, &next_id)) > > > > > > Because the container doesn't have CAP_SYS_ADMIN enabled, while they > > > only have CAP_BPF and other required CAPs. > > > > > > Another possible solution is that we run an agent in the host, and the > > > user in the container who wants to get the bpf objects info in his > > > container should send a request to this agent via unix domain socket. > > > That is what we are doing now in our production environment. That > > > said, each container has to run a client to get the bpf object fd. > > > > None of such hacks are necessary. People that debug bpf setups with bpftool > > can always sudo. > > > > > There are some downsides, > > > - It can't handle pinned bpf programs > > > For pinned programs, the user can get them from the pinned files > > > directly, so he can use bpftool in his case, only with some > > > complaints. > > > - If the user attached the bpf prog, and then removed the pinned > > > file, but didn't detach it. > > > That happened. But this error case can't be handled. > > > - There may be other corner cases that it can't fit. > > > > > > There's a solution to improve it, but we also need to change the > > > kernel. That is, we can use the wasted space btf->name. > > > > > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c > > > index b7e5a55..59d73a3 100644 > > > --- a/kernel/bpf/btf.c > > > +++ b/kernel/bpf/btf.c > > > @@ -5542,6 +5542,8 @@ static struct btf *btf_parse(bpfptr_t btf_data, > > > u32 btf_data_size, > > > err = -ENOMEM; > > > goto errout; > > > } > > > + snprintf(btf->name, sizeof(btf->name), "%s-%d-%d", current->comm, > > > + current->pid, cgroup_id(task_cgroup(p, cpu_cgrp_id))); > > > > Unnecessary. > > comm, pid, cgroup can be printed by bpftool without changing the kernel. > > Some questions, > - What if the process exits after attaching the bpf prog and the prog > is not auto-detachable? > For example, the reuserport bpf prog is not auto-detachable. After > pins the reuserport bpf prog, a task can attach it through the pinned > bpf file, but if the task forgets to detach it and the pinned file is > removed, then it seems there's no way to figure out which task or > cgroup this prog belongs to... you're saying that there is a bpf prog in the kernel without corresponding user space ? Meaning no user space process has an FD that points to this prog or FD to a map that this prog is using? In such a case this is truly kernel bpf prog. It doesn't belong to cgroup. > - Could you pls. explain in detail how to get comm, pid, or cgroup > from a pinned bpffs file? pinned bpf prog and no user space holds FD to it? It's not part of any cgroup. Nothing to print.
On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > It seems that I didn't describe the issue clearly. > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is > > required to run bpftool, so the bpftool running in the container > > can't get the ID of bpf objects or convert IDs to FDs. > > Is there something that I missed ? > > Nothing. This is by design. bpftool needs sudo. That's all. > Hmm, what I'm trying to do is make bpftool run without sudo. > > > > > > > --- a/kernel/bpf/syscall.c > > > > +++ b/kernel/bpf/syscall.c > > > > @@ -3705,9 +3705,6 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr, > > > > if (CHECK_ATTR(BPF_OBJ_GET_NEXT_ID) || next_id >= INT_MAX) > > > > return -EINVAL; > > > > > > > > - if (!capable(CAP_SYS_ADMIN)) > > > > - return -EPERM; > > > > - > > > > next_id++; > > > > spin_lock_bh(lock); > > > > if (!idr_get_next(idr, &next_id)) > > > > > > > > Because the container doesn't have CAP_SYS_ADMIN enabled, while they > > > > only have CAP_BPF and other required CAPs. > > > > > > > > Another possible solution is that we run an agent in the host, and the > > > > user in the container who wants to get the bpf objects info in his > > > > container should send a request to this agent via unix domain socket. > > > > That is what we are doing now in our production environment. That > > > > said, each container has to run a client to get the bpf object fd. > > > > > > None of such hacks are necessary. People that debug bpf setups with bpftool > > > can always sudo. > > > > > > > There are some downsides, > > > > - It can't handle pinned bpf programs > > > > For pinned programs, the user can get them from the pinned files > > > > directly, so he can use bpftool in his case, only with some > > > > complaints. > > > > - If the user attached the bpf prog, and then removed the pinned > > > > file, but didn't detach it. > > > > That happened. But this error case can't be handled. > > > > - There may be other corner cases that it can't fit. > > > > > > > > There's a solution to improve it, but we also need to change the > > > > kernel. That is, we can use the wasted space btf->name. > > > > > > > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c > > > > index b7e5a55..59d73a3 100644 > > > > --- a/kernel/bpf/btf.c > > > > +++ b/kernel/bpf/btf.c > > > > @@ -5542,6 +5542,8 @@ static struct btf *btf_parse(bpfptr_t btf_data, > > > > u32 btf_data_size, > > > > err = -ENOMEM; > > > > goto errout; > > > > } > > > > + snprintf(btf->name, sizeof(btf->name), "%s-%d-%d", current->comm, > > > > + current->pid, cgroup_id(task_cgroup(p, cpu_cgrp_id))); > > > > > > Unnecessary. > > > comm, pid, cgroup can be printed by bpftool without changing the kernel. > > > > Some questions, > > - What if the process exits after attaching the bpf prog and the prog > > is not auto-detachable? > > For example, the reuserport bpf prog is not auto-detachable. After > > pins the reuserport bpf prog, a task can attach it through the pinned > > bpf file, but if the task forgets to detach it and the pinned file is > > removed, then it seems there's no way to figure out which task or > > cgroup this prog belongs to... > > you're saying that there is a bpf prog in the kernel without > corresponding user space ? No, it is corresponding to user space. For example, it may be corresponding to a socket fd, or a cgroup fd. > Meaning no user space process has an FD > that points to this prog or FD to a map that this prog is using? > In such a case this is truly kernel bpf prog. It doesn't belong to cgroup. > Even if it is kernel bpf prog, it is created by a process. The user needs to know which one created it. > > - Could you pls. explain in detail how to get comm, pid, or cgroup > > from a pinned bpffs file? > > pinned bpf prog and no user space holds FD to it? > It's not part of any cgroup. Nothing to print. As I explained above, even if it holds nothing, the user needs to know the information from it. For example, if it is expected, which one created it?
On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > It seems that I didn't describe the issue clearly. > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is > > > required to run bpftool, so the bpftool running in the container > > > can't get the ID of bpf objects or convert IDs to FDs. > > > Is there something that I missed ? > > > > Nothing. This is by design. bpftool needs sudo. That's all. > > > > Hmm, what I'm trying to do is make bpftool run without sudo. This is not a task that is worth solving. > > > Some questions, > > > - What if the process exits after attaching the bpf prog and the prog > > > is not auto-detachable? > > > For example, the reuserport bpf prog is not auto-detachable. After > > > pins the reuserport bpf prog, a task can attach it through the pinned > > > bpf file, but if the task forgets to detach it and the pinned file is > > > removed, then it seems there's no way to figure out which task or > > > cgroup this prog belongs to... > > > > you're saying that there is a bpf prog in the kernel without > > corresponding user space ? > > No, it is corresponding to user space. For example, it may be > corresponding to a socket fd, or a cgroup fd. > > > Meaning no user space process has an FD > > that points to this prog or FD to a map that this prog is using? > > In such a case this is truly kernel bpf prog. It doesn't belong to cgroup. > > > > Even if it is kernel bpf prog, it is created by a process. The user > needs to know which one created it. In some situations it's certainly interesting to know which process loaded a particular program. In many other situations it's irrelevant. For example, the process that loaded a prog could have been moved to a different cgroup. If you want to track the loading you need to install bpf_lsm that monitors prog_load hook and collect that info. It's not the job of the kernel to do it. > > > - Could you pls. explain in detail how to get comm, pid, or cgroup > > > from a pinned bpffs file? > > > > pinned bpf prog and no user space holds FD to it? > > It's not part of any cgroup. Nothing to print. > > As I explained above, even if it holds nothing, the user needs to know > the information from it. For example, if it is expected, which one > created it? See the answer above. The kernel has enough hooks already to provide this information to user space. No kernel changes necessary.
On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > It seems that I didn't describe the issue clearly. > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is > > > > required to run bpftool, so the bpftool running in the container > > > > can't get the ID of bpf objects or convert IDs to FDs. > > > > Is there something that I missed ? > > > > > > Nothing. This is by design. bpftool needs sudo. That's all. > > > > > > > Hmm, what I'm trying to do is make bpftool run without sudo. > > This is not a task that is worth solving. > Then the container with CAP_BPF enabled can't even iterate its bpf progs ... > > > > Some questions, > > > > - What if the process exits after attaching the bpf prog and the prog > > > > is not auto-detachable? > > > > For example, the reuserport bpf prog is not auto-detachable. After > > > > pins the reuserport bpf prog, a task can attach it through the pinned > > > > bpf file, but if the task forgets to detach it and the pinned file is > > > > removed, then it seems there's no way to figure out which task or > > > > cgroup this prog belongs to... > > > > > > you're saying that there is a bpf prog in the kernel without > > > corresponding user space ? > > > > No, it is corresponding to user space. For example, it may be > > corresponding to a socket fd, or a cgroup fd. > > > > > Meaning no user space process has an FD > > > that points to this prog or FD to a map that this prog is using? > > > In such a case this is truly kernel bpf prog. It doesn't belong to cgroup. > > > > > > > Even if it is kernel bpf prog, it is created by a process. The user > > needs to know which one created it. > > In some situations it's certainly interesting to know which process > loaded a particular program. > In many other situations it's irrelevant. > For example, the process that loaded a prog could have been moved to a > different cgroup. > If you want to track the loading you need to install bpf_lsm > that monitors prog_load hook and collect that info. > It's not the job of the kernel to do it. > Agreed with you that we can add lots of hooks to track every detail of the operations. But it is not free. More hooks, more overhead. If we can change the kernel to make it lightweight, why not... > > > > - Could you pls. explain in detail how to get comm, pid, or cgroup > > > > from a pinned bpffs file? > > > > > > pinned bpf prog and no user space holds FD to it? > > > It's not part of any cgroup. Nothing to print. > > > > As I explained above, even if it holds nothing, the user needs to know > > the information from it. For example, if it is expected, which one > > created it? > > See the answer above. The kernel has enough hooks already to provide > this information to user space. No kernel changes necessary.
On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > It seems that I didn't describe the issue clearly. > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is > > > > > required to run bpftool, so the bpftool running in the container > > > > > can't get the ID of bpf objects or convert IDs to FDs. > > > > > Is there something that I missed ? > > > > > > > > Nothing. This is by design. bpftool needs sudo. That's all. > > > > > > > > > > Hmm, what I'm trying to do is make bpftool run without sudo. > > > > This is not a task that is worth solving. > > > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ... I'll leave the BPF namespace discussion aside (I agree that it needs way more thought). I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow you to take over someone else's link and stuff like this. But just iterating IDs seems like a pretty innocent functionality, so maybe we should remove CAP_SYS_ADMIN for GET_NEXT_ID? By itself GET_NEXT_ID is relatively useless without capabilities, but we've been floating the idea of providing GET_INFO_BY_ID (not by FD) for a while now, and that seems useful in itself, as it would indeed help tools like bpftool to get *some* information even without privileges. Whether those GET_INFO_BY_ID operations should return same full bpf_{prog,map,link,btf}_info or some trimmed down version of them would be up to discussion, but I think getting some info without creating an FD seems useful in itself. Would it be worth discussing and solving this separately from namespacing issues? > > > > > > Some questions, > > > > > - What if the process exits after attaching the bpf prog and the prog > > > > > is not auto-detachable? > > > > > For example, the reuserport bpf prog is not auto-detachable. After > > > > > pins the reuserport bpf prog, a task can attach it through the pinned > > > > > bpf file, but if the task forgets to detach it and the pinned file is > > > > > removed, then it seems there's no way to figure out which task or > > > > > cgroup this prog belongs to... > > > > > > > > you're saying that there is a bpf prog in the kernel without > > > > corresponding user space ? > > > > > > No, it is corresponding to user space. For example, it may be > > > corresponding to a socket fd, or a cgroup fd. > > > > > > > Meaning no user space process has an FD > > > > that points to this prog or FD to a map that this prog is using? > > > > In such a case this is truly kernel bpf prog. It doesn't belong to cgroup. > > > > > > > > > > Even if it is kernel bpf prog, it is created by a process. The user > > > needs to know which one created it. > > > > In some situations it's certainly interesting to know which process > > loaded a particular program. > > In many other situations it's irrelevant. > > For example, the process that loaded a prog could have been moved to a > > different cgroup. > > If you want to track the loading you need to install bpf_lsm > > that monitors prog_load hook and collect that info. > > It's not the job of the kernel to do it. > > > > Agreed with you that we can add lots of hooks to track every detail of > the operations. > But it is not free. More hooks, more overhead. > If we can change the kernel to make it lightweight, why not... > > > > > > - Could you pls. explain in detail how to get comm, pid, or cgroup > > > > > from a pinned bpffs file? > > > > > > > > pinned bpf prog and no user space holds FD to it? > > > > It's not part of any cgroup. Nothing to print. > > > > > > As I explained above, even if it holds nothing, the user needs to know > > > the information from it. For example, if it is expected, which one > > > created it? > > > > See the answer above. The kernel has enough hooks already to provide > > this information to user space. No kernel changes necessary. > > > > -- > Regards > Yafang
On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote: > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > It seems that I didn't describe the issue clearly. > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is > > > > > > required to run bpftool, so the bpftool running in the container > > > > > > can't get the ID of bpf objects or convert IDs to FDs. > > > > > > Is there something that I missed ? > > > > > > > > > > Nothing. This is by design. bpftool needs sudo. That's all. > > > > > > > > > > > > > Hmm, what I'm trying to do is make bpftool run without sudo. > > > > > > This is not a task that is worth solving. > > > > > > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ... > > I'll leave the BPF namespace discussion aside (I agree that it needs > way more thought). > > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow > you to take over someone else's link and stuff like this. But just > iterating IDs seems like a pretty innocent functionality, so maybe we > should remove CAP_SYS_ADMIN for GET_NEXT_ID? > > By itself GET_NEXT_ID is relatively useless without capabilities, but > we've been floating the idea of providing GET_INFO_BY_ID (not by FD) > for a while now, and that seems useful in itself, as it would indeed > help tools like bpftool to get *some* information even without > privileges. Whether those GET_INFO_BY_ID operations should return same > full bpf_{prog,map,link,btf}_info or some trimmed down version of them > would be up to discussion, but I think getting some info without > creating an FD seems useful in itself. > > Would it be worth discussing and solving this separately from > namespacing issues? Iteration of IDs itself is fine. The set of IDs is not security sensitive, but GET_NEXT_BY_ID has to be carefully restricted. It returns xlated, jited, BTF, line info, etc and with all the restrictions it would need something like CAP_SYS_PTRACE and CAP_PERFMON to be useful. And with that we're not far from CAP_SYS_ADMIN. Why bother then?
On Fri, Apr 7, 2023 at 9:44 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote: > > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov > > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > > > It seems that I didn't describe the issue clearly. > > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is > > > > > > > required to run bpftool, so the bpftool running in the container > > > > > > > can't get the ID of bpf objects or convert IDs to FDs. > > > > > > > Is there something that I missed ? > > > > > > > > > > > > Nothing. This is by design. bpftool needs sudo. That's all. > > > > > > > > > > > > > > > > Hmm, what I'm trying to do is make bpftool run without sudo. > > > > > > > > This is not a task that is worth solving. > > > > > > > > > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ... > > > > I'll leave the BPF namespace discussion aside (I agree that it needs > > way more thought). > > > > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID > > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow > > you to take over someone else's link and stuff like this. But just > > iterating IDs seems like a pretty innocent functionality, so maybe we > > should remove CAP_SYS_ADMIN for GET_NEXT_ID? > > > > By itself GET_NEXT_ID is relatively useless without capabilities, but > > we've been floating the idea of providing GET_INFO_BY_ID (not by FD) > > for a while now, and that seems useful in itself, as it would indeed > > help tools like bpftool to get *some* information even without > > privileges. Whether those GET_INFO_BY_ID operations should return same > > full bpf_{prog,map,link,btf}_info or some trimmed down version of them > > would be up to discussion, but I think getting some info without > > creating an FD seems useful in itself. > > > > Would it be worth discussing and solving this separately from > > namespacing issues? > > Iteration of IDs itself is fine. The set of IDs is not security sensitive, > but GET_NEXT_BY_ID has to be carefully restricted. > It returns xlated, jited, BTF, line info, etc > and with all the restrictions it would need something like > CAP_SYS_PTRACE and CAP_PERFMON to be useful. > And with that we're not far from CAP_SYS_ADMIN. > Why bother then? Not sure if I get your point clearly. I think the reason we introduce CAP_BPF is that we don't want the user to enable CAP_SYS_ADMIN. Enabling specific CAPs instead of CAP_SYS_ADMIN should be our alignment goal, right? If so, then it is worth doing. As Andrii suggested, a trimmed down version is also helpful and should be acceptable. Agreed with you that we have lots of things to do, but if we don't try to move it forward, it will always be as-is.
On Thu, Apr 6, 2023 at 9:34 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Fri, Apr 7, 2023 at 9:44 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote: > > > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov > > > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > > > > > It seems that I didn't describe the issue clearly. > > > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is > > > > > > > > required to run bpftool, so the bpftool running in the container > > > > > > > > can't get the ID of bpf objects or convert IDs to FDs. > > > > > > > > Is there something that I missed ? > > > > > > > > > > > > > > Nothing. This is by design. bpftool needs sudo. That's all. > > > > > > > > > > > > > > > > > > > Hmm, what I'm trying to do is make bpftool run without sudo. > > > > > > > > > > This is not a task that is worth solving. > > > > > > > > > > > > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ... > > > > > > I'll leave the BPF namespace discussion aside (I agree that it needs > > > way more thought). > > > > > > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID > > > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow > > > you to take over someone else's link and stuff like this. But just > > > iterating IDs seems like a pretty innocent functionality, so maybe we > > > should remove CAP_SYS_ADMIN for GET_NEXT_ID? > > > > > > By itself GET_NEXT_ID is relatively useless without capabilities, but > > > we've been floating the idea of providing GET_INFO_BY_ID (not by FD) > > > for a while now, and that seems useful in itself, as it would indeed > > > help tools like bpftool to get *some* information even without > > > privileges. Whether those GET_INFO_BY_ID operations should return same > > > full bpf_{prog,map,link,btf}_info or some trimmed down version of them > > > would be up to discussion, but I think getting some info without > > > creating an FD seems useful in itself. > > > > > > Would it be worth discussing and solving this separately from > > > namespacing issues? > > > > Iteration of IDs itself is fine. The set of IDs is not security sensitive, > > but GET_NEXT_BY_ID has to be carefully restricted. > > It returns xlated, jited, BTF, line info, etc > > and with all the restrictions it would need something like > > CAP_SYS_PTRACE and CAP_PERFMON to be useful. > > And with that we're not far from CAP_SYS_ADMIN. > > Why bother then? > > Not sure if I get your point clearly. I think the reason we introduce > CAP_BPF is that we don't want the user to enable CAP_SYS_ADMIN. > Enabling specific CAPs instead of CAP_SYS_ADMIN should be our > alignment goal, right? We want users to switch to CAP_BPF (potentially with CAP_PERFMON) from full CAP_SYS_ADMIN to reduce attack surface of production workloads. bpftool is a tool for humans to do introspection and debugging. It will stay root only. > If so, then it is worth doing. As Andrii suggested, a trimmed down > version is also helpful and should be acceptable. It's not helpful. It's actively misleading.
On Thu, Apr 6, 2023 at 6:44 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote: > > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov > > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > > > It seems that I didn't describe the issue clearly. > > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is > > > > > > > required to run bpftool, so the bpftool running in the container > > > > > > > can't get the ID of bpf objects or convert IDs to FDs. > > > > > > > Is there something that I missed ? > > > > > > > > > > > > Nothing. This is by design. bpftool needs sudo. That's all. > > > > > > > > > > > > > > > > Hmm, what I'm trying to do is make bpftool run without sudo. > > > > > > > > This is not a task that is worth solving. > > > > > > > > > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ... > > > > I'll leave the BPF namespace discussion aside (I agree that it needs > > way more thought). > > > > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID > > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow > > you to take over someone else's link and stuff like this. But just > > iterating IDs seems like a pretty innocent functionality, so maybe we > > should remove CAP_SYS_ADMIN for GET_NEXT_ID? > > > > By itself GET_NEXT_ID is relatively useless without capabilities, but > > we've been floating the idea of providing GET_INFO_BY_ID (not by FD) > > for a while now, and that seems useful in itself, as it would indeed > > help tools like bpftool to get *some* information even without > > privileges. Whether those GET_INFO_BY_ID operations should return same > > full bpf_{prog,map,link,btf}_info or some trimmed down version of them > > would be up to discussion, but I think getting some info without > > creating an FD seems useful in itself. > > > > Would it be worth discussing and solving this separately from > > namespacing issues? > > Iteration of IDs itself is fine. The set of IDs is not security sensitive, > but GET_NEXT_BY_ID has to be carefully restricted. > It returns xlated, jited, BTF, line info, etc > and with all the restrictions it would need something like > CAP_SYS_PTRACE and CAP_PERFMON to be useful. > And with that we're not far from CAP_SYS_ADMIN. > Why bother then? You probably meant that GET_INFO_BY_ID should be carefully restricted? So yeah, that's what I said that this would have to be discussed further. I agree that returning func/line info, program dump, etc is probably a privileged part. But there is plenty of useful info besides that (e.g., prog name, insns cnt, run stats, etc) that would be useful for unpriv applications to monitor their own apps that they opened from BPF FS, or just some observability daemons. There is a lot of useful information in bpf_map_info and bpf_link_info that's way less privileged. I think bpf_link_info is good as is. Same for bpf_map_info. Either way, I'm not insisting, just something that seems pretty simple to add and useful in some scenarios. We can reuse existing code and types for GET_INFO_BY_FD and just zero-out (or prevent filling out) those privileged fields you mentioned. Anyway, something to put on the backburner, perhaps.
On Fri, Apr 7, 2023 at 8:59 AM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Thu, Apr 6, 2023 at 6:44 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote: > > > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov > > > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > > > > > It seems that I didn't describe the issue clearly. > > > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is > > > > > > > > required to run bpftool, so the bpftool running in the container > > > > > > > > can't get the ID of bpf objects or convert IDs to FDs. > > > > > > > > Is there something that I missed ? > > > > > > > > > > > > > > Nothing. This is by design. bpftool needs sudo. That's all. > > > > > > > > > > > > > > > > > > > Hmm, what I'm trying to do is make bpftool run without sudo. > > > > > > > > > > This is not a task that is worth solving. > > > > > > > > > > > > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ... > > > > > > I'll leave the BPF namespace discussion aside (I agree that it needs > > > way more thought). > > > > > > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID > > > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow > > > you to take over someone else's link and stuff like this. But just > > > iterating IDs seems like a pretty innocent functionality, so maybe we > > > should remove CAP_SYS_ADMIN for GET_NEXT_ID? > > > > > > By itself GET_NEXT_ID is relatively useless without capabilities, but > > > we've been floating the idea of providing GET_INFO_BY_ID (not by FD) > > > for a while now, and that seems useful in itself, as it would indeed > > > help tools like bpftool to get *some* information even without > > > privileges. Whether those GET_INFO_BY_ID operations should return same > > > full bpf_{prog,map,link,btf}_info or some trimmed down version of them > > > would be up to discussion, but I think getting some info without > > > creating an FD seems useful in itself. > > > > > > Would it be worth discussing and solving this separately from > > > namespacing issues? > > > > Iteration of IDs itself is fine. The set of IDs is not security sensitive, > > but GET_NEXT_BY_ID has to be carefully restricted. > > It returns xlated, jited, BTF, line info, etc > > and with all the restrictions it would need something like > > CAP_SYS_PTRACE and CAP_PERFMON to be useful. > > And with that we're not far from CAP_SYS_ADMIN. > > Why bother then? > > You probably meant that GET_INFO_BY_ID should be carefully restricted? yes. > So yeah, that's what I said that this would have to be discussed > further. I agree that returning func/line info, program dump, etc is > probably a privileged part. But there is plenty of useful info besides > that (e.g., prog name, insns cnt, run stats, etc) that would be useful > for unpriv applications to monitor their own apps that they opened > from BPF FS, or just some observability daemons. > > There is a lot of useful information in bpf_map_info and bpf_link_info > that's way less privileged. I think bpf_link_info is good as is. Same > for bpf_map_info. > > Either way, I'm not insisting, just something that seems pretty simple > to add and useful in some scenarios. We can reuse existing code and > types for GET_INFO_BY_FD and just zero-out (or prevent filling out) > those privileged fields you mentioned. Anyway, something to put on the > backburner, perhaps. Sorry, but I only see negatives. It's an extra code in the kernel that has to be carefully reviewed when initially submitted and then every patch that touches get_info_by_id would have to go through a microscope every time to avoid introducing a security issue. And for what? So that CAP_BPF application can read prog name and run stats?
On Sat, Apr 8, 2023 at 12:05 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Fri, Apr 7, 2023 at 8:59 AM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > > > On Thu, Apr 6, 2023 at 6:44 PM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote: > > > > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov > > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov > > > > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > > > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > > > > > > > It seems that I didn't describe the issue clearly. > > > > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is > > > > > > > > > required to run bpftool, so the bpftool running in the container > > > > > > > > > can't get the ID of bpf objects or convert IDs to FDs. > > > > > > > > > Is there something that I missed ? > > > > > > > > > > > > > > > > Nothing. This is by design. bpftool needs sudo. That's all. > > > > > > > > > > > > > > > > > > > > > > Hmm, what I'm trying to do is make bpftool run without sudo. > > > > > > > > > > > > This is not a task that is worth solving. > > > > > > > > > > > > > > > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ... > > > > > > > > I'll leave the BPF namespace discussion aside (I agree that it needs > > > > way more thought). > > > > > > > > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID > > > > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow > > > > you to take over someone else's link and stuff like this. But just > > > > iterating IDs seems like a pretty innocent functionality, so maybe we > > > > should remove CAP_SYS_ADMIN for GET_NEXT_ID? > > > > > > > > By itself GET_NEXT_ID is relatively useless without capabilities, but > > > > we've been floating the idea of providing GET_INFO_BY_ID (not by FD) > > > > for a while now, and that seems useful in itself, as it would indeed > > > > help tools like bpftool to get *some* information even without > > > > privileges. Whether those GET_INFO_BY_ID operations should return same > > > > full bpf_{prog,map,link,btf}_info or some trimmed down version of them > > > > would be up to discussion, but I think getting some info without > > > > creating an FD seems useful in itself. > > > > > > > > Would it be worth discussing and solving this separately from > > > > namespacing issues? > > > > > > Iteration of IDs itself is fine. The set of IDs is not security sensitive, > > > but GET_NEXT_BY_ID has to be carefully restricted. > > > It returns xlated, jited, BTF, line info, etc > > > and with all the restrictions it would need something like > > > CAP_SYS_PTRACE and CAP_PERFMON to be useful. > > > And with that we're not far from CAP_SYS_ADMIN. > > > Why bother then? > > > > You probably meant that GET_INFO_BY_ID should be carefully restricted? > > yes. > > > So yeah, that's what I said that this would have to be discussed > > further. I agree that returning func/line info, program dump, etc is > > probably a privileged part. But there is plenty of useful info besides > > that (e.g., prog name, insns cnt, run stats, etc) that would be useful > > for unpriv applications to monitor their own apps that they opened > > from BPF FS, or just some observability daemons. > > > > There is a lot of useful information in bpf_map_info and bpf_link_info > > that's way less privileged. I think bpf_link_info is good as is. Same > > for bpf_map_info. > > > > Either way, I'm not insisting, just something that seems pretty simple > > to add and useful in some scenarios. We can reuse existing code and > > types for GET_INFO_BY_FD and just zero-out (or prevent filling out) > > those privileged fields you mentioned. Anyway, something to put on the > > backburner, perhaps. > > Sorry, but I only see negatives. It's an extra code in the kernel > that has to be carefully reviewed when initially submitted and > then every patch that touches get_info_by_id would have to go > through a microscope every time to avoid introducing a security issue. > And for what? So that CAP_BPF application can read prog name and run stats? Per my experience, observability is a very important part for a project. If the user can't observe the object directly created by it, he will worry about or even mistrust it. However I don't insist on it either if you think we shouldn't do it.
On Fri, Apr 7, 2023 at 9:22 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Sat, Apr 8, 2023 at 12:05 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Fri, Apr 7, 2023 at 8:59 AM Andrii Nakryiko > > <andrii.nakryiko@gmail.com> wrote: > > > > > > On Thu, Apr 6, 2023 at 6:44 PM Alexei Starovoitov > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote: > > > > > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov > > > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > > > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov > > > > > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > > > > > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > It seems that I didn't describe the issue clearly. > > > > > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is > > > > > > > > > > required to run bpftool, so the bpftool running in the container > > > > > > > > > > can't get the ID of bpf objects or convert IDs to FDs. > > > > > > > > > > Is there something that I missed ? > > > > > > > > > > > > > > > > > > Nothing. This is by design. bpftool needs sudo. That's all. > > > > > > > > > > > > > > > > > > > > > > > > > Hmm, what I'm trying to do is make bpftool run without sudo. > > > > > > > > > > > > > > This is not a task that is worth solving. > > > > > > > > > > > > > > > > > > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ... > > > > > > > > > > I'll leave the BPF namespace discussion aside (I agree that it needs > > > > > way more thought). > > > > > > > > > > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID > > > > > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow > > > > > you to take over someone else's link and stuff like this. But just > > > > > iterating IDs seems like a pretty innocent functionality, so maybe we > > > > > should remove CAP_SYS_ADMIN for GET_NEXT_ID? > > > > > > > > > > By itself GET_NEXT_ID is relatively useless without capabilities, but > > > > > we've been floating the idea of providing GET_INFO_BY_ID (not by FD) > > > > > for a while now, and that seems useful in itself, as it would indeed > > > > > help tools like bpftool to get *some* information even without > > > > > privileges. Whether those GET_INFO_BY_ID operations should return same > > > > > full bpf_{prog,map,link,btf}_info or some trimmed down version of them > > > > > would be up to discussion, but I think getting some info without > > > > > creating an FD seems useful in itself. > > > > > > > > > > Would it be worth discussing and solving this separately from > > > > > namespacing issues? > > > > > > > > Iteration of IDs itself is fine. The set of IDs is not security sensitive, > > > > but GET_NEXT_BY_ID has to be carefully restricted. > > > > It returns xlated, jited, BTF, line info, etc > > > > and with all the restrictions it would need something like > > > > CAP_SYS_PTRACE and CAP_PERFMON to be useful. > > > > And with that we're not far from CAP_SYS_ADMIN. > > > > Why bother then? > > > > > > You probably meant that GET_INFO_BY_ID should be carefully restricted? > > > > yes. > > > > > So yeah, that's what I said that this would have to be discussed > > > further. I agree that returning func/line info, program dump, etc is > > > probably a privileged part. But there is plenty of useful info besides > > > that (e.g., prog name, insns cnt, run stats, etc) that would be useful > > > for unpriv applications to monitor their own apps that they opened > > > from BPF FS, or just some observability daemons. > > > > > > There is a lot of useful information in bpf_map_info and bpf_link_info > > > that's way less privileged. I think bpf_link_info is good as is. Same > > > for bpf_map_info. > > > > > > Either way, I'm not insisting, just something that seems pretty simple > > > to add and useful in some scenarios. We can reuse existing code and > > > types for GET_INFO_BY_FD and just zero-out (or prevent filling out) > > > those privileged fields you mentioned. Anyway, something to put on the > > > backburner, perhaps. > > > > Sorry, but I only see negatives. It's an extra code in the kernel > > that has to be carefully reviewed when initially submitted and > > then every patch that touches get_info_by_id would have to go > > through a microscope every time to avoid introducing a security issue. > > And for what? So that CAP_BPF application can read prog name and run stats? > > Per my experience, observability is a very important part for a > project. If the user can't observe the object directly created by it, > he will worry about or even mistrust it. The user can observe the objects just fine. That's what get_info_by_fd is for. But the kernel will not report JITed instructions to unpriv user who just loaded a prog and a sole owner of it. By your definition such a user should not trust the kernel. So be it.
On Sat, Apr 8, 2023 at 12:32 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Fri, Apr 7, 2023 at 9:22 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > On Sat, Apr 8, 2023 at 12:05 AM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Fri, Apr 7, 2023 at 8:59 AM Andrii Nakryiko > > > <andrii.nakryiko@gmail.com> wrote: > > > > > > > > On Thu, Apr 6, 2023 at 6:44 PM Alexei Starovoitov > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote: > > > > > > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov > > > > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > > > > > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > > > > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov > > > > > > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > > It seems that I didn't describe the issue clearly. > > > > > > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is > > > > > > > > > > > required to run bpftool, so the bpftool running in the container > > > > > > > > > > > can't get the ID of bpf objects or convert IDs to FDs. > > > > > > > > > > > Is there something that I missed ? > > > > > > > > > > > > > > > > > > > > Nothing. This is by design. bpftool needs sudo. That's all. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hmm, what I'm trying to do is make bpftool run without sudo. > > > > > > > > > > > > > > > > This is not a task that is worth solving. > > > > > > > > > > > > > > > > > > > > > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ... > > > > > > > > > > > > I'll leave the BPF namespace discussion aside (I agree that it needs > > > > > > way more thought). > > > > > > > > > > > > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID > > > > > > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow > > > > > > you to take over someone else's link and stuff like this. But just > > > > > > iterating IDs seems like a pretty innocent functionality, so maybe we > > > > > > should remove CAP_SYS_ADMIN for GET_NEXT_ID? > > > > > > > > > > > > By itself GET_NEXT_ID is relatively useless without capabilities, but > > > > > > we've been floating the idea of providing GET_INFO_BY_ID (not by FD) > > > > > > for a while now, and that seems useful in itself, as it would indeed > > > > > > help tools like bpftool to get *some* information even without > > > > > > privileges. Whether those GET_INFO_BY_ID operations should return same > > > > > > full bpf_{prog,map,link,btf}_info or some trimmed down version of them > > > > > > would be up to discussion, but I think getting some info without > > > > > > creating an FD seems useful in itself. > > > > > > > > > > > > Would it be worth discussing and solving this separately from > > > > > > namespacing issues? > > > > > > > > > > Iteration of IDs itself is fine. The set of IDs is not security sensitive, > > > > > but GET_NEXT_BY_ID has to be carefully restricted. > > > > > It returns xlated, jited, BTF, line info, etc > > > > > and with all the restrictions it would need something like > > > > > CAP_SYS_PTRACE and CAP_PERFMON to be useful. > > > > > And with that we're not far from CAP_SYS_ADMIN. > > > > > Why bother then? > > > > > > > > You probably meant that GET_INFO_BY_ID should be carefully restricted? > > > > > > yes. > > > > > > > So yeah, that's what I said that this would have to be discussed > > > > further. I agree that returning func/line info, program dump, etc is > > > > probably a privileged part. But there is plenty of useful info besides > > > > that (e.g., prog name, insns cnt, run stats, etc) that would be useful > > > > for unpriv applications to monitor their own apps that they opened > > > > from BPF FS, or just some observability daemons. > > > > > > > > There is a lot of useful information in bpf_map_info and bpf_link_info > > > > that's way less privileged. I think bpf_link_info is good as is. Same > > > > for bpf_map_info. > > > > > > > > Either way, I'm not insisting, just something that seems pretty simple > > > > to add and useful in some scenarios. We can reuse existing code and > > > > types for GET_INFO_BY_FD and just zero-out (or prevent filling out) > > > > those privileged fields you mentioned. Anyway, something to put on the > > > > backburner, perhaps. > > > > > > Sorry, but I only see negatives. It's an extra code in the kernel > > > that has to be carefully reviewed when initially submitted and > > > then every patch that touches get_info_by_id would have to go > > > through a microscope every time to avoid introducing a security issue. > > > And for what? So that CAP_BPF application can read prog name and run stats? > > > > Per my experience, observability is a very important part for a > > project. If the user can't observe the object directly created by it, > > he will worry about or even mistrust it. > > The user can observe the objects just fine. That's what get_info_by_fd is for. > But the kernel will not report JITed instructions to unpriv user who > just loaded a prog and a sole owner of it. There's no UAPI to create the JITed instructions directly per my understanding. The JITed instructions are created by the kernel. While they're really UAPI to create a map, prog, and link. > By your definition such a user should not trust the kernel. So be it.