Message ID | 20231003083836.100706-1-hengqi.chen@gmail.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2a8e:b0:403:3b70:6f57 with SMTP id in14csp1942040vqb; Tue, 3 Oct 2023 01:44:14 -0700 (PDT) X-Google-Smtp-Source: AGHT+IE2ZU/UHexVvIEYw1GufLaa6Ls+oNiJkWZHFgNDfWJGtGRmi2arFx4sbKLC7D/eKuOEfWyW X-Received: by 2002:a05:6a00:1a41:b0:68f:d554:a299 with SMTP id h1-20020a056a001a4100b0068fd554a299mr12562239pfv.10.1696322654335; Tue, 03 Oct 2023 01:44:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696322654; cv=none; d=google.com; s=arc-20160816; b=tIB6Xurlo1lFQLGhlSXLipts6BJsPIPdtPm3B/QG22GnFIiM5drFaRCYPtghAN5amu SdnK/OvfqnTSxQZipydr+1JUiV4bhdzkuqym796Lteg9J3uyd7cJI1NzUvJIfw/pXMEF FwVikNGG0PWiooQc804AeQd/56v5sGbRW/zArCxmYZeuMA7hcG8GyGZSy66uABAtbj0l ks7IiEWqX+YFZIWwn9SSwjcdhMjTJR/4dHSTkqGKtUNuUH9DqBM3ouiZOU9Hp6VZuLhO 2nqudEZIc4b3MH2Sxh0tN2InTDkTFwT1e/jQoYeQt0/3kfhM1yAGtvjMd7Sqh8bphtw0 LigA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=hquSg/aXE1/u0q+geoga44KI7IqOR8LfjmqgUyTNI18=; fh=DRFGKMrXi0yLz3nURRYjQq65yfvW5/0oGnrPIKoecFQ=; b=ZB+7R0llYbXMd299lIzDXG7I2Yz2LCds4dLVgArxlWpXXGFomHAVgScXlFqiN5mIDD ftPQBiWFbwTzd8XaULPBCkysCGS1nu6KQjtQlEl71+SJvKRX3Ou3PRFpxM1EJAHjQ1yG wCW87XXxj2XUFavey7MmGMl2uHCSUF4cQMFDVitOP74OCfLdkdd8WA7YkJW0gcaFRZER 0Oa9g+10pWtOl6pKdUWuq5L95PeYmVeDw3pKK4Z+QyfowDgnReLpefsMRzRwLg9m3eGp fwCo/pUffatCL/vtuRHVD6BSHx9MQawidJSmtKOsxZp6q4ZY14F3ozpXMe80wmR5dREt ireQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=jorcgsJs; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id g10-20020a65580a000000b00563d9ff5157si980924pgr.547.2023.10.03.01.44.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Oct 2023 01:44:14 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=jorcgsJs; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id AD892802A736; Tue, 3 Oct 2023 01:44:13 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231247AbjJCIoL (ORCPT <rfc822;pusanteemu@gmail.com> + 18 others); Tue, 3 Oct 2023 04:44:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48096 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230314AbjJCIoJ (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 3 Oct 2023 04:44:09 -0400 Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BEDE9A7; Tue, 3 Oct 2023 01:44:06 -0700 (PDT) Received: by mail-pl1-x636.google.com with SMTP id d9443c01a7336-1c61acd1285so4676975ad.2; Tue, 03 Oct 2023 01:44:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1696322646; x=1696927446; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=hquSg/aXE1/u0q+geoga44KI7IqOR8LfjmqgUyTNI18=; b=jorcgsJsHndmaBwaDIp3k3yTSYDqoe1fQhaSXblOGBuK4bxXqTvrFLqK1CRQB6Occ/ PLuGiApFSpP2Zgofo9eO7q5H2k2oprrWaGKJuisCQKY4D3VXtXxy4iqXSWbmoA210b1a OSO3QifmObtABRGPK+gYtwk2JoEQgBPnRTRJxsxSxgEkeQ5m8CGgjXzplyBdHI8ZYEPB mS17nsGlaPKPSyoXXi3CAOX9PYSOHpEKhC2++T1emqgv0kkBo6FVPmXvYPCeSY4JSBb+ AJy2TIPra3ZCWEorT376zGQEXdIXpanwFjYWTPmWpGDUApxYL5FPlPUVRfzF847AwhSh 0gsA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696322646; x=1696927446; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=hquSg/aXE1/u0q+geoga44KI7IqOR8LfjmqgUyTNI18=; b=JTrpBFV6Gx5KGSk0vZ6MyRjoGPrzqXfB5nytxNv3f6eiGNWsH9+cfPa0b1kLDFeoI2 q0DRsb6pZ8U/1GT/xmze5/P9wHVYRvVklxzovXyfJ+yogJKYd0xfpiR5Z2iNRyhgJXxW MofKqkuIStx04C/c+jO/8osUObJ6hd0ke6EC55xMmcHqKr1YDQURB289aliWoo3ZHdUU nK85miuWbibNrHZt3ObtUhdoaOuf9lxTUlWb3kJWXZyJYHyiDtgFNrKhs6Tx+hczE/rH UefSMNIJj8KD+yphTxXkJcBCyRvhUrDFDozDiB1cGGueiY0vTml7FiEy2XwMr6Z2O9X0 YD4g== X-Gm-Message-State: AOJu0YyrLB0w09oatI8HUR3eb/Ixt+lKwJ2vR0VdKdUGoonteN/Liny+ Gqnbdo6Vik7J8U+bIHSdH0jXkxVB1/LJzsbr X-Received: by 2002:a17:902:d490:b0:1c0:c640:3f3e with SMTP id c16-20020a170902d49000b001c0c6403f3emr14443148plg.42.1696322645982; Tue, 03 Oct 2023 01:44:05 -0700 (PDT) Received: from ubuntu.. ([113.64.184.44]) by smtp.googlemail.com with ESMTPSA id y16-20020a17090322d000b001bc445e249asm902876plg.124.2023.10.03.01.44.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Oct 2023 01:44:05 -0700 (PDT) From: Hengqi Chen <hengqi.chen@gmail.com> To: linux-kernel@vger.kernel.org, bpf@vger.kernel.org Cc: keescook@chromium.org, luto@amacapital.net, wad@chromium.org, alexyonghe@tencent.com, hengqi.chen@gmail.com Subject: [RFC PATCH 0/2] seccomp: Split set filter into two steps Date: Tue, 3 Oct 2023 08:38:34 +0000 Message-Id: <20231003083836.100706-1-hengqi.chen@gmail.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Tue, 03 Oct 2023 01:44:13 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778723223473898810 X-GMAIL-MSGID: 1778723223473898810 |
Series |
seccomp: Split set filter into two steps
|
|
Message
Hengqi Chen
Oct. 3, 2023, 8:38 a.m. UTC
This patchset introduces two new operations which essentially splits the SECCOMP_SET_MODE_FILTER process into two steps: SECCOMP_LOAD_FILTER and SECCOMP_ATTACH_FILTER. The SECCOMP_LOAD_FILTER loads the filter and returns a fd which can be pinned to bpffs. This extends the lifetime of the filter and thus can be reused by different processes. With this new operation, we can eliminate a hot path of JITing BPF program (the filter) where we apply the same seccomp filter to thousands of micro VMs on a bare metal instance. The SECCOMP_ATTACH_FILTER is used to attach a loaded filter. The filter is represented by a fd which is either returned from SECCOMP_LOAD_FILTER or obtained from bpffs using bpf syscall. Hengqi Chen (2): seccomp: Introduce SECCOMP_LOAD_FILTER operation seccomp: Introduce SECCOMP_ATTACH_FILTER operation include/uapi/linux/seccomp.h | 2 + kernel/seccomp.c | 138 ++++++++++++++++++++++++++++++++++- 2 files changed, 136 insertions(+), 4 deletions(-)
Comments
On Tue, Oct 03, 2023 at 08:38:34AM +0000, Hengqi Chen wrote: > This patchset introduces two new operations which essentially > splits the SECCOMP_SET_MODE_FILTER process into two steps: > SECCOMP_LOAD_FILTER and SECCOMP_ATTACH_FILTER. > > The SECCOMP_LOAD_FILTER loads the filter and returns a fd > which can be pinned to bpffs. This extends the lifetime of the > filter and thus can be reused by different processes. > With this new operation, we can eliminate a hot path of JITing > BPF program (the filter) where we apply the same seccomp filter > to thousands of micro VMs on a bare metal instance. > > The SECCOMP_ATTACH_FILTER is used to attach a loaded filter. > The filter is represented by a fd which is either returned > from SECCOMP_LOAD_FILTER or obtained from bpffs using bpf syscall. Interesting! I like this idea, thanks for writing it up. Two design notes: - Can you reuse/refactor seccomp_prepare_filter() instead of duplicating the logic into two new functions? - Is there a way to make sure the BPF program coming from the fd is one that was built via SECCOMP_LOAD_FILTER? (I want to make sure we can never confuse a non-seccomp program into getting loaded into seccomp.) -Kees
On 10/3/23 10:38, Hengqi Chen wrote: > This patchset introduces two new operations which essentially > splits the SECCOMP_SET_MODE_FILTER process into two steps: > SECCOMP_LOAD_FILTER and SECCOMP_ATTACH_FILTER. > > The SECCOMP_LOAD_FILTER loads the filter and returns a fd > which can be pinned to bpffs. This extends the lifetime of the > filter and thus can be reused by different processes. A quick question to see if handling something else too is possible/reasonable to do here too. Let me explain our use case first. For us (Alban in cc) it would be great if we can extend the lifetime of the fd returned, so the process managing a seccomp notification in userspace can easly crash or be updated. Today, if the agent that got the fd crashes, all the "notify-syscalls" return ENOSYS in the target process. Our use case is we created a seccomp agent to use in Kubernetes (github.com/kinvolk/seccompagent) and we need to handle either the agent crashing or upgrading it. We were thinking tricks to have another container that just stores fds and make sure that never crashes, but it is not ideal (we checked tricks to use systemd to store our fds, but it is not simpler either to use from containers). If the agent crashes today, all the syscalls return ENOSYS. It will be great if we can make the process doing the syscall just wait until a new process to handle the notifications is up and the syscalls done in the meantime are just queued. A mode of saying "if the agent crashes, just queue notifications, one agent to pick them up will come back soon" (we can of course limit reasonably the notification queue). It seems the split here would not just work for that use case. I think we would need to pin the attachment. Do you think handling that is something reasonable to do in this series too? I'll be afk until end next week. I'll catch up as soon as I'm back with internet :) Best, Rodrigo
+ BPF maintainers On Wed, Oct 4, 2023 at 2:02 AM Kees Cook <keescook@chromium.org> wrote: > > On Tue, Oct 03, 2023 at 08:38:34AM +0000, Hengqi Chen wrote: > > This patchset introduces two new operations which essentially > > splits the SECCOMP_SET_MODE_FILTER process into two steps: > > SECCOMP_LOAD_FILTER and SECCOMP_ATTACH_FILTER. > > > > The SECCOMP_LOAD_FILTER loads the filter and returns a fd > > which can be pinned to bpffs. This extends the lifetime of the > > filter and thus can be reused by different processes. > > With this new operation, we can eliminate a hot path of JITing > > BPF program (the filter) where we apply the same seccomp filter > > to thousands of micro VMs on a bare metal instance. > > > > The SECCOMP_ATTACH_FILTER is used to attach a loaded filter. > > The filter is represented by a fd which is either returned > > from SECCOMP_LOAD_FILTER or obtained from bpffs using bpf syscall. > > Interesting! I like this idea, thanks for writing it up. > > Two design notes: > > - Can you reuse/refactor seccomp_prepare_filter() instead of duplicating > the logic into two new functions? > Sure, will do. > - Is there a way to make sure the BPF program coming from the fd is one > that was built via SECCOMP_LOAD_FILTER? (I want to make sure we can > never confuse a non-seccomp program into getting loaded into seccomp.) > Maybe we can add a new prog type enum like BPF_PROG_TYPE_SECCOMP for seccomp filter. > -Kees > > -- > Kees Cook > Cheers, -- Hengqi
On Wed, Oct 4, 2023 at 10:03 PM Rodrigo Campos <rodrigo@sdfg.com.ar> wrote: > > On 10/3/23 10:38, Hengqi Chen wrote: > > This patchset introduces two new operations which essentially > > splits the SECCOMP_SET_MODE_FILTER process into two steps: > > SECCOMP_LOAD_FILTER and SECCOMP_ATTACH_FILTER. > > > > The SECCOMP_LOAD_FILTER loads the filter and returns a fd > > which can be pinned to bpffs. This extends the lifetime of the > > filter and thus can be reused by different processes. > > A quick question to see if handling something else too is > possible/reasonable to do here too. > > Let me explain our use case first. > > For us (Alban in cc) it would be great if we can extend the lifetime of > the fd returned, so the process managing a seccomp notification in > userspace can easly crash or be updated. Today, if the agent that got > the fd crashes, all the "notify-syscalls" return ENOSYS in the target > process. > > Our use case is we created a seccomp agent to use in Kubernetes > (github.com/kinvolk/seccompagent) and we need to handle either the agent > crashing or upgrading it. We were thinking tricks to have another > container that just stores fds and make sure that never crashes, but it > is not ideal (we checked tricks to use systemd to store our fds, but it > is not simpler either to use from containers). > > If the agent crashes today, all the syscalls return ENOSYS. It will be > great if we can make the process doing the syscall just wait until a new > process to handle the notifications is up and the syscalls done in the > meantime are just queued. A mode of saying "if the agent crashes, just > queue notifications, one agent to pick them up will come back soon" (we > can of course limit reasonably the notification queue). > > It seems the split here would not just work for that use case. I think > we would need to pin the attachment. > > Do you think handling that is something reasonable to do in this series too? > I am not familiar with this notification mechanism, but it seems unrelated. This patchset is trying to reuse the seccomp filter itself. > I'll be afk until end next week. I'll catch up as soon as I'm back with > internet :) > > > > Best, > Rodrigo -- Hengqi