Message ID | 20230413133355.350571-3-aleksandr.mikhalitsyn@canonical.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp1042863vqo; Thu, 13 Apr 2023 06:39:14 -0700 (PDT) X-Google-Smtp-Source: AKy350Y5KFZOcBNrs9pqKRU5LUoYMX7gUniJ38HOKC6lCqN5rQpppiIu/DTn5O69WEsYj08UzXtV X-Received: by 2002:a05:6a20:af0a:b0:e7:76ba:d74d with SMTP id dr10-20020a056a20af0a00b000e776bad74dmr2302386pzb.7.1681393154212; Thu, 13 Apr 2023 06:39:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681393154; cv=none; d=google.com; s=arc-20160816; b=lEFTaY2VhsiVclxanRZLi7qBpyFB3gqmseMi20TYkT6HkBjM6h8m6D/8qc2IFVD3GR uVxzvzytl4NGGAtkAHE+JzkeGvOMnQF1BwK+zECZ8SG/Lcu58dEg0kt9sT8wwfwgJBS7 /WqNIhnRqsg4T0PYbcOJoTDu5c+MySnrEPkPgpv4xQ+ONMn02fsS/7L5lcH8eu5OLImz nobQv+gSVJYhOpb1i4Xu/87RH1zVWMX0UNMtokw52YfoYolGPCSQIGsHWXzZvEOavJ5t QKFjOWBWFC3TT5OcjVl/Zvcpw77QKZeNZSKRU+R1z4wIRW/h/DpzoERuFUB5Jie5YsJc QAJA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=pU+XkxxaqxIltFz6HAiJgXotXtTecmR/abIbMqTWsWw=; b=FLgbWj1R6zz9PpVAKgtgWjZiEPlb8lbpDuqRkUukHyzPhcYaYKF2EtUIEbCCB2IBE8 10dA2W8ss8RJdBzz4juoE5KCtqlZ2F0vsm/++LABN61vqnjQeJ61gCL9gJ9UcL1LL9zy I82jO9v4Sfs7Zi+bhQwDrL+cQBtktOxHMsmraHoIZWKqw2BP+ZEy/o6fFHDZKKVPr97X gLwMpo7XvzC4rQ9b/FT/nBqOZMlo3xnsvWTMuwwLc6kDKeD13A41Qqe8TZOnQ2Vbt7wN ddT2lT0keBZLX3ygG8wfAyAwqwgLrtqhlx9i3MFPVmGPN7aQ3USfzG9eZFNpXQLStj1S g+KA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@canonical.com header.s=20210705 header.b=fSdLFv6+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=canonical.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 20-20020a630f54000000b0051393805ccesi2156446pgp.418.2023.04.13.06.38.58; Thu, 13 Apr 2023 06:39:14 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@canonical.com header.s=20210705 header.b=fSdLFv6+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=canonical.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231621AbjDMNhj (ORCPT <rfc822;peter110.wang@gmail.com> + 99 others); Thu, 13 Apr 2023 09:37:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37620 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230502AbjDMNhS (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 13 Apr 2023 09:37:18 -0400 Received: from smtp-relay-internal-0.canonical.com (smtp-relay-internal-0.canonical.com [185.125.188.122]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 99916B779 for <linux-kernel@vger.kernel.org>; Thu, 13 Apr 2023 06:35:05 -0700 (PDT) Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com [209.85.208.70]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-internal-0.canonical.com (Postfix) with ESMTPS id BC7793F43B for <linux-kernel@vger.kernel.org>; Thu, 13 Apr 2023 13:34:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=canonical.com; s=20210705; t=1681392880; bh=pU+XkxxaqxIltFz6HAiJgXotXtTecmR/abIbMqTWsWw=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=fSdLFv6+6AwtL1xo8lr+sbRcMDlnQOpxNFbKF1loIK3JtPIuU6Cd255RADbyUjM1e mApLFFd90uIHUgVuW0iLpUUg0YJEPEjXR6k8cjHWN4co1VfoY1Or5iMisiYttIBwTb A9KQuRa67zJnpLR+BvXFa3ATy0doBYvADt595MduF11TQAmkNZDRm9YGr11p0oENXX J8wyW0JjBZh0ZHpc+qqDyJRF2496so/351JVfa3VVrQetQRY/KQrOuGMCwtEESczD7 KXpMmrZXfgsj7YWYlODOiv+8VH4XTKPC6cFmLhO8bF+KjH05nB+EfDcx4wVJIzOMny H/pLsAFXSbvCA== Received: by mail-ed1-f70.google.com with SMTP id 4fb4d7f45d1cf-505149e1a4eso3102652a12.1 for <linux-kernel@vger.kernel.org>; Thu, 13 Apr 2023 06:34:40 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681392877; x=1683984877; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=pU+XkxxaqxIltFz6HAiJgXotXtTecmR/abIbMqTWsWw=; b=WWs3p7TZSj6AZDvpabfZHP3K0OxgLmiy5K3FdG3Ljg52wtv2B6UTHE4rr5p7drPECg GoJpfA2CpIrwx7Q3a6K1IybkCyF2J5LaI0Huv1Xfr7fTtSWArJkornOV8r9KInA5U3k2 RR0dLoKg80QnMbU23vGeq8sunqYzgbtNXr+NvCZJCwIq4QtuzBzyjvW7NgcOG0vci0MB AdB1KtJnQXTPugSzijxu0mBSO9MKbSzAhROZumC6WpdFbKEN/DNB6KKEpeVTg5ai0DSp PpwVd4GGf5uUHdEbbySAVyPHqpg/1vXce9sEKJikRjFvXTjLdeBazQpjXhLZaUWb4vgN mAzg== X-Gm-Message-State: AAQBX9fmDXO40ciQz6U1PS71wnvBBfCRvP+0objQ1iuo1FhR0OwX1MhD nGVZ/jVdM/Y8cWqZeaSepBaKxefuhGobT9UvQ9SMOVpgSMzXMI07SLFP6ZMtcHXgXnoygXv+hMJ y00Ip0zKW783xrpbqMFQO8QxRY7k41IH0jAUvrbX0rjkY8tzzqQ== X-Received: by 2002:a05:6402:49:b0:504:8a0f:13ca with SMTP id f9-20020a056402004900b005048a0f13camr1932415edu.10.1681392877052; Thu, 13 Apr 2023 06:34:37 -0700 (PDT) X-Received: by 2002:a05:6402:49:b0:504:8a0f:13ca with SMTP id f9-20020a056402004900b005048a0f13camr1932387edu.10.1681392876801; Thu, 13 Apr 2023 06:34:36 -0700 (PDT) Received: from amikhalitsyn.. ([95.91.208.118]) by smtp.gmail.com with ESMTPSA id et22-20020a170907295600b0094a966330fdsm976806ejc.211.2023.04.13.06.34.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Apr 2023 06:34:36 -0700 (PDT) From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> To: davem@davemloft.net Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, daniel@iogearbox.net, Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Leon Romanovsky <leon@kernel.org>, David Ahern <dsahern@kernel.org>, Arnd Bergmann <arnd@arndb.de>, Kees Cook <keescook@chromium.org>, Christian Brauner <brauner@kernel.org>, Kuniyuki Iwashima <kuniyu@amazon.com>, Lennart Poettering <mzxreary@0pointer.de>, linux-arch@vger.kernel.org Subject: [PATCH net-next v4 2/4] net: socket: add sockopts blacklist for BPF cgroup hook Date: Thu, 13 Apr 2023 15:33:53 +0200 Message-Id: <20230413133355.350571-3-aleksandr.mikhalitsyn@canonical.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230413133355.350571-1-aleksandr.mikhalitsyn@canonical.com> References: <20230413133355.350571-1-aleksandr.mikhalitsyn@canonical.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1763068507948335851?= X-GMAIL-MSGID: =?utf-8?q?1763068507948335851?= |
Series |
Add SCM_PIDFD and SO_PEERPIDFD
|
|
Commit Message
Aleksandr Mikhalitsyn
April 13, 2023, 1:33 p.m. UTC
During work on SO_PEERPIDFD, it was discovered (thanks to Christian),
that bpf cgroup hook can cause FD leaks when used with sockopts which
install FDs into the process fdtable.
After some offlist discussion it was proposed to add a blacklist of
socket options those can cause troubles when BPF cgroup hook is enabled.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Suggested-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
---
net/socket.c | 38 +++++++++++++++++++++++++++++++++++---
1 file changed, 35 insertions(+), 3 deletions(-)
Comments
On Thu, Apr 13, 2023 at 3:35 PM Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> wrote: > > During work on SO_PEERPIDFD, it was discovered (thanks to Christian), > that bpf cgroup hook can cause FD leaks when used with sockopts which > install FDs into the process fdtable. > > After some offlist discussion it was proposed to add a blacklist of We try to replace this word by either denylist or blocklist, even in changelogs. > socket options those can cause troubles when BPF cgroup hook is enabled. > Can we find the appropriate Fixes: tag to help stable teams ? > Cc: "David S. Miller" <davem@davemloft.net> > Cc: Eric Dumazet <edumazet@google.com> > Cc: Jakub Kicinski <kuba@kernel.org> > Cc: Paolo Abeni <pabeni@redhat.com> > Cc: Leon Romanovsky <leon@kernel.org> > Cc: David Ahern <dsahern@kernel.org> > Cc: Arnd Bergmann <arnd@arndb.de> > Cc: Kees Cook <keescook@chromium.org> > Cc: Christian Brauner <brauner@kernel.org> > Cc: Kuniyuki Iwashima <kuniyu@amazon.com> > Cc: Lennart Poettering <mzxreary@0pointer.de> > Cc: linux-kernel@vger.kernel.org > Cc: netdev@vger.kernel.org > Cc: linux-arch@vger.kernel.org > Suggested-by: Daniel Borkmann <daniel@iogearbox.net> > Suggested-by: Christian Brauner <brauner@kernel.org> > Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Thanks.
On Thu, Apr 13, 2023 at 4:22 PM Eric Dumazet <edumazet@google.com> wrote: > > On Thu, Apr 13, 2023 at 3:35 PM Alexander Mikhalitsyn > <aleksandr.mikhalitsyn@canonical.com> wrote: > > > > During work on SO_PEERPIDFD, it was discovered (thanks to Christian), > > that bpf cgroup hook can cause FD leaks when used with sockopts which > > install FDs into the process fdtable. > > > > After some offlist discussion it was proposed to add a blacklist of > > We try to replace this word by either denylist or blocklist, even in changelogs. Hi Eric, Oh, I'm sorry about that. :( Sure. > > > socket options those can cause troubles when BPF cgroup hook is enabled. > > > > Can we find the appropriate Fixes: tag to help stable teams ? Sure, I will add next time. Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") I think it's better to add Stanislav Fomichev to CC. Kind regards, Alex > > > Cc: "David S. Miller" <davem@davemloft.net> > > Cc: Eric Dumazet <edumazet@google.com> > > Cc: Jakub Kicinski <kuba@kernel.org> > > Cc: Paolo Abeni <pabeni@redhat.com> > > Cc: Leon Romanovsky <leon@kernel.org> > > Cc: David Ahern <dsahern@kernel.org> > > Cc: Arnd Bergmann <arnd@arndb.de> > > Cc: Kees Cook <keescook@chromium.org> > > Cc: Christian Brauner <brauner@kernel.org> > > Cc: Kuniyuki Iwashima <kuniyu@amazon.com> > > Cc: Lennart Poettering <mzxreary@0pointer.de> > > Cc: linux-kernel@vger.kernel.org > > Cc: netdev@vger.kernel.org > > Cc: linux-arch@vger.kernel.org > > Suggested-by: Daniel Borkmann <daniel@iogearbox.net> > > Suggested-by: Christian Brauner <brauner@kernel.org> > > Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> > > Thanks.
On Thu, Apr 13, 2023 at 7:38 AM Aleksandr Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> wrote: > > On Thu, Apr 13, 2023 at 4:22 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Thu, Apr 13, 2023 at 3:35 PM Alexander Mikhalitsyn > > <aleksandr.mikhalitsyn@canonical.com> wrote: > > > > > > During work on SO_PEERPIDFD, it was discovered (thanks to Christian), > > > that bpf cgroup hook can cause FD leaks when used with sockopts which > > > install FDs into the process fdtable. > > > > > > After some offlist discussion it was proposed to add a blacklist of > > > > We try to replace this word by either denylist or blocklist, even in changelogs. > > Hi Eric, > > Oh, I'm sorry about that. :( Sure. > > > > > > socket options those can cause troubles when BPF cgroup hook is enabled. > > > > > > > Can we find the appropriate Fixes: tag to help stable teams ? > > Sure, I will add next time. > > Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") > > I think it's better to add Stanislav Fomichev to CC. Can we use 'struct proto' bpf_bypass_getsockopt instead? We already use it for tcp zerocopy, I'm assuming it should work in this case as well? > Kind regards, > Alex > > > > > > Cc: "David S. Miller" <davem@davemloft.net> > > > Cc: Eric Dumazet <edumazet@google.com> > > > Cc: Jakub Kicinski <kuba@kernel.org> > > > Cc: Paolo Abeni <pabeni@redhat.com> > > > Cc: Leon Romanovsky <leon@kernel.org> > > > Cc: David Ahern <dsahern@kernel.org> > > > Cc: Arnd Bergmann <arnd@arndb.de> > > > Cc: Kees Cook <keescook@chromium.org> > > > Cc: Christian Brauner <brauner@kernel.org> > > > Cc: Kuniyuki Iwashima <kuniyu@amazon.com> > > > Cc: Lennart Poettering <mzxreary@0pointer.de> > > > Cc: linux-kernel@vger.kernel.org > > > Cc: netdev@vger.kernel.org > > > Cc: linux-arch@vger.kernel.org > > > Suggested-by: Daniel Borkmann <daniel@iogearbox.net> > > > Suggested-by: Christian Brauner <brauner@kernel.org> > > > Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> > > > > Thanks.
On Thu, 13 Apr 2023 16:38:35 +0200 Aleksandr Mikhalitsyn wrote: > Sure, I will add next time. > > Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") > > I think it's better to add Stanislav Fomichev to CC. This should go in separately as a fix, right? Not in a -next series. Also the whole set as is does not build on top of net-next :(
On 04/13, Stanislav Fomichev wrote: > On Thu, Apr 13, 2023 at 7:38 AM Aleksandr Mikhalitsyn > <aleksandr.mikhalitsyn@canonical.com> wrote: > > > > On Thu, Apr 13, 2023 at 4:22 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Thu, Apr 13, 2023 at 3:35 PM Alexander Mikhalitsyn > > > <aleksandr.mikhalitsyn@canonical.com> wrote: > > > > > > > > During work on SO_PEERPIDFD, it was discovered (thanks to Christian), > > > > that bpf cgroup hook can cause FD leaks when used with sockopts which > > > > install FDs into the process fdtable. > > > > > > > > After some offlist discussion it was proposed to add a blacklist of > > > > > > We try to replace this word by either denylist or blocklist, even in changelogs. > > > > Hi Eric, > > > > Oh, I'm sorry about that. :( Sure. > > > > > > > > > socket options those can cause troubles when BPF cgroup hook is enabled. > > > > > > > > > > Can we find the appropriate Fixes: tag to help stable teams ? > > > > Sure, I will add next time. > > > > Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") > > > > I think it's better to add Stanislav Fomichev to CC. > > Can we use 'struct proto' bpf_bypass_getsockopt instead? We already > use it for tcp zerocopy, I'm assuming it should work in this case as > well? Jakub reminded me of the other things I wanted to ask here bug forgot: - setsockopt is probably not needed, right? setsockopt hook triggers before the kernel and shouldn't leak anything - for getsockopt, instead of bypassing bpf completely, should we instead ignore the error from the bpf program? that would still preserve the observability aspect - or maybe we can even have a per-proto bpf_getsockopt_cleanup call that gets called whenever bpf returns an error to make sure protocols have a chance to handle that condition (and free the fd) > > Kind regards, > > Alex > > > > > > > > > Cc: "David S. Miller" <davem@davemloft.net> > > > > Cc: Eric Dumazet <edumazet@google.com> > > > > Cc: Jakub Kicinski <kuba@kernel.org> > > > > Cc: Paolo Abeni <pabeni@redhat.com> > > > > Cc: Leon Romanovsky <leon@kernel.org> > > > > Cc: David Ahern <dsahern@kernel.org> > > > > Cc: Arnd Bergmann <arnd@arndb.de> > > > > Cc: Kees Cook <keescook@chromium.org> > > > > Cc: Christian Brauner <brauner@kernel.org> > > > > Cc: Kuniyuki Iwashima <kuniyu@amazon.com> > > > > Cc: Lennart Poettering <mzxreary@0pointer.de> > > > > Cc: linux-kernel@vger.kernel.org > > > > Cc: netdev@vger.kernel.org > > > > Cc: linux-arch@vger.kernel.org > > > > Suggested-by: Daniel Borkmann <daniel@iogearbox.net> > > > > Suggested-by: Christian Brauner <brauner@kernel.org> > > > > Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> > > > > > > Thanks.
On Fri, Apr 14, 2023 at 06:55:39PM -0700, Stanislav Fomichev wrote: > On 04/13, Stanislav Fomichev wrote: > > On Thu, Apr 13, 2023 at 7:38 AM Aleksandr Mikhalitsyn > > <aleksandr.mikhalitsyn@canonical.com> wrote: > > > > > > On Thu, Apr 13, 2023 at 4:22 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Thu, Apr 13, 2023 at 3:35 PM Alexander Mikhalitsyn > > > > <aleksandr.mikhalitsyn@canonical.com> wrote: > > > > > > > > > > During work on SO_PEERPIDFD, it was discovered (thanks to Christian), > > > > > that bpf cgroup hook can cause FD leaks when used with sockopts which > > > > > install FDs into the process fdtable. > > > > > > > > > > After some offlist discussion it was proposed to add a blacklist of > > > > > > > > We try to replace this word by either denylist or blocklist, even in changelogs. > > > > > > Hi Eric, > > > > > > Oh, I'm sorry about that. :( Sure. > > > > > > > > > > > > socket options those can cause troubles when BPF cgroup hook is enabled. > > > > > > > > > > > > > Can we find the appropriate Fixes: tag to help stable teams ? > > > > > > Sure, I will add next time. > > > > > > Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") > > > > > > I think it's better to add Stanislav Fomichev to CC. > > > > Can we use 'struct proto' bpf_bypass_getsockopt instead? We already > > use it for tcp zerocopy, I'm assuming it should work in this case as > > well? > > Jakub reminded me of the other things I wanted to ask here bug forgot: > > - setsockopt is probably not needed, right? setsockopt hook triggers > before the kernel and shouldn't leak anything > - for getsockopt, instead of bypassing bpf completely, should we instead > ignore the error from the bpf program? that would still preserve That's fine by me as well. It'd be great if the net folks could tell Alex how they would want this handled. > the observability aspect Please see for more details https://lore.kernel.org/lkml/20230411-nudelsalat-spreu-3038458f25c4@brauner > - or maybe we can even have a per-proto bpf_getsockopt_cleanup call that > gets called whenever bpf returns an error to make sure protocols have > a chance to handle that condition (and free the fd) Installing an fd into an fdtable makes it visible to userspace at which point calling close_fd() is doable but an absolute last resort and generally a good indicator of misdesign. If the bpf hook wants to make decisions based on the file then it should receive a struct file, not an fd.
On Mon, Apr 17, 2023 at 7:42 AM Christian Brauner <brauner@kernel.org> wrote: > > On Fri, Apr 14, 2023 at 06:55:39PM -0700, Stanislav Fomichev wrote: > > On 04/13, Stanislav Fomichev wrote: > > > On Thu, Apr 13, 2023 at 7:38 AM Aleksandr Mikhalitsyn > > > <aleksandr.mikhalitsyn@canonical.com> wrote: > > > > > > > > On Thu, Apr 13, 2023 at 4:22 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > On Thu, Apr 13, 2023 at 3:35 PM Alexander Mikhalitsyn > > > > > <aleksandr.mikhalitsyn@canonical.com> wrote: > > > > > > > > > > > > During work on SO_PEERPIDFD, it was discovered (thanks to Christian), > > > > > > that bpf cgroup hook can cause FD leaks when used with sockopts which > > > > > > install FDs into the process fdtable. > > > > > > > > > > > > After some offlist discussion it was proposed to add a blacklist of > > > > > > > > > > We try to replace this word by either denylist or blocklist, even in changelogs. > > > > > > > > Hi Eric, > > > > > > > > Oh, I'm sorry about that. :( Sure. > > > > > > > > > > > > > > > socket options those can cause troubles when BPF cgroup hook is enabled. > > > > > > > > > > > > > > > > Can we find the appropriate Fixes: tag to help stable teams ? > > > > > > > > Sure, I will add next time. > > > > > > > > Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") > > > > > > > > I think it's better to add Stanislav Fomichev to CC. > > > > > > Can we use 'struct proto' bpf_bypass_getsockopt instead? We already > > > use it for tcp zerocopy, I'm assuming it should work in this case as > > > well? > > > > Jakub reminded me of the other things I wanted to ask here bug forgot: > > > > - setsockopt is probably not needed, right? setsockopt hook triggers > > before the kernel and shouldn't leak anything > > - for getsockopt, instead of bypassing bpf completely, should we instead > > ignore the error from the bpf program? that would still preserve > > That's fine by me as well. > > It'd be great if the net folks could tell Alex how they would want this > handled. Doing the bypass seems fine with me for now. If we ever decide that fd-based optvals are worth inspecting in bpf, we can lift that bypass. > > the observability aspect > > Please see for more details > https://lore.kernel.org/lkml/20230411-nudelsalat-spreu-3038458f25c4@brauner Thanks for the context. Yeah, sockopts are being used for a lot of interesting things :-( > > - or maybe we can even have a per-proto bpf_getsockopt_cleanup call that > > gets called whenever bpf returns an error to make sure protocols have > > a chance to handle that condition (and free the fd) > > Installing an fd into an fdtable makes it visible to userspace at which > point calling close_fd() is doable but an absolute last resort and > generally a good indicator of misdesign. If the bpf hook wants to make > decisions based on the file then it should receive a struct > file, not an fd. SG! Then let's not over-complicate it for now and do a simple bypass.
On 4/14/23 6:55 PM, Stanislav Fomichev wrote: > On 04/13, Stanislav Fomichev wrote: >> On Thu, Apr 13, 2023 at 7:38 AM Aleksandr Mikhalitsyn >> <aleksandr.mikhalitsyn@canonical.com> wrote: >>> >>> On Thu, Apr 13, 2023 at 4:22 PM Eric Dumazet <edumazet@google.com> wrote: >>>> >>>> On Thu, Apr 13, 2023 at 3:35 PM Alexander Mikhalitsyn >>>> <aleksandr.mikhalitsyn@canonical.com> wrote: >>>>> >>>>> During work on SO_PEERPIDFD, it was discovered (thanks to Christian), >>>>> that bpf cgroup hook can cause FD leaks when used with sockopts which >>>>> install FDs into the process fdtable. >>>>> >>>>> After some offlist discussion it was proposed to add a blacklist of >>>> >>>> We try to replace this word by either denylist or blocklist, even in changelogs. >>> >>> Hi Eric, >>> >>> Oh, I'm sorry about that. :( Sure. >>> >>>> >>>>> socket options those can cause troubles when BPF cgroup hook is enabled. >>>>> >>>> >>>> Can we find the appropriate Fixes: tag to help stable teams ? >>> >>> Sure, I will add next time. >>> >>> Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") >>> >>> I think it's better to add Stanislav Fomichev to CC. >> >> Can we use 'struct proto' bpf_bypass_getsockopt instead? We already >> use it for tcp zerocopy, I'm assuming it should work in this case as >> well? > > Jakub reminded me of the other things I wanted to ask here bug forgot: > > - setsockopt is probably not needed, right? setsockopt hook triggers > before the kernel and shouldn't leak anything > - for getsockopt, instead of bypassing bpf completely, should we instead > ignore the error from the bpf program? that would still preserve > the observability aspect stealing this thread to discuss the optlen issue which may make sense to bypass also. There has been issue with optlen. Other than this older post related to optlen > PAGE_SIZE: https://lore.kernel.org/bpf/5c8b7d59-1f28-2284-f7b9-49d946f2e982@linux.dev/, the recent one related to optlen that we have seen is NETLINK_LIST_MEMBERSHIPS. The userspace passed in optlen == 0 and the kernel put the expected optlen (> 0) and 'return 0;' to userspace. The userspace intention is to learn the expected optlen. This makes 'ctx.optlen > max_optlen' and __cgroup_bpf_run_filter_getsockopt() ends up returning -EFAULT to the userspace even the bpf prog has not changed anything. Does it make sense to also bypass the bpf prog when 'ctx.optlen > max_optlen' for now (and this can use a separate patch which as usual requires a bpf selftests)? In the future, does it make sense to have a specific cgroup-bpf-prog (a specific attach type?) that only uses bpf_dynptr kfunc to access the optval such that it can enforce read-only for some optname and potentially also track if bpf-prog has written a new optval? The bpf-prog can only return 1 (OK) and only allows using bpf_set_retval() instead. Likely there is still holes but could be a seed of thought to continue polishing the idea. > - or maybe we can even have a per-proto bpf_getsockopt_cleanup call that > gets called whenever bpf returns an error to make sure protocols have > a chance to handle that condition (and free the fd) >
On 04/17, Martin KaFai Lau wrote: > On 4/14/23 6:55 PM, Stanislav Fomichev wrote: > > On 04/13, Stanislav Fomichev wrote: > > > On Thu, Apr 13, 2023 at 7:38 AM Aleksandr Mikhalitsyn > > > <aleksandr.mikhalitsyn@canonical.com> wrote: > > > > > > > > On Thu, Apr 13, 2023 at 4:22 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > On Thu, Apr 13, 2023 at 3:35 PM Alexander Mikhalitsyn > > > > > <aleksandr.mikhalitsyn@canonical.com> wrote: > > > > > > > > > > > > During work on SO_PEERPIDFD, it was discovered (thanks to Christian), > > > > > > that bpf cgroup hook can cause FD leaks when used with sockopts which > > > > > > install FDs into the process fdtable. > > > > > > > > > > > > After some offlist discussion it was proposed to add a blacklist of > > > > > > > > > > We try to replace this word by either denylist or blocklist, even in changelogs. > > > > > > > > Hi Eric, > > > > > > > > Oh, I'm sorry about that. :( Sure. > > > > > > > > > > > > > > > socket options those can cause troubles when BPF cgroup hook is enabled. > > > > > > > > > > > > > > > > Can we find the appropriate Fixes: tag to help stable teams ? > > > > > > > > Sure, I will add next time. > > > > > > > > Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") > > > > > > > > I think it's better to add Stanislav Fomichev to CC. > > > > > > Can we use 'struct proto' bpf_bypass_getsockopt instead? We already > > > use it for tcp zerocopy, I'm assuming it should work in this case as > > > well? > > > > Jakub reminded me of the other things I wanted to ask here bug forgot: > > > > - setsockopt is probably not needed, right? setsockopt hook triggers > > before the kernel and shouldn't leak anything > > - for getsockopt, instead of bypassing bpf completely, should we instead > > ignore the error from the bpf program? that would still preserve > > the observability aspect > > stealing this thread to discuss the optlen issue which may make sense to > bypass also. > > There has been issue with optlen. Other than this older post related to > optlen > PAGE_SIZE: > https://lore.kernel.org/bpf/5c8b7d59-1f28-2284-f7b9-49d946f2e982@linux.dev/, > the recent one related to optlen that we have seen is > NETLINK_LIST_MEMBERSHIPS. The userspace passed in optlen == 0 and the kernel > put the expected optlen (> 0) and 'return 0;' to userspace. The userspace > intention is to learn the expected optlen. This makes 'ctx.optlen > > max_optlen' and __cgroup_bpf_run_filter_getsockopt() ends up returning > -EFAULT to the userspace even the bpf prog has not changed anything. (ignoring -EFAULT issue) this seems like it needs to be if (optval && (ctx.optlen > max_optlen || ctx.optlen < 0)) { /* error */ } ? > Does it make sense to also bypass the bpf prog when 'ctx.optlen > > max_optlen' for now (and this can use a separate patch which as usual > requires a bpf selftests)? Yeah, makes sense. Replacing this -EFAULT with WARN_ON_ONCE or something seems like the way to go. It caused too much trouble already :-( Should I prepare a patch or do you want to take a stab at it? > In the future, does it make sense to have a specific cgroup-bpf-prog (a > specific attach type?) that only uses bpf_dynptr kfunc to access the optval > such that it can enforce read-only for some optname and potentially also > track if bpf-prog has written a new optval? The bpf-prog can only return 1 > (OK) and only allows using bpf_set_retval() instead. Likely there is still > holes but could be a seed of thought to continue polishing the idea. Ack, let's think about it. Maybe we should re-evaluate 'getsockopt-happens-after-the-kernel' idea as well? If we can have a sleepable hook that can copy_from_user/copy_to_user, and we have a mostly working bpf_getsockopt (after your refactoring), I don't see why we need to continue the current scheme of triggering after the kernel? > > - or maybe we can even have a per-proto bpf_getsockopt_cleanup call that > > gets called whenever bpf returns an error to make sure protocols have > > a chance to handle that condition (and free the fd) > > > >
On 4/18/23 09:47, Stanislav Fomichev wrote: > On 04/17, Martin KaFai Lau wrote: >> On 4/14/23 6:55 PM, Stanislav Fomichev wrote: >>> On 04/13, Stanislav Fomichev wrote: >>>> On Thu, Apr 13, 2023 at 7:38 AM Aleksandr Mikhalitsyn >>>> <aleksandr.mikhalitsyn@canonical.com> wrote: >>>>> >>>>> On Thu, Apr 13, 2023 at 4:22 PM Eric Dumazet <edumazet@google.com> wrote: >>>>>> >>>>>> On Thu, Apr 13, 2023 at 3:35 PM Alexander Mikhalitsyn >>>>>> <aleksandr.mikhalitsyn@canonical.com> wrote: >>>>>>> >>>>>>> During work on SO_PEERPIDFD, it was discovered (thanks to Christian), >>>>>>> that bpf cgroup hook can cause FD leaks when used with sockopts which >>>>>>> install FDs into the process fdtable. >>>>>>> >>>>>>> After some offlist discussion it was proposed to add a blacklist of >>>>>> >>>>>> We try to replace this word by either denylist or blocklist, even in changelogs. >>>>> >>>>> Hi Eric, >>>>> >>>>> Oh, I'm sorry about that. :( Sure. >>>>> >>>>>> >>>>>>> socket options those can cause troubles when BPF cgroup hook is enabled. >>>>>>> >>>>>> >>>>>> Can we find the appropriate Fixes: tag to help stable teams ? >>>>> >>>>> Sure, I will add next time. >>>>> >>>>> Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") >>>>> >>>>> I think it's better to add Stanislav Fomichev to CC. >>>> >>>> Can we use 'struct proto' bpf_bypass_getsockopt instead? We already >>>> use it for tcp zerocopy, I'm assuming it should work in this case as >>>> well? >>> >>> Jakub reminded me of the other things I wanted to ask here bug forgot: >>> >>> - setsockopt is probably not needed, right? setsockopt hook triggers >>> before the kernel and shouldn't leak anything >>> - for getsockopt, instead of bypassing bpf completely, should we instead >>> ignore the error from the bpf program? that would still preserve >>> the observability aspect >> >> stealing this thread to discuss the optlen issue which may make sense to >> bypass also. >> >> There has been issue with optlen. Other than this older post related to >> optlen > PAGE_SIZE: >> https://lore.kernel.org/bpf/5c8b7d59-1f28-2284-f7b9-49d946f2e982@linux.dev/, >> the recent one related to optlen that we have seen is >> NETLINK_LIST_MEMBERSHIPS. The userspace passed in optlen == 0 and the kernel >> put the expected optlen (> 0) and 'return 0;' to userspace. The userspace >> intention is to learn the expected optlen. This makes 'ctx.optlen > >> max_optlen' and __cgroup_bpf_run_filter_getsockopt() ends up returning >> -EFAULT to the userspace even the bpf prog has not changed anything. > > (ignoring -EFAULT issue) this seems like it needs to be > > if (optval && (ctx.optlen > max_optlen || ctx.optlen < 0)) { > /* error */ > } > > ? > >> Does it make sense to also bypass the bpf prog when 'ctx.optlen > >> max_optlen' for now (and this can use a separate patch which as usual >> requires a bpf selftests)? > > Yeah, makes sense. Replacing this -EFAULT with WARN_ON_ONCE or something > seems like the way to go. It caused too much trouble already :-( > > Should I prepare a patch or do you want to take a stab at it? > >> In the future, does it make sense to have a specific cgroup-bpf-prog (a >> specific attach type?) that only uses bpf_dynptr kfunc to access the optval >> such that it can enforce read-only for some optname and potentially also >> track if bpf-prog has written a new optval? The bpf-prog can only return 1 >> (OK) and only allows using bpf_set_retval() instead. Likely there is still >> holes but could be a seed of thought to continue polishing the idea. > > Ack, let's think about it. > > Maybe we should re-evaluate 'getsockopt-happens-after-the-kernel' idea > as well? If we can have a sleepable hook that can copy_from_user/copy_to_user, > and we have a mostly working bpf_getsockopt (after your refactoring), > I don't see why we need to continue the current scheme of triggering > after the kernel? Since a sleepable hook would cause some restrictions, perhaps, we could introduce something like the promise pattern. In our case here, BPF program call an async version of copy_from_user()/copy_to_user() to return a promise. > >>> - or maybe we can even have a per-proto bpf_getsockopt_cleanup call that >>> gets called whenever bpf returns an error to make sure protocols have >>> a chance to handle that condition (and free the fd) >>> >> >>
On Tue, Apr 25, 2023 at 10:59 AM Kui-Feng Lee <sinquersw@gmail.com> wrote: > > > > On 4/18/23 09:47, Stanislav Fomichev wrote: > > On 04/17, Martin KaFai Lau wrote: > >> On 4/14/23 6:55 PM, Stanislav Fomichev wrote: > >>> On 04/13, Stanislav Fomichev wrote: > >>>> On Thu, Apr 13, 2023 at 7:38 AM Aleksandr Mikhalitsyn > >>>> <aleksandr.mikhalitsyn@canonical.com> wrote: > >>>>> > >>>>> On Thu, Apr 13, 2023 at 4:22 PM Eric Dumazet <edumazet@google.com> wrote: > >>>>>> > >>>>>> On Thu, Apr 13, 2023 at 3:35 PM Alexander Mikhalitsyn > >>>>>> <aleksandr.mikhalitsyn@canonical.com> wrote: > >>>>>>> > >>>>>>> During work on SO_PEERPIDFD, it was discovered (thanks to Christian), > >>>>>>> that bpf cgroup hook can cause FD leaks when used with sockopts which > >>>>>>> install FDs into the process fdtable. > >>>>>>> > >>>>>>> After some offlist discussion it was proposed to add a blacklist of > >>>>>> > >>>>>> We try to replace this word by either denylist or blocklist, even in changelogs. > >>>>> > >>>>> Hi Eric, > >>>>> > >>>>> Oh, I'm sorry about that. :( Sure. > >>>>> > >>>>>> > >>>>>>> socket options those can cause troubles when BPF cgroup hook is enabled. > >>>>>>> > >>>>>> > >>>>>> Can we find the appropriate Fixes: tag to help stable teams ? > >>>>> > >>>>> Sure, I will add next time. > >>>>> > >>>>> Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") > >>>>> > >>>>> I think it's better to add Stanislav Fomichev to CC. > >>>> > >>>> Can we use 'struct proto' bpf_bypass_getsockopt instead? We already > >>>> use it for tcp zerocopy, I'm assuming it should work in this case as > >>>> well? > >>> > >>> Jakub reminded me of the other things I wanted to ask here bug forgot: > >>> > >>> - setsockopt is probably not needed, right? setsockopt hook triggers > >>> before the kernel and shouldn't leak anything > >>> - for getsockopt, instead of bypassing bpf completely, should we instead > >>> ignore the error from the bpf program? that would still preserve > >>> the observability aspect > >> > >> stealing this thread to discuss the optlen issue which may make sense to > >> bypass also. > >> > >> There has been issue with optlen. Other than this older post related to > >> optlen > PAGE_SIZE: > >> https://lore.kernel.org/bpf/5c8b7d59-1f28-2284-f7b9-49d946f2e982@linux.dev/, > >> the recent one related to optlen that we have seen is > >> NETLINK_LIST_MEMBERSHIPS. The userspace passed in optlen == 0 and the kernel > >> put the expected optlen (> 0) and 'return 0;' to userspace. The userspace > >> intention is to learn the expected optlen. This makes 'ctx.optlen > > >> max_optlen' and __cgroup_bpf_run_filter_getsockopt() ends up returning > >> -EFAULT to the userspace even the bpf prog has not changed anything. > > > > (ignoring -EFAULT issue) this seems like it needs to be > > > > if (optval && (ctx.optlen > max_optlen || ctx.optlen < 0)) { > > /* error */ > > } > > > > ? > > > >> Does it make sense to also bypass the bpf prog when 'ctx.optlen > > >> max_optlen' for now (and this can use a separate patch which as usual > >> requires a bpf selftests)? > > > > Yeah, makes sense. Replacing this -EFAULT with WARN_ON_ONCE or something > > seems like the way to go. It caused too much trouble already :-( > > > > Should I prepare a patch or do you want to take a stab at it? > > > >> In the future, does it make sense to have a specific cgroup-bpf-prog (a > >> specific attach type?) that only uses bpf_dynptr kfunc to access the optval > >> such that it can enforce read-only for some optname and potentially also > >> track if bpf-prog has written a new optval? The bpf-prog can only return 1 > >> (OK) and only allows using bpf_set_retval() instead. Likely there is still > >> holes but could be a seed of thought to continue polishing the idea. > > > > Ack, let's think about it. > > > > Maybe we should re-evaluate 'getsockopt-happens-after-the-kernel' idea > > as well? If we can have a sleepable hook that can copy_from_user/copy_to_user, > > and we have a mostly working bpf_getsockopt (after your refactoring), > > I don't see why we need to continue the current scheme of triggering > > after the kernel? > > Since a sleepable hook would cause some restrictions, perhaps, we could > introduce something like the promise pattern. In our case here, BPF > program call an async version of copy_from_user()/copy_to_user() to > return a promise. Having a promise might work. This is essentially what we already do with sockets/etc with acquire/release pattern. What are the sleepable restrictions you're hinting about? I feel like with the sleepable bpf, we can also remove all the temporary buffer management / extra copies which sounds like a win to me. (we have this ugly heuristics with BPF_SOCKOPT_KERN_BUF_SIZE) The program can allocate temporary buffers if needed.. > >>> - or maybe we can even have a per-proto bpf_getsockopt_cleanup call that > >>> gets called whenever bpf returns an error to make sure protocols have > >>> a chance to handle that condition (and free the fd) > >>> > >> > >>
On 4/25/23 11:42, Stanislav Fomichev wrote: > On Tue, Apr 25, 2023 at 10:59 AM Kui-Feng Lee <sinquersw@gmail.com> wrote: >> >> >> >> On 4/18/23 09:47, Stanislav Fomichev wrote: >>> On 04/17, Martin KaFai Lau wrote: >>>> On 4/14/23 6:55 PM, Stanislav Fomichev wrote: >>>>> On 04/13, Stanislav Fomichev wrote: >>>>>> On Thu, Apr 13, 2023 at 7:38 AM Aleksandr Mikhalitsyn >>>>>> <aleksandr.mikhalitsyn@canonical.com> wrote: >>>>>>> >>>>>>> On Thu, Apr 13, 2023 at 4:22 PM Eric Dumazet <edumazet@google.com> wrote: >>>>>>>> >>>>>>>> On Thu, Apr 13, 2023 at 3:35 PM Alexander Mikhalitsyn >>>>>>>> <aleksandr.mikhalitsyn@canonical.com> wrote: >>>>>>>>> >>>>>>>>> During work on SO_PEERPIDFD, it was discovered (thanks to Christian), >>>>>>>>> that bpf cgroup hook can cause FD leaks when used with sockopts which >>>>>>>>> install FDs into the process fdtable. >>>>>>>>> >>>>>>>>> After some offlist discussion it was proposed to add a blacklist of >>>>>>>> >>>>>>>> We try to replace this word by either denylist or blocklist, even in changelogs. >>>>>>> >>>>>>> Hi Eric, >>>>>>> >>>>>>> Oh, I'm sorry about that. :( Sure. >>>>>>> >>>>>>>> >>>>>>>>> socket options those can cause troubles when BPF cgroup hook is enabled. >>>>>>>>> >>>>>>>> >>>>>>>> Can we find the appropriate Fixes: tag to help stable teams ? >>>>>>> >>>>>>> Sure, I will add next time. >>>>>>> >>>>>>> Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") >>>>>>> >>>>>>> I think it's better to add Stanislav Fomichev to CC. >>>>>> >>>>>> Can we use 'struct proto' bpf_bypass_getsockopt instead? We already >>>>>> use it for tcp zerocopy, I'm assuming it should work in this case as >>>>>> well? >>>>> >>>>> Jakub reminded me of the other things I wanted to ask here bug forgot: >>>>> >>>>> - setsockopt is probably not needed, right? setsockopt hook triggers >>>>> before the kernel and shouldn't leak anything >>>>> - for getsockopt, instead of bypassing bpf completely, should we instead >>>>> ignore the error from the bpf program? that would still preserve >>>>> the observability aspect >>>> >>>> stealing this thread to discuss the optlen issue which may make sense to >>>> bypass also. >>>> >>>> There has been issue with optlen. Other than this older post related to >>>> optlen > PAGE_SIZE: >>>> https://lore.kernel.org/bpf/5c8b7d59-1f28-2284-f7b9-49d946f2e982@linux.dev/, >>>> the recent one related to optlen that we have seen is >>>> NETLINK_LIST_MEMBERSHIPS. The userspace passed in optlen == 0 and the kernel >>>> put the expected optlen (> 0) and 'return 0;' to userspace. The userspace >>>> intention is to learn the expected optlen. This makes 'ctx.optlen > >>>> max_optlen' and __cgroup_bpf_run_filter_getsockopt() ends up returning >>>> -EFAULT to the userspace even the bpf prog has not changed anything. >>> >>> (ignoring -EFAULT issue) this seems like it needs to be >>> >>> if (optval && (ctx.optlen > max_optlen || ctx.optlen < 0)) { >>> /* error */ >>> } >>> >>> ? >>> >>>> Does it make sense to also bypass the bpf prog when 'ctx.optlen > >>>> max_optlen' for now (and this can use a separate patch which as usual >>>> requires a bpf selftests)? >>> >>> Yeah, makes sense. Replacing this -EFAULT with WARN_ON_ONCE or something >>> seems like the way to go. It caused too much trouble already :-( >>> >>> Should I prepare a patch or do you want to take a stab at it? >>> >>>> In the future, does it make sense to have a specific cgroup-bpf-prog (a >>>> specific attach type?) that only uses bpf_dynptr kfunc to access the optval >>>> such that it can enforce read-only for some optname and potentially also >>>> track if bpf-prog has written a new optval? The bpf-prog can only return 1 >>>> (OK) and only allows using bpf_set_retval() instead. Likely there is still >>>> holes but could be a seed of thought to continue polishing the idea. >>> >>> Ack, let's think about it. >>> >>> Maybe we should re-evaluate 'getsockopt-happens-after-the-kernel' idea >>> as well? If we can have a sleepable hook that can copy_from_user/copy_to_user, >>> and we have a mostly working bpf_getsockopt (after your refactoring), >>> I don't see why we need to continue the current scheme of triggering >>> after the kernel? >> >> Since a sleepable hook would cause some restrictions, perhaps, we could >> introduce something like the promise pattern. In our case here, BPF >> program call an async version of copy_from_user()/copy_to_user() to >> return a promise. > > Having a promise might work. This is essentially what we already do > with sockets/etc with acquire/release pattern. > > What are the sleepable restrictions you're hinting about? I feel like AFAIK, a sleepable program can use only some types of maps; for example, array, hash, ringbuf,... etc. Other types of maps are unavailable to sleepable programs; for example, sockmap, sockhash. > with the sleepable bpf, we can also remove all the temporary buffer > management / extra copies which sounds like a win to me. (we have this > ugly heuristics with BPF_SOCKOPT_KERN_BUF_SIZE) The program can > allocate temporary buffers if needed.. Agree! > >>>>> - or maybe we can even have a per-proto bpf_getsockopt_cleanup call that >>>>> gets called whenever bpf returns an error to make sure protocols have >>>>> a chance to handle that condition (and free the fd) >>>>> >>>> >>>>
On 4/25/23 11:42, Stanislav Fomichev wrote: > On Tue, Apr 25, 2023 at 10:59 AM Kui-Feng Lee <sinquersw@gmail.com> wrote: >> >> >> >> On 4/18/23 09:47, Stanislav Fomichev wrote: >>> On 04/17, Martin KaFai Lau wrote: >>>> On 4/14/23 6:55 PM, Stanislav Fomichev wrote: >>>>> On 04/13, Stanislav Fomichev wrote: >>>>>> On Thu, Apr 13, 2023 at 7:38 AM Aleksandr Mikhalitsyn >>>>>> <aleksandr.mikhalitsyn@canonical.com> wrote: >>>>>>> >>>>>>> On Thu, Apr 13, 2023 at 4:22 PM Eric Dumazet <edumazet@google.com> wrote: >>>>>>>> >>>>>>>> On Thu, Apr 13, 2023 at 3:35 PM Alexander Mikhalitsyn >>>>>>>> <aleksandr.mikhalitsyn@canonical.com> wrote: >>>>>>>>> >>>>>>>>> During work on SO_PEERPIDFD, it was discovered (thanks to Christian), >>>>>>>>> that bpf cgroup hook can cause FD leaks when used with sockopts which >>>>>>>>> install FDs into the process fdtable. >>>>>>>>> >>>>>>>>> After some offlist discussion it was proposed to add a blacklist of >>>>>>>> >>>>>>>> We try to replace this word by either denylist or blocklist, even in changelogs. >>>>>>> >>>>>>> Hi Eric, >>>>>>> >>>>>>> Oh, I'm sorry about that. :( Sure. >>>>>>> >>>>>>>> >>>>>>>>> socket options those can cause troubles when BPF cgroup hook is enabled. >>>>>>>>> >>>>>>>> >>>>>>>> Can we find the appropriate Fixes: tag to help stable teams ? >>>>>>> >>>>>>> Sure, I will add next time. >>>>>>> >>>>>>> Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") >>>>>>> >>>>>>> I think it's better to add Stanislav Fomichev to CC. >>>>>> >>>>>> Can we use 'struct proto' bpf_bypass_getsockopt instead? We already >>>>>> use it for tcp zerocopy, I'm assuming it should work in this case as >>>>>> well? >>>>> >>>>> Jakub reminded me of the other things I wanted to ask here bug forgot: >>>>> >>>>> - setsockopt is probably not needed, right? setsockopt hook triggers >>>>> before the kernel and shouldn't leak anything >>>>> - for getsockopt, instead of bypassing bpf completely, should we instead >>>>> ignore the error from the bpf program? that would still preserve >>>>> the observability aspect >>>> >>>> stealing this thread to discuss the optlen issue which may make sense to >>>> bypass also. >>>> >>>> There has been issue with optlen. Other than this older post related to >>>> optlen > PAGE_SIZE: >>>> https://lore.kernel.org/bpf/5c8b7d59-1f28-2284-f7b9-49d946f2e982@linux.dev/, >>>> the recent one related to optlen that we have seen is >>>> NETLINK_LIST_MEMBERSHIPS. The userspace passed in optlen == 0 and the kernel >>>> put the expected optlen (> 0) and 'return 0;' to userspace. The userspace >>>> intention is to learn the expected optlen. This makes 'ctx.optlen > >>>> max_optlen' and __cgroup_bpf_run_filter_getsockopt() ends up returning >>>> -EFAULT to the userspace even the bpf prog has not changed anything. >>> >>> (ignoring -EFAULT issue) this seems like it needs to be >>> >>> if (optval && (ctx.optlen > max_optlen || ctx.optlen < 0)) { >>> /* error */ >>> } >>> >>> ? >>> >>>> Does it make sense to also bypass the bpf prog when 'ctx.optlen > >>>> max_optlen' for now (and this can use a separate patch which as usual >>>> requires a bpf selftests)? >>> >>> Yeah, makes sense. Replacing this -EFAULT with WARN_ON_ONCE or something >>> seems like the way to go. It caused too much trouble already :-( >>> >>> Should I prepare a patch or do you want to take a stab at it? >>> >>>> In the future, does it make sense to have a specific cgroup-bpf-prog (a >>>> specific attach type?) that only uses bpf_dynptr kfunc to access the optval >>>> such that it can enforce read-only for some optname and potentially also >>>> track if bpf-prog has written a new optval? The bpf-prog can only return 1 >>>> (OK) and only allows using bpf_set_retval() instead. Likely there is still >>>> holes but could be a seed of thought to continue polishing the idea. >>> >>> Ack, let's think about it. >>> >>> Maybe we should re-evaluate 'getsockopt-happens-after-the-kernel' idea >>> as well? If we can have a sleepable hook that can copy_from_user/copy_to_user, >>> and we have a mostly working bpf_getsockopt (after your refactoring), >>> I don't see why we need to continue the current scheme of triggering >>> after the kernel? >> >> Since a sleepable hook would cause some restrictions, perhaps, we could >> introduce something like the promise pattern. In our case here, BPF >> program call an async version of copy_from_user()/copy_to_user() to >> return a promise. > > Having a promise might work. This is essentially what we already do > with sockets/etc with acquire/release pattern. Would you mind to give me some context of the socket things? > > What are the sleepable restrictions you're hinting about? I feel like > with the sleepable bpf, we can also remove all the temporary buffer > management / extra copies which sounds like a win to me. (we have this > ugly heuristics with BPF_SOCKOPT_KERN_BUF_SIZE) The program can > allocate temporary buffers if needed.. > >>>>> - or maybe we can even have a per-proto bpf_getsockopt_cleanup call that >>>>> gets called whenever bpf returns an error to make sure protocols have >>>>> a chance to handle that condition (and free the fd) >>>>> >>>> >>>>
On Tue, Apr 25, 2023 at 2:28 PM Kui-Feng Lee <sinquersw@gmail.com> wrote: > > > > On 4/25/23 11:42, Stanislav Fomichev wrote: > > On Tue, Apr 25, 2023 at 10:59 AM Kui-Feng Lee <sinquersw@gmail.com> wrote: > >> > >> > >> > >> On 4/18/23 09:47, Stanislav Fomichev wrote: > >>> On 04/17, Martin KaFai Lau wrote: > >>>> On 4/14/23 6:55 PM, Stanislav Fomichev wrote: > >>>>> On 04/13, Stanislav Fomichev wrote: > >>>>>> On Thu, Apr 13, 2023 at 7:38 AM Aleksandr Mikhalitsyn > >>>>>> <aleksandr.mikhalitsyn@canonical.com> wrote: > >>>>>>> > >>>>>>> On Thu, Apr 13, 2023 at 4:22 PM Eric Dumazet <edumazet@google.com> wrote: > >>>>>>>> > >>>>>>>> On Thu, Apr 13, 2023 at 3:35 PM Alexander Mikhalitsyn > >>>>>>>> <aleksandr.mikhalitsyn@canonical.com> wrote: > >>>>>>>>> > >>>>>>>>> During work on SO_PEERPIDFD, it was discovered (thanks to Christian), > >>>>>>>>> that bpf cgroup hook can cause FD leaks when used with sockopts which > >>>>>>>>> install FDs into the process fdtable. > >>>>>>>>> > >>>>>>>>> After some offlist discussion it was proposed to add a blacklist of > >>>>>>>> > >>>>>>>> We try to replace this word by either denylist or blocklist, even in changelogs. > >>>>>>> > >>>>>>> Hi Eric, > >>>>>>> > >>>>>>> Oh, I'm sorry about that. :( Sure. > >>>>>>> > >>>>>>>> > >>>>>>>>> socket options those can cause troubles when BPF cgroup hook is enabled. > >>>>>>>>> > >>>>>>>> > >>>>>>>> Can we find the appropriate Fixes: tag to help stable teams ? > >>>>>>> > >>>>>>> Sure, I will add next time. > >>>>>>> > >>>>>>> Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") > >>>>>>> > >>>>>>> I think it's better to add Stanislav Fomichev to CC. > >>>>>> > >>>>>> Can we use 'struct proto' bpf_bypass_getsockopt instead? We already > >>>>>> use it for tcp zerocopy, I'm assuming it should work in this case as > >>>>>> well? > >>>>> > >>>>> Jakub reminded me of the other things I wanted to ask here bug forgot: > >>>>> > >>>>> - setsockopt is probably not needed, right? setsockopt hook triggers > >>>>> before the kernel and shouldn't leak anything > >>>>> - for getsockopt, instead of bypassing bpf completely, should we instead > >>>>> ignore the error from the bpf program? that would still preserve > >>>>> the observability aspect > >>>> > >>>> stealing this thread to discuss the optlen issue which may make sense to > >>>> bypass also. > >>>> > >>>> There has been issue with optlen. Other than this older post related to > >>>> optlen > PAGE_SIZE: > >>>> https://lore.kernel.org/bpf/5c8b7d59-1f28-2284-f7b9-49d946f2e982@linux.dev/, > >>>> the recent one related to optlen that we have seen is > >>>> NETLINK_LIST_MEMBERSHIPS. The userspace passed in optlen == 0 and the kernel > >>>> put the expected optlen (> 0) and 'return 0;' to userspace. The userspace > >>>> intention is to learn the expected optlen. This makes 'ctx.optlen > > >>>> max_optlen' and __cgroup_bpf_run_filter_getsockopt() ends up returning > >>>> -EFAULT to the userspace even the bpf prog has not changed anything. > >>> > >>> (ignoring -EFAULT issue) this seems like it needs to be > >>> > >>> if (optval && (ctx.optlen > max_optlen || ctx.optlen < 0)) { > >>> /* error */ > >>> } > >>> > >>> ? > >>> > >>>> Does it make sense to also bypass the bpf prog when 'ctx.optlen > > >>>> max_optlen' for now (and this can use a separate patch which as usual > >>>> requires a bpf selftests)? > >>> > >>> Yeah, makes sense. Replacing this -EFAULT with WARN_ON_ONCE or something > >>> seems like the way to go. It caused too much trouble already :-( > >>> > >>> Should I prepare a patch or do you want to take a stab at it? > >>> > >>>> In the future, does it make sense to have a specific cgroup-bpf-prog (a > >>>> specific attach type?) that only uses bpf_dynptr kfunc to access the optval > >>>> such that it can enforce read-only for some optname and potentially also > >>>> track if bpf-prog has written a new optval? The bpf-prog can only return 1 > >>>> (OK) and only allows using bpf_set_retval() instead. Likely there is still > >>>> holes but could be a seed of thought to continue polishing the idea. > >>> > >>> Ack, let's think about it. > >>> > >>> Maybe we should re-evaluate 'getsockopt-happens-after-the-kernel' idea > >>> as well? If we can have a sleepable hook that can copy_from_user/copy_to_user, > >>> and we have a mostly working bpf_getsockopt (after your refactoring), > >>> I don't see why we need to continue the current scheme of triggering > >>> after the kernel? > >> > >> Since a sleepable hook would cause some restrictions, perhaps, we could > >> introduce something like the promise pattern. In our case here, BPF > >> program call an async version of copy_from_user()/copy_to_user() to > >> return a promise. > > > > Having a promise might work. This is essentially what we already do > > with sockets/etc with acquire/release pattern. > > Would you mind to give me some context of the socket things? I'm mostly referring to the infra around KF_ACQUIRE/KF_RELEASE where the verifier already has some resource tracking functionality. We can probably extend it to verify that the program does copy_to_user equivalent at the end of the run (or somehow specifically marks that it's not needed). > > > > What are the sleepable restrictions you're hinting about? I feel like > > with the sleepable bpf, we can also remove all the temporary buffer > > management / extra copies which sounds like a win to me. (we have this > > ugly heuristics with BPF_SOCKOPT_KERN_BUF_SIZE) The program can > > allocate temporary buffers if needed.. > > > >>>>> - or maybe we can even have a per-proto bpf_getsockopt_cleanup call that > >>>>> gets called whenever bpf returns an error to make sure protocols have > >>>>> a chance to handle that condition (and free the fd) > >>>>> > >>>> > >>>>
On Tue, Apr 25, 2023 at 2:11 PM Kui-Feng Lee <sinquersw@gmail.com> wrote: > > > > On 4/25/23 11:42, Stanislav Fomichev wrote: > > On Tue, Apr 25, 2023 at 10:59 AM Kui-Feng Lee <sinquersw@gmail.com> wrote: > >> > >> > >> > >> On 4/18/23 09:47, Stanislav Fomichev wrote: > >>> On 04/17, Martin KaFai Lau wrote: > >>>> On 4/14/23 6:55 PM, Stanislav Fomichev wrote: > >>>>> On 04/13, Stanislav Fomichev wrote: > >>>>>> On Thu, Apr 13, 2023 at 7:38 AM Aleksandr Mikhalitsyn > >>>>>> <aleksandr.mikhalitsyn@canonical.com> wrote: > >>>>>>> > >>>>>>> On Thu, Apr 13, 2023 at 4:22 PM Eric Dumazet <edumazet@google.com> wrote: > >>>>>>>> > >>>>>>>> On Thu, Apr 13, 2023 at 3:35 PM Alexander Mikhalitsyn > >>>>>>>> <aleksandr.mikhalitsyn@canonical.com> wrote: > >>>>>>>>> > >>>>>>>>> During work on SO_PEERPIDFD, it was discovered (thanks to Christian), > >>>>>>>>> that bpf cgroup hook can cause FD leaks when used with sockopts which > >>>>>>>>> install FDs into the process fdtable. > >>>>>>>>> > >>>>>>>>> After some offlist discussion it was proposed to add a blacklist of > >>>>>>>> > >>>>>>>> We try to replace this word by either denylist or blocklist, even in changelogs. > >>>>>>> > >>>>>>> Hi Eric, > >>>>>>> > >>>>>>> Oh, I'm sorry about that. :( Sure. > >>>>>>> > >>>>>>>> > >>>>>>>>> socket options those can cause troubles when BPF cgroup hook is enabled. > >>>>>>>>> > >>>>>>>> > >>>>>>>> Can we find the appropriate Fixes: tag to help stable teams ? > >>>>>>> > >>>>>>> Sure, I will add next time. > >>>>>>> > >>>>>>> Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") > >>>>>>> > >>>>>>> I think it's better to add Stanislav Fomichev to CC. > >>>>>> > >>>>>> Can we use 'struct proto' bpf_bypass_getsockopt instead? We already > >>>>>> use it for tcp zerocopy, I'm assuming it should work in this case as > >>>>>> well? > >>>>> > >>>>> Jakub reminded me of the other things I wanted to ask here bug forgot: > >>>>> > >>>>> - setsockopt is probably not needed, right? setsockopt hook triggers > >>>>> before the kernel and shouldn't leak anything > >>>>> - for getsockopt, instead of bypassing bpf completely, should we instead > >>>>> ignore the error from the bpf program? that would still preserve > >>>>> the observability aspect > >>>> > >>>> stealing this thread to discuss the optlen issue which may make sense to > >>>> bypass also. > >>>> > >>>> There has been issue with optlen. Other than this older post related to > >>>> optlen > PAGE_SIZE: > >>>> https://lore.kernel.org/bpf/5c8b7d59-1f28-2284-f7b9-49d946f2e982@linux.dev/, > >>>> the recent one related to optlen that we have seen is > >>>> NETLINK_LIST_MEMBERSHIPS. The userspace passed in optlen == 0 and the kernel > >>>> put the expected optlen (> 0) and 'return 0;' to userspace. The userspace > >>>> intention is to learn the expected optlen. This makes 'ctx.optlen > > >>>> max_optlen' and __cgroup_bpf_run_filter_getsockopt() ends up returning > >>>> -EFAULT to the userspace even the bpf prog has not changed anything. > >>> > >>> (ignoring -EFAULT issue) this seems like it needs to be > >>> > >>> if (optval && (ctx.optlen > max_optlen || ctx.optlen < 0)) { > >>> /* error */ > >>> } > >>> > >>> ? > >>> > >>>> Does it make sense to also bypass the bpf prog when 'ctx.optlen > > >>>> max_optlen' for now (and this can use a separate patch which as usual > >>>> requires a bpf selftests)? > >>> > >>> Yeah, makes sense. Replacing this -EFAULT with WARN_ON_ONCE or something > >>> seems like the way to go. It caused too much trouble already :-( > >>> > >>> Should I prepare a patch or do you want to take a stab at it? > >>> > >>>> In the future, does it make sense to have a specific cgroup-bpf-prog (a > >>>> specific attach type?) that only uses bpf_dynptr kfunc to access the optval > >>>> such that it can enforce read-only for some optname and potentially also > >>>> track if bpf-prog has written a new optval? The bpf-prog can only return 1 > >>>> (OK) and only allows using bpf_set_retval() instead. Likely there is still > >>>> holes but could be a seed of thought to continue polishing the idea. > >>> > >>> Ack, let's think about it. > >>> > >>> Maybe we should re-evaluate 'getsockopt-happens-after-the-kernel' idea > >>> as well? If we can have a sleepable hook that can copy_from_user/copy_to_user, > >>> and we have a mostly working bpf_getsockopt (after your refactoring), > >>> I don't see why we need to continue the current scheme of triggering > >>> after the kernel? > >> > >> Since a sleepable hook would cause some restrictions, perhaps, we could > >> introduce something like the promise pattern. In our case here, BPF > >> program call an async version of copy_from_user()/copy_to_user() to > >> return a promise. > > > > Having a promise might work. This is essentially what we already do > > with sockets/etc with acquire/release pattern. > > > > What are the sleepable restrictions you're hinting about? I feel like > AFAIK, a sleepable program can use only some types of maps; for example, > array, hash, ringbuf,... etc. Other types of maps are unavailable to > sleepable programs; for example, sockmap, sockhash. Sure, but it doesn't have to stay that way. (hypothetically) If we think that sleepable makes sense, we can try to expand the scope. > > with the sleepable bpf, we can also remove all the temporary buffer > > management / extra copies which sounds like a win to me. (we have this > > ugly heuristics with BPF_SOCKOPT_KERN_BUF_SIZE) The program can > > allocate temporary buffers if needed.. > Agree! > > > > > >>>>> - or maybe we can even have a per-proto bpf_getsockopt_cleanup call that > >>>>> gets called whenever bpf returns an error to make sure protocols have > >>>>> a chance to handle that condition (and free the fd) > >>>>> > >>>> > >>>>
diff --git a/net/socket.c b/net/socket.c index 73e493da4589..9c1ef11de23f 100644 --- a/net/socket.c +++ b/net/socket.c @@ -108,6 +108,8 @@ #include <linux/ptp_clock_kernel.h> #include <trace/events/sock.h> +#include <linux/sctp.h> + #ifdef CONFIG_NET_RX_BUSY_POLL unsigned int sysctl_net_busy_read __read_mostly; unsigned int sysctl_net_busy_poll __read_mostly; @@ -2227,6 +2229,36 @@ static bool sock_use_custom_sol_socket(const struct socket *sock) return test_bit(SOCK_CUSTOM_SOCKOPT, &sock->flags); } +#ifdef CONFIG_CGROUP_BPF +static bool sockopt_installs_fd(int level, int optname) +{ + /* + * These options do fd_install(), and if BPF_CGROUP_RUN_PROG_GETSOCKOPT + * hook returns an error after success of the original handler + * sctp_getsockopt(...), userspace will receive an error from getsockopt + * syscall and will be not aware that fd was successfully installed into fdtable. + * + * Let's prevent bpf cgroup hook from running on them. + */ + if (level == SOL_SCTP) { + switch (optname) { + case SCTP_SOCKOPT_PEELOFF: + case SCTP_SOCKOPT_PEELOFF_FLAGS: + return true; + default: + return false; + } + } + + return false; +} +#else /* CONFIG_CGROUP_BPF */ +static inline bool sockopt_installs_fd(int level, int optname) +{ + return false; +} +#endif /* CONFIG_CGROUP_BPF */ + /* * Set a socket option. Because we don't know the option lengths we have * to pass the user mode parameter for the protocols to sort out. @@ -2250,7 +2282,7 @@ int __sys_setsockopt(int fd, int level, int optname, char __user *user_optval, if (err) goto out_put; - if (!in_compat_syscall()) + if (!in_compat_syscall() && !sockopt_installs_fd(level, optname)) err = BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock->sk, &level, &optname, user_optval, &optlen, &kernel_optval); @@ -2304,7 +2336,7 @@ int __sys_getsockopt(int fd, int level, int optname, char __user *optval, if (err) goto out_put; - if (!in_compat_syscall()) + if (!in_compat_syscall() && !sockopt_installs_fd(level, optname)) max_optlen = BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN(optlen); if (level == SOL_SOCKET) @@ -2315,7 +2347,7 @@ int __sys_getsockopt(int fd, int level, int optname, char __user *optval, err = sock->ops->getsockopt(sock, level, optname, optval, optlen); - if (!in_compat_syscall()) + if (!in_compat_syscall() && !sockopt_installs_fd(level, optname)) err = BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock->sk, level, optname, optval, optlen, max_optlen, err);