Message ID | 20231011065446.53034-1-cuiyunhui@bytedance.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2908:b0:403:3b70:6f57 with SMTP id ib8csp347996vqb; Tue, 10 Oct 2023 23:55:50 -0700 (PDT) X-Google-Smtp-Source: AGHT+IF8UPa67pVnONTCIgEPMVwSFQXJfbPq1/1J78c3K6+NpBfZw/EqMG0rFbOtL9uSnaHHjDHt X-Received: by 2002:a05:6870:b4a5:b0:1a7:f79c:2fbc with SMTP id y37-20020a056870b4a500b001a7f79c2fbcmr23882042oap.0.1697007350682; Tue, 10 Oct 2023 23:55:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697007350; cv=none; d=google.com; s=arc-20160816; b=Oz3mILBOUfldjXbGCg7rlJWexQc6negBDkCDwaqJR/tedi6aOos8N6KfF9oNmvSp3y alacRseuouoay3V2CYuYcQjgLcPxqlouvFxgRjColnBOGnfi6z2cCIVlhilxHMYySnp4 7VxJRZSVJsnhlf5rxiax/d9F6SeCt+Pgx8/hDtYN2Zea+iBojugAjCmDrM164Os54WOt K4O1/jNzTE3BVqph7mQxM574unL/FdLsplzNwa3cJTSvScHvi/7oeSqfM5u/H12cPzhY 52VAxklz1nGmuDSJIWDwQElwdpfwG13+7y6zVQbDbbAQWJvVd1/7DfIM5+kwa15XBrIc R/Pg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:to:from:dkim-signature; bh=ykcEFA0ijkjBw3h+Q6uTEQDITJX5wqqBCPD/TQXajLU=; fh=NPlRuEA1XmcFgtad3eVy6A24JPJh4fKkB+tfymfWzXg=; b=xGUXymbcaed66OiBnYdpigpRhdxJKTMSaMLNe9EC8jUbssGQ3Rj9XldPvpdNPhx59q Eyerpk0PYyT/6FmBhxTUP0jI0Pn5YCPKfd6SX2PMRFg5/6BSObURSh5yeOm8Vc/kxfSO h+yubQ5x6ggr4FSb86u9awzQH7ppFS5Jp1SwtunAb6w+HB8XBvJNEY7EsoIqjmVduG+Y qjw7ORvozkkxR229kd814qxrcUL0RIj5NfB9vM4BHjVVi/fWLfLDFGR8Hp20VGTffKUF vsJRMn5R3SXafihoOXyorQduVkgZez3+wOinMqRF6pOj2DAly3IHjkgOzSuqvw2Z1i5I QZZw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=JEGGDHUJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: from howler.vger.email (howler.vger.email. [2620:137:e000::3:4]) by mx.google.com with ESMTPS id p21-20020a637415000000b0058ae63187e5si9689344pgc.830.2023.10.10.23.55.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 10 Oct 2023 23:55:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) client-ip=2620:137:e000::3:4; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=JEGGDHUJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 7D07D8027DDB; Tue, 10 Oct 2023 23:55:48 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1343968AbjJKGzb (ORCPT <rfc822;rua109.linux@gmail.com> + 19 others); Wed, 11 Oct 2023 02:55:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37870 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229497AbjJKGz3 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 11 Oct 2023 02:55:29 -0400 Received: from mail-pl1-x635.google.com (mail-pl1-x635.google.com [IPv6:2607:f8b0:4864:20::635]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5781B93 for <linux-kernel@vger.kernel.org>; Tue, 10 Oct 2023 23:55:28 -0700 (PDT) Received: by mail-pl1-x635.google.com with SMTP id d9443c01a7336-1c87a85332bso54513865ad.2 for <linux-kernel@vger.kernel.org>; Tue, 10 Oct 2023 23:55:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1697007328; x=1697612128; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:to :from:from:to:cc:subject:date:message-id:reply-to; bh=ykcEFA0ijkjBw3h+Q6uTEQDITJX5wqqBCPD/TQXajLU=; b=JEGGDHUJGMH5VVHqt7OenbSyE1T7GEWQD7o4I58dzPNPMai47+31zh7VQLYmCSAsZS gXS81dZVG8EyGBZTDtguCYJ51ORs5j8ehVN9f8RDtBMhIneBLOU3tpobrKqd17rEe0iQ uYYSiO46M1NfoVUeUh8hO4y5joE0dwzR8sZKt5fmjUbcRtUpmox7lKTjOka7hvhrdV1e o1ldor38kOCZ8ZMxRst7x34xJStIuJGEAzNS3oFqgVvudW93viOmjyJXP2K21dZSQ/9f e3qOob87p2Uf6zjJdnW5yQngTxetlouIPePD8wbk/EwD2pVdLsYjWg3fYpXvE/BEsMgY d/BA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697007328; x=1697612128; h=content-transfer-encoding:mime-version:message-id:date:subject:to :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ykcEFA0ijkjBw3h+Q6uTEQDITJX5wqqBCPD/TQXajLU=; b=sebvvqGc4VB6VJkKtc/V20tcwTnRCCP3UIQNEgPtwSngh7KGSj8x21cVFIvFRUvyzg JbF1m9ghNL9TsHobv2yfz7eigJHgirwF8kbNCcc6ZceK8O9WFZgq1AqnOKo85vLKe0Ai LW5TSbXUl0XAHPj2pogToaWrrUtV2qF0d2mkDf76+yXJf61M5Ie1ifA6Di1vb5/4YbAK 1QKf0s+Eha1Fh5/aUwxfV4hZ8vcQNZHDiCjjPD0ew1DjwHp51lFovV6FWdSEOs7SvVp2 /Qnl2RIXQAXpgoGVUKczlJlI0kD7d7Tkzs2GM0Ae9PeQS1hHFAQjy2i/3b52c2HJ9qqD ZQKA== X-Gm-Message-State: AOJu0Yy/cz4v6ULLpnD2L2ouZCdhm7UWbljI2hWyiYmv8t727Es/JHyt pNXvtOa74F5aBaU587KZ0f5zkQ== X-Received: by 2002:a05:6a20:938b:b0:15a:2c0b:6c73 with SMTP id x11-20020a056a20938b00b0015a2c0b6c73mr25644714pzh.12.1697007327744; Tue, 10 Oct 2023 23:55:27 -0700 (PDT) Received: from L6YN4KR4K9.bytedance.net ([61.213.176.10]) by smtp.gmail.com with ESMTPSA id n14-20020a170902d2ce00b001bf52834696sm13010464plc.207.2023.10.10.23.55.24 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 10 Oct 2023 23:55:27 -0700 (PDT) From: Yunhui Cui <cuiyunhui@bytedance.com> To: akpm@linux-foundation.org, keescook@chromium.org, brauner@kernel.org, jeffxu@google.com, frederic@kernel.org, mcgrof@kernel.org, cyphar@cyphar.com, cuiyunhui@bytedance.com, rongtao@cestc.cn, linux-kernel@vger.kernel.org Subject: [PATCH] pid_ns: support pidns switching between sibling Date: Wed, 11 Oct 2023 14:54:46 +0800 Message-Id: <20231011065446.53034-1-cuiyunhui@bytedance.com> X-Mailer: git-send-email 2.39.2 (Apple Git-143) MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=2.7 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, RCVD_IN_SBL_CSS,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Tue, 10 Oct 2023 23:55:48 -0700 (PDT) X-Spam-Level: ** X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1779441179702098770 X-GMAIL-MSGID: 1779441179702098770 |
Series |
pid_ns: support pidns switching between sibling
|
|
Commit Message
yunhui cui
Oct. 11, 2023, 6:54 a.m. UTC
In the scenario of container acceleration, when a target pstree
is cloned from a temp pstree, we hope that the cloned process is
inherently in the target's pid namespace.
Examples of what we expected:
/* switch to target ns first. */
setns(target_ns, CLONE_NEWPID);
if(!fork()) {
/* Child */
...
}
/* switch back */
setns(temp_ns, CLONE_NEWPID);
However, it is limited by the existing implementation, CAP_SYS_ADMIN
has been checked in pidns_install(), so remove the limitation that only
by traversing parent can switch pidns.
Signed-off-by: Yunhui Cui <cuiyunhui@bytedance.com>
---
kernel/pid_namespace.c | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)
Comments
On Wed, 11 Oct 2023 14:54:46 +0800 Yunhui Cui <cuiyunhui@bytedance.com> wrote: > In the scenario of container acceleration, when a target pstree > is cloned from a temp pstree, we hope that the cloned process is > inherently in the target's pid namespace. > Examples of what we expected: > > /* switch to target ns first. */ > setns(target_ns, CLONE_NEWPID); > if(!fork()) { > /* Child */ > ... > } > /* switch back */ > setns(temp_ns, CLONE_NEWPID); > > However, it is limited by the existing implementation, CAP_SYS_ADMIN > has been checked in pidns_install(), so remove the limitation that only > by traversing parent can switch pidns. > (cc Eric) > --- a/kernel/pid_namespace.c > +++ b/kernel/pid_namespace.c > @@ -389,7 +389,7 @@ static int pidns_install(struct nsset *nsset, struct ns_common *ns) > { > struct nsproxy *nsproxy = nsset->nsproxy; > struct pid_namespace *active = task_active_pid_ns(current); > - struct pid_namespace *ancestor, *new = to_pid_ns(ns); > + struct pid_namespace *new = to_pid_ns(ns); > > if (!ns_capable(new->user_ns, CAP_SYS_ADMIN) || > !ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN)) > @@ -406,12 +406,6 @@ static int pidns_install(struct nsset *nsset, struct ns_common *ns) > if (new->level < active->level) > return -EINVAL; > > - ancestor = new; > - while (ancestor->level > active->level) > - ancestor = ancestor->parent; > - if (ancestor != active) > - return -EINVAL; > - > put_pid_ns(nsproxy->pid_ns_for_children); > nsproxy->pid_ns_for_children = get_pid_ns(new); > return 0; > -- > 2.20.1
Yunhui Cui <cuiyunhui@bytedance.com> writes: > In the scenario of container acceleration, What is container acceleration? Are you perhaps performing what is essentially checkpoint/restart from one set of processes to a new set of processes so you can get a container starting faster? > when a target pstree is cloned from a temp pstree, we hope that the > cloned process is inherently in the target's pid namespace. I am having a hard time figuring out what you are saying here. > Examples of what we expected: > > /* switch to target ns first. */ > setns(target_ns, CLONE_NEWPID); ^-------- Is this the line that fails for you? > if(!fork()) { > /* Child */ > ... > } > /* switch back */ > setns(temp_ns, CLONE_NEWPID); Assuming that the "switch back" means returning to your task_active_pid_ns that should always work. If I had to guess I think what you are missing is that entire pid namespaces can be inside other pid namespaces. So there is no reason to believe that any random pid namespace that happens to pass the CAP_SYS_ADMIN permission check is also in your processes task_active_pid_ns. > However, it is limited by the existing implementation, CAP_SYS_ADMIN > has been checked in pidns_install(), so remove the limitation that only > by traversing parent can switch pidns. The check you are deleting is what verifies the pid namespaces you are attempting to change pid_ns_for_children to is a member of the tasks current pid namespace (aka task_active_pid_ns). There is a perfectly good comment describing why what you are attempting to do is unsupportable. /* * Only allow entering the current active pid namespace * or a child of the current active pid namespace. * * This is required for fork to return a usable pid value and * this maintains the property that processes and their * children can not escape their current pid namespace. */ If you pick a pid namespace that does not meet the restrictions you are removing the pid of the new child can not be mapped into the pid namespace of the parent that called setns. AKA the following code can not work. pid = fork(); if (!pid) { /* child */ do_something(); _exit(0); } waitpid(pid); So no. The suggested change to pidns_install makes no sense at all. The whole not being able to escape your current pid namespace is also an important invariant when reasoning about pid namespaces. It would have to be a strong well thought out case for me to agree it makes sense to abandon the invariant that a process can not escape it's pid namespace. > Signed-off-by: Yunhui Cui <cuiyunhui@bytedance.com> > --- > kernel/pid_namespace.c | 8 +------- > 1 file changed, 1 insertion(+), 7 deletions(-) > > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c > index 3028b2218aa4..774db1f268f1 100644 > --- a/kernel/pid_namespace.c > +++ b/kernel/pid_namespace.c > @@ -389,7 +389,7 @@ static int pidns_install(struct nsset *nsset, struct ns_common *ns) > { > struct nsproxy *nsproxy = nsset->nsproxy; > struct pid_namespace *active = task_active_pid_ns(current); > - struct pid_namespace *ancestor, *new = to_pid_ns(ns); > + struct pid_namespace *new = to_pid_ns(ns); > > if (!ns_capable(new->user_ns, CAP_SYS_ADMIN) || > !ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN)) > @@ -406,12 +406,6 @@ static int pidns_install(struct nsset *nsset, struct ns_common *ns) > if (new->level < active->level) > return -EINVAL; > > - ancestor = new; > - while (ancestor->level > active->level) > - ancestor = ancestor->parent; > - if (ancestor != active) > - return -EINVAL; > - > put_pid_ns(nsproxy->pid_ns_for_children); > nsproxy->pid_ns_for_children = get_pid_ns(new); > return 0; Eric
Hi Eric, On Thu, Oct 12, 2023 at 11:31 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Yunhui Cui <cuiyunhui@bytedance.com> writes: > > > In the scenario of container acceleration, > > What is container acceleration? > > Are you perhaps performing what is essentially checkpoint/restart > from one set of processes to a new set of processes so you can > get a container starting faster? Yeah, you are right . > > > when a target pstree is cloned from a temp pstree, we hope that the > > cloned process is inherently in the target's pid namespace. > > I am having a hard time figuring out what you are saying here. I think I need to describe in detail our needs and problems we face. What we need to do is fork a container into a new container, which means that all processes of the original container need to be forked out and added to the new container. Then the forked process needs to be added to the namespace and cgroup of the new container. What we are talking about here is the pid namespace. for example: Assume that there are three processes A, B, and C in the original container. What we need to do is A fork A_new, B fork B_new, C fork C_new. However, in the existing pidns implementation, the parent process first joins pidns, and then the forked child process will get the new pidns (the pid of the child process is what we expected), and the parent process's own pidns has not actually changed (at least pid is still existing). To make A_new, B_new, and C_new inherently in the pidns of the new container, A, B, and C must first switch to the pidns of the new container, right? From my understanding there is no better way to implement it. But the existing implementation (the part to be changed in this patch) is blocking our progress. > > > Examples of what we expected: > > > > /* switch to target ns first. */ > > setns(target_ns, CLONE_NEWPID); > ^-------- Is this the line that fails for you? > > > if(!fork()) { > > /* Child */ > > ... > > } > > /* switch back */ > > setns(temp_ns, CLONE_NEWPID); > > Assuming that the "switch back" means returning to your > task_active_pid_ns that should always work. In the scenario I described, "switch back" would certainly work. dst_pidns = open("/proc/%d/ns/pid"); src_pidns = open("/proc/self/ns/pid"); setns(dst_pidns, CLONE_NEWPID); if(!fork()) { /* Child */ /* The child process is born in the pidns of the new container. */ ... } /* switch back */ setns(src_pidns, CLONE_NEWPID); > > If I had to guess I think what you are missing is that entire pid > namespaces can be inside other pid namespaces. > > So there is no reason to believe that any random pid namespace > that happens to pass the CAP_SYS_ADMIN permission check is also in > your processes task_active_pid_ns. > > > > However, it is limited by the existing implementation, CAP_SYS_ADMIN > > has been checked in pidns_install(), so remove the limitation that only > > by traversing parent can switch pidns. > > The check you are deleting is what verifies the pid namespaces you are > attempting to change pid_ns_for_children to is a member of the tasks > current pid namespace (aka task_active_pid_ns). > > > There is a perfectly good comment describing why what you are attempting > to do is unsupportable. > > /* > * Only allow entering the current active pid namespace > * or a child of the current active pid namespace. > * > * This is required for fork to return a usable pid value and > * this maintains the property that processes and their > * children can not escape their current pid namespace. > */ > > > If you pick a pid namespace that does not meet the restrictions you are > removing the pid of the new child can not be mapped into the pid > namespace of the parent that called setns. > > AKA the following code can not work. > > pid = fork(); > if (!pid) { > /* child */ > do_something(); > _exit(0); > } > waitpid(pid); Sorry, I don't understand what you mean here. > > > So no. The suggested change to pidns_install makes no sense at all. > > The whole not being able to escape your current pid namespace is > also an important invariant when reasoning about pid namespaces. > > It would have to be a strong well thought out case for me to agree > it makes sense to abandon the invariant that a process can not escape > it's pid namespace. I think we'd better have a good understanding of the problems we face first, and then think of a more comprehensive way to solve it. Although the modification of this patch is not perfect, do we have a better way? Thanks, Yunhui
Hi Eric, On Fri, Oct 13, 2023 at 10:44 AM yunhui cui <cuiyunhui@bytedance.com> wrote: > > Hi Eric, > > On Thu, Oct 12, 2023 at 11:31 AM Eric W. Biederman > <ebiederm@xmission.com> wrote: > > > > Yunhui Cui <cuiyunhui@bytedance.com> writes: > > > > > In the scenario of container acceleration, > > > > What is container acceleration? > > > > Are you perhaps performing what is essentially checkpoint/restart > > from one set of processes to a new set of processes so you can > > get a container starting faster? > Yeah, you are right . > > > > > > when a target pstree is cloned from a temp pstree, we hope that the > > > cloned process is inherently in the target's pid namespace. > > > > I am having a hard time figuring out what you are saying here. > > I think I need to describe in detail our needs and problems we face. > What we need to do is fork a container into a new container, which > means that all > processes of the original container need to be forked out and added to > the new container. > Then the forked process needs to be added to the namespace and cgroup > of the new container. > > What we are talking about here is the pid namespace. > > for example: > Assume that there are three processes A, B, and C in the original container. > What we need to do is A fork A_new, B fork B_new, C fork C_new. > > However, in the existing pidns implementation, the parent process > first joins pidns, and then > the forked child process will get the new pidns (the pid of the child > process is what we expected), > and the parent process's own pidns has not actually changed (at least > pid is still existing). > > To make A_new, B_new, and C_new inherently in the pidns of the new container, > A, B, and C must first switch to the pidns of the new container, right? > From my understanding there is no better way to implement it. > > But the existing implementation (the part to be changed in this patch) > is blocking our progress. > > > > > > Examples of what we expected: > > > > > > /* switch to target ns first. */ > > > setns(target_ns, CLONE_NEWPID); > > ^-------- Is this the line that fails for you? Yes, it failed here. Thanks, Yunhui
yunhui cui <cuiyunhui@bytedance.com> writes: > Hi Eric, > > On Thu, Oct 12, 2023 at 11:31 AM Eric W. Biederman > <ebiederm@xmission.com> wrote: >> >> The check you are deleting is what verifies the pid namespaces you are >> attempting to change pid_ns_for_children to is a member of the tasks >> current pid namespace (aka task_active_pid_ns). >> >> >> There is a perfectly good comment describing why what you are attempting >> to do is unsupportable. >> >> /* >> * Only allow entering the current active pid namespace >> * or a child of the current active pid namespace. >> * >> * This is required for fork to return a usable pid value and >> * this maintains the property that processes and their >> * children can not escape their current pid namespace. >> */ >> >> >> If you pick a pid namespace that does not meet the restrictions you are >> removing the pid of the new child can not be mapped into the pid >> namespace of the parent that called setns. >> >> AKA the following code can not work. >> >> pid = fork(); >> if (!pid) { >> /* child */ >> do_something(); >> _exit(0); >> } >> waitpid(pid); > > Sorry, I don't understand what you mean here. What I mean is that if your simple patch was adopted, then the classic way of controlling a fork would fail. pid = fork() ^--------------- Would return 0 for both parent and child ^--------------- Look at pid_nr_ns to understand. if (!pid() { /* child */ do_something(); _exit(0); } waitpid(pid); For your use case there are more serious problems as well. The entire process hierarchy built would be incorrect. Which means children signaling parents when they exit would be incorrect, and that parents would not be able to wait on their children. I do understand the desire to want to cow the memory space of all of the processes. That can potentially save a lot of resources. In other checkpoint/restart scenarios people have been using userfaultfd to get a similar benefit. I suggest you look at the CRIU project. Eric
Hi Eric, On Fri, Oct 13, 2023 at 9:04 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > yunhui cui <cuiyunhui@bytedance.com> writes: > > > Hi Eric, > > > > On Thu, Oct 12, 2023 at 11:31 AM Eric W. Biederman > > <ebiederm@xmission.com> wrote: > >> > >> The check you are deleting is what verifies the pid namespaces you are > >> attempting to change pid_ns_for_children to is a member of the tasks > >> current pid namespace (aka task_active_pid_ns). > >> > >> > >> There is a perfectly good comment describing why what you are attempting > >> to do is unsupportable. > >> > >> /* > >> * Only allow entering the current active pid namespace > >> * or a child of the current active pid namespace. > >> * > >> * This is required for fork to return a usable pid value and > >> * this maintains the property that processes and their > >> * children can not escape their current pid namespace. > >> */ > >> > >> > >> If you pick a pid namespace that does not meet the restrictions you are > >> removing the pid of the new child can not be mapped into the pid > >> namespace of the parent that called setns. > >> > >> AKA the following code can not work. > >> > >> pid = fork(); > >> if (!pid) { > >> /* child */ > >> do_something(); > >> _exit(0); > >> } > >> waitpid(pid); > > > > Sorry, I don't understand what you mean here. > > What I mean is that if your simple patch was adopted, > then the classic way of controlling a fork would fail. > > pid = fork() > ^--------------- Would return 0 for both parent and child > ^--------------- Look at pid_nr_ns to understand. > if (!pid() { > /* child */ > do_something(); > _exit(0); > } > waitpid(pid); okay, The reason here is that pid_nr_ns has no pid in the current pidns of the child process, and returns 0. Can this also support sibling traversal? If so, it means that the process also has a pid in its sibling's pidns. > > For your use case there are more serious problems as well. The entire > process hierarchy built would be incorrect. Which means children > signaling parents when they exit would be incorrect, and that parents > would not be able to wait on their children. Therefore, support for slibing pidns must be added to the entire logic of pidns. Do you have any plans to support this, or what are the good reasons for not supporting it? Thanks, Yunhui
yunhui cui <cuiyunhui@bytedance.com> writes: > Hi Eric, > > On Fri, Oct 13, 2023 at 9:04 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> yunhui cui <cuiyunhui@bytedance.com> writes: >> >> > Hi Eric, >> > >> > On Thu, Oct 12, 2023 at 11:31 AM Eric W. Biederman >> > <ebiederm@xmission.com> wrote: >> >> >> >> The check you are deleting is what verifies the pid namespaces you are >> >> attempting to change pid_ns_for_children to is a member of the tasks >> >> current pid namespace (aka task_active_pid_ns). >> >> >> >> >> >> There is a perfectly good comment describing why what you are attempting >> >> to do is unsupportable. >> >> >> >> /* >> >> * Only allow entering the current active pid namespace >> >> * or a child of the current active pid namespace. >> >> * >> >> * This is required for fork to return a usable pid value and >> >> * this maintains the property that processes and their >> >> * children can not escape their current pid namespace. >> >> */ >> >> >> >> >> >> If you pick a pid namespace that does not meet the restrictions you are >> >> removing the pid of the new child can not be mapped into the pid >> >> namespace of the parent that called setns. >> >> >> >> AKA the following code can not work. >> >> >> >> pid = fork(); >> >> if (!pid) { >> >> /* child */ >> >> do_something(); >> >> _exit(0); >> >> } >> >> waitpid(pid); >> > >> > Sorry, I don't understand what you mean here. >> >> What I mean is that if your simple patch was adopted, >> then the classic way of controlling a fork would fail. >> >> pid = fork() >> ^--------------- Would return 0 for both parent and child >> ^--------------- Look at pid_nr_ns to understand. >> if (!pid() { >> /* child */ >> do_something(); >> _exit(0); >> } >> waitpid(pid); > > okay, The reason here is that pid_nr_ns has no pid in the current > pidns of the child process, and returns 0. > Can this also support sibling traversal? Not without a complete redesign. > If so, it means that the process also has a pid in its sibling's pidns. >> For your use case there are more serious problems as well. The entire >> process hierarchy built would be incorrect. Which means children >> signaling parents when they exit would be incorrect, and that parents >> would not be able to wait on their children. > > Therefore, support for slibing pidns must be added to the entire logic of pidns. > Do you have any plans to support this, No plans to support it. > or what are the good reasons for not supporting it? I see no point, it is a lot of work, and your container acceleration still won't work. By forking from your original processes instead of properly building the process hierarchy. If a pair of your original processes are doing: pid = fork() if (!pid() { /* child */ <-------------------------- clone created here do_something(); _exit(0); } <---------------------------------- clone created here waitpid(pid); Their clones won't work. Not because the pids aren't the same, but because the clones are not parent and child. Which causes waitpid not to see the other process. I believe you want to do this sibling pid_ns fork so that you can have copy-on-write of the anonymous pages of the original process. Which is a completely reasonable thing to want. For performing copy-on-write between machines we have userfaultfd. For simply reading the pages we have process_vm_readv. I think what you want is essentially process_vm_cow_map. Unfortunately no one has built that yet. Maybe memfd is a better model to start from? Something where you pause process a, setup the cow in process a, and place the pages in process b. With the final result that either process a or process b writing to the page will cause the copy on write to happen the and page to be unshared. I really think you need something that will decouple the copy-on-write mechanism of fork from the rest of fork, so you can build a proper process hierarchy. Eric
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index 3028b2218aa4..774db1f268f1 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -389,7 +389,7 @@ static int pidns_install(struct nsset *nsset, struct ns_common *ns) { struct nsproxy *nsproxy = nsset->nsproxy; struct pid_namespace *active = task_active_pid_ns(current); - struct pid_namespace *ancestor, *new = to_pid_ns(ns); + struct pid_namespace *new = to_pid_ns(ns); if (!ns_capable(new->user_ns, CAP_SYS_ADMIN) || !ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN)) @@ -406,12 +406,6 @@ static int pidns_install(struct nsset *nsset, struct ns_common *ns) if (new->level < active->level) return -EINVAL; - ancestor = new; - while (ancestor->level > active->level) - ancestor = ancestor->parent; - if (ancestor != active) - return -EINVAL; - put_pid_ns(nsproxy->pid_ns_for_children); nsproxy->pid_ns_for_children = get_pid_ns(new); return 0;