Message ID | 20230301193403.1507484-1-surenb@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp3832406wrd; Wed, 1 Mar 2023 11:48:11 -0800 (PST) X-Google-Smtp-Source: AK7set86ewWXRA5lAfqmUaEAQRFPHokUONHSHBlUt5EbP2QxUY7F1FEOEycNuX/jJnWNhHt9qOn6 X-Received: by 2002:a17:903:27d0:b0:19e:25b4:7752 with SMTP id km16-20020a17090327d000b0019e25b47752mr6258335plb.24.1677700091501; Wed, 01 Mar 2023 11:48:11 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1677700091; cv=none; d=google.com; s=arc-20160816; b=f7qTa4KXNdRMY+TMJxhucx7VOzrum99ZBnCYi0Xtytk2d+kMDrY+iE81wioyzXKYtb M/90JEd8QR7azy3fPy9Mxsx8VXU6vAYa3E8H7NZIOdCY7SaK9gp5RpsvEz+dbFJ+6y5T 3KQEieD3C7n4wtKlOpZQ18Sf3fDjTIfdiZuaLw/uNgeNLYO7QSeSHup2Z++oB+Ykhme/ x2n6JDELzD1UQBJckGEg8fdjqvuqxhylacj5FAK16EHljuJJxRuSFotOW2mI55dmGFIi m8VCAaO2IzBenjGQrCNbmB9iobzHlQZ4dYBgMctvVVy/ojyhpu2HZg2P9z3Fm6hsQdQJ wMJQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:mime-version:date :dkim-signature; bh=VW1a/ztFW3oOtIrI9ep2g5ojznwZvfd6phBOKuSBaFU=; b=msNnOYO9kdtFpxVMPU9O8rMXX4C8EB3DARfpGV1cC7dMJBMZhL8f8pONsFciiymaFV kSqt8lUtUXaEBO0U4N596is1vyvMDO6vZHemeMy52MuYLS7ws+xUIYjHLx04tdy3rKCF uyNzZVxF+uZu5gBtP13W+fqFTIt2+RgYQAtTL+M/qYnWfTK45sQgE0FdYj2n66bKLZ+a OS9eewK1xXsKACPd9a8NtN0PayuRd3Oydrt/3FCyy8oxfFYhJKsl0enhmoRQqD1Kg4wR sioXUKttgqPp4QZ6wZOpdhgYrh683tLkSgwA3yjJqfw7FGSMHFLFMOQok4PkepECw6QO qHhg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Ll26EZNi; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c5-20020a6566c5000000b005030006a2desi13247243pgw.182.2023.03.01.11.47.58; Wed, 01 Mar 2023 11:48:11 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Ll26EZNi; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229612AbjCATeL (ORCPT <rfc822;david.simonyants@gmail.com> + 99 others); Wed, 1 Mar 2023 14:34:11 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38972 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229451AbjCATeK (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 1 Mar 2023 14:34:10 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8BF8220D2F for <linux-kernel@vger.kernel.org>; Wed, 1 Mar 2023 11:34:08 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id d7-20020a25adc7000000b00953ffdfbe1aso1437749ybe.23 for <linux-kernel@vger.kernel.org>; Wed, 01 Mar 2023 11:34:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1677699248; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=VW1a/ztFW3oOtIrI9ep2g5ojznwZvfd6phBOKuSBaFU=; b=Ll26EZNi/jeVAl0OTcHUVwbVkpV9f3gHXLoSYmO1V8V6ZnM8Xix86qmIbRQCiW/uqN JubSJxWHW/nuhVWBP9+ve54r0YG4l+9yACZF9yiPrUuykGvrcanYHwntjSiUthPfmQD6 oUEN2puY+bilBZaqeWPtoysFJZ2cfz8Hoo//99yYDxceQFxOMz+zghx16LhfRoF9ydvr cwBT2ktXIetOsJsU0cr2FxMs6cnP6ugcOTzLRtSxklIHuLgx6Y3cktvHmeTRKzNOXCPb +QGiVN0Bd2LEPJP+3h7pQoEg54t6GPrvyGV5w0HvWufE1ooG0JHvDEVXj29IrNkl7XiY ePuQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1677699248; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=VW1a/ztFW3oOtIrI9ep2g5ojznwZvfd6phBOKuSBaFU=; b=4yyC0eUZgvbSk17JcTVDD4vxBVgqas8etGMMA4Yx9Dmz7pyE3pW6pGZTaIwoDKHG16 bLZF4+VQU+X2j8lTCNF79CIRXsg+Pq3KocXfBuiMOars9ePrJoidPcDV13J8c4q2nrB7 Wpiz0l+7ojRg6emBGb5YSy9+lV+fMNOGIkobugV5iTYeCEMHoIBycWjDYkYACXF4zs6X Qwfi0MoZ5pC63IjtlrHnLEqzfeai+1Na+dLbzUz0o3NrXsMb7tHU6T0z3u7wjebRX+jI p6zs1WANjAa5Bqkv2INQEgRQcqDyvisGwQfEWmJ/46RSP1PEq8eYpMPT5YcctEFFGmd7 3WKQ== X-Gm-Message-State: AO0yUKVWCUnGfHbwoGjkPy3vAz3qA/mtj2vWTzhM7YeVnes6RUdDslz/ 3GaTUS4xsCOKTUa5AZI+ImW9vWsZGAg= X-Received: from surenb-desktop.mtv.corp.google.com ([2620:15c:211:200:3c40:eeb3:7c3a:807e]) (user=surenb job=sendgmr) by 2002:a81:af08:0:b0:536:5557:33a8 with SMTP id n8-20020a81af08000000b00536555733a8mr4702461ywh.9.1677699247788; Wed, 01 Mar 2023 11:34:07 -0800 (PST) Date: Wed, 1 Mar 2023 11:34:03 -0800 Mime-Version: 1.0 X-Mailer: git-send-email 2.40.0.rc0.216.gc4246ad0f0-goog Message-ID: <20230301193403.1507484-1-surenb@google.com> Subject: [PATCH 1/1] psi: remove 500ms min window size limitation for triggers From: Suren Baghdasaryan <surenb@google.com> To: tj@kernel.org Cc: hannes@cmpxchg.org, lizefan.x@bytedance.com, peterz@infradead.org, johunt@akamai.com, mhocko@suse.com, keescook@chromium.org, quic_sudaraja@quicinc.com, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, surenb@google.com Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1759196051259533814?= X-GMAIL-MSGID: =?utf-8?q?1759196051259533814?= |
Series |
[1/1] psi: remove 500ms min window size limitation for triggers
|
|
Commit Message
Suren Baghdasaryan
March 1, 2023, 7:34 p.m. UTC
Current 500ms min window size for psi triggers limits polling interval
to 50ms to prevent polling threads from using too much cpu bandwidth by
polling too frequently. However the number of cgroups with triggers is
unlimited, so this protection can be defeated by creating multiple
cgroups with psi triggers (triggers in each cgroup are served by a single
"psimon" kernel thread).
Instead of limiting min polling period, which also limits the latency of
psi events, it's better to limit psi trigger creation to authorized users
only, like we do for system-wide psi triggers (/proc/pressure/* files can
be written only by processes with CAP_SYS_RESOURCE capability). This also
makes access rules for cgroup psi files consistent with system-wide ones.
Add a CAP_SYS_RESOURCE capability check for cgroup psi file writers and
remove the psi window min size limitation.
Suggested-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com>
Link: https://lore.kernel.org/all/cover.1676067791.git.quic_sudaraja@quicinc.com/
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
kernel/cgroup/cgroup.c | 10 ++++++++++
kernel/sched/psi.c | 4 +---
2 files changed, 11 insertions(+), 3 deletions(-)
Comments
On Wed, Mar 01, 2023 at 11:34:03AM -0800, Suren Baghdasaryan wrote: > Current 500ms min window size for psi triggers limits polling interval > to 50ms to prevent polling threads from using too much cpu bandwidth by > polling too frequently. However the number of cgroups with triggers is > unlimited, so this protection can be defeated by creating multiple > cgroups with psi triggers (triggers in each cgroup are served by a single > "psimon" kernel thread). > Instead of limiting min polling period, which also limits the latency of > psi events, it's better to limit psi trigger creation to authorized users > only, like we do for system-wide psi triggers (/proc/pressure/* files can > be written only by processes with CAP_SYS_RESOURCE capability). This also > makes access rules for cgroup psi files consistent with system-wide ones. > Add a CAP_SYS_RESOURCE capability check for cgroup psi file writers and > remove the psi window min size limitation. > > Suggested-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com> > Link: https://lore.kernel.org/all/cover.1676067791.git.quic_sudaraja@quicinc.com/ > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > --- > kernel/cgroup/cgroup.c | 10 ++++++++++ > kernel/sched/psi.c | 4 +--- > 2 files changed, 11 insertions(+), 3 deletions(-) > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 935e8121b21e..b600a6baaeca 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -3867,6 +3867,12 @@ static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, > return psi_trigger_poll(&ctx->psi.trigger, of->file, pt); > } > > +static int cgroup_pressure_open(struct kernfs_open_file *of) > +{ > + return (of->file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE)) ? > + -EPERM : 0; > +} I agree with the change, but it's a bit unfortunate that this check is duplicated between system and cgroup. What do you think about psi_trigger_create() taking the file and checking FMODE_WRITE and CAP_SYS_RESOURCE against file->f_cred?
On Wed, Mar 1, 2023 at 12:07 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Mar 01, 2023 at 11:34:03AM -0800, Suren Baghdasaryan wrote: > > Current 500ms min window size for psi triggers limits polling interval > > to 50ms to prevent polling threads from using too much cpu bandwidth by > > polling too frequently. However the number of cgroups with triggers is > > unlimited, so this protection can be defeated by creating multiple > > cgroups with psi triggers (triggers in each cgroup are served by a single > > "psimon" kernel thread). > > Instead of limiting min polling period, which also limits the latency of > > psi events, it's better to limit psi trigger creation to authorized users > > only, like we do for system-wide psi triggers (/proc/pressure/* files can > > be written only by processes with CAP_SYS_RESOURCE capability). This also > > makes access rules for cgroup psi files consistent with system-wide ones. > > Add a CAP_SYS_RESOURCE capability check for cgroup psi file writers and > > remove the psi window min size limitation. > > > > Suggested-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com> > > Link: https://lore.kernel.org/all/cover.1676067791.git.quic_sudaraja@quicinc.com/ > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > --- > > kernel/cgroup/cgroup.c | 10 ++++++++++ > > kernel/sched/psi.c | 4 +--- > > 2 files changed, 11 insertions(+), 3 deletions(-) > > > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > > index 935e8121b21e..b600a6baaeca 100644 > > --- a/kernel/cgroup/cgroup.c > > +++ b/kernel/cgroup/cgroup.c > > @@ -3867,6 +3867,12 @@ static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, > > return psi_trigger_poll(&ctx->psi.trigger, of->file, pt); > > } > > > > +static int cgroup_pressure_open(struct kernfs_open_file *of) > > +{ > > + return (of->file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE)) ? > > + -EPERM : 0; > > +} > > I agree with the change, but it's a bit unfortunate that this check is > duplicated between system and cgroup. > > What do you think about psi_trigger_create() taking the file and > checking FMODE_WRITE and CAP_SYS_RESOURCE against file->f_cred? That's definitely doable and we don't even need to pass file to psi_trigger_create() since it's called only when we write to the file. However by moving the capability check into psi_trigger_create() we also postpone the check until write() instead of failing early in open(). I always assumed failing early is preferable but if consolidating the code here makes more sense then I can make the switch. Please let me know if you still prefer to move the check.
On Wed, Mar 1, 2023 at 12:48 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Wed, Mar 1, 2023 at 12:07 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Wed, Mar 01, 2023 at 11:34:03AM -0800, Suren Baghdasaryan wrote: > > > Current 500ms min window size for psi triggers limits polling interval > > > to 50ms to prevent polling threads from using too much cpu bandwidth by > > > polling too frequently. However the number of cgroups with triggers is > > > unlimited, so this protection can be defeated by creating multiple > > > cgroups with psi triggers (triggers in each cgroup are served by a single > > > "psimon" kernel thread). > > > Instead of limiting min polling period, which also limits the latency of > > > psi events, it's better to limit psi trigger creation to authorized users > > > only, like we do for system-wide psi triggers (/proc/pressure/* files can > > > be written only by processes with CAP_SYS_RESOURCE capability). This also > > > makes access rules for cgroup psi files consistent with system-wide ones. > > > Add a CAP_SYS_RESOURCE capability check for cgroup psi file writers and > > > remove the psi window min size limitation. > > > > > > Suggested-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com> > > > Link: https://lore.kernel.org/all/cover.1676067791.git.quic_sudaraja@quicinc.com/ > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > > --- > > > kernel/cgroup/cgroup.c | 10 ++++++++++ > > > kernel/sched/psi.c | 4 +--- > > > 2 files changed, 11 insertions(+), 3 deletions(-) > > > > > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > > > index 935e8121b21e..b600a6baaeca 100644 > > > --- a/kernel/cgroup/cgroup.c > > > +++ b/kernel/cgroup/cgroup.c > > > @@ -3867,6 +3867,12 @@ static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, > > > return psi_trigger_poll(&ctx->psi.trigger, of->file, pt); > > > } > > > > > > +static int cgroup_pressure_open(struct kernfs_open_file *of) > > > +{ > > > + return (of->file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE)) ? > > > + -EPERM : 0; > > > +} > > > > I agree with the change, but it's a bit unfortunate that this check is > > duplicated between system and cgroup. > > > > What do you think about psi_trigger_create() taking the file and > > checking FMODE_WRITE and CAP_SYS_RESOURCE against file->f_cred? > > That's definitely doable and we don't even need to pass file to > psi_trigger_create() since it's called only when we write to the file. > However by moving the capability check into psi_trigger_create() we > also postpone the check until write() instead of failing early in > open(). I always assumed failing early is preferable but if > consolidating the code here makes more sense then I can make the > switch. Please let me know if you still prefer to move the check. > I always assumed the convention is to check the credentials on open() per Linus comment here: https://lore.kernel.org/r/CAHk-=whDkekE8n2LdPiKHeTdRnV--ys0V0nPZ76oPaE0fn-d+g@mail.gmail.com/#t --Kalesh
On Wed 01-03-23 11:34:03, Suren Baghdasaryan wrote: > Current 500ms min window size for psi triggers limits polling interval > to 50ms to prevent polling threads from using too much cpu bandwidth by > polling too frequently. However the number of cgroups with triggers is > unlimited, so this protection can be defeated by creating multiple > cgroups with psi triggers (triggers in each cgroup are served by a single > "psimon" kernel thread). > Instead of limiting min polling period, which also limits the latency of > psi events, it's better to limit psi trigger creation to authorized users > only, like we do for system-wide psi triggers (/proc/pressure/* files can > be written only by processes with CAP_SYS_RESOURCE capability). This also > makes access rules for cgroup psi files consistent with system-wide ones. > Add a CAP_SYS_RESOURCE capability check for cgroup psi file writers and > remove the psi window min size limitation. > > Suggested-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com> > Link: https://lore.kernel.org/all/cover.1676067791.git.quic_sudaraja@quicinc.com/ > Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> with this to fix [...] > @@ -1278,8 +1277,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, > if (state >= PSI_NONIDLE) > return ERR_PTR(-EINVAL); > > - if (window_us < WINDOW_MIN_US || > - window_us > WINDOW_MAX_US) > + if (window_us <= 0 || window_us > WINDOW_MAX_US) > return ERR_PTR(-EINVAL); window_us is u32 sp the check for <= 0 doesn't make any sense. > > /* Check threshold */ > -- > 2.40.0.rc0.216.gc4246ad0f0-goog
On Wed, Mar 01, 2023 at 12:48:38PM -0800, Suren Baghdasaryan wrote: > On Wed, Mar 1, 2023 at 12:07 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Wed, Mar 01, 2023 at 11:34:03AM -0800, Suren Baghdasaryan wrote: > > > Current 500ms min window size for psi triggers limits polling interval > > > to 50ms to prevent polling threads from using too much cpu bandwidth by > > > polling too frequently. However the number of cgroups with triggers is > > > unlimited, so this protection can be defeated by creating multiple > > > cgroups with psi triggers (triggers in each cgroup are served by a single > > > "psimon" kernel thread). > > > Instead of limiting min polling period, which also limits the latency of > > > psi events, it's better to limit psi trigger creation to authorized users > > > only, like we do for system-wide psi triggers (/proc/pressure/* files can > > > be written only by processes with CAP_SYS_RESOURCE capability). This also > > > makes access rules for cgroup psi files consistent with system-wide ones. > > > Add a CAP_SYS_RESOURCE capability check for cgroup psi file writers and > > > remove the psi window min size limitation. > > > > > > Suggested-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com> > > > Link: https://lore.kernel.org/all/cover.1676067791.git.quic_sudaraja@quicinc.com/ > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > > --- > > > kernel/cgroup/cgroup.c | 10 ++++++++++ > > > kernel/sched/psi.c | 4 +--- > > > 2 files changed, 11 insertions(+), 3 deletions(-) > > > > > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > > > index 935e8121b21e..b600a6baaeca 100644 > > > --- a/kernel/cgroup/cgroup.c > > > +++ b/kernel/cgroup/cgroup.c > > > @@ -3867,6 +3867,12 @@ static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, > > > return psi_trigger_poll(&ctx->psi.trigger, of->file, pt); > > > } > > > > > > +static int cgroup_pressure_open(struct kernfs_open_file *of) > > > +{ > > > + return (of->file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE)) ? > > > + -EPERM : 0; > > > +} > > > > I agree with the change, but it's a bit unfortunate that this check is > > duplicated between system and cgroup. > > > > What do you think about psi_trigger_create() taking the file and > > checking FMODE_WRITE and CAP_SYS_RESOURCE against file->f_cred? > > That's definitely doable and we don't even need to pass file to > psi_trigger_create() since it's called only when we write to the file. > However by moving the capability check into psi_trigger_create() we > also postpone the check until write() instead of failing early in > open(). I always assumed failing early is preferable but if > consolidating the code here makes more sense then I can make the > switch. Please let me know if you still prefer to move the check. Just for context, a person on our team is working on allowing unprivileged polls with windows that are multiples of 2s, which can be triggered from the regular aggregator threads. This should be useful for container delegation, and also for the desktop monitor app usecase that Chris Down brought up some time ago. At that point, everybody can open the file for write, and permissions are checked against the trigger parameters. So I don't think it's a big deal to check this particular permission at write time. But if you prefer we can also merge your patch as-is and do the refactor as part of the other series. Your call. In either case, please feel free to add Acked-by: Johannes Weiner <hannes@cmpxchg.org>
On Thu, Mar 2, 2023 at 5:58 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 01-03-23 11:34:03, Suren Baghdasaryan wrote: > > Current 500ms min window size for psi triggers limits polling interval > > to 50ms to prevent polling threads from using too much cpu bandwidth by > > polling too frequently. However the number of cgroups with triggers is > > unlimited, so this protection can be defeated by creating multiple > > cgroups with psi triggers (triggers in each cgroup are served by a single > > "psimon" kernel thread). > > Instead of limiting min polling period, which also limits the latency of > > psi events, it's better to limit psi trigger creation to authorized users > > only, like we do for system-wide psi triggers (/proc/pressure/* files can > > be written only by processes with CAP_SYS_RESOURCE capability). This also > > makes access rules for cgroup psi files consistent with system-wide ones. > > Add a CAP_SYS_RESOURCE capability check for cgroup psi file writers and > > remove the psi window min size limitation. > > > > Suggested-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com> > > Link: https://lore.kernel.org/all/cover.1676067791.git.quic_sudaraja@quicinc.com/ > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > Acked-by: Michal Hocko <mhocko@suse.com> > > with this to fix > [...] > > @@ -1278,8 +1277,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, > > if (state >= PSI_NONIDLE) > > return ERR_PTR(-EINVAL); > > > > - if (window_us < WINDOW_MIN_US || > > - window_us > WINDOW_MAX_US) > > + if (window_us <= 0 || window_us > WINDOW_MAX_US) > > return ERR_PTR(-EINVAL); > > window_us is u32 sp the check for <= 0 doesn't make any sense. Completely missed that. Will change to == 0 and post the new version. window_us is later multiplied by NSEC_PER_USEC and then divided by UPDATES_PER_WINDOW(10), so even the smallest value of 1 still results in a positive poll_min_period. I think we should be fine with that check. Thanks! > > > > > /* Check threshold */ > > -- > > 2.40.0.rc0.216.gc4246ad0f0-goog > > -- > Michal Hocko > SUSE Labs
On Thu, Mar 2, 2023 at 7:30 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Mar 01, 2023 at 12:48:38PM -0800, Suren Baghdasaryan wrote: > > On Wed, Mar 1, 2023 at 12:07 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Wed, Mar 01, 2023 at 11:34:03AM -0800, Suren Baghdasaryan wrote: > > > > Current 500ms min window size for psi triggers limits polling interval > > > > to 50ms to prevent polling threads from using too much cpu bandwidth by > > > > polling too frequently. However the number of cgroups with triggers is > > > > unlimited, so this protection can be defeated by creating multiple > > > > cgroups with psi triggers (triggers in each cgroup are served by a single > > > > "psimon" kernel thread). > > > > Instead of limiting min polling period, which also limits the latency of > > > > psi events, it's better to limit psi trigger creation to authorized users > > > > only, like we do for system-wide psi triggers (/proc/pressure/* files can > > > > be written only by processes with CAP_SYS_RESOURCE capability). This also > > > > makes access rules for cgroup psi files consistent with system-wide ones. > > > > Add a CAP_SYS_RESOURCE capability check for cgroup psi file writers and > > > > remove the psi window min size limitation. > > > > > > > > Suggested-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com> > > > > Link: https://lore.kernel.org/all/cover.1676067791.git.quic_sudaraja@quicinc.com/ > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > > > --- > > > > kernel/cgroup/cgroup.c | 10 ++++++++++ > > > > kernel/sched/psi.c | 4 +--- > > > > 2 files changed, 11 insertions(+), 3 deletions(-) > > > > > > > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > > > > index 935e8121b21e..b600a6baaeca 100644 > > > > --- a/kernel/cgroup/cgroup.c > > > > +++ b/kernel/cgroup/cgroup.c > > > > @@ -3867,6 +3867,12 @@ static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, > > > > return psi_trigger_poll(&ctx->psi.trigger, of->file, pt); > > > > } > > > > > > > > +static int cgroup_pressure_open(struct kernfs_open_file *of) > > > > +{ > > > > + return (of->file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE)) ? > > > > + -EPERM : 0; > > > > +} > > > > > > I agree with the change, but it's a bit unfortunate that this check is > > > duplicated between system and cgroup. > > > > > > What do you think about psi_trigger_create() taking the file and > > > checking FMODE_WRITE and CAP_SYS_RESOURCE against file->f_cred? > > > > That's definitely doable and we don't even need to pass file to > > psi_trigger_create() since it's called only when we write to the file. > > However by moving the capability check into psi_trigger_create() we > > also postpone the check until write() instead of failing early in > > open(). I always assumed failing early is preferable but if > > consolidating the code here makes more sense then I can make the > > switch. Please let me know if you still prefer to move the check. > > Just for context, a person on our team is working on allowing > unprivileged polls with windows that are multiples of 2s, which can be > triggered from the regular aggregator threads. This should be useful > for container delegation, and also for the desktop monitor app usecase > that Chris Down brought up some time ago. At that point, everybody can > open the file for write, and permissions are checked against the > trigger parameters. > > So I don't think it's a big deal to check this particular permission > at write time. But if you prefer we can also merge your patch as-is > and do the refactor as part of the other series. Let's roll this check without additional changes and then consolidate the checking inside psi_trigger_create() in a separate patch. If anybody objects to the late permission check we will just revert that last change without affecting anything else. > > Your call. In either case, please feel free to add > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> Thanks! Will post the final patch with Ack's later today. Originally it was purely cgroup-related change but now it's more of a PSI change. Therefore Peter's tree will probably be the right place for it.
On Thu 02-03-23 08:08:38, Suren Baghdasaryan wrote: > On Thu, Mar 2, 2023 at 5:58 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 01-03-23 11:34:03, Suren Baghdasaryan wrote: > > > Current 500ms min window size for psi triggers limits polling interval > > > to 50ms to prevent polling threads from using too much cpu bandwidth by > > > polling too frequently. However the number of cgroups with triggers is > > > unlimited, so this protection can be defeated by creating multiple > > > cgroups with psi triggers (triggers in each cgroup are served by a single > > > "psimon" kernel thread). > > > Instead of limiting min polling period, which also limits the latency of > > > psi events, it's better to limit psi trigger creation to authorized users > > > only, like we do for system-wide psi triggers (/proc/pressure/* files can > > > be written only by processes with CAP_SYS_RESOURCE capability). This also > > > makes access rules for cgroup psi files consistent with system-wide ones. > > > Add a CAP_SYS_RESOURCE capability check for cgroup psi file writers and > > > remove the psi window min size limitation. > > > > > > Suggested-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com> > > > Link: https://lore.kernel.org/all/cover.1676067791.git.quic_sudaraja@quicinc.com/ > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > > > Acked-by: Michal Hocko <mhocko@suse.com> > > > > with this to fix > > [...] > > > @@ -1278,8 +1277,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, > > > if (state >= PSI_NONIDLE) > > > return ERR_PTR(-EINVAL); > > > > > > - if (window_us < WINDOW_MIN_US || > > > - window_us > WINDOW_MAX_US) > > > + if (window_us <= 0 || window_us > WINDOW_MAX_US) > > > return ERR_PTR(-EINVAL); > > > > window_us is u32 sp the check for <= 0 doesn't make any sense. > > Completely missed that. Will change to == 0 and post the new version. I was about to asky check for 0 actually matters but maybe a future extension might give it a special meaning so it is likely a good idea to EINVAL on 0 now. You can keep my ack on that change
On Thu 02-03-23 08:13:54, Suren Baghdasaryan wrote: [...] > Let's roll this check without additional changes and then consolidate > the checking inside psi_trigger_create() in a separate patch. If > anybody objects to the late permission check we will just revert that > last change without affecting anything else. Permissions checks at write time are problematic because userspace cannot drop privileges. Also I think it would be an antipattern for how we do this in general.
On Thu, Mar 02, 2023 at 06:41:47PM +0100, Michal Hocko wrote: > On Thu 02-03-23 08:13:54, Suren Baghdasaryan wrote: > [...] > > Let's roll this check without additional changes and then consolidate > > the checking inside psi_trigger_create() in a separate patch. If > > anybody objects to the late permission check we will just revert that > > last change without affecting anything else. > > Permissions checks at write time are problematic because userspace > cannot drop privileges. Also I think it would be an antipattern for how > we do this in general. The permissions can be checked against opener privileges through file->f_cred. This allows dropping privileges, as well as passing the fd to a trusted but unprivileged process to delegate trigger setup. I agree with keeping it in open() for now. But it will matter when we distinguish between privileged and unprivileged trigger parameters.
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 935e8121b21e..b600a6baaeca 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3867,6 +3867,12 @@ static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, return psi_trigger_poll(&ctx->psi.trigger, of->file, pt); } +static int cgroup_pressure_open(struct kernfs_open_file *of) +{ + return (of->file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE)) ? + -EPERM : 0; +} + static void cgroup_pressure_release(struct kernfs_open_file *of) { struct cgroup_file_ctx *ctx = of->priv; @@ -5266,6 +5272,7 @@ static struct cftype cgroup_psi_files[] = { { .name = "io.pressure", .file_offset = offsetof(struct cgroup, psi_files[PSI_IO]), + .open = cgroup_pressure_open, .seq_show = cgroup_io_pressure_show, .write = cgroup_io_pressure_write, .poll = cgroup_pressure_poll, @@ -5274,6 +5281,7 @@ static struct cftype cgroup_psi_files[] = { { .name = "memory.pressure", .file_offset = offsetof(struct cgroup, psi_files[PSI_MEM]), + .open = cgroup_pressure_open, .seq_show = cgroup_memory_pressure_show, .write = cgroup_memory_pressure_write, .poll = cgroup_pressure_poll, @@ -5282,6 +5290,7 @@ static struct cftype cgroup_psi_files[] = { { .name = "cpu.pressure", .file_offset = offsetof(struct cgroup, psi_files[PSI_CPU]), + .open = cgroup_pressure_open, .seq_show = cgroup_cpu_pressure_show, .write = cgroup_cpu_pressure_write, .poll = cgroup_pressure_poll, @@ -5291,6 +5300,7 @@ static struct cftype cgroup_psi_files[] = { { .name = "irq.pressure", .file_offset = offsetof(struct cgroup, psi_files[PSI_IRQ]), + .open = cgroup_pressure_open, .seq_show = cgroup_irq_pressure_show, .write = cgroup_irq_pressure_write, .poll = cgroup_pressure_poll, diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 02e011cabe91..9c02b27052bb 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -160,7 +160,6 @@ __setup("psi=", setup_psi); #define EXP_300s 2034 /* 1/exp(2s/300s) */ /* PSI trigger definitions */ -#define WINDOW_MIN_US 500000 /* Min window size is 500ms */ #define WINDOW_MAX_US 10000000 /* Max window size is 10s */ #define UPDATES_PER_WINDOW 10 /* 10 updates per window */ @@ -1278,8 +1277,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, if (state >= PSI_NONIDLE) return ERR_PTR(-EINVAL); - if (window_us < WINDOW_MIN_US || - window_us > WINDOW_MAX_US) + if (window_us <= 0 || window_us > WINDOW_MAX_US) return ERR_PTR(-EINVAL); /* Check threshold */