Message ID | 20231030182013.40086-5-haitao.huang@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:d641:0:b0:403:3b70:6f57 with SMTP id cy1csp2416668vqb; Mon, 30 Oct 2023 11:23:17 -0700 (PDT) X-Google-Smtp-Source: AGHT+IF56aDbwHlBiOCmGksYFIRbO6RRGUEA855C4+ntLh88tY1tTjZlm9R0wsPSx7d9DulbaGKA X-Received: by 2002:a05:6a00:1a4a:b0:68a:582b:6b62 with SMTP id h10-20020a056a001a4a00b0068a582b6b62mr487244pfv.7.1698690197496; Mon, 30 Oct 2023 11:23:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698690197; cv=none; d=google.com; s=arc-20160816; b=XETrtJjyfvIuHui23N6jJhjNYkSCQ+dWU2x9uebsTZcLcDzM6baxnY+nLPQl6uMAvK FaPkFCw/ap6C8j9ZUkTtE3faZ4scI3wM+sFS8eaY6UA1LT6tQeURpdzGD+MGFpEuVH6a DdGh5HAK9W67s6/+FzmTAzu/D7bE17pHX3lQhBPjoGIfvZIQ8f+pSqvHO9OuBuRBiyVn Wjj8E6jieR754HuDzsIHfRRM+I21/WJ6ZgOitgVIBA7LYudnL0Eh9xII2jjG5TAhsF9o dQfNwrsMtszI/RClyW+d7Bu5O4mcg89JW4CwUqttOYk38waNQ/tte2rq4VifOmRY3VlC JQuA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=hPS3r7hMkBFhA7BsgKWYKs42I+N3tjkMHUicVBUhylE=; fh=CTa835q0iyfQFiV5yjYPzrfvG/ulw1To85Pi/STRmhA=; b=FAlLE43jGW9/3o2fDQ5aks/SEOhLoikIbC4fBKKOJKm6sZV2x2GuEsaJVQxPHww+lD dTRoH/zhmeNm/x41SjAHJOXc5Sf+s7NpgOqUwFQU8u0zO3mUw7Lcp0UYSFzPQc8SCgO5 I2YFODKyhgcilOTyPtgqolzAIZyXDWPp5RcL115wZtte4UROxZ9UApzeILEXDUhjGNXF yH70oO5T+0XL6refcXgWtHjuIoIeKVggNascA0S1dWU+LEmLGopCELttO/Q2uUeLsn5h cj1aS7U5A9RtvWR6FgqrR+4ilwtQi/pAUNVOi6kU+8caMXyymw6cVpHTycJHMDvq0dEY H/0A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=BxGkbJnS; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id g22-20020a056a001a1600b006c081a3fe4dsi5271763pfv.259.2023.10.30.11.23.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 30 Oct 2023 11:23:17 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=BxGkbJnS; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 94D7F8049049; Mon, 30 Oct 2023 11:23:10 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233321AbjJ3SVB (ORCPT <rfc822;chrisjones.unixmen@gmail.com> + 32 others); Mon, 30 Oct 2023 14:21:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33650 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232019AbjJ3SUk (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 30 Oct 2023 14:20:40 -0400 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 61187C5; Mon, 30 Oct 2023 11:20:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1698690033; x=1730226033; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=CZ5TypyOV7qEcBmcoHmI4JyV2l1WURD+EvzLJqnisRk=; b=BxGkbJnSVC6PO6eq2+s60R7uyqZgKL0mo1Xr/6N6Wvcjk8Wpeq15xqRu 4zIAbWTXB3AU5jkWZmACee8rQ4d2aUmBmPj+Y8YMREhGLFGUW2zhRAl4z 0VFXCtLOVW6580Lo6P90AxoT9vGyZbdSBzTcX5YJzZHw1yUgrBf+vCdWV FPPMkLA53f1nhYriy/nZmsmNaEAKXgM/0pX9rkWHNFVWowaHo1OokxJe5 hZsQ+takPBNFByRkXU/TGggqeuzSOkGrjPexknCeTX7H6u+7nZKNj1CEB GPJCEFsv8hhZX/UrfavftewHnmPQmzaTzWTNFfY8K6iWiu05m1GnTUR1V A==; X-IronPort-AV: E=McAfee;i="6600,9927,10879"; a="367479553" X-IronPort-AV: E=Sophos;i="6.03,263,1694761200"; d="scan'208";a="367479553" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Oct 2023 11:20:29 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10879"; a="789529507" X-IronPort-AV: E=Sophos;i="6.03,263,1694761200"; d="scan'208";a="789529507" Received: from b4969161e530.jf.intel.com ([10.165.56.46]) by orsmga008.jf.intel.com with ESMTP; 30 Oct 2023 11:20:28 -0700 From: Haitao Huang <haitao.huang@linux.intel.com> To: jarkko@kernel.org, dave.hansen@linux.intel.com, tj@kernel.org, mkoutny@suse.com, linux-kernel@vger.kernel.org, linux-sgx@vger.kernel.org, x86@kernel.org, cgroups@vger.kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, hpa@zytor.com, sohil.mehta@intel.com Cc: zhiquan1.li@intel.com, kristen@linux.intel.com, seanjc@google.com, zhanb@microsoft.com, anakrish@microsoft.com, mikko.ylinen@linux.intel.com, yangjie@microsoft.com, Sean Christopherson <sean.j.christopherson@intel.com>, Haitao Huang <haitao.huang@linux.intel.com> Subject: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality Date: Mon, 30 Oct 2023 11:20:05 -0700 Message-Id: <20231030182013.40086-5-haitao.huang@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20231030182013.40086-1-haitao.huang@linux.intel.com> References: <20231030182013.40086-1-haitao.huang@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.2 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Mon, 30 Oct 2023 11:23:10 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1781205772382245417 X-GMAIL-MSGID: 1781205772382245417 |
Series |
Add Cgroup support for SGX EPC memory
|
|
Commit Message
Haitao Huang
Oct. 30, 2023, 6:20 p.m. UTC
From: Kristen Carlson Accardi <kristen@linux.intel.com> Implement support for cgroup control of SGX Enclave Page Cache (EPC) memory using the misc cgroup controller. EPC memory is independent from normal system memory, e.g. must be reserved at boot from RAM and cannot be converted between EPC and normal memory while the system is running. EPC is managed by the SGX subsystem and is not accounted by the memory controller. Much like normal system memory, EPC memory can be overcommitted via virtual memory techniques and pages can be swapped out of the EPC to their backing store (normal system memory, e.g. shmem). The SGX EPC subsystem is analogous to the memory subsystem and the SGX EPC controller is in turn analogous to the memory controller; it implements limit and protection models for EPC memory. The misc controller provides a mechanism to set a hard limit of EPC usage via the "sgx_epc" resource in "misc.max". The total EPC memory available on the system is reported via the "sgx_epc" resource in "misc.capacity". This patch was modified from the previous version to only add basic EPC cgroup structure, accounting allocations for cgroup usage (charge/uncharge), setup misc cgroup callbacks, set total EPC capacity. For now, the EPC cgroup simply blocks additional EPC allocation in sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are still tracked in the global active list, only reclaimed by the global reclaimer when the total free page count is lower than a threshold. Later patches will reorganize the tracking and reclamation code in the globale reclaimer and implement per-cgroup tracking and reclaiming. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> --- V6: - Split the original large patch"Limit process EPC usage with misc cgroup controller" and restructure it (Kai) --- arch/x86/Kconfig | 13 ++++ arch/x86/kernel/cpu/sgx/Makefile | 1 + arch/x86/kernel/cpu/sgx/epc_cgroup.c | 103 +++++++++++++++++++++++++++ arch/x86/kernel/cpu/sgx/epc_cgroup.h | 36 ++++++++++ arch/x86/kernel/cpu/sgx/main.c | 28 ++++++++ arch/x86/kernel/cpu/sgx/sgx.h | 3 + 6 files changed, 184 insertions(+) create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
Comments
On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote: > From: Kristen Carlson Accardi <kristen@linux.intel.com> > > Implement support for cgroup control of SGX Enclave Page Cache (EPC) > memory using the misc cgroup controller. EPC memory is independent > from normal system memory, e.g. must be reserved at boot from RAM and > cannot be converted between EPC and normal memory while the system is > running. EPC is managed by the SGX subsystem and is not accounted by > the memory controller. > > Much like normal system memory, EPC memory can be overcommitted via > virtual memory techniques and pages can be swapped out of the EPC to > their backing store (normal system memory, e.g. shmem). The SGX EPC > subsystem is analogous to the memory subsystem and the SGX EPC controller > is in turn analogous to the memory controller; it implements limit and > protection models for EPC memory. Nit: The above two paragraphs talk about what is EPC and EPC resource control needs to be done separately, etc, but IMHO it lacks some background about "why" EPC resource control is needed, e.g, from use case's perspective. > > The misc controller provides a mechanism to set a hard limit of EPC > usage via the "sgx_epc" resource in "misc.max". The total EPC memory > available on the system is reported via the "sgx_epc" resource in > "misc.capacity". Please separate what the current misc cgroup provides, and how this patch is going to utilize. Please describe the changes in imperative mood. E.g, "report total EPC memory via ...", instead of "... is reported via ...". > > This patch was modified from the previous version to only add basic EPC > cgroup structure, accounting allocations for cgroup usage > (charge/uncharge), setup misc cgroup callbacks, set total EPC capacity. This isn't changelog material. Please focus on describing the high level design and why you chose such design. > > For now, the EPC cgroup simply blocks additional EPC allocation in > sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are > still tracked in the global active list, only reclaimed by the global > reclaimer when the total free page count is lower than a threshold. > > Later patches will reorganize the tracking and reclamation code in the > globale reclaimer and implement per-cgroup tracking and reclaiming. > > Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> > Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> > --- > V6: > - Split the original large patch"Limit process EPC usage with misc > cgroup controller" and restructure it (Kai) > --- > arch/x86/Kconfig | 13 ++++ > arch/x86/kernel/cpu/sgx/Makefile | 1 + > arch/x86/kernel/cpu/sgx/epc_cgroup.c | 103 +++++++++++++++++++++++++++ > arch/x86/kernel/cpu/sgx/epc_cgroup.h | 36 ++++++++++ > arch/x86/kernel/cpu/sgx/main.c | 28 ++++++++ > arch/x86/kernel/cpu/sgx/sgx.h | 3 + > 6 files changed, 184 insertions(+) > create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c > create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 66bfabae8814..e17c5dc3aea4 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -1921,6 +1921,19 @@ config X86_SGX > > If unsure, say N. > > +config CGROUP_SGX_EPC > + bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX" > + depends on X86_SGX && CGROUP_MISC > + help > + Provides control over the EPC footprint of tasks in a cgroup via > + the Miscellaneous cgroup controller. > + > + EPC is a subset of regular memory that is usable only by SGX > + enclaves and is very limited in quantity, e.g. less than 1% > + of total DRAM. > + > + Say N if unsure. > + > config X86_USER_SHADOW_STACK > bool "X86 userspace shadow stack" > depends on AS_WRUSS > diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile > index 9c1656779b2a..12901a488da7 100644 > --- a/arch/x86/kernel/cpu/sgx/Makefile > +++ b/arch/x86/kernel/cpu/sgx/Makefile > @@ -4,3 +4,4 @@ obj-y += \ > ioctl.o \ > main.o > obj-$(CONFIG_X86_SGX_KVM) += virt.o > +obj-$(CONFIG_CGROUP_SGX_EPC) += epc_cgroup.o > diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c > new file mode 100644 > index 000000000000..500627d0563f > --- /dev/null > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c > @@ -0,0 +1,103 @@ > +// SPDX-License-Identifier: GPL-2.0 > +// Copyright(c) 2022 Intel Corporation. > + > +#include <linux/atomic.h> > +#include <linux/kernel.h> > +#include "epc_cgroup.h" > + > +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg) > +{ > + return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv); > +} > + > +static inline bool sgx_epc_cgroup_disabled(void) > +{ > + return !cgroup_subsys_enabled(misc_cgrp_subsys); From below, the root EPC cgroup is dynamically allocated. Shouldn't it also check whether the root EPC cgroup is valid? > +} > + > +/** > + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single EPC page > + * > + * Returns EPC cgroup or NULL on success, -errno on failure. > + */ > +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void) > +{ > + struct sgx_epc_cgroup *epc_cg; > + int ret; > + > + if (sgx_epc_cgroup_disabled()) > + return NULL; > + > + epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg()); > + ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE); > + > + if (!ret) { > + /* No epc_cg returned, release ref from get_current_misc_cg() */ > + put_misc_cg(epc_cg->cg); > + return ERR_PTR(-ENOMEM); misc_cg_try_charge() returns 0 when successfully charged, no? > + } > + > + /* Ref released in sgx_epc_cgroup_uncharge() */ > + return epc_cg; > +} IMHO the above _try_charge() returning a pointer of EPC cgroup is a little bit odd, because it doesn't match the existing misc_cg_try_charge() which returns whether the charge is successful or not. sev_misc_cg_try_charge() matches misc_cg_try_charge() too. I think it's better to split "getting EPC cgroup" part out as a separate helper, and make this _try_charge() match existing pattern: struct sgx_epc_cgroup *sgx_get_current_epc_cg(void) { if (sgx_epc_cgroup_disabled()) return NULL; return sgx_epc_cgroup_from_misc_cg(get_current_misc_cg()); } int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) { if (!epc_cg) return -EINVAL; return misc_cg_try_charge(epc_cg->cg); } Having sgx_get_current_epc_cg() also makes the caller easier to read, because we can immediately know we are going to charge the *current* EPC cgroup, but not some cgroup hidden within sgx_epc_cgroup_try_charge(). > + > +/** > + * sgx_epc_cgroup_uncharge() - hierarchically uncharge EPC pages > + * @epc_cg: the charged epc cgroup > + */ > +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) > +{ > + if (sgx_epc_cgroup_disabled()) > + return; If with above change, check !epc_cg instead. > + > + misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE); > + > + /* Ref got from sgx_epc_cgroup_try_charge() */ > + put_misc_cg(epc_cg->cg); > +} > > + > +static void sgx_epc_cgroup_free(struct misc_cg *cg) > +{ > + struct sgx_epc_cgroup *epc_cg; > + > + epc_cg = sgx_epc_cgroup_from_misc_cg(cg); > + if (!epc_cg) > + return; > + > + kfree(epc_cg); > +} > + > +static int sgx_epc_cgroup_alloc(struct misc_cg *cg); > + > +const struct misc_operations_struct sgx_epc_cgroup_ops = { > + .alloc = sgx_epc_cgroup_alloc, > + .free = sgx_epc_cgroup_free, > +}; > + > +static int sgx_epc_cgroup_alloc(struct misc_cg *cg) > +{ > + struct sgx_epc_cgroup *epc_cg; > + > + epc_cg = kzalloc(sizeof(*epc_cg), GFP_KERNEL); > + if (!epc_cg) > + return -ENOMEM; > + > + cg->res[MISC_CG_RES_SGX_EPC].misc_ops = &sgx_epc_cgroup_ops; > + cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg; > + epc_cg->cg = cg; > + return 0; > +} > + > +static int __init sgx_epc_cgroup_init(void) > +{ > + struct misc_cg *cg; > + > + if (!boot_cpu_has(X86_FEATURE_SGX)) > + return 0; > + > + cg = misc_cg_root(); > + BUG_ON(!cg); BUG_ON() will catch some eyeball, but it cannot be NULL in practice IIUC. I am not sure whether you can just make misc @root_cg visible (instead of having the misc_cg_root() helper) and directly use @root_cg here to avoid using the BUG(). No opinion here. > + > + return sgx_epc_cgroup_alloc(cg); As mentioned above the memory allocation can fail, in which case EPC cgroup is effectively disabled IIUC? One way is to manually check whether root EPC cgroup is valid in sgx_epc_cgroup_disabled(). Alternatively, you can have a static root EPC cgroup here: static struct sgx_epc_cgroup root_epc_cg; In this way you can have a sgx_epc_cgroup_init(&epc_cg), and call it from sgx_epc_cgroup_alloc(). > +} > +subsys_initcall(sgx_epc_cgroup_init); > diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h > new file mode 100644 > index 000000000000..c3abfe82be15 > --- /dev/null > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h > @@ -0,0 +1,36 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* Copyright(c) 2022 Intel Corporation. */ > +#ifndef _INTEL_SGX_EPC_CGROUP_H_ > +#define _INTEL_SGX_EPC_CGROUP_H_ > + > +#include <asm/sgx.h> > +#include <linux/cgroup.h> > +#include <linux/list.h> > +#include <linux/misc_cgroup.h> > +#include <linux/page_counter.h> > +#include <linux/workqueue.h> > + > +#include "sgx.h" > + > +#ifndef CONFIG_CGROUP_SGX_EPC > +#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES Do you need this macro? > +struct sgx_epc_cgroup; > + > +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void) > +{ > + return NULL; > +} > + > +static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { } > +#else > +struct sgx_epc_cgroup { > + struct misc_cg *cg; > +}; > + > +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void); > +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg); > +bool sgx_epc_cgroup_lru_empty(struct misc_cg *root); Why do you need sgx_epc_cgroup_lru_empty() here? > + > +#endif > + > +#endif /* _INTEL_SGX_EPC_CGROUP_H_ */ > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c > index 166692f2d501..07606f391540 100644 > --- a/arch/x86/kernel/cpu/sgx/main.c > +++ b/arch/x86/kernel/cpu/sgx/main.c > @@ -6,6 +6,7 @@ > #include <linux/highmem.h> > #include <linux/kthread.h> > #include <linux/miscdevice.h> > +#include <linux/misc_cgroup.h> > #include <linux/node.h> > #include <linux/pagemap.h> > #include <linux/ratelimit.h> > @@ -17,6 +18,7 @@ > #include "driver.h" > #include "encl.h" > #include "encls.h" > +#include "epc_cgroup.h" > > struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS]; > static int sgx_nr_epc_sections; > @@ -559,6 +561,11 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page) > struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim) > { > struct sgx_epc_page *page; > + struct sgx_epc_cgroup *epc_cg; > + > + epc_cg = sgx_epc_cgroup_try_charge(); > + if (IS_ERR(epc_cg)) > + return ERR_CAST(epc_cg); > > for ( ; ; ) { > page = __sgx_alloc_epc_page(); > @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim) > break; > } > > + /* > + * Need to do a global reclamation if cgroup was not full but free > + * physical pages run out, causing __sgx_alloc_epc_page() to fail. > + */ > sgx_reclaim_pages(); What's the final behaviour? IIUC it should be reclaiming from the *current* EPC cgroup? If so shouldn't we just pass the @epc_cg to it here? I think we can make this patch as "structure" patch w/o actually having EPC cgroup enabled, i.e., sgx_get_current_epc_cg() always return NULL. And we can have one patch to change sgx_reclaim_pages() to take the 'struct sgx_epc_lru_list *' as argument: void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru) { ... } Then here we can have something like: void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg) { struct sgx_epc_lru_list *lru = epc_cg ? &epc_cg->lru : &sgx_global_lru; sgx_reclaim_pages_lru(lru); } Makes sense? > cond_resched(); > } > > + if (!IS_ERR(page)) { > + WARN_ON_ONCE(page->epc_cg); > + page->epc_cg = epc_cg; > + } else { > + sgx_epc_cgroup_uncharge(epc_cg); > + } > + > if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) > wake_up(&ksgxd_waitq); > > @@ -604,6 +622,11 @@ void sgx_free_epc_page(struct sgx_epc_page *page) > struct sgx_epc_section *section = &sgx_epc_sections[page->section]; > struct sgx_numa_node *node = section->node; > > + if (page->epc_cg) { > + sgx_epc_cgroup_uncharge(page->epc_cg); > + page->epc_cg = NULL; > + } > + > spin_lock(&node->lock); > > page->owner = NULL; > @@ -643,6 +666,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size, > section->pages[i].flags = 0; > section->pages[i].owner = NULL; > section->pages[i].poison = 0; > + section->pages[i].epc_cg = NULL; > list_add_tail(§ion->pages[i].list, &sgx_dirty_page_list); > } > > @@ -787,6 +811,7 @@ static void __init arch_update_sysfs_visibility(int nid) {} > static bool __init sgx_page_cache_init(void) > { > u32 eax, ebx, ecx, edx, type; > + u64 capacity = 0; > u64 pa, size; > int nid; > int i; > @@ -837,6 +862,7 @@ static bool __init sgx_page_cache_init(void) > > sgx_epc_sections[i].node = &sgx_numa_nodes[nid]; > sgx_numa_nodes[nid].size += size; > + capacity += size; > > sgx_nr_epc_sections++; > } > @@ -846,6 +872,8 @@ static bool __init sgx_page_cache_init(void) > return false; > } > > + misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity); > + > return true; > } I would separate setting up capacity as a separate patch. > > diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h > index d2dad21259a8..b1786774b8d2 100644 > --- a/arch/x86/kernel/cpu/sgx/sgx.h > +++ b/arch/x86/kernel/cpu/sgx/sgx.h > @@ -29,12 +29,15 @@ > /* Pages on free list */ > #define SGX_EPC_PAGE_IS_FREE BIT(1) > > +struct sgx_epc_cgroup; > + > struct sgx_epc_page { > unsigned int section; > u16 flags; > u16 poison; > struct sgx_encl_page *owner; > struct list_head list; > + struct sgx_epc_cgroup *epc_cg; > }; > Adding @epc_cg unconditionally means even with !CONFIG_CGROUP_SGX_EPC the memory is still occupied. IMHO that would bring non-trivial memory waste as it's 8- bytes for each EPC page. If it's not good to have "#ifdef CONFIG_CGROUP_SGX_EPC" in the .c file, then perhaps we can have some helper here, e.g., static inline sgx_epc_page_set_cg(struct sgx_epc_page *epc_page, struct sgx_epc_cgroup *epc_cg) { #ifdef CONFIG_CGROUP_SGX_EPC epc_page->epc_cg = epc_cg; #endif }
On Mon, 06 Nov 2023 06:09:45 -0600, Huang, Kai <kai.huang@intel.com> wrote: > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote: >> From: Kristen Carlson Accardi <kristen@linux.intel.com> >> >> Implement support for cgroup control of SGX Enclave Page Cache (EPC) >> memory using the misc cgroup controller. EPC memory is independent >> from normal system memory, e.g. must be reserved at boot from RAM and >> cannot be converted between EPC and normal memory while the system is >> running. EPC is managed by the SGX subsystem and is not accounted by >> the memory controller. >> >> Much like normal system memory, EPC memory can be overcommitted via >> virtual memory techniques and pages can be swapped out of the EPC to >> their backing store (normal system memory, e.g. shmem). The SGX EPC >> subsystem is analogous to the memory subsystem and the SGX EPC >> controller >> is in turn analogous to the memory controller; it implements limit and >> protection models for EPC memory. > > Nit: > > The above two paragraphs talk about what is EPC and EPC resource control > needs > to be done separately, etc, but IMHO it lacks some background about > "why" EPC > resource control is needed, e.g, from use case's perspective. > >> >> The misc controller provides a mechanism to set a hard limit of EPC >> usage via the "sgx_epc" resource in "misc.max". The total EPC memory >> available on the system is reported via the "sgx_epc" resource in >> "misc.capacity". > > Please separate what the current misc cgroup provides, and how this > patch is > going to utilize. > > Please describe the changes in imperative mood. E.g, "report total EPC > memory > via ...", instead of "... is reported via ...". > Will update >> >> This patch was modified from the previous version to only add basic EPC >> cgroup structure, accounting allocations for cgroup usage >> (charge/uncharge), setup misc cgroup callbacks, set total EPC capacity. > > This isn't changelog material. Please focus on describing the high > level design > and why you chose such design. > >> >> For now, the EPC cgroup simply blocks additional EPC allocation in >> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are >> still tracked in the global active list, only reclaimed by the global >> reclaimer when the total free page count is lower than a threshold. >> >> Later patches will reorganize the tracking and reclamation code in the >> globale reclaimer and implement per-cgroup tracking and reclaiming. >> >> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> >> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> >> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> >> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> >> --- >> V6: >> - Split the original large patch"Limit process EPC usage with misc >> cgroup controller" and restructure it (Kai) >> --- >> arch/x86/Kconfig | 13 ++++ >> arch/x86/kernel/cpu/sgx/Makefile | 1 + >> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 103 +++++++++++++++++++++++++++ >> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 36 ++++++++++ >> arch/x86/kernel/cpu/sgx/main.c | 28 ++++++++ >> arch/x86/kernel/cpu/sgx/sgx.h | 3 + >> 6 files changed, 184 insertions(+) >> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c >> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h >> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >> index 66bfabae8814..e17c5dc3aea4 100644 >> --- a/arch/x86/Kconfig >> +++ b/arch/x86/Kconfig >> @@ -1921,6 +1921,19 @@ config X86_SGX >> >> If unsure, say N. >> >> +config CGROUP_SGX_EPC >> + bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) >> for Intel SGX" >> + depends on X86_SGX && CGROUP_MISC >> + help >> + Provides control over the EPC footprint of tasks in a cgroup via >> + the Miscellaneous cgroup controller. >> + >> + EPC is a subset of regular memory that is usable only by SGX >> + enclaves and is very limited in quantity, e.g. less than 1% >> + of total DRAM. >> + >> + Say N if unsure. >> + >> config X86_USER_SHADOW_STACK >> bool "X86 userspace shadow stack" >> depends on AS_WRUSS >> diff --git a/arch/x86/kernel/cpu/sgx/Makefile >> b/arch/x86/kernel/cpu/sgx/Makefile >> index 9c1656779b2a..12901a488da7 100644 >> --- a/arch/x86/kernel/cpu/sgx/Makefile >> +++ b/arch/x86/kernel/cpu/sgx/Makefile >> @@ -4,3 +4,4 @@ obj-y += \ >> ioctl.o \ >> main.o >> obj-$(CONFIG_X86_SGX_KVM) += virt.o >> +obj-$(CONFIG_CGROUP_SGX_EPC) += epc_cgroup.o >> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c >> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c >> new file mode 100644 >> index 000000000000..500627d0563f >> --- /dev/null >> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c >> @@ -0,0 +1,103 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +// Copyright(c) 2022 Intel Corporation. >> + >> +#include <linux/atomic.h> >> +#include <linux/kernel.h> >> +#include "epc_cgroup.h" >> + >> +static inline struct sgx_epc_cgroup >> *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg) >> +{ >> + return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv); >> +} >> + >> +static inline bool sgx_epc_cgroup_disabled(void) >> +{ >> + return !cgroup_subsys_enabled(misc_cgrp_subsys); > > From below, the root EPC cgroup is dynamically allocated. Shouldn't it > also > check whether the root EPC cgroup is valid? > Good point. I think I'll go with the static instance approach below. >> +} >> + >> +/** >> + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single >> EPC page >> + * >> + * Returns EPC cgroup or NULL on success, -errno on failure. >> + */ >> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void) >> +{ >> + struct sgx_epc_cgroup *epc_cg; >> + int ret; >> + >> + if (sgx_epc_cgroup_disabled()) >> + return NULL; >> + >> + epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg()); >> + ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE); >> + >> + if (!ret) { >> + /* No epc_cg returned, release ref from get_current_misc_cg() */ >> + put_misc_cg(epc_cg->cg); >> + return ERR_PTR(-ENOMEM); > > misc_cg_try_charge() returns 0 when successfully charged, no? Right. I really made some mess in rebasing :-( > >> + } >> + >> + /* Ref released in sgx_epc_cgroup_uncharge() */ >> + return epc_cg; >> +} > > IMHO the above _try_charge() returning a pointer of EPC cgroup is a > little bit > odd, because it doesn't match the existing misc_cg_try_charge() which > returns > whether the charge is successful or not. sev_misc_cg_try_charge() > matches > misc_cg_try_charge() too. > > I think it's better to split "getting EPC cgroup" part out as a separate > helper, > and make this _try_charge() match existing pattern: > > struct sgx_epc_cgroup *sgx_get_current_epc_cg(void) > { > if (sgx_epc_cgroup_disabled()) > return NULL; > > return sgx_epc_cgroup_from_misc_cg(get_current_misc_cg()); > } > > int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) > { > if (!epc_cg) > return -EINVAL; > > return misc_cg_try_charge(epc_cg->cg); > } > > Having sgx_get_current_epc_cg() also makes the caller easier to read, > because we > can immediately know we are going to charge the *current* EPC cgroup, > but not > some cgroup hidden within sgx_epc_cgroup_try_charge(). > Actually, unlike other misc controllers, we need charge and get the epc_cg reference at the same time. That's why it was returning the pointer. How about rename them sgx_{charge_and_get, uncharge_and_put}_epc_cg()? In final version, there is a __sgx_epc_cgroup_try_charge() that wraps misc_cg_try_charge(). >> + >> +/** >> + * sgx_epc_cgroup_uncharge() - hierarchically uncharge EPC pages >> + * @epc_cg: the charged epc cgroup >> + */ >> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) >> +{ >> + if (sgx_epc_cgroup_disabled()) >> + return; > > If with above change, check !epc_cg instead. > >> + >> + misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE); >> + >> + /* Ref got from sgx_epc_cgroup_try_charge() */ >> + put_misc_cg(epc_cg->cg); >> +} >> >> + >> +static void sgx_epc_cgroup_free(struct misc_cg *cg) >> +{ >> + struct sgx_epc_cgroup *epc_cg; >> + >> + epc_cg = sgx_epc_cgroup_from_misc_cg(cg); >> + if (!epc_cg) >> + return; >> + >> + kfree(epc_cg); >> +} >> + >> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg); >> + >> +const struct misc_operations_struct sgx_epc_cgroup_ops = { >> + .alloc = sgx_epc_cgroup_alloc, >> + .free = sgx_epc_cgroup_free, >> +}; >> + >> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg) >> +{ >> + struct sgx_epc_cgroup *epc_cg; >> + >> + epc_cg = kzalloc(sizeof(*epc_cg), GFP_KERNEL); >> + if (!epc_cg) >> + return -ENOMEM; >> + >> + cg->res[MISC_CG_RES_SGX_EPC].misc_ops = &sgx_epc_cgroup_ops; >> + cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg; >> + epc_cg->cg = cg; >> + return 0; >> +} >> + >> +static int __init sgx_epc_cgroup_init(void) >> +{ >> + struct misc_cg *cg; >> + >> + if (!boot_cpu_has(X86_FEATURE_SGX)) >> + return 0; >> + >> + cg = misc_cg_root(); >> + BUG_ON(!cg); > > BUG_ON() will catch some eyeball, but it cannot be NULL in practice IIUC. > > I am not sure whether you can just make misc @root_cg visible (instead > of having > the misc_cg_root() helper) and directly use @root_cg here to avoid using > the > BUG(). No opinion here. > I can remove BUG_ON(). It should never happen anyways. >> + >> + return sgx_epc_cgroup_alloc(cg); > > As mentioned above the memory allocation can fail, in which case EPC > cgroup is > effectively disabled IIUC? > > One way is to manually check whether root EPC cgroup is valid in > sgx_epc_cgroup_disabled(). Alternatively, you can have a static root > EPC cgroup > here: > > static struct sgx_epc_cgroup root_epc_cg; > > In this way you can have a sgx_epc_cgroup_init(&epc_cg), and call it from > sgx_epc_cgroup_alloc(). > Yeah, I think that is reasonable. >> +} >> +subsys_initcall(sgx_epc_cgroup_init); >> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h >> b/arch/x86/kernel/cpu/sgx/epc_cgroup.h >> new file mode 100644 >> index 000000000000..c3abfe82be15 >> --- /dev/null >> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h >> @@ -0,0 +1,36 @@ >> +/* SPDX-License-Identifier: GPL-2.0 */ >> +/* Copyright(c) 2022 Intel Corporation. */ >> +#ifndef _INTEL_SGX_EPC_CGROUP_H_ >> +#define _INTEL_SGX_EPC_CGROUP_H_ >> + >> +#include <asm/sgx.h> >> +#include <linux/cgroup.h> >> +#include <linux/list.h> >> +#include <linux/misc_cgroup.h> >> +#include <linux/page_counter.h> >> +#include <linux/workqueue.h> >> + >> +#include "sgx.h" >> + >> +#ifndef CONFIG_CGROUP_SGX_EPC >> +#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES > > Do you need this macro? I remember I got some compiling error without it but I don't see why it should be needed. I'll double check next round. thanks. > >> +struct sgx_epc_cgroup; >> + >> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void) >> +{ >> + return NULL; >> +} >> + >> +static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup >> *epc_cg) { } >> +#else >> +struct sgx_epc_cgroup { >> + struct misc_cg *cg; >> +}; >> + >> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void); >> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg); >> +bool sgx_epc_cgroup_lru_empty(struct misc_cg *root); > > Why do you need sgx_epc_cgroup_lru_empty() here? > leftover from rebasing. Will remove. >> + >> +#endif >> + >> +#endif /* _INTEL_SGX_EPC_CGROUP_H_ */ >> diff --git a/arch/x86/kernel/cpu/sgx/main.c >> b/arch/x86/kernel/cpu/sgx/main.c >> index 166692f2d501..07606f391540 100644 >> --- a/arch/x86/kernel/cpu/sgx/main.c >> +++ b/arch/x86/kernel/cpu/sgx/main.c >> @@ -6,6 +6,7 @@ >> #include <linux/highmem.h> >> #include <linux/kthread.h> >> #include <linux/miscdevice.h> >> +#include <linux/misc_cgroup.h> >> #include <linux/node.h> >> #include <linux/pagemap.h> >> #include <linux/ratelimit.h> >> @@ -17,6 +18,7 @@ >> #include "driver.h" >> #include "encl.h" >> #include "encls.h" >> +#include "epc_cgroup.h" >> >> struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS]; >> static int sgx_nr_epc_sections; >> @@ -559,6 +561,11 @@ int sgx_unmark_page_reclaimable(struct >> sgx_epc_page *page) >> struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim) >> { >> struct sgx_epc_page *page; >> + struct sgx_epc_cgroup *epc_cg; >> + >> + epc_cg = sgx_epc_cgroup_try_charge(); >> + if (IS_ERR(epc_cg)) >> + return ERR_CAST(epc_cg); >> >> for ( ; ; ) { >> page = __sgx_alloc_epc_page(); >> @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void >> *owner, bool reclaim) >> break; >> } >> >> + /* >> + * Need to do a global reclamation if cgroup was not full but free >> + * physical pages run out, causing __sgx_alloc_epc_page() to fail. >> + */ >> sgx_reclaim_pages(); > > What's the final behaviour? IIUC it should be reclaiming from the > *current* EPC > cgroup? If so shouldn't we just pass the @epc_cg to it here? > > I think we can make this patch as "structure" patch w/o actually having > EPC > cgroup enabled, i.e., sgx_get_current_epc_cg() always return NULL. > > And we can have one patch to change sgx_reclaim_pages() to take the > 'struct > sgx_epc_lru_list *' as argument: > > void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru) > { > ... > } > > Then here we can have something like: > > void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg) > { > struct sgx_epc_lru_list *lru = epc_cg ? &epc_cg->lru : > &sgx_global_lru; > > sgx_reclaim_pages_lru(lru); > } > > Makes sense? > This is purely global reclamation. No cgroup involved. You can see it later in changes in patch 10/12. For now I just make a comment there but no real changes. Cgroup reclamation will be done as part of _try_charge call. >> cond_resched(); >> } >> >> + if (!IS_ERR(page)) { >> + WARN_ON_ONCE(page->epc_cg); >> + page->epc_cg = epc_cg; >> + } else { >> + sgx_epc_cgroup_uncharge(epc_cg); >> + } >> + >> if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) >> wake_up(&ksgxd_waitq); >> >> @@ -604,6 +622,11 @@ void sgx_free_epc_page(struct sgx_epc_page *page) >> struct sgx_epc_section *section = &sgx_epc_sections[page->section]; >> struct sgx_numa_node *node = section->node; >> >> + if (page->epc_cg) { >> + sgx_epc_cgroup_uncharge(page->epc_cg); >> + page->epc_cg = NULL; >> + } >> + >> spin_lock(&node->lock); >> >> page->owner = NULL; >> @@ -643,6 +666,7 @@ static bool __init sgx_setup_epc_section(u64 >> phys_addr, u64 size, >> section->pages[i].flags = 0; >> section->pages[i].owner = NULL; >> section->pages[i].poison = 0; >> + section->pages[i].epc_cg = NULL; >> list_add_tail(§ion->pages[i].list, &sgx_dirty_page_list); >> } >> >> @@ -787,6 +811,7 @@ static void __init arch_update_sysfs_visibility(int >> nid) {} >> static bool __init sgx_page_cache_init(void) >> { >> u32 eax, ebx, ecx, edx, type; >> + u64 capacity = 0; >> u64 pa, size; >> int nid; >> int i; >> @@ -837,6 +862,7 @@ static bool __init sgx_page_cache_init(void) >> >> sgx_epc_sections[i].node = &sgx_numa_nodes[nid]; >> sgx_numa_nodes[nid].size += size; >> + capacity += size; >> >> sgx_nr_epc_sections++; >> } >> @@ -846,6 +872,8 @@ static bool __init sgx_page_cache_init(void) >> return false; >> } >> >> + misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity); >> + >> return true; >> } > > I would separate setting up capacity as a separate patch. I thought about that, but again it was only 3-4 lines all in this function and it's also necessary part of basic setup for misc controller... > >> >> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h >> b/arch/x86/kernel/cpu/sgx/sgx.h >> index d2dad21259a8..b1786774b8d2 100644 >> --- a/arch/x86/kernel/cpu/sgx/sgx.h >> +++ b/arch/x86/kernel/cpu/sgx/sgx.h >> @@ -29,12 +29,15 @@ >> /* Pages on free list */ >> #define SGX_EPC_PAGE_IS_FREE BIT(1) >> >> +struct sgx_epc_cgroup; >> + >> struct sgx_epc_page { >> unsigned int section; >> u16 flags; >> u16 poison; >> struct sgx_encl_page *owner; >> struct list_head list; >> + struct sgx_epc_cgroup *epc_cg; >> }; >> > > Adding @epc_cg unconditionally means even with !CONFIG_CGROUP_SGX_EPC > the memory > is still occupied. IMHO that would bring non-trivial memory waste as > it's 8- > bytes for each EPC page. > Ok, I'll add ifdef
> > > > +/** > > > + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single > > > EPC page > > > + * > > > + * Returns EPC cgroup or NULL on success, -errno on failure. > > > + */ > > > +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void) > > > +{ > > > + struct sgx_epc_cgroup *epc_cg; > > > + int ret; > > > + > > > + if (sgx_epc_cgroup_disabled()) > > > + return NULL; > > > + > > > + epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg()); > > > + ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE); > > > + > > > + if (!ret) { > > > + /* No epc_cg returned, release ref from get_current_misc_cg() */ > > > + put_misc_cg(epc_cg->cg); > > > + return ERR_PTR(-ENOMEM); > > > > misc_cg_try_charge() returns 0 when successfully charged, no? > > Right. I really made some mess in rebasing :-( > > > > > > + } > > > + > > > + /* Ref released in sgx_epc_cgroup_uncharge() */ > > > + return epc_cg; > > > +} > > > > IMHO the above _try_charge() returning a pointer of EPC cgroup is a > > little bit > > odd, because it doesn't match the existing misc_cg_try_charge() which > > returns > > whether the charge is successful or not. sev_misc_cg_try_charge() > > matches > > misc_cg_try_charge() too. > > > > I think it's better to split "getting EPC cgroup" part out as a separate > > helper, > > and make this _try_charge() match existing pattern: > > > > struct sgx_epc_cgroup *sgx_get_current_epc_cg(void) > > { > > if (sgx_epc_cgroup_disabled()) > > return NULL; > > > > return sgx_epc_cgroup_from_misc_cg(get_current_misc_cg()); > > } > > > > int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) > > { > > if (!epc_cg) > > return -EINVAL; > > > > return misc_cg_try_charge(epc_cg->cg); > > } > > > > Having sgx_get_current_epc_cg() also makes the caller easier to read, > > because we > > can immediately know we are going to charge the *current* EPC cgroup, > > but not > > some cgroup hidden within sgx_epc_cgroup_try_charge(). > > > > Actually, unlike other misc controllers, we need charge and get the epc_cg > reference at the same time. > Can you elaborate? And in practice you always call sgx_epc_cgroup_try_charge() right after sgx_get_current_epc_cg() anyway. The only difference is the whole thing is done in one function or in separate functions. [...] > > > struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim) > > > { > > > struct sgx_epc_page *page; > > > + struct sgx_epc_cgroup *epc_cg; > > > + > > > + epc_cg = sgx_epc_cgroup_try_charge(); > > > + if (IS_ERR(epc_cg)) > > > + return ERR_CAST(epc_cg); > > > > > > for ( ; ; ) { > > > page = __sgx_alloc_epc_page(); > > > @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void > > > *owner, bool reclaim) > > > break; > > > } > > > > > > + /* > > > + * Need to do a global reclamation if cgroup was not full but free > > > + * physical pages run out, causing __sgx_alloc_epc_page() to fail. > > > + */ > > > sgx_reclaim_pages(); > > > > What's the final behaviour? IIUC it should be reclaiming from the > > *current* EPC > > cgroup? If so shouldn't we just pass the @epc_cg to it here? > > > > I think we can make this patch as "structure" patch w/o actually having > > EPC > > cgroup enabled, i.e., sgx_get_current_epc_cg() always return NULL. > > > > And we can have one patch to change sgx_reclaim_pages() to take the > > 'struct > > sgx_epc_lru_list *' as argument: > > > > void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru) > > { > > ... > > } > > > > Then here we can have something like: > > > > void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg) > > { > > struct sgx_epc_lru_list *lru = epc_cg ? &epc_cg->lru : > > &sgx_global_lru; > > > > sgx_reclaim_pages_lru(lru); > > } > > > > Makes sense? > > > > This is purely global reclamation. No cgroup involved. > Again why? Here you are allocating one EPC page for enclave in a particular EPC cgroup. When that fails, shouldn't you try only to reclaim from the *current* EPC cgroup? Or at least you should try to reclaim from the *current* EPC cgroup first? > You can see it > later in changes in patch 10/12. For now I just make a comment there but > no real changes. Cgroup reclamation will be done as part of _try_charge > call. > > > > cond_resched(); > > > } > > > > > > + if (!IS_ERR(page)) { > > > + WARN_ON_ONCE(page->epc_cg); > > > + page->epc_cg = epc_cg; > > > + } else { > > > + sgx_epc_cgroup_uncharge(epc_cg); > > > + } > > > + > > > if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) > > > wake_up(&ksgxd_waitq); > > > > > > @@ -604,6 +622,11 @@ void sgx_free_epc_page(struct sgx_epc_page *page) > > > struct sgx_epc_section *section = &sgx_epc_sections[page->section]; > > > struct sgx_numa_node *node = section->node; > > > > > > + if (page->epc_cg) { > > > + sgx_epc_cgroup_uncharge(page->epc_cg); > > > + page->epc_cg = NULL; > > > + } > > > + > > > spin_lock(&node->lock); > > > > > > page->owner = NULL; > > > @@ -643,6 +666,7 @@ static bool __init sgx_setup_epc_section(u64 > > > phys_addr, u64 size, > > > section->pages[i].flags = 0; > > > section->pages[i].owner = NULL; > > > section->pages[i].poison = 0; > > > + section->pages[i].epc_cg = NULL; > > > list_add_tail(§ion->pages[i].list, &sgx_dirty_page_list); > > > } > > > > > > @@ -787,6 +811,7 @@ static void __init arch_update_sysfs_visibility(int > > > nid) {} > > > static bool __init sgx_page_cache_init(void) > > > { > > > u32 eax, ebx, ecx, edx, type; > > > + u64 capacity = 0; > > > u64 pa, size; > > > int nid; > > > int i; > > > @@ -837,6 +862,7 @@ static bool __init sgx_page_cache_init(void) > > > > > > sgx_epc_sections[i].node = &sgx_numa_nodes[nid]; > > > sgx_numa_nodes[nid].size += size; > > > + capacity += size; > > > > > > sgx_nr_epc_sections++; > > > } > > > @@ -846,6 +872,8 @@ static bool __init sgx_page_cache_init(void) > > > return false; > > > } > > > > > > + misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity); Hmm.. I think this is why MISC_CG_RES_SGX_EPC is needed when !CONFIG_CGROUP_SGX_EPC. > > > + > > > return true; > > > } > > > > I would separate setting up capacity as a separate patch. > > I thought about that, but again it was only 3-4 lines all in this function > and it's also necessary part of basic setup for misc controller... Fine. Anyway it depends on what things you want to do on this patch. It's fine to include the capacity if this patch is some "structure" patch that shows the high level flow of how EPC cgroup works.
On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote: > +static int __init sgx_epc_cgroup_init(void) > +{ > + struct misc_cg *cg; > + > + if (!boot_cpu_has(X86_FEATURE_SGX)) > + return 0; > + > + cg = misc_cg_root(); > + BUG_ON(!cg); > + > + return sgx_epc_cgroup_alloc(cg); > +} > +subsys_initcall(sgx_epc_cgroup_init); This should be called from sgx_init(), which is the place to init SGX related staff. In case you didn't notice, sgx_init() is actually device_initcall(), which is actually called _after_ the subsys_initcall() you used above.
On Mon, 06 Nov 2023 16:18:30 -0600, Huang, Kai <kai.huang@intel.com> wrote: >> >> > > +/** >> > > + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a >> single >> > > EPC page >> > > + * >> > > + * Returns EPC cgroup or NULL on success, -errno on failure. >> > > + */ >> > > +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void) >> > > +{ >> > > + struct sgx_epc_cgroup *epc_cg; >> > > + int ret; >> > > + >> > > + if (sgx_epc_cgroup_disabled()) >> > > + return NULL; >> > > + >> > > + epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg()); >> > > + ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, >> PAGE_SIZE); >> > > + >> > > + if (!ret) { >> > > + /* No epc_cg returned, release ref from get_current_misc_cg() */ >> > > + put_misc_cg(epc_cg->cg); >> > > + return ERR_PTR(-ENOMEM); >> > >> > misc_cg_try_charge() returns 0 when successfully charged, no? >> >> Right. I really made some mess in rebasing :-( >> >> > >> > > + } >> > > + >> > > + /* Ref released in sgx_epc_cgroup_uncharge() */ >> > > + return epc_cg; >> > > +} >> > >> > IMHO the above _try_charge() returning a pointer of EPC cgroup is a >> > little bit >> > odd, because it doesn't match the existing misc_cg_try_charge() which >> > returns >> > whether the charge is successful or not. sev_misc_cg_try_charge() >> > matches >> > misc_cg_try_charge() too. >> > >> > I think it's better to split "getting EPC cgroup" part out as a >> separate >> > helper, >> > and make this _try_charge() match existing pattern: >> > >> > struct sgx_epc_cgroup *sgx_get_current_epc_cg(void) >> > { >> > if (sgx_epc_cgroup_disabled()) >> > return NULL; >> > >> > return sgx_epc_cgroup_from_misc_cg(get_current_misc_cg()); >> > } >> > >> > int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) >> > { >> > if (!epc_cg) >> > return -EINVAL; >> > >> > return misc_cg_try_charge(epc_cg->cg); >> > } >> > >> > Having sgx_get_current_epc_cg() also makes the caller easier to read, >> > because we >> > can immediately know we are going to charge the *current* EPC cgroup, >> > but not >> > some cgroup hidden within sgx_epc_cgroup_try_charge(). >> > >> >> Actually, unlike other misc controllers, we need charge and get the >> epc_cg >> reference at the same time. > > Can you elaborate? > > And in practice you always call sgx_epc_cgroup_try_charge() right after > sgx_get_current_epc_cg() anyway. The only difference is the whole thing > is done > in one function or in separate functions. > > [...] > That's true. I was thinking no need to have them done in separate calls. The caller has to check the return value for epc_cg instance first, then check result of try_charge. But there is really only one caller, sgx_alloc_epc_page() below, so I don't have strong opinions now. With them separate, the checks will look like this: if (epc_cg = sgx_get_current_epc_cg()) // NULL means cgroup disabled, should continue for allocation { if (ret = sgx_epc_cgroup_try_charge()) return ret } // continue... I can go either way. > >> > > struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim) >> > > { >> > > struct sgx_epc_page *page; >> > > + struct sgx_epc_cgroup *epc_cg; >> > > + >> > > + epc_cg = sgx_epc_cgroup_try_charge(); >> > > + if (IS_ERR(epc_cg)) >> > > + return ERR_CAST(epc_cg); >> > > >> > > for ( ; ; ) { >> > > page = __sgx_alloc_epc_page(); >> > > @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void >> > > *owner, bool reclaim) >> > > break; >> > > } >> > > >> > > + /* >> > > + * Need to do a global reclamation if cgroup was not full but >> free >> > > + * physical pages run out, causing __sgx_alloc_epc_page() to >> fail. >> > > + */ >> > > sgx_reclaim_pages(); >> > >> > What's the final behaviour? IIUC it should be reclaiming from the >> > *current* EPC >> > cgroup? If so shouldn't we just pass the @epc_cg to it here? >> > >> > I think we can make this patch as "structure" patch w/o actually >> having >> > EPC >> > cgroup enabled, i.e., sgx_get_current_epc_cg() always return NULL. >> > >> > And we can have one patch to change sgx_reclaim_pages() to take the >> > 'struct >> > sgx_epc_lru_list *' as argument: >> > >> > void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru) >> > { >> > ... >> > } >> > >> > Then here we can have something like: >> > >> > void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg) >> > { >> > struct sgx_epc_lru_list *lru = epc_cg ? &epc_cg->lru : >> > &sgx_global_lru; >> > >> > sgx_reclaim_pages_lru(lru); >> > } >> > >> > Makes sense? >> > >> >> This is purely global reclamation. No cgroup involved. > > Again why? Here you are allocating one EPC page for enclave in a > particular EPC > cgroup. When that fails, shouldn't you try only to reclaim from the > *current* > EPC cgroup? Or at least you should try to reclaim from the *current* > EPC cgroup > first? > Later sgx_epc_cg_try_charge will take a 'reclaim' flag, if true, cgroup reclaims synchronously, otherwise in background and returns -EBUSY in that case. This function also returns if no valid epc_cg pointer returned. All reclamation for *current* cgroup is done in sgx_epc_cg_try_charge(). So, by reaching to this point, a valid epc_cg pointer was returned, that means allocation is allowed for the cgroup (it has reclaimed if necessary, and its usage is not above limit after charging). But the system level free count may be low (e.g., limits of all cgroups may add up to be more than capacity). so we need to do a global reclamation here, which may involve reclaiming a few pages (from current or other groups) so the system can be at a performant state with minimal free count. (current behavior of ksgxd). Note this patch does not do per-cgroup reclamation. If we had stopped with this patch without next patches, cgroups will only block allocation by returning invalid epc_cg pointer (-ENOMEM) from sgx_epc_cg_try_charge(). Reclamation only happens when cgroup is not full but system level free count is below threshold. >> You can see it >> later in changes in patch 10/12. For now I just make a comment there but >> no real changes. Cgroup reclamation will be done as part of _try_charge >> call. >> >> > > cond_resched(); >> > > } >> > > >> > > + if (!IS_ERR(page)) { >> > > + WARN_ON_ONCE(page->epc_cg); >> > > + page->epc_cg = epc_cg; >> > > + } else { >> > > + sgx_epc_cgroup_uncharge(epc_cg); >> > > + } >> > > + >> > > if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) >> > > wake_up(&ksgxd_waitq); >> > > >> > > @@ -604,6 +622,11 @@ void sgx_free_epc_page(struct sgx_epc_page >> *page) >> > > struct sgx_epc_section *section = >> &sgx_epc_sections[page->section]; >> > > struct sgx_numa_node *node = section->node; >> > > >> > > + if (page->epc_cg) { >> > > + sgx_epc_cgroup_uncharge(page->epc_cg); >> > > + page->epc_cg = NULL; >> > > + } >> > > + >> > > spin_lock(&node->lock); >> > > >> > > page->owner = NULL; >> > > @@ -643,6 +666,7 @@ static bool __init sgx_setup_epc_section(u64 >> > > phys_addr, u64 size, >> > > section->pages[i].flags = 0; >> > > section->pages[i].owner = NULL; >> > > section->pages[i].poison = 0; >> > > + section->pages[i].epc_cg = NULL; >> > > list_add_tail(§ion->pages[i].list, &sgx_dirty_page_list); >> > > } >> > > >> > > @@ -787,6 +811,7 @@ static void __init >> arch_update_sysfs_visibility(int >> > > nid) {} >> > > static bool __init sgx_page_cache_init(void) >> > > { >> > > u32 eax, ebx, ecx, edx, type; >> > > + u64 capacity = 0; >> > > u64 pa, size; >> > > int nid; >> > > int i; >> > > @@ -837,6 +862,7 @@ static bool __init sgx_page_cache_init(void) >> > > >> > > sgx_epc_sections[i].node = &sgx_numa_nodes[nid]; >> > > sgx_numa_nodes[nid].size += size; >> > > + capacity += size; >> > > >> > > sgx_nr_epc_sections++; >> > > } >> > > @@ -846,6 +872,8 @@ static bool __init sgx_page_cache_init(void) >> > > return false; >> > > } >> > > >> > > + misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity); > > Hmm.. I think this is why MISC_CG_RES_SGX_EPC is needed when > !CONFIG_CGROUP_SGX_EPC. right, that was it :-) > >> > > + >> > > return true; >> > > } >> > >> > I would separate setting up capacity as a separate patch. >> >> I thought about that, but again it was only 3-4 lines all in this >> function >> and it's also necessary part of basic setup for misc controller... > > Fine. Anyway it depends on what things you want to do on this patch. > It's fine > to include the capacity if this patch is some "structure" patch that > shows the > high level flow of how EPC cgroup works. Ok, I'll keep this way for now unless any objections. Thanks Haitao
On Mon, 06 Nov 2023 19:16:30 -0600, Haitao Huang <haitao.huang@linux.intel.com> wrote: > On Mon, 06 Nov 2023 16:18:30 -0600, Huang, Kai <kai.huang@intel.com> > wrote: > >>> >>> > > +/** >>> > > + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a >>> single >>> > > EPC page >>> > > + * >>> > > + * Returns EPC cgroup or NULL on success, -errno on failure. >>> > > + */ >>> > > +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void) >>> > > +{ >>> > > + struct sgx_epc_cgroup *epc_cg; >>> > > + int ret; >>> > > + >>> > > + if (sgx_epc_cgroup_disabled()) >>> > > + return NULL; >>> > > + >>> > > + epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg()); >>> > > + ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, >>> PAGE_SIZE); >>> > > + >>> > > + if (!ret) { >>> > > + /* No epc_cg returned, release ref from get_current_misc_cg() */ >>> > > + put_misc_cg(epc_cg->cg); >>> > > + return ERR_PTR(-ENOMEM); >>> > >>> > misc_cg_try_charge() returns 0 when successfully charged, no? >>> >>> Right. I really made some mess in rebasing :-( >>> >>> > >>> > > + } >>> > > + >>> > > + /* Ref released in sgx_epc_cgroup_uncharge() */ >>> > > + return epc_cg; >>> > > +} >>> > >>> > IMHO the above _try_charge() returning a pointer of EPC cgroup is a >>> > little bit >>> > odd, because it doesn't match the existing misc_cg_try_charge() which >>> > returns >>> > whether the charge is successful or not. sev_misc_cg_try_charge() >>> > matches >>> > misc_cg_try_charge() too. >>> > >>> > I think it's better to split "getting EPC cgroup" part out as a >>> separate >>> > helper, >>> > and make this _try_charge() match existing pattern: >>> > >>> > struct sgx_epc_cgroup *sgx_get_current_epc_cg(void) >>> > { >>> > if (sgx_epc_cgroup_disabled()) >>> > return NULL; >>> > >>> > return sgx_epc_cgroup_from_misc_cg(get_current_misc_cg()); >>> > } >>> > >>> > int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) >>> > { >>> > if (!epc_cg) >>> > return -EINVAL; >>> > >>> > return misc_cg_try_charge(epc_cg->cg); >>> > } >>> > >>> > Having sgx_get_current_epc_cg() also makes the caller easier to read, >>> > because we >>> > can immediately know we are going to charge the *current* EPC cgroup, >>> > but not >>> > some cgroup hidden within sgx_epc_cgroup_try_charge(). >>> > >>> >>> Actually, unlike other misc controllers, we need charge and get the >>> epc_cg >>> reference at the same time. >> >> Can you elaborate? >> >> And in practice you always call sgx_epc_cgroup_try_charge() right after >> sgx_get_current_epc_cg() anyway. The only difference is the whole >> thing is done >> in one function or in separate functions. >> >> [...] >> > > That's true. I was thinking no need to have them done in separate calls. > The caller has to check the return value for epc_cg instance first, then > check result of try_charge. But there is really only one caller, > sgx_alloc_epc_page() below, so I don't have strong opinions now. > > With them separate, the checks will look like this: > if (epc_cg = sgx_get_current_epc_cg()) // NULL means cgroup disabled, > should continue for allocation > { > if (ret = sgx_epc_cgroup_try_charge()) > return ret > } > // continue... > > I can go either way. > >> >>> > > struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim) >>> > > { >>> > > struct sgx_epc_page *page; >>> > > + struct sgx_epc_cgroup *epc_cg; >>> > > + >>> > > + epc_cg = sgx_epc_cgroup_try_charge(); >>> > > + if (IS_ERR(epc_cg)) >>> > > + return ERR_CAST(epc_cg); >>> > > >>> > > for ( ; ; ) { >>> > > page = __sgx_alloc_epc_page(); >>> > > @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void >>> > > *owner, bool reclaim) >>> > > break; >>> > > } >>> > > >>> > > + /* >>> > > + * Need to do a global reclamation if cgroup was not full but >>> free >>> > > + * physical pages run out, causing __sgx_alloc_epc_page() to >>> fail. >>> > > + */ >>> > > sgx_reclaim_pages(); >>> > >>> > What's the final behaviour? IIUC it should be reclaiming from the >>> > *current* EPC >>> > cgroup? If so shouldn't we just pass the @epc_cg to it here? >>> > >>> > I think we can make this patch as "structure" patch w/o actually >>> having >>> > EPC >>> > cgroup enabled, i.e., sgx_get_current_epc_cg() always return NULL. >>> > >>> > And we can have one patch to change sgx_reclaim_pages() to take the >>> > 'struct >>> > sgx_epc_lru_list *' as argument: >>> > >>> > void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru) >>> > { >>> > ... >>> > } >>> > >>> > Then here we can have something like: >>> > >>> > void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg) >>> > { >>> > struct sgx_epc_lru_list *lru = epc_cg ? &epc_cg->lru : >>> > &sgx_global_lru; >>> > >>> > sgx_reclaim_pages_lru(lru); >>> > } >>> > >>> > Makes sense? >>> > >>> >>> This is purely global reclamation. No cgroup involved. >> >> Again why? Here you are allocating one EPC page for enclave in a >> particular EPC >> cgroup. When that fails, shouldn't you try only to reclaim from the >> *current* >> EPC cgroup? Or at least you should try to reclaim from the *current* >> EPC cgroup >> first? >> > > Later sgx_epc_cg_try_charge will take a 'reclaim' flag, if true, cgroup > reclaims synchronously, otherwise in background and returns -EBUSY in > that case. This function also returns if no valid epc_cg pointer > returned. > > All reclamation for *current* cgroup is done in sgx_epc_cg_try_charge(). > > So, by reaching to this point, a valid epc_cg pointer was returned, > that means allocation is allowed for the cgroup (it has reclaimed if > necessary, and its usage is not above limit after charging). > > But the system level free count may be low (e.g., limits of all cgroups > may add up to be more than capacity). so we need to do a global > reclamation here, which may involve reclaiming a few pages (from current > or other groups) so the system can be at a performant state with minimal > free count. (current behavior of ksgxd). > I should have sticked to the orignial comment added in code. Actually __sgx_alloc_epc_page() can fail if system runs out of EPC. That's the really reason for global reclaim. The free count enforcement is near the end of this method after should_reclaim() check. Haitao
> I should have sticked to the orignial comment added in code. Actually > __sgx_alloc_epc_page() can fail if system runs out of EPC. That's the really reason > for global reclaim. The free count enforcement is near the end of this method > after should_reclaim() check. Hi Haitao, Sorry I have something else to do at this moment and will continue this series next week.
On Mon Oct 30, 2023 at 8:20 PM EET, Haitao Huang wrote: > From: Kristen Carlson Accardi <kristen@linux.intel.com> > > Implement support for cgroup control of SGX Enclave Page Cache (EPC) > memory using the misc cgroup controller. EPC memory is independent > from normal system memory, e.g. must be reserved at boot from RAM and > cannot be converted between EPC and normal memory while the system is > running. EPC is managed by the SGX subsystem and is not accounted by > the memory controller. > > Much like normal system memory, EPC memory can be overcommitted via > virtual memory techniques and pages can be swapped out of the EPC to > their backing store (normal system memory, e.g. shmem). The SGX EPC > subsystem is analogous to the memory subsystem and the SGX EPC controller > is in turn analogous to the memory controller; it implements limit and > protection models for EPC memory. > > The misc controller provides a mechanism to set a hard limit of EPC > usage via the "sgx_epc" resource in "misc.max". The total EPC memory > available on the system is reported via the "sgx_epc" resource in > "misc.capacity". > > This patch was modified from the previous version to only add basic EPC > cgroup structure, accounting allocations for cgroup usage > (charge/uncharge), setup misc cgroup callbacks, set total EPC capacity. > > For now, the EPC cgroup simply blocks additional EPC allocation in > sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are > still tracked in the global active list, only reclaimed by the global > reclaimer when the total free page count is lower than a threshold. > > Later patches will reorganize the tracking and reclamation code in the > globale reclaimer and implement per-cgroup tracking and reclaiming. > > Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> > Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> > --- > V6: > - Split the original large patch"Limit process EPC usage with misc > cgroup controller" and restructure it (Kai) > --- > arch/x86/Kconfig | 13 ++++ > arch/x86/kernel/cpu/sgx/Makefile | 1 + > arch/x86/kernel/cpu/sgx/epc_cgroup.c | 103 +++++++++++++++++++++++++++ > arch/x86/kernel/cpu/sgx/epc_cgroup.h | 36 ++++++++++ > arch/x86/kernel/cpu/sgx/main.c | 28 ++++++++ > arch/x86/kernel/cpu/sgx/sgx.h | 3 + > 6 files changed, 184 insertions(+) > create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c > create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 66bfabae8814..e17c5dc3aea4 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -1921,6 +1921,19 @@ config X86_SGX > > If unsure, say N. > > +config CGROUP_SGX_EPC > + bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX" > + depends on X86_SGX && CGROUP_MISC > + help > + Provides control over the EPC footprint of tasks in a cgroup via > + the Miscellaneous cgroup controller. > + > + EPC is a subset of regular memory that is usable only by SGX > + enclaves and is very limited in quantity, e.g. less than 1% > + of total DRAM. > + > + Say N if unsure. > + > config X86_USER_SHADOW_STACK > bool "X86 userspace shadow stack" > depends on AS_WRUSS > diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile > index 9c1656779b2a..12901a488da7 100644 > --- a/arch/x86/kernel/cpu/sgx/Makefile > +++ b/arch/x86/kernel/cpu/sgx/Makefile > @@ -4,3 +4,4 @@ obj-y += \ > ioctl.o \ > main.o > obj-$(CONFIG_X86_SGX_KVM) += virt.o > +obj-$(CONFIG_CGROUP_SGX_EPC) += epc_cgroup.o > diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c > new file mode 100644 > index 000000000000..500627d0563f > --- /dev/null > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c > @@ -0,0 +1,103 @@ > +// SPDX-License-Identifier: GPL-2.0 > +// Copyright(c) 2022 Intel Corporation. > + > +#include <linux/atomic.h> > +#include <linux/kernel.h> > +#include "epc_cgroup.h" > + > +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg) > +{ > + return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv); > +} > + > +static inline bool sgx_epc_cgroup_disabled(void) > +{ > + return !cgroup_subsys_enabled(misc_cgrp_subsys); > +} > + > +/** > + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single EPC page > + * > + * Returns EPC cgroup or NULL on success, -errno on failure. Should have a description explaining what "charging hierarchically" is all about. This is too cryptic like this. E.g. consider wahat non-hierarchically charging means. There must be opposite end in order to have a meaning (for anything expressed with a language). > + */ > +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void) > +{ > + struct sgx_epc_cgroup *epc_cg; > + int ret; > + > + if (sgx_epc_cgroup_disabled()) > + return NULL; > + > + epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg()); > + ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE); > + > + if (!ret) { > + /* No epc_cg returned, release ref from get_current_misc_cg() */ > + put_misc_cg(epc_cg->cg); > + return ERR_PTR(-ENOMEM); > + } > + > + /* Ref released in sgx_epc_cgroup_uncharge() */ > + return epc_cg; > +} > + > +/** > + * sgx_epc_cgroup_uncharge() - hierarchically uncharge EPC pages > + * @epc_cg: the charged epc cgroup > + */ > +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) > +{ > + if (sgx_epc_cgroup_disabled()) > + return; > + > + misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE); > + > + /* Ref got from sgx_epc_cgroup_try_charge() */ > + put_misc_cg(epc_cg->cg); > +} > + > +static void sgx_epc_cgroup_free(struct misc_cg *cg) > +{ > + struct sgx_epc_cgroup *epc_cg; > + > + epc_cg = sgx_epc_cgroup_from_misc_cg(cg); > + if (!epc_cg) > + return; > + > + kfree(epc_cg); > +} > + > +static int sgx_epc_cgroup_alloc(struct misc_cg *cg); > + > +const struct misc_operations_struct sgx_epc_cgroup_ops = { > + .alloc = sgx_epc_cgroup_alloc, > + .free = sgx_epc_cgroup_free, > +}; > + > +static int sgx_epc_cgroup_alloc(struct misc_cg *cg) > +{ > + struct sgx_epc_cgroup *epc_cg; > + > + epc_cg = kzalloc(sizeof(*epc_cg), GFP_KERNEL); > + if (!epc_cg) > + return -ENOMEM; > + > + cg->res[MISC_CG_RES_SGX_EPC].misc_ops = &sgx_epc_cgroup_ops; > + cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg; > + epc_cg->cg = cg; > + return 0; > +} > + > +static int __init sgx_epc_cgroup_init(void) > +{ > + struct misc_cg *cg; > + > + if (!boot_cpu_has(X86_FEATURE_SGX)) > + return 0; > + > + cg = misc_cg_root(); > + BUG_ON(!cg); > + > + return sgx_epc_cgroup_alloc(cg); > +} > +subsys_initcall(sgx_epc_cgroup_init); > diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h > new file mode 100644 > index 000000000000..c3abfe82be15 > --- /dev/null > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h > @@ -0,0 +1,36 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* Copyright(c) 2022 Intel Corporation. */ > +#ifndef _INTEL_SGX_EPC_CGROUP_H_ > +#define _INTEL_SGX_EPC_CGROUP_H_ > + > +#include <asm/sgx.h> > +#include <linux/cgroup.h> > +#include <linux/list.h> > +#include <linux/misc_cgroup.h> > +#include <linux/page_counter.h> > +#include <linux/workqueue.h> > + > +#include "sgx.h" > + > +#ifndef CONFIG_CGROUP_SGX_EPC > +#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES > +struct sgx_epc_cgroup; > + > +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void) > +{ > + return NULL; > +} > + > +static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { } > +#else > +struct sgx_epc_cgroup { > + struct misc_cg *cg; > +}; > + > +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void); > +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg); > +bool sgx_epc_cgroup_lru_empty(struct misc_cg *root); > + > +#endif > + > +#endif /* _INTEL_SGX_EPC_CGROUP_H_ */ > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c > index 166692f2d501..07606f391540 100644 > --- a/arch/x86/kernel/cpu/sgx/main.c > +++ b/arch/x86/kernel/cpu/sgx/main.c > @@ -6,6 +6,7 @@ > #include <linux/highmem.h> > #include <linux/kthread.h> > #include <linux/miscdevice.h> > +#include <linux/misc_cgroup.h> > #include <linux/node.h> > #include <linux/pagemap.h> > #include <linux/ratelimit.h> > @@ -17,6 +18,7 @@ > #include "driver.h" > #include "encl.h" > #include "encls.h" > +#include "epc_cgroup.h" > > struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS]; > static int sgx_nr_epc_sections; > @@ -559,6 +561,11 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page) > struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim) > { > struct sgx_epc_page *page; > + struct sgx_epc_cgroup *epc_cg; > + > + epc_cg = sgx_epc_cgroup_try_charge(); > + if (IS_ERR(epc_cg)) > + return ERR_CAST(epc_cg); > > for ( ; ; ) { > page = __sgx_alloc_epc_page(); > @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim) > break; > } > > + /* > + * Need to do a global reclamation if cgroup was not full but free > + * physical pages run out, causing __sgx_alloc_epc_page() to fail. > + */ > sgx_reclaim_pages(); > cond_resched(); > } > > + if (!IS_ERR(page)) { > + WARN_ON_ONCE(page->epc_cg); > + page->epc_cg = epc_cg; > + } else { > + sgx_epc_cgroup_uncharge(epc_cg); > + } > + > if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) > wake_up(&ksgxd_waitq); > > @@ -604,6 +622,11 @@ void sgx_free_epc_page(struct sgx_epc_page *page) > struct sgx_epc_section *section = &sgx_epc_sections[page->section]; > struct sgx_numa_node *node = section->node; > > + if (page->epc_cg) { > + sgx_epc_cgroup_uncharge(page->epc_cg); > + page->epc_cg = NULL; > + } > + > spin_lock(&node->lock); > > page->owner = NULL; > @@ -643,6 +666,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size, > section->pages[i].flags = 0; > section->pages[i].owner = NULL; > section->pages[i].poison = 0; > + section->pages[i].epc_cg = NULL; > list_add_tail(§ion->pages[i].list, &sgx_dirty_page_list); > } > > @@ -787,6 +811,7 @@ static void __init arch_update_sysfs_visibility(int nid) {} > static bool __init sgx_page_cache_init(void) > { > u32 eax, ebx, ecx, edx, type; > + u64 capacity = 0; > u64 pa, size; > int nid; > int i; > @@ -837,6 +862,7 @@ static bool __init sgx_page_cache_init(void) > > sgx_epc_sections[i].node = &sgx_numa_nodes[nid]; > sgx_numa_nodes[nid].size += size; > + capacity += size; > > sgx_nr_epc_sections++; > } > @@ -846,6 +872,8 @@ static bool __init sgx_page_cache_init(void) > return false; > } > > + misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity); > + > return true; > } > > diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h > index d2dad21259a8..b1786774b8d2 100644 > --- a/arch/x86/kernel/cpu/sgx/sgx.h > +++ b/arch/x86/kernel/cpu/sgx/sgx.h > @@ -29,12 +29,15 @@ > /* Pages on free list */ > #define SGX_EPC_PAGE_IS_FREE BIT(1) > > +struct sgx_epc_cgroup; > + > struct sgx_epc_page { > unsigned int section; > u16 flags; > u16 poison; > struct sgx_encl_page *owner; > struct list_head list; > + struct sgx_epc_cgroup *epc_cg; > }; > > /* BR, Jarkko
> > > > > > > That's true. I was thinking no need to have them done in separate calls. > > The caller has to check the return value for epc_cg instance first, then > > check result of try_charge. But there is really only one caller, > > sgx_alloc_epc_page() below, so I don't have strong opinions now. > > > > With them separate, the checks will look like this: > > if (epc_cg = sgx_get_current_epc_cg()) // NULL means cgroup disabled, > > should continue for allocation > > { > > if (ret = sgx_epc_cgroup_try_charge()) > > return ret > > } > > // continue... > > > > I can go either way. Let's keep this aligned with other _try_charge() variants: return 'int' to indicate whether the charge is done or not. > > > > > > > > > > > struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim) > > > > > > { > > > > > > struct sgx_epc_page *page; > > > > > > + struct sgx_epc_cgroup *epc_cg; > > > > > > + > > > > > > + epc_cg = sgx_epc_cgroup_try_charge(); > > > > > > + if (IS_ERR(epc_cg)) > > > > > > + return ERR_CAST(epc_cg); > > > > > > > > > > > > for ( ; ; ) { > > > > > > page = __sgx_alloc_epc_page(); > > > > > > @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void > > > > > > *owner, bool reclaim) > > > > > > break; > > > > > > } > > > > > > > > > > > > + /* > > > > > > + * Need to do a global reclamation if cgroup was not full but > > > > free > > > > > > + * physical pages run out, causing __sgx_alloc_epc_page() to > > > > fail. > > > > > > + */ > > > > > > sgx_reclaim_pages(); > > > > > > > > > > What's the final behaviour? IIUC it should be reclaiming from the > > > > > *current* EPC > > > > > cgroup? If so shouldn't we just pass the @epc_cg to it here? > > > > > > > > > > I think we can make this patch as "structure" patch w/o actually > > > > having > > > > > EPC > > > > > cgroup enabled, i.e., sgx_get_current_epc_cg() always return NULL. > > > > > > > > > > And we can have one patch to change sgx_reclaim_pages() to take the > > > > > 'struct > > > > > sgx_epc_lru_list *' as argument: > > > > > > > > > > void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru) > > > > > { > > > > > ... > > > > > } > > > > > > > > > > Then here we can have something like: > > > > > > > > > > void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg) > > > > > { > > > > > struct sgx_epc_lru_list *lru = epc_cg ? &epc_cg->lru : > > > > > &sgx_global_lru; > > > > > > > > > > sgx_reclaim_pages_lru(lru); > > > > > } > > > > > > > > > > Makes sense? > > > > > > > > > > > > > This is purely global reclamation. No cgroup involved. > > > > > > Again why? Here you are allocating one EPC page for enclave in a > > > particular EPC > > > cgroup. When that fails, shouldn't you try only to reclaim from the > > > *current* > > > EPC cgroup? Or at least you should try to reclaim from the *current* > > > EPC cgroup > > > first? > > > > > > > Later sgx_epc_cg_try_charge will take a 'reclaim' flag, if true, cgroup > > reclaims synchronously, otherwise in background and returns -EBUSY in > > that case. This function also returns if no valid epc_cg pointer > > returned. > > > > All reclamation for *current* cgroup is done in sgx_epc_cg_try_charge(). This is fine, but I believe my question above is about where to reclaim when "allocation" fails, but not "try charge" fails. And for "reclaim for current cgroup when charge fails", I don't think its even necessary in this initial implementation of EPC cgroup. You can just fail the allocation when charge fails (reaching the limit). Trying to reclaim when limit is hit can be done later. Please see Dave and Michal's replies here: https://lore.kernel.org/lkml/7a1a5125-9da2-47b6-ba0f-cf24d84df16b@intel.com/#t https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/ > > > > So, by reaching to this point, a valid epc_cg pointer was returned, > > that means allocation is allowed for the cgroup (it has reclaimed if > > necessary, and its usage is not above limit after charging). I found memory cgroup uses different logic -- allocation first and then charge: For instance: static vm_fault_t do_anonymous_page(struct vm_fault *vmf) { ...... folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); if (!folio) goto oom; if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) goto oom_free_page; ...... } Why EPC needs to "charge first" and "then allocate"? > > > > But the system level free count may be low (e.g., limits of all cgroups > > may add up to be more than capacity). so we need to do a global > > reclamation here, which may involve reclaiming a few pages (from current > > or other groups) so the system can be at a performant state with minimal > > free count. (current behavior of ksgxd). > > > I should have sticked to the orignial comment added in code. Actually > __sgx_alloc_epc_page() can fail if system runs out of EPC. That's the > really reason for global reclaim. > I don't see how this can work. With EPC cgroup likely all EPC pages will go to the individual LRU of each cgroup, and the sgx_global_lru will basically empty. How can you reclaim from the sgx_global_lru? I am probably missing something, but anyway, looks some high level design material is really missing in the changelog.
On Mon, 20 Nov 2023 11:16:42 +0800, Huang, Kai <kai.huang@intel.com> wrote: >> > > >> > >> > That's true. I was thinking no need to have them done in separate >> calls. >> > The caller has to check the return value for epc_cg instance first, >> then >> > check result of try_charge. But there is really only one caller, >> > sgx_alloc_epc_page() below, so I don't have strong opinions now. >> > >> > With them separate, the checks will look like this: >> > if (epc_cg = sgx_get_current_epc_cg()) // NULL means cgroup disabled, >> > should continue for allocation >> > { >> > if (ret = sgx_epc_cgroup_try_charge()) >> > return ret >> > } >> > // continue... >> > >> > I can go either way. > > Let's keep this aligned with other _try_charge() variants: return 'int' > to > indicate whether the charge is done or not. > Fine with me if no objections from maintainers. >> > >> > > >> > > > > > struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool >> reclaim) >> > > > > > { >> > > > > > struct sgx_epc_page *page; >> > > > > > + struct sgx_epc_cgroup *epc_cg; >> > > > > > + >> > > > > > + epc_cg = sgx_epc_cgroup_try_charge(); >> > > > > > + if (IS_ERR(epc_cg)) >> > > > > > + return ERR_CAST(epc_cg); >> > > > > > >> > > > > > for ( ; ; ) { >> > > > > > page = __sgx_alloc_epc_page(); >> > > > > > @@ -580,10 +587,21 @@ struct sgx_epc_page >> *sgx_alloc_epc_page(void >> > > > > > *owner, bool reclaim) >> > > > > > break; >> > > > > > } >> > > > > > >> > > > > > + /* >> > > > > > + * Need to do a global reclamation if cgroup was not full >> but >> > > > free >> > > > > > + * physical pages run out, causing __sgx_alloc_epc_page() >> to >> > > > fail. >> > > > > > + */ >> > > > > > sgx_reclaim_pages(); >> > > > > >> > > > > What's the final behaviour? IIUC it should be reclaiming from >> the >> > > > > *current* EPC >> > > > > cgroup? If so shouldn't we just pass the @epc_cg to it here? >> > > > > >> > > > > I think we can make this patch as "structure" patch w/o actually >> > > > having >> > > > > EPC >> > > > > cgroup enabled, i.e., sgx_get_current_epc_cg() always return >> NULL. >> > > > > >> > > > > And we can have one patch to change sgx_reclaim_pages() to take >> the >> > > > > 'struct >> > > > > sgx_epc_lru_list *' as argument: >> > > > > >> > > > > void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru) >> > > > > { >> > > > > ... >> > > > > } >> > > > > >> > > > > Then here we can have something like: >> > > > > >> > > > > void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg) >> > > > > { >> > > > > struct sgx_epc_lru_list *lru = epc_cg ? &epc_cg->lru : >> > > > > &sgx_global_lru; >> > > > > >> > > > > sgx_reclaim_pages_lru(lru); >> > > > > } >> > > > > >> > > > > Makes sense? >> > > > > The reason we 'isolate' first then do real 'reclaim' is that the actual reclaim is expensive and especially for eblock, etrack, etc. >> > > > >> > > > This is purely global reclamation. No cgroup involved. >> > > >> > > Again why? Here you are allocating one EPC page for enclave in a >> > > particular EPC >> > > cgroup. When that fails, shouldn't you try only to reclaim from the >> > > *current* >> > > EPC cgroup? Or at least you should try to reclaim from the >> *current* >> > > EPC cgroup >> > > first? >> > > >> > >> > Later sgx_epc_cg_try_charge will take a 'reclaim' flag, if true, >> cgroup >> > reclaims synchronously, otherwise in background and returns -EBUSY in >> > that case. This function also returns if no valid epc_cg pointer >> > returned. >> > >> > All reclamation for *current* cgroup is done in >> sgx_epc_cg_try_charge(). > > This is fine, but I believe my question above is about where to reclaim > when > "allocation" fails, but not "try charge" fails. > I mean "will be done" :-) Currently no reclaim in try_charge. > And for "reclaim for current cgroup when charge fails", I don't think > its even > necessary in this initial implementation of EPC cgroup. You can just > fail the > allocation when charge fails (reaching the limit). Trying to reclaim > when limit > is hit can be done later. > Yes. It is done later. > Please see Dave and Michal's replies here: > > https://lore.kernel.org/lkml/7a1a5125-9da2-47b6-ba0f-cf24d84df16b@intel.com/#t > https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/ > >> > >> > So, by reaching to this point, a valid epc_cg pointer was returned, >> > that means allocation is allowed for the cgroup (it has reclaimed if >> > necessary, and its usage is not above limit after charging). > > I found memory cgroup uses different logic -- allocation first and then > charge: > > For instance: > > static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > { > ...... > > folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); > if (!folio) > goto oom; > if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) > goto oom_free_page; > > ...... } > > Why EPC needs to "charge first" and "then allocate"? > EPC allocation can involve reclaiming which is more expensive than regular RAM reclamation. Also misc only has max hard limit. Thanks Haitao >> > >> > But the system level free count may be low (e.g., limits of all >> cgroups >> > may add up to be more than capacity). so we need to do a global >> > reclamation here, which may involve reclaiming a few pages (from >> current >> > or other groups) so the system can be at a performant state with >> minimal >> > free count. (current behavior of ksgxd). >> > >> I should have sticked to the orignial comment added in code. Actually >> __sgx_alloc_epc_page() can fail if system runs out of EPC. That's the >> really reason for global reclaim. > > I don't see how this can work. With EPC cgroup likely all EPC pages > will go to > the individual LRU of each cgroup, and the sgx_global_lru will basically > empty. > How can you reclaim from the sgx_global_lru? Currently, nothing in cgroup LRU, all EPCs pages are in global list.
On Mon, 27 Nov 2023 00:01:56 +0800, Haitao Huang <haitao.huang@linux.intel.com> wrote: >>> > > > > Then here we can have something like: >>> > > > > >>> > > > > void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg) >>> > > > > { >>> > > > > struct sgx_epc_lru_list *lru = epc_cg ? &epc_cg->lru : >>> > > > > &sgx_global_lru; >>> > > > > >>> > > > > sgx_reclaim_pages_lru(lru); >>> > > > > } >>> > > > > >>> > > > > Makes sense? >>> > > > > > > The reason we 'isolate' first then do real 'reclaim' is that the actual > reclaim is expensive and especially for eblock, etrack, etc. Sorry this is out of context. It was meant to be in the other response for patch 9/12. Also FYI I'm traveling for a vacation and email access may be sporadic. BR Haitao
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 66bfabae8814..e17c5dc3aea4 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1921,6 +1921,19 @@ config X86_SGX If unsure, say N. +config CGROUP_SGX_EPC + bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX" + depends on X86_SGX && CGROUP_MISC + help + Provides control over the EPC footprint of tasks in a cgroup via + the Miscellaneous cgroup controller. + + EPC is a subset of regular memory that is usable only by SGX + enclaves and is very limited in quantity, e.g. less than 1% + of total DRAM. + + Say N if unsure. + config X86_USER_SHADOW_STACK bool "X86 userspace shadow stack" depends on AS_WRUSS diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile index 9c1656779b2a..12901a488da7 100644 --- a/arch/x86/kernel/cpu/sgx/Makefile +++ b/arch/x86/kernel/cpu/sgx/Makefile @@ -4,3 +4,4 @@ obj-y += \ ioctl.o \ main.o obj-$(CONFIG_X86_SGX_KVM) += virt.o +obj-$(CONFIG_CGROUP_SGX_EPC) += epc_cgroup.o diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c new file mode 100644 index 000000000000..500627d0563f --- /dev/null +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -0,0 +1,103 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright(c) 2022 Intel Corporation. + +#include <linux/atomic.h> +#include <linux/kernel.h> +#include "epc_cgroup.h" + +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg) +{ + return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv); +} + +static inline bool sgx_epc_cgroup_disabled(void) +{ + return !cgroup_subsys_enabled(misc_cgrp_subsys); +} + +/** + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single EPC page + * + * Returns EPC cgroup or NULL on success, -errno on failure. + */ +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void) +{ + struct sgx_epc_cgroup *epc_cg; + int ret; + + if (sgx_epc_cgroup_disabled()) + return NULL; + + epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg()); + ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE); + + if (!ret) { + /* No epc_cg returned, release ref from get_current_misc_cg() */ + put_misc_cg(epc_cg->cg); + return ERR_PTR(-ENOMEM); + } + + /* Ref released in sgx_epc_cgroup_uncharge() */ + return epc_cg; +} + +/** + * sgx_epc_cgroup_uncharge() - hierarchically uncharge EPC pages + * @epc_cg: the charged epc cgroup + */ +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) +{ + if (sgx_epc_cgroup_disabled()) + return; + + misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE); + + /* Ref got from sgx_epc_cgroup_try_charge() */ + put_misc_cg(epc_cg->cg); +} + +static void sgx_epc_cgroup_free(struct misc_cg *cg) +{ + struct sgx_epc_cgroup *epc_cg; + + epc_cg = sgx_epc_cgroup_from_misc_cg(cg); + if (!epc_cg) + return; + + kfree(epc_cg); +} + +static int sgx_epc_cgroup_alloc(struct misc_cg *cg); + +const struct misc_operations_struct sgx_epc_cgroup_ops = { + .alloc = sgx_epc_cgroup_alloc, + .free = sgx_epc_cgroup_free, +}; + +static int sgx_epc_cgroup_alloc(struct misc_cg *cg) +{ + struct sgx_epc_cgroup *epc_cg; + + epc_cg = kzalloc(sizeof(*epc_cg), GFP_KERNEL); + if (!epc_cg) + return -ENOMEM; + + cg->res[MISC_CG_RES_SGX_EPC].misc_ops = &sgx_epc_cgroup_ops; + cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg; + epc_cg->cg = cg; + return 0; +} + +static int __init sgx_epc_cgroup_init(void) +{ + struct misc_cg *cg; + + if (!boot_cpu_has(X86_FEATURE_SGX)) + return 0; + + cg = misc_cg_root(); + BUG_ON(!cg); + + return sgx_epc_cgroup_alloc(cg); +} +subsys_initcall(sgx_epc_cgroup_init); diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h new file mode 100644 index 000000000000..c3abfe82be15 --- /dev/null +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h @@ -0,0 +1,36 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright(c) 2022 Intel Corporation. */ +#ifndef _INTEL_SGX_EPC_CGROUP_H_ +#define _INTEL_SGX_EPC_CGROUP_H_ + +#include <asm/sgx.h> +#include <linux/cgroup.h> +#include <linux/list.h> +#include <linux/misc_cgroup.h> +#include <linux/page_counter.h> +#include <linux/workqueue.h> + +#include "sgx.h" + +#ifndef CONFIG_CGROUP_SGX_EPC +#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES +struct sgx_epc_cgroup; + +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void) +{ + return NULL; +} + +static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { } +#else +struct sgx_epc_cgroup { + struct misc_cg *cg; +}; + +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void); +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg); +bool sgx_epc_cgroup_lru_empty(struct misc_cg *root); + +#endif + +#endif /* _INTEL_SGX_EPC_CGROUP_H_ */ diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 166692f2d501..07606f391540 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -6,6 +6,7 @@ #include <linux/highmem.h> #include <linux/kthread.h> #include <linux/miscdevice.h> +#include <linux/misc_cgroup.h> #include <linux/node.h> #include <linux/pagemap.h> #include <linux/ratelimit.h> @@ -17,6 +18,7 @@ #include "driver.h" #include "encl.h" #include "encls.h" +#include "epc_cgroup.h" struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS]; static int sgx_nr_epc_sections; @@ -559,6 +561,11 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page) struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim) { struct sgx_epc_page *page; + struct sgx_epc_cgroup *epc_cg; + + epc_cg = sgx_epc_cgroup_try_charge(); + if (IS_ERR(epc_cg)) + return ERR_CAST(epc_cg); for ( ; ; ) { page = __sgx_alloc_epc_page(); @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim) break; } + /* + * Need to do a global reclamation if cgroup was not full but free + * physical pages run out, causing __sgx_alloc_epc_page() to fail. + */ sgx_reclaim_pages(); cond_resched(); } + if (!IS_ERR(page)) { + WARN_ON_ONCE(page->epc_cg); + page->epc_cg = epc_cg; + } else { + sgx_epc_cgroup_uncharge(epc_cg); + } + if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) wake_up(&ksgxd_waitq); @@ -604,6 +622,11 @@ void sgx_free_epc_page(struct sgx_epc_page *page) struct sgx_epc_section *section = &sgx_epc_sections[page->section]; struct sgx_numa_node *node = section->node; + if (page->epc_cg) { + sgx_epc_cgroup_uncharge(page->epc_cg); + page->epc_cg = NULL; + } + spin_lock(&node->lock); page->owner = NULL; @@ -643,6 +666,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size, section->pages[i].flags = 0; section->pages[i].owner = NULL; section->pages[i].poison = 0; + section->pages[i].epc_cg = NULL; list_add_tail(§ion->pages[i].list, &sgx_dirty_page_list); } @@ -787,6 +811,7 @@ static void __init arch_update_sysfs_visibility(int nid) {} static bool __init sgx_page_cache_init(void) { u32 eax, ebx, ecx, edx, type; + u64 capacity = 0; u64 pa, size; int nid; int i; @@ -837,6 +862,7 @@ static bool __init sgx_page_cache_init(void) sgx_epc_sections[i].node = &sgx_numa_nodes[nid]; sgx_numa_nodes[nid].size += size; + capacity += size; sgx_nr_epc_sections++; } @@ -846,6 +872,8 @@ static bool __init sgx_page_cache_init(void) return false; } + misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity); + return true; } diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h index d2dad21259a8..b1786774b8d2 100644 --- a/arch/x86/kernel/cpu/sgx/sgx.h +++ b/arch/x86/kernel/cpu/sgx/sgx.h @@ -29,12 +29,15 @@ /* Pages on free list */ #define SGX_EPC_PAGE_IS_FREE BIT(1) +struct sgx_epc_cgroup; + struct sgx_epc_page { unsigned int section; u16 flags; u16 poison; struct sgx_encl_page *owner; struct list_head list; + struct sgx_epc_cgroup *epc_cg; }; /*