Message ID | 20240205210638.157741-11-haitao.huang@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-53953-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:168b:b0:106:860b:bbdd with SMTP id ma11csp1169731dyb; Mon, 5 Feb 2024 14:02:44 -0800 (PST) X-Google-Smtp-Source: AGHT+IFnUrdy7ASIKqAnaBz45QLwNckoYBR3/r6iH9VDKm3lVJSw5ZSrx/lJDf2UE1T4e1aabOX7 X-Received: by 2002:a17:906:c407:b0:a36:f314:c8d7 with SMTP id u7-20020a170906c40700b00a36f314c8d7mr473909ejz.77.1707170564771; Mon, 05 Feb 2024 14:02:44 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1707170564; cv=pass; d=google.com; s=arc-20160816; b=Kq1xQoeV2sxwe83TCyaVp5ULTx8tTwECVoe5HTlbCpXF2bZ+xyl4bPbL0C7RWP+Ngj 6yD78FhWqY5T0iGTm4xG4cc1LNEyrbkunSgsn74eKMTADSeb3od1QOmJhtjFY1Jrh3in LPj5hiG/c83f4AUJHkMlrKtqklRRgprEl9j0W9TQqAvFXwxjPWTgoaH5FYVBPnELCVio BRUgvR9AYioDaA9hsRASN+lXMJVfvk1W2ihEQZUBIintOWskM5at8FRocnsW/tA9z/La +JGUJa7eMhuhk+6FfZWK6fuL8LMdvhtJQvV7fUsQJpHN8Ww57prN6eiMDCnixgi5lvIs pSSQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:in-reply-to:message-id :date:subject:cc:to:from:dkim-signature; bh=z8C7UT3qyOduVKVcVWWKtFwKyTsxN56jWo/OdfggFug=; fh=veYzAuxmNctIrCGbIILjYMWCfm82T4ltO8HCTd2WabQ=; b=kuoBkzoXfZlImvTf7H6kX2KxGBmdBKx8jwkETozTP60+L70AyZl+PbSIxkdSv4BkpE mI1foeM3gwfUXtydnQfDZ1VQinoVk8raU0N9xmEAWY1NPsIdvS0oDpKk4jpBzmxib0yM r25RjNvZVJ5TP8W2lGq5dZsVbJz1bCdi9KiMM7Ym1EnsKKKtiNcrngFKrszMe3T/SIqM OWz0JyENO3e+PKFiMtYlobSEDsz+FPhluubNyyLM+APC1QEcvzSMERKN05fMJSBkio2I JKxzdU+8RB8pqS8CF1sWSjgSlUlhHb3yaV4Axl1hP4ahiIBEy3d+hxKz2frlM9hH8SDE buJg==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=AUy9qQab; arc=pass (i=1 dkim=pass dkdomain=intel.com dmarc=pass fromdomain=linux.intel.com); spf=pass (google.com: domain of linux-kernel+bounces-53953-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-53953-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Forwarded-Encrypted: i=1; AJvYcCWgzInNozEKjhLFwvIceCSvDMSVoNk2rSjeI6+Q7zS1yJ50CwIW8Cccp3L0Q7EqYg076lWHV92oZX/wJ44BGmq+9f827w== Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [2604:1380:4601:e00::3]) by mx.google.com with ESMTPS id qu13-20020a170907110d00b00a37905c1452si278091ejb.678.2024.02.05.14.02.44 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 05 Feb 2024 14:02:44 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-53953-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=AUy9qQab; arc=pass (i=1 dkim=pass dkdomain=intel.com dmarc=pass fromdomain=linux.intel.com); spf=pass (google.com: domain of linux-kernel+bounces-53953-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-53953-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 329861F2BDD6 for <ouuuleilei@gmail.com>; Mon, 5 Feb 2024 22:02:44 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id C982A14F9CF; Mon, 5 Feb 2024 21:06:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="AUy9qQab" Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3DA951474D3; Mon, 5 Feb 2024 21:06:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.12 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707167210; cv=none; b=jYWz5QZARFakLuIihRAMJqsFkA2+gaJbkFPxTAnkNY2G1M3cC4AH6LI1UMFanUNk+jo3jNuNAx2UYnY2rrB9Nf/QViqniIdIDnSZfUnvsxbVsSnOTLqcm8QSg+aStDZ5eKq6m5zHdTOev8kdXUytgy7eT0vy3SXZshfkE1e4kZI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707167210; c=relaxed/simple; bh=WxK5jOiLmsHbMCLwOxK9jLktiGaCw5o0cemgelWhecY=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=YJnqxlth70SAzlPeAkiF2VJ85yW9SFN02+V9sWei5bDxjsSvMgpOfo2SMd6lFrTJSr/mrx+uVON7P2NkKPFZy9qri/7sPDn4TOQZ+KKMamjt0+eJNSdTTUMeo2YGl6z8itdKhR4xR2SSqYFLQf74N3iF0V0HiHsBkyXCYlVnZ6Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=AUy9qQab; arc=none smtp.client-ip=198.175.65.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1707167208; x=1738703208; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=WxK5jOiLmsHbMCLwOxK9jLktiGaCw5o0cemgelWhecY=; b=AUy9qQab7y89njuytZRv12iD5Q5G+FmsIvNn/FTWnmCVzenRlQHyBNbo Li4kq2cZq+Dk5DbXslQM4mISimiNHc3xVXReQks0bKieAR98E8igNHpyE h14l9u8Vs1KyRtgS6nOeQSmi6b0pW0IVy7TSMQZDY0Q5zxOYDRjjfRkpF R/Pw1mjf2WWYCvVZwOq/n7FzBE8avId9TuHGLJG3TR295C+GJTL3OeTDl XDPEN7bPhwy6/XJjhcrlGLiB5rxSUVSorfVbkxLvwORbJucw/9+lyUC/D +tWbtW6VlC5iO7VzmQGE7uu6XVxuvmD/gSj/XXNbvPQKVYeUaQwDNYQaT g==; X-IronPort-AV: E=McAfee;i="6600,9927,10975"; a="11960440" X-IronPort-AV: E=Sophos;i="6.05,245,1701158400"; d="scan'208";a="11960440" Received: from orviesa001.jf.intel.com ([10.64.159.141]) by orvoesa104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Feb 2024 13:06:43 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.05,245,1701158400"; d="scan'208";a="38245641" Received: from b4969161e530.jf.intel.com ([10.165.56.46]) by orviesa001.jf.intel.com with ESMTP; 05 Feb 2024 13:06:42 -0800 From: Haitao Huang <haitao.huang@linux.intel.com> To: jarkko@kernel.org, dave.hansen@linux.intel.com, tj@kernel.org, mkoutny@suse.com, linux-kernel@vger.kernel.org, linux-sgx@vger.kernel.org, x86@kernel.org, cgroups@vger.kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, hpa@zytor.com, sohil.mehta@intel.com, tim.c.chen@linux.intel.com Cc: zhiquan1.li@intel.com, kristen@linux.intel.com, seanjc@google.com, zhanb@microsoft.com, anakrish@microsoft.com, mikko.ylinen@linux.intel.com, yangjie@microsoft.com, chrisyan@microsoft.com Subject: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge() Date: Mon, 5 Feb 2024 13:06:33 -0800 Message-Id: <20240205210638.157741-11-haitao.huang@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20240205210638.157741-1-haitao.huang@linux.intel.com> References: <20240205210638.157741-1-haitao.huang@linux.intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1790098082183523190 X-GMAIL-MSGID: 1790098082183523190 |
Series |
Add Cgroup support for SGX EPC memory
|
|
Commit Message
Haitao Huang
Feb. 5, 2024, 9:06 p.m. UTC
From: Kristen Carlson Accardi <kristen@linux.intel.com> When the EPC usage of a cgroup is near its limit, the cgroup needs to reclaim pages used in the same cgroup to make room for new allocations. This is analogous to the behavior that the global reclaimer is triggered when the global usage is close to total available EPC. Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate whether synchronous reclaim is allowed or not. And trigger the synchronous/asynchronous reclamation flow accordingly. Note at this point, all reclaimable EPC pages are still tracked in the global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation is activated yet. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> --- V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 ++++++++++++++++++++++++-- arch/x86/kernel/cpu/sgx/epc_cgroup.h | 4 ++-- arch/x86/kernel/cpu/sgx/main.c | 2 +- 3 files changed, 27 insertions(+), 5 deletions(-)
Comments
On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote: > From: Kristen Carlson Accardi <kristen@linux.intel.com> > > When the EPC usage of a cgroup is near its limit, the cgroup needs to > reclaim pages used in the same cgroup to make room for new allocations. > This is analogous to the behavior that the global reclaimer is triggered > when the global usage is close to total available EPC. > > Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate > whether synchronous reclaim is allowed or not. And trigger the > synchronous/asynchronous reclamation flow accordingly. > > Note at this point, all reclaimable EPC pages are still tracked in the > global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation > is activated yet. > > Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> > Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> > --- > V7: > - Split this out from the big patch, #10 in V6. (Dave, Kai) > --- > arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 ++++++++++++++++++++++++-- > arch/x86/kernel/cpu/sgx/epc_cgroup.h | 4 ++-- > arch/x86/kernel/cpu/sgx/main.c | 2 +- > 3 files changed, 27 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c > index d399fda2b55e..abf74fdb12b4 100644 > --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c > @@ -184,13 +184,35 @@ static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work) > /** > * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page > * @epc_cg: The EPC cgroup to be charged for the page. > + * @reclaim: Whether or not synchronous reclaim is allowed > * Return: > * * %0 - If successfully charged. > * * -errno - for failures. > */ > -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) > +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim) > { > - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE); > + for (;;) { > + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, > + PAGE_SIZE)) > + break; > + > + if (sgx_epc_cgroup_lru_empty(epc_cg->cg)) > + return -ENOMEM; > + + if (signal_pending(current)) > + return -ERESTARTSYS; > + > + if (!reclaim) { > + queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work); > + return -EBUSY; > + } > + > + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false)) > + /* All pages were too young to reclaim, try again a little later */ > + schedule(); This will be total pain to backtrack after a while when something needs to be changed so there definitely should be inline comments addressing each branch condition. I'd rethink this as: 1. Create static __sgx_epc_cgroup_try_charge() for addressing single iteration with the new "reclaim" parameter. 2. Add a new sgx_epc_group_try_charge_reclaim() function. There's a bit of redundancy with sgx_epc_cgroup_try_charge() and sgx_epc_cgroup_try_charge_reclaim() because both have almost the same loop calling internal __sgx_epc_cgroup_try_charge() with different parameters. That is totally acceptable. Please also add my suggested-by. BR, Jarkko BR, Jarkko
Hi Jarkko On Mon, 12 Feb 2024 13:55:46 -0600, Jarkko Sakkinen <jarkko@kernel.org> wrote: > On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote: >> From: Kristen Carlson Accardi <kristen@linux.intel.com> >> >> When the EPC usage of a cgroup is near its limit, the cgroup needs to >> reclaim pages used in the same cgroup to make room for new allocations. >> This is analogous to the behavior that the global reclaimer is triggered >> when the global usage is close to total available EPC. >> >> Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate >> whether synchronous reclaim is allowed or not. And trigger the >> synchronous/asynchronous reclamation flow accordingly. >> >> Note at this point, all reclaimable EPC pages are still tracked in the >> global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation >> is activated yet. >> >> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> >> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> >> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> >> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> >> --- >> V7: >> - Split this out from the big patch, #10 in V6. (Dave, Kai) >> --- >> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 ++++++++++++++++++++++++-- >> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 4 ++-- >> arch/x86/kernel/cpu/sgx/main.c | 2 +- >> 3 files changed, 27 insertions(+), 5 deletions(-) >> >> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c >> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c >> index d399fda2b55e..abf74fdb12b4 100644 >> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c >> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c >> @@ -184,13 +184,35 @@ static void >> sgx_epc_cgroup_reclaim_work_func(struct work_struct *work) >> /** >> * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC >> page >> * @epc_cg: The EPC cgroup to be charged for the page. >> + * @reclaim: Whether or not synchronous reclaim is allowed >> * Return: >> * * %0 - If successfully charged. >> * * -errno - for failures. >> */ >> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) >> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool >> reclaim) >> { >> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE); >> + for (;;) { >> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, >> + PAGE_SIZE)) >> + break; >> + >> + if (sgx_epc_cgroup_lru_empty(epc_cg->cg)) >> + return -ENOMEM; >> + + if (signal_pending(current)) >> + return -ERESTARTSYS; >> + >> + if (!reclaim) { >> + queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work); >> + return -EBUSY; >> + } >> + >> + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false)) >> + /* All pages were too young to reclaim, try again a little later */ >> + schedule(); > > This will be total pain to backtrack after a while when something > needs to be changed so there definitely should be inline comments > addressing each branch condition. > > I'd rethink this as: > > 1. Create static __sgx_epc_cgroup_try_charge() for addressing single > iteration with the new "reclaim" parameter. > 2. Add a new sgx_epc_group_try_charge_reclaim() function. > > There's a bit of redundancy with sgx_epc_cgroup_try_charge() and > sgx_epc_cgroup_try_charge_reclaim() because both have almost the > same loop calling internal __sgx_epc_cgroup_try_charge() with > different parameters. That is totally acceptable. > > Please also add my suggested-by. > > BR, Jarkko > > BR, Jarkko > For #2: The only caller of this function, sgx_alloc_epc_page(), has the same boolean which is passed into this this function. If we separate it into sgx_epc_cgroup_try_charge() and sgx_epc_cgroup_try_charge_reclaim(), then the caller has to have the if/else branches. So separation here seems not help? For #1: If we don't do #2, It seems overkill at the moment for such a short function. How about we add inline comments for each branch for now, and if later there are more branches and the function become too long we add __sgx_epc_cgroup_try_charge() as you suggested? Thanks Haitao
On Tue Feb 13, 2024 at 1:15 AM EET, Haitao Huang wrote: > Hi Jarkko > > On Mon, 12 Feb 2024 13:55:46 -0600, Jarkko Sakkinen <jarkko@kernel.org> > wrote: > > > On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote: > >> From: Kristen Carlson Accardi <kristen@linux.intel.com> > >> > >> When the EPC usage of a cgroup is near its limit, the cgroup needs to > >> reclaim pages used in the same cgroup to make room for new allocations. > >> This is analogous to the behavior that the global reclaimer is triggered > >> when the global usage is close to total available EPC. > >> > >> Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate > >> whether synchronous reclaim is allowed or not. And trigger the > >> synchronous/asynchronous reclamation flow accordingly. > >> > >> Note at this point, all reclaimable EPC pages are still tracked in the > >> global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation > >> is activated yet. > >> > >> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> > >> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> > >> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> > >> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> > >> --- > >> V7: > >> - Split this out from the big patch, #10 in V6. (Dave, Kai) > >> --- > >> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 ++++++++++++++++++++++++-- > >> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 4 ++-- > >> arch/x86/kernel/cpu/sgx/main.c | 2 +- > >> 3 files changed, 27 insertions(+), 5 deletions(-) > >> > >> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c > >> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c > >> index d399fda2b55e..abf74fdb12b4 100644 > >> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c > >> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c > >> @@ -184,13 +184,35 @@ static void > >> sgx_epc_cgroup_reclaim_work_func(struct work_struct *work) > >> /** > >> * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC > >> page > >> * @epc_cg: The EPC cgroup to be charged for the page. > >> + * @reclaim: Whether or not synchronous reclaim is allowed > >> * Return: > >> * * %0 - If successfully charged. > >> * * -errno - for failures. > >> */ > >> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) > >> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool > >> reclaim) > >> { > >> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE); > >> + for (;;) { > >> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, > >> + PAGE_SIZE)) > >> + break; > >> + > >> + if (sgx_epc_cgroup_lru_empty(epc_cg->cg)) > >> + return -ENOMEM; > >> + + if (signal_pending(current)) > >> + return -ERESTARTSYS; > >> + > >> + if (!reclaim) { > >> + queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work); > >> + return -EBUSY; > >> + } > >> + > >> + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false)) > >> + /* All pages were too young to reclaim, try again a little later */ > >> + schedule(); > > > > This will be total pain to backtrack after a while when something > > needs to be changed so there definitely should be inline comments > > addressing each branch condition. > > > > I'd rethink this as: > > > > 1. Create static __sgx_epc_cgroup_try_charge() for addressing single > > iteration with the new "reclaim" parameter. > > 2. Add a new sgx_epc_group_try_charge_reclaim() function. > > > > There's a bit of redundancy with sgx_epc_cgroup_try_charge() and > > sgx_epc_cgroup_try_charge_reclaim() because both have almost the > > same loop calling internal __sgx_epc_cgroup_try_charge() with > > different parameters. That is totally acceptable. > > > > Please also add my suggested-by. > > > > BR, Jarkko > > > > BR, Jarkko > > > For #2: > The only caller of this function, sgx_alloc_epc_page(), has the same > boolean which is passed into this this function. I know. This would be good opportunity to fix that up. Large patch sets should try to make the space for its feature best possible and thus also clean up the code base overally. > If we separate it into sgx_epc_cgroup_try_charge() and > sgx_epc_cgroup_try_charge_reclaim(), then the caller has to have the > if/else branches. So separation here seems not help? Of course it does. It makes the code in that location self-documenting and easier to remember what it does. BR, Jarkko
On Tue, 13 Feb 2024 19:52:25 -0600, Jarkko Sakkinen <jarkko@kernel.org> wrote: > On Tue Feb 13, 2024 at 1:15 AM EET, Haitao Huang wrote: >> Hi Jarkko >> >> On Mon, 12 Feb 2024 13:55:46 -0600, Jarkko Sakkinen <jarkko@kernel.org> >> wrote: >> >> > On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote: >> >> From: Kristen Carlson Accardi <kristen@linux.intel.com> >> >> >> >> When the EPC usage of a cgroup is near its limit, the cgroup needs to >> >> reclaim pages used in the same cgroup to make room for new >> allocations. >> >> This is analogous to the behavior that the global reclaimer is >> triggered >> >> when the global usage is close to total available EPC. >> >> >> >> Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate >> >> whether synchronous reclaim is allowed or not. And trigger the >> >> synchronous/asynchronous reclamation flow accordingly. >> >> >> >> Note at this point, all reclaimable EPC pages are still tracked in >> the >> >> global LRU and per-cgroup LRUs are empty. So no per-cgroup >> reclamation >> >> is activated yet. >> >> >> >> Co-developed-by: Sean Christopherson >> <sean.j.christopherson@intel.com> >> >> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> >> >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> >> >> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> >> >> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> >> >> --- >> >> V7: >> >> - Split this out from the big patch, #10 in V6. (Dave, Kai) >> >> --- >> >> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 ++++++++++++++++++++++++-- >> >> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 4 ++-- >> >> arch/x86/kernel/cpu/sgx/main.c | 2 +- >> >> 3 files changed, 27 insertions(+), 5 deletions(-) >> >> >> >> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c >> >> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c >> >> index d399fda2b55e..abf74fdb12b4 100644 >> >> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c >> >> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c >> >> @@ -184,13 +184,35 @@ static void >> >> sgx_epc_cgroup_reclaim_work_func(struct work_struct *work) >> >> /** >> >> * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single >> EPC >> >> page >> >> * @epc_cg: The EPC cgroup to be charged for the page. >> >> + * @reclaim: Whether or not synchronous reclaim is allowed >> >> * Return: >> >> * * %0 - If successfully charged. >> >> * * -errno - for failures. >> >> */ >> >> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) >> >> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool >> >> reclaim) >> >> { >> >> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, >> PAGE_SIZE); >> >> + for (;;) { >> >> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, >> >> + PAGE_SIZE)) >> >> + break; >> >> + >> >> + if (sgx_epc_cgroup_lru_empty(epc_cg->cg)) >> >> + return -ENOMEM; >> >> + + if (signal_pending(current)) >> >> + return -ERESTARTSYS; >> >> + >> >> + if (!reclaim) { >> >> + queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work); >> >> + return -EBUSY; >> >> + } >> >> + >> >> + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false)) >> >> + /* All pages were too young to reclaim, try again a little later >> */ >> >> + schedule(); >> > >> > This will be total pain to backtrack after a while when something >> > needs to be changed so there definitely should be inline comments >> > addressing each branch condition. >> > >> > I'd rethink this as: >> > >> > 1. Create static __sgx_epc_cgroup_try_charge() for addressing single >> > iteration with the new "reclaim" parameter. >> > 2. Add a new sgx_epc_group_try_charge_reclaim() function. >> > >> > There's a bit of redundancy with sgx_epc_cgroup_try_charge() and >> > sgx_epc_cgroup_try_charge_reclaim() because both have almost the >> > same loop calling internal __sgx_epc_cgroup_try_charge() with >> > different parameters. That is totally acceptable. >> > >> > Please also add my suggested-by. >> > >> > BR, Jarkko >> > >> > BR, Jarkko >> > >> For #2: >> The only caller of this function, sgx_alloc_epc_page(), has the same >> boolean which is passed into this this function. > > I know. This would be good opportunity to fix that up. Large patch > sets should try to make the space for its feature best possible and > thus also clean up the code base overally. > >> If we separate it into sgx_epc_cgroup_try_charge() and >> sgx_epc_cgroup_try_charge_reclaim(), then the caller has to have the >> if/else branches. So separation here seems not help? > > Of course it does. It makes the code in that location self-documenting > and easier to remember what it does. > > BR, Jarkko > Please let me know if this aligns with your suggestion. static int ___sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) { if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE)) return 0; if (sgx_epc_cgroup_lru_empty(epc_cg->cg)) return -ENOMEM; if (signal_pending(current)) return -ERESTARTSYS; return -EBUSY; } /** * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single page * @epc_cg: The EPC cgroup to be charged for the page. * * Try to reclaim pages in the background if the group reaches its limit and * there are reclaimable pages in the group. * Return: * * %0 - If successfully charged. * * -errno - for failures. */ int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) { int ret = ___sgx_epc_cgroup_try_charge(epc_cg); if (ret == -EBUSY) queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work); return ret; } /** * sgx_epc_cgroup_try_charge_reclaim() - try to charge cgroup for a single page * @epc_cg: The EPC cgroup to be charged for the page. * * Try to reclaim pages directly if the group reaches its limit and there are * reclaimable pages in the group. * Return: * * %0 - If successfully charged. * * -errno - for failures. */ int sgx_epc_cgroup_try_charge_reclaim(struct sgx_epc_cgroup *epc_cg) { int ret; for (;;) { ret = ___sgx_epc_cgroup_try_charge(epc_cg); if (ret != -EBUSY) return ret; if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, current->mm)) /* All pages were too young to reclaim, try again a little later */ schedule(); } return 0; } It is a little more involved to remove the boolean for sgx_alloc_epc_page() and its callers like sgx_encl_grow(), sgx_alloc_va_page(). I'll send a separate patch for comments. Thanks Haitao
On 2/19/24 07:39, Haitao Huang wrote: > Remove all boolean parameters for 'reclaim' from the function > sgx_alloc_epc_page() and its callers by making two versions of each > function. > > Also opportunistically remove non-static declaration of > __sgx_alloc_epc_page() and a typo > > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> > Suggested-by: Jarkko Sakkinen <jarkko@kernel.org> > --- > arch/x86/kernel/cpu/sgx/encl.c | 56 +++++++++++++++++++++------ > arch/x86/kernel/cpu/sgx/encl.h | 6 ++- > arch/x86/kernel/cpu/sgx/ioctl.c | 23 ++++++++--- > arch/x86/kernel/cpu/sgx/main.c | 68 ++++++++++++++++++++++----------- > arch/x86/kernel/cpu/sgx/sgx.h | 4 +- > arch/x86/kernel/cpu/sgx/virt.c | 2 +- > 6 files changed, 115 insertions(+), 44 deletions(-) Jarkko, did this turn out how you expected? I think passing around a function pointer to *only* communicate 1 bit of information is a _bit_ overkill here. Simply replacing the bool with: enum sgx_reclaim { SGX_NO_RECLAIM, SGX_DO_RECLAIM }; would do the same thing. Right? Are you sure you want a function pointer for this?
On Mon Feb 19, 2024 at 3:12 PM UTC, Haitao Huang wrote: > On Tue, 13 Feb 2024 19:52:25 -0600, Jarkko Sakkinen <jarkko@kernel.org> > wrote: > > > On Tue Feb 13, 2024 at 1:15 AM EET, Haitao Huang wrote: > >> Hi Jarkko > >> > >> On Mon, 12 Feb 2024 13:55:46 -0600, Jarkko Sakkinen <jarkko@kernel.org> > >> wrote: > >> > >> > On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote: > >> >> From: Kristen Carlson Accardi <kristen@linux.intel.com> > >> >> > >> >> When the EPC usage of a cgroup is near its limit, the cgroup needs to > >> >> reclaim pages used in the same cgroup to make room for new > >> allocations. > >> >> This is analogous to the behavior that the global reclaimer is > >> triggered > >> >> when the global usage is close to total available EPC. > >> >> > >> >> Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate > >> >> whether synchronous reclaim is allowed or not. And trigger the > >> >> synchronous/asynchronous reclamation flow accordingly. > >> >> > >> >> Note at this point, all reclaimable EPC pages are still tracked in > >> the > >> >> global LRU and per-cgroup LRUs are empty. So no per-cgroup > >> reclamation > >> >> is activated yet. > >> >> > >> >> Co-developed-by: Sean Christopherson > >> <sean.j.christopherson@intel.com> > >> >> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > >> >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> > >> >> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> > >> >> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> > >> >> --- > >> >> V7: > >> >> - Split this out from the big patch, #10 in V6. (Dave, Kai) > >> >> --- > >> >> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 ++++++++++++++++++++++++-- > >> >> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 4 ++-- > >> >> arch/x86/kernel/cpu/sgx/main.c | 2 +- > >> >> 3 files changed, 27 insertions(+), 5 deletions(-) > >> >> > >> >> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c > >> >> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c > >> >> index d399fda2b55e..abf74fdb12b4 100644 > >> >> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c > >> >> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c > >> >> @@ -184,13 +184,35 @@ static void > >> >> sgx_epc_cgroup_reclaim_work_func(struct work_struct *work) > >> >> /** > >> >> * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single > >> EPC > >> >> page > >> >> * @epc_cg: The EPC cgroup to be charged for the page. > >> >> + * @reclaim: Whether or not synchronous reclaim is allowed > >> >> * Return: > >> >> * * %0 - If successfully charged. > >> >> * * -errno - for failures. > >> >> */ > >> >> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) > >> >> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool > >> >> reclaim) > >> >> { > >> >> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, > >> PAGE_SIZE); > >> >> + for (;;) { > >> >> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, > >> >> + PAGE_SIZE)) > >> >> + break; > >> >> + > >> >> + if (sgx_epc_cgroup_lru_empty(epc_cg->cg)) > >> >> + return -ENOMEM; > >> >> + + if (signal_pending(current)) > >> >> + return -ERESTARTSYS; > >> >> + > >> >> + if (!reclaim) { > >> >> + queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work); > >> >> + return -EBUSY; > >> >> + } > >> >> + > >> >> + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false)) > >> >> + /* All pages were too young to reclaim, try again a little later > >> */ > >> >> + schedule(); > >> > > >> > This will be total pain to backtrack after a while when something > >> > needs to be changed so there definitely should be inline comments > >> > addressing each branch condition. > >> > > >> > I'd rethink this as: > >> > > >> > 1. Create static __sgx_epc_cgroup_try_charge() for addressing single > >> > iteration with the new "reclaim" parameter. > >> > 2. Add a new sgx_epc_group_try_charge_reclaim() function. > >> > > >> > There's a bit of redundancy with sgx_epc_cgroup_try_charge() and > >> > sgx_epc_cgroup_try_charge_reclaim() because both have almost the > >> > same loop calling internal __sgx_epc_cgroup_try_charge() with > >> > different parameters. That is totally acceptable. > >> > > >> > Please also add my suggested-by. > >> > > >> > BR, Jarkko > >> > > >> > BR, Jarkko > >> > > >> For #2: > >> The only caller of this function, sgx_alloc_epc_page(), has the same > >> boolean which is passed into this this function. > > > > I know. This would be good opportunity to fix that up. Large patch > > sets should try to make the space for its feature best possible and > > thus also clean up the code base overally. > > > >> If we separate it into sgx_epc_cgroup_try_charge() and > >> sgx_epc_cgroup_try_charge_reclaim(), then the caller has to have the > >> if/else branches. So separation here seems not help? > > > > Of course it does. It makes the code in that location self-documenting > > and easier to remember what it does. > > > > BR, Jarkko > > > > Please let me know if this aligns with your suggestion. > > > static int ___sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) > { > if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, > PAGE_SIZE)) > return 0; > > if (sgx_epc_cgroup_lru_empty(epc_cg->cg)) > return -ENOMEM; > > if (signal_pending(current)) > return -ERESTARTSYS; > > return -EBUSY; > } > > /** > * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single page > * @epc_cg: The EPC cgroup to be charged for the page. > * > * Try to reclaim pages in the background if the group reaches its limit > and > * there are reclaimable pages in the group. > * Return: > * * %0 - If successfully charged. > * * -errno - for failures. > */ > int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) > { > int ret = ___sgx_epc_cgroup_try_charge(epc_cg); > > if (ret == -EBUSY) > queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work); > > return ret; > } > > /** > * sgx_epc_cgroup_try_charge_reclaim() - try to charge cgroup for a single > page > * @epc_cg: The EPC cgroup to be charged for the page. > * > * Try to reclaim pages directly if the group reaches its limit and there > are > * reclaimable pages in the group. > * Return: > * * %0 - If successfully charged. > * * -errno - for failures. > */ > int sgx_epc_cgroup_try_charge_reclaim(struct sgx_epc_cgroup *epc_cg) > { > int ret; > > for (;;) { > ret = ___sgx_epc_cgroup_try_charge(epc_cg); > if (ret != -EBUSY) > return ret; > > if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, current->mm)) > /* All pages were too young to reclaim, try again > a little later */ > schedule(); > } > > return 0; > } > > It is a little more involved to remove the boolean for > sgx_alloc_epc_page() and its callers like sgx_encl_grow(), > sgx_alloc_va_page(). I'll send a separate patch for comments. With quick look, it is towards right direction for sure. BR, Jarkko
On Mon Feb 19, 2024 at 3:56 PM UTC, Dave Hansen wrote: > On 2/19/24 07:39, Haitao Huang wrote: > > Remove all boolean parameters for 'reclaim' from the function > > sgx_alloc_epc_page() and its callers by making two versions of each > > function. > > > > Also opportunistically remove non-static declaration of > > __sgx_alloc_epc_page() and a typo > > > > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> > > Suggested-by: Jarkko Sakkinen <jarkko@kernel.org> > > --- > > arch/x86/kernel/cpu/sgx/encl.c | 56 +++++++++++++++++++++------ > > arch/x86/kernel/cpu/sgx/encl.h | 6 ++- > > arch/x86/kernel/cpu/sgx/ioctl.c | 23 ++++++++--- > > arch/x86/kernel/cpu/sgx/main.c | 68 ++++++++++++++++++++++----------- > > arch/x86/kernel/cpu/sgx/sgx.h | 4 +- > > arch/x86/kernel/cpu/sgx/virt.c | 2 +- > > 6 files changed, 115 insertions(+), 44 deletions(-) > > Jarkko, did this turn out how you expected? > > I think passing around a function pointer to *only* communicate 1 bit of > information is a _bit_ overkill here. > > Simply replacing the bool with: > > enum sgx_reclaim { > SGX_NO_RECLAIM, > SGX_DO_RECLAIM > }; > > would do the same thing. Right? > > Are you sure you want a function pointer for this? To look this in context I drafted quickly two branches representing imaginary next version of the patch set. I guess this would simpler and totally sufficient approach. With this approach I'd then change also: [PATCH v9 04/15] x86/sgx: Implement basic EPC misc cgroup functionality And add the enum-parameter already in that patch with just "no reclaim" enum. I.e. then 10/15 will add only "do reclaim" and the new functionality. BR, Jarkko
On Mon, 19 Feb 2024 14:42:29 -0600, Jarkko Sakkinen <jarkko@kernel.org> wrote: > On Mon Feb 19, 2024 at 3:56 PM UTC, Dave Hansen wrote: >> On 2/19/24 07:39, Haitao Huang wrote: >> > Remove all boolean parameters for 'reclaim' from the function >> > sgx_alloc_epc_page() and its callers by making two versions of each >> > function. >> > >> > Also opportunistically remove non-static declaration of >> > __sgx_alloc_epc_page() and a typo >> > >> > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> >> > Suggested-by: Jarkko Sakkinen <jarkko@kernel.org> >> > --- >> > arch/x86/kernel/cpu/sgx/encl.c | 56 +++++++++++++++++++++------ >> > arch/x86/kernel/cpu/sgx/encl.h | 6 ++- >> > arch/x86/kernel/cpu/sgx/ioctl.c | 23 ++++++++--- >> > arch/x86/kernel/cpu/sgx/main.c | 68 >> ++++++++++++++++++++++----------- >> > arch/x86/kernel/cpu/sgx/sgx.h | 4 +- >> > arch/x86/kernel/cpu/sgx/virt.c | 2 +- >> > 6 files changed, 115 insertions(+), 44 deletions(-) >> >> Jarkko, did this turn out how you expected? >> >> I think passing around a function pointer to *only* communicate 1 bit of >> information is a _bit_ overkill here. >> >> Simply replacing the bool with: >> >> enum sgx_reclaim { >> SGX_NO_RECLAIM, >> SGX_DO_RECLAIM >> }; >> >> would do the same thing. Right? >> >> Are you sure you want a function pointer for this? > > To look this in context I drafted quickly two branches representing > imaginary next version of the patch set. > > I guess this would simpler and totally sufficient approach. > > With this approach I'd then change also: > > [PATCH v9 04/15] x86/sgx: Implement basic EPC misc cgroup functionality > > And add the enum-parameter already in that patch with just "no reclaim" > enum. I.e. then 10/15 will add only "do reclaim" and the new > functionality. > > BR, Jarkko > Thanks. My understanding is: 1) For this patch, replace the boolean with the enum as Dave suggested. No two versions of the same functions. And this is a prerequisite for the cgroup series, positioned before [PATCH v9 04/15] 2) For [PATCH v9 04/15], pass a hard coded SGX_NO_RECLAIM to sgx_epc_cg_try_charge() from sgx_alloc_epc_page(). 3) For [PATCH v9 10/15], remove the hard coded value, pass the reclaim enum parameter value from sgx_alloc_epc_page() to sgx_epc_cg_try_charge() and add the reclaim logic. I'll send patches soon. But please let me know if I misunderstood. Thanks Haitao
On Mon Feb 19, 2024 at 10:25 PM UTC, Haitao Huang wrote: > On Mon, 19 Feb 2024 14:42:29 -0600, Jarkko Sakkinen <jarkko@kernel.org> > wrote: > > > On Mon Feb 19, 2024 at 3:56 PM UTC, Dave Hansen wrote: > >> On 2/19/24 07:39, Haitao Huang wrote: > >> > Remove all boolean parameters for 'reclaim' from the function > >> > sgx_alloc_epc_page() and its callers by making two versions of each > >> > function. > >> > > >> > Also opportunistically remove non-static declaration of > >> > __sgx_alloc_epc_page() and a typo > >> > > >> > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> > >> > Suggested-by: Jarkko Sakkinen <jarkko@kernel.org> > >> > --- > >> > arch/x86/kernel/cpu/sgx/encl.c | 56 +++++++++++++++++++++------ > >> > arch/x86/kernel/cpu/sgx/encl.h | 6 ++- > >> > arch/x86/kernel/cpu/sgx/ioctl.c | 23 ++++++++--- > >> > arch/x86/kernel/cpu/sgx/main.c | 68 > >> ++++++++++++++++++++++----------- > >> > arch/x86/kernel/cpu/sgx/sgx.h | 4 +- > >> > arch/x86/kernel/cpu/sgx/virt.c | 2 +- > >> > 6 files changed, 115 insertions(+), 44 deletions(-) > >> > >> Jarkko, did this turn out how you expected? > >> > >> I think passing around a function pointer to *only* communicate 1 bit of > >> information is a _bit_ overkill here. > >> > >> Simply replacing the bool with: > >> > >> enum sgx_reclaim { > >> SGX_NO_RECLAIM, > >> SGX_DO_RECLAIM > >> }; > >> > >> would do the same thing. Right? > >> > >> Are you sure you want a function pointer for this? > > > > To look this in context I drafted quickly two branches representing > > imaginary next version of the patch set. > > > > I guess this would simpler and totally sufficient approach. > > > > With this approach I'd then change also: > > > > [PATCH v9 04/15] x86/sgx: Implement basic EPC misc cgroup functionality > > > > And add the enum-parameter already in that patch with just "no reclaim" > > enum. I.e. then 10/15 will add only "do reclaim" and the new > > functionality. > > > > BR, Jarkko > > > > Thanks. My understanding is: > > 1) For this patch, replace the boolean with the enum as Dave suggested. No > two versions of the same functions. And this is a prerequisite for the > cgroup series, positioned before [PATCH v9 04/15] > > 2) For [PATCH v9 04/15], pass a hard coded SGX_NO_RECLAIM to > sgx_epc_cg_try_charge() from sgx_alloc_epc_page(). Yup, this will make the whole patch set also a bit leaner as the API does not change in the middle. > > 3) For [PATCH v9 10/15], remove the hard coded value, pass the reclaim > enum parameter value from sgx_alloc_epc_page() to sgx_epc_cg_try_charge() > and add the reclaim logic. > > I'll send patches soon. But please let me know if I misunderstood. BR, Jarkko
> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) > +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim) > { > - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE); > + for (;;) { > + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, > + PAGE_SIZE)) > + break; > + > + if (sgx_epc_cgroup_lru_empty(epc_cg->cg)) > + return -ENOMEM; > + > + if (signal_pending(current)) > + return -ERESTARTSYS; > + > + if (!reclaim) { > + queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work); > + return -EBUSY; > + } > + > + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false)) > + /* All pages were too young to reclaim, try again a little later */ > + schedule(); > + } > + > + return 0; > } > Seems this code change is 90% similar to the existing code in the sgx_alloc_epc_page(): ... for ( ; ; ) { page = __sgx_alloc_epc_page(); if (!IS_ERR(page)) { page->owner = owner; break; } if (list_empty(&sgx_active_page_list)) return ERR_PTR(-ENOMEM); if (!reclaim) { page = ERR_PTR(-EBUSY); break; } if (signal_pending(current)) { page = ERR_PTR(-ERESTARTSYS); break; } sgx_reclaim_pages(); cond_resched(); } ... Is it better to move the logic/code change in try_charge() out to sgx_alloc_epc_page() to unify them? IIUC, the logic is quite similar: When you either failed to allocate one page, or failed to charge one page, you try to reclaim EPC page(s) from the current EPC cgroup, either directly or indirectly. No?
On Wed, 21 Feb 2024 05:06:02 -0600, Huang, Kai <kai.huang@intel.com> wrote: > >> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) >> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool >> reclaim) >> { >> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE); >> + for (;;) { >> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, >> + PAGE_SIZE)) >> + break; >> + >> + if (sgx_epc_cgroup_lru_empty(epc_cg->cg)) >> + return -ENOMEM; >> + >> + if (signal_pending(current)) >> + return -ERESTARTSYS; >> + >> + if (!reclaim) { >> + queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work); >> + return -EBUSY; >> + } >> + >> + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false)) >> + /* All pages were too young to reclaim, try again a little later */ >> + schedule(); >> + } >> + >> + return 0; >> } >> > > Seems this code change is 90% similar to the existing code in the > sgx_alloc_epc_page(): > > ... > for ( ; ; ) { > page = __sgx_alloc_epc_page(); > if (!IS_ERR(page)) { > page->owner = owner; > break; > } > > if (list_empty(&sgx_active_page_list)) > return ERR_PTR(-ENOMEM); > > if (!reclaim) { > page = ERR_PTR(-EBUSY); > break; > } > > if (signal_pending(current)) { > page = ERR_PTR(-ERESTARTSYS); > break; > } > > sgx_reclaim_pages(); > cond_resched(); > } > ... > > Is it better to move the logic/code change in try_charge() out to > sgx_alloc_epc_page() to unify them? > > IIUC, the logic is quite similar: When you either failed to allocate one > page, > or failed to charge one page, you try to reclaim EPC page(s) from the > current > EPC cgroup, either directly or indirectly. > > No? Only these lines are the same: if (!reclaim) { page = ERR_PTR(-EBUSY); break; } if (signal_pending(current)) { page = ERR_PTR(-ERESTARTSYS); break; } In sgx_alloc_epc_page() we do global reclamation but here we do per-cgroup reclamation. That's why the logic of other lines is different though they look similar due to similar function names. For the global reclamation we need consider case in that cgroup is not enabled. Similarly list_empty(&sgx_active_page_list) would have to be changed to check root cgroup if cgroups enabled otherwise check global LRU. The (!reclaim) case is also different. So I don't see an obvious good way to abstract those to get meaningful savings. Thanks Haitao
On 23/02/2024 6:09 am, Haitao Huang wrote: > On Wed, 21 Feb 2024 05:06:02 -0600, Huang, Kai <kai.huang@intel.com> wrote: > >> >>> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) >>> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool >>> reclaim) >>> { >>> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, >>> PAGE_SIZE); >>> + for (;;) { >>> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, >>> + PAGE_SIZE)) >>> + break; >>> + >>> + if (sgx_epc_cgroup_lru_empty(epc_cg->cg)) >>> + return -ENOMEM; >>> + >>> + if (signal_pending(current)) >>> + return -ERESTARTSYS; >>> + >>> + if (!reclaim) { >>> + queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work); >>> + return -EBUSY; >>> + } >>> + >>> + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false)) >>> + /* All pages were too young to reclaim, try again a >>> little later */ >>> + schedule(); >>> + } >>> + >>> + return 0; >>> } >>> >> >> Seems this code change is 90% similar to the existing code in the >> sgx_alloc_epc_page(): >> >> ... >> for ( ; ; ) { >> page = __sgx_alloc_epc_page(); >> if (!IS_ERR(page)) { >> page->owner = owner; >> break; >> } >> >> if (list_empty(&sgx_active_page_list)) >> return ERR_PTR(-ENOMEM); >> >> if (!reclaim) { >> page = ERR_PTR(-EBUSY); >> break; >> } >> >> if (signal_pending(current)) { >> page = ERR_PTR(-ERESTARTSYS); >> break; >> } >> >> sgx_reclaim_pages(); >> cond_resched(); >> } >> ... >> >> Is it better to move the logic/code change in try_charge() out to >> sgx_alloc_epc_page() to unify them? >> >> IIUC, the logic is quite similar: When you either failed to allocate >> one page, >> or failed to charge one page, you try to reclaim EPC page(s) from the >> current >> EPC cgroup, either directly or indirectly. >> >> No? > > Only these lines are the same: > if (!reclaim) { > page = ERR_PTR(-EBUSY); > break; > } > > if (signal_pending(current)) { > page = ERR_PTR(-ERESTARTSYS); > break; > } > > In sgx_alloc_epc_page() we do global reclamation but here we do > per-cgroup reclamation. But why? If we failed to allocate, shouldn't we try to reclaim from the _current_ EPC cgroup instead of global? E.g., I thought one enclave in one EPC cgroup requesting insane amount of EPC shouldn't impact enclaves inside other cgroups? That's why the logic of other lines is different > though they look similar due to similar function names. For the global > reclamation we need consider case in that cgroup is not enabled. > Similarly list_empty(&sgx_active_page_list) would have to be changed to > check root cgroup if cgroups enabled otherwise check global LRU. The > (!reclaim) case is also different. W/o getting clear on my above question, so far I am not convinced why such difference cannot be hide inside wrapper function(s). So I don't see an obvious good way > to abstract those to get meaningful savings. > > Thanks > Haitao
On Thu, 22 Feb 2024 15:26:05 -0600, Huang, Kai <kai.huang@intel.com> wrote: > > > On 23/02/2024 6:09 am, Haitao Huang wrote: >> On Wed, 21 Feb 2024 05:06:02 -0600, Huang, Kai <kai.huang@intel.com> >> wrote: >> >>> >>>> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) >>>> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool >>>> reclaim) >>>> { >>>> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, >>>> PAGE_SIZE); >>>> + for (;;) { >>>> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, >>>> + PAGE_SIZE)) >>>> + break; >>>> + >>>> + if (sgx_epc_cgroup_lru_empty(epc_cg->cg)) >>>> + return -ENOMEM; >>>> + >>>> + if (signal_pending(current)) >>>> + return -ERESTARTSYS; >>>> + >>>> + if (!reclaim) { >>>> + queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work); >>>> + return -EBUSY; >>>> + } >>>> + >>>> + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false)) >>>> + /* All pages were too young to reclaim, try again a >>>> little later */ >>>> + schedule(); >>>> + } >>>> + >>>> + return 0; >>>> } >>>> >>> >>> Seems this code change is 90% similar to the existing code in the >>> sgx_alloc_epc_page(): >>> >>> ... >>> for ( ; ; ) { >>> page = __sgx_alloc_epc_page(); >>> if (!IS_ERR(page)) { >>> page->owner = owner; >>> break; >>> } >>> >>> if (list_empty(&sgx_active_page_list)) >>> return ERR_PTR(-ENOMEM); >>> >>> if (!reclaim) { >>> page = ERR_PTR(-EBUSY); >>> break; >>> } >>> >>> if (signal_pending(current)) { >>> page = ERR_PTR(-ERESTARTSYS); >>> break; >>> } >>> >>> sgx_reclaim_pages(); >>> cond_resched(); >>> } >>> ... >>> >>> Is it better to move the logic/code change in try_charge() out to >>> sgx_alloc_epc_page() to unify them? >>> >>> IIUC, the logic is quite similar: When you either failed to allocate >>> one page, >>> or failed to charge one page, you try to reclaim EPC page(s) from the >>> current >>> EPC cgroup, either directly or indirectly. >>> >>> No? >> Only these lines are the same: >> if (!reclaim) { >> page = ERR_PTR(-EBUSY); >> break; >> } >> if (signal_pending(current)) { >> page = ERR_PTR(-ERESTARTSYS); >> break; >> } >> In sgx_alloc_epc_page() we do global reclamation but here we do >> per-cgroup reclamation. > > But why? If we failed to allocate, shouldn't we try to reclaim from the > _current_ EPC cgroup instead of global? E.g., I thought one enclave in > one EPC cgroup requesting insane amount of EPC shouldn't impact enclaves > inside other cgroups? > Right. When code reaches to here, we already passed reclaim per cgroup. The cgroup may not at or reach limit but system has run out of physical EPC. Thanks Haitao
> > > Right. When code reaches to here, we already passed reclaim per cgroup. Yes if try_charge() failed we must do pre-cgroup reclaim. > The cgroup may not at or reach limit but system has run out of physical > EPC. > But after try_charge() we can still choose to reclaim from the current group, but not necessarily have to be global, right? I am not sure whether I am missing something, but could you elaborate why we should choose to reclaim from the global?
On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai <kai.huang@intel.com> wrote: >> > >> Right. When code reaches to here, we already passed reclaim per cgroup. > > Yes if try_charge() failed we must do pre-cgroup reclaim. > >> The cgroup may not at or reach limit but system has run out of physical >> EPC. >> > > But after try_charge() we can still choose to reclaim from the current > group, > but not necessarily have to be global, right? I am not sure whether I am > missing something, but could you elaborate why we should choose to > reclaim from > the global? > Once try_charge is done and returns zero that means the cgroup usage is charged and it's not over usage limit. So you really can't reclaim from that cgroup if allocation failed. The only thing you can do is to reclaim globally. This could happen when the sum of limits of all cgroups is greater than the physical EPC, i.e., user is overcommitting. Thanks Haitao
On 24/02/2024 6:00 am, Haitao Huang wrote: > On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai <kai.huang@intel.com> wrote: > >>> > >>> Right. When code reaches to here, we already passed reclaim per cgroup. >> >> Yes if try_charge() failed we must do pre-cgroup reclaim. >> >>> The cgroup may not at or reach limit but system has run out of physical >>> EPC. >>> >> >> But after try_charge() we can still choose to reclaim from the current >> group, >> but not necessarily have to be global, right? I am not sure whether I am >> missing something, but could you elaborate why we should choose to >> reclaim from >> the global? >> > > Once try_charge is done and returns zero that means the cgroup usage is > charged and it's not over usage limit. So you really can't reclaim from > that cgroup if allocation failed. The only thing you can do is to > reclaim globally. Sorry I still cannot establish the logic here. Let's say the sum of all cgroups are greater than the physical EPC, and elclave(s) in each cgroup could potentially fault w/o reaching cgroup's limit. In this case, when enclave(s) in one cgroup faults, why we cannot reclaim from the current cgroup, but have to reclaim from global? Is there any real downside of the former, or you just want to follow the reclaim logic w/o cgroup at all? IIUC, there's at least one advantage of reclaim from the current group, that faults of enclave(s) in one group won't impact other enclaves in other cgroups. E.g., in this way other enclaves in other groups may never need to trigger faults. Or perhaps I am missing anything?
On Sun, 25 Feb 2024 19:38:26 -0600, Huang, Kai <kai.huang@intel.com> wrote: > > > On 24/02/2024 6:00 am, Haitao Huang wrote: >> On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai <kai.huang@intel.com> >> wrote: >> >>>> > >>>> Right. When code reaches to here, we already passed reclaim per >>>> cgroup. >>> >>> Yes if try_charge() failed we must do pre-cgroup reclaim. >>> >>>> The cgroup may not at or reach limit but system has run out of >>>> physical >>>> EPC. >>>> >>> >>> But after try_charge() we can still choose to reclaim from the current >>> group, >>> but not necessarily have to be global, right? I am not sure whether I >>> am >>> missing something, but could you elaborate why we should choose to >>> reclaim from >>> the global? >>> >> Once try_charge is done and returns zero that means the cgroup usage >> is charged and it's not over usage limit. So you really can't reclaim >> from that cgroup if allocation failed. The only thing you can do is to >> reclaim globally. > > Sorry I still cannot establish the logic here. > > Let's say the sum of all cgroups are greater than the physical EPC, and > elclave(s) in each cgroup could potentially fault w/o reaching cgroup's > limit. > > In this case, when enclave(s) in one cgroup faults, why we cannot > reclaim from the current cgroup, but have to reclaim from global? > > Is there any real downside of the former, or you just want to follow the > reclaim logic w/o cgroup at all? > > IIUC, there's at least one advantage of reclaim from the current group, > that faults of enclave(s) in one group won't impact other enclaves in > other cgroups. E.g., in this way other enclaves in other groups may > never need to trigger faults. > > Or perhaps I am missing anything? > The use case here is that user knows it's OK for group A to borrow some pages from group B for some time without impact much performance, vice versa. That's why the user is overcomitting so system can run more enclave/groups. Otherwise, if she is concerned about impact of A on B, she could lower limit for A so it never interfere or interfere less with B (assume the lower limit is still high enough to run all enclaves in A), and sacrifice some of A's performance. Or if she does not want any interference between groups, just don't over-comit. So we don't really lose anything here. In case of overcomitting, even if we always reclaim from the same cgroup for each fault, one group may still interfere the other: e.g., consider an extreme case in that group A used up almost all EPC at the time group B has a fault, B has to fail allocation and kill enclaves. Haitao
On Sun, 2024-02-25 at 22:03 -0600, Haitao Huang wrote: > On Sun, 25 Feb 2024 19:38:26 -0600, Huang, Kai <kai.huang@intel.com> wrote: > > > > > > > On 24/02/2024 6:00 am, Haitao Huang wrote: > > > On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai <kai.huang@intel.com> > > > wrote: > > > > > > > > > > > > > > Right. When code reaches to here, we already passed reclaim per > > > > > cgroup. > > > > > > > > Yes if try_charge() failed we must do pre-cgroup reclaim. > > > > > > > > > The cgroup may not at or reach limit but system has run out of > > > > > physical > > > > > EPC. > > > > > > > > > > > > > But after try_charge() we can still choose to reclaim from the current > > > > group, > > > > but not necessarily have to be global, right? I am not sure whether I > > > > am > > > > missing something, but could you elaborate why we should choose to > > > > reclaim from > > > > the global? > > > > > > > Once try_charge is done and returns zero that means the cgroup usage > > > is charged and it's not over usage limit. So you really can't reclaim > > > from that cgroup if allocation failed. The only thing you can do is to > > > reclaim globally. > > > > Sorry I still cannot establish the logic here. > > > > Let's say the sum of all cgroups are greater than the physical EPC, and > > elclave(s) in each cgroup could potentially fault w/o reaching cgroup's > > limit. > > > > In this case, when enclave(s) in one cgroup faults, why we cannot > > reclaim from the current cgroup, but have to reclaim from global? > > > > Is there any real downside of the former, or you just want to follow the > > reclaim logic w/o cgroup at all? > > > > IIUC, there's at least one advantage of reclaim from the current group, > > that faults of enclave(s) in one group won't impact other enclaves in > > other cgroups. E.g., in this way other enclaves in other groups may > > never need to trigger faults. > > > > Or perhaps I am missing anything? > > > The use case here is that user knows it's OK for group A to borrow some > pages from group B for some time without impact much performance, vice > versa. That's why the user is overcomitting so system can run more > enclave/groups. Otherwise, if she is concerned about impact of A on B, she > could lower limit for A so it never interfere or interfere less with B > (assume the lower limit is still high enough to run all enclaves in A), > and sacrifice some of A's performance. Or if she does not want any > interference between groups, just don't over-comit. So we don't really > lose anything here. But if we reclaim from the same group, seems we could enable a user case that allows the admin to ensure certain group won't be impacted at all, while allowing other groups to over-commit? E.g., let's say we have 100M physical EPC. And let's say the admin wants to run some performance-critical enclave(s) which costs 50M EPC w/o being impacted. The admin also wants to run other enclaves which could cost 100M EPC in total but EPC swapping among them is acceptable. If we choose to reclaim from the current EPC cgroup, then seems to that the admin can achieve the above by setting up 2 groups with group1 having 50M limit and group2 having 100M limit, and then run performance-critical enclave(s) in group1 and others in group2? Or am I missing anything? If we choose to do global reclaim, then we cannot achieve that. > > In case of overcomitting, even if we always reclaim from the same cgroup > for each fault, one group may still interfere the other: e.g., consider an > extreme case in that group A used up almost all EPC at the time group B > has a fault, B has to fail allocation and kill enclaves. If the admin allows group A to use almost all EPC, to me it's fair to say he/she doesn't want to run anything inside B at all and it is acceptable enclaves in B to be killed.
On 2/26/24 03:36, Huang, Kai wrote: >> In case of overcomitting, even if we always reclaim from the same cgroup >> for each fault, one group may still interfere the other: e.g., consider an >> extreme case in that group A used up almost all EPC at the time group B >> has a fault, B has to fail allocation and kill enclaves. > If the admin allows group A to use almost all EPC, to me it's fair to say he/she > doesn't want to run anything inside B at all and it is acceptable enclaves in B > to be killed. Folks, I'm having a really hard time following this thread. It sounds like there's disagreement about when to do system-wide reclaim. Could someone remind me of the choices that we have? (A proposed patch would go a _long_ way to helping me understand) Also, what does the core mm memcg code do? Last, what is the simplest (least amount of code) thing that the SGX cgroup controller could implement here?
On Mon, 26 Feb 2024 05:36:02 -0600, Huang, Kai <kai.huang@intel.com> wrote: > On Sun, 2024-02-25 at 22:03 -0600, Haitao Huang wrote: >> On Sun, 25 Feb 2024 19:38:26 -0600, Huang, Kai <kai.huang@intel.com> >> wrote: >> >> > >> > >> > On 24/02/2024 6:00 am, Haitao Huang wrote: >> > > On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai <kai.huang@intel.com> >> > > wrote: >> > > >> > > > > > >> > > > > Right. When code reaches to here, we already passed reclaim per >> > > > > cgroup. >> > > > >> > > > Yes if try_charge() failed we must do pre-cgroup reclaim. >> > > > >> > > > > The cgroup may not at or reach limit but system has run out of >> > > > > physical >> > > > > EPC. >> > > > > >> > > > >> > > > But after try_charge() we can still choose to reclaim from the >> current >> > > > group, >> > > > but not necessarily have to be global, right? I am not sure >> whether I >> > > > am >> > > > missing something, but could you elaborate why we should choose to >> > > > reclaim from >> > > > the global? >> > > > >> > > Once try_charge is done and returns zero that means the cgroup >> usage >> > > is charged and it's not over usage limit. So you really can't >> reclaim >> > > from that cgroup if allocation failed. The only thing you can do >> is to >> > > reclaim globally. >> > >> > Sorry I still cannot establish the logic here. >> > >> > Let's say the sum of all cgroups are greater than the physical EPC, >> and >> > elclave(s) in each cgroup could potentially fault w/o reaching >> cgroup's >> > limit. >> > >> > In this case, when enclave(s) in one cgroup faults, why we cannot >> > reclaim from the current cgroup, but have to reclaim from global? >> > >> > Is there any real downside of the former, or you just want to follow >> the >> > reclaim logic w/o cgroup at all? >> > >> > IIUC, there's at least one advantage of reclaim from the current >> group, >> > that faults of enclave(s) in one group won't impact other enclaves in >> > other cgroups. E.g., in this way other enclaves in other groups may >> > never need to trigger faults. >> > >> > Or perhaps I am missing anything? >> > >> The use case here is that user knows it's OK for group A to borrow some >> pages from group B for some time without impact much performance, vice >> versa. That's why the user is overcomitting so system can run more >> enclave/groups. Otherwise, if she is concerned about impact of A on B, >> she >> could lower limit for A so it never interfere or interfere less with B >> (assume the lower limit is still high enough to run all enclaves in A), >> and sacrifice some of A's performance. Or if she does not want any >> interference between groups, just don't over-comit. So we don't really >> lose anything here. > > But if we reclaim from the same group, seems we could enable a user case > that > allows the admin to ensure certain group won't be impacted at all, while > allowing other groups to over-commit? > > E.g., let's say we have 100M physical EPC. And let's say the admin > wants to run > some performance-critical enclave(s) which costs 50M EPC w/o being > impacted. > The admin also wants to run other enclaves which could cost 100M EPC in > total > but EPC swapping among them is acceptable. > > If we choose to reclaim from the current EPC cgroup, then seems to that > the > admin can achieve the above by setting up 2 groups with group1 having > 50M limit > and group2 having 100M limit, and then run performance-critical > enclave(s) in > group1 and others in group2? Or am I missing anything? > The more important groups should have limits higher than or equal to peak usage to ensure no impact. The less important groups should have lower limits than its peak usage to avoid impacting higher priority groups. The limit is the maximum usage allowed. By setting group2 limit to 100M, you are allowing it to use 100M. So as soon as it gets up and consume 100M, group1 can not even load any enclave if we only reclaim per-cgroup and do not do global reclaim. > If we choose to do global reclaim, then we cannot achieve that. You can achieve this by setting group 2 limit to 50M. No need to overcommiting to the system. Group 2 will swap as soon as it hits 50M, which is the maximum it can consume so no impact to group 1. > >> >> In case of overcomitting, even if we always reclaim from the same cgroup >> for each fault, one group may still interfere the other: e.g., consider >> an >> extreme case in that group A used up almost all EPC at the time group B >> has a fault, B has to fail allocation and kill enclaves. > > If the admin allows group A to use almost all EPC, to me it's fair to > say he/she > doesn't want to run anything inside B at all and it is acceptable > enclaves in B > to be killed. > > I don't think so. The user just knows group A + B peak usages higher than system capacity. And she is OK for them to share some of the pages dynamically. So kernel should allow one borrow from the other at a particular instance when one group has higher demand. And later doing the opposite. IOW, the favor goes both ways. Haitao
Hi Dave, On Mon, 26 Feb 2024 08:04:54 -0600, Dave Hansen <dave.hansen@intel.com> wrote: > On 2/26/24 03:36, Huang, Kai wrote: >>> In case of overcomitting, even if we always reclaim from the same >>> cgroup >>> for each fault, one group may still interfere the other: e.g., >>> consider an >>> extreme case in that group A used up almost all EPC at the time group B >>> has a fault, B has to fail allocation and kill enclaves. >> If the admin allows group A to use almost all EPC, to me it's fair to >> say he/she >> doesn't want to run anything inside B at all and it is acceptable >> enclaves in B >> to be killed. > > Folks, I'm having a really hard time following this thread. It sounds > like there's disagreement about when to do system-wide reclaim. Could > someone remind me of the choices that we have? (A proposed patch would > go a _long_ way to helping me understand) > In case of overcomitting, i.e., sum of limits greater than the EPC capacity, if one group has a fault, and its usage is not above its own limit (try_charge() passes), yet total usage of the system has exceeded the capacity, whether we do global reclaim or just reclaim pages in the current faulting group. > Also, what does the core mm memcg code do? > I'm not sure. I'll try to find out but it'd be appreciated if someone more knowledgeable can comment on this. memcg also has the protection mechanism (i.e., min, low settings) to guarantee some allocation per group so its approach might not be applicable to misc controller here. > Last, what is the simplest (least amount of code) thing that the SGX > cgroup controller could implement here? > > I still think the current approach of doing global reclaim is reasonable and simple: try_charge() checks cgroup limit and reclaim within the group if needed, then do EPC page allocation, reclaim globally if allocation fails due to global usage reaches the capacity. I'm not sure how not doing global reclaiming in this case would bring any benefit. Please see my response to Kai's example cases. Thanks Haitao
On 2/26/24 13:48, Haitao Huang wrote: > In case of overcomitting, i.e., sum of limits greater than the EPC > capacity, if one group has a fault, and its usage is not above its own > limit (try_charge() passes), yet total usage of the system has exceeded > the capacity, whether we do global reclaim or just reclaim pages in the > current faulting group. I don't see _any_ reason to limit reclaim to the current faulting cgroup. >> Last, what is the simplest (least amount of code) thing that the SGX >> cgroup controller could implement here? > > I still think the current approach of doing global reclaim is reasonable > and simple: try_charge() checks cgroup limit and reclaim within the > group if needed, then do EPC page allocation, reclaim globally if > allocation fails due to global usage reaches the capacity. > > I'm not sure how not doing global reclaiming in this case would bring > any benefit. I tend to agree. Kai, I think your examples sound a little bit contrived. Have actual users expressed a strong intent for doing anything with this series other than limiting bad actors from eating all the EPC?
On 27/02/2024 10:18 am, Haitao Huang wrote: > On Mon, 26 Feb 2024 05:36:02 -0600, Huang, Kai <kai.huang@intel.com> wrote: > >> On Sun, 2024-02-25 at 22:03 -0600, Haitao Huang wrote: >>> On Sun, 25 Feb 2024 19:38:26 -0600, Huang, Kai <kai.huang@intel.com> >>> wrote: >>> >>> > >>> > >>> > On 24/02/2024 6:00 am, Haitao Huang wrote: >>> > > On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai <kai.huang@intel.com> >>> > > wrote: >>> > > >>> > > > > > >>> > > > > Right. When code reaches to here, we already passed reclaim per >>> > > > > cgroup. >>> > > > >>> > > > Yes if try_charge() failed we must do pre-cgroup reclaim. >>> > > > >>> > > > > The cgroup may not at or reach limit but system has run out of >>> > > > > physical >>> > > > > EPC. >>> > > > > >>> > > > >>> > > > But after try_charge() we can still choose to reclaim from the >>> current >>> > > > group, >>> > > > but not necessarily have to be global, right? I am not sure >>> whether I >>> > > > am >>> > > > missing something, but could you elaborate why we should choose to >>> > > > reclaim from >>> > > > the global? >>> > > > >>> > > Once try_charge is done and returns zero that means the cgroup >>> usage >>> > > is charged and it's not over usage limit. So you really can't >>> reclaim >>> > > from that cgroup if allocation failed. The only thing you can do >>> is to >>> > > reclaim globally. >>> > >>> > Sorry I still cannot establish the logic here. >>> > >>> > Let's say the sum of all cgroups are greater than the physical EPC, >>> and >>> > elclave(s) in each cgroup could potentially fault w/o reaching >>> cgroup's >>> > limit. >>> > >>> > In this case, when enclave(s) in one cgroup faults, why we cannot >>> > reclaim from the current cgroup, but have to reclaim from global? >>> > >>> > Is there any real downside of the former, or you just want to >>> follow the >>> > reclaim logic w/o cgroup at all? >>> > >>> > IIUC, there's at least one advantage of reclaim from the current >>> group, >>> > that faults of enclave(s) in one group won't impact other enclaves in >>> > other cgroups. E.g., in this way other enclaves in other groups may >>> > never need to trigger faults. >>> > >>> > Or perhaps I am missing anything? >>> > >>> The use case here is that user knows it's OK for group A to borrow some >>> pages from group B for some time without impact much performance, vice >>> versa. That's why the user is overcomitting so system can run more >>> enclave/groups. Otherwise, if she is concerned about impact of A on >>> B, she >>> could lower limit for A so it never interfere or interfere less with B >>> (assume the lower limit is still high enough to run all enclaves in A), >>> and sacrifice some of A's performance. Or if she does not want any >>> interference between groups, just don't over-comit. So we don't really >>> lose anything here. >> >> But if we reclaim from the same group, seems we could enable a user >> case that >> allows the admin to ensure certain group won't be impacted at all, while >> allowing other groups to over-commit? >> >> E.g., let's say we have 100M physical EPC. And let's say the admin >> wants to run >> some performance-critical enclave(s) which costs 50M EPC w/o being >> impacted. >> The admin also wants to run other enclaves which could cost 100M EPC >> in total >> but EPC swapping among them is acceptable. >> >> If we choose to reclaim from the current EPC cgroup, then seems to >> that the >> admin can achieve the above by setting up 2 groups with group1 having >> 50M limit >> and group2 having 100M limit, and then run performance-critical >> enclave(s) in >> group1 and others in group2? Or am I missing anything? >> > > The more important groups should have limits higher than or equal to > peak usage to ensure no impact. Yes. But if you do global reclaim there's no guarantee of this regardless of the limit setting. It depends on setting of limits of other groups. > The less important groups should have lower limits than its peak usage > to avoid impacting higher priority groups. Yeah, but depending on how low the limit is, the try_charge() can still succeed but physical EPC is already running out. Are you saying we should always expect the admin to set limits of groups not exceeding the physical EPC? > The limit is the maximum usage allowed. > > By setting group2 limit to 100M, you are allowing it to use 100M. So as > soon as it gets up and consume 100M, group1 can not even load any > enclave if we only reclaim per-cgroup and do not do global reclaim. I kinda forgot, but I think SGX supports swapping out EPC of an enclave before EINIT? Also, with SGX2 the initial enclave can take less EPC to be loaded. > >> If we choose to do global reclaim, then we cannot achieve that. > > > You can achieve this by setting group 2 limit to 50M. No need to > overcommiting to the system. > Group 2 will swap as soon as it hits 50M, which is the maximum it can > consume so no impact to group 1. Right. We can achieve this by doing so. But as said above, you are depending on setting up the limit to do per-cgorup reclaim. So, back to the question: What is the downside of doing per-group reclaim when try_charge() succeeds for the enclave but failed to allocate EPC page? Could you give an complete answer why you choose to use global reclaim for the above case?
On 2/26/24 14:24, Huang, Kai wrote: > What is the downside of doing per-group reclaim when try_charge() > succeeds for the enclave but failed to allocate EPC page? > > Could you give an complete answer why you choose to use global reclaim > for the above case? There are literally two different limits at play. There's the limit that the cgroup imposes and then the actual physical limit. Hitting the cgroup limit induces cgroup reclaim. Hitting the physical limit induces global reclaim. Maybe I'm just being dense, but I fail to understand why you would want to entangle those two different concepts more than absolutely necessary.
> > Kai, I think your examples sound a little bit contrived. Have actual > users expressed a strong intent for doing anything with this series > other than limiting bad actors from eating all the EPC? I am not sure about this. I am also trying to get a full picture. I asked because I didn't quite like the duplicated code change in try_charge() in this patch and in sgx_alloc_epc_page(). I think using per-group reclaim we can unify the code (I have even started to write the code) and I don't see the downside of doing so. So I am trying to get the actual downside of doing per-cgroup reclaim or the full reason that we choose global reclaim.
On 2/26/24 14:34, Huang, Kai wrote: > So I am trying to get the actual downside of doing per-cgroup reclaim or > the full reason that we choose global reclaim. Take the most extreme example: while (hit_global_sgx_limit()) reclaim_from_this(cgroup); You eventually end up with all of 'cgroup's pages gone and handed out to other users on the system who stole them all. Other users might cause you to go over the global limit. *They* should be paying part of the cost, not just you and your cgroup.
On 27/02/2024 11:31 am, Dave Hansen wrote: > On 2/26/24 14:24, Huang, Kai wrote: >> What is the downside of doing per-group reclaim when try_charge() >> succeeds for the enclave but failed to allocate EPC page? >> >> Could you give an complete answer why you choose to use global reclaim >> for the above case? > > There are literally two different limits at play. There's the limit > that the cgroup imposes and then the actual physical limit. > > Hitting the cgroup limit induces cgroup reclaim. > > Hitting the physical limit induces global reclaim. > > Maybe I'm just being dense, but I fail to understand why you would want > to entangle those two different concepts more than absolutely necessary. OK. Yes I agree doing per-cgroup reclaim when hitting physical limit would bring another layer of consideration of when to do global reclaim, which is not necessary now.
On 27/02/2024 11:38 am, Dave Hansen wrote: > On 2/26/24 14:34, Huang, Kai wrote: >> So I am trying to get the actual downside of doing per-cgroup reclaim or >> the full reason that we choose global reclaim. > > Take the most extreme example: > > while (hit_global_sgx_limit()) > reclaim_from_this(cgroup); > > You eventually end up with all of 'cgroup's pages gone and handed out to > other users on the system who stole them all. Other users might cause > you to go over the global limit. *They* should be paying part of the > cost, not just you and your cgroup. Yeah likely we will need another layer of logic to decide when to do global reclaim. I agree that is downside and is unnecessary for this patchset. Thanks for the comments.
Hello. On Mon, Feb 26, 2024 at 03:48:18PM -0600, Haitao Huang <haitao.huang@linux.intel.com> wrote: > In case of overcomitting, i.e., sum of limits greater than the EPC capacity, > if one group has a fault, and its usage is not above its own limit > (try_charge() passes), yet total usage of the system has exceeded the > capacity, whether we do global reclaim or just reclaim pages in the current > faulting group. > > > Also, what does the core mm memcg code do? > > > I'm not sure. I'll try to find out but it'd be appreciated if someone more > knowledgeable can comment on this. memcg also has the protection mechanism > (i.e., min, low settings) to guarantee some allocation per group so its > approach might not be applicable to misc controller here. I only follow the discussion superficially but it'd be nice to have analogous mechanisms in memcg and sgx controller. The memory limits are rather simple -- when allocating new memory, the tightest limit of ancestor applies and reclaim is applied to whole subtree of such an ancestor (not necessearily the originating cgroup of the allocation). Overcommit is admited, competition among siblings is resolved on the first comes, first served basis. The memory protections are an additional (and in a sense orthogoal) mechanism. They can be interpretted as limits that are enforced not at the time of allocation but at the time of reclaim (and reclaim is triggered independetly, either global or with the limits above). The consequence is that the protection code must do some normalization to evaluate overcommited values sensibly. (Historically, there was also memory.soft_limit_in_bytes, which combined "protection" and global reclaim but it turned out broken. I suggest reading Documentation/admin-guide/cgroup-v2.rst section Controller Issues and Remedies/Memory for more details and as cautionary example.) HTH, Michal
On Mon Feb 26, 2024 at 11:56 PM EET, Dave Hansen wrote: > On 2/26/24 13:48, Haitao Huang wrote: > > In case of overcomitting, i.e., sum of limits greater than the EPC > > capacity, if one group has a fault, and its usage is not above its own > > limit (try_charge() passes), yet total usage of the system has exceeded > > the capacity, whether we do global reclaim or just reclaim pages in the > > current faulting group. > > I don't see _any_ reason to limit reclaim to the current faulting cgroup. > > >> Last, what is the simplest (least amount of code) thing that the SGX > >> cgroup controller could implement here? > > > > I still think the current approach of doing global reclaim is reasonable > > and simple: try_charge() checks cgroup limit and reclaim within the > > group if needed, then do EPC page allocation, reclaim globally if > > allocation fails due to global usage reaches the capacity. > > > > I'm not sure how not doing global reclaiming in this case would bring > > any benefit. > I tend to agree. > > Kai, I think your examples sound a little bit contrived. Have actual > users expressed a strong intent for doing anything with this series > other than limiting bad actors from eating all the EPC? I'd consider this from the viewpoint is there anything in the user space visible portion of the patch set that would limit tuning the performance later on, if required let's say by a workload that acts sub-optimally. If not, then most of performance related issues can be only identified by actual use of the code. BR, Jarkko
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c index d399fda2b55e..abf74fdb12b4 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -184,13 +184,35 @@ static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work) /** * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page * @epc_cg: The EPC cgroup to be charged for the page. + * @reclaim: Whether or not synchronous reclaim is allowed * Return: * * %0 - If successfully charged. * * -errno - for failures. */ -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim) { - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE); + for (;;) { + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, + PAGE_SIZE)) + break; + + if (sgx_epc_cgroup_lru_empty(epc_cg->cg)) + return -ENOMEM; + + if (signal_pending(current)) + return -ERESTARTSYS; + + if (!reclaim) { + queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work); + return -EBUSY; + } + + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false)) + /* All pages were too young to reclaim, try again a little later */ + schedule(); + } + + return 0; } /** diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h index e3c6a08f0ee8..d061cd807b45 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h @@ -23,7 +23,7 @@ static inline struct sgx_epc_cgroup *sgx_get_current_epc_cg(void) static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg) { } -static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) +static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim) { return 0; } @@ -66,7 +66,7 @@ static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg) put_misc_cg(epc_cg->cg); } -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg); +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim); void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg); bool sgx_epc_cgroup_lru_empty(struct misc_cg *root); void sgx_epc_cgroup_init(void); diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 51904f191b97..2279ae967707 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -588,7 +588,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim) int ret; epc_cg = sgx_get_current_epc_cg(); - ret = sgx_epc_cgroup_try_charge(epc_cg); + ret = sgx_epc_cgroup_try_charge(epc_cg, reclaim); if (ret) { sgx_put_epc_cg(epc_cg); return ERR_PTR(ret);