Message ID | 20231030182013.40086-10-haitao.huang@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:d641:0:b0:403:3b70:6f57 with SMTP id cy1csp2416801vqb; Mon, 30 Oct 2023 11:23:29 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHpJQbtPFw17eoKGCIbHcqLdpGA2xco3W8HGJlaC2ocIrs+EbRjNIhqt8jGFNOg4xJNDrRe X-Received: by 2002:a17:90a:9484:b0:280:4799:a841 with SMTP id s4-20020a17090a948400b002804799a841mr4048525pjo.38.1698690209634; Mon, 30 Oct 2023 11:23:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698690209; cv=none; d=google.com; s=arc-20160816; b=ZoDEUNHrADQiR2ykY1scvl5dxE9LpJYFeTzvQI3ANtGBQWZK6BRTeUbDF9udTGZVe6 mZEAHTzWNaaJWp3KxmzMkh3p77JTAwq6FE+IjCAqQwnh4p5GKxTd9UkZSV7vOM1uHQlc 4HnyzgSAJvFBkUz7q7NIdYM57bTVTGWNrTpI62lkaG8hsE61q6/7NI/U8wpKBQj6yvH3 IISkoewXN/UiZeyqrqnIT+dRPhm0MviJ9yXZ6b4M+r5lSA68Rud/l6qEASSLhsN/cYzb M9t/50qtmwZWfF35pHbTW4AC073lOHf2ELXIs9T1gpUn8jcRPfd9HbLhcQzwNgCrgTOx d9kA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=hx0ZdLNLAQ4oU+EgBWz0IsHkFFUWyeSw5yZu85KjKUU=; fh=CTa835q0iyfQFiV5yjYPzrfvG/ulw1To85Pi/STRmhA=; b=OPLT79TVy8nPHNwv1V48d4H1PXC70aSayhqESXoSz2KukbfqcmCgjBbH4fjxcJxRGg b8ZZoifzxnlAdOcZbrFnCCz1h5NQ7CtnqfnXlktM2VKesfHbIXMCr7BiXLk/jVvWxGU8 d4KT20NVH6UW62mkWUcSMZTYuFNnA6bGzo9W8HZuyXcRiIjpyCbPY+kWLorGHff5Pd70 7Li3iAilQD/lT3azkAqdSLfRS/4Enqdl0+pIhe1Kr4ntOgysUflm8ydfE6kEJg3nybk1 l/RsnQl0ZqfxewU7ILuBU8x6xXDIhOwzUt6S6i/TNdSaz1+KV3HuE2YjBPXdtJe7R0Yq rZ/A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=kedlIRxl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from fry.vger.email (fry.vger.email. [2620:137:e000::3:8]) by mx.google.com with ESMTPS id d10-20020a17090a628a00b0026b49c1aa50si7145807pjj.111.2023.10.30.11.23.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 30 Oct 2023 11:23:29 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) client-ip=2620:137:e000::3:8; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=kedlIRxl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id AAF9C805989C; Mon, 30 Oct 2023 11:23:25 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232660AbjJ3SVO (ORCPT <rfc822;chrisjones.unixmen@gmail.com> + 32 others); Mon, 30 Oct 2023 14:21:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53326 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232546AbjJ3SUm (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 30 Oct 2023 14:20:42 -0400 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2D638C9; Mon, 30 Oct 2023 11:20:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1698690036; x=1730226036; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=XwIgULTxO8Z4Bp/SNgub1tcebqsaKXGrvzAz+ItDLZY=; b=kedlIRxlUjelJSHdfjdLLpBFk0wVY3s/SBU37vmVxgjCthLOVvwK12Li nrKpYs+cYINFBw7OFgQoCFfjUW0GWf9hzEbG4BJm23fZwRV1IJdKj/f8I iXUxLfuBUKplkE74YsTDUC4H4D5PoRXAZ8dEUVTEnMLjW9E2ZAHMwKnBQ mS3IZ1Nd2a2sQM1S2RqZXnNY4tMoCq6FWEAHHbasUChfv3VP8LAIG2rCw 6rdkdqIbYf67tUtNPJzgVS2TEE9hy+ff+U1hCHl4VYTJKQL2xNnflUWS2 ovvRgCIr6+KfJKrrz9olIOJYpTjzYfXKIzUNE6O8xkSIbeAAjwKStmpxj A==; X-IronPort-AV: E=McAfee;i="6600,9927,10879"; a="367479608" X-IronPort-AV: E=Sophos;i="6.03,263,1694761200"; d="scan'208";a="367479608" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Oct 2023 11:20:30 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10879"; a="789529528" X-IronPort-AV: E=Sophos;i="6.03,263,1694761200"; d="scan'208";a="789529528" Received: from b4969161e530.jf.intel.com ([10.165.56.46]) by orsmga008.jf.intel.com with ESMTP; 30 Oct 2023 11:20:29 -0700 From: Haitao Huang <haitao.huang@linux.intel.com> To: jarkko@kernel.org, dave.hansen@linux.intel.com, tj@kernel.org, mkoutny@suse.com, linux-kernel@vger.kernel.org, linux-sgx@vger.kernel.org, x86@kernel.org, cgroups@vger.kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, hpa@zytor.com, sohil.mehta@intel.com Cc: zhiquan1.li@intel.com, kristen@linux.intel.com, seanjc@google.com, zhanb@microsoft.com, anakrish@microsoft.com, mikko.ylinen@linux.intel.com, yangjie@microsoft.com, Sean Christopherson <sean.j.christopherson@intel.com>, Haitao Huang <haitao.huang@linux.intel.com> Subject: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function Date: Mon, 30 Oct 2023 11:20:10 -0700 Message-Id: <20231030182013.40086-10-haitao.huang@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20231030182013.40086-1-haitao.huang@linux.intel.com> References: <20231030182013.40086-1-haitao.huang@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.2 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Mon, 30 Oct 2023 11:23:25 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1781205785200647452 X-GMAIL-MSGID: 1781205785200647452 |
Series |
Add Cgroup support for SGX EPC memory
|
|
Commit Message
Haitao Huang
Oct. 30, 2023, 6:20 p.m. UTC
From: Sean Christopherson <sean.j.christopherson@intel.com> To prepare for per-cgroup reclamation, separate the top-level reclaim function, sgx_reclaim_epc_pages(), into two separate functions: - sgx_isolate_epc_pages() scans and isolates reclaimable pages from a given LRU list. - sgx_do_epc_reclamation() performs the real reclamation for the already isolated pages. Create a new function, sgx_reclaim_epc_pages_global(), calling those two in succession, to replace the original sgx_reclaim_epc_pages(). The above two functions will serve as building blocks for the reclamation flows in later EPC cgroup implementation. sgx_do_epc_reclamation() returns the number of reclaimed pages. The EPC cgroup will use the result to track reclaiming progress. sgx_isolate_epc_pages() returns the additional number of pages to scan for current epoch of reclamation. The EPC cgroup will use the result to determine if more scanning to be done in LRUs in its children groups. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> Cc: Sean Christopherson <seanjc@google.com> --- V6: - Restructure patches to make it easier to review. (Kai) - Fix unused nr_to_scan (Kai) --- arch/x86/kernel/cpu/sgx/main.c | 97 ++++++++++++++++++++++------------ arch/x86/kernel/cpu/sgx/sgx.h | 8 +++ 2 files changed, 72 insertions(+), 33 deletions(-)
Comments
On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote: > From: Sean Christopherson <sean.j.christopherson@intel.com> > > To prepare for per-cgroup reclamation, separate the top-level reclaim > function, sgx_reclaim_epc_pages(), into two separate functions: > > - sgx_isolate_epc_pages() scans and isolates reclaimable pages from a given LRU list. > - sgx_do_epc_reclamation() performs the real reclamation for the already isolated pages. > > Create a new function, sgx_reclaim_epc_pages_global(), calling those two > in succession, to replace the original sgx_reclaim_epc_pages(). The > above two functions will serve as building blocks for the reclamation > flows in later EPC cgroup implementation. > > sgx_do_epc_reclamation() returns the number of reclaimed pages. The EPC > cgroup will use the result to track reclaiming progress. > > sgx_isolate_epc_pages() returns the additional number of pages to scan > for current epoch of reclamation. The EPC cgroup will use the result to > determine if more scanning to be done in LRUs in its children groups. This changelog says nothing about "why", but only mentions the "implementation". For instance, assuming we need to reclaim @npages_to_reclaim from the @epc_cgrp_to_reclaim and its descendants, why cannot we do: for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp) { if (npages_to_reclaim <= 0) return; npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru, npages_to_reclaim); } Is there any difference to have "isolate" + "reclaim"? > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com> > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> > Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> > Cc: Sean Christopherson <seanjc@google.com> > --- > [...] > +/** > + * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC pages. > + * @iso: List of isolated pages for reclamation > + * > + * Take a list of EPC pages and reclaim them to the enclave's private shmem files. Do not > + * reclaim the pages that have been accessed since the last scan, and move each of those pages > + * to the tail of its tracking LRU list. > + * > + * Limit the number of pages to be processed up to SGX_NR_TO_SCAN_MAX per call in order to > + * degrade amount of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit > + * among the HW threads with three stage EWB pipeline (EWB, ETRACK + EWB and IPI + EWB) but not > + * sufficiently. Reclaiming one page at a time would also be problematic as it would increase > + * the lock contention too much, which would halt forward progress. This is kinda optimization, correct? Is there any real performance data to justify this? If this optimization is useful, shouldn't we bring this optimization to the current sgx_reclaim_pages() instead, e.g., just increase SGX_NR_TO_SCAN (16) to SGX_NR_TO_SCAN_MAX (32)?
On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai <kai.huang@intel.com> wrote: > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote: >> From: Sean Christopherson <sean.j.christopherson@intel.com> >> >> To prepare for per-cgroup reclamation, separate the top-level reclaim >> function, sgx_reclaim_epc_pages(), into two separate functions: >> >> - sgx_isolate_epc_pages() scans and isolates reclaimable pages from a >> given LRU list. >> - sgx_do_epc_reclamation() performs the real reclamation for the >> already isolated pages. >> >> Create a new function, sgx_reclaim_epc_pages_global(), calling those two >> in succession, to replace the original sgx_reclaim_epc_pages(). The >> above two functions will serve as building blocks for the reclamation >> flows in later EPC cgroup implementation. >> >> sgx_do_epc_reclamation() returns the number of reclaimed pages. The EPC >> cgroup will use the result to track reclaiming progress. >> >> sgx_isolate_epc_pages() returns the additional number of pages to scan >> for current epoch of reclamation. The EPC cgroup will use the result to >> determine if more scanning to be done in LRUs in its children groups. > > This changelog says nothing about "why", but only mentions the > "implementation". > > For instance, assuming we need to reclaim @npages_to_reclaim from the > @epc_cgrp_to_reclaim and its descendants, why cannot we do: > > for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp) { > if (npages_to_reclaim <= 0) > return; > > npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru, > npages_to_reclaim); > } > > Is there any difference to have "isolate" + "reclaim"? > This is to optimize "reclaim". See how etrack was done in sgx_encl_ewb. Here we just follow the same design as ksgxd for each reclamation cycle. >> >> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> >> Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com> >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> >> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> >> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> >> Cc: Sean Christopherson <seanjc@google.com> >> --- >> > > [...] > >> +/** >> + * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC >> pages. >> + * @iso: List of isolated pages for reclamation >> + * >> + * Take a list of EPC pages and reclaim them to the enclave's private >> shmem files. Do not >> + * reclaim the pages that have been accessed since the last scan, and >> move each of those pages >> + * to the tail of its tracking LRU list. >> + * >> + * Limit the number of pages to be processed up to SGX_NR_TO_SCAN_MAX >> per call in order to >> + * degrade amount of IPI's and ETRACK's potentially required. >> sgx_encl_ewb() does degrade a bit >> + * among the HW threads with three stage EWB pipeline (EWB, ETRACK + >> EWB and IPI + EWB) but not >> + * sufficiently. Reclaiming one page at a time would also be >> problematic as it would increase >> + * the lock contention too much, which would halt forward progress. > > This is kinda optimization, correct? Is there any real performance data > to > justify this? The above sentences were there originally. This optimization was justified. > If this optimization is useful, shouldn't we bring this > optimization to the current sgx_reclaim_pages() instead, e.g., just > increase > SGX_NR_TO_SCAN (16) to SGX_NR_TO_SCAN_MAX (32)? > SGX_NR_TO_SCAN_MAX might be designed earlier for other reasons I don't know. Currently it is really the buffer size allocated in sgx_reclaim_pages(). Both cgroup and ksgxd scan 16 pages a time.Maybe we should just use SGX_NR_TO_SCAN. No _MAX needed. The point was to batch reclamation to certain number to minimize impact of EWB pipeline. 16 was the original design. Thanks Haitao
On Mon, 2023-11-27 at 00:27 +0800, Haitao Huang wrote: > On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai <kai.huang@intel.com> wrote: > > > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote: > > > From: Sean Christopherson <sean.j.christopherson@intel.com> > > > > > > To prepare for per-cgroup reclamation, separate the top-level reclaim > > > function, sgx_reclaim_epc_pages(), into two separate functions: > > > > > > - sgx_isolate_epc_pages() scans and isolates reclaimable pages from a > > > given LRU list. > > > - sgx_do_epc_reclamation() performs the real reclamation for the > > > already isolated pages. > > > > > > Create a new function, sgx_reclaim_epc_pages_global(), calling those two > > > in succession, to replace the original sgx_reclaim_epc_pages(). The > > > above two functions will serve as building blocks for the reclamation > > > flows in later EPC cgroup implementation. > > > > > > sgx_do_epc_reclamation() returns the number of reclaimed pages. The EPC > > > cgroup will use the result to track reclaiming progress. > > > > > > sgx_isolate_epc_pages() returns the additional number of pages to scan > > > for current epoch of reclamation. The EPC cgroup will use the result to > > > determine if more scanning to be done in LRUs in its children groups. > > > > This changelog says nothing about "why", but only mentions the > > "implementation". > > > > For instance, assuming we need to reclaim @npages_to_reclaim from the > > @epc_cgrp_to_reclaim and its descendants, why cannot we do: > > > > for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp) { > > if (npages_to_reclaim <= 0) > > return; > > > > npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru, > > npages_to_reclaim); > > } > > > > Is there any difference to have "isolate" + "reclaim"? > > > > This is to optimize "reclaim". See how etrack was done in sgx_encl_ewb. > Here we just follow the same design as ksgxd for each reclamation cycle. I don't see how did you "follow" ksgxd. If I am guessing correctly, you are afraid of there might be less than 16 pages in a given EPC cgroup, thus w/o splitting into "isolate" + "reclaim" you might feed the "reclaim" less than 16 pages, which might cause some performance degrade? But is this a common case? Should we even worry about this? I suppose for such new feature we should bring functionality first and then optimization if you have real performance data to show. > > > > > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > > > Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com> > > > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> > > > Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> > > > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> > > > Cc: Sean Christopherson <seanjc@google.com> > > > --- > > > > > > > [...] > > > > > +/** > > > + * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC > > > pages. > > > + * @iso: List of isolated pages for reclamation > > > + * > > > + * Take a list of EPC pages and reclaim them to the enclave's private > > > shmem files. Do not > > > + * reclaim the pages that have been accessed since the last scan, and > > > move each of those pages > > > + * to the tail of its tracking LRU list. > > > + * > > > + * Limit the number of pages to be processed up to SGX_NR_TO_SCAN_MAX > > > per call in order to > > > + * degrade amount of IPI's and ETRACK's potentially required. > > > sgx_encl_ewb() does degrade a bit > > > + * among the HW threads with three stage EWB pipeline (EWB, ETRACK + > > > EWB and IPI + EWB) but not > > > + * sufficiently. Reclaiming one page at a time would also be > > > problematic as it would increase > > > + * the lock contention too much, which would halt forward progress. > > > > This is kinda optimization, correct? Is there any real performance data > > to > > justify this? > > The above sentences were there originally. This optimization was justified. I am talking about 16 -> 32. > > > If this optimization is useful, shouldn't we bring this > > optimization to the current sgx_reclaim_pages() instead, e.g., just > > increase > > SGX_NR_TO_SCAN (16) to SGX_NR_TO_SCAN_MAX (32)? > > > > SGX_NR_TO_SCAN_MAX might be designed earlier for other reasons I don't > know. Currently it is really the buffer size allocated in > sgx_reclaim_pages(). Both cgroup and ksgxd scan 16 pages a time.Maybe we > should just use SGX_NR_TO_SCAN. No _MAX needed. The point was to batch > reclamation to certain number to minimize impact of EWB pipeline. 16 was > the original design. > Please don't leave why you are trying to do this to the reviewers. If you don't know, then just drop this.
Hi Kai On Mon, 27 Nov 2023 03:57:03 -0600, Huang, Kai <kai.huang@intel.com> wrote: > On Mon, 2023-11-27 at 00:27 +0800, Haitao Huang wrote: >> On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai <kai.huang@intel.com> >> wrote: >> >> > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote: >> > > From: Sean Christopherson <sean.j.christopherson@intel.com> >> > > >> > > To prepare for per-cgroup reclamation, separate the top-level >> reclaim >> > > function, sgx_reclaim_epc_pages(), into two separate functions: >> > > >> > > - sgx_isolate_epc_pages() scans and isolates reclaimable pages from >> a >> > > given LRU list. >> > > - sgx_do_epc_reclamation() performs the real reclamation for the >> > > already isolated pages. >> > > >> > > Create a new function, sgx_reclaim_epc_pages_global(), calling >> those two >> > > in succession, to replace the original sgx_reclaim_epc_pages(). The >> > > above two functions will serve as building blocks for the >> reclamation >> > > flows in later EPC cgroup implementation. >> > > >> > > sgx_do_epc_reclamation() returns the number of reclaimed pages. The >> EPC >> > > cgroup will use the result to track reclaiming progress. >> > > >> > > sgx_isolate_epc_pages() returns the additional number of pages to >> scan >> > > for current epoch of reclamation. The EPC cgroup will use the >> result to >> > > determine if more scanning to be done in LRUs in its children >> groups. >> > >> > This changelog says nothing about "why", but only mentions the >> > "implementation". >> > >> > For instance, assuming we need to reclaim @npages_to_reclaim from the >> > @epc_cgrp_to_reclaim and its descendants, why cannot we do: >> > >> > for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp) { >> > if (npages_to_reclaim <= 0) >> > return; >> > >> > npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru, >> > npages_to_reclaim); >> > } >> > >> > Is there any difference to have "isolate" + "reclaim"? >> > >> >> This is to optimize "reclaim". See how etrack was done in sgx_encl_ewb. >> Here we just follow the same design as ksgxd for each reclamation cycle. > > I don't see how did you "follow" ksgxd. If I am guessing correctly, you > are > afraid of there might be less than 16 pages in a given EPC cgroup, thus > w/o > splitting into "isolate" + "reclaim" you might feed the "reclaim" less > than 16 > pages, which might cause some performance degrade? > > But is this a common case? Should we even worry about this? > > I suppose for such new feature we should bring functionality first and > then > optimization if you have real performance data to show. > The concern is not about "reclaim less than 16". I mean this is just refactoring with exactly the same design of ksgxd preserved, in that we first isolate as many candidate EPC pages (up to 16, ignore the unneeded SGX_NR_TO_SCAN_MAX for now), then does the ewb in one shot without anything else done in between. As described in original comments for the function sgx_reclaim_pages and sgx_encl_ewb, this is to finish all ewb quickly while minimizing impact of IPI. The way you proposed will work but alters the current design and behavior if cgroups is enabled and EPCs of an enclave are tracked across multiple LRUs within the descendant cgroups, in that you will have isolation loop, backing store allocation loop, eblock loop interleaved with the ewb loop. >> >> > > >> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> >> > > Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com> >> > > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> >> > > Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> >> > > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> >> > > Cc: Sean Christopherson <seanjc@google.com> >> > > --- >> > > >> > >> > [...] >> > >> > > +/** >> > > + * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC >> > > pages. >> > > + * @iso: List of isolated pages for reclamation >> > > + * >> > > + * Take a list of EPC pages and reclaim them to the enclave's >> private >> > > shmem files. Do not >> > > + * reclaim the pages that have been accessed since the last scan, >> and >> > > move each of those pages >> > > + * to the tail of its tracking LRU list. >> > > + * >> > > + * Limit the number of pages to be processed up to >> SGX_NR_TO_SCAN_MAX >> > > per call in order to >> > > + * degrade amount of IPI's and ETRACK's potentially required. >> > > sgx_encl_ewb() does degrade a bit >> > > + * among the HW threads with three stage EWB pipeline (EWB, ETRACK >> + >> > > EWB and IPI + EWB) but not >> > > + * sufficiently. Reclaiming one page at a time would also be >> > > problematic as it would increase >> > > + * the lock contention too much, which would halt forward progress. >> > >> > This is kinda optimization, correct? Is there any real performance >> data >> > to >> > justify this? >> >> The above sentences were there originally. This optimization was >> justified. > > I am talking about 16 -> 32. > >> >> > If this optimization is useful, shouldn't we bring this >> > optimization to the current sgx_reclaim_pages() instead, e.g., just >> > increase >> > SGX_NR_TO_SCAN (16) to SGX_NR_TO_SCAN_MAX (32)? >> > >> >> SGX_NR_TO_SCAN_MAX might be designed earlier for other reasons I don't >> know. Currently it is really the buffer size allocated in >> sgx_reclaim_pages(). Both cgroup and ksgxd scan 16 pages a time.Maybe we >> should just use SGX_NR_TO_SCAN. No _MAX needed. The point was to batch >> reclamation to certain number to minimize impact of EWB pipeline. 16 was >> the original design. >> > > Please don't leave why you are trying to do this to the reviewers. If > you don't > know, then just drop this. > Fair enough. This was my oversight when doing all the changes and rebase. Will drop it. Thanks Haitao
On Mon, 2023-12-11 at 22:04 -0600, Haitao Huang wrote: > Hi Kai > > On Mon, 27 Nov 2023 03:57:03 -0600, Huang, Kai <kai.huang@intel.com> wrote: > > > On Mon, 2023-11-27 at 00:27 +0800, Haitao Huang wrote: > > > On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai <kai.huang@intel.com> > > > wrote: > > > > > > > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote: > > > > > From: Sean Christopherson <sean.j.christopherson@intel.com> > > > > > > > > > > To prepare for per-cgroup reclamation, separate the top-level > > > reclaim > > > > > function, sgx_reclaim_epc_pages(), into two separate functions: > > > > > > > > > > - sgx_isolate_epc_pages() scans and isolates reclaimable pages from > > > a > > > > > given LRU list. > > > > > - sgx_do_epc_reclamation() performs the real reclamation for the > > > > > already isolated pages. > > > > > > > > > > Create a new function, sgx_reclaim_epc_pages_global(), calling > > > those two > > > > > in succession, to replace the original sgx_reclaim_epc_pages(). The > > > > > above two functions will serve as building blocks for the > > > reclamation > > > > > flows in later EPC cgroup implementation. > > > > > > > > > > sgx_do_epc_reclamation() returns the number of reclaimed pages. The > > > EPC > > > > > cgroup will use the result to track reclaiming progress. > > > > > > > > > > sgx_isolate_epc_pages() returns the additional number of pages to > > > scan > > > > > for current epoch of reclamation. The EPC cgroup will use the > > > result to > > > > > determine if more scanning to be done in LRUs in its children > > > groups. > > > > > > > > This changelog says nothing about "why", but only mentions the > > > > "implementation". > > > > > > > > For instance, assuming we need to reclaim @npages_to_reclaim from the > > > > @epc_cgrp_to_reclaim and its descendants, why cannot we do: > > > > > > > > for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp) { > > > > if (npages_to_reclaim <= 0) > > > > return; > > > > > > > > npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru, > > > > npages_to_reclaim); > > > > } > > > > > > > > Is there any difference to have "isolate" + "reclaim"? > > > > > > > > > > This is to optimize "reclaim". See how etrack was done in sgx_encl_ewb. > > > Here we just follow the same design as ksgxd for each reclamation cycle. > > > > I don't see how did you "follow" ksgxd. If I am guessing correctly, you > > are > > afraid of there might be less than 16 pages in a given EPC cgroup, thus > > w/o > > splitting into "isolate" + "reclaim" you might feed the "reclaim" less > > than 16 > > pages, which might cause some performance degrade? > > > > But is this a common case? Should we even worry about this? > > > > I suppose for such new feature we should bring functionality first and > > then > > optimization if you have real performance data to show. > > > The concern is not about "reclaim less than 16". > I mean this is just refactoring with exactly the same design of ksgxd > preserved, > I literally have no idea what you are talking about here. ksgxd() just calls sgx_reclaim_pages(), which tries to reclaim 16 pages at once. > in that we first isolate as many candidate EPC pages (up to > 16, ignore the unneeded SGX_NR_TO_SCAN_MAX for now), then does the ewb in > one shot without anything else done in between. > Assuming you are referring the implementation of sgx_reclaim_pages(), and assuming the "isolate" you mean removing EPC pages from the list (which is exactly what the sgx_isolate_epc_pages() in this patch does), what happens to the loops of "backing store allocation" and "EBLOCK", before the loop of EWB? Eaten by you? > As described in original > comments for the function sgx_reclaim_pages and sgx_encl_ewb, this is to > finish all ewb quickly while minimizing impact of IPI. > > The way you proposed will work but alters the current design and behavior > if cgroups is enabled and EPCs of an enclave are tracked across multiple > LRUs within the descendant cgroups, in that you will have isolation loop, > backing store allocation loop, eblock loop interleaved with the ewb loop. > I have no idea what you are talking about. The point is, with or w/o this patch, you can only reclaim 16 EPC pages in one function call (as you have said you are going to remove SGX_NR_TO_SCAN_MAX, which is a cipher to both of us). The only difference I can see is, with this patch, you can have multiple calls of "isolate" and then call the "do_reclaim" once. But what's the good of having the "isolate" if the "do_reclaim" can only reclaim 16 pages anyway? Back to my last reply, are you afraid of any LRU has less than 16 pages to "isolate", therefore you need to loop LRUs of descendants to get 16? Cause I really cannot think of any other reason why you are doing this. > >
On Wed, 13 Dec 2023 05:17:11 -0600, Huang, Kai <kai.huang@intel.com> wrote: > On Mon, 2023-12-11 at 22:04 -0600, Haitao Huang wrote: >> Hi Kai >> >> On Mon, 27 Nov 2023 03:57:03 -0600, Huang, Kai <kai.huang@intel.com> >> wrote: >> >> > On Mon, 2023-11-27 at 00:27 +0800, Haitao Huang wrote: >> > > On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai <kai.huang@intel.com> >> > > wrote: >> > > >> > > > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote: >> > > > > From: Sean Christopherson <sean.j.christopherson@intel.com> >> > > > > >> > > > > To prepare for per-cgroup reclamation, separate the top-level >> > > reclaim >> > > > > function, sgx_reclaim_epc_pages(), into two separate functions: >> > > > > >> > > > > - sgx_isolate_epc_pages() scans and isolates reclaimable pages >> from >> > > a >> > > > > given LRU list. >> > > > > - sgx_do_epc_reclamation() performs the real reclamation for the >> > > > > already isolated pages. >> > > > > >> > > > > Create a new function, sgx_reclaim_epc_pages_global(), calling >> > > those two >> > > > > in succession, to replace the original sgx_reclaim_epc_pages(). >> The >> > > > > above two functions will serve as building blocks for the >> > > reclamation >> > > > > flows in later EPC cgroup implementation. >> > > > > >> > > > > sgx_do_epc_reclamation() returns the number of reclaimed pages. >> The >> > > EPC >> > > > > cgroup will use the result to track reclaiming progress. >> > > > > >> > > > > sgx_isolate_epc_pages() returns the additional number of pages >> to >> > > scan >> > > > > for current epoch of reclamation. The EPC cgroup will use the >> > > result to >> > > > > determine if more scanning to be done in LRUs in its children >> > > groups. >> > > > >> > > > This changelog says nothing about "why", but only mentions the >> > > > "implementation". >> > > > >> > > > For instance, assuming we need to reclaim @npages_to_reclaim from >> the >> > > > @epc_cgrp_to_reclaim and its descendants, why cannot we do: >> > > > >> > > > for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp) >> { >> > > > if (npages_to_reclaim <= 0) >> > > > return; >> > > > >> > > > npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru, >> > > > npages_to_reclaim); >> > > > } >> > > > >> > > > Is there any difference to have "isolate" + "reclaim"? >> > > > >> > > >> > > This is to optimize "reclaim". See how etrack was done in >> sgx_encl_ewb. >> > > Here we just follow the same design as ksgxd for each reclamation >> cycle. >> > >> > I don't see how did you "follow" ksgxd. If I am guessing correctly, >> you >> > are >> > afraid of there might be less than 16 pages in a given EPC cgroup, >> thus >> > w/o >> > splitting into "isolate" + "reclaim" you might feed the "reclaim" less >> > than 16 >> > pages, which might cause some performance degrade? >> > >> > But is this a common case? Should we even worry about this? >> > >> > I suppose for such new feature we should bring functionality first and >> > then >> > optimization if you have real performance data to show. >> > >> The concern is not about "reclaim less than 16". >> I mean this is just refactoring with exactly the same design of ksgxd >> preserved, > > I literally have no idea what you are talking about here. ksgxd() just > calls > sgx_reclaim_pages(), which tries to reclaim 16 pages at once. > >> in that we first isolate as many candidate EPC pages (up to >> 16, ignore the unneeded SGX_NR_TO_SCAN_MAX for now), then does the ewb >> in >> one shot without anything else done in between. > > Assuming you are referring the implementation of sgx_reclaim_pages(), and > assuming the "isolate" you mean removing EPC pages from the list (which > is > exactly what the sgx_isolate_epc_pages() in this patch does), what > happens to > the loops of "backing store allocation" and "EBLOCK", before the loop of > EWB?Eaten by you? > I skipped those as what really matters is to keep ewb loop separate and run in one shot for each reclaiming cycle, not dependent on number of LRUs. All those loops in original sgx_reclaim_pages() except the "isolate" loop are not dealing with multiple LRUs of cgroups later. That's the reason to refactor out only the "isolate" part and loop it through cgroup LRUs in later patches. > >> As described in original >> comments for the function sgx_reclaim_pages and sgx_encl_ewb, this is to >> finish all ewb quickly while minimizing impact of IPI. >> >> The way you proposed will work but alters the current design and >> behavior >> if cgroups is enabled and EPCs of an enclave are tracked across multiple >> LRUs within the descendant cgroups, in that you will have isolation >> loop, >> backing store allocation loop, eblock loop interleaved with the ewb >> loop. >> > > I have no idea what you are talking about. > > The point is, with or w/o this patch, you can only reclaim 16 EPC pages > in one > function call (as you have said you are going to remove > SGX_NR_TO_SCAN_MAX, > which is a cipher to both of us). The only difference I can see is, > with this > patch, you can have multiple calls of "isolate" and then call the > "do_reclaim" > once. > > But what's the good of having the "isolate" if the "do_reclaim" can only > reclaim > 16 pages anyway? > > Back to my last reply, are you afraid of any LRU has less than 16 pages > to > "isolate", therefore you need to loop LRUs of descendants to get 16? > Cause I > really cannot think of any other reason why you are doing this. > > I think I see your point. By capping pages reclaimed per cycle to 16, there is not much difference even if those 16 pages are spread in separate LRUs . The difference is only significant when we ever raise that cap. To preserve the current behavior of ewb loops independent on number of LRUs to loop through for each reclaiming cycle, regardless of the exact value of the page cap, I would still think current approach in the patch is reasonable choice. What do you think? Thanks Haitao
> > > > The point is, with or w/o this patch, you can only reclaim 16 EPC pages > > in one > > function call (as you have said you are going to remove > > SGX_NR_TO_SCAN_MAX, > > which is a cipher to both of us). The only difference I can see is, > > with this > > patch, you can have multiple calls of "isolate" and then call the > > "do_reclaim" > > once. > > > > But what's the good of having the "isolate" if the "do_reclaim" can only > > reclaim > > 16 pages anyway? > > > > Back to my last reply, are you afraid of any LRU has less than 16 pages > > to > > "isolate", therefore you need to loop LRUs of descendants to get 16? > > Cause I > > really cannot think of any other reason why you are doing this. > > > > > > I think I see your point. By capping pages reclaimed per cycle to 16, > there is not much difference even if those 16 pages are spread in separate > LRUs . The difference is only significant when we ever raise that cap. To > preserve the current behavior of ewb loops independent on number of LRUs > to loop through for each reclaiming cycle, regardless of the exact value > of the page cap, I would still think current approach in the patch is > reasonable choice. What do you think? To me I won't bother to do that. Having less than 16 pages in one LRU is *extremely rare* that should never happen in practice. It's pointless to make such code adjustment at this stage. Let's focus on enabling functionality first. When you have some real performance issue that is related to this, we can come back then. Btw, I think you need to step back even further. IIUC the whole multiple LRU thing isn't mandatory in this initial support. Please (again) take a look at the comments from Dave and Michal: https://lore.kernel.org/lkml/7a1a5125-9da2-47b6-ba0f-cf24d84df16b@intel.com/#t https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/
On Mon, Dec 18, 2023 at 01:44:56AM +0000, Huang, Kai wrote: > > Let's focus on enabling functionality first. When you have some real > performance issue that is related to this, we can come back then. > > Btw, I think you need to step back even further. IIUC the whole multiple LRU > thing isn't mandatory in this initial support. > > Please (again) take a look at the comments from Dave and Michal: > > https://lore.kernel.org/lkml/7a1a5125-9da2-47b6-ba0f-cf24d84df16b@intel.com/#t > https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/ I don't think setting a hard limit without any reclaiming is preferred. I'd rather see this similar to what the "sgx_epc.high" was in the RFC patchset: misc.max for sgx_epc becomes the max value for EPC usage but enclaves larger than the limit would still run OK. Per-cgroup reclaiming allows additional controls via memory.high/max in the same cgroup. If this reclaim flexibily was not there, the sgx_epc limit would always have to be set based on some "peak" EPC consumption which may not even be known at the time the limit is set. From a container runtime perspective (which is what I'm working for Kubernetes) the current proposal seems best to me: a container is guaranteed at most the amount of EPC set as the limit and no other container gets to use it. Also, each container gets charged for reclaiming independently if a low max value is used (which might be desirable to get more containers to run on the same node/system). In this model, the sum of containers' max values would be the capacity. -- Mikko
On Sun, 17 Dec 2023 19:44:56 -0600, Huang, Kai <kai.huang@intel.com> wrote: > >> > >> > The point is, with or w/o this patch, you can only reclaim 16 EPC >> pages >> > in one >> > function call (as you have said you are going to remove >> > SGX_NR_TO_SCAN_MAX, >> > which is a cipher to both of us). The only difference I can see is, >> > with this >> > patch, you can have multiple calls of "isolate" and then call the >> > "do_reclaim" >> > once. >> > >> > But what's the good of having the "isolate" if the "do_reclaim" can >> only >> > reclaim >> > 16 pages anyway? >> > >> > Back to my last reply, are you afraid of any LRU has less than 16 >> pages >> > to >> > "isolate", therefore you need to loop LRUs of descendants to get 16? >> > Cause I >> > really cannot think of any other reason why you are doing this. >> > >> > >> >> I think I see your point. By capping pages reclaimed per cycle to 16, >> there is not much difference even if those 16 pages are spread in >> separate >> LRUs . The difference is only significant when we ever raise that cap. >> To >> preserve the current behavior of ewb loops independent on number of LRUs >> to loop through for each reclaiming cycle, regardless of the exact value >> of the page cap, I would still think current approach in the patch is >> reasonable choice. What do you think? > > To me I won't bother to do that. Having less than 16 pages in one LRU is > *extremely rare* that should never happen in practice. It's pointless > to make > such code adjustment at this stage. > > Let's focus on enabling functionality first. When you have some real > performance issue that is related to this, we can come back then. > > Btw, I think you need to step back even further. IIUC the whole > multiple LRU > thing isn't mandatory in this initial support. > > Please (again) take a look at the comments from Dave and Michal: > > https://lore.kernel.org/lkml/7a1a5125-9da2-47b6-ba0f-cf24d84df16b@intel.com/#t > https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/ Thanks for raising this. Actually my understanding the above discussion was mainly about not doing reclaiming by killing enclaves, i.e., I assumed "reclaiming" within that context only meant for that particular kind. As Mikko pointed out, without reclaiming per-cgroup, the max limit of each cgroup needs to accommodate the peak usage of enclaves within that cgroup. That may be inconvenient for practical usage and limits could be forced to be set larger than necessary to run enclaves performantly. For example, we can observe following undesired consequences comparing a system with current kernel loaded with enclaves whose total peak usage is greater than the EPC capacity. 1) If a user wants to load the same exact enclaves but in separate cgroups, then the sum of cgroup limits must be higher than the capacity and the system will end up doing the same old global reclaiming as it is currently doing. Cgroup is not useful at all for isolating EPC consumptions. 2) To isolate impact of usage of each cgroup on other cgroups and yet still being able to load each enclave, the user basically has to carefully plan to ensure the sum of cgroup max limits, i.e., the sum of peak usage of enclaves, is not reaching over the capacity. That means no over-commiting allowed and the same system may not be able to load as many enclaves as with current kernel. @Dave and @Michal, Your thoughts? Or could you confirm we should not do reclaim per cgroup at all? If confirmed as desired, then this series can stop at patch 4. Thanks Haitao
On 12/18/23 13:24, Haitao Huang wrote:> @Dave and @Michal, Your
thoughts? Or could you confirm we should not
> do reclaim per cgroup at all?
What's the benefit of doing reclaim per cgroup? Is that worth the extra
complexity?
The key question here is whether we want the SGX VM to be complex and
more like the real VM or simple when a cgroup hits its limit. Right?
If stopping at patch 5 and having less code is even remotely an option,
why not do _that_?
Hello. On Mon, Dec 18, 2023 at 03:24:40PM -0600, Haitao Huang <haitao.huang@linux.intel.com> wrote: > Thanks for raising this. Actually my understanding the above discussion was > mainly about not doing reclaiming by killing enclaves, i.e., I assumed > "reclaiming" within that context only meant for that particular kind. > > As Mikko pointed out, without reclaiming per-cgroup, the max limit of each > cgroup needs to accommodate the peak usage of enclaves within that cgroup. > That may be inconvenient for practical usage and limits could be forced to > be set larger than necessary to run enclaves performantly. For example, we > can observe following undesired consequences comparing a system with current > kernel loaded with enclaves whose total peak usage is greater than the EPC > capacity. > > 1) If a user wants to load the same exact enclaves but in separate cgroups, > then the sum of cgroup limits must be higher than the capacity and the > system will end up doing the same old global reclaiming as it is currently > doing. Cgroup is not useful at all for isolating EPC consumptions. That is the use of limits to prevent a runaway cgroup smothering the system. Overcommited values in such a config are fine because the more simultaneous runaways, the less likely. The peak consumption is on the fair expense of others (some efficiency) and the limit contains the runaway (hence the isolation). > 2) To isolate impact of usage of each cgroup on other cgroups and yet still > being able to load each enclave, the user basically has to carefully plan to > ensure the sum of cgroup max limits, i.e., the sum of peak usage of > enclaves, is not reaching over the capacity. That means no over-commiting > allowed and the same system may not be able to load as many enclaves as with > current kernel. Another "config layout" of limits is to achieve partitioning (sum == capacity). That is perfect isolation but it naturally goes against efficient utilization. The way other controllers approach this trade-off is with weights (cpu, io) or protections (memory). I'm afraid misc controller is not ready for this. My opinion is to start with the simple limits (first patches) and think of prioritization/guarantee mechanism based on real cases. HTH, Michal
Hi Dave, On Wed, 03 Jan 2024 10:37:35 -0600, Dave Hansen <dave.hansen@intel.com> wrote: > On 12/18/23 13:24, Haitao Huang wrote:> @Dave and @Michal, Your > thoughts? Or could you confirm we should not >> do reclaim per cgroup at all? > What's the benefit of doing reclaim per cgroup? Is that worth the extra > complexity? > Without reclaiming per cgroup, then we have to always set the limit to enclave's peak usage. This may not be efficient utilization as in many cases each enclave can perform fine with EPC limit set less than peak. Basically each group can not give up some pages for greater good without dying :-) Also with enclaves enabled with EDMM, the peak usage is not static so hard to determine upfront. Hence it might be an operation/deployment inconvenience. In case of over-committing (sum of limits > total capacity), one cgroup at peak usage may require swapping pages out in a different cgroup if system is overloaded at that time. > The key question here is whether we want the SGX VM to be complex and > more like the real VM or simple when a cgroup hits its limit. Right? > Although it's fair to say the majority of complexity of this series is in support for reclaiming per cgroup, I think it's manageable and much less than real VM after we removed the enclave killing parts: the only extra effort is to track pages in separate list and reclaim them in separately as opposed to track in on global list and reclaim together. The main reclaiming loop code is still pretty much the same as before. > If stopping at patch 5 and having less code is even remotely an option, > why not do _that_? > I hope I described limitations clear enough above. If those are OK with users and also make it acceptable for merge quickly, I'm happy to do that :-) Thanks Haitao
On Thu Jan 4, 2024 at 9:11 PM EET, Haitao Huang wrote: > > The key question here is whether we want the SGX VM to be complex and > > more like the real VM or simple when a cgroup hits its limit. Right? > > > > Although it's fair to say the majority of complexity of this series is in > support for reclaiming per cgroup, I think it's manageable and much less > than real VM after we removed the enclave killing parts: the only extra > effort is to track pages in separate list and reclaim them in separately > as opposed to track in on global list and reclaim together. The main > reclaiming loop code is still pretty much the same as before. I'm not seeing any unmanageable complexity on SGX side, and also cgroups specific changes are somewhat clean to me at least... BR, Jarkko
Hi Michal, On Thu, 04 Jan 2024 06:38:41 -0600, Michal Koutný <mkoutny@suse.com> wrote: > Hello. > > On Mon, Dec 18, 2023 at 03:24:40PM -0600, Haitao Huang > <haitao.huang@linux.intel.com> wrote: >> Thanks for raising this. Actually my understanding the above discussion >> was >> mainly about not doing reclaiming by killing enclaves, i.e., I assumed >> "reclaiming" within that context only meant for that particular kind. >> >> As Mikko pointed out, without reclaiming per-cgroup, the max limit of >> each >> cgroup needs to accommodate the peak usage of enclaves within that >> cgroup. >> That may be inconvenient for practical usage and limits could be forced >> to >> be set larger than necessary to run enclaves performantly. For example, >> we >> can observe following undesired consequences comparing a system with >> current >> kernel loaded with enclaves whose total peak usage is greater than the >> EPC >> capacity. >> >> 1) If a user wants to load the same exact enclaves but in separate >> cgroups, >> then the sum of cgroup limits must be higher than the capacity and the >> system will end up doing the same old global reclaiming as it is >> currently >> doing. Cgroup is not useful at all for isolating EPC consumptions. > > That is the use of limits to prevent a runaway cgroup smothering the > system. Overcommited values in such a config are fine because the more > simultaneous runaways, the less likely. > The peak consumption is on the fair expense of others (some efficiency) > and the limit contains the runaway (hence the isolation). > This makes sense to me in theory. Mikko, Chris Y/Bo Z, your thoughts on whether this is good enough for your intended usages? >> 2) To isolate impact of usage of each cgroup on other cgroups and yet >> still >> being able to load each enclave, the user basically has to carefully >> plan to >> ensure the sum of cgroup max limits, i.e., the sum of peak usage of >> enclaves, is not reaching over the capacity. That means no >> over-commiting >> allowed and the same system may not be able to load as many enclaves as >> with >> current kernel. > > Another "config layout" of limits is to achieve partitioning (sum == > capacity). That is perfect isolation but it naturally goes against > efficient utilization. The way other controllers approach this trade-off > is with weights (cpu, io) or protections (memory). I'm afraid misc > controller is not ready for this. > > My opinion is to start with the simple limits (first patches) and think > of prioritization/guarantee mechanism based on real cases. > We moved away from using mem like custom controller with (low, high, max) to misc controller. But if we need add those down the road, then the interface needs be changed. So my concern on this route would be whether misc would allow any of those extensions. On the other hand, it might turn out less complex just doing the reclamation per cgroup. Thanks a lot for your comments and they are really helpful! Haitao
On 1/4/24 11:11, Haitao Huang wrote: > If those are OK with users and also make it acceptable for merge > quickly, I'm happy to do that 🙂 How about we put some actual numbers behind this? How much complexity are we talking about here? What's the diffstat for the utterly bare-bones implementation and what does it cost on top of that to do the per-cgroup reclaim?
On Thu, 04 Jan 2024 13:27:07 -0600, Dave Hansen <dave.hansen@intel.com> wrote: > On 1/4/24 11:11, Haitao Huang wrote: >> If those are OK with users and also make it acceptable for merge >> quickly, I'm happy to do that 🙂 > > How about we put some actual numbers behind this? How much complexity > are we talking about here? What's the diffstat for the utterly > bare-bones implementation and what does it cost on top of that to do the > per-cgroup reclaim? > For bare-bone: arch/x86/Kconfig | 13 ++++++++++++ arch/x86/kernel/cpu/sgx/Makefile | 1 + arch/x86/kernel/cpu/sgx/epc_cgroup.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ arch/x86/kernel/cpu/sgx/epc_cgroup.h | 39 +++++++++++++++++++++++++++++++++++ arch/x86/kernel/cpu/sgx/main.c | 41 ++++++++++++++++++++++++++++++++++++ arch/x86/kernel/cpu/sgx/sgx.h | 5 +++++ include/linux/misc_cgroup.h | 42 +++++++++++++++++++++++++++++++++++++ kernel/cgroup/misc.c | 52 +++++++++++++++++++++++++++++++--------------- 8 files changed, 281 insertions(+), 16 deletions(-) Additional for per-cgroup reclaim: arch/x86/kernel/cpu/sgx/encl.c | 41 +++++++++-------- arch/x86/kernel/cpu/sgx/encl.h | 3 +- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 224 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- arch/x86/kernel/cpu/sgx/epc_cgroup.h | 18 ++++++-- arch/x86/kernel/cpu/sgx/main.c | 226 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------- arch/x86/kernel/cpu/sgx/sgx.h | 85 +++++++++++++++++++++++++++++++++-- 6 files changed, 480 insertions(+), 117 deletions(-) Thanks Haitao
On Thu, Jan 04, 2024 at 01:11:15PM -0600, Haitao Huang wrote: > Hi Dave, > > On Wed, 03 Jan 2024 10:37:35 -0600, Dave Hansen <dave.hansen@intel.com> > wrote: > > > On 12/18/23 13:24, Haitao Huang wrote:> @Dave and @Michal, Your > > thoughts? Or could you confirm we should not > > > do reclaim per cgroup at all? > > What's the benefit of doing reclaim per cgroup? Is that worth the extra > > complexity? > > > > Without reclaiming per cgroup, then we have to always set the limit to > enclave's peak usage. This may not be efficient utilization as in many cases > each enclave can perform fine with EPC limit set less than peak. Basically > each group can not give up some pages for greater good without dying :-) +1. this is exactly my thinking too. The per cgroup reclaiming is important for the containers use case we are working on. I also think it makes the limit more meaningful: the per-container pool of EPC pages to use (which is independent of the enclave size). > > Also with enclaves enabled with EDMM, the peak usage is not static so hard > to determine upfront. Hence it might be an operation/deployment > inconvenience. > > In case of over-committing (sum of limits > total capacity), one cgroup at > peak usage may require swapping pages out in a different cgroup if system is > overloaded at that time. > > > The key question here is whether we want the SGX VM to be complex and > > more like the real VM or simple when a cgroup hits its limit. Right? > > > > Although it's fair to say the majority of complexity of this series is in > support for reclaiming per cgroup, I think it's manageable and much less > than real VM after we removed the enclave killing parts: the only extra > effort is to track pages in separate list and reclaim them in separately as > opposed to track in on global list and reclaim together. The main reclaiming > loop code is still pretty much the same as before. > > > > If stopping at patch 5 and having less code is even remotely an option, > > why not do _that_? > > > I hope I described limitations clear enough above. > If those are OK with users and also make it acceptable for merge quickly, You explained the gaps very well already. I don't think the simple version without per-cgroup reclaiming is enough for the container case. Mikko
On Sun, 17 Dec 2023 19:44:56 -0600, Huang, Kai <kai.huang@intel.com> wrote: > >> > >> > The point is, with or w/o this patch, you can only reclaim 16 EPC >> pages >> > in one >> > function call (as you have said you are going to remove >> > SGX_NR_TO_SCAN_MAX, >> > which is a cipher to both of us). The only difference I can see is, >> > with this >> > patch, you can have multiple calls of "isolate" and then call the >> > "do_reclaim" >> > once. >> > >> > But what's the good of having the "isolate" if the "do_reclaim" can >> only >> > reclaim >> > 16 pages anyway? >> > >> > Back to my last reply, are you afraid of any LRU has less than 16 >> pages >> > to >> > "isolate", therefore you need to loop LRUs of descendants to get 16? >> > Cause I >> > really cannot think of any other reason why you are doing this. >> > >> > >> >> I think I see your point. By capping pages reclaimed per cycle to 16, >> there is not much difference even if those 16 pages are spread in >> separate >> LRUs . The difference is only significant when we ever raise that cap. >> To >> preserve the current behavior of ewb loops independent on number of LRUs >> to loop through for each reclaiming cycle, regardless of the exact value >> of the page cap, I would still think current approach in the patch is >> reasonable choice. What do you think? > > To me I won't bother to do that. Having less than 16 pages in one LRU is > *extremely rare* that should never happen in practice. It's pointless > to make > such code adjustment at this stage. > > Let's focus on enabling functionality first. When you have some real > performance issue that is related to this, we can come back then. > I have done some rethinking about this and realize this does save quite some significant work: without breaking out isolation part from sgx_reclaim_pages(), I can remove the changes to use a list for isolated pages, and no need to introduce "state" such as RECLAIM_IN_PROGRESS. About 1/3 of changes for per-cgroup reclamation will be gone. So I think I'll go this route now. The only downside may be performance if a enclave spreads its pages in different cgroups and even that is minimum impact as we limit reclamation to 16 pages a time. Let me know if someone feel strongly we need dealt with that and see some other potential issues I may have missed. Thanks Haitao
On Fri Jan 12, 2024 at 7:07 PM EET, Haitao Huang wrote: > On Sun, 17 Dec 2023 19:44:56 -0600, Huang, Kai <kai.huang@intel.com> wrote: > > > > >> > > >> > The point is, with or w/o this patch, you can only reclaim 16 EPC > >> pages > >> > in one > >> > function call (as you have said you are going to remove > >> > SGX_NR_TO_SCAN_MAX, > >> > which is a cipher to both of us). The only difference I can see is, > >> > with this > >> > patch, you can have multiple calls of "isolate" and then call the > >> > "do_reclaim" > >> > once. > >> > > >> > But what's the good of having the "isolate" if the "do_reclaim" can > >> only > >> > reclaim > >> > 16 pages anyway? > >> > > >> > Back to my last reply, are you afraid of any LRU has less than 16 > >> pages > >> > to > >> > "isolate", therefore you need to loop LRUs of descendants to get 16? > >> > Cause I > >> > really cannot think of any other reason why you are doing this. > >> > > >> > > >> > >> I think I see your point. By capping pages reclaimed per cycle to 16, > >> there is not much difference even if those 16 pages are spread in > >> separate > >> LRUs . The difference is only significant when we ever raise that cap. > >> To > >> preserve the current behavior of ewb loops independent on number of LRUs > >> to loop through for each reclaiming cycle, regardless of the exact value > >> of the page cap, I would still think current approach in the patch is > >> reasonable choice. What do you think? > > > > To me I won't bother to do that. Having less than 16 pages in one LRU is > > *extremely rare* that should never happen in practice. It's pointless > > to make > > such code adjustment at this stage. > > > > Let's focus on enabling functionality first. When you have some real > > performance issue that is related to this, we can come back then. > > > > I have done some rethinking about this and realize this does save quite > some significant work: without breaking out isolation part from > sgx_reclaim_pages(), I can remove the changes to use a list for isolated > pages, and no need to introduce "state" such as RECLAIM_IN_PROGRESS. About > 1/3 of changes for per-cgroup reclamation will be gone. > > So I think I'll go this route now. The only downside may be performance if > a enclave spreads its pages in different cgroups and even that is minimum > impact as we limit reclamation to 16 pages a time. Let me know if someone > feel strongly we need dealt with that and see some other potential issues > I may have missed. We could deal with possible performance regression later on (if there is need). I mean there should we a workload first that would so that sort stimulus... > Thanks > > Haitao BR, Jarkko
On Sat Jan 13, 2024 at 11:04 PM EET, Jarkko Sakkinen wrote: > On Fri Jan 12, 2024 at 7:07 PM EET, Haitao Huang wrote: > > On Sun, 17 Dec 2023 19:44:56 -0600, Huang, Kai <kai.huang@intel.com> wrote: > > > > > > > >> > > > >> > The point is, with or w/o this patch, you can only reclaim 16 EPC > > >> pages > > >> > in one > > >> > function call (as you have said you are going to remove > > >> > SGX_NR_TO_SCAN_MAX, > > >> > which is a cipher to both of us). The only difference I can see is, > > >> > with this > > >> > patch, you can have multiple calls of "isolate" and then call the > > >> > "do_reclaim" > > >> > once. > > >> > > > >> > But what's the good of having the "isolate" if the "do_reclaim" can > > >> only > > >> > reclaim > > >> > 16 pages anyway? > > >> > > > >> > Back to my last reply, are you afraid of any LRU has less than 16 > > >> pages > > >> > to > > >> > "isolate", therefore you need to loop LRUs of descendants to get 16? > > >> > Cause I > > >> > really cannot think of any other reason why you are doing this. > > >> > > > >> > > > >> > > >> I think I see your point. By capping pages reclaimed per cycle to 16, > > >> there is not much difference even if those 16 pages are spread in > > >> separate > > >> LRUs . The difference is only significant when we ever raise that cap. > > >> To > > >> preserve the current behavior of ewb loops independent on number of LRUs > > >> to loop through for each reclaiming cycle, regardless of the exact value > > >> of the page cap, I would still think current approach in the patch is > > >> reasonable choice. What do you think? > > > > > > To me I won't bother to do that. Having less than 16 pages in one LRU is > > > *extremely rare* that should never happen in practice. It's pointless > > > to make > > > such code adjustment at this stage. > > > > > > Let's focus on enabling functionality first. When you have some real > > > performance issue that is related to this, we can come back then. > > > > > > > I have done some rethinking about this and realize this does save quite > > some significant work: without breaking out isolation part from > > sgx_reclaim_pages(), I can remove the changes to use a list for isolated > > pages, and no need to introduce "state" such as RECLAIM_IN_PROGRESS. About > > 1/3 of changes for per-cgroup reclamation will be gone. > > > > So I think I'll go this route now. The only downside may be performance if > > a enclave spreads its pages in different cgroups and even that is minimum > > impact as we limit reclamation to 16 pages a time. Let me know if someone > > feel strongly we need dealt with that and see some other potential issues > > I may have missed. > > We could deal with possible performance regression later on (if there > is need). I mean there should we a workload first that would so that > sort stimulus... I.e. no reason to deal with imaginary workload :-) Go ahead and we'll go through it. BR, Jarkko
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 33bcba313d40..e8848b493eb7 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -281,33 +281,23 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page, mutex_unlock(&encl->lock); } -/* - * Take a fixed number of pages from the head of the active page pool and - * reclaim them to the enclave's private shmem files. Skip the pages, which have - * been accessed since the last scan. Move those pages to the tail of active - * page pool so that the pages get scanned in LRU like fashion. +/** + * sgx_isolate_epc_pages() - Isolate pages from an LRU for reclaim + * @lru: LRU from which to reclaim + * @nr_to_scan: Number of pages to scan for reclaim + * @dst: Destination list to hold the isolated pages * - * Batch process a chunk of pages (at the moment 16) in order to degrade amount - * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit - * among the HW threads with three stage EWB pipeline (EWB, ETRACK + EWB and IPI - * + EWB) but not sufficiently. Reclaiming one page at a time would also be - * problematic as it would increase the lock contention too much, which would - * halt forward progress. + * Return: remaining pages to scan, i.e, @nr_to_scan minus the number of pages scanned. */ -static void sgx_reclaim_pages(void) +unsigned int sgx_isolate_epc_pages(struct sgx_epc_lru_list *lru, unsigned int nr_to_scan, + struct list_head *dst) { - struct sgx_backing backing[SGX_NR_TO_SCAN]; - struct sgx_epc_page *epc_page, *tmp; struct sgx_encl_page *encl_page; - pgoff_t page_index; - LIST_HEAD(iso); - int ret; - int i; + struct sgx_epc_page *epc_page; - spin_lock(&sgx_global_lru.lock); - for (i = 0; i < SGX_NR_TO_SCAN; i++) { - epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable, - struct sgx_epc_page, list); + spin_lock(&lru->lock); + for (; nr_to_scan > 0; --nr_to_scan) { + epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list); if (!epc_page) break; @@ -316,23 +306,53 @@ static void sgx_reclaim_pages(void) if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) { sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS); - list_move_tail(&epc_page->list, &iso); + list_move_tail(&epc_page->list, dst); } else /* The owner is freeing the page. No need to add the * page back to the list of reclaimable pages. */ sgx_epc_page_reset_state(epc_page); } - spin_unlock(&sgx_global_lru.lock); + spin_unlock(&lru->lock); + + return nr_to_scan; +} + +/** + * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC pages. + * @iso: List of isolated pages for reclamation + * + * Take a list of EPC pages and reclaim them to the enclave's private shmem files. Do not + * reclaim the pages that have been accessed since the last scan, and move each of those pages + * to the tail of its tracking LRU list. + * + * Limit the number of pages to be processed up to SGX_NR_TO_SCAN_MAX per call in order to + * degrade amount of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit + * among the HW threads with three stage EWB pipeline (EWB, ETRACK + EWB and IPI + EWB) but not + * sufficiently. Reclaiming one page at a time would also be problematic as it would increase + * the lock contention too much, which would halt forward progress. + * + * Extra pages in the list beyond the SGX_NR_TO_SCAN_MAX limit are skipped and returned back to + * their tracking LRU lists. + * + * Return: number of pages successfully reclaimed. + */ +unsigned int sgx_do_epc_reclamation(struct list_head *iso) +{ + struct sgx_backing backing[SGX_NR_TO_SCAN_MAX]; + struct sgx_epc_page *epc_page, *tmp; + struct sgx_encl_page *encl_page; + pgoff_t page_index; + size_t ret, i; - if (list_empty(&iso)) - return; + if (list_empty(iso)) + return 0; i = 0; - list_for_each_entry_safe(epc_page, tmp, &iso, list) { + list_for_each_entry_safe(epc_page, tmp, iso, list) { encl_page = epc_page->owner; - if (!sgx_reclaimer_age(epc_page)) + if (i == SGX_NR_TO_SCAN_MAX || !sgx_reclaimer_age(epc_page)) goto skip; page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base); @@ -358,11 +378,11 @@ static void sgx_reclaim_pages(void) kref_put(&encl_page->encl->refcount, sgx_encl_release); } - list_for_each_entry(epc_page, &iso, list) + list_for_each_entry(epc_page, iso, list) sgx_reclaimer_block(epc_page); i = 0; - list_for_each_entry_safe(epc_page, tmp, &iso, list) { + list_for_each_entry_safe(epc_page, tmp, iso, list) { encl_page = epc_page->owner; sgx_reclaimer_write(epc_page, &backing[i++]); @@ -371,6 +391,17 @@ static void sgx_reclaim_pages(void) sgx_free_epc_page(epc_page); } + + return i; +} + +static void sgx_reclaim_epc_pages_global(void) +{ + LIST_HEAD(iso); + + sgx_isolate_epc_pages(&sgx_global_lru, SGX_NR_TO_SCAN, &iso); + + sgx_do_epc_reclamation(&iso); } static bool sgx_should_reclaim(unsigned long watermark) @@ -387,7 +418,7 @@ static bool sgx_should_reclaim(unsigned long watermark) void sgx_reclaim_direct(void) { if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) - sgx_reclaim_pages(); + sgx_reclaim_epc_pages_global(); } static int ksgxd(void *p) @@ -410,7 +441,7 @@ static int ksgxd(void *p) sgx_should_reclaim(SGX_NR_HIGH_PAGES)); if (sgx_should_reclaim(SGX_NR_HIGH_PAGES)) - sgx_reclaim_pages(); + sgx_reclaim_epc_pages_global(); cond_resched(); } @@ -587,7 +618,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim) * Need to do a global reclamation if cgroup was not full but free * physical pages run out, causing __sgx_alloc_epc_page() to fail. */ - sgx_reclaim_pages(); + sgx_reclaim_epc_pages_global(); cond_resched(); } diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h index dd7ab65b5b27..6a40f70ed96f 100644 --- a/arch/x86/kernel/cpu/sgx/sgx.h +++ b/arch/x86/kernel/cpu/sgx/sgx.h @@ -19,6 +19,11 @@ #define SGX_MAX_EPC_SECTIONS 8 #define SGX_EEXTEND_BLOCK_SIZE 256 + +/* + * Maximum number of pages to scan for reclaiming. + */ +#define SGX_NR_TO_SCAN_MAX 32U #define SGX_NR_TO_SCAN 16 #define SGX_NR_LOW_PAGES 32 #define SGX_NR_HIGH_PAGES 64 @@ -162,6 +167,9 @@ void sgx_reclaim_direct(void); void sgx_mark_page_reclaimable(struct sgx_epc_page *page); int sgx_unmark_page_reclaimable(struct sgx_epc_page *page); struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim); +unsigned int sgx_do_epc_reclamation(struct list_head *iso); +unsigned int sgx_isolate_epc_pages(struct sgx_epc_lru_list *lru, unsigned int nr_to_scan, + struct list_head *dst); void sgx_ipi_cb(void *info);