Message ID | 20230923030657.16148-13-haitao.huang@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp97809vqu; Sat, 23 Sep 2023 03:34:58 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFG3asnC0BMDOW90Gvi4cPm0m7/VZRkJwiljIGzNOw60UryL0S5dtuWUiu/h7k210P4W6RO X-Received: by 2002:a17:902:ea0f:b0:1c0:9b7c:f82a with SMTP id s15-20020a170902ea0f00b001c09b7cf82amr2163702plg.53.1695465298166; Sat, 23 Sep 2023 03:34:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695465298; cv=none; d=google.com; s=arc-20160816; b=J9Raqhkubp4whBE2ZDrA3G0YTQF0xAGPx87Elj1LugRa4xgNd+WoNrZErRFmphWlAu Lc4HNpyRIb6A3r+e6Nw5LQOfE1NC4PdqQLOpi9Nm56o8aTyMZcejyVRJGJGOKRSHh/D9 z4VrqUdn3QPGriHHxa2Ty5mOpdexyALh1zezsqn/acL9QmUD+nOt09hvPfU1bNGfLpON XZFC0Xz0/qJ28RBjppr8BiPX4RKAjUT5IHH3TKkDk6DgGeZQCThQn752En2an7NZGf6s f9i+uVC8TyexusAS0vnqKcWnze3fBuuTSiLIT3L6JRUjEy7UA9hXixOPslpWFjqLTBUz ARHA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=+d4Zlwng/sU2s6pT91qFAWROKIAQwXOAwcoMCKJVljs=; fh=j8PE345l5Ydlo3KwK7JeWnjqRgjiq4AteUoOZeOwa0I=; b=ca07YP5GCkRb/4ge2q3AhWBUOv7iHUYGbmsSE5303a6knZ2eE3iSZf7xfu08BT7CC+ YZeowHcJAGQuA4Ttik+8rTzTaEQwlFdZr0OrzbCNf25x1r2AGHxmZm1EWR/TJx+V4l/B RP8Y5WmJ8GQyEABZGXGeti+zMkmC/fkFf5yzbY5OYKdE5dw9EoijnA9gpQQsJm5wg5fY nK2CdtAvF0xLtdkUvUMs/b9WC1PAzhBK6cinKJzhY3YjG1XDq1a+Vh31VJWpfyzyaEVO Amljx4uVxtBcFfds1Yx49zNF2GvXQZWnTD1V3+O6st0gYZ9fVdJ3N9sW9PhcVGZV358z 2X/g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=UJ2WzrC9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32]) by mx.google.com with ESMTPS id u18-20020a170903309200b001b9e9b21249si5236728plc.649.2023.09.23.03.34.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 23 Sep 2023 03:34:58 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=UJ2WzrC9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id A273382DB2DF; Fri, 22 Sep 2023 20:08:34 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230369AbjIWDID (ORCPT <rfc822;pwkd43@gmail.com> + 28 others); Fri, 22 Sep 2023 23:08:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58370 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230032AbjIWDHW (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 22 Sep 2023 23:07:22 -0400 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.100]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3FD4B1B7; Fri, 22 Sep 2023 20:07:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695438428; x=1726974428; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=UA/9J/zQf4eOo4qQvjhwmYZhqKa4R4W6o4TLEyujPbk=; b=UJ2WzrC9LgSZ2iMJTLGaI1esoNazhmkK/wxndmfkM84dSxgD6JV3hXLE Mkf41/Ks7+5vKcL6+6aXDcaYWpNCvoXrtUEVxA3bni+CJKZouR15+jpJk FpH97v2kHMx5aIFcq0ovdAs+CNU534dSf+oA9DnkSU5nJwAzc7xXJy30i cZuEf6Tafp0WoNI1Kr0OFTEYFXGXTISHX70HGaN+tqnEdC/UnMEkRvbs4 2qdSYahzC3JWwHha6Lxh9zineiGbL/q2KiXzU2DhviacDDnHibGBr2j/d +b6QYG/H979dNgR9XDPnKCi0u70nG0z+z6jILh7eO6ylcFwrW5mC94IwI w==; X-IronPort-AV: E=McAfee;i="6600,9927,10841"; a="447466815" X-IronPort-AV: E=Sophos;i="6.03,169,1694761200"; d="scan'208";a="447466815" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Sep 2023 20:07:07 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10841"; a="891048570" X-IronPort-AV: E=Sophos;i="6.03,169,1694761200"; d="scan'208";a="891048570" Received: from b4969161e530.jf.intel.com ([10.165.56.46]) by fmsmga001.fm.intel.com with ESMTP; 22 Sep 2023 20:06:10 -0700 From: Haitao Huang <haitao.huang@linux.intel.com> To: jarkko@kernel.org, dave.hansen@linux.intel.com, tj@kernel.org, linux-kernel@vger.kernel.org, linux-sgx@vger.kernel.org, x86@kernel.org, cgroups@vger.kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, hpa@zytor.com, sohil.mehta@intel.com Cc: zhiquan1.li@intel.com, kristen@linux.intel.com, seanjc@google.com, zhanb@microsoft.com, anakrish@microsoft.com, mikko.ylinen@linux.intel.com, yangjie@microsoft.com Subject: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC Date: Fri, 22 Sep 2023 20:06:51 -0700 Message-Id: <20230923030657.16148-13-haitao.huang@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230923030657.16148-1-haitao.huang@linux.intel.com> References: <20230923030657.16148-1-haitao.huang@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=2.8 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, RCVD_IN_SBL_CSS,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Fri, 22 Sep 2023 20:08:34 -0700 (PDT) X-Spam-Level: ** X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1777824220441401020 X-GMAIL-MSGID: 1777824220441401020 |
Series |
Add Cgroup support for SGX EPC memory
|
|
Commit Message
Haitao Huang
Sept. 23, 2023, 3:06 a.m. UTC
From: Sean Christopherson <sean.j.christopherson@intel.com> Introduce the OOM path for killing an enclave with a reclaimer that is no longer able to reclaim enough EPC pages. Find a victim enclave, which will be an enclave with only "unreclaimable" EPC pages left in the cgroup LRU lists. Once a victim is identified, mark the enclave as OOM and zap the enclave's entire page range, and drain all mm references in encl->mm_list. Block allocating any EPC pages in #PF handler, or reloading any pages in all paths, or creating any new mappings. The OOM killing path may race with the reclaimers: in some cases, the victim enclave is in the process of reclaiming the last EPC pages when OOM happens, that is, all pages other than SECS and VA pages are in RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to the enclave backing, VA pages as well as SECS. So the OOM killer does not directly release those enclave resources, instead, it lets all reclaiming in progress to finish, and relies (as currently done) on kref_put on encl->refcount to trigger sgx_encl_release() to do the final cleanup. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> Cc: Sean Christopherson <seanjc@google.com> --- V5: - Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY V4: - Updates for patch reordering and typo fixes. V3: - Rebased to use the new VMA_ITERATOR to zap VMAs. - Fixed the racing cases by blocking new page allocation/mapping and reloading when enclave is marked for OOM. And do not release any enclave resources other than draining mm_list entries, and let pages in RECLAIMING_IN_PROGRESS to be reaped by reclaimers. - Due to above changes, also removed the no-longer needed encl->lock in the OOM path which was causing deadlocks reported by the lock prover. --- arch/x86/kernel/cpu/sgx/driver.c | 27 +----- arch/x86/kernel/cpu/sgx/encl.c | 48 ++++++++++- arch/x86/kernel/cpu/sgx/encl.h | 2 + arch/x86/kernel/cpu/sgx/ioctl.c | 9 ++ arch/x86/kernel/cpu/sgx/main.c | 140 +++++++++++++++++++++++++++++++ arch/x86/kernel/cpu/sgx/sgx.h | 1 + 6 files changed, 200 insertions(+), 27 deletions(-)
Comments
On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote: > From: Sean Christopherson <sean.j.christopherson@intel.com> > > Introduce the OOM path for killing an enclave with a reclaimer that is no > longer able to reclaim enough EPC pages. Find a victim enclave, which > will be an enclave with only "unreclaimable" EPC pages left in the > cgroup LRU lists. Once a victim is identified, mark the enclave as OOM > and zap the enclave's entire page range, and drain all mm references in > encl->mm_list. Block allocating any EPC pages in #PF handler, or > reloading any pages in all paths, or creating any new mappings. > > The OOM killing path may race with the reclaimers: in some cases, the > victim enclave is in the process of reclaiming the last EPC pages when > OOM happens, that is, all pages other than SECS and VA pages are in > RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to > the enclave backing, VA pages as well as SECS. So the OOM killer does > not directly release those enclave resources, instead, it lets all > reclaiming in progress to finish, and relies (as currently done) on > kref_put on encl->refcount to trigger sgx_encl_release() to do the > final cleanup. > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com> > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> > Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> > Cc: Sean Christopherson <seanjc@google.com> > --- > V5: > - Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY > > V4: > - Updates for patch reordering and typo fixes. > > V3: > - Rebased to use the new VMA_ITERATOR to zap VMAs. > - Fixed the racing cases by blocking new page allocation/mapping and > reloading when enclave is marked for OOM. And do not release any enclave > resources other than draining mm_list entries, and let pages in > RECLAIMING_IN_PROGRESS to be reaped by reclaimers. > - Due to above changes, also removed the no-longer needed encl->lock in > the OOM path which was causing deadlocks reported by the lock prover. > [...] > + > +/** > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU > + * @lru: LRU that is low > + * > + * Return: %true if a victim was found and kicked. > + */ > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru) > +{ > + struct sgx_epc_page *victim; > + > + spin_lock(&lru->lock); > + victim = sgx_oom_get_victim(lru); > + spin_unlock(&lru->lock); > + > + if (!victim) > + return false; > + > + if (victim->flags & SGX_EPC_OWNER_PAGE) > + return sgx_oom_encl_page(victim->encl_page); > + > + if (victim->flags & SGX_EPC_OWNER_ENCL) > + return sgx_oom_encl(victim->encl); I hate to bring this up, at least at this stage, but I am wondering why we need to put VA and SECS pages to the unreclaimable list, but cannot keep an "enclave_list" instead? So by looking the patch (" x86/sgx: Limit process EPC usage with misc cgroup controller"), if I am not missing anything, the whole "unreclaimable" list is just used to find the victim enclave when OOM needs to be done. Thus, I don't see why "enclave_list" cannot be used to achieve this. The reason that I am asking is because it seems using "enclave_list" we can simplify the code. At least the patches related to track VA/SECS pages, and the SGX_EPC_OWNER_PAGE/SGX_EPC_OWNER_ENCL thing can be eliminated completely. Using "enclave_list", I guess you just need to put the enclave to the current EPC cgroup when SECS page is allocated. In fact, putting "unreclaimable" list to LRU itself is a little bit confusing because: 1) you cannot really reclaim anything from the list; 2) VA/SECS pages don't have the concept of "young" at all, thus makes no sense to annotate they as LRU. Thus putting VA/SECS to "unreclaimable" list, instead of keeping an "enclave_list" seems won't have any benefit but will only make the code more complicated. Or am I missing anything?
On Mon, Oct 09, 2023, Kai Huang wrote: > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote: > > +/** > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU > > + * @lru: LRU that is low > > + * > > + * Return: %true if a victim was found and kicked. > > + */ > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru) > > +{ > > + struct sgx_epc_page *victim; > > + > > + spin_lock(&lru->lock); > > + victim = sgx_oom_get_victim(lru); > > + spin_unlock(&lru->lock); > > + > > + if (!victim) > > + return false; > > + > > + if (victim->flags & SGX_EPC_OWNER_PAGE) > > + return sgx_oom_encl_page(victim->encl_page); > > + > > + if (victim->flags & SGX_EPC_OWNER_ENCL) > > + return sgx_oom_encl(victim->encl); > > I hate to bring this up, at least at this stage, but I am wondering why we need > to put VA and SECS pages to the unreclaimable list, but cannot keep an > "enclave_list" instead? The motivation for tracking EPC pages instead of enclaves was so that the EPC OOM-killer could "kill" VMs as well as host-owned enclaves. The virtual EPC code didn't actually kill the VM process, it instead just freed all of the EPC pages and abused the SGX architecture to effectively make the guest recreate all its enclaves (IIRC, QEMU does the same thing to "support" live migration). Looks like y'all punted on that with: The EPC pages allocated for KVM guests by the virtual EPC driver are not reclaimable by the host kernel [5]. Therefore they are not tracked by any LRU lists for reclaiming purposes in this implementation, but they are charged toward the cgroup of the user processs (e.g., QEMU) launching the guest. And when the cgroup EPC usage reaches its limit, the virtual EPC driver will stop allocating more EPC for the VM, and return SIGBUS to the user process which would abort the VM launch. which IMO is a hack, unless returning SIGBUS is actually enforced somehow. Relying on userspace to be kind enough to kill its VMs kinda defeats the purpose of cgroup enforcement. E.g. if the hard limit for a EPC cgroup is lowered, userspace running encalves in a VM could continue on and refuse to give up its EPC, and thus run above its limit in perpetuity. I can see userspace wanting to explicitly terminate the VM instead of "silently" the VM's enclaves, but that seems like it should be a knob in the virtual EPC code.
On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote: > On Mon, Oct 09, 2023, Kai Huang wrote: > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote: > > > +/** > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU > > > + * @lru: LRU that is low > > > + * > > > + * Return: %true if a victim was found and kicked. > > > + */ > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru) > > > +{ > > > + struct sgx_epc_page *victim; > > > + > > > + spin_lock(&lru->lock); > > > + victim = sgx_oom_get_victim(lru); > > > + spin_unlock(&lru->lock); > > > + > > > + if (!victim) > > > + return false; > > > + > > > + if (victim->flags & SGX_EPC_OWNER_PAGE) > > > + return sgx_oom_encl_page(victim->encl_page); > > > + > > > + if (victim->flags & SGX_EPC_OWNER_ENCL) > > > + return sgx_oom_encl(victim->encl); > > > > I hate to bring this up, at least at this stage, but I am wondering why we need > > to put VA and SECS pages to the unreclaimable list, but cannot keep an > > "enclave_list" instead? > > The motivation for tracking EPC pages instead of enclaves was so that the EPC > OOM-killer could "kill" VMs as well as host-owned enclaves. > Ah this seems a fair argument. :-) > The virtual EPC code > didn't actually kill the VM process, it instead just freed all of the EPC pages > and abused the SGX architecture to effectively make the guest recreate all its > enclaves (IIRC, QEMU does the same thing to "support" live migration). It returns SIGBUS. SGX VM live migration also requires enough EPC being able to be allocated on the destination machine to work AFAICT. > > Looks like y'all punted on that with: > > The EPC pages allocated for KVM guests by the virtual EPC driver are not > reclaimable by the host kernel [5]. Therefore they are not tracked by any > LRU lists for reclaiming purposes in this implementation, but they are > charged toward the cgroup of the user processs (e.g., QEMU) launching the > guest. And when the cgroup EPC usage reaches its limit, the virtual EPC > driver will stop allocating more EPC for the VM, and return SIGBUS to the > user process which would abort the VM launch. > > which IMO is a hack, unless returning SIGBUS is actually enforced somehow. > "enforced" do you mean? Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot allocate EPC page. And when this happens, KVM returns KVM_PFN_ERR_FAULT in hva_to_pfn(), which eventually results in KVM returning -EFAULT to userspace in vcpu_run(). And Qemu then kills the VM with some nonsense message: error: kvm run failed Bad address <dump guest registers nonsense> > Relying > on userspace to be kind enough to kill its VMs kinda defeats the purpose of cgroup > enforcement. E.g. if the hard limit for a EPC cgroup is lowered, userspace running > encalves in a VM could continue on and refuse to give up its EPC, and thus run above > its limit in perpetuity. Agreed. But this looks cannot resolved until we can reclaim EPC page from VM. Or in the EPC cgroup code we can refuse to set the maximum which cannot be supported, e.g., less the total virtual EPC size. I guess the second is acceptable for now? > > I can see userspace wanting to explicitly terminate the VM instead of "silently" > the VM's enclaves, but that seems like it should be a knob in the virtual EPC > code. See above for the second option.
On Mon, 09 Oct 2023 18:45:06 -0500, Huang, Kai <kai.huang@intel.com> wrote: > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote: >> From: Sean Christopherson <sean.j.christopherson@intel.com> >> >> Introduce the OOM path for killing an enclave with a reclaimer that is >> no >> longer able to reclaim enough EPC pages. Find a victim enclave, which >> will be an enclave with only "unreclaimable" EPC pages left in the >> cgroup LRU lists. Once a victim is identified, mark the enclave as OOM >> and zap the enclave's entire page range, and drain all mm references in >> encl->mm_list. Block allocating any EPC pages in #PF handler, or >> reloading any pages in all paths, or creating any new mappings. >> >> The OOM killing path may race with the reclaimers: in some cases, the >> victim enclave is in the process of reclaiming the last EPC pages when >> OOM happens, that is, all pages other than SECS and VA pages are in >> RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to >> the enclave backing, VA pages as well as SECS. So the OOM killer does >> not directly release those enclave resources, instead, it lets all >> reclaiming in progress to finish, and relies (as currently done) on >> kref_put on encl->refcount to trigger sgx_encl_release() to do the >> final cleanup. >> >> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> >> Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com> >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> >> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> >> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> >> Cc: Sean Christopherson <seanjc@google.com> >> --- >> V5: >> - Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY >> >> V4: >> - Updates for patch reordering and typo fixes. >> >> V3: >> - Rebased to use the new VMA_ITERATOR to zap VMAs. >> - Fixed the racing cases by blocking new page allocation/mapping and >> reloading when enclave is marked for OOM. And do not release any enclave >> resources other than draining mm_list entries, and let pages in >> RECLAIMING_IN_PROGRESS to be reaped by reclaimers. >> - Due to above changes, also removed the no-longer needed encl->lock in >> the OOM path which was causing deadlocks reported by the lock prover. >> > > [...] > >> + >> +/** >> + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU >> + * @lru: LRU that is low >> + * >> + * Return: %true if a victim was found and kicked. >> + */ >> +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru) >> +{ >> + struct sgx_epc_page *victim; >> + >> + spin_lock(&lru->lock); >> + victim = sgx_oom_get_victim(lru); >> + spin_unlock(&lru->lock); >> + >> + if (!victim) >> + return false; >> + >> + if (victim->flags & SGX_EPC_OWNER_PAGE) >> + return sgx_oom_encl_page(victim->encl_page); >> + >> + if (victim->flags & SGX_EPC_OWNER_ENCL) >> + return sgx_oom_encl(victim->encl); > > I hate to bring this up, at least at this stage, but I am wondering why > we need > to put VA and SECS pages to the unreclaimable list, but cannot keep an > "enclave_list" instead? > > So by looking the patch (" x86/sgx: Limit process EPC usage with misc > cgroup > controller"), if I am not missing anything, the whole "unreclaimable" > list is > just used to find the victim enclave when OOM needs to be done. Thus, I > don't > see why "enclave_list" cannot be used to achieve this. > > The reason that I am asking is because it seems using "enclave_list" we > can > simplify the code. At least the patches related to track VA/SECS pages, > and the > SGX_EPC_OWNER_PAGE/SGX_EPC_OWNER_ENCL thing can be eliminated > completely. > Using "enclave_list", I guess you just need to put the enclave to the > current > EPC cgroup when SECS page is allocated. > Later the hosting process could migrated/reassigned to another cgroup? What to do when the new cgroup is OOM? Thanks Haitao
On Mon, 2023-10-09 at 20:04 -0500, Haitao Huang wrote: > On Mon, 09 Oct 2023 18:45:06 -0500, Huang, Kai <kai.huang@intel.com> wrote: > > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote: > > > From: Sean Christopherson <sean.j.christopherson@intel.com> > > > > > > Introduce the OOM path for killing an enclave with a reclaimer that is > > > no > > > longer able to reclaim enough EPC pages. Find a victim enclave, which > > > will be an enclave with only "unreclaimable" EPC pages left in the > > > cgroup LRU lists. Once a victim is identified, mark the enclave as OOM > > > and zap the enclave's entire page range, and drain all mm references in > > > encl->mm_list. Block allocating any EPC pages in #PF handler, or > > > reloading any pages in all paths, or creating any new mappings. > > > > > > The OOM killing path may race with the reclaimers: in some cases, the > > > victim enclave is in the process of reclaiming the last EPC pages when > > > OOM happens, that is, all pages other than SECS and VA pages are in > > > RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to > > > the enclave backing, VA pages as well as SECS. So the OOM killer does > > > not directly release those enclave resources, instead, it lets all > > > reclaiming in progress to finish, and relies (as currently done) on > > > kref_put on encl->refcount to trigger sgx_encl_release() to do the > > > final cleanup. > > > > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > > > Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com> > > > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> > > > Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> > > > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> > > > Cc: Sean Christopherson <seanjc@google.com> > > > --- > > > V5: > > > - Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY > > > > > > V4: > > > - Updates for patch reordering and typo fixes. > > > > > > V3: > > > - Rebased to use the new VMA_ITERATOR to zap VMAs. > > > - Fixed the racing cases by blocking new page allocation/mapping and > > > reloading when enclave is marked for OOM. And do not release any enclave > > > resources other than draining mm_list entries, and let pages in > > > RECLAIMING_IN_PROGRESS to be reaped by reclaimers. > > > - Due to above changes, also removed the no-longer needed encl->lock in > > > the OOM path which was causing deadlocks reported by the lock prover. > > > > > > > [...] > > > > > + > > > +/** > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU > > > + * @lru: LRU that is low > > > + * > > > + * Return: %true if a victim was found and kicked. > > > + */ > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru) > > > +{ > > > + struct sgx_epc_page *victim; > > > + > > > + spin_lock(&lru->lock); > > > + victim = sgx_oom_get_victim(lru); > > > + spin_unlock(&lru->lock); > > > + > > > + if (!victim) > > > + return false; > > > + > > > + if (victim->flags & SGX_EPC_OWNER_PAGE) > > > + return sgx_oom_encl_page(victim->encl_page); > > > + > > > + if (victim->flags & SGX_EPC_OWNER_ENCL) > > > + return sgx_oom_encl(victim->encl); > > > > I hate to bring this up, at least at this stage, but I am wondering why > > we need > > to put VA and SECS pages to the unreclaimable list, but cannot keep an > > "enclave_list" instead? > > > > So by looking the patch (" x86/sgx: Limit process EPC usage with misc > > cgroup > > controller"), if I am not missing anything, the whole "unreclaimable" > > list is > > just used to find the victim enclave when OOM needs to be done. Thus, I > > don't > > see why "enclave_list" cannot be used to achieve this. > > > > The reason that I am asking is because it seems using "enclave_list" we > > can > > simplify the code. At least the patches related to track VA/SECS pages, > > and the > > SGX_EPC_OWNER_PAGE/SGX_EPC_OWNER_ENCL thing can be eliminated > > completely. > > Using "enclave_list", I guess you just need to put the enclave to the > > current > > EPC cgroup when SECS page is allocated. > > > Later the hosting process could migrated/reassigned to another cgroup? > What to do when the new cgroup is OOM? > You addressed in the documentation, no? +Migration +--------- + +Once an EPC page is charged to a cgroup (during allocation), it +remains charged to the original cgroup until the page is released +or reclaimed. Migrating a process to a different cgroup doesn't +move the EPC charges that it incurred while in the previous cgroup +to its new cgroup.
On Tue, 2023-10-10 at 00:50 +0000, Huang, Kai wrote: > On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote: > > On Mon, Oct 09, 2023, Kai Huang wrote: > > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote: > > > > +/** > > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU > > > > + * @lru: LRU that is low > > > > + * > > > > + * Return: %true if a victim was found and kicked. > > > > + */ > > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru) > > > > +{ > > > > + struct sgx_epc_page *victim; > > > > + > > > > + spin_lock(&lru->lock); > > > > + victim = sgx_oom_get_victim(lru); > > > > + spin_unlock(&lru->lock); > > > > + > > > > + if (!victim) > > > > + return false; > > > > + > > > > + if (victim->flags & SGX_EPC_OWNER_PAGE) > > > > + return sgx_oom_encl_page(victim->encl_page); > > > > + > > > > + if (victim->flags & SGX_EPC_OWNER_ENCL) > > > > + return sgx_oom_encl(victim->encl); > > > > > > I hate to bring this up, at least at this stage, but I am wondering why we need > > > to put VA and SECS pages to the unreclaimable list, but cannot keep an > > > "enclave_list" instead? > > > > The motivation for tracking EPC pages instead of enclaves was so that the EPC > > OOM-killer could "kill" VMs as well as host-owned enclaves. > > > > Ah this seems a fair argument. :-) > > > The virtual EPC code > > didn't actually kill the VM process, it instead just freed all of the EPC pages > > and abused the SGX architecture to effectively make the guest recreate all its > > enclaves (IIRC, QEMU does the same thing to "support" live migration). > > It returns SIGBUS. SGX VM live migration also requires enough EPC being able to > be allocated on the destination machine to work AFAICT. > > > > > Looks like y'all punted on that with: > > > > The EPC pages allocated for KVM guests by the virtual EPC driver are not > > reclaimable by the host kernel [5]. Therefore they are not tracked by any > > LRU lists for reclaiming purposes in this implementation, but they are > > charged toward the cgroup of the user processs (e.g., QEMU) launching the > > guest. And when the cgroup EPC usage reaches its limit, the virtual EPC > > driver will stop allocating more EPC for the VM, and return SIGBUS to the > > user process which would abort the VM launch. > > > > which IMO is a hack, unless returning SIGBUS is actually enforced somehow. > > > > "enforced" do you mean? > > Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot allocate > EPC page. And when this happens, KVM returns KVM_PFN_ERR_FAULT in hva_to_pfn(), > which eventually results in KVM returning -EFAULT to userspace in vcpu_run(). > And Qemu then kills the VM with some nonsense message: > > error: kvm run failed Bad address > <dump guest registers nonsense> > > > Relying > > on userspace to be kind enough to kill its VMs kinda defeats the purpose of cgroup > > enforcement. E.g. if the hard limit for a EPC cgroup is lowered, userspace running > > encalves in a VM could continue on and refuse to give up its EPC, and thus run above > > its limit in perpetuity. > > > > > I can see userspace wanting to explicitly terminate the VM instead of "silently" > > the VM's enclaves, but that seems like it should be a knob in the virtual EPC > > code. I guess I slightly misunderstood your words. You mean we want to kill VM when the limit is set to be lower than virtual EPC size. This patch adds SGX_ENCL_NO_MEMORY. I guess we can use it for virtual EPC too? In the sgx_vepc_fault(), we check this flag at early time and return SIGBUS if it is set. But this also requires keeping virtual EPC pages in some list, and handles them in sgx_epc_oom() too. And for virtual EPC pages, I guess the "young" logic can be applied thus probably it's better to keep the actual virtual EPC pages to a (separate?) list instead of keeping the virtual EPC instance. struct sgx_epc_lru { struct list_head reclaimable; struct sgx_encl *enclaves; struct list_head vepc_pages; } Or still tracking VA/SECS and virtual EPC pages in a single unrecliamable list? I don't know :-)
On Mon, 09 Oct 2023 20:18:00 -0500, Huang, Kai <kai.huang@intel.com> wrote: > On Mon, 2023-10-09 at 20:04 -0500, Haitao Huang wrote: >> On Mon, 09 Oct 2023 18:45:06 -0500, Huang, Kai <kai.huang@intel.com> >> wrote: >> >> > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote: >> > > From: Sean Christopherson <sean.j.christopherson@intel.com> >> > > >> > > Introduce the OOM path for killing an enclave with a reclaimer that >> is >> > > no >> > > longer able to reclaim enough EPC pages. Find a victim enclave, >> which >> > > will be an enclave with only "unreclaimable" EPC pages left in the >> > > cgroup LRU lists. Once a victim is identified, mark the enclave as >> OOM >> > > and zap the enclave's entire page range, and drain all mm >> references in >> > > encl->mm_list. Block allocating any EPC pages in #PF handler, or >> > > reloading any pages in all paths, or creating any new mappings. >> > > >> > > The OOM killing path may race with the reclaimers: in some cases, >> the >> > > victim enclave is in the process of reclaiming the last EPC pages >> when >> > > OOM happens, that is, all pages other than SECS and VA pages are in >> > > RECLAIMING_IN_PROGRESS state. The reclaiming process requires >> access to >> > > the enclave backing, VA pages as well as SECS. So the OOM killer >> does >> > > not directly release those enclave resources, instead, it lets all >> > > reclaiming in progress to finish, and relies (as currently done) on >> > > kref_put on encl->refcount to trigger sgx_encl_release() to do the >> > > final cleanup. >> > > >> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> >> > > Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com> >> > > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> >> > > Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> >> > > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> >> > > Cc: Sean Christopherson <seanjc@google.com> >> > > --- >> > > V5: >> > > - Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY >> > > >> > > V4: >> > > - Updates for patch reordering and typo fixes. >> > > >> > > V3: >> > > - Rebased to use the new VMA_ITERATOR to zap VMAs. >> > > - Fixed the racing cases by blocking new page allocation/mapping and >> > > reloading when enclave is marked for OOM. And do not release any >> enclave >> > > resources other than draining mm_list entries, and let pages in >> > > RECLAIMING_IN_PROGRESS to be reaped by reclaimers. >> > > - Due to above changes, also removed the no-longer needed >> encl->lock in >> > > the OOM path which was causing deadlocks reported by the lock >> prover. >> > > >> > >> > [...] >> > >> > > + >> > > +/** >> > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU >> > > + * @lru: LRU that is low >> > > + * >> > > + * Return: %true if a victim was found and kicked. >> > > + */ >> > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru) >> > > +{ >> > > + struct sgx_epc_page *victim; >> > > + >> > > + spin_lock(&lru->lock); >> > > + victim = sgx_oom_get_victim(lru); >> > > + spin_unlock(&lru->lock); >> > > + >> > > + if (!victim) >> > > + return false; >> > > + >> > > + if (victim->flags & SGX_EPC_OWNER_PAGE) >> > > + return sgx_oom_encl_page(victim->encl_page); >> > > + >> > > + if (victim->flags & SGX_EPC_OWNER_ENCL) >> > > + return sgx_oom_encl(victim->encl); >> > >> > I hate to bring this up, at least at this stage, but I am wondering >> why >> > we need >> > to put VA and SECS pages to the unreclaimable list, but cannot keep an >> > "enclave_list" instead? >> > >> > So by looking the patch (" x86/sgx: Limit process EPC usage with misc >> > cgroup >> > controller"), if I am not missing anything, the whole "unreclaimable" >> > list is >> > just used to find the victim enclave when OOM needs to be done. >> Thus, I >> > don't >> > see why "enclave_list" cannot be used to achieve this. >> > >> > The reason that I am asking is because it seems using "enclave_list" >> we >> > can >> > simplify the code. At least the patches related to track VA/SECS >> pages, >> > and the >> > SGX_EPC_OWNER_PAGE/SGX_EPC_OWNER_ENCL thing can be eliminated >> > completely. >> > Using "enclave_list", I guess you just need to put the enclave to the >> > current >> > EPC cgroup when SECS page is allocated. >> > >> Later the hosting process could migrated/reassigned to another cgroup? >> What to do when the new cgroup is OOM? >> > > You addressed in the documentation, no? > > +Migration > +--------- > + > +Once an EPC page is charged to a cgroup (during allocation), it > +remains charged to the original cgroup until the page is released > +or reclaimed. Migrating a process to a different cgroup doesn't > +move the EPC charges that it incurred while in the previous cgroup > +to its new cgroup. Should we kill the enclave though because some VA pages may be in the new group? Haitao
Hi Sean On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson <seanjc@google.com> wrote: > On Mon, Oct 09, 2023, Kai Huang wrote: >> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote: >> > +/** >> > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU >> > + * @lru: LRU that is low >> > + * >> > + * Return: %true if a victim was found and kicked. >> > + */ >> > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru) >> > +{ >> > + struct sgx_epc_page *victim; >> > + >> > + spin_lock(&lru->lock); >> > + victim = sgx_oom_get_victim(lru); >> > + spin_unlock(&lru->lock); >> > + >> > + if (!victim) >> > + return false; >> > + >> > + if (victim->flags & SGX_EPC_OWNER_PAGE) >> > + return sgx_oom_encl_page(victim->encl_page); >> > + >> > + if (victim->flags & SGX_EPC_OWNER_ENCL) >> > + return sgx_oom_encl(victim->encl); >> >> I hate to bring this up, at least at this stage, but I am wondering why >> we need >> to put VA and SECS pages to the unreclaimable list, but cannot keep an >> "enclave_list" instead? > > The motivation for tracking EPC pages instead of enclaves was so that > the EPC > OOM-killer could "kill" VMs as well as host-owned enclaves. The virtual > EPC code > didn't actually kill the VM process, it instead just freed all of the > EPC pages > and abused the SGX architecture to effectively make the guest recreate > all its > enclaves (IIRC, QEMU does the same thing to "support" live migration). > > Looks like y'all punted on that with: > > The EPC pages allocated for KVM guests by the virtual EPC driver are > not > reclaimable by the host kernel [5]. Therefore they are not tracked by > any > LRU lists for reclaiming purposes in this implementation, but they are > charged toward the cgroup of the user processs (e.g., QEMU) launching > the > guest. And when the cgroup EPC usage reaches its limit, the virtual > EPC > driver will stop allocating more EPC for the VM, and return SIGBUS to > the > user process which would abort the VM launch. > > which IMO is a hack, unless returning SIGBUS is actually enforced > somehow. Relying > on userspace to be kind enough to kill its VMs kinda defeats the purpose > of cgroup > enforcement. E.g. if the hard limit for a EPC cgroup is lowered, > userspace running > encalves in a VM could continue on and refuse to give up its EPC, and > thus run above > its limit in perpetuity. > Cgroup would refuse to allocate more when limit is reached so VMs can not run above limit. IIRC VMs only support static EPC size right now, reaching limit at launch means the EPC size given in command line for QEMU is not appropriate. So VM should not launch, hence the current behavior. [all EPC pages in guest are allocated on page fault caused by the sensitization process in guest kernel during init, which is part of the VM Launch process. So SIGNBUS will turn into failed VM launch.] Once it is launched, guest kernel would have 'total capacity' given by the static value from QEMU option. And it would start paging when it is used up, never would ask for more from host. For future with dynamic EPC for running guests, QEMU could handle allocation failure and pass SIGBUS to the running guest kernel. Is that correct understanding? > I can see userspace wanting to explicitly terminate the VM instead of > "silently" > the VM's enclaves, but that seems like it should be a knob in the > virtual EPC > code. If my understanding above is correct and understanding your statement above correctly, then don't see we really need separate knob for vEPC code. Reaching a cgroup limit by a running guest (assuming dynamic allocation implemented) should not translate automatically killing the VM. Instead, it's user space job to work with guest to handle allocation failure. Guest could page and kill enclaves. Haitao
> > > > > > > Later the hosting process could migrated/reassigned to another cgroup? > > > What to do when the new cgroup is OOM? > > > > > > > You addressed in the documentation, no? > > > > +Migration > > +--------- > > + > > +Once an EPC page is charged to a cgroup (during allocation), it > > +remains charged to the original cgroup until the page is released > > +or reclaimed. Migrating a process to a different cgroup doesn't > > +move the EPC charges that it incurred while in the previous cgroup > > +to its new cgroup. > > Should we kill the enclave though because some VA pages may be in the new > group? > I guess acceptable? And any difference if you keep VA/SECS to unreclaimabe list? If you migrate one enclave to another cgroup, the old EPC pages stay in the old cgroup while the new one is charged to the new group IIUC. I am not cgroup expert, but by searching some old thread it appears this isn't a supported model: https://lore.kernel.org/lkml/YEyR9181Qgzt+Ps9@mtj.duckdns.org/
On Mon, 2023-10-09 at 20:42 -0500, Haitao Huang wrote: > Hi Sean > > On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson > <seanjc@google.com> wrote: > > > On Mon, Oct 09, 2023, Kai Huang wrote: > > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote: > > > > +/** > > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU > > > > + * @lru: LRU that is low > > > > + * > > > > + * Return: %true if a victim was found and kicked. > > > > + */ > > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru) > > > > +{ > > > > + struct sgx_epc_page *victim; > > > > + > > > > + spin_lock(&lru->lock); > > > > + victim = sgx_oom_get_victim(lru); > > > > + spin_unlock(&lru->lock); > > > > + > > > > + if (!victim) > > > > + return false; > > > > + > > > > + if (victim->flags & SGX_EPC_OWNER_PAGE) > > > > + return sgx_oom_encl_page(victim->encl_page); > > > > + > > > > + if (victim->flags & SGX_EPC_OWNER_ENCL) > > > > + return sgx_oom_encl(victim->encl); > > > > > > I hate to bring this up, at least at this stage, but I am wondering why > > > we need > > > to put VA and SECS pages to the unreclaimable list, but cannot keep an > > > "enclave_list" instead? > > > > The motivation for tracking EPC pages instead of enclaves was so that > > the EPC > > OOM-killer could "kill" VMs as well as host-owned enclaves. The virtual > > EPC code > > didn't actually kill the VM process, it instead just freed all of the > > EPC pages > > and abused the SGX architecture to effectively make the guest recreate > > all its > > enclaves (IIRC, QEMU does the same thing to "support" live migration). > > > > Looks like y'all punted on that with: > > > > The EPC pages allocated for KVM guests by the virtual EPC driver are > > not > > reclaimable by the host kernel [5]. Therefore they are not tracked by > > any > > LRU lists for reclaiming purposes in this implementation, but they are > > charged toward the cgroup of the user processs (e.g., QEMU) launching > > the > > guest. And when the cgroup EPC usage reaches its limit, the virtual > > EPC > > driver will stop allocating more EPC for the VM, and return SIGBUS to > > the > > user process which would abort the VM launch. > > > > which IMO is a hack, unless returning SIGBUS is actually enforced > > somehow. Relying > > on userspace to be kind enough to kill its VMs kinda defeats the purpose > > of cgroup > > enforcement. E.g. if the hard limit for a EPC cgroup is lowered, > > userspace running > > encalves in a VM could continue on and refuse to give up its EPC, and > > thus run above > > its limit in perpetuity. > > > Cgroup would refuse to allocate more when limit is reached so VMs can not > run above limit. > > IIRC VMs only support static EPC size right now, reaching limit at launch > means the EPC size given in command line for QEMU is not appropriate. So > VM should not launch, hence the current behavior. > > [all EPC pages in guest are allocated on page fault caused by the > sensitization process in guest kernel during init, which is part of the VM > Launch process. So SIGNBUS will turn into failed VM launch.] > > Once it is launched, guest kernel would have 'total capacity' given by the > static value from QEMU option. And it would start paging when it is used > up, never would ask for more from host. > > For future with dynamic EPC for running guests, QEMU could handle > allocation failure and pass SIGBUS to the running guest kernel. Is that > correct understanding? > > > > I can see userspace wanting to explicitly terminate the VM instead of > > "silently" > > the VM's enclaves, but that seems like it should be a knob in the > > virtual EPC > > code. > > If my understanding above is correct and understanding your statement > above correctly, then don't see we really need separate knob for vEPC > code. Reaching a cgroup limit by a running guest (assuming dynamic > allocation implemented) should not translate automatically killing the VM. > Instead, it's user space job to work with guest to handle allocation > failure. Guest could page and kill enclaves. > IIUC Sean was talking about changing misc.max _after_ you launch SGX VMs: 1) misc.max = 100M 2) Launch VMs with total virtual EPC size = 100M <- success 3) misc.max = 50M 3) will also succeed, but nothing will happen, the VMs will be still holding 100M EPC. You need to somehow track virtual EPC and kill VM instead. (or somehow fail to do 3) if it is also an acceptable option.)
On Mon, 09 Oct 2023 21:23:12 -0500, Huang, Kai <kai.huang@intel.com> wrote: > On Mon, 2023-10-09 at 20:42 -0500, Haitao Huang wrote: >> Hi Sean >> >> On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson >> <seanjc@google.com> wrote: >> >> > On Mon, Oct 09, 2023, Kai Huang wrote: >> > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote: >> > > > +/** >> > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target >> LRU >> > > > + * @lru: LRU that is low >> > > > + * >> > > > + * Return: %true if a victim was found and kicked. >> > > > + */ >> > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru) >> > > > +{ >> > > > + struct sgx_epc_page *victim; >> > > > + >> > > > + spin_lock(&lru->lock); >> > > > + victim = sgx_oom_get_victim(lru); >> > > > + spin_unlock(&lru->lock); >> > > > + >> > > > + if (!victim) >> > > > + return false; >> > > > + >> > > > + if (victim->flags & SGX_EPC_OWNER_PAGE) >> > > > + return sgx_oom_encl_page(victim->encl_page); >> > > > + >> > > > + if (victim->flags & SGX_EPC_OWNER_ENCL) >> > > > + return sgx_oom_encl(victim->encl); >> > > >> > > I hate to bring this up, at least at this stage, but I am wondering >> why >> > > we need >> > > to put VA and SECS pages to the unreclaimable list, but cannot keep >> an >> > > "enclave_list" instead? >> > >> > The motivation for tracking EPC pages instead of enclaves was so that >> > the EPC >> > OOM-killer could "kill" VMs as well as host-owned enclaves. The >> virtual >> > EPC code >> > didn't actually kill the VM process, it instead just freed all of the >> > EPC pages >> > and abused the SGX architecture to effectively make the guest recreate >> > all its >> > enclaves (IIRC, QEMU does the same thing to "support" live migration). >> > >> > Looks like y'all punted on that with: >> > >> > The EPC pages allocated for KVM guests by the virtual EPC driver are >> > not >> > reclaimable by the host kernel [5]. Therefore they are not tracked >> by >> > any >> > LRU lists for reclaiming purposes in this implementation, but they >> are >> > charged toward the cgroup of the user processs (e.g., QEMU) >> launching >> > the >> > guest. And when the cgroup EPC usage reaches its limit, the virtual >> > EPC >> > driver will stop allocating more EPC for the VM, and return SIGBUS >> to >> > the >> > user process which would abort the VM launch. >> > >> > which IMO is a hack, unless returning SIGBUS is actually enforced >> > somehow. Relying >> > on userspace to be kind enough to kill its VMs kinda defeats the >> purpose >> > of cgroup >> > enforcement. E.g. if the hard limit for a EPC cgroup is lowered, >> > userspace running >> > encalves in a VM could continue on and refuse to give up its EPC, and >> > thus run above >> > its limit in perpetuity. >> > >> Cgroup would refuse to allocate more when limit is reached so VMs can >> not >> run above limit. >> >> IIRC VMs only support static EPC size right now, reaching limit at >> launch >> means the EPC size given in command line for QEMU is not appropriate. So >> VM should not launch, hence the current behavior. >> >> [all EPC pages in guest are allocated on page fault caused by the >> sensitization process in guest kernel during init, which is part of the >> VM >> Launch process. So SIGNBUS will turn into failed VM launch.] >> >> Once it is launched, guest kernel would have 'total capacity' given by >> the >> static value from QEMU option. And it would start paging when it is used >> up, never would ask for more from host. >> >> For future with dynamic EPC for running guests, QEMU could handle >> allocation failure and pass SIGBUS to the running guest kernel. Is that >> correct understanding? >> >> >> > I can see userspace wanting to explicitly terminate the VM instead of >> > "silently" >> > the VM's enclaves, but that seems like it should be a knob in the >> > virtual EPC >> > code. >> >> If my understanding above is correct and understanding your statement >> above correctly, then don't see we really need separate knob for vEPC >> code. Reaching a cgroup limit by a running guest (assuming dynamic >> allocation implemented) should not translate automatically killing the >> VM. >> Instead, it's user space job to work with guest to handle allocation >> failure. Guest could page and kill enclaves. >> > > IIUC Sean was talking about changing misc.max _after_ you launch SGX VMs: > > 1) misc.max = 100M > 2) Launch VMs with total virtual EPC size = 100M <- success > 3) misc.max = 50M > > 3) will also succeed, but nothing will happen, the VMs will be still > holding > 100M EPC. > > You need to somehow track virtual EPC and kill VM instead. > > (or somehow fail to do 3) if it is also an acceptable option.) > Thanks for explaining it. There is an error code to return from max_write. I can add that too to the callback definition and fail it when it can't be enforced for any reason. Would like some community feedback if this is acceptable though. I think to solve it ultimately, we need be able to adjust 'capacity' of VMs not to just kill them, which is basically the same as dynamic allocation support for VMs (being able to increase/decrease epc size when it is running). For now, we only have static allocation so max can't be enforced once it is launched. Haitao
On Mon, 09 Oct 2023 20:34:29 -0500, Huang, Kai <kai.huang@intel.com> wrote: > On Tue, 2023-10-10 at 00:50 +0000, Huang, Kai wrote: >> On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote: >> > On Mon, Oct 09, 2023, Kai Huang wrote: >> > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote: >> > > > +/** >> > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target >> LRU >> > > > + * @lru: LRU that is low >> > > > + * >> > > > + * Return: %true if a victim was found and kicked. >> > > > + */ >> > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru) >> > > > +{ >> > > > + struct sgx_epc_page *victim; >> > > > + >> > > > + spin_lock(&lru->lock); >> > > > + victim = sgx_oom_get_victim(lru); >> > > > + spin_unlock(&lru->lock); >> > > > + >> > > > + if (!victim) >> > > > + return false; >> > > > + >> > > > + if (victim->flags & SGX_EPC_OWNER_PAGE) >> > > > + return sgx_oom_encl_page(victim->encl_page); >> > > > + >> > > > + if (victim->flags & SGX_EPC_OWNER_ENCL) >> > > > + return sgx_oom_encl(victim->encl); >> > > >> > > I hate to bring this up, at least at this stage, but I am wondering >> why we need >> > > to put VA and SECS pages to the unreclaimable list, but cannot keep >> an >> > > "enclave_list" instead? >> > >> > The motivation for tracking EPC pages instead of enclaves was so that >> the EPC >> > OOM-killer could "kill" VMs as well as host-owned enclaves. > >> >> Ah this seems a fair argument. :-) >> >> > The virtual EPC code >> > didn't actually kill the VM process, it instead just freed all of the >> EPC pages >> > and abused the SGX architecture to effectively make the guest >> recreate all its >> > enclaves (IIRC, QEMU does the same thing to "support" live migration). >> >> It returns SIGBUS. SGX VM live migration also requires enough EPC >> being able to >> be allocated on the destination machine to work AFAICT. >> >> > >> > Looks like y'all punted on that with: >> > >> > The EPC pages allocated for KVM guests by the virtual EPC driver >> are not >> > reclaimable by the host kernel [5]. Therefore they are not tracked >> by any >> > LRU lists for reclaiming purposes in this implementation, but they >> are >> > charged toward the cgroup of the user processs (e.g., QEMU) >> launching the >> > guest. And when the cgroup EPC usage reaches its limit, the >> virtual EPC >> > driver will stop allocating more EPC for the VM, and return SIGBUS >> to the >> > user process which would abort the VM launch. >> > >> > which IMO is a hack, unless returning SIGBUS is actually enforced >> somehow. > >> >> "enforced" do you mean? >> >> Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot >> allocate >> EPC page. And when this happens, KVM returns KVM_PFN_ERR_FAULT in >> hva_to_pfn(), >> which eventually results in KVM returning -EFAULT to userspace in >> vcpu_run(). >> And Qemu then kills the VM with some nonsense message: >> >> error: kvm run failed Bad address >> <dump guest registers nonsense> >> >> > Relying >> > on userspace to be kind enough to kill its VMs kinda defeats the >> purpose of cgroup >> > enforcement. E.g. if the hard limit for a EPC cgroup is lowered, >> userspace running >> > encalves in a VM could continue on and refuse to give up its EPC, and >> thus run above >> > its limit in perpetuity. >> >> > >> > I can see userspace wanting to explicitly terminate the VM instead of >> "silently" >> > the VM's enclaves, but that seems like it should be a knob in the >> virtual EPC >> > code. > > I guess I slightly misunderstood your words. > > You mean we want to kill VM when the limit is set to be lower than > virtual EPC > size. > > This patch adds SGX_ENCL_NO_MEMORY. I guess we can use it for virtual > EPC too? > That flag is set for enclaves, do you mean we set similar flag in vepc struct? > In the sgx_vepc_fault(), we check this flag at early time and return > SIGBUS if > it is set. > > But this also requires keeping virtual EPC pages in some list, and > handles them > in sgx_epc_oom() too. > > And for virtual EPC pages, I guess the "young" logic can be applied thus > probably it's better to keep the actual virtual EPC pages to a > (separate?) list > instead of keeping the virtual EPC instance. > > struct sgx_epc_lru { > struct list_head reclaimable; > struct sgx_encl *enclaves; > struct list_head vepc_pages; > } > > Or still tracking VA/SECS and virtual EPC pages in a single > unrecliamable list? > One LRU should be OK as we only need relative order in which they are loaded? If an VA page is in front of vEPC, we just kill host side enclave first before disrupting VMs in the same group. As the group is not in a good situation anyway so kernel just pick something reasonable to force kill. Also after rereading the sentences "The virtual EPC code didn't actually kill the VM process, it instead just freed all of the EPC pages and abused the SGX architecture to effectively make the guest recreate all its enclaves..." Maybe by "kill" vm, Sean means EREMOVE the vepc pages in the unreclaimable LRU, which effectively make enclaves in guest receiving "EPC lost" error and those enclaves are forced to be reloaded (all reasonable user space impl should already handle that). Not sure about free *all* of EPC pages though. we should just EREMOVE enough to bring down the usage. And disable new allocation in sgx_vepc_fault as kai outlined above. It also means user space needs to inject/pass the SIGBUS to guest (I'm not really familiar with this part, or maybe it's already there?). @sean is that what you mean? Sorry I've been slow on understanding this. If this is the case, some may still think it's too disruptive to guest because the guest did not have a chance to paging out less active enclave pages. But it's user's limit to set so it is justifiable as long as we document this behavior. Thanks to both of you for great insights. Haitao
On Mon, 09 Oct 2023 21:12:27 -0500, Huang, Kai <kai.huang@intel.com> wrote: > >> > > > >> > > Later the hosting process could migrated/reassigned to another >> cgroup? >> > > What to do when the new cgroup is OOM? >> > > >> > >> > You addressed in the documentation, no? >> > >> > +Migration >> > +--------- >> > + >> > +Once an EPC page is charged to a cgroup (during allocation), it >> > +remains charged to the original cgroup until the page is released >> > +or reclaimed. Migrating a process to a different cgroup doesn't >> > +move the EPC charges that it incurred while in the previous cgroup >> > +to its new cgroup. >> >> Should we kill the enclave though because some VA pages may be in the >> new >> group? >> > > I guess acceptable? > > And any difference if you keep VA/SECS to unreclaimabe list? Tracking VA/SECS allows all cgroups, in which an enclave has allocation, to identify the enclave following the back pointer and kill it as needed. > If you migrate one > enclave to another cgroup, the old EPC pages stay in the old cgroup > while the > new one is charged to the new group IIUC. > > I am not cgroup expert, but by searching some old thread it appears this > isn't a > supported model: > > https://lore.kernel.org/lkml/YEyR9181Qgzt+Ps9@mtj.duckdns.org/ > IIUC it's a different problem here. If we don't track the allocated VAs in the new group, then the enclave that spans the two groups can't be killed by the new group. If so, some enclave could just hide in some small group and never gets killed but keeps allocating in a different group? Thanks Haitao
On Tue, Oct 10, 2023, Haitao Huang wrote: > On Mon, 09 Oct 2023 21:23:12 -0500, Huang, Kai <kai.huang@intel.com> wrote: > > > On Mon, 2023-10-09 at 20:42 -0500, Haitao Huang wrote: > > > Hi Sean > > > > > > On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson > > > <seanjc@google.com> wrote: > > > > I can see userspace wanting to explicitly terminate the VM instead of > > > > "silently" > > > > the VM's enclaves, but that seems like it should be a knob in the > > > > virtual EPC > > > > code. > > > > > > If my understanding above is correct and understanding your statement > > > above correctly, then don't see we really need separate knob for vEPC > > > code. Reaching a cgroup limit by a running guest (assuming dynamic > > > allocation implemented) should not translate automatically killing > > > the VM. > > > Instead, it's user space job to work with guest to handle allocation > > > failure. Guest could page and kill enclaves. > > > > > > > IIUC Sean was talking about changing misc.max _after_ you launch SGX VMs: > > > > 1) misc.max = 100M > > 2) Launch VMs with total virtual EPC size = 100M <- success > > 3) misc.max = 50M > > > > 3) will also succeed, but nothing will happen, the VMs will be still > > holding 100M EPC. > > > > You need to somehow track virtual EPC and kill VM instead. > > > > (or somehow fail to do 3) if it is also an acceptable option.) > > > Thanks for explaining it. > > There is an error code to return from max_write. I can add that too to the > callback definition and fail it when it can't be enforced for any reason. > Would like some community feedback if this is acceptable though. That likely isn't acceptable. E.g. create a cgroup with both a host enclave and virtual EPC, set the hard limit to 100MiB. Virtual EPC consumes 50MiB, and the host enclave consumes 50MiB. Userspace lowers the limit to 49MiB. The cgroup code would reclaim all of the enclave's reclaimable EPC, and then kill the enclave because it's still over the limit. And then fail the max_write because the cgroup is *still* over the limit. So in addition to burning a lot of cycles, from userspace's perspective its enclave was killed for no reason, as the new limit wasn't actually set. > I think to solve it ultimately, we need be able to adjust 'capacity' of VMs > not to just kill them, which is basically the same as dynamic allocation > support for VMs (being able to increase/decrease epc size when it is > running). For now, we only have static allocation so max can't be enforced > once it is launched. No, reclaiming virtual EPC is not a requirement. VMM EPC oversubscription is insanely complex, and I highly doubt any users actually want to oversubcribe VMs. There are use cases for cgroups beyond oversubscribing/swapping, e.g. privileged userspace may set limits on a container to ensure the container doesn't *accidentally* consume more EPC than it was allotted, e.g. due to a configuration bug that created a VM with more EPC than it was supposed to have. My comments on virtual EPC vs. cgroups is much more about having sane, well-defined behavior, not about saying the kernel actually needs to support oversubscribing EPC for KVM guests.
On Tue, 2023-10-10 at 12:05 -0500, Haitao Huang wrote: > On Mon, 09 Oct 2023 21:12:27 -0500, Huang, Kai <kai.huang@intel.com> wrote: > > > > > > > > > > > > > > Later the hosting process could migrated/reassigned to another > > > cgroup? > > > > > What to do when the new cgroup is OOM? > > > > > > > > > > > > > You addressed in the documentation, no? > > > > > > > > +Migration > > > > +--------- > > > > + > > > > +Once an EPC page is charged to a cgroup (during allocation), it > > > > +remains charged to the original cgroup until the page is released > > > > +or reclaimed. Migrating a process to a different cgroup doesn't > > > > +move the EPC charges that it incurred while in the previous cgroup > > > > +to its new cgroup. > > > > > > Should we kill the enclave though because some VA pages may be in the > > > new > > > group? > > > > > > > I guess acceptable? > > > > And any difference if you keep VA/SECS to unreclaimabe list? > > Tracking VA/SECS allows all cgroups, in which an enclave has allocation, > to identify the enclave following the back pointer and kill it as needed. > > > If you migrate one > > enclave to another cgroup, the old EPC pages stay in the old cgroup > > while the > > new one is charged to the new group IIUC. > > > > I am not cgroup expert, but by searching some old thread it appears this > > isn't a > > supported model: > > > > https://lore.kernel.org/lkml/YEyR9181Qgzt+Ps9@mtj.duckdns.org/ > > > > IIUC it's a different problem here. If we don't track the allocated VAs in > the new group, then the enclave that spans the two groups can't be killed > by the new group. If so, some enclave could just hide in some small group > and never gets killed but keeps allocating in a different group? > I mean from the link above IIUC migrating enclave among different cgroups simply isn't a supported model, thus any bad behaviour isn't a big concern in terms of decision making.
On Tue, 2023-10-10 at 11:49 -0500, Haitao Huang wrote: > On Mon, 09 Oct 2023 20:34:29 -0500, Huang, Kai <kai.huang@intel.com> wrote: > > > On Tue, 2023-10-10 at 00:50 +0000, Huang, Kai wrote: > > > On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote: > > > > On Mon, Oct 09, 2023, Kai Huang wrote: > > > > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote: > > > > > > +/** > > > > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target > > > LRU > > > > > > + * @lru: LRU that is low > > > > > > + * > > > > > > + * Return: %true if a victim was found and kicked. > > > > > > + */ > > > > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru) > > > > > > +{ > > > > > > + struct sgx_epc_page *victim; > > > > > > + > > > > > > + spin_lock(&lru->lock); > > > > > > + victim = sgx_oom_get_victim(lru); > > > > > > + spin_unlock(&lru->lock); > > > > > > + > > > > > > + if (!victim) > > > > > > + return false; > > > > > > + > > > > > > + if (victim->flags & SGX_EPC_OWNER_PAGE) > > > > > > + return sgx_oom_encl_page(victim->encl_page); > > > > > > + > > > > > > + if (victim->flags & SGX_EPC_OWNER_ENCL) > > > > > > + return sgx_oom_encl(victim->encl); > > > > > > > > > > I hate to bring this up, at least at this stage, but I am wondering > > > why we need > > > > > to put VA and SECS pages to the unreclaimable list, but cannot keep > > > an > > > > > "enclave_list" instead? > > > > > > > > The motivation for tracking EPC pages instead of enclaves was so that > > > the EPC > > > > OOM-killer could "kill" VMs as well as host-owned enclaves. > > > > > > > Ah this seems a fair argument. :-) > > > > > > > The virtual EPC code > > > > didn't actually kill the VM process, it instead just freed all of the > > > EPC pages > > > > and abused the SGX architecture to effectively make the guest > > > recreate all its > > > > enclaves (IIRC, QEMU does the same thing to "support" live migration). > > > > > > It returns SIGBUS. SGX VM live migration also requires enough EPC > > > being able to > > > be allocated on the destination machine to work AFAICT. > > > > > > > > > > > Looks like y'all punted on that with: > > > > > > > > The EPC pages allocated for KVM guests by the virtual EPC driver > > > are not > > > > reclaimable by the host kernel [5]. Therefore they are not tracked > > > by any > > > > LRU lists for reclaiming purposes in this implementation, but they > > > are > > > > charged toward the cgroup of the user processs (e.g., QEMU) > > > launching the > > > > guest. And when the cgroup EPC usage reaches its limit, the > > > virtual EPC > > > > driver will stop allocating more EPC for the VM, and return SIGBUS > > > to the > > > > user process which would abort the VM launch. > > > > > > > > which IMO is a hack, unless returning SIGBUS is actually enforced > > > somehow. > > > > > > > "enforced" do you mean? > > > > > > Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot > > > allocate > > > EPC page. And when this happens, KVM returns KVM_PFN_ERR_FAULT in > > > hva_to_pfn(), > > > which eventually results in KVM returning -EFAULT to userspace in > > > vcpu_run(). > > > And Qemu then kills the VM with some nonsense message: > > > > > > error: kvm run failed Bad address > > > <dump guest registers nonsense> > > > > > > > Relying > > > > on userspace to be kind enough to kill its VMs kinda defeats the > > > purpose of cgroup > > > > enforcement. E.g. if the hard limit for a EPC cgroup is lowered, > > > userspace running > > > > encalves in a VM could continue on and refuse to give up its EPC, and > > > thus run above > > > > its limit in perpetuity. > > > > > > > > > > > I can see userspace wanting to explicitly terminate the VM instead of > > > "silently" > > > > the VM's enclaves, but that seems like it should be a knob in the > > > virtual EPC > > > > code. > > > > I guess I slightly misunderstood your words. > > > > You mean we want to kill VM when the limit is set to be lower than > > virtual EPC > > size. > > > > This patch adds SGX_ENCL_NO_MEMORY. I guess we can use it for virtual > > EPC too? > > > > That flag is set for enclaves, do you mean we set similar flag in vepc > struct? > > > In the sgx_vepc_fault(), we check this flag at early time and return > > SIGBUS if > > it is set. > > > > But this also requires keeping virtual EPC pages in some list, and > > handles them > > in sgx_epc_oom() too. > > > > And for virtual EPC pages, I guess the "young" logic can be applied thus > > probably it's better to keep the actual virtual EPC pages to a > > (separate?) list > > instead of keeping the virtual EPC instance. > > > > struct sgx_epc_lru { > > struct list_head reclaimable; > > struct sgx_encl *enclaves; > > struct list_head vepc_pages; > > } > > > > Or still tracking VA/SECS and virtual EPC pages in a single > > unrecliamable list? > > > > One LRU should be OK as we only need relative order in which they are > loaded? It's one way to do, but not the only way. The disadvantage of using one unreclaimable list is, for VA/SECS pages you don't have the concept of "young/age". The only purpose of getting the page is to get the owner enclave and kill it. On the other hand, for virtual EPC pages we do have concept of "young/age", although I think it's acceptable we don't implement this in the first version of EPC cgroup support. From this point, perhaps it's better to maintain VA/SECS (or enclaves) separately from virtual EPC pages. But it is not mandatory. Another pro of maintaining them separately is you don't need these flags to distinguish them from the single list: SGX_EPC_OWNER_{ENCL|PAGE|VEPC}, which I dislike. (btw, even you track VA/SECS pages in unreclaimable list, given they both have 'enclave' as the owner, do you still need SGX_EPC_OWNER_ENCL and SGX_EPC_OWNER_PAGE ?) That being said, personally I'd slightly prefer to keep them in separate lists but I can see it also depends on how you want to implement the algorithm of selecting a victim. So I am not going to insist for now but will leave to you to decide. Just remember to give justification of choosing so. [...] > It also means user > space needs to inject/pass the SIGBUS to guest (I'm not really familiar > with this part, or maybe it's already there?). @sean is that what you > mean? Sorry I've been slow on understanding this. I don't think we need to do this for now. Killing the VM is an acceptable start to me. Just make sure we can kill some VM.
On Tue, 2023-10-10 at 11:49 -0500, Haitao Huang wrote: > > > > This patch adds SGX_ENCL_NO_MEMORY. I guess we can use it for virtual > > EPC too? > > > > That flag is set for enclaves, do you mean we set similar flag in vepc > struct? Yes.
On Tue, 10 Oct 2023 19:01:25 -0500, Sean Christopherson <seanjc@google.com> wrote: > On Tue, Oct 10, 2023, Haitao Huang wrote: >> On Mon, 09 Oct 2023 21:23:12 -0500, Huang, Kai <kai.huang@intel.com> >> wrote: >> >> > On Mon, 2023-10-09 at 20:42 -0500, Haitao Huang wrote: >> > > Hi Sean >> > > >> > > On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson >> > > <seanjc@google.com> wrote: >> > > > I can see userspace wanting to explicitly terminate the VM >> instead of >> > > > "silently" >> > > > the VM's enclaves, but that seems like it should be a knob in the >> > > > virtual EPC >> > > > code. >> > > >> > > If my understanding above is correct and understanding your >> statement >> > > above correctly, then don't see we really need separate knob for >> vEPC >> > > code. Reaching a cgroup limit by a running guest (assuming dynamic >> > > allocation implemented) should not translate automatically killing >> > > the VM. >> > > Instead, it's user space job to work with guest to handle allocation >> > > failure. Guest could page and kill enclaves. >> > > >> > >> > IIUC Sean was talking about changing misc.max _after_ you launch SGX >> VMs: >> > >> > 1) misc.max = 100M >> > 2) Launch VMs with total virtual EPC size = 100M <- success >> > 3) misc.max = 50M >> > >> > 3) will also succeed, but nothing will happen, the VMs will be still >> > holding 100M EPC. >> > >> > You need to somehow track virtual EPC and kill VM instead. >> > >> > (or somehow fail to do 3) if it is also an acceptable option.) >> > >> Thanks for explaining it. >> >> There is an error code to return from max_write. I can add that too to >> the >> callback definition and fail it when it can't be enforced for any >> reason. >> Would like some community feedback if this is acceptable though. > > That likely isn't acceptable. E.g. create a cgroup with both a host > enclave and > virtual EPC, set the hard limit to 100MiB. Virtual EPC consumes 50MiB, > and the > host enclave consumes 50MiB. Userspace lowers the limit to 49MiB. The > cgroup > code would reclaim all of the enclave's reclaimable EPC, and then kill > the enclave > because it's still over the limit. And then fail the max_write because > the cgroup > is *still* over the limit. So in addition to burning a lot of cycles, > from > userspace's perspective its enclave was killed for no reason, as the new > limit > wasn't actually set. I was thinking before reclaiming enclave pages, if we know the untracked vepc pages (current-tracked) is larger than limit, we just return error without enforcing the limit. That way user also knows something is wrong. But I get that we want to be able to kill VMs to enforce the newer lower limit. I assume we can't optimize efficiency/priority of killing: enclave pages will be reclaimed first no matter what just because they are reclaimable; no good rules to choose victim between enclave and VMs in your example so enclaves could be killed still before VMs. >> I think to solve it ultimately, we need be able to adjust 'capacity' of >> VMs >> not to just kill them, which is basically the same as dynamic allocation >> support for VMs (being able to increase/decrease epc size when it is >> running). For now, we only have static allocation so max can't be >> enforced >> once it is launched. > > No, reclaiming virtual EPC is not a requirement. VMM EPC > oversubscription is > insanely complex, and I highly doubt any users actually want to > oversubcribe VMs. > > There are use cases for cgroups beyond oversubscribing/swapping, e.g. > privileged > userspace may set limits on a container to ensure the container doesn't > *accidentally* > consume more EPC than it was allotted, e.g. due to a configuration bug > that created > a VM with more EPC than it was supposed to have. > > My comments on virtual EPC vs. cgroups is much more about having sane, > well-defined > behavior, not about saying the kernel actually needs to support > oversubscribing EPC > for KVM guests. So here are the steps I see now, let me know if this is aligned with your thinking or I'm off track. 0) when all left are enclave VA, SECS pages and VEPC in a cgroup, the OOM kill process starts. 1) The cgroup identifies a victim vepc for OOM kill(this could be before or after enclaves being killed), sets a new flag VEPC_NO_MEMORY in the vepc. 2) call sgx_vepc_remove_all(), ignore failure counts returned. This will perform best effort to eremove all normal pages used by the vepc. 3) Guest may trigger fault again, return SIGBUS in sgx_vepc_fault when VEPC_NO_MEMORY is set. Do the same for mmap. 4) That would eventually cause sgx_vepc_release() before VM dies or killed by user space, which does the final cleanup. Q: should we reset VEPC_NO_MEMORY flag if #4 never happens and cgroup usage is below limit? I suppose we can do it, but not sure it is sane because VM would try to load as much pages as configured originally. I'm still thinking about using one unreclaimable list, justification is simplicity and lack of rules to select items from separate lists, but may change my mind if I found it's inconvenient. Not sure how age/youngness can be accounted for guest pages though Kai indicated we don't need that for first version. So I assume we can deal with it later and add separate list for vEPC if needed for that reason. Thanks a lot for your input. Haitao
On Tue, 10 Oct 2023 19:31:19 -0500, Huang, Kai <kai.huang@intel.com> wrote: > On Tue, 2023-10-10 at 12:05 -0500, Haitao Huang wrote: >> On Mon, 09 Oct 2023 21:12:27 -0500, Huang, Kai <kai.huang@intel.com> >> wrote: >> >> > >> > > > > > >> > > > > Later the hosting process could migrated/reassigned to another >> > > cgroup? >> > > > > What to do when the new cgroup is OOM? >> > > > > >> > > > >> > > > You addressed in the documentation, no? >> > > > >> > > > +Migration >> > > > +--------- >> > > > + >> > > > +Once an EPC page is charged to a cgroup (during allocation), it >> > > > +remains charged to the original cgroup until the page is released >> > > > +or reclaimed. Migrating a process to a different cgroup doesn't >> > > > +move the EPC charges that it incurred while in the previous >> cgroup >> > > > +to its new cgroup. >> > > >> > > Should we kill the enclave though because some VA pages may be in >> the >> > > new >> > > group? >> > > >> > >> > I guess acceptable? >> > >> > And any difference if you keep VA/SECS to unreclaimabe list? >> >> Tracking VA/SECS allows all cgroups, in which an enclave has allocation, >> to identify the enclave following the back pointer and kill it as >> needed. >> >> > If you migrate one >> > enclave to another cgroup, the old EPC pages stay in the old cgroup >> > while the >> > new one is charged to the new group IIUC. >> > >> > I am not cgroup expert, but by searching some old thread it appears >> this >> > isn't a >> > supported model: >> > >> > https://lore.kernel.org/lkml/YEyR9181Qgzt+Ps9@mtj.duckdns.org/ >> > >> >> IIUC it's a different problem here. If we don't track the allocated VAs >> in >> the new group, then the enclave that spans the two groups can't be >> killed >> by the new group. If so, some enclave could just hide in some small >> group >> and never gets killed but keeps allocating in a different group? >> > > I mean from the link above IIUC migrating enclave among different > cgroups simply > isn't a supported model, thus any bad behaviour isn't a big concern in > terms of > decision making. If we leave some pages in a cgroup unkillable, we are in the same situation of not able to enforce a cgroup limit as that we are are in if we don't kill VMs for lower limits. I think not supporting migration of pages between cgroups should not leave a gap for enforcement just like we don't want to have an enforcement gap if we let VMs to hold pages once it is launched. Haitao
On Tue, 10 Oct 2023 19:51:17 -0500, Huang, Kai <kai.huang@intel.com> wrote: [...] > (btw, even you track VA/SECS pages in unreclaimable list, given they > both have > 'enclave' as the owner, do you still need SGX_EPC_OWNER_ENCL and > SGX_EPC_OWNER_PAGE ?) Let me think about it, there might be also a way just track encl objects not unreclaimable pages. I still not get why we need kill the VM not just remove just enough pages. Is it due to the static allocation not able to reclaim? If we always remove all vEPC pages/kill VM, then we should not need track individual vepc pages. Thanks Haitao
On Thu, 2023-10-12 at 08:27 -0500, Haitao Huang wrote: > On Tue, 10 Oct 2023 19:51:17 -0500, Huang, Kai <kai.huang@intel.com> wrote: > [...] > > (btw, even you track VA/SECS pages in unreclaimable list, given they > > both have > > 'enclave' as the owner, do you still need SGX_EPC_OWNER_ENCL and > > SGX_EPC_OWNER_PAGE ?) > > Let me think about it, there might be also a way just track encl objects > not unreclaimable pages. > > I still not get why we need kill the VM not just remove just enough pages. > Is it due to the static allocation not able to reclaim? We can choose to "just remove enough EPC pages". The VM may or may not be killed when it wants the EPC pages back, depending on whether the current EPC cgroup can provide enough EPC pages or not. And this depends on how we implement the cgroup algorithm to reclaim EPC pages. One problem could be: for a EPC cgroup only has SGX VMs, you may end up with moving EPC pages from one VM to another and then vice versa endlessly, because you never really actually mark any VM to be dead just like OOM does to the normal enclaves. From this point, you still need a way to kill a VM, IIUC. I think the key point of virtual EPC vs cgroup, as quoted from Sean, should be "having sane, well-defined behavior". Does "just remove enough EPC pages" meet this? If the problem mentioned above can be avoided, I suppose so? So if there's an easy way to achieve, I guess it can be an option too. But for the initial support, IMO we are not looking for a perfect but yet complicated solution. I would say, from review's point of view, it's preferred to have a simple implementation to achieve a not-prefect, but consistent, well- defined behaviour. So to me looks killing the VM when cgroup cannot reclaim any more EPC pages is a simple option. But I might have missed something, especially since middle of last week I have been having fever and headache :-) So as mentioned above, you can try other alternatives, but please avoid complicated ones. Also, I guess it will be helpful if we can understand the typical SGX app and/or SGX VM deployment under EPC cgroup use case. This may help us on justifying why the EPC cgroup algorithm to select victim is reasonable.
On Wed, 2023-10-11 at 01:14 +0000, Huang, Kai wrote: > On Tue, 2023-10-10 at 11:49 -0500, Haitao Huang wrote: > > > > > > This patch adds SGX_ENCL_NO_MEMORY. I guess we can use it for virtual > > > EPC too? > > > > > > > That flag is set for enclaves, do you mean we set similar flag in vepc > > struct? > > Yes. I missed the "ENCL" part but only noted the "NO_MEMORY" part, so I guess it should not be used directly for vEPC. So if it is needed, SGX_VEPC_NO_MEMORY, or a simple 'bool dead' or similar in 'struct sgx_vepc' is more suitable. As I said I was fighting with fever and headache last week :-)
On Mon, 16 Oct 2023 05:57:36 -0500, Huang, Kai <kai.huang@intel.com> wrote: > On Thu, 2023-10-12 at 08:27 -0500, Haitao Huang wrote: >> On Tue, 10 Oct 2023 19:51:17 -0500, Huang, Kai <kai.huang@intel.com> >> wrote: >> [...] >> > (btw, even you track VA/SECS pages in unreclaimable list, given they >> > both have >> > 'enclave' as the owner, do you still need SGX_EPC_OWNER_ENCL and >> > SGX_EPC_OWNER_PAGE ?) >> >> Let me think about it, there might be also a way just track encl objects >> not unreclaimable pages. >> >> I still not get why we need kill the VM not just remove just enough >> pages. >> Is it due to the static allocation not able to reclaim? > > We can choose to "just remove enough EPC pages". The VM may or may not > be > killed when it wants the EPC pages back, depending on whether the > current EPC > cgroup can provide enough EPC pages or not. And this depends on how we > implement the cgroup algorithm to reclaim EPC pages. > > One problem could be: for a EPC cgroup only has SGX VMs, you may end up > with > moving EPC pages from one VM to another and then vice versa endlessly, This could be a valid use case though if people intend to share EPCs between two VMs. Understand no one would be able to use VMs this way currently with the static allocation. > because > you never really actually mark any VM to be dead just like OOM does to > the > normal enclaves. > > From this point, you still need a way to kill a VM, IIUC. > > I think the key point of virtual EPC vs cgroup, as quoted from Sean, > should be > "having sane, well-defined behavior". > > Does "just remove enough EPC pages" meet this? If the problem mentioned > above > can be avoided, I suppose so? So if there's an easy way to achieve, I > guess it > can be an option too. > > But for the initial support, IMO we are not looking for a perfect but yet > complicated solution. I would say, from review's point of view, it's > preferred > to have a simple implementation to achieve a not-prefect, but > consistent, well- > defined behaviour. > > So to me looks killing the VM when cgroup cannot reclaim any more EPC > pages is a > simple option. > > But I might have missed something, especially since middle of last week > I have > been having fever and headache :-) > > So as mentioned above, you can try other alternatives, but please avoid > complicated ones. > > Also, I guess it will be helpful if we can understand the typical SGX > app and/or > SGX VM deployment under EPC cgroup use case. This may help us on > justifying why > the EPC cgroup algorithm to select victim is reasonable. > From this perspective, I think the current implementation is "well-defined": EPC cgroup limits for VMs are only enforced at VM launch time, not runtime. In practice, SGX VM can be launched only with fixed EPC size and all those EPCs are fully committed to the VM once launched. Because of that, I imagine people are using VMs to primarily partition the physical EPCs, i.e, the static size itself is the 'limit' for the workload of a single VM and not expecting EPCs taken away at runtime. So killing does not really add much value for the existing usages IIUC. That said, I don't anticipate adding the enforcement of killing VMs at runtime would break such usages as admin/user can simply choose to set the limit equal to the static size to launch the VM and forget about it. Given that, I'll propose an add-on patch to this series as RFC and have some feedback from community before we decide if that needs be included in first version or we can skip it until we have EPC reclaiming for VMs. Thanks Haitao
> > > From this perspective, I think the current implementation is > "well-defined": EPC cgroup limits for VMs are only enforced at VM launch > time, not runtime. In practice, SGX VM can be launched only with fixed > EPC size and all those EPCs are fully committed to the VM once launched. > Because of that, I imagine people are using VMs to primarily partition the > physical EPCs, i.e, the static size itself is the 'limit' for the workload > of a single VM and not expecting EPCs taken away at runtime. > > So killing does not really add much value for the existing usages IIUC. It's not about adding value to the existing usages, it's about fixing the issue when we lower the EPC limit to a point that is less than total virtual EPC size. It's a design issue, or simply a bug in the current implementation we need to fix. > > That said, I don't anticipate adding the enforcement of killing VMs at > runtime would break such usages as admin/user can simply choose to set the > limit equal to the static size to launch the VM and forget about it. > > Given that, I'll propose an add-on patch to this series as RFC and have > some feedback from community before we decide if that needs be included in > first version or we can skip it until we have EPC reclaiming for VMs. I don't understand what is the "add-on" patch you are talking about. I mentioned the "typical deployment thing" is that can help us understand which algorithm is better way to select the victim. But no matter what we choose, we still need to fix the bug mentioned above here. I really think you should just go this simple way: When you want to take EPC back from VM, kill the VM. Another bad thing about "just removing EPC pages from VM" is the enclaves in the VM would suffer "sudden lose of EPC", or even worse, suffer it at a high frequency. Although we depend on that for supporting SGX VM live migration, but that needs to avoided if possible.
On Mon, Oct 16, 2023, Haitao Huang wrote: > From this perspective, I think the current implementation is "well-defined": > EPC cgroup limits for VMs are only enforced at VM launch time, not runtime. > In practice, SGX VM can be launched only with fixed EPC size and all those > EPCs are fully committed to the VM once launched. Fully committed doesn't mean those numbers are reflected in the cgroup. A VM scheduler can easily "commit" EPC to a guest, but allocate EPC on demand, i.e. when the guest attempts to actually access a page. Preallocating memory isn't free, e.g. it can slow down guest boot, so it's entirely reasonable to have virtual EPC be allocated on-demand. Enforcing at launch time doesn't work for such setups, because from the cgroup's perspective, the VM is using 0 pages of EPC at launch. > Because of that, I imagine people are using VMs to primarily partition the > physical EPCs, i.e, the static size itself is the 'limit' for the workload of > a single VM and not expecting EPCs taken away at runtime. If everything goes exactly as planned, sure. But it's not hard to imagine some configuration change way up the stack resulting in the hard limit for an EPC cgroup being lowered. > So killing does not really add much value for the existing usages IIUC. As I said earlier, the behavior doesn't have to result in terminating a VM, e.g. the virtual EPC code could provide a knob to send a signal/notification if the owning cgroup has gone above the limit and the VM is targeted for forced reclaim. > That said, I don't anticipate adding the enforcement of killing VMs at > runtime would break such usages as admin/user can simply choose to set the > limit equal to the static size to launch the VM and forget about it. > > Given that, I'll propose an add-on patch to this series as RFC and have some > feedback from community before we decide if that needs be included in first > version or we can skip it until we have EPC reclaiming for VMs. Gracefully *swapping* virtual EPC isn't required for oversubscribing virtual EPC. Think of it like airlines overselling tickets. The airline sells more tickets than they have seats, and banks on some passengers canceling. If too many people show up, the airline doesn't swap passengers to the cargo bay, they just shunt them to a different plane. The same could be easily be done for hosts and virtual EPC. E.g. if every VM *might* use 1GiB, but in practice 99% of VMs only consume 128MiB, then it's not too crazy to advertise 1GiB to each VM, but only actually carve out 256MiB per VM in order to pack more VMs on a host. If the host needs to free up EPC, then the most problematic VMs can be migrated to a different host. Genuinely curious, who is asking for EPC cgroup support that *isn't* running VMs? AFAIK, these days, SGX is primarily targeted at cloud. I assume virtual EPC is the primary use case for an EPC cgroup. I don't have any skin in the game beyond my name being attached to some of the patches, i.e. I certainly won't stand in the way. I just don't understand why you would go through all the effort of adding an EPC cgroup and then not go the extra few steps to enforce limits for virtual EPC. Compared to the complexity of the rest of the series, that little bit seems quite trivial.
Hi Sean On Mon, 16 Oct 2023 16:32:31 -0500, Sean Christopherson <seanjc@google.com> wrote: > On Mon, Oct 16, 2023, Haitao Huang wrote: >> From this perspective, I think the current implementation is >> "well-defined": >> EPC cgroup limits for VMs are only enforced at VM launch time, not >> runtime. >> In practice, SGX VM can be launched only with fixed EPC size and all >> those >> EPCs are fully committed to the VM once launched. > > Fully committed doesn't mean those numbers are reflected in the cgroup. > A VM > scheduler can easily "commit" EPC to a guest, but allocate EPC on > demand, i.e. > when the guest attempts to actually access a page. Preallocating memory > isn't > free, e.g. it can slow down guest boot, so it's entirely reasonable to > have virtual > EPC be allocated on-demand. Enforcing at launch time doesn't work for > such setups, > because from the cgroup's perspective, the VM is using 0 pages of EPC at > launch. > Maybe I understood the current implementation wrong. From what I see, vEPC is impossible not fully commit at launch time. The guest would EREMOVE all pages during initialization resulting #PF and all pages allocated. This essentially makes "prealloc=off" the same as "prealloc=on". Unless you are talking about some custom OS or kernel other than upstream Linux here? Thanks Haitap
On Mon, 16 Oct 2023 16:09:52 -0500, Huang, Kai <kai.huang@intel.com> wrote: [...] > still need to fix the bug mentioned above here. > > I really think you should just go this simple way: > > When you want to take EPC back from VM, kill the VM. > My only concern is that this is a compromise due to current limitation (no other sane way to take EPC from VMs). If we define this behavior and it becomes a contract to user space, then we can't change in future. On the other hand, my understanding the reason you want this behavior is to enforce EPC limit at runtime. I just not sure how important it is and if it is a real usage given all limitations of SGX VMs we have (static EPC size, no migration). Thanks Haitao
On Mon, 2023-10-16 at 19:10 -0500, Haitao Huang wrote: > On Mon, 16 Oct 2023 16:09:52 -0500, Huang, Kai <kai.huang@intel.com> wrote: > [...] > > > still need to fix the bug mentioned above here. > > > > I really think you should just go this simple way: > > > > When you want to take EPC back from VM, kill the VM. > > > > My only concern is that this is a compromise due to current limitation (no > other sane way to take EPC from VMs). If we define this behavior and it > becomes a contract to user space, then we can't change in future. Why do we need to "define such behaviour"? This isn't some kinda of kernel/userspace ABI IMHO, but only kernel internal implementation. Here VM is similar to normal host enclaves. You limit the resource, some host enclaves could be killed. Similarly, VM could also be killed too. And supporting VMM EPC oversubscription doesn't mean VM won't be killed. The VM can still be a target to kill after VM's all EPC pages have been swapped out. > > On the other hand, my understanding the reason you want this behavior is > to enforce EPC limit at runtime. > No I just thought this is a bug/issue needs to be fixed. If anyone believes this is not a bug/issue then it's a separate discussion.
On Mon, Oct 16, 2023 at 02:32:31PM -0700, Sean Christopherson wrote: > Genuinely curious, who is asking for EPC cgroup support that *isn't* running VMs? People who work with containers: [1], [2]. > AFAIK, these days, SGX is primarily targeted at cloud. I assume virtual EPC is > the primary use case for an EPC cgroup. The common setup is that a cloud VM instance with vEPC is created and then several SGX enclave containers are run simultaneously on that instance. EPC cgroups is used to ensure that each container gets their own share of EPC (and any attempts to go beyond the limit is reclaimed and charged from the container's memcg). The same containers w/ enclaves use case is applicable to baremetal also, though. As far as Kubernetes orchestrated containers are concerned, "in-place" resource scaling is still in very early stages which means that the cgroups values are adjusted by *re-creating* the container. The hierarchies are also built such that there's no mix of VMs w/ vEPC and enclaves in the same tree. Mikko [1] https://lore.kernel.org/linux-sgx/20221202183655.3767674-1-kristen@linux.intel.com/T/#m6d1c895534b4c0636f47c2d1620016b4c362bb9b [2] https://lore.kernel.org/linux-sgx/20221202183655.3767674-1-kristen@linux.intel.com/T/#m37600e457b832feee6e8346aa74dcff8f21965f8
On Mon, 16 Oct 2023 20:34:57 -0500, Huang, Kai <kai.huang@intel.com> wrote: > On Mon, 2023-10-16 at 19:10 -0500, Haitao Huang wrote: >> On Mon, 16 Oct 2023 16:09:52 -0500, Huang, Kai <kai.huang@intel.com> >> wrote: >> [...] >> >> > still need to fix the bug mentioned above here. >> > >> > I really think you should just go this simple way: >> > >> > When you want to take EPC back from VM, kill the VM. >> > >> >> My only concern is that this is a compromise due to current limitation >> (no >> other sane way to take EPC from VMs). If we define this behavior and it >> becomes a contract to user space, then we can't change in future. > > Why do we need to "define such behaviour"? > > This isn't some kinda of kernel/userspace ABI IMHO, but only kernel > internal > implementation. Here VM is similar to normal host enclaves. You limit > the > resource, some host enclaves could be killed. Similarly, VM could also > be > killed too. > > And supporting VMM EPC oversubscription doesn't mean VM won't be > killed. The VM > can still be a target to kill after VM's all EPC pages have been swapped > out. > >> >> On the other hand, my understanding the reason you want this behavior is >> to enforce EPC limit at runtime. > > No I just thought this is a bug/issue needs to be fixed. If anyone > believes > this is not a bug/issue then it's a separate discussion. > AFAIK, before we introducing max_write() callback in this series, no misc controller would possibly enforce the limit when misc.max is reduced. e.g. I don't think CVMs be killed when ASID limit is reduced and the cgroup was full before limit is reduced. I think EPC pages to VMs could have the same behavior, once they are given to a guest, never taken back by the host. For enclaves on host side, pages are reclaimable, that allows us to enforce in a similar way to memcg. Thanks Haitao
On Mon, Oct 16, 2023, Haitao Huang wrote: > Hi Sean > > On Mon, 16 Oct 2023 16:32:31 -0500, Sean Christopherson <seanjc@google.com> > wrote: > > > On Mon, Oct 16, 2023, Haitao Huang wrote: > > > From this perspective, I think the current implementation is > > > "well-defined": > > > EPC cgroup limits for VMs are only enforced at VM launch time, not > > > runtime. In practice, SGX VM can be launched only with fixed EPC size > > > and all those EPCs are fully committed to the VM once launched. > > > > Fully committed doesn't mean those numbers are reflected in the cgroup. A > > VM scheduler can easily "commit" EPC to a guest, but allocate EPC on > > demand, i.e. when the guest attempts to actually access a page. > > Preallocating memory isn't free, e.g. it can slow down guest boot, so it's > > entirely reasonable to have virtual EPC be allocated on-demand. Enforcing > > at launch time doesn't work for such setups, because from the cgroup's > > perspective, the VM is using 0 pages of EPC at launch. > > > Maybe I understood the current implementation wrong. From what I see, vEPC > is impossible not fully commit at launch time. The guest would EREMOVE all > pages during initialization resulting #PF and all pages allocated. This > essentially makes "prealloc=off" the same as "prealloc=on". > Unless you are talking about some custom OS or kernel other than upstream > Linux here? Yes, a customer could be running an older kernel, something other than Linux, a custom kernel, an out-of-tree SGX driver, etc. The host should never assume anything about the guest kernel when it comes to correctness (unless the guest kernel is controlled by the host). Doing EREMOVE on all pages is definitely not mandatory, especially if the kernel detects a hypervisor, i.e. knows its running as a guest.
Hello Haitao. On Tue, Oct 17, 2023 at 07:58:02AM -0500, Haitao Huang <haitao.huang@linux.intel.com> wrote: > AFAIK, before we introducing max_write() callback in this series, no misc > controller would possibly enforce the limit when misc.max is reduced. e.g. I > don't think CVMs be killed when ASID limit is reduced and the cgroup was > full before limit is reduced. Yes, misccontroller was meant to be simple, current >= max serves to prevent new allocations. FTR, at some point in time memory.max was considered for reclaim control of regular pages but it turned out to be too coarse (and OOM killing processes if amount was not sensed correctly) and this eventually evolved into specific mechanism of memory.reclaim. So I'm mentioning this should that be an interface with better semantic for your use case (and misc.max writes can remain non-preemptive). One more note -- I was quite confused when I read in the rest of the series about OOM and _kill_ing but then I found no such measure in the code implementation. So I would suggest two terminological changes: - the basic premise of the series (00/18) is that EPC pages are a different resource than memory, hence choose a better suiting name than OOM (out of memory) condition, - killing -- (unless you have an intention to implement process termination later) My current interpretation that it is rather some aggressive unmapping within address space, so less confusing name for that would be "reclaim". > I think EPC pages to VMs could have the same behavior, once they are given > to a guest, never taken back by the host. For enclaves on host side, pages > are reclaimable, that allows us to enforce in a similar way to memcg. Is this distinction between preemptability of EPC pages mandated by the HW implementation? (host/"process" enclaves vs VM enclaves) Or do have users an option to lock certain pages in memory that yields this difference? Regards, Michal
On Tue, Oct 17, 2023 at 08:54:48PM +0200, Michal Koutný <mkoutny@suse.com> wrote: > Is this distinction between preemptability of EPC pages mandated by the > HW implementation? (host/"process" enclaves vs VM enclaves) Or do have > users an option to lock certain pages in memory that yields this > difference? (After skimming Documentation/arch/x86/sgx.rst, Section "Virtual EPC") Or would these two types warrant also two types of miscresource? (To deal with each in own way.) Thanks, Michal
Hi Michal, On Tue, 17 Oct 2023 13:54:46 -0500, Michal Koutný <mkoutny@suse.com> wrote: > Hello Haitao. > > On Tue, Oct 17, 2023 at 07:58:02AM -0500, Haitao Huang > <haitao.huang@linux.intel.com> wrote: >> AFAIK, before we introducing max_write() callback in this series, no >> misc >> controller would possibly enforce the limit when misc.max is reduced. >> e.g. I >> don't think CVMs be killed when ASID limit is reduced and the cgroup was >> full before limit is reduced. > > Yes, misccontroller was meant to be simple, current >= max serves to > prevent new allocations. > Thanks for confirming. Maybe another alternative we just keep max_write non-preemptive. No need to add max_write() callback. The EPC controller only triggers reclaiming on new allocations or return NOMEM if no more to reclaim. Reclaiming here includes normal EPC page reclaiming and killing enclaves in out of EPC cases. vEPCs assigned to guests are basically carved out and never reclaimable by the host. As we no longer enforce limits on max_write a lower value, user should not expect cgroup to force reclaim pages from enclave or kill VMs/enclaves as a result of reducing limits 'in-place'. User should always create cgroups, set limits, launch enclave/VM into the groups created. > FTR, at some point in time memory.max was considered for reclaim control > of regular pages but it turned out to be too coarse (and OOM killing > processes if amount was not sensed correctly) and this eventually > evolved into specific mechanism of memory.reclaim. > So I'm mentioning this should that be an interface with better semantic > for your use case (and misc.max writes can remain non-preemptive). > Yes we can introduce misc.reclaim to give user a knob to forcefully reducing usage if that is really needed in real usage. The semantics would make force-kill VMs explicit to user. > One more note -- I was quite confused when I read in the rest of the > series about OOM and _kill_ing but then I found no such measure in the > code implementation. So I would suggest two terminological changes: > > - the basic premise of the series (00/18) is that EPC pages are a > different resource than memory, hence choose a better suiting name > than OOM (out of memory) condition, I couldn't come up a good name. Out of EPC (OOEPC) maybe? I feel OOEPC would be hard to read in code though. OOM was relatable as it is similar to normal OOM but special kind of memory :-) I'm open to any better suggestions. > - killing -- (unless you have an intention to implement process > termination later) My current interpretation that it is rather some > aggressive unmapping within address space, so less confusing name for > that would be "reclaim". > yes. Killing here refers to killing enclave, analogous to killing process, not just 'reclaim' though. I can change to always use 'killing enclave' explicitly. > >> I think EPC pages to VMs could have the same behavior, once they are >> given >> to a guest, never taken back by the host. For enclaves on host side, >> pages >> are reclaimable, that allows us to enforce in a similar way to memcg. > > Is this distinction between preemptability of EPC pages mandated by the > HW implementation? (host/"process" enclaves vs VM enclaves) Or do have > users an option to lock certain pages in memory that yields this > difference? > The difference is really a result of current vEPC implementation. Because enclave pages once in use contains confidential content, they need special process to reclaim. So it's complex to implement host reclaiming guest EPCs gracefully. Thanks Haitao
On Tue, 17 Oct 2023 14:13:22 -0500, Michal Koutný <mkoutny@suse.com> wrote: > On Tue, Oct 17, 2023 at 08:54:48PM +0200, Michal Koutný > <mkoutny@suse.com> wrote: >> Is this distinction between preemptability of EPC pages mandated by the >> HW implementation? (host/"process" enclaves vs VM enclaves) Or do have >> users an option to lock certain pages in memory that yields this >> difference? > > (After skimming Documentation/arch/x86/sgx.rst, Section "Virtual EPC") > > Or would these two types warrant also two types of miscresource? (To > deal with each in own way.) They are from the same bucket of HW resource so I think it's more suitable to be one resource type. Otherwise need to policy to dividing the capacity, etc. And it is still possible in future vEPC become reclaimable. My current thinking is we probably can get away with non-preemptive max_write for enclaves too. See my other reply. Thanks Haitao
On 10/17/23 21:37, Haitao Huang wrote: > Yes we can introduce misc.reclaim to give user a knob to forcefully > reducing usage if that is really needed in real usage. The semantics > would make force-kill VMs explicit to user. Do any other controllers do something like this? It seems odd.
On Wed, 18 Oct 2023 08:55:12 -0500, Dave Hansen <dave.hansen@intel.com> wrote: > On 10/17/23 21:37, Haitao Huang wrote: >> Yes we can introduce misc.reclaim to give user a knob to forcefully >> reducing usage if that is really needed in real usage. The semantics >> would make force-kill VMs explicit to user. > > Do any other controllers do something like this? It seems odd. Maybe not in sense of killing something. My understanding memory.reclaim does not necessarily invoke the OOM killer. But what I really intend to say is we can have a separate knob for user to express the need for reducing the current usage explicitly and keep "misc.max' non-preemptive semantics intact. When we implement that new knob, then we can define what kind of reclaim for that. Depending on vEPC implementation, it may or may not involve killing VMs. But at least that semantics will be explicit for user. Thanks Haitao
On 10/18/23 08:26, Haitao Huang wrote: > Maybe not in sense of killing something. My understanding memory.reclaim > does not necessarily invoke the OOM killer. But what I really intend to > say is we can have a separate knob for user to express the need for > reducing the current usage explicitly and keep "misc.max' non-preemptive > semantics intact. When we implement that new knob, then we can define > what kind of reclaim for that. Depending on vEPC implementation, it may > or may not involve killing VMs. But at least that semantics will be > explicit for user. I'm really worried that you're going for "perfect" semantics here. This is SGX. It's *NOT* getting heavy use and even fewer folks will ever apply cgroup controls to it. Can we please stick to simple, easily-coded rules here? I honestly don't think these corner cases matter all that much and there's been *WAY* too much traffic in this thread for what is ultimately not that complicated. Focus on *ONE* thing: 1. Admin sets a limit 2. Enclave is created 3. Enclave hits limit, allocation fails Nothing else matters. What if the admin lowers the limit on an already-created enclave? Nobody cares. Seriously. What about inducing reclaim? Nobody cares. What about vEPC? Doesn't matter, an enclave page is an enclave page.
On Wed, Oct 18, 2023 at 08:37:25AM -0700, Dave Hansen <dave.hansen@intel.com> wrote: > 1. Admin sets a limit > 2. Enclave is created > 3. Enclave hits limit, allocation fails I was actually about to suggest reorganizing the series to a part implementing this simple limiting and a subsequent part with the reclaim stuff for easier digestability. > Nothing else matters. If the latter part is an unncessary overkill, it's even better. Thanks, Michal
On Wed, 18 Oct 2023 10:52:23 -0500, Michal Koutný <mkoutny@suse.com> wrote: > On Wed, Oct 18, 2023 at 08:37:25AM -0700, Dave Hansen > <dave.hansen@intel.com> wrote: >> 1. Admin sets a limit >> 2. Enclave is created >> 3. Enclave hits limit, allocation fails > > I was actually about to suggest reorganizing the series to a part > implementing this simple limiting and a subsequent part with the reclaim > stuff for easier digestability. > >> Nothing else matters. > > If the latter part is an unncessary overkill, it's even better. > Ok. I'll take out max_write() callback and only implement non-preemptive misc.max for EPC. I can also separate OOEPC_killing enclaves out, which is not needed if we only block allocation at limit, no need killing one enclave to make space for another. This will simplify a lot. Thanks to all for your input! Haitao
diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c index 262f5fb18d74..ff42d649c7b6 100644 --- a/arch/x86/kernel/cpu/sgx/driver.c +++ b/arch/x86/kernel/cpu/sgx/driver.c @@ -44,7 +44,6 @@ static int sgx_open(struct inode *inode, struct file *file) static int sgx_release(struct inode *inode, struct file *file) { struct sgx_encl *encl = file->private_data; - struct sgx_encl_mm *encl_mm; /* * Drain the remaining mm_list entries. At this point the list contains @@ -52,31 +51,7 @@ static int sgx_release(struct inode *inode, struct file *file) * not exited yet. The processes, which have exited, are gone from the * list by sgx_mmu_notifier_release(). */ - for ( ; ; ) { - spin_lock(&encl->mm_lock); - - if (list_empty(&encl->mm_list)) { - encl_mm = NULL; - } else { - encl_mm = list_first_entry(&encl->mm_list, - struct sgx_encl_mm, list); - list_del_rcu(&encl_mm->list); - } - - spin_unlock(&encl->mm_lock); - - /* The enclave is no longer mapped by any mm. */ - if (!encl_mm) - break; - - synchronize_srcu(&encl->srcu); - mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm); - kfree(encl_mm); - - /* 'encl_mm' is gone, put encl_mm->encl reference: */ - kref_put(&encl->refcount, sgx_encl_release); - } - + sgx_encl_mm_drain(encl); kref_put(&encl->refcount, sgx_encl_release); return 0; } diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index a8617e6a4b4e..3c91a705e720 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -451,6 +451,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf) if (unlikely(!encl)) return VM_FAULT_SIGBUS; + if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags)) + return VM_FAULT_SIGBUS; + /* * The page_array keeps track of all enclave pages, whether they * are swapped out or not. If there is no entry for this page and @@ -649,7 +652,8 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr, if (!encl) return -EFAULT; - if (!test_bit(SGX_ENCL_DEBUG, &encl->flags)) + if (!test_bit(SGX_ENCL_DEBUG, &encl->flags) || + test_bit(SGX_ENCL_NO_MEMORY, &encl->flags)) return -EFAULT; for (i = 0; i < len; i += cnt) { @@ -774,6 +778,45 @@ void sgx_encl_release(struct kref *ref) kfree(encl); } +/** + * sgx_encl_mm_drain - drain all mm_list entries + * @encl: address of the sgx_encl to drain + * + * Used during oom kill to empty the mm_list entries after they have been + * zapped. Or used by sgx_release to drain the remaining mm_list entries when + * the enclave fd is closing. After this call, sgx_encl_release will be called + * with kref_put. + */ +void sgx_encl_mm_drain(struct sgx_encl *encl) +{ + struct sgx_encl_mm *encl_mm; + + for ( ; ; ) { + spin_lock(&encl->mm_lock); + + if (list_empty(&encl->mm_list)) { + encl_mm = NULL; + } else { + encl_mm = list_first_entry(&encl->mm_list, + struct sgx_encl_mm, list); + list_del_rcu(&encl_mm->list); + } + + spin_unlock(&encl->mm_lock); + + /* The enclave is no longer mapped by any mm. */ + if (!encl_mm) + break; + + synchronize_srcu(&encl->srcu); + mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm); + kfree(encl_mm); + + /* 'encl_mm' is gone, put encl_mm->encl reference: */ + kref_put(&encl->refcount, sgx_encl_release); + } +} + /* * 'mm' is exiting and no longer needs mmu notifications. */ @@ -845,6 +888,9 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm) struct sgx_encl_mm *encl_mm; int ret; + if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags)) + return -ENOMEM; + /* * Even though a single enclave may be mapped into an mm more than once, * each 'mm' only appears once on encl->mm_list. This is guaranteed by diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h index 831d63f80f5a..cdb57ecb05c8 100644 --- a/arch/x86/kernel/cpu/sgx/encl.h +++ b/arch/x86/kernel/cpu/sgx/encl.h @@ -39,6 +39,7 @@ enum sgx_encl_flags { SGX_ENCL_DEBUG = BIT(1), SGX_ENCL_CREATED = BIT(2), SGX_ENCL_INITIALIZED = BIT(3), + SGX_ENCL_NO_MEMORY = BIT(4), }; struct sgx_encl_mm { @@ -125,5 +126,6 @@ struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl, unsigned long addr); struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim); void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page); +void sgx_encl_mm_drain(struct sgx_encl *encl); #endif /* _X86_ENCL_H */ diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c index 50ddd8988452..e1209e2cf6a3 100644 --- a/arch/x86/kernel/cpu/sgx/ioctl.c +++ b/arch/x86/kernel/cpu/sgx/ioctl.c @@ -420,6 +420,9 @@ static long sgx_ioc_enclave_add_pages(struct sgx_encl *encl, void __user *arg) test_bit(SGX_ENCL_INITIALIZED, &encl->flags)) return -EINVAL; + if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags)) + return -ENOMEM; + if (copy_from_user(&add_arg, arg, sizeof(add_arg))) return -EFAULT; @@ -605,6 +608,9 @@ static long sgx_ioc_enclave_init(struct sgx_encl *encl, void __user *arg) test_bit(SGX_ENCL_INITIALIZED, &encl->flags)) return -EINVAL; + if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags)) + return -ENOMEM; + if (copy_from_user(&init_arg, arg, sizeof(init_arg))) return -EFAULT; @@ -681,6 +687,9 @@ static int sgx_ioc_sgx2_ready(struct sgx_encl *encl) if (!test_bit(SGX_ENCL_INITIALIZED, &encl->flags)) return -EINVAL; + if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags)) + return -ENOMEM; + return 0; } diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index f3a3ed894616..3b875ab4dcd0 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -621,6 +621,146 @@ void sgx_free_epc_page(struct sgx_epc_page *page) atomic_long_inc(&sgx_nr_free_pages); } +static bool sgx_oom_get_ref(struct sgx_epc_page *epc_page) +{ + struct sgx_encl *encl; + + if (epc_page->flags & SGX_EPC_OWNER_PAGE) + encl = epc_page->encl_page->encl; + else if (epc_page->flags & SGX_EPC_OWNER_ENCL) + encl = epc_page->encl; + else + return false; + + return kref_get_unless_zero(&encl->refcount); +} + +static struct sgx_epc_page *sgx_oom_get_victim(struct sgx_epc_lru_lists *lru) +{ + struct sgx_epc_page *epc_page, *tmp; + + if (list_empty(&lru->unreclaimable)) + return NULL; + + list_for_each_entry_safe(epc_page, tmp, &lru->unreclaimable, list) { + list_del_init(&epc_page->list); + + if (sgx_oom_get_ref(epc_page)) + return epc_page; + } + return NULL; +} + +static void sgx_epc_oom_zap(void *owner, struct mm_struct *mm, unsigned long start, + unsigned long end, const struct vm_operations_struct *ops) +{ + VMA_ITERATOR(vmi, mm, start); + struct vm_area_struct *vma; + + /** + * Use end because start can be zero and not mapped into + * enclave even if encl->base = 0. + */ + for_each_vma_range(vmi, vma, end) { + if (vma->vm_ops == ops && vma->vm_private_data == owner && + vma->vm_start < end) { + zap_vma_pages(vma); + } + } +} + +static bool sgx_oom_encl(struct sgx_encl *encl) +{ + unsigned long mm_list_version; + struct sgx_encl_mm *encl_mm; + bool ret = false; + int idx; + + if (!test_bit(SGX_ENCL_CREATED, &encl->flags)) + goto out_put; + + /* Done OOM on this enclave previously, do not redo it. + * This may happen when the SECS page is still UNRECLAIMABLE because + * another page is in RECLAIM_IN_PROGRESS. Still return true so OOM + * killer can wait until the reclaimer done with the hold-up page and + * SECS before it move on to find another victim. + */ + if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags)) + goto out; + + set_bit(SGX_ENCL_NO_MEMORY, &encl->flags); + + do { + mm_list_version = encl->mm_list_version; + + /* Pairs with smp_rmb() in sgx_encl_mm_add(). */ + smp_rmb(); + + idx = srcu_read_lock(&encl->srcu); + + list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) { + if (!mmget_not_zero(encl_mm->mm)) + continue; + + mmap_read_lock(encl_mm->mm); + + sgx_epc_oom_zap(encl, encl_mm->mm, encl->base, + encl->base + encl->size, &sgx_vm_ops); + + mmap_read_unlock(encl_mm->mm); + + mmput_async(encl_mm->mm); + } + + srcu_read_unlock(&encl->srcu, idx); + } while (WARN_ON_ONCE(encl->mm_list_version != mm_list_version)); + + sgx_encl_mm_drain(encl); +out: + ret = true; + +out_put: + /* + * This puts the refcount we took when we identified this enclave as + * an OOM victim. + */ + kref_put(&encl->refcount, sgx_encl_release); + return ret; +} + +static inline bool sgx_oom_encl_page(struct sgx_encl_page *encl_page) +{ + return sgx_oom_encl(encl_page->encl); +} + +/** + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU + * @lru: LRU that is low + * + * Return: %true if a victim was found and kicked. + */ +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru) +{ + struct sgx_epc_page *victim; + + spin_lock(&lru->lock); + victim = sgx_oom_get_victim(lru); + spin_unlock(&lru->lock); + + if (!victim) + return false; + + if (victim->flags & SGX_EPC_OWNER_PAGE) + return sgx_oom_encl_page(victim->encl_page); + + if (victim->flags & SGX_EPC_OWNER_ENCL) + return sgx_oom_encl(victim->encl); + + /*Will never happen unless we add more owner types in future */ + WARN_ON_ONCE(1); + return false; +} + static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size, unsigned long index, struct sgx_epc_section *section) diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h index 337747bef7c2..6c0bfdc209c0 100644 --- a/arch/x86/kernel/cpu/sgx/sgx.h +++ b/arch/x86/kernel/cpu/sgx/sgx.h @@ -178,6 +178,7 @@ void sgx_reclaim_direct(void); void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags); int sgx_drop_epc_page(struct sgx_epc_page *page); struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim); +bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus); void sgx_ipi_cb(void *info);