Message ID | 20231030182013.40086-1-haitao.huang@linux.intel.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:d641:0:b0:403:3b70:6f57 with SMTP id cy1csp2415268vqb; Mon, 30 Oct 2023 11:21:19 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHBDRMVLGJICPVgE2k8KCf1yniTWbTYN6M1a3IOJifzuu20BPNysTta0vxFaVlsDUxtIRVp X-Received: by 2002:a17:90a:134c:b0:280:35ce:5e0f with SMTP id y12-20020a17090a134c00b0028035ce5e0fmr3359389pjf.11.1698690078749; Mon, 30 Oct 2023 11:21:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698690078; cv=none; d=google.com; s=arc-20160816; b=eBlK5ekEuYOAM4QhUtDSPc4uumKQ8vUMSbkPWcsQtmE8NBb3EQ6lZ+T4qic9EwuLdQ nwvpCUzZb65HaEwLIdwEfADyJjKiPHFaHELWjknmcxqRRsNvIKZMwMcozRZr4CdphctK sESTqeBmPJ+3b4rM0PyJpyDkZYeEx9LP5CZLJHvz1EJUBi26FsV1Gf2tvWZyfljtJccF zO+5sh1gqxBKro07UxWBBMKfW1BV61i04av5sz6wbqihuAUwKiYKNncUdf8/IYb9RQj2 cgH3wdCe22YPzaNfxJh0cKpleoKCawQuUINpBxgmFpjaXzkMpVAhThL/9pHJAKh9zPeD v7eg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=SSHO/MXgpCoZ9DNgt2TY8EMWTNjXuYN6fw+AE5WPQSo=; fh=sNqM0rzpS3sahstSbhU1PMo5LJZgUrUMK8BDlDCJMTE=; b=W7btBa2LOFnNPDsJxxPZQiDPwwOx2l+zLdjLrAzgUYiPSk4bgOAygCpGfMxbWxhzC0 6Pk7L/Po8BvVjRYnOQ/ovjRrbhYrPEs9XZ5yGM0UnxOandD8f8dOSqMEBZEA7OiAnnDv TBiJ3OPNV9v4Ej4F5wz1YmWj81govcTVwxUBjwQst9AB0J3MIPwvKs+CGpGOMFThvOVm 1gl6EVK1qKQkXcxZOEsKmVdjhZAdvnpnjapB8S5d4pOqgGrjZTHLvcpfulInLaJdBZIz hw3Lxr5jNjdIWXw9L9cG3QLH8hU8NzDI+HGmM/NqXdtoFgawGru0l5ouXrGYsAZmFkB3 ptFQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Pg487iIp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from groat.vger.email (groat.vger.email. [23.128.96.35]) by mx.google.com with ESMTPS id s9-20020a17090a6e4900b0027d37bb12a6si5104883pjm.49.2023.10.30.11.21.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 30 Oct 2023 11:21:18 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) client-ip=23.128.96.35; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Pg487iIp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 1CF0C805B6C0; Mon, 30 Oct 2023 11:20:57 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231979AbjJ3SUg (ORCPT <rfc822;chrisjones.unixmen@gmail.com> + 32 others); Mon, 30 Oct 2023 14:20:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33542 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230422AbjJ3SUc (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 30 Oct 2023 14:20:32 -0400 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 10B8EED; Mon, 30 Oct 2023 11:20:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1698690029; x=1730226029; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=AEQMsn7V1pZDjP5p6/D9jve46YyOOCLObeovwu5yHLE=; b=Pg487iIpMgRZGvFFk3Ux7U0plokhVF0oMQEIb6a6rUQQJ0rWHeURIcmZ 4362ezoGVj31tmA5acj3nh4dlYPD1Z9OKoxZMq8D2gMke1P8O40a7wgCw zooXdR1I0nGeZ1ZpeYWytLrGcDxMJ7/IntkbSLaQWMpRqTyvC3O0wrrBQ C25ywTooxXMtokUbP6Ao3WzJaIu+cGFmfVbnQ9IBeVkfQ89fDHNpeyiTk E+plE4zbBKKIRN+lcXJlUKX76dO+bCketk9yyIKat7QmCEtTnKx/kTOIK QN79wPy+vFk6oZbWRPeR/gN+EebuQxhJizN7QPwrLx9EP9PkBqnSzH2q8 A==; X-IronPort-AV: E=McAfee;i="6600,9927,10879"; a="367479513" X-IronPort-AV: E=Sophos;i="6.03,263,1694761200"; d="scan'208";a="367479513" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Oct 2023 11:20:28 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10879"; a="789529492" X-IronPort-AV: E=Sophos;i="6.03,263,1694761200"; d="scan'208";a="789529492" Received: from b4969161e530.jf.intel.com ([10.165.56.46]) by orsmga008.jf.intel.com with ESMTP; 30 Oct 2023 11:20:27 -0700 From: Haitao Huang <haitao.huang@linux.intel.com> To: jarkko@kernel.org, dave.hansen@linux.intel.com, tj@kernel.org, mkoutny@suse.com, linux-kernel@vger.kernel.org, linux-sgx@vger.kernel.org, x86@kernel.org, cgroups@vger.kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, hpa@zytor.com, sohil.mehta@intel.com Cc: zhiquan1.li@intel.com, kristen@linux.intel.com, seanjc@google.com, zhanb@microsoft.com, anakrish@microsoft.com, mikko.ylinen@linux.intel.com, yangjie@microsoft.com, Haitao Huang <haitao.huang@linux.intel.com> Subject: [PATCH v6 00/12] Add Cgroup support for SGX EPC memory Date: Mon, 30 Oct 2023 11:20:01 -0700 Message-Id: <20231030182013.40086-1-haitao.huang@linux.intel.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.2 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Mon, 30 Oct 2023 11:20:57 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1781205648338634808 X-GMAIL-MSGID: 1781205648338634808 |
Series |
Add Cgroup support for SGX EPC memory
|
|
Message
Haitao Huang
Oct. 30, 2023, 6:20 p.m. UTC
SGX Enclave Page Cache (EPC) memory allocations are separate from normal RAM allocations, and are managed solely by the SGX subsystem. The existing cgroup memory controller cannot be used to limit or account for SGX EPC memory, which is a desirable feature in some environments, e.g., support for pod level control in a Kubernates cluster on a VM or baremetal host [1,2]. This patchset implements the support for sgx_epc memory within the misc cgroup controller. The user can use the misc cgroup controller to set and enforce a max limit on total EPC usage per cgroup. The implementation reports current usage and events of reaching the limit per cgroup as well as the total system capacity. With the EPC misc controller enabled, every EPC page allocation is accounted for a cgroup's usage, reflected in the 'sgx_epc' entry in the 'misc.current' interface file of the cgroup. Much like normal system memory, EPC memory can be overcommitted via virtual memory techniques and pages can be swapped out of the EPC to their backing store (normal system memory allocated via shmem, accounted by the memory controller). When the EPC usage of a cgroup reaches its hard limit ('sgx_epc' entry in the 'misc.max' file), the cgroup starts a reclamation process to swap out some EPC pages within the same cgroup and its descendant to their backing store. Although the SGX architecture supports swapping for all pages, to avoid extra complexities, this implementation does not support swapping for certain page types, e.g. Version Array(VA) pages, and treat them as unreclaimable pages. When the limit is reached but nothing left in the cgroup for reclamation, i.e., only unreclaimable pages left, any new EPC allocation in the cgroup will result in an ENOMEM error. The EPC pages allocated for guest VMs by the virtual EPC driver are not reclaimable by the host kernel [5]. Therefore they are also treated as unreclaimable from cgroup's point of view. And the virtual EPC driver translates an ENOMEM error resulted from an EPC allocation request into a SIGBUS to the user process. This work was originally authored by Sean Christopherson a few years ago, and previously modified by Kristen C. Accardi to utilize the misc cgroup controller rather than a custom controller. I have been updating the patches based on review comments since V2 [3, 4, 10], simplified the implementation/design and fixed some stability issues found from testing. The patches are organized as following: - Patches 1-3 are prerequisite misc cgroup changes for adding new APIs, structs, resource types. - Patch 4 implements basic misc controller for EPC without reclamation. - Patches 5-9 prepare for per-cgroup reclamation. * Separate out the existing infrastructure of tracking reclaimable pages from the global reclaimer(ksgxd) to a newly created LRU list struct. * Separate out reusable top-level functions for reclamation. - Patch 10 adds support for per-cgroup reclamation. - Patch 11 adds documentation for the EPC cgroup. - Patch 12 adds test scripts. I appreciate your review and providing tags if appropriate. --- V6: - Dropped OOM killing path, only implement non-preemptive enforcement of max limit (Dave, Michal) - Simplified reclamation flow by taking out sgx_epc_reclaim_control, forced reclamation by ignoring 'age". - Restructured patches: split misc API + resource types patch and the big EPC cgroup patch (Kai, Michal) - Dropped some Tested-by/Reviewed-by tags due to significant changes - Added more selftests v5: - Replace the manual test script with a selftest script. - Restore the "From" tag for some patches to Sean (Kai) - Style fixes (Jarkko) v4: - Collected "Tested-by" from Mikko. I kept it for now as no functional changes in v4. - Rebased on to v6.6_rc1 and reordered patches as described above. - Separated out the bug fixes [7,8,9]. This series depend on those patches. (Dave, Jarkko) - Added comments in commit message to give more preview what's to come next. (Jarkko) - Fixed some documentation error, gap, style (Mikko, Randy) - Fixed some comments, typo, style in code (Mikko, Kai) - Patch format and background for reclaimable vs unreclaimable (Kai, Jarkko) - Fixed typo (Pavel) - Exclude the previous fixes/enhancements for self-tests. Patch 18 now depends on series [6] - Use the same to list for cover and all patches. (Solo) v3: - Added EPC states to replace flags in sgx_epc_page struct. (Jarkko) - Unrolled wrappers for cond_resched, list (Dave) - Separate patches for adding reclaimable and unreclaimable lists. (Dave) - Other improvements on patch flow, commit messages, styles. (Dave, Jarkko) - Simplified the cgroup tree walking with plain css_for_each_descendant_pre. - Fixed race conditions and crashes. - OOM killer to wait for the victim enclave pages being reclaimed. - Unblock the user by handling misc_max_write callback asynchronously. - Rebased onto 6.4 and no longer base this series on the MCA patchset. - Fix an overflow in misc_try_charge. - Fix a NULL pointer in SGX PF handler. - Updated and included the SGX selftest patches previously reviewed. Those patches fix issues triggered in high EPC pressure required for cgroup testing. - Added test scripts to help setup and test SGX EPC cgroups. [1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/ [2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/ [3]https://lore.kernel.org/all/20221202183655.3767674-1-kristen@linux.intel.com/ [4]https://lore.kernel.org/linux-sgx/20230712230202.47929-1-haitao.huang@linux.intel.com/ [5]Documentation/arch/x86/sgx.rst, Section "Virtual EPC" [6]https://lore.kernel.org/linux-sgx/20220905020411.17290-1-jarkko@kernel.org/ [7]https://lore.kernel.org/linux-sgx/ZLcXmvDKheCRYOjG@slm.duckdns.org/ [8]https://lore.kernel.org/linux-sgx/20230721120231.13916-1-haitao.huang@linux.intel.com/ [9]https://lore.kernel.org/linux-sgx/20230728051024.33063-1-haitao.huang@linux.intel.com/ [10]https://lore.kernel.org/all/20230923030657.16148-1-haitao.huang@linux.intel.com/ Haitao Huang (2): x86/sgx: Introduce EPC page states selftests/sgx: Add scripts for EPC cgroup testing Kristen Carlson Accardi (5): cgroup/misc: Add per resource callbacks for CSS events cgroup/misc: Export APIs for SGX driver cgroup/misc: Add SGX EPC resource type x86/sgx: Implement basic EPC misc cgroup functionality x86/sgx: Implement EPC reclamation for cgroup Sean Christopherson (5): x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list x86/sgx: Use sgx_epc_lru_list for existing active page list x86/sgx: Use a list to track to-be-reclaimed pages x86/sgx: Restructure top-level EPC reclaim function Docs/x86/sgx: Add description for cgroup support Documentation/arch/x86/sgx.rst | 74 ++++ arch/x86/Kconfig | 13 + arch/x86/kernel/cpu/sgx/Makefile | 1 + arch/x86/kernel/cpu/sgx/encl.c | 2 +- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 319 ++++++++++++++++++ arch/x86/kernel/cpu/sgx/epc_cgroup.h | 49 +++ arch/x86/kernel/cpu/sgx/main.c | 245 +++++++++----- arch/x86/kernel/cpu/sgx/sgx.h | 88 ++++- include/linux/misc_cgroup.h | 42 +++ kernel/cgroup/misc.c | 52 ++- .../selftests/sgx/run_epc_cg_selftests.sh | 196 +++++++++++ .../selftests/sgx/watch_misc_for_tests.sh | 13 + 12 files changed, 996 insertions(+), 98 deletions(-) create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
Comments
On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote: > SGX Enclave Page Cache (EPC) memory allocations are separate from normal RAM allocations, and > are managed solely by the SGX subsystem. The existing cgroup memory controller cannot be used > to limit or account for SGX EPC memory, which is a desirable feature in some environments, > e.g., support for pod level control in a Kubernates cluster on a VM or baremetal host [1,2]. > > This patchset implements the support for sgx_epc memory within the misc cgroup controller. The > user can use the misc cgroup controller to set and enforce a max limit on total EPC usage per > cgroup. The implementation reports current usage and events of reaching the limit per cgroup as > well as the total system capacity. > > With the EPC misc controller enabled, every EPC page allocation is accounted for a cgroup's > usage, reflected in the 'sgx_epc' entry in the 'misc.current' interface file of the cgroup. > Much like normal system memory, EPC memory can be overcommitted via virtual memory techniques > and pages can be swapped out of the EPC to their backing store (normal system memory allocated > via shmem, accounted by the memory controller). When the EPC usage of a cgroup reaches its hard > limit ('sgx_epc' entry in the 'misc.max' file), the cgroup starts a reclamation process to swap > out some EPC pages within the same cgroup and its descendant to their backing store. Although > the SGX architecture supports swapping for all pages, to avoid extra complexities, this > implementation does not support swapping for certain page types, e.g. Version Array(VA) pages, > and treat them as unreclaimable pages. When the limit is reached but nothing left in the > cgroup for reclamation, i.e., only unreclaimable pages left, any new EPC allocation in the > cgroup will result in an ENOMEM error. > > The EPC pages allocated for guest VMs by the virtual EPC driver are not reclaimable by the host > kernel [5]. Therefore they are also treated as unreclaimable from cgroup's point of view. And > the virtual EPC driver translates an ENOMEM error resulted from an EPC allocation request into > a SIGBUS to the user process. > > This work was originally authored by Sean Christopherson a few years ago, and previously > modified by Kristen C. Accardi to utilize the misc cgroup controller rather than a custom > controller. I have been updating the patches based on review comments since V2 [3, 4, 10], > simplified the implementation/design and fixed some stability issues found from testing. > > The patches are organized as following: > - Patches 1-3 are prerequisite misc cgroup changes for adding new APIs, structs, resource > types. > - Patch 4 implements basic misc controller for EPC without reclamation. > - Patches 5-9 prepare for per-cgroup reclamation. > * Separate out the existing infrastructure of tracking reclaimable pages > from the global reclaimer(ksgxd) to a newly created LRU list struct. > * Separate out reusable top-level functions for reclamation. > - Patch 10 adds support for per-cgroup reclamation. > - Patch 11 adds documentation for the EPC cgroup. > - Patch 12 adds test scripts. > > I appreciate your review and providing tags if appropriate. > > --- > V6: > - Dropped OOM killing path, only implement non-preemptive enforcement of max limit (Dave, Michal) > - Simplified reclamation flow by taking out sgx_epc_reclaim_control, forced reclamation by > ignoring 'age". > - Restructured patches: split misc API + resource types patch and the big EPC cgroup patch > (Kai, Michal) > - Dropped some Tested-by/Reviewed-by tags due to significant changes > - Added more selftests > > v5: > - Replace the manual test script with a selftest script. > - Restore the "From" tag for some patches to Sean (Kai) > - Style fixes (Jarkko) > > v4: > - Collected "Tested-by" from Mikko. I kept it for now as no functional changes in v4. > - Rebased on to v6.6_rc1 and reordered patches as described above. > - Separated out the bug fixes [7,8,9]. This series depend on those patches. (Dave, Jarkko) > - Added comments in commit message to give more preview what's to come next. (Jarkko) > - Fixed some documentation error, gap, style (Mikko, Randy) > - Fixed some comments, typo, style in code (Mikko, Kai) > - Patch format and background for reclaimable vs unreclaimable (Kai, Jarkko) > - Fixed typo (Pavel) > - Exclude the previous fixes/enhancements for self-tests. Patch 18 now depends on series [6] > - Use the same to list for cover and all patches. (Solo) > > v3: > > - Added EPC states to replace flags in sgx_epc_page struct. (Jarkko) > - Unrolled wrappers for cond_resched, list (Dave) > - Separate patches for adding reclaimable and unreclaimable lists. (Dave) > - Other improvements on patch flow, commit messages, styles. (Dave, Jarkko) > - Simplified the cgroup tree walking with plain > css_for_each_descendant_pre. > - Fixed race conditions and crashes. > - OOM killer to wait for the victim enclave pages being reclaimed. > - Unblock the user by handling misc_max_write callback asynchronously. > - Rebased onto 6.4 and no longer base this series on the MCA patchset. > - Fix an overflow in misc_try_charge. > - Fix a NULL pointer in SGX PF handler. > - Updated and included the SGX selftest patches previously reviewed. Those > patches fix issues triggered in high EPC pressure required for cgroup > testing. > - Added test scripts to help setup and test SGX EPC cgroups. > > [1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/ > [2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/ > [3]https://lore.kernel.org/all/20221202183655.3767674-1-kristen@linux.intel.com/ > [4]https://lore.kernel.org/linux-sgx/20230712230202.47929-1-haitao.huang@linux.intel.com/ > [5]Documentation/arch/x86/sgx.rst, Section "Virtual EPC" > [6]https://lore.kernel.org/linux-sgx/20220905020411.17290-1-jarkko@kernel.org/ > [7]https://lore.kernel.org/linux-sgx/ZLcXmvDKheCRYOjG@slm.duckdns.org/ > [8]https://lore.kernel.org/linux-sgx/20230721120231.13916-1-haitao.huang@linux.intel.com/ > [9]https://lore.kernel.org/linux-sgx/20230728051024.33063-1-haitao.huang@linux.intel.com/ > [10]https://lore.kernel.org/all/20230923030657.16148-1-haitao.huang@linux.intel.com/ > > Haitao Huang (2): > x86/sgx: Introduce EPC page states > selftests/sgx: Add scripts for EPC cgroup testing > > Kristen Carlson Accardi (5): > cgroup/misc: Add per resource callbacks for CSS events > cgroup/misc: Export APIs for SGX driver > cgroup/misc: Add SGX EPC resource type > x86/sgx: Implement basic EPC misc cgroup functionality > x86/sgx: Implement EPC reclamation for cgroup > > Sean Christopherson (5): > x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list > x86/sgx: Use sgx_epc_lru_list for existing active page list > x86/sgx: Use a list to track to-be-reclaimed pages > x86/sgx: Restructure top-level EPC reclaim function > Docs/x86/sgx: Add description for cgroup support > > Documentation/arch/x86/sgx.rst | 74 ++++ > arch/x86/Kconfig | 13 + > arch/x86/kernel/cpu/sgx/Makefile | 1 + > arch/x86/kernel/cpu/sgx/encl.c | 2 +- > arch/x86/kernel/cpu/sgx/epc_cgroup.c | 319 ++++++++++++++++++ > arch/x86/kernel/cpu/sgx/epc_cgroup.h | 49 +++ > arch/x86/kernel/cpu/sgx/main.c | 245 +++++++++----- > arch/x86/kernel/cpu/sgx/sgx.h | 88 ++++- > include/linux/misc_cgroup.h | 42 +++ > kernel/cgroup/misc.c | 52 ++- > .../selftests/sgx/run_epc_cg_selftests.sh | 196 +++++++++++ > .../selftests/sgx/watch_misc_for_tests.sh | 13 + > 12 files changed, 996 insertions(+), 98 deletions(-) > create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c > create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h > create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh > create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh > Is this expected to work on NUC7? Planning to test this next week (no time this week). BR, Jarkko
On Sun, 05 Nov 2023 21:26:44 -0600, Jarkko Sakkinen <jarkko@kernel.org> wrote: > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote: >> SGX Enclave Page Cache (EPC) memory allocations are separate from >> normal RAM allocations, and >> are managed solely by the SGX subsystem. The existing cgroup memory >> controller cannot be used >> to limit or account for SGX EPC memory, which is a desirable feature in >> some environments, >> e.g., support for pod level control in a Kubernates cluster on a VM or >> baremetal host [1,2]. >> This patchset implements the support for sgx_epc memory within the >> misc cgroup controller. The >> user can use the misc cgroup controller to set and enforce a max limit >> on total EPC usage per >> cgroup. The implementation reports current usage and events of reaching >> the limit per cgroup as >> well as the total system capacity. >> With the EPC misc controller enabled, every EPC page allocation is >> accounted for a cgroup's >> usage, reflected in the 'sgx_epc' entry in the 'misc.current' interface >> file of the cgroup. >> Much like normal system memory, EPC memory can be overcommitted via >> virtual memory techniques >> and pages can be swapped out of the EPC to their backing store (normal >> system memory allocated >> via shmem, accounted by the memory controller). When the EPC usage of a >> cgroup reaches its hard >> limit ('sgx_epc' entry in the 'misc.max' file), the cgroup starts a >> reclamation process to swap >> out some EPC pages within the same cgroup and its descendant to their >> backing store. Although >> the SGX architecture supports swapping for all pages, to avoid extra >> complexities, this >> implementation does not support swapping for certain page types, e.g. >> Version Array(VA) pages, >> and treat them as unreclaimable pages. When the limit is reached but >> nothing left in the >> cgroup for reclamation, i.e., only unreclaimable pages left, any new >> EPC allocation in the >> cgroup will result in an ENOMEM error. >> >> The EPC pages allocated for guest VMs by the virtual EPC driver are not >> reclaimable by the host >> kernel [5]. Therefore they are also treated as unreclaimable from >> cgroup's point of view. And >> the virtual EPC driver translates an ENOMEM error resulted from an EPC >> allocation request into >> a SIGBUS to the user process. >> >> This work was originally authored by Sean Christopherson a few years >> ago, and previously >> modified by Kristen C. Accardi to utilize the misc cgroup controller >> rather than a custom >> controller. I have been updating the patches based on review comments >> since V2 [3, 4, 10], >> simplified the implementation/design and fixed some stability issues >> found from testing. >> The patches are organized as following: >> - Patches 1-3 are prerequisite misc cgroup changes for adding new APIs, >> structs, resource >> types. >> - Patch 4 implements basic misc controller for EPC without reclamation. >> - Patches 5-9 prepare for per-cgroup reclamation. >> * Separate out the existing infrastructure of tracking reclaimable >> pages >> from the global reclaimer(ksgxd) to a newly created LRU list >> struct. >> * Separate out reusable top-level functions for reclamation. >> - Patch 10 adds support for per-cgroup reclamation. >> - Patch 11 adds documentation for the EPC cgroup. >> - Patch 12 adds test scripts. >> >> I appreciate your review and providing tags if appropriate. >> >> --- >> V6: >> - Dropped OOM killing path, only implement non-preemptive enforcement >> of max limit (Dave, Michal) >> - Simplified reclamation flow by taking out sgx_epc_reclaim_control, >> forced reclamation by >> ignoring 'age". >> - Restructured patches: split misc API + resource types patch and the >> big EPC cgroup patch >> (Kai, Michal) >> - Dropped some Tested-by/Reviewed-by tags due to significant changes >> - Added more selftests >> >> v5: >> - Replace the manual test script with a selftest script. >> - Restore the "From" tag for some patches to Sean (Kai) >> - Style fixes (Jarkko) >> >> v4: >> - Collected "Tested-by" from Mikko. I kept it for now as no functional >> changes in v4. >> - Rebased on to v6.6_rc1 and reordered patches as described above. >> - Separated out the bug fixes [7,8,9]. This series depend on those >> patches. (Dave, Jarkko) >> - Added comments in commit message to give more preview what's to come >> next. (Jarkko) >> - Fixed some documentation error, gap, style (Mikko, Randy) >> - Fixed some comments, typo, style in code (Mikko, Kai) >> - Patch format and background for reclaimable vs unreclaimable (Kai, >> Jarkko) >> - Fixed typo (Pavel) >> - Exclude the previous fixes/enhancements for self-tests. Patch 18 now >> depends on series [6] >> - Use the same to list for cover and all patches. (Solo) >> v3: >> - Added EPC states to replace flags in sgx_epc_page struct. (Jarkko) >> - Unrolled wrappers for cond_resched, list (Dave) >> - Separate patches for adding reclaimable and unreclaimable lists. >> (Dave) >> - Other improvements on patch flow, commit messages, styles. (Dave, >> Jarkko) >> - Simplified the cgroup tree walking with plain >> css_for_each_descendant_pre. >> - Fixed race conditions and crashes. >> - OOM killer to wait for the victim enclave pages being reclaimed. >> - Unblock the user by handling misc_max_write callback asynchronously. >> - Rebased onto 6.4 and no longer base this series on the MCA patchset. >> - Fix an overflow in misc_try_charge. >> - Fix a NULL pointer in SGX PF handler. >> - Updated and included the SGX selftest patches previously reviewed. >> Those >> patches fix issues triggered in high EPC pressure required for cgroup >> testing. >> - Added test scripts to help setup and test SGX EPC cgroups. >> >> [1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/ >> [2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/ >> [3]https://lore.kernel.org/all/20221202183655.3767674-1-kristen@linux.intel.com/ >> [4]https://lore.kernel.org/linux-sgx/20230712230202.47929-1-haitao.huang@linux.intel.com/ >> [5]Documentation/arch/x86/sgx.rst, Section "Virtual EPC" >> [6]https://lore.kernel.org/linux-sgx/20220905020411.17290-1-jarkko@kernel.org/ >> [7]https://lore.kernel.org/linux-sgx/ZLcXmvDKheCRYOjG@slm.duckdns.org/ >> [8]https://lore.kernel.org/linux-sgx/20230721120231.13916-1-haitao.huang@linux.intel.com/ >> [9]https://lore.kernel.org/linux-sgx/20230728051024.33063-1-haitao.huang@linux.intel.com/ >> [10]https://lore.kernel.org/all/20230923030657.16148-1-haitao.huang@linux.intel.com/ >> >> Haitao Huang (2): >> x86/sgx: Introduce EPC page states >> selftests/sgx: Add scripts for EPC cgroup testing >> >> Kristen Carlson Accardi (5): >> cgroup/misc: Add per resource callbacks for CSS events >> cgroup/misc: Export APIs for SGX driver >> cgroup/misc: Add SGX EPC resource type >> x86/sgx: Implement basic EPC misc cgroup functionality >> x86/sgx: Implement EPC reclamation for cgroup >> >> Sean Christopherson (5): >> x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list >> x86/sgx: Use sgx_epc_lru_list for existing active page list >> x86/sgx: Use a list to track to-be-reclaimed pages >> x86/sgx: Restructure top-level EPC reclaim function >> Docs/x86/sgx: Add description for cgroup support >> >> Documentation/arch/x86/sgx.rst | 74 ++++ >> arch/x86/Kconfig | 13 + >> arch/x86/kernel/cpu/sgx/Makefile | 1 + >> arch/x86/kernel/cpu/sgx/encl.c | 2 +- >> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 319 ++++++++++++++++++ >> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 49 +++ >> arch/x86/kernel/cpu/sgx/main.c | 245 +++++++++----- >> arch/x86/kernel/cpu/sgx/sgx.h | 88 ++++- >> include/linux/misc_cgroup.h | 42 +++ >> kernel/cgroup/misc.c | 52 ++- >> .../selftests/sgx/run_epc_cg_selftests.sh | 196 +++++++++++ >> .../selftests/sgx/watch_misc_for_tests.sh | 13 + >> 12 files changed, 996 insertions(+), 98 deletions(-) >> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c >> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h >> create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh >> create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh >> > > Is this expected to work on NUC7? > > Planning to test this next week (no time this week). > > BR, Jarkko I don't see a reason why it would not be working on a NUC. I'll try to get access to one and test it too. The only difference I can think of is you may have lower capacity. The selftests scripts try not hard code the limit numbers so I expect they should also work. Also changes do not depend on EDMM so any NUC should be fine. BTW I'll also send a fixup patch for an issue on memcg not charged for cgroup workqueue reclamation that Mikko found in his testing. Please apply that on top of the series when testing. Thanks Haitao
On Mon, 06 Nov 2023 09:48:36 -0600, Haitao Huang <haitao.huang@linux.intel.com> wrote: > On Sun, 05 Nov 2023 21:26:44 -0600, Jarkko Sakkinen <jarkko@kernel.org> > wrote: > >> On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote: >>> SGX Enclave Page Cache (EPC) memory allocations are separate from >>> normal RAM allocations, and >>> are managed solely by the SGX subsystem. The existing cgroup memory >>> controller cannot be used >>> to limit or account for SGX EPC memory, which is a desirable feature >>> in some environments, >>> e.g., support for pod level control in a Kubernates cluster on a VM or >>> baremetal host [1,2]. >>> This patchset implements the support for sgx_epc memory within the >>> misc cgroup controller. The >>> user can use the misc cgroup controller to set and enforce a max limit >>> on total EPC usage per >>> cgroup. The implementation reports current usage and events of >>> reaching the limit per cgroup as >>> well as the total system capacity. >>> With the EPC misc controller enabled, every EPC page allocation is >>> accounted for a cgroup's >>> usage, reflected in the 'sgx_epc' entry in the 'misc.current' >>> interface file of the cgroup. >>> Much like normal system memory, EPC memory can be overcommitted via >>> virtual memory techniques >>> and pages can be swapped out of the EPC to their backing store (normal >>> system memory allocated >>> via shmem, accounted by the memory controller). When the EPC usage of >>> a cgroup reaches its hard >>> limit ('sgx_epc' entry in the 'misc.max' file), the cgroup starts a >>> reclamation process to swap >>> out some EPC pages within the same cgroup and its descendant to their >>> backing store. Although >>> the SGX architecture supports swapping for all pages, to avoid extra >>> complexities, this >>> implementation does not support swapping for certain page types, e.g. >>> Version Array(VA) pages, >>> and treat them as unreclaimable pages. When the limit is reached but >>> nothing left in the >>> cgroup for reclamation, i.e., only unreclaimable pages left, any new >>> EPC allocation in the >>> cgroup will result in an ENOMEM error. >>> >>> The EPC pages allocated for guest VMs by the virtual EPC driver are >>> not reclaimable by the host >>> kernel [5]. Therefore they are also treated as unreclaimable from >>> cgroup's point of view. And >>> the virtual EPC driver translates an ENOMEM error resulted from an EPC >>> allocation request into >>> a SIGBUS to the user process. >>> >>> This work was originally authored by Sean Christopherson a few years >>> ago, and previously >>> modified by Kristen C. Accardi to utilize the misc cgroup controller >>> rather than a custom >>> controller. I have been updating the patches based on review comments >>> since V2 [3, 4, 10], >>> simplified the implementation/design and fixed some stability issues >>> found from testing. >>> The patches are organized as following: >>> - Patches 1-3 are prerequisite misc cgroup changes for adding new >>> APIs, structs, resource >>> types. >>> - Patch 4 implements basic misc controller for EPC without reclamation. >>> - Patches 5-9 prepare for per-cgroup reclamation. >>> * Separate out the existing infrastructure of tracking reclaimable >>> pages >>> from the global reclaimer(ksgxd) to a newly created LRU list >>> struct. >>> * Separate out reusable top-level functions for reclamation. >>> - Patch 10 adds support for per-cgroup reclamation. >>> - Patch 11 adds documentation for the EPC cgroup. >>> - Patch 12 adds test scripts. >>> >>> I appreciate your review and providing tags if appropriate. >>> >>> --- >>> V6: >>> - Dropped OOM killing path, only implement non-preemptive enforcement >>> of max limit (Dave, Michal) >>> - Simplified reclamation flow by taking out sgx_epc_reclaim_control, >>> forced reclamation by >>> ignoring 'age". >>> - Restructured patches: split misc API + resource types patch and the >>> big EPC cgroup patch >>> (Kai, Michal) >>> - Dropped some Tested-by/Reviewed-by tags due to significant changes >>> - Added more selftests >>> >>> v5: >>> - Replace the manual test script with a selftest script. >>> - Restore the "From" tag for some patches to Sean (Kai) >>> - Style fixes (Jarkko) >>> >>> v4: >>> - Collected "Tested-by" from Mikko. I kept it for now as no functional >>> changes in v4. >>> - Rebased on to v6.6_rc1 and reordered patches as described above. >>> - Separated out the bug fixes [7,8,9]. This series depend on those >>> patches. (Dave, Jarkko) >>> - Added comments in commit message to give more preview what's to come >>> next. (Jarkko) >>> - Fixed some documentation error, gap, style (Mikko, Randy) >>> - Fixed some comments, typo, style in code (Mikko, Kai) >>> - Patch format and background for reclaimable vs unreclaimable (Kai, >>> Jarkko) >>> - Fixed typo (Pavel) >>> - Exclude the previous fixes/enhancements for self-tests. Patch 18 now >>> depends on series [6] >>> - Use the same to list for cover and all patches. (Solo) >>> v3: >>> - Added EPC states to replace flags in sgx_epc_page struct. (Jarkko) >>> - Unrolled wrappers for cond_resched, list (Dave) >>> - Separate patches for adding reclaimable and unreclaimable lists. >>> (Dave) >>> - Other improvements on patch flow, commit messages, styles. (Dave, >>> Jarkko) >>> - Simplified the cgroup tree walking with plain >>> css_for_each_descendant_pre. >>> - Fixed race conditions and crashes. >>> - OOM killer to wait for the victim enclave pages being reclaimed. >>> - Unblock the user by handling misc_max_write callback asynchronously. >>> - Rebased onto 6.4 and no longer base this series on the MCA patchset. >>> - Fix an overflow in misc_try_charge. >>> - Fix a NULL pointer in SGX PF handler. >>> - Updated and included the SGX selftest patches previously reviewed. >>> Those >>> patches fix issues triggered in high EPC pressure required for cgroup >>> testing. >>> - Added test scripts to help setup and test SGX EPC cgroups. >>> >>> [1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/ >>> [2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/ >>> [3]https://lore.kernel.org/all/20221202183655.3767674-1-kristen@linux.intel.com/ >>> [4]https://lore.kernel.org/linux-sgx/20230712230202.47929-1-haitao.huang@linux.intel.com/ >>> [5]Documentation/arch/x86/sgx.rst, Section "Virtual EPC" >>> [6]https://lore.kernel.org/linux-sgx/20220905020411.17290-1-jarkko@kernel.org/ >>> [7]https://lore.kernel.org/linux-sgx/ZLcXmvDKheCRYOjG@slm.duckdns.org/ >>> [8]https://lore.kernel.org/linux-sgx/20230721120231.13916-1-haitao.huang@linux.intel.com/ >>> [9]https://lore.kernel.org/linux-sgx/20230728051024.33063-1-haitao.huang@linux.intel.com/ >>> [10]https://lore.kernel.org/all/20230923030657.16148-1-haitao.huang@linux.intel.com/ >>> >>> Haitao Huang (2): >>> x86/sgx: Introduce EPC page states >>> selftests/sgx: Add scripts for EPC cgroup testing >>> >>> Kristen Carlson Accardi (5): >>> cgroup/misc: Add per resource callbacks for CSS events >>> cgroup/misc: Export APIs for SGX driver >>> cgroup/misc: Add SGX EPC resource type >>> x86/sgx: Implement basic EPC misc cgroup functionality >>> x86/sgx: Implement EPC reclamation for cgroup >>> >>> Sean Christopherson (5): >>> x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list >>> x86/sgx: Use sgx_epc_lru_list for existing active page list >>> x86/sgx: Use a list to track to-be-reclaimed pages >>> x86/sgx: Restructure top-level EPC reclaim function >>> Docs/x86/sgx: Add description for cgroup support >>> >>> Documentation/arch/x86/sgx.rst | 74 ++++ >>> arch/x86/Kconfig | 13 + >>> arch/x86/kernel/cpu/sgx/Makefile | 1 + >>> arch/x86/kernel/cpu/sgx/encl.c | 2 +- >>> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 319 ++++++++++++++++++ >>> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 49 +++ >>> arch/x86/kernel/cpu/sgx/main.c | 245 +++++++++----- >>> arch/x86/kernel/cpu/sgx/sgx.h | 88 ++++- >>> include/linux/misc_cgroup.h | 42 +++ >>> kernel/cgroup/misc.c | 52 ++- >>> .../selftests/sgx/run_epc_cg_selftests.sh | 196 +++++++++++ >>> .../selftests/sgx/watch_misc_for_tests.sh | 13 + >>> 12 files changed, 996 insertions(+), 98 deletions(-) >>> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c >>> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h >>> create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh >>> create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh >>> >> >> Is this expected to work on NUC7? >> >> Planning to test this next week (no time this week). >> >> BR, Jarkko > > I don't see a reason why it would not be working on a NUC. I'll try to > get access to one and test it too. Tried on a NUC with about 90M EPC. The selftests worked fine. BR Haitao
There's very little about how the LRU design came to be in this cover letter. Let's add some details. How's this? Writing this up, I'm a lot more convinced that this series is, in general, taking the right approach. I honestly don't see any other alternatives. As much as I'd love to do something stupidly simple like just killing enclaves at the moment they hit the limit, that would be a horrid experience for users _and_ a departure from what the existing reclaim support does. That said, there's still a lot of work do to do refactor this series. It's in need of some love to make it more clear what is going on and to making the eventual switch over to per-cgroup LRUs more gradual. Each patch in the series is still doing way too much, _especially_ in patch 10. == The existing EPC memory management aims to be a miniature version of the core VM where EPC memory can be overcommitted and reclaimed. EPC allocations can wait for reclaim. The alternative to waiting would have been to send a signal and let the enclave die. This series attempts to implement that same logic for cgroups, for the same reasons: it's preferable to wait for memory to become available and let reclaim happen than to do things that are fatal to enclaves. There is currently a global reclaimable page SGX LRU list. That list (and the existing scanning algorithm) is essentially useless for doing reclaim when a cgroup hits its limit because the cgroup's pages are scattered around that LRU. It is unspeakably inefficient to scan a linked list with millions of entries for what could be dozens of pages from a cgroup that needs reclaim. Even if unspeakably slow reclaim was accepted, the existing scanning algorithm only picks a few pages off the head of the global LRU. It would either need to hold the list locks for unreasonable amounts of time, or be taught to scan the list in pieces, which has its own challenges. tl;dr: An cgroup hitting its limit should be as similar as possible to the system running out of EPC memory. The only two choices to implement that are nasty changes the existing LRU scanning algorithm, or to add new LRUs. The result: Add a new LRU for each cgroup and scans those instead. Replace the existing global cgroup with the root cgroup's LRU (only when this new support is compiled in, obviously).
On Fri, 05 Jan 2024 12:29:05 -0600, Dave Hansen <dave.hansen@intel.com> wrote: > There's very little about how the LRU design came to be in this cover > letter. Let's add some details. > > How's this? > > Writing this up, I'm a lot more convinced that this series is, in > general, taking the right approach. I honestly don't see any other > alternatives. As much as I'd love to do something stupidly simple like > just killing enclaves at the moment they hit the limit, that would be a > horrid experience for users _and_ a departure from what the existing > reclaim support does. > > That said, there's still a lot of work do to do refactor this series. > It's in need of some love to make it more clear what is going on and to > making the eventual switch over to per-cgroup LRUs more gradual. Each > patch in the series is still doing way too much, _especially_ in patch > 10. > > == > > The existing EPC memory management aims to be a miniature version of the > core VM where EPC memory can be overcommitted and reclaimed. EPC > allocations can wait for reclaim. The alternative to waiting would have > been to send a signal and let the enclave die. > > This series attempts to implement that same logic for cgroups, for the > same reasons: it's preferable to wait for memory to become available and > let reclaim happen than to do things that are fatal to enclaves. > > There is currently a global reclaimable page SGX LRU list. That list > (and the existing scanning algorithm) is essentially useless for doing > reclaim when a cgroup hits its limit because the cgroup's pages are > scattered around that LRU. It is unspeakably inefficient to scan a > linked list with millions of entries for what could be dozens of pages > from a cgroup that needs reclaim. > > Even if unspeakably slow reclaim was accepted, the existing scanning > algorithm only picks a few pages off the head of the global LRU. It > would either need to hold the list locks for unreasonable amounts of > time, or be taught to scan the list in pieces, which has its own > challenges. > > tl;dr: An cgroup hitting its limit should be as similar as possible to > the system running out of EPC memory. The only two choices to implement > that are nasty changes the existing LRU scanning algorithm, or to add > new LRUs. The result: Add a new LRU for each cgroup and scans those > instead. Replace the existing global cgroup with the root cgroup's LRU > (only when this new support is compiled in, obviously). > I'll add this to the cover letter as a section justifying the LRU design for per-cgroup reclaiming. Thank you very much. Haitao