[v6,09/12] x86/sgx: Restructure top-level EPC reclaim function

Message ID 20231030182013.40086-10-haitao.huang@linux.intel.com
State New
Headers
Series Add Cgroup support for SGX EPC memory |

Commit Message

Haitao Huang Oct. 30, 2023, 6:20 p.m. UTC
  From: Sean Christopherson <sean.j.christopherson@intel.com>

To prepare for per-cgroup reclamation, separate the top-level reclaim
function, sgx_reclaim_epc_pages(), into two separate functions:

- sgx_isolate_epc_pages() scans and isolates reclaimable pages from a given LRU list.
- sgx_do_epc_reclamation() performs the real reclamation for the already isolated pages.

Create a new function, sgx_reclaim_epc_pages_global(), calling those two
in succession, to replace the original sgx_reclaim_epc_pages(). The
above two functions will serve as building blocks for the reclamation
flows in later EPC cgroup implementation.

sgx_do_epc_reclamation() returns the number of reclaimed pages. The EPC
cgroup will use the result to track reclaiming progress.

sgx_isolate_epc_pages() returns the additional number of pages to scan
for current epoch of reclamation. The EPC cgroup will use the result to
determine if more scanning to be done in LRUs in its children groups.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
V6:
- Restructure patches to make it easier to review. (Kai)
- Fix unused nr_to_scan (Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 97 ++++++++++++++++++++++------------
 arch/x86/kernel/cpu/sgx/sgx.h  |  8 +++
 2 files changed, 72 insertions(+), 33 deletions(-)
  

Comments

Kai Huang Nov. 20, 2023, 3:45 a.m. UTC | #1
On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> To prepare for per-cgroup reclamation, separate the top-level reclaim
> function, sgx_reclaim_epc_pages(), into two separate functions:
> 
> - sgx_isolate_epc_pages() scans and isolates reclaimable pages from a given LRU list.
> - sgx_do_epc_reclamation() performs the real reclamation for the already isolated pages.
> 
> Create a new function, sgx_reclaim_epc_pages_global(), calling those two
> in succession, to replace the original sgx_reclaim_epc_pages(). The
> above two functions will serve as building blocks for the reclamation
> flows in later EPC cgroup implementation.
> 
> sgx_do_epc_reclamation() returns the number of reclaimed pages. The EPC
> cgroup will use the result to track reclaiming progress.
> 
> sgx_isolate_epc_pages() returns the additional number of pages to scan
> for current epoch of reclamation. The EPC cgroup will use the result to
> determine if more scanning to be done in LRUs in its children groups.

This changelog says nothing about "why", but only mentions the "implementation".

For instance, assuming we need to reclaim @npages_to_reclaim from the
@epc_cgrp_to_reclaim and its descendants, why cannot we do:

	for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp) {
		if (npages_to_reclaim <= 0)
			return;

		npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru,
					npages_to_reclaim);
	}

Is there any difference to have "isolate" + "reclaim"?

> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> ---
> 

[...]

> +/**
> + * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC pages.
> + * @iso:		List of isolated pages for reclamation
> + *
> + * Take a list of EPC pages and reclaim them to the enclave's private shmem files.  Do not
> + * reclaim the pages that have been accessed since the last scan, and move each of those pages
> + * to the tail of its tracking LRU list.
> + *
> + * Limit the number of pages to be processed up to SGX_NR_TO_SCAN_MAX per call in order to
> + * degrade amount of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit
> + * among the HW threads with three stage EWB pipeline (EWB, ETRACK + EWB and IPI + EWB) but not
> + * sufficiently. Reclaiming one page at a time would also be problematic as it would increase
> + * the lock contention too much, which would halt forward progress.

This is kinda optimization, correct?  Is there any real performance data to
justify this?  If this optimization is useful, shouldn't we bring this
optimization to the current sgx_reclaim_pages() instead, e.g., just increase
SGX_NR_TO_SCAN (16) to SGX_NR_TO_SCAN_MAX (32)?
  
Haitao Huang Nov. 26, 2023, 4:27 p.m. UTC | #2
On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai <kai.huang@intel.com> wrote:

> On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
>> From: Sean Christopherson <sean.j.christopherson@intel.com>
>>
>> To prepare for per-cgroup reclamation, separate the top-level reclaim
>> function, sgx_reclaim_epc_pages(), into two separate functions:
>>
>> - sgx_isolate_epc_pages() scans and isolates reclaimable pages from a  
>> given LRU list.
>> - sgx_do_epc_reclamation() performs the real reclamation for the  
>> already isolated pages.
>>
>> Create a new function, sgx_reclaim_epc_pages_global(), calling those two
>> in succession, to replace the original sgx_reclaim_epc_pages(). The
>> above two functions will serve as building blocks for the reclamation
>> flows in later EPC cgroup implementation.
>>
>> sgx_do_epc_reclamation() returns the number of reclaimed pages. The EPC
>> cgroup will use the result to track reclaiming progress.
>>
>> sgx_isolate_epc_pages() returns the additional number of pages to scan
>> for current epoch of reclamation. The EPC cgroup will use the result to
>> determine if more scanning to be done in LRUs in its children groups.
>
> This changelog says nothing about "why", but only mentions the  
> "implementation".
>
> For instance, assuming we need to reclaim @npages_to_reclaim from the
> @epc_cgrp_to_reclaim and its descendants, why cannot we do:
>
> 	for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp) {
> 		if (npages_to_reclaim <= 0)
> 			return;
>
> 		npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru,
> 					npages_to_reclaim);
> 	}
>
> Is there any difference to have "isolate" + "reclaim"?
>

This is to optimize "reclaim". See how etrack was done in sgx_encl_ewb.
Here we just follow the same design as ksgxd for each reclamation cycle.

>>
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
>> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
>> Cc: Sean Christopherson <seanjc@google.com>
>> ---
>>
>
> [...]
>
>> +/**
>> + * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC  
>> pages.
>> + * @iso:		List of isolated pages for reclamation
>> + *
>> + * Take a list of EPC pages and reclaim them to the enclave's private  
>> shmem files.  Do not
>> + * reclaim the pages that have been accessed since the last scan, and  
>> move each of those pages
>> + * to the tail of its tracking LRU list.
>> + *
>> + * Limit the number of pages to be processed up to SGX_NR_TO_SCAN_MAX  
>> per call in order to
>> + * degrade amount of IPI's and ETRACK's potentially required.  
>> sgx_encl_ewb() does degrade a bit
>> + * among the HW threads with three stage EWB pipeline (EWB, ETRACK +  
>> EWB and IPI + EWB) but not
>> + * sufficiently. Reclaiming one page at a time would also be  
>> problematic as it would increase
>> + * the lock contention too much, which would halt forward progress.
>
> This is kinda optimization, correct?  Is there any real performance data  
> to
> justify this?

The above sentences were there originally. This optimization was justified.

> If this optimization is useful, shouldn't we bring this
> optimization to the current sgx_reclaim_pages() instead, e.g., just  
> increase
> SGX_NR_TO_SCAN (16) to SGX_NR_TO_SCAN_MAX (32)?
>

SGX_NR_TO_SCAN_MAX might be designed earlier for other reasons I don't  
know. Currently it is really the buffer size allocated in  
sgx_reclaim_pages(). Both cgroup and ksgxd scan 16 pages a time.Maybe we  
should just use SGX_NR_TO_SCAN. No _MAX needed. The point was to batch  
reclamation to certain number to minimize impact of EWB pipeline. 16 was  
the original design.

Thanks
Haitao
  
Kai Huang Nov. 27, 2023, 9:57 a.m. UTC | #3
On Mon, 2023-11-27 at 00:27 +0800, Haitao Huang wrote:
> On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> > > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > > 
> > > To prepare for per-cgroup reclamation, separate the top-level reclaim
> > > function, sgx_reclaim_epc_pages(), into two separate functions:
> > > 
> > > - sgx_isolate_epc_pages() scans and isolates reclaimable pages from a  
> > > given LRU list.
> > > - sgx_do_epc_reclamation() performs the real reclamation for the  
> > > already isolated pages.
> > > 
> > > Create a new function, sgx_reclaim_epc_pages_global(), calling those two
> > > in succession, to replace the original sgx_reclaim_epc_pages(). The
> > > above two functions will serve as building blocks for the reclamation
> > > flows in later EPC cgroup implementation.
> > > 
> > > sgx_do_epc_reclamation() returns the number of reclaimed pages. The EPC
> > > cgroup will use the result to track reclaiming progress.
> > > 
> > > sgx_isolate_epc_pages() returns the additional number of pages to scan
> > > for current epoch of reclamation. The EPC cgroup will use the result to
> > > determine if more scanning to be done in LRUs in its children groups.
> > 
> > This changelog says nothing about "why", but only mentions the  
> > "implementation".
> > 
> > For instance, assuming we need to reclaim @npages_to_reclaim from the
> > @epc_cgrp_to_reclaim and its descendants, why cannot we do:
> > 
> > 	for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp) {
> > 		if (npages_to_reclaim <= 0)
> > 			return;
> > 
> > 		npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru,
> > 					npages_to_reclaim);
> > 	}
> > 
> > Is there any difference to have "isolate" + "reclaim"?
> > 
> 
> This is to optimize "reclaim". See how etrack was done in sgx_encl_ewb.
> Here we just follow the same design as ksgxd for each reclamation cycle.

I don't see how did you "follow" ksgxd.  If I am guessing correctly, you are
afraid of there might be less than 16 pages in a given EPC cgroup, thus w/o
splitting into "isolate" + "reclaim" you might feed the "reclaim" less than 16
pages, which might cause some performance degrade?

But is this a common case?  Should we even worry about this?

I suppose for such new feature we should bring functionality first and then
optimization if you have real performance data to show.

> 
> > > 
> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > > Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> > > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> > > Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
> > > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> > > Cc: Sean Christopherson <seanjc@google.com>
> > > ---
> > > 
> > 
> > [...]
> > 
> > > +/**
> > > + * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC  
> > > pages.
> > > + * @iso:		List of isolated pages for reclamation
> > > + *
> > > + * Take a list of EPC pages and reclaim them to the enclave's private  
> > > shmem files.  Do not
> > > + * reclaim the pages that have been accessed since the last scan, and  
> > > move each of those pages
> > > + * to the tail of its tracking LRU list.
> > > + *
> > > + * Limit the number of pages to be processed up to SGX_NR_TO_SCAN_MAX  
> > > per call in order to
> > > + * degrade amount of IPI's and ETRACK's potentially required.  
> > > sgx_encl_ewb() does degrade a bit
> > > + * among the HW threads with three stage EWB pipeline (EWB, ETRACK +  
> > > EWB and IPI + EWB) but not
> > > + * sufficiently. Reclaiming one page at a time would also be  
> > > problematic as it would increase
> > > + * the lock contention too much, which would halt forward progress.
> > 
> > This is kinda optimization, correct?  Is there any real performance data  
> > to
> > justify this?
> 
> The above sentences were there originally. This optimization was justified.

I am talking about 16 -> 32.

> 
> > If this optimization is useful, shouldn't we bring this
> > optimization to the current sgx_reclaim_pages() instead, e.g., just  
> > increase
> > SGX_NR_TO_SCAN (16) to SGX_NR_TO_SCAN_MAX (32)?
> > 
> 
> SGX_NR_TO_SCAN_MAX might be designed earlier for other reasons I don't  
> know. Currently it is really the buffer size allocated in  
> sgx_reclaim_pages(). Both cgroup and ksgxd scan 16 pages a time.Maybe we  
> should just use SGX_NR_TO_SCAN. No _MAX needed. The point was to batch  
> reclamation to certain number to minimize impact of EWB pipeline. 16 was  
> the original design.
> 

Please don't leave why you are trying to do this to the reviewers.  If you don't
know, then just drop this.
  
Haitao Huang Dec. 12, 2023, 4:04 a.m. UTC | #4
Hi Kai

On Mon, 27 Nov 2023 03:57:03 -0600, Huang, Kai <kai.huang@intel.com> wrote:

> On Mon, 2023-11-27 at 00:27 +0800, Haitao Huang wrote:
>> On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>>
>> > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
>> > > From: Sean Christopherson <sean.j.christopherson@intel.com>
>> > >
>> > > To prepare for per-cgroup reclamation, separate the top-level  
>> reclaim
>> > > function, sgx_reclaim_epc_pages(), into two separate functions:
>> > >
>> > > - sgx_isolate_epc_pages() scans and isolates reclaimable pages from  
>> a
>> > > given LRU list.
>> > > - sgx_do_epc_reclamation() performs the real reclamation for the
>> > > already isolated pages.
>> > >
>> > > Create a new function, sgx_reclaim_epc_pages_global(), calling  
>> those two
>> > > in succession, to replace the original sgx_reclaim_epc_pages(). The
>> > > above two functions will serve as building blocks for the  
>> reclamation
>> > > flows in later EPC cgroup implementation.
>> > >
>> > > sgx_do_epc_reclamation() returns the number of reclaimed pages. The  
>> EPC
>> > > cgroup will use the result to track reclaiming progress.
>> > >
>> > > sgx_isolate_epc_pages() returns the additional number of pages to  
>> scan
>> > > for current epoch of reclamation. The EPC cgroup will use the  
>> result to
>> > > determine if more scanning to be done in LRUs in its children  
>> groups.
>> >
>> > This changelog says nothing about "why", but only mentions the
>> > "implementation".
>> >
>> > For instance, assuming we need to reclaim @npages_to_reclaim from the
>> > @epc_cgrp_to_reclaim and its descendants, why cannot we do:
>> >
>> > 	for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp) {
>> > 		if (npages_to_reclaim <= 0)
>> > 			return;
>> >
>> > 		npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru,
>> > 					npages_to_reclaim);
>> > 	}
>> >
>> > Is there any difference to have "isolate" + "reclaim"?
>> >
>>
>> This is to optimize "reclaim". See how etrack was done in sgx_encl_ewb.
>> Here we just follow the same design as ksgxd for each reclamation cycle.
>
> I don't see how did you "follow" ksgxd.  If I am guessing correctly, you  
> are
> afraid of there might be less than 16 pages in a given EPC cgroup, thus  
> w/o
> splitting into "isolate" + "reclaim" you might feed the "reclaim" less  
> than 16
> pages, which might cause some performance degrade?
>
> But is this a common case?  Should we even worry about this?
>
> I suppose for such new feature we should bring functionality first and  
> then
> optimization if you have real performance data to show.
>
The concern is not about "reclaim less than 16".
I mean this is just refactoring with exactly the same design of ksgxd  
preserved, in that we first isolate as many candidate EPC pages (up  to  
16, ignore the unneeded SGX_NR_TO_SCAN_MAX for now), then does the ewb in  
one shot without anything else done in between. As described in original  
comments for the function sgx_reclaim_pages and sgx_encl_ewb, this is to  
finish all ewb quickly while minimizing impact of IPI.

The way you proposed will work but alters the current design and behavior  
if cgroups is enabled and EPCs of an enclave are tracked across multiple  
LRUs within the descendant cgroups, in that you will have isolation loop,  
backing store allocation loop, eblock loop interleaved with the ewb loop.

>>
>> > >
>> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> > > Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> > > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> > > Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
>> > > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
>> > > Cc: Sean Christopherson <seanjc@google.com>
>> > > ---
>> > >
>> >
>> > [...]
>> >
>> > > +/**
>> > > + * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC
>> > > pages.
>> > > + * @iso:		List of isolated pages for reclamation
>> > > + *
>> > > + * Take a list of EPC pages and reclaim them to the enclave's  
>> private
>> > > shmem files.  Do not
>> > > + * reclaim the pages that have been accessed since the last scan,  
>> and
>> > > move each of those pages
>> > > + * to the tail of its tracking LRU list.
>> > > + *
>> > > + * Limit the number of pages to be processed up to  
>> SGX_NR_TO_SCAN_MAX
>> > > per call in order to
>> > > + * degrade amount of IPI's and ETRACK's potentially required.
>> > > sgx_encl_ewb() does degrade a bit
>> > > + * among the HW threads with three stage EWB pipeline (EWB, ETRACK  
>> +
>> > > EWB and IPI + EWB) but not
>> > > + * sufficiently. Reclaiming one page at a time would also be
>> > > problematic as it would increase
>> > > + * the lock contention too much, which would halt forward progress.
>> >
>> > This is kinda optimization, correct?  Is there any real performance  
>> data
>> > to
>> > justify this?
>>
>> The above sentences were there originally. This optimization was  
>> justified.
>
> I am talking about 16 -> 32.
>
>>
>> > If this optimization is useful, shouldn't we bring this
>> > optimization to the current sgx_reclaim_pages() instead, e.g., just
>> > increase
>> > SGX_NR_TO_SCAN (16) to SGX_NR_TO_SCAN_MAX (32)?
>> >
>>
>> SGX_NR_TO_SCAN_MAX might be designed earlier for other reasons I don't
>> know. Currently it is really the buffer size allocated in
>> sgx_reclaim_pages(). Both cgroup and ksgxd scan 16 pages a time.Maybe we
>> should just use SGX_NR_TO_SCAN. No _MAX needed. The point was to batch
>> reclamation to certain number to minimize impact of EWB pipeline. 16 was
>> the original design.
>>
>
> Please don't leave why you are trying to do this to the reviewers.  If  
> you don't
> know, then just drop this.
>

Fair enough. This was my oversight when doing all the changes and rebase.  
Will drop it.

Thanks
Haitao
  
Kai Huang Dec. 13, 2023, 11:17 a.m. UTC | #5
On Mon, 2023-12-11 at 22:04 -0600, Haitao Huang wrote:
> Hi Kai
> 
> On Mon, 27 Nov 2023 03:57:03 -0600, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Mon, 2023-11-27 at 00:27 +0800, Haitao Huang wrote:
> > > On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai <kai.huang@intel.com>  
> > > wrote:
> > > 
> > > > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> > > > > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > > > > 
> > > > > To prepare for per-cgroup reclamation, separate the top-level  
> > > reclaim
> > > > > function, sgx_reclaim_epc_pages(), into two separate functions:
> > > > > 
> > > > > - sgx_isolate_epc_pages() scans and isolates reclaimable pages from  
> > > a
> > > > > given LRU list.
> > > > > - sgx_do_epc_reclamation() performs the real reclamation for the
> > > > > already isolated pages.
> > > > > 
> > > > > Create a new function, sgx_reclaim_epc_pages_global(), calling  
> > > those two
> > > > > in succession, to replace the original sgx_reclaim_epc_pages(). The
> > > > > above two functions will serve as building blocks for the  
> > > reclamation
> > > > > flows in later EPC cgroup implementation.
> > > > > 
> > > > > sgx_do_epc_reclamation() returns the number of reclaimed pages. The  
> > > EPC
> > > > > cgroup will use the result to track reclaiming progress.
> > > > > 
> > > > > sgx_isolate_epc_pages() returns the additional number of pages to  
> > > scan
> > > > > for current epoch of reclamation. The EPC cgroup will use the  
> > > result to
> > > > > determine if more scanning to be done in LRUs in its children  
> > > groups.
> > > > 
> > > > This changelog says nothing about "why", but only mentions the
> > > > "implementation".
> > > > 
> > > > For instance, assuming we need to reclaim @npages_to_reclaim from the
> > > > @epc_cgrp_to_reclaim and its descendants, why cannot we do:
> > > > 
> > > > 	for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp) {
> > > > 		if (npages_to_reclaim <= 0)
> > > > 			return;
> > > > 
> > > > 		npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru,
> > > > 					npages_to_reclaim);
> > > > 	}
> > > > 
> > > > Is there any difference to have "isolate" + "reclaim"?
> > > > 
> > > 
> > > This is to optimize "reclaim". See how etrack was done in sgx_encl_ewb.
> > > Here we just follow the same design as ksgxd for each reclamation cycle.
> > 
> > I don't see how did you "follow" ksgxd.  If I am guessing correctly, you  
> > are
> > afraid of there might be less than 16 pages in a given EPC cgroup, thus  
> > w/o
> > splitting into "isolate" + "reclaim" you might feed the "reclaim" less  
> > than 16
> > pages, which might cause some performance degrade?
> > 
> > But is this a common case?  Should we even worry about this?
> > 
> > I suppose for such new feature we should bring functionality first and  
> > then
> > optimization if you have real performance data to show.
> > 
> The concern is not about "reclaim less than 16".
> I mean this is just refactoring with exactly the same design of ksgxd  
> preserved, 
> 

I literally have no idea what you are talking about here.  ksgxd() just calls
sgx_reclaim_pages(), which tries to reclaim 16 pages at once.

> in that we first isolate as many candidate EPC pages (up  to  
> 16, ignore the unneeded SGX_NR_TO_SCAN_MAX for now), then does the ewb in  
> one shot without anything else done in between. 
> 

Assuming you are referring the implementation of sgx_reclaim_pages(), and
assuming the "isolate" you mean removing EPC pages from the list (which is
exactly what the sgx_isolate_epc_pages() in this patch does), what happens to
the loops of "backing store allocation" and "EBLOCK", before the loop of EWB? 
Eaten by you?


> As described in original  
> comments for the function sgx_reclaim_pages and sgx_encl_ewb, this is to  
> finish all ewb quickly while minimizing impact of IPI.
> 
> The way you proposed will work but alters the current design and behavior  
> if cgroups is enabled and EPCs of an enclave are tracked across multiple  
> LRUs within the descendant cgroups, in that you will have isolation loop,  
> backing store allocation loop, eblock loop interleaved with the ewb loop.
> 

I have no idea what you are talking about.

The point is, with or w/o this patch, you can only reclaim 16 EPC pages in one
function call (as you have said you are going to remove SGX_NR_TO_SCAN_MAX,
which is a cipher to both of us).  The only difference I can see is, with this
patch, you can have multiple calls of "isolate" and then call the "do_reclaim"
once.

But what's the good of having the "isolate" if the "do_reclaim" can only reclaim
16 pages anyway?

Back to my last reply, are you afraid of any LRU has less than 16 pages to
"isolate", therefore you need to loop LRUs of descendants to get 16?  Cause I
really cannot think of any other reason why you are doing this.


> >
  
Haitao Huang Dec. 15, 2023, 7:49 p.m. UTC | #6
On Wed, 13 Dec 2023 05:17:11 -0600, Huang, Kai <kai.huang@intel.com> wrote:

> On Mon, 2023-12-11 at 22:04 -0600, Haitao Huang wrote:
>> Hi Kai
>>
>> On Mon, 27 Nov 2023 03:57:03 -0600, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>>
>> > On Mon, 2023-11-27 at 00:27 +0800, Haitao Huang wrote:
>> > > On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai <kai.huang@intel.com>
>> > > wrote:
>> > >
>> > > > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
>> > > > > From: Sean Christopherson <sean.j.christopherson@intel.com>
>> > > > >
>> > > > > To prepare for per-cgroup reclamation, separate the top-level
>> > > reclaim
>> > > > > function, sgx_reclaim_epc_pages(), into two separate functions:
>> > > > >
>> > > > > - sgx_isolate_epc_pages() scans and isolates reclaimable pages  
>> from
>> > > a
>> > > > > given LRU list.
>> > > > > - sgx_do_epc_reclamation() performs the real reclamation for the
>> > > > > already isolated pages.
>> > > > >
>> > > > > Create a new function, sgx_reclaim_epc_pages_global(), calling
>> > > those two
>> > > > > in succession, to replace the original sgx_reclaim_epc_pages().  
>> The
>> > > > > above two functions will serve as building blocks for the
>> > > reclamation
>> > > > > flows in later EPC cgroup implementation.
>> > > > >
>> > > > > sgx_do_epc_reclamation() returns the number of reclaimed pages.  
>> The
>> > > EPC
>> > > > > cgroup will use the result to track reclaiming progress.
>> > > > >
>> > > > > sgx_isolate_epc_pages() returns the additional number of pages  
>> to
>> > > scan
>> > > > > for current epoch of reclamation. The EPC cgroup will use the
>> > > result to
>> > > > > determine if more scanning to be done in LRUs in its children
>> > > groups.
>> > > >
>> > > > This changelog says nothing about "why", but only mentions the
>> > > > "implementation".
>> > > >
>> > > > For instance, assuming we need to reclaim @npages_to_reclaim from  
>> the
>> > > > @epc_cgrp_to_reclaim and its descendants, why cannot we do:
>> > > >
>> > > > 	for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp)  
>> {
>> > > > 		if (npages_to_reclaim <= 0)
>> > > > 			return;
>> > > >
>> > > > 		npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru,
>> > > > 					npages_to_reclaim);
>> > > > 	}
>> > > >
>> > > > Is there any difference to have "isolate" + "reclaim"?
>> > > >
>> > >
>> > > This is to optimize "reclaim". See how etrack was done in  
>> sgx_encl_ewb.
>> > > Here we just follow the same design as ksgxd for each reclamation  
>> cycle.
>> >
>> > I don't see how did you "follow" ksgxd.  If I am guessing correctly,  
>> you
>> > are
>> > afraid of there might be less than 16 pages in a given EPC cgroup,  
>> thus
>> > w/o
>> > splitting into "isolate" + "reclaim" you might feed the "reclaim" less
>> > than 16
>> > pages, which might cause some performance degrade?
>> >
>> > But is this a common case?  Should we even worry about this?
>> >
>> > I suppose for such new feature we should bring functionality first and
>> > then
>> > optimization if you have real performance data to show.
>> >
>> The concern is not about "reclaim less than 16".
>> I mean this is just refactoring with exactly the same design of ksgxd
>> preserved,
>
> I literally have no idea what you are talking about here.  ksgxd() just  
> calls
> sgx_reclaim_pages(), which tries to reclaim 16 pages at once.
>
>> in that we first isolate as many candidate EPC pages (up  to
>> 16, ignore the unneeded SGX_NR_TO_SCAN_MAX for now), then does the ewb  
>> in
>> one shot without anything else done in between.
>
> Assuming you are referring the implementation of sgx_reclaim_pages(), and
> assuming the "isolate" you mean removing EPC pages from the list (which  
> is
> exactly what the sgx_isolate_epc_pages() in this patch does), what  
> happens to
> the loops of "backing store allocation" and "EBLOCK", before the loop of  
> EWB?Eaten by you?
>

I skipped those as what really matters is to keep ewb loop separate and  
run in one shot for each reclaiming cycle, not dependent on number of  
LRUs.  All those loops in original sgx_reclaim_pages() except the  
"isolate" loop are not dealing with multiple LRUs of cgroups later. That's  
the reason to refactor out only the "isolate" part and loop it through  
cgroup LRUs in later patches.

>
>> As described in original
>> comments for the function sgx_reclaim_pages and sgx_encl_ewb, this is to
>> finish all ewb quickly while minimizing impact of IPI.
>>
>> The way you proposed will work but alters the current design and  
>> behavior
>> if cgroups is enabled and EPCs of an enclave are tracked across multiple
>> LRUs within the descendant cgroups, in that you will have isolation  
>> loop,
>> backing store allocation loop, eblock loop interleaved with the ewb  
>> loop.
>>
>
> I have no idea what you are talking about.
>
> The point is, with or w/o this patch, you can only reclaim 16 EPC pages  
> in one
> function call (as you have said you are going to remove  
> SGX_NR_TO_SCAN_MAX,
> which is a cipher to both of us).  The only difference I can see is,  
> with this
> patch, you can have multiple calls of "isolate" and then call the  
> "do_reclaim"
> once.
>
> But what's the good of having the "isolate" if the "do_reclaim" can only  
> reclaim
> 16 pages anyway?
>
> Back to my last reply, are you afraid of any LRU has less than 16 pages  
> to
> "isolate", therefore you need to loop LRUs of descendants to get 16?   
> Cause I
> really cannot think of any other reason why you are doing this.
>
>

I think I see your point. By capping pages reclaimed per cycle to 16,  
there is not much difference even if those 16 pages are spread in separate  
LRUs . The difference is only significant when we ever raise that cap. To  
preserve the current behavior of ewb loops independent on number of LRUs  
to loop through for each reclaiming cycle, regardless of the exact value  
of the page cap, I would still think current approach in the patch is  
reasonable choice.  What do you think?

Thanks
Haitao
  
Kai Huang Dec. 18, 2023, 1:44 a.m. UTC | #7
> > 
> > The point is, with or w/o this patch, you can only reclaim 16 EPC pages  
> > in one
> > function call (as you have said you are going to remove  
> > SGX_NR_TO_SCAN_MAX,
> > which is a cipher to both of us).  The only difference I can see is,  
> > with this
> > patch, you can have multiple calls of "isolate" and then call the  
> > "do_reclaim"
> > once.
> > 
> > But what's the good of having the "isolate" if the "do_reclaim" can only  
> > reclaim
> > 16 pages anyway?
> > 
> > Back to my last reply, are you afraid of any LRU has less than 16 pages  
> > to
> > "isolate", therefore you need to loop LRUs of descendants to get 16?   
> > Cause I
> > really cannot think of any other reason why you are doing this.
> > 
> > 
> 
> I think I see your point. By capping pages reclaimed per cycle to 16,  
> there is not much difference even if those 16 pages are spread in separate  
> LRUs . The difference is only significant when we ever raise that cap. To  
> preserve the current behavior of ewb loops independent on number of LRUs  
> to loop through for each reclaiming cycle, regardless of the exact value  
> of the page cap, I would still think current approach in the patch is  
> reasonable choice.  What do you think?

To me I won't bother to do that.  Having less than 16 pages in one LRU is
*extremely rare* that should never happen in practice.  It's pointless to make
such code adjustment at this stage.

Let's focus on enabling functionality first.  When you have some real
performance issue that is related to this, we can come back then.

Btw, I think you need to step back even further.  IIUC the whole multiple LRU
thing isn't mandatory in this initial support.

Please (again) take a look at the comments from Dave and Michal:

https://lore.kernel.org/lkml/7a1a5125-9da2-47b6-ba0f-cf24d84df16b@intel.com/#t
https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/
  
Mikko Ylinen Dec. 18, 2023, 5:32 p.m. UTC | #8
On Mon, Dec 18, 2023 at 01:44:56AM +0000, Huang, Kai wrote:
> 
> Let's focus on enabling functionality first.  When you have some real
> performance issue that is related to this, we can come back then.
> 
> Btw, I think you need to step back even further.  IIUC the whole multiple LRU
> thing isn't mandatory in this initial support.
> 
> Please (again) take a look at the comments from Dave and Michal:
> 
> https://lore.kernel.org/lkml/7a1a5125-9da2-47b6-ba0f-cf24d84df16b@intel.com/#t
> https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/

I don't think setting a hard limit without any reclaiming is preferred.

I'd rather see this similar to what the "sgx_epc.high" was in the RFC
patchset: misc.max for sgx_epc becomes the max value for EPC usage but
enclaves larger than the limit would still run OK. Per-cgroup reclaiming
allows additional controls via memory.high/max in the same cgroup.

If this reclaim flexibily was not there, the sgx_epc limit would always
have to be set based on some "peak" EPC consumption which may not even
be known at the time the limit is set.

From a container runtime perspective (which is what I'm working for Kubernetes)
the current proposal seems best to me: a container is guaranteed at most
the amount of EPC set as the limit and no other container gets to use it.
Also, each container gets charged for reclaiming independently if a low
max value is used (which might be desirable to get more containers to run on the
same node/system). In this model, the sum of containers' max values would be
the capacity.

-- Mikko
  
Haitao Huang Dec. 18, 2023, 9:24 p.m. UTC | #9
On Sun, 17 Dec 2023 19:44:56 -0600, Huang, Kai <kai.huang@intel.com> wrote:

>
>> >
>> > The point is, with or w/o this patch, you can only reclaim 16 EPC  
>> pages
>> > in one
>> > function call (as you have said you are going to remove
>> > SGX_NR_TO_SCAN_MAX,
>> > which is a cipher to both of us).  The only difference I can see is,
>> > with this
>> > patch, you can have multiple calls of "isolate" and then call the
>> > "do_reclaim"
>> > once.
>> >
>> > But what's the good of having the "isolate" if the "do_reclaim" can  
>> only
>> > reclaim
>> > 16 pages anyway?
>> >
>> > Back to my last reply, are you afraid of any LRU has less than 16  
>> pages
>> > to
>> > "isolate", therefore you need to loop LRUs of descendants to get 16?
>> > Cause I
>> > really cannot think of any other reason why you are doing this.
>> >
>> >
>>
>> I think I see your point. By capping pages reclaimed per cycle to 16,
>> there is not much difference even if those 16 pages are spread in  
>> separate
>> LRUs . The difference is only significant when we ever raise that cap.  
>> To
>> preserve the current behavior of ewb loops independent on number of LRUs
>> to loop through for each reclaiming cycle, regardless of the exact value
>> of the page cap, I would still think current approach in the patch is
>> reasonable choice.  What do you think?
>
> To me I won't bother to do that.  Having less than 16 pages in one LRU is
> *extremely rare* that should never happen in practice.  It's pointless  
> to make
> such code adjustment at this stage.
>
> Let's focus on enabling functionality first.  When you have some real
> performance issue that is related to this, we can come back then.
>
> Btw, I think you need to step back even further.  IIUC the whole  
> multiple LRU
> thing isn't mandatory in this initial support.
>
> Please (again) take a look at the comments from Dave and Michal:
>
> https://lore.kernel.org/lkml/7a1a5125-9da2-47b6-ba0f-cf24d84df16b@intel.com/#t
> https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/

Thanks for raising this. Actually my understanding the above discussion  
was mainly about not doing reclaiming by killing enclaves, i.e., I assumed  
"reclaiming" within that context only meant for that particular kind.

As Mikko pointed out, without reclaiming per-cgroup, the max limit of each  
cgroup needs to accommodate the peak usage of enclaves within that cgroup.  
That may be inconvenient for practical usage and limits could be forced to  
be set larger than necessary to run enclaves performantly. For example, we  
can observe following undesired consequences comparing a system with  
current kernel loaded with enclaves whose total peak usage is greater than  
the EPC capacity.

1) If a user wants to load the same exact enclaves but in separate  
cgroups, then the sum of cgroup limits must be higher than the capacity  
and the system will end up doing the same old global reclaiming as it is  
currently doing. Cgroup is not useful at all for isolating EPC  
consumptions.

2) To isolate impact of usage of each cgroup on other cgroups and yet  
still being able to load each enclave, the user basically has to carefully  
plan to ensure the sum of cgroup max limits, i.e., the sum of peak usage  
of enclaves, is not reaching over the capacity. That means no  
over-commiting allowed and the same system may not be able to load as many  
enclaves as with current kernel.

@Dave and @Michal, Your thoughts? Or could you confirm we should not do  
reclaim per cgroup at
all?

If confirmed as desired, then this series can stop at patch 4.

Thanks
Haitao
  
Dave Hansen Jan. 3, 2024, 4:37 p.m. UTC | #10
On 12/18/23 13:24, Haitao Huang wrote:> @Dave and @Michal, Your
thoughts? Or could you confirm we should not
> do reclaim per cgroup at all?
What's the benefit of doing reclaim per cgroup?  Is that worth the extra
complexity?

The key question here is whether we want the SGX VM to be complex and
more like the real VM or simple when a cgroup hits its limit.  Right?

If stopping at patch 5 and having less code is even remotely an option,
why not do _that_?
  
Michal Koutný Jan. 4, 2024, 12:38 p.m. UTC | #11
Hello.

On Mon, Dec 18, 2023 at 03:24:40PM -0600, Haitao Huang <haitao.huang@linux.intel.com> wrote:
> Thanks for raising this. Actually my understanding the above discussion was
> mainly about not doing reclaiming by killing enclaves, i.e., I assumed
> "reclaiming" within that context only meant for that particular kind.
> 
> As Mikko pointed out, without reclaiming per-cgroup, the max limit of each
> cgroup needs to accommodate the peak usage of enclaves within that cgroup.
> That may be inconvenient for practical usage and limits could be forced to
> be set larger than necessary to run enclaves performantly. For example, we
> can observe following undesired consequences comparing a system with current
> kernel loaded with enclaves whose total peak usage is greater than the EPC
> capacity.
> 
> 1) If a user wants to load the same exact enclaves but in separate cgroups,
> then the sum of cgroup limits must be higher than the capacity and the
> system will end up doing the same old global reclaiming as it is currently
> doing. Cgroup is not useful at all for isolating EPC consumptions.

That is the use of limits to prevent a runaway cgroup smothering the
system. Overcommited values in such a config are fine because the more
simultaneous runaways, the less likely.
The peak consumption is on the fair expense of others (some efficiency)
and the limit contains the runaway (hence the isolation).

> 2) To isolate impact of usage of each cgroup on other cgroups and yet still
> being able to load each enclave, the user basically has to carefully plan to
> ensure the sum of cgroup max limits, i.e., the sum of peak usage of
> enclaves, is not reaching over the capacity. That means no over-commiting
> allowed and the same system may not be able to load as many enclaves as with
> current kernel.

Another "config layout" of limits is to achieve partitioning (sum ==
capacity). That is perfect isolation but it naturally goes against
efficient utilization. The way other controllers approach this trade-off
is with weights (cpu, io) or protections (memory). I'm afraid misc
controller is not ready for this.

My opinion is to start with the simple limits (first patches) and think
of prioritization/guarantee mechanism based on real cases.

HTH,
Michal
  
Haitao Huang Jan. 4, 2024, 7:11 p.m. UTC | #12
Hi Dave,

On Wed, 03 Jan 2024 10:37:35 -0600, Dave Hansen <dave.hansen@intel.com>  
wrote:

> On 12/18/23 13:24, Haitao Huang wrote:> @Dave and @Michal, Your
> thoughts? Or could you confirm we should not
>> do reclaim per cgroup at all?
> What's the benefit of doing reclaim per cgroup?  Is that worth the extra
> complexity?
>

Without reclaiming per cgroup, then we have to always set the limit to  
enclave's peak usage. This may not be efficient utilization as in many  
cases each enclave can perform fine with EPC limit set less than peak.  
Basically each group can not give up some pages for greater good without  
dying :-)

Also with enclaves enabled with EDMM, the peak usage is not static so hard  
to determine upfront. Hence it might be an operation/deployment  
inconvenience.

In case of over-committing (sum of limits > total capacity), one cgroup at  
peak usage may require swapping pages out in a different cgroup if system  
is overloaded at that time.

> The key question here is whether we want the SGX VM to be complex and
> more like the real VM or simple when a cgroup hits its limit.  Right?
>

Although it's fair to say the majority of complexity of this series is in  
support for reclaiming per cgroup, I think it's manageable and much less  
than real VM after we removed the enclave killing parts: the only extra  
effort is to track pages in separate list and reclaim them in separately  
as opposed to track in on global list and reclaim together. The main  
reclaiming loop code is still pretty much the same as before.


> If stopping at patch 5 and having less code is even remotely an option,
> why not do _that_?
>
I hope I described limitations clear enough above.
If those are OK with users and also make it acceptable for merge quickly,  
I'm happy to do that :-)

Thanks
Haitao
  
Jarkko Sakkinen Jan. 4, 2024, 7:19 p.m. UTC | #13
On Thu Jan 4, 2024 at 9:11 PM EET, Haitao Huang wrote:
> > The key question here is whether we want the SGX VM to be complex and
> > more like the real VM or simple when a cgroup hits its limit.  Right?
> >
>
> Although it's fair to say the majority of complexity of this series is in  
> support for reclaiming per cgroup, I think it's manageable and much less  
> than real VM after we removed the enclave killing parts: the only extra  
> effort is to track pages in separate list and reclaim them in separately  
> as opposed to track in on global list and reclaim together. The main  
> reclaiming loop code is still pretty much the same as before.

I'm not seeing any unmanageable complexity on SGX side, and also
cgroups specific changes are somewhat clean to me at least...

BR, Jarkko
  
Haitao Huang Jan. 4, 2024, 7:20 p.m. UTC | #14
Hi Michal,

On Thu, 04 Jan 2024 06:38:41 -0600, Michal Koutný <mkoutny@suse.com> wrote:

> Hello.
>
> On Mon, Dec 18, 2023 at 03:24:40PM -0600, Haitao Huang  
> <haitao.huang@linux.intel.com> wrote:
>> Thanks for raising this. Actually my understanding the above discussion  
>> was
>> mainly about not doing reclaiming by killing enclaves, i.e., I assumed
>> "reclaiming" within that context only meant for that particular kind.
>>
>> As Mikko pointed out, without reclaiming per-cgroup, the max limit of  
>> each
>> cgroup needs to accommodate the peak usage of enclaves within that  
>> cgroup.
>> That may be inconvenient for practical usage and limits could be forced  
>> to
>> be set larger than necessary to run enclaves performantly. For example,  
>> we
>> can observe following undesired consequences comparing a system with  
>> current
>> kernel loaded with enclaves whose total peak usage is greater than the  
>> EPC
>> capacity.
>>
>> 1) If a user wants to load the same exact enclaves but in separate  
>> cgroups,
>> then the sum of cgroup limits must be higher than the capacity and the
>> system will end up doing the same old global reclaiming as it is  
>> currently
>> doing. Cgroup is not useful at all for isolating EPC consumptions.
>
> That is the use of limits to prevent a runaway cgroup smothering the
> system. Overcommited values in such a config are fine because the more
> simultaneous runaways, the less likely.
> The peak consumption is on the fair expense of others (some efficiency)
> and the limit contains the runaway (hence the isolation).
>

This makes sense to me in theory. Mikko, Chris Y/Bo Z, your thoughts on  
whether this is good enough for your intended usages?

>> 2) To isolate impact of usage of each cgroup on other cgroups and yet  
>> still
>> being able to load each enclave, the user basically has to carefully  
>> plan to
>> ensure the sum of cgroup max limits, i.e., the sum of peak usage of
>> enclaves, is not reaching over the capacity. That means no  
>> over-commiting
>> allowed and the same system may not be able to load as many enclaves as  
>> with
>> current kernel.
>
> Another "config layout" of limits is to achieve partitioning (sum ==
> capacity). That is perfect isolation but it naturally goes against
> efficient utilization. The way other controllers approach this trade-off
> is with weights (cpu, io) or protections (memory). I'm afraid misc
> controller is not ready for this.
>
> My opinion is to start with the simple limits (first patches) and think
> of prioritization/guarantee mechanism based on real cases.
>

We moved away from using mem like custom controller with (low, high, max)  
to misc controller. But if we need add those down the road, then the  
interface needs be changed. So my concern on this route would be whether  
misc would allow any of those extensions. On the other hand, it might turn  
out less complex just doing the reclamation per cgroup.

Thanks a lot for your comments and they are really helpful!

Haitao
  
Dave Hansen Jan. 4, 2024, 7:27 p.m. UTC | #15
On 1/4/24 11:11, Haitao Huang wrote:
> If those are OK with users and also make it acceptable for merge
> quickly, I'm happy to do that 🙂

How about we put some actual numbers behind this?  How much complexity
are we talking about here?  What's the diffstat for the utterly
bare-bones implementation and what does it cost on top of that to do the
per-cgroup reclaim?
  
Haitao Huang Jan. 4, 2024, 9:01 p.m. UTC | #16
On Thu, 04 Jan 2024 13:27:07 -0600, Dave Hansen <dave.hansen@intel.com>  
wrote:

> On 1/4/24 11:11, Haitao Huang wrote:
>> If those are OK with users and also make it acceptable for merge
>> quickly, I'm happy to do that 🙂
>
> How about we put some actual numbers behind this?  How much complexity
> are we talking about here?  What's the diffstat for the utterly
> bare-bones implementation and what does it cost on top of that to do the
> per-cgroup reclaim?
>

For bare-bone:

  arch/x86/Kconfig                     |  13 ++++++++++++
  arch/x86/kernel/cpu/sgx/Makefile     |   1 +
  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 104  
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  39  
+++++++++++++++++++++++++++++++++++
  arch/x86/kernel/cpu/sgx/main.c       |  41  
++++++++++++++++++++++++++++++++++++
  arch/x86/kernel/cpu/sgx/sgx.h        |   5 +++++
  include/linux/misc_cgroup.h          |  42  
+++++++++++++++++++++++++++++++++++++
  kernel/cgroup/misc.c                 |  52  
+++++++++++++++++++++++++++++++---------------
  8 files changed, 281 insertions(+), 16 deletions(-)

Additional for per-cgroup reclaim:

  arch/x86/kernel/cpu/sgx/encl.c       |  41 +++++++++--------
  arch/x86/kernel/cpu/sgx/encl.h       |   3 +-
  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 224  
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  18 ++++++--
  arch/x86/kernel/cpu/sgx/main.c       | 226  
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------
  arch/x86/kernel/cpu/sgx/sgx.h        |  85  
+++++++++++++++++++++++++++++++++--
  6 files changed, 480 insertions(+), 117 deletions(-)


Thanks
Haitao
  
Mikko Ylinen Jan. 5, 2024, 2:43 p.m. UTC | #17
On Thu, Jan 04, 2024 at 01:11:15PM -0600, Haitao Huang wrote:
> Hi Dave,
> 
> On Wed, 03 Jan 2024 10:37:35 -0600, Dave Hansen <dave.hansen@intel.com>
> wrote:
> 
> > On 12/18/23 13:24, Haitao Huang wrote:> @Dave and @Michal, Your
> > thoughts? Or could you confirm we should not
> > > do reclaim per cgroup at all?
> > What's the benefit of doing reclaim per cgroup?  Is that worth the extra
> > complexity?
> > 
> 
> Without reclaiming per cgroup, then we have to always set the limit to
> enclave's peak usage. This may not be efficient utilization as in many cases
> each enclave can perform fine with EPC limit set less than peak. Basically
> each group can not give up some pages for greater good without dying :-)

+1. this is exactly my thinking too. The per cgroup reclaiming is
important for the containers use case we are working on. I also think
it makes the limit more meaningful: the per-container pool of EPC pages
to use (which is independent of the enclave size).

> 
> Also with enclaves enabled with EDMM, the peak usage is not static so hard
> to determine upfront. Hence it might be an operation/deployment
> inconvenience.
> 
> In case of over-committing (sum of limits > total capacity), one cgroup at
> peak usage may require swapping pages out in a different cgroup if system is
> overloaded at that time.
> 
> > The key question here is whether we want the SGX VM to be complex and
> > more like the real VM or simple when a cgroup hits its limit.  Right?
> > 
> 
> Although it's fair to say the majority of complexity of this series is in
> support for reclaiming per cgroup, I think it's manageable and much less
> than real VM after we removed the enclave killing parts: the only extra
> effort is to track pages in separate list and reclaim them in separately as
> opposed to track in on global list and reclaim together. The main reclaiming
> loop code is still pretty much the same as before.
> 
> 
> > If stopping at patch 5 and having less code is even remotely an option,
> > why not do _that_?
> > 
> I hope I described limitations clear enough above.
> If those are OK with users and also make it acceptable for merge quickly,

You explained the gaps very well already. I don't think the simple
version without per-cgroup reclaiming is enough for the container case.

Mikko
  
Haitao Huang Jan. 12, 2024, 5:07 p.m. UTC | #18
On Sun, 17 Dec 2023 19:44:56 -0600, Huang, Kai <kai.huang@intel.com> wrote:

>
>> >
>> > The point is, with or w/o this patch, you can only reclaim 16 EPC  
>> pages
>> > in one
>> > function call (as you have said you are going to remove
>> > SGX_NR_TO_SCAN_MAX,
>> > which is a cipher to both of us).  The only difference I can see is,
>> > with this
>> > patch, you can have multiple calls of "isolate" and then call the
>> > "do_reclaim"
>> > once.
>> >
>> > But what's the good of having the "isolate" if the "do_reclaim" can  
>> only
>> > reclaim
>> > 16 pages anyway?
>> >
>> > Back to my last reply, are you afraid of any LRU has less than 16  
>> pages
>> > to
>> > "isolate", therefore you need to loop LRUs of descendants to get 16?
>> > Cause I
>> > really cannot think of any other reason why you are doing this.
>> >
>> >
>>
>> I think I see your point. By capping pages reclaimed per cycle to 16,
>> there is not much difference even if those 16 pages are spread in  
>> separate
>> LRUs . The difference is only significant when we ever raise that cap.  
>> To
>> preserve the current behavior of ewb loops independent on number of LRUs
>> to loop through for each reclaiming cycle, regardless of the exact value
>> of the page cap, I would still think current approach in the patch is
>> reasonable choice.  What do you think?
>
> To me I won't bother to do that.  Having less than 16 pages in one LRU is
> *extremely rare* that should never happen in practice.  It's pointless  
> to make
> such code adjustment at this stage.
>
> Let's focus on enabling functionality first.  When you have some real
> performance issue that is related to this, we can come back then.
>

I have done some rethinking about this and realize this does save quite  
some significant work: without breaking out isolation part from  
sgx_reclaim_pages(), I can remove the changes to use a list for isolated  
pages, and no need to introduce "state" such as RECLAIM_IN_PROGRESS. About  
1/3 of changes for per-cgroup reclamation will be gone.

So I think I'll go this route now. The only downside may be performance if  
a enclave spreads its pages in different cgroups and even that is minimum  
impact as we limit reclamation to 16 pages a time. Let me know if someone  
feel strongly we need dealt with that and see some other potential issues  
I may have missed.

Thanks

Haitao
  
Jarkko Sakkinen Jan. 13, 2024, 9:04 p.m. UTC | #19
On Fri Jan 12, 2024 at 7:07 PM EET, Haitao Huang wrote:
> On Sun, 17 Dec 2023 19:44:56 -0600, Huang, Kai <kai.huang@intel.com> wrote:
>
> >
> >> >
> >> > The point is, with or w/o this patch, you can only reclaim 16 EPC  
> >> pages
> >> > in one
> >> > function call (as you have said you are going to remove
> >> > SGX_NR_TO_SCAN_MAX,
> >> > which is a cipher to both of us).  The only difference I can see is,
> >> > with this
> >> > patch, you can have multiple calls of "isolate" and then call the
> >> > "do_reclaim"
> >> > once.
> >> >
> >> > But what's the good of having the "isolate" if the "do_reclaim" can  
> >> only
> >> > reclaim
> >> > 16 pages anyway?
> >> >
> >> > Back to my last reply, are you afraid of any LRU has less than 16  
> >> pages
> >> > to
> >> > "isolate", therefore you need to loop LRUs of descendants to get 16?
> >> > Cause I
> >> > really cannot think of any other reason why you are doing this.
> >> >
> >> >
> >>
> >> I think I see your point. By capping pages reclaimed per cycle to 16,
> >> there is not much difference even if those 16 pages are spread in  
> >> separate
> >> LRUs . The difference is only significant when we ever raise that cap.  
> >> To
> >> preserve the current behavior of ewb loops independent on number of LRUs
> >> to loop through for each reclaiming cycle, regardless of the exact value
> >> of the page cap, I would still think current approach in the patch is
> >> reasonable choice.  What do you think?
> >
> > To me I won't bother to do that.  Having less than 16 pages in one LRU is
> > *extremely rare* that should never happen in practice.  It's pointless  
> > to make
> > such code adjustment at this stage.
> >
> > Let's focus on enabling functionality first.  When you have some real
> > performance issue that is related to this, we can come back then.
> >
>
> I have done some rethinking about this and realize this does save quite  
> some significant work: without breaking out isolation part from  
> sgx_reclaim_pages(), I can remove the changes to use a list for isolated  
> pages, and no need to introduce "state" such as RECLAIM_IN_PROGRESS. About  
> 1/3 of changes for per-cgroup reclamation will be gone.
>
> So I think I'll go this route now. The only downside may be performance if  
> a enclave spreads its pages in different cgroups and even that is minimum  
> impact as we limit reclamation to 16 pages a time. Let me know if someone  
> feel strongly we need dealt with that and see some other potential issues  
> I may have missed.

We could deal with possible performance regression later on (if there
is need). I mean there should we a workload first that would so that
sort stimulus...

> Thanks
>
> Haitao

BR, Jarkko
  
Jarkko Sakkinen Jan. 13, 2024, 9:08 p.m. UTC | #20
On Sat Jan 13, 2024 at 11:04 PM EET, Jarkko Sakkinen wrote:
> On Fri Jan 12, 2024 at 7:07 PM EET, Haitao Huang wrote:
> > On Sun, 17 Dec 2023 19:44:56 -0600, Huang, Kai <kai.huang@intel.com> wrote:
> >
> > >
> > >> >
> > >> > The point is, with or w/o this patch, you can only reclaim 16 EPC  
> > >> pages
> > >> > in one
> > >> > function call (as you have said you are going to remove
> > >> > SGX_NR_TO_SCAN_MAX,
> > >> > which is a cipher to both of us).  The only difference I can see is,
> > >> > with this
> > >> > patch, you can have multiple calls of "isolate" and then call the
> > >> > "do_reclaim"
> > >> > once.
> > >> >
> > >> > But what's the good of having the "isolate" if the "do_reclaim" can  
> > >> only
> > >> > reclaim
> > >> > 16 pages anyway?
> > >> >
> > >> > Back to my last reply, are you afraid of any LRU has less than 16  
> > >> pages
> > >> > to
> > >> > "isolate", therefore you need to loop LRUs of descendants to get 16?
> > >> > Cause I
> > >> > really cannot think of any other reason why you are doing this.
> > >> >
> > >> >
> > >>
> > >> I think I see your point. By capping pages reclaimed per cycle to 16,
> > >> there is not much difference even if those 16 pages are spread in  
> > >> separate
> > >> LRUs . The difference is only significant when we ever raise that cap.  
> > >> To
> > >> preserve the current behavior of ewb loops independent on number of LRUs
> > >> to loop through for each reclaiming cycle, regardless of the exact value
> > >> of the page cap, I would still think current approach in the patch is
> > >> reasonable choice.  What do you think?
> > >
> > > To me I won't bother to do that.  Having less than 16 pages in one LRU is
> > > *extremely rare* that should never happen in practice.  It's pointless  
> > > to make
> > > such code adjustment at this stage.
> > >
> > > Let's focus on enabling functionality first.  When you have some real
> > > performance issue that is related to this, we can come back then.
> > >
> >
> > I have done some rethinking about this and realize this does save quite  
> > some significant work: without breaking out isolation part from  
> > sgx_reclaim_pages(), I can remove the changes to use a list for isolated  
> > pages, and no need to introduce "state" such as RECLAIM_IN_PROGRESS. About  
> > 1/3 of changes for per-cgroup reclamation will be gone.
> >
> > So I think I'll go this route now. The only downside may be performance if  
> > a enclave spreads its pages in different cgroups and even that is minimum  
> > impact as we limit reclamation to 16 pages a time. Let me know if someone  
> > feel strongly we need dealt with that and see some other potential issues  
> > I may have missed.
>
> We could deal with possible performance regression later on (if there
> is need). I mean there should we a workload first that would so that
> sort stimulus...

I.e. no reason to deal with imaginary workload :-) Go ahead and we'll
go through it.

BR, Jarkko
  

Patch

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 33bcba313d40..e8848b493eb7 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -281,33 +281,23 @@  static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 	mutex_unlock(&encl->lock);
 }
 
-/*
- * Take a fixed number of pages from the head of the active page pool and
- * reclaim them to the enclave's private shmem files. Skip the pages, which have
- * been accessed since the last scan. Move those pages to the tail of active
- * page pool so that the pages get scanned in LRU like fashion.
+/**
+ * sgx_isolate_epc_pages() - Isolate pages from an LRU for reclaim
+ * @lru:	LRU from which to reclaim
+ * @nr_to_scan:	Number of pages to scan for reclaim
+ * @dst:	Destination list to hold the isolated pages
  *
- * Batch process a chunk of pages (at the moment 16) in order to degrade amount
- * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit
- * among the HW threads with three stage EWB pipeline (EWB, ETRACK + EWB and IPI
- * + EWB) but not sufficiently. Reclaiming one page at a time would also be
- * problematic as it would increase the lock contention too much, which would
- * halt forward progress.
+ * Return: remaining pages to scan, i.e, @nr_to_scan minus the number of pages scanned.
  */
-static void sgx_reclaim_pages(void)
+unsigned int  sgx_isolate_epc_pages(struct sgx_epc_lru_list *lru, unsigned int nr_to_scan,
+				    struct list_head *dst)
 {
-	struct sgx_backing backing[SGX_NR_TO_SCAN];
-	struct sgx_epc_page *epc_page, *tmp;
 	struct sgx_encl_page *encl_page;
-	pgoff_t page_index;
-	LIST_HEAD(iso);
-	int ret;
-	int i;
+	struct sgx_epc_page *epc_page;
 
-	spin_lock(&sgx_global_lru.lock);
-	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-		epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
-						    struct sgx_epc_page, list);
+	spin_lock(&lru->lock);
+	for (; nr_to_scan > 0; --nr_to_scan) {
+		epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
 		if (!epc_page)
 			break;
 
@@ -316,23 +306,53 @@  static void sgx_reclaim_pages(void)
 
 		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
 			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
-			list_move_tail(&epc_page->list, &iso);
+			list_move_tail(&epc_page->list, dst);
 		} else
 			/* The owner is freeing the page. No need to add the
 			 * page back to the list of reclaimable pages.
 			 */
 			sgx_epc_page_reset_state(epc_page);
 	}
-	spin_unlock(&sgx_global_lru.lock);
+	spin_unlock(&lru->lock);
+
+	return nr_to_scan;
+}
+
+/**
+ * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC pages.
+ * @iso:		List of isolated pages for reclamation
+ *
+ * Take a list of EPC pages and reclaim them to the enclave's private shmem files.  Do not
+ * reclaim the pages that have been accessed since the last scan, and move each of those pages
+ * to the tail of its tracking LRU list.
+ *
+ * Limit the number of pages to be processed up to SGX_NR_TO_SCAN_MAX per call in order to
+ * degrade amount of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit
+ * among the HW threads with three stage EWB pipeline (EWB, ETRACK + EWB and IPI + EWB) but not
+ * sufficiently. Reclaiming one page at a time would also be problematic as it would increase
+ * the lock contention too much, which would halt forward progress.
+ *
+ * Extra pages in the list beyond the SGX_NR_TO_SCAN_MAX limit are skipped and returned back to
+ * their tracking LRU lists.
+ *
+ * Return: number of pages successfully reclaimed.
+ */
+unsigned int sgx_do_epc_reclamation(struct list_head *iso)
+{
+	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
+	struct sgx_epc_page *epc_page, *tmp;
+	struct sgx_encl_page *encl_page;
+	pgoff_t page_index;
+	size_t ret, i;
 
-	if (list_empty(&iso))
-		return;
+	if (list_empty(iso))
+		return 0;
 
 	i = 0;
-	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
+	list_for_each_entry_safe(epc_page, tmp, iso, list) {
 		encl_page = epc_page->owner;
 
-		if (!sgx_reclaimer_age(epc_page))
+		if (i == SGX_NR_TO_SCAN_MAX || !sgx_reclaimer_age(epc_page))
 			goto skip;
 
 		page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
@@ -358,11 +378,11 @@  static void sgx_reclaim_pages(void)
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
 	}
 
-	list_for_each_entry(epc_page, &iso, list)
+	list_for_each_entry(epc_page, iso, list)
 		sgx_reclaimer_block(epc_page);
 
 	i = 0;
-	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
+	list_for_each_entry_safe(epc_page, tmp, iso, list) {
 		encl_page = epc_page->owner;
 		sgx_reclaimer_write(epc_page, &backing[i++]);
 
@@ -371,6 +391,17 @@  static void sgx_reclaim_pages(void)
 
 		sgx_free_epc_page(epc_page);
 	}
+
+	return i;
+}
+
+static void sgx_reclaim_epc_pages_global(void)
+{
+	LIST_HEAD(iso);
+
+	sgx_isolate_epc_pages(&sgx_global_lru, SGX_NR_TO_SCAN, &iso);
+
+	sgx_do_epc_reclamation(&iso);
 }
 
 static bool sgx_should_reclaim(unsigned long watermark)
@@ -387,7 +418,7 @@  static bool sgx_should_reclaim(unsigned long watermark)
 void sgx_reclaim_direct(void)
 {
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-		sgx_reclaim_pages();
+		sgx_reclaim_epc_pages_global();
 }
 
 static int ksgxd(void *p)
@@ -410,7 +441,7 @@  static int ksgxd(void *p)
 				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
 
 		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
-			sgx_reclaim_pages();
+			sgx_reclaim_epc_pages_global();
 
 		cond_resched();
 	}
@@ -587,7 +618,7 @@  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 		 * Need to do a global reclamation if cgroup was not full but free
 		 * physical pages run out, causing __sgx_alloc_epc_page() to fail.
 		 */
-		sgx_reclaim_pages();
+		sgx_reclaim_epc_pages_global();
 		cond_resched();
 	}
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index dd7ab65b5b27..6a40f70ed96f 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -19,6 +19,11 @@ 
 
 #define SGX_MAX_EPC_SECTIONS		8
 #define SGX_EEXTEND_BLOCK_SIZE		256
+
+/*
+ * Maximum number of pages to scan for reclaiming.
+ */
+#define SGX_NR_TO_SCAN_MAX		32U
 #define SGX_NR_TO_SCAN			16
 #define SGX_NR_LOW_PAGES		32
 #define SGX_NR_HIGH_PAGES		64
@@ -162,6 +167,9 @@  void sgx_reclaim_direct(void);
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
+unsigned int sgx_do_epc_reclamation(struct list_head *iso);
+unsigned int sgx_isolate_epc_pages(struct sgx_epc_lru_list *lru, unsigned int nr_to_scan,
+				   struct list_head *dst);
 
 void sgx_ipi_cb(void *info);