[v5,09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages

Message ID 20230923030657.16148-10-haitao.huang@linux.intel.com
State New
Headers
Series Add Cgroup support for SGX EPC memory |

Commit Message

Haitao Huang Sept. 23, 2023, 3:06 a.m. UTC
  From: Sean Christopherson <sean.j.christopherson@intel.com>

In a later patch, when a cgroup has exceeded the max capacity for EPC
pages, it may need to identify and OOM kill a less active enclave to
make room for other enclaves within the same group. Such a victim
enclave would have no active pages other than the unreclaimable Version
Array (VA) and SECS pages.  Therefore, the cgroup needs examine its
unreclaimable page list, and finding an enclave given a SECS page or a
VA page. This will require a backpointer from a page to an enclave,
which is not available for VA pages.

Because struct sgx_epc_page instances of VA pages are not owned by an
sgx_encl_page instance, mark their owner as sgx_encl: pass the struct
sgx_encl of the enclave allocating the VA page to sgx_alloc_epc_page(),
which will store this value in the owner field of the struct
sgx_epc_page.  In a later patch, VA pages will be placed in an
unreclaimable queue that can be examined by the cgroup to select the OOM
killed enclave.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
V5:
- Fixed some comments in code (Jarkko)

V4:
- Changes needed for patch reordering
- Revised commit messages (Jarkko)
---
 arch/x86/kernel/cpu/sgx/encl.c  |  5 +++--
 arch/x86/kernel/cpu/sgx/encl.h  |  2 +-
 arch/x86/kernel/cpu/sgx/ioctl.c |  2 +-
 arch/x86/kernel/cpu/sgx/main.c  | 20 ++++++++++----------
 arch/x86/kernel/cpu/sgx/sgx.h   |  7 ++++++-
 5 files changed, 21 insertions(+), 15 deletions(-)
  

Comments

Kai Huang Sept. 27, 2023, 11:14 a.m. UTC | #1
On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> In a later patch, when a cgroup has exceeded the max capacity for EPC
> pages, it may need to identify and OOM kill a less active enclave to
> make room for other enclaves within the same group. Such a victim
> enclave would have no active pages other than the unreclaimable Version
> Array (VA) and SECS pages.  Therefore, the cgroup needs examine its
> unreclaimable page list, and finding an enclave given a SECS page or a
> VA page. This will require a backpointer from a page to an enclave,
> which is not available for VA pages.
> 
> Because struct sgx_epc_page instances of VA pages are not owned by an
> sgx_encl_page instance, mark their owner as sgx_encl: pass the struct
> sgx_encl of the enclave allocating the VA page to sgx_alloc_epc_page(),
> which will store this value in the owner field of the struct
> sgx_epc_page.  In a later patch, VA pages will be placed in an
> unreclaimable queue that can be examined by the cgroup to select the OOM
> killed enclave.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> 

[...]

> @@ -562,7 +562,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  	for ( ; ; ) {
>  		page = __sgx_alloc_epc_page();
>  		if (!IS_ERR(page)) {
> -			page->owner = owner;
> +			page->encl_page = owner;

Looks using 'encl_page' is arbitrary.

Also actually for virtual EPC page the owner is set to the 'sgx_vepc' instance.

>  			break;
>  		}
>  
> @@ -607,7 +607,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  
>  	spin_lock(&node->lock);
>  
> -	page->owner = NULL;
> +	page->encl_page = NULL;

Ditto.

>  	if (page->poison)
>  		list_add(&page->list, &node->sgx_poison_page_list);
>  	else
> @@ -642,7 +642,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
>  	for (i = 0; i < nr_pages; i++) {
>  		section->pages[i].section = index;
>  		section->pages[i].flags = 0;
> -		section->pages[i].owner = NULL;
> +		section->pages[i].encl_page = NULL;
>  		section->pages[i].poison = 0;
>  		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
>  	}
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 764cec23f4e5..5110dd433b80 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -68,7 +68,12 @@ struct sgx_epc_page {
>  	unsigned int section;
>  	u16 flags;
>  	u16 poison;
> -	struct sgx_encl_page *owner;
> +
> +	/* Possible owner types */
> +	union {
> +		struct sgx_encl_page *encl_page;
> +		struct sgx_encl *encl;
> +	};

Sadly for virtual EPC page the owner is set to the 'sgx_vepc' instance it
belongs to.

Given how sgx_{alloc|free}_epc_page() arbitrarily uses encl_page, perhaps we
should do below?

	union {
		struct sgx_encl_page *encl_page;
		struct sgx_encl *encl;
		struct sgx_vepc *vepc;
		void *owner;
	};

And in sgx_{alloc|free}_epc_page() we can use 'owner' instead.
  
Kai Huang Sept. 27, 2023, 11:35 a.m. UTC | #2
On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> In a later patch, when a cgroup has exceeded the max capacity for EPC
> pages, it may need to identify and OOM kill a less active enclave to
> make room for other enclaves within the same group. Such a victim
> enclave would have no active pages other than the unreclaimable Version
> Array (VA) and SECS pages.  
> 

What does "no active pages" mean?

A "less active enclave" doesn't necessarily mean it has "no active pages"?


> Therefore, the cgroup needs examine its
			^
			needs to

> unreclaimable page list, and finding an enclave given a SECS page or a
				^
				find

> VA page. This will require a backpointer from a page to an enclave,
> which is not available for VA pages.
> 
> Because struct sgx_epc_page instances of VA pages are not owned by an
> sgx_encl_page instance, mark their owner as sgx_encl: pass the struct
> sgx_encl of the enclave allocating the VA page to sgx_alloc_epc_page(),
> which will store this value in the owner field of the struct
> sgx_epc_page.  
> 

IMHO this paragraph is hard to understand and can be more concise:

One VA page can be shared by multiple enclave pages thus cannot be associated
with any 'struct sgx_encl_page' instance.  Set the owner of VA page to the
enclave instead.


> In a later patch, VA pages will be placed in an
> unreclaimable queue that can be examined by the cgroup to select the OOM
> killed enclave.

The code to "place the VA page to unreclaimable queue" has been done in earlier
patch ("x86/sgx: Introduce EPC page states").  Just the unreclaimable list isn't
introduced yet.  I think you should just introduce it first then you can get rid
of those "in a later patch" staff.

And nit: please use "unreclaimable list" consistently (not queue).


Btw, probably a dumb question:

Theoretically if you only need to find a victim enclave you don't need to put VA
pages to the unreclaimable list, because those VA pages will be freed anyway
when enclave is killed.  So keeping VA pages in the list is for accounting all
the pages that the cgroup is having?
  
Haitao Huang Sept. 27, 2023, 3:35 p.m. UTC | #3
Hi Kai,

On Wed, 27 Sep 2023 06:14:20 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> From: Sean Christopherson <sean.j.christopherson@intel.com>
>>
>> In a later patch, when a cgroup has exceeded the max capacity for EPC
>> pages, it may need to identify and OOM kill a less active enclave to
>> make room for other enclaves within the same group. Such a victim
>> enclave would have no active pages other than the unreclaimable Version
>> Array (VA) and SECS pages.  Therefore, the cgroup needs examine its
>> unreclaimable page list, and finding an enclave given a SECS page or a
>> VA page. This will require a backpointer from a page to an enclave,
>> which is not available for VA pages.
>>
>> Because struct sgx_epc_page instances of VA pages are not owned by an
>> sgx_encl_page instance, mark their owner as sgx_encl: pass the struct
>> sgx_encl of the enclave allocating the VA page to sgx_alloc_epc_page(),
>> which will store this value in the owner field of the struct
>> sgx_epc_page.  In a later patch, VA pages will be placed in an
>> unreclaimable queue that can be examined by the cgroup to select the OOM
>> killed enclave.
>>
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
>> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
>> Cc: Sean Christopherson <seanjc@google.com>
>>
>
> [...]
>
>> @@ -562,7 +562,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void  
>> *owner, bool reclaim)
>>  	for ( ; ; ) {
>>  		page = __sgx_alloc_epc_page();
>>  		if (!IS_ERR(page)) {
>> -			page->owner = owner;
>> +			page->encl_page = owner;
>
> Looks using 'encl_page' is arbitrary.
>
> Also actually for virtual EPC page the owner is set to the 'sgx_vepc'  
> instance.
>
>>  			break;
>>  		}
>>
>> @@ -607,7 +607,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>>
>>  	spin_lock(&node->lock);
>>
>> -	page->owner = NULL;
>> +	page->encl_page = NULL;
>
> Ditto.
>
>>  	if (page->poison)
>>  		list_add(&page->list, &node->sgx_poison_page_list);
>>  	else
>> @@ -642,7 +642,7 @@ static bool __init sgx_setup_epc_section(u64  
>> phys_addr, u64 size,
>>  	for (i = 0; i < nr_pages; i++) {
>>  		section->pages[i].section = index;
>>  		section->pages[i].flags = 0;
>> -		section->pages[i].owner = NULL;
>> +		section->pages[i].encl_page = NULL;
>>  		section->pages[i].poison = 0;
>>  		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
>>  	}
>> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h  
>> b/arch/x86/kernel/cpu/sgx/sgx.h
>> index 764cec23f4e5..5110dd433b80 100644
>> --- a/arch/x86/kernel/cpu/sgx/sgx.h
>> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
>> @@ -68,7 +68,12 @@ struct sgx_epc_page {
>>  	unsigned int section;
>>  	u16 flags;
>>  	u16 poison;
>> -	struct sgx_encl_page *owner;
>> +
>> +	/* Possible owner types */
>> +	union {
>> +		struct sgx_encl_page *encl_page;
>> +		struct sgx_encl *encl;
>> +	};
>
> Sadly for virtual EPC page the owner is set to the 'sgx_vepc' instance it
> belongs to.
>
> Given how sgx_{alloc|free}_epc_page() arbitrarily uses encl_page,  
> perhaps we
> should do below?
>
> 	union {
> 		struct sgx_encl_page *encl_page;
> 		struct sgx_encl *encl;
> 		struct sgx_vepc *vepc;
> 		void *owner;
> 	};
>
> And in sgx_{alloc|free}_epc_page() we can use 'owner' instead.
>

As I mentioned in cover letter and change log in 11/18, this series does  
not track virtual EPC.
We can add vepc field into the union in future if such tracking is needed.  
Don't think "void *owner" is needed though.

Thanks
Haitao
  
Kai Huang Sept. 27, 2023, 9:21 p.m. UTC | #4
On Wed, 2023-09-27 at 10:35 -0500, Haitao Huang wrote:
> > > +
> > > +	/* Possible owner types */
> > > +	union {
> > > +		struct sgx_encl_page *encl_page;
> > > +		struct sgx_encl *encl;
> > > +	};
> > 
> > Sadly for virtual EPC page the owner is set to the 'sgx_vepc' instance it
> > belongs to.
> > 
> > Given how sgx_{alloc|free}_epc_page() arbitrarily uses encl_page,  
> > perhaps we
> > should do below?
> > 
> >  	union {
> >  		struct sgx_encl_page *encl_page;
> >  		struct sgx_encl *encl;
> >  		struct sgx_vepc *vepc;
> >  		void *owner;
> >  	};
> > 
> > And in sgx_{alloc|free}_epc_page() we can use 'owner' instead.
> > 
> 
> As I mentioned in cover letter and change log in 11/18, this series does  
> not track virtual EPC.

It's not about how does the cover letter says.  We cannot ignore the fact that
currently virtual EPC uses owner too.

But given the virtual EPC code currently doesn't use the owner, I can live with
not having the 'vepc' member in the union now.

> We can add vepc field into the union in future if such tracking is needed.  
> Don't think "void *owner" is needed though.

As mentioned, using 'encl_page' arbitrarily in sgx_alloc_epc_page() doesn't look
nice.  Do you have example in the current kernel code to prove it is acceptable?
  
Haitao Huang Sept. 29, 2023, 3:06 p.m. UTC | #5
On Wed, 27 Sep 2023 16:21:19 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Wed, 2023-09-27 at 10:35 -0500, Haitao Huang wrote:
>> > > +
>> > > +	/* Possible owner types */
>> > > +	union {
>> > > +		struct sgx_encl_page *encl_page;
>> > > +		struct sgx_encl *encl;
>> > > +	};
>> >
>> > Sadly for virtual EPC page the owner is set to the 'sgx_vepc'  
>> instance it
>> > belongs to.
>> >
>> > Given how sgx_{alloc|free}_epc_page() arbitrarily uses encl_page,>  
>> perhaps we
>> > should do below?
>> >
>> >  	union {
>> >  		struct sgx_encl_page *encl_page;
>> >  		struct sgx_encl *encl;
>> >  		struct sgx_vepc *vepc;
>> >  		void *owner;
>> >  	};
>> >
>> > And in sgx_{alloc|free}_epc_page() we can use 'owner' instead.
>> >
>>
>> As I mentioned in cover letter and change log in 11/18, this series does 
>> not track virtual EPC.
>
> It's not about how does the cover letter says.  We cannot ignore the  
> fact that
> currently virtual EPC uses owner too.
>
> But given the virtual EPC code currently doesn't use the owner, I can  
> live with
> not having the 'vepc' member in the union now.

Ah, I forgot even though we don't use the owner field assigned by vepc, it  
is still passed into sgx_alloc/free_epc_page().

Will add back "void* owner" and use it for assignment inside  
sgx_alloc/free_epc_page().

Thanks

Haitao
  
Kai Huang Oct. 2, 2023, 11:05 a.m. UTC | #6
On Fri, 2023-09-29 at 10:06 -0500, Haitao Huang wrote:
> On Wed, 27 Sep 2023 16:21:19 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Wed, 2023-09-27 at 10:35 -0500, Haitao Huang wrote:
> > > > > +
> > > > > +	/* Possible owner types */
> > > > > +	union {
> > > > > +		struct sgx_encl_page *encl_page;
> > > > > +		struct sgx_encl *encl;
> > > > > +	};
> > > > 
> > > > Sadly for virtual EPC page the owner is set to the 'sgx_vepc'  
> > > instance it
> > > > belongs to.
> > > > 
> > > > Given how sgx_{alloc|free}_epc_page() arbitrarily uses encl_page,>  
> > > perhaps we
> > > > should do below?
> > > > 
> > > >  	union {
> > > >  		struct sgx_encl_page *encl_page;
> > > >  		struct sgx_encl *encl;
> > > >  		struct sgx_vepc *vepc;
> > > >  		void *owner;
> > > >  	};
> > > > 
> > > > And in sgx_{alloc|free}_epc_page() we can use 'owner' instead.
> > > > 
> > > 
> > > As I mentioned in cover letter and change log in 11/18, this series does 
> > > not track virtual EPC.
> > 
> > It's not about how does the cover letter says.  We cannot ignore the  
> > fact that
> > currently virtual EPC uses owner too.
> > 
> > But given the virtual EPC code currently doesn't use the owner, I can  
> > live with
> > not having the 'vepc' member in the union now.
> 
> Ah, I forgot even though we don't use the owner field assigned by vepc, it  
> is still passed into sgx_alloc/free_epc_page().
> 
> Will add back "void* owner" and use it for assignment inside  
> sgx_alloc/free_epc_page().
> 
> 

And also sgx_setup_epc_section().
  
Haitao Huang Oct. 3, 2023, 6:45 a.m. UTC | #7
On Wed, 27 Sep 2023 06:35:57 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> From: Sean Christopherson <sean.j.christopherson@intel.com>
>>
>> In a later patch, when a cgroup has exceeded the max capacity for EPC
>> pages, it may need to identify and OOM kill a less active enclave to
>> make room for other enclaves within the same group. Such a victim
>> enclave would have no active pages other than the unreclaimable Version
>> Array (VA) and SECS pages.
>
> What does "no active pages" mean?
>

EPC pages in use.

> A "less active enclave" doesn't necessarily mean it has "no active  
> pages"?
>

I'll rephrase the above sentences

>
>> Therefore, the cgroup needs examine its
> 			^
> 			needs to
>
>> unreclaimable page list, and finding an enclave given a SECS page or a
> 				^
> 				find
>
>> VA page. This will require a backpointer from a page to an enclave,
>> which is not available for VA pages.
>>
>> Because struct sgx_epc_page instances of VA pages are not owned by an
>> sgx_encl_page instance, mark their owner as sgx_encl: pass the struct
>> sgx_encl of the enclave allocating the VA page to sgx_alloc_epc_page(),
>> which will store this value in the owner field of the struct
>> sgx_epc_page.
>
> IMHO this paragraph is hard to understand and can be more concise:
>
> One VA page can be shared by multiple enclave pages thus cannot be  
> associated
> with any 'struct sgx_encl_page' instance.  Set the owner of VA page to  
> the
> enclave instead.
>
>

Agreed

>> In a later patch, VA pages will be placed in an
>> unreclaimable queue that can be examined by the cgroup to select the OOM
>> killed enclave.
>
> The code to "place the VA page to unreclaimable queue" has been done in  
> earlier
> patch ("x86/sgx: Introduce EPC page states").  Just the unreclaimable  
> list isn't
> introduced yet.  I think you should just introduce it first then you can  
> get rid
> of those "in a later patch" staff.
>

I hope I was able to clarify to you in other threads that VA pages are not  
placed in any queue/list until [PATCH v5 11/18] x86/sgx: store  
unreclaimable pages in LRU lists.

This patch is the first one to implement tracking for unreclaimable pages.  
I'll add that as a transition hint.

> And nit: please use "unreclaimable list" consistently (not queue).
>

Yes will do

>
> Btw, probably a dumb question:
>
> Theoretically if you only need to find a victim enclave you don't need  
> to put VA
> pages to the unreclaimable list, because those VA pages will be freed  
> anyway
> when enclave is killed.  So keeping VA pages in the list is for  
> accounting all
> the pages that the cgroup is having?

Yes basically tracking them in cgroups as they are allocated.

VAs and SECS may also come and go as swapping/unswapping happens. But if a  
cgroup is OOM, and all reclaimables are gone (swapped out), it'd have to  
reclaim VAs/SECs in the same cgroup starting from the front of the LRU  
list. To reclaim a VA/SECS, it identifies the enclave from the owner of  
the VA/SECS page and kills it, as killing enclave is the only way to  
reclaim VA/SECS pages.
  
Kai Huang Oct. 3, 2023, 8:07 p.m. UTC | #8
On Tue, 2023-10-03 at 01:45 -0500, Haitao Huang wrote:
> > 
> > Btw, probably a dumb question:
> > 
> > Theoretically if you only need to find a victim enclave you don't need  
> > to put VA
> > pages to the unreclaimable list, because those VA pages will be freed  
> > anyway
> > when enclave is killed.  So keeping VA pages in the list is for  
> > accounting all
> > the pages that the cgroup is having?
> 
> Yes basically tracking them in cgroups as they are allocated.
> 
> VAs and SECS may also come and go as swapping/unswapping happens. But if a  
> cgroup is OOM, and all reclaimables are gone (swapped out), it'd have to  
> reclaim VAs/SECs in the same cgroup starting from the front of the LRU  
> list. To reclaim a VA/SECS, it identifies the enclave from the owner of  
> the VA/SECS page and kills it, as killing enclave is the only way to  
> reclaim VA/SECS pages.

To kill enclave you just need to track SECS in  the unreclaimable list.  

Only when you want to account the total EPC pages via some list you _probably_
need to track VA as well.  But I am not quite sure about this either.
  
Haitao Huang Oct. 4, 2023, 3:03 p.m. UTC | #9
On Tue, 03 Oct 2023 15:07:42 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Tue, 2023-10-03 at 01:45 -0500, Haitao Huang wrote:
>> >
>> > Btw, probably a dumb question:
>> >
>> > Theoretically if you only need to find a victim enclave you don't need 
>> > to put VA
>> > pages to the unreclaimable list, because those VA pages will be freed 
>> > anyway
>> > when enclave is killed.  So keeping VA pages in the list is for>  
>> accounting all
>> > the pages that the cgroup is having?
>>
>> Yes basically tracking them in cgroups as they are allocated.
>>
>> VAs and SECS may also come and go as swapping/unswapping happens. But  
>> if acgroup is OOM, and all reclaimables are gone (swapped out), it'd  
>> have toreclaim VAs/SECs in the same cgroup starting from the front of  
>> the LRUlist. To reclaim a VA/SECS, it identifies the enclave from the  
>> owner ofthe VA/SECS page and kills it, as killing enclave is the only  
>> way toreclaim VA/SECS pages.
>
> To kill enclave you just need to track SECS in  the unreclaimable list.  
> Only when you want to account the total EPC pages via some list you  
> _probably_
> need to track VA as well.  But I am not quite sure about this either.

There is a case where even SECS is paged out for an enclave with all  
reclaimables out. So cgroup needs to track each page used by an enclave  
and kill enclave when cgroup needs to lower usage by evicting an VA or  
SECS page.
There were some discussion on paging out VAs without killing enclaves but  
it'd be complicated and not implemented yet.

BTW, I need clarify tracking pages which is done by LRUs vs usage  
accounting which is done by charge/uncharge to misc. To me tracking is for  
reclaiming not accounting. Also vEPCs not tracked at all but they are  
accounted for.

Haitao
  
Kai Huang Oct. 4, 2023, 9:13 p.m. UTC | #10
On Wed, 2023-10-04 at 10:03 -0500, Haitao Huang wrote:
> On Tue, 03 Oct 2023 15:07:42 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Tue, 2023-10-03 at 01:45 -0500, Haitao Huang wrote:
> > > > 
> > > > Btw, probably a dumb question:
> > > > 
> > > > Theoretically if you only need to find a victim enclave you don't need 
> > > > to put VA
> > > > pages to the unreclaimable list, because those VA pages will be freed 
> > > > anyway
> > > > when enclave is killed.  So keeping VA pages in the list is for>  
> > > accounting all
> > > > the pages that the cgroup is having?
> > > 
> > > Yes basically tracking them in cgroups as they are allocated.
> > > 
> > > VAs and SECS may also come and go as swapping/unswapping happens. But  
> > > if acgroup is OOM, and all reclaimables are gone (swapped out), it'd  
> > > have toreclaim VAs/SECs in the same cgroup starting from the front of  
> > > the LRUlist. To reclaim a VA/SECS, it identifies the enclave from the  
> > > owner ofthe VA/SECS page and kills it, as killing enclave is the only  
> > > way toreclaim VA/SECS pages.
> > 
> > To kill enclave you just need to track SECS in  the unreclaimable list.  
> > Only when you want to account the total EPC pages via some list you  
> > _probably_
> > need to track VA as well.  But I am not quite sure about this either.
> 
> There is a case where even SECS is paged out for an enclave with all  
> reclaimables out. 
> 

Yes.  But this essentially means these enclaves are not active, thus shouldn't
be the victim of OOM?

> So cgroup needs to track each page used by an enclave  
> and kill enclave when cgroup needs to lower usage by evicting an VA or  
> SECS page.

Let's discuss more on tracking SECS on unreclaimable list only.

Could we assume that when the OOM wants to pick up a victim to serve the new
enclave, there must be at least another one *active* enclave which still has the
SECS page in EPC?

If yes, that enclave will be selected as victim.

If not, then no other enclave will be selected as victim.  Instead, only the new
enclave which is requesting more EPC will be selected, because it's SECS is on
the unreclaimable list.

Somehow this is unacceptable, thus we need to track VA pages too in order to
kill other inactive enclave?

> There were some discussion on paging out VAs without killing enclaves but  
> it'd be complicated and not implemented yet.

No we don't involve swapping VA pages now.  It's a separate topic.

> 
> BTW, I need clarify tracking pages which is done by LRUs vs usage  
> accounting which is done by charge/uncharge to misc. To me tracking is for  
> reclaiming not accounting. Also vEPCs not tracked at all but they are  
> accounted for.

I'll review the rest patches.  Thanks.
  
Haitao Huang Oct. 5, 2023, 4:22 a.m. UTC | #11
On Wed, 04 Oct 2023 16:13:41 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Wed, 2023-10-04 at 10:03 -0500, Haitao Huang wrote:
>> On Tue, 03 Oct 2023 15:07:42 -0500, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>>
>> > On Tue, 2023-10-03 at 01:45 -0500, Haitao Huang wrote:
>> > > >
>> > > > Btw, probably a dumb question:
>> > > >
>> > > > Theoretically if you only need to find a victim enclave you don't  
>> need
>> > > > to put VA
>> > > > pages to the unreclaimable list, because those VA pages will be  
>> freed
>> > > > anyway
>> > > > when enclave is killed.  So keeping VA pages in the list is for>
>> > > accounting all
>> > > > the pages that the cgroup is having?
>> > >
>> > > Yes basically tracking them in cgroups as they are allocated.
>> > >
>> > > VAs and SECS may also come and go as swapping/unswapping happens.  
>> But
>> > > if acgroup is OOM, and all reclaimables are gone (swapped out), it'd
>> > > have toreclaim VAs/SECs in the same cgroup starting from the front  
>> of
>> > > the LRUlist. To reclaim a VA/SECS, it identifies the enclave from  
>> the
>> > > owner ofthe VA/SECS page and kills it, as killing enclave is the  
>> only
>> > > way toreclaim VA/SECS pages.
>> >
>> > To kill enclave you just need to track SECS in  the unreclaimable  
>> list.
>> > Only when you want to account the total EPC pages via some list you
>> > _probably_
>> > need to track VA as well.  But I am not quite sure about this either.
>>
>> There is a case where even SECS is paged out for an enclave with all
>> reclaimables out.
>
> Yes.  But this essentially means these enclaves are not active, thus  
> shouldn't
> be the victim of OOM?
>

But there are VA pages for the enclave at that moment. So it can be  
candidate for OOM victim.

>> So cgroup needs to track each page used by an enclave
>> and kill enclave when cgroup needs to lower usage by evicting an VA or
>> SECS page.
>
> Let's discuss more on tracking SECS on unreclaimable list only.
>
> Could we assume that when the OOM wants to pick up a victim to serve the  
> new
> enclave, there must be at least another one *active* enclave which still  
> has the
> SECS page in EPC?
>
No, at a given instant when OOM happens, "active" enclave's SECS may not  
be in EPC, but lots of VAs.

OOM := "no reclaimable pages left in the cgroup to reclaim and total usage  
is still at/near limit".



> If yes, that enclave will be selected as victim.
>
> If not, then no other enclave will be selected as victim.  Instead, only  
> the new
> enclave which is requesting more EPC will be selected, because it's SECS  
> is on
> the unreclaimable list.
>

You can't assume the requesting enclave's SECS is in unreclaimable list  
either. Think the request is from #PF in the scenario we fixed the NULL  
pointer of SECS by reloading it.

> Somehow this is unacceptable, thus we need to track VA pages too in  
> order to
> kill other inactive enclave?
>

If we know for sure SECS will always be in EPC, thus tracked in  
unreclaimables, then we probably can do it (see below).
I hope the reason given above is clear.

>> There were some discussion on paging out VAs without killing enclaves  
>> but
>> it'd be complicated and not implemented yet.
>
> No we don't involve swapping VA pages now.  It's a separate topic.
>
Only mentioned it as a kind of constraints impacting current design.

Another potential alternative: we don't reclaim SECS either until OOM and  
only track SECS pages for cgroups. But that would change current behavior.  
And I'm not sure about other consequences, e.g., enclaves theoretically  
can allocate pages (including VA pages) in different cgroups/processes, so  
we may still end up tracking all VA pages for cgroups or we track SECS  
page in all cgroups in which enclave allocated any pages. Let me know your  
thoughts.

>>
>> BTW, I need clarify tracking pages which is done by LRUs vs usage
>> accounting which is done by charge/uncharge to misc. To me tracking is  
>> for
>> reclaiming not accounting. Also vEPCs not tracked at all but they are
>> accounted for.
>
> I'll review the rest patches.  Thanks.


Thank you!
Haitao
  
Kai Huang Oct. 5, 2023, 6:49 a.m. UTC | #12
On Wed, 2023-10-04 at 23:22 -0500, Haitao Huang wrote:
> On Wed, 04 Oct 2023 16:13:41 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Wed, 2023-10-04 at 10:03 -0500, Haitao Huang wrote:
> > > On Tue, 03 Oct 2023 15:07:42 -0500, Huang, Kai <kai.huang@intel.com>  
> > > wrote:
> > > 
> > > > On Tue, 2023-10-03 at 01:45 -0500, Haitao Huang wrote:
> > > > > > 
> > > > > > Btw, probably a dumb question:
> > > > > > 
> > > > > > Theoretically if you only need to find a victim enclave you don't  
> > > need
> > > > > > to put VA
> > > > > > pages to the unreclaimable list, because those VA pages will be  
> > > freed
> > > > > > anyway
> > > > > > when enclave is killed.  So keeping VA pages in the list is for>
> > > > > accounting all
> > > > > > the pages that the cgroup is having?
> > > > > 
> > > > > Yes basically tracking them in cgroups as they are allocated.
> > > > > 
> > > > > VAs and SECS may also come and go as swapping/unswapping happens.  
> > > But
> > > > > if acgroup is OOM, and all reclaimables are gone (swapped out), it'd
> > > > > have toreclaim VAs/SECs in the same cgroup starting from the front  
> > > of
> > > > > the LRUlist. To reclaim a VA/SECS, it identifies the enclave from  
> > > the
> > > > > owner ofthe VA/SECS page and kills it, as killing enclave is the  
> > > only
> > > > > way toreclaim VA/SECS pages.
> > > > 
> > > > To kill enclave you just need to track SECS in  the unreclaimable  
> > > list.
> > > > Only when you want to account the total EPC pages via some list you
> > > > _probably_
> > > > need to track VA as well.  But I am not quite sure about this either.
> > > 
> > > There is a case where even SECS is paged out for an enclave with all
> > > reclaimables out.
> > 
> > Yes.  But this essentially means these enclaves are not active, thus  
> > shouldn't
> > be the victim of OOM?
> > 
> 
> But there are VA pages for the enclave at that moment. So it can be  
> candidate for OOM victim.

Yes.  I am not familiar with how does OOM choose victim, but it seems choosing
inactive enclaves seems more reasonable.


[...]

> > > There were some discussion on paging out VAs without killing enclaves  
> > > but
> > > it'd be complicated and not implemented yet.
> > 
> > No we don't involve swapping VA pages now.  It's a separate topic.
> > 
> Only mentioned it as a kind of constraints impacting current design.
> 
> Another potential alternative: we don't reclaim SECS either until OOM and  
> only track SECS pages for cgroups. But that would change current behavior.  
> And I'm not sure about other consequences, e.g., enclaves theoretically  
> can allocate pages (including VA pages) in different cgroups/processes, so  
> we may still end up tracking all VA pages for cgroups or we track SECS  
> page in all cgroups in which enclave allocated any pages. Let me know your  
> thoughts.

Let's not change current behaviour.  I seriously doubt that is needed.

So it seems to me that what we need is just some way to let the OOM find some
victim enclave.  I am not sure whether "tracking EPC pages in some lists" has
anything to do with cgroup accounting EPC pages, so will take a look the rest of
the patches.
  

Patch

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index f5afc8d65e22..ec3402d41b63 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -1236,6 +1236,7 @@  void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr)
 
 /**
  * sgx_alloc_va_page() - Allocate a Version Array (VA) page
+ * @encl:    The new owner of the page.
  * @reclaim: Reclaim EPC pages directly if none available. Enclave
  *           mutex should not be held if this is set.
  *
@@ -1245,12 +1246,12 @@  void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr)
  *   a VA page,
  *   -errno otherwise
  */
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
+struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim)
 {
 	struct sgx_epc_page *epc_page;
 	int ret;
 
-	epc_page = sgx_alloc_epc_page(NULL, reclaim);
+	epc_page = sgx_alloc_epc_page(encl, reclaim);
 	if (IS_ERR(epc_page))
 		return ERR_CAST(epc_page);
 
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..831d63f80f5a 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -116,7 +116,7 @@  struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl,
 					  unsigned long offset,
 					  u64 secinfo_flags);
 void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr);
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim);
+struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim);
 unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
 void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
 bool sgx_va_page_full(struct sgx_va_page *va_page);
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 9a32bf5a1070..164256ea18d0 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -30,7 +30,7 @@  struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim)
 		if (!va_page)
 			return ERR_PTR(-ENOMEM);
 
-		va_page->epc_page = sgx_alloc_va_page(reclaim);
+		va_page->epc_page = sgx_alloc_va_page(encl, reclaim);
 		if (IS_ERR(va_page->epc_page)) {
 			err = ERR_CAST(va_page->epc_page);
 			kfree(va_page);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index fba06dc5abfe..ed813288af44 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -107,7 +107,7 @@  static unsigned long __sgx_sanitize_pages(struct list_head *dirty_page_list)
 
 static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
 {
-	struct sgx_encl_page *page = epc_page->owner;
+	struct sgx_encl_page *page = epc_page->encl_page;
 	struct sgx_encl *encl = page->encl;
 	struct sgx_encl_mm *encl_mm;
 	bool ret = true;
@@ -139,7 +139,7 @@  static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
 
 static void sgx_reclaimer_block(struct sgx_epc_page *epc_page)
 {
-	struct sgx_encl_page *page = epc_page->owner;
+	struct sgx_encl_page *page = epc_page->encl_page;
 	unsigned long addr = page->desc & PAGE_MASK;
 	struct sgx_encl *encl = page->encl;
 	int ret;
@@ -196,7 +196,7 @@  void sgx_ipi_cb(void *info)
 static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
 			 struct sgx_backing *backing)
 {
-	struct sgx_encl_page *encl_page = epc_page->owner;
+	struct sgx_encl_page *encl_page = epc_page->encl_page;
 	struct sgx_encl *encl = encl_page->encl;
 	struct sgx_va_page *va_page;
 	unsigned int va_offset;
@@ -249,7 +249,7 @@  static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
 static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 				struct sgx_backing *backing)
 {
-	struct sgx_encl_page *encl_page = epc_page->owner;
+	struct sgx_encl_page *encl_page = epc_page->encl_page;
 	struct sgx_encl *encl = encl_page->encl;
 	struct sgx_backing secs_backing;
 	int ret;
@@ -309,7 +309,7 @@  static void sgx_reclaim_pages(void)
 			break;
 
 		list_del_init(&epc_page->list);
-		encl_page = epc_page->owner;
+		encl_page = epc_page->encl_page;
 
 		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
 			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
@@ -329,7 +329,7 @@  static void sgx_reclaim_pages(void)
 
 	i = 0;
 	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
-		encl_page = epc_page->owner;
+		encl_page = epc_page->encl_page;
 
 		if (!sgx_reclaimer_age(epc_page))
 			goto skip;
@@ -362,7 +362,7 @@  static void sgx_reclaim_pages(void)
 
 	i = 0;
 	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
-		encl_page = epc_page->owner;
+		encl_page = epc_page->encl_page;
 		sgx_reclaimer_write(epc_page, &backing[i++]);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
@@ -562,7 +562,7 @@  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 	for ( ; ; ) {
 		page = __sgx_alloc_epc_page();
 		if (!IS_ERR(page)) {
-			page->owner = owner;
+			page->encl_page = owner;
 			break;
 		}
 
@@ -607,7 +607,7 @@  void sgx_free_epc_page(struct sgx_epc_page *page)
 
 	spin_lock(&node->lock);
 
-	page->owner = NULL;
+	page->encl_page = NULL;
 	if (page->poison)
 		list_add(&page->list, &node->sgx_poison_page_list);
 	else
@@ -642,7 +642,7 @@  static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 	for (i = 0; i < nr_pages; i++) {
 		section->pages[i].section = index;
 		section->pages[i].flags = 0;
-		section->pages[i].owner = NULL;
+		section->pages[i].encl_page = NULL;
 		section->pages[i].poison = 0;
 		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
 	}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 764cec23f4e5..5110dd433b80 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -68,7 +68,12 @@  struct sgx_epc_page {
 	unsigned int section;
 	u16 flags;
 	u16 poison;
-	struct sgx_encl_page *owner;
+
+	/* Possible owner types */
+	union {
+		struct sgx_encl_page *encl_page;
+		struct sgx_encl *encl;
+	};
 	struct list_head list;
 };