x86/sgx: Reduce delay and interference of enclave release

Message ID 06a5f478d3bfaa57954954c82dd5d4040450171d.1666130846.git.reinette.chatre@intel.com
State New
Headers
Series x86/sgx: Reduce delay and interference of enclave release |

Commit Message

Reinette Chatre Oct. 18, 2022, 10:42 p.m. UTC
  commit 8795359e35bc ("x86/sgx: Silence softlockup detection when
releasing large enclaves") introduced a cond_resched() during enclave
release where the EREMOVE instruction is applied to every 4k enclave
page. Giving other tasks an opportunity to run while tearing down a
large enclave placates the soft lockup detector but Iqbal found
that the fix causes a 25% performance degradation of a workload
run using Gramine.

Gramine maintains a 1:1 mapping between processes and SGX enclaves.
That means if a workload in an enclave creates a subprocess then
Gramine creates a duplicate enclave for that subprocess to run in.
The consequence is that the release of the enclave used to run
the subprocess can impact the performance of the workload that is
run in the original enclave, especially in large enclaves when
SGX2 is not in use.

The workload run by Iqbal behaves as follows:
Create enclave (enclave "A")
/* Initialize workload in enclave "A" */
Create enclave (enclave "B")
/* Run subprocess in enclave "B" and send result to enclave "A" */
Release enclave (enclave "B")
/* Run workload in enclave "A" */
Release enclave (enclave "A")

The performance impact of releasing enclave "B" in the above scenario
is amplified when there is a lot of SGX memory and the enclave size
matches the SGX memory. When there is 128GB SGX memory and an enclave
size of 128GB, from the time enclave "B" starts the 128GB SGX memory
is oversubscribed with a combined demand for 256GB from the two
enclaves.

Before commit 8795359e35bc ("x86/sgx: Silence softlockup detection when
releasing large enclaves") enclave release was done in a tight loop
without giving other tasks a chance to run. Even though the system
experienced soft lockups the workload (run in enclave "A") obtained
good performance numbers because when the workload started running
there was no interference.

Commit 8795359e35bc ("x86/sgx: Silence softlockup detection when
releasing large enclaves") gave other tasks opportunity to run while an
enclave is released. The impact of this in this scenario is that while
enclave "B" is released and needing to access each page that belongs
to it in order to run the SGX EREMOVE instruction on it, enclave "A"
is attempting to run the workload needing to access the enclave
pages that belong to it. This causes a lot of swapping due to the
demand for the oversubscribed SGX memory. Longer latencies are
experienced by the workload in enclave "A" while enclave "B" is
released.

Improve the performance of enclave release while still avoiding the
soft lockup detector with two enhancements:
- Only call cond_resched() after XA_CHECK_SCHED iterations.
- Use the xarray advanced API to keep the xarray locked for
  XA_CHECK_SCHED iterations instead of locking and unlocking
  at every iteration.

This batching solution is copied from sgx_encl_may_map() that
also iterates through all enclave pages using this technique.

With this enhancement the workload experiences a 5%
performance degradation when compared to a kernel without
commit 8795359e35bc ("x86/sgx: Silence softlockup detection when
releasing large enclaves"), an improvement to the reported 25%
degradation, while still placating the soft lockup detector.

Scenarios with poor performance are still possible even with these
enhancements. For example, short workloads creating sub processes
while running in large enclaves. Further performance improvements
are pursued in user space through avoiding to create duplicate enclaves
for certain sub processes, and using SGX2 that will do lazy allocation
of pages as needed so enclaves created for sub processes start quickly
and release quickly.

Fixes: 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves")
Reported-by: Md Iqbal Hossain <md.iqbal.hossain@intel.com>
Tested-by: Md Iqbal Hossain <md.iqbal.hossain@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---

I do not know if this qualifies as stable material.

 arch/x86/kernel/cpu/sgx/encl.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)
  

Comments

Jarkko Sakkinen Oct. 23, 2022, 8:06 p.m. UTC | #1
On Tue, Oct 18, 2022 at 03:42:47PM -0700, Reinette Chatre wrote:
> commit 8795359e35bc ("x86/sgx: Silence softlockup detection when
> releasing large enclaves") introduced a cond_resched() during enclave
> release where the EREMOVE instruction is applied to every 4k enclave
> page. Giving other tasks an opportunity to run while tearing down a
> large enclave placates the soft lockup detector but Iqbal found
> that the fix causes a 25% performance degradation of a workload
> run using Gramine.
> 
> Gramine maintains a 1:1 mapping between processes and SGX enclaves.
> That means if a workload in an enclave creates a subprocess then
> Gramine creates a duplicate enclave for that subprocess to run in.
> The consequence is that the release of the enclave used to run
> the subprocess can impact the performance of the workload that is
> run in the original enclave, especially in large enclaves when
> SGX2 is not in use.
> 
> The workload run by Iqbal behaves as follows:
> Create enclave (enclave "A")
> /* Initialize workload in enclave "A" */
> Create enclave (enclave "B")
> /* Run subprocess in enclave "B" and send result to enclave "A" */
> Release enclave (enclave "B")
> /* Run workload in enclave "A" */
> Release enclave (enclave "A")
> 
> The performance impact of releasing enclave "B" in the above scenario
> is amplified when there is a lot of SGX memory and the enclave size
> matches the SGX memory. When there is 128GB SGX memory and an enclave
> size of 128GB, from the time enclave "B" starts the 128GB SGX memory
> is oversubscribed with a combined demand for 256GB from the two
> enclaves.
> 
> Before commit 8795359e35bc ("x86/sgx: Silence softlockup detection when
> releasing large enclaves") enclave release was done in a tight loop
> without giving other tasks a chance to run. Even though the system
> experienced soft lockups the workload (run in enclave "A") obtained
> good performance numbers because when the workload started running
> there was no interference.
> 
> Commit 8795359e35bc ("x86/sgx: Silence softlockup detection when
> releasing large enclaves") gave other tasks opportunity to run while an
> enclave is released. The impact of this in this scenario is that while
> enclave "B" is released and needing to access each page that belongs
> to it in order to run the SGX EREMOVE instruction on it, enclave "A"
> is attempting to run the workload needing to access the enclave
> pages that belong to it. This causes a lot of swapping due to the
> demand for the oversubscribed SGX memory. Longer latencies are
> experienced by the workload in enclave "A" while enclave "B" is
> released.
> 
> Improve the performance of enclave release while still avoiding the
> soft lockup detector with two enhancements:
> - Only call cond_resched() after XA_CHECK_SCHED iterations.
> - Use the xarray advanced API to keep the xarray locked for
>   XA_CHECK_SCHED iterations instead of locking and unlocking
>   at every iteration.
> 
> This batching solution is copied from sgx_encl_may_map() that
> also iterates through all enclave pages using this technique.
> 
> With this enhancement the workload experiences a 5%
> performance degradation when compared to a kernel without
> commit 8795359e35bc ("x86/sgx: Silence softlockup detection when
> releasing large enclaves"), an improvement to the reported 25%
> degradation, while still placating the soft lockup detector.
> 
> Scenarios with poor performance are still possible even with these
> enhancements. For example, short workloads creating sub processes
> while running in large enclaves. Further performance improvements
> are pursued in user space through avoiding to create duplicate enclaves
> for certain sub processes, and using SGX2 that will do lazy allocation
> of pages as needed so enclaves created for sub processes start quickly
> and release quickly.
> 
> Fixes: 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves")
> Reported-by: Md Iqbal Hossain <md.iqbal.hossain@intel.com>
> Tested-by: Md Iqbal Hossain <md.iqbal.hossain@intel.com>
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> ---
> 
> I do not know if this qualifies as stable material.
> 
>  arch/x86/kernel/cpu/sgx/encl.c | 22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 1ec20807de1e..f7365c278525 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -682,9 +682,12 @@ void sgx_encl_release(struct kref *ref)
>  	struct sgx_encl *encl = container_of(ref, struct sgx_encl, refcount);
>  	struct sgx_va_page *va_page;
>  	struct sgx_encl_page *entry;
> -	unsigned long index;
> +	unsigned long count = 0;
> +
> +	XA_STATE(xas, &encl->page_array, PFN_DOWN(encl->base));
>  
> -	xa_for_each(&encl->page_array, index, entry) {
> +	xas_lock(&xas);
> +	xas_for_each(&xas, entry, PFN_DOWN(encl->base + encl->size  - 1)) {

I would add to declarations:

unsigned long nr_pages = PFN_DOWN(encl->base + encl->size  - 1);

Makes this more readable.

>  		if (entry->epc_page) {
>  			/*
>  			 * The page and its radix tree entry cannot be freed
> @@ -699,9 +702,20 @@ void sgx_encl_release(struct kref *ref)
>  		}
>  
>  		kfree(entry);
> -		/* Invoke scheduler to prevent soft lockups. */
> -		cond_resched();
> +		/*
> +		 * Invoke scheduler on every XA_CHECK_SCHED iteration
> +		 * to prevent soft lockups.
> +		 */
> +		if (!(++count % XA_CHECK_SCHED)) {
> +			xas_pause(&xas);
> +			xas_unlock(&xas);
> +
> +			cond_resched();
> +
> +			xas_lock(&xas);
> +		}
>  	}

        WARN_ON(count != nr_pages);

> +	xas_unlock(&xas);
>  
>  	xa_destroy(&encl->page_array);
>  
> -- 
> 2.34.1
> 

BR, Jarkko
  
Reinette Chatre Oct. 24, 2022, 6:56 p.m. UTC | #2
Hi Jarkko,

On 10/23/2022 1:06 PM, Jarkko Sakkinen wrote:
> On Tue, Oct 18, 2022 at 03:42:47PM -0700, Reinette Chatre wrote:

...

>> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
>> index 1ec20807de1e..f7365c278525 100644
>> --- a/arch/x86/kernel/cpu/sgx/encl.c
>> +++ b/arch/x86/kernel/cpu/sgx/encl.c
>> @@ -682,9 +682,12 @@ void sgx_encl_release(struct kref *ref)
>>  	struct sgx_encl *encl = container_of(ref, struct sgx_encl, refcount);
>>  	struct sgx_va_page *va_page;
>>  	struct sgx_encl_page *entry;
>> -	unsigned long index;
>> +	unsigned long count = 0;
>> +
>> +	XA_STATE(xas, &encl->page_array, PFN_DOWN(encl->base));
>>  
>> -	xa_for_each(&encl->page_array, index, entry) {
>> +	xas_lock(&xas);
>> +	xas_for_each(&xas, entry, PFN_DOWN(encl->base + encl->size  - 1)) {
> 
> I would add to declarations:
> 
> unsigned long nr_pages = PFN_DOWN(encl->base + encl->size  - 1);
> 
> Makes this more readable.

Will do, but I prefer to name it "max_page_index" or something related instead.
"nr_pages" implies "number of pages" to me, which is not what
PFN_DOWN(encl->base + encl->size - 1) represents. What is represented is the
highest possible index of a page in page_array, where an index is the
pfn of a page.

> 
>>  		if (entry->epc_page) {
>>  			/*
>>  			 * The page and its radix tree entry cannot be freed
>> @@ -699,9 +702,20 @@ void sgx_encl_release(struct kref *ref)
>>  		}
>>  
>>  		kfree(entry);
>> -		/* Invoke scheduler to prevent soft lockups. */
>> -		cond_resched();
>> +		/*
>> +		 * Invoke scheduler on every XA_CHECK_SCHED iteration
>> +		 * to prevent soft lockups.
>> +		 */
>> +		if (!(++count % XA_CHECK_SCHED)) {
>> +			xas_pause(&xas);
>> +			xas_unlock(&xas);
>> +
>> +			cond_resched();
>> +
>> +			xas_lock(&xas);
>> +		}
>>  	}
> 
>         WARN_ON(count != nr_pages);
> 

nr_pages as assigned in your example does not represent a count of the
enclave pages but instead a pfn index into the page_array. Comparing it
to count, the number of removed enclave pages that are not being held
by reclaimer, is not appropriate.

This check would be problematic even if we create a "nr_pages" from
the range of possible indices. This is because of how enclave sizes are
required to be power-of-two that makes it likely for there to be indices
without pages associated with it.

>> +	xas_unlock(&xas);
>>  
>>  	xa_destroy(&encl->page_array);
>>  
>> -- 
>> 2.34.1
>>

Reinette
  
Jarkko Sakkinen Nov. 1, 2022, 12:52 a.m. UTC | #3
On Mon, Oct 24, 2022 at 11:56:39AM -0700, Reinette Chatre wrote:
> Hi Jarkko,
> 
> On 10/23/2022 1:06 PM, Jarkko Sakkinen wrote:
> > On Tue, Oct 18, 2022 at 03:42:47PM -0700, Reinette Chatre wrote:
> 
> ...
> 
> >> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> >> index 1ec20807de1e..f7365c278525 100644
> >> --- a/arch/x86/kernel/cpu/sgx/encl.c
> >> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> >> @@ -682,9 +682,12 @@ void sgx_encl_release(struct kref *ref)
> >>  	struct sgx_encl *encl = container_of(ref, struct sgx_encl, refcount);
> >>  	struct sgx_va_page *va_page;
> >>  	struct sgx_encl_page *entry;
> >> -	unsigned long index;
> >> +	unsigned long count = 0;
> >> +
> >> +	XA_STATE(xas, &encl->page_array, PFN_DOWN(encl->base));
> >>  
> >> -	xa_for_each(&encl->page_array, index, entry) {
> >> +	xas_lock(&xas);
> >> +	xas_for_each(&xas, entry, PFN_DOWN(encl->base + encl->size  - 1)) {
> > 
> > I would add to declarations:
> > 
> > unsigned long nr_pages = PFN_DOWN(encl->base + encl->size  - 1);
> > 
> > Makes this more readable.
> 
> Will do, but I prefer to name it "max_page_index" or something related instead.
> "nr_pages" implies "number of pages" to me, which is not what
> PFN_DOWN(encl->base + encl->size - 1) represents. What is represented is the
> highest possible index of a page in page_array, where an index is the
> pfn of a page.

Yeah, makes sense.

> 
> > 
> >>  		if (entry->epc_page) {
> >>  			/*
> >>  			 * The page and its radix tree entry cannot be freed
> >> @@ -699,9 +702,20 @@ void sgx_encl_release(struct kref *ref)
> >>  		}
> >>  
> >>  		kfree(entry);
> >> -		/* Invoke scheduler to prevent soft lockups. */
> >> -		cond_resched();
> >> +		/*
> >> +		 * Invoke scheduler on every XA_CHECK_SCHED iteration
> >> +		 * to prevent soft lockups.
> >> +		 */
> >> +		if (!(++count % XA_CHECK_SCHED)) {
> >> +			xas_pause(&xas);
> >> +			xas_unlock(&xas);
> >> +
> >> +			cond_resched();
> >> +
> >> +			xas_lock(&xas);
> >> +		}
> >>  	}
> > 
> >         WARN_ON(count != nr_pages);
> > 
> 
> nr_pages as assigned in your example does not represent a count of the
> enclave pages but instead a pfn index into the page_array. Comparing it
> to count, the number of removed enclave pages that are not being held
> by reclaimer, is not appropriate.
> 
> This check would be problematic even if we create a "nr_pages" from
> the range of possible indices. This is because of how enclave sizes are
> required to be power-of-two that makes it likely for there to be indices
> without pages associated with it.

Ok.

> 
> >> +	xas_unlock(&xas);
> >>  
> >>  	xa_destroy(&encl->page_array);
> >>  
> >> -- 
> >> 2.34.1
> >>
> 
> Reinette

BR, Jarkko
  

Patch

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 1ec20807de1e..f7365c278525 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -682,9 +682,12 @@  void sgx_encl_release(struct kref *ref)
 	struct sgx_encl *encl = container_of(ref, struct sgx_encl, refcount);
 	struct sgx_va_page *va_page;
 	struct sgx_encl_page *entry;
-	unsigned long index;
+	unsigned long count = 0;
+
+	XA_STATE(xas, &encl->page_array, PFN_DOWN(encl->base));
 
-	xa_for_each(&encl->page_array, index, entry) {
+	xas_lock(&xas);
+	xas_for_each(&xas, entry, PFN_DOWN(encl->base + encl->size  - 1)) {
 		if (entry->epc_page) {
 			/*
 			 * The page and its radix tree entry cannot be freed
@@ -699,9 +702,20 @@  void sgx_encl_release(struct kref *ref)
 		}
 
 		kfree(entry);
-		/* Invoke scheduler to prevent soft lockups. */
-		cond_resched();
+		/*
+		 * Invoke scheduler on every XA_CHECK_SCHED iteration
+		 * to prevent soft lockups.
+		 */
+		if (!(++count % XA_CHECK_SCHED)) {
+			xas_pause(&xas);
+			xas_unlock(&xas);
+
+			cond_resched();
+
+			xas_lock(&xas);
+		}
 	}
+	xas_unlock(&xas);
 
 	xa_destroy(&encl->page_array);