[v5,12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

Message ID 20230923030657.16148-13-haitao.huang@linux.intel.com
State New
Headers
Series Add Cgroup support for SGX EPC memory |

Commit Message

Haitao Huang Sept. 23, 2023, 3:06 a.m. UTC
  From: Sean Christopherson <sean.j.christopherson@intel.com>

Introduce the OOM path for killing an enclave with a reclaimer that is no
longer able to reclaim enough EPC pages. Find a victim enclave, which
will be an enclave with only "unreclaimable" EPC pages left in the
cgroup LRU lists. Once a victim is identified, mark the enclave as OOM
and zap the enclave's entire page range, and drain all mm references in
encl->mm_list. Block allocating any EPC pages in #PF handler, or
reloading any pages in all paths, or creating any new mappings.

The OOM killing path may race with the reclaimers: in some cases, the
victim enclave is in the process of reclaiming the last EPC pages when
OOM happens, that is, all pages other than SECS and VA pages are in
RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to
the enclave backing, VA pages as well as SECS. So the OOM killer does
not directly release those enclave resources, instead, it lets all
reclaiming in progress to finish, and relies (as currently done) on
kref_put on encl->refcount to trigger sgx_encl_release() to do the
final cleanup.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
V5:
- Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY

V4:
- Updates for patch reordering and typo fixes.

V3:
- Rebased to use the new VMA_ITERATOR to zap VMAs.
- Fixed the racing cases by blocking new page allocation/mapping and
reloading when enclave is marked for OOM. And do not release any enclave
resources other than draining mm_list entries, and let pages in
RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
- Due to above changes, also removed the no-longer needed encl->lock in
the OOM path which was causing deadlocks reported by the lock prover.
---
 arch/x86/kernel/cpu/sgx/driver.c |  27 +-----
 arch/x86/kernel/cpu/sgx/encl.c   |  48 ++++++++++-
 arch/x86/kernel/cpu/sgx/encl.h   |   2 +
 arch/x86/kernel/cpu/sgx/ioctl.c  |   9 ++
 arch/x86/kernel/cpu/sgx/main.c   | 140 +++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/sgx.h    |   1 +
 6 files changed, 200 insertions(+), 27 deletions(-)
  

Comments

Kai Huang Oct. 9, 2023, 11:45 p.m. UTC | #1
On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Introduce the OOM path for killing an enclave with a reclaimer that is no
> longer able to reclaim enough EPC pages. Find a victim enclave, which
> will be an enclave with only "unreclaimable" EPC pages left in the
> cgroup LRU lists. Once a victim is identified, mark the enclave as OOM
> and zap the enclave's entire page range, and drain all mm references in
> encl->mm_list. Block allocating any EPC pages in #PF handler, or
> reloading any pages in all paths, or creating any new mappings.
> 
> The OOM killing path may race with the reclaimers: in some cases, the
> victim enclave is in the process of reclaiming the last EPC pages when
> OOM happens, that is, all pages other than SECS and VA pages are in
> RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to
> the enclave backing, VA pages as well as SECS. So the OOM killer does
> not directly release those enclave resources, instead, it lets all
> reclaiming in progress to finish, and relies (as currently done) on
> kref_put on encl->refcount to trigger sgx_encl_release() to do the
> final cleanup.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> ---
> V5:
> - Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY
> 
> V4:
> - Updates for patch reordering and typo fixes.
> 
> V3:
> - Rebased to use the new VMA_ITERATOR to zap VMAs.
> - Fixed the racing cases by blocking new page allocation/mapping and
> reloading when enclave is marked for OOM. And do not release any enclave
> resources other than draining mm_list entries, and let pages in
> RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
> - Due to above changes, also removed the no-longer needed encl->lock in
> the OOM path which was causing deadlocks reported by the lock prover.
> 

[...]

> +
> +/**
> + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> + * @lru:	LRU that is low
> + *
> + * Return:	%true if a victim was found and kicked.
> + */
> +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> +{
> +	struct sgx_epc_page *victim;
> +
> +	spin_lock(&lru->lock);
> +	victim = sgx_oom_get_victim(lru);
> +	spin_unlock(&lru->lock);
> +
> +	if (!victim)
> +		return false;
> +
> +	if (victim->flags & SGX_EPC_OWNER_PAGE)
> +		return sgx_oom_encl_page(victim->encl_page);
> +
> +	if (victim->flags & SGX_EPC_OWNER_ENCL)
> +		return sgx_oom_encl(victim->encl);

I hate to bring this up, at least at this stage, but I am wondering why we need
to put VA and SECS pages to the unreclaimable list, but cannot keep an
"enclave_list" instead?

So by looking the patch (" x86/sgx: Limit process EPC usage with misc cgroup
controller"), if I am not missing anything, the whole "unreclaimable" list is
just used to find the victim enclave when OOM needs to be done.  Thus, I don't
see why "enclave_list" cannot be used to achieve this.

The reason that I am asking is because it seems using "enclave_list" we can
simplify the code.  At least the patches related to track VA/SECS pages, and the
SGX_EPC_OWNER_PAGE/SGX_EPC_OWNER_ENCL thing can be eliminated completely.  

Using "enclave_list", I guess you just need to put the enclave to the current
EPC cgroup when SECS page is allocated.

In fact, putting "unreclaimable" list to LRU itself is a little bit confusing
because: 1) you cannot really reclaim anything from the list; 2) VA/SECS pages
don't have the concept of "young" at all, thus makes no sense to annotate they
as LRU.

Thus putting VA/SECS to "unreclaimable" list, instead of keeping an
"enclave_list" seems won't have any benefit but will only make the code more
complicated.

Or am I missing anything?
  
Sean Christopherson Oct. 10, 2023, 12:23 a.m. UTC | #2
On Mon, Oct 09, 2023, Kai Huang wrote:
> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > +/**
> > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > + * @lru:	LRU that is low
> > + *
> > + * Return:	%true if a victim was found and kicked.
> > + */
> > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > +{
> > +	struct sgx_epc_page *victim;
> > +
> > +	spin_lock(&lru->lock);
> > +	victim = sgx_oom_get_victim(lru);
> > +	spin_unlock(&lru->lock);
> > +
> > +	if (!victim)
> > +		return false;
> > +
> > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
> > +		return sgx_oom_encl_page(victim->encl_page);
> > +
> > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
> > +		return sgx_oom_encl(victim->encl);
> 
> I hate to bring this up, at least at this stage, but I am wondering why we need
> to put VA and SECS pages to the unreclaimable list, but cannot keep an
> "enclave_list" instead?

The motivation for tracking EPC pages instead of enclaves was so that the EPC
OOM-killer could "kill" VMs as well as host-owned enclaves.  The virtual EPC code
didn't actually kill the VM process, it instead just freed all of the EPC pages
and abused the SGX architecture to effectively make the guest recreate all its
enclaves (IIRC, QEMU does the same thing to "support" live migration).

Looks like y'all punted on that with:

  The EPC pages allocated for KVM guests by the virtual EPC driver are not
  reclaimable by the host kernel [5]. Therefore they are not tracked by any
  LRU lists for reclaiming purposes in this implementation, but they are
  charged toward the cgroup of the user processs (e.g., QEMU) launching the
  guest.  And when the cgroup EPC usage reaches its limit, the virtual EPC
  driver will stop allocating more EPC for the VM, and return SIGBUS to the
  user process which would abort the VM launch.

which IMO is a hack, unless returning SIGBUS is actually enforced somehow.  Relying
on userspace to be kind enough to kill its VMs kinda defeats the purpose of cgroup
enforcement.  E.g. if the hard limit for a EPC cgroup is lowered, userspace running
encalves in a VM could continue on and refuse to give up its EPC, and thus run above
its limit in perpetuity.

I can see userspace wanting to explicitly terminate the VM instead of "silently"
the VM's enclaves, but that seems like it should be a knob in the virtual EPC
code.
  
Kai Huang Oct. 10, 2023, 12:50 a.m. UTC | #3
On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote:
> On Mon, Oct 09, 2023, Kai Huang wrote:
> > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > +/**
> > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > > + * @lru:	LRU that is low
> > > + *
> > > + * Return:	%true if a victim was found and kicked.
> > > + */
> > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > +{
> > > +	struct sgx_epc_page *victim;
> > > +
> > > +	spin_lock(&lru->lock);
> > > +	victim = sgx_oom_get_victim(lru);
> > > +	spin_unlock(&lru->lock);
> > > +
> > > +	if (!victim)
> > > +		return false;
> > > +
> > > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > +		return sgx_oom_encl_page(victim->encl_page);
> > > +
> > > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > +		return sgx_oom_encl(victim->encl);
> > 
> > I hate to bring this up, at least at this stage, but I am wondering why we need
> > to put VA and SECS pages to the unreclaimable list, but cannot keep an
> > "enclave_list" instead?
> 
> The motivation for tracking EPC pages instead of enclaves was so that the EPC
> OOM-killer could "kill" VMs as well as host-owned enclaves.  
> 

Ah this seems a fair argument. :-)

> The virtual EPC code
> didn't actually kill the VM process, it instead just freed all of the EPC pages
> and abused the SGX architecture to effectively make the guest recreate all its
> enclaves (IIRC, QEMU does the same thing to "support" live migration).

It returns SIGBUS.  SGX VM live migration also requires enough EPC being able to
be allocated on the destination machine to work AFAICT.
 
> 
> Looks like y'all punted on that with:
> 
>   The EPC pages allocated for KVM guests by the virtual EPC driver are not
>   reclaimable by the host kernel [5]. Therefore they are not tracked by any
>   LRU lists for reclaiming purposes in this implementation, but they are
>   charged toward the cgroup of the user processs (e.g., QEMU) launching the
>   guest.  And when the cgroup EPC usage reaches its limit, the virtual EPC
>   driver will stop allocating more EPC for the VM, and return SIGBUS to the
>   user process which would abort the VM launch.
> 
> which IMO is a hack, unless returning SIGBUS is actually enforced somehow.  
> 

"enforced" do you mean?

Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot allocate
EPC page.  And when this happens, KVM returns KVM_PFN_ERR_FAULT in hva_to_pfn(),
which eventually results in KVM returning -EFAULT to userspace in vcpu_run(). 
And Qemu then kills the VM with some nonsense message:

        error: kvm run failed Bad address
        <dump guest registers nonsense>

> Relying
> on userspace to be kind enough to kill its VMs kinda defeats the purpose of cgroup
> enforcement.  E.g. if the hard limit for a EPC cgroup is lowered, userspace running
> encalves in a VM could continue on and refuse to give up its EPC, and thus run above
> its limit in perpetuity.

Agreed.  But this looks cannot resolved until we can reclaim EPC page from VM.

Or in the EPC cgroup code we can refuse to set the maximum which cannot be
supported, e.g., less the total virtual EPC size.

I guess the second is acceptable for now?

> 
> I can see userspace wanting to explicitly terminate the VM instead of "silently"
> the VM's enclaves, but that seems like it should be a knob in the virtual EPC
> code.

See above for the second option.
  
Haitao Huang Oct. 10, 2023, 1:04 a.m. UTC | #4
On Mon, 09 Oct 2023 18:45:06 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> From: Sean Christopherson <sean.j.christopherson@intel.com>
>>
>> Introduce the OOM path for killing an enclave with a reclaimer that is  
>> no
>> longer able to reclaim enough EPC pages. Find a victim enclave, which
>> will be an enclave with only "unreclaimable" EPC pages left in the
>> cgroup LRU lists. Once a victim is identified, mark the enclave as OOM
>> and zap the enclave's entire page range, and drain all mm references in
>> encl->mm_list. Block allocating any EPC pages in #PF handler, or
>> reloading any pages in all paths, or creating any new mappings.
>>
>> The OOM killing path may race with the reclaimers: in some cases, the
>> victim enclave is in the process of reclaiming the last EPC pages when
>> OOM happens, that is, all pages other than SECS and VA pages are in
>> RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to
>> the enclave backing, VA pages as well as SECS. So the OOM killer does
>> not directly release those enclave resources, instead, it lets all
>> reclaiming in progress to finish, and relies (as currently done) on
>> kref_put on encl->refcount to trigger sgx_encl_release() to do the
>> final cleanup.
>>
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
>> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
>> Cc: Sean Christopherson <seanjc@google.com>
>> ---
>> V5:
>> - Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY
>>
>> V4:
>> - Updates for patch reordering and typo fixes.
>>
>> V3:
>> - Rebased to use the new VMA_ITERATOR to zap VMAs.
>> - Fixed the racing cases by blocking new page allocation/mapping and
>> reloading when enclave is marked for OOM. And do not release any enclave
>> resources other than draining mm_list entries, and let pages in
>> RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
>> - Due to above changes, also removed the no-longer needed encl->lock in
>> the OOM path which was causing deadlocks reported by the lock prover.
>>
>
> [...]
>
>> +
>> +/**
>> + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
>> + * @lru:	LRU that is low
>> + *
>> + * Return:	%true if a victim was found and kicked.
>> + */
>> +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
>> +{
>> +	struct sgx_epc_page *victim;
>> +
>> +	spin_lock(&lru->lock);
>> +	victim = sgx_oom_get_victim(lru);
>> +	spin_unlock(&lru->lock);
>> +
>> +	if (!victim)
>> +		return false;
>> +
>> +	if (victim->flags & SGX_EPC_OWNER_PAGE)
>> +		return sgx_oom_encl_page(victim->encl_page);
>> +
>> +	if (victim->flags & SGX_EPC_OWNER_ENCL)
>> +		return sgx_oom_encl(victim->encl);
>
> I hate to bring this up, at least at this stage, but I am wondering why  
> we need
> to put VA and SECS pages to the unreclaimable list, but cannot keep an
> "enclave_list" instead?
>
> So by looking the patch (" x86/sgx: Limit process EPC usage with misc  
> cgroup
> controller"), if I am not missing anything, the whole "unreclaimable"  
> list is
> just used to find the victim enclave when OOM needs to be done.  Thus, I  
> don't
> see why "enclave_list" cannot be used to achieve this.
>
> The reason that I am asking is because it seems using "enclave_list" we  
> can
> simplify the code.  At least the patches related to track VA/SECS pages,  
> and the
> SGX_EPC_OWNER_PAGE/SGX_EPC_OWNER_ENCL thing can be eliminated  
> completely.  
> Using "enclave_list", I guess you just need to put the enclave to the  
> current
> EPC cgroup when SECS page is allocated.
>
Later the hosting process could migrated/reassigned to another cgroup?
What to do when the new cgroup is OOM?

Thanks
Haitao
  
Kai Huang Oct. 10, 2023, 1:18 a.m. UTC | #5
On Mon, 2023-10-09 at 20:04 -0500, Haitao Huang wrote:
> On Mon, 09 Oct 2023 18:45:06 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > > 
> > > Introduce the OOM path for killing an enclave with a reclaimer that is  
> > > no
> > > longer able to reclaim enough EPC pages. Find a victim enclave, which
> > > will be an enclave with only "unreclaimable" EPC pages left in the
> > > cgroup LRU lists. Once a victim is identified, mark the enclave as OOM
> > > and zap the enclave's entire page range, and drain all mm references in
> > > encl->mm_list. Block allocating any EPC pages in #PF handler, or
> > > reloading any pages in all paths, or creating any new mappings.
> > > 
> > > The OOM killing path may race with the reclaimers: in some cases, the
> > > victim enclave is in the process of reclaiming the last EPC pages when
> > > OOM happens, that is, all pages other than SECS and VA pages are in
> > > RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to
> > > the enclave backing, VA pages as well as SECS. So the OOM killer does
> > > not directly release those enclave resources, instead, it lets all
> > > reclaiming in progress to finish, and relies (as currently done) on
> > > kref_put on encl->refcount to trigger sgx_encl_release() to do the
> > > final cleanup.
> > > 
> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > > Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> > > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> > > Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
> > > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> > > Cc: Sean Christopherson <seanjc@google.com>
> > > ---
> > > V5:
> > > - Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY
> > > 
> > > V4:
> > > - Updates for patch reordering and typo fixes.
> > > 
> > > V3:
> > > - Rebased to use the new VMA_ITERATOR to zap VMAs.
> > > - Fixed the racing cases by blocking new page allocation/mapping and
> > > reloading when enclave is marked for OOM. And do not release any enclave
> > > resources other than draining mm_list entries, and let pages in
> > > RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
> > > - Due to above changes, also removed the no-longer needed encl->lock in
> > > the OOM path which was causing deadlocks reported by the lock prover.
> > > 
> > 
> > [...]
> > 
> > > +
> > > +/**
> > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > > + * @lru:	LRU that is low
> > > + *
> > > + * Return:	%true if a victim was found and kicked.
> > > + */
> > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > +{
> > > +	struct sgx_epc_page *victim;
> > > +
> > > +	spin_lock(&lru->lock);
> > > +	victim = sgx_oom_get_victim(lru);
> > > +	spin_unlock(&lru->lock);
> > > +
> > > +	if (!victim)
> > > +		return false;
> > > +
> > > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > +		return sgx_oom_encl_page(victim->encl_page);
> > > +
> > > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > +		return sgx_oom_encl(victim->encl);
> > 
> > I hate to bring this up, at least at this stage, but I am wondering why  
> > we need
> > to put VA and SECS pages to the unreclaimable list, but cannot keep an
> > "enclave_list" instead?
> > 
> > So by looking the patch (" x86/sgx: Limit process EPC usage with misc  
> > cgroup
> > controller"), if I am not missing anything, the whole "unreclaimable"  
> > list is
> > just used to find the victim enclave when OOM needs to be done.  Thus, I  
> > don't
> > see why "enclave_list" cannot be used to achieve this.
> > 
> > The reason that I am asking is because it seems using "enclave_list" we  
> > can
> > simplify the code.  At least the patches related to track VA/SECS pages,  
> > and the
> > SGX_EPC_OWNER_PAGE/SGX_EPC_OWNER_ENCL thing can be eliminated  
> > completely.  
> > Using "enclave_list", I guess you just need to put the enclave to the  
> > current
> > EPC cgroup when SECS page is allocated.
> > 
> Later the hosting process could migrated/reassigned to another cgroup?
> What to do when the new cgroup is OOM?
> 

You addressed in the documentation, no?

+Migration
+---------
+
+Once an EPC page is charged to a cgroup (during allocation), it
+remains charged to the original cgroup until the page is released
+or reclaimed.  Migrating a process to a different cgroup doesn't
+move the EPC charges that it incurred while in the previous cgroup
+to its new cgroup.
  
Kai Huang Oct. 10, 2023, 1:34 a.m. UTC | #6
On Tue, 2023-10-10 at 00:50 +0000, Huang, Kai wrote:
> On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote:
> > On Mon, Oct 09, 2023, Kai Huang wrote:
> > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > > +/**
> > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > > > + * @lru:	LRU that is low
> > > > + *
> > > > + * Return:	%true if a victim was found and kicked.
> > > > + */
> > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > > +{
> > > > +	struct sgx_epc_page *victim;
> > > > +
> > > > +	spin_lock(&lru->lock);
> > > > +	victim = sgx_oom_get_victim(lru);
> > > > +	spin_unlock(&lru->lock);
> > > > +
> > > > +	if (!victim)
> > > > +		return false;
> > > > +
> > > > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > > +		return sgx_oom_encl_page(victim->encl_page);
> > > > +
> > > > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > > +		return sgx_oom_encl(victim->encl);
> > > 
> > > I hate to bring this up, at least at this stage, but I am wondering why we need
> > > to put VA and SECS pages to the unreclaimable list, but cannot keep an
> > > "enclave_list" instead?
> > 
> > The motivation for tracking EPC pages instead of enclaves was so that the EPC
> > OOM-killer could "kill" VMs as well as host-owned enclaves.  
> > 
> 
> Ah this seems a fair argument. :-)
> 
> > The virtual EPC code
> > didn't actually kill the VM process, it instead just freed all of the EPC pages
> > and abused the SGX architecture to effectively make the guest recreate all its
> > enclaves (IIRC, QEMU does the same thing to "support" live migration).
> 
> It returns SIGBUS.  SGX VM live migration also requires enough EPC being able to
> be allocated on the destination machine to work AFAICT.
>  
> > 
> > Looks like y'all punted on that with:
> > 
> >   The EPC pages allocated for KVM guests by the virtual EPC driver are not
> >   reclaimable by the host kernel [5]. Therefore they are not tracked by any
> >   LRU lists for reclaiming purposes in this implementation, but they are
> >   charged toward the cgroup of the user processs (e.g., QEMU) launching the
> >   guest.  And when the cgroup EPC usage reaches its limit, the virtual EPC
> >   driver will stop allocating more EPC for the VM, and return SIGBUS to the
> >   user process which would abort the VM launch.
> > 
> > which IMO is a hack, unless returning SIGBUS is actually enforced somehow.  
> > 
> 
> "enforced" do you mean?
> 
> Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot allocate
> EPC page.  And when this happens, KVM returns KVM_PFN_ERR_FAULT in hva_to_pfn(),
> which eventually results in KVM returning -EFAULT to userspace in vcpu_run(). 
> And Qemu then kills the VM with some nonsense message:
> 
>         error: kvm run failed Bad address
>         <dump guest registers nonsense>
> 
> > Relying
> > on userspace to be kind enough to kill its VMs kinda defeats the purpose of cgroup
> > enforcement.  E.g. if the hard limit for a EPC cgroup is lowered, userspace running
> > encalves in a VM could continue on and refuse to give up its EPC, and thus run above
> > its limit in perpetuity.
> 
> > 
> > I can see userspace wanting to explicitly terminate the VM instead of "silently"
> > the VM's enclaves, but that seems like it should be a knob in the virtual EPC
> > code.

I guess I slightly misunderstood your words.

You mean we want to kill VM when the limit is set to be lower than virtual EPC
size.

This patch adds SGX_ENCL_NO_MEMORY.  I guess we can use it for virtual EPC too?

In the sgx_vepc_fault(), we check this flag at early time and return SIGBUS if
it is set.

But this also requires keeping virtual EPC pages in some list, and handles them
in sgx_epc_oom() too.

And for virtual EPC pages, I guess the "young" logic can be applied thus
probably it's better to keep the actual virtual EPC pages to a (separate?) list
instead of keeping the virtual EPC instance.

	struct sgx_epc_lru {
		struct list_head reclaimable;
		struct sgx_encl *enclaves;
		struct list_head vepc_pages;
	}

Or still tracking VA/SECS and virtual EPC pages in a single unrecliamable list?

I don't know :-)
  
Haitao Huang Oct. 10, 2023, 1:38 a.m. UTC | #7
On Mon, 09 Oct 2023 20:18:00 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Mon, 2023-10-09 at 20:04 -0500, Haitao Huang wrote:
>> On Mon, 09 Oct 2023 18:45:06 -0500, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>>
>> > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> > > From: Sean Christopherson <sean.j.christopherson@intel.com>
>> > >
>> > > Introduce the OOM path for killing an enclave with a reclaimer that  
>> is
>> > > no
>> > > longer able to reclaim enough EPC pages. Find a victim enclave,  
>> which
>> > > will be an enclave with only "unreclaimable" EPC pages left in the
>> > > cgroup LRU lists. Once a victim is identified, mark the enclave as  
>> OOM
>> > > and zap the enclave's entire page range, and drain all mm  
>> references in
>> > > encl->mm_list. Block allocating any EPC pages in #PF handler, or
>> > > reloading any pages in all paths, or creating any new mappings.
>> > >
>> > > The OOM killing path may race with the reclaimers: in some cases,  
>> the
>> > > victim enclave is in the process of reclaiming the last EPC pages  
>> when
>> > > OOM happens, that is, all pages other than SECS and VA pages are in
>> > > RECLAIMING_IN_PROGRESS state. The reclaiming process requires  
>> access to
>> > > the enclave backing, VA pages as well as SECS. So the OOM killer  
>> does
>> > > not directly release those enclave resources, instead, it lets all
>> > > reclaiming in progress to finish, and relies (as currently done) on
>> > > kref_put on encl->refcount to trigger sgx_encl_release() to do the
>> > > final cleanup.
>> > >
>> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> > > Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> > > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> > > Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
>> > > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
>> > > Cc: Sean Christopherson <seanjc@google.com>
>> > > ---
>> > > V5:
>> > > - Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY
>> > >
>> > > V4:
>> > > - Updates for patch reordering and typo fixes.
>> > >
>> > > V3:
>> > > - Rebased to use the new VMA_ITERATOR to zap VMAs.
>> > > - Fixed the racing cases by blocking new page allocation/mapping and
>> > > reloading when enclave is marked for OOM. And do not release any  
>> enclave
>> > > resources other than draining mm_list entries, and let pages in
>> > > RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
>> > > - Due to above changes, also removed the no-longer needed  
>> encl->lock in
>> > > the OOM path which was causing deadlocks reported by the lock  
>> prover.
>> > >
>> >
>> > [...]
>> >
>> > > +
>> > > +/**
>> > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
>> > > + * @lru:	LRU that is low
>> > > + *
>> > > + * Return:	%true if a victim was found and kicked.
>> > > + */
>> > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
>> > > +{
>> > > +	struct sgx_epc_page *victim;
>> > > +
>> > > +	spin_lock(&lru->lock);
>> > > +	victim = sgx_oom_get_victim(lru);
>> > > +	spin_unlock(&lru->lock);
>> > > +
>> > > +	if (!victim)
>> > > +		return false;
>> > > +
>> > > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
>> > > +		return sgx_oom_encl_page(victim->encl_page);
>> > > +
>> > > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
>> > > +		return sgx_oom_encl(victim->encl);
>> >
>> > I hate to bring this up, at least at this stage, but I am wondering  
>> why
>> > we need
>> > to put VA and SECS pages to the unreclaimable list, but cannot keep an
>> > "enclave_list" instead?
>> >
>> > So by looking the patch (" x86/sgx: Limit process EPC usage with misc
>> > cgroup
>> > controller"), if I am not missing anything, the whole "unreclaimable"
>> > list is
>> > just used to find the victim enclave when OOM needs to be done.   
>> Thus, I
>> > don't
>> > see why "enclave_list" cannot be used to achieve this.
>> >
>> > The reason that I am asking is because it seems using "enclave_list"  
>> we
>> > can
>> > simplify the code.  At least the patches related to track VA/SECS  
>> pages,
>> > and the
>> > SGX_EPC_OWNER_PAGE/SGX_EPC_OWNER_ENCL thing can be eliminated
>> > completely.
>> > Using "enclave_list", I guess you just need to put the enclave to the
>> > current
>> > EPC cgroup when SECS page is allocated.
>> >
>> Later the hosting process could migrated/reassigned to another cgroup?
>> What to do when the new cgroup is OOM?
>>
>
> You addressed in the documentation, no?
>
> +Migration
> +---------
> +
> +Once an EPC page is charged to a cgroup (during allocation), it
> +remains charged to the original cgroup until the page is released
> +or reclaimed.  Migrating a process to a different cgroup doesn't
> +move the EPC charges that it incurred while in the previous cgroup
> +to its new cgroup.

Should we kill the enclave though because some VA pages may be in the new  
group?

Haitao
  
Haitao Huang Oct. 10, 2023, 1:42 a.m. UTC | #8
Hi Sean

On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson  
<seanjc@google.com> wrote:

> On Mon, Oct 09, 2023, Kai Huang wrote:
>> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> > +/**
>> > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
>> > + * @lru:	LRU that is low
>> > + *
>> > + * Return:	%true if a victim was found and kicked.
>> > + */
>> > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
>> > +{
>> > +	struct sgx_epc_page *victim;
>> > +
>> > +	spin_lock(&lru->lock);
>> > +	victim = sgx_oom_get_victim(lru);
>> > +	spin_unlock(&lru->lock);
>> > +
>> > +	if (!victim)
>> > +		return false;
>> > +
>> > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
>> > +		return sgx_oom_encl_page(victim->encl_page);
>> > +
>> > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
>> > +		return sgx_oom_encl(victim->encl);
>>
>> I hate to bring this up, at least at this stage, but I am wondering why  
>> we need
>> to put VA and SECS pages to the unreclaimable list, but cannot keep an
>> "enclave_list" instead?
>
> The motivation for tracking EPC pages instead of enclaves was so that  
> the EPC
> OOM-killer could "kill" VMs as well as host-owned enclaves.  The virtual  
> EPC code
> didn't actually kill the VM process, it instead just freed all of the  
> EPC pages
> and abused the SGX architecture to effectively make the guest recreate  
> all its
> enclaves (IIRC, QEMU does the same thing to "support" live migration).
>
> Looks like y'all punted on that with:
>
>   The EPC pages allocated for KVM guests by the virtual EPC driver are  
> not
>   reclaimable by the host kernel [5]. Therefore they are not tracked by  
> any
>   LRU lists for reclaiming purposes in this implementation, but they are
>   charged toward the cgroup of the user processs (e.g., QEMU) launching  
> the
>   guest.  And when the cgroup EPC usage reaches its limit, the virtual  
> EPC
>   driver will stop allocating more EPC for the VM, and return SIGBUS to  
> the
>   user process which would abort the VM launch.
>
> which IMO is a hack, unless returning SIGBUS is actually enforced  
> somehow.  Relying
> on userspace to be kind enough to kill its VMs kinda defeats the purpose  
> of cgroup
> enforcement.  E.g. if the hard limit for a EPC cgroup is lowered,  
> userspace running
> encalves in a VM could continue on and refuse to give up its EPC, and  
> thus run above
> its limit in perpetuity.
>
Cgroup would refuse to allocate more when limit is reached so VMs can not  
run above limit.

IIRC VMs only support static EPC size right now, reaching limit at launch  
means the EPC size given in command line for QEMU is not appropriate. So  
VM should not launch, hence the current behavior.

[all EPC pages in guest are allocated on page fault caused by the  
sensitization process in guest kernel during init, which is part of the VM  
Launch process. So SIGNBUS will turn into failed VM launch.]

Once it is launched, guest kernel would have 'total capacity' given by the  
static value from QEMU option. And it would start paging when it is used  
up, never would ask for more from host.

For future with dynamic EPC for running guests, QEMU could handle  
allocation failure and pass SIGBUS to the running guest kernel.  Is that  
correct understanding?


> I can see userspace wanting to explicitly terminate the VM instead of  
> "silently"
> the VM's enclaves, but that seems like it should be a knob in the  
> virtual EPC
> code.

If my understanding above is correct and understanding your statement  
above correctly, then don't see we really need separate knob for vEPC  
code. Reaching a cgroup limit by a running guest (assuming dynamic  
allocation implemented) should not translate automatically killing the VM.  
Instead, it's user space job to work with guest to handle allocation  
failure. Guest could page and kill enclaves.

Haitao
  
Kai Huang Oct. 10, 2023, 2:12 a.m. UTC | #9
> > > > 
> > > Later the hosting process could migrated/reassigned to another cgroup?
> > > What to do when the new cgroup is OOM?
> > > 
> > 
> > You addressed in the documentation, no?
> > 
> > +Migration
> > +---------
> > +
> > +Once an EPC page is charged to a cgroup (during allocation), it
> > +remains charged to the original cgroup until the page is released
> > +or reclaimed.  Migrating a process to a different cgroup doesn't
> > +move the EPC charges that it incurred while in the previous cgroup
> > +to its new cgroup.
> 
> Should we kill the enclave though because some VA pages may be in the new  
> group?
> 

I guess acceptable?

And any difference if you keep VA/SECS to unreclaimabe list? If you migrate one
enclave to another cgroup, the old EPC pages stay in the old cgroup while the
new one is charged to the new group IIUC.

I am not cgroup expert, but by searching some old thread it appears this isn't a
supported model:

https://lore.kernel.org/lkml/YEyR9181Qgzt+Ps9@mtj.duckdns.org/
  
Kai Huang Oct. 10, 2023, 2:23 a.m. UTC | #10
On Mon, 2023-10-09 at 20:42 -0500, Haitao Huang wrote:
> Hi Sean
> 
> On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson  
> <seanjc@google.com> wrote:
> 
> > On Mon, Oct 09, 2023, Kai Huang wrote:
> > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > > +/**
> > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > > > + * @lru:	LRU that is low
> > > > + *
> > > > + * Return:	%true if a victim was found and kicked.
> > > > + */
> > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > > +{
> > > > +	struct sgx_epc_page *victim;
> > > > +
> > > > +	spin_lock(&lru->lock);
> > > > +	victim = sgx_oom_get_victim(lru);
> > > > +	spin_unlock(&lru->lock);
> > > > +
> > > > +	if (!victim)
> > > > +		return false;
> > > > +
> > > > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > > +		return sgx_oom_encl_page(victim->encl_page);
> > > > +
> > > > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > > +		return sgx_oom_encl(victim->encl);
> > > 
> > > I hate to bring this up, at least at this stage, but I am wondering why  
> > > we need
> > > to put VA and SECS pages to the unreclaimable list, but cannot keep an
> > > "enclave_list" instead?
> > 
> > The motivation for tracking EPC pages instead of enclaves was so that  
> > the EPC
> > OOM-killer could "kill" VMs as well as host-owned enclaves.  The virtual  
> > EPC code
> > didn't actually kill the VM process, it instead just freed all of the  
> > EPC pages
> > and abused the SGX architecture to effectively make the guest recreate  
> > all its
> > enclaves (IIRC, QEMU does the same thing to "support" live migration).
> > 
> > Looks like y'all punted on that with:
> > 
> >   The EPC pages allocated for KVM guests by the virtual EPC driver are  
> > not
> >   reclaimable by the host kernel [5]. Therefore they are not tracked by  
> > any
> >   LRU lists for reclaiming purposes in this implementation, but they are
> >   charged toward the cgroup of the user processs (e.g., QEMU) launching  
> > the
> >   guest.  And when the cgroup EPC usage reaches its limit, the virtual  
> > EPC
> >   driver will stop allocating more EPC for the VM, and return SIGBUS to  
> > the
> >   user process which would abort the VM launch.
> > 
> > which IMO is a hack, unless returning SIGBUS is actually enforced  
> > somehow.  Relying
> > on userspace to be kind enough to kill its VMs kinda defeats the purpose  
> > of cgroup
> > enforcement.  E.g. if the hard limit for a EPC cgroup is lowered,  
> > userspace running
> > encalves in a VM could continue on and refuse to give up its EPC, and  
> > thus run above
> > its limit in perpetuity.
> > 
> Cgroup would refuse to allocate more when limit is reached so VMs can not  
> run above limit.
> 
> IIRC VMs only support static EPC size right now, reaching limit at launch  
> means the EPC size given in command line for QEMU is not appropriate. So  
> VM should not launch, hence the current behavior.
> 
> [all EPC pages in guest are allocated on page fault caused by the  
> sensitization process in guest kernel during init, which is part of the VM  
> Launch process. So SIGNBUS will turn into failed VM launch.]
> 
> Once it is launched, guest kernel would have 'total capacity' given by the  
> static value from QEMU option. And it would start paging when it is used  
> up, never would ask for more from host.
> 
> For future with dynamic EPC for running guests, QEMU could handle  
> allocation failure and pass SIGBUS to the running guest kernel.  Is that  
> correct understanding?
> 
> 
> > I can see userspace wanting to explicitly terminate the VM instead of  
> > "silently"
> > the VM's enclaves, but that seems like it should be a knob in the  
> > virtual EPC
> > code.
> 
> If my understanding above is correct and understanding your statement  
> above correctly, then don't see we really need separate knob for vEPC  
> code. Reaching a cgroup limit by a running guest (assuming dynamic  
> allocation implemented) should not translate automatically killing the VM.  
> Instead, it's user space job to work with guest to handle allocation  
> failure. Guest could page and kill enclaves.
> 

IIUC Sean was talking about changing misc.max _after_ you launch SGX VMs:

1) misc.max = 100M
2) Launch VMs with total virtual EPC size = 100M	<- success
3) misc.max = 50M

3) will also succeed, but nothing will happen, the VMs will be still holding
100M EPC.

You need to somehow track virtual EPC and kill VM instead.

(or somehow fail to do 3) if it is also an acceptable option.)
  
Haitao Huang Oct. 10, 2023, 1:26 p.m. UTC | #11
On Mon, 09 Oct 2023 21:23:12 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Mon, 2023-10-09 at 20:42 -0500, Haitao Huang wrote:
>> Hi Sean
>>
>> On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson
>> <seanjc@google.com> wrote:
>>
>> > On Mon, Oct 09, 2023, Kai Huang wrote:
>> > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> > > > +/**
>> > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target  
>> LRU
>> > > > + * @lru:	LRU that is low
>> > > > + *
>> > > > + * Return:	%true if a victim was found and kicked.
>> > > > + */
>> > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
>> > > > +{
>> > > > +	struct sgx_epc_page *victim;
>> > > > +
>> > > > +	spin_lock(&lru->lock);
>> > > > +	victim = sgx_oom_get_victim(lru);
>> > > > +	spin_unlock(&lru->lock);
>> > > > +
>> > > > +	if (!victim)
>> > > > +		return false;
>> > > > +
>> > > > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
>> > > > +		return sgx_oom_encl_page(victim->encl_page);
>> > > > +
>> > > > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
>> > > > +		return sgx_oom_encl(victim->encl);
>> > >
>> > > I hate to bring this up, at least at this stage, but I am wondering  
>> why
>> > > we need
>> > > to put VA and SECS pages to the unreclaimable list, but cannot keep  
>> an
>> > > "enclave_list" instead?
>> >
>> > The motivation for tracking EPC pages instead of enclaves was so that
>> > the EPC
>> > OOM-killer could "kill" VMs as well as host-owned enclaves.  The  
>> virtual
>> > EPC code
>> > didn't actually kill the VM process, it instead just freed all of the
>> > EPC pages
>> > and abused the SGX architecture to effectively make the guest recreate
>> > all its
>> > enclaves (IIRC, QEMU does the same thing to "support" live migration).
>> >
>> > Looks like y'all punted on that with:
>> >
>> >   The EPC pages allocated for KVM guests by the virtual EPC driver are
>> > not
>> >   reclaimable by the host kernel [5]. Therefore they are not tracked  
>> by
>> > any
>> >   LRU lists for reclaiming purposes in this implementation, but they  
>> are
>> >   charged toward the cgroup of the user processs (e.g., QEMU)  
>> launching
>> > the
>> >   guest.  And when the cgroup EPC usage reaches its limit, the virtual
>> > EPC
>> >   driver will stop allocating more EPC for the VM, and return SIGBUS  
>> to
>> > the
>> >   user process which would abort the VM launch.
>> >
>> > which IMO is a hack, unless returning SIGBUS is actually enforced
>> > somehow.  Relying
>> > on userspace to be kind enough to kill its VMs kinda defeats the  
>> purpose
>> > of cgroup
>> > enforcement.  E.g. if the hard limit for a EPC cgroup is lowered,
>> > userspace running
>> > encalves in a VM could continue on and refuse to give up its EPC, and
>> > thus run above
>> > its limit in perpetuity.
>> >
>> Cgroup would refuse to allocate more when limit is reached so VMs can  
>> not
>> run above limit.
>>
>> IIRC VMs only support static EPC size right now, reaching limit at  
>> launch
>> means the EPC size given in command line for QEMU is not appropriate. So
>> VM should not launch, hence the current behavior.
>>
>> [all EPC pages in guest are allocated on page fault caused by the
>> sensitization process in guest kernel during init, which is part of the  
>> VM
>> Launch process. So SIGNBUS will turn into failed VM launch.]
>>
>> Once it is launched, guest kernel would have 'total capacity' given by  
>> the
>> static value from QEMU option. And it would start paging when it is used
>> up, never would ask for more from host.
>>
>> For future with dynamic EPC for running guests, QEMU could handle
>> allocation failure and pass SIGBUS to the running guest kernel.  Is that
>> correct understanding?
>>
>>
>> > I can see userspace wanting to explicitly terminate the VM instead of
>> > "silently"
>> > the VM's enclaves, but that seems like it should be a knob in the
>> > virtual EPC
>> > code.
>>
>> If my understanding above is correct and understanding your statement
>> above correctly, then don't see we really need separate knob for vEPC
>> code. Reaching a cgroup limit by a running guest (assuming dynamic
>> allocation implemented) should not translate automatically killing the  
>> VM.
>> Instead, it's user space job to work with guest to handle allocation
>> failure. Guest could page and kill enclaves.
>>
>
> IIUC Sean was talking about changing misc.max _after_ you launch SGX VMs:
>
> 1) misc.max = 100M
> 2) Launch VMs with total virtual EPC size = 100M	<- success
> 3) misc.max = 50M
>
> 3) will also succeed, but nothing will happen, the VMs will be still  
> holding
> 100M EPC.
>
> You need to somehow track virtual EPC and kill VM instead.
>
> (or somehow fail to do 3) if it is also an acceptable option.)
>
Thanks for explaining it.

There is an error code to return from max_write. I can add that too to the  
callback definition and fail it when it can't be enforced for any reason.
Would like some community feedback if this is acceptable though.

I think to solve it ultimately, we need be able to adjust 'capacity' of  
VMs not to just kill them, which is basically the same as dynamic  
allocation support for VMs (being able to increase/decrease epc size when  
it is running). For now, we only have static allocation so max can't be  
enforced once it is launched.

Haitao
  
Haitao Huang Oct. 10, 2023, 4:49 p.m. UTC | #12
On Mon, 09 Oct 2023 20:34:29 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Tue, 2023-10-10 at 00:50 +0000, Huang, Kai wrote:
>> On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote:
>> > On Mon, Oct 09, 2023, Kai Huang wrote:
>> > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
>> > > > +/**
>> > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target  
>> LRU
>> > > > + * @lru:	LRU that is low
>> > > > + *
>> > > > + * Return:	%true if a victim was found and kicked.
>> > > > + */
>> > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
>> > > > +{
>> > > > +	struct sgx_epc_page *victim;
>> > > > +
>> > > > +	spin_lock(&lru->lock);
>> > > > +	victim = sgx_oom_get_victim(lru);
>> > > > +	spin_unlock(&lru->lock);
>> > > > +
>> > > > +	if (!victim)
>> > > > +		return false;
>> > > > +
>> > > > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
>> > > > +		return sgx_oom_encl_page(victim->encl_page);
>> > > > +
>> > > > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
>> > > > +		return sgx_oom_encl(victim->encl);
>> > >
>> > > I hate to bring this up, at least at this stage, but I am wondering  
>> why we need
>> > > to put VA and SECS pages to the unreclaimable list, but cannot keep  
>> an
>> > > "enclave_list" instead?
>> >
>> > The motivation for tracking EPC pages instead of enclaves was so that  
>> the EPC
>> > OOM-killer could "kill" VMs as well as host-owned enclaves. >
>>
>> Ah this seems a fair argument. :-)
>>
>> > The virtual EPC code
>> > didn't actually kill the VM process, it instead just freed all of the  
>> EPC pages
>> > and abused the SGX architecture to effectively make the guest  
>> recreate all its
>> > enclaves (IIRC, QEMU does the same thing to "support" live migration).
>>
>> It returns SIGBUS.  SGX VM live migration also requires enough EPC  
>> being able to
>> be allocated on the destination machine to work AFAICT.
>>
>> >
>> > Looks like y'all punted on that with:
>> >
>> >   The EPC pages allocated for KVM guests by the virtual EPC driver  
>> are not
>> >   reclaimable by the host kernel [5]. Therefore they are not tracked  
>> by any
>> >   LRU lists for reclaiming purposes in this implementation, but they  
>> are
>> >   charged toward the cgroup of the user processs (e.g., QEMU)  
>> launching the
>> >   guest.  And when the cgroup EPC usage reaches its limit, the  
>> virtual EPC
>> >   driver will stop allocating more EPC for the VM, and return SIGBUS  
>> to the
>> >   user process which would abort the VM launch.
>> >
>> > which IMO is a hack, unless returning SIGBUS is actually enforced  
>> somehow. >
>>
>> "enforced" do you mean?
>>
>> Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot  
>> allocate
>> EPC page.  And when this happens, KVM returns KVM_PFN_ERR_FAULT in  
>> hva_to_pfn(),
>> which eventually results in KVM returning -EFAULT to userspace in  
>> vcpu_run().
>> And Qemu then kills the VM with some nonsense message:
>>
>>         error: kvm run failed Bad address
>>         <dump guest registers nonsense>
>>
>> > Relying
>> > on userspace to be kind enough to kill its VMs kinda defeats the  
>> purpose of cgroup
>> > enforcement.  E.g. if the hard limit for a EPC cgroup is lowered,  
>> userspace running
>> > encalves in a VM could continue on and refuse to give up its EPC, and  
>> thus run above
>> > its limit in perpetuity.
>>
>> >
>> > I can see userspace wanting to explicitly terminate the VM instead of  
>> "silently"
>> > the VM's enclaves, but that seems like it should be a knob in the  
>> virtual EPC
>> > code.
>
> I guess I slightly misunderstood your words.
>
> You mean we want to kill VM when the limit is set to be lower than  
> virtual EPC
> size.
>
> This patch adds SGX_ENCL_NO_MEMORY.  I guess we can use it for virtual  
> EPC too?
>

That flag is set for enclaves, do you mean we set similar flag in vepc  
struct?

> In the sgx_vepc_fault(), we check this flag at early time and return  
> SIGBUS if
> it is set.
>
> But this also requires keeping virtual EPC pages in some list, and  
> handles them
> in sgx_epc_oom() too.
>
> And for virtual EPC pages, I guess the "young" logic can be applied thus
> probably it's better to keep the actual virtual EPC pages to a  
> (separate?) list
> instead of keeping the virtual EPC instance.
>
> 	struct sgx_epc_lru {
> 		struct list_head reclaimable;
> 		struct sgx_encl *enclaves;
> 		struct list_head vepc_pages;
> 	}
>
> Or still tracking VA/SECS and virtual EPC pages in a single  
> unrecliamable list?
>

One LRU should be OK as we only need relative order in which they are  
loaded?
If an VA page is in front of vEPC, we just kill host side enclave first  
before disrupting VMs in the same group.
As the group is not in a good situation anyway so kernel just pick  
something reasonable to force kill.

Also after rereading the sentences "The virtual EPC code didn't actually  
kill the VM process, it instead just freed all of the  EPC pages and  
abused the SGX architecture to effectively make the guest  recreate all  
its enclaves..."

Maybe by "kill" vm, Sean means EREMOVE the vepc pages in the unreclaimable  
LRU, which effectively make enclaves in guest receiving "EPC lost" error  
and those enclaves are forced to be reloaded (all reasonable user space  
impl should already handle that). Not sure about free *all* of EPC pages  
though. we should just EREMOVE enough to bring down the usage. And disable  
new allocation in sgx_vepc_fault as kai outlined above. It also means user  
space needs to inject/pass the SIGBUS to guest (I'm not really familiar  
with this part, or maybe it's already there?). @sean is that what you  
mean? Sorry I've been slow on understanding this.

If this is the case, some may still think it's too disruptive to guest  
because the guest did not have a chance to paging out less active enclave  
pages. But it's user's limit to set so it is justifiable as long as we  
document this behavior.

Thanks to both of you for great insights.

Haitao
  
Haitao Huang Oct. 10, 2023, 5:05 p.m. UTC | #13
On Mon, 09 Oct 2023 21:12:27 -0500, Huang, Kai <kai.huang@intel.com> wrote:

>
>> > > >
>> > > Later the hosting process could migrated/reassigned to another  
>> cgroup?
>> > > What to do when the new cgroup is OOM?
>> > >
>> >
>> > You addressed in the documentation, no?
>> >
>> > +Migration
>> > +---------
>> > +
>> > +Once an EPC page is charged to a cgroup (during allocation), it
>> > +remains charged to the original cgroup until the page is released
>> > +or reclaimed.  Migrating a process to a different cgroup doesn't
>> > +move the EPC charges that it incurred while in the previous cgroup
>> > +to its new cgroup.
>>
>> Should we kill the enclave though because some VA pages may be in the  
>> new
>> group?
>>
>
> I guess acceptable?
>
> And any difference if you keep VA/SECS to unreclaimabe list?

Tracking VA/SECS allows all cgroups, in which an enclave has allocation,  
to identify the enclave following the back pointer and kill it as needed.

> If you migrate one
> enclave to another cgroup, the old EPC pages stay in the old cgroup  
> while the
> new one is charged to the new group IIUC.
>
> I am not cgroup expert, but by searching some old thread it appears this  
> isn't a
> supported model:
>
> https://lore.kernel.org/lkml/YEyR9181Qgzt+Ps9@mtj.duckdns.org/
>

IIUC it's a different problem here. If we don't track the allocated VAs in  
the new group, then the enclave that spans the two groups can't be killed  
by the new group. If so, some enclave could just hide in some small group  
and never gets killed but keeps allocating in a different group?

Thanks
Haitao
  
Sean Christopherson Oct. 11, 2023, 12:01 a.m. UTC | #14
On Tue, Oct 10, 2023, Haitao Huang wrote:
> On Mon, 09 Oct 2023 21:23:12 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Mon, 2023-10-09 at 20:42 -0500, Haitao Huang wrote:
> > > Hi Sean
> > > 
> > > On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson
> > > <seanjc@google.com> wrote:
> > > > I can see userspace wanting to explicitly terminate the VM instead of
> > > > "silently"
> > > > the VM's enclaves, but that seems like it should be a knob in the
> > > > virtual EPC
> > > > code.
> > > 
> > > If my understanding above is correct and understanding your statement
> > > above correctly, then don't see we really need separate knob for vEPC
> > > code. Reaching a cgroup limit by a running guest (assuming dynamic
> > > allocation implemented) should not translate automatically killing
> > > the VM.
> > > Instead, it's user space job to work with guest to handle allocation
> > > failure. Guest could page and kill enclaves.
> > > 
> > 
> > IIUC Sean was talking about changing misc.max _after_ you launch SGX VMs:
> > 
> > 1) misc.max = 100M
> > 2) Launch VMs with total virtual EPC size = 100M	<- success
> > 3) misc.max = 50M
> > 
> > 3) will also succeed, but nothing will happen, the VMs will be still
> > holding 100M EPC.
> > 
> > You need to somehow track virtual EPC and kill VM instead.
> > 
> > (or somehow fail to do 3) if it is also an acceptable option.)
> > 
> Thanks for explaining it.
> 
> There is an error code to return from max_write. I can add that too to the
> callback definition and fail it when it can't be enforced for any reason.
> Would like some community feedback if this is acceptable though.

That likely isn't acceptable.  E.g. create a cgroup with both a host enclave and
virtual EPC, set the hard limit to 100MiB.  Virtual EPC consumes 50MiB, and the
host enclave consumes 50MiB.  Userspace lowers the limit to 49MiB.  The cgroup
code would reclaim all of the enclave's reclaimable EPC, and then kill the enclave
because it's still over the limit.  And then fail the max_write because the cgroup
is *still* over the limit.  So in addition to burning a lot of cycles, from
userspace's perspective its enclave was killed for no reason, as the new limit
wasn't actually set.

> I think to solve it ultimately, we need be able to adjust 'capacity' of VMs
> not to just kill them, which is basically the same as dynamic allocation
> support for VMs (being able to increase/decrease epc size when it is
> running). For now, we only have static allocation so max can't be enforced
> once it is launched.

No, reclaiming virtual EPC is not a requirement.  VMM EPC oversubscription is
insanely complex, and I highly doubt any users actually want to oversubcribe VMs.

There are use cases for cgroups beyond oversubscribing/swapping, e.g. privileged
userspace may set limits on a container to ensure the container doesn't *accidentally*
consume more EPC than it was allotted, e.g. due to a configuration bug that created
a VM with more EPC than it was supposed to have.  

My comments on virtual EPC vs. cgroups is much more about having sane, well-defined
behavior, not about saying the kernel actually needs to support oversubscribing EPC
for KVM guests.
  
Kai Huang Oct. 11, 2023, 12:31 a.m. UTC | #15
On Tue, 2023-10-10 at 12:05 -0500, Haitao Huang wrote:
> On Mon, 09 Oct 2023 21:12:27 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > 
> > > > > > 
> > > > > Later the hosting process could migrated/reassigned to another  
> > > cgroup?
> > > > > What to do when the new cgroup is OOM?
> > > > > 
> > > > 
> > > > You addressed in the documentation, no?
> > > > 
> > > > +Migration
> > > > +---------
> > > > +
> > > > +Once an EPC page is charged to a cgroup (during allocation), it
> > > > +remains charged to the original cgroup until the page is released
> > > > +or reclaimed.  Migrating a process to a different cgroup doesn't
> > > > +move the EPC charges that it incurred while in the previous cgroup
> > > > +to its new cgroup.
> > > 
> > > Should we kill the enclave though because some VA pages may be in the  
> > > new
> > > group?
> > > 
> > 
> > I guess acceptable?
> > 
> > And any difference if you keep VA/SECS to unreclaimabe list?
> 
> Tracking VA/SECS allows all cgroups, in which an enclave has allocation,  
> to identify the enclave following the back pointer and kill it as needed.
> 
> > If you migrate one
> > enclave to another cgroup, the old EPC pages stay in the old cgroup  
> > while the
> > new one is charged to the new group IIUC.
> > 
> > I am not cgroup expert, but by searching some old thread it appears this  
> > isn't a
> > supported model:
> > 
> > https://lore.kernel.org/lkml/YEyR9181Qgzt+Ps9@mtj.duckdns.org/
> > 
> 
> IIUC it's a different problem here. If we don't track the allocated VAs in  
> the new group, then the enclave that spans the two groups can't be killed  
> by the new group. If so, some enclave could just hide in some small group  
> and never gets killed but keeps allocating in a different group?
> 

I mean from the link above IIUC migrating enclave among different cgroups simply
isn't a supported model, thus any bad behaviour isn't a big concern in terms of
decision making.
  
Kai Huang Oct. 11, 2023, 12:51 a.m. UTC | #16
On Tue, 2023-10-10 at 11:49 -0500, Haitao Huang wrote:
> On Mon, 09 Oct 2023 20:34:29 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Tue, 2023-10-10 at 00:50 +0000, Huang, Kai wrote:
> > > On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote:
> > > > On Mon, Oct 09, 2023, Kai Huang wrote:
> > > > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > > > > +/**
> > > > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target  
> > > LRU
> > > > > > + * @lru:	LRU that is low
> > > > > > + *
> > > > > > + * Return:	%true if a victim was found and kicked.
> > > > > > + */
> > > > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > > > > +{
> > > > > > +	struct sgx_epc_page *victim;
> > > > > > +
> > > > > > +	spin_lock(&lru->lock);
> > > > > > +	victim = sgx_oom_get_victim(lru);
> > > > > > +	spin_unlock(&lru->lock);
> > > > > > +
> > > > > > +	if (!victim)
> > > > > > +		return false;
> > > > > > +
> > > > > > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > > > > +		return sgx_oom_encl_page(victim->encl_page);
> > > > > > +
> > > > > > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > > > > +		return sgx_oom_encl(victim->encl);
> > > > > 
> > > > > I hate to bring this up, at least at this stage, but I am wondering  
> > > why we need
> > > > > to put VA and SECS pages to the unreclaimable list, but cannot keep  
> > > an
> > > > > "enclave_list" instead?
> > > > 
> > > > The motivation for tracking EPC pages instead of enclaves was so that  
> > > the EPC
> > > > OOM-killer could "kill" VMs as well as host-owned enclaves. >
> > > 
> > > Ah this seems a fair argument. :-)
> > > 
> > > > The virtual EPC code
> > > > didn't actually kill the VM process, it instead just freed all of the  
> > > EPC pages
> > > > and abused the SGX architecture to effectively make the guest  
> > > recreate all its
> > > > enclaves (IIRC, QEMU does the same thing to "support" live migration).
> > > 
> > > It returns SIGBUS.  SGX VM live migration also requires enough EPC  
> > > being able to
> > > be allocated on the destination machine to work AFAICT.
> > > 
> > > > 
> > > > Looks like y'all punted on that with:
> > > > 
> > > >   The EPC pages allocated for KVM guests by the virtual EPC driver  
> > > are not
> > > >   reclaimable by the host kernel [5]. Therefore they are not tracked  
> > > by any
> > > >   LRU lists for reclaiming purposes in this implementation, but they  
> > > are
> > > >   charged toward the cgroup of the user processs (e.g., QEMU)  
> > > launching the
> > > >   guest.  And when the cgroup EPC usage reaches its limit, the  
> > > virtual EPC
> > > >   driver will stop allocating more EPC for the VM, and return SIGBUS  
> > > to the
> > > >   user process which would abort the VM launch.
> > > > 
> > > > which IMO is a hack, unless returning SIGBUS is actually enforced  
> > > somehow. >
> > > 
> > > "enforced" do you mean?
> > > 
> > > Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot  
> > > allocate
> > > EPC page.  And when this happens, KVM returns KVM_PFN_ERR_FAULT in  
> > > hva_to_pfn(),
> > > which eventually results in KVM returning -EFAULT to userspace in  
> > > vcpu_run().
> > > And Qemu then kills the VM with some nonsense message:
> > > 
> > >         error: kvm run failed Bad address
> > >         <dump guest registers nonsense>
> > > 
> > > > Relying
> > > > on userspace to be kind enough to kill its VMs kinda defeats the  
> > > purpose of cgroup
> > > > enforcement.  E.g. if the hard limit for a EPC cgroup is lowered,  
> > > userspace running
> > > > encalves in a VM could continue on and refuse to give up its EPC, and  
> > > thus run above
> > > > its limit in perpetuity.
> > > 
> > > > 
> > > > I can see userspace wanting to explicitly terminate the VM instead of  
> > > "silently"
> > > > the VM's enclaves, but that seems like it should be a knob in the  
> > > virtual EPC
> > > > code.
> > 
> > I guess I slightly misunderstood your words.
> > 
> > You mean we want to kill VM when the limit is set to be lower than  
> > virtual EPC
> > size.
> > 
> > This patch adds SGX_ENCL_NO_MEMORY.  I guess we can use it for virtual  
> > EPC too?
> > 
> 
> That flag is set for enclaves, do you mean we set similar flag in vepc  
> struct?
> 
> > In the sgx_vepc_fault(), we check this flag at early time and return  
> > SIGBUS if
> > it is set.
> > 
> > But this also requires keeping virtual EPC pages in some list, and  
> > handles them
> > in sgx_epc_oom() too.
> > 
> > And for virtual EPC pages, I guess the "young" logic can be applied thus
> > probably it's better to keep the actual virtual EPC pages to a  
> > (separate?) list
> > instead of keeping the virtual EPC instance.
> > 
> > 	struct sgx_epc_lru {
> > 		struct list_head reclaimable;
> > 		struct sgx_encl *enclaves;
> > 		struct list_head vepc_pages;
> > 	}
> > 
> > Or still tracking VA/SECS and virtual EPC pages in a single  
> > unrecliamable list?
> > 
> 
> One LRU should be OK as we only need relative order in which they are  
> loaded?

It's one way to do, but not the only way.

The disadvantage of using one unreclaimable list is, for VA/SECS pages you don't
have the concept of "young/age".  The only purpose of getting the page is to get
the owner enclave and kill it.

On the other hand, for virtual EPC pages we do have concept of "young/age",
although I think it's acceptable we don't implement this in the first version of
EPC cgroup support.

From this point, perhaps it's better to  maintain VA/SECS (or enclaves)
separately from virtual EPC pages.  But it is not mandatory.

Another pro of maintaining them separately is you don't need these flags to
distinguish them from the single list: SGX_EPC_OWNER_{ENCL|PAGE|VEPC}, which I
dislike.

(btw, even you track VA/SECS pages in unreclaimable list, given they both have
'enclave' as the owner,  do you still need SGX_EPC_OWNER_ENCL and
SGX_EPC_OWNER_PAGE ?)

That being said, personally I'd slightly prefer to keep them in separate lists
but I can see it also depends on how you want to implement the algorithm of
selecting a victim.  So I am not going to insist for now but will leave to you
to decide.

Just remember to give justification of choosing so.

[...]

> It also means user  
> space needs to inject/pass the SIGBUS to guest (I'm not really familiar  
> with this part, or maybe it's already there?). @sean is that what you  
> mean? Sorry I've been slow on understanding this.

I don't think we need to do this for now.  Killing the VM is an acceptable start
to me.  Just make sure we can kill some VM.
  
Kai Huang Oct. 11, 2023, 1:14 a.m. UTC | #17
On Tue, 2023-10-10 at 11:49 -0500, Haitao Huang wrote:
> > 
> > This patch adds SGX_ENCL_NO_MEMORY.  I guess we can use it for virtual  
> > EPC too?
> > 
> 
> That flag is set for enclaves, do you mean we set similar flag in vepc  
> struct?

Yes.
  
Haitao Huang Oct. 11, 2023, 3:02 p.m. UTC | #18
On Tue, 10 Oct 2023 19:01:25 -0500, Sean Christopherson  
<seanjc@google.com> wrote:

> On Tue, Oct 10, 2023, Haitao Huang wrote:
>> On Mon, 09 Oct 2023 21:23:12 -0500, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>>
>> > On Mon, 2023-10-09 at 20:42 -0500, Haitao Huang wrote:
>> > > Hi Sean
>> > >
>> > > On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson
>> > > <seanjc@google.com> wrote:
>> > > > I can see userspace wanting to explicitly terminate the VM  
>> instead of
>> > > > "silently"
>> > > > the VM's enclaves, but that seems like it should be a knob in the
>> > > > virtual EPC
>> > > > code.
>> > >
>> > > If my understanding above is correct and understanding your  
>> statement
>> > > above correctly, then don't see we really need separate knob for  
>> vEPC
>> > > code. Reaching a cgroup limit by a running guest (assuming dynamic
>> > > allocation implemented) should not translate automatically killing
>> > > the VM.
>> > > Instead, it's user space job to work with guest to handle allocation
>> > > failure. Guest could page and kill enclaves.
>> > >
>> >
>> > IIUC Sean was talking about changing misc.max _after_ you launch SGX  
>> VMs:
>> >
>> > 1) misc.max = 100M
>> > 2) Launch VMs with total virtual EPC size = 100M	<- success
>> > 3) misc.max = 50M
>> >
>> > 3) will also succeed, but nothing will happen, the VMs will be still
>> > holding 100M EPC.
>> >
>> > You need to somehow track virtual EPC and kill VM instead.
>> >
>> > (or somehow fail to do 3) if it is also an acceptable option.)
>> >
>> Thanks for explaining it.
>>
>> There is an error code to return from max_write. I can add that too to  
>> the
>> callback definition and fail it when it can't be enforced for any  
>> reason.
>> Would like some community feedback if this is acceptable though.
>
> That likely isn't acceptable.  E.g. create a cgroup with both a host  
> enclave and
> virtual EPC, set the hard limit to 100MiB.  Virtual EPC consumes 50MiB,  
> and the
> host enclave consumes 50MiB.  Userspace lowers the limit to 49MiB.  The  
> cgroup
> code would reclaim all of the enclave's reclaimable EPC, and then kill  
> the enclave
> because it's still over the limit.  And then fail the max_write because  
> the cgroup
> is *still* over the limit.  So in addition to burning a lot of cycles,  
> from
> userspace's perspective its enclave was killed for no reason, as the new  
> limit
> wasn't actually set.

I was thinking before reclaiming enclave pages, if we know the untracked  
vepc pages (current-tracked) is larger than limit, we just return error  
without enforcing the limit. That way user also knows something is wrong.

But I get that we want to be able to kill VMs to enforce the newer lower  
limit. I assume we can't optimize efficiency/priority of killing: enclave  
pages will be reclaimed first no matter what just because they are  
reclaimable; no good rules to choose victim between enclave and VMs in  
your example so enclaves could be killed still before VMs.

>> I think to solve it ultimately, we need be able to adjust 'capacity' of  
>> VMs
>> not to just kill them, which is basically the same as dynamic allocation
>> support for VMs (being able to increase/decrease epc size when it is
>> running). For now, we only have static allocation so max can't be  
>> enforced
>> once it is launched.
>
> No, reclaiming virtual EPC is not a requirement.  VMM EPC  
> oversubscription is
> insanely complex, and I highly doubt any users actually want to  
> oversubcribe VMs.
>
> There are use cases for cgroups beyond oversubscribing/swapping, e.g.  
> privileged
> userspace may set limits on a container to ensure the container doesn't  
> *accidentally*
> consume more EPC than it was allotted, e.g. due to a configuration bug  
> that created
> a VM with more EPC than it was supposed to have.
>
> My comments on virtual EPC vs. cgroups is much more about having sane,  
> well-defined
> behavior, not about saying the kernel actually needs to support  
> oversubscribing EPC
> for KVM guests.

So here are the steps I see now, let me know if this is aligned with your  
thinking or I'm off track.

0) when all left are enclave VA, SECS pages and VEPC in a cgroup, the OOM  
kill process starts.
1) The cgroup identifies a victim vepc for OOM kill(this could be before  
or after enclaves being killed), sets a new flag VEPC_NO_MEMORY in the  
vepc.
2) call sgx_vepc_remove_all(), ignore failure counts returned. This will  
perform best effort to eremove all normal pages used by the vepc.
3) Guest may trigger fault again, return SIGBUS in sgx_vepc_fault when  
VEPC_NO_MEMORY is set. Do the same for mmap.
4) That would eventually cause sgx_vepc_release() before VM dies or killed  
by user space, which does the final cleanup.

Q: should we reset VEPC_NO_MEMORY flag if #4 never happens and cgroup  
usage is below limit? I suppose we can do it, but not sure it is sane  
because VM would try to load as much pages as configured originally.

I'm still thinking about using one unreclaimable list, justification is  
simplicity and lack of rules to select items from separate lists, but may  
change my mind if I found it's inconvenient.

Not sure how age/youngness can be accounted for guest pages though Kai  
indicated we don't need that for first version. So I assume we can deal  
with it later and add separate list for vEPC if needed
for that reason.

Thanks a lot for your input.
Haitao
  
Haitao Huang Oct. 11, 2023, 4:04 p.m. UTC | #19
On Tue, 10 Oct 2023 19:31:19 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Tue, 2023-10-10 at 12:05 -0500, Haitao Huang wrote:
>> On Mon, 09 Oct 2023 21:12:27 -0500, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>>
>> >
>> > > > > >
>> > > > > Later the hosting process could migrated/reassigned to another
>> > > cgroup?
>> > > > > What to do when the new cgroup is OOM?
>> > > > >
>> > > >
>> > > > You addressed in the documentation, no?
>> > > >
>> > > > +Migration
>> > > > +---------
>> > > > +
>> > > > +Once an EPC page is charged to a cgroup (during allocation), it
>> > > > +remains charged to the original cgroup until the page is released
>> > > > +or reclaimed.  Migrating a process to a different cgroup doesn't
>> > > > +move the EPC charges that it incurred while in the previous  
>> cgroup
>> > > > +to its new cgroup.
>> > >
>> > > Should we kill the enclave though because some VA pages may be in  
>> the
>> > > new
>> > > group?
>> > >
>> >
>> > I guess acceptable?
>> >
>> > And any difference if you keep VA/SECS to unreclaimabe list?
>>
>> Tracking VA/SECS allows all cgroups, in which an enclave has allocation,
>> to identify the enclave following the back pointer and kill it as  
>> needed.
>>
>> > If you migrate one
>> > enclave to another cgroup, the old EPC pages stay in the old cgroup
>> > while the
>> > new one is charged to the new group IIUC.
>> >
>> > I am not cgroup expert, but by searching some old thread it appears  
>> this
>> > isn't a
>> > supported model:
>> >
>> > https://lore.kernel.org/lkml/YEyR9181Qgzt+Ps9@mtj.duckdns.org/
>> >
>>
>> IIUC it's a different problem here. If we don't track the allocated VAs  
>> in
>> the new group, then the enclave that spans the two groups can't be  
>> killed
>> by the new group. If so, some enclave could just hide in some small  
>> group
>> and never gets killed but keeps allocating in a different group?
>>
>
> I mean from the link above IIUC migrating enclave among different  
> cgroups simply
> isn't a supported model, thus any bad behaviour isn't a big concern in  
> terms of
> decision making.

If we leave some pages in a cgroup unkillable, we are in the same  
situation of not able to enforce a cgroup limit as that we are are in if  
we don't kill VMs for lower limits.

I think not supporting migration of pages between cgroups should not leave  
a gap for enforcement just like we don't want to have an enforcement gap  
if we let VMs to hold pages once it is launched.

Haitao
  
Haitao Huang Oct. 12, 2023, 1:27 p.m. UTC | #20
On Tue, 10 Oct 2023 19:51:17 -0500, Huang, Kai <kai.huang@intel.com> wrote:
[...]
> (btw, even you track VA/SECS pages in unreclaimable list, given they  
> both have
> 'enclave' as the owner,  do you still need SGX_EPC_OWNER_ENCL and
> SGX_EPC_OWNER_PAGE ?)

Let me think about it, there might be also a way just track encl objects  
not unreclaimable pages.

I still not get why we need kill the VM not just remove just enough pages.  
Is it due to the static allocation not able to reclaim?


If we always remove all vEPC pages/kill VM, then we should not need track  
individual vepc pages.

Thanks

Haitao
  
Kai Huang Oct. 16, 2023, 10:57 a.m. UTC | #21
On Thu, 2023-10-12 at 08:27 -0500, Haitao Huang wrote:
> On Tue, 10 Oct 2023 19:51:17 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> [...]
> > (btw, even you track VA/SECS pages in unreclaimable list, given they  
> > both have
> > 'enclave' as the owner,  do you still need SGX_EPC_OWNER_ENCL and
> > SGX_EPC_OWNER_PAGE ?)
> 
> Let me think about it, there might be also a way just track encl objects  
> not unreclaimable pages.
> 
> I still not get why we need kill the VM not just remove just enough pages.  
> Is it due to the static allocation not able to reclaim?

We can choose to "just remove enough EPC pages".  The VM may or may not be
killed when it wants the EPC pages back, depending on whether the current EPC
cgroup can provide enough EPC pages or not.  And this depends on how we
implement the cgroup algorithm to reclaim EPC pages.

One problem could be: for a EPC cgroup only has SGX VMs, you may end up with
moving EPC pages from one VM to another and then vice versa endlessly, because
you never really actually mark any VM to be dead just like OOM does to the
normal enclaves.

From this point, you still need a way to kill a VM, IIUC.

I think the key point of virtual EPC vs cgroup, as quoted from Sean, should be
"having sane, well-defined behavior".

Does "just remove enough EPC pages" meet this?  If the problem mentioned above
can be avoided, I suppose so?  So if there's an easy way to achieve, I guess it
can be an option too.

But for the initial support, IMO we are not looking for a perfect but yet
complicated solution.  I would say, from review's point of view, it's preferred
to have a simple implementation to achieve a not-prefect, but consistent, well-
defined behaviour.

So to me looks killing the VM when cgroup cannot reclaim any more EPC pages is a
simple option.

But I might have missed something, especially since middle of last week I have
been having fever and headache :-)

So as mentioned above, you can try other alternatives, but please avoid
complicated ones.

Also, I guess it will be helpful if we can understand the typical SGX app and/or
SGX VM deployment under EPC cgroup use case.  This may help us on justifying why
the EPC cgroup algorithm to select victim is reasonable.
  
Kai Huang Oct. 16, 2023, 11:02 a.m. UTC | #22
On Wed, 2023-10-11 at 01:14 +0000, Huang, Kai wrote:
> On Tue, 2023-10-10 at 11:49 -0500, Haitao Huang wrote:
> > > 
> > > This patch adds SGX_ENCL_NO_MEMORY.  I guess we can use it for virtual  
> > > EPC too?
> > > 
> > 
> > That flag is set for enclaves, do you mean we set similar flag in vepc  
> > struct?
> 
> Yes.

I missed the "ENCL" part but only noted the "NO_MEMORY" part, so I guess it
should not be used directly for vEPC.  So if it is needed, SGX_VEPC_NO_MEMORY,
or a simple 'bool dead' or similar in 'struct sgx_vepc' is more suitable.

As I said I was fighting with fever and headache last week :-)
  
Haitao Huang Oct. 16, 2023, 7:52 p.m. UTC | #23
On Mon, 16 Oct 2023 05:57:36 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Thu, 2023-10-12 at 08:27 -0500, Haitao Huang wrote:
>> On Tue, 10 Oct 2023 19:51:17 -0500, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>> [...]
>> > (btw, even you track VA/SECS pages in unreclaimable list, given they
>> > both have
>> > 'enclave' as the owner,  do you still need SGX_EPC_OWNER_ENCL and
>> > SGX_EPC_OWNER_PAGE ?)
>>
>> Let me think about it, there might be also a way just track encl objects
>> not unreclaimable pages.
>>
>> I still not get why we need kill the VM not just remove just enough  
>> pages.
>> Is it due to the static allocation not able to reclaim?
>
> We can choose to "just remove enough EPC pages".  The VM may or may not  
> be
> killed when it wants the EPC pages back, depending on whether the  
> current EPC
> cgroup can provide enough EPC pages or not.  And this depends on how we
> implement the cgroup algorithm to reclaim EPC pages.
>
> One problem could be: for a EPC cgroup only has SGX VMs, you may end up  
> with
> moving EPC pages from one VM to another and then vice versa endlessly,

This could be a valid use case though if people intend to share EPCs  
between two VMs. Understand no one would be able to use VMs this way  
currently with the static allocation.

> because
> you never really actually mark any VM to be dead just like OOM does to  
> the
> normal enclaves.
>
> From this point, you still need a way to kill a VM, IIUC.
>
> I think the key point of virtual EPC vs cgroup, as quoted from Sean,  
> should be
> "having sane, well-defined behavior".
>
> Does "just remove enough EPC pages" meet this?  If the problem mentioned  
> above
> can be avoided, I suppose so?  So if there's an easy way to achieve, I  
> guess it
> can be an option too.
>
> But for the initial support, IMO we are not looking for a perfect but yet
> complicated solution.  I would say, from review's point of view, it's  
> preferred
> to have a simple implementation to achieve a not-prefect, but  
> consistent, well-
> defined behaviour.
>
> So to me looks killing the VM when cgroup cannot reclaim any more EPC  
> pages is a
> simple option.
>
> But I might have missed something, especially since middle of last week  
> I have
> been having fever and headache :-)
>
> So as mentioned above, you can try other alternatives, but please avoid
> complicated ones.
>
> Also, I guess it will be helpful if we can understand the typical SGX  
> app and/or
> SGX VM deployment under EPC cgroup use case.  This may help us on  
> justifying why
> the EPC cgroup algorithm to select victim is reasonable.
>


 From this perspective, I think the current implementation is  
"well-defined": EPC cgroup limits for VMs are only enforced at VM launch  
time, not runtime. In practice,  SGX VM can be launched only with fixed  
EPC size and all those EPCs are fully committed to the VM once launched.  
Because of that, I imagine people are using VMs to primarily partition the  
physical EPCs, i.e, the static size itself is the 'limit' for the workload  
of a single VM and not expecting EPCs taken away at runtime.

So killing does not really add much value for the existing usages IIUC.

That said, I don't anticipate adding the enforcement of killing VMs at  
runtime would break such usages as admin/user can simply choose to set the  
limit equal to the static size to launch the VM and forget about it.

Given that, I'll propose an add-on patch to this series as RFC and have  
some feedback from community before we decide if that needs be included in  
first version or we can skip it until we have EPC reclaiming for VMs.

Thanks
Haitao
  
Kai Huang Oct. 16, 2023, 9:09 p.m. UTC | #24
> 
> 
>  From this perspective, I think the current implementation is  
> "well-defined": EPC cgroup limits for VMs are only enforced at VM launch  
> time, not runtime. In practice,  SGX VM can be launched only with fixed  
> EPC size and all those EPCs are fully committed to the VM once launched.  
> Because of that, I imagine people are using VMs to primarily partition the  
> physical EPCs, i.e, the static size itself is the 'limit' for the workload  
> of a single VM and not expecting EPCs taken away at runtime.
> 
> So killing does not really add much value for the existing usages IIUC.

It's not about adding value to the existing usages, it's about fixing the issue
when we lower the EPC limit to a point that is less than total virtual EPC size.

It's a design issue, or simply a bug in the current implementation we need to
fix.

> 
> That said, I don't anticipate adding the enforcement of killing VMs at  
> runtime would break such usages as admin/user can simply choose to set the  
> limit equal to the static size to launch the VM and forget about it.
> 
> Given that, I'll propose an add-on patch to this series as RFC and have  
> some feedback from community before we decide if that needs be included in  
> first version or we can skip it until we have EPC reclaiming for VMs.

I don't understand what is the "add-on" patch you are talking about.

I mentioned the "typical deployment thing" is that can help us understand which
algorithm is better way to select the victim.  But no matter what we choose, we
still need to fix the bug mentioned above here.

I really think you should just go this simple way: 

When you want to take EPC back from VM, kill the VM.

Another bad thing about "just removing EPC pages from VM" is the enclaves in the
VM would suffer "sudden lose of EPC", or even worse, suffer it at a high
frequency.  Although we depend on that for supporting SGX VM live migration, but
that needs to avoided if possible.
  
Sean Christopherson Oct. 16, 2023, 9:32 p.m. UTC | #25
On Mon, Oct 16, 2023, Haitao Huang wrote:
> From this perspective, I think the current implementation is "well-defined":
> EPC cgroup limits for VMs are only enforced at VM launch time, not runtime.
> In practice,  SGX VM can be launched only with fixed EPC size and all those
> EPCs are fully committed to the VM once launched.

Fully committed doesn't mean those numbers are reflected in the cgroup.  A VM
scheduler can easily "commit" EPC to a guest, but allocate EPC on demand, i.e.
when the guest attempts to actually access a page.  Preallocating memory isn't
free, e.g. it can slow down guest boot, so it's entirely reasonable to have virtual
EPC be allocated on-demand.  Enforcing at launch time doesn't work for such setups,
because from the cgroup's perspective, the VM is using 0 pages of EPC at launch.

> Because of that, I imagine people are using VMs to primarily partition the
> physical EPCs, i.e, the static size itself is the 'limit' for the workload of
> a single VM and not expecting EPCs taken away at runtime.

If everything goes exactly as planned, sure.  But it's not hard to imagine some
configuration change way up the stack resulting in the hard limit for an EPC cgroup
being lowered.

> So killing does not really add much value for the existing usages IIUC.

As I said earlier, the behavior doesn't have to result in terminating a VM, e.g.
the virtual EPC code could provide a knob to send a signal/notification if the
owning cgroup has gone above the limit and the VM is targeted for forced reclaim.

> That said, I don't anticipate adding the enforcement of killing VMs at
> runtime would break such usages as admin/user can simply choose to set the
> limit equal to the static size to launch the VM and forget about it.
> 
> Given that, I'll propose an add-on patch to this series as RFC and have some
> feedback from community before we decide if that needs be included in first
> version or we can skip it until we have EPC reclaiming for VMs.

Gracefully *swapping* virtual EPC isn't required for oversubscribing virtual EPC.
Think of it like airlines overselling tickets.  The airline sells more tickets
than they have seats, and banks on some passengers canceling.  If too many people
show up, the airline doesn't swap passengers to the cargo bay, they just shunt them
to a different plane.

The same could be easily be done for hosts and virtual EPC.  E.g. if every VM
*might* use 1GiB, but in practice 99% of VMs only consume 128MiB, then it's not
too crazy to advertise 1GiB to each VM, but only actually carve out 256MiB per VM
in order to pack more VMs on a host.  If the host needs to free up EPC, then the
most problematic VMs can be migrated to a different host.

Genuinely curious, who is asking for EPC cgroup support that *isn't* running VMs?
AFAIK, these days, SGX is primarily targeted at cloud.  I assume virtual EPC is
the primary use case for an EPC cgroup.

I don't have any skin in the game beyond my name being attached to some of the
patches, i.e. I certainly won't stand in the way.  I just don't understand why
you would go through all the effort of adding an EPC cgroup and then not go the
extra few steps to enforce limits for virtual EPC.  Compared to the complexity
of the rest of the series, that little bit seems quite trivial.
  
Haitao Huang Oct. 17, 2023, 12:09 a.m. UTC | #26
Hi Sean

On Mon, 16 Oct 2023 16:32:31 -0500, Sean Christopherson  
<seanjc@google.com> wrote:

> On Mon, Oct 16, 2023, Haitao Huang wrote:
>> From this perspective, I think the current implementation is  
>> "well-defined":
>> EPC cgroup limits for VMs are only enforced at VM launch time, not  
>> runtime.
>> In practice,  SGX VM can be launched only with fixed EPC size and all  
>> those
>> EPCs are fully committed to the VM once launched.
>
> Fully committed doesn't mean those numbers are reflected in the cgroup.   
> A VM
> scheduler can easily "commit" EPC to a guest, but allocate EPC on  
> demand, i.e.
> when the guest attempts to actually access a page.  Preallocating memory  
> isn't
> free, e.g. it can slow down guest boot, so it's entirely reasonable to  
> have virtual
> EPC be allocated on-demand.  Enforcing at launch time doesn't work for  
> such setups,
> because from the cgroup's perspective, the VM is using 0 pages of EPC at  
> launch.
>
Maybe I understood the current implementation wrong. From what I see, vEPC  
is impossible not fully commit at launch time. The guest would EREMOVE all  
pages during initialization resulting #PF and all pages allocated. This  
essentially makes "prealloc=off" the same as "prealloc=on".
Unless you are talking about some custom OS or kernel other than upstream  
Linux here?

Thanks
Haitap
  
Haitao Huang Oct. 17, 2023, 12:10 a.m. UTC | #27
On Mon, 16 Oct 2023 16:09:52 -0500, Huang, Kai <kai.huang@intel.com> wrote:
[...]

> still need to fix the bug mentioned above here.
>
> I really think you should just go this simple way:
>
> When you want to take EPC back from VM, kill the VM.
>

My only concern is that this is a compromise due to current limitation (no  
other sane way to take EPC from VMs). If we define this behavior and it  
becomes a contract to user space, then we can't change in future.

On the other hand, my understanding the reason you want this behavior is  
to enforce EPC limit at runtime. I just not sure how important it is and  
if it is a real usage given all limitations of SGX VMs we have (static EPC  
size, no migration).

Thanks

Haitao
  
Kai Huang Oct. 17, 2023, 1:34 a.m. UTC | #28
On Mon, 2023-10-16 at 19:10 -0500, Haitao Huang wrote:
> On Mon, 16 Oct 2023 16:09:52 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> [...]
> 
> > still need to fix the bug mentioned above here.
> > 
> > I really think you should just go this simple way:
> > 
> > When you want to take EPC back from VM, kill the VM.
> > 
> 
> My only concern is that this is a compromise due to current limitation (no  
> other sane way to take EPC from VMs). If we define this behavior and it  
> becomes a contract to user space, then we can't change in future.

Why do we need to "define such behaviour"?

This isn't some kinda of kernel/userspace ABI IMHO, but only kernel internal
implementation.  Here VM is similar to normal host enclaves.  You limit the
resource, some host enclaves could be killed.  Similarly, VM could also be
killed too.

And supporting VMM EPC oversubscription doesn't mean VM won't be killed.  The VM
can still be a target to kill after VM's all EPC pages have been swapped out.

> 
> On the other hand, my understanding the reason you want this behavior is  
> to enforce EPC limit at runtime. 
> 

No I just thought this is a bug/issue needs to be fixed.  If anyone believes
this is not a bug/issue then it's a separate discussion.
  
Mikko Ylinen Oct. 17, 2023, 11:49 a.m. UTC | #29
On Mon, Oct 16, 2023 at 02:32:31PM -0700, Sean Christopherson wrote:
> Genuinely curious, who is asking for EPC cgroup support that *isn't* running VMs?

People who work with containers: [1], [2]. 

> AFAIK, these days, SGX is primarily targeted at cloud.  I assume virtual EPC is
> the primary use case for an EPC cgroup.

The common setup is that a cloud VM instance with vEPC is created and then
several SGX enclave containers are run simultaneously on that instance. EPC
cgroups is used to ensure that each container gets their own share of EPC
(and any attempts to go beyond the limit is reclaimed and charged from
the container's memcg). The same containers w/ enclaves use case is
applicable to baremetal also, though.

As far as Kubernetes orchestrated containers are concerned, "in-place" resource
scaling is still in very early stages which means that the cgroups values are
adjusted by *re-creating* the container. The hierarchies are also built
such that there's no mix of VMs w/ vEPC and enclaves in the same tree.

Mikko

[1] https://lore.kernel.org/linux-sgx/20221202183655.3767674-1-kristen@linux.intel.com/T/#m6d1c895534b4c0636f47c2d1620016b4c362bb9b
[2] https://lore.kernel.org/linux-sgx/20221202183655.3767674-1-kristen@linux.intel.com/T/#m37600e457b832feee6e8346aa74dcff8f21965f8
  
Haitao Huang Oct. 17, 2023, 12:58 p.m. UTC | #30
On Mon, 16 Oct 2023 20:34:57 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Mon, 2023-10-16 at 19:10 -0500, Haitao Huang wrote:
>> On Mon, 16 Oct 2023 16:09:52 -0500, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>> [...]
>>
>> > still need to fix the bug mentioned above here.
>> >
>> > I really think you should just go this simple way:
>> >
>> > When you want to take EPC back from VM, kill the VM.
>> >
>>
>> My only concern is that this is a compromise due to current limitation  
>> (no
>> other sane way to take EPC from VMs). If we define this behavior and it
>> becomes a contract to user space, then we can't change in future.
>
> Why do we need to "define such behaviour"?
>
> This isn't some kinda of kernel/userspace ABI IMHO, but only kernel  
> internal
> implementation.  Here VM is similar to normal host enclaves.  You limit  
> the
> resource, some host enclaves could be killed.  Similarly, VM could also  
> be
> killed too.
>
> And supporting VMM EPC oversubscription doesn't mean VM won't be  
> killed.  The VM
> can still be a target to kill after VM's all EPC pages have been swapped  
> out.
>
>>
>> On the other hand, my understanding the reason you want this behavior is
>> to enforce EPC limit at runtime.
>
> No I just thought this is a bug/issue needs to be fixed.  If anyone  
> believes
> this is not a bug/issue then it's a separate discussion.
>

AFAIK, before we introducing max_write() callback in this series, no misc  
controller would possibly enforce the limit when misc.max is reduced. e.g.  
I don't think CVMs be killed when ASID limit is reduced and the cgroup was  
full before limit is reduced.

I think EPC pages to VMs could have the same behavior, once they are given  
to a guest, never taken back by the host. For enclaves on host side, pages  
are reclaimable, that allows us to enforce in a similar way to memcg.

Thanks
Haitao
  
Sean Christopherson Oct. 17, 2023, 3:43 p.m. UTC | #31
On Mon, Oct 16, 2023, Haitao Huang wrote:
> Hi Sean
> 
> On Mon, 16 Oct 2023 16:32:31 -0500, Sean Christopherson <seanjc@google.com>
> wrote:
> 
> > On Mon, Oct 16, 2023, Haitao Huang wrote:
> > > From this perspective, I think the current implementation is
> > > "well-defined":
> > > EPC cgroup limits for VMs are only enforced at VM launch time, not
> > > runtime.  In practice,  SGX VM can be launched only with fixed EPC size
> > > and all those EPCs are fully committed to the VM once launched.
> > 
> > Fully committed doesn't mean those numbers are reflected in the cgroup.  A
> > VM scheduler can easily "commit" EPC to a guest, but allocate EPC on
> > demand, i.e.  when the guest attempts to actually access a page.
> > Preallocating memory isn't free, e.g. it can slow down guest boot, so it's
> > entirely reasonable to have virtual EPC be allocated on-demand.  Enforcing
> > at launch time doesn't work for such setups, because from the cgroup's
> > perspective, the VM is using 0 pages of EPC at launch.
> > 
> Maybe I understood the current implementation wrong. From what I see, vEPC
> is impossible not fully commit at launch time. The guest would EREMOVE all
> pages during initialization resulting #PF and all pages allocated. This
> essentially makes "prealloc=off" the same as "prealloc=on".
> Unless you are talking about some custom OS or kernel other than upstream
> Linux here?

Yes, a customer could be running an older kernel, something other than Linux, a
custom kernel, an out-of-tree SGX driver, etc.  The host should never assume
anything about the guest kernel when it comes to correctness (unless the guest
kernel is controlled by the host).

Doing EREMOVE on all pages is definitely not mandatory, especially if the kernel
detects a hypervisor, i.e. knows its running as a guest.
  
Michal Koutný Oct. 17, 2023, 6:54 p.m. UTC | #32
Hello Haitao.

On Tue, Oct 17, 2023 at 07:58:02AM -0500, Haitao Huang <haitao.huang@linux.intel.com> wrote:
> AFAIK, before we introducing max_write() callback in this series, no misc
> controller would possibly enforce the limit when misc.max is reduced. e.g. I
> don't think CVMs be killed when ASID limit is reduced and the cgroup was
> full before limit is reduced.

Yes, misccontroller was meant to be simple, current >= max serves to
prevent new allocations.

FTR, at some point in time memory.max was considered for reclaim control
of regular pages but it turned out to be too coarse (and OOM killing
processes if amount was not sensed correctly) and this eventually
evolved into specific mechanism of memory.reclaim.
So I'm mentioning this should that be an interface with better semantic
for your use case (and misc.max writes can remain non-preemptive).

One more note -- I was quite confused when I read in the rest of the
series about OOM and _kill_ing but then I found no such measure in the
code implementation. So I would suggest two terminological changes:

- the basic premise of the series (00/18) is that EPC pages are a
  different resource than memory, hence choose a better suiting name
  than OOM (out of memory) condition,
- killing -- (unless you have an intention to implement process
  termination later) My current interpretation that it is rather some
  aggressive unmapping within address space, so less confusing name for
  that would be "reclaim".


> I think EPC pages to VMs could have the same behavior, once they are given
> to a guest, never taken back by the host. For enclaves on host side, pages
> are reclaimable, that allows us to enforce in a similar way to memcg.

Is this distinction between preemptability of EPC pages mandated by the
HW implementation? (host/"process" enclaves vs VM enclaves) Or do have
users an option to lock certain pages in memory that yields this
difference?

Regards,
Michal
  
Michal Koutný Oct. 17, 2023, 7:13 p.m. UTC | #33
On Tue, Oct 17, 2023 at 08:54:48PM +0200, Michal Koutný <mkoutny@suse.com> wrote:
> Is this distinction between preemptability of EPC pages mandated by the
> HW implementation? (host/"process" enclaves vs VM enclaves) Or do have
> users an option to lock certain pages in memory that yields this
> difference?

(After skimming Documentation/arch/x86/sgx.rst, Section "Virtual EPC")

Or would these two types warrant also two types of miscresource? (To
deal with each in own way.)

Thanks,
Michal
  
Haitao Huang Oct. 18, 2023, 4:37 a.m. UTC | #34
Hi Michal,

On Tue, 17 Oct 2023 13:54:46 -0500, Michal Koutný <mkoutny@suse.com> wrote:

> Hello Haitao.
>
> On Tue, Oct 17, 2023 at 07:58:02AM -0500, Haitao Huang  
> <haitao.huang@linux.intel.com> wrote:
>> AFAIK, before we introducing max_write() callback in this series, no  
>> misc
>> controller would possibly enforce the limit when misc.max is reduced.  
>> e.g. I
>> don't think CVMs be killed when ASID limit is reduced and the cgroup was
>> full before limit is reduced.
>
> Yes, misccontroller was meant to be simple, current >= max serves to
> prevent new allocations.
>
Thanks for confirming. Maybe another alternative we just keep max_write
non-preemptive. No need to add max_write() callback.

The EPC controller only triggers reclaiming on new allocations or return
NOMEM if no more to reclaim. Reclaiming here includes normal EPC page  
reclaiming and killing enclaves in out of EPC cases. vEPCs assigned to  
guests are basically carved out and never reclaimable by the host.

As we no longer enforce limits on max_write a lower value, user should not  
expect cgroup to force reclaim pages from enclave or kill VMs/enclaves as  
a result of reducing limits 'in-place'. User should always create cgroups,  
set limits, launch enclave/VM into the groups created.

> FTR, at some point in time memory.max was considered for reclaim control
> of regular pages but it turned out to be too coarse (and OOM killing
> processes if amount was not sensed correctly) and this eventually
> evolved into specific mechanism of memory.reclaim.
> So I'm mentioning this should that be an interface with better semantic
> for your use case (and misc.max writes can remain non-preemptive).
>

Yes we can introduce misc.reclaim to give user a knob to forcefully  
reducing usage if
that is really needed in real usage. The semantics would make force-kill  
VMs explicit to user.

> One more note -- I was quite confused when I read in the rest of the
> series about OOM and _kill_ing but then I found no such measure in the
> code implementation. So I would suggest two terminological changes:
>
> - the basic premise of the series (00/18) is that EPC pages are a
>   different resource than memory, hence choose a better suiting name
>   than OOM (out of memory) condition,

I couldn't come up a good name. Out of EPC (OOEPC) maybe? I feel OOEPC  
would be hard to read in code though. OOM was relatable as it is similar  
to normal OOM but special kind of memory :-) I'm open to any better  
suggestions.

> - killing -- (unless you have an intention to implement process
>   termination later) My current interpretation that it is rather some
>   aggressive unmapping within address space, so less confusing name for
>   that would be "reclaim".
>

yes. Killing here refers to killing enclave, analogous to killing process,
not just 'reclaim' though. I can change to always use 'killing enclave'  
explicitly.

>
>> I think EPC pages to VMs could have the same behavior, once they are  
>> given
>> to a guest, never taken back by the host. For enclaves on host side,  
>> pages
>> are reclaimable, that allows us to enforce in a similar way to memcg.
>
> Is this distinction between preemptability of EPC pages mandated by the
> HW implementation? (host/"process" enclaves vs VM enclaves) Or do have
> users an option to lock certain pages in memory that yields this
> difference?
>

The difference is really a result of current vEPC implementation. Because
enclave pages once in use contains confidential content, they need special
process to reclaim. So it's complex to implement host reclaiming guest EPCs
gracefully.

Thanks
Haitao
  
Haitao Huang Oct. 18, 2023, 4:39 a.m. UTC | #35
On Tue, 17 Oct 2023 14:13:22 -0500, Michal Koutný <mkoutny@suse.com> wrote:

> On Tue, Oct 17, 2023 at 08:54:48PM +0200, Michal Koutný  
> <mkoutny@suse.com> wrote:
>> Is this distinction between preemptability of EPC pages mandated by the
>> HW implementation? (host/"process" enclaves vs VM enclaves) Or do have
>> users an option to lock certain pages in memory that yields this
>> difference?
>
> (After skimming Documentation/arch/x86/sgx.rst, Section "Virtual EPC")
>
> Or would these two types warrant also two types of miscresource? (To
> deal with each in own way.)

They are from the same bucket of HW resource so I think it's more suitable  
to be one resource type. Otherwise need to policy to dividing the  
capacity, etc. And it is still possible in future vEPC become reclaimable.

My current thinking is we probably can get away with non-preemptive  
max_write for enclaves too. See my other reply.

Thanks
Haitao
  
Dave Hansen Oct. 18, 2023, 1:55 p.m. UTC | #36
On 10/17/23 21:37, Haitao Huang wrote:
> Yes we can introduce misc.reclaim to give user a knob to forcefully 
> reducing usage if that is really needed in real usage. The semantics
> would make force-kill VMs explicit to user.

Do any other controllers do something like this?  It seems odd.
  
Haitao Huang Oct. 18, 2023, 3:26 p.m. UTC | #37
On Wed, 18 Oct 2023 08:55:12 -0500, Dave Hansen <dave.hansen@intel.com>  
wrote:

> On 10/17/23 21:37, Haitao Huang wrote:
>> Yes we can introduce misc.reclaim to give user a knob to forcefully
>> reducing usage if that is really needed in real usage. The semantics
>> would make force-kill VMs explicit to user.
>
> Do any other controllers do something like this?  It seems odd.

Maybe not in sense of killing something. My understanding memory.reclaim  
does not necessarily invoke the OOM killer. But what I really intend to  
say is we can have a separate knob for user to express the need for  
reducing the current usage explicitly and keep "misc.max' non-preemptive  
semantics intact. When we implement that new knob, then we can define what  
kind of reclaim for that. Depending on vEPC implementation, it may or may  
not involve killing VMs. But at least that semantics will be explicit for  
user.

Thanks
Haitao
  
Dave Hansen Oct. 18, 2023, 3:37 p.m. UTC | #38
On 10/18/23 08:26, Haitao Huang wrote:
> Maybe not in sense of killing something. My understanding memory.reclaim
> does not necessarily invoke the OOM killer. But what I really intend to
> say is we can have a separate knob for user to express the need for
> reducing the current usage explicitly and keep "misc.max' non-preemptive
> semantics intact. When we implement that new knob, then we can define
> what kind of reclaim for that. Depending on vEPC implementation, it may
> or may not involve killing VMs. But at least that semantics will be
> explicit for user.

I'm really worried that you're going for "perfect" semantics here.  This
is SGX.  It's *NOT* getting heavy use and even fewer folks will ever
apply cgroup controls to it.

Can we please stick to simple, easily-coded rules here?  I honestly
don't think these corner cases matter all that much and there's been
*WAY* too much traffic in this thread for what is ultimately not that
complicated.  Focus on *ONE* thing:

1. Admin sets a limit
2. Enclave is created
3. Enclave hits limit, allocation fails

Nothing else matters.  What if the admin lowers the limit on an
already-created enclave?  Nobody cares.  Seriously.  What about inducing
reclaim?  Nobody cares.  What about vEPC?  Doesn't matter, an enclave
page is an enclave page.
  
Michal Koutný Oct. 18, 2023, 3:52 p.m. UTC | #39
On Wed, Oct 18, 2023 at 08:37:25AM -0700, Dave Hansen <dave.hansen@intel.com> wrote:
> 1. Admin sets a limit
> 2. Enclave is created
> 3. Enclave hits limit, allocation fails

I was actually about to suggest reorganizing the series to a part
implementing this simple limiting and a subsequent part with the reclaim
stuff for easier digestability. 

> Nothing else matters.

If the latter part is an unncessary overkill, it's even better.

Thanks,
Michal
  
Haitao Huang Oct. 18, 2023, 4:25 p.m. UTC | #40
On Wed, 18 Oct 2023 10:52:23 -0500, Michal Koutný <mkoutny@suse.com> wrote:

> On Wed, Oct 18, 2023 at 08:37:25AM -0700, Dave Hansen  
> <dave.hansen@intel.com> wrote:
>> 1. Admin sets a limit
>> 2. Enclave is created
>> 3. Enclave hits limit, allocation fails
>
> I was actually about to suggest reorganizing the series to a part
> implementing this simple limiting and a subsequent part with the reclaim
> stuff for easier digestability.
>
>> Nothing else matters.
>
> If the latter part is an unncessary overkill, it's even better.
>

Ok. I'll take out max_write() callback and only implement non-preemptive  
misc.max for EPC.
I can also separate OOEPC_killing enclaves out, which is not needed if we  
only block allocation at limit, no need killing one enclave to make space  
for another. This will simplify a lot.

Thanks to all for your input!

Haitao
  

Patch

diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
index 262f5fb18d74..ff42d649c7b6 100644
--- a/arch/x86/kernel/cpu/sgx/driver.c
+++ b/arch/x86/kernel/cpu/sgx/driver.c
@@ -44,7 +44,6 @@  static int sgx_open(struct inode *inode, struct file *file)
 static int sgx_release(struct inode *inode, struct file *file)
 {
 	struct sgx_encl *encl = file->private_data;
-	struct sgx_encl_mm *encl_mm;
 
 	/*
 	 * Drain the remaining mm_list entries. At this point the list contains
@@ -52,31 +51,7 @@  static int sgx_release(struct inode *inode, struct file *file)
 	 * not exited yet. The processes, which have exited, are gone from the
 	 * list by sgx_mmu_notifier_release().
 	 */
-	for ( ; ; )  {
-		spin_lock(&encl->mm_lock);
-
-		if (list_empty(&encl->mm_list)) {
-			encl_mm = NULL;
-		} else {
-			encl_mm = list_first_entry(&encl->mm_list,
-						   struct sgx_encl_mm, list);
-			list_del_rcu(&encl_mm->list);
-		}
-
-		spin_unlock(&encl->mm_lock);
-
-		/* The enclave is no longer mapped by any mm. */
-		if (!encl_mm)
-			break;
-
-		synchronize_srcu(&encl->srcu);
-		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
-		kfree(encl_mm);
-
-		/* 'encl_mm' is gone, put encl_mm->encl reference: */
-		kref_put(&encl->refcount, sgx_encl_release);
-	}
-
+	sgx_encl_mm_drain(encl);
 	kref_put(&encl->refcount, sgx_encl_release);
 	return 0;
 }
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index a8617e6a4b4e..3c91a705e720 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -451,6 +451,9 @@  static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
 	if (unlikely(!encl))
 		return VM_FAULT_SIGBUS;
 
+	if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
+		return VM_FAULT_SIGBUS;
+
 	/*
 	 * The page_array keeps track of all enclave pages, whether they
 	 * are swapped out or not. If there is no entry for this page and
@@ -649,7 +652,8 @@  static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
 	if (!encl)
 		return -EFAULT;
 
-	if (!test_bit(SGX_ENCL_DEBUG, &encl->flags))
+	if (!test_bit(SGX_ENCL_DEBUG, &encl->flags) ||
+	    test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
 		return -EFAULT;
 
 	for (i = 0; i < len; i += cnt) {
@@ -774,6 +778,45 @@  void sgx_encl_release(struct kref *ref)
 	kfree(encl);
 }
 
+/**
+ * sgx_encl_mm_drain - drain all mm_list entries
+ * @encl:	address of the sgx_encl to drain
+ *
+ * Used during oom kill to empty the mm_list entries after they have been
+ * zapped. Or used by sgx_release to drain the remaining mm_list entries when
+ * the enclave fd is closing. After this call, sgx_encl_release will be called
+ * with kref_put.
+ */
+void sgx_encl_mm_drain(struct sgx_encl *encl)
+{
+	struct sgx_encl_mm *encl_mm;
+
+	for ( ; ; )  {
+		spin_lock(&encl->mm_lock);
+
+		if (list_empty(&encl->mm_list)) {
+			encl_mm = NULL;
+		} else {
+			encl_mm = list_first_entry(&encl->mm_list,
+						   struct sgx_encl_mm, list);
+			list_del_rcu(&encl_mm->list);
+		}
+
+		spin_unlock(&encl->mm_lock);
+
+		/* The enclave is no longer mapped by any mm. */
+		if (!encl_mm)
+			break;
+
+		synchronize_srcu(&encl->srcu);
+		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
+		kfree(encl_mm);
+
+		/* 'encl_mm' is gone, put encl_mm->encl reference: */
+		kref_put(&encl->refcount, sgx_encl_release);
+	}
+}
+
 /*
  * 'mm' is exiting and no longer needs mmu notifications.
  */
@@ -845,6 +888,9 @@  int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
 	struct sgx_encl_mm *encl_mm;
 	int ret;
 
+	if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
+		return -ENOMEM;
+
 	/*
 	 * Even though a single enclave may be mapped into an mm more than once,
 	 * each 'mm' only appears once on encl->mm_list. This is guaranteed by
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index 831d63f80f5a..cdb57ecb05c8 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -39,6 +39,7 @@  enum sgx_encl_flags {
 	SGX_ENCL_DEBUG		= BIT(1),
 	SGX_ENCL_CREATED	= BIT(2),
 	SGX_ENCL_INITIALIZED	= BIT(3),
+	SGX_ENCL_NO_MEMORY	= BIT(4),
 };
 
 struct sgx_encl_mm {
@@ -125,5 +126,6 @@  struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
 					 unsigned long addr);
 struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim);
 void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page);
+void sgx_encl_mm_drain(struct sgx_encl *encl);
 
 #endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 50ddd8988452..e1209e2cf6a3 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -420,6 +420,9 @@  static long sgx_ioc_enclave_add_pages(struct sgx_encl *encl, void __user *arg)
 	    test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
 		return -EINVAL;
 
+	if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
+		return -ENOMEM;
+
 	if (copy_from_user(&add_arg, arg, sizeof(add_arg)))
 		return -EFAULT;
 
@@ -605,6 +608,9 @@  static long sgx_ioc_enclave_init(struct sgx_encl *encl, void __user *arg)
 	    test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
 		return -EINVAL;
 
+	if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
+		return -ENOMEM;
+
 	if (copy_from_user(&init_arg, arg, sizeof(init_arg)))
 		return -EFAULT;
 
@@ -681,6 +687,9 @@  static int sgx_ioc_sgx2_ready(struct sgx_encl *encl)
 	if (!test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
 		return -EINVAL;
 
+	if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
+		return -ENOMEM;
+
 	return 0;
 }
 
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index f3a3ed894616..3b875ab4dcd0 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -621,6 +621,146 @@  void sgx_free_epc_page(struct sgx_epc_page *page)
 	atomic_long_inc(&sgx_nr_free_pages);
 }
 
+static bool sgx_oom_get_ref(struct sgx_epc_page *epc_page)
+{
+	struct sgx_encl *encl;
+
+	if (epc_page->flags & SGX_EPC_OWNER_PAGE)
+		encl = epc_page->encl_page->encl;
+	else if (epc_page->flags & SGX_EPC_OWNER_ENCL)
+		encl = epc_page->encl;
+	else
+		return false;
+
+	return kref_get_unless_zero(&encl->refcount);
+}
+
+static struct sgx_epc_page *sgx_oom_get_victim(struct sgx_epc_lru_lists *lru)
+{
+	struct sgx_epc_page *epc_page, *tmp;
+
+	if (list_empty(&lru->unreclaimable))
+		return NULL;
+
+	list_for_each_entry_safe(epc_page, tmp, &lru->unreclaimable, list) {
+		list_del_init(&epc_page->list);
+
+		if (sgx_oom_get_ref(epc_page))
+			return epc_page;
+	}
+	return NULL;
+}
+
+static void sgx_epc_oom_zap(void *owner, struct mm_struct *mm, unsigned long start,
+			    unsigned long end, const struct vm_operations_struct *ops)
+{
+	VMA_ITERATOR(vmi, mm, start);
+	struct vm_area_struct *vma;
+
+	/**
+	 * Use end because start can be zero and not mapped into
+	 * enclave even if encl->base = 0.
+	 */
+	for_each_vma_range(vmi, vma, end) {
+		if (vma->vm_ops == ops && vma->vm_private_data == owner &&
+		    vma->vm_start < end) {
+			zap_vma_pages(vma);
+		}
+	}
+}
+
+static bool sgx_oom_encl(struct sgx_encl *encl)
+{
+	unsigned long mm_list_version;
+	struct sgx_encl_mm *encl_mm;
+	bool ret = false;
+	int idx;
+
+	if (!test_bit(SGX_ENCL_CREATED, &encl->flags))
+		goto out_put;
+
+	/* Done OOM on this enclave previously, do not redo it.
+	 * This may happen when the SECS page is still UNRECLAIMABLE because
+	 * another page is in RECLAIM_IN_PROGRESS. Still return true so OOM
+	 * killer can wait until the reclaimer done with the hold-up page and
+	 * SECS before it move on to find another victim.
+	 */
+	if (test_bit(SGX_ENCL_NO_MEMORY, &encl->flags))
+		goto out;
+
+	set_bit(SGX_ENCL_NO_MEMORY, &encl->flags);
+
+	do {
+		mm_list_version = encl->mm_list_version;
+
+		/* Pairs with smp_rmb() in sgx_encl_mm_add(). */
+		smp_rmb();
+
+		idx = srcu_read_lock(&encl->srcu);
+
+		list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
+			if (!mmget_not_zero(encl_mm->mm))
+				continue;
+
+			mmap_read_lock(encl_mm->mm);
+
+			sgx_epc_oom_zap(encl, encl_mm->mm, encl->base,
+					encl->base + encl->size, &sgx_vm_ops);
+
+			mmap_read_unlock(encl_mm->mm);
+
+			mmput_async(encl_mm->mm);
+		}
+
+		srcu_read_unlock(&encl->srcu, idx);
+	} while (WARN_ON_ONCE(encl->mm_list_version != mm_list_version));
+
+	sgx_encl_mm_drain(encl);
+out:
+	ret = true;
+
+out_put:
+	/*
+	 * This puts the refcount we took when we identified this enclave as
+	 * an OOM victim.
+	 */
+	kref_put(&encl->refcount, sgx_encl_release);
+	return ret;
+}
+
+static inline bool sgx_oom_encl_page(struct sgx_encl_page *encl_page)
+{
+	return sgx_oom_encl(encl_page->encl);
+}
+
+/**
+ * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
+ * @lru:	LRU that is low
+ *
+ * Return:	%true if a victim was found and kicked.
+ */
+bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
+{
+	struct sgx_epc_page *victim;
+
+	spin_lock(&lru->lock);
+	victim = sgx_oom_get_victim(lru);
+	spin_unlock(&lru->lock);
+
+	if (!victim)
+		return false;
+
+	if (victim->flags & SGX_EPC_OWNER_PAGE)
+		return sgx_oom_encl_page(victim->encl_page);
+
+	if (victim->flags & SGX_EPC_OWNER_ENCL)
+		return sgx_oom_encl(victim->encl);
+
+	/*Will never happen unless we add more owner types in future */
+	WARN_ON_ONCE(1);
+	return false;
+}
+
 static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 					 unsigned long index,
 					 struct sgx_epc_section *section)
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 337747bef7c2..6c0bfdc209c0 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -178,6 +178,7 @@  void sgx_reclaim_direct(void);
 void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
 int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
+bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
 
 void sgx_ipi_cb(void *info);