[v10,48/50] KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event

Message ID 20231016132819.1002933-49-michael.roth@amd.com
State New
Headers
Series Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support |

Commit Message

Michael Roth Oct. 16, 2023, 1:28 p.m. UTC
  From: Brijesh Singh <brijesh.singh@amd.com>

Version 2 of GHCB specification added the support for two SNP Guest
Request Message NAE events. The events allows for an SEV-SNP guest to
make request to the SEV-SNP firmware through hypervisor using the
SNP_GUEST_REQUEST API define in the SEV-SNP firmware specification.

The SNP_EXT_GUEST_REQUEST is similar to SNP_GUEST_REQUEST with the
difference of an additional certificate blob that can be passed through
the SNP_SET_CONFIG ioctl defined in the CCP driver. The CCP driver
provides snp_guest_ext_guest_request() that is used by the KVM to get
both the report and certificate data at once.

Co-developed-by: Alexey Kardashevskiy <aik@amd.com>
Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
[mdr: ensure FW command failures are indicated to guest]
Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 arch/x86/kvm/svm/sev.c       | 176 +++++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.h       |   1 +
 drivers/crypto/ccp/sev-dev.c |  15 +++
 include/linux/psp-sev.h      |   1 +
 4 files changed, 193 insertions(+)
  

Comments

Dionna Amalie Glaze Oct. 16, 2023, 11:18 p.m. UTC | #1
> +
> +       /*
> +        * If a VMM-specific certificate blob hasn't been provided, grab the
> +        * host-wide one.
> +        */
> +       snp_certs = sev_snp_certs_get(sev->snp_certs);
> +       if (!snp_certs)
> +               snp_certs = sev_snp_global_certs_get();
> +

This is where the generation I suggested adding would get checked. If
the instance certs' generation is not the global generation, then I
think we need a way to return to the VMM to make that right before
continuing to provide outdated certificates.
This might be an unreasonable request, but the fact that the certs and
reported_tcb can be set while a VM is running makes this an issue.
  
Sean Christopherson Oct. 17, 2023, 4:27 p.m. UTC | #2
On Mon, Oct 16, 2023, Dionna Amalie Glaze wrote:
> > +
> > +       /*
> > +        * If a VMM-specific certificate blob hasn't been provided, grab the
> > +        * host-wide one.
> > +        */
> > +       snp_certs = sev_snp_certs_get(sev->snp_certs);
> > +       if (!snp_certs)
> > +               snp_certs = sev_snp_global_certs_get();
> > +
> 
> This is where the generation I suggested adding would get checked. If
> the instance certs' generation is not the global generation, then I
> think we need a way to return to the VMM to make that right before
> continuing to provide outdated certificates.
> This might be an unreasonable request, but the fact that the certs and
> reported_tcb can be set while a VM is running makes this an issue.

Before we get that far, the changelogs need to explain why the kernel is storing
userspace blobs in the first place.  The whole thing is a bit of a mess.

sev_snp_global_certs_get() has data races that could lead to variations of TOCTOU
bugs: sev_ioctl_snp_set_config() can overwrite psp_master->sev_data->snp_certs
while sev_snp_global_certs_get() is running.  If the compiler reloads snp_certs
between bumping the refcount and grabbing the pointer, KVM will end up leaking a
refcount and consuming a pointer without a refcount.

	if (!kref_get_unless_zero(&certs->kref))
		return NULL;

	return certs;

If allocating memory for the certs fails, the kernel will have set the config
but not store the corresponding certs.

	ret = __sev_do_cmd_locked(SEV_CMD_SNP_CONFIG, &config, &argp->error);
		if (ret)
			goto e_free;

		memcpy(&sev->snp_config, &config, sizeof(config));
	}

	/*
	 * If the new certs are passed then cache it else free the old certs.
	 */
	if (input.certs_len) {
		snp_certs = sev_snp_certs_new(certs, input.certs_len);
		if (!snp_certs) {
			ret = -ENOMEM;
			goto e_free;
		}
	}

Reasoning about ordering is also difficult, e.g. what is KVM's contract with
userspace in terms of recognizing new global certs?

I don't understand why the kernel needs to manage the certs.  AFAICT the so called
global certs aren't an input to SEV_CMD_SNP_CONFIG, i.e. SNP_SET_EXT_CONFIG is
purely a software defined thing.

The easiest solution I can think of is to have KVM provide a chunk of memory in
kvm_sev_info for SNP guests that userspace can mmap(), a la vcpu->run.

	struct sev_snp_certs {
		u8 data[KVM_MAX_SEV_SNP_CERT_SIZE];
		u32 size;
		u8 pad[<size to make the struct page aligned>];
	};

When the guest requests the certs, KVM does something like:

	certs_size = READ_ONCE(sev->snp_certs->size);
	if (certs_size > sizeof(sev->snp_certs->data) ||
	    !IS_ALIGNED(certs_size, PAGE_SIZE))
		certs_size = 0;

	if (certs_size && (data_npages << PAGE_SHIFT) < certs_size) {
		vcpu->arch.regs[VCPU_REGS_RBX] = certs_size >> PAGE_SHIFT;
		exitcode = SNP_GUEST_VMM_ERR(SNP_GUEST_VMM_ERR_INVALID_LEN);
		goto cleanup;
	}

	...

	if (certs_size &&
	    kvm_write_guest(kvm, data_gpa, sev->snp_certs->data, certs_size))
		exitcode = SEV_RET_INVALID_ADDRESS;

If userspace wants to provide garbage to the guest, so be it, not KVM's problem.
That way, whether the VM gets the global cert or a per-VM cert is purely a userspace
concern.

If userspace needs to *stall* cert requests, e.g. while the certs are being updated,
then that's a different issue entirely.  If the GHCB allows telling the guest to
retry the request, then it should be trivially easy to solve, e.g. add a flag in
sev_snp_certs.  If KVM must "immediately" handle the request, then we'll need more
elaborate uAPI.
  
Alexey Kardashevskiy Oct. 18, 2023, 2:28 a.m. UTC | #3
On 18/10/23 03:27, Sean Christopherson wrote:
> On Mon, Oct 16, 2023, Dionna Amalie Glaze wrote:
>>> +
>>> +       /*
>>> +        * If a VMM-specific certificate blob hasn't been provided, grab the
>>> +        * host-wide one.
>>> +        */
>>> +       snp_certs = sev_snp_certs_get(sev->snp_certs);
>>> +       if (!snp_certs)
>>> +               snp_certs = sev_snp_global_certs_get();
>>> +
>>
>> This is where the generation I suggested adding would get checked. If
>> the instance certs' generation is not the global generation, then I
>> think we need a way to return to the VMM to make that right before
>> continuing to provide outdated certificates.
>> This might be an unreasonable request, but the fact that the certs and
>> reported_tcb can be set while a VM is running makes this an issue.
> 
> Before we get that far, the changelogs need to explain why the kernel is storing
> userspace blobs in the first place.  The whole thing is a bit of a mess.
> 
> sev_snp_global_certs_get() has data races that could lead to variations of TOCTOU
> bugs: sev_ioctl_snp_set_config() can overwrite psp_master->sev_data->snp_certs
> while sev_snp_global_certs_get() is running.  If the compiler reloads snp_certs
> between bumping the refcount and grabbing the pointer, KVM will end up leaking a
> refcount and consuming a pointer without a refcount.
> 
> 	if (!kref_get_unless_zero(&certs->kref))
> 		return NULL;
> 
> 	return certs;

I'm missing something here. The @certs pointer is on the stack, if it is 
being released elsewhere - kref_get_unless_zero() is going to fail and 
return NULL. How can this @certs not have the refcount incremented?


> If allocating memory for the certs fails, the kernel will have set the config
> but not store the corresponding certs.


Ah true.

> 	ret = __sev_do_cmd_locked(SEV_CMD_SNP_CONFIG, &config, &argp->error);
> 		if (ret)
> 			goto e_free;
> 
> 		memcpy(&sev->snp_config, &config, sizeof(config));
> 	}
> 
> 	/*
> 	 * If the new certs are passed then cache it else free the old certs.
> 	 */
> 	if (input.certs_len) {
> 		snp_certs = sev_snp_certs_new(certs, input.certs_len);
> 		if (!snp_certs) {
> 			ret = -ENOMEM;
> 			goto e_free;
> 		}
> 	}
> 
> Reasoning about ordering is also difficult, e.g. what is KVM's contract with
> userspace in terms of recognizing new global certs?
 >
> I don't understand why the kernel needs to manage the certs.  AFAICT the so called
> global certs aren't an input to SEV_CMD_SNP_CONFIG, i.e. SNP_SET_EXT_CONFIG is
> purely a software defined thing.
> > The easiest solution I can think of is to have KVM provide a chunk of 
memory in
> kvm_sev_info for SNP guests that userspace can mmap(), a la vcpu->run.
> 
> 	struct sev_snp_certs {
> 		u8 data[KVM_MAX_SEV_SNP_CERT_SIZE];
> 		u32 size;
> 		u8 pad[<size to make the struct page aligned>];
> 	};
> 
> When the guest requests the certs, KVM does something like:
> 
> 	certs_size = READ_ONCE(sev->snp_certs->size);
> 	if (certs_size > sizeof(sev->snp_certs->data) ||
> 	    !IS_ALIGNED(certs_size, PAGE_SIZE))
> 		certs_size = 0;
> 
> 	if (certs_size && (data_npages << PAGE_SHIFT) < certs_size) {
> 		vcpu->arch.regs[VCPU_REGS_RBX] = certs_size >> PAGE_SHIFT;
> 		exitcode = SNP_GUEST_VMM_ERR(SNP_GUEST_VMM_ERR_INVALID_LEN);
> 		goto cleanup;
> 	}
> 
> 	...
> 
> 	if (certs_size &&
> 	    kvm_write_guest(kvm, data_gpa, sev->snp_certs->data, certs_size))
> 		exitcode = SEV_RET_INVALID_ADDRESS;
> 
> If userspace wants to provide garbage to the guest, so be it, not KVM's problem.
> That way, whether the VM gets the global cert or a per-VM cert is purely a userspace
> concern.

The global cert lives in CCP (/dev/sev), the per VM cert lives in 
kvmvm_fd. "A la vcpu->run" is fine for the latter but for the former we 
need something else. And there is scenario when one global certs blob is 
what is needed and copying it over multiple VMs seems suboptimal.

> If userspace needs to *stall* cert requests, e.g. while the certs are being updated,

afaik it does not need to.

> then that's a different issue entirely.  If the GHCB allows telling the guest to
> retry the request, then it should be trivially easy to solve, e.g. add a flag in
> sev_snp_certs.  If KVM must "immediately" handle the request, then we'll need more
> elaborate uAPI.
  
Sean Christopherson Oct. 18, 2023, 1:48 p.m. UTC | #4
On Wed, Oct 18, 2023, Alexey Kardashevskiy wrote:
> 
> On 18/10/23 03:27, Sean Christopherson wrote:
> > On Mon, Oct 16, 2023, Dionna Amalie Glaze wrote:
> > > > +
> > > > +       /*
> > > > +        * If a VMM-specific certificate blob hasn't been provided, grab the
> > > > +        * host-wide one.
> > > > +        */
> > > > +       snp_certs = sev_snp_certs_get(sev->snp_certs);
> > > > +       if (!snp_certs)
> > > > +               snp_certs = sev_snp_global_certs_get();
> > > > +
> > > 
> > > This is where the generation I suggested adding would get checked. If
> > > the instance certs' generation is not the global generation, then I
> > > think we need a way to return to the VMM to make that right before
> > > continuing to provide outdated certificates.
> > > This might be an unreasonable request, but the fact that the certs and
> > > reported_tcb can be set while a VM is running makes this an issue.
> > 
> > Before we get that far, the changelogs need to explain why the kernel is storing
> > userspace blobs in the first place.  The whole thing is a bit of a mess.
> > 
> > sev_snp_global_certs_get() has data races that could lead to variations of TOCTOU
> > bugs: sev_ioctl_snp_set_config() can overwrite psp_master->sev_data->snp_certs
> > while sev_snp_global_certs_get() is running.  If the compiler reloads snp_certs
> > between bumping the refcount and grabbing the pointer, KVM will end up leaking a
> > refcount and consuming a pointer without a refcount.
> > 
> > 	if (!kref_get_unless_zero(&certs->kref))
> > 		return NULL;
> > 
> > 	return certs;
> 
> I'm missing something here. The @certs pointer is on the stack,

No, nothing guarantees that @certs is on the stack and will never be reloaded.
sev_snp_certs_get() is in full view of sev_snp_global_certs_get(), so it's entirely
possible that it can be inlined.  Then you end up with:

	struct sev_device *sev;

	if (!psp_master || !psp_master->sev_data)
		return NULL;

	sev = psp_master->sev_data;
	if (!sev->snp_initialized)
		return NULL;

	if (!sev->snp_certs)
		return NULL;

	if (!kref_get_unless_zero(&sev->snp_certs->kref))
		return NULL;

	return sev->snp_certs;

At which point the compiler could choose to omit a local variable entirely, it
could store @certs in a register and reload after kref_get_unless_zero(), etc.
If psp_master->sev_data->snp_certs is changed at any point, odd thing can happen.

That atomic operation in kref_get_unless_zero() might prevent a reload between
getting the kref and the return, but it wouldn't prevent a reload between the
!NULL check and kref_get_unless_zero().

> > If userspace wants to provide garbage to the guest, so be it, not KVM's problem.
> > That way, whether the VM gets the global cert or a per-VM cert is purely a userspace
> > concern.
> 
> The global cert lives in CCP (/dev/sev), the per VM cert lives in kvmvm_fd.
> "A la vcpu->run" is fine for the latter but for the former we need something
> else.

Why?  The cert ultimately comes from userspace, no?  Make userspace deal with it.

> And there is scenario when one global certs blob is what is needed and
> copying it over multiple VMs seems suboptimal.

That's a solvable problem.  I'm not sure I like the most obvious solution, but it
is a solution: let userspace define a KVM-wide blob pointer, either via .mmap()
or via an ioctl().

FWIW, there's no need to do .mmap() shenanigans, e.g. an ioctl() to set the
userspace pointer would suffice.  The benefit of a kernel controlled pointer is
that it doesn't require copying to a kernel buffer (or special code to copy from
userspace into guest).

Actually, looking at the flow again, AFAICT there's nothing special about the
target DATA_PAGE.  It must be SHARED *before* SVM_VMGEXIT_EXT_GUEST_REQUEST, i.e.
KVM doesn't need to do conversions, there's no kernel priveleges required, etc.
And the GHCB doesn't dictate ordering between storing the certificates and doing
the request.  That means the certificate stuff can be punted entirely to usersepace.

Heh, typing up the below, there's another bug: KVM will incorrectly "return" '0'
for non-SNP guests:

	unsigned long exitcode = 0;
	u64 data_gpa;
	int err, rc;

	if (!sev_snp_guest(vcpu->kvm)) {
		rc = SEV_RET_INVALID_GUEST; <= sets "rc", not "exitcode"
		goto e_fail;
	}

e_fail:
	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, exitcode);

Which really highlights that we need to get test infrastructure up and running
for SEV-ES, SNP, and TDX.

Anyways, back to punting to userspace.  Here's a rough sketch.  The only new uAPI
is the definition of KVM_HC_SNP_GET_CERTS and its arguments.

static void snp_handle_guest_request(struct vcpu_svm *svm)
{
	struct vmcb_control_area *control = &svm->vmcb->control;
	struct sev_data_snp_guest_request data = {0};
	struct kvm_vcpu *vcpu = &svm->vcpu;
	struct kvm *kvm = vcpu->kvm;
	struct kvm_sev_info *sev;
	gpa_t req_gpa = control->exit_info_1;
	gpa_t resp_gpa = control->exit_info_2;
	unsigned long rc;
	int err;

	if (!sev_snp_guest(vcpu->kvm)) {
		rc = SEV_RET_INVALID_GUEST;
		goto e_fail;
	}

	sev = &to_kvm_svm(kvm)->sev_info;

	mutex_lock(&sev->guest_req_lock);

	rc = snp_setup_guest_buf(svm, &data, req_gpa, resp_gpa);
	if (rc)
		goto unlock;

	rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &data, &err);
	if (rc)
		/* Ensure an error value is returned to guest. */
		rc = err ? err : SEV_RET_INVALID_ADDRESS;

	snp_cleanup_guest_buf(&data, &rc);

unlock:
	mutex_unlock(&sev->guest_req_lock);

e_fail:
	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, rc);
}

static int snp_complete_ext_guest_request(struct kvm_vcpu *vcpu)
{
	u64 certs_exitcode = vcpu->run->hypercall.args[2];
	struct vcpu_svm *svm = to_svm(vcpu);

	if (certs_exitcode)
		ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, certs_exitcode);
	else
		snp_handle_guest_request(svm);
	return 1;
}

static int snp_handle_ext_guest_request(struct vcpu_svm *svm)
{
	struct kvm_vcpu *vcpu = &svm->vcpu;
	struct kvm *kvm = vcpu->kvm;
	struct kvm_sev_info *sev;
	unsigned long exitcode;
	u64 data_gpa;

	if (!sev_snp_guest(vcpu->kvm)) {
		ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_GUEST);
		return 1;
	}

	data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
	if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
		ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_ADDRESS);
		return 1;
	}

	vcpu->run->hypercall.nr		 = KVM_HC_SNP_GET_CERTS;
	vcpu->run->hypercall.args[0]	 = data_gpa;
	vcpu->run->hypercall.args[1]	 = vcpu->arch.regs[VCPU_REGS_RBX];
	vcpu->run->hypercall.flags	 = KVM_EXIT_HYPERCALL_LONG_MODE;
	vcpu->arch.complete_userspace_io = snp_complete_ext_guest_request;
	return 0;
}
  
Kalra, Ashish Oct. 18, 2023, 8:27 p.m. UTC | #5
On 10/18/2023 8:48 AM, Sean Christopherson wrote:
> On Wed, Oct 18, 2023, Alexey Kardashevskiy wrote:
>>
>> On 18/10/23 03:27, Sean Christopherson wrote:
>>> On Mon, Oct 16, 2023, Dionna Amalie Glaze wrote:
>>>>> +
>>>>> +       /*
>>>>> +        * If a VMM-specific certificate blob hasn't been provided, grab the
>>>>> +        * host-wide one.
>>>>> +        */
>>>>> +       snp_certs = sev_snp_certs_get(sev->snp_certs);
>>>>> +       if (!snp_certs)
>>>>> +               snp_certs = sev_snp_global_certs_get();
>>>>> +
>>>>
>>>> This is where the generation I suggested adding would get checked. If
>>>> the instance certs' generation is not the global generation, then I
>>>> think we need a way to return to the VMM to make that right before
>>>> continuing to provide outdated certificates.
>>>> This might be an unreasonable request, but the fact that the certs and
>>>> reported_tcb can be set while a VM is running makes this an issue.
>>>
>>> Before we get that far, the changelogs need to explain why the kernel is storing
>>> userspace blobs in the first place.  The whole thing is a bit of a mess.
>>>
>>> sev_snp_global_certs_get() has data races that could lead to variations of TOCTOU
>>> bugs: sev_ioctl_snp_set_config() can overwrite psp_master->sev_data->snp_certs
>>> while sev_snp_global_certs_get() is running.  If the compiler reloads snp_certs
>>> between bumping the refcount and grabbing the pointer, KVM will end up leaking a
>>> refcount and consuming a pointer without a refcount.
>>>
>>> 	if (!kref_get_unless_zero(&certs->kref))
>>> 		return NULL;
>>>
>>> 	return certs;
>>
>> I'm missing something here. The @certs pointer is on the stack,
> 
> No, nothing guarantees that @certs is on the stack and will never be reloaded.
> sev_snp_certs_get() is in full view of sev_snp_global_certs_get(), so it's entirely
> possible that it can be inlined.  Then you end up with:
> 
> 	struct sev_device *sev;
> 
> 	if (!psp_master || !psp_master->sev_data)
> 		return NULL;
> 
> 	sev = psp_master->sev_data;
> 	if (!sev->snp_initialized)
> 		return NULL;
> 
> 	if (!sev->snp_certs)
> 		return NULL;
> 
> 	if (!kref_get_unless_zero(&sev->snp_certs->kref))
> 		return NULL;
> 
> 	return sev->snp_certs;
> 
> At which point the compiler could choose to omit a local variable entirely, it
> could store @certs in a register and reload after kref_get_unless_zero(), etc.
> If psp_master->sev_data->snp_certs is changed at any point, odd thing can happen.
> 
> That atomic operation in kref_get_unless_zero() might prevent a reload between
> getting the kref and the return, but it wouldn't prevent a reload between the
> !NULL check and kref_get_unless_zero().
> 
>>> If userspace wants to provide garbage to the guest, so be it, not KVM's problem.
>>> That way, whether the VM gets the global cert or a per-VM cert is purely a userspace
>>> concern.
>>
>> The global cert lives in CCP (/dev/sev), the per VM cert lives in kvmvm_fd.
>> "A la vcpu->run" is fine for the latter but for the former we need something
>> else.
> 
> Why?  The cert ultimately comes from userspace, no?  Make userspace deal with it.
> 
>> And there is scenario when one global certs blob is what is needed and
>> copying it over multiple VMs seems suboptimal.
> 
> That's a solvable problem.  I'm not sure I like the most obvious solution, but it
> is a solution: let userspace define a KVM-wide blob pointer, either via .mmap()
> or via an ioctl().
> 
> FWIW, there's no need to do .mmap() shenanigans, e.g. an ioctl() to set the
> userspace pointer would suffice.  The benefit of a kernel controlled pointer is
> that it doesn't require copying to a kernel buffer (or special code to copy from
> userspace into guest).
> 
> Actually, looking at the flow again, AFAICT there's nothing special about the
> target DATA_PAGE.  It must be SHARED *before* SVM_VMGEXIT_EXT_GUEST_REQUEST, i.e.
> KVM doesn't need to do conversions, there's no kernel priveleges required, etc.
> And the GHCB doesn't dictate ordering between storing the certificates and doing
> the request.  

That's true.

>That means the certificate stuff can be punted entirely to usersepace.

> 
> Heh, typing up the below, there's another bug: KVM will incorrectly "return" '0'
> for non-SNP guests:
> 
> 	unsigned long exitcode = 0;
> 	u64 data_gpa;
> 	int err, rc;
> 
> 	if (!sev_snp_guest(vcpu->kvm)) {
> 		rc = SEV_RET_INVALID_GUEST; <= sets "rc", not "exitcode"
> 		goto e_fail;
> 	}
> 
> e_fail:
> 	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, exitcode);
> 
> Which really highlights that we need to get test infrastructure up and running
> for SEV-ES, SNP, and TDX.
> 
> Anyways, back to punting to userspace.  Here's a rough sketch.  The only new uAPI
> is the definition of KVM_HC_SNP_GET_CERTS and its arguments.
> 
> static void snp_handle_guest_request(struct vcpu_svm *svm)
> {
> 	struct vmcb_control_area *control = &svm->vmcb->control;
> 	struct sev_data_snp_guest_request data = {0};
> 	struct kvm_vcpu *vcpu = &svm->vcpu;
> 	struct kvm *kvm = vcpu->kvm;
> 	struct kvm_sev_info *sev;
> 	gpa_t req_gpa = control->exit_info_1;
> 	gpa_t resp_gpa = control->exit_info_2;
> 	unsigned long rc;
> 	int err;
> 
> 	if (!sev_snp_guest(vcpu->kvm)) {
> 		rc = SEV_RET_INVALID_GUEST;
> 		goto e_fail;
> 	}
> 
> 	sev = &to_kvm_svm(kvm)->sev_info;
> 
> 	mutex_lock(&sev->guest_req_lock);
> 
> 	rc = snp_setup_guest_buf(svm, &data, req_gpa, resp_gpa);
> 	if (rc)
> 		goto unlock;
> 
> 	rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &data, &err);
> 	if (rc)
> 		/* Ensure an error value is returned to guest. */
> 		rc = err ? err : SEV_RET_INVALID_ADDRESS;
> 
> 	snp_cleanup_guest_buf(&data, &rc);
> 
> unlock:
> 	mutex_unlock(&sev->guest_req_lock);
> 
> e_fail:
> 	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, rc);
> }
> 
> static int snp_complete_ext_guest_request(struct kvm_vcpu *vcpu)
> {
> 	u64 certs_exitcode = vcpu->run->hypercall.args[2];
> 	struct vcpu_svm *svm = to_svm(vcpu);
> 
> 	if (certs_exitcode)
> 		ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, certs_exitcode);
> 	else
> 		snp_handle_guest_request(svm);
> 	return 1;
> }
> 
> static int snp_handle_ext_guest_request(struct vcpu_svm *svm)
> {
> 	struct kvm_vcpu *vcpu = &svm->vcpu;
> 	struct kvm *kvm = vcpu->kvm;
> 	struct kvm_sev_info *sev;
> 	unsigned long exitcode;
> 	u64 data_gpa;
> 
> 	if (!sev_snp_guest(vcpu->kvm)) {
> 		ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_GUEST);
> 		return 1;
> 	}
> 
> 	data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
> 	if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
> 		ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_ADDRESS);
> 		return 1;
> 	}
> 
> 	vcpu->run->hypercall.nr		 = KVM_HC_SNP_GET_CERTS;
> 	vcpu->run->hypercall.args[0]	 = data_gpa;
> 	vcpu->run->hypercall.args[1]	 = vcpu->arch.regs[VCPU_REGS_RBX];
> 	vcpu->run->hypercall.flags	 = KVM_EXIT_HYPERCALL_LONG_MODE;
> 	vcpu->arch.complete_userspace_io = snp_complete_ext_guest_request;
> 	return 0;
> }
> 

IIRC, the important consideration here is to ensure that getting the 
attestation report and retrieving the certificates appears atomic to the 
guest. When SNP live migration is supported we don't want a case where 
the guest could have migrated between the call to obtain the 
certificates and obtaining the attestation report, which can potentially 
cause failure of validation of the attestation report.

Thanks,
Ashish
  
Sean Christopherson Oct. 18, 2023, 8:38 p.m. UTC | #6
On Wed, Oct 18, 2023, Ashish Kalra wrote:
> > static int snp_handle_ext_guest_request(struct vcpu_svm *svm)
> > {
> > 	struct kvm_vcpu *vcpu = &svm->vcpu;
> > 	struct kvm *kvm = vcpu->kvm;
> > 	struct kvm_sev_info *sev;
> > 	unsigned long exitcode;
> > 	u64 data_gpa;
> > 
> > 	if (!sev_snp_guest(vcpu->kvm)) {
> > 		ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_GUEST);
> > 		return 1;
> > 	}
> > 
> > 	data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
> > 	if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
> > 		ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_ADDRESS);
> > 		return 1;
> > 	}
> > 
> > 	vcpu->run->hypercall.nr		 = KVM_HC_SNP_GET_CERTS;
> > 	vcpu->run->hypercall.args[0]	 = data_gpa;
> > 	vcpu->run->hypercall.args[1]	 = vcpu->arch.regs[VCPU_REGS_RBX];
> > 	vcpu->run->hypercall.flags	 = KVM_EXIT_HYPERCALL_LONG_MODE;
> > 	vcpu->arch.complete_userspace_io = snp_complete_ext_guest_request;
> > 	return 0;
> > }
> > 
> 
> IIRC, the important consideration here is to ensure that getting the
> attestation report and retrieving the certificates appears atomic to the
> guest. When SNP live migration is supported we don't want a case where the
> guest could have migrated between the call to obtain the certificates and
> obtaining the attestation report, which can potentially cause failure of
> validation of the attestation report.

Where does "obtaining the attestation report" happen?  I see the guest request
and the certificate stuff, I don't see anything about attestation reports (though
I'm not looking very closely).
  
Kalra, Ashish Oct. 18, 2023, 9:27 p.m. UTC | #7
On 10/18/2023 3:38 PM, Sean Christopherson wrote:
> On Wed, Oct 18, 2023, Ashish Kalra wrote:
>>> static int snp_handle_ext_guest_request(struct vcpu_svm *svm)
>>> {
>>> 	struct kvm_vcpu *vcpu = &svm->vcpu;
>>> 	struct kvm *kvm = vcpu->kvm;
>>> 	struct kvm_sev_info *sev;
>>> 	unsigned long exitcode;
>>> 	u64 data_gpa;
>>>
>>> 	if (!sev_snp_guest(vcpu->kvm)) {
>>> 		ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_GUEST);
>>> 		return 1;
>>> 	}
>>>
>>> 	data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
>>> 	if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
>>> 		ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_ADDRESS);
>>> 		return 1;
>>> 	}
>>>
>>> 	vcpu->run->hypercall.nr		 = KVM_HC_SNP_GET_CERTS;
>>> 	vcpu->run->hypercall.args[0]	 = data_gpa;
>>> 	vcpu->run->hypercall.args[1]	 = vcpu->arch.regs[VCPU_REGS_RBX];
>>> 	vcpu->run->hypercall.flags	 = KVM_EXIT_HYPERCALL_LONG_MODE;
>>> 	vcpu->arch.complete_userspace_io = snp_complete_ext_guest_request;
>>> 	return 0;
>>> }
>>>
>>
>> IIRC, the important consideration here is to ensure that getting the
>> attestation report and retrieving the certificates appears atomic to the
>> guest. When SNP live migration is supported we don't want a case where the
>> guest could have migrated between the call to obtain the certificates and
>> obtaining the attestation report, which can potentially cause failure of
>> validation of the attestation report.
> 
> Where does "obtaining the attestation report" happen?  I see the guest request
> and the certificate stuff, I don't see anything about attestation reports (though
> I'm not looking very closely).
> 

The guest requests that the firmware construct an attestation report via 
the SNP_GUEST_REQUEST command. The certificates are piggy-backed to the 
guest along with the attestation report (retrieved from the FW via the 
SNP_GUEST_REQUEST command) as part of the SNP Extended Guest Request NAE 
handling.

Thanks,
Ashish
  
Sean Christopherson Oct. 18, 2023, 9:43 p.m. UTC | #8
On Wed, Oct 18, 2023, Ashish Kalra wrote:
> 
> On 10/18/2023 3:38 PM, Sean Christopherson wrote:
> > On Wed, Oct 18, 2023, Ashish Kalra wrote:
> > > > static int snp_handle_ext_guest_request(struct vcpu_svm *svm)
> > > > {
> > > > 	struct kvm_vcpu *vcpu = &svm->vcpu;
> > > > 	struct kvm *kvm = vcpu->kvm;
> > > > 	struct kvm_sev_info *sev;
> > > > 	unsigned long exitcode;
> > > > 	u64 data_gpa;
> > > > 
> > > > 	if (!sev_snp_guest(vcpu->kvm)) {
> > > > 		ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_GUEST);
> > > > 		return 1;
> > > > 	}
> > > > 
> > > > 	data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
> > > > 	if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
> > > > 		ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_ADDRESS);
> > > > 		return 1;
> > > > 	}
> > > > 

Doh, I forget to set

		vcpu->run->exit_reason        = KVM_EXIT_HYPERCALL;

> > > > 	vcpu->run->hypercall.nr		 = KVM_HC_SNP_GET_CERTS;
> > > > 	vcpu->run->hypercall.args[0]	 = data_gpa;
> > > > 	vcpu->run->hypercall.args[1]	 = vcpu->arch.regs[VCPU_REGS_RBX];
> > > > 	vcpu->run->hypercall.flags	 = KVM_EXIT_HYPERCALL_LONG_MODE;
> > > > 	vcpu->arch.complete_userspace_io = snp_complete_ext_guest_request;
> > > > 	return 0;
> > > > }
> > > > 
> > > 
> > > IIRC, the important consideration here is to ensure that getting the
> > > attestation report and retrieving the certificates appears atomic to the
> > > guest. When SNP live migration is supported we don't want a case where the
> > > guest could have migrated between the call to obtain the certificates and
> > > obtaining the attestation report, which can potentially cause failure of
> > > validation of the attestation report.
> > 
> > Where does "obtaining the attestation report" happen?  I see the guest request
> > and the certificate stuff, I don't see anything about attestation reports (though
> > I'm not looking very closely).
> > 
> 
> The guest requests that the firmware construct an attestation report via the
> SNP_GUEST_REQUEST command. The certificates are piggy-backed to the guest
> along with the attestation report (retrieved from the FW via the
> SNP_GUEST_REQUEST command) as part of the SNP Extended Guest Request NAE
> handling.

Ah, thanks!

In that case, my proposal should more or less Just Work™, we simply need to define
KVM's ABI to be that userspace is responsible for doing KVM_RUN with
vcpu->run->immediate_exit set before migrating if the previous exit was
KVM_EXIT_HYPERCALL with KVM_HC_SNP_GET_CERTS.  This is standard operating procedure
for userspace exits where KVM needs to "complete" the VM-Exit, e.g. for MMIO, I/O,
etc. that are punted to userspace.
  
Alexey Kardashevskiy Oct. 19, 2023, 2:48 a.m. UTC | #9
On 19/10/23 00:48, Sean Christopherson wrote:
> On Wed, Oct 18, 2023, Alexey Kardashevskiy wrote:
>>
>> On 18/10/23 03:27, Sean Christopherson wrote:
>>> On Mon, Oct 16, 2023, Dionna Amalie Glaze wrote:
>>>>> +
>>>>> +       /*
>>>>> +        * If a VMM-specific certificate blob hasn't been provided, grab the
>>>>> +        * host-wide one.
>>>>> +        */
>>>>> +       snp_certs = sev_snp_certs_get(sev->snp_certs);
>>>>> +       if (!snp_certs)
>>>>> +               snp_certs = sev_snp_global_certs_get();
>>>>> +
>>>>
>>>> This is where the generation I suggested adding would get checked. If
>>>> the instance certs' generation is not the global generation, then I
>>>> think we need a way to return to the VMM to make that right before
>>>> continuing to provide outdated certificates.
>>>> This might be an unreasonable request, but the fact that the certs and
>>>> reported_tcb can be set while a VM is running makes this an issue.
>>>
>>> Before we get that far, the changelogs need to explain why the kernel is storing
>>> userspace blobs in the first place.  The whole thing is a bit of a mess.
>>>
>>> sev_snp_global_certs_get() has data races that could lead to variations of TOCTOU
>>> bugs: sev_ioctl_snp_set_config() can overwrite psp_master->sev_data->snp_certs
>>> while sev_snp_global_certs_get() is running.  If the compiler reloads snp_certs
>>> between bumping the refcount and grabbing the pointer, KVM will end up leaking a
>>> refcount and consuming a pointer without a refcount.
>>>
>>> 	if (!kref_get_unless_zero(&certs->kref))
>>> 		return NULL;
>>>
>>> 	return certs;
>>
>> I'm missing something here. The @certs pointer is on the stack,
> 
> No, nothing guarantees that @certs is on the stack and will never be reloaded.
> sev_snp_certs_get() is in full view of sev_snp_global_certs_get(), so it's entirely
> possible that it can be inlined.  Then you end up with:
> 
> 	struct sev_device *sev;
> 
> 	if (!psp_master || !psp_master->sev_data)
> 		return NULL;
> 
> 	sev = psp_master->sev_data;
> 	if (!sev->snp_initialized)
> 		return NULL;
> 
> 	if (!sev->snp_certs)
> 		return NULL;
> 
> 	if (!kref_get_unless_zero(&sev->snp_certs->kref))
> 		return NULL;
> 
> 	return sev->snp_certs;
> 
> At which point the compiler could choose to omit a local variable entirely, it
> could store @certs in a register and reload after kref_get_unless_zero(), etc.
> If psp_master->sev_data->snp_certs is changed at any point, odd thing can happen.
> 
> That atomic operation in kref_get_unless_zero() might prevent a reload between
> getting the kref and the return, but it wouldn't prevent a reload between the
> !NULL check and kref_get_unless_zero().

Oh. The function is exported so I thought gcc would not go that far but 
yeah it is possible. So this needs an explicit READ_ONCE barrier.


>>> If userspace wants to provide garbage to the guest, so be it, not KVM's problem.
>>> That way, whether the VM gets the global cert or a per-VM cert is purely a userspace
>>> concern.
>>
>> The global cert lives in CCP (/dev/sev), the per VM cert lives in kvmvm_fd.
>> "A la vcpu->run" is fine for the latter but for the former we need something
>> else.
> 
> Why?  The cert ultimately comes from userspace, no?  Make userspace deal with it.
>
>> And there is scenario when one global certs blob is what is needed and
>> copying it over multiple VMs seems suboptimal.
> 
> That's a solvable problem.  I'm not sure I like the most obvious solution, but it
> is a solution: let userspace define a KVM-wide blob pointer, either via .mmap()
> or via an ioctl().
> 
> FWIW, there's no need to do .mmap() shenanigans, e.g. an ioctl() to set the
> userspace pointer would suffice.  The benefit of a kernel controlled pointer is
> that it doesn't require copying to a kernel buffer (or special code to copy from
> userspace into guest).

Just to clarify - like, a small userspace non-qemu program which just 
holds a pointer with the certs blob, or embed it into libvirt or systemd?


> Actually, looking at the flow again, AFAICT there's nothing special about the
> target DATA_PAGE.  It must be SHARED *before* SVM_VMGEXIT_EXT_GUEST_REQUEST, i.e.
> KVM doesn't need to do conversions, there's no kernel priveleges required, etc.
> And the GHCB doesn't dictate ordering between storing the certificates and doing
> the request.  That means the certificate stuff can be punted entirely to usersepace.

All true.

> Heh, typing up the below, there's another bug: KVM will incorrectly "return" '0'
> for non-SNP guests:
> 
> 	unsigned long exitcode = 0;
> 	u64 data_gpa;
> 	int err, rc;
> 
> 	if (!sev_snp_guest(vcpu->kvm)) {
> 		rc = SEV_RET_INVALID_GUEST; <= sets "rc", not "exitcode"
> 		goto e_fail;
> 	}
> 
> e_fail:
> 	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, exitcode);
> 
> Which really highlights that we need to get test infrastructure up and running
> for SEV-ES, SNP, and TDX.
> 
> Anyways, back to punting to userspace.  Here's a rough sketch.  The only new uAPI
> is the definition of KVM_HC_SNP_GET_CERTS and its arguments.
> 
> static void snp_handle_guest_request(struct vcpu_svm *svm)
> {
> 	struct vmcb_control_area *control = &svm->vmcb->control;
> 	struct sev_data_snp_guest_request data = {0};
> 	struct kvm_vcpu *vcpu = &svm->vcpu;
> 	struct kvm *kvm = vcpu->kvm;
> 	struct kvm_sev_info *sev;
> 	gpa_t req_gpa = control->exit_info_1;
> 	gpa_t resp_gpa = control->exit_info_2;
> 	unsigned long rc;
> 	int err;
> 
> 	if (!sev_snp_guest(vcpu->kvm)) {
> 		rc = SEV_RET_INVALID_GUEST;
> 		goto e_fail;
> 	}
> 
> 	sev = &to_kvm_svm(kvm)->sev_info;
> 
> 	mutex_lock(&sev->guest_req_lock);
> 
> 	rc = snp_setup_guest_buf(svm, &data, req_gpa, resp_gpa);
> 	if (rc)
> 		goto unlock;
> 
> 	rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &data, &err);
> 	if (rc)
> 		/* Ensure an error value is returned to guest. */
> 		rc = err ? err : SEV_RET_INVALID_ADDRESS;
> 
> 	snp_cleanup_guest_buf(&data, &rc);
> 
> unlock:
> 	mutex_unlock(&sev->guest_req_lock);
> 
> e_fail:
> 	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, rc);
> }
> 
> static int snp_complete_ext_guest_request(struct kvm_vcpu *vcpu)
> {
> 	u64 certs_exitcode = vcpu->run->hypercall.args[2];
> 	struct vcpu_svm *svm = to_svm(vcpu);
> 
> 	if (certs_exitcode)
> 		ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, certs_exitcode);
> 	else
> 		snp_handle_guest_request(svm);
> 	return 1;
> }
> 
> static int snp_handle_ext_guest_request(struct vcpu_svm *svm)
> {
> 	struct kvm_vcpu *vcpu = &svm->vcpu;
> 	struct kvm *kvm = vcpu->kvm;
> 	struct kvm_sev_info *sev;
> 	unsigned long exitcode;
> 	u64 data_gpa;
> 
> 	if (!sev_snp_guest(vcpu->kvm)) {
> 		ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_GUEST);
> 		return 1;
> 	}
> 
> 	data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
> 	if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
> 		ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_ADDRESS);
> 		return 1;
> 	}
> 
> 	vcpu->run->hypercall.nr		 = KVM_HC_SNP_GET_CERTS;
> 	vcpu->run->hypercall.args[0]	 = data_gpa;
> 	vcpu->run->hypercall.args[1]	 = vcpu->arch.regs[VCPU_REGS_RBX];
> 	vcpu->run->hypercall.flags	 = KVM_EXIT_HYPERCALL_LONG_MODE;

btw why is it _LONG_MODE and not just _64? :)

> 	vcpu->arch.complete_userspace_io = snp_complete_ext_guest_request;
> 	return 0;
> }

This should work the KVM stored certs nicely but not for the global 
certs. Although I am not all convinced that global certs is all that 
valuable but I do not know the history of that, happened before I joined 
so I let others to comment on that. Thanks,
  
Sean Christopherson Oct. 19, 2023, 2:57 p.m. UTC | #10
On Thu, Oct 19, 2023, Alexey Kardashevskiy wrote:
> 
> On 19/10/23 00:48, Sean Christopherson wrote:
> > static int snp_handle_ext_guest_request(struct vcpu_svm *svm)
> > {
> > 	struct kvm_vcpu *vcpu = &svm->vcpu;
> > 	struct kvm *kvm = vcpu->kvm;
> > 	struct kvm_sev_info *sev;
> > 	unsigned long exitcode;
> > 	u64 data_gpa;
> > 
> > 	if (!sev_snp_guest(vcpu->kvm)) {
> > 		ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_GUEST);
> > 		return 1;
> > 	}
> > 
> > 	data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
> > 	if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
> > 		ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_ADDRESS);
> > 		return 1;
> > 	}
> > 
> > 	vcpu->run->hypercall.nr		 = KVM_HC_SNP_GET_CERTS;
> > 	vcpu->run->hypercall.args[0]	 = data_gpa;
> > 	vcpu->run->hypercall.args[1]	 = vcpu->arch.regs[VCPU_REGS_RBX];
> > 	vcpu->run->hypercall.flags	 = KVM_EXIT_HYPERCALL_LONG_MODE;
> 
> btw why is it _LONG_MODE and not just _64? :)

I'm pretty sure it got copied from Xen when KVM started adding supporting for
emulating Xen's hypercalls.  I assume Xen PV actually has a need for identifying
long mode as opposed to just 64-bit mode, but KVM, not so much.

> > 	vcpu->arch.complete_userspace_io = snp_complete_ext_guest_request;
> > 	return 0;
> > }
> 
> This should work the KVM stored certs nicely but not for the global certs.
> Although I am not all convinced that global certs is all that valuable but I
> do not know the history of that, happened before I joined so I let others to
> comment on that. Thanks,

Aren't the global certs provided by userspace too though?  If all certs are
ultimately controlled by userspace, I don't see any reason to make the kernel a
middle-man.
  
Alexey Kardashevskiy Oct. 19, 2023, 11:55 p.m. UTC | #11
On 20/10/23 01:57, Sean Christopherson wrote:
> On Thu, Oct 19, 2023, Alexey Kardashevskiy wrote:
>>
>> On 19/10/23 00:48, Sean Christopherson wrote:
>>> static int snp_handle_ext_guest_request(struct vcpu_svm *svm)
>>> {
>>> 	struct kvm_vcpu *vcpu = &svm->vcpu;
>>> 	struct kvm *kvm = vcpu->kvm;
>>> 	struct kvm_sev_info *sev;
>>> 	unsigned long exitcode;
>>> 	u64 data_gpa;
>>>
>>> 	if (!sev_snp_guest(vcpu->kvm)) {
>>> 		ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_GUEST);
>>> 		return 1;
>>> 	}
>>>
>>> 	data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
>>> 	if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
>>> 		ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_ADDRESS);
>>> 		return 1;
>>> 	}
>>>
>>> 	vcpu->run->hypercall.nr		 = KVM_HC_SNP_GET_CERTS;
>>> 	vcpu->run->hypercall.args[0]	 = data_gpa;
>>> 	vcpu->run->hypercall.args[1]	 = vcpu->arch.regs[VCPU_REGS_RBX];
>>> 	vcpu->run->hypercall.flags	 = KVM_EXIT_HYPERCALL_LONG_MODE;
>>
>> btw why is it _LONG_MODE and not just _64? :)
> 
> I'm pretty sure it got copied from Xen when KVM started adding supporting for
> emulating Xen's hypercalls.  I assume Xen PV actually has a need for identifying
> long mode as opposed to just 64-bit mode, but KVM, not so much.
> 
>>> 	vcpu->arch.complete_userspace_io = snp_complete_ext_guest_request;
>>> 	return 0;
>>> }
>>
>> This should work the KVM stored certs nicely but not for the global certs.
>> Although I am not all convinced that global certs is all that valuable but I
>> do not know the history of that, happened before I joined so I let others to
>> comment on that. Thanks,
> 
> Aren't the global certs provided by userspace too though?  If all certs are
> ultimately controlled by userspace, I don't see any reason to make the kernel a
> middle-man.

The max blob size is 32KB or so and for 200 VMs it is:
- 6.5MB, all in the userspace so swappable  vs
- 32KB but in the kernel so not swappable.
Sure, a box capable of running 200 VMs must have plenty of RAM but still :)
Plus, GHCB now has to go via the userspace before talking to the PSP 
which was not the case so far (though I cannot think of immediate 
implication right now).
  
Sean Christopherson Oct. 20, 2023, 12:13 a.m. UTC | #12
On Fri, Oct 20, 2023, Alexey Kardashevskiy wrote:
> 
> On 20/10/23 01:57, Sean Christopherson wrote:
> > On Thu, Oct 19, 2023, Alexey Kardashevskiy wrote:
> > > > 	vcpu->arch.complete_userspace_io = snp_complete_ext_guest_request;
> > > > 	return 0;
> > > > }
> > > 
> > > This should work the KVM stored certs nicely but not for the global certs.
> > > Although I am not all convinced that global certs is all that valuable but I
> > > do not know the history of that, happened before I joined so I let others to
> > > comment on that. Thanks,
> > 
> > Aren't the global certs provided by userspace too though?  If all certs are
> > ultimately controlled by userspace, I don't see any reason to make the kernel a
> > middle-man.
> 
> The max blob size is 32KB or so and for 200 VMs it is:

Not according to include/linux/psp-sev.h:

#define SEV_FW_BLOB_MAX_SIZE	0x4000	/* 16KB */

Ugh, and I see in another patch:

  Also increase the SEV_FW_BLOB_MAX_SIZE another 4K page to allow space
  for an extra certificate.

-#define SEV_FW_BLOB_MAX_SIZE   0x4000  /* 16KB */
+#define SEV_FW_BLOB_MAX_SIZE   0x5000  /* 20KB */

That's gross and just asking for ABI problems, because then there's this:

+::
+
+       struct kvm_sev_snp_set_certs {
+               __u64 certs_uaddr;
+               __u64 certs_len
+       };
+
+The certs_len field may not exceed SEV_FW_BLOB_MAX_SIZE.

> - 6.5MB, all in the userspace so swappable  vs
> - 32KB but in the kernel so not swappable.
> Sure, a box capable of running 200 VMs must have plenty of RAM but still :)

That's making quite a few assumptions.

  1) That the global cert will be 32KiB (which clearly isn't the case today).
  2) That every VM will want the global cert.
  3) That userspace can't figure out a way to share the global cert.

Even in that absolutely worst case scenario, I am not remotely convinced that it
justifies taking on the necessary complexity to manage certs in-kernel.

> Plus, GHCB now has to go via the userspace before talking to the PSP which
> was not the case so far (though I cannot think of immediate implication
> right now).

Any argument along the lines of "because that's how we've always done it" is going
to fall on deaf ears.  If there's a real performance bottleneck with kicking out
to userspace, then I'll happily work to figure out a solution.  If.
  
Alexey Kardashevskiy Oct. 20, 2023, 12:43 a.m. UTC | #13
On 20/10/23 11:13, Sean Christopherson wrote:
> On Fri, Oct 20, 2023, Alexey Kardashevskiy wrote:
>>
>> On 20/10/23 01:57, Sean Christopherson wrote:
>>> On Thu, Oct 19, 2023, Alexey Kardashevskiy wrote:
>>>>> 	vcpu->arch.complete_userspace_io = snp_complete_ext_guest_request;
>>>>> 	return 0;
>>>>> }
>>>>
>>>> This should work the KVM stored certs nicely but not for the global certs.
>>>> Although I am not all convinced that global certs is all that valuable but I
>>>> do not know the history of that, happened before I joined so I let others to
>>>> comment on that. Thanks,
>>>
>>> Aren't the global certs provided by userspace too though?  If all certs are
>>> ultimately controlled by userspace, I don't see any reason to make the kernel a
>>> middle-man.
>>
>> The max blob size is 32KB or so and for 200 VMs it is:
> 
> Not according to include/linux/psp-sev.h:
> 
> #define SEV_FW_BLOB_MAX_SIZE	0x4000	/* 16KB */
> 
> Ugh, and I see in another patch:
> 
>    Also increase the SEV_FW_BLOB_MAX_SIZE another 4K page to allow space
>    for an extra certificate.
> 
> -#define SEV_FW_BLOB_MAX_SIZE   0x4000  /* 16KB */
> +#define SEV_FW_BLOB_MAX_SIZE   0x5000  /* 20KB */
> 
> That's gross and just asking for ABI problems, because then there's this:
> 
> +::
> +
> +       struct kvm_sev_snp_set_certs {
> +               __u64 certs_uaddr;
> +               __u64 certs_len
> +       };
> +
> +The certs_len field may not exceed SEV_FW_BLOB_MAX_SIZE.
> 
>> - 6.5MB, all in the userspace so swappable  vs
>> - 32KB but in the kernel so not swappable.
>> Sure, a box capable of running 200 VMs must have plenty of RAM but still :)
> 
> That's making quite a few assumptions.
> 
>    1) That the global cert will be 32KiB (which clearly isn't the case today).
>    2) That every VM will want the global cert.
>    3) That userspace can't figure out a way to share the global cert.
> 
> Even in that absolutely worst case scenario, I am not remotely convinced that it
> justifies taking on the necessary complexity to manage certs in-kernel.
> 
>> Plus, GHCB now has to go via the userspace before talking to the PSP which
>> was not the case so far (though I cannot think of immediate implication
>> right now).
> 
> Any argument along the lines of "because that's how we've always done it" is going
> to fall on deaf ears.  If there's a real performance bottleneck with kicking out
> to userspace, then I'll happily work to figure out a solution.  If.

No, not performance, I was trying to imagine what can go wrong if 
multiple vcpus are making this call, all exiting to QEMU, in a loop, 
racing, something like this.
  
Sean Christopherson Oct. 20, 2023, 3:13 p.m. UTC | #14
On Fri, Oct 20, 2023, Alexey Kardashevskiy wrote:
> 
> On 20/10/23 11:13, Sean Christopherson wrote:
> > On Fri, Oct 20, 2023, Alexey Kardashevskiy wrote:
> > > Plus, GHCB now has to go via the userspace before talking to the PSP which
> > > was not the case so far (though I cannot think of immediate implication
> > > right now).
> > 
> > Any argument along the lines of "because that's how we've always done it" is going
> > to fall on deaf ears.  If there's a real performance bottleneck with kicking out
> > to userspace, then I'll happily work to figure out a solution.  If.
> 
> No, not performance, I was trying to imagine what can go wrong if multiple
> vcpus are making this call, all exiting to QEMU, in a loop, racing,
> something like this.

I am not at all concerned about userspace being able to handle parallel requests
to get a certificate.  Per-vCPU exits that access global/shared resources might
not be super common, but they're certainly not rare.  E.g. a guest access to an
option ROM can trigger memslot updates in QEMU, which requires at least taking a
mutex to guard KVM_SET_USER_MEMORY_REGION, and IIRC QEMU also uses RCU to protect
QEMU accesses to address spaces.

Given that we know there will be scenarios where certificates are changed/updated,
I wouldn't be at all surprised if handling this in userspace is actually easier
as it will give userspace more control and options, and make it easier to reason
about the resulting behavior.  E.g. userspace could choose between a lockless
scheme and a r/w lock if there's a need to ensure per-VM and global certs are
updated atomically from the guest's perspective.
  
Tom Lendacky Oct. 20, 2023, 6:37 p.m. UTC | #15
On 10/18/23 21:48, Alexey Kardashevskiy wrote:
> 
> On 19/10/23 00:48, Sean Christopherson wrote:
>> On Wed, Oct 18, 2023, Alexey Kardashevskiy wrote:
>>>
>>> On 18/10/23 03:27, Sean Christopherson wrote:
>>>> On Mon, Oct 16, 2023, Dionna Amalie Glaze wrote:
>>>>>> +
>>>>>> +       /*
>>>>>> +        * If a VMM-specific certificate blob hasn't been provided, 
>>>>>> grab the
>>>>>> +        * host-wide one.
>>>>>> +        */
>>>>>> +       snp_certs = sev_snp_certs_get(sev->snp_certs);
>>>>>> +       if (!snp_certs)
>>>>>> +               snp_certs = sev_snp_global_certs_get();
>>>>>> +
>>>>>
>>>>> This is where the generation I suggested adding would get checked. If
>>>>> the instance certs' generation is not the global generation, then I
>>>>> think we need a way to return to the VMM to make that right before
>>>>> continuing to provide outdated certificates.
>>>>> This might be an unreasonable request, but the fact that the certs and
>>>>> reported_tcb can be set while a VM is running makes this an issue.
>>>>
>>>> Before we get that far, the changelogs need to explain why the kernel 
>>>> is storing
>>>> userspace blobs in the first place.  The whole thing is a bit of a mess.
>>>>
>>>> sev_snp_global_certs_get() has data races that could lead to 
>>>> variations of TOCTOU
>>>> bugs: sev_ioctl_snp_set_config() can overwrite 
>>>> psp_master->sev_data->snp_certs
>>>> while sev_snp_global_certs_get() is running.  If the compiler reloads 
>>>> snp_certs
>>>> between bumping the refcount and grabbing the pointer, KVM will end up 
>>>> leaking a
>>>> refcount and consuming a pointer without a refcount.
>>>>
>>>>     if (!kref_get_unless_zero(&certs->kref))
>>>>         return NULL;
>>>>
>>>>     return certs;
>>>
>>> I'm missing something here. The @certs pointer is on the stack,
>>
>> No, nothing guarantees that @certs is on the stack and will never be 
>> reloaded.
>> sev_snp_certs_get() is in full view of sev_snp_global_certs_get(), so 
>> it's entirely
>> possible that it can be inlined.  Then you end up with:
>>
>>     struct sev_device *sev;
>>
>>     if (!psp_master || !psp_master->sev_data)
>>         return NULL;
>>
>>     sev = psp_master->sev_data;
>>     if (!sev->snp_initialized)
>>         return NULL;
>>
>>     if (!sev->snp_certs)
>>         return NULL;
>>
>>     if (!kref_get_unless_zero(&sev->snp_certs->kref))
>>         return NULL;
>>
>>     return sev->snp_certs;
>>
>> At which point the compiler could choose to omit a local variable 
>> entirely, it
>> could store @certs in a register and reload after 
>> kref_get_unless_zero(), etc.
>> If psp_master->sev_data->snp_certs is changed at any point, odd thing 
>> can happen.
>>
>> That atomic operation in kref_get_unless_zero() might prevent a reload 
>> between
>> getting the kref and the return, but it wouldn't prevent a reload 
>> between the
>> !NULL check and kref_get_unless_zero().
> 
> Oh. The function is exported so I thought gcc would not go that far but 
> yeah it is possible. So this needs an explicit READ_ONCE barrier.
> 
> 
>>>> If userspace wants to provide garbage to the guest, so be it, not 
>>>> KVM's problem.
>>>> That way, whether the VM gets the global cert or a per-VM cert is 
>>>> purely a userspace
>>>> concern.
>>>
>>> The global cert lives in CCP (/dev/sev), the per VM cert lives in 
>>> kvmvm_fd.
>>> "A la vcpu->run" is fine for the latter but for the former we need 
>>> something
>>> else.
>>
>> Why?  The cert ultimately comes from userspace, no?  Make userspace deal 
>> with it.
>>
>>> And there is scenario when one global certs blob is what is needed and
>>> copying it over multiple VMs seems suboptimal.
>>
>> That's a solvable problem.  I'm not sure I like the most obvious 
>> solution, but it
>> is a solution: let userspace define a KVM-wide blob pointer, either via 
>> .mmap()
>> or via an ioctl().
>>
>> FWIW, there's no need to do .mmap() shenanigans, e.g. an ioctl() to set the
>> userspace pointer would suffice.  The benefit of a kernel controlled 
>> pointer is
>> that it doesn't require copying to a kernel buffer (or special code to 
>> copy from
>> userspace into guest).
> 
> Just to clarify - like, a small userspace non-qemu program which just 
> holds a pointer with the certs blob, or embed it into libvirt or systemd?
> 
> 
>> Actually, looking at the flow again, AFAICT there's nothing special 
>> about the
>> target DATA_PAGE.  It must be SHARED *before* 
>> SVM_VMGEXIT_EXT_GUEST_REQUEST, i.e.
>> KVM doesn't need to do conversions, there's no kernel priveleges 
>> required, etc.
>> And the GHCB doesn't dictate ordering between storing the certificates 
>> and doing
>> the request.  That means the certificate stuff can be punted entirely to 
>> usersepace.
> 
> All true.
> 
>> Heh, typing up the below, there's another bug: KVM will incorrectly 
>> "return" '0'
>> for non-SNP guests:
>>
>>     unsigned long exitcode = 0;
>>     u64 data_gpa;
>>     int err, rc;
>>
>>     if (!sev_snp_guest(vcpu->kvm)) {
>>         rc = SEV_RET_INVALID_GUEST; <= sets "rc", not "exitcode"
>>         goto e_fail;
>>     }
>>
>> e_fail:
>>     ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, exitcode);
>>
>> Which really highlights that we need to get test infrastructure up and 
>> running
>> for SEV-ES, SNP, and TDX.
>>
>> Anyways, back to punting to userspace.  Here's a rough sketch.  The only 
>> new uAPI
>> is the definition of KVM_HC_SNP_GET_CERTS and its arguments.
>>
>> static void snp_handle_guest_request(struct vcpu_svm *svm)
>> {
>>     struct vmcb_control_area *control = &svm->vmcb->control;
>>     struct sev_data_snp_guest_request data = {0};
>>     struct kvm_vcpu *vcpu = &svm->vcpu;
>>     struct kvm *kvm = vcpu->kvm;
>>     struct kvm_sev_info *sev;
>>     gpa_t req_gpa = control->exit_info_1;
>>     gpa_t resp_gpa = control->exit_info_2;
>>     unsigned long rc;
>>     int err;
>>
>>     if (!sev_snp_guest(vcpu->kvm)) {
>>         rc = SEV_RET_INVALID_GUEST;
>>         goto e_fail;
>>     }
>>
>>     sev = &to_kvm_svm(kvm)->sev_info;
>>
>>     mutex_lock(&sev->guest_req_lock);
>>
>>     rc = snp_setup_guest_buf(svm, &data, req_gpa, resp_gpa);
>>     if (rc)
>>         goto unlock;
>>
>>     rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &data, &err);
>>     if (rc)
>>         /* Ensure an error value is returned to guest. */
>>         rc = err ? err : SEV_RET_INVALID_ADDRESS;
>>
>>     snp_cleanup_guest_buf(&data, &rc);
>>
>> unlock:
>>     mutex_unlock(&sev->guest_req_lock);
>>
>> e_fail:
>>     ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, rc);
>> }
>>
>> static int snp_complete_ext_guest_request(struct kvm_vcpu *vcpu)
>> {
>>     u64 certs_exitcode = vcpu->run->hypercall.args[2];
>>     struct vcpu_svm *svm = to_svm(vcpu);
>>
>>     if (certs_exitcode)
>>         ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, certs_exitcode);
>>     else
>>         snp_handle_guest_request(svm);
>>     return 1;
>> }
>>
>> static int snp_handle_ext_guest_request(struct vcpu_svm *svm)
>> {
>>     struct kvm_vcpu *vcpu = &svm->vcpu;
>>     struct kvm *kvm = vcpu->kvm;
>>     struct kvm_sev_info *sev;
>>     unsigned long exitcode;
>>     u64 data_gpa;
>>
>>     if (!sev_snp_guest(vcpu->kvm)) {
>>         ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_GUEST);
>>         return 1;
>>     }
>>
>>     data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
>>     if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
>>         ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_ADDRESS);
>>         return 1;
>>     }
>>
>>     vcpu->run->hypercall.nr         = KVM_HC_SNP_GET_CERTS;
>>     vcpu->run->hypercall.args[0]     = data_gpa;
>>     vcpu->run->hypercall.args[1]     = vcpu->arch.regs[VCPU_REGS_RBX];
>>     vcpu->run->hypercall.flags     = KVM_EXIT_HYPERCALL_LONG_MODE;
> 
> btw why is it _LONG_MODE and not just _64? :)
> 
>>     vcpu->arch.complete_userspace_io = snp_complete_ext_guest_request;
>>     return 0;
>> }
> 
> This should work the KVM stored certs nicely but not for the global certs. 
> Although I am not all convinced that global certs is all that valuable but 
> I do not know the history of that, happened before I joined so I let 
> others to comment on that. Thanks,

Global certs was the original implementation because it was intended to 
provide the VCEK, ASK, and ARK. These will be the same for all SNP guests 
that are launched. The original intention was also to not make the kernel 
have to manage multiple certificates and instead just treat the data as a 
blob provided from user-space.

The per-VM change was added to allow a per-VM certificates. If a provider 
has no need to use this, then only the global certs blob is needed which 
reduces the amount of memory needed for the VM.

Thanks,
Tom

> 
>
  
Michael Roth Nov. 10, 2023, 10:07 p.m. UTC | #16
On Wed, Oct 18, 2023 at 06:48:59AM -0700, Sean Christopherson wrote:
> On Wed, Oct 18, 2023, Alexey Kardashevskiy wrote:
> > 
> > On 18/10/23 03:27, Sean Christopherson wrote:
> > > On Mon, Oct 16, 2023, Dionna Amalie Glaze wrote:
> > > > > +
> > > > > +       /*
> > > > > +        * If a VMM-specific certificate blob hasn't been provided, grab the
> > > > > +        * host-wide one.
> > > > > +        */
> > > > > +       snp_certs = sev_snp_certs_get(sev->snp_certs);
> > > > > +       if (!snp_certs)
> > > > > +               snp_certs = sev_snp_global_certs_get();
> > > > > +
> > > > 
> > > > This is where the generation I suggested adding would get checked. If
> > > > the instance certs' generation is not the global generation, then I
> > > > think we need a way to return to the VMM to make that right before
> > > > continuing to provide outdated certificates.
> > > > This might be an unreasonable request, but the fact that the certs and
> > > > reported_tcb can be set while a VM is running makes this an issue.
> > > 
> > > Before we get that far, the changelogs need to explain why the kernel is storing
> > > userspace blobs in the first place.  The whole thing is a bit of a mess.
> > > 
> > > sev_snp_global_certs_get() has data races that could lead to variations of TOCTOU
> > > bugs: sev_ioctl_snp_set_config() can overwrite psp_master->sev_data->snp_certs
> > > while sev_snp_global_certs_get() is running.  If the compiler reloads snp_certs
> > > between bumping the refcount and grabbing the pointer, KVM will end up leaking a
> > > refcount and consuming a pointer without a refcount.
> > > 
> > > 	if (!kref_get_unless_zero(&certs->kref))
> > > 		return NULL;
> > > 
> > > 	return certs;
> > 
> > I'm missing something here. The @certs pointer is on the stack,
> 
> No, nothing guarantees that @certs is on the stack and will never be reloaded.
> sev_snp_certs_get() is in full view of sev_snp_global_certs_get(), so it's entirely
> possible that it can be inlined.  Then you end up with:
> 
> 	struct sev_device *sev;
> 
> 	if (!psp_master || !psp_master->sev_data)
> 		return NULL;
> 
> 	sev = psp_master->sev_data;
> 	if (!sev->snp_initialized)
> 		return NULL;
> 
> 	if (!sev->snp_certs)
> 		return NULL;
> 
> 	if (!kref_get_unless_zero(&sev->snp_certs->kref))
> 		return NULL;
> 
> 	return sev->snp_certs;
> 
> At which point the compiler could choose to omit a local variable entirely, it
> could store @certs in a register and reload after kref_get_unless_zero(), etc.
> If psp_master->sev_data->snp_certs is changed at any point, odd thing can happen.
> 
> That atomic operation in kref_get_unless_zero() might prevent a reload between
> getting the kref and the return, but it wouldn't prevent a reload between the
> !NULL check and kref_get_unless_zero().
> 
> > > If userspace wants to provide garbage to the guest, so be it, not KVM's problem.
> > > That way, whether the VM gets the global cert or a per-VM cert is purely a userspace
> > > concern.
> > 
> > The global cert lives in CCP (/dev/sev), the per VM cert lives in kvmvm_fd.
> > "A la vcpu->run" is fine for the latter but for the former we need something
> > else.
> 
> Why?  The cert ultimately comes from userspace, no?  Make userspace deal with it.
> 
> > And there is scenario when one global certs blob is what is needed and
> > copying it over multiple VMs seems suboptimal.
> 
> That's a solvable problem.  I'm not sure I like the most obvious solution, but it
> is a solution: let userspace define a KVM-wide blob pointer, either via .mmap()
> or via an ioctl().
> 
> FWIW, there's no need to do .mmap() shenanigans, e.g. an ioctl() to set the
> userspace pointer would suffice.  The benefit of a kernel controlled pointer is
> that it doesn't require copying to a kernel buffer (or special code to copy from
> userspace into guest).
> 
> Actually, looking at the flow again, AFAICT there's nothing special about the
> target DATA_PAGE.  It must be SHARED *before* SVM_VMGEXIT_EXT_GUEST_REQUEST, i.e.
> KVM doesn't need to do conversions, there's no kernel priveleges required, etc.
> And the GHCB doesn't dictate ordering between storing the certificates and doing
> the request.  That means the certificate stuff can be punted entirely to usersepace.
> 
> Heh, typing up the below, there's another bug: KVM will incorrectly "return" '0'
> for non-SNP guests:

Thanks for the catch, will fix that up.

> 
> 	unsigned long exitcode = 0;
> 	u64 data_gpa;
> 	int err, rc;
> 
> 	if (!sev_snp_guest(vcpu->kvm)) {
> 		rc = SEV_RET_INVALID_GUEST; <= sets "rc", not "exitcode"
> 		goto e_fail;
> 	}
> 
> e_fail:
> 	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, exitcode);
> 
> Which really highlights that we need to get test infrastructure up and running
> for SEV-ES, SNP, and TDX.
> 
> Anyways, back to punting to userspace.  Here's a rough sketch.  The only new uAPI
> is the definition of KVM_HC_SNP_GET_CERTS and its arguments.

This sketch seems like a good, flexible way to handle per-VM certs, but
it does complicate things from a userspace perspective. As a basic
requirement, all userspaces will need to provide a way to specify the
initial blob (either a very verbose base64-encoded userspace cmdline param,
or a filepatch that needs additional management to store and handle
permissions/etc.), and also a means to update it (e.g. a HMP/QMP command
for QEMU, some libvirt wrappers, etc.).

That's all well and good if you want to make use of per-VM certs, but we
don't necessarily expect that most deployments will necessarily want to deal
with per-VM certs, and would be happy with a system-wide one where they could
simply issue the /dev/sev ioctl to inject one automatically for all guests.

So we're sort of complicating the more common case to support a more niche
one (as far as userspace is concerned anyway; as far as kernel goes, your
approach is certainly simplest :)).

Instead, maybe a compromise is warranted so the requirements on userspace
side are less complicated for a more basic deployment:

  1) If /dev/sev is used to set a global certificate, then that will be
     used unconditionally by KVM, protected by simple dumb mutex during
     usage/update.
  2) If /dev/sev is not used to set the global certificate is the value
     is NULL, we assume userspace wants full responsibility for managing
     certificates and exit to userspace to request the certs in the manner
     you suggested.

Sean, Dionna, would this cover your concerns and address the certificate
update use-case?

-Mike
  
Sean Christopherson Nov. 10, 2023, 10:47 p.m. UTC | #17
On Fri, Nov 10, 2023, Michael Roth wrote:
> On Wed, Oct 18, 2023 at 06:48:59AM -0700, Sean Christopherson wrote:
> > On Wed, Oct 18, 2023, Alexey Kardashevskiy wrote:
> > Anyways, back to punting to userspace.  Here's a rough sketch.  The only new uAPI
> > is the definition of KVM_HC_SNP_GET_CERTS and its arguments.
> 
> This sketch seems like a good, flexible way to handle per-VM certs, but
> it does complicate things from a userspace perspective. As a basic
> requirement, all userspaces will need to provide a way to specify the
> initial blob (either a very verbose base64-encoded userspace cmdline param,
> or a filepatch that needs additional management to store and handle
> permissions/etc.), and also a means to update it (e.g. a HMP/QMP command
> for QEMU, some libvirt wrappers, etc.).
>
> That's all well and good if you want to make use of per-VM certs, but we
> don't necessarily expect that most deployments will necessarily want to deal
> with per-VM certs, and would be happy with a system-wide one where they could
> simply issue the /dev/sev ioctl to inject one automatically for all guests.
> 
> So we're sort of complicating the more common case to support a more niche
> one (as far as userspace is concerned anyway; as far as kernel goes, your
> approach is certainly simplest :)).
> 
> Instead, maybe a compromise is warranted so the requirements on userspace
> side are less complicated for a more basic deployment:
> 
>   1) If /dev/sev is used to set a global certificate, then that will be
>      used unconditionally by KVM, protected by simple dumb mutex during
>      usage/update.
>   2) If /dev/sev is not used to set the global certificate is the value
>      is NULL, we assume userspace wants full responsibility for managing
>      certificates and exit to userspace to request the certs in the manner
>      you suggested.
> 
> Sean, Dionna, would this cover your concerns and address the certificate
> update use-case?

Honestly, no.  I see zero reason for the kernel to be involved.  IIUC, there's no
privileged operations that require kernel intervention, which means that shoving
a global cert into /dev/sev is using the CCP driver as middleman.  Just use a
userspace daemon.  I have a very hard time believing that passing around large-ish
blobs of data in userspace isn't already a solved problem.
  
Dionna Amalie Glaze Nov. 16, 2023, 5:31 a.m. UTC | #18
> > So we're sort of complicating the more common case to support a more niche
> > one (as far as userspace is concerned anyway; as far as kernel goes, your
> > approach is certainly simplest :)).
> >
> > Instead, maybe a compromise is warranted so the requirements on userspace
> > side are less complicated for a more basic deployment:
> >
> >   1) If /dev/sev is used to set a global certificate, then that will be
> >      used unconditionally by KVM, protected by simple dumb mutex during
> >      usage/update.
> >   2) If /dev/sev is not used to set the global certificate is the value
> >      is NULL, we assume userspace wants full responsibility for managing
> >      certificates and exit to userspace to request the certs in the manner
> >      you suggested.
> >
> > Sean, Dionna, would this cover your concerns and address the certificate
> > update use-case?
>
> Honestly, no.  I see zero reason for the kernel to be involved.  IIUC, there's no
> privileged operations that require kernel intervention, which means that shoving
> a global cert into /dev/sev is using the CCP driver as middleman.  Just use a
> userspace daemon.  I have a very hard time believing that passing around large-ish
> blobs of data in userspace isn't already a solved problem.

ping sathyanarayanan.kuppuswamy@linux.intel.com and +Dan Williams

I think for a uniform experience for all coco technologies, we need
someone from Intel to weigh in on supporting auxblob through a similar
vmexit. Whereas the quoting enclave gets its PCK cert installed by the
host, something like the firmware's SBOM [1] could be delivered in
auxblob. The proposal to embed the compressed SBOM binary in a coff
section of the UEFI doesn't get it communicated to user space, so this
is a good place to get that info about the expected TDMR in. The SBOM
proposal itself would need additional modeling in the coRIM profile to
have extra coco-specific measurements or we need to find some other
method of getting this info bundled with the attestation report.

My own plan for SEV-SNP was to have a bespoke signed measurement of
the UEFI in the GUID table, but that doesn't extend to TDX. If we're
looking more at an industry alignment on coRIM for SBOM formats (yes
please), then it'd be great to start getting that kind of info plumbed
to the user in a uniform way that doesn't have to rely on servers
providing the endorsements.

[1] https://uefi.org/blog/firmware-sbom-proposal
  
Dan Williams Dec. 5, 2023, 12:30 a.m. UTC | #19
[ add Dan Middleton for his awareness ]

Dionna Amalie Glaze wrote:
> > > So we're sort of complicating the more common case to support a more niche
> > > one (as far as userspace is concerned anyway; as far as kernel goes, your
> > > approach is certainly simplest :)).
> > >
> > > Instead, maybe a compromise is warranted so the requirements on userspace
> > > side are less complicated for a more basic deployment:
> > >
> > >   1) If /dev/sev is used to set a global certificate, then that will be
> > >      used unconditionally by KVM, protected by simple dumb mutex during
> > >      usage/update.
> > >   2) If /dev/sev is not used to set the global certificate is the value
> > >      is NULL, we assume userspace wants full responsibility for managing
> > >      certificates and exit to userspace to request the certs in the manner
> > >      you suggested.
> > >
> > > Sean, Dionna, would this cover your concerns and address the certificate
> > > update use-case?
> >
> > Honestly, no.  I see zero reason for the kernel to be involved.  IIUC, there's no
> > privileged operations that require kernel intervention, which means that shoving
> > a global cert into /dev/sev is using the CCP driver as middleman.  Just use a
> > userspace daemon.  I have a very hard time believing that passing around large-ish
> > blobs of data in userspace isn't already a solved problem.
> 
> ping sathyanarayanan.kuppuswamy@linux.intel.com and +Dan Williams

Apologies Dionna, I missed this earlier. 

> 
> I think for a uniform experience for all coco technologies, we need
> someone from Intel to weigh in on supporting auxblob through a similar
> vmexit. Whereas the quoting enclave gets its PCK cert installed by the
> host, something like the firmware's SBOM [1] could be delivered in
> auxblob. The proposal to embed the compressed SBOM binary in a coff
> section of the UEFI doesn't get it communicated to user space, so this
> is a good place to get that info about the expected TDMR in. The SBOM
> proposal itself would need additional modeling in the coRIM profile to
> have extra coco-specific measurements or we need to find some other
> method of getting this info bundled with the attestation report.

SBOM looks different than the SEV use case of @auxblob to convey a
certificate chain.

Are you asking for @auxblob to be SBOM on TDX and a certchain on SEV, or
unifying the @auxblob format on SBOM?

> My own plan for SEV-SNP was to have a bespoke signed measurement of
> the UEFI in the GUID table, but that doesn't extend to TDX. If we're
> looking more at an industry alignment on coRIM for SBOM formats (yes
> please), then it'd be great to start getting that kind of info plumbed
> to the user in a uniform way that doesn't have to rely on servers
> providing the endorsements.
> 
> [1] https://uefi.org/blog/firmware-sbom-proposal

Honestly my first reaction for this ABI would be for a new file under
/sys/firmware/efi/efivars or similar.
  
Dionna Amalie Glaze Dec. 5, 2023, 12:48 a.m. UTC | #20
On Mon, Dec 4, 2023 at 4:30 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> [ add Dan Middleton for his awareness ]
>
> Dionna Amalie Glaze wrote:
> > > > So we're sort of complicating the more common case to support a more niche
> > > > one (as far as userspace is concerned anyway; as far as kernel goes, your
> > > > approach is certainly simplest :)).
> > > >
> > > > Instead, maybe a compromise is warranted so the requirements on userspace
> > > > side are less complicated for a more basic deployment:
> > > >
> > > >   1) If /dev/sev is used to set a global certificate, then that will be
> > > >      used unconditionally by KVM, protected by simple dumb mutex during
> > > >      usage/update.
> > > >   2) If /dev/sev is not used to set the global certificate is the value
> > > >      is NULL, we assume userspace wants full responsibility for managing
> > > >      certificates and exit to userspace to request the certs in the manner
> > > >      you suggested.
> > > >
> > > > Sean, Dionna, would this cover your concerns and address the certificate
> > > > update use-case?
> > >
> > > Honestly, no.  I see zero reason for the kernel to be involved.  IIUC, there's no
> > > privileged operations that require kernel intervention, which means that shoving
> > > a global cert into /dev/sev is using the CCP driver as middleman.  Just use a
> > > userspace daemon.  I have a very hard time believing that passing around large-ish
> > > blobs of data in userspace isn't already a solved problem.
> >
> > ping sathyanarayanan.kuppuswamy@linux.intel.com and +Dan Williams
>
> Apologies Dionna, I missed this earlier.
>

No worries, I've been sick anyway.

> >
> > I think for a uniform experience for all coco technologies, we need
> > someone from Intel to weigh in on supporting auxblob through a similar
> > vmexit. Whereas the quoting enclave gets its PCK cert installed by the
> > host, something like the firmware's SBOM [1] could be delivered in
> > auxblob. The proposal to embed the compressed SBOM binary in a coff
> > section of the UEFI doesn't get it communicated to user space, so this
> > is a good place to get that info about the expected TDMR in. The SBOM
> > proposal itself would need additional modeling in the coRIM profile to
> > have extra coco-specific measurements or we need to find some other
> > method of getting this info bundled with the attestation report.
>
> SBOM looks different than the SEV use case of @auxblob to convey a
> certificate chain.

The SEV use case has a GUID table in which we're allowed to provide
more than just the VCEK certificate chain. I'm using it to deliver a
UEFI endorsement document as well.

>
> Are you asking for @auxblob to be SBOM on TDX and a certchain on SEV, or
> unifying the @auxblob format on SBOM?

Given SEV is both certchain and SBOM and TDX doesn't need a certchain
in auxblob, I'd just be looking at delivering SBOM in auxblob for TDX.
It's probably better to have something extensible though, like SEV's
GUID table format. We may want to provide cached TDI RIMs too, for
example.

>
> > My own plan for SEV-SNP was to have a bespoke signed measurement of
> > the UEFI in the GUID table, but that doesn't extend to TDX. If we're
> > looking more at an industry alignment on coRIM for SBOM formats (yes
> > please), then it'd be great to start getting that kind of info plumbed
> > to the user in a uniform way that doesn't have to rely on servers
> > providing the endorsements.
> >
> > [1] https://uefi.org/blog/firmware-sbom-proposal
>
> Honestly my first reaction for this ABI would be for a new file under
> /sys/firmware/efi/efivars or similar.

For UEFI specifically that could make sense, yes. Not everyone has
been mounting efivars, so it's been a bit of an uphill battle for that
one. Still there's the matter of cached TDI RIMs. NVIDIA would have
everyone send attestation requests to their servers every quote
request in the NRAS architecture, but we're looking at other ways to
provide reliable attestation without a third party service, albeit
with slightly different security properties.
  
Dan Williams Dec. 5, 2023, 8:06 p.m. UTC | #21
[ add Ard for the SBOM sysfs ABI commentary ]

Dionna Amalie Glaze wrote:
[..]
> > > My own plan for SEV-SNP was to have a bespoke signed measurement of
> > > the UEFI in the GUID table, but that doesn't extend to TDX. If we're
> > > looking more at an industry alignment on coRIM for SBOM formats (yes
> > > please), then it'd be great to start getting that kind of info plumbed
> > > to the user in a uniform way that doesn't have to rely on servers
> > > providing the endorsements.
> > >
> > > [1] https://uefi.org/blog/firmware-sbom-proposal
> >
> > Honestly my first reaction for this ABI would be for a new file under
> > /sys/firmware/efi/efivars or similar.
> 
> For UEFI specifically that could make sense, yes. Not everyone has
> been mounting efivars, so it's been a bit of an uphill battle for that
> one.

I wonder what the concern is with mounting efivarfs vs configfs? In any
event this seems distinct enough to be its own /sys/firmware/efi/sbom
file. I would defer to Ard, but I think SBOM is a generally useful
concept that would be out of place as a blob returned from configfs-tsm.

> Still there's the matter of cached TDI RIMs. NVIDIA would have

I am not immediatly sure what a "TDI RIM" is?

> everyone send attestation requests to their servers every quote
> request in the NRAS architecture, but we're looking at other ways to

"NRAS" does not parse for me either.

> provide reliable attestation without a third party service, albeit
> with slightly different security properties.

Setting the above confusion aside, I would just say that in general yes,
the kernel needs to understand its role in an end-to-end attestation
architecture that is not beholden to a single vendor, but also allows
the kernel to enforce ABI stability / mitigate regressions based on
binary format changes.
  
Dionna Amalie Glaze Dec. 5, 2023, 10:04 p.m. UTC | #22
On Tue, Dec 5, 2023 at 12:06 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> [ add Ard for the SBOM sysfs ABI commentary ]
>
> Dionna Amalie Glaze wrote:
> [..]
> > > > My own plan for SEV-SNP was to have a bespoke signed measurement of
> > > > the UEFI in the GUID table, but that doesn't extend to TDX. If we're
> > > > looking more at an industry alignment on coRIM for SBOM formats (yes
> > > > please), then it'd be great to start getting that kind of info plumbed
> > > > to the user in a uniform way that doesn't have to rely on servers
> > > > providing the endorsements.
> > > >
> > > > [1] https://uefi.org/blog/firmware-sbom-proposal
> > >
> > > Honestly my first reaction for this ABI would be for a new file under
> > > /sys/firmware/efi/efivars or similar.
> >
> > For UEFI specifically that could make sense, yes. Not everyone has
> > been mounting efivars, so it's been a bit of an uphill battle for that
> > one.
>
> I wonder what the concern is with mounting efivarfs vs configfs? In any
> event this seems distinct enough to be its own /sys/firmware/efi/sbom
> file. I would defer to Ard, but I think SBOM is a generally useful
> concept that would be out of place as a blob returned from configfs-tsm.
>
> > Still there's the matter of cached TDI RIMs. NVIDIA would have
>
> I am not immediatly sure what a "TDI RIM" is?
>

I might just be making up terms. Any trusted hardware device that has
its own attestation will (hopefully) have signed reference
measurements, or a Reference Integrity Manifest as TCG calls them.

> > everyone send attestation requests to their servers every quote
> > request in the NRAS architecture, but we're looking at other ways to
>
> "NRAS" does not parse for me either.
>

That would be this https://docs.attestation.nvidia.com/api-docs/nras.html

> > provide reliable attestation without a third party service, albeit
> > with slightly different security properties.
>
> Setting the above confusion aside, I would just say that in general yes,
> the kernel needs to understand its role in an end-to-end attestation
> architecture that is not beholden to a single vendor, but also allows
> the kernel to enforce ABI stability / mitigate regressions based on
> binary format changes.
>

I'm mainly holding on to hope that I don't have to introduce a new
runtime dependency on a service that gives a source of truth about the
software that's running in the VM.
If we can have a GUID table with a flexible size that the host can
request of the guest, then we can version ABI changes with new GUID
entries.
It's a big enough value space without vanity naming opportunities that
we can pretty easily make changes without incurring any guest kernel
changes.
  
Dan Williams Dec. 5, 2023, 11:11 p.m. UTC | #23
Dionna Amalie Glaze wrote:
> On Tue, Dec 5, 2023 at 12:06 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > [ add Ard for the SBOM sysfs ABI commentary ]
> >
> > Dionna Amalie Glaze wrote:
> > [..]
> > > > > My own plan for SEV-SNP was to have a bespoke signed measurement of
> > > > > the UEFI in the GUID table, but that doesn't extend to TDX. If we're
> > > > > looking more at an industry alignment on coRIM for SBOM formats (yes
> > > > > please), then it'd be great to start getting that kind of info plumbed
> > > > > to the user in a uniform way that doesn't have to rely on servers
> > > > > providing the endorsements.
> > > > >
> > > > > [1] https://uefi.org/blog/firmware-sbom-proposal
> > > >
> > > > Honestly my first reaction for this ABI would be for a new file under
> > > > /sys/firmware/efi/efivars or similar.
> > >
> > > For UEFI specifically that could make sense, yes. Not everyone has
> > > been mounting efivars, so it's been a bit of an uphill battle for that
> > > one.
> >
> > I wonder what the concern is with mounting efivarfs vs configfs? In any
> > event this seems distinct enough to be its own /sys/firmware/efi/sbom
> > file. I would defer to Ard, but I think SBOM is a generally useful
> > concept that would be out of place as a blob returned from configfs-tsm.
> >
> > > Still there's the matter of cached TDI RIMs. NVIDIA would have
> >
> > I am not immediatly sure what a "TDI RIM" is?
> >
> 
> I might just be making up terms. Any trusted hardware device that has
> its own attestation will (hopefully) have signed reference
> measurements, or a Reference Integrity Manifest as TCG calls them.

Ah, ok.

> 
> > > everyone send attestation requests to their servers every quote
> > > request in the NRAS architecture, but we're looking at other ways to
> >
> > "NRAS" does not parse for me either.
> >
> 
> That would be this https://docs.attestation.nvidia.com/api-docs/nras.html

Thanks!

> > > provide reliable attestation without a third party service, albeit
> > > with slightly different security properties.
> >
> > Setting the above confusion aside, I would just say that in general yes,
> > the kernel needs to understand its role in an end-to-end attestation
> > architecture that is not beholden to a single vendor, but also allows
> > the kernel to enforce ABI stability / mitigate regressions based on
> > binary format changes.
> >
> 
> I'm mainly holding on to hope that I don't have to introduce a new
> runtime dependency on a service that gives a source of truth about the
> software that's running in the VM.
> If we can have a GUID table with a flexible size that the host can
> request of the guest, then we can version ABI changes with new GUID
> entries.
> It's a big enough value space without vanity naming opportunities that
> we can pretty easily make changes without incurring any guest kernel
> changes.

So it's not only SBOM that you are concerned about, but instead want to
have a one stop shop for auxiliary evidence and get the vendors agree on
following the same GUID+blob precedent that is already there for the AMD
cert chain? That sounds reasonable, but I still feel it should be
limited to things that do not fit into an existing ABI namespace.

...unless its evidence / material that only a TVM would ever need.
  
Dionna Amalie Glaze Dec. 6, 2023, 12:43 a.m. UTC | #24
> So it's not only SBOM that you are concerned about, but instead want to
> have a one stop shop for auxiliary evidence and get the vendors agree on
> following the same GUID+blob precedent that is already there for the AMD
> cert chain? That sounds reasonable, but I still feel it should be
> limited to things that do not fit into an existing ABI namespace.
>

The tl;dr is I want something simple for a hard problem and I'll
probably lose this argument.

The unfortunate state of affairs here is that it is "vendor dependent"
whatever pathway they choose to provide reference material for
attestations. Even with the RATS framework, "reference value provider"
is just a bubble in a diagram and not a federated service protocol
that we've all agreed on. TCG's recommendation for delivering the RIM
in the efi volume doesn't quite work in the cloud setting, since
images own that and not us as the firmware provider. There's no
standard for how to inform a user where to get a RIM other than that
:(

> ...unless its evidence / material that only a TVM would ever need.

There's the rub. Evidence may be provided to the TVM to forward to
verifiers, but really it's the verifiers that are tasked with
gathering all this attestation collateral. The TVM doesn't have to do
anything, even provide machine-specific certificates. That's pretty
dreadful system design though, since you end up with global services
in your hot path rather than getting cached data from the machine
you're getting an attestation from. My first stab at it is to just
have a storage bucket with objects named based on expected measurement
values, so you just construct a URL and download the endorsement if
you need it. I'd really rather just make that part of what the guest
can get from auxblob since they pay the bandwidth of forwarding it to
verifiers rather than my cost center paying for the storage bucket
bandwidth (is this the real reason I'm pushing on this? I'm unsure).

If you're instead asking if this information would need to get piped
to a non-TVM (say, a non-confidential VM with a virtual TPM), then the
answer is maa~aaa~aaybe but probably not. PCR0 in the cloud really
needs its own profile, since the TCG platform firmware profile (PFP)
is unsuitable. There's not a whole lot of point delivering a signed
endorsement of a firmware measurement that we don't include in the
event log anyway for stability reasons, even if that's against the PFP
spec. So probably not. I think we're pretty clear that host-cached
data would only need to be for TVMs.

If you ask the folks I've been talking to at Intel or NVIDIA, they'd
probably say to put a service in your dependencies and be done with
it; it's the vendor's responsibility to provide an available enough
service to provide all the evidence that an attestation verifier may
want. That's quite unfriendly to the smaller players in the field, but
maybe it's easy to integrate with something like Veraison's
endorsement provisioning API [1] or Intel's Trust Authority (née
Project Amber). I haven't done it.

[1] https://github.com/veraison/docs/tree/main/api/endorsement-provisioning
  

Patch

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 602aaf82eef3..d71ec257debb 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -19,6 +19,7 @@ 
 #include <linux/misc_cgroup.h>
 #include <linux/processor.h>
 #include <linux/trace_events.h>
+#include <uapi/linux/sev-guest.h>
 
 #include <asm/pkru.h>
 #include <asm/trapnr.h>
@@ -339,6 +340,8 @@  static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
 		ret = verify_snp_init_flags(kvm, argp);
 		if (ret)
 			goto e_free;
+
+		mutex_init(&sev->guest_req_lock);
 	}
 
 	ret = sev_platform_init(&argp->error);
@@ -2345,8 +2348,10 @@  static int snp_get_instance_certs(struct kvm *kvm, struct kvm_sev_cmd *argp)
 
 static void snp_replace_certs(struct kvm_sev_info *sev, struct sev_snp_certs *snp_certs)
 {
+	mutex_lock(&sev->guest_req_lock);
 	sev_snp_certs_put(sev->snp_certs);
 	sev->snp_certs = snp_certs;
+	mutex_unlock(&sev->guest_req_lock);
 }
 
 static int snp_set_instance_certs(struct kvm *kvm, struct kvm_sev_cmd *argp)
@@ -3218,6 +3223,8 @@  static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
 	case SVM_VMGEXIT_HV_FEATURES:
 	case SVM_VMGEXIT_PSC:
 	case SVM_VMGEXIT_TERM_REQUEST:
+	case SVM_VMGEXIT_GUEST_REQUEST:
+	case SVM_VMGEXIT_EXT_GUEST_REQUEST:
 		break;
 	default:
 		reason = GHCB_ERR_INVALID_EVENT;
@@ -3627,6 +3634,163 @@  static int sev_snp_ap_creation(struct vcpu_svm *svm)
 	return ret;
 }
 
+static unsigned long snp_setup_guest_buf(struct vcpu_svm *svm,
+					 struct sev_data_snp_guest_request *data,
+					 gpa_t req_gpa, gpa_t resp_gpa)
+{
+	struct kvm_vcpu *vcpu = &svm->vcpu;
+	struct kvm *kvm = vcpu->kvm;
+	kvm_pfn_t req_pfn, resp_pfn;
+	struct kvm_sev_info *sev;
+
+	sev = &to_kvm_svm(kvm)->sev_info;
+
+	if (!IS_ALIGNED(req_gpa, PAGE_SIZE) || !IS_ALIGNED(resp_gpa, PAGE_SIZE))
+		return SEV_RET_INVALID_PARAM;
+
+	req_pfn = gfn_to_pfn(kvm, gpa_to_gfn(req_gpa));
+	if (is_error_noslot_pfn(req_pfn))
+		return SEV_RET_INVALID_ADDRESS;
+
+	resp_pfn = gfn_to_pfn(kvm, gpa_to_gfn(resp_gpa));
+	if (is_error_noslot_pfn(resp_pfn))
+		return SEV_RET_INVALID_ADDRESS;
+
+	if (rmp_make_private(resp_pfn, 0, PG_LEVEL_4K, 0, true))
+		return SEV_RET_INVALID_ADDRESS;
+
+	data->gctx_paddr = __psp_pa(sev->snp_context);
+	data->req_paddr = __sme_set(req_pfn << PAGE_SHIFT);
+	data->res_paddr = __sme_set(resp_pfn << PAGE_SHIFT);
+
+	return 0;
+}
+
+static void snp_cleanup_guest_buf(struct sev_data_snp_guest_request *data, unsigned long *rc)
+{
+	u64 pfn = __sme_clr(data->res_paddr) >> PAGE_SHIFT;
+	int ret;
+
+	ret = snp_page_reclaim(pfn);
+	if (ret)
+		*rc = SEV_RET_INVALID_ADDRESS;
+
+	ret = rmp_make_shared(pfn, PG_LEVEL_4K);
+	if (ret)
+		*rc = SEV_RET_INVALID_ADDRESS;
+}
+
+static void snp_handle_guest_request(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp_gpa)
+{
+	struct sev_data_snp_guest_request data = {0};
+	struct kvm_vcpu *vcpu = &svm->vcpu;
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_sev_info *sev;
+	unsigned long rc;
+	int err;
+
+	if (!sev_snp_guest(vcpu->kvm)) {
+		rc = SEV_RET_INVALID_GUEST;
+		goto e_fail;
+	}
+
+	sev = &to_kvm_svm(kvm)->sev_info;
+
+	mutex_lock(&sev->guest_req_lock);
+
+	rc = snp_setup_guest_buf(svm, &data, req_gpa, resp_gpa);
+	if (rc)
+		goto unlock;
+
+	rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &data, &err);
+	if (rc)
+		/* Ensure an error value is returned to guest. */
+		rc = err ? err : SEV_RET_INVALID_ADDRESS;
+
+	snp_cleanup_guest_buf(&data, &rc);
+
+unlock:
+	mutex_unlock(&sev->guest_req_lock);
+
+e_fail:
+	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, rc);
+}
+
+static void snp_handle_ext_guest_request(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp_gpa)
+{
+	struct sev_data_snp_guest_request req = {0};
+	struct sev_snp_certs *snp_certs = NULL;
+	struct kvm_vcpu *vcpu = &svm->vcpu;
+	struct kvm *kvm = vcpu->kvm;
+	unsigned long data_npages;
+	struct kvm_sev_info *sev;
+	unsigned long exitcode = 0;
+	u64 data_gpa;
+	int err, rc;
+
+	if (!sev_snp_guest(vcpu->kvm)) {
+		rc = SEV_RET_INVALID_GUEST;
+		goto e_fail;
+	}
+
+	sev = &to_kvm_svm(kvm)->sev_info;
+
+	data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
+	data_npages = vcpu->arch.regs[VCPU_REGS_RBX];
+
+	if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
+		exitcode = SEV_RET_INVALID_ADDRESS;
+		goto e_fail;
+	}
+
+	mutex_lock(&sev->guest_req_lock);
+
+	rc = snp_setup_guest_buf(svm, &req, req_gpa, resp_gpa);
+	if (rc)
+		goto unlock;
+
+	/*
+	 * If a VMM-specific certificate blob hasn't been provided, grab the
+	 * host-wide one.
+	 */
+	snp_certs = sev_snp_certs_get(sev->snp_certs);
+	if (!snp_certs)
+		snp_certs = sev_snp_global_certs_get();
+
+	/*
+	 * If there is a host-wide or VMM-specific certificate blob available,
+	 * make sure the guest has allocated enough space to store it.
+	 * Otherwise, inform the guest how much space is needed.
+	 */
+	if (snp_certs && (data_npages << PAGE_SHIFT) < snp_certs->len) {
+		vcpu->arch.regs[VCPU_REGS_RBX] = snp_certs->len >> PAGE_SHIFT;
+		exitcode = SNP_GUEST_VMM_ERR(SNP_GUEST_VMM_ERR_INVALID_LEN);
+		goto cleanup;
+	}
+
+	rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &req, &err);
+	if (rc) {
+		/* pass the firmware error code */
+		exitcode = err;
+		goto cleanup;
+	}
+
+	/* Copy the certificate blob in the guest memory */
+	if (snp_certs &&
+	    kvm_write_guest(kvm, data_gpa, snp_certs->data, snp_certs->len))
+		exitcode = SEV_RET_INVALID_ADDRESS;
+
+cleanup:
+	sev_snp_certs_put(snp_certs);
+	snp_cleanup_guest_buf(&req, &exitcode);
+
+unlock:
+	mutex_unlock(&sev->guest_req_lock);
+
+e_fail:
+	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, exitcode);
+}
+
 static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 {
 	struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3894,6 +4058,18 @@  int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
 		vcpu->run->system_event.ndata = 1;
 		vcpu->run->system_event.data[0] = control->ghcb_gpa;
 		break;
+	case SVM_VMGEXIT_GUEST_REQUEST:
+		snp_handle_guest_request(svm, control->exit_info_1, control->exit_info_2);
+
+		ret = 1;
+		break;
+	case SVM_VMGEXIT_EXT_GUEST_REQUEST:
+		snp_handle_ext_guest_request(svm,
+					     control->exit_info_1,
+					     control->exit_info_2);
+
+		ret = 1;
+		break;
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
 		vcpu_unimpl(vcpu,
 			    "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index bdf792ba06e1..3673a6e4e22e 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -98,6 +98,7 @@  struct kvm_sev_info {
 	void *snp_context;      /* SNP guest context page */
 	u64 sev_features;	/* Features set at VMSA creation */
 	struct sev_snp_certs *snp_certs;
+	struct mutex guest_req_lock; /* Lock for guest request handling */
 };
 
 struct kvm_svm {
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 4807ddd6ec52..f9c75c561c4e 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -2109,6 +2109,21 @@  void sev_snp_certs_put(struct sev_snp_certs *certs)
 }
 EXPORT_SYMBOL_GPL(sev_snp_certs_put);
 
+struct sev_snp_certs *sev_snp_global_certs_get(void)
+{
+	struct sev_device *sev;
+
+	if (!psp_master || !psp_master->sev_data)
+		return NULL;
+
+	sev = psp_master->sev_data;
+	if (!sev->snp_initialized)
+		return NULL;
+
+	return sev_snp_certs_get(sev->snp_certs);
+}
+EXPORT_SYMBOL_GPL(sev_snp_global_certs_get);
+
 static void sev_exit(struct kref *ref)
 {
 	misc_deregister(&misc_dev->misc);
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index 722e26d28d2f..3b294ccbbec9 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -25,6 +25,7 @@  struct sev_snp_certs {
 struct sev_snp_certs *sev_snp_certs_new(void *data, u32 len);
 struct sev_snp_certs *sev_snp_certs_get(struct sev_snp_certs *certs);
 void sev_snp_certs_put(struct sev_snp_certs *certs);
+struct sev_snp_certs *sev_snp_global_certs_get(void);
 
 /**
  * SEV platform state