[RFC,0/33] KVM: x86: hyperv: Introduce VSM support

Message ID	20231108111806.92604-1-nsaenz@amazon.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; From: Nicolas Saenz Julienne <nsaenz@amazon.com> To: <kvm@vger.kernel.org> CC: <linux-kernel@vger.kernel.org>, <linux-hyperv@vger.kernel.org>, <pbonzini@redhat.com>, <seanjc@google.com>, <vkuznets@redhat.com>, <anelkz@amazon.com>, <graf@amazon.com>, <dwmw@amazon.co.uk>, <jgowans@amazon.com>, <corbert@lwn.net>, <kys@microsoft.com>, <haiyangz@microsoft.com>, <decui@microsoft.com>, <x86@kernel.org>, <linux-doc@vger.kernel.org> Subject: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Date: Wed, 8 Nov 2023 11:17:33 +0000 Message-ID: <20231108111806.92604-1-nsaenz@amazon.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk
Series	KVM: x86: hyperv: Introduce VSM support \| [RFC,0/33] KVM: x86: hyperv: Introduce VSM support [RFC,01/33] KVM: x86: Decouple lapic.h from hyperv.h [RFC,02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS [RFC,03/33] KVM: x86: hyper-v: Introduce XMM output support [RFC,04/33] KVM: x86: hyper-v: Move hypercall page handling into separate function [RFC,05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page [RFC,06/33] KVM: x86: hyper-v: Introduce VTL awareness to Hyper-V's PV-IPIs [RFC,07/33] KVM: x86: hyper-v: Introduce KVM_CAP_HYPERV_VSM [RFC,08/33] KVM: x86: Don't use hv_timer if CAP_HYPERV_VSM enabled [RFC,09/33] KVM: x86: hyper-v: Introduce per-VTL vcpu helpers [RFC,10/33] KVM: x86: hyper-v: Introduce KVM_HV_GET_VSM_STATE [RFC,11/33] KVM: x86: hyper-v: Handle GET/SET_VP_REGISTER hcall in user-space [RFC,12/33] KVM: x86: hyper-v: Handle VSM hcalls in user-space [RFC,13/33] KVM: Allow polling vCPUs for events [RFC,14/33] KVM: x86: Add VTL to the MMU role [RFC,15/33] KVM: x86/mmu: Introduce infrastructure to handle non-executable faults [RFC,16/33] KVM: x86/mmu: Expose R/W/X flags during memory fault exits [RFC,17/33] KVM: x86/mmu: Allow setting memory attributes if VSM enabled [RFC,18/33] KVM: x86: Decouple kvm_get_memory_attributes() from struct kvm's mem_attr_array [RFC,19/33] KVM: x86: Decouple kvm_range_has_memory_attributes() from struct kvm's mem_attr_array [RFC,20/33] KVM: x86/mmu: Decouple hugepage_has_attrs() from struct kvm's mem_attr_array [RFC,21/33] KVM: Pass memory attribute array as a MMU notifier argument [RFC,23/33] KVM: Expose memory attribute helper functions unanimously [RFC,24/33] KVM: x86: hyper-v: Introduce KVM VTL device [RFC,25/33] KVM: Introduce a set of new memory attributes [RFC,26/33] KVM: x86: hyper-vsm: Allow setting per-VTL memory attributes [RFC,27/33] KVM: x86/mmu/hyper-v: Validate memory faults against per-VTL memprots [RFC,28/33] x86/hyper-v: Introduce memory intercept message structure [RFC,29/33] KVM: VMX: Save instruction length on EPT violation [RFC,30/33] KVM: x86: hyper-v: Introduce KVM_REQ_HV_INJECT_INTERCEPT request [RFC,32/33] KVM: x86: hyper-v: Implement HVCALL_TRANSLATE_VIRTUAL_ADDRESS [RFC,33/33] Documentation: KVM: Introduce "Emulating Hyper-V VSM with KVM"

Message ID

20231108111806.92604-1-nsaenz@amazon.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 client-ip=2620:137:e000::3:7;
From: Nicolas Saenz Julienne <nsaenz@amazon.com>
To: <kvm@vger.kernel.org>
CC: <linux-kernel@vger.kernel.org>, <linux-hyperv@vger.kernel.org>,
        <pbonzini@redhat.com>, <seanjc@google.com>, <vkuznets@redhat.com>,
        <anelkz@amazon.com>, <graf@amazon.com>, <dwmw@amazon.co.uk>,
        <jgowans@amazon.com>, <corbert@lwn.net>, <kys@microsoft.com>,
        <haiyangz@microsoft.com>, <decui@microsoft.com>, <x86@kernel.org>,
        <linux-doc@vger.kernel.org>
Subject: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support
Date: Wed, 8 Nov 2023 11:17:33 +0000
Message-ID: <20231108111806.92604-1-nsaenz@amazon.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
Precedence: bulk

Series

KVM: x86: hyperv: Introduce VSM support |

Message

Nicolas Saenz Julienne Nov. 8, 2023, 11:17 a.m. UTC

Hyper-V's Virtual Secure Mode (VSM) is a virtualisation security feature
that leverages the hypervisor to create secure execution environments
within a guest. VSM is documented as part of Microsoft's Hypervisor Top
Level Functional Specification [1]. Security features that build upon
VSM, like Windows Credential Guard, are enabled by default on Windows 11,
and are becoming a prerequisite in some industries.

This RFC series introduces the necessary infrastructure to emulate VSM
enabled guests. It is a snapshot of the progress we made so far, and its
main goal is to gather design feedback. Specifically on the KVM APIs we
introduce. For a high level design overview, see the documentation in
patch 33.

Additionally, this topic will be discussed as part of the KVM
Micro-conference, in this year's Linux Plumbers Conference [2].

The series is accompanied by two repositories:
- A PoC QEMU implementation of VSM [3].
- VSM kvm-unit-tests [4].

Note that this isn't a full VSM implementation. For now it only supports
2 VTLs, and only runs on uniprocessor guests. It is capable of booting
Windows Sever 2016/2019, but is unstable during runtime.

The series is based on the v6.6 kernel release, and depends on the
introduction of KVM memory attributes, which is being worked on
independently in "KVM: guest_memfd() and per-page attributes" [5]. A full
Linux tree is also made available [6].

Series rundown:
- Patch 2 introduces the concept of APIC ID groups.
- Patches 3-12 introduce the VSM capability and basic VTL awareness into
Hyper-V emulation.
- Patch 13 introduces vCPU polling support.
- Patches 14-31 use KVM's memory attributes to implement VTL memory
protections. Introduces the VTL KMV device and secure memory
intercepts.
- Patch 32 is a temporary implementation of
HVCALL_TRANSLATE_VIRTUAL_ADDRESS necessary to boot Windows 2019.
- Patch 33 introduces documentation.

Our intention is to integrate feedback gathered in the RFC and LPC while
we finish the VSM implementation. In the future, we will split the series
into distinct feature patch sets and upstream these independently.

Thanks,
Nicolas

[1] https://raw.githubusercontent.com/Microsoft/Virtualization-Documentation/master/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf
[2] https://lpc.events/event/17/sessions/166/#20231114
[3] https://github.com/vianpl/qemu/tree/vsm-rfc-v1
[4] https://github.com/vianpl/kvm-unit-tests/tree/vsm-rfc-v1
[5] https://lore.kernel.org/lkml/20231105163040.14904-1-pbonzini@redhat.com/.
[6] Full tree: https://github.com/vianpl/linux/tree/vsm-rfc-v1.
There are also two small dependencies with
https://marc.info/?l=kvm&m=167887543028109&w=2 and
https://lkml.org/lkml/2023/10/17/972

Comments

Alexander Graf Nov. 8, 2023, 11:40 a.m. UTC | #1

Hey Nicolas,

On 08.11.23 12:17, Nicolas Saenz Julienne wrote:
> Hyper-V's Virtual Secure Mode (VSM) is a virtualisation security feature
> that leverages the hypervisor to create secure execution environments
> within a guest. VSM is documented as part of Microsoft's Hypervisor Top
> Level Functional Specification [1]. Security features that build upon
> VSM, like Windows Credential Guard, are enabled by default on Windows 11,
> and are becoming a prerequisite in some industries.
>
> This RFC series introduces the necessary infrastructure to emulate VSM
> enabled guests. It is a snapshot of the progress we made so far, and its
> main goal is to gather design feedback. Specifically on the KVM APIs we
> introduce. For a high level design overview, see the documentation in
> patch 33.
>
> Additionally, this topic will be discussed as part of the KVM
> Micro-conference, in this year's Linux Plumbers Conference [2].


Awesome, looking forward to the session! :)


> The series is accompanied by two repositories:
>   - A PoC QEMU implementation of VSM [3].
>   - VSM kvm-unit-tests [4].
>
> Note that this isn't a full VSM implementation. For now it only supports
> 2 VTLs, and only runs on uniprocessor guests. It is capable of booting
> Windows Sever 2016/2019, but is unstable during runtime.


How much of these limitations are inherent in the current set of 
patches? What is missing to go beyond 2 VTLs and into SMP land? Anything 
that will require API changes?


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Nicolas Saenz Julienne Nov. 8, 2023, 2:41 p.m. UTC | #2

On Wed Nov 8, 2023 at 11:40 AM UTC, Alexander Graf wrote:
> Hey Nicolas,

[...]

> > The series is accompanied by two repositories:
> >   - A PoC QEMU implementation of VSM [3].
> >   - VSM kvm-unit-tests [4].
> >
> > Note that this isn't a full VSM implementation. For now it only supports
> > 2 VTLs, and only runs on uniprocessor guests. It is capable of booting
> > Windows Sever 2016/2019, but is unstable during runtime.
>
> How much of these limitations are inherent in the current set of 
> patches? What is missing to go beyond 2 VTLs and into SMP land? Anything 
> that will require API changes?

The main KVM concepts introduced by this series are ready to deal with
any number of VTLs (APIC ID groups, VTL KVM device).

KVM_HV_GET_VSM_STATE should provide a copy of 'vsm_code_page_offsets'
per-VTL, since the hypercall page is partition wide but per-VTL.
Attaching that information as a VTL KVM device attribute fits that
requirement nicely. I'd prefer going that way especially if the VTL KVM
device has a decent reception. Also, the secure memory intecepts and
HVCALL_TRANSLATE_VIRTUAL_ADDRESS take some VTL related shortcuts, but
those are going away. Otherwise, I don't see any necessary in-kernel
changes.

When virtualizing Windows with VSM I've never seen usages that go beyond
VTL1. So enabling VTL > 1 will be mostly a kvm-unit-tests effort. As for
SMP, it just a matter of work. Notably HvStartVirtualProcessor and
HvGetVpIndexFromApicId need to be implemented, and making sure the QEMU
VTL scheduling code holds up.

Nicolas

Sean Christopherson Nov. 8, 2023, 4:55 p.m. UTC | #3

On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> This RFC series introduces the necessary infrastructure to emulate VSM
> enabled guests. It is a snapshot of the progress we made so far, and its
> main goal is to gather design feedback.

Heh, then please provide an overview of the design, and ideally context and/or
justification for various design decisions.  It doesn't need to be a proper design
doc, and you can certainly point at other documentation for explaining VSM/VTLs,
but a few paragraphs and/or verbose bullet points would go a long way.

The documentation in patch 33 provides an explanation of VSM itself, and a little
insight into how userspace can utilize the KVM implementation.  But the documentation
provides no explanation of the mechanics that KVM *developers* care about, e.g.
the use of memory attributes, how memory attributes are enforced, whether or not
an in-kernel local APIC is required, etc.

Nor does the documentation explain *why*, e.g. why store a separate set of memory
attributes per VTL "device", which by the by is broken and unnecessary.

> Specifically on the KVM APIs we introduce. For a high level design overview,
> see the documentation in patch 33.
> 
> Additionally, this topic will be discussed as part of the KVM
> Micro-conference, in this year's Linux Plumbers Conference [2].
> 
> The series is accompanied by two repositories:
>  - A PoC QEMU implementation of VSM [3].
>  - VSM kvm-unit-tests [4].
> 
> Note that this isn't a full VSM implementation. For now it only supports
> 2 VTLs, and only runs on uniprocessor guests. It is capable of booting
> Windows Sever 2016/2019, but is unstable during runtime.
> 
> The series is based on the v6.6 kernel release, and depends on the
> introduction of KVM memory attributes, which is being worked on
> independently in "KVM: guest_memfd() and per-page attributes" [5].

This doesn't actually apply on 6.6 with v14 of guest_memfd, because v14 of
guest_memfd is based on kvm-6.7-1.  Ah, and looking at your github repo, this
isn't based on v14 at all, it's based on v12.

That's totally fine, but the cover letter needs to explicitly, clearly, and
*accurately* state the dependencies.  I can obviously grab the full branch from
github, but that's not foolproof, e.g. if you accidentally delete or force push
to that branch.  And I also prefer to know that what I'm replying to on list is
the exact same code that I am looking at.

> A full Linux tree is also made available [6].
> 
> Series rundown:
>  - Patch 2 introduces the concept of APIC ID groups.
>  - Patches 3-12 introduce the VSM capability and basic VTL awareness into
>    Hyper-V emulation.
>  - Patch 13 introduces vCPU polling support.
>  - Patches 14-31 use KVM's memory attributes to implement VTL memory
>    protections. Introduces the VTL KMV device and secure memory
>    intercepts.
>  - Patch 32 is a temporary implementation of
>    HVCALL_TRANSLATE_VIRTUAL_ADDRESS necessary to boot Windows 2019.
>  - Patch 33 introduces documentation.
> 
> Our intention is to integrate feedback gathered in the RFC and LPC while
> we finish the VSM implementation. In the future, we will split the series
> into distinct feature patch sets and upstream these independently.
> 
> Thanks,
> Nicolas
> 
> [1] https://raw.githubusercontent.com/Microsoft/Virtualization-Documentation/master/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf
> [2] https://lpc.events/event/17/sessions/166/#20231114
> [3] https://github.com/vianpl/qemu/tree/vsm-rfc-v1
> [4] https://github.com/vianpl/kvm-unit-tests/tree/vsm-rfc-v1
> [5] https://lore.kernel.org/lkml/20231105163040.14904-1-pbonzini@redhat.com/.
> [6] Full tree: https://github.com/vianpl/linux/tree/vsm-rfc-v1. 

When providing github links, my preference is to format the pointers as:

  <repo> <branch>

or
  <repo> tags/<tag>

e.g.

  https://github.com/vianpl/linux vsm-rfc-v1

so that readers can copy+paste the full thing directly into `git fetch`.  It's a
minor thing, but AFAIK no one actually does review by clicking through github's
webview.

>     There are also two small dependencies with
>     https://marc.info/?l=kvm&m=167887543028109&w=2 and
>     https://lkml.org/lkml/2023/10/17/972

Please use lore links, there's zero reason to use anything else these days.  For
those of us that use b4, lore links make life much easier.

Sean Christopherson Nov. 8, 2023, 6:33 p.m. UTC | #4

On Wed, Nov 08, 2023, Sean Christopherson wrote:
> On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> > This RFC series introduces the necessary infrastructure to emulate VSM
> > enabled guests. It is a snapshot of the progress we made so far, and its
> > main goal is to gather design feedback.
> 
> Heh, then please provide an overview of the design, and ideally context and/or
> justification for various design decisions.  It doesn't need to be a proper design
> doc, and you can certainly point at other documentation for explaining VSM/VTLs,
> but a few paragraphs and/or verbose bullet points would go a long way.
> 
> The documentation in patch 33 provides an explanation of VSM itself, and a little
> insight into how userspace can utilize the KVM implementation.  But the documentation
> provides no explanation of the mechanics that KVM *developers* care about, e.g.
> the use of memory attributes, how memory attributes are enforced, whether or not
> an in-kernel local APIC is required, etc.
> 
> Nor does the documentation explain *why*, e.g. why store a separate set of memory
> attributes per VTL "device", which by the by is broken and unnecessary.

After speed reading the series..  An overview of the design, why you made certain
choices, and the tradeoffs between various options is definitely needed.

A few questions off the top of my head:

 - What is the split between userspace and KVM?  How did you arrive at that split?

 - How much *needs* to be in KVM?  I.e. how much can be pushed to userspace while
   maintaininly good performance?
   
 - Why not make VTLs a first-party concept in KVM?  E.g. rather than bury info
   in a VTL device and APIC ID groups, why not modify "struct kvm" to support
   replicating state that needs to be tracked per-VTL?  Because of how memory
   attributes affect hugepages, duplicating *memslots* might actually be easier
   than teaching memslots to be VTL-aware.

 - Is "struct kvm_vcpu" the best representation of an execution context (if I'm
   getting the terminology right)?  E.g. if 90% of the state is guaranteed to be
   identical for a given vCPU across execution contexts, then modeling that with
   separate kvm_vcpu structures is very inefficient.  I highly doubt it's 90%,
   but it might be quite high depending on how much the TFLS restricts the state
   of the vCPU, e.g. if it's 64-bit only.

The more info you can provide before LPC, the better, e.g. so that we can spend
time discussing options instead of you getting peppered with questions about the
requirements and whatnot.

Nicolas Saenz Julienne Nov. 10, 2023, 5:56 p.m. UTC | #5

Hi Sean,
Thanks for taking the time to review the series. I took note of your
comments across the series, and will incorporate them into the LPC
discussion.

On Wed Nov 8, 2023 at 6:33 PM UTC, Sean Christopherson wrote:
> On Wed, Nov 08, 2023, Sean Christopherson wrote:
> > On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> > > This RFC series introduces the necessary infrastructure to emulate VSM
> > > enabled guests. It is a snapshot of the progress we made so far, and its
> > > main goal is to gather design feedback.
> >
> > Heh, then please provide an overview of the design, and ideally context and/or
> > justification for various design decisions.  It doesn't need to be a proper design
> > doc, and you can certainly point at other documentation for explaining VSM/VTLs,
> > but a few paragraphs and/or verbose bullet points would go a long way.
> >
> > The documentation in patch 33 provides an explanation of VSM itself, and a little
> > insight into how userspace can utilize the KVM implementation.  But the documentation
> > provides no explanation of the mechanics that KVM *developers* care about, e.g.
> > the use of memory attributes, how memory attributes are enforced, whether or not
> > an in-kernel local APIC is required, etc.
> >
> > Nor does the documentation explain *why*, e.g. why store a separate set of memory
> > attributes per VTL "device", which by the by is broken and unnecessary.
>
> After speed reading the series..  An overview of the design, why you made certain
> choices, and the tradeoffs between various options is definitely needed.
>
> A few questions off the top of my head:
>
>  - What is the split between userspace and KVM?  How did you arrive at that split?

Our original design, which we discussed in the KVM forum 2023 [1] and is
public [2], implemented most of VSM in-kernel. Notably we introduced VTL
awareness in struct kvm_vcpu. This turned out to be way too complex: now
vcpus have multiple CPU architectural states, events, apics, mmu, etc.
First of all, the code turned out to be very intrusive, for example,
most apic APIs had to be reworked one way or another to accommodate
the fact there are multiple apics available. Also, we were forced to
introduce VSM-specific semantics in x86 emulation code. But more
importantly, the biggest pain has been dealing with state changes, they
may be triggered remotely through requests, and some are already fairly
delicate as-is. They involve a multitude of corner cases that almost
never apply for a VTL aware kvm_vcpu. Especially if you factor in
features like live migration. It's been a continuous source of
regressions.

Memory protections were implemented by using memory slot modifications.
We introduced a downstream API that allows updating memory slots
concurrently with vCPUs running. I think there was a similar proposal
upstream from Red Hat some time ago. The result is complex, hard to
generalize and slow.

So we decided to move all this complexity outside of struct kvm_vcpu
and, as much as possible, out of the kernel. We figures out the basic
kernel building blocks that are absolutely necessary, and let user-space
deal with the rest.

>  - How much *needs* to be in KVM?  I.e. how much can be pushed to userspace while
>    maintaininly good performance?

As I said above, the aim of the current design is to keep it as light as
possible. The biggest move we made was moving VTL switching into
user-space. We don't see any indication that performance is affected in
a major way. But we will know for sure once we finish the implementation
and test it under real use-cases.

>  - Why not make VTLs a first-party concept in KVM?  E.g. rather than bury info
>    in a VTL device and APIC ID groups, why not modify "struct kvm" to support
>    replicating state that needs to be tracked per-VTL?  Because of how memory
>    attributes affect hugepages, duplicating *memslots* might actually be easier
>    than teaching memslots to be VTL-aware.

I do agree that we need to introduce some level VTL awareness into
memslots. There's the hugepages issues you pointed out. But it'll be
also necessary once we look at how to implement overlay pages that are
per-VTL. (A topic I didn't mention in the series as I though I had
managed to solve memory protections while avoiding the need for multiple
slots). What I have in mind is introducing a memory slot address space
per-VTL, similar to how we do things with SMM.

It's important to note that requirements for overlay pages and memory
protections are very different. Overlay pages are scarce, and are setup
once and never change (AFAICT), so we think stopping all vCPUs, updating
slots, and resuming execution will provide good enough performance.
Memory protections happen very frequently, generally with page
granularity, and may be short-lived.

>  - Is "struct kvm_vcpu" the best representation of an execution context (if I'm
>    getting the terminology right)?

Let's forget I ever mentioned execution contexts. I used it in the hopes
of making the VTL concept a little more understandable for non-VSM aware
people. It's meant to be interchangeable with VTL. But I see how it
creates confusion.

>    E.g. if 90% of the state is guaranteed to be identical for a given
>    vCPU across execution contexts, then modeling that with separate
>    kvm_vcpu structures is very inefficient.  I highly doubt it's 90%,
>    but it might be quite high depending on how much the TFLS restricts
>    the state of the vCPU, e.g. if it's 64-bit only.

For the record here's the private VTL state (TLFS 15.11.1):

"In general, each VTL has its own control registers, RIP register, RSP
 register, and MSRs:

 SYSENTER_CS, SYSENTER_ESP, SYSENTER_EIP, STAR, LSTAR, CSTAR, SFMASK,
 EFER, PAT, KERNEL_GSBASE, FS.BASE, GS.BASE, TSC_AUX
 HV_X64_MSR_HYPERCALL
 HV_X64_MSR_GUEST_OS_ID
 HV_X64_MSR_REFERENCE_TSC
 HV_X64_MSR_APIC_FREQUENCY
 HV_X64_MSR_EOI
 HV_X64_MSR_ICR
 HV_X64_MSR_TPR
 HV_X64_MSR_APIC_ASSIST_PAGE
 HV_X64_MSR_NPIEP_CONFIG
 HV_X64_MSR_SIRBP
 HV_X64_MSR_SCONTROL
 HV_X64_MSR_SVERSION
 HV_X64_MSR_SIEFP
 HV_X64_MSR_SIMP
 HV_X64_MSR_EOM
 HV_X64_MSR_SINT0 – HV_X64_MSR_SINT15
 HV_X64_MSR_STIMER0_COUNT – HV_X64_MSR_STIMER3_COUNT
 Local APIC registers (including CR8/TPR)
"

The rest is shared.

Note that we've observed that during normal operation, VTL switches
don't happen that often. The boot process is the most affected by any
performance impact VSM might introduce, which issues 100000s (mostly
memory protections).

Nicolas

[1] https://kvm-forum.qemu.org/2023/talk/TK7YGD/
[2] Partial rebase of our original implementation:
    https://github.com/vianpl/linux vsm

Nicolas Saenz Julienne Nov. 10, 2023, 7:04 p.m. UTC | #6

On Wed Nov 8, 2023 at 4:55 PM UTC, Sean Christopherson wrote:
> > This RFC series introduces the necessary infrastructure to emulate VSM
> > enabled guests. It is a snapshot of the progress we made so far, and its
> > main goal is to gather design feedback.
>
> Heh, then please provide an overview of the design, and ideally context and/or
> justification for various design decisions.  It doesn't need to be a proper design
> doc, and you can certainly point at other documentation for explaining VSM/VTLs,
> but a few paragraphs and/or verbose bullet points would go a long way.
>
> The documentation in patch 33 provides an explanation of VSM itself, and a little
> insight into how userspace can utilize the KVM implementation.  But the documentation
> provides no explanation of the mechanics that KVM *developers* care about, e.g.
> the use of memory attributes, how memory attributes are enforced, whether or not
> an in-kernel local APIC is required, etc.

Noted, I'll provide a design review on the next submission.

> Nor does the documentation explain *why*, e.g. why store a separate set of memory
> attributes per VTL "device", which by the by is broken and unnecessary.

It's clear to me how the current implementation of VTL devices is
broken. But unncessary? That made me think we could inject the VTL In
the memory attribute key, for ex. with 'gfn | vtl << 58'. And then use
generic API and a single xarray.

Nicolas

Sean Christopherson Nov. 10, 2023, 7:32 p.m. UTC | #7

On Fri, Nov 10, 2023, Nicolas Saenz Julienne wrote:
> On Wed Nov 8, 2023 at 6:33 PM UTC, Sean Christopherson wrote:
> >  - What is the split between userspace and KVM?  How did you arrive at that split?
> 
> Our original design, which we discussed in the KVM forum 2023 [1] and is
> public [2], implemented most of VSM in-kernel. Notably we introduced VTL
> awareness in struct kvm_vcpu.

...

> So we decided to move all this complexity outside of struct kvm_vcpu
> and, as much as possible, out of the kernel. We figures out the basic
> kernel building blocks that are absolutely necessary, and let user-space
> deal with the rest.

Sorry, I should have been more clear.  What's the split in terms of responsibilities,
i.e. what will KVM's ABI look like?  E.g. if the vCPU=>VTLs setup is nonsensical,
does KVM care?

My general preference is for KVM to be as permissive as possible, i.e. let
userspace do whatever it wants so long as it doesn't place undue burden on KVM.
But at the same time I don't to end up in a similar boat as many of the paravirt
features, where things just stop working if userspace or the guest makes a goof.

> >  - Why not make VTLs a first-party concept in KVM?  E.g. rather than bury info
> >    in a VTL device and APIC ID groups, why not modify "struct kvm" to support
> >    replicating state that needs to be tracked per-VTL?  Because of how memory
> >    attributes affect hugepages, duplicating *memslots* might actually be easier
> >    than teaching memslots to be VTL-aware.
> 
> I do agree that we need to introduce some level VTL awareness into
> memslots. There's the hugepages issues you pointed out. But it'll be
> also necessary once we look at how to implement overlay pages that are
> per-VTL. (A topic I didn't mention in the series as I though I had
> managed to solve memory protections while avoiding the need for multiple
> slots). What I have in mind is introducing a memory slot address space
> per-VTL, similar to how we do things with SMM.

Noooooooo (I hate memslot address spaces :-) )

Why not represent each VTL with a separate "struct kvm" instance?  That would
naturally provide per-VTL behavior for:

  - APIC groups
  - memslot overlays
  - memory attributes (and their impact on hugepages)
  - MMU pages

The only (obvious) issue with that approach would be cross-VTL operations.  IIUC,
sending IPIs across VTLs isn't allowed, but even if it were, that should be easy
enough to solve, e.g. KVM already supports posting interrupts from non-KVM sources.

GVA=>GPA translation would be trickier, but that patch suggests you want to handle
that in userspace anyways.  And if translation is a rare/slow path, maybe it could
simply be punted to userspace?

  NOTE: The hypercall implementation is incomplete and only shared for
  completion. Additionally we'd like to move the VTL aware parts to
  user-space.

Ewww, and looking at what it would take to support cross-VM translations shows
another problem with using vCPUs to model VTLs.  Nothing prevents userspace from
running a virtual CPU at multiple VTLs concurrently, which means that anything
that uses kvm_hv_get_vtl_vcpu() is unsafe, e.g. walk_mmu->gva_to_gpa() could be
modified while kvm_hv_xlate_va_walk() is running.

I suppose that's not too hard to solve, e.g. mutex_trylock() and bail if something
holds the other kvm_vcpu/VTL's mutex.  Though ideally, KVM would punt all cross-VTL
operations to userspace.  :-)

If punting to userspace isn't feasible, using a struct kvm per VTL probably wouldn't
make the locking and concurrency problems meaningfully easier or harder to solve.
E.g. KVM could require VTLs, i.e. "struct kvm" instances that are part of a single
virtual machine, to belong to the same process.  That'd avoid headaches with
mm_struct, at which point I don't _think_ getting and using a kvm_vcpu from a
different kvm would need special handling?

Heh, another fun one, the VTL handling in kvm_hv_send_ipi() is wildly broken, the
in_vtl field is consumed before send_ipi is read from userspace.

	union hv_input_vtl *in_vtl;
	u64 valid_bank_mask;
	u32 vector;
	bool all_cpus;
	u8 vtl;

	/* VTL is at the same offset on both IPI types */
	in_vtl = &send_ipi.in_vtl;
	vtl = in_vtl->use_target_vtl ? in_vtl->target_vtl : kvm_hv_get_active_vtl(vcpu);

> >    E.g. if 90% of the state is guaranteed to be identical for a given
> >    vCPU across execution contexts, then modeling that with separate
> >    kvm_vcpu structures is very inefficient.  I highly doubt it's 90%,
> >    but it might be quite high depending on how much the TFLS restricts
> >    the state of the vCPU, e.g. if it's 64-bit only.
> 
> For the record here's the private VTL state (TLFS 15.11.1):
> 
> "In general, each VTL has its own control registers, RIP register, RSP
>  register, and MSRs:
> 
>  SYSENTER_CS, SYSENTER_ESP, SYSENTER_EIP, STAR, LSTAR, CSTAR, SFMASK,
>  EFER, PAT, KERNEL_GSBASE, FS.BASE, GS.BASE, TSC_AUX
>  HV_X64_MSR_HYPERCALL
>  HV_X64_MSR_GUEST_OS_ID
>  HV_X64_MSR_REFERENCE_TSC
>  HV_X64_MSR_APIC_FREQUENCY
>  HV_X64_MSR_EOI
>  HV_X64_MSR_ICR
>  HV_X64_MSR_TPR
>  HV_X64_MSR_APIC_ASSIST_PAGE
>  HV_X64_MSR_NPIEP_CONFIG
>  HV_X64_MSR_SIRBP
>  HV_X64_MSR_SCONTROL
>  HV_X64_MSR_SVERSION
>  HV_X64_MSR_SIEFP
>  HV_X64_MSR_SIMP
>  HV_X64_MSR_EOM
>  HV_X64_MSR_SINT0 – HV_X64_MSR_SINT15
>  HV_X64_MSR_STIMER0_COUNT – HV_X64_MSR_STIMER3_COUNT
>  Local APIC registers (including CR8/TPR)

Ugh, the APIC state is quite the killer.  And I gotta image things like CET and
FRED are only going to increase that list.

Nicolas Saenz Julienne Nov. 11, 2023, 11:55 a.m. UTC | #8

On Fri Nov 10, 2023 at 7:32 PM UTC, Sean Christopherson wrote:
> On Fri, Nov 10, 2023, Nicolas Saenz Julienne wrote:
> > On Wed Nov 8, 2023 at 6:33 PM UTC, Sean Christopherson wrote:
> > >  - What is the split between userspace and KVM?  How did you arrive at that split?
> >
> > Our original design, which we discussed in the KVM forum 2023 [1] and is
> > public [2], implemented most of VSM in-kernel. Notably we introduced VTL
> > awareness in struct kvm_vcpu.
>
> ...
>
> > So we decided to move all this complexity outside of struct kvm_vcpu
> > and, as much as possible, out of the kernel. We figures out the basic
> > kernel building blocks that are absolutely necessary, and let user-space
> > deal with the rest.
>
> Sorry, I should have been more clear.  What's the split in terms of responsibilities,
> i.e. what will KVM's ABI look like?  E.g. if the vCPU=>VTLs setup is nonsensical,
> does KVM care?
>
> My general preference is for KVM to be as permissive as possible, i.e. let
> userspace do whatever it wants so long as it doesn't place undue burden on KVM.
> But at the same time I don't to end up in a similar boat as many of the paravirt
> features, where things just stop working if userspace or the guest makes a goof.

I'll make sure to formalize this for whenever I post a full series, I
need to go over every hcall and think from that perspective.

There are some rules it might make sense to enforce. But it really
depends on the abstractions we settle on. KVM might not have the
necessary introspection to enforce them. IMO ideally it wouldn't, VTLs
should remain a user-space concept. My approach so far has been trusting
QEMU is doing the right thing.

Some high level examples come to mind:
 - Only one VTL vCPU might run at all times.
 - Privileged VTL interrupts have precedence over lower VTL execution.
 - lAPICs can only access their VTL. (Cross VTL IPIs happen through the
   PV interface).
 - Lower VTL state should be up to date when accessed from privileged
   VTLs (through the GET/SET_VP_REGSITER hcall).

> > >  - Why not make VTLs a first-party concept in KVM?  E.g. rather than bury info
> > >    in a VTL device and APIC ID groups, why not modify "struct kvm" to support
> > >    replicating state that needs to be tracked per-VTL?  Because of how memory
> > >    attributes affect hugepages, duplicating *memslots* might actually be easier
> > >    than teaching memslots to be VTL-aware.
> >
> > I do agree that we need to introduce some level VTL awareness into
> > memslots. There's the hugepages issues you pointed out. But it'll be
> > also necessary once we look at how to implement overlay pages that are
> > per-VTL. (A topic I didn't mention in the series as I though I had
> > managed to solve memory protections while avoiding the need for multiple
> > slots). What I have in mind is introducing a memory slot address space
> > per-VTL, similar to how we do things with SMM.
>
> Noooooooo (I hate memslot address spaces :-) )
>
> Why not represent each VTL with a separate "struct kvm" instance?  That would
> naturally provide per-VTL behavior for:
>
>   - APIC groups
>   - memslot overlays
>   - memory attributes (and their impact on hugepages)
>   - MMU pages

Very interesting idea! I'll spend some time researching it, it sure
solves a lot of issues.

> The only (obvious) issue with that approach would be cross-VTL operations.  IIUC,
> sending IPIs across VTLs isn't allowed, but even if it were, that should be easy
> enough to solve, e.g. KVM already supports posting interrupts from non-KVM sources.

Correct. Only through kvm_hv_send_ipi(), but from experience it happens
very rarely, so performance shouldn't be critical.

> GVA=>GPA translation would be trickier, but that patch suggests you want to handle
> that in userspace anyways.  And if translation is a rare/slow path, maybe it could
> simply be punted to userspace?
>
>   NOTE: The hypercall implementation is incomplete and only shared for
>   completion. Additionally we'd like to move the VTL aware parts to
>   user-space.
>
> Ewww, and looking at what it would take to support cross-VM translations shows
> another problem with using vCPUs to model VTLs.  Nothing prevents userspace from
> running a virtual CPU at multiple VTLs concurrently, which means that anything
> that uses kvm_hv_get_vtl_vcpu() is unsafe, e.g. walk_mmu->gva_to_gpa() could be
> modified while kvm_hv_xlate_va_walk() is running.
>
> I suppose that's not too hard to solve, e.g. mutex_trylock() and bail if something
> holds the other kvm_vcpu/VTL's mutex.  Though ideally, KVM would punt all cross-VTL
> operations to userspace.  :-)
>
> If punting to userspace isn't feasible, using a struct kvm per VTL probably wouldn't
> make the locking and concurrency problems meaningfully easier or harder to solve.
> E.g. KVM could require VTLs, i.e. "struct kvm" instances that are part of a single
> virtual machine, to belong to the same process.  That'd avoid headaches with
> mm_struct, at which point I don't _think_ getting and using a kvm_vcpu from a
> different kvm would need special handling?

I'll look into it.

> Heh, another fun one, the VTL handling in kvm_hv_send_ipi() is wildly broken, the
> in_vtl field is consumed before send_ipi is read from userspace.

ugh, that's a tired last minute "cleanup" that went south... It's been
working as intended for a while otherwise. I'll implement a
kvm-unit-test to redeem myself. :)

Nicolas