Message ID | 20231108111806.92604-1-nsaenz@amazon.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:aa0b:0:b0:403:3b70:6f57 with SMTP id k11csp841242vqo; Wed, 8 Nov 2023 03:18:37 -0800 (PST) X-Google-Smtp-Source: AGHT+IGELtA/f3FFWohTc/x9C3z4AmJm9QcRAAnAvPX7Gg8Twt+D1zLEW0oJD55pVegstD63fsoA X-Received: by 2002:a05:6870:7021:b0:1f0:84cb:9a35 with SMTP id u33-20020a056870702100b001f084cb9a35mr1885006oae.4.1699442316929; Wed, 08 Nov 2023 03:18:36 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1699442316; cv=none; d=google.com; s=arc-20160816; b=LMsvQ+6j0KnpOHn6lZIwbfQgy+/qQ0LsTvQ63EmPOjsm7U1dx9NtLNMHNOaHb+7HeJ bSfbw9EsgsNhN6FWYRk///rXKewE1DijVNceSP+HdOpEFPNK5ZqNulWfaDs7MVDQ6YF8 +4dvNdMdoVUB3GR+tgfu6q2VVPtGyAnJl0zNUqU3C+n41cFcPkMTzTMgb3hnBiYHDeiw NnVlacv93+0tTEjo7fGR6xBCh3zkQl+TF8UocF1+XPirhXkHD0DGgwvAoqD59HrBurKC bRmT6UZb5vSwDd49k27eUPcMt78wuraqWQntpj3DGofWH5zG28VbT5S2IhCN5DTBwR9g P0yA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=YhNiCY6Yx16hIA8ic6bDnU8CklxIVvoOxjYhmVU762I=; fh=HB9ImBoUqYRbUAG4zQt6lry8rjfzmKWb+XvPQWZxTPk=; b=udqV7fkPsbpbmEJ+BBI13VsPw22t1I2+zrq/vTfvpNcxrnkQV5SUQ4W3R4k1Cyniuo TySrHoazdp27uH4kd0iqaCa9+WtsTBzgiLvrQwuP7Ko2SgxKm4Rjhai2Rh002vsHa+1D CFUnz7fuUSnSsy6Cg/zepVoCAWouaXLsN/IkAGrz5cH+viDGgNTOvfeovn001kr7Zh/h V6t+UOL1FjG0swpTvwLUp9tQaCTBVe8jsdS5ylsbcEewpg7XNg7NbGxnQF3c89htzROq 4g/uRGCYYlS1BCMkMKuvGzAcU9fh5oEsONNXbW0CVzpSmSajg0f7SHZKjeTay8bmRSlV TieQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@amazon.com header.s=amazon201209 header.b=KFXFH01+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id bn10-20020a056a02030a00b005b92842d469si5241882pgb.62.2023.11.08.03.18.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 08 Nov 2023 03:18:36 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@amazon.com header.s=amazon201209 header.b=KFXFH01+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id D68E4837B2FF; Wed, 8 Nov 2023 03:18:35 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344373AbjKHLSe (ORCPT <rfc822;jaysivo@gmail.com> + 32 others); Wed, 8 Nov 2023 06:18:34 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51746 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230045AbjKHLSc (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 8 Nov 2023 06:18:32 -0500 Received: from smtp-fw-9105.amazon.com (smtp-fw-9105.amazon.com [207.171.188.204]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B2C83101; Wed, 8 Nov 2023 03:18:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1699442311; x=1730978311; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=YhNiCY6Yx16hIA8ic6bDnU8CklxIVvoOxjYhmVU762I=; b=KFXFH01+WyKZVuGXSf+Ano1t5HqH4ywbQheqAJa6homt8uBTBLEo7NpO mG65UaJQHbEmtZp6YhIXHpCCp/+Exe8ozrAhXA+i4sQ+O66EPO1Hsr+DZ ftf8HaracbQVHh+VARBIggQLgn81310Sh/zYUHrSX+7HOACPx8iVcyWjr o=; X-IronPort-AV: E=Sophos;i="6.03,286,1694736000"; d="scan'208";a="683505176" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO email-inbound-relay-iad-1d-m6i4x-f05d30a1.us-east-1.amazon.com) ([10.25.36.210]) by smtp-border-fw-9105.sea19.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Nov 2023 11:18:25 +0000 Received: from smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev (iad7-ws-svc-p70-lb3-vlan3.iad.amazon.com [10.32.235.38]) by email-inbound-relay-iad-1d-m6i4x-f05d30a1.us-east-1.amazon.com (Postfix) with ESMTPS id 74C1080D5F; Wed, 8 Nov 2023 11:18:21 +0000 (UTC) Received: from EX19MTAEUC002.ant.amazon.com [10.0.43.254:31015] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.43.105:2525] with esmtp (Farcaster) id b8d2a861-c97a-4ec8-b470-7641a9da532f; Wed, 8 Nov 2023 11:18:20 +0000 (UTC) X-Farcaster-Flow-ID: b8d2a861-c97a-4ec8-b470-7641a9da532f Received: from EX19D004EUC001.ant.amazon.com (10.252.51.190) by EX19MTAEUC002.ant.amazon.com (10.252.51.181) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.39; Wed, 8 Nov 2023 11:18:20 +0000 Received: from dev-dsk-nsaenz-1b-189b39ae.eu-west-1.amazon.com (10.13.235.138) by EX19D004EUC001.ant.amazon.com (10.252.51.190) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.39; Wed, 8 Nov 2023 11:18:15 +0000 From: Nicolas Saenz Julienne <nsaenz@amazon.com> To: <kvm@vger.kernel.org> CC: <linux-kernel@vger.kernel.org>, <linux-hyperv@vger.kernel.org>, <pbonzini@redhat.com>, <seanjc@google.com>, <vkuznets@redhat.com>, <anelkz@amazon.com>, <graf@amazon.com>, <dwmw@amazon.co.uk>, <jgowans@amazon.com>, <corbert@lwn.net>, <kys@microsoft.com>, <haiyangz@microsoft.com>, <decui@microsoft.com>, <x86@kernel.org>, <linux-doc@vger.kernel.org> Subject: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Date: Wed, 8 Nov 2023 11:17:33 +0000 Message-ID: <20231108111806.92604-1-nsaenz@amazon.com> X-Mailer: git-send-email 2.40.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-Originating-IP: [10.13.235.138] X-ClientProxiedBy: EX19D037UWC004.ant.amazon.com (10.13.139.254) To EX19D004EUC001.ant.amazon.com (10.252.51.190) Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Wed, 08 Nov 2023 03:18:35 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1781994426786089794 X-GMAIL-MSGID: 1781994426786089794 |
Series |
KVM: x86: hyperv: Introduce VSM support
|
|
Message
Nicolas Saenz Julienne
Nov. 8, 2023, 11:17 a.m. UTC
Hyper-V's Virtual Secure Mode (VSM) is a virtualisation security feature that leverages the hypervisor to create secure execution environments within a guest. VSM is documented as part of Microsoft's Hypervisor Top Level Functional Specification [1]. Security features that build upon VSM, like Windows Credential Guard, are enabled by default on Windows 11, and are becoming a prerequisite in some industries. This RFC series introduces the necessary infrastructure to emulate VSM enabled guests. It is a snapshot of the progress we made so far, and its main goal is to gather design feedback. Specifically on the KVM APIs we introduce. For a high level design overview, see the documentation in patch 33. Additionally, this topic will be discussed as part of the KVM Micro-conference, in this year's Linux Plumbers Conference [2]. The series is accompanied by two repositories: - A PoC QEMU implementation of VSM [3]. - VSM kvm-unit-tests [4]. Note that this isn't a full VSM implementation. For now it only supports 2 VTLs, and only runs on uniprocessor guests. It is capable of booting Windows Sever 2016/2019, but is unstable during runtime. The series is based on the v6.6 kernel release, and depends on the introduction of KVM memory attributes, which is being worked on independently in "KVM: guest_memfd() and per-page attributes" [5]. A full Linux tree is also made available [6]. Series rundown: - Patch 2 introduces the concept of APIC ID groups. - Patches 3-12 introduce the VSM capability and basic VTL awareness into Hyper-V emulation. - Patch 13 introduces vCPU polling support. - Patches 14-31 use KVM's memory attributes to implement VTL memory protections. Introduces the VTL KMV device and secure memory intercepts. - Patch 32 is a temporary implementation of HVCALL_TRANSLATE_VIRTUAL_ADDRESS necessary to boot Windows 2019. - Patch 33 introduces documentation. Our intention is to integrate feedback gathered in the RFC and LPC while we finish the VSM implementation. In the future, we will split the series into distinct feature patch sets and upstream these independently. Thanks, Nicolas [1] https://raw.githubusercontent.com/Microsoft/Virtualization-Documentation/master/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf [2] https://lpc.events/event/17/sessions/166/#20231114 [3] https://github.com/vianpl/qemu/tree/vsm-rfc-v1 [4] https://github.com/vianpl/kvm-unit-tests/tree/vsm-rfc-v1 [5] https://lore.kernel.org/lkml/20231105163040.14904-1-pbonzini@redhat.com/. [6] Full tree: https://github.com/vianpl/linux/tree/vsm-rfc-v1. There are also two small dependencies with https://marc.info/?l=kvm&m=167887543028109&w=2 and https://lkml.org/lkml/2023/10/17/972
Comments
Hey Nicolas, On 08.11.23 12:17, Nicolas Saenz Julienne wrote: > Hyper-V's Virtual Secure Mode (VSM) is a virtualisation security feature > that leverages the hypervisor to create secure execution environments > within a guest. VSM is documented as part of Microsoft's Hypervisor Top > Level Functional Specification [1]. Security features that build upon > VSM, like Windows Credential Guard, are enabled by default on Windows 11, > and are becoming a prerequisite in some industries. > > This RFC series introduces the necessary infrastructure to emulate VSM > enabled guests. It is a snapshot of the progress we made so far, and its > main goal is to gather design feedback. Specifically on the KVM APIs we > introduce. For a high level design overview, see the documentation in > patch 33. > > Additionally, this topic will be discussed as part of the KVM > Micro-conference, in this year's Linux Plumbers Conference [2]. Awesome, looking forward to the session! :) > The series is accompanied by two repositories: > - A PoC QEMU implementation of VSM [3]. > - VSM kvm-unit-tests [4]. > > Note that this isn't a full VSM implementation. For now it only supports > 2 VTLs, and only runs on uniprocessor guests. It is capable of booting > Windows Sever 2016/2019, but is unstable during runtime. How much of these limitations are inherent in the current set of patches? What is missing to go beyond 2 VTLs and into SMP land? Anything that will require API changes? Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Wed Nov 8, 2023 at 11:40 AM UTC, Alexander Graf wrote: > Hey Nicolas, [...] > > The series is accompanied by two repositories: > > - A PoC QEMU implementation of VSM [3]. > > - VSM kvm-unit-tests [4]. > > > > Note that this isn't a full VSM implementation. For now it only supports > > 2 VTLs, and only runs on uniprocessor guests. It is capable of booting > > Windows Sever 2016/2019, but is unstable during runtime. > > How much of these limitations are inherent in the current set of > patches? What is missing to go beyond 2 VTLs and into SMP land? Anything > that will require API changes? The main KVM concepts introduced by this series are ready to deal with any number of VTLs (APIC ID groups, VTL KVM device). KVM_HV_GET_VSM_STATE should provide a copy of 'vsm_code_page_offsets' per-VTL, since the hypercall page is partition wide but per-VTL. Attaching that information as a VTL KVM device attribute fits that requirement nicely. I'd prefer going that way especially if the VTL KVM device has a decent reception. Also, the secure memory intecepts and HVCALL_TRANSLATE_VIRTUAL_ADDRESS take some VTL related shortcuts, but those are going away. Otherwise, I don't see any necessary in-kernel changes. When virtualizing Windows with VSM I've never seen usages that go beyond VTL1. So enabling VTL > 1 will be mostly a kvm-unit-tests effort. As for SMP, it just a matter of work. Notably HvStartVirtualProcessor and HvGetVpIndexFromApicId need to be implemented, and making sure the QEMU VTL scheduling code holds up. Nicolas
On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote: > This RFC series introduces the necessary infrastructure to emulate VSM > enabled guests. It is a snapshot of the progress we made so far, and its > main goal is to gather design feedback. Heh, then please provide an overview of the design, and ideally context and/or justification for various design decisions. It doesn't need to be a proper design doc, and you can certainly point at other documentation for explaining VSM/VTLs, but a few paragraphs and/or verbose bullet points would go a long way. The documentation in patch 33 provides an explanation of VSM itself, and a little insight into how userspace can utilize the KVM implementation. But the documentation provides no explanation of the mechanics that KVM *developers* care about, e.g. the use of memory attributes, how memory attributes are enforced, whether or not an in-kernel local APIC is required, etc. Nor does the documentation explain *why*, e.g. why store a separate set of memory attributes per VTL "device", which by the by is broken and unnecessary. > Specifically on the KVM APIs we introduce. For a high level design overview, > see the documentation in patch 33. > > Additionally, this topic will be discussed as part of the KVM > Micro-conference, in this year's Linux Plumbers Conference [2]. > > The series is accompanied by two repositories: > - A PoC QEMU implementation of VSM [3]. > - VSM kvm-unit-tests [4]. > > Note that this isn't a full VSM implementation. For now it only supports > 2 VTLs, and only runs on uniprocessor guests. It is capable of booting > Windows Sever 2016/2019, but is unstable during runtime. > > The series is based on the v6.6 kernel release, and depends on the > introduction of KVM memory attributes, which is being worked on > independently in "KVM: guest_memfd() and per-page attributes" [5]. This doesn't actually apply on 6.6 with v14 of guest_memfd, because v14 of guest_memfd is based on kvm-6.7-1. Ah, and looking at your github repo, this isn't based on v14 at all, it's based on v12. That's totally fine, but the cover letter needs to explicitly, clearly, and *accurately* state the dependencies. I can obviously grab the full branch from github, but that's not foolproof, e.g. if you accidentally delete or force push to that branch. And I also prefer to know that what I'm replying to on list is the exact same code that I am looking at. > A full Linux tree is also made available [6]. > > Series rundown: > - Patch 2 introduces the concept of APIC ID groups. > - Patches 3-12 introduce the VSM capability and basic VTL awareness into > Hyper-V emulation. > - Patch 13 introduces vCPU polling support. > - Patches 14-31 use KVM's memory attributes to implement VTL memory > protections. Introduces the VTL KMV device and secure memory > intercepts. > - Patch 32 is a temporary implementation of > HVCALL_TRANSLATE_VIRTUAL_ADDRESS necessary to boot Windows 2019. > - Patch 33 introduces documentation. > > Our intention is to integrate feedback gathered in the RFC and LPC while > we finish the VSM implementation. In the future, we will split the series > into distinct feature patch sets and upstream these independently. > > Thanks, > Nicolas > > [1] https://raw.githubusercontent.com/Microsoft/Virtualization-Documentation/master/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf > [2] https://lpc.events/event/17/sessions/166/#20231114 > [3] https://github.com/vianpl/qemu/tree/vsm-rfc-v1 > [4] https://github.com/vianpl/kvm-unit-tests/tree/vsm-rfc-v1 > [5] https://lore.kernel.org/lkml/20231105163040.14904-1-pbonzini@redhat.com/. > [6] Full tree: https://github.com/vianpl/linux/tree/vsm-rfc-v1. When providing github links, my preference is to format the pointers as: <repo> <branch> or <repo> tags/<tag> e.g. https://github.com/vianpl/linux vsm-rfc-v1 so that readers can copy+paste the full thing directly into `git fetch`. It's a minor thing, but AFAIK no one actually does review by clicking through github's webview. > There are also two small dependencies with > https://marc.info/?l=kvm&m=167887543028109&w=2 and > https://lkml.org/lkml/2023/10/17/972 Please use lore links, there's zero reason to use anything else these days. For those of us that use b4, lore links make life much easier.
On Wed, Nov 08, 2023, Sean Christopherson wrote: > On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote: > > This RFC series introduces the necessary infrastructure to emulate VSM > > enabled guests. It is a snapshot of the progress we made so far, and its > > main goal is to gather design feedback. > > Heh, then please provide an overview of the design, and ideally context and/or > justification for various design decisions. It doesn't need to be a proper design > doc, and you can certainly point at other documentation for explaining VSM/VTLs, > but a few paragraphs and/or verbose bullet points would go a long way. > > The documentation in patch 33 provides an explanation of VSM itself, and a little > insight into how userspace can utilize the KVM implementation. But the documentation > provides no explanation of the mechanics that KVM *developers* care about, e.g. > the use of memory attributes, how memory attributes are enforced, whether or not > an in-kernel local APIC is required, etc. > > Nor does the documentation explain *why*, e.g. why store a separate set of memory > attributes per VTL "device", which by the by is broken and unnecessary. After speed reading the series.. An overview of the design, why you made certain choices, and the tradeoffs between various options is definitely needed. A few questions off the top of my head: - What is the split between userspace and KVM? How did you arrive at that split? - How much *needs* to be in KVM? I.e. how much can be pushed to userspace while maintaininly good performance? - Why not make VTLs a first-party concept in KVM? E.g. rather than bury info in a VTL device and APIC ID groups, why not modify "struct kvm" to support replicating state that needs to be tracked per-VTL? Because of how memory attributes affect hugepages, duplicating *memslots* might actually be easier than teaching memslots to be VTL-aware. - Is "struct kvm_vcpu" the best representation of an execution context (if I'm getting the terminology right)? E.g. if 90% of the state is guaranteed to be identical for a given vCPU across execution contexts, then modeling that with separate kvm_vcpu structures is very inefficient. I highly doubt it's 90%, but it might be quite high depending on how much the TFLS restricts the state of the vCPU, e.g. if it's 64-bit only. The more info you can provide before LPC, the better, e.g. so that we can spend time discussing options instead of you getting peppered with questions about the requirements and whatnot.
Hi Sean, Thanks for taking the time to review the series. I took note of your comments across the series, and will incorporate them into the LPC discussion. On Wed Nov 8, 2023 at 6:33 PM UTC, Sean Christopherson wrote: > On Wed, Nov 08, 2023, Sean Christopherson wrote: > > On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote: > > > This RFC series introduces the necessary infrastructure to emulate VSM > > > enabled guests. It is a snapshot of the progress we made so far, and its > > > main goal is to gather design feedback. > > > > Heh, then please provide an overview of the design, and ideally context and/or > > justification for various design decisions. It doesn't need to be a proper design > > doc, and you can certainly point at other documentation for explaining VSM/VTLs, > > but a few paragraphs and/or verbose bullet points would go a long way. > > > > The documentation in patch 33 provides an explanation of VSM itself, and a little > > insight into how userspace can utilize the KVM implementation. But the documentation > > provides no explanation of the mechanics that KVM *developers* care about, e.g. > > the use of memory attributes, how memory attributes are enforced, whether or not > > an in-kernel local APIC is required, etc. > > > > Nor does the documentation explain *why*, e.g. why store a separate set of memory > > attributes per VTL "device", which by the by is broken and unnecessary. > > After speed reading the series.. An overview of the design, why you made certain > choices, and the tradeoffs between various options is definitely needed. > > A few questions off the top of my head: > > - What is the split between userspace and KVM? How did you arrive at that split? Our original design, which we discussed in the KVM forum 2023 [1] and is public [2], implemented most of VSM in-kernel. Notably we introduced VTL awareness in struct kvm_vcpu. This turned out to be way too complex: now vcpus have multiple CPU architectural states, events, apics, mmu, etc. First of all, the code turned out to be very intrusive, for example, most apic APIs had to be reworked one way or another to accommodate the fact there are multiple apics available. Also, we were forced to introduce VSM-specific semantics in x86 emulation code. But more importantly, the biggest pain has been dealing with state changes, they may be triggered remotely through requests, and some are already fairly delicate as-is. They involve a multitude of corner cases that almost never apply for a VTL aware kvm_vcpu. Especially if you factor in features like live migration. It's been a continuous source of regressions. Memory protections were implemented by using memory slot modifications. We introduced a downstream API that allows updating memory slots concurrently with vCPUs running. I think there was a similar proposal upstream from Red Hat some time ago. The result is complex, hard to generalize and slow. So we decided to move all this complexity outside of struct kvm_vcpu and, as much as possible, out of the kernel. We figures out the basic kernel building blocks that are absolutely necessary, and let user-space deal with the rest. > - How much *needs* to be in KVM? I.e. how much can be pushed to userspace while > maintaininly good performance? As I said above, the aim of the current design is to keep it as light as possible. The biggest move we made was moving VTL switching into user-space. We don't see any indication that performance is affected in a major way. But we will know for sure once we finish the implementation and test it under real use-cases. > - Why not make VTLs a first-party concept in KVM? E.g. rather than bury info > in a VTL device and APIC ID groups, why not modify "struct kvm" to support > replicating state that needs to be tracked per-VTL? Because of how memory > attributes affect hugepages, duplicating *memslots* might actually be easier > than teaching memslots to be VTL-aware. I do agree that we need to introduce some level VTL awareness into memslots. There's the hugepages issues you pointed out. But it'll be also necessary once we look at how to implement overlay pages that are per-VTL. (A topic I didn't mention in the series as I though I had managed to solve memory protections while avoiding the need for multiple slots). What I have in mind is introducing a memory slot address space per-VTL, similar to how we do things with SMM. It's important to note that requirements for overlay pages and memory protections are very different. Overlay pages are scarce, and are setup once and never change (AFAICT), so we think stopping all vCPUs, updating slots, and resuming execution will provide good enough performance. Memory protections happen very frequently, generally with page granularity, and may be short-lived. > - Is "struct kvm_vcpu" the best representation of an execution context (if I'm > getting the terminology right)? Let's forget I ever mentioned execution contexts. I used it in the hopes of making the VTL concept a little more understandable for non-VSM aware people. It's meant to be interchangeable with VTL. But I see how it creates confusion. > E.g. if 90% of the state is guaranteed to be identical for a given > vCPU across execution contexts, then modeling that with separate > kvm_vcpu structures is very inefficient. I highly doubt it's 90%, > but it might be quite high depending on how much the TFLS restricts > the state of the vCPU, e.g. if it's 64-bit only. For the record here's the private VTL state (TLFS 15.11.1): "In general, each VTL has its own control registers, RIP register, RSP register, and MSRs: SYSENTER_CS, SYSENTER_ESP, SYSENTER_EIP, STAR, LSTAR, CSTAR, SFMASK, EFER, PAT, KERNEL_GSBASE, FS.BASE, GS.BASE, TSC_AUX HV_X64_MSR_HYPERCALL HV_X64_MSR_GUEST_OS_ID HV_X64_MSR_REFERENCE_TSC HV_X64_MSR_APIC_FREQUENCY HV_X64_MSR_EOI HV_X64_MSR_ICR HV_X64_MSR_TPR HV_X64_MSR_APIC_ASSIST_PAGE HV_X64_MSR_NPIEP_CONFIG HV_X64_MSR_SIRBP HV_X64_MSR_SCONTROL HV_X64_MSR_SVERSION HV_X64_MSR_SIEFP HV_X64_MSR_SIMP HV_X64_MSR_EOM HV_X64_MSR_SINT0 – HV_X64_MSR_SINT15 HV_X64_MSR_STIMER0_COUNT – HV_X64_MSR_STIMER3_COUNT Local APIC registers (including CR8/TPR) " The rest is shared. Note that we've observed that during normal operation, VTL switches don't happen that often. The boot process is the most affected by any performance impact VSM might introduce, which issues 100000s (mostly memory protections). Nicolas [1] https://kvm-forum.qemu.org/2023/talk/TK7YGD/ [2] Partial rebase of our original implementation: https://github.com/vianpl/linux vsm
On Wed Nov 8, 2023 at 4:55 PM UTC, Sean Christopherson wrote: > > This RFC series introduces the necessary infrastructure to emulate VSM > > enabled guests. It is a snapshot of the progress we made so far, and its > > main goal is to gather design feedback. > > Heh, then please provide an overview of the design, and ideally context and/or > justification for various design decisions. It doesn't need to be a proper design > doc, and you can certainly point at other documentation for explaining VSM/VTLs, > but a few paragraphs and/or verbose bullet points would go a long way. > > The documentation in patch 33 provides an explanation of VSM itself, and a little > insight into how userspace can utilize the KVM implementation. But the documentation > provides no explanation of the mechanics that KVM *developers* care about, e.g. > the use of memory attributes, how memory attributes are enforced, whether or not > an in-kernel local APIC is required, etc. Noted, I'll provide a design review on the next submission. > Nor does the documentation explain *why*, e.g. why store a separate set of memory > attributes per VTL "device", which by the by is broken and unnecessary. It's clear to me how the current implementation of VTL devices is broken. But unncessary? That made me think we could inject the VTL In the memory attribute key, for ex. with 'gfn | vtl << 58'. And then use generic API and a single xarray. Nicolas
On Fri, Nov 10, 2023, Nicolas Saenz Julienne wrote: > On Wed Nov 8, 2023 at 6:33 PM UTC, Sean Christopherson wrote: > > - What is the split between userspace and KVM? How did you arrive at that split? > > Our original design, which we discussed in the KVM forum 2023 [1] and is > public [2], implemented most of VSM in-kernel. Notably we introduced VTL > awareness in struct kvm_vcpu. ... > So we decided to move all this complexity outside of struct kvm_vcpu > and, as much as possible, out of the kernel. We figures out the basic > kernel building blocks that are absolutely necessary, and let user-space > deal with the rest. Sorry, I should have been more clear. What's the split in terms of responsibilities, i.e. what will KVM's ABI look like? E.g. if the vCPU=>VTLs setup is nonsensical, does KVM care? My general preference is for KVM to be as permissive as possible, i.e. let userspace do whatever it wants so long as it doesn't place undue burden on KVM. But at the same time I don't to end up in a similar boat as many of the paravirt features, where things just stop working if userspace or the guest makes a goof. > > - Why not make VTLs a first-party concept in KVM? E.g. rather than bury info > > in a VTL device and APIC ID groups, why not modify "struct kvm" to support > > replicating state that needs to be tracked per-VTL? Because of how memory > > attributes affect hugepages, duplicating *memslots* might actually be easier > > than teaching memslots to be VTL-aware. > > I do agree that we need to introduce some level VTL awareness into > memslots. There's the hugepages issues you pointed out. But it'll be > also necessary once we look at how to implement overlay pages that are > per-VTL. (A topic I didn't mention in the series as I though I had > managed to solve memory protections while avoiding the need for multiple > slots). What I have in mind is introducing a memory slot address space > per-VTL, similar to how we do things with SMM. Noooooooo (I hate memslot address spaces :-) ) Why not represent each VTL with a separate "struct kvm" instance? That would naturally provide per-VTL behavior for: - APIC groups - memslot overlays - memory attributes (and their impact on hugepages) - MMU pages The only (obvious) issue with that approach would be cross-VTL operations. IIUC, sending IPIs across VTLs isn't allowed, but even if it were, that should be easy enough to solve, e.g. KVM already supports posting interrupts from non-KVM sources. GVA=>GPA translation would be trickier, but that patch suggests you want to handle that in userspace anyways. And if translation is a rare/slow path, maybe it could simply be punted to userspace? NOTE: The hypercall implementation is incomplete and only shared for completion. Additionally we'd like to move the VTL aware parts to user-space. Ewww, and looking at what it would take to support cross-VM translations shows another problem with using vCPUs to model VTLs. Nothing prevents userspace from running a virtual CPU at multiple VTLs concurrently, which means that anything that uses kvm_hv_get_vtl_vcpu() is unsafe, e.g. walk_mmu->gva_to_gpa() could be modified while kvm_hv_xlate_va_walk() is running. I suppose that's not too hard to solve, e.g. mutex_trylock() and bail if something holds the other kvm_vcpu/VTL's mutex. Though ideally, KVM would punt all cross-VTL operations to userspace. :-) If punting to userspace isn't feasible, using a struct kvm per VTL probably wouldn't make the locking and concurrency problems meaningfully easier or harder to solve. E.g. KVM could require VTLs, i.e. "struct kvm" instances that are part of a single virtual machine, to belong to the same process. That'd avoid headaches with mm_struct, at which point I don't _think_ getting and using a kvm_vcpu from a different kvm would need special handling? Heh, another fun one, the VTL handling in kvm_hv_send_ipi() is wildly broken, the in_vtl field is consumed before send_ipi is read from userspace. union hv_input_vtl *in_vtl; u64 valid_bank_mask; u32 vector; bool all_cpus; u8 vtl; /* VTL is at the same offset on both IPI types */ in_vtl = &send_ipi.in_vtl; vtl = in_vtl->use_target_vtl ? in_vtl->target_vtl : kvm_hv_get_active_vtl(vcpu); > > E.g. if 90% of the state is guaranteed to be identical for a given > > vCPU across execution contexts, then modeling that with separate > > kvm_vcpu structures is very inefficient. I highly doubt it's 90%, > > but it might be quite high depending on how much the TFLS restricts > > the state of the vCPU, e.g. if it's 64-bit only. > > For the record here's the private VTL state (TLFS 15.11.1): > > "In general, each VTL has its own control registers, RIP register, RSP > register, and MSRs: > > SYSENTER_CS, SYSENTER_ESP, SYSENTER_EIP, STAR, LSTAR, CSTAR, SFMASK, > EFER, PAT, KERNEL_GSBASE, FS.BASE, GS.BASE, TSC_AUX > HV_X64_MSR_HYPERCALL > HV_X64_MSR_GUEST_OS_ID > HV_X64_MSR_REFERENCE_TSC > HV_X64_MSR_APIC_FREQUENCY > HV_X64_MSR_EOI > HV_X64_MSR_ICR > HV_X64_MSR_TPR > HV_X64_MSR_APIC_ASSIST_PAGE > HV_X64_MSR_NPIEP_CONFIG > HV_X64_MSR_SIRBP > HV_X64_MSR_SCONTROL > HV_X64_MSR_SVERSION > HV_X64_MSR_SIEFP > HV_X64_MSR_SIMP > HV_X64_MSR_EOM > HV_X64_MSR_SINT0 – HV_X64_MSR_SINT15 > HV_X64_MSR_STIMER0_COUNT – HV_X64_MSR_STIMER3_COUNT > Local APIC registers (including CR8/TPR) Ugh, the APIC state is quite the killer. And I gotta image things like CET and FRED are only going to increase that list.
On Fri Nov 10, 2023 at 7:32 PM UTC, Sean Christopherson wrote: > On Fri, Nov 10, 2023, Nicolas Saenz Julienne wrote: > > On Wed Nov 8, 2023 at 6:33 PM UTC, Sean Christopherson wrote: > > > - What is the split between userspace and KVM? How did you arrive at that split? > > > > Our original design, which we discussed in the KVM forum 2023 [1] and is > > public [2], implemented most of VSM in-kernel. Notably we introduced VTL > > awareness in struct kvm_vcpu. > > ... > > > So we decided to move all this complexity outside of struct kvm_vcpu > > and, as much as possible, out of the kernel. We figures out the basic > > kernel building blocks that are absolutely necessary, and let user-space > > deal with the rest. > > Sorry, I should have been more clear. What's the split in terms of responsibilities, > i.e. what will KVM's ABI look like? E.g. if the vCPU=>VTLs setup is nonsensical, > does KVM care? > > My general preference is for KVM to be as permissive as possible, i.e. let > userspace do whatever it wants so long as it doesn't place undue burden on KVM. > But at the same time I don't to end up in a similar boat as many of the paravirt > features, where things just stop working if userspace or the guest makes a goof. I'll make sure to formalize this for whenever I post a full series, I need to go over every hcall and think from that perspective. There are some rules it might make sense to enforce. But it really depends on the abstractions we settle on. KVM might not have the necessary introspection to enforce them. IMO ideally it wouldn't, VTLs should remain a user-space concept. My approach so far has been trusting QEMU is doing the right thing. Some high level examples come to mind: - Only one VTL vCPU might run at all times. - Privileged VTL interrupts have precedence over lower VTL execution. - lAPICs can only access their VTL. (Cross VTL IPIs happen through the PV interface). - Lower VTL state should be up to date when accessed from privileged VTLs (through the GET/SET_VP_REGSITER hcall). > > > - Why not make VTLs a first-party concept in KVM? E.g. rather than bury info > > > in a VTL device and APIC ID groups, why not modify "struct kvm" to support > > > replicating state that needs to be tracked per-VTL? Because of how memory > > > attributes affect hugepages, duplicating *memslots* might actually be easier > > > than teaching memslots to be VTL-aware. > > > > I do agree that we need to introduce some level VTL awareness into > > memslots. There's the hugepages issues you pointed out. But it'll be > > also necessary once we look at how to implement overlay pages that are > > per-VTL. (A topic I didn't mention in the series as I though I had > > managed to solve memory protections while avoiding the need for multiple > > slots). What I have in mind is introducing a memory slot address space > > per-VTL, similar to how we do things with SMM. > > Noooooooo (I hate memslot address spaces :-) ) > > Why not represent each VTL with a separate "struct kvm" instance? That would > naturally provide per-VTL behavior for: > > - APIC groups > - memslot overlays > - memory attributes (and their impact on hugepages) > - MMU pages Very interesting idea! I'll spend some time researching it, it sure solves a lot of issues. > The only (obvious) issue with that approach would be cross-VTL operations. IIUC, > sending IPIs across VTLs isn't allowed, but even if it were, that should be easy > enough to solve, e.g. KVM already supports posting interrupts from non-KVM sources. Correct. Only through kvm_hv_send_ipi(), but from experience it happens very rarely, so performance shouldn't be critical. > GVA=>GPA translation would be trickier, but that patch suggests you want to handle > that in userspace anyways. And if translation is a rare/slow path, maybe it could > simply be punted to userspace? > > NOTE: The hypercall implementation is incomplete and only shared for > completion. Additionally we'd like to move the VTL aware parts to > user-space. > > Ewww, and looking at what it would take to support cross-VM translations shows > another problem with using vCPUs to model VTLs. Nothing prevents userspace from > running a virtual CPU at multiple VTLs concurrently, which means that anything > that uses kvm_hv_get_vtl_vcpu() is unsafe, e.g. walk_mmu->gva_to_gpa() could be > modified while kvm_hv_xlate_va_walk() is running. > > I suppose that's not too hard to solve, e.g. mutex_trylock() and bail if something > holds the other kvm_vcpu/VTL's mutex. Though ideally, KVM would punt all cross-VTL > operations to userspace. :-) > > If punting to userspace isn't feasible, using a struct kvm per VTL probably wouldn't > make the locking and concurrency problems meaningfully easier or harder to solve. > E.g. KVM could require VTLs, i.e. "struct kvm" instances that are part of a single > virtual machine, to belong to the same process. That'd avoid headaches with > mm_struct, at which point I don't _think_ getting and using a kvm_vcpu from a > different kvm would need special handling? I'll look into it. > Heh, another fun one, the VTL handling in kvm_hv_send_ipi() is wildly broken, the > in_vtl field is consumed before send_ipi is read from userspace. ugh, that's a tired last minute "cleanup" that went south... It's been working as intended for a while otherwise. I'll implement a kvm-unit-test to redeem myself. :) Nicolas