Message ID | 20221025151344.3784230-6-chao.p.peng@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp1066285wru; Tue, 25 Oct 2022 08:22:53 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6Q6fL7aOVKJqCZnTPz3Sg+1vJBvopoy8WXPOx9GC0LcSJE/tzlUX/RwROFiI7vAjUgE0ax X-Received: by 2002:a17:907:75e6:b0:7a1:848:20cb with SMTP id jz6-20020a17090775e600b007a1084820cbmr15708730ejc.745.1666711373000; Tue, 25 Oct 2022 08:22:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666711372; cv=none; d=google.com; s=arc-20160816; b=MQi6jfhEY38sV9kGncJH9emykCwNV2wdHZ9XoCjXeb6TBJ1M3ZgNNtgkXtydK/crYr 3EihDRL/3jHU1OI9z2t0YbWLBcZkqwfDtrxZGDapYgv8yye3wvS6fzm8poZL7AKOml6C bOwFSWyK56Fc4p27KIg3Wo90K6SNKWoL3LnmcM1sn34oA1GNwCYZ7hMt1MeibJ7OcptY QKZ9+s49/txAhTqfn2phrBW1MacG5dyMNqDgC+evmKfUzmYzaDgLFyfPmq/CD9RmSxDs gBJ2eHuNxDlFCHAW0X5nnukw//IAeW49Hqq+RXjrimpPVK8XzYDGcUVVpOQs6Pfzls0Q Di+g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=FwTLoCwh/MiCXurkwROzZt91PSoc9QqKgtUzt7vqWZA=; b=C1t3Q0em2NCMNApLRyxEk8a994FqM7BW5JC2EdcMNR5OlNaDSosHt15dt0eSW5K969 zRCrbwFKCGFGvBUA7gCXdiGYtu0WwgWMII91LX+rngtoB3ToXH+HREY/D/KY7ALwFy2J p4fG+lt5dRH7rTJjE14a0mzD7fsta79xGbn/pbOhoJ3VmFkcqqT47MDeUdL377FsGh/G /rdJBlHN1wuOV7Hno6cuHfvOj9uZdR3GvTWCfcWzRfMSTpmKxwM8ccZbM20uz8z8k1l/ R+a9ZYPZDw4O0p7EImz3EFEd7wtfj7+L37hCdZzgtr52dkz8rMm971IeCGDsG+XM+BK4 8yrA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=KbXg8KFT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id du21-20020a17090772d500b0078dcddc1b8csi3285861ejc.788.2022.10.25.08.22.23; Tue, 25 Oct 2022 08:22:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=KbXg8KFT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233155AbiJYPUk (ORCPT <rfc822;pwkd43@gmail.com> + 99 others); Tue, 25 Oct 2022 11:20:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42244 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233109AbiJYPTW (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 25 Oct 2022 11:19:22 -0400 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6B961192D9E; Tue, 25 Oct 2022 08:19:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1666711161; x=1698247161; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=aLtIISiTrDngZitiG9gcuAA16c7d+c3InY1rPwfxPPk=; b=KbXg8KFT39OqEcoPhq/qu0XhGwGQXebywRwy9Fyu9xtMcY+0p41d2Eq1 yLlEb4sfqqNzuAAB3pKCHcxYv6nOEp+Q02MevkIXshatyfZoUF04g6ZAI rzHim3CelO1ZWNEqYwWf6Az7dWWZf5j9Wv0/F7jekSZmNzIkyXl8n5D7Z oxUnCX8dlNT6Lkqw+Na2VYVd2uV7dZEtCSv0yR+eZi0GQx3D4w3mjm09n vy4kVO0LUXRTsnz/uutIrGM+BxGfkbXSqRmP9wTVAN7vWSnS8Kd0+O+mS /HPfVQwWNsyG+BokzxSd0GNTyrzcQBrcOV51G7kziPpGro4rITXMoPSzP Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10510"; a="305320988" X-IronPort-AV: E=Sophos;i="5.95,212,1661842800"; d="scan'208";a="305320988" Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Oct 2022 08:19:21 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10510"; a="736865649" X-IronPort-AV: E=Sophos;i="5.95,212,1661842800"; d="scan'208";a="736865649" Received: from chaop.bj.intel.com ([10.240.193.75]) by fmsmga002.fm.intel.com with ESMTP; 25 Oct 2022 08:19:10 -0700 From: Chao Peng <chao.p.peng@linux.intel.com> To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini <pbonzini@redhat.com>, Jonathan Corbet <corbet@lwn.net>, Sean Christopherson <seanjc@google.com>, Vitaly Kuznetsov <vkuznets@redhat.com>, Wanpeng Li <wanpengli@tencent.com>, Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>, Hugh Dickins <hughd@google.com>, Jeff Layton <jlayton@kernel.org>, "J . Bruce Fields" <bfields@fieldses.org>, Andrew Morton <akpm@linux-foundation.org>, Shuah Khan <shuah@kernel.org>, Mike Rapoport <rppt@kernel.org>, Steven Price <steven.price@arm.com>, "Maciej S . Szmigiero" <mail@maciej.szmigiero.name>, Vlastimil Babka <vbabka@suse.cz>, Vishal Annapurve <vannapurve@google.com>, Yu Zhang <yu.c.zhang@linux.intel.com>, Chao Peng <chao.p.peng@linux.intel.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret <qperret@google.com>, tabba@google.com, Michael Roth <michael.roth@amd.com>, mhocko@suse.com, Muchun Song <songmuchun@bytedance.com>, wei.w.wang@intel.com Subject: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions Date: Tue, 25 Oct 2022 23:13:41 +0800 Message-Id: <20221025151344.3784230-6-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20221025151344.3784230-1-chao.p.peng@linux.intel.com> References: <20221025151344.3784230-1-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-7.5 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_HI,SPF_HELO_NONE, SPF_NONE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747673544920496599?= X-GMAIL-MSGID: =?utf-8?q?1747673544920496599?= |
Series |
KVM: mm: fd-based approach for supporting KVM
|
|
Commit Message
Chao Peng
Oct. 25, 2022, 3:13 p.m. UTC
Introduce generic private memory register/unregister by reusing existing
SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION. It differs from SEV case
by treating address in the region as gpa instead of hva. Which cases
should these ioctls go is determined by the kvm_arch_has_private_mem().
Architecture which supports KVM_PRIVATE_MEM should override this function.
KVM internally defaults all guest memory as private memory and maintain
the shared memory in 'mem_attr_array'. The above ioctls operate on this
field and unmap existing mappings if any.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
Documentation/virt/kvm/api.rst | 17 ++-
arch/x86/kvm/Kconfig | 1 +
include/linux/kvm_host.h | 10 +-
virt/kvm/Kconfig | 4 +
virt/kvm/kvm_main.c | 227 +++++++++++++++++++++++++--------
5 files changed, 198 insertions(+), 61 deletions(-)
Comments
Hi, On Tue, Oct 25, 2022 at 4:19 PM Chao Peng <chao.p.peng@linux.intel.com> wrote: > > Introduce generic private memory register/unregister by reusing existing > SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION. It differs from SEV case > by treating address in the region as gpa instead of hva. Which cases > should these ioctls go is determined by the kvm_arch_has_private_mem(). > Architecture which supports KVM_PRIVATE_MEM should override this function. > > KVM internally defaults all guest memory as private memory and maintain > the shared memory in 'mem_attr_array'. The above ioctls operate on this > field and unmap existing mappings if any. > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > --- Reviewed-by: Fuad Tabba <tabba@google.com> Cheers, /fuad > Documentation/virt/kvm/api.rst | 17 ++- > arch/x86/kvm/Kconfig | 1 + > include/linux/kvm_host.h | 10 +- > virt/kvm/Kconfig | 4 + > virt/kvm/kvm_main.c | 227 +++++++++++++++++++++++++-------- > 5 files changed, 198 insertions(+), 61 deletions(-) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index 975688912b8c..08253cf498d1 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -4717,10 +4717,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst. > This ioctl can be used to register a guest memory region which may > contain encrypted data (e.g. guest RAM, SMRAM etc). > > -It is used in the SEV-enabled guest. When encryption is enabled, a guest > -memory region may contain encrypted data. The SEV memory encryption > -engine uses a tweak such that two identical plaintext pages, each at > -different locations will have differing ciphertexts. So swapping or > +Currently this ioctl supports registering memory regions for two usages: > +private memory and SEV-encrypted memory. > + > +When private memory is enabled, this ioctl is used to register guest private > +memory region and the addr/size of kvm_enc_region represents guest physical > +address (GPA). In this usage, this ioctl zaps the existing guest memory > +mappings in KVM that fallen into the region. > + > +When SEV-encrypted memory is enabled, this ioctl is used to register guest > +memory region which may contain encrypted data for a SEV-enabled guest. The > +addr/size of kvm_enc_region represents userspace address (HVA). The SEV > +memory encryption engine uses a tweak such that two identical plaintext pages, > +each at different locations will have differing ciphertexts. So swapping or > moving ciphertext of those pages will not result in plaintext being > swapped. So relocating (or migrating) physical backing pages for the SEV > guest will require some additional steps. > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig > index 8d2bd455c0cd..73fdfa429b20 100644 > --- a/arch/x86/kvm/Kconfig > +++ b/arch/x86/kvm/Kconfig > @@ -51,6 +51,7 @@ config KVM > select HAVE_KVM_PM_NOTIFIER if PM > select HAVE_KVM_RESTRICTED_MEM if X86_64 > select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM > + select KVM_GENERIC_PRIVATE_MEM if HAVE_KVM_RESTRICTED_MEM > help > Support hosting fully virtualized guest machines using hardware > virtualization extensions. You will need a fairly recent > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 79e5cbc35fcf..4ce98fa0153c 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -245,7 +245,8 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); > #endif > > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER > + > +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_KVM_GENERIC_PRIVATE_MEM) > struct kvm_gfn_range { > struct kvm_memory_slot *slot; > gfn_t start; > @@ -254,6 +255,9 @@ struct kvm_gfn_range { > bool may_block; > }; > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); > +#endif > + > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > @@ -794,6 +798,9 @@ struct kvm { > struct notifier_block pm_notifier; > #endif > char stats_id[KVM_STATS_NAME_SIZE]; > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > + struct xarray mem_attr_array; > +#endif > }; > > #define kvm_err(fmt, ...) \ > @@ -1453,6 +1460,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu); > int kvm_arch_post_init_vm(struct kvm *kvm); > void kvm_arch_pre_destroy_vm(struct kvm *kvm); > int kvm_arch_create_vm_debugfs(struct kvm *kvm); > +bool kvm_arch_has_private_mem(struct kvm *kvm); > > #ifndef __KVM_HAVE_ARCH_VM_ALLOC > /* > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig > index 9ff164c7e0cc..69ca59e82149 100644 > --- a/virt/kvm/Kconfig > +++ b/virt/kvm/Kconfig > @@ -89,3 +89,7 @@ config HAVE_KVM_PM_NOTIFIER > > config HAVE_KVM_RESTRICTED_MEM > bool > + > +config KVM_GENERIC_PRIVATE_MEM > + bool > + depends on HAVE_KVM_RESTRICTED_MEM > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 09c9cdeb773c..fc3835826ace 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm) > } > EXPORT_SYMBOL_GPL(kvm_destroy_vcpus); > > +static inline void update_invalidate_range(struct kvm *kvm, gfn_t start, > + gfn_t end) > +{ > + if (likely(kvm->mmu_invalidate_in_progress == 1)) { > + kvm->mmu_invalidate_range_start = start; > + kvm->mmu_invalidate_range_end = end; > + } else { > + /* > + * Fully tracking multiple concurrent ranges has diminishing > + * returns. Keep things simple and just find the minimal range > + * which includes the current and new ranges. As there won't be > + * enough information to subtract a range after its invalidate > + * completes, any ranges invalidated concurrently will > + * accumulate and persist until all outstanding invalidates > + * complete. > + */ > + kvm->mmu_invalidate_range_start = > + min(kvm->mmu_invalidate_range_start, start); > + kvm->mmu_invalidate_range_end = > + max(kvm->mmu_invalidate_range_end, end); > + } > +} > + > +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end) > +{ > + /* > + * The count increase must become visible at unlock time as no > + * spte can be established without taking the mmu_lock and > + * count is also read inside the mmu_lock critical section. > + */ > + kvm->mmu_invalidate_in_progress++; > +} > + > +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end) > +{ > + mark_invalidate_in_progress(kvm, start, end); > + update_invalidate_range(kvm, start, end); > +} > + > +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end) > +{ > + /* > + * This sequence increase will notify the kvm page fault that > + * the page that is going to be mapped in the spte could have > + * been freed. > + */ > + kvm->mmu_invalidate_seq++; > + smp_wmb(); > + /* > + * The above sequence increase must be visible before the > + * below count decrease, which is ensured by the smp_wmb above > + * in conjunction with the smp_rmb in mmu_invalidate_retry(). > + */ > + kvm->mmu_invalidate_in_progress--; > +} > + > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) > static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn) > { > @@ -715,51 +771,12 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn); > } > > -static inline void update_invalidate_range(struct kvm *kvm, gfn_t start, > - gfn_t end) > -{ > - if (likely(kvm->mmu_invalidate_in_progress == 1)) { > - kvm->mmu_invalidate_range_start = start; > - kvm->mmu_invalidate_range_end = end; > - } else { > - /* > - * Fully tracking multiple concurrent ranges has diminishing > - * returns. Keep things simple and just find the minimal range > - * which includes the current and new ranges. As there won't be > - * enough information to subtract a range after its invalidate > - * completes, any ranges invalidated concurrently will > - * accumulate and persist until all outstanding invalidates > - * complete. > - */ > - kvm->mmu_invalidate_range_start = > - min(kvm->mmu_invalidate_range_start, start); > - kvm->mmu_invalidate_range_end = > - max(kvm->mmu_invalidate_range_end, end); > - } > -} > - > -static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end) > -{ > - /* > - * The count increase must become visible at unlock time as no > - * spte can be established without taking the mmu_lock and > - * count is also read inside the mmu_lock critical section. > - */ > - kvm->mmu_invalidate_in_progress++; > -} > - > static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > { > update_invalidate_range(kvm, range->start, range->end); > return kvm_unmap_gfn_range(kvm, range); > } > > -void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end) > -{ > - mark_invalidate_in_progress(kvm, start, end); > - update_invalidate_range(kvm, start, end); > -} > - > static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > const struct mmu_notifier_range *range) > { > @@ -807,23 +824,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > return 0; > } > > -void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end) > -{ > - /* > - * This sequence increase will notify the kvm page fault that > - * the page that is going to be mapped in the spte could have > - * been freed. > - */ > - kvm->mmu_invalidate_seq++; > - smp_wmb(); > - /* > - * The above sequence increase must be visible before the > - * below count decrease, which is ensured by the smp_wmb above > - * in conjunction with the smp_rmb in mmu_invalidate_retry(). > - */ > - kvm->mmu_invalidate_in_progress--; > -} > - > static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, > const struct mmu_notifier_range *range) > { > @@ -937,6 +937,89 @@ static int kvm_init_mmu_notifier(struct kvm *kvm) > > #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */ > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > + > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end) > +{ > + struct kvm_gfn_range gfn_range; > + struct kvm_memory_slot *slot; > + struct kvm_memslots *slots; > + struct kvm_memslot_iter iter; > + int i; > + int r = 0; > + > + gfn_range.pte = __pte(0); > + gfn_range.may_block = true; > + > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { > + slots = __kvm_memslots(kvm, i); > + > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) { > + slot = iter.slot; > + gfn_range.start = max(start, slot->base_gfn); > + gfn_range.end = min(end, slot->base_gfn + slot->npages); > + if (gfn_range.start >= gfn_range.end) > + continue; > + gfn_range.slot = slot; > + > + r |= kvm_unmap_gfn_range(kvm, &gfn_range); > + } > + } > + > + if (r) > + kvm_flush_remote_tlbs(kvm); > +} > + > +#define KVM_MEM_ATTR_SHARED 0x0001 > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size, > + bool is_private) > +{ > + gfn_t start, end; > + unsigned long i; > + void *entry; > + int idx; > + int r = 0; > + > + if (size == 0 || gpa + size < gpa) > + return -EINVAL; > + if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1)) > + return -EINVAL; > + > + start = gpa >> PAGE_SHIFT; > + end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT; > + > + /* > + * Guest memory defaults to private, kvm->mem_attr_array only stores > + * shared memory. > + */ > + entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED); > + > + idx = srcu_read_lock(&kvm->srcu); > + KVM_MMU_LOCK(kvm); > + kvm_mmu_invalidate_begin(kvm, start, end); > + > + for (i = start; i < end; i++) { > + r = xa_err(xa_store(&kvm->mem_attr_array, i, entry, > + GFP_KERNEL_ACCOUNT)); > + if (r) > + goto err; > + } > + > + kvm_unmap_mem_range(kvm, start, end); > + > + goto ret; > +err: > + for (; i > start; i--) > + xa_erase(&kvm->mem_attr_array, i); > +ret: > + kvm_mmu_invalidate_end(kvm, start, end); > + KVM_MMU_UNLOCK(kvm); > + srcu_read_unlock(&kvm->srcu, idx); > + > + return r; > +} > +#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */ > + > #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER > static int kvm_pm_notifier_call(struct notifier_block *bl, > unsigned long state, > @@ -1165,6 +1248,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname) > spin_lock_init(&kvm->mn_invalidate_lock); > rcuwait_init(&kvm->mn_memslots_update_rcuwait); > xa_init(&kvm->vcpu_array); > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > + xa_init(&kvm->mem_attr_array); > +#endif > > INIT_LIST_HEAD(&kvm->gpc_list); > spin_lock_init(&kvm->gpc_lock); > @@ -1338,6 +1424,9 @@ static void kvm_destroy_vm(struct kvm *kvm) > kvm_free_memslots(kvm, &kvm->__memslots[i][0]); > kvm_free_memslots(kvm, &kvm->__memslots[i][1]); > } > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > + xa_destroy(&kvm->mem_attr_array); > +#endif > cleanup_srcu_struct(&kvm->irq_srcu); > cleanup_srcu_struct(&kvm->srcu); > kvm_arch_free_vm(kvm); > @@ -1541,6 +1630,11 @@ static void kvm_replace_memslot(struct kvm *kvm, > } > } > > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm) > +{ > + return false; > +} > + > static int check_memory_region_flags(const struct kvm_user_mem_region *mem) > { > u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp, > r = kvm_vm_ioctl_set_memory_region(kvm, &mem); > break; > } > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > + case KVM_MEMORY_ENCRYPT_REG_REGION: > + case KVM_MEMORY_ENCRYPT_UNREG_REGION: { > + struct kvm_enc_region region; > + bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION; > + > + if (!kvm_arch_has_private_mem(kvm)) > + goto arch_vm_ioctl; > + > + r = -EFAULT; > + if (copy_from_user(®ion, argp, sizeof(region))) > + goto out; > + > + r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr, > + region.size, set); > + break; > + } > +#endif > case KVM_GET_DIRTY_LOG: { > struct kvm_dirty_log log; > > @@ -4861,6 +4973,9 @@ static long kvm_vm_ioctl(struct file *filp, > r = kvm_vm_ioctl_get_stats_fd(kvm); > break; > default: > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > +arch_vm_ioctl: > +#endif > r = kvm_arch_vm_ioctl(filp, ioctl, arg); > } > out: > -- > 2.25.1 >
On Tue, Oct 25, 2022, Chao Peng wrote: > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp, > r = kvm_vm_ioctl_set_memory_region(kvm, &mem); > break; > } > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > + case KVM_MEMORY_ENCRYPT_REG_REGION: > + case KVM_MEMORY_ENCRYPT_UNREG_REGION: { I'm having second thoughts about usurping KVM_MEMORY_ENCRYPT_(UN)REG_REGION. Aside from the fact that restricted/protected memory may not be encrypted, there are other potential use cases for per-page memory attributes[*], e.g. to make memory read-only (or no-exec, or exec-only, etc...) without having to modify memslots. Any paravirt use case where the attributes of a page are effectively dictated by the guest is going to run into the exact same performance problems with memslots, which isn't suprising in hindsight since shared vs. private is really just an attribute, albeit with extra special semantics. And if we go with a brand new ioctl(), maybe someday in the very distant future we can deprecate and delete KVM_MEMORY_ENCRYPT_(UN)REG_REGION. Switching to a new ioctl() should be a minor change, i.e. shouldn't throw too big of a wrench into things. Something like: KVM_SET_MEMORY_ATTRIBUTES struct kvm_memory_attributes { __u64 address; __u64 size; __u64 flags; } [*] https://lore.kernel.org/all/Y1a1i9vbJ%2FpVmV9r@google.com > + struct kvm_enc_region region; > + bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION; > + > + if (!kvm_arch_has_private_mem(kvm)) > + goto arch_vm_ioctl; > + > + r = -EFAULT; > + if (copy_from_user(®ion, argp, sizeof(region))) > + goto out; > + > + r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr, > + region.size, set); > + break; > + } > +#endif
On Thu, Nov 03, 2022 at 11:04:53PM +0000, Sean Christopherson wrote: > On Tue, Oct 25, 2022, Chao Peng wrote: > > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp, > > r = kvm_vm_ioctl_set_memory_region(kvm, &mem); > > break; > > } > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > > + case KVM_MEMORY_ENCRYPT_REG_REGION: > > + case KVM_MEMORY_ENCRYPT_UNREG_REGION: { > > I'm having second thoughts about usurping KVM_MEMORY_ENCRYPT_(UN)REG_REGION. Aside > from the fact that restricted/protected memory may not be encrypted, there are > other potential use cases for per-page memory attributes[*], e.g. to make memory > read-only (or no-exec, or exec-only, etc...) without having to modify memslots. > > Any paravirt use case where the attributes of a page are effectively dictated by > the guest is going to run into the exact same performance problems with memslots, > which isn't suprising in hindsight since shared vs. private is really just an > attribute, albeit with extra special semantics. > > And if we go with a brand new ioctl(), maybe someday in the very distant future > we can deprecate and delete KVM_MEMORY_ENCRYPT_(UN)REG_REGION. > > Switching to a new ioctl() should be a minor change, i.e. shouldn't throw too big > of a wrench into things. > > Something like: > > KVM_SET_MEMORY_ATTRIBUTES > > struct kvm_memory_attributes { > __u64 address; > __u64 size; > __u64 flags; > } I like the idea of adding a new ioctl(). But putting all attributes into a flags in uAPI sounds not good to me, e.g. forcing userspace to set all attributes in one call can cause pain for userspace, probably for KVM implementation as well. For private<->shared memory conversion, we actually only care the KVM_MEM_ATTR_SHARED or KVM_MEM_ATTR_PRIVATE bit, but we force userspace to set other irrelevant bits as well if use this API. I looked at kvm_device_attr, sounds we can do similar: KVM_SET_MEMORY_ATTR struct kvm_memory_attr { __u64 address; __u64 size; #define KVM_MEM_ATTR_SHARED BIT(0) #define KVM_MEM_ATTR_READONLY BIT(1) #define KVM_MEM_ATTR_NOEXEC BIT(2) __u32 attr; __u32 pad; } I'm not sure if we need KVM_GET_MEMORY_ATTR/KVM_HAS_MEMORY_ATTR as well, but sounds like we need a KVM_UNSET_MEMORY_ATTR. Since we are exposing the attribute directly to userspace I also think we'd better treat shared memory as the default, so even when the private memory is not used, the bit can still be meaningful. So define BIT(0) as KVM_MEM_ATTR_PRIVATE instead of KVM_MEM_ATTR_SHARED. Thanks, Chao > > [*] https://lore.kernel.org/all/Y1a1i9vbJ%2FpVmV9r@google.com > > > + struct kvm_enc_region region; > > + bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION; > > + > > + if (!kvm_arch_has_private_mem(kvm)) > > + goto arch_vm_ioctl; > > + > > + r = -EFAULT; > > + if (copy_from_user(®ion, argp, sizeof(region))) > > + goto out; > > + > > + r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr, > > + region.size, set); > > + break; > > + } > > +#endif
Paolo, any thoughts before I lead things further astray? On Fri, Nov 04, 2022, Chao Peng wrote: > On Thu, Nov 03, 2022 at 11:04:53PM +0000, Sean Christopherson wrote: > > On Tue, Oct 25, 2022, Chao Peng wrote: > > > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp, > > > r = kvm_vm_ioctl_set_memory_region(kvm, &mem); > > > break; > > > } > > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > > > + case KVM_MEMORY_ENCRYPT_REG_REGION: > > > + case KVM_MEMORY_ENCRYPT_UNREG_REGION: { > > > > I'm having second thoughts about usurping KVM_MEMORY_ENCRYPT_(UN)REG_REGION. Aside > > from the fact that restricted/protected memory may not be encrypted, there are > > other potential use cases for per-page memory attributes[*], e.g. to make memory > > read-only (or no-exec, or exec-only, etc...) without having to modify memslots. > > > > Any paravirt use case where the attributes of a page are effectively dictated by > > the guest is going to run into the exact same performance problems with memslots, > > which isn't suprising in hindsight since shared vs. private is really just an > > attribute, albeit with extra special semantics. > > > > And if we go with a brand new ioctl(), maybe someday in the very distant future > > we can deprecate and delete KVM_MEMORY_ENCRYPT_(UN)REG_REGION. > > > > Switching to a new ioctl() should be a minor change, i.e. shouldn't throw too big > > of a wrench into things. > > > > Something like: > > > > KVM_SET_MEMORY_ATTRIBUTES > > > > struct kvm_memory_attributes { > > __u64 address; > > __u64 size; > > __u64 flags; Oh, this is half-baked. I lost track of which flags were which. What I intended was a separate, initially-unused flags, e.g. struct kvm_memory_attributes { __u64 address; __u64 size; __u64 attributes; __u64 flags; } so that KVM can tweak behavior and/or extend the effective size of the struct. > I like the idea of adding a new ioctl(). But putting all attributes into > a flags in uAPI sounds not good to me, e.g. forcing userspace to set all > attributes in one call can cause pain for userspace, probably for KVM > implementation as well. For private<->shared memory conversion, we > actually only care the KVM_MEM_ATTR_SHARED or KVM_MEM_ATTR_PRIVATE bit, Not necessarily, e.g. I can see pKVM wanting to convert from RW+PRIVATE => RO+SHARED or even RW+PRIVATE => NONE+SHARED so that the guest can't write/access the memory while it's accessible from the host. And if this does extend beyond shared/private, dropping from RWX=>R, i.e. dropping WX permissions, would also be a common operation. Hmm, typing that out makes me think that if we do end up supporting other "attributes", i.e. protections, we should go straight to full RWX protections instead of doing things piecemeal, i.e. add individual protections instead of combinations like NO_EXEC and READ_ONLY. The protections would have to be inverted for backwards compatibility, but that's easy enough to handle. The semantics could be like protection keys, which also have inverted persmissions, where the final protections are the combination of memslot+attributes, i.e. a read-only memslot couldn't be made writable via attributes. E.g. userspace could do "NO_READ | NO_WRITE | NO_EXEC" to temporarily block access to memory without needing to delete the memslot. KVM would need to disallow unsupported combinations, e.g. disallowed effective protections would be: - W or WX [unless there's an arch that supports write-only memory] - R or RW [until KVM plumbs through support for no-exec, or it's unsupported in hardware] - X [until KVM plumbs through support for exec-only, or it's unsupported in hardware] Anyways, that's all future work... > but we force userspace to set other irrelevant bits as well if use this > API. They aren't irrelevant though, as the memory attributes are all describing the allowed protections for a given page. If there's a use case where userspace "can't" keep track of the attributes for whatever reason, then userspace could do a RMW to set/clear attributes. Alternatively, the ioctl() could take an "operation" and support WRITE/OR/AND to allow setting/clearing individual flags, e.g. tweak the above to be: struct kvm_memory_attributes { __u64 address; __u64 size; __u64 attributes; __u32 operation; __u32 flags; } > I looked at kvm_device_attr, sounds we can do similar: The device attributes deal with isolated, arbitrary values, whereas memory attributes are flags, i.e. devices are 1:1 whereas memory is 1:MANY. There is no "unset" for device attributes, because they aren't flags. Device attributes vs. memory attributes really are two very different things that just happen to use a common name. If it helped clarify things without creating naming problems, we could even use PROTECTIONS instead of ATTRIBUTES. > KVM_SET_MEMORY_ATTR > > struct kvm_memory_attr { > __u64 address; > __u64 size; > #define KVM_MEM_ATTR_SHARED BIT(0) > #define KVM_MEM_ATTR_READONLY BIT(1) > #define KVM_MEM_ATTR_NOEXEC BIT(2) > __u32 attr; As above, letting userspace set only a single attribute would prevent setting (or clearing) multiple attributes in a single ioctl(). > __u32 pad; > } > > I'm not sure if we need KVM_GET_MEMORY_ATTR/KVM_HAS_MEMORY_ATTR as well, Definitely would need to communicate to userspace that various attributes are supported. That doesn't necessarily require a common ioctl(), but I don't see any reason not to add a common helper, and adding a common helper would mean KVM_CAP_PRIVATE_MEM can go away. But it should return a bitmask so that userspace can do a single query to get all supported attributes, e.g. KVM_SUPPORTED_MEMORY_ATTRIBUTES. As for KVM_GET_MEMORY_ATTRIBUTES, we wouldn't necessarily have to provide such an API, e.g. we could hold off until someone came along with a RMW use case (as above). That said, debug would likely be a nightmare without KVM_GET_MEMORY_ATTRIBUTES, so it's probably best to add it straightway. > but sounds like we need a KVM_UNSET_MEMORY_ATTR. No need if the setter operates on all attributes. > Since we are exposing the attribute directly to userspace I also think > we'd better treat shared memory as the default, so even when the private > memory is not used, the bit can still be meaningful. So define BIT(0) as > KVM_MEM_ATTR_PRIVATE instead of KVM_MEM_ATTR_SHARED. Ah, right.
On Tue, Oct 25, 2022 at 11:13:41PM +0800, Chao Peng wrote: > Introduce generic private memory register/unregister by reusing existing > SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION. It differs from SEV case > by treating address in the region as gpa instead of hva. Which cases > should these ioctls go is determined by the kvm_arch_has_private_mem(). > Architecture which supports KVM_PRIVATE_MEM should override this function. > > KVM internally defaults all guest memory as private memory and maintain > the shared memory in 'mem_attr_array'. The above ioctls operate on this > field and unmap existing mappings if any. > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > --- > Documentation/virt/kvm/api.rst | 17 ++- > arch/x86/kvm/Kconfig | 1 + > include/linux/kvm_host.h | 10 +- > virt/kvm/Kconfig | 4 + > virt/kvm/kvm_main.c | 227 +++++++++++++++++++++++++-------- > 5 files changed, 198 insertions(+), 61 deletions(-) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index 975688912b8c..08253cf498d1 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -4717,10 +4717,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst. > This ioctl can be used to register a guest memory region which may > contain encrypted data (e.g. guest RAM, SMRAM etc). > > -It is used in the SEV-enabled guest. When encryption is enabled, a guest > -memory region may contain encrypted data. The SEV memory encryption > -engine uses a tweak such that two identical plaintext pages, each at > -different locations will have differing ciphertexts. So swapping or > +Currently this ioctl supports registering memory regions for two usages: > +private memory and SEV-encrypted memory. > + > +When private memory is enabled, this ioctl is used to register guest private > +memory region and the addr/size of kvm_enc_region represents guest physical > +address (GPA). In this usage, this ioctl zaps the existing guest memory > +mappings in KVM that fallen into the region. > + > +When SEV-encrypted memory is enabled, this ioctl is used to register guest > +memory region which may contain encrypted data for a SEV-enabled guest. The > +addr/size of kvm_enc_region represents userspace address (HVA). The SEV > +memory encryption engine uses a tweak such that two identical plaintext pages, > +each at different locations will have differing ciphertexts. So swapping or > moving ciphertext of those pages will not result in plaintext being > swapped. So relocating (or migrating) physical backing pages for the SEV > guest will require some additional steps. > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig > index 8d2bd455c0cd..73fdfa429b20 100644 > --- a/arch/x86/kvm/Kconfig > +++ b/arch/x86/kvm/Kconfig > @@ -51,6 +51,7 @@ config KVM > select HAVE_KVM_PM_NOTIFIER if PM > select HAVE_KVM_RESTRICTED_MEM if X86_64 > select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM > + select KVM_GENERIC_PRIVATE_MEM if HAVE_KVM_RESTRICTED_MEM > help > Support hosting fully virtualized guest machines using hardware > virtualization extensions. You will need a fairly recent > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 79e5cbc35fcf..4ce98fa0153c 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -245,7 +245,8 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); > #endif > > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER > + > +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_KVM_GENERIC_PRIVATE_MEM) > struct kvm_gfn_range { > struct kvm_memory_slot *slot; > gfn_t start; > @@ -254,6 +255,9 @@ struct kvm_gfn_range { > bool may_block; > }; > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); > +#endif > + > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > @@ -794,6 +798,9 @@ struct kvm { > struct notifier_block pm_notifier; > #endif > char stats_id[KVM_STATS_NAME_SIZE]; > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > + struct xarray mem_attr_array; > +#endif > }; > > #define kvm_err(fmt, ...) \ > @@ -1453,6 +1460,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu); > int kvm_arch_post_init_vm(struct kvm *kvm); > void kvm_arch_pre_destroy_vm(struct kvm *kvm); > int kvm_arch_create_vm_debugfs(struct kvm *kvm); > +bool kvm_arch_has_private_mem(struct kvm *kvm); > > #ifndef __KVM_HAVE_ARCH_VM_ALLOC > /* > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig > index 9ff164c7e0cc..69ca59e82149 100644 > --- a/virt/kvm/Kconfig > +++ b/virt/kvm/Kconfig > @@ -89,3 +89,7 @@ config HAVE_KVM_PM_NOTIFIER > > config HAVE_KVM_RESTRICTED_MEM > bool > + > +config KVM_GENERIC_PRIVATE_MEM > + bool > + depends on HAVE_KVM_RESTRICTED_MEM > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 09c9cdeb773c..fc3835826ace 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm) > } > EXPORT_SYMBOL_GPL(kvm_destroy_vcpus); > > +static inline void update_invalidate_range(struct kvm *kvm, gfn_t start, > + gfn_t end) > +{ > + if (likely(kvm->mmu_invalidate_in_progress == 1)) { > + kvm->mmu_invalidate_range_start = start; > + kvm->mmu_invalidate_range_end = end; > + } else { > + /* > + * Fully tracking multiple concurrent ranges has diminishing > + * returns. Keep things simple and just find the minimal range > + * which includes the current and new ranges. As there won't be > + * enough information to subtract a range after its invalidate > + * completes, any ranges invalidated concurrently will > + * accumulate and persist until all outstanding invalidates > + * complete. > + */ > + kvm->mmu_invalidate_range_start = > + min(kvm->mmu_invalidate_range_start, start); > + kvm->mmu_invalidate_range_end = > + max(kvm->mmu_invalidate_range_end, end); > + } > +} > + > +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end) > +{ > + /* > + * The count increase must become visible at unlock time as no > + * spte can be established without taking the mmu_lock and > + * count is also read inside the mmu_lock critical section. > + */ > + kvm->mmu_invalidate_in_progress++; > +} > + > +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end) > +{ > + mark_invalidate_in_progress(kvm, start, end); > + update_invalidate_range(kvm, start, end); > +} > + > +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end) > +{ > + /* > + * This sequence increase will notify the kvm page fault that > + * the page that is going to be mapped in the spte could have > + * been freed. > + */ > + kvm->mmu_invalidate_seq++; > + smp_wmb(); > + /* > + * The above sequence increase must be visible before the > + * below count decrease, which is ensured by the smp_wmb above > + * in conjunction with the smp_rmb in mmu_invalidate_retry(). > + */ > + kvm->mmu_invalidate_in_progress--; > +} > + > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) > static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn) > { > @@ -715,51 +771,12 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn); > } > > -static inline void update_invalidate_range(struct kvm *kvm, gfn_t start, > - gfn_t end) > -{ > - if (likely(kvm->mmu_invalidate_in_progress == 1)) { > - kvm->mmu_invalidate_range_start = start; > - kvm->mmu_invalidate_range_end = end; > - } else { > - /* > - * Fully tracking multiple concurrent ranges has diminishing > - * returns. Keep things simple and just find the minimal range > - * which includes the current and new ranges. As there won't be > - * enough information to subtract a range after its invalidate > - * completes, any ranges invalidated concurrently will > - * accumulate and persist until all outstanding invalidates > - * complete. > - */ > - kvm->mmu_invalidate_range_start = > - min(kvm->mmu_invalidate_range_start, start); > - kvm->mmu_invalidate_range_end = > - max(kvm->mmu_invalidate_range_end, end); > - } > -} > - > -static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end) > -{ > - /* > - * The count increase must become visible at unlock time as no > - * spte can be established without taking the mmu_lock and > - * count is also read inside the mmu_lock critical section. > - */ > - kvm->mmu_invalidate_in_progress++; > -} > - > static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > { > update_invalidate_range(kvm, range->start, range->end); > return kvm_unmap_gfn_range(kvm, range); > } > > -void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end) > -{ > - mark_invalidate_in_progress(kvm, start, end); > - update_invalidate_range(kvm, start, end); > -} > - > static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > const struct mmu_notifier_range *range) > { > @@ -807,23 +824,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > return 0; > } > > -void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end) > -{ > - /* > - * This sequence increase will notify the kvm page fault that > - * the page that is going to be mapped in the spte could have > - * been freed. > - */ > - kvm->mmu_invalidate_seq++; > - smp_wmb(); > - /* > - * The above sequence increase must be visible before the > - * below count decrease, which is ensured by the smp_wmb above > - * in conjunction with the smp_rmb in mmu_invalidate_retry(). > - */ > - kvm->mmu_invalidate_in_progress--; > -} > - > static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, > const struct mmu_notifier_range *range) > { > @@ -937,6 +937,89 @@ static int kvm_init_mmu_notifier(struct kvm *kvm) > > #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */ > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > + > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end) > +{ > + struct kvm_gfn_range gfn_range; > + struct kvm_memory_slot *slot; > + struct kvm_memslots *slots; > + struct kvm_memslot_iter iter; > + int i; > + int r = 0; > + > + gfn_range.pte = __pte(0); > + gfn_range.may_block = true; > + > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { > + slots = __kvm_memslots(kvm, i); > + > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) { > + slot = iter.slot; > + gfn_range.start = max(start, slot->base_gfn); > + gfn_range.end = min(end, slot->base_gfn + slot->npages); > + if (gfn_range.start >= gfn_range.end) > + continue; > + gfn_range.slot = slot; > + > + r |= kvm_unmap_gfn_range(kvm, &gfn_range); > + } > + } > + > + if (r) > + kvm_flush_remote_tlbs(kvm); > +} > + > +#define KVM_MEM_ATTR_SHARED 0x0001 > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size, > + bool is_private) > +{ > + gfn_t start, end; > + unsigned long i; > + void *entry; > + int idx; > + int r = 0; > + > + if (size == 0 || gpa + size < gpa) > + return -EINVAL; > + if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1)) > + return -EINVAL; > + > + start = gpa >> PAGE_SHIFT; > + end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT; > + > + /* > + * Guest memory defaults to private, kvm->mem_attr_array only stores > + * shared memory. > + */ > + entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED); > + > + idx = srcu_read_lock(&kvm->srcu); > + KVM_MMU_LOCK(kvm); > + kvm_mmu_invalidate_begin(kvm, start, end); > + > + for (i = start; i < end; i++) { > + r = xa_err(xa_store(&kvm->mem_attr_array, i, entry, > + GFP_KERNEL_ACCOUNT)); > + if (r) > + goto err; > + } > + > + kvm_unmap_mem_range(kvm, start, end); lock is hold by KVM_MMU_LOCK() so how about do kvm_mmu_invalidate_begin() after changing xarray: kvm_mmu_invalidate_begin(kvm, start, end); kvm_unmap_mem_range(kvm, start, end); kvm_mmu_invalidate_end(kvm, start, end); Also the error handling path doesn't need to care it yet. > + > + goto ret; > +err: > + for (; i > start; i--) > + xa_erase(&kvm->mem_attr_array, i); the start should be covered yet, consider the i is unsigned long and case of start is 0, may need another variable j for this. > +ret: > + kvm_mmu_invalidate_end(kvm, start, end); > + KVM_MMU_UNLOCK(kvm); > + srcu_read_unlock(&kvm->srcu, idx); > + > + return r; > +} > +#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */ > + > #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER > static int kvm_pm_notifier_call(struct notifier_block *bl, > unsigned long state, > @@ -1165,6 +1248,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname) > spin_lock_init(&kvm->mn_invalidate_lock); > rcuwait_init(&kvm->mn_memslots_update_rcuwait); > xa_init(&kvm->vcpu_array); > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > + xa_init(&kvm->mem_attr_array); > +#endif > > INIT_LIST_HEAD(&kvm->gpc_list); > spin_lock_init(&kvm->gpc_lock); > @@ -1338,6 +1424,9 @@ static void kvm_destroy_vm(struct kvm *kvm) > kvm_free_memslots(kvm, &kvm->__memslots[i][0]); > kvm_free_memslots(kvm, &kvm->__memslots[i][1]); > } > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > + xa_destroy(&kvm->mem_attr_array); > +#endif > cleanup_srcu_struct(&kvm->irq_srcu); > cleanup_srcu_struct(&kvm->srcu); > kvm_arch_free_vm(kvm); > @@ -1541,6 +1630,11 @@ static void kvm_replace_memslot(struct kvm *kvm, > } > } > > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm) > +{ > + return false; > +} > + > static int check_memory_region_flags(const struct kvm_user_mem_region *mem) > { > u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp, > r = kvm_vm_ioctl_set_memory_region(kvm, &mem); > break; > } > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > + case KVM_MEMORY_ENCRYPT_REG_REGION: > + case KVM_MEMORY_ENCRYPT_UNREG_REGION: { > + struct kvm_enc_region region; > + bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION; > + > + if (!kvm_arch_has_private_mem(kvm)) > + goto arch_vm_ioctl; > + > + r = -EFAULT; > + if (copy_from_user(®ion, argp, sizeof(region))) > + goto out; > + > + r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr, > + region.size, set); > + break; > + } > +#endif > case KVM_GET_DIRTY_LOG: { > struct kvm_dirty_log log; > > @@ -4861,6 +4973,9 @@ static long kvm_vm_ioctl(struct file *filp, > r = kvm_vm_ioctl_get_stats_fd(kvm); > break; > default: > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > +arch_vm_ioctl: > +#endif > r = kvm_arch_vm_ioctl(filp, ioctl, arg); > } > out: > -- > 2.25.1 > >
On Fri, Nov 04, 2022 at 09:19:31PM +0000, Sean Christopherson wrote: > Paolo, any thoughts before I lead things further astray? > > On Fri, Nov 04, 2022, Chao Peng wrote: > > On Thu, Nov 03, 2022 at 11:04:53PM +0000, Sean Christopherson wrote: > > > On Tue, Oct 25, 2022, Chao Peng wrote: > > > > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp, > > > > r = kvm_vm_ioctl_set_memory_region(kvm, &mem); > > > > break; > > > > } > > > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > > > > + case KVM_MEMORY_ENCRYPT_REG_REGION: > > > > + case KVM_MEMORY_ENCRYPT_UNREG_REGION: { > > > > > > I'm having second thoughts about usurping KVM_MEMORY_ENCRYPT_(UN)REG_REGION. Aside > > > from the fact that restricted/protected memory may not be encrypted, there are > > > other potential use cases for per-page memory attributes[*], e.g. to make memory > > > read-only (or no-exec, or exec-only, etc...) without having to modify memslots. > > > > > > Any paravirt use case where the attributes of a page are effectively dictated by > > > the guest is going to run into the exact same performance problems with memslots, > > > which isn't suprising in hindsight since shared vs. private is really just an > > > attribute, albeit with extra special semantics. > > > > > > And if we go with a brand new ioctl(), maybe someday in the very distant future > > > we can deprecate and delete KVM_MEMORY_ENCRYPT_(UN)REG_REGION. > > > > > > Switching to a new ioctl() should be a minor change, i.e. shouldn't throw too big > > > of a wrench into things. > > > > > > Something like: > > > > > > KVM_SET_MEMORY_ATTRIBUTES > > > > > > struct kvm_memory_attributes { > > > __u64 address; > > > __u64 size; > > > __u64 flags; > > Oh, this is half-baked. I lost track of which flags were which. What I intended > was a separate, initially-unused flags, e.g. That makes sense. > > struct kvm_memory_attributes { > __u64 address; > __u64 size; > __u64 attributes; > __u64 flags; > } > > so that KVM can tweak behavior and/or extend the effective size of the struct. > > > I like the idea of adding a new ioctl(). But putting all attributes into > > a flags in uAPI sounds not good to me, e.g. forcing userspace to set all > > attributes in one call can cause pain for userspace, probably for KVM > > implementation as well. For private<->shared memory conversion, we > > actually only care the KVM_MEM_ATTR_SHARED or KVM_MEM_ATTR_PRIVATE bit, > > Not necessarily, e.g. I can see pKVM wanting to convert from RW+PRIVATE => RO+SHARED > or even RW+PRIVATE => NONE+SHARED so that the guest can't write/access the memory > while it's accessible from the host. > > And if this does extend beyond shared/private, dropping from RWX=>R, i.e. dropping > WX permissions, would also be a common operation. > > Hmm, typing that out makes me think that if we do end up supporting other "attributes", > i.e. protections, we should go straight to full RWX protections instead of doing > things piecemeal, i.e. add individual protections instead of combinations like > NO_EXEC and READ_ONLY. The protections would have to be inverted for backwards > compatibility, but that's easy enough to handle. The semantics could be like > protection keys, which also have inverted persmissions, where the final protections > are the combination of memslot+attributes, i.e. a read-only memslot couldn't be made > writable via attributes. > > E.g. userspace could do "NO_READ | NO_WRITE | NO_EXEC" to temporarily block access > to memory without needing to delete the memslot. KVM would need to disallow > unsupported combinations, e.g. disallowed effective protections would be: > > - W or WX [unless there's an arch that supports write-only memory] > - R or RW [until KVM plumbs through support for no-exec, or it's unsupported in hardware] > - X [until KVM plumbs through support for exec-only, or it's unsupported in hardware] > > Anyways, that's all future work... > > > but we force userspace to set other irrelevant bits as well if use this > > API. > > They aren't irrelevant though, as the memory attributes are all describing the > allowed protections for a given page. The 'allowed' protections seems answer my concern. But after we enabled "NO_READ | NO_WRITE | NO_EXEC", are we going to check "NO_READ | NO_WRITE | NO_EXEC" are also set together with the PRIVATE bit? I just can't imagine what the semantic would be if we have the PRIVATE bit set but other bits indicate it's actually can READ/WRITE/EXEC from usrspace. > If there's a use case where userspace "can't" > keep track of the attributes for whatever reason, then userspace could do a RMW > to set/clear attributes. Alternatively, the ioctl() could take an "operation" and > support WRITE/OR/AND to allow setting/clearing individual flags, e.g. tweak the > above to be: A getter would be good, it might also be needed for live migration. > > struct kvm_memory_attributes { > __u64 address; > __u64 size; > __u64 attributes; > __u32 operation; > __u32 flags; > } > > > I looked at kvm_device_attr, sounds we can do similar: > > The device attributes deal with isolated, arbitrary values, whereas memory attributes > are flags, i.e. devices are 1:1 whereas memory is 1:MANY. There is no "unset" for > device attributes, because they aren't flags. Device attributes vs. memory attributes > really are two very different things that just happen to use a common name. > > If it helped clarify things without creating naming problems, we could even use > PROTECTIONS instead of ATTRIBUTES. > > > KVM_SET_MEMORY_ATTR > > > > struct kvm_memory_attr { > > __u64 address; > > __u64 size; > > #define KVM_MEM_ATTR_SHARED BIT(0) > > #define KVM_MEM_ATTR_READONLY BIT(1) > > #define KVM_MEM_ATTR_NOEXEC BIT(2) > > __u32 attr; > > As above, letting userspace set only a single attribute would prevent setting > (or clearing) multiple attributes in a single ioctl(). > > > __u32 pad; > > } > > > > I'm not sure if we need KVM_GET_MEMORY_ATTR/KVM_HAS_MEMORY_ATTR as well, > > Definitely would need to communicate to userspace that various attributes are > supported. That doesn't necessarily require a common ioctl(), but I don't see > any reason not to add a common helper, and adding a common helper would mean > KVM_CAP_PRIVATE_MEM can go away. But it should return a bitmask so that userspace > can do a single query to get all supported attributes, e.g. KVM_SUPPORTED_MEMORY_ATTRIBUTES. Do you have preference on using a new ioctl or just keep it as a cap? E.g. KVM_CAP_MEMORY_ATTIBUTES can also returns a mask. > > As for KVM_GET_MEMORY_ATTRIBUTES, we wouldn't necessarily have to provide such an > API, e.g. we could hold off until someone came along with a RMW use case (as above). > That said, debug would likely be a nightmare without KVM_GET_MEMORY_ATTRIBUTES, > so it's probably best to add it straightway. Dive into the implementation a bit, for KVM_GET_MEMORY_ATTRIBUTES we can have different attributes for different pages in the same user-provided range, in that case we will have to either return a list or just a error number. Or we only support per-page attributes for the getter? Chao > > > but sounds like we need a KVM_UNSET_MEMORY_ATTR. > > No need if the setter operates on all attributes. > > > Since we are exposing the attribute directly to userspace I also think > > we'd better treat shared memory as the default, so even when the private > > memory is not used, the bit can still be meaningful. So define BIT(0) as > > KVM_MEM_ATTR_PRIVATE instead of KVM_MEM_ATTR_SHARED. > > Ah, right.
On Tue, Nov 08, 2022 at 09:35:06AM +0800, Yuan Yao wrote: > On Tue, Oct 25, 2022 at 11:13:41PM +0800, Chao Peng wrote: > > Introduce generic private memory register/unregister by reusing existing > > SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION. It differs from SEV case > > by treating address in the region as gpa instead of hva. Which cases > > should these ioctls go is determined by the kvm_arch_has_private_mem(). > > Architecture which supports KVM_PRIVATE_MEM should override this function. > > > > KVM internally defaults all guest memory as private memory and maintain > > the shared memory in 'mem_attr_array'. The above ioctls operate on this > > field and unmap existing mappings if any. > > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > > --- > > Documentation/virt/kvm/api.rst | 17 ++- > > arch/x86/kvm/Kconfig | 1 + > > include/linux/kvm_host.h | 10 +- > > virt/kvm/Kconfig | 4 + > > virt/kvm/kvm_main.c | 227 +++++++++++++++++++++++++-------- > > 5 files changed, 198 insertions(+), 61 deletions(-) > > > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > > index 975688912b8c..08253cf498d1 100644 > > --- a/Documentation/virt/kvm/api.rst > > +++ b/Documentation/virt/kvm/api.rst > > @@ -4717,10 +4717,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst. > > This ioctl can be used to register a guest memory region which may > > contain encrypted data (e.g. guest RAM, SMRAM etc). > > > > -It is used in the SEV-enabled guest. When encryption is enabled, a guest > > -memory region may contain encrypted data. The SEV memory encryption > > -engine uses a tweak such that two identical plaintext pages, each at > > -different locations will have differing ciphertexts. So swapping or > > +Currently this ioctl supports registering memory regions for two usages: > > +private memory and SEV-encrypted memory. > > + > > +When private memory is enabled, this ioctl is used to register guest private > > +memory region and the addr/size of kvm_enc_region represents guest physical > > +address (GPA). In this usage, this ioctl zaps the existing guest memory > > +mappings in KVM that fallen into the region. > > + > > +When SEV-encrypted memory is enabled, this ioctl is used to register guest > > +memory region which may contain encrypted data for a SEV-enabled guest. The > > +addr/size of kvm_enc_region represents userspace address (HVA). The SEV > > +memory encryption engine uses a tweak such that two identical plaintext pages, > > +each at different locations will have differing ciphertexts. So swapping or > > moving ciphertext of those pages will not result in plaintext being > > swapped. So relocating (or migrating) physical backing pages for the SEV > > guest will require some additional steps. > > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig > > index 8d2bd455c0cd..73fdfa429b20 100644 > > --- a/arch/x86/kvm/Kconfig > > +++ b/arch/x86/kvm/Kconfig > > @@ -51,6 +51,7 @@ config KVM > > select HAVE_KVM_PM_NOTIFIER if PM > > select HAVE_KVM_RESTRICTED_MEM if X86_64 > > select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM > > + select KVM_GENERIC_PRIVATE_MEM if HAVE_KVM_RESTRICTED_MEM > > help > > Support hosting fully virtualized guest machines using hardware > > virtualization extensions. You will need a fairly recent > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > index 79e5cbc35fcf..4ce98fa0153c 100644 > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -245,7 +245,8 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); > > #endif > > > > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER > > + > > +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_KVM_GENERIC_PRIVATE_MEM) > > struct kvm_gfn_range { > > struct kvm_memory_slot *slot; > > gfn_t start; > > @@ -254,6 +255,9 @@ struct kvm_gfn_range { > > bool may_block; > > }; > > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); > > +#endif > > + > > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER > > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > > @@ -794,6 +798,9 @@ struct kvm { > > struct notifier_block pm_notifier; > > #endif > > char stats_id[KVM_STATS_NAME_SIZE]; > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > > + struct xarray mem_attr_array; > > +#endif > > }; > > > > #define kvm_err(fmt, ...) \ > > @@ -1453,6 +1460,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu); > > int kvm_arch_post_init_vm(struct kvm *kvm); > > void kvm_arch_pre_destroy_vm(struct kvm *kvm); > > int kvm_arch_create_vm_debugfs(struct kvm *kvm); > > +bool kvm_arch_has_private_mem(struct kvm *kvm); > > > > #ifndef __KVM_HAVE_ARCH_VM_ALLOC > > /* > > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig > > index 9ff164c7e0cc..69ca59e82149 100644 > > --- a/virt/kvm/Kconfig > > +++ b/virt/kvm/Kconfig > > @@ -89,3 +89,7 @@ config HAVE_KVM_PM_NOTIFIER > > > > config HAVE_KVM_RESTRICTED_MEM > > bool > > + > > +config KVM_GENERIC_PRIVATE_MEM > > + bool > > + depends on HAVE_KVM_RESTRICTED_MEM > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > index 09c9cdeb773c..fc3835826ace 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm) > > } > > EXPORT_SYMBOL_GPL(kvm_destroy_vcpus); > > > > +static inline void update_invalidate_range(struct kvm *kvm, gfn_t start, > > + gfn_t end) > > +{ > > + if (likely(kvm->mmu_invalidate_in_progress == 1)) { > > + kvm->mmu_invalidate_range_start = start; > > + kvm->mmu_invalidate_range_end = end; > > + } else { > > + /* > > + * Fully tracking multiple concurrent ranges has diminishing > > + * returns. Keep things simple and just find the minimal range > > + * which includes the current and new ranges. As there won't be > > + * enough information to subtract a range after its invalidate > > + * completes, any ranges invalidated concurrently will > > + * accumulate and persist until all outstanding invalidates > > + * complete. > > + */ > > + kvm->mmu_invalidate_range_start = > > + min(kvm->mmu_invalidate_range_start, start); > > + kvm->mmu_invalidate_range_end = > > + max(kvm->mmu_invalidate_range_end, end); > > + } > > +} > > + > > +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end) > > +{ > > + /* > > + * The count increase must become visible at unlock time as no > > + * spte can be established without taking the mmu_lock and > > + * count is also read inside the mmu_lock critical section. > > + */ > > + kvm->mmu_invalidate_in_progress++; > > +} > > + > > +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end) > > +{ > > + mark_invalidate_in_progress(kvm, start, end); > > + update_invalidate_range(kvm, start, end); > > +} > > + > > +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end) > > +{ > > + /* > > + * This sequence increase will notify the kvm page fault that > > + * the page that is going to be mapped in the spte could have > > + * been freed. > > + */ > > + kvm->mmu_invalidate_seq++; > > + smp_wmb(); > > + /* > > + * The above sequence increase must be visible before the > > + * below count decrease, which is ensured by the smp_wmb above > > + * in conjunction with the smp_rmb in mmu_invalidate_retry(). > > + */ > > + kvm->mmu_invalidate_in_progress--; > > +} > > + > > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) > > static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn) > > { > > @@ -715,51 +771,12 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, > > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn); > > } > > > > -static inline void update_invalidate_range(struct kvm *kvm, gfn_t start, > > - gfn_t end) > > -{ > > - if (likely(kvm->mmu_invalidate_in_progress == 1)) { > > - kvm->mmu_invalidate_range_start = start; > > - kvm->mmu_invalidate_range_end = end; > > - } else { > > - /* > > - * Fully tracking multiple concurrent ranges has diminishing > > - * returns. Keep things simple and just find the minimal range > > - * which includes the current and new ranges. As there won't be > > - * enough information to subtract a range after its invalidate > > - * completes, any ranges invalidated concurrently will > > - * accumulate and persist until all outstanding invalidates > > - * complete. > > - */ > > - kvm->mmu_invalidate_range_start = > > - min(kvm->mmu_invalidate_range_start, start); > > - kvm->mmu_invalidate_range_end = > > - max(kvm->mmu_invalidate_range_end, end); > > - } > > -} > > - > > -static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end) > > -{ > > - /* > > - * The count increase must become visible at unlock time as no > > - * spte can be established without taking the mmu_lock and > > - * count is also read inside the mmu_lock critical section. > > - */ > > - kvm->mmu_invalidate_in_progress++; > > -} > > - > > static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > > { > > update_invalidate_range(kvm, range->start, range->end); > > return kvm_unmap_gfn_range(kvm, range); > > } > > > > -void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end) > > -{ > > - mark_invalidate_in_progress(kvm, start, end); > > - update_invalidate_range(kvm, start, end); > > -} > > - > > static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > > const struct mmu_notifier_range *range) > > { > > @@ -807,23 +824,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > > return 0; > > } > > > > -void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end) > > -{ > > - /* > > - * This sequence increase will notify the kvm page fault that > > - * the page that is going to be mapped in the spte could have > > - * been freed. > > - */ > > - kvm->mmu_invalidate_seq++; > > - smp_wmb(); > > - /* > > - * The above sequence increase must be visible before the > > - * below count decrease, which is ensured by the smp_wmb above > > - * in conjunction with the smp_rmb in mmu_invalidate_retry(). > > - */ > > - kvm->mmu_invalidate_in_progress--; > > -} > > - > > static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, > > const struct mmu_notifier_range *range) > > { > > @@ -937,6 +937,89 @@ static int kvm_init_mmu_notifier(struct kvm *kvm) > > > > #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */ > > > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > > + > > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end) > > +{ > > + struct kvm_gfn_range gfn_range; > > + struct kvm_memory_slot *slot; > > + struct kvm_memslots *slots; > > + struct kvm_memslot_iter iter; > > + int i; > > + int r = 0; > > + > > + gfn_range.pte = __pte(0); > > + gfn_range.may_block = true; > > + > > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { > > + slots = __kvm_memslots(kvm, i); > > + > > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) { > > + slot = iter.slot; > > + gfn_range.start = max(start, slot->base_gfn); > > + gfn_range.end = min(end, slot->base_gfn + slot->npages); > > + if (gfn_range.start >= gfn_range.end) > > + continue; > > + gfn_range.slot = slot; > > + > > + r |= kvm_unmap_gfn_range(kvm, &gfn_range); > > + } > > + } > > + > > + if (r) > > + kvm_flush_remote_tlbs(kvm); > > +} > > + > > +#define KVM_MEM_ATTR_SHARED 0x0001 > > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size, > > + bool is_private) > > +{ > > + gfn_t start, end; > > + unsigned long i; > > + void *entry; > > + int idx; > > + int r = 0; > > + > > + if (size == 0 || gpa + size < gpa) > > + return -EINVAL; > > + if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1)) > > + return -EINVAL; > > + > > + start = gpa >> PAGE_SHIFT; > > + end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT; > > + > > + /* > > + * Guest memory defaults to private, kvm->mem_attr_array only stores > > + * shared memory. > > + */ > > + entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED); > > + > > + idx = srcu_read_lock(&kvm->srcu); > > + KVM_MMU_LOCK(kvm); > > + kvm_mmu_invalidate_begin(kvm, start, end); > > + > > + for (i = start; i < end; i++) { > > + r = xa_err(xa_store(&kvm->mem_attr_array, i, entry, > > + GFP_KERNEL_ACCOUNT)); > > + if (r) > > + goto err; > > + } > > + > > + kvm_unmap_mem_range(kvm, start, end); > > lock is hold by KVM_MMU_LOCK() so how about do > kvm_mmu_invalidate_begin() after changing xarray: > > kvm_mmu_invalidate_begin(kvm, start, end); > kvm_unmap_mem_range(kvm, start, end); > kvm_mmu_invalidate_end(kvm, start, end); > > Also the error handling path doesn't need to care it yet. The mem_attr_array is consumed in the page fault handler(i.e. kvm_mem_is_private() in patch 08) so it should also be protected by kvm_mmu_invalidate_begin/end(). E.g. if we change the mem_attr_arry here after the page fault handler has read the mem_attr_array, the mmu_invalidate_retry_gfn() should return 1 to let the page fault handler to retry the fault. > > > + > > + goto ret; > > +err: > > + for (; i > start; i--) > > + xa_erase(&kvm->mem_attr_array, i); > > the start should be covered yet, consider the i is > unsigned long and case of start is 0, may need another > variable j for this. Ah, right! Thanks, Chao > > > +ret: > > + kvm_mmu_invalidate_end(kvm, start, end); > > + KVM_MMU_UNLOCK(kvm); > > + srcu_read_unlock(&kvm->srcu, idx); > > + > > + return r; > > +} > > +#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */ > > + > > #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER > > static int kvm_pm_notifier_call(struct notifier_block *bl, > > unsigned long state, > > @@ -1165,6 +1248,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname) > > spin_lock_init(&kvm->mn_invalidate_lock); > > rcuwait_init(&kvm->mn_memslots_update_rcuwait); > > xa_init(&kvm->vcpu_array); > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > > + xa_init(&kvm->mem_attr_array); > > +#endif > > > > INIT_LIST_HEAD(&kvm->gpc_list); > > spin_lock_init(&kvm->gpc_lock); > > @@ -1338,6 +1424,9 @@ static void kvm_destroy_vm(struct kvm *kvm) > > kvm_free_memslots(kvm, &kvm->__memslots[i][0]); > > kvm_free_memslots(kvm, &kvm->__memslots[i][1]); > > } > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > > + xa_destroy(&kvm->mem_attr_array); > > +#endif > > cleanup_srcu_struct(&kvm->irq_srcu); > > cleanup_srcu_struct(&kvm->srcu); > > kvm_arch_free_vm(kvm); > > @@ -1541,6 +1630,11 @@ static void kvm_replace_memslot(struct kvm *kvm, > > } > > } > > > > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm) > > +{ > > + return false; > > +} > > + > > static int check_memory_region_flags(const struct kvm_user_mem_region *mem) > > { > > u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; > > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp, > > r = kvm_vm_ioctl_set_memory_region(kvm, &mem); > > break; > > } > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > > + case KVM_MEMORY_ENCRYPT_REG_REGION: > > + case KVM_MEMORY_ENCRYPT_UNREG_REGION: { > > + struct kvm_enc_region region; > > + bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION; > > + > > + if (!kvm_arch_has_private_mem(kvm)) > > + goto arch_vm_ioctl; > > + > > + r = -EFAULT; > > + if (copy_from_user(®ion, argp, sizeof(region))) > > + goto out; > > + > > + r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr, > > + region.size, set); > > + break; > > + } > > +#endif > > case KVM_GET_DIRTY_LOG: { > > struct kvm_dirty_log log; > > > > @@ -4861,6 +4973,9 @@ static long kvm_vm_ioctl(struct file *filp, > > r = kvm_vm_ioctl_get_stats_fd(kvm); > > break; > > default: > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > > +arch_vm_ioctl: > > +#endif > > r = kvm_arch_vm_ioctl(filp, ioctl, arg); > > } > > out: > > -- > > 2.25.1 > > > >
On Tue, Nov 08, 2022 at 05:41:41PM +0800, Chao Peng wrote: > On Tue, Nov 08, 2022 at 09:35:06AM +0800, Yuan Yao wrote: > > On Tue, Oct 25, 2022 at 11:13:41PM +0800, Chao Peng wrote: > > > Introduce generic private memory register/unregister by reusing existing > > > SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION. It differs from SEV case > > > by treating address in the region as gpa instead of hva. Which cases > > > should these ioctls go is determined by the kvm_arch_has_private_mem(). > > > Architecture which supports KVM_PRIVATE_MEM should override this function. > > > > > > KVM internally defaults all guest memory as private memory and maintain > > > the shared memory in 'mem_attr_array'. The above ioctls operate on this > > > field and unmap existing mappings if any. > > > > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > > > --- > > > Documentation/virt/kvm/api.rst | 17 ++- > > > arch/x86/kvm/Kconfig | 1 + > > > include/linux/kvm_host.h | 10 +- > > > virt/kvm/Kconfig | 4 + > > > virt/kvm/kvm_main.c | 227 +++++++++++++++++++++++++-------- > > > 5 files changed, 198 insertions(+), 61 deletions(-) > > > > > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > > > index 975688912b8c..08253cf498d1 100644 > > > --- a/Documentation/virt/kvm/api.rst > > > +++ b/Documentation/virt/kvm/api.rst > > > @@ -4717,10 +4717,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst. > > > This ioctl can be used to register a guest memory region which may > > > contain encrypted data (e.g. guest RAM, SMRAM etc). > > > > > > -It is used in the SEV-enabled guest. When encryption is enabled, a guest > > > -memory region may contain encrypted data. The SEV memory encryption > > > -engine uses a tweak such that two identical plaintext pages, each at > > > -different locations will have differing ciphertexts. So swapping or > > > +Currently this ioctl supports registering memory regions for two usages: > > > +private memory and SEV-encrypted memory. > > > + > > > +When private memory is enabled, this ioctl is used to register guest private > > > +memory region and the addr/size of kvm_enc_region represents guest physical > > > +address (GPA). In this usage, this ioctl zaps the existing guest memory > > > +mappings in KVM that fallen into the region. > > > + > > > +When SEV-encrypted memory is enabled, this ioctl is used to register guest > > > +memory region which may contain encrypted data for a SEV-enabled guest. The > > > +addr/size of kvm_enc_region represents userspace address (HVA). The SEV > > > +memory encryption engine uses a tweak such that two identical plaintext pages, > > > +each at different locations will have differing ciphertexts. So swapping or > > > moving ciphertext of those pages will not result in plaintext being > > > swapped. So relocating (or migrating) physical backing pages for the SEV > > > guest will require some additional steps. > > > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig > > > index 8d2bd455c0cd..73fdfa429b20 100644 > > > --- a/arch/x86/kvm/Kconfig > > > +++ b/arch/x86/kvm/Kconfig > > > @@ -51,6 +51,7 @@ config KVM > > > select HAVE_KVM_PM_NOTIFIER if PM > > > select HAVE_KVM_RESTRICTED_MEM if X86_64 > > > select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM > > > + select KVM_GENERIC_PRIVATE_MEM if HAVE_KVM_RESTRICTED_MEM > > > help > > > Support hosting fully virtualized guest machines using hardware > > > virtualization extensions. You will need a fairly recent > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > > index 79e5cbc35fcf..4ce98fa0153c 100644 > > > --- a/include/linux/kvm_host.h > > > +++ b/include/linux/kvm_host.h > > > @@ -245,7 +245,8 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > > > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); > > > #endif > > > > > > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER > > > + > > > +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_KVM_GENERIC_PRIVATE_MEM) > > > struct kvm_gfn_range { > > > struct kvm_memory_slot *slot; > > > gfn_t start; > > > @@ -254,6 +255,9 @@ struct kvm_gfn_range { > > > bool may_block; > > > }; > > > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); > > > +#endif > > > + > > > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER > > > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > > > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > > > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > > > @@ -794,6 +798,9 @@ struct kvm { > > > struct notifier_block pm_notifier; > > > #endif > > > char stats_id[KVM_STATS_NAME_SIZE]; > > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > > > + struct xarray mem_attr_array; > > > +#endif > > > }; > > > > > > #define kvm_err(fmt, ...) \ > > > @@ -1453,6 +1460,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu); > > > int kvm_arch_post_init_vm(struct kvm *kvm); > > > void kvm_arch_pre_destroy_vm(struct kvm *kvm); > > > int kvm_arch_create_vm_debugfs(struct kvm *kvm); > > > +bool kvm_arch_has_private_mem(struct kvm *kvm); > > > > > > #ifndef __KVM_HAVE_ARCH_VM_ALLOC > > > /* > > > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig > > > index 9ff164c7e0cc..69ca59e82149 100644 > > > --- a/virt/kvm/Kconfig > > > +++ b/virt/kvm/Kconfig > > > @@ -89,3 +89,7 @@ config HAVE_KVM_PM_NOTIFIER > > > > > > config HAVE_KVM_RESTRICTED_MEM > > > bool > > > + > > > +config KVM_GENERIC_PRIVATE_MEM > > > + bool > > > + depends on HAVE_KVM_RESTRICTED_MEM > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > index 09c9cdeb773c..fc3835826ace 100644 > > > --- a/virt/kvm/kvm_main.c > > > +++ b/virt/kvm/kvm_main.c > > > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm) > > > } > > > EXPORT_SYMBOL_GPL(kvm_destroy_vcpus); > > > > > > +static inline void update_invalidate_range(struct kvm *kvm, gfn_t start, > > > + gfn_t end) > > > +{ > > > + if (likely(kvm->mmu_invalidate_in_progress == 1)) { > > > + kvm->mmu_invalidate_range_start = start; > > > + kvm->mmu_invalidate_range_end = end; > > > + } else { > > > + /* > > > + * Fully tracking multiple concurrent ranges has diminishing > > > + * returns. Keep things simple and just find the minimal range > > > + * which includes the current and new ranges. As there won't be > > > + * enough information to subtract a range after its invalidate > > > + * completes, any ranges invalidated concurrently will > > > + * accumulate and persist until all outstanding invalidates > > > + * complete. > > > + */ > > > + kvm->mmu_invalidate_range_start = > > > + min(kvm->mmu_invalidate_range_start, start); > > > + kvm->mmu_invalidate_range_end = > > > + max(kvm->mmu_invalidate_range_end, end); > > > + } > > > +} > > > + > > > +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end) > > > +{ > > > + /* > > > + * The count increase must become visible at unlock time as no > > > + * spte can be established without taking the mmu_lock and > > > + * count is also read inside the mmu_lock critical section. > > > + */ > > > + kvm->mmu_invalidate_in_progress++; > > > +} > > > + > > > +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end) > > > +{ > > > + mark_invalidate_in_progress(kvm, start, end); > > > + update_invalidate_range(kvm, start, end); > > > +} > > > + > > > +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end) > > > +{ > > > + /* > > > + * This sequence increase will notify the kvm page fault that > > > + * the page that is going to be mapped in the spte could have > > > + * been freed. > > > + */ > > > + kvm->mmu_invalidate_seq++; > > > + smp_wmb(); > > > + /* > > > + * The above sequence increase must be visible before the > > > + * below count decrease, which is ensured by the smp_wmb above > > > + * in conjunction with the smp_rmb in mmu_invalidate_retry(). > > > + */ > > > + kvm->mmu_invalidate_in_progress--; > > > +} > > > + > > > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) > > > static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn) > > > { > > > @@ -715,51 +771,12 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, > > > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn); > > > } > > > > > > -static inline void update_invalidate_range(struct kvm *kvm, gfn_t start, > > > - gfn_t end) > > > -{ > > > - if (likely(kvm->mmu_invalidate_in_progress == 1)) { > > > - kvm->mmu_invalidate_range_start = start; > > > - kvm->mmu_invalidate_range_end = end; > > > - } else { > > > - /* > > > - * Fully tracking multiple concurrent ranges has diminishing > > > - * returns. Keep things simple and just find the minimal range > > > - * which includes the current and new ranges. As there won't be > > > - * enough information to subtract a range after its invalidate > > > - * completes, any ranges invalidated concurrently will > > > - * accumulate and persist until all outstanding invalidates > > > - * complete. > > > - */ > > > - kvm->mmu_invalidate_range_start = > > > - min(kvm->mmu_invalidate_range_start, start); > > > - kvm->mmu_invalidate_range_end = > > > - max(kvm->mmu_invalidate_range_end, end); > > > - } > > > -} > > > - > > > -static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end) > > > -{ > > > - /* > > > - * The count increase must become visible at unlock time as no > > > - * spte can be established without taking the mmu_lock and > > > - * count is also read inside the mmu_lock critical section. > > > - */ > > > - kvm->mmu_invalidate_in_progress++; > > > -} > > > - > > > static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > > > { > > > update_invalidate_range(kvm, range->start, range->end); > > > return kvm_unmap_gfn_range(kvm, range); > > > } > > > > > > -void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end) > > > -{ > > > - mark_invalidate_in_progress(kvm, start, end); > > > - update_invalidate_range(kvm, start, end); > > > -} > > > - > > > static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > > > const struct mmu_notifier_range *range) > > > { > > > @@ -807,23 +824,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > > > return 0; > > > } > > > > > > -void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end) > > > -{ > > > - /* > > > - * This sequence increase will notify the kvm page fault that > > > - * the page that is going to be mapped in the spte could have > > > - * been freed. > > > - */ > > > - kvm->mmu_invalidate_seq++; > > > - smp_wmb(); > > > - /* > > > - * The above sequence increase must be visible before the > > > - * below count decrease, which is ensured by the smp_wmb above > > > - * in conjunction with the smp_rmb in mmu_invalidate_retry(). > > > - */ > > > - kvm->mmu_invalidate_in_progress--; > > > -} > > > - > > > static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, > > > const struct mmu_notifier_range *range) > > > { > > > @@ -937,6 +937,89 @@ static int kvm_init_mmu_notifier(struct kvm *kvm) > > > > > > #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */ > > > > > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > > > + > > > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end) > > > +{ > > > + struct kvm_gfn_range gfn_range; > > > + struct kvm_memory_slot *slot; > > > + struct kvm_memslots *slots; > > > + struct kvm_memslot_iter iter; > > > + int i; > > > + int r = 0; > > > + > > > + gfn_range.pte = __pte(0); > > > + gfn_range.may_block = true; > > > + > > > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { > > > + slots = __kvm_memslots(kvm, i); > > > + > > > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) { > > > + slot = iter.slot; > > > + gfn_range.start = max(start, slot->base_gfn); > > > + gfn_range.end = min(end, slot->base_gfn + slot->npages); > > > + if (gfn_range.start >= gfn_range.end) > > > + continue; > > > + gfn_range.slot = slot; > > > + > > > + r |= kvm_unmap_gfn_range(kvm, &gfn_range); > > > + } > > > + } > > > + > > > + if (r) > > > + kvm_flush_remote_tlbs(kvm); > > > +} > > > + > > > +#define KVM_MEM_ATTR_SHARED 0x0001 > > > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size, > > > + bool is_private) > > > +{ > > > + gfn_t start, end; > > > + unsigned long i; > > > + void *entry; > > > + int idx; > > > + int r = 0; > > > + > > > + if (size == 0 || gpa + size < gpa) > > > + return -EINVAL; > > > + if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1)) > > > + return -EINVAL; > > > + > > > + start = gpa >> PAGE_SHIFT; > > > + end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT; > > > + > > > + /* > > > + * Guest memory defaults to private, kvm->mem_attr_array only stores > > > + * shared memory. > > > + */ > > > + entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED); > > > + > > > + idx = srcu_read_lock(&kvm->srcu); > > > + KVM_MMU_LOCK(kvm); > > > + kvm_mmu_invalidate_begin(kvm, start, end); > > > + > > > + for (i = start; i < end; i++) { > > > + r = xa_err(xa_store(&kvm->mem_attr_array, i, entry, > > > + GFP_KERNEL_ACCOUNT)); > > > + if (r) > > > + goto err; > > > + } > > > + > > > + kvm_unmap_mem_range(kvm, start, end); > > > > lock is hold by KVM_MMU_LOCK() so how about do > > kvm_mmu_invalidate_begin() after changing xarray: > > > > kvm_mmu_invalidate_begin(kvm, start, end); > > kvm_unmap_mem_range(kvm, start, end); > > kvm_mmu_invalidate_end(kvm, start, end); > > > > Also the error handling path doesn't need to care it yet. > > The mem_attr_array is consumed in the page fault handler(i.e. > kvm_mem_is_private() in patch 08) so it should also be protected by > kvm_mmu_invalidate_begin/end(). E.g. if we change the mem_attr_arry here > after the page fault handler has read the mem_attr_array, the > mmu_invalidate_retry_gfn() should return 1 to let the page fault handler > to retry the fault. You're right! Even the changes are undo by error handling path, we still need to sure that user of mem_attr_arry retry the fault, due to the user may get some "stale" data (they're "stale" becaus the xarray is recovered in error case). > > > > > > + > > > + goto ret; > > > +err: > > > + for (; i > start; i--) > > > + xa_erase(&kvm->mem_attr_array, i); > > > > the start should be covered yet, consider the i is > > unsigned long and case of start is 0, may need another > > variable j for this. > > Ah, right! > > Thanks, > Chao > > > > > +ret: > > > + kvm_mmu_invalidate_end(kvm, start, end); > > > + KVM_MMU_UNLOCK(kvm); > > > + srcu_read_unlock(&kvm->srcu, idx); > > > + > > > + return r; > > > +} > > > +#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */ > > > + > > > #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER > > > static int kvm_pm_notifier_call(struct notifier_block *bl, > > > unsigned long state, > > > @@ -1165,6 +1248,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname) > > > spin_lock_init(&kvm->mn_invalidate_lock); > > > rcuwait_init(&kvm->mn_memslots_update_rcuwait); > > > xa_init(&kvm->vcpu_array); > > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > > > + xa_init(&kvm->mem_attr_array); > > > +#endif > > > > > > INIT_LIST_HEAD(&kvm->gpc_list); > > > spin_lock_init(&kvm->gpc_lock); > > > @@ -1338,6 +1424,9 @@ static void kvm_destroy_vm(struct kvm *kvm) > > > kvm_free_memslots(kvm, &kvm->__memslots[i][0]); > > > kvm_free_memslots(kvm, &kvm->__memslots[i][1]); > > > } > > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > > > + xa_destroy(&kvm->mem_attr_array); > > > +#endif > > > cleanup_srcu_struct(&kvm->irq_srcu); > > > cleanup_srcu_struct(&kvm->srcu); > > > kvm_arch_free_vm(kvm); > > > @@ -1541,6 +1630,11 @@ static void kvm_replace_memslot(struct kvm *kvm, > > > } > > > } > > > > > > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm) > > > +{ > > > + return false; > > > +} > > > + > > > static int check_memory_region_flags(const struct kvm_user_mem_region *mem) > > > { > > > u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; > > > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp, > > > r = kvm_vm_ioctl_set_memory_region(kvm, &mem); > > > break; > > > } > > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > > > + case KVM_MEMORY_ENCRYPT_REG_REGION: > > > + case KVM_MEMORY_ENCRYPT_UNREG_REGION: { > > > + struct kvm_enc_region region; > > > + bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION; > > > + > > > + if (!kvm_arch_has_private_mem(kvm)) > > > + goto arch_vm_ioctl; > > > + > > > + r = -EFAULT; > > > + if (copy_from_user(®ion, argp, sizeof(region))) > > > + goto out; > > > + > > > + r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr, > > > + region.size, set); > > > + break; > > > + } > > > +#endif > > > case KVM_GET_DIRTY_LOG: { > > > struct kvm_dirty_log log; > > > > > > @@ -4861,6 +4973,9 @@ static long kvm_vm_ioctl(struct file *filp, > > > r = kvm_vm_ioctl_get_stats_fd(kvm); > > > break; > > > default: > > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > > > +arch_vm_ioctl: > > > +#endif > > > r = kvm_arch_vm_ioctl(filp, ioctl, arg); > > > } > > > out: > > > -- > > > 2.25.1 > > > > > >
On Tue, Oct 25, 2022, Chao Peng wrote: > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size, > + bool is_private) > +{ > + gfn_t start, end; > + unsigned long i; > + void *entry; > + int idx; > + int r = 0; > + > + if (size == 0 || gpa + size < gpa) > + return -EINVAL; > + if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1)) > + return -EINVAL; > + > + start = gpa >> PAGE_SHIFT; > + end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT; > + > + /* > + * Guest memory defaults to private, kvm->mem_attr_array only stores > + * shared memory. > + */ > + entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED); > + > + idx = srcu_read_lock(&kvm->srcu); > + KVM_MMU_LOCK(kvm); > + kvm_mmu_invalidate_begin(kvm, start, end); > + > + for (i = start; i < end; i++) { > + r = xa_err(xa_store(&kvm->mem_attr_array, i, entry, > + GFP_KERNEL_ACCOUNT)); > + if (r) > + goto err; > + } > + > + kvm_unmap_mem_range(kvm, start, end); > + > + goto ret; > +err: > + for (; i > start; i--) > + xa_erase(&kvm->mem_attr_array, i); I don't think deleting previous entries is correct. To unwind, the correct thing to do is restore the original values. E.g. if userspace space is mapping a large range as shared, and some of the previous entries were shared, deleting them would incorrectly "convert" those entries to private. Tracking the previous state likely isn't the best approach, e.g. it would require speculatively allocating extra memory for a rare condition that is likely going to lead to OOM anyways. Instead of trying to unwind, what about updating the ioctl() params such that retrying with the updated addr+size would Just Work? E.g. diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 55b07aae67cc..f1de592a1a06 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1015,15 +1015,12 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size, kvm_unmap_mem_range(kvm, start, end, attr); - goto ret; -err: - for (; i > start; i--) - xa_erase(&kvm->mem_attr_array, i); -ret: kvm_mmu_invalidate_end(kvm, start, end); KVM_MMU_UNLOCK(kvm); srcu_read_unlock(&kvm->srcu, idx); + <update gpa and size> + return r; } #endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */ @@ -4989,6 +4986,8 @@ static long kvm_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr, region.size, set); + if (copy_to_user(argp, ®ion, sizeof(region)) && !r) + r = -EFAULT break; } #endif
On Wed, Nov 16, 2022 at 10:24:11PM +0000, Sean Christopherson wrote: > On Tue, Oct 25, 2022, Chao Peng wrote: > > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size, > > + bool is_private) > > +{ > > + gfn_t start, end; > > + unsigned long i; > > + void *entry; > > + int idx; > > + int r = 0; > > + > > + if (size == 0 || gpa + size < gpa) > > + return -EINVAL; > > + if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1)) > > + return -EINVAL; > > + > > + start = gpa >> PAGE_SHIFT; > > + end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT; > > + > > + /* > > + * Guest memory defaults to private, kvm->mem_attr_array only stores > > + * shared memory. > > + */ > > + entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED); > > + > > + idx = srcu_read_lock(&kvm->srcu); > > + KVM_MMU_LOCK(kvm); > > + kvm_mmu_invalidate_begin(kvm, start, end); > > + > > + for (i = start; i < end; i++) { > > + r = xa_err(xa_store(&kvm->mem_attr_array, i, entry, > > + GFP_KERNEL_ACCOUNT)); > > + if (r) > > + goto err; > > + } > > + > > + kvm_unmap_mem_range(kvm, start, end); > > + > > + goto ret; > > +err: > > + for (; i > start; i--) > > + xa_erase(&kvm->mem_attr_array, i); > > I don't think deleting previous entries is correct. To unwind, the correct thing > to do is restore the original values. E.g. if userspace space is mapping a large > range as shared, and some of the previous entries were shared, deleting them would > incorrectly "convert" those entries to private. Ah, right! > > Tracking the previous state likely isn't the best approach, e.g. it would require > speculatively allocating extra memory for a rare condition that is likely going to > lead to OOM anyways. Agree. > > Instead of trying to unwind, what about updating the ioctl() params such that > retrying with the updated addr+size would Just Work? E.g. Looks good to me. Thanks! Chao > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 55b07aae67cc..f1de592a1a06 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1015,15 +1015,12 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size, > > kvm_unmap_mem_range(kvm, start, end, attr); > > - goto ret; > -err: > - for (; i > start; i--) > - xa_erase(&kvm->mem_attr_array, i); > -ret: > kvm_mmu_invalidate_end(kvm, start, end); > KVM_MMU_UNLOCK(kvm); > srcu_read_unlock(&kvm->srcu, idx); > > + <update gpa and size> > + > return r; > } > #endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */ > @@ -4989,6 +4986,8 @@ static long kvm_vm_ioctl(struct file *filp, > > r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr, > region.size, set); > + if (copy_to_user(argp, ®ion, sizeof(region)) && !r) > + r = -EFAULT > break; > } > #endif
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 975688912b8c..08253cf498d1 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4717,10 +4717,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst. This ioctl can be used to register a guest memory region which may contain encrypted data (e.g. guest RAM, SMRAM etc). -It is used in the SEV-enabled guest. When encryption is enabled, a guest -memory region may contain encrypted data. The SEV memory encryption -engine uses a tweak such that two identical plaintext pages, each at -different locations will have differing ciphertexts. So swapping or +Currently this ioctl supports registering memory regions for two usages: +private memory and SEV-encrypted memory. + +When private memory is enabled, this ioctl is used to register guest private +memory region and the addr/size of kvm_enc_region represents guest physical +address (GPA). In this usage, this ioctl zaps the existing guest memory +mappings in KVM that fallen into the region. + +When SEV-encrypted memory is enabled, this ioctl is used to register guest +memory region which may contain encrypted data for a SEV-enabled guest. The +addr/size of kvm_enc_region represents userspace address (HVA). The SEV +memory encryption engine uses a tweak such that two identical plaintext pages, +each at different locations will have differing ciphertexts. So swapping or moving ciphertext of those pages will not result in plaintext being swapped. So relocating (or migrating) physical backing pages for the SEV guest will require some additional steps. diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 8d2bd455c0cd..73fdfa429b20 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -51,6 +51,7 @@ config KVM select HAVE_KVM_PM_NOTIFIER if PM select HAVE_KVM_RESTRICTED_MEM if X86_64 select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM + select KVM_GENERIC_PRIVATE_MEM if HAVE_KVM_RESTRICTED_MEM help Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 79e5cbc35fcf..4ce98fa0153c 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -245,7 +245,8 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); #endif -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER + +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_KVM_GENERIC_PRIVATE_MEM) struct kvm_gfn_range { struct kvm_memory_slot *slot; gfn_t start; @@ -254,6 +255,9 @@ struct kvm_gfn_range { bool may_block; }; bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); +#endif + +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); @@ -794,6 +798,9 @@ struct kvm { struct notifier_block pm_notifier; #endif char stats_id[KVM_STATS_NAME_SIZE]; +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM + struct xarray mem_attr_array; +#endif }; #define kvm_err(fmt, ...) \ @@ -1453,6 +1460,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu); int kvm_arch_post_init_vm(struct kvm *kvm); void kvm_arch_pre_destroy_vm(struct kvm *kvm); int kvm_arch_create_vm_debugfs(struct kvm *kvm); +bool kvm_arch_has_private_mem(struct kvm *kvm); #ifndef __KVM_HAVE_ARCH_VM_ALLOC /* diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index 9ff164c7e0cc..69ca59e82149 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -89,3 +89,7 @@ config HAVE_KVM_PM_NOTIFIER config HAVE_KVM_RESTRICTED_MEM bool + +config KVM_GENERIC_PRIVATE_MEM + bool + depends on HAVE_KVM_RESTRICTED_MEM diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 09c9cdeb773c..fc3835826ace 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm) } EXPORT_SYMBOL_GPL(kvm_destroy_vcpus); +static inline void update_invalidate_range(struct kvm *kvm, gfn_t start, + gfn_t end) +{ + if (likely(kvm->mmu_invalidate_in_progress == 1)) { + kvm->mmu_invalidate_range_start = start; + kvm->mmu_invalidate_range_end = end; + } else { + /* + * Fully tracking multiple concurrent ranges has diminishing + * returns. Keep things simple and just find the minimal range + * which includes the current and new ranges. As there won't be + * enough information to subtract a range after its invalidate + * completes, any ranges invalidated concurrently will + * accumulate and persist until all outstanding invalidates + * complete. + */ + kvm->mmu_invalidate_range_start = + min(kvm->mmu_invalidate_range_start, start); + kvm->mmu_invalidate_range_end = + max(kvm->mmu_invalidate_range_end, end); + } +} + +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end) +{ + /* + * The count increase must become visible at unlock time as no + * spte can be established without taking the mmu_lock and + * count is also read inside the mmu_lock critical section. + */ + kvm->mmu_invalidate_in_progress++; +} + +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end) +{ + mark_invalidate_in_progress(kvm, start, end); + update_invalidate_range(kvm, start, end); +} + +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end) +{ + /* + * This sequence increase will notify the kvm page fault that + * the page that is going to be mapped in the spte could have + * been freed. + */ + kvm->mmu_invalidate_seq++; + smp_wmb(); + /* + * The above sequence increase must be visible before the + * below count decrease, which is ensured by the smp_wmb above + * in conjunction with the smp_rmb in mmu_invalidate_retry(). + */ + kvm->mmu_invalidate_in_progress--; +} + #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn) { @@ -715,51 +771,12 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn); } -static inline void update_invalidate_range(struct kvm *kvm, gfn_t start, - gfn_t end) -{ - if (likely(kvm->mmu_invalidate_in_progress == 1)) { - kvm->mmu_invalidate_range_start = start; - kvm->mmu_invalidate_range_end = end; - } else { - /* - * Fully tracking multiple concurrent ranges has diminishing - * returns. Keep things simple and just find the minimal range - * which includes the current and new ranges. As there won't be - * enough information to subtract a range after its invalidate - * completes, any ranges invalidated concurrently will - * accumulate and persist until all outstanding invalidates - * complete. - */ - kvm->mmu_invalidate_range_start = - min(kvm->mmu_invalidate_range_start, start); - kvm->mmu_invalidate_range_end = - max(kvm->mmu_invalidate_range_end, end); - } -} - -static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end) -{ - /* - * The count increase must become visible at unlock time as no - * spte can be established without taking the mmu_lock and - * count is also read inside the mmu_lock critical section. - */ - kvm->mmu_invalidate_in_progress++; -} - static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) { update_invalidate_range(kvm, range->start, range->end); return kvm_unmap_gfn_range(kvm, range); } -void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end) -{ - mark_invalidate_in_progress(kvm, start, end); - update_invalidate_range(kvm, start, end); -} - static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, const struct mmu_notifier_range *range) { @@ -807,23 +824,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, return 0; } -void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end) -{ - /* - * This sequence increase will notify the kvm page fault that - * the page that is going to be mapped in the spte could have - * been freed. - */ - kvm->mmu_invalidate_seq++; - smp_wmb(); - /* - * The above sequence increase must be visible before the - * below count decrease, which is ensured by the smp_wmb above - * in conjunction with the smp_rmb in mmu_invalidate_retry(). - */ - kvm->mmu_invalidate_in_progress--; -} - static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, const struct mmu_notifier_range *range) { @@ -937,6 +937,89 @@ static int kvm_init_mmu_notifier(struct kvm *kvm) #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */ +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM + +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end) +{ + struct kvm_gfn_range gfn_range; + struct kvm_memory_slot *slot; + struct kvm_memslots *slots; + struct kvm_memslot_iter iter; + int i; + int r = 0; + + gfn_range.pte = __pte(0); + gfn_range.may_block = true; + + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { + slots = __kvm_memslots(kvm, i); + + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) { + slot = iter.slot; + gfn_range.start = max(start, slot->base_gfn); + gfn_range.end = min(end, slot->base_gfn + slot->npages); + if (gfn_range.start >= gfn_range.end) + continue; + gfn_range.slot = slot; + + r |= kvm_unmap_gfn_range(kvm, &gfn_range); + } + } + + if (r) + kvm_flush_remote_tlbs(kvm); +} + +#define KVM_MEM_ATTR_SHARED 0x0001 +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size, + bool is_private) +{ + gfn_t start, end; + unsigned long i; + void *entry; + int idx; + int r = 0; + + if (size == 0 || gpa + size < gpa) + return -EINVAL; + if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1)) + return -EINVAL; + + start = gpa >> PAGE_SHIFT; + end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT; + + /* + * Guest memory defaults to private, kvm->mem_attr_array only stores + * shared memory. + */ + entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED); + + idx = srcu_read_lock(&kvm->srcu); + KVM_MMU_LOCK(kvm); + kvm_mmu_invalidate_begin(kvm, start, end); + + for (i = start; i < end; i++) { + r = xa_err(xa_store(&kvm->mem_attr_array, i, entry, + GFP_KERNEL_ACCOUNT)); + if (r) + goto err; + } + + kvm_unmap_mem_range(kvm, start, end); + + goto ret; +err: + for (; i > start; i--) + xa_erase(&kvm->mem_attr_array, i); +ret: + kvm_mmu_invalidate_end(kvm, start, end); + KVM_MMU_UNLOCK(kvm); + srcu_read_unlock(&kvm->srcu, idx); + + return r; +} +#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */ + #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER static int kvm_pm_notifier_call(struct notifier_block *bl, unsigned long state, @@ -1165,6 +1248,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname) spin_lock_init(&kvm->mn_invalidate_lock); rcuwait_init(&kvm->mn_memslots_update_rcuwait); xa_init(&kvm->vcpu_array); +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM + xa_init(&kvm->mem_attr_array); +#endif INIT_LIST_HEAD(&kvm->gpc_list); spin_lock_init(&kvm->gpc_lock); @@ -1338,6 +1424,9 @@ static void kvm_destroy_vm(struct kvm *kvm) kvm_free_memslots(kvm, &kvm->__memslots[i][0]); kvm_free_memslots(kvm, &kvm->__memslots[i][1]); } +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM + xa_destroy(&kvm->mem_attr_array); +#endif cleanup_srcu_struct(&kvm->irq_srcu); cleanup_srcu_struct(&kvm->srcu); kvm_arch_free_vm(kvm); @@ -1541,6 +1630,11 @@ static void kvm_replace_memslot(struct kvm *kvm, } } +bool __weak kvm_arch_has_private_mem(struct kvm *kvm) +{ + return false; +} + static int check_memory_region_flags(const struct kvm_user_mem_region *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_set_memory_region(kvm, &mem); break; } +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM + case KVM_MEMORY_ENCRYPT_REG_REGION: + case KVM_MEMORY_ENCRYPT_UNREG_REGION: { + struct kvm_enc_region region; + bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION; + + if (!kvm_arch_has_private_mem(kvm)) + goto arch_vm_ioctl; + + r = -EFAULT; + if (copy_from_user(®ion, argp, sizeof(region))) + goto out; + + r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr, + region.size, set); + break; + } +#endif case KVM_GET_DIRTY_LOG: { struct kvm_dirty_log log; @@ -4861,6 +4973,9 @@ static long kvm_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_get_stats_fd(kvm); break; default: +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM +arch_vm_ioctl: +#endif r = kvm_arch_vm_ioctl(filp, ioctl, arg); } out: