Message ID | 20221108084416.11447-1-yan.y.zhao@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp2582682wru; Tue, 8 Nov 2022 01:12:41 -0800 (PST) X-Google-Smtp-Source: AMsMyM5Uh7fnAxpxCuprLaLm9q0cQte3pIL+DuzKdthy3vKpASPZoCM9YV8liJPp4BMSgVNzZpmm X-Received: by 2002:a17:906:8a48:b0:7a5:a8f5:b870 with SMTP id gx8-20020a1709068a4800b007a5a8f5b870mr49428937ejc.458.1667898761788; Tue, 08 Nov 2022 01:12:41 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1667898761; cv=none; d=google.com; s=arc-20160816; b=a6HDbglpFchQT228/QK36reWDAdgifJmtfxL0mtQctUwJEf+nukC2Q7r+SGK6A52Pu UblKhRc/YbTnenBQBKf2dBtiuDj82rkTC3JF4D0Kik8N2v7KlFcbNdnFDGesE2aXM1cG vf1uKvETe9NueERFttt1H96lVs3uZBYQXXlejHqKlmA1QYziYB71uSl9Jy2eg6BntKNi rb/ZnMJx6eQJWSItaUmnnAKXFz6VDNtBVO46pXW6fnCoIN62jOIJJvlTnuvsBZKFmsT0 FUKbVVB7RNSV55mhkNgvn8YoqpgD3MCNg4ZE92LMkJ9w/q1s1ip7vzqzwykVHfmnwol1 8pxA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:subject:cc:to:from :dkim-signature; bh=J+PoXUmhaL/eeYW0pXlnqDvCHrmEpO66S8XICaMV4cc=; b=OosRXvAIKZbWOXcMLg6cG6/oGG2V95stpk7SZx/tjYj0smBHHUUE/GJ9l9AuB+95I2 Rj949elwvoGd4gBzRUmLtDhgmaDAQEzXzOiSwGDHJAVU91gPrRgapneg27pvvoGJvR+M qKOcFwW29nAsknNk4FP+aulIeTyRD6RFPzmjDJW60BaAbIRcr033jRa+l4BGtzrTmmm5 w+9gzFVQ6vIRMFQpgE9y2+SYctzpFtKMtwe42ULn/MnZ/mDHI0Je9JSd7YvQA46ywD8U fCQKEDIm7q2sN+sN+4SfbKFd6T3OYtTDrzNDEB4J6p1AwEWoMmDcnO/YhTW7kIwXyWY4 rWJA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=VUJDq89f; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f20-20020a056402195400b00461c9edc3a5si11913139edz.77.2022.11.08.01.12.13; Tue, 08 Nov 2022 01:12:41 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=VUJDq89f; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233728AbiKHJHQ (ORCPT <rfc822;hjfbswb@gmail.com> + 99 others); Tue, 8 Nov 2022 04:07:16 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59770 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233680AbiKHJHJ (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 8 Nov 2022 04:07:09 -0500 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 114801E3F0; Tue, 8 Nov 2022 01:07:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1667898429; x=1699434429; h=from:to:cc:subject:date:message-id; bh=LJJFnczcEeMn0nte3m6151Gb7yapTGUV60MFhRsjZeo=; b=VUJDq89fa7dihyhJXSGJ6IpBwaEjAS6KVCVI8okK+YXtgOpPWVecbDVB 89KkL3WlwDycZiqXzIMlSGFYe7JbTqrD/wCHcwTcfVKT5h0TdnUrjJIrR cBYIhzQf3Pnel4Fnn46uTVnxFYbc5niGY2dyPys6B0mPC8rTmJWMWHbKJ uB8h6LXeAnKR4xQhSfjIraK6FzEBCf4bvjcnoiz7pGZeB9VqLjsbqfedJ 0VNZqQ12ERGeVpMb7H3F1Crg9ijDAfjTM4lkOZAn5KVH9bBcHFtg5qAqx ATw56WUbGKULHLJ6do6/FxjGS+8wkrvJBUWtbnLgNDf/d7okwd3R6aDkv Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10524"; a="291034619" X-IronPort-AV: E=Sophos;i="5.96,147,1665471600"; d="scan'208";a="291034619" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Nov 2022 01:07:08 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10524"; a="811172114" X-IronPort-AV: E=Sophos;i="5.96,147,1665471600"; d="scan'208";a="811172114" Received: from yzhao56-desk.sh.intel.com ([10.238.200.254]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Nov 2022 01:07:07 -0800 From: Yan Zhao <yan.y.zhao@intel.com> To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: pbonzini@redhat.com, seanjc@google.com, Yan Zhao <yan.y.zhao@intel.com> Subject: [PATCH] KVM: move memslot invalidation later than possible failures Date: Tue, 8 Nov 2022 16:44:16 +0800 Message-Id: <20221108084416.11447-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.17.1 X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1748918612388065595?= X-GMAIL-MSGID: =?utf-8?q?1748918612388065595?= |
Series |
KVM: move memslot invalidation later than possible failures
|
|
Commit Message
Yan Zhao
Nov. 8, 2022, 8:44 a.m. UTC
For memslot delete and move, kvm_invalidate_memslot() is required before
the real changes committed.
Besides swapping to an inactive slot, kvm_invalidate_memslot() will call
kvm_arch_flush_shadow_memslot() and further kvm_page_track_flush_slot() in
arch x86.
And according to the definition in kvm_page_track_notifier_node, users can
drop write-protection for the pages in the memory slot on receiving
.track_flush_slot.
However, if kvm_prepare_memory_region() fails, the later
kvm_activate_memslot() will only swap back the original slot, leaving
previous write protection not recovered.
This may not be a problem for kvm itself as a page tracker user, but may
cause problem to other page tracker users, e.g. kvmgt, whose
write-protected pages are removed from the write-protected list and not
added back.
So call kvm_prepare_memory_region first for meta data preparation before
the slot invalidation so as to avoid failure and recovery.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
virt/kvm/kvm_main.c | 40 +++++++++++++++-------------------------
1 file changed, 15 insertions(+), 25 deletions(-)
Comments
On Tue, Nov 08, 2022, Yan Zhao wrote: > For memslot delete and move, kvm_invalidate_memslot() is required before > the real changes committed. > Besides swapping to an inactive slot, kvm_invalidate_memslot() will call > kvm_arch_flush_shadow_memslot() and further kvm_page_track_flush_slot() in > arch x86. > And according to the definition in kvm_page_track_notifier_node, users can > drop write-protection for the pages in the memory slot on receiving > .track_flush_slot. Ugh, that's a terrible API. The entire page track API is a mess, e.g. KVMGT is forced to grab its own references to KVM and also needs to manually acquire/release mmu_lock in some flows but not others. Anyways, this is a flaw in the page track API that should be fixed. Flushing a slot should not be overloaded to imply "this slot is gone", it should be a flush command, no more, no less. AFAICT, KVMGT never flushes anything, so fixing the bug should be a simple matter of adding another hook that's invoked when the memory region change is committed. That would allow KVMGT to fix another bug. If a memory region is moved and the new region partially overlaps the old region, KVMGT technically probably wants to retain its write-protection scheme. Though that's probably not worth supporting, might be better to say "don't move memory regions if KVMGT is enabled", because AFAIK there is no VMM that actually moves memory regions (KVM's MOVE support was broken for years and no one noticed). Actually, given that MOVE + KVMGT probably doesn't work regardless of where the region is moved to, maybe we can get away with rejecting MOVE if an external page tracker cares about the slot in question. > However, if kvm_prepare_memory_region() fails, the later > kvm_activate_memslot() will only swap back the original slot, leaving > previous write protection not recovered. > > This may not be a problem for kvm itself as a page tracker user, but may > cause problem to other page tracker users, e.g. kvmgt, whose > write-protected pages are removed from the write-protected list and not > added back. > > So call kvm_prepare_memory_region first for meta data preparation before > the slot invalidation so as to avoid failure and recovery. IIUC, this is purely a theoretical bug and practically speaking can't be a problem in practice, at least not without completely breaking KVMGT. On DELETE, kvm_prepare_memory_region() will never fail on x86 (s390's ultravisor protected VM case is the only scenario where DELETE can fail on any architecture). The common code in kvm_prepare_memory_region() does nothing for DELETE, ditto for kvm_arch_prepare_memory_region(). And for MOVE, it can only fail in two scenarios: (1) the end gfn is beyond the max gfn, i.e. userspace gave bad input or (2) the new memslot is unaligned but the old memslot was not, and so KVM needs to allocate new metadata due to the new memslot needed larger arrays. As above MOVE is not handled correctly by KVMGT, so I highly doubt there is a VMM that supports KVMGT and moves relevant memory regions, let alove does something that can result in MOVE failing _and_ moves the memslot that KVMGT is shadowing. Heh, and MOVE + KVM_MEM_LOG_DIRTY_PAGES is also arguably broken, as KVM reuses the existing dirty bitmap but doesn't shift the dirty bits. This is likely another combination that KVM can simply reject. Assuming this is indeed purely theoretical, that should be called out in the changelog. Or as above, simply disallow MOVE in this case, which probably has my vote. > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > --- > virt/kvm/kvm_main.c | 40 +++++++++++++++------------------------- > 1 file changed, 15 insertions(+), 25 deletions(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 25d7872b29c1..5f29011f432d 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1827,45 +1827,35 @@ static int kvm_set_memslot(struct kvm *kvm, > */ > mutex_lock(&kvm->slots_arch_lock); > > - /* > - * Invalidate the old slot if it's being deleted or moved. This is > - * done prior to actually deleting/moving the memslot to allow vCPUs to > - * continue running by ensuring there are no mappings or shadow pages > - * for the memslot when it is deleted/moved. Without pre-invalidation > - * (and without a lock), a window would exist between effecting the > - * delete/move and committing the changes in arch code where KVM or a > - * guest could access a non-existent memslot. > - * > - * Modifications are done on a temporary, unreachable slot. The old > - * slot needs to be preserved in case a later step fails and the > - * invalidation needs to be reverted. > - */ > if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) { > invalid_slot = kzalloc(sizeof(*invalid_slot), GFP_KERNEL_ACCOUNT); > if (!invalid_slot) { > mutex_unlock(&kvm->slots_arch_lock); > return -ENOMEM; > } > - kvm_invalidate_memslot(kvm, old, invalid_slot); > } > > r = kvm_prepare_memory_region(kvm, old, new, change); > if (r) { > - /* > - * For DELETE/MOVE, revert the above INVALID change. No > - * modifications required since the original slot was preserved > - * in the inactive slots. Changing the active memslots also > - * release slots_arch_lock. > - */ > - if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) { > - kvm_activate_memslot(kvm, invalid_slot, old); > + if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) > kfree(invalid_slot); > - } else { > - mutex_unlock(&kvm->slots_arch_lock); > - } > + > + mutex_unlock(&kvm->slots_arch_lock); > return r; > } > > + /* > + * Invalidate the old slot if it's being deleted or moved. This is > + * done prior to actually deleting/moving the memslot to allow vCPUs to > + * continue running by ensuring there are no mappings or shadow pages > + * for the memslot when it is deleted/moved. Without pre-invalidation > + * (and without a lock), a window would exist between effecting the > + * delete/move and committing the changes in arch code where KVM or a > + * guest could access a non-existent memslot. > + */ > + if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) > + kvm_invalidate_memslot(kvm, old, invalid_slot); I'm not 100% sure that simply moving kvm_invalidate_memslot() is functionally correct. It might be, but there are so many subtleties in this code that I don't want to change this ordering unless absolutely necessary, or at least not without an in-depth analysis and a pile of tests. E.g. it's possible one or more architectures relies on unmapping, flushing, and invalidating the old region prior to preparing the new region, e.g. to reuse arch memslot data. > + > /* > * For DELETE and MOVE, the working slot is now active as the INVALID > * version of the old slot. MOVE is particularly special as it reuses > -- > 2.17.1 >
On Tue, Nov 08, 2022 at 05:32:50PM +0000, Sean Christopherson wrote: > On Tue, Nov 08, 2022, Yan Zhao wrote: > > For memslot delete and move, kvm_invalidate_memslot() is required before > > the real changes committed. > > Besides swapping to an inactive slot, kvm_invalidate_memslot() will call > > kvm_arch_flush_shadow_memslot() and further kvm_page_track_flush_slot() in > > arch x86. > > And according to the definition in kvm_page_track_notifier_node, users can > > drop write-protection for the pages in the memory slot on receiving > > .track_flush_slot. > > Ugh, that's a terrible API. The entire page track API is a mess, e.g. KVMGT is > forced to grab its own references to KVM and also needs to manually acquire/release > mmu_lock in some flows but not others. > > Anyways, this is a flaw in the page track API that should be fixed. Flushing a > slot should not be overloaded to imply "this slot is gone", it should be a flush > command, no more, no less. hmm, but KVM also registers to the page track "node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;" And in kvm_mmu_invalidate_zap_pages_in_memslot, memslot (actaully all the shadow page tables) is zapped. And during the zap, unaccount_shadowed() will drop the write protection. But KVM is ok because the dropped write-protection can be rebuilt during rebuilding the shadow page table. So, for .track_flush_slot, it's expected that "users can drop write-protection for the pages in the memory slot", right? > AFAICT, KVMGT never flushes anything, so fixing the bug should be a simple matter > of adding another hook that's invoked when the memory region change is committed. > Do you mean adding a new hook in page track, e.g. .track_slot_change? Then right before committing slot changes, call this interface to notify slot DELETE/MOVE? Not only KVMGT, KVM can also be affected by this failure to MOVE if it wants to support below usecase: 1. KVM pre-populated a memslot 2. memslot MOVE happened 3. KVM pre-populates the new memslot (MOVE succeeds) or the old one (MOVE fails). So also add a new interface for the MOVE failure? > That would allow KVMGT to fix another bug. If a memory region is moved and the > new region partially overlaps the old region, KVMGT technically probably wants to > retain its write-protection scheme. Though that's probably not worth supporting, > might be better to say "don't move memory regions if KVMGT is enabled", because > AFAIK there is no VMM that actually moves memory regions (KVM's MOVE support was > broken for years and no one noticed). > So, could we disable MOVE in KVM at all? > Actually, given that MOVE + KVMGT probably doesn't work regardless of where the > region is moved to, maybe we can get away with rejecting MOVE if an external > page tracker cares about the slot in question. > > > However, if kvm_prepare_memory_region() fails, the later > > kvm_activate_memslot() will only swap back the original slot, leaving > > previous write protection not recovered. > > > > This may not be a problem for kvm itself as a page tracker user, but may > > cause problem to other page tracker users, e.g. kvmgt, whose > > write-protected pages are removed from the write-protected list and not > > added back. > > > > So call kvm_prepare_memory_region first for meta data preparation before > > the slot invalidation so as to avoid failure and recovery. > > IIUC, this is purely a theoretical bug and practically speaking can't be a problem > in practice, at least not without completely breaking KVMGT. > > On DELETE, kvm_prepare_memory_region() will never fail on x86 (s390's ultravisor > protected VM case is the only scenario where DELETE can fail on any architecture). > The common code in kvm_prepare_memory_region() does nothing for DELETE, ditto for > kvm_arch_prepare_memory_region(). But as long as with current code sequence, we can't relying on that kvm_prepare_memory_region() will never fail for DELETE. Or, we need to call kvm_prepare_memory_region() only for !DELETE to avoid future possible broken. > > And for MOVE, it can only fail in two scenarios: (1) the end gfn is beyond the > max gfn, i.e. userspace gave bad input or (2) the new memslot is unaligned but > the old memslot was not, and so KVM needs to allocate new metadata due to the new > memslot needed larger arrays. kvm_prepare_memory_region() can also fail for MOVE during live migration when memslot->dirty_bitmap allocation is failed in kvm_alloc_dirty_bitmap(). and in x86, kvm_arch_prepare_memory_region() can also fail for MOVE if kvm_alloc_memslot_metadata() fails due to -ENOMEM. BTW, I don't find the "(2) the new memslot is unaligned but the old memslot was not, and so KVM needs to allocate new metadata due to the new memslot needed larger arrays". > > As above MOVE is not handled correctly by KVMGT, so I highly doubt there is a VMM > that supports KVMGT and moves relevant memory regions, let alove does something > that can result in MOVE failing _and_ moves the memslot that KVMGT is shadowing. > > Heh, and MOVE + KVM_MEM_LOG_DIRTY_PAGES is also arguably broken, as KVM reuses > the existing dirty bitmap but doesn't shift the dirty bits. This is likely Do you mean, for the new slot in MOVE, the new dirty bitmap should be cleared? Otherwise, why shift is required, given mem->userspace_addr and npages are not changed and dirty_bitmap is indexed using rel_gfn (rel_gfn = gfn - memslot->base_gfn) and both QEMU/KVM aligns the bitmap size to BITS_PER_LONG. > another combination that KVM can simply reject. > > Assuming this is indeed purely theoretical, that should be called out in the > changelog. Or as above, simply disallow MOVE in this case, which probably has > my vote. > Yes, currently it's a purely theoretical bug, as I'm not seeing MOVE is triggered in upstream QEMU. <...> > I'm not 100% sure that simply moving kvm_invalidate_memslot() is functionally > correct. It might be, but there are so many subtleties in this code that I don't > want to change this ordering unless absolutely necessary, or at least not without > an in-depth analysis and a pile of tests. E.g. it's possible one or more > architectures relies on unmapping, flushing, and invalidating the old region > prior to preparing the new region, e.g. to reuse arch memslot data. yes. what about just moving kvm_arch_flush_shadow_memslot() and kvm_arch_guest_memory_reclaimed() to later than kvm_arch_prepare_memory_region()? diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 25d7872b29c1..f9d93be2ead2 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1716,6 +1716,19 @@ static void kvm_copy_memslot(struct kvm_memory_slot *dest, } static void kvm_invalidate_memslot(struct kvm *kvm, + struct kvm_memory_slot *old) +{ + /* + * From this point no new shadow pages pointing to a deleted, or moved, + * memslot will be created. Validation of sp->gfn happens in: + * - gfn_to_hva (kvm_read_guest, gfn_to_pfn) + * - kvm_is_visible_gfn (mmu_check_root) + */ + kvm_arch_flush_shadow_memslot(kvm, old); + kvm_arch_guest_memory_reclaimed(kvm); +} + +static void kvm_deactive_memslot(struct kvm *kvm, struct kvm_memory_slot *old, struct kvm_memory_slot *invalid_slot) { @@ -1735,15 +1748,6 @@ static void kvm_invalidate_memslot(struct kvm *kvm, */ kvm_swap_active_memslots(kvm, old->as_id); - /* - * From this point no new shadow pages pointing to a deleted, or moved, - * memslot will be created. Validation of sp->gfn happens in: - * - gfn_to_hva (kvm_read_guest, gfn_to_pfn) - * - kvm_is_visible_gfn (mmu_check_root) - */ - kvm_arch_flush_shadow_memslot(kvm, old); - kvm_arch_guest_memory_reclaimed(kvm); - /* Was released by kvm_swap_active_memslots, reacquire. */ mutex_lock(&kvm->slots_arch_lock); @@ -1846,7 +1850,7 @@ static int kvm_set_memslot(struct kvm *kvm, mutex_unlock(&kvm->slots_arch_lock); return -ENOMEM; } - kvm_invalidate_memslot(kvm, old, invalid_slot); + kvm_deactive_memslot(kvm, old, invalid_slot); } r = kvm_prepare_memory_region(kvm, old, new, change); @@ -1866,6 +1870,9 @@ static int kvm_set_memslot(struct kvm *kvm, return r; } + if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) + kvm_invalidate_memslot(kvm, old); + /* * For DELETE and MOVE, the working slot is now active as the INVALID * version of the old slot. MOVE is particularly special as it reuses Thanks Yan
On Wed, Nov 09, 2022, Yan Zhao wrote: > On Tue, Nov 08, 2022 at 05:32:50PM +0000, Sean Christopherson wrote: > > On Tue, Nov 08, 2022, Yan Zhao wrote: > > > For memslot delete and move, kvm_invalidate_memslot() is required before > > > the real changes committed. > > > Besides swapping to an inactive slot, kvm_invalidate_memslot() will call > > > kvm_arch_flush_shadow_memslot() and further kvm_page_track_flush_slot() in > > > arch x86. > > > And according to the definition in kvm_page_track_notifier_node, users can > > > drop write-protection for the pages in the memory slot on receiving > > > .track_flush_slot. > > > > Ugh, that's a terrible API. The entire page track API is a mess, e.g. KVMGT is > > forced to grab its own references to KVM and also needs to manually acquire/release > > mmu_lock in some flows but not others. > > > > Anyways, this is a flaw in the page track API that should be fixed. Flushing a > > slot should not be overloaded to imply "this slot is gone", it should be a flush > > command, no more, no less. > hmm, but KVM also registers to the page track > "node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;" > And in kvm_mmu_invalidate_zap_pages_in_memslot, memslot (actaully all the shadow > page tables) is zapped. And during the zap, unaccount_shadowed() will drop the > write protection. But KVM is ok because the dropped write-protection can be > rebuilt during rebuilding the shadow page table. > So, for .track_flush_slot, it's expected that "users can drop write-protection > for the pages in the memory slot", right? No. KVM isn't actually dropping write-protection, because for the internal KVM case, KVM obliterates all of its page tables. > > AFAICT, KVMGT never flushes anything, so fixing the bug should be a simple matter > > of adding another hook that's invoked when the memory region change is committed. > > > Do you mean adding a new hook in page track, e.g. .track_slot_change? > Then right before committing slot changes, call this interface to notify slot > DELETE/MOVE? > > Not only KVMGT, KVM can also be affected by this failure to MOVE if it wants to > support below usecase: > 1. KVM pre-populated a memslot > 2. memslot MOVE happened > 3. KVM pre-populates the new memslot (MOVE succeeds) or the old one (MOVE fails). > So also add a new interface for the MOVE failure? > > > That would allow KVMGT to fix another bug. If a memory region is moved and the > > new region partially overlaps the old region, KVMGT technically probably wants to > > retain its write-protection scheme. Though that's probably not worth supporting, > > might be better to say "don't move memory regions if KVMGT is enabled", because > > AFAIK there is no VMM that actually moves memory regions (KVM's MOVE support was > > broken for years and no one noticed). > > > So, could we disable MOVE in KVM at all? Ideally, yes. Practically? Dunno. It's difficult to know whether or not there are users out there. > > Actually, given that MOVE + KVMGT probably doesn't work regardless of where the > > region is moved to, maybe we can get away with rejecting MOVE if an external > > page tracker cares about the slot in question. > > > > > However, if kvm_prepare_memory_region() fails, the later > > > kvm_activate_memslot() will only swap back the original slot, leaving > > > previous write protection not recovered. > > > > > > This may not be a problem for kvm itself as a page tracker user, but may > > > cause problem to other page tracker users, e.g. kvmgt, whose > > > write-protected pages are removed from the write-protected list and not > > > added back. > > > > > > So call kvm_prepare_memory_region first for meta data preparation before > > > the slot invalidation so as to avoid failure and recovery. > > > > IIUC, this is purely a theoretical bug and practically speaking can't be a problem > > in practice, at least not without completely breaking KVMGT. > > > > On DELETE, kvm_prepare_memory_region() will never fail on x86 (s390's ultravisor > > protected VM case is the only scenario where DELETE can fail on any architecture). > > The common code in kvm_prepare_memory_region() does nothing for DELETE, ditto for > > kvm_arch_prepare_memory_region(). > But as long as with current code sequence, we can't relying on that > kvm_prepare_memory_region() will never fail for DELETE. > Or, we need to call kvm_prepare_memory_region() only for !DELETE to avoid future > possible broken. Agreed, I just don't want to touch the common memslots code unless it's necessary. > > And for MOVE, it can only fail in two scenarios: (1) the end gfn is beyond the > > max gfn, i.e. userspace gave bad input or (2) the new memslot is unaligned but > > the old memslot was not, and so KVM needs to allocate new metadata due to the new > > memslot needed larger arrays. > kvm_prepare_memory_region() can also fail for MOVE during live migration when > memslot->dirty_bitmap allocation is failed in kvm_alloc_dirty_bitmap(). > and in x86, kvm_arch_prepare_memory_region() can also fail for MOVE if > kvm_alloc_memslot_metadata() fails due to -ENOMEM. > BTW, I don't find the "(2) the new memslot is unaligned but the old memslot was not, > and so KVM needs to allocate new metadata due to the new memslot needed > larger arrays". > > > > As above MOVE is not handled correctly by KVMGT, so I highly doubt there is a VMM > > that supports KVMGT and moves relevant memory regions, let alove does something > > that can result in MOVE failing _and_ moves the memslot that KVMGT is shadowing. > > > > Heh, and MOVE + KVM_MEM_LOG_DIRTY_PAGES is also arguably broken, as KVM reuses > > the existing dirty bitmap but doesn't shift the dirty bits. This is likely > Do you mean, for the new slot in MOVE, the new dirty bitmap should be > cleared? Otherwise, why shift is required, given mem->userspace_addr and npages > are not changed and dirty_bitmap is indexed using rel_gfn > (rel_gfn = gfn - memslot->base_gfn) and both QEMU/KVM aligns the bitmap > size to BITS_PER_LONG. Oh, you're right. I forgot that userspace would also be doing a gfn-relative calculation in the ridiculously theoretically scenario that a memslot is moved while it's being dirty-logged. > > another combination that KVM can simply reject. > > > > Assuming this is indeed purely theoretical, that should be called out in the > > changelog. Or as above, simply disallow MOVE in this case, which probably has > > my vote. > > > Yes, currently it's a purely theoretical bug, as I'm not seeing MOVE is triggered > in upstream QEMU. > > <...> > > > I'm not 100% sure that simply moving kvm_invalidate_memslot() is functionally > > correct. It might be, but there are so many subtleties in this code that I don't > > want to change this ordering unless absolutely necessary, or at least not without > > an in-depth analysis and a pile of tests. E.g. it's possible one or more > > architectures relies on unmapping, flushing, and invalidating the old region > > prior to preparing the new region, e.g. to reuse arch memslot data. > yes. what about just moving kvm_arch_flush_shadow_memslot() and > kvm_arch_guest_memory_reclaimed() to later than kvm_arch_prepare_memory_region()? I'm not dead set against tweaking the memslots flow, but I don't want to do so to fix this extremely theoretical bug. In other words, I want to fix this by giving KVM-GT a more appropriate hook, not by shuffling common KVM code to fudge around a poor KVM x86 API. And if KVM-GT moves away from track_flush_slot(), we can delete the hook entirely after cleaning up clean up another pile of ugly: KVM always registers a page-track notifier because it relies on the track_flush_slot() call to invoke kvm_mmu_invalidate_zap_pages_in_memslot(), even when KVM isn't tracking anything. I'll send patches for this; if/when both land, track_flush_slot() can be deleted on top.
On Thu, Nov 10, 2022 at 12:33:25AM +0000, Sean Christopherson wrote: > On Wed, Nov 09, 2022, Yan Zhao wrote: > > On Tue, Nov 08, 2022 at 05:32:50PM +0000, Sean Christopherson wrote: > > > On Tue, Nov 08, 2022, Yan Zhao wrote: > > > > For memslot delete and move, kvm_invalidate_memslot() is required before > > > > the real changes committed. > > > > Besides swapping to an inactive slot, kvm_invalidate_memslot() will call > > > > kvm_arch_flush_shadow_memslot() and further kvm_page_track_flush_slot() in > > > > arch x86. > > > > And according to the definition in kvm_page_track_notifier_node, users can > > > > drop write-protection for the pages in the memory slot on receiving > > > > .track_flush_slot. > > > > > > Ugh, that's a terrible API. The entire page track API is a mess, e.g. KVMGT is > > > forced to grab its own references to KVM and also needs to manually acquire/release > > > mmu_lock in some flows but not others. > > > > > > Anyways, this is a flaw in the page track API that should be fixed. Flushing a > > > slot should not be overloaded to imply "this slot is gone", it should be a flush > > > command, no more, no less. > > hmm, but KVM also registers to the page track > > "node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;" > > And in kvm_mmu_invalidate_zap_pages_in_memslot, memslot (actaully all the shadow > > page tables) is zapped. And during the zap, unaccount_shadowed() will drop the > > write protection. But KVM is ok because the dropped write-protection can be > > rebuilt during rebuilding the shadow page table. > > So, for .track_flush_slot, it's expected that "users can drop write-protection > > for the pages in the memory slot", right? > > No. KVM isn't actually dropping write-protection, because for the internal KVM > case, KVM obliterates all of its page tables. > > > > AFAICT, KVMGT never flushes anything, so fixing the bug should be a simple matter > > > of adding another hook that's invoked when the memory region change is committed. > > > > > Do you mean adding a new hook in page track, e.g. .track_slot_change? > > Then right before committing slot changes, call this interface to notify slot > > DELETE/MOVE? > > > > Not only KVMGT, KVM can also be affected by this failure to MOVE if it wants to > > support below usecase: > > 1. KVM pre-populated a memslot > > 2. memslot MOVE happened > > 3. KVM pre-populates the new memslot (MOVE succeeds) or the old one (MOVE fails). > > So also add a new interface for the MOVE failure? > > > > > That would allow KVMGT to fix another bug. If a memory region is moved and the > > > new region partially overlaps the old region, KVMGT technically probably wants to > > > retain its write-protection scheme. Though that's probably not worth supporting, > > > might be better to say "don't move memory regions if KVMGT is enabled", because > > > AFAIK there is no VMM that actually moves memory regions (KVM's MOVE support was > > > broken for years and no one noticed). > > > > > So, could we disable MOVE in KVM at all? > > Ideally, yes. Practically? Dunno. It's difficult to know whether or not there > are users out there. > > > > Actually, given that MOVE + KVMGT probably doesn't work regardless of where the > > > region is moved to, maybe we can get away with rejecting MOVE if an external > > > page tracker cares about the slot in question. > > > > > > > However, if kvm_prepare_memory_region() fails, the later > > > > kvm_activate_memslot() will only swap back the original slot, leaving > > > > previous write protection not recovered. > > > > > > > > This may not be a problem for kvm itself as a page tracker user, but may > > > > cause problem to other page tracker users, e.g. kvmgt, whose > > > > write-protected pages are removed from the write-protected list and not > > > > added back. > > > > > > > > So call kvm_prepare_memory_region first for meta data preparation before > > > > the slot invalidation so as to avoid failure and recovery. > > > > > > IIUC, this is purely a theoretical bug and practically speaking can't be a problem > > > in practice, at least not without completely breaking KVMGT. > > > > > > On DELETE, kvm_prepare_memory_region() will never fail on x86 (s390's ultravisor > > > protected VM case is the only scenario where DELETE can fail on any architecture). > > > The common code in kvm_prepare_memory_region() does nothing for DELETE, ditto for > > > kvm_arch_prepare_memory_region(). > > But as long as with current code sequence, we can't relying on that > > kvm_prepare_memory_region() will never fail for DELETE. > > Or, we need to call kvm_prepare_memory_region() only for !DELETE to avoid future > > possible broken. > > Agreed, I just don't want to touch the common memslots code unless it's necessary. Ok. Let me prepare a patch for it. > > > > And for MOVE, it can only fail in two scenarios: (1) the end gfn is beyond the > > > max gfn, i.e. userspace gave bad input or (2) the new memslot is unaligned but > > > the old memslot was not, and so KVM needs to allocate new metadata due to the new > > > memslot needed larger arrays. > > kvm_prepare_memory_region() can also fail for MOVE during live migration when > > memslot->dirty_bitmap allocation is failed in kvm_alloc_dirty_bitmap(). > > and in x86, kvm_arch_prepare_memory_region() can also fail for MOVE if > > kvm_alloc_memslot_metadata() fails due to -ENOMEM. > > BTW, I don't find the "(2) the new memslot is unaligned but the old memslot was not, > > and so KVM needs to allocate new metadata due to the new memslot needed > > larger arrays". > > > > > > As above MOVE is not handled correctly by KVMGT, so I highly doubt there is a VMM > > > that supports KVMGT and moves relevant memory regions, let alove does something > > > that can result in MOVE failing _and_ moves the memslot that KVMGT is shadowing. > > > > > > Heh, and MOVE + KVM_MEM_LOG_DIRTY_PAGES is also arguably broken, as KVM reuses > > > the existing dirty bitmap but doesn't shift the dirty bits. This is likely > > Do you mean, for the new slot in MOVE, the new dirty bitmap should be > > cleared? Otherwise, why shift is required, given mem->userspace_addr and npages > > are not changed and dirty_bitmap is indexed using rel_gfn > > (rel_gfn = gfn - memslot->base_gfn) and both QEMU/KVM aligns the bitmap > > size to BITS_PER_LONG. > > Oh, you're right. I forgot that userspace would also be doing a gfn-relative > calculation in the ridiculously theoretically scenario that a memslot is moved > while it's being dirty-logged. > > > > another combination that KVM can simply reject. > > > > > > Assuming this is indeed purely theoretical, that should be called out in the > > > changelog. Or as above, simply disallow MOVE in this case, which probably has > > > my vote. > > > > > Yes, currently it's a purely theoretical bug, as I'm not seeing MOVE is triggered > > in upstream QEMU. > > > > <...> > > > > > I'm not 100% sure that simply moving kvm_invalidate_memslot() is functionally > > > correct. It might be, but there are so many subtleties in this code that I don't > > > want to change this ordering unless absolutely necessary, or at least not without > > > an in-depth analysis and a pile of tests. E.g. it's possible one or more > > > architectures relies on unmapping, flushing, and invalidating the old region > > > prior to preparing the new region, e.g. to reuse arch memslot data. > > yes. what about just moving kvm_arch_flush_shadow_memslot() and > > kvm_arch_guest_memory_reclaimed() to later than kvm_arch_prepare_memory_region()? > > I'm not dead set against tweaking the memslots flow, but I don't want to do so to > fix this extremely theoretical bug. In other words, I want to fix this by giving > KVM-GT a more appropriate hook, not by shuffling common KVM code to fudge around > a poor KVM x86 API. > > And if KVM-GT moves away from track_flush_slot(), we can delete the hook entirely > after cleaning up clean up another pile of ugly: KVM always registers a page-track > notifier because it relies on the track_flush_slot() call to invoke > kvm_mmu_invalidate_zap_pages_in_memslot(), even when KVM isn't tracking anything. > I'll send patches for this; if/when both land, track_flush_slot() can be deleted > on top. Ok. thanks. I'll watch and decide what to do based on your changes. Thanks Yan
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 25d7872b29c1..5f29011f432d 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1827,45 +1827,35 @@ static int kvm_set_memslot(struct kvm *kvm, */ mutex_lock(&kvm->slots_arch_lock); - /* - * Invalidate the old slot if it's being deleted or moved. This is - * done prior to actually deleting/moving the memslot to allow vCPUs to - * continue running by ensuring there are no mappings or shadow pages - * for the memslot when it is deleted/moved. Without pre-invalidation - * (and without a lock), a window would exist between effecting the - * delete/move and committing the changes in arch code where KVM or a - * guest could access a non-existent memslot. - * - * Modifications are done on a temporary, unreachable slot. The old - * slot needs to be preserved in case a later step fails and the - * invalidation needs to be reverted. - */ if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) { invalid_slot = kzalloc(sizeof(*invalid_slot), GFP_KERNEL_ACCOUNT); if (!invalid_slot) { mutex_unlock(&kvm->slots_arch_lock); return -ENOMEM; } - kvm_invalidate_memslot(kvm, old, invalid_slot); } r = kvm_prepare_memory_region(kvm, old, new, change); if (r) { - /* - * For DELETE/MOVE, revert the above INVALID change. No - * modifications required since the original slot was preserved - * in the inactive slots. Changing the active memslots also - * release slots_arch_lock. - */ - if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) { - kvm_activate_memslot(kvm, invalid_slot, old); + if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) kfree(invalid_slot); - } else { - mutex_unlock(&kvm->slots_arch_lock); - } + + mutex_unlock(&kvm->slots_arch_lock); return r; } + /* + * Invalidate the old slot if it's being deleted or moved. This is + * done prior to actually deleting/moving the memslot to allow vCPUs to + * continue running by ensuring there are no mappings or shadow pages + * for the memslot when it is deleted/moved. Without pre-invalidation + * (and without a lock), a window would exist between effecting the + * delete/move and committing the changes in arch code where KVM or a + * guest could access a non-existent memslot. + */ + if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) + kvm_invalidate_memslot(kvm, old, invalid_slot); + /* * For DELETE and MOVE, the working slot is now active as the INVALID * version of the old slot. MOVE is particularly special as it reuses