Message ID | 20221111103350.22326-1-yan.y.zhao@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp670535wru; Fri, 11 Nov 2022 02:58:24 -0800 (PST) X-Google-Smtp-Source: AA0mqf4NAag8LMvCagU0jU0mP2ibZKm0mptE6Gza+0ANLY+jdvoiVzW4OllvJ+D1uIrL/Q8IOD0e X-Received: by 2002:a17:90a:a67:b0:200:8f06:e9cc with SMTP id o94-20020a17090a0a6700b002008f06e9ccmr1414998pjo.7.1668164304173; Fri, 11 Nov 2022 02:58:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1668164304; cv=none; d=google.com; s=arc-20160816; b=oUJezSlGR/M2ZYLFQ57HHcJvDHYMJ4+Tw0o4Bvm6cVCWY9yYk3vgQ/5aqieDAonV0L MM/7R+rmVLzbZr0pNW2mmYc8dKL8lSqMg/yup0B7CFSwgZa/cpnZ7/Q0uAwmCVNbSUAR b+y5IgIBiSzkMy4LQ/0Ncnlzq0geRJyXAtHQLUnpQgDkS7Kv5fSdwOAX5vIvIPLdkH1c 1E+tfgdLEEmW55PmWwCI3GxFRc8QTruJepR2kz/yR3ipWbWFjlbGQuPvm9ZMR3a3xo0W aoz358UYeJCaj3/AxdR5yRlyzQJ/po3Gnsa1hgnpHfs9mh0Sl8zFVXAR5gTu1LxcAd0i NBGA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:references:in-reply-to:message-id:date:subject :cc:to:from:dkim-signature; bh=VGnoHidjSpZ3XGViwkGRwGXjZvXF+2V32wDbmRfdZWY=; b=L7xMoqtNQPsNtU5U1F7t4e6XHxLvQ0hUMOCFCKY4ivX4x7/GIBnJ9St3tcoQiq54LA TLSlJGOpZAJOm0qQNvzb0inxe4MWKBmm3OthYIjCRgIcCglJS0sa4bQke6vlVskjKV9o VSNOChXolYV6RCS9X9Mlpb5heZNSVbUvV/mv6iYogDHFf7Mz3GK94ZVNX/oXVOi6LO2d tWvPS8suWPegmVWgYFkWXIStZP/JAUlHDwdNl5ilNbkIVxuhWdH4Zuj9Gww6fecwv2Ho R4p9LQ2yxX3k5I90LskCo8ZJ1QKyLZuCDR5JHkxLP/AS4ZcbXTuwhEu1EBv63IRJO8yz 1rFw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=IbNuZXiG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a14-20020a056a000c8e00b0054114c6685asi2206101pfv.129.2022.11.11.02.58.10; Fri, 11 Nov 2022 02:58:24 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=IbNuZXiG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233317AbiKKK4p (ORCPT <rfc822;winker.wchi@gmail.com> + 99 others); Fri, 11 Nov 2022 05:56:45 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55014 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232918AbiKKK4n (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 11 Nov 2022 05:56:43 -0500 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 10A8CC6A; Fri, 11 Nov 2022 02:56:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1668164203; x=1699700203; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=u7QIWUi0C6/FBWEuiDhVYicRF9D5dHYcVP0Zb/o5lgM=; b=IbNuZXiGuAg/xzRtMi5TEzGdW564+9OE8b1PXsFm3EGMGDev7jKgq2h8 VZTbsxJXwd3wyrCIWHACvhZ+xeD1CQ/CVvhXjeITLnSZ2E072s9/N/mOT KDI728URM/67mTm0A722SCJR1aPNBkc6Ux/zo9GdXv2c8VkpSkjHaqtsk +k7xKzfQYSe7JVAnAtsxxdzX3te75pkT6Da2JhEVJrYs7pa1rfCngkZW/ rIhWgQ7dUfbtFXC3opl0i+drO5QzyFLw4mq0XlqeL4gor222TFTQYehQU DMH14nrcErxFAmJe/fyq8FYGZgW+ECneC5/fLklIhpS5KE4wd19JyP287 w==; X-IronPort-AV: E=McAfee;i="6500,9779,10527"; a="313364622" X-IronPort-AV: E=Sophos;i="5.96,156,1665471600"; d="scan'208";a="313364622" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Nov 2022 02:56:42 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10527"; a="637549316" X-IronPort-AV: E=Sophos;i="5.96,156,1665471600"; d="scan'208";a="637549316" Received: from yzhao56-desk.sh.intel.com ([10.238.200.254]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Nov 2022 02:56:40 -0800 From: Yan Zhao <yan.y.zhao@intel.com> To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: pbonzini@redhat.com, seanjc@google.com, intel-gfx@lists.freedesktop.org, intel-gvt-dev@lists.freedesktop.org, zhenyuw@linux.intel.com, Yan Zhao <yan.y.zhao@intel.com> Subject: [PATCH v2 1/3] KVM: x86: add a new page track hook track_remove_slot Date: Fri, 11 Nov 2022 18:33:50 +0800 Message-Id: <20221111103350.22326-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20221111103247.22275-1-yan.y.zhao@intel.com> References: <20221111103247.22275-1-yan.y.zhao@intel.com> X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1749197053278972828?= X-GMAIL-MSGID: =?utf-8?q?1749197053278972828?= |
Series |
add track_remove_slot and remove track_flush_slot
|
|
Commit Message
Yan Zhao
Nov. 11, 2022, 10:33 a.m. UTC
Page track hook track_remove_slot is used to notify users that a slot has been removed and is called when a slot DELETE/MOVE is about to be completed. Users of this hook can drop write protections in the removed slot. Note: Since KVM_MR_MOVE currently never actually happens in KVM/QEMU, and has never been properly supported in the external page track user, we just use the hook track_remove_slot to notify users of the old slot being removed. Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- arch/x86/include/asm/kvm_page_track.h | 11 +++++++++++ arch/x86/kvm/mmu/page_track.c | 26 ++++++++++++++++++++++++++ arch/x86/kvm/x86.c | 3 +++ 3 files changed, 40 insertions(+)
Comments
TL;DR: I'm going to try to add more aggressive patches for this into my series to clean up the KVM side of things, along with many more patches to clean up the page track APIs. I'll post patches next week if things go well (fingers crossed), and if not I'll give an update On Fri, Nov 11, 2022, Yan Zhao wrote: > Page track hook track_remove_slot is used to notify users that a slot > has been removed and is called when a slot DELETE/MOVE is about to be > completed. Phrase this as a command, and explain _why_ the new hook is being added, e.g. Add a new page track hook, track_remove_slot(), that is called when a memslot DELETE/MOVE operation is about to be committed. The "remove" hook will be used by KVMGT and will effectively replace the existing track_flush_slot() altogether now that KVM itself doesn't rely on the "flush" hook either. The "flush" hook is flawed as it's invoked before the memslot operation is guaranteed, i.e. KVM might ultimately keep the existing memslot without notifying external page track users, a.k.a. KVMGT. > Users of this hook can drop write protections in the removed slot. Hmm, actually, on second thought, after thinking about what KVGT is doing in response to the memslot update, I think we should be more aggressive and actively prevent MOVE if there are external page trackers, i.e. if KVMGT is attached. Dropping write protections when a memslot is being deleted is a waste of cycles. The memslot and thus gfn_track is literally being deleted immediately after invoking the hook, updating gfn_track from KVMGT is completely unecessary. I.e. if we kill off the MOVE path, then KVMGT just needs to delete its hash table entry. Oooh! Looking at this code again made me realize that the larger page track cleanup that I want to do might actually work. Long story short, I want to stop forcing KVMGT to poke into KVM internals, but I thought there was a lock inversion problem. But AFAICT, there is no such problem. And the cleanup I want to do will actually fix an existing KVMGT bug: kvmgt_page_track_write() invokes kvmgt_gfn_is_write_protected() without holding mmu_lock, and thus could consume garbage when walking the hash table. static void kvmgt_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *val, int len, struct kvm_page_track_notifier_node *node) { struct intel_vgpu *info = container_of(node, struct intel_vgpu, track_node); if (kvmgt_gfn_is_write_protected(info, gpa_to_gfn(gpa))) intel_vgpu_page_track_handler(info, gpa, (void *)val, len); } Acquiring mmu_lock isn't an option as intel_vgpu_page_track_handler() might sleep, e.g. when acquiring vgpu_lock. Let me see if the clean up I have in mind will actually work. If it does, I think the end result will be quite nice for both KVM and KVMGT.
On Fri, Nov 11, 2022 at 06:19:15PM +0000, Sean Christopherson wrote: > TL;DR: I'm going to try to add more aggressive patches for this into my series to > clean up the KVM side of things, along with many more patches to clean up the page > track APIs. > > I'll post patches next week if things go well (fingers crossed), and if not I'll > give an update > > On Fri, Nov 11, 2022, Yan Zhao wrote: > > Page track hook track_remove_slot is used to notify users that a slot > > has been removed and is called when a slot DELETE/MOVE is about to be > > completed. > > Phrase this as a command, and explain _why_ the new hook is being added, e.g. > > Add a new page track hook, track_remove_slot(), that is called when a > memslot DELETE/MOVE operation is about to be committed. The "remove" > hook will be used by KVMGT and will effectively replace the existing > track_flush_slot() altogether now that KVM itself doesn't rely on the > "flush" hook either. > > The "flush" hook is flawed as it's invoked before the memslot operation > is guaranteed, i.e. KVM might ultimately keep the existing memslot without > notifying external page track users, a.k.a. KVMGT. > > > Users of this hook can drop write protections in the removed slot. > > Hmm, actually, on second thought, after thinking about what KVGT is doing in > response to the memslot update, I think we should be more aggressive and actively > prevent MOVE if there are external page trackers, i.e. if KVMGT is attached. > > Dropping write protections when a memslot is being deleted is a waste of cycles. > The memslot and thus gfn_track is literally being deleted immediately after invoking > the hook, updating gfn_track from KVMGT is completely unecessary. > I.e. if we kill off the MOVE path, then KVMGT just needs to delete its hash table > entry. > > Oooh! Looking at this code again made me realize that the larger page track cleanup > that I want to do might actually work. Long story short, I want to stop forcing > KVMGT to poke into KVM internals, but I thought there was a lock inversion problem. > > But AFAICT, there is no such problem. And the cleanup I want to do will actually > fix an existing KVMGT bug: kvmgt_page_track_write() invokes kvmgt_gfn_is_write_protected() > without holding mmu_lock, and thus could consume garbage when walking the hash > table. > > static void kvmgt_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, > const u8 *val, int len, > struct kvm_page_track_notifier_node *node) > { > struct intel_vgpu *info = > container_of(node, struct intel_vgpu, track_node); > > if (kvmgt_gfn_is_write_protected(info, gpa_to_gfn(gpa))) > intel_vgpu_page_track_handler(info, gpa, > (void *)val, len); > } > > Acquiring mmu_lock isn't an option as intel_vgpu_page_track_handler() might sleep, > e.g. when acquiring vgpu_lock. > I totally agree with you and actually had the same feeling as you when examined the code yesterday. But I thought I'd better send this series first and do the cleanup later :) And I'm also not sure if a slots_arch_lock is required for kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(). (Though it doesn't matter for a removing slot and we actually needn't to call kvm_slot_page_track_remove_page() in response to a slot removal, the two interfaces are still required elsewhere.) > Let me see if the clean up I have in mind will actually work. If it does, I think > the end result will be quite nice for both KVM and KVMGT. Yes, it would be great. Thanks Yan
On Sat, Nov 12, 2022, Yan Zhao wrote: > And I'm also not sure if a slots_arch_lock is required for > kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(). It's not required. slots_arch_lock protects interaction between memslot updates mmu_first_shadow_root_alloc(). When CONFIG_KVM_EXTERNAL_WRITE_TRACKING=y, then the mmu_first_shadow_root_alloc() doesn't touch the memslots because everything is pre-allocated: bool kvm_page_track_write_tracking_enabled(struct kvm *kvm) { return IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING) || !tdp_enabled || kvm_shadow_root_allocated(kvm); } int kvm_page_track_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, unsigned long npages) { if (!kvm_page_track_write_tracking_enabled(kvm)) <== always true return 0; return __kvm_page_track_write_tracking_alloc(slot, npages); } Though now that you point it out, it's tempting to #ifdef out some of those hooks so that's basically impossible for mmu_first_shadow_root_alloc() to cause problems. Not sure the extra #ideffery would be worth while though. slots_arch_lock also protects shadow_root_allocated, but that's a KVM-internal detail that isn't relevant to the page-tracking machinery when CONFIG_KVM_EXTERNAL_WRITE_TRACKING=y.
On Sat, Nov 12, 2022 at 12:43:07AM +0000, Sean Christopherson wrote: > On Sat, Nov 12, 2022, Yan Zhao wrote: > > And I'm also not sure if a slots_arch_lock is required for > > kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(). > > It's not required. slots_arch_lock protects interaction between memslot updates In kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(), slot->arch.gfn_track[mode][index] is updated in update_gfn_track(), do you know which lock is used to protect it? Thanks Yan > mmu_first_shadow_root_alloc(). When CONFIG_KVM_EXTERNAL_WRITE_TRACKING=y, then > the mmu_first_shadow_root_alloc() doesn't touch the memslots because everything > is pre-allocated: > > bool kvm_page_track_write_tracking_enabled(struct kvm *kvm) > { > return IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING) || > !tdp_enabled || kvm_shadow_root_allocated(kvm); > } > > int kvm_page_track_create_memslot(struct kvm *kvm, > struct kvm_memory_slot *slot, > unsigned long npages) > { > if (!kvm_page_track_write_tracking_enabled(kvm)) <== always true > return 0; > > return __kvm_page_track_write_tracking_alloc(slot, npages); > } > > Though now that you point it out, it's tempting to #ifdef out some of those hooks > so that's basically impossible for mmu_first_shadow_root_alloc() to cause problems. > Not sure the extra #ideffery would be worth while though. > > slots_arch_lock also protects shadow_root_allocated, but that's a KVM-internal > detail that isn't relevant to the page-tracking machinery when > CONFIG_KVM_EXTERNAL_WRITE_TRACKING=y.
On Mon, Nov 14, 2022, Yan Zhao wrote: > On Sat, Nov 12, 2022 at 12:43:07AM +0000, Sean Christopherson wrote: > > On Sat, Nov 12, 2022, Yan Zhao wrote: > > > And I'm also not sure if a slots_arch_lock is required for > > > kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(). > > > > It's not required. slots_arch_lock protects interaction between memslot updates > In kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(), > slot->arch.gfn_track[mode][index] is updated in update_gfn_track(), > do you know which lock is used to protect it? mmu_lock protects the count, kvm->srcu protects the slot, and shadow_root_allocated protects that validity of gfn_track, i.e. shadow_root_allocated ensures that KVM allocates gfn_track for all memslots when shadow paging is activated. The cleanup series I'm prepping adds lockdep assertions for the relevant paths, e.g. $ git grep -B 8 -E "update_gfn_write_track.*;" arch/x86/kvm/mmu/page_track.c-void __kvm_write_track_add_gfn(struct kvm *kvm, struct kvm_memory_slot *slot, arch/x86/kvm/mmu/page_track.c- gfn_t gfn) arch/x86/kvm/mmu/page_track.c-{ arch/x86/kvm/mmu/page_track.c- lockdep_assert_held_write(&kvm->mmu_lock); arch/x86/kvm/mmu/page_track.c- arch/x86/kvm/mmu/page_track.c- if (KVM_BUG_ON(!kvm_page_track_write_tracking_enabled(kvm), kvm)) arch/x86/kvm/mmu/page_track.c- return; arch/x86/kvm/mmu/page_track.c- arch/x86/kvm/mmu/page_track.c: update_gfn_write_track(slot, gfn, 1); -- arch/x86/kvm/mmu/page_track.c-void __kvm_write_track_remove_gfn(struct kvm *kvm, arch/x86/kvm/mmu/page_track.c- struct kvm_memory_slot *slot, gfn_t gfn) arch/x86/kvm/mmu/page_track.c-{ arch/x86/kvm/mmu/page_track.c- lockdep_assert_held_write(&kvm->mmu_lock); arch/x86/kvm/mmu/page_track.c- arch/x86/kvm/mmu/page_track.c- if (KVM_BUG_ON(!kvm_page_track_write_tracking_enabled(kvm), kvm)) arch/x86/kvm/mmu/page_track.c- return; arch/x86/kvm/mmu/page_track.c- arch/x86/kvm/mmu/page_track.c: update_gfn_write_track(slot, gfn, -1);
On Mon, Nov 14, 2022 at 04:32:34PM +0000, Sean Christopherson wrote: > On Mon, Nov 14, 2022, Yan Zhao wrote: > > On Sat, Nov 12, 2022 at 12:43:07AM +0000, Sean Christopherson wrote: > > > On Sat, Nov 12, 2022, Yan Zhao wrote: > > > > And I'm also not sure if a slots_arch_lock is required for > > > > kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(). > > > > > > It's not required. slots_arch_lock protects interaction between memslot updates > > In kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(), > > slot->arch.gfn_track[mode][index] is updated in update_gfn_track(), > > do you know which lock is used to protect it? > > mmu_lock protects the count, kvm->srcu protects the slot, and shadow_root_allocated > protects that validity of gfn_track, i.e. shadow_root_allocated ensures that KVM > allocates gfn_track for all memslots when shadow paging is activated. Hmm, thanks for the reply. but in direct_page_fault(), if (page_fault_handle_page_track(vcpu, fault)) return RET_PF_EMULATE; slot->arch.gfn_track is read without any mmu_lock is held. > > The cleanup series I'm prepping adds lockdep assertions for the relevant paths, e.g. > > $ git grep -B 8 -E "update_gfn_write_track.*;" > arch/x86/kvm/mmu/page_track.c-void __kvm_write_track_add_gfn(struct kvm *kvm, struct kvm_memory_slot *slot, > arch/x86/kvm/mmu/page_track.c- gfn_t gfn) > arch/x86/kvm/mmu/page_track.c-{ > arch/x86/kvm/mmu/page_track.c- lockdep_assert_held_write(&kvm->mmu_lock); > arch/x86/kvm/mmu/page_track.c- > arch/x86/kvm/mmu/page_track.c- if (KVM_BUG_ON(!kvm_page_track_write_tracking_enabled(kvm), kvm)) > arch/x86/kvm/mmu/page_track.c- return; > arch/x86/kvm/mmu/page_track.c- > arch/x86/kvm/mmu/page_track.c: update_gfn_write_track(slot, gfn, 1); > -- > arch/x86/kvm/mmu/page_track.c-void __kvm_write_track_remove_gfn(struct kvm *kvm, > arch/x86/kvm/mmu/page_track.c- struct kvm_memory_slot *slot, gfn_t gfn) > arch/x86/kvm/mmu/page_track.c-{ > arch/x86/kvm/mmu/page_track.c- lockdep_assert_held_write(&kvm->mmu_lock); > arch/x86/kvm/mmu/page_track.c- > arch/x86/kvm/mmu/page_track.c- if (KVM_BUG_ON(!kvm_page_track_write_tracking_enabled(kvm), kvm)) > arch/x86/kvm/mmu/page_track.c- return; > arch/x86/kvm/mmu/page_track.c- > arch/x86/kvm/mmu/page_track.c: update_gfn_write_track(slot, gfn, -1); yes, it will be helpful. Besides, will WRITE_ONCE or atomic_add in update_gfn_write_track() to update slot->arch.gfn_track be better? Thanks Yan
On Mon, Nov 14, 2022 at 11:24:16PM +0000, Sean Christopherson wrote: > On Tue, Nov 15, 2022, Yan Zhao wrote: > > On Mon, Nov 14, 2022 at 04:32:34PM +0000, Sean Christopherson wrote: > > > On Mon, Nov 14, 2022, Yan Zhao wrote: > > > > On Sat, Nov 12, 2022 at 12:43:07AM +0000, Sean Christopherson wrote: > > > > > On Sat, Nov 12, 2022, Yan Zhao wrote: > > > > > > And I'm also not sure if a slots_arch_lock is required for > > > > > > kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(). > > > > > > > > > > It's not required. slots_arch_lock protects interaction between memslot updates > > > > In kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(), > > > > slot->arch.gfn_track[mode][index] is updated in update_gfn_track(), > > > > do you know which lock is used to protect it? > > > > > > mmu_lock protects the count, kvm->srcu protects the slot, and shadow_root_allocated > > > protects that validity of gfn_track, i.e. shadow_root_allocated ensures that KVM > > > allocates gfn_track for all memslots when shadow paging is activated. > > Hmm, thanks for the reply. > > but in direct_page_fault(), > > if (page_fault_handle_page_track(vcpu, fault)) > > return RET_PF_EMULATE; > > > > slot->arch.gfn_track is read without any mmu_lock is held. > > That's a fast path that deliberately reads out of mmu_lock. A false positive > only results in unnecessary emulation, and any false positive is inherently prone > to races anyways, e.g. fault racing with zap. what about false negative? If the fast path read 0 count, no page track write callback will be called and write protection will be removed in the slow path. Thanks Yan > > > > arch/x86/kvm/mmu/page_track.c-void __kvm_write_track_remove_gfn(struct kvm *kvm, > > > arch/x86/kvm/mmu/page_track.c- struct kvm_memory_slot *slot, gfn_t gfn) > > > arch/x86/kvm/mmu/page_track.c-{ > > > arch/x86/kvm/mmu/page_track.c- lockdep_assert_held_write(&kvm->mmu_lock); > > > arch/x86/kvm/mmu/page_track.c- > > > arch/x86/kvm/mmu/page_track.c- if (KVM_BUG_ON(!kvm_page_track_write_tracking_enabled(kvm), kvm)) > > > arch/x86/kvm/mmu/page_track.c- return; > > > arch/x86/kvm/mmu/page_track.c- > > > arch/x86/kvm/mmu/page_track.c: update_gfn_write_track(slot, gfn, -1); > > yes, it will be helpful. > > > > Besides, will WRITE_ONCE or atomic_add in update_gfn_write_track() to > > update slot->arch.gfn_track be better? > > WRITE_ONCE() won't suffice, it needs to be atomic. Switching to atomic_inc/dec > isn't worth it so long as KVM's shadow MMU takes mmu_lock for write, i.e. while > the accounting is mutually exclusive for other reasons in both KVM and KVMGT.
On Tue, Nov 15, 2022, Yan Zhao wrote: > On Mon, Nov 14, 2022 at 04:32:34PM +0000, Sean Christopherson wrote: > > On Mon, Nov 14, 2022, Yan Zhao wrote: > > > On Sat, Nov 12, 2022 at 12:43:07AM +0000, Sean Christopherson wrote: > > > > On Sat, Nov 12, 2022, Yan Zhao wrote: > > > > > And I'm also not sure if a slots_arch_lock is required for > > > > > kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(). > > > > > > > > It's not required. slots_arch_lock protects interaction between memslot updates > > > In kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(), > > > slot->arch.gfn_track[mode][index] is updated in update_gfn_track(), > > > do you know which lock is used to protect it? > > > > mmu_lock protects the count, kvm->srcu protects the slot, and shadow_root_allocated > > protects that validity of gfn_track, i.e. shadow_root_allocated ensures that KVM > > allocates gfn_track for all memslots when shadow paging is activated. > Hmm, thanks for the reply. > but in direct_page_fault(), > if (page_fault_handle_page_track(vcpu, fault)) > return RET_PF_EMULATE; > > slot->arch.gfn_track is read without any mmu_lock is held. That's a fast path that deliberately reads out of mmu_lock. A false positive only results in unnecessary emulation, and any false positive is inherently prone to races anyways, e.g. fault racing with zap. > > arch/x86/kvm/mmu/page_track.c-void __kvm_write_track_remove_gfn(struct kvm *kvm, > > arch/x86/kvm/mmu/page_track.c- struct kvm_memory_slot *slot, gfn_t gfn) > > arch/x86/kvm/mmu/page_track.c-{ > > arch/x86/kvm/mmu/page_track.c- lockdep_assert_held_write(&kvm->mmu_lock); > > arch/x86/kvm/mmu/page_track.c- > > arch/x86/kvm/mmu/page_track.c- if (KVM_BUG_ON(!kvm_page_track_write_tracking_enabled(kvm), kvm)) > > arch/x86/kvm/mmu/page_track.c- return; > > arch/x86/kvm/mmu/page_track.c- > > arch/x86/kvm/mmu/page_track.c: update_gfn_write_track(slot, gfn, -1); > yes, it will be helpful. > > Besides, will WRITE_ONCE or atomic_add in update_gfn_write_track() to > update slot->arch.gfn_track be better? WRITE_ONCE() won't suffice, it needs to be atomic. Switching to atomic_inc/dec isn't worth it so long as KVM's shadow MMU takes mmu_lock for write, i.e. while the accounting is mutually exclusive for other reasons in both KVM and KVMGT.
On Tue, Nov 15, 2022, Yan Zhao wrote: > On Mon, Nov 14, 2022 at 11:24:16PM +0000, Sean Christopherson wrote: > > On Tue, Nov 15, 2022, Yan Zhao wrote: > > > On Mon, Nov 14, 2022 at 04:32:34PM +0000, Sean Christopherson wrote: > > > > On Mon, Nov 14, 2022, Yan Zhao wrote: > > > > > On Sat, Nov 12, 2022 at 12:43:07AM +0000, Sean Christopherson wrote: > > > > > > On Sat, Nov 12, 2022, Yan Zhao wrote: > > > > > > > And I'm also not sure if a slots_arch_lock is required for > > > > > > > kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(). > > > > > > > > > > > > It's not required. slots_arch_lock protects interaction between memslot updates > > > > > In kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(), > > > > > slot->arch.gfn_track[mode][index] is updated in update_gfn_track(), > > > > > do you know which lock is used to protect it? > > > > > > > > mmu_lock protects the count, kvm->srcu protects the slot, and shadow_root_allocated > > > > protects that validity of gfn_track, i.e. shadow_root_allocated ensures that KVM > > > > allocates gfn_track for all memslots when shadow paging is activated. > > > Hmm, thanks for the reply. > > > but in direct_page_fault(), > > > if (page_fault_handle_page_track(vcpu, fault)) > > > return RET_PF_EMULATE; > > > > > > slot->arch.gfn_track is read without any mmu_lock is held. > > > > That's a fast path that deliberately reads out of mmu_lock. A false positive > > only results in unnecessary emulation, and any false positive is inherently prone > > to races anyways, e.g. fault racing with zap. > what about false negative? > If the fast path read 0 count, no page track write callback will be called and write > protection will be removed in the slow path. No. For a false negative to occur, a different task would have to create a SPTE and write-protect the GFN _while holding mmu_lock_. And then after acquiring mmu_lock, the vCPU that got the false negative would call make_spte(), which would detect that making the SPTE writable is disallowed due to the GFN being write-protected. if (pte_access & ACC_WRITE_MASK) { spte |= PT_WRITABLE_MASK | shadow_mmu_writable_mask; /* * Optimization: for pte sync, if spte was writable the hash * lookup is unnecessary (and expensive). Write protection * is responsibility of kvm_mmu_get_page / kvm_mmu_sync_roots. * Same reasoning can be applied to dirty page accounting. */ if (is_writable_pte(old_spte)) goto out; /* * Unsync shadow pages that are reachable by the new, writable * SPTE. Write-protect the SPTE if the page can't be unsync'd, * e.g. it's write-tracked (upper-level SPs) or has one or more * shadow pages and unsync'ing pages is not allowed. */ if (mmu_try_to_unsync_pages(vcpu->kvm, slot, gfn, can_unsync, prefetch)) { pgprintk("%s: found shadow page for %llx, marking ro\n", __func__, gfn); wrprot = true; pte_access &= ~ACC_WRITE_MASK; spte &= ~(PT_WRITABLE_MASK | shadow_mmu_writable_mask); } } int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot, gfn_t gfn, bool can_unsync, bool prefetch) { struct kvm_mmu_page *sp; bool locked = false; /* * Force write-protection if the page is being tracked. Note, the page * track machinery is used to write-protect upper-level shadow pages, * i.e. this guards the role.level == 4K assertion below! */ if (kvm_slot_page_track_is_active(kvm, slot, gfn, KVM_PAGE_TRACK_WRITE)) return -EPERM; ... }
On Tue, Nov 15, 2022 at 12:55:42AM +0000, Sean Christopherson wrote: > On Tue, Nov 15, 2022, Yan Zhao wrote: > > On Mon, Nov 14, 2022 at 11:24:16PM +0000, Sean Christopherson wrote: > > > On Tue, Nov 15, 2022, Yan Zhao wrote: > > > > On Mon, Nov 14, 2022 at 04:32:34PM +0000, Sean Christopherson wrote: > > > > > On Mon, Nov 14, 2022, Yan Zhao wrote: > > > > > > On Sat, Nov 12, 2022 at 12:43:07AM +0000, Sean Christopherson wrote: > > > > > > > On Sat, Nov 12, 2022, Yan Zhao wrote: > > > > > > > > And I'm also not sure if a slots_arch_lock is required for > > > > > > > > kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(). > > > > > > > > > > > > > > It's not required. slots_arch_lock protects interaction between memslot updates > > > > > > In kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(), > > > > > > slot->arch.gfn_track[mode][index] is updated in update_gfn_track(), > > > > > > do you know which lock is used to protect it? > > > > > > > > > > mmu_lock protects the count, kvm->srcu protects the slot, and shadow_root_allocated > > > > > protects that validity of gfn_track, i.e. shadow_root_allocated ensures that KVM > > > > > allocates gfn_track for all memslots when shadow paging is activated. > > > > Hmm, thanks for the reply. > > > > but in direct_page_fault(), > > > > if (page_fault_handle_page_track(vcpu, fault)) > > > > return RET_PF_EMULATE; > > > > > > > > slot->arch.gfn_track is read without any mmu_lock is held. > > > > > > That's a fast path that deliberately reads out of mmu_lock. A false positive > > > only results in unnecessary emulation, and any false positive is inherently prone > > > to races anyways, e.g. fault racing with zap. > > what about false negative? > > If the fast path read 0 count, no page track write callback will be called and write > > protection will be removed in the slow path. > > No. For a false negative to occur, a different task would have to create a SPTE > and write-protect the GFN _while holding mmu_lock_. And then after acquiring > mmu_lock, the vCPU that got the false negative would call make_spte(), which would > detect that making the SPTE writable is disallowed due to the GFN being write-protected. > > if (pte_access & ACC_WRITE_MASK) { > spte |= PT_WRITABLE_MASK | shadow_mmu_writable_mask; > > /* > * Optimization: for pte sync, if spte was writable the hash > * lookup is unnecessary (and expensive). Write protection > * is responsibility of kvm_mmu_get_page / kvm_mmu_sync_roots. > * Same reasoning can be applied to dirty page accounting. > */ > if (is_writable_pte(old_spte)) > goto out; > > /* > * Unsync shadow pages that are reachable by the new, writable > * SPTE. Write-protect the SPTE if the page can't be unsync'd, > * e.g. it's write-tracked (upper-level SPs) or has one or more > * shadow pages and unsync'ing pages is not allowed. > */ > if (mmu_try_to_unsync_pages(vcpu->kvm, slot, gfn, can_unsync, prefetch)) { > pgprintk("%s: found shadow page for %llx, marking ro\n", > __func__, gfn); > wrprot = true; > pte_access &= ~ACC_WRITE_MASK; > spte &= ~(PT_WRITABLE_MASK | shadow_mmu_writable_mask); > } > } > > > > int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot, > gfn_t gfn, bool can_unsync, bool prefetch) > { > struct kvm_mmu_page *sp; > bool locked = false; > > /* > * Force write-protection if the page is being tracked. Note, the page > * track machinery is used to write-protect upper-level shadow pages, > * i.e. this guards the role.level == 4K assertion below! > */ > if (kvm_slot_page_track_is_active(kvm, slot, gfn, KVM_PAGE_TRACK_WRITE)) > return -EPERM; > > ... > } Oh, you are right! I thought mmu_try_to_unsync_pages() is only for the shadow mmu, and overlooked that TDP MMU will also go into it. Thanks for the detailed explanation. Thanks Yan
diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h index eb186bc57f6a..046b024d1813 100644 --- a/arch/x86/include/asm/kvm_page_track.h +++ b/arch/x86/include/asm/kvm_page_track.h @@ -44,6 +44,16 @@ struct kvm_page_track_notifier_node { */ void (*track_flush_slot)(struct kvm *kvm, struct kvm_memory_slot *slot, struct kvm_page_track_notifier_node *node); + /* + * It is called when memory slot is moved or removed + * users can drop write-protection for the pages in that memory slot + * + * @kvm: the kvm where memory slot being moved or removed + * @slot: the memory slot being moved or removed + * @node: this node + */ + void (*track_remove_slot)(struct kvm *kvm, struct kvm_memory_slot *slot, + struct kvm_page_track_notifier_node *node); }; int kvm_page_track_init(struct kvm *kvm); @@ -76,4 +86,5 @@ kvm_page_track_unregister_notifier(struct kvm *kvm, void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, int bytes); void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot); +void kvm_page_track_remove_slot(struct kvm *kvm, struct kvm_memory_slot *slot); #endif diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c index 2e09d1b6249f..4d6bab1d61c9 100644 --- a/arch/x86/kvm/mmu/page_track.c +++ b/arch/x86/kvm/mmu/page_track.c @@ -300,3 +300,29 @@ void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot) n->track_flush_slot(kvm, slot, n); srcu_read_unlock(&head->track_srcu, idx); } + +/* + * Notify the node that memory slot is removed or moved so that it can + * drop write-protection for the pages in the memory slot. + * + * The node should figure out it has any write-protected pages in this slot + * by itself. + */ +void kvm_page_track_remove_slot(struct kvm *kvm, struct kvm_memory_slot *slot) +{ + struct kvm_page_track_notifier_head *head; + struct kvm_page_track_notifier_node *n; + int idx; + + head = &kvm->arch.track_notifier_head; + + if (hlist_empty(&head->track_notifier_list)) + return; + + idx = srcu_read_lock(&head->track_srcu); + hlist_for_each_entry_srcu(n, &head->track_notifier_list, node, + srcu_read_lock_held(&head->track_srcu)) + if (n->track_remove_slot) + n->track_remove_slot(kvm, slot, n); + srcu_read_unlock(&head->track_srcu, idx); +} diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 916ebbc81e52..a24a4a2ad1a0 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12844,6 +12844,9 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, const struct kvm_memory_slot *new, enum kvm_mr_change change) { + if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) + kvm_page_track_remove_slot(kvm, old); + if (!kvm->arch.n_requested_mmu_pages && (change == KVM_MR_CREATE || change == KVM_MR_DELETE)) { unsigned long nr_mmu_pages;