[v1,2/9] KVM: x86/mmu: Add support for prewrite page tracking

Message ID 20230505152046.6575-3-mic@digikod.net
State New
Headers
Series Hypervisor-Enforced Kernel Integrity |

Commit Message

Mickaël Salaün May 5, 2023, 3:20 p.m. UTC
  Add a new page tracking mode to deny a page update and throw a page
fault to the guest.  This is useful for KVM to be able to make some
pages non-writable (not read-only because it doesn't imply execution
restrictions), see the next Heki commits.

This kind of synthetic kernel page fault needs to be handled by the
guest, which is not currently the case, making it try again and again.
This will be part of a follow-up patch series.

Update emulator_read_write_onepage() to handle X86EMUL_CONTINUE and
X86EMUL_PROPAGATE_FAULT.

Update page_fault_handle_page_track() to call
kvm_slot_page_track_is_active() whenever this is required for
KVM_PAGE_TRACK_PREWRITE and KVM_PAGE_TRACK_WRITE, even if one tracker
already returned true.

Invert the return code semantic for read_emulate() and write_emulate():
- from 1=Ok 0=Error
- to X86EMUL_* return codes (e.g. X86EMUL_CONTINUE == 0)

Imported the prewrite page tracking support part originally written by
Mihai Donțu, Marian Rotariu, and Ștefan Șicleru:
https://lore.kernel.org/r/20211006173113.26445-27-alazar@bitdefender.com
https://lore.kernel.org/r/20211006173113.26445-28-alazar@bitdefender.com
Removed the GVA changes for page tracking, removed the
X86EMUL_RETRY_INSTR case, and some emulation part for now.

Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Cc: Marian Rotariu <marian.c.rotariu@gmail.com>
Cc: Mihai Donțu <mdontu@bitdefender.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Ștefan Șicleru <ssicleru@bitdefender.com>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Link: https://lore.kernel.org/r/20230505152046.6575-3-mic@digikod.net
---
 arch/x86/include/asm/kvm_page_track.h | 12 +++++
 arch/x86/kvm/mmu/mmu.c                | 64 ++++++++++++++++++++++-----
 arch/x86/kvm/mmu/page_track.c         | 33 +++++++++++++-
 arch/x86/kvm/mmu/spte.c               |  6 +++
 arch/x86/kvm/x86.c                    | 27 +++++++----
 5 files changed, 122 insertions(+), 20 deletions(-)
  

Comments

Sean Christopherson May 5, 2023, 4:28 p.m. UTC | #1
On Fri, May 05, 2023, Micka�l Sala�n wrote:
> diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
> index eb186bc57f6a..a7fb4ff888e6 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h
> @@ -3,6 +3,7 @@
>  #define _ASM_X86_KVM_PAGE_TRACK_H
>  
>  enum kvm_page_track_mode {
> +	KVM_PAGE_TRACK_PREWRITE,

Heh, just when I decide to finally kill off support for multiple modes[1] :-)

My assessment from that changelog still holds true for this case:

  Drop "support" for multiple page-track modes, as there is no evidence
  that array-based and refcounted metadata is the optimal solution for
  other modes, nor is there any evidence that other use cases, e.g. for
  access-tracking, will be a good fit for the page-track machinery in
  general.
  
  E.g. one potential use case of access-tracking would be to prevent guest
  access to poisoned memory (from the guest's perspective).  In that case,
  the number of poisoned pages is likely to be a very small percentage of
  the guest memory, and there is no need to reference count the number of
  access-tracking users, i.e. expanding gfn_track[] for a new mode would be
  grossly inefficient.  And for poisoned memory, host userspace would also
  likely want to trap accesses, e.g. to inject #MC into the guest, and that
  isn't currently supported by the page-track framework.
  
  A better alternative for that poisoned page use case is likely a
  variation of the proposed per-gfn attributes overlay (linked), which
  would allow efficiently tracking the sparse set of poisoned pages, and by
  default would exit to userspace on access.

Of particular relevance:

  - Using the page-track machinery is inefficient because the guest is likely
    going to write-protect a minority of its memory.  And this

      select KVM_EXTERNAL_WRITE_TRACKING if KVM

    is particularly nasty because simply enabling HEKI in the Kconfig will cause
    KVM to allocate rmaps and gfn tracking.

  - There's no need to reference count the protection, i.e. 15 of the 16 bits of
    gfn_track are dead weight.

  - As proposed, adding a second "mode" would double the cost of gfn tracking.

  - Tying the protections to the memslots will create an impossible-to-maintain
    ABI because the protections will be lost if the owning memslot is deleted and
    recreated.

  - The page-track framework provides incomplete protection and will lead to an
    ongoing game of whack-a-mole, e.g. this patch catches the obvious cases by
    adding calls to kvm_page_track_prewrite(), but misses things like kvm_vcpu_map().

  - The scaling and maintenance issues will only get worse if/when someone tries
    to support dropping read and/or execute permissions, e.g. for execute-only.

  - The code is x86-only, and is likely to stay that way for the foreseeable
    future.

The proposed alternative is to piggyback the memory attributes implementation[2]
that is being added (if all goes according to plan) for confidential VMs.  This
use case (dropping permissions) came up not too long ago[3], which is why I have
a ready-made answer).

I have no doubt that we'll need to solve performance and scaling issues with the
memory attributes implementation, e.g. to utilize xarray multi-range support
instead of storing information on a per-4KiB-page basis, but AFAICT, the core
idea is sound.  And a very big positive from a maintenance perspective is that
any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should also
benefit the other use case.

[1] https://lore.kernel.org/all/20230311002258.852397-22-seanjc@google.com
[2] https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
[3] https://lore.kernel.org/all/Y1a1i9vbJ%2FpVmV9r@google.com
  
Mickaël Salaün May 5, 2023, 4:49 p.m. UTC | #2
On 05/05/2023 18:28, Sean Christopherson wrote:
> On Fri, May 05, 2023, Micka�l Sala�n wrote:
>> diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
>> index eb186bc57f6a..a7fb4ff888e6 100644
>> --- a/arch/x86/include/asm/kvm_page_track.h
>> +++ b/arch/x86/include/asm/kvm_page_track.h
>> @@ -3,6 +3,7 @@
>>   #define _ASM_X86_KVM_PAGE_TRACK_H
>>   
>>   enum kvm_page_track_mode {
>> +	KVM_PAGE_TRACK_PREWRITE,
> 
> Heh, just when I decide to finally kill off support for multiple modes[1] :-)
> 
> My assessment from that changelog still holds true for this case:
> 
>    Drop "support" for multiple page-track modes, as there is no evidence
>    that array-based and refcounted metadata is the optimal solution for
>    other modes, nor is there any evidence that other use cases, e.g. for
>    access-tracking, will be a good fit for the page-track machinery in
>    general.
>    
>    E.g. one potential use case of access-tracking would be to prevent guest
>    access to poisoned memory (from the guest's perspective).  In that case,
>    the number of poisoned pages is likely to be a very small percentage of
>    the guest memory, and there is no need to reference count the number of
>    access-tracking users, i.e. expanding gfn_track[] for a new mode would be
>    grossly inefficient.  And for poisoned memory, host userspace would also
>    likely want to trap accesses, e.g. to inject #MC into the guest, and that
>    isn't currently supported by the page-track framework.
>    
>    A better alternative for that poisoned page use case is likely a
>    variation of the proposed per-gfn attributes overlay (linked), which
>    would allow efficiently tracking the sparse set of poisoned pages, and by
>    default would exit to userspace on access.
> 
> Of particular relevance:
> 
>    - Using the page-track machinery is inefficient because the guest is likely
>      going to write-protect a minority of its memory.  And this
> 
>        select KVM_EXTERNAL_WRITE_TRACKING if KVM
> 
>      is particularly nasty because simply enabling HEKI in the Kconfig will cause
>      KVM to allocate rmaps and gfn tracking.
> 
>    - There's no need to reference count the protection, i.e. 15 of the 16 bits of
>      gfn_track are dead weight.
> 
>    - As proposed, adding a second "mode" would double the cost of gfn tracking.
> 
>    - Tying the protections to the memslots will create an impossible-to-maintain
>      ABI because the protections will be lost if the owning memslot is deleted and
>      recreated.
> 
>    - The page-track framework provides incomplete protection and will lead to an
>      ongoing game of whack-a-mole, e.g. this patch catches the obvious cases by
>      adding calls to kvm_page_track_prewrite(), but misses things like kvm_vcpu_map().
> 
>    - The scaling and maintenance issues will only get worse if/when someone tries
>      to support dropping read and/or execute permissions, e.g. for execute-only.
> 
>    - The code is x86-only, and is likely to stay that way for the foreseeable
>      future.
> 
> The proposed alternative is to piggyback the memory attributes implementation[2]
> that is being added (if all goes according to plan) for confidential VMs.  This
> use case (dropping permissions) came up not too long ago[3], which is why I have
> a ready-made answer).
> 
> I have no doubt that we'll need to solve performance and scaling issues with the
> memory attributes implementation, e.g. to utilize xarray multi-range support
> instead of storing information on a per-4KiB-page basis, but AFAICT, the core
> idea is sound.  And a very big positive from a maintenance perspective is that
> any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should also
> benefit the other use case.
> 
> [1] https://lore.kernel.org/all/20230311002258.852397-22-seanjc@google.com
> [2] https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
> [3] https://lore.kernel.org/all/Y1a1i9vbJ%2FpVmV9r@google.com

I agree, I used this mechanism because it was easier at first to rely on 
a previous work, but while I was working on the MBEC support, I realized 
that it's not the optimal way to do it.

I was thinking about using a new special EPT bit similar to 
EPT_SPTE_HOST_WRITABLE, but it may not be portable though. What do you 
think?
  
Sean Christopherson May 5, 2023, 5:31 p.m. UTC | #3
On Fri, May 05, 2023, Micka�l Sala�n wrote:
> 
> On 05/05/2023 18:28, Sean Christopherson wrote:
> > I have no doubt that we'll need to solve performance and scaling issues with the
> > memory attributes implementation, e.g. to utilize xarray multi-range support
> > instead of storing information on a per-4KiB-page basis, but AFAICT, the core
> > idea is sound.  And a very big positive from a maintenance perspective is that
> > any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should also
> > benefit the other use case.
> > 
> > [1] https://lore.kernel.org/all/20230311002258.852397-22-seanjc@google.com
> > [2] https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
> > [3] https://lore.kernel.org/all/Y1a1i9vbJ%2FpVmV9r@google.com
> 
> I agree, I used this mechanism because it was easier at first to rely on a
> previous work, but while I was working on the MBEC support, I realized that
> it's not the optimal way to do it.
> 
> I was thinking about using a new special EPT bit similar to
> EPT_SPTE_HOST_WRITABLE, but it may not be portable though. What do you
> think?

On x86, SPTEs are even more ephemeral than memslots.  E.g. for historical reasons,
KVM zaps all SPTEs if _any_ memslot is deleted, which is problematic if the guest
is moving around BARs, using option ROMs, etc.

ARM's pKVM tracks metadata in its stage-2 PTEs, i.e. doesn't need an xarray to
otrack attributes, but that works only because pKVM is more privileged than the
host kernel, and the shared vs. private memory attribute that pKVM cares about
is very, very restricted in how it can be used and changed.

I tried shoehorning private vs. shared metadata into x86's SPTEs in the past, and
it ended up being a constant battle with the kernel, e.g. page migration, and with
KVM itself, e.g. the above memslot mess.
  
Madhavan T. Venkataraman May 24, 2023, 8:53 p.m. UTC | #4
On 5/5/23 12:31, Sean Christopherson wrote:
> On Fri, May 05, 2023, Micka�l Sala�n wrote:
>>
>> On 05/05/2023 18:28, Sean Christopherson wrote:
>>> I have no doubt that we'll need to solve performance and scaling issues with the
>>> memory attributes implementation, e.g. to utilize xarray multi-range support
>>> instead of storing information on a per-4KiB-page basis, but AFAICT, the core
>>> idea is sound.  And a very big positive from a maintenance perspective is that
>>> any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should also
>>> benefit the other use case.
>>>
>>> [1] https://lore.kernel.org/all/20230311002258.852397-22-seanjc@google.com
>>> [2] https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
>>> [3] https://lore.kernel.org/all/Y1a1i9vbJ%2FpVmV9r@google.com
>>
>> I agree, I used this mechanism because it was easier at first to rely on a
>> previous work, but while I was working on the MBEC support, I realized that
>> it's not the optimal way to do it.
>>
>> I was thinking about using a new special EPT bit similar to
>> EPT_SPTE_HOST_WRITABLE, but it may not be portable though. What do you
>> think?
> 
> On x86, SPTEs are even more ephemeral than memslots.  E.g. for historical reasons,
> KVM zaps all SPTEs if _any_ memslot is deleted, which is problematic if the guest
> is moving around BARs, using option ROMs, etc.
> 
> ARM's pKVM tracks metadata in its stage-2 PTEs, i.e. doesn't need an xarray to
> otrack attributes, but that works only because pKVM is more privileged than the
> host kernel, and the shared vs. private memory attribute that pKVM cares about
> is very, very restricted in how it can be used and changed.
> 
> I tried shoehorning private vs. shared metadata into x86's SPTEs in the past, and
> it ended up being a constant battle with the kernel, e.g. page migration, and with
> KVM itself, e.g. the above memslot mess.

Sorry for the delay in responding to this. I wanted to study the KVM code and fully
understand your comment before responding.

Yes, I quite agree with you. I will make an attempt to address this in the next version.
I am working on it right now.

Thanks.

Madhavan
  

Patch

diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index eb186bc57f6a..a7fb4ff888e6 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -3,6 +3,7 @@ 
 #define _ASM_X86_KVM_PAGE_TRACK_H
 
 enum kvm_page_track_mode {
+	KVM_PAGE_TRACK_PREWRITE,
 	KVM_PAGE_TRACK_WRITE,
 	KVM_PAGE_TRACK_MAX,
 };
@@ -22,6 +23,16 @@  struct kvm_page_track_notifier_head {
 struct kvm_page_track_notifier_node {
 	struct hlist_node node;
 
+	/*
+	 * It is called when guest is writing the write-tracked page
+	 * and the write emulation didn't happened yet.
+	 *
+	 * @vcpu: the vcpu where the write access happened
+	 * @gpa: the physical address written by guest
+	 * @node: this nodet
+	 */
+	bool (*track_prewrite)(struct kvm_vcpu *vcpu, gpa_t gpa,
+			       struct kvm_page_track_notifier_node *node);
 	/*
 	 * It is called when guest is writing the write-tracked page
 	 * and write emulation is finished at that time.
@@ -73,6 +84,7 @@  kvm_page_track_register_notifier(struct kvm *kvm,
 void
 kvm_page_track_unregister_notifier(struct kvm *kvm,
 				   struct kvm_page_track_notifier_node *n);
+bool kvm_page_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa);
 void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
 			  int bytes);
 void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 835426254e76..e5d1e241ff0f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -793,9 +793,13 @@  static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 	slot = __gfn_to_memslot(slots, gfn);
 
 	/* the non-leaf shadow pages are keeping readonly. */
-	if (sp->role.level > PG_LEVEL_4K)
-		return kvm_slot_page_track_add_page(kvm, slot, gfn,
-						    KVM_PAGE_TRACK_WRITE);
+	if (sp->role.level > PG_LEVEL_4K) {
+		kvm_slot_page_track_add_page(kvm, slot, gfn,
+					     KVM_PAGE_TRACK_PREWRITE);
+		kvm_slot_page_track_add_page(kvm, slot, gfn,
+					     KVM_PAGE_TRACK_WRITE);
+		return;
+	}
 
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
 
@@ -840,9 +844,13 @@  static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 	gfn = sp->gfn;
 	slots = kvm_memslots_for_spte_role(kvm, sp->role);
 	slot = __gfn_to_memslot(slots, gfn);
-	if (sp->role.level > PG_LEVEL_4K)
-		return kvm_slot_page_track_remove_page(kvm, slot, gfn,
-						       KVM_PAGE_TRACK_WRITE);
+	if (sp->role.level > PG_LEVEL_4K) {
+		kvm_slot_page_track_remove_page(kvm, slot, gfn,
+						KVM_PAGE_TRACK_PREWRITE);
+		kvm_slot_page_track_remove_page(kvm, slot, gfn,
+						KVM_PAGE_TRACK_WRITE);
+		return;
+	}
 
 	kvm_mmu_gfn_allow_lpage(slot, gfn);
 }
@@ -2714,7 +2722,10 @@  int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
 	 * track machinery is used to write-protect upper-level shadow pages,
 	 * i.e. this guards the role.level == 4K assertion below!
 	 */
-	if (kvm_slot_page_track_is_active(kvm, slot, gfn, KVM_PAGE_TRACK_WRITE))
+	if (kvm_slot_page_track_is_active(kvm, slot, gfn,
+						KVM_PAGE_TRACK_PREWRITE) ||
+	    kvm_slot_page_track_is_active(kvm, slot, gfn,
+						KVM_PAGE_TRACK_WRITE))
 		return -EPERM;
 
 	/*
@@ -4103,6 +4114,8 @@  static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
 static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
 					 struct kvm_page_fault *fault)
 {
+	bool ret = false;
+
 	if (unlikely(fault->rsvd))
 		return false;
 
@@ -4113,10 +4126,14 @@  static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
 	 * guest is writing the page which is write tracked which can
 	 * not be fixed by page fault handler.
 	 */
-	if (kvm_slot_page_track_is_active(vcpu->kvm, fault->slot, fault->gfn, KVM_PAGE_TRACK_WRITE))
-		return true;
+	ret = kvm_slot_page_track_is_active(vcpu->kvm, fault->slot, fault->gfn,
+					    KVM_PAGE_TRACK_PREWRITE) ||
+	      ret;
+	ret = kvm_slot_page_track_is_active(vcpu->kvm, fault->slot, fault->gfn,
+					    KVM_PAGE_TRACK_WRITE) ||
+	      ret;
 
-	return false;
+	return ret;
 }
 
 static void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr)
@@ -5600,6 +5617,33 @@  int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
 	if (r != RET_PF_EMULATE)
 		return 1;
 
+	if ((error_code & PFERR_WRITE_MASK) &&
+	    !kvm_page_track_prewrite(vcpu, cr2_or_gpa)) {
+		struct x86_exception fault = {
+			.vector = PF_VECTOR,
+			.error_code_valid = true,
+			.error_code = error_code,
+			.nested_page_fault = false,
+			/*
+			 * TODO: This kind of kernel page fault needs to be handled by
+			 * the guest, which is not currently the case, making it try
+			 * again and again.
+			 *
+			 * You may want to test with cr2_or_gva to see the page
+			 * fault caught by the guest kernel (thinking it is a
+			 * user space fault).
+			 */
+			.address = static_call(kvm_x86_fault_gva)(vcpu),
+			.async_page_fault = false,
+		};
+
+		pr_warn_ratelimited(
+			"heki-kvm: Creating write #PF at 0x%016llx\n",
+			fault.address);
+		kvm_inject_page_fault(vcpu, &fault);
+		return RET_PF_INVALID;
+	}
+
 	/*
 	 * Before emulating the instruction, check if the error code
 	 * was due to a RO violation while translating the guest page.
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 2e09d1b6249f..2454887cd48b 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -131,9 +131,10 @@  void kvm_slot_page_track_add_page(struct kvm *kvm,
 	 */
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
 
-	if (mode == KVM_PAGE_TRACK_WRITE)
+	if (mode == KVM_PAGE_TRACK_PREWRITE || mode == KVM_PAGE_TRACK_WRITE) {
 		if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
 			kvm_flush_remote_tlbs(kvm);
+	}
 }
 EXPORT_SYMBOL_GPL(kvm_slot_page_track_add_page);
 
@@ -248,6 +249,36 @@  kvm_page_track_unregister_notifier(struct kvm *kvm,
 }
 EXPORT_SYMBOL_GPL(kvm_page_track_unregister_notifier);
 
+/*
+ * Notify the node that a write access is about to happen. Returning false
+ * doesn't stop the other nodes from being called, but it will stop
+ * the emulation.
+ *
+ * The node should figure out if the written page is the one that the node
+ * is interested in by itself.
+ */
+bool kvm_page_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+	struct kvm_page_track_notifier_head *head;
+	struct kvm_page_track_notifier_node *n;
+	int idx;
+	bool ret = true;
+
+	head = &vcpu->kvm->arch.track_notifier_head;
+
+	if (hlist_empty(&head->track_notifier_list))
+		return ret;
+
+	idx = srcu_read_lock(&head->track_srcu);
+	hlist_for_each_entry_srcu(n, &head->track_notifier_list, node,
+				  srcu_read_lock_held(&head->track_srcu))
+		if (n->track_prewrite)
+			if (!n->track_prewrite(vcpu, gpa, n))
+				ret = false;
+	srcu_read_unlock(&head->track_srcu, idx);
+	return ret;
+}
+
 /*
  * Notify the node that write access is intercepted and write emulation is
  * finished at this time.
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index c0fd7e049b4e..639f220a1ed5 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -144,6 +144,12 @@  bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	u64 spte = SPTE_MMU_PRESENT_MASK;
 	bool wrprot = false;
 
+	if (kvm_slot_page_track_is_active(vcpu->kvm, slot, gfn,
+					  KVM_PAGE_TRACK_PREWRITE) ||
+	    kvm_slot_page_track_is_active(vcpu->kvm, slot, gfn,
+					  KVM_PAGE_TRACK_WRITE))
+		pte_access &= ~ACC_WRITE_MASK;
+
 	WARN_ON_ONCE(!pte_access && !shadow_present_mask);
 
 	if (sp->role.ad_disabled)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a2c299d47e69..fd05f42c9913 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7325,6 +7325,7 @@  static int kvm_write_guest_virt_helper(gva_t addr, void *val, unsigned int bytes
 			r = X86EMUL_IO_NEEDED;
 			goto out;
 		}
+		kvm_page_track_write(vcpu, gpa, data, towrite);
 
 		bytes -= towrite;
 		data += towrite;
@@ -7441,13 +7442,12 @@  static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
 int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
 			const void *val, int bytes)
 {
-	int ret;
-
-	ret = kvm_vcpu_write_guest(vcpu, gpa, val, bytes);
-	if (ret < 0)
-		return 0;
+	if (!kvm_page_track_prewrite(vcpu, gpa))
+		return X86EMUL_PROPAGATE_FAULT;
+	if (kvm_vcpu_write_guest(vcpu, gpa, val, bytes))
+		return X86EMUL_UNHANDLEABLE;
 	kvm_page_track_write(vcpu, gpa, val, bytes);
-	return 1;
+	return X86EMUL_CONTINUE;
 }
 
 struct read_write_emulator_ops {
@@ -7477,7 +7477,9 @@  static int read_prepare(struct kvm_vcpu *vcpu, void *val, int bytes)
 static int read_emulate(struct kvm_vcpu *vcpu, gpa_t gpa,
 			void *val, int bytes)
 {
-	return !kvm_vcpu_read_guest(vcpu, gpa, val, bytes);
+	if (kvm_vcpu_read_guest(vcpu, gpa, val, bytes))
+		return X86EMUL_UNHANDLEABLE;
+	return X86EMUL_CONTINUE;
 }
 
 static int write_emulate(struct kvm_vcpu *vcpu, gpa_t gpa,
@@ -7551,8 +7553,12 @@  static int emulator_read_write_onepage(unsigned long addr, void *val,
 			return X86EMUL_PROPAGATE_FAULT;
 	}
 
-	if (!ret && ops->read_write_emulate(vcpu, gpa, val, bytes))
-		return X86EMUL_CONTINUE;
+	if (!ret) {
+		ret = ops->read_write_emulate(vcpu, gpa, val, bytes);
+		if (ret != X86EMUL_UNHANDLEABLE)
+			/* Handles X86EMUL_CONTINUE and X86EMUL_PROPAGATE_FAULT. */
+			return ret;
+	}
 
 	/*
 	 * Is this MMIO handled locally?
@@ -7689,6 +7695,9 @@  static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 	if (kvm_is_error_hva(hva))
 		goto emul_write;
 
+	if (!kvm_page_track_prewrite(vcpu, gpa))
+		return X86EMUL_PROPAGATE_FAULT;
+
 	hva += offset_in_page(gpa);
 
 	switch (bytes) {