[0/2] KVM: x86/mmu: .change_pte() optimization in TDP MMU

Message ID 20230808085056.14644-1-yan.y.zhao@intel.com
Headers
Series KVM: x86/mmu: .change_pte() optimization in TDP MMU |

Message

Yan Zhao Aug. 8, 2023, 8:50 a.m. UTC
  This series optmizes KVM mmu notifier.change_pte() handler in x86 TDP MMU
(i.e. kvm_tdp_mmu_set_spte_gfn()) by removing old dead code and prefetching
notified new PFN into SPTEs directly in the handler.

As in [1], .change_pte() has been dead code on x86 for 10+ years.
Patch 1 drops the dead code in x86 TDP MMU to save cpu cycles and prepare
for optimization in TDP MMU in patch 2.

Patch 2 optimizes TDP MMU's .change_pte() handler to prefetch SPTEs in the
handler directly with PFN info contained in .change_pte() to avoid that
each vCPU write that triggers .change_pte() must undergo twice VMExits and
TDP page faults.

base-commit: fdf0eaf11452 + Sean's patch "KVM: Wrap kvm_{gfn,hva}_range.pte
in a per-action union" [2]

[1]: https://lore.kernel.org/lkml/ZMAO6bhan9l6ybQM@google.com/
[2]:
https://lore.kernel.org/lkml/20230729004144.1054885-1-seanjc@google.com/

Yan Zhao (2):
  KVM: x86/mmu: Remove dead code in .change_pte() handler in x86 TDP MMU
  KVM: x86/mmu: prefetch SPTE directly in x86 TDP MMU's change_pte()
    handler

 arch/x86/kvm/mmu/tdp_mmu.c | 101 +++++++++++++++++++++++++------------
 1 file changed, 68 insertions(+), 33 deletions(-)
  

Comments

Yan Zhao Aug. 17, 2023, midnight UTC | #1
On Wed, Aug 16, 2023 at 11:18:03AM -0700, Sean Christopherson wrote:
> On Tue, Aug 08, 2023, Yan Zhao wrote:
> > This series optmizes KVM mmu notifier.change_pte() handler in x86 TDP MMU
> > (i.e. kvm_tdp_mmu_set_spte_gfn()) by removing old dead code and prefetching
> > notified new PFN into SPTEs directly in the handler.
> > 
> > As in [1], .change_pte() has been dead code on x86 for 10+ years.
> > Patch 1 drops the dead code in x86 TDP MMU to save cpu cycles and prepare
> > for optimization in TDP MMU in patch 2.
> 
> If we're going to officially kill the long-dead attempt at optimizing KSM, I'd
> strongly prefer to rip out .change_pte() entirely, i.e. kill it off in all
> architectures and remove it from mmu_notifiers.  The only reason I haven't proposed
> such patches is because I didn't want to it to backfire and lead to someone trying
> to resurrect the optimizations for KSM.
> 
> > Patch 2 optimizes TDP MMU's .change_pte() handler to prefetch SPTEs in the
> > handler directly with PFN info contained in .change_pte() to avoid that
> > each vCPU write that triggers .change_pte() must undergo twice VMExits and
> > TDP page faults.
> 
> IMO, prefaulting guest memory as writable is better handled by userspace, e.g. by
> using QEMU's prealloc option.  It's more coarse grained, but at a minimum it's
> sufficient for improving guest boot time, e.g. by preallocating memory below 4GiB.
> 
> And we can do even better, e.g. by providing a KVM ioctl() to allow userspace to
> prefault memory not just into the primary MMU, but also into KVM's MMU.  Such an
> ioctl() is basically manadatory for TDX, we just need to morph the support being
> added by TDX into a generic ioctl()[*]
> 
> Prefaulting guest memory as writable into the primary MMU should be able to achieve
> far better performance than hooking .change_pte(), as it will avoid the mmu_notifier
> invalidation, e.g. won't trigger taking mmu_lock for write and the resulting remote
> TLB flush(es).  And a KVM ioctl() to prefault into KVM's MMU should eliminate page
> fault VM-Exits entirely.
> 
> Explicit prefaulting isn't perfect, but IMO the value added by prefetching in
> .change_pte() isn't enough to justify carrying the hook and the code in KVM.
> 
> [*] https://lore.kernel.org/all/ZMFYhkSPE6Zbp8Ea@google.com
Hi Sean,
As I didn't write the full picture of patch 2 in the cover letter well,
may I request you to take a look of patch 2 to see if you like it? (in
case if you just read the cover letter).
What I observed is that each vCPU write to a COW page in primary MMU
will lead to twice TDP page faults.
Then, I just update the secondary MMU during the first TDP page fault
to avoid the second one.
It's not a blind prefetch (I checked the vCPU to ensure it's triggered
by a vCPU operation as much as possible) and it can benefit guests who
doesn't explicitly request a prefault memory as write.


Thanks
Yan
  
Sean Christopherson Aug. 18, 2023, 1:46 p.m. UTC | #2
On Fri, Aug 18, 2023, Yan Zhao wrote:
> On Thu, Aug 17, 2023 at 10:53:25AM -0700, Sean Christopherson wrote:
> > And FWIW, removing .change_pte() entirely, even without any other optimizations,
> > will also benefit those guests, as it will remove a source of mmu_lock contention
> > along with all of the overhead of invoking callbacks, walking memslots, etc.  And
> > removing .change_pte() will benefit *all* guests by eliminating unrelated callbacks,
> > i.e. callbacks when memory for the VMM takes a CoW fault.
> >
> If with above "always write_fault = true" solution, I think it's better.

Another option would be to allow a per-mm override of use_zero_page, but I think
I like the KVM memslot route more as it provides better granularity, doesn't
prevent CoW for VMM memory, and works even if THP isn't being used.
  
Christoph Hellwig Sept. 8, 2023, 8:18 a.m. UTC | #3
On Wed, Sep 06, 2023 at 05:18:51PM +0100, Robin Murphy wrote:
> Indeed a bunch of work has gone into SWIOTLB recently trying to make it a 
> bit more efficient for such cases where it can't be avoided, so it is 
> definitely still interesting to learn about impacts at other levels like 
> this. Maybe there's a bit of a get-out for confidential VMs though, since 
> presumably there's not much point COW-ing encrypted private memory, so 
> perhaps KVM might end up wanting to optimise that out and thus happen to 
> end up less sensitive to unavoidable SWIOTLB behaviour anyway?

Well, the fix for bounce buffering is to trust the device, and there is
a lot of work going into device authentication and attesttion right now
so that will happen.

On the swiotlb side a new version of the dma_sync_*_device APIs that
specifies the mapping len and the data length transfer would avoid
some of the overhead here.  We've decided that it is a good idea last
time, but so far no one has volunteers to implement it.