Message ID | 20230808085056.14644-1-yan.y.zhao@intel.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:c44e:0:b0:3f2:4152:657d with SMTP id w14csp2334595vqr; Tue, 8 Aug 2023 12:04:46 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEHtz/v6bPfEBU/UUFNqnefktsSLcO25mMCjlWOrrUoHXBMQfF/srDZm7viJqPMsFtPpKET X-Received: by 2002:a17:906:3001:b0:994:17e3:2753 with SMTP id 1-20020a170906300100b0099417e32753mr368814ejz.26.1691521486441; Tue, 08 Aug 2023 12:04:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691521486; cv=none; d=google.com; s=arc-20160816; b=AU3ES9G0w5mQpjxBRVgU6ICvnSixAkumfLb2MjNGY8LRhS7ZxIzN8DKgAgXIcX63N4 kDmJwqFj/pJCNMUW2nXVcyYy/tD/zTHcVKUSjgJGq8aoopsdITkinC8LRUQ7Qo/CzdYM j8T71uvUqxcSpN62H97ADVmNblFCJroF0SRHpeGTty0hGwoAa9ZeiTj6EDyQZF0bwlh+ cjlytAda9oE3U5eyXWzZGJhMDoDp/9ySeqfQ3w0UB1CJWPjnDhC8Sgp1bImak+paMhkM KdP9zM8OHCj15CbagOlW5G1nQV96puoDhyzQvTMjHHELn3LYaq8s12hBjG1GciGsRBN2 2iTA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:subject:cc:to:from :dkim-signature; bh=hp2z5nkW57vIHb7XjNh2S8W7ckvGGkYDfShXdIDXhu0=; fh=QXHRr2he8xyChZIeqmRKFcQ/Xe6XefztEBp7udm/kbw=; b=ZTiej0yOzddUqXsL+SgvS/lK4YBuYXmMyeaEaEafoNO4H1xcTq2XN2NHOVKSrXtheB KcZATbdwLfKUUtxiJqDcgxfBxiAg9WpJXHRl4TShFZJsl59Zcsn3I0s+UKeDXaU2Cxz/ r250dRwN73cvOqcHJCJWrxtI5BOVKZo3deYyoUY1duU9i52kKYLh9NBo45YRNKlfmIh9 uGGRaU8Lsl8HYuA7m4Az/gzp8d1vT05nsPftrABl5htTxK8CiYpl/F1HRR+4Oq+GcApM tQRxKwR6+I33LB7V8GyUip/aKdLqWDRhuV853abJSzUTtxH/9UspajmlQ2X1Q/7k7Tco Q5Zw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=YPou47vT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id pj14-20020a170906d78e00b0099b6c03e567si8090617ejb.353.2023.08.08.12.04.21; Tue, 08 Aug 2023 12:04:46 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=YPou47vT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230503AbjHHPzf (ORCPT <rfc822;aaronkmseo@gmail.com> + 99 others); Tue, 8 Aug 2023 11:55:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60974 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230288AbjHHPxu (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 8 Aug 2023 11:53:50 -0400 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8850355B8; Tue, 8 Aug 2023 08:43:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1691509403; x=1723045403; h=from:to:cc:subject:date:message-id; bh=OEhSFtz6kSiSx1lq0Y2Sj8F+FGdJpfe5yIutpZ0zmq8=; b=YPou47vTNhuZQt8sTW6xip2q3AEkepILUqk34vgbOAhg1n/h7Kd+o0gX uRPK3F+6Rd4/E3oYxhPlxawPdg5UUvPTDoUKdb1IkEYoW7hOVSNfxj7dc b1A7BaO77HJYcMFNsEiTZznbpn508mLquNlKMvoFttODqQtvktPB5SSJV bcAuWxVzjFRFgpC49UbQtCzMwGC9hixo8owAJt4D2Owm4DIq4NITjqlAf aPsfdiIiCOM93NPPrYYw6tc+WFB6X6EDFQ91At0GLZX3GWqMgvtd+t8PE zD5L7fzkDk6fHqNyC809a3TX/jmzVhnuCI3S20UPEYsi9rfN25hzQrNBc g==; X-IronPort-AV: E=McAfee;i="6600,9927,10795"; a="457152360" X-IronPort-AV: E=Sophos;i="6.01,263,1684825200"; d="scan'208";a="457152360" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Aug 2023 02:18:10 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10795"; a="796653375" X-IronPort-AV: E=Sophos;i="6.01,263,1684825200"; d="scan'208";a="796653375" Received: from yzhao56-desk.sh.intel.com ([10.239.159.62]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Aug 2023 02:18:08 -0700 From: Yan Zhao <yan.y.zhao@intel.com> To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: pbonzini@redhat.com, seanjc@google.com, Yan Zhao <yan.y.zhao@intel.com> Subject: [PATCH 0/2] KVM: x86/mmu: .change_pte() optimization in TDP MMU Date: Tue, 8 Aug 2023 16:50:56 +0800 Message-Id: <20230808085056.14644-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.17.1 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_NONE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1773688834580718995 X-GMAIL-MSGID: 1773688834580718995 |
Series |
KVM: x86/mmu: .change_pte() optimization in TDP MMU
|
|
Message
Yan Zhao
Aug. 8, 2023, 8:50 a.m. UTC
This series optmizes KVM mmu notifier.change_pte() handler in x86 TDP MMU (i.e. kvm_tdp_mmu_set_spte_gfn()) by removing old dead code and prefetching notified new PFN into SPTEs directly in the handler. As in [1], .change_pte() has been dead code on x86 for 10+ years. Patch 1 drops the dead code in x86 TDP MMU to save cpu cycles and prepare for optimization in TDP MMU in patch 2. Patch 2 optimizes TDP MMU's .change_pte() handler to prefetch SPTEs in the handler directly with PFN info contained in .change_pte() to avoid that each vCPU write that triggers .change_pte() must undergo twice VMExits and TDP page faults. base-commit: fdf0eaf11452 + Sean's patch "KVM: Wrap kvm_{gfn,hva}_range.pte in a per-action union" [2] [1]: https://lore.kernel.org/lkml/ZMAO6bhan9l6ybQM@google.com/ [2]: https://lore.kernel.org/lkml/20230729004144.1054885-1-seanjc@google.com/ Yan Zhao (2): KVM: x86/mmu: Remove dead code in .change_pte() handler in x86 TDP MMU KVM: x86/mmu: prefetch SPTE directly in x86 TDP MMU's change_pte() handler arch/x86/kvm/mmu/tdp_mmu.c | 101 +++++++++++++++++++++++++------------ 1 file changed, 68 insertions(+), 33 deletions(-)
Comments
On Wed, Aug 16, 2023 at 11:18:03AM -0700, Sean Christopherson wrote: > On Tue, Aug 08, 2023, Yan Zhao wrote: > > This series optmizes KVM mmu notifier.change_pte() handler in x86 TDP MMU > > (i.e. kvm_tdp_mmu_set_spte_gfn()) by removing old dead code and prefetching > > notified new PFN into SPTEs directly in the handler. > > > > As in [1], .change_pte() has been dead code on x86 for 10+ years. > > Patch 1 drops the dead code in x86 TDP MMU to save cpu cycles and prepare > > for optimization in TDP MMU in patch 2. > > If we're going to officially kill the long-dead attempt at optimizing KSM, I'd > strongly prefer to rip out .change_pte() entirely, i.e. kill it off in all > architectures and remove it from mmu_notifiers. The only reason I haven't proposed > such patches is because I didn't want to it to backfire and lead to someone trying > to resurrect the optimizations for KSM. > > > Patch 2 optimizes TDP MMU's .change_pte() handler to prefetch SPTEs in the > > handler directly with PFN info contained in .change_pte() to avoid that > > each vCPU write that triggers .change_pte() must undergo twice VMExits and > > TDP page faults. > > IMO, prefaulting guest memory as writable is better handled by userspace, e.g. by > using QEMU's prealloc option. It's more coarse grained, but at a minimum it's > sufficient for improving guest boot time, e.g. by preallocating memory below 4GiB. > > And we can do even better, e.g. by providing a KVM ioctl() to allow userspace to > prefault memory not just into the primary MMU, but also into KVM's MMU. Such an > ioctl() is basically manadatory for TDX, we just need to morph the support being > added by TDX into a generic ioctl()[*] > > Prefaulting guest memory as writable into the primary MMU should be able to achieve > far better performance than hooking .change_pte(), as it will avoid the mmu_notifier > invalidation, e.g. won't trigger taking mmu_lock for write and the resulting remote > TLB flush(es). And a KVM ioctl() to prefault into KVM's MMU should eliminate page > fault VM-Exits entirely. > > Explicit prefaulting isn't perfect, but IMO the value added by prefetching in > .change_pte() isn't enough to justify carrying the hook and the code in KVM. > > [*] https://lore.kernel.org/all/ZMFYhkSPE6Zbp8Ea@google.com Hi Sean, As I didn't write the full picture of patch 2 in the cover letter well, may I request you to take a look of patch 2 to see if you like it? (in case if you just read the cover letter). What I observed is that each vCPU write to a COW page in primary MMU will lead to twice TDP page faults. Then, I just update the secondary MMU during the first TDP page fault to avoid the second one. It's not a blind prefetch (I checked the vCPU to ensure it's triggered by a vCPU operation as much as possible) and it can benefit guests who doesn't explicitly request a prefault memory as write. Thanks Yan
On Fri, Aug 18, 2023, Yan Zhao wrote: > On Thu, Aug 17, 2023 at 10:53:25AM -0700, Sean Christopherson wrote: > > And FWIW, removing .change_pte() entirely, even without any other optimizations, > > will also benefit those guests, as it will remove a source of mmu_lock contention > > along with all of the overhead of invoking callbacks, walking memslots, etc. And > > removing .change_pte() will benefit *all* guests by eliminating unrelated callbacks, > > i.e. callbacks when memory for the VMM takes a CoW fault. > > > If with above "always write_fault = true" solution, I think it's better. Another option would be to allow a per-mm override of use_zero_page, but I think I like the KVM memslot route more as it provides better granularity, doesn't prevent CoW for VMM memory, and works even if THP isn't being used.
On Wed, Sep 06, 2023 at 05:18:51PM +0100, Robin Murphy wrote: > Indeed a bunch of work has gone into SWIOTLB recently trying to make it a > bit more efficient for such cases where it can't be avoided, so it is > definitely still interesting to learn about impacts at other levels like > this. Maybe there's a bit of a get-out for confidential VMs though, since > presumably there's not much point COW-ing encrypted private memory, so > perhaps KVM might end up wanting to optimise that out and thus happen to > end up less sensitive to unavoidable SWIOTLB behaviour anyway? Well, the fix for bounce buffering is to trust the device, and there is a lot of work going into device authentication and attesttion right now so that will happen. On the swiotlb side a new version of the dma_sync_*_device APIs that specifies the mapping len and the data length transfer would avoid some of the overhead here. We've decided that it is a good idea last time, but so far no one has volunteers to implement it.